├── Computer Vision
├── Pose Estimation & Squat Counter
│ ├── Pose Estimation and squat counter using MoveNet.ipynb
│ ├── Readme.md
│ ├── ezgif.com-gif-maker.gif
│ ├── images.jpg
│ ├── jump.gif
│ └── requirments
└── Real Time Sign Language Interpretation App
│ ├── IBM_cloud_configuration.md.txt
│ ├── ReactComputerVisionTemplate
│ ├── Public
│ │ ├── favicon.ico
│ │ ├── index.html
│ │ ├── logo192.png
│ │ ├── logo512.png
│ │ ├── manifest.json
│ │ ├── readme.md
│ │ └── robots.txt
│ ├── Readme.md
│ ├── package-lock.json
│ ├── package.json
│ ├── src
│ │ ├── App.css
│ │ ├── App.js
│ │ ├── index.css
│ │ ├── index.js
│ │ ├── readme.md
│ │ └── utilities.js
│ └── yarn.lock
│ ├── Readme.md
│ └── Sign-language_detection.ipynb
├── Data Visualization
└── Python
│ ├── Immigration_to_Canda_Data_Visualization.ipynb
│ ├── Readme.dm
│ └── Spatial visualization of San Francisco incidents.ipynb
├── Deep Learning
└── Classification
│ └── Melenoma_Classification
│ ├── Readme.md
│ ├── deep-learning-models
│ ├── CNN_model.py
│ ├── __init__.py
│ ├── main.py
│ ├── readme.md
│ └── training.py
│ ├── evaluation-metrics
│ ├── __init__.py
│ ├── classification_metrics.py
│ ├── f1_score.py
│ └── readme.md
│ ├── loading and storing
│ ├── __init__.py
│ ├── loading_images.py
│ ├── loading_storing_h5py.py
│ └── readme.md
│ ├── main.py
│ ├── preprocessing
│ ├── __init__.py
│ ├── exploration.py
│ ├── preprocessing.py
│ └── readme.md
│ └── readme.md
├── Machine Learning
├── Classification
│ ├── Alzhimers CV-BOLD Classification
│ │ ├── Best_mask.py
│ │ ├── Best_mask2.py
│ │ ├── Model.py
│ │ ├── confidence_interval_mask.py
│ │ ├── data_preprocessing.py
│ │ ├── deep learning
│ │ │ ├── CNN_based_models
│ │ │ │ ├── AlexNet.py
│ │ │ │ ├── CNN.py
│ │ │ │ ├── CNN_feature_extractor.py
│ │ │ │ ├── DenseNet121.py
│ │ │ │ ├── InceptionResNetV2.py
│ │ │ │ ├── LeNet.py
│ │ │ │ ├── ResNet50.py
│ │ │ │ ├── VGG.py
│ │ │ │ ├── VGG_pretrained.py
│ │ │ │ ├── ZFNet.py
│ │ │ │ ├── optimizers.py
│ │ │ │ ├── readme.md
│ │ │ │ └── simple_model.py
│ │ │ ├── evaluation
│ │ │ │ ├── metrics.py
│ │ │ │ ├── model_evaluation.py
│ │ │ │ └── readme.md
│ │ │ ├── main.py
│ │ │ ├── preprocessing
│ │ │ │ ├── data_augmentation.py
│ │ │ │ ├── data_preprocessing.py
│ │ │ │ ├── preprocessing_methods.py
│ │ │ │ └── readme.md
│ │ │ └── storing_loading
│ │ │ │ ├── generate_result_.py
│ │ │ │ ├── load_data.py
│ │ │ │ └── readme.md
│ │ ├── generate_result.py
│ │ ├── hyper_opt.py
│ │ ├── load_data.py
│ │ ├── load_models.py
│ │ ├── main.py
│ │ ├── pykliep.py
│ │ ├── readme.md
│ │ ├── sample_test.py
│ │ ├── shuffle.py
│ │ └── writing.py
│ └── Sensor-activity-recognition
│ │ ├── Sensor Activity Recognition.pdf
│ │ ├── codes
│ │ ├── classes_accuarcy.m
│ │ ├── classification.m
│ │ ├── create_feature_map.m
│ │ ├── main.m
│ │ ├── performance_evaluation.m
│ │ ├── readme.md
│ │ └── scalingANDoutliers.m
│ │ └── readme.md
├── Clustering
│ ├── Customer identification for mail order products
│ │ ├── Identify Customer Segments.ipynb
│ │ ├── LICENSE
│ │ └── README.md
│ ├── Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym
│ │ ├── Finding best neighborhood for new gym opening in toronto city.ipynb
│ │ ├── LICENSE
│ │ ├── Project report.pdf
│ │ └── Readme.md
│ └── Readme.md
└── Regression
│ ├── Automobile price prediction
│ ├── Automobile Price Prediction .ipynb
│ └── Readme.md
│ └── Readme.md
├── Natural_Language_processing
├── Data-Science-Resume-Selector
│ ├── Resume Selector with Naive Bayes .ipynb
│ ├── readme.md
│ └── resume.csv
├── Sentiment-analysis
│ ├── README.md
│ ├── SageMaker Project.ipynb
│ ├── sevre
│ │ ├── model.py
│ │ ├── predict.py
│ │ ├── requirements.txt
│ │ └── utils.py
│ ├── train
│ │ ├── model.py
│ │ ├── requirements.txt
│ │ └── train.py
│ └── website
│ │ └── index.html
└── plagiarism-detector-web-app
│ ├── 1_Data_Exploration.ipynb
│ ├── 2_Plagiarism_Feature_Engineering.ipynb
│ ├── 3_Training_a_Model.ipynb
│ ├── README.md
│ ├── helpers.py
│ ├── palagrism_data
│ ├── test.csv
│ └── train.csv
│ ├── problem_unittests.py
│ ├── source_pytorch
│ ├── model.py
│ ├── predict.py
│ └── train.py
│ └── source_sklearn
│ └── train.py
├── Readme.md
├── Spark
├── Cluster Analysis of the San Diego Weather Data
│ ├── Cluster Analysis of the San Diego Weather Data.ipynb
│ └── readme.md
└── San Diego Rainforest Fire Predicition
│ ├── Readme.md
│ └── San Diego Rainforest Fire Prediction.ipynb
└── time-series-analysis
├── Power-consumption-forecasting
├── Energy_Consumption_Solution.ipynb
├── json_energy_data
│ ├── readme.md
│ ├── test.json
│ └── train.json
├── readme.md
└── txt_preprocessing.py
└── readme.md
/Computer Vision/Pose Estimation & Squat Counter/Readme.md:
--------------------------------------------------------------------------------
1 | ## Pose Estimation & Squat Counter
2 |
3 | ### Introdction ###
4 |
5 | Pose estimation refers to a general problem in computer vision techniques that detect human figures in images and videos, so that one could determine, for example, where someone’s elbow shows up in an image. It is important to be aware of the fact that pose estimation merely estimates where key body joints are and does not recognize who is in an image or video. The pose estimation models takes a processed camera image as the input and outputs information about keypoints. The keypoints detected are indexed by a part ID, with a confidence score between 0.0 and 1.0. The confidence score indicates the probability that a keypoint exists in that position. An example of this is as shown in the video below.
6 |
7 | 
8 |
9 | Based on the results of the pose estimatiaon, the squat movment was detected and counted and printed on the screen as shown in the video below.
10 |
11 | 
12 | ---
13 |
14 | ### Methods
15 |
16 | The model used is MoveNet, the MoveNet is available in two flavors:
17 |
18 | * MoveNet.Lightning is smaller, faster but less accurate than the Thunder version. It can run in realtime on modern smartphones.
19 | * MoveNet.Thunder is the more accurate version but also larger and slower than Lightning. It is useful for the use cases that require higher accuracy.
20 |
21 | MoveNet.Lightning is used here.
22 |
23 | MoveNet is the state-of-the-art pose estimation model that can detect these 17 key-points:
24 |
25 | * Nose
26 | * Left and right eye
27 | * Left and right ear
28 | * Left and right shoulder
29 | * Left and right elbow
30 | * Left and right wrist
31 | * Left and right hip
32 | * Left and right knee
33 | * Left and right ankle
34 |
35 | The various body joints detected by the pose estimation model are tabulated below:
36 |
37 | | Id | Part |
38 | | --- | ----------- |
39 | | 0 | nose |
40 | | 1 | leftEye |
41 | | 2 | rightEye |
42 | | 3 | leftEar |
43 | | 4 | rightEar |
44 | |5 | leftShoulder |
45 | | 6 | rightShoulder |
46 | | 7 | leftElbow |
47 | | 8 | rightElbow |
48 | | 9 | leftWrist |
49 | | 10 | rightWrist |
50 | | 11 | leftHip |
51 | | 12 | rightHip |
52 | | 13 | leftKnee |
53 | | 14 | rightKnee |
54 | | 15 | leftAnkle |
55 | | 16 | rightAnkle |
56 |
57 |
58 | ---
59 |
60 | ### Install dependencies
61 |
62 | ```
63 | pip install -r requirements
64 | ```
65 |
66 |
--------------------------------------------------------------------------------
/Computer Vision/Pose Estimation & Squat Counter/ezgif.com-gif-maker.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Pose Estimation & Squat Counter/ezgif.com-gif-maker.gif
--------------------------------------------------------------------------------
/Computer Vision/Pose Estimation & Squat Counter/images.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Pose Estimation & Squat Counter/images.jpg
--------------------------------------------------------------------------------
/Computer Vision/Pose Estimation & Squat Counter/jump.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Pose Estimation & Squat Counter/jump.gif
--------------------------------------------------------------------------------
/Computer Vision/Pose Estimation & Squat Counter/requirments:
--------------------------------------------------------------------------------
1 | tensorflow
2 | tensorflow_hub
3 | opencv-python
4 | numpy
5 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/IBM_cloud_configuration.md.txt:
--------------------------------------------------------------------------------
1 | ibmcloud login
2 |
3 | ibmcloud target -r eu-de
4 |
5 | # configurations
6 |
7 | ibmcloud cos bucket-cors-put --bucket tensorflowjsrealtimesign --cors-configuration file://corsconfig.json
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/favicon.ico:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/favicon.ico
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
12 |
13 |
17 |
18 |
27 | React App
28 |
29 |
30 | You need to enable JavaScript to run this app.
31 |
32 |
42 |
43 |
44 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo192.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo192.png
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo512.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/logo512.png
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/manifest.json:
--------------------------------------------------------------------------------
1 | {
2 | "short_name": "React App",
3 | "name": "Create React App Sample",
4 | "icons": [
5 | {
6 | "src": "favicon.ico",
7 | "sizes": "64x64 32x32 24x24 16x16",
8 | "type": "image/x-icon"
9 | },
10 | {
11 | "src": "logo192.png",
12 | "type": "image/png",
13 | "sizes": "192x192"
14 | },
15 | {
16 | "src": "logo512.png",
17 | "type": "image/png",
18 | "sizes": "512x512"
19 | }
20 | ],
21 | "start_url": ".",
22 | "display": "standalone",
23 | "theme_color": "#000000",
24 | "background_color": "#ffffff"
25 | }
26 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Public/robots.txt:
--------------------------------------------------------------------------------
1 | # https://www.robotstxt.org/robotstxt.html
2 | User-agent: *
3 | Disallow:
4 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/Readme.md:
--------------------------------------------------------------------------------
1 | This project was bootstrapped with [Create React App](https://github.com/facebook/create-react-app).
2 |
3 | ## Available Scripts
4 |
5 | In the project directory, you can run:
6 |
7 | ### `yarn start`
8 |
9 | Runs the app in the development mode.
10 | Open [http://localhost:3000](http://localhost:3000) to view it in the browser.
11 |
12 | The page will reload if you make edits.
13 | You will also see any lint errors in the console.
14 |
15 | ### `yarn test`
16 |
17 | Launches the test runner in the interactive watch mode.
18 | See the section about [running tests](https://facebook.github.io/create-react-app/docs/running-tests) for more information.
19 |
20 | ### `yarn build`
21 |
22 | Builds the app for production to the `build` folder.
23 | It correctly bundles React in production mode and optimizes the build for the best performance.
24 |
25 | The build is minified and the filenames include the hashes.
26 | Your app is ready to be deployed!
27 |
28 | See the section about [deployment](https://facebook.github.io/create-react-app/docs/deployment) for more information.
29 |
30 | ### `yarn eject`
31 |
32 | **Note: this is a one-way operation. Once you `eject`, you can’t go back!**
33 |
34 | If you aren’t satisfied with the build tool and configuration choices, you can `eject` at any time. This command will remove the single build dependency from your project.
35 |
36 | Instead, it will copy all the configuration files and the transitive dependencies (webpack, Babel, ESLint, etc) right into your project so you have full control over them. All of the commands except `eject` will still work, but they will point to the copied scripts so you can tweak them. At this point you’re on your own.
37 |
38 | You don’t have to ever use `eject`. The curated feature set is suitable for small and middle deployments, and you shouldn’t feel obligated to use this feature. However we understand that this tool wouldn’t be useful if you couldn’t customize it when you are ready for it.
39 |
40 | ## Learn More
41 |
42 | You can learn more in the [Create React App documentation](https://facebook.github.io/create-react-app/docs/getting-started).
43 |
44 | To learn React, check out the [React documentation](https://reactjs.org/).
45 |
46 | ### Code Splitting
47 |
48 | This section has moved here: https://facebook.github.io/create-react-app/docs/code-splitting
49 |
50 | ### Analyzing the Bundle Size
51 |
52 | This section has moved here: https://facebook.github.io/create-react-app/docs/analyzing-the-bundle-size
53 |
54 | ### Making a Progressive Web App
55 |
56 | This section has moved here: https://facebook.github.io/create-react-app/docs/making-a-progressive-web-app
57 |
58 | ### Advanced Configuration
59 |
60 | This section has moved here: https://facebook.github.io/create-react-app/docs/advanced-configuration
61 |
62 | ### Deployment
63 |
64 | This section has moved here: https://facebook.github.io/create-react-app/docs/deployment
65 |
66 | ### `yarn build` fails to minify
67 |
68 | This section has moved here: https://facebook.github.io/create-react-app/docs/troubleshooting#npm-run-build-fails-to-minify
69 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/package.json:
--------------------------------------------------------------------------------
1 | {
2 | "name": "handpose",
3 | "version": "0.1.0",
4 | "private": true,
5 | "dependencies": {
6 | "@tensorflow/tfjs": "^3.1.0",
7 | "@testing-library/jest-dom": "^4.2.4",
8 | "@testing-library/react": "^9.3.2",
9 | "@testing-library/user-event": "^7.1.2",
10 | "react": "^16.13.1",
11 | "react-dom": "^16.13.1",
12 | "react-scripts": "3.4.3",
13 | "react-webcam": "^5.2.0"
14 | },
15 | "scripts": {
16 | "start": "react-scripts start",
17 | "build": "react-scripts build",
18 | "test": "react-scripts test",
19 | "eject": "react-scripts eject"
20 | },
21 | "eslintConfig": {
22 | "extends": "react-app"
23 | },
24 | "browserslist": {
25 | "production": [
26 | ">0.2%",
27 | "not dead",
28 | "not op_mini all"
29 | ],
30 | "development": [
31 | "last 1 chrome version",
32 | "last 1 firefox version",
33 | "last 1 safari version"
34 | ]
35 | }
36 | }
37 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/App.css:
--------------------------------------------------------------------------------
1 | .App {
2 | text-align: center;
3 | }
4 |
5 | .App-logo {
6 | height: 40vmin;
7 | pointer-events: none;
8 | }
9 |
10 | @media (prefers-reduced-motion: no-preference) {
11 | .App-logo {
12 | animation: App-logo-spin infinite 20s linear;
13 | }
14 | }
15 |
16 | .App-header {
17 | background-color: #282c34;
18 | min-height: 100vh;
19 | display: flex;
20 | flex-direction: column;
21 | align-items: center;
22 | justify-content: center;
23 | font-size: calc(10px + 2vmin);
24 | color: white;
25 | }
26 |
27 | .App-link {
28 | color: #61dafb;
29 | }
30 |
31 | @keyframes App-logo-spin {
32 | from {
33 | transform: rotate(0deg);
34 | }
35 | to {
36 | transform: rotate(360deg);
37 | }
38 | }
39 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/App.js:
--------------------------------------------------------------------------------
1 | // Import dependencies
2 | import React, { useRef, useState, useEffect } from "react";
3 | import * as tf from "@tensorflow/tfjs";
4 | import Webcam from "react-webcam";
5 | import "./App.css";
6 | import { nextFrame } from "@tensorflow/tfjs";
7 | // 2. TODO - Import drawing utility here
8 | // e.g. import { drawRect } from "./utilities";
9 | import { drawRect } from "./utilities";
10 |
11 | function App() {
12 | const webcamRef = useRef(null);
13 | const canvasRef = useRef(null);
14 |
15 | // Main function
16 | const runCoco = async () => {
17 | // 3. TODO - Load network
18 | // e.g. const net = await cocossd.load();
19 | // https://tensorflowjsrealtimesign.s3.eu-de.cloud-object-storage.appdomain.cloud/model.json
20 | const net = await tf.loadGraphModel('https://tensorflowjsrealtimesign.s3.eu-de.cloud-object-storage.appdomain.cloud/model.json')
21 |
22 | // Loop and detect hands
23 | setInterval(() => {
24 | detect(net);
25 | }, 16.7);
26 | };
27 |
28 | const detect = async (net) => {
29 | // Check data is available
30 | if (
31 | typeof webcamRef.current !== "undefined" &&
32 | webcamRef.current !== null &&
33 | webcamRef.current.video.readyState === 4
34 | ) {
35 | // Get Video Properties
36 | const video = webcamRef.current.video;
37 | const videoWidth = webcamRef.current.video.videoWidth;
38 | const videoHeight = webcamRef.current.video.videoHeight;
39 |
40 | // Set video width
41 | webcamRef.current.video.width = videoWidth;
42 | webcamRef.current.video.height = videoHeight;
43 |
44 | // Set canvas height and width
45 | canvasRef.current.width = videoWidth;
46 | canvasRef.current.height = videoHeight;
47 |
48 | // 4. TODO - Make Detections
49 | const img = tf.browser.fromPixels(video)
50 | const resized = tf.image.resizeBilinear(img, [640, 480])
51 | const casted = resized.cast('int32')
52 | const expanded = casted.expandDims(0)
53 | const obj = await net.executeAsync(expanded)
54 | console.log(obj)
55 |
56 | const boxes = await obj[1].array()
57 | const classes = await obj[6].array()
58 | const scores = await obj[5].array()
59 |
60 | // Draw mesh
61 | const ctx = canvasRef.current.getContext("2d");
62 |
63 | // 5. TODO - Update drawing utility
64 | // drawSomething(obj, ctx)
65 | window.requestAnimationFrame(() => { drawRect(boxes[0], classes[0], scores[0], 0.9, videoWidth, videoHeight, ctx) });
66 |
67 | tf.dispose(img)
68 | tf.dispose(resized)
69 | tf.dispose(casted)
70 | tf.dispose(expanded)
71 | tf.dispose(obj)
72 |
73 | }
74 | };
75 |
76 | useEffect(() => { runCoco() }, []);
77 |
78 | return (
79 |
80 |
112 |
113 | );
114 | }
115 |
116 | export default App;
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/index.css:
--------------------------------------------------------------------------------
1 | body {
2 | margin: 0;
3 | font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', 'Roboto', 'Oxygen',
4 | 'Ubuntu', 'Cantarell', 'Fira Sans', 'Droid Sans', 'Helvetica Neue',
5 | sans-serif;
6 | -webkit-font-smoothing: antialiased;
7 | -moz-osx-font-smoothing: grayscale;
8 | }
9 |
10 | code {
11 | font-family: source-code-pro, Menlo, Monaco, Consolas, 'Courier New',
12 | monospace;
13 | }
14 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/index.js:
--------------------------------------------------------------------------------
1 | import React from 'react';
2 | import ReactDOM from 'react-dom';
3 | import './index.css';
4 | import App from './App';
5 |
6 | ReactDOM.render(
7 |
8 |
9 | ,
10 | document.getElementById('root')
11 | );
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/ReactComputerVisionTemplate/src/utilities.js:
--------------------------------------------------------------------------------
1 | // Define our labelmap
2 | const labelMap = {
3 | 1: { name: 'Hello', color: 'red' },
4 | 2: { name: 'Yes', color: 'yellow' },
5 | 3: { name: 'NO', color: 'lime' },
6 | 4: { name: 'Thank_you', color: 'blue' },
7 | 5: { name: 'I_Love_You', color: 'purple' },
8 | }
9 |
10 | const width_scale = 2
11 | const length_scale = 1.5
12 |
13 |
14 | // Define a drawing function
15 | export const drawRect = (boxes, classes, scores, threshold, imgWidth, imgHeight, ctx) => {
16 | for (let i = 0; i <= boxes.length; i++) {
17 | if (boxes[i] && classes[i] && scores[i] > threshold) {
18 | // Extract variables
19 | const [y, x, height, width] = boxes[i]
20 | const text = classes[i]
21 |
22 | // Set styling
23 | ctx.strokeStyle = labelMap[text]['color']
24 | ctx.lineWidth = 10
25 | ctx.fillStyle = 'white'
26 | ctx.font = '30px Arial'
27 |
28 | if (labelMap[text]['name'] == "Hello"){
29 |
30 | const width_scale = 2.5
31 | const length_scale = 2
32 |
33 | } else if (labelMap[text]['name'] == "Yes") {
34 | const width_scale = 1.5
35 | const length_scale = 2.5
36 |
37 | } else if (labelMap[text]['name'] == "No"){
38 |
39 | const width_scale = 1.5
40 | const length_scale = 2.5
41 |
42 | } else if (labelMap[text]['name'] == "Thank_you"){
43 |
44 | const width_scale = 2
45 | const length_scale = 2.5
46 | } else if (labelMap[text]['name'] == "I_Love_You"){
47 |
48 | const width_scale = 1.2
49 | const length_scale = 1.5
50 | } else {
51 |
52 | const width_scale = 2
53 | const length_scale = 1.5
54 |
55 | }
56 |
57 | // DRAW!!
58 | ctx.beginPath()
59 | ctx.fillText(labelMap[text]['name'] + ' - ' + Math.round(scores[i] * 100) / 100, x * imgWidth, y * imgHeight -10)
60 |
61 | ctx.rect(x * imgWidth, y * imgHeight, width * imgWidth /width_scale, height * imgHeight /length_scale);
62 | ctx.stroke()
63 | }
64 | }
65 | }
--------------------------------------------------------------------------------
/Computer Vision/Real Time Sign Language Interpretation App/Readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Data Visualization/Python/Readme.dm:
--------------------------------------------------------------------------------
1 | Data visulaization projects implemented in python.
2 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/Readme.md:
--------------------------------------------------------------------------------
1 | # Melenoma Classification
2 |
3 | ## 1. Problem Statment
4 |
5 | Classfying Melenoma skin lesion images into 9 classes of diagnostics using deep learning models.
6 |
7 | ## 2. Methods
8 |
9 | ### 2.1. Dataset
10 | The dataset used is the [Skin Lesion Images for Melanoma Classification on Kaggle](https://www.kaggle.com/datasets/andrewmvd/isic-2019). This dataset contains the training data for the ISIC 2019 challenge, note that it already includes data from previous years (2018 and 2017). The dataset for ISIC 2019 contains 25,331 images available for the classification of dermoscopic images among nine different diagnostic categories:
11 |
12 | * Melanoma
13 | * Melanocytic nevus
14 | * Basal cell carcinoma
15 | * Actinic keratosis
16 | * Benign keratosis (solar lentigo / seborrheic keratosis / lichen planus-like keratosis)
17 | * Dermatofibroma
18 | * Vascular lesion
19 | * Squamous cell carcinoma
20 | * None of the above
21 |
22 | 
23 |
24 | ### 2.2. Data Preprcoessing
25 |
26 |
27 | ### 2.3. Feature Engineering
28 |
29 |
30 | ### 2.4. Models
31 | Finetuned pretrained CNN based models. The models used are:
32 |
33 | * VGG-16
34 | * VGG-19
35 | * ResNet-50
36 | * Mobile-Net
37 |
38 |
39 | ## 3. Results
40 | Since the data is imbalanced so the F1_score.
41 |
42 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/deep-learning-models/CNN_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 19:00:48 2021
4 |
5 | @author: youss
6 | """
7 | import tensorflow.compat.v1 as tf
8 | from tensorflow.keras.models import Sequential
9 | from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten
10 | from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, Activation
11 |
12 | # https://keras.io/api/applications/
13 | def simple_CNN(train_data_shape,n_classes):
14 | # building a linear stack of layers with the sequential model
15 |
16 | model = Sequential()
17 | model.add(Conv2D(32, (3, 3), input_shape=train_data_shape[1:]))
18 | model.add(Activation('relu'))
19 | model.add(MaxPooling2D(pool_size=(2, 2)))
20 |
21 | model.add(Conv2D(64, (3, 3)))
22 | model.add(Activation('relu'))
23 | model.add(MaxPooling2D(pool_size=(2, 2)))
24 |
25 | model.add(Conv2D(128, (3, 3)))
26 | model.add(Activation('relu'))
27 | model.add(MaxPooling2D(pool_size=(2, 2)))
28 |
29 | model.add(Flatten())
30 | model.add(Dense(64))
31 | model.add(Activation('relu'))
32 | model.add(Dropout(0.5))
33 | model.add(Dense(n_classes))
34 | model.add(Activation('sigmoid'))
35 |
36 | model.summary()
37 | return model
38 |
39 | def MobileNet(num_classes, is_trainable ):
40 |
41 | pretrained_model=tf.keras.applications.MobileNet(
42 | input_shape=(224, 224, 3),
43 | alpha=1.0,
44 | depth_multiplier=1,
45 | dropout=0.001,
46 | include_top=False,
47 | weights="imagenet")
48 |
49 | for layer in pretrained_model.layers[0:18]:
50 | layer.trainable = is_trainable
51 |
52 | model = Sequential()
53 | # first (and only) set of FC => RELU layers
54 | model.add(Flatten())
55 | model.add(Dense(200, activation='relu'))
56 | model.add(Dropout(0.5))
57 | model.add(BatchNormalization())
58 | model.add(Dense(400, activation='relu'))
59 | model.add(Dropout(0.5))
60 | model.add(BatchNormalization())
61 |
62 | # softmax classifier
63 | model.add(Dense(num_classes,activation='softmax'))
64 | pretrainedInput = pretrained_model.input
65 | pretrainedOutput = pretrained_model.output
66 | output = model(pretrainedOutput)
67 | model = tf.keras.models.Model(pretrainedInput, output)
68 | model.summary()
69 | return model
70 |
71 | def VGG_16(num_classes,is_trainable):
72 | from tensorflow.keras.applications.vgg16 import VGG16
73 |
74 | pretrained_model = VGG16(
75 | include_top=False,
76 | input_shape=(224, 224, 3),
77 | weights='imagenet')
78 |
79 | for layer in pretrained_model.layers:
80 | layer.trainable = is_trainable
81 |
82 | model = Sequential()
83 | # first (and only) set of FC => RELU layers
84 | model.add(Flatten())
85 | model.add(Dense(200, activation='relu'))
86 | model.add(Dropout(0.5))
87 | model.add(BatchNormalization())
88 | model.add(Dense(400, activation='relu'))
89 | model.add(Dropout(0.5))
90 | model.add(BatchNormalization())
91 |
92 | # softmax classifier
93 | model.add(Dense(num_classes,activation='softmax'))
94 | pretrainedInput = pretrained_model.input
95 | pretrainedOutput = pretrained_model.output
96 | output = model(pretrainedOutput)
97 | model = tf.keras.models.Model(pretrainedInput, output)
98 | model.summary()
99 | return model
100 |
101 | def Inception_v3(num_classes,is_trainable):
102 | pretrained_model= tf.keras.applications.InceptionV3(
103 | include_top=False,
104 | weights="imagenet",
105 | input_tensor=None,
106 | input_shape=(224, 224, 3),
107 | pooling='max')
108 | for layer in pretrained_model.layers[0:150]:
109 | layer.trainable = is_trainable
110 | model = Sequential()
111 | # first (and only) set of FC => RELU layers
112 | model.add(Flatten())
113 | model.add(Dense(200, activation='relu'))
114 | model.add(Dropout(0.5))
115 | model.add(BatchNormalization())
116 | model.add(Dense(400, activation='relu'))
117 | model.add(Dropout(0.5))
118 | model.add(BatchNormalization())
119 |
120 | # softmax classifier
121 | model.add(Dense(num_classes,activation='softmax'))
122 | pretrainedInput = pretrained_model.input
123 | pretrainedOutput = pretrained_model.output
124 | output = model(pretrainedOutput)
125 | model = tf.keras.models.Model(pretrainedInput, output)
126 | model.summary()
127 | return model
128 |
129 | def InceptionResNetV2(num_classes,is_trainable):
130 | pretrained_model=tf.keras.applications.InceptionResNetV2(
131 | include_top=False,
132 | weights="imagenet",
133 | input_tensor=None,
134 | input_shape=(224,224,3))
135 |
136 | for layer in pretrained_model.layers[0:450]:
137 | layer.trainable = is_trainable
138 | model = Sequential()
139 | # first (and only) set of FC => RELU layers
140 | model.add(Flatten())
141 | model.add(Dense(32, activation='relu'))
142 | model.add(Dropout(0.5))
143 | model.add(BatchNormalization())
144 |
145 | model.add(Dense(64, activation='relu'))
146 | model.add(Dropout(0.5))
147 | model.add(BatchNormalization())
148 |
149 | model.add(Dense(128, activation='relu'))
150 | model.add(Dropout(0.5))
151 | model.add(BatchNormalization())
152 |
153 | model.add(Dense(256, activation='relu'))
154 | model.add(Dropout(0.5))
155 | model.add(BatchNormalization())
156 |
157 | model.add(Dense(512, activation='relu'))
158 | model.add(Dropout(0.5))
159 | model.add(BatchNormalization())
160 |
161 |
162 | # softmax classifier
163 | model.add(Dense(num_classes,activation='softmax'))
164 | pretrainedInput = pretrained_model.input
165 | pretrainedOutput = pretrained_model.output
166 | output = model(pretrainedOutput)
167 | model = tf.keras.models.Model(pretrainedInput, output)
168 | model.summary()
169 | return model
170 |
171 |
172 |
173 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/deep-learning-models/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 19:29:24 2021
4 |
5 | @author: youss
6 | """
7 |
8 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/deep-learning-models/main.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Mon Mar 22 04:58:28 2021
4 |
5 | @author: youss
6 | """
7 | import sys
8 |
9 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/deep learning models')
10 | from CNN_model import Inception_v3
11 | from CNN_model import VGG_16
12 | from CNN_model import simple_CNN
13 | from CNN_model import MobileNet
14 | from CNN_model import InceptionResNetV2
15 |
16 | def select_CNN_model(model_name,num_classes,trainable,input_shape):
17 |
18 | if model_name == 'simple_CNN':
19 | model= simple_CNN(input_shape,num_classes)
20 |
21 | elif model_name=='MobileNet':
22 | model=MobileNet(num_classes,trainable)
23 |
24 | elif model_name=='VGG-16':
25 | model=VGG_16(num_classes,trainable)
26 |
27 | elif model_name=='Inception-v3':
28 | model=Inception_v3(num_classes,trainable)
29 |
30 | elif model_name=='InceptionResNetV2':
31 | model=InceptionResNetV2(num_classes,trainable)
32 |
33 | else:
34 | print("Error value : There is no model with the following name",model_name)
35 | return
36 |
37 | return model
38 |
39 | def getLayerIndexByName(model, layername):
40 |
41 | for idx, layer in enumerate(model.layers):
42 | if layer.name == layername:
43 | return idx
44 | return None
45 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/deep-learning-models/readme.md:
--------------------------------------------------------------------------------
1 | The deep learning models used are
2 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/deep-learning-models/training.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 19:31:01 2021
4 |
5 | @author: youss
6 | """
7 | import numpy as np
8 | import sys
9 |
10 | import matplotlib.pyplot as plt
11 | from keras.callbacks import EarlyStopping
12 | from sklearn.utils import class_weight
13 | from tensorflow.keras.optimizers import SGD
14 |
15 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/evaluation metrics')
16 | from f1_score import f1,f1_micro
17 | from classification_metrics import confusion_matrix_calc
18 |
19 | def training_model(model,train_data,train_labels,val_data,val_labels,test_data,test_labels,num_epoch,batch_size,num_classes,evaulation_metric):
20 |
21 | class_weights = class_weight.compute_class_weight('balanced',np.unique(train_labels.values.argmax(axis=1)),train_labels.values.argmax(axis=1))
22 | sgd = SGD(lr=1e-3, decay=1e-6, momentum=0.9, nesterov=True)
23 | es = EarlyStopping(monitor='val_'+evaulation_metric, mode='max', verbose=1,patience=10,baseline=0.5,min_delta=0.1)
24 |
25 | if evaulation_metric=='accuracy':
26 | model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=['accuracy'])
27 | elif evaulation_metric=='f1':
28 | model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=[f1])
29 | elif evaulation_metric=='f1_micro':
30 | model.compile(loss='binary_crossentropy',optimizer=sgd,metrics=[f1_micro])
31 |
32 |
33 | history=model.fit(train_data,train_labels, validation_data=(val_data,val_labels),epochs=num_epoch, batch_size=batch_size,class_weight=class_weights,callbacks=[es])
34 | score=model.evaluate(test_data,test_labels)
35 | print(f'Test loss: {score[0]} / Test' + ' ' + evaulation_metric + f'score: {score[1]}')
36 | plotting_train_val_metrics(history,evaulation_metric)
37 | predicted_train_labels=model.predict(train_data)
38 | predicted_val_labels=model.predict(val_data)
39 | predicted_test_labels=model.predict(test_data)
40 | confusion_matrix_calc(predicted_train_labels,train_labels,num_classes,'confusion matrix of the training data')
41 | confusion_matrix_calc(predicted_val_labels,val_labels,num_classes,'confusion matrix of the validation data')
42 | confusion_matrix_calc(predicted_test_labels,test_labels,num_classes,'confusion matrix of the test data')
43 |
44 | return None
45 |
46 | def plotting_train_val_metrics(history,evaulation_metric):
47 | plt.plot(history.history[evaulation_metric])
48 | plt.plot(history.history['val_'+ evaulation_metric])
49 | plt.title('training and val' + evaulation_metric)
50 | plt.ylabel(evaulation_metric +'score')
51 | plt.xlabel('epoch')
52 | plt.legend(['train', 'val'], loc='upper left')
53 | plt.show()
54 | # summarize history for loss
55 | plt.plot(history.history['loss'])
56 | plt.plot(history.history['val_loss'])
57 | plt.title('model loss')
58 | plt.ylabel('loss')
59 | plt.xlabel('epoch')
60 | plt.legend(['train', 'val'], loc='upper left')
61 | plt.show()
62 | return None
63 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 19:21:12 2021
4 |
5 | @author: youss
6 | """
7 |
8 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/classification_metrics.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Wed Mar 17 13:57:42 2021
4 |
5 | @author: youss
6 | """
7 | from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
8 | import numpy as np
9 | import matplotlib.pyplot as plt
10 |
11 | def confusion_matrix_calc(predicted_labels,true_labels,num_classes,title):
12 | positions = np.arange(0,num_classes)
13 | classes = np.arange(0,num_classes)
14 | cm=confusion_matrix(predicted_labels.argmax(axis=1), true_labels.values.argmax(axis=1),labels=classes)
15 | disp=ConfusionMatrixDisplay(cm,display_labels=classes)
16 | classes_name = true_labels.columns.values
17 | plt.figure(figsize=(10,10))
18 | disp.plot()
19 | plt.xticks(positions, classes_name)
20 | plt.yticks(positions, classes_name)
21 | plt.title(title)
22 | return
23 |
24 |
25 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/f1_score.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 19:10:17 2021
4 |
5 | @author: youss
6 | """
7 | import tensorflow.keras.backend as K
8 | import tensorflow.compat.v1 as tf
9 |
10 | def f1(y_true, y_pred):
11 | y_pred = K.round(y_pred)
12 | tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
13 | tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
14 | fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
15 | fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)
16 |
17 | p = tp / (tp + fp + K.epsilon())
18 | r = tp / (tp + fn + K.epsilon())
19 |
20 | f1 = 2*p*r / (p+r+K.epsilon())
21 | f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
22 | return K.mean(f1)
23 |
24 | def f1_loss(y_true, y_pred):
25 |
26 | tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
27 | tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
28 | fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
29 | fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)
30 |
31 | p = tp / (tp + fp + K.epsilon())
32 | r = tp / (tp + fn + K.epsilon())
33 |
34 | f1 = 2*p*r / (p+r+K.epsilon())
35 | f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
36 | return 1 - K.mean(f1)
37 |
38 | def f1_micro(y_true, y_pred):
39 | y_pred = K.round(y_pred)
40 | tp_per_class = K.sum(K.cast(y_true*y_pred, 'float'), axis=1)
41 | tn_per_class = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=1)
42 | fp_per_class = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=1)
43 | fn_per_class = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=1)
44 |
45 | p_per_class = tp_per_class / (tp_per_class + fp_per_class + K.epsilon())
46 | r_per_class = tp_per_class / (tp_per_class + fn_per_class + K.epsilon())
47 |
48 | f1_per_class = 2*p_per_class*r_per_class / (p_per_class+r_per_class+K.epsilon())
49 | f1_total= K.sum(f1_per_class*K.sum(y_true,axis=1))/ K.sum(y_true)
50 |
51 | return f1_total
52 |
53 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/evaluation-metrics/readme.md:
--------------------------------------------------------------------------------
1 | The evaluation metrics used to evaluate the classificaiton models
2 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/loading and storing/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 17:23:51 2021
4 |
5 | @author: youss
6 | """
7 |
8 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/loading and storing/loading_images.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Wed Mar 24 02:40:27 2021
4 |
5 | @author: youss
6 | """
7 | import cv2
8 | import os
9 |
10 |
11 | def load_images_from_folder(folder,width,height):
12 | images = []
13 | i=0
14 |
15 | for filename in os.listdir(folder):
16 | img = cv2.imread(os.path.join(folder,filename))
17 | img=cv2.resize(img,(width,height))
18 | if img is not None:
19 | images.append(img)
20 | i=i+1
21 | return images
22 |
23 |
24 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/loading and storing/loading_storing_h5py.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 16:55:49 2021
4 |
5 | @author: youss
6 | """
7 | import numpy as np
8 | import h5py
9 | import os
10 |
11 |
12 | def storing_h5py(input_data,hdf5_dir):
13 | for i in range (len(input_data)):
14 | image_id=i
15 | image= input_data[i]
16 | file = h5py.File(os.path.join(hdf5_dir,str(image_id)+'.h5'), "w")
17 | dataset = file.create_dataset("image", np.shape(image), h5py.h5t.STD_U8BE, data=image)
18 | file.close()
19 |
20 | def read_h5py(hdf5_dir,num_images):
21 | images=[]
22 | for i in range(num_images):
23 | image_id=i
24 | file = h5py.File(os.path.join(hdf5_dir,str(image_id)+'.h5'), "r+")
25 | image = np.array(file["/image"]).astype("uint8")
26 | images.append(image)
27 | return images
28 |
29 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/loading and storing/readme.md:
--------------------------------------------------------------------------------
1 | Loading the inputs and storing the output
2 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/main.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 16:54:05 2021
4 |
5 | @author: youssef Hosni
6 | """
7 | import pandas as pd
8 | import numpy as np
9 | import sys
10 |
11 | import tensorflow.compat.v1 as tf
12 |
13 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/loading and storing')
14 | from loading_images import load_images_from_folder
15 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/preprocessing')
16 | from exploration import bar_plot,class_counts_proportions
17 | from preprocessing import splitting_normalization
18 | from preprocessing import splitting_classes
19 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/deep learning models')
20 | from main import select_CNN_model
21 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/deep learning models')
22 | from training import training_model
23 |
24 |
25 | print('Using:')
26 | print('\t\u2022 TensorFlow version:', tf.__version__)
27 | print('\t\u2022 tf.keras version:', tf.keras.__version__)
28 | print('\t\u2022 Running on GPU' if tf.config.list_physical_devices('GPU') else '\t\u2022 GPU device not found. Running on CPU')
29 |
30 | #%% Loading the png data and split and normalize it
31 | images_dir = "D:\\work & study\\Nawah\\Datasets\\ISIC_2019_Training_Input\\ISIC_2019_Training_Input"
32 | width = 224
33 | height = 224
34 | input_data = load_images_from_folder(images_dir,width,height)
35 | #%% preprocessiing
36 | labels = pd.read_csv("D:/work & study/Nawah/Datasets/ISIC_2019_Training_GroundTruth.csv")
37 | labels=labels.iloc[:,1:]
38 | labels.head()
39 | #%% splitting the data and normalizing it
40 | train_data,train_labels, val_data,val_labels,test_data,test_labels = splitting_normalization(
41 | input_data,
42 | labels
43 | )
44 |
45 | #%% dividing the data into datasets one with two classes and one with 8 classes
46 | [train_data_small_classes,
47 | train_labels_small_classes,
48 | train_data_labels_two_classes] = splitting_classes(train_data,train_labels)
49 |
50 | [val_data_small_classes,
51 | val_labels_small_classes,
52 | val_data_labels_two_classes] = splitting_classes(val_data,val_labels)
53 |
54 | [test_data_small_classes,
55 | test_labels_small_classes,
56 | test_data_labels_two_classes] = splitting_classes(test_data,test_labels)
57 |
58 |
59 | #%% underesampling the data
60 | sys.path.insert(0,'D:/work & study/Nawah/Datasets/codes/preprocessing')
61 | from preprocessing import resampling
62 | resampling_stragey = {1:4500}
63 | resampled_training_data, resampled_labels = resampling(
64 | train_data,
65 | train_labels,
66 | 'under_sampling',
67 | resampling_stragey
68 | )
69 | resampled_training_data = resampled_training_data.reshape(
70 | resampled_training_data.shape[0],
71 | 224,
72 | 224,
73 | 3
74 | )
75 | resampled_labels=pd.DataFrame(resampled_labels)
76 | resampled_labels.columns = train_labels.columns[0:-1]
77 | resampled_labels['UNK'] = 0
78 |
79 | #%% Oversampling the data
80 | resampling_stragey = {2:2000,4:1700,5:1700,6:2000}
81 | resampled_training_data_small, resampled_labels_small= resampling(
82 | train_data_small_classes,train_labels_small_classes,
83 | 'over_sampling',resampling_stragey
84 | )
85 | resampled_training_data_small = resampled_training_data_small.reshape(
86 | resampled_training_data_small.shape[0],
87 | 224,224,3
88 | )
89 | resampled_labels_small = pd.DataFrame(resampled_labels_small)
90 | resampled_labels_small.columns=train_labels_small_classes.columns[0:-1]
91 | resampled_labels_small['UNK']=0
92 | #%% Building the simple CNN model and trainning it
93 | models_name_list = [
94 | 'simple_CNN',
95 | 'MobileNet',
96 | 'VGG-16',
97 | 'Inception-v3',
98 | 'InceptionResNetV2'
99 | ]
100 | model_name=models_name_list[3]
101 | is_trainable=False
102 | epoch_num=100
103 | batch_num=32
104 | evaluation_metrics_list=[
105 | 'accuracy',
106 | 'f1',
107 | 'f1_micro'
108 | ]
109 | evaluation_metric = evaluation_metrics_list[2]
110 | model = select_CNN_model(model_name,8,is_trainable, np.shape(resampled_training_data_small))
111 | training_model(model,resampled_training_data_small,resampled_labels_small,val_data_small_classes,val_labels_small_classes,test_data_small_classes,
112 | test_labels_small_classes,epoch_num,batch_num,8,evaluation_metric)
113 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/preprocessing/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Sun Mar 21 06:45:28 2021
4 |
5 | @author: youss
6 | """
7 |
8 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/preprocessing/exploration.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Sun Mar 21 04:05:02 2021
4 |
5 | @author: youss
6 | """
7 | import pandas as pd
8 | import matplotlib.pyplot as plt
9 |
10 | def bar_plot(input_data,title):
11 | input_data.columns
12 | df=pd.DataFrame()
13 | df["disease"]=input_data.columns
14 | number_cases_per_class=input_data.sum()
15 | df["number_of_cases"]=number_cases_per_class.values
16 | plt.figure()
17 | plt.bar(df["disease"],df["number_of_cases"])
18 | plt.title(title)
19 |
20 | def class_counts_proportions(labels):
21 | df=pd.DataFrame()
22 | df["Label"]=labels.columns
23 | number_cases_per_class=labels.sum()
24 | df["number_of_cases_each_class"]=number_cases_per_class.values
25 | df["percentage_of_classes"]=number_cases_per_class.values/sum(number_cases_per_class.values)
26 | return df
27 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/preprocessing/preprocessing.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on Tue Mar 16 22:20:13 2021
4 | @author: youss
5 | """
6 |
7 | import numpy as np
8 | import time
9 | import cv2
10 |
11 | from multiprocessing.dummy import Pool
12 | from multiprocessing.sharedctypes import Value
13 | from ctypes import c_int
14 | from sklearn.model_selection import train_test_split
15 | from imblearn.over_sampling import SMOTE, RandomOverSampler
16 | from imblearn.under_sampling import RandomUnderSampler, TomekLinks,NearMiss
17 | from imblearn.under_sampling import OneSidedSelection
18 |
19 | def splitting_normalization(input_data, labels):
20 | train_data,val_data,train_labels,val_labels = train_test_split(input_data,labels, test_size=0.3, random_state=42)
21 | val_data,test_data,val_labels,test_labels = train_test_split(val_data,val_labels, test_size=0.33, random_state=42)
22 |
23 | # building the input vector from the 28x28 pixels
24 | train_data=np.array(train_data)
25 | val_data=np.array(val_data)
26 | test_data=np.array(test_data)
27 |
28 | train_data = train_data.astype('float32')
29 | val_data = val_data.astype('float32')
30 | test_data = test_data.astype('float32')
31 | print(train_data.shape)
32 | print(val_data.shape)
33 | print(test_data.shape)
34 |
35 | # normalizing the data to help with the training
36 | train_data /= 255
37 | val_data /= 255
38 | test_data /= 255
39 | return train_data,train_labels, val_data,val_labels,test_data,test_labels
40 |
41 |
42 |
43 | def splitting_classes(input_data,input_labels):
44 |
45 | """
46 | Parameters
47 | ----------
48 | input_data : Array
49 | The input data with all classes .
50 | input_labels : DataFrame
51 | The input labels of data with all classes.
52 |
53 | Returns
54 | -------
55 | ouput_data_small_classes : Array
56 | The input data with all classes .
57 | output_labels_small_classes : DataFrame
58 | The input labels of data with all classes.
59 | output_labels_two_classes : DataFrame
60 | The input labels of data with all classes.
61 |
62 | """
63 | input_labels.reset_index(drop=True,inplace=True)
64 | output_labels_small_classes=input_labels[input_labels['NV']==0]
65 | output_labels_small_classes.drop(columns='NV',inplace=True)
66 | ouput_data_small_classes=input_data[output_labels_small_classes.index.values]
67 |
68 |
69 | output_labels_two_classes=input_labels.copy()
70 | output_labels_two_classes['other_classes']=0
71 | output_labels_two_classes.iloc[output_labels_small_classes.index.values,9]=1
72 |
73 | labels_to_drop_index=[0,2,3,4,5,6,7,8]
74 | output_labels_two_classes.drop(columns=output_labels_two_classes.columns[labels_to_drop_index],inplace=True)
75 |
76 | return ouput_data_small_classes, output_labels_small_classes,output_labels_two_classes
77 |
78 | def resizing_data(input_data,width, height):
79 | resized_data=[]
80 | def read_imagecv2(img, counter):
81 | img = cv2.resize(img, (width, height))
82 | resized_data.append(img)
83 | with counter.get_lock(): #processing pools give no way to check up on progress, so we make our own
84 | counter.value += 1
85 |
86 | # start 4 worker processes
87 | with Pool(processes=2) as pool: #this should be the same as your processor cores (or less)
88 | counter = Value(c_int, 0) #using sharedctypes with mp.dummy isn't needed anymore, but we already wrote the code once...
89 | chunksize = 4 #making this larger might improve speed (less important the longer a single function call takes)
90 | resized_test_data = pool.starmap_async(read_imagecv2, ((img, counter) for img in input_data) , chunksize) #how many jobs to submit to each worker at once
91 | while not resized_test_data.ready(): #print out progress to indicate program is still working.
92 | #with counter.get_lock(): #you could lock here but you're not modifying the value, so nothing bad will happen if a write occurs simultaneously
93 | #just don't `time.sleep()` while you're holding the lock
94 | print("\rcompleted {} images ".format(counter.value), end='')
95 | time.sleep(.5)
96 | print('\nCompleted all images')
97 | return resized_data
98 |
99 |
100 |
101 | def resampling(train_data,train_labels,resampling_type,resampling_stragey):
102 | train_data_new=np.reshape(train_data,(train_data.shape[0],train_data.shape[1]*train_data.shape[2]*train_data.shape[3]))
103 | if resampling_type == 'SMOTE':
104 | train_data_resampled,train_labels_resampled = SMOTE(random_state=42).fit_resample(train_data_new, train_labels.values)
105 |
106 | elif resampling_type=='over_sampling':
107 | over_sampler=RandomOverSampler(sampling_strategy=resampling_stragey)
108 | train_data_resampled, train_labels_resampled = over_sampler.fit_resample(train_data_new,train_labels.values)
109 |
110 | elif resampling_type== 'under_sampling':
111 | under_sampler=RandomUnderSampler(sampling_strategy=resampling_stragey)
112 | train_data_resampled, train_labels_resampled = under_sampler.fit_resample(train_data_new,train_labels.values)
113 |
114 | elif resampling_type == 'tomelinks':
115 | t1= TomekLinks( sampling_strategy=resampling_stragey)
116 | train_data_resampled, train_labels_resampled = t1.fit_resample(train_data_new,train_labels.values )
117 |
118 | elif resampling_type=='near_miss_neighbors':
119 | undersample = NearMiss(version=1, n_neighbors=3)
120 | train_data_resampled, train_labels_resampled = undersample.fit_resample(train_data_new,train_labels.values )
121 |
122 | elif resampling_type=='one_sided_selection':
123 | undersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200)
124 | train_data_resampled, train_labels_resampled = undersample.fit_resample(train_data_new,train_labels.values )
125 |
126 | return train_data_resampled, train_labels_resampled
127 |
128 |
129 |
130 |
131 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/preprocessing/readme.md:
--------------------------------------------------------------------------------
1 | The preprocessing for the data
2 |
--------------------------------------------------------------------------------
/Deep Learning/Classification/Melenoma_Classification/readme.md:
--------------------------------------------------------------------------------
1 | # Melenoma Classification
2 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/AlexNet.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | from keras.layers import LeakyReLU
12 | import os
13 | import CNN_feature_extractor
14 | import model_evaluation
15 | from sklearn.model_selection import train_test_split
16 |
17 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
18 |
19 |
20 |
21 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
22 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
23 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
24 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
25 | batch_size=round(train_data.shape[0]/batch_size_factor)
26 |
27 | #Instantiate an empty model
28 | classification_model = Sequential()
29 |
30 | # 1st Convolutional Layer
31 | classification_model.add(Conv2D(filters=96, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]), kernel_size=(11,11), strides=(4,4), padding='valid', activation='tanh'))
32 | #classification_model.add(LeakyReLU(alpha=0.01))
33 |
34 | # Max Pooling
35 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
36 | classification_model.add(Dropout(0.25))
37 |
38 | # 2nd Convolutional Layer
39 | classification_model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid',activation='sigmoid'))
40 | #classification_model.add(LeakyReLU(alpha=0.01))
41 |
42 | # Max Pooling
43 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
44 | classification_model.add(Dropout(0.25))
45 |
46 | # 3rd Convolutional Layer
47 | classification_model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid',activation='sigmoid'))
48 | #classification_model.add(LeakyReLU(alpha=0.01))
49 |
50 | # 4th Convolutional Layer
51 | classification_model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='valid',activation='sigmoid'))
52 | #classification_model.add(LeakyReLU(alpha=0.01))
53 |
54 |
55 | # 5th Convolutional Layer
56 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='valid',activation='sigmoid'))
57 | #classification_model.add(LeakyReLU(alpha=0.01))
58 |
59 | # Max Pooling
60 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
61 | classification_model.add(Dropout(0.25))
62 |
63 | # Passing it to a Fully Connected layer
64 | classification_model.add(Flatten())
65 | # 1st Fully Connected Layer
66 | classification_model.add(Dense(4096,activation='sigmoid'))
67 | #classification_model.add(LeakyReLU(alpha=0.01))
68 |
69 |
70 | # Add Dropout to prevent overfitting
71 | classification_model.add(Dropout(0.5))
72 |
73 | # 2nd Fully Connected Layer
74 | classification_model.add(Dense(4096,activation='sigmoid'))
75 |
76 | # Add Dropout
77 | classification_model.add(Dropout(0.5))
78 |
79 | # 3rd Fully Connected Layer
80 | classification_model.add(Dense(1000,activation='sigmoid'))
81 | #classification_model.add(LeakyReLU(alpha=0.01))
82 |
83 | # Add Dropout
84 | classification_model.add(Dropout(0.5))
85 |
86 | # Output Layer
87 | classification_model.add(Dense(num_classes,activation='softmax'))
88 |
89 | classification_model.summary()
90 |
91 | # Compile the classification_model
92 | classification_model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
93 |
94 | es=keras.callbacks.EarlyStopping(monitor='val_acc',
95 | min_delta=0,
96 | patience=5000,
97 | verbose=1, mode='auto')
98 |
99 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True)
100 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
101 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es])
102 | best_model=load_model(os.path.join(result_path,'best_model.h5'))
103 | file_name=os.path.split(result_path)[1]
104 | date=os.path.split(os.path.split(result_path)[0])[1]
105 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'AlexNet_model.h5'))
106 |
107 |
108 | if feature_extraction==1:
109 | feature_extractor_parameters['CNN_model']=classification_model
110 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
111 | return
112 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'Alexnet', result_path,epoch)
113 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/DenseNet121.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | import os
12 | import CNN_feature_extractor
13 | import model_evaluation
14 | from sklearn.model_selection import train_test_split
15 |
16 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
17 |
18 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
19 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
20 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
21 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
22 | '''
23 | train_data=data_preprocessing.depth_reshapeing(train_data)
24 | test_data=data_preprocessing.depth_reshapeing(test_data)
25 | valid_data=data_preprocessing.depth_reshapeing(valid_data)
26 |
27 | train_data = data_preprocessing.size_editing(train_data, 224)
28 | valid_data= data_preprocessing.size_editing(valid_data, 224)
29 | test_data = data_preprocessing.size_editing(test_data, 224)
30 | '''
31 | batch_size=round(train_data.shape[0]/batch_size_factor)
32 | input_shape= (224,224,3)
33 | densenet121_model=keras.applications.densenet.DenseNet121(include_top=False, weights='imagenet', input_shape=input_shape, pooling=None, classes=num_classes)
34 |
35 |
36 |
37 | layer_dict = dict([(layer.name, layer) for layer in densenet121_model.layers])
38 | # Getting output tensor of the last VGG layer that we want to include
39 | x = layer_dict[list(layer_dict.keys())[-1]].output
40 | x = MaxPooling2D(pool_size=(2, 2))(x)
41 |
42 | x = Flatten()(x)
43 | x = Dense(4096, activation='relu')(x)
44 | x = Dropout(0.5)(x)
45 | x = Dense(4096, activation='relu')(x)
46 | x = Dropout(0.5)(x)
47 | x = Dense(num_classes, activation='softmax')(x)
48 | classification_model = Model(input=densenet121_model.input, output=x)
49 |
50 |
51 | for layer in classification_model.layers:
52 | layer.trainable = True
53 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy'])
54 | es = keras.callbacks.EarlyStopping(monitor='val_acc',
55 | min_delta=0,
56 | patience=500,
57 | verbose=1, mode='auto')
58 | mc = ModelCheckpoint(os.path.join(result_path, 'best_model.h5'), monitor='val_acc', mode='auto',
59 | save_best_only=True)
60 |
61 |
62 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
63 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc ])
64 | file_name=os.path.split(result_path)[1]
65 | date=os.path.split(os.path.split(result_path)[0])[1]
66 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'DenseNet121.h5'))
67 | #best_model=load_model(os.path.join(result_path,'best_model.h5'))
68 | best_model=classification_model
69 | if feature_extraction==1:
70 | feature_extractor_parameters['CNN_model']=classification_model
71 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
72 | return
73 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'InceptionResNetV2', result_path,epoch)
74 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/InceptionResNetV2.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | import os
12 | import CNN_feature_extractor
13 | import model_evaluation
14 | from sklearn.model_selection import train_test_split
15 |
16 |
17 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
18 |
19 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
20 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
21 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
22 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
23 |
24 | '''
25 | train_data=data_preprocessing.depth_reshapeing(train_data)
26 | test_data=data_preprocessing.depth_reshapeing(test_data)
27 | valid_data=data_preprocessing.depth_reshapeing(valid_data)
28 |
29 | train_data = data_preprocessing.size_editing(train_data, 224)
30 | valid_data= data_preprocessing.size_editing(valid_data, 224)
31 | test_data = data_preprocessing.size_editing(test_data, 224)
32 | '''
33 | input_shape= (224,224,3)
34 | batch_size=round(train_data.shape[0]/batch_size_factor)
35 | inception_model=keras.applications.inception_resnet_v2.InceptionResNetV2(include_top=False, weights='imagenet',input_shape=input_shape, pooling=None)
36 |
37 |
38 | layer_dict = dict([(layer.name, layer) for layer in inception_model.layers])
39 | # Getting output tensor of the last VGG layer that we want to include
40 | x = layer_dict[list(layer_dict.keys())[-1]].output
41 | x = MaxPooling2D(pool_size=(2, 2))(x)
42 |
43 | x = Flatten()(x)
44 | x = Dense(4096, activation='relu')(x)
45 | x = Dropout(0.5)(x)
46 | x = Dense(4096, activation='relu')(x)
47 | x = Dropout(0.5)(x)
48 | x = Dense(num_classes, activation='softmax')(x)
49 | classification_model = Model(input=inception_model.input, output=x)
50 |
51 |
52 | for layer in classification_model.layers:
53 | layer.trainable = True
54 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy'])
55 |
56 | es = keras.callbacks.EarlyStopping(monitor='val_acc',
57 | min_delta=0,
58 | patience=500,
59 | verbose=1, mode='auto')
60 | mc = ModelCheckpoint(os.path.join(result_path, 'best_model.h5'), monitor='val_acc', mode='auto',
61 | save_best_only=True)
62 |
63 |
64 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
65 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc ])
66 |
67 | #best_model=load_model(os.path.join(result_path,'best_model.h5'))
68 | best_model=classification_model
69 | file_name=os.path.split(result_path)[1]
70 | date=os.path.split(os.path.split(result_path)[0])[1]
71 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'InceptionResNet_model.h5'))
72 |
73 | if feature_extraction==1:
74 | feature_extractor_parameters['CNN_model']=classification_model
75 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
76 | return
77 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'InceptionResNetV2', result_path,epoch)
78 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/LeNet.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | from keras.layers import LeakyReLU
12 | import os
13 | import CNN_feature_extractor
14 | import model_evaluation
15 | from sklearn.model_selection import train_test_split
16 |
17 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
18 |
19 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
20 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
21 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
22 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
23 | batch_size=round(train_data.shape[0]/batch_size_factor)
24 | classification_model = Sequential()
25 | # C1 Convolutional Layer
26 | classification_model.add(layers.Conv2D(6, kernel_size=(5, 5), strides=(1, 1),activation='relu', input_shape=(train_data.shape[1], train_data.shape[2], train_data.shape[3]), padding='same'))
27 |
28 | classification_model.add(Dropout(0.7))
29 |
30 | # S2 Pooling Layer
31 | classification_model.add(layers.AveragePooling2D(pool_size=(2, 2), strides=(1, 1), padding='valid'))
32 | # C3 Convolutional Layer
33 | classification_model.add(layers.Conv2D(16, kernel_size=(5, 5), strides=(1, 1), padding='valid',activation='relu'))
34 |
35 | classification_model.add(Dropout(0.7))
36 |
37 | # S4 Pooling Layer
38 | classification_model.add(layers.AveragePooling2D(pool_size=(2, 2), strides=(2, 2), padding='valid'))
39 | # C5 Fully Connected Convolutional Layer
40 | classification_model.add(layers.Conv2D(120, kernel_size=(5, 5), strides=(1, 1), padding='valid',activation='relu'))
41 |
42 | classification_model.add(Dropout(0.8))
43 | #Flatten the CNN output so that we can connect it with fully connected layers
44 | classification_model.add(layers.Flatten())
45 | # FC6 Fully Connected Layer
46 | classification_model.add(layers.Dense(84,activation='relu'))
47 |
48 | classification_model.add(Dropout(0.8))
49 |
50 | #Output Layer with softmax activation
51 | classification_model.add(layers.Dense(num_classes, activation='softmax'))
52 |
53 |
54 | classification_model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt, metrics=['accuracy'])
55 |
56 |
57 | es=keras.callbacks.EarlyStopping(monitor='val_acc',
58 | min_delta=0,
59 | patience=100,
60 | verbose=1, mode='auto')
61 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True)
62 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
63 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc])
64 | best_model=load_model(os.path.join(result_path,'best_model.h5'))
65 | file_name=os.path.split(result_path)[1]
66 | date=os.path.split(os.path.split(result_path)[0])[1]
67 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'leNet_model.h5'))
68 | if feature_extraction==1:
69 | feature_extractor_parameters['CNN_model']=classification_model
70 | CNN_feature_extractor.CNN_feature_extraction_classsification(train_data_whole,train_labels_whole,test_data,test_labels,feature_extractor_parameters,result_path)
71 | return
72 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data,test_labels_one_hot,'LeNet',result_path,epoch)
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/ResNet50.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | import os
12 | import CNN_feature_extractor
13 | import model_evaluation
14 | from sklearn.model_selection import train_test_split
15 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
16 |
17 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
18 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
19 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
20 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
21 |
22 | '''
23 | train_data=data_preprocessing.depth_reshapeing(train_data)
24 | test_data=data_preprocessing.depth_reshapeing(test_data)
25 | valid_data=data_preprocessing.depth_reshapeing(valid_data)
26 |
27 | train_data = data_preprocessing.size_editing(train_data, 224)
28 | valid_data= data_preprocessing.size_editing(valid_data, 224)
29 | test_data = data_preprocessing.size_editing(test_data, 224)
30 | '''
31 | batch_size=round(train_data.shape[0]/batch_size_factor)
32 | input_shape= (224,224,3)
33 | resnet50_model=keras.applications.resnet50.ResNet50(include_top=False, weights='imagenet', input_shape=input_shape, pooling=None)
34 |
35 | layer_dict = dict([(layer.name, layer) for layer in resnet50_model.layers])
36 | #print(layer_dict)
37 | # Getting output tensor of the last VGG layer that we want to include
38 | x = layer_dict[list(layer_dict.keys())[-1]].output
39 | x = MaxPooling2D(pool_size=(2, 2))(x)
40 |
41 | x = Flatten()(x)
42 | x = Dense(4096, activation='relu')(x)
43 | x = Dropout(0.5)(x)
44 | x = Dense(4096, activation='relu')(x)
45 | x = Dropout(0.5)(x)
46 | x = Dense(num_classes, activation='softmax')(x)
47 | classification_model = Model(input=resnet50_model.input, output=x)
48 |
49 | for layer in classification_model.layers[:len(list(layer_dict.keys()))-50]:
50 | layer.trainable = False
51 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy'])
52 | es = keras.callbacks.EarlyStopping(monitor='val_acc',
53 | min_delta=0,
54 | patience=500,
55 | verbose=1, mode='auto')
56 | mc = ModelCheckpoint(os.path.join(result_path, 'best_model.h5'), monitor='val_acc', mode='auto',
57 | save_best_only=True)
58 |
59 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
60 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc])
61 | #best_model=load_model(os.path.join(result_path,'best_model.h5'))
62 | best_model=classification_model
63 | file_name=os.path.split(result_path)[1]
64 | date=os.path.split(os.path.split(result_path)[0])[1]
65 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'ResNet50_model.h5'))
66 |
67 | if feature_extraction==1:
68 | feature_extractor_parameters['CNN_model']=classification_model
69 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
70 | return
71 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'ResNet50', result_path,epoch)
72 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/VGG.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | import os
12 | import CNN_feature_extractor
13 | import model_evaluation
14 | from sklearn.model_selection import train_test_split
15 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
16 |
17 |
18 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
19 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
20 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
21 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
22 |
23 |
24 | batch_size=round(train_data.shape[0]/batch_size_factor)
25 |
26 | #Instantiate an empty model
27 | classification_model = Sequential()
28 |
29 | # 1st Convolutional Layer
30 | classification_model.add(Conv2D(filters=64, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]),activation='relu', kernel_size=(3,3), strides=(1,1), padding='same'))
31 | classification_model.add(Conv2D(filters=64,activation='relu', kernel_size=(3,3), strides=(1,1), padding='same'))
32 | classification_model.add(Dropout(0.4))
33 | # Max Pooling
34 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
35 |
36 | # 2nd Convolutional Layer
37 | classification_model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
38 | classification_model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
39 | classification_model.add(Dropout(0.4))
40 | # Max Pooling
41 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
42 |
43 | # 3rd Convolutional Layer
44 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
45 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
46 | classification_model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
47 | classification_model.add(Dropout(0.4))
48 | # Max Pooling
49 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2)))
50 |
51 | # 4th Convolutional Layer
52 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
53 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
54 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
55 | classification_model.add(Dropout(0.4))
56 | # 5th Convolutional Layer
57 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
58 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
59 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='same',activation='relu'))
60 | classification_model.add(Dropout(0.4))
61 |
62 | # Max Pooling
63 | classification_model.add(MaxPooling2D(pool_size=(2,2), strides=(2,2), padding='valid'))
64 |
65 | # Passing it to a Fully Connected layer
66 | classification_model.add(Flatten())
67 | # 1st Fully Connected Layer
68 | classification_model.add(Dense(4096,activation='relu'))
69 | # Add Dropout to prevent overfitting
70 | classification_model.add(Dropout(0.5))
71 |
72 | # 2nd Fully Connected Layer
73 | classification_model.add(Dense(4096,activation='relu'))
74 |
75 | # Add Dropout
76 | classification_model.add(Dropout(0.5))
77 | # Output Layer
78 | classification_model.add(Dense(num_classes,activation='softmax'))
79 | classification_model.summary()
80 | classification_model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
81 |
82 | es=keras.callbacks.EarlyStopping(monitor='val_acc',
83 | min_delta=0,
84 | patience=1000,
85 | verbose=1, mode='auto')
86 |
87 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True)
88 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
89 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es])
90 | best_model=load_model(os.path.join(result_path,'best_model.h5'))
91 | file_name=os.path.split(result_path)[1]
92 | date=os.path.split(os.path.split(result_path)[0])[1]
93 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'VGG_model.h5'))
94 |
95 | if feature_extraction==1:
96 | feature_extractor_parameters['CNN_model']=classification_model
97 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
98 | return
99 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'VGG', result_path,epoch)
100 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/VGG_pretrained.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | import os
12 | import CNN_feature_extractor
13 | import model_evaluation
14 | from sklearn.model_selection import train_test_split
15 | from keras.applications.vgg16 import VGG16
16 |
17 |
18 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
19 |
20 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
21 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
22 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
23 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
24 | batch_size=round(train_data.shape[0]/batch_size_factor)
25 |
26 | input_shape= (224,224,3)
27 | vgg_model = VGG16(weights='imagenet',
28 | include_top=False,
29 | input_shape=input_shape)
30 | # Creating dictionary that maps layer names to the layers
31 | layer_dict = dict([(layer.name, layer) for layer in vgg_model.layers])
32 | # Getting output tensor of the last VGG layer that we want to include
33 | x = layer_dict['block2_pool'].output
34 |
35 | x = Conv2D(filters=64, kernel_size=(3, 3), activation='relu',padding='same')(x)
36 | x = MaxPooling2D(pool_size=(2, 2))(x)
37 | x = Conv2D(filters=128, kernel_size=(3, 3), activation='relu',padding='same')(x)
38 | x = MaxPooling2D(pool_size=(2, 2))(x)
39 | x = Conv2D(filters=128, kernel_size=(3, 3), activation='relu',padding='same')(x)
40 | x = MaxPooling2D(pool_size=(2, 2))(x)
41 |
42 | x = Flatten()(x)
43 | x = Dense(4096, activation='relu')(x)
44 | x = Dropout(0.5)(x)
45 | x = Dense(4096, activation='relu')(x)
46 | x = Dropout(0.5)(x)
47 | x = Dense(2, activation='softmax')(x)
48 |
49 | # Creating new model. Please note that this is NOT a Sequential() model.
50 | classification_model = Model(input=vgg_model.input, output=x)
51 | for layer in classification_model.layers[:7]:
52 | layer.trainable = True
53 |
54 | es=keras.callbacks.EarlyStopping(monitor='val_acc',
55 | min_delta=0,
56 | patience=5000,
57 | verbose=1, mode='auto')
58 |
59 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True)
60 |
61 | classification_model.compile(loss='mean_squared_error',optimizer=opt,metrics=['accuracy'])
62 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
63 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es])
64 | best_model=load_model(os.path.join(result_path,'best_model.h5'))
65 | file_name=os.path.split(result_path)[1]
66 | date=os.path.split(os.path.split(result_path)[0])[1]
67 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'VGGpretrained_model.h5'))
68 |
69 | if feature_extraction==1:
70 | feature_extractor_parameters['CNN_model']=classification_model
71 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
72 | return
73 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'VGG_pretrained', result_path,epoch)
74 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/ZFNet.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | import os
12 | import CNN_feature_extractor
13 | import model_evaluation
14 | from sklearn.model_selection import train_test_split
15 |
16 |
17 | def model(train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
18 |
19 | train_data_whole = data_preprocessing.size_editing(train_data_whole, 224)
20 | test_data = data_preprocessing.size_editing(test_data, 224)
21 |
22 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.1, random_state=13)
23 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
24 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
25 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
26 |
27 |
28 |
29 | batch_size=round(train_data.shape[0]/batch_size_factor)
30 |
31 | #Instantiate an empty model
32 | classification_model = Sequential()
33 |
34 | # 1st Convolutional Layer
35 | classification_model.add(Conv2D(filters=96, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]),activation='relu', kernel_size=(7,7), strides=(2,2), padding='valid'))
36 |
37 | # Max Pooling
38 | classification_model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))
39 |
40 | # 2nd Convolutional Layer
41 | classification_model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid',activation='relu'))
42 | # Max Pooling
43 | classification_model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))
44 |
45 | # 3rd Convolutional Layer
46 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='valid',activation='relu'))
47 |
48 | # 4th Convolutional Layer
49 | classification_model.add(Conv2D(filters=1024, kernel_size=(3,3), strides=(1,1), padding='valid',activation='relu'))
50 |
51 |
52 | # 5th Convolutional Layer
53 | classification_model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), padding='valid',activation='relu'))
54 | # Max Pooling
55 | classification_model.add(MaxPooling2D(pool_size=(3,3), strides=(2,2), padding='valid'))
56 |
57 | # Passing it to a Fully Connected layer
58 | classification_model.add(Flatten())
59 | # 1st Fully Connected Layer
60 | classification_model.add(Dense(4096, input_shape=(train_data.shape[1],train_data.shape[2],train_data.shape[3]),activation='relu'))
61 | # Add Dropout to prevent overfitting
62 | classification_model.add(Dropout(0.4))
63 |
64 | # 2nd Fully Connected Layer
65 | classification_model.add(Dense(4096))
66 |
67 | # Add Dropout
68 | classification_model.add(Dropout(0.4))
69 |
70 | # 3rd Fully Connected Layer
71 | classification_model.add(Dense(1000,activation='relu'))
72 | # Add Dropout
73 | classification_model.add(Dropout(0.4))
74 |
75 | # Output Layer
76 | classification_model.add(Dense(num_classes,activation='softmax'))
77 |
78 | classification_model.summary()
79 |
80 | # Compile the classification_model
81 | classification_model.compile(loss='mean_squared_error', optimizer=opt, metrics=['accuracy'])
82 |
83 | es=keras.callbacks.EarlyStopping(monitor='val_acc',
84 | min_delta=0,
85 | patience=100,
86 | verbose=1, mode='auto')
87 |
88 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True)
89 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
90 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[mc,es])
91 |
92 | best_model=load_model(os.path.join(result_path,'best_model.h5'))
93 | file_name=os.path.split(result_path)[1]
94 | date=os.path.split(os.path.split(result_path)[0])[1]
95 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'ZNet_model.h5'))
96 |
97 | if feature_extraction==1:
98 | feature_extractor_parameters['CNN_model']=classification_model
99 | CNN_feature_extractor.CNN_feature_extraction_classsification(feature_extractor_parameters,result_path)
100 | return
101 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data, test_labels_one_hot, 'ZFNet', result_path,epoch)
102 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/optimizers.py:
--------------------------------------------------------------------------------
1 | import keras
2 | from keras import optimizers
3 | import sys
4 |
5 |
6 | def choosing(optimizer):
7 | if optimizer=='adam':
8 | opt=keras.optimizers.Adam(lr=0.000001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
9 | elif optimizer=='adamax':
10 | opt= keras.optimizers.Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
11 | elif optimizer=='Nadama':
12 | opt= keras.optimizers.Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
13 | elif optimizer=='adadelta':
14 | opt= optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0)
15 | elif optimizer=='adagrad':
16 | opt= keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0)
17 | elif optimizer=='sgd':
18 | opt = optimizers.SGD(lr=0.01, clipnorm=1.)
19 | elif optimizer=='RMSprop':
20 | opt=keras.optimizers.RMSprop(lr=0.0006, rho=0.9, epsilon=None, decay=0.0)
21 | elif optimizer==None:
22 | return
23 | else:
24 | print('Value Error:Optimizer took unexpected value')
25 | sys.exit()
26 |
27 | return opt
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/CNN_based_models/simple_model.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import keras
3 | from keras.models import Sequential
4 | from keras.layers import Dropout
5 | from keras.callbacks import ModelCheckpoint
6 | from keras.models import load_model
7 | from keras import layers
8 | from keras.layers import Conv2D, MaxPooling2D
9 | from keras.models import Sequential, Input, Model
10 | from keras.layers import Dense, Dropout, Flatten
11 | from keras.layers import LeakyReLU
12 | import os
13 | import CNN_feature_extractor
14 | import model_evaluation
15 | from sklearn.model_selection import train_test_split
16 |
17 |
18 | def model (train_data_whole,train_labels_whole,test_data,test_labels,opt,epoch,batch_size_factor,num_classes,result_path,feature_extraction,feature_extractor_parameters):
19 | train_data, valid_data, train_labels, valid_labels = train_test_split(train_data_whole, train_labels_whole,test_size=0.2, random_state=13)
20 |
21 | train_labels_one_hot=data_preprocessing.labels_convert_one_hot(train_labels)
22 | valid_labels_one_hot=data_preprocessing.labels_convert_one_hot(valid_labels)
23 | test_labels_one_hot=data_preprocessing.labels_convert_one_hot(test_labels)
24 | batch_size=round(train_data.shape[0]/batch_size_factor)
25 | classification_model = Sequential()
26 | classification_model.add(Conv2D(32, kernel_size=(3, 3), padding='same',activation='relu',
27 | input_shape=(train_data.shape[1], train_data.shape[2], train_data.shape[3])))
28 | #classification_model.add(BatchNormalization())
29 |
30 | classification_model.add(MaxPooling2D((2, 2), padding='same'))
31 | classification_model.add(Dropout(0.5))
32 | classification_model.add(Conv2D(64, (3,3), padding='same'))
33 | classification_model.add(LeakyReLU(alpha=0.1))
34 | #classification_model.add(BatchNormalization())
35 |
36 | classification_model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
37 | classification_model.add(Dropout(0.5))
38 | classification_model.add(Conv2D(128, (3,3), padding='same'))
39 | classification_model.add(LeakyReLU(alpha=0.1))
40 | #classification_model.add(BatchNormalization())
41 | classification_model.add(MaxPooling2D(pool_size=(2, 2), padding='same'))
42 | classification_model.add(Dropout(0.5))
43 | classification_model.add(Flatten())
44 | classification_model.add(Dense(128))
45 | classification_model.add(LeakyReLU(alpha=0.1))
46 | #classification_model.add(BatchNormalization())
47 | classification_model.add(Dense(128,))
48 | classification_model.add(Dropout(0.5))
49 | classification_model.add(LeakyReLU(alpha=0.1))
50 |
51 | classification_model.add(Dense(num_classes, activation='softmax'))
52 | classification_model.compile(loss=keras.losses.categorical_crossentropy, optimizer=opt,
53 | metrics=['accuracy'])
54 |
55 | es=keras.callbacks.EarlyStopping(monitor='val_acc',
56 | min_delta=0,
57 | patience=50,
58 | verbose=1, mode='auto',baseline=0.9)
59 |
60 | mc = ModelCheckpoint(os.path.join(result_path,'best_model.h5'), monitor='val_acc', mode='auto', save_best_only=True)
61 | classification_train = classification_model.fit(train_data, train_labels_one_hot, batch_size=batch_size, epochs=epoch,
62 | verbose=1, validation_data=(valid_data, valid_labels_one_hot),callbacks=[es,mc])
63 | best_model=load_model(os.path.join(result_path,'best_model.h5'))
64 | file_name=os.path.split(result_path)[1]
65 | date=os.path.split(os.path.split(result_path)[0])[1]
66 |
67 | classification_model.save(os.path.join(result_path,date+'_'+file_name+'_'+'simple_model.h5'))
68 |
69 | if feature_extraction==1:
70 | feature_extractor_parameters['CNN_model']=classification_model
71 | CNN_feature_extractor.CNN_feature_extraction_classsification(train_data_whole,train_labels_whole,test_data,test_labels,feature_extractor_parameters,result_path)
72 | return
73 |
74 | model_evaluation.testing_and_printing(classification_model,classification_train,best_model,test_data,test_labels_one_hot,'simple_architecture',result_path,epoch)
75 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/evaluation/metrics.py:
--------------------------------------------------------------------------------
1 | def auc_roc(y_true, y_pred):
2 | # any tensorflow metric
3 | value, update_op = tf.contrib.metrics.streaming_auc(y_pred, y_true)
4 |
5 | # find all variables created for this metric
6 | metric_vars = [i for i in tf.local_variables() if 'auc_roc' in i.name.split('/')[1]]
7 |
8 | # Add metric variables to GLOBAL_VARIABLES collection.
9 | # They will be initialized for new session.
10 | for v in metric_vars:
11 | tf.add_to_collection(tf.GraphKeys.GLOBAL_VARIABLES, v)
12 |
13 | # force to update metric values
14 | with tf.control_dependencies([update_op]):
15 | value = tf.identity(value)
16 | return value
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/evaluation/model_evaluation.py:
--------------------------------------------------------------------------------
1 | import matplotlib.pyplot as plt
2 | import generate_result_
3 | from sklearn.metrics import precision_score
4 | from sklearn.metrics import recall_score
5 | from sklearn.metrics import f1_score
6 | from sklearn.metrics import cohen_kappa_score
7 | from sklearn.metrics import roc_auc_score
8 | from sklearn.metrics import confusion_matrix
9 |
10 |
11 | def testing_and_printing(classification_model,classification_train,best_model,test_data,test_labels_one_hot,model_name,results_path,epoch):
12 | #classification_model.summary()
13 | test_eval = classification_model.evaluate(test_data, test_labels_one_hot, verbose=1)
14 | prediction=classification_model.predict(test_data)
15 | predicted_classes=classification_model.predict_classes(test_data)
16 | print(predicted_classes)
17 | test_labels=test_labels_one_hot[:,1]
18 |
19 | print(test_labels_one_hot)
20 | print(prediction)
21 | print('Test loss:', test_eval[0])
22 | print('Test accuracy:', test_eval[1])
23 | precision = precision_score(test_labels, predicted_classes)
24 | print('Precision: %f' % precision)
25 | # recall: tp / (tp + fn)
26 | recall = recall_score(test_labels, predicted_classes)
27 | print('Recall: %f' % recall)
28 | # f1: 2 tp / (2 tp + fp + fn)
29 | f1 = f1_score(test_labels, predicted_classes)
30 | print('F1 score: %f' % f1)
31 |
32 | matrix = confusion_matrix(test_labels, predicted_classes)
33 | print(matrix)
34 |
35 |
36 |
37 | #print('AUC on test data:',test_eval[2])
38 | print('the number of epochs:', epoch)
39 | accuracy = classification_train.history['acc']
40 | val_accuracy = classification_train.history['val_acc']
41 | loss = classification_train.history['loss']
42 | val_loss = classification_train.history['val_loss']
43 | epochs = range(len(accuracy))
44 | plt.plot(epochs, accuracy, 'bo', label='Training accuracy')
45 | plt.plot(epochs, val_accuracy, 'b', label='Validation accuracy')
46 | plt.title('Training and validation accuracy')
47 | plt.legend()
48 | plt.figure()
49 | plt.plot(epochs, loss, 'bo', label='Training loss')
50 | plt.plot(epochs, val_loss, 'b', label='Validation loss')
51 | plt.title('Training and validation loss')
52 | plt.legend()
53 | plt.show()
54 |
55 |
56 | test_acc=best_model.evaluate(test_data,test_labels_one_hot)
57 | print('best model test accaurecy is ',test_acc)
58 | generate_result_.cnn_save_result(test_eval[1],classification_model,model_name,results_path)
59 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/evaluation/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/main.py:
--------------------------------------------------------------------------------
1 | import CNN
2 | import load_data
3 | from numpy import load
4 | import numpy as np
5 | import data_preprocessing
6 | import preprocessing_methods
7 | import generate_result_
8 | import os
9 | from scipy.signal import resample_poly
10 |
11 | def main():
12 |
13 |
14 |
15 | train_data_path='/data/fmri/Folder/AD_classification/Data/input_data/preprocessed_data/CV_OULU_Con_AD_preprocessed.npz'
16 | train_data_classifer = load(train_data_path)['masked_voxels']
17 | train_data_path='/data/fmri/Folder/AD_classification/Data/input_data/Augmented_data/CV_OULU_Con_AD_aug.npz'
18 | train_data_CNN = load(train_data_path)['masked_voxels']
19 | test_data_path='/data/fmri/Folder/AD_classification/Data/input_data/CV_ADNI_Con_AD.npz'
20 | test_data_CNN = load(test_data_path)['masked_voxels']
21 | test_data_path='/data/fmri/Folder/AD_classification/Data/input_data/preprocessed_data/CV_ADNI_Con_AD_preprocessed.npz'
22 | test_data_classifer = load(test_data_path)['masked_voxels']
23 |
24 | transposing_order=[3,0,2,1]
25 | train_data_CNN=data_preprocessing.transposnig(train_data_CNN,transposing_order)
26 | test_data_CNN=data_preprocessing.transposnig(test_data_CNN,transposing_order)
27 |
28 | train_labels_path='/data/fmri/Folder/AD_classification/Data/input_data/labels/train_labels_aug_data.npz'
29 | train_labels_CNN=load(train_labels_path)['labels']
30 | shuffling_indicies = np.random.permutation(len(train_labels_CNN))
31 | temp = train_data_CNN[shuffling_indicies, :, :, :]
32 | train_data_CNN=temp
33 | train_labels_CNN = train_labels_CNN[shuffling_indicies]
34 |
35 |
36 |
37 |
38 |
39 | train_labels_path='/data/fmri/Folder/AD_classification/Data/input_data/labels/train_labels.npz'
40 | train_labels_classifer=load(train_labels_path)['labels']
41 | shuffling_indicies = np.random.permutation(len(train_labels_classifer))
42 | temp = train_data_classifer[shuffling_indicies, :, :, :]
43 | train_data_classifer=temp
44 | train_labels_classifer = train_labels_classifer[shuffling_indicies]
45 |
46 | #test_data_path = load_data.find_path(test_data_file_name)
47 | #test_data_path='/data/fmri/Folder/AD_classification/Data/input_data/CV_ADNI_Con_AD.npz'
48 | #test_data = load(test_data_path)['masked_voxels']
49 | #test_labels_path=load_data.find_path(test_labels_file_name)
50 | test_labels_path='/data/fmri/Folder/AD_classification/Data/input_data/labels/test_labels.npz'
51 | test_labels=load(test_labels_path)['labels']
52 | shuffling_indicies = np.random.permutation(len(test_labels))
53 | test_data_CNN = test_data_CNN[shuffling_indicies, :, :, :]
54 | test_data_classifer = test_data_classifer[shuffling_indicies, :, :, :]
55 |
56 | test_labels = test_labels[shuffling_indicies]
57 |
58 | train_data_CNN,test_data_CNN,train_labels_CNN,test_labels=preprocessing_methods.preprocessing(train_data_CNN,test_data_CNN,train_labels_CNN,test_labels,4,0,None,None)
59 |
60 | factors=[(224,45),(224,45),(3,54)]
61 | train_data_CNN=resample_poly(train_data_CNN, factors[0][0], factors[0][1], axis=1)
62 | train_data_CNN=resample_poly(train_data_CNN, factors[1][0], factors[1][1], axis=2)
63 | #train_data_CNN=resample_poly(train_data_CNN, factors[2][0], factors[2][1], axis=3)
64 |
65 | test_data_CNN=resample_poly(test_data_CNN, factors[0][0], factors[0][1], axis=1)
66 | test_data_CNN=resample_poly(test_data_CNN, factors[1][0], factors[1][1], axis=2)
67 | #test_data_CNN=resample_poly(test_data_CNN, factors[2][0], factors[2][1], axis=3)
68 |
69 |
70 | train_CNN=0
71 | feature_extraction=1
72 |
73 | if train_CNN==1 and feature_extraction==1:
74 | line1='CNN model is trained and saved and then used as feature extractor'
75 | line2='CNN model used for feature extraction is :'
76 | elif train_CNN==1 and feature_extraction==0:
77 | line1 ='CNN model is trained and used to test the test data'
78 | line2='CNN model used is :'
79 | elif train_CNN==0 and feature_extraction==1:
80 | line1 ='using a saved model to extract fetaures'
81 | line2='The model used used is a saved model'
82 | else:
83 | print('Value Error: train_CNN and feature_extraction cannnot have these values')
84 |
85 |
86 | results_directory='Results'
87 | num_classes=2
88 | epoch=1000
89 | batch_size_factor=1
90 | optimizer='adam'
91 | CNN_models=['VGG16','VGG19']
92 | #intermedidate_layer=[7,7,7,16]
93 | hyperparameters={'dropouts':[0.25,0.5,0.5],'activation_function':['relu','relu','relu','sigmoid'],'epoch':10,'opt':'adam','penalty':'l1','C':100,'neighbors':50}
94 | data={'train_data':train_data_CNN,'test_data':test_data_CNN,'train_labels':train_labels_CNN,'test_labels':test_labels}
95 | preprocessing_method='method 4'
96 | i=0
97 | for CNN_model in CNN_models:
98 | result_path = generate_result_.create_results_dir(results_directory)
99 | print(CNN_model)
100 | feature_extractor_parameters={'data':data,'hyperparameters':hyperparameters,'model_type':'pretrained','CNN_model':CNN_model,'intermediate_layer':7,'classifer_name':'all'}
101 | CNN.CNN_main(train_data_CNN,test_data_CNN,result_path,train_labels_CNN,test_labels,num_classes,epoch,batch_size_factor,optimizer,CNN_model,train_CNN,feature_extraction,feature_extractor_parameters)
102 | f = open(os.path.join(result_path, 'README'), "w+")
103 |
104 | line3=CNN_model
105 | line4='The preprocessing methods used is '+' '+preprocessing_method
106 | line5='The number of epochs used to train the CNN_model is '+str(epoch)
107 | line6='the oprimizer used is '+optimizer
108 | f.write("{}" "\n" "{}" "\n" "{}" "\n" "{}" "\n" "{}" "\n" "{}" "\n" .format(line1,line2,line3,line4,line5,line6))
109 | i=i+1
110 |
111 | if __name__=='__main__':
112 | main()
113 |
114 |
115 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/preprocessing/data_augmentation.py:
--------------------------------------------------------------------------------
1 | import random
2 | from scipy import ndarray
3 | import skimage as sk
4 | from skimage import transform
5 | from skimage import util
6 | from sklearn.svm import SVC
7 | from sklearn.model_selection import StratifiedKFold
8 | from sklearn.feature_selection import RFECV
9 | from sklearn.datasets import make_classification
10 | from sklearn.model_selection import train_test_split
11 | #from sklearn import decomposition
12 | from sklearn.gaussian_process import GaussianProcessClassifier
13 | from sklearn.gaussian_process.kernels import RBF
14 | from sklearn import decomposition
15 | from sklearn.feature_selection import SelectFromModel
16 | from sklearn.svm import LinearSVC
17 | from sklearn.metrics import roc_auc_score
18 | from sklearn.metrics import f1_score
19 | from sklearn.metrics import confusion_matrix
20 | from sklearn.ensemble import RandomForestClassifier
21 | from sklearn.utils import resample
22 | from sklearn.utils import shuffle
23 | from sklearn.preprocessing import StandardScaler
24 | from sklearn.preprocessing import Normalizer
25 | from sklearn import preprocessing
26 | from scipy import ndimage
27 | import nilearn
28 | import nibabel as nib
29 | import numpy as np
30 | import os
31 | import load_data
32 |
33 | def load_obj(obj):
34 | # Load subjects
35 | in_img = nib.load(obj)
36 | in_shape = in_img.shape
37 | print('Shape: ', in_shape)
38 | in_array = in_img.get_fdata()
39 | return in_array
40 |
41 |
42 |
43 | def flipping(img,axis):
44 | flipped_img = np.flip(img,axis=axis)
45 | return flipped_img
46 |
47 |
48 |
49 | def flipping_HV(img):
50 | flipped_img = np.fliplr(img)
51 | return flipped_img
52 |
53 | def rotate(img,angle):
54 |
55 | img=ndimage.interpolation.rotate(img,angle)
56 | return img
57 |
58 | def shifting(img,shift_amount):
59 |
60 | img=ndimage.interpolation.shift(img,shift_amount)
61 |
62 | return img
63 |
64 | def zooming(img,zooming_amount):
65 | img=ndimage.interpolation.zoom(img,zooming_amount)
66 | return img
67 |
68 |
69 | def add_gaussian_noise(X_imgs):
70 | gaussian_noise_imgs = []
71 | row, col,depth,number_of_samples = X_imgs.shape
72 | # Gaussian distribution parameters
73 | mean = 0
74 | var = 0.1
75 | sigma = var ** 0.5
76 | gaussian_noise_imgs=np.empty(X_imgs.shape)
77 | for i in range(number_of_samples):
78 | gaussian_img=np.zeros((row,col,depth))
79 | gaussian = np.random.random((row, col, 1)).astype(np.float64)
80 | gaussian = np.tile(gaussian,(1,1,depth))
81 | gaussian_img = cv2.addWeighted(X_imgs[:,:,:,i], 0.75, 0.25 * gaussian, 0.25, 0 ,dtype=cv2.CV_64F)
82 | gaussian_noise_imgs[:,:,:,i]=gaussian_img
83 | gaussian_noise_imgs = np.array(gaussian_noise_imgs, dtype = np.float32)
84 | return gaussian_noise_imgs
85 |
86 |
87 | def transposnig(input_data,order):
88 | return input_data.transpose(order)
89 |
90 |
91 |
92 | def mask_print(input,mask,name):
93 | #remained_feature_indices=np.where(mask==1)
94 | masking_img = nib.load('/data/fmri/Folder/AD_classification/Data/input_data/4mm_brain_mask_bin.nii.gz')
95 |
96 | masking_shape = masking_img.shape
97 | print(masking_shape)
98 | masking = np.empty(masking_shape, dtype=float)
99 | masking[:,:,:] = masking_img.get_data().astype(float)
100 | for i in range (np.shape(input)[3]):
101 | input[:,:,:,i]=mask*input[:,:,:,i]
102 | #input[:,:,:,i]=input[:,:,:,i]
103 | hdr = masking_img.header
104 | aff = masking_img.affine
105 | out_img = nib.Nifti1Image(input, aff, hdr)
106 | # Save to disk
107 | out_file_name = '/data/fmri/Folder/AD_classification/Data/input_data/Augmented_data/mask_'+name+'.nii.gz'
108 | nib.save(out_img, out_file_name)
109 |
110 | def slicing(len1,len2 ):
111 | diff=abs(len2-len1)/2
112 |
113 | if (round(diff)>diff):
114 | return round(diff),len2-round(diff)+1
115 |
116 | else:
117 | return int(diff),int(len2-diff)
118 |
119 |
120 | '''
121 | Oulu_data_ad_path = '/data/fmri/Folder/AD_classification/Data/Raw_data/Oulu_Data/CV_OULU_AD.nii.gz'
122 | Oulu_data_con_path='/data/fmri/Folder/AD_classification/Data/Raw_data/Oulu_Data/CV_OULU_CON.nii.gz'
123 | adni_data_ad_path='/data/fmri/Folder/AD_classification/Data/Raw_data/ADNI_Data/CV_ADNI_AD.nii.gz'
124 | adni_data_con_path='/data/fmri/Folder/AD_classification/Data/Raw_data/ADNI_Data/CV_ADNI_CON.nii.gz'
125 | masking_data='/data/fmri/Folder/AD_classification/Data/input_data/4mm_brain_mask_bin.nii.gz'
126 |
127 |
128 |
129 | Oulu_data_ad=load_obj(Oulu_data_ad_path)
130 | Oulu_data_con=load_obj(Oulu_data_con_path)
131 | adni_data_ad=load_obj(adni_data_ad_path)
132 | adni_data_con=load_obj(adni_data_con_path)
133 | mask=load_obj(masking_data)
134 |
135 | order_data=(0,2,1,3)
136 | order_mask=(0,2,1)
137 |
138 | Oulu_data_con_transposed=transposnig(Oulu_data_con,order_data)
139 | Oulu_data_ad_transposed=transposnig(Oulu_data_ad,order_data)
140 | mask_transposed=transposnig(mask,order_mask)
141 |
142 |
143 | #Rotation
144 | angles=[30,-30,60,-60,45,-45]
145 | for i in angles:
146 | Oulu_data_ad_rotated=rotate(Oulu_data_ad_transposed,i)
147 | mask_rotated=rotate(mask_transposed,i)
148 | start,end=slicing(Oulu_data_ad.shape[0],Oulu_data_ad_rotated.shape[0])
149 |
150 | Oulu_data_ad_rotated=Oulu_data_ad_rotated[0:Oulu_data_ad.shape[0],0:Oulu_data_ad.shape[0],:,:]
151 | mask_rotated=mask_rotated[0:Oulu_data_ad.shape[0],0:Oulu_data_ad.shape[0],:]
152 |
153 | Oulu_data_ad_rotated_transposed=transposnig(Oulu_data_ad_rotated,order_data)
154 | mask_rotated_transposed=transposnig(mask_rotated,order_mask)
155 | print(Oulu_data_ad_rotated_transposed.shape)
156 | mask_print(Oulu_data_ad_rotated_transposed,mask_rotated_transposed,'rotated_' +str(i)+ '_Oulu_data_ad')
157 |
158 |
159 |
160 | # adding gussian noise
161 |
162 | Oulu_data_ad_noised=add_gaussian_noise(Oulu_data_ad_transposed)
163 | Oulu_data_ad_noised_transposed=transposnig(Oulu_data_ad_noised,order_data)
164 | mask_print(Oulu_data_ad_noised_transposed,mask,'Oulu_data_ad_gussian_noised')
165 |
166 |
167 | #shifting
168 | shift_amount_data=[0,20,0,0]
169 | shift_amount_mask=[0,20,0]
170 | Oulu_data_con_shifted=shifting(Oulu_data_con_transposed,shift_amount_data)
171 | Oulu_data_con_shifted_transposed=transposnig(Oulu_data_con_shifted,order_data)
172 | mask_shifted=shifting(mask_transposed,shift_amount_mask)
173 | mask_shifted_transposed=transposnig(mask_shifted,order_mask)
174 | mask_print(Oulu_data_con_shifted_transposed,mask_shifted_transposed,'down_Oulu_data_con')
175 |
176 |
177 | # flipping
178 | Oulu_data_ad_flipped=flipping(Oulu_data_ad_transposed,0)
179 | Oulu_data_ad_flipped_transposed=transposnig(Oulu_data_ad_flipped,order_data)
180 | mask_tranposed=transposnig(mask,order_mask)
181 | mask_flipped=flipping(mask_tranposed,1)
182 | mask_flipped_transposed=transposnig(mask_flipped,order_mask)
183 | mask_print(Oulu_data_ad_flipped_transposed,mask_flipped_transposed,'vertical_flipped_Oulu_data_ad')
184 | '''
185 |
186 |
187 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/preprocessing/preprocessing_methods.py:
--------------------------------------------------------------------------------
1 | import data_preprocessing
2 | import numpy as np
3 | import load_data
4 |
5 | def preprocessing(train_data,test_data,train_labels,test_labels,method,save,file_name,output_dir):
6 |
7 |
8 |
9 | dim0_train=train_data.shape[0]
10 | dim1_train=train_data.shape[1]
11 | dim2_train=train_data.shape[2]
12 | dim3_train=train_data.shape[3]
13 |
14 | dim0_test=test_data.shape[0]
15 | dim1_test=test_data.shape[1]
16 | dim2_test=test_data.shape[2]
17 | dim3_test=test_data.shape[3]
18 |
19 | if method==0:
20 | return
21 | elif method==1:
22 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train*dim3_train)
23 | test_data=test_data.reshape(dim0_test,dim1_test*dim2_test*dim3_test)
24 | train_data=data_preprocessing.MinMax_scaler(train_data)
25 | test_data=data_preprocessing.MinMax_scaler(test_data)
26 | train_data,test_data=data_preprocessing.standarization(train_data,test_data)
27 | train_data,test_data=data_preprocessing.KSTest(train_data,test_data,800)
28 | train_data=train_data.reshape(dim0_train,dim1_train,dim2_train,dim3_train)
29 | test_data=test_data.reshape(dim0_test,dim1_test,dim2_test,dim3_test)
30 |
31 | elif method==2:
32 | for i in range(train_data.shape[0]):
33 | for j in range(train_data.shape[3]):
34 | train_data[i, :, :, j] = data_preprocessing.standarization(train_data[i, :, :, j])
35 | train_data[i, :, :, j] = data_preprocessing.MinMax_scaler(train_data[i, :, :, j])
36 |
37 | if i < test_data.shape[0]:
38 | test_data[i, :, :, j] = data_preprocessing.standarization(test_data[i, :, :, j])
39 |
40 | test_data[i, :, :, j] = data_preprocessing.MinMax_scaler(test_data[i, :, :, j])
41 |
42 |
43 |
44 |
45 |
46 |
47 | elif method==3:
48 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train*dim3_train)
49 | test_data=test_data.reshape(dim0_test,dim1_test*dim2_test*dim3_test)
50 | train_data,test_data=data_preprocessing.KSTest(train_data,test_data,500)
51 | train_data=train_data.reshape(dim0_train,dim1_train,dim2_train,dim3_train)
52 | test_data=test_data.reshape(dim0_test,dim1_test,dim2_test,dim3_test)
53 |
54 |
55 |
56 | elif method==4:
57 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train,dim3_train)
58 | test_data=test_data.reshape(dim0_test,dim1_test*dim2_test,dim3_test)
59 | for i in range (dim3_train):
60 | train_data[:,:,i]=data_preprocessing.MinMax_scaler(train_data[:,:,i])
61 | train_data[:,:,i]=data_preprocessing.standarization(train_data[:,:,i])
62 |
63 | test_data[:,:,i]=data_preprocessing.MinMax_scaler(test_data[:,:,i])
64 | test_data[:,:,i]=data_preprocessing.standarization(test_data[:,:,i])
65 |
66 | train_data[:,:,i],test_data[:,:,i]=data_preprocessing.KSTest(train_data[:,:,i],test_data[:,:,i],800)
67 | train_data=train_data.reshape(dim0_train,dim1_train,dim2_train,dim3_train)
68 | test_data=test_data.reshape(dim0_test,dim1_test,dim2_test,dim3_test)
69 |
70 | elif method==5:
71 | train_data=train_data.reshape(dim0_train,dim1_train*dim2_train*dim3_train)
72 | train_data,train_labels,index=data_preprocessing.outliers(train_data,train_labels,1)
73 | train_data=train_data.reshape(dim0_train-np.size(index),dim1_train,dim2_train,dim3_train)
74 | if save==0:
75 |
76 | return train_data,test_data,train_labels,test_labels
77 | else:
78 |
79 | transposing_order = [1,3,2,0]
80 | train_data = data_preprocessing.transposnig(train_data, transposing_order)
81 | test_data = data_preprocessing.transposnig(test_data, transposing_order)
82 | output_path=load_data.find_path(output_dir)
83 | np.savez(output_path+file_name+'train_data.npz',masked_voxels=train_data)
84 | np.savez(output_path+file_name+'test_data.npz',masked_voxels=test_data)
85 |
86 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/preprocessing/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/storing_loading/load_data.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from numpy import load
3 | import os
4 | import nibabel as nib
5 | from pathlib import Path
6 | import warnings
7 | warnings.filterwarnings("ignore")
8 |
9 |
10 | def find_path(file_name):
11 | data_path=None
12 | current_dir_path=os.getcwd()
13 | p=Path(current_dir_path)
14 | root_dir=p.parts[0]+p.parts[1]
15 | for r,d,f in os.walk(root_dir):
16 | for files in f:
17 | if files == file_name:
18 | data_path=os.path.join(r,files)
19 |
20 | else:
21 | for dir in d :
22 | if dir == file_name:
23 | data_path = os.path.join(r, dir)
24 | if data_path is not None:
25 | return data_path
26 | else:
27 | os.makedirs('./'+file_name)
28 | return './'+file_name
29 |
30 |
31 |
32 | def train_data_3d(train_Con_file_name, train_AD_file_name):
33 | train_data_Con_path = find_path(train_Con_file_name)
34 | train_data_AD_path = find_path(train_AD_file_name)
35 | train_data_Con = load(train_data_Con_path)['masked_voxels']
36 | train_data_AD = load(train_data_AD_path)['masked_voxels']
37 | train_data=np.concatenate((train_data_Con,train_data_AD),axis=3)
38 | train_labels = np.hstack((np.zeros(train_data_Con.shape[3]), np.ones(train_data_AD.shape[3])))
39 |
40 |
41 | return train_data, train_labels
42 |
43 | def test_data_3d(test_Con_file_name,test_AD_file_name):
44 | test_data_Con_path=find_path(test_Con_file_name)
45 | test_data_AD_path = find_path(test_AD_file_name)
46 | test_data_Con = load(test_data_Con_path)['masked_voxels']
47 | test_data_AD = load(test_data_AD_path)['masked_voxels']
48 | test_data = np.concatenate((test_data_Con, test_data_AD), axis=3)
49 | test_labels = np.hstack((np.zeros(test_data_Con.shape[3]), np.ones(test_data_AD.shape[3])))
50 |
51 |
52 | return test_data,test_labels
53 |
54 |
55 | def mask(mask_name):
56 | mask_path = find_path(mask_name)
57 | original_mask = nib.load(mask_path)
58 | return original_mask
59 |
60 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/deep learning/storing_loading/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/load_data.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from numpy import load
3 | import os
4 | import nibabel as nib
5 | from pathlib import Path
6 | import warnings
7 | warnings.filterwarnings("ignore")
8 |
9 |
10 | def find_path(file_name):
11 | data_path=None
12 | current_dir_path=os.getcwd()
13 | p=Path(current_dir_path)
14 | root_dir=p.parts[0]+p.parts[1]
15 | for r,d,f in os.walk(root_dir):
16 | for files in f:
17 | if files == file_name:
18 | data_path=os.path.join(r,files)
19 |
20 | else:
21 | for dir in d :
22 | if dir == file_name:
23 | data_path = os.path.join(r, dir)
24 | if data_path is not None:
25 | return data_path
26 | else:
27 | os.makedirs('./'+file_name)
28 | return './'+file_name
29 |
30 |
31 |
32 | def train_data_3d(train_Con_file_name, train_AD_file_name):
33 | train_data_Con_path = find_path(train_Con_file_name)
34 | train_data_AD_path = find_path(train_AD_file_name)
35 | train_data_Con = load(train_data_Con_path)['masked_voxels']
36 | train_data_AD = load(train_data_AD_path)['masked_voxels']
37 | train_data=np.concatenate((train_data_Con,train_data_AD),axis=3)
38 | train_labels = np.hstack((np.zeros(train_data_Con.shape[3]), np.ones(train_data_AD.shape[3])))
39 |
40 |
41 | return train_data, train_labels
42 |
43 | def test_data_3d(test_Con_file_name,test_AD_file_name):
44 | test_data_Con_path=find_path(test_Con_file_name)
45 | test_data_AD_path = find_path(test_AD_file_name)
46 | test_data_Con = load(test_data_Con_path)['masked_voxels']
47 | test_data_AD = load(test_data_AD_path)['masked_voxels']
48 | test_data = np.concatenate((test_data_Con, test_data_AD), axis=3)
49 | test_labels = np.hstack((np.zeros(test_data_Con.shape[3]), np.ones(test_data_AD.shape[3])))
50 |
51 |
52 | return test_data,test_labels
53 |
54 |
55 | def mask(mask_name):
56 | mask_path = find_path(mask_name)
57 | original_mask = nib.load(mask_path)
58 | return original_mask
59 |
60 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/load_models.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | import load_data
3 | import data_preprocessing
4 | import numpy as np
5 | import nibabel as nib
6 | import generate_result
7 |
8 | #define paths
9 | train_Con_file_name = 'CV_OULU_CON.npz'
10 | train_AD_file_name = 'CV_OULU_AD.npz'
11 | test_Con_file_name = 'CV_ADNI_CON.npz'
12 | test_AD_file_name = 'CV_ADNI_AD.npz'
13 | mask_name = '4mm_brain_mask_bin.nii.gz'
14 | created_mask_high_certainity_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz'
15 | created_mask_outlier_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz'
16 | high_certainity_model_name='./Output_results_directory/2019-08-10/1/high_certainity_model.sav'
17 | low_certainty_model_name='./Output_results_directory/2019-08-10/1/low_certainty_model.sav'
18 | outliers_model_name='./Output_results_directory/2019-08-10/1/outliers_model.sav'
19 | #define variables
20 | number_of_neighbours = 1
21 |
22 | #load data
23 | train_data,train_labels=load_data.train_data_3d(train_Con_file_name,train_AD_file_name)
24 | test_data, test_labels = load_data.test_data_3d(test_Con_file_name, test_AD_file_name)
25 |
26 | #load masks
27 | mask_4mm = load_data.mask(mask_name)
28 | created_mask_high_certainity = nib.load(created_mask_high_certainity_file_name)
29 | created_mask_outlier = nib.load(created_mask_outlier_file_name)
30 |
31 |
32 | #data preprocessing
33 | train_data = np.moveaxis(train_data.copy(), 3, 0)
34 | test_data = np.moveaxis(test_data.copy(), 3, 0)
35 | original_mask=mask_4mm.get_fdata()
36 | train_data = train_data * original_mask
37 | test_data = test_data * original_mask
38 | created_mask_high_certainity=created_mask_high_certainity.get_fdata()
39 | created_mask_outlier=created_mask_outlier.get_fdata()
40 | orignal_mask_flatten = data_preprocessing.flatten(original_mask[np.newaxis, :, :, :].copy())
41 | orignal_mask_flatten = np.reshape(orignal_mask_flatten, (-1))
42 | created_mask_high_certainity_flatten = data_preprocessing.flatten(created_mask_high_certainity[np.newaxis, :, :, :].copy())
43 | created_mask_high_certainity_flatten = np.reshape(created_mask_high_certainity_flatten, (-1))
44 | created_mask_outlier_flatten = data_preprocessing.flatten(created_mask_outlier[np.newaxis, :, :, :].copy())
45 | created_mask_outlier_flatten = np.reshape(created_mask_outlier_flatten, (-1))
46 | train_data_flattened = data_preprocessing.flatten(train_data.copy())
47 | test_data_flattened = data_preprocessing.flatten(test_data.copy())
48 | train_data_flattened = data_preprocessing.MinMax_scaler(train_data_flattened.copy())
49 | test_data_flattened = data_preprocessing.MinMax_scaler(test_data_flattened.copy())
50 |
51 |
52 | train_data_inlier, train_labels_inlier, outlier_indices_train = data_preprocessing.outliers(train_data_flattened,
53 | train_labels,
54 | number_of_neighbours)
55 | test_data_inlier, test_labels_inlier, outlier_indices_test = data_preprocessing.novelty(train_data_inlier,
56 | train_labels_inlier,
57 | test_data_flattened,
58 | test_labels,
59 | number_of_neighbours)
60 |
61 | test_data_inlier_brain=test_data_inlier[:,np.squeeze(np.where(orignal_mask_flatten>0),axis=0)]
62 | test_data_outlier_brain=(test_data_flattened[outlier_indices_test])[:,np.squeeze(np.where(orignal_mask_flatten>0),axis=0)]
63 | test_data_masked_high_certainity=test_data_inlier_brain* created_mask_high_certainity_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)]
64 | test_data_inlier_CVspace = data_preprocessing.coefficient_of_variance(test_data_masked_high_certainity)[:,np.newaxis]
65 | test_data_outlier_cv = data_preprocessing.coefficient_of_variance(
66 | test_data_outlier_brain *created_mask_outlier_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)])[:, np.newaxis]
67 | #load models
68 | high_certainity_model = pickle.load(open(high_certainity_model_name, 'rb'))
69 | low_certainty_model = pickle.load(open(low_certainty_model_name, 'rb'))
70 | outliers_model = pickle.load(open(outliers_model_name, 'rb'))
71 | #output results
72 | test_accuracy_high_certainity,F1_score_high_certainity,auc_high_certainity,low_confidence_indices=generate_result.out_result_highprob(test_data_inlier_CVspace,
73 | test_labels_inlier,orignal_mask_flatten,created_mask_high_certainity_flatten,high_certainity_model)
74 | test_accuracy_low_certainty,F1_score_low_certainty,auc_low_certainty=generate_result.out_result(test_data_inlier_CVspace[low_confidence_indices],
75 | test_labels_inlier[low_confidence_indices],orignal_mask_flatten,created_mask_high_certainity_flatten,low_certainty_model)
76 | test_accuracy_outlier,F1_score_outlier,auc_outlier= generate_result.out_result(test_data_outlier_cv ,
77 | test_labels[outlier_indices_test], orignal_mask_flatten,
78 | created_mask_outlier_flatten, outliers_model)
79 |
80 |
81 | #print results
82 | print('total_test_accuracy>',(test_accuracy_high_certainity+test_accuracy_low_certainty+test_accuracy_outlier)/3)
83 | print('total_F1_score>',(F1_score_high_certainity+F1_score_low_certainty+F1_score_outlier)/3)
84 | print('total_AUC_score>',(auc_high_certainity+auc_low_certainty+auc_outlier)/3)
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/main.py:
--------------------------------------------------------------------------------
1 | #from hyper_opt import create_mask,model,model_1D
2 | import load_data
3 | import data_preprocessing
4 | import generate_result
5 | from Model import create_mask
6 | from pathlib import Path
7 |
8 |
9 |
10 | def main():
11 | train_Con_file_name = 'whole_brain_Oulu_Con.npz'
12 | train_AD_file_name = 'whole_brain_Oulu_AD.npz'
13 | test_Con_file_name = 'whole_brain_ADNI_Con.npz'
14 | test_AD_file_name = 'whole_brain_ADNI_AD.npz'
15 | root_dir='/data'
16 | mask_name='4mm_brain_mask_bin.nii.gz'
17 | results_directory='Results'
18 | results_path=load_data.find_path(results_directory)
19 | number_of_cv=5
20 | feature_selection_type='recursion'
21 | data_preprocessing_method='kstest and standarization and Normalization and Density ratio estimation'
22 | Hyperparameter=(4000,10)
23 | train_data,train_labels=load_data.train_data(train_Con_file_name,train_AD_file_name)
24 | test_data, test_labels = load_data.test_data(test_Con_file_name, test_AD_file_name)
25 | #sample_weight = data_preprocessing.density_ratio_estimation(train_data,test_data)
26 | original_mask=load_data.mask(mask_name,root_dir)
27 |
28 | #created_mask,model_,model_name,weights=create_mask(train_data,labels_train,number_of_cv,feature_selection_type,
29 | #Hyperparameter,mask_threshold=2,model_type='gaussian_process')
30 |
31 | #test_data = data_preprocessing.coefficient_of_variance(test_data)
32 |
33 | #model_, model_name=model_1D(train_data,labels_train,created_mask,data_validation=None,labels_validation=None,model_type='gaussian_process')
34 | train_data,test_data=data_preprocessing.KSTest(train_data,test_data,step=Hyperparameter[1])
35 |
36 |
37 | train_data = data_preprocessing.standarization(train_data)
38 | test_data = data_preprocessing.standarization(test_data)
39 | train_data = data_preprocessing.standarization(train_data)
40 | test_data = data_preprocessing.standarization(test_data)
41 |
42 | train_data = data_preprocessing. MinMax_scaler(train_data)
43 | test_data= data_preprocessing. MinMax_scaler(test_data)
44 |
45 | sample_weights=data_preprocessing.density_ratio_estimation(train_data,test_data)
46 |
47 | created_mask,model,model_name,weights =create_mask(train_data,train_labels,number_of_cv,feature_selection_type,Hyperparameter[0],1,model_type='Random_forest', sample_weights=sample_weights)
48 |
49 | generate_result.print_result(test_data, test_labels, original_mask, created_mask, model, model_name, weights,
50 | results_path,feature_selection_type,Hyperparameter,data_preprocessing_method)
51 |
52 |
53 |
54 |
55 |
56 |
57 | if __name__=='__main__':
58 | main()
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/readme.md:
--------------------------------------------------------------------------------
1 | This project is about utilizint resting_state FMRI to classify patients with Alzheimer's disease form controls.
2 | The project started on June 2019, as a part of summer internship in Oulu universtiy, in collaboration with Oulu-university
3 | hospital.
4 |
5 | ---
6 | ## Installation
7 | ___
8 | ### Dependencies
9 | * Python(>=3.5)
10 | * Keras==2.2.4
11 | * nilearn==0.5.2
12 | * scipy==1.2.1
13 | * nibabel==2.4.1
14 | * numpy==1.16.2
15 | * imbalanced_learn==0.5.0
16 | * imblearn==0.0
17 | * scikit_learn==0.21.2
18 | * densratio==0.2.2
19 | * skimage==0.0
20 | * matplotlib==3.0.3
21 |
22 | Download all using:
23 |
24 | ```bash
25 | pip3 install -r requirements.txt
26 | ```
27 |
28 | ## Data
29 |
30 | In this project Oulu university data where used for trainign the model and [ADNI](http://adni.loni.usc.edu/data-samples/) data were used in testing the performance.
31 | The 4mm Brain mask that was used for extracting brain from scalp had been extracted using [FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki)
32 |
33 | ---
34 | ## Model
35 |
36 | Coefficient of variation over time for BOLD signal was used as the input data for the model. The model Consists of three main
37 | steps:
38 | 1. Loading and preprocessing the input data
39 | 1. Voxel Selection using cross validation for creating mask of most effective voxels
40 | 1. Classify the masked data using many classifiers and printing the results (Gaussian process classifier provides the best
41 | performance in high dimensional space.
42 |
43 |
44 | ---
45 | ### Running the code
46 | ```python3
47 | python3 main.py
48 | ```
49 | Change input parameters and variables in the main.py file to fit your requirements and preferences
50 | The default script runs under 10 minutes on an average laptop using under 1 GB of memory.
51 |
52 | ---
53 | ## Results
54 |
55 | The results folder sould be created by you and give it name to the variable 'Results_directory' , it contains the measured performace in **Results.txt**, the used parameters and chosed classifier **README.txt**, mask of effective voxels, and voxel importance weight.
56 | The provided pretrained model in **Output_results_directory** gives the following confidence level:
57 |
58 | | | Median | Min(.95 CL) | Max(.95 CL) |
59 | |----------|--------|-------------|-------------|
60 | | Accuracy | .702 | .555 | .835 |
61 | | F1_score | .723 | .595 | .841 |
62 | | AUC | .781 | .721 | .856 |
63 |
64 | #### Note the mask and weights files images are in NIFTI format, you can use FSL utils or nibabel library to process the data or FSLeyes to visualize.
65 | #### Note try always to get unique name for the *Results* directory (unique in your root directory)
66 |
67 |
68 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/sample_test.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | import load_data
3 | import data_preprocessing
4 | import numpy as np
5 | import nibabel as nib
6 | from sklearn.neighbors import LocalOutlierFactor
7 | from scipy.stats import variation
8 |
9 | train_Con_file_name = 'CV_OULU_CON.npz'
10 | train_AD_file_name = 'CV_OULU_AD.npz'
11 | mask_name = '4mm_brain_mask_bin.nii.gz'
12 | created_mask_high_certainity_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz'
13 | created_mask_outlier_file_name='./Output_results_directory/2019-08-10/1/high_certainity_model_mask.nii.gz'
14 | high_certainity_model_name='./Output_results_directory/2019-08-10/1/high_certainity_model.sav'
15 | low_certainty_model_name='./Output_results_directory/2019-08-10/1/low_certainty_model.sav'
16 | outliers_model_name='./Output_results_directory/2019-08-10/1/outliers_model.sav'
17 | scaler_name='scaler.sav'
18 | number_of_neighbours = 1
19 | model_type='gaussian_process'
20 |
21 | #load nii file
22 | sample_name='CV_ADNI_AD.nii.gz'
23 | sample_path=load_data.find_path(sample_name)
24 | sample = nib.load(sample_path)
25 | sample = sample.get_fdata()
26 | sample=sample[:,:,:,7] #comment if using 3d data
27 |
28 | #load necessary files
29 | mask_4mm = load_data.mask(mask_name)
30 | original_mask=mask_4mm.get_fdata()
31 | orignal_mask_flatten = data_preprocessing.flatten(original_mask[np.newaxis, :, :, :].copy())
32 | orignal_mask_flatten = np.reshape(orignal_mask_flatten, (-1))
33 | created_mask_high_certainity = nib.load(created_mask_high_certainity_file_name)
34 | created_mask_outlier = nib.load(created_mask_outlier_file_name)
35 | created_mask_high_certainity=created_mask_high_certainity.get_fdata()
36 | created_mask_outlier=created_mask_outlier.get_fdata()
37 | created_mask_high_certainity_flatten = data_preprocessing.flatten(created_mask_high_certainity[np.newaxis, :, :, :].copy())
38 | created_mask_high_certainity_flatten = np.reshape(created_mask_high_certainity_flatten, (-1))
39 | created_mask_outlier_flatten = data_preprocessing.flatten(created_mask_outlier[np.newaxis, :, :, :].copy())
40 | created_mask_outlier_flatten = np.reshape(created_mask_outlier_flatten, (-1))
41 | train_data,train_labels=load_data.train_data_3d(train_Con_file_name,train_AD_file_name)
42 | train_data = np.moveaxis(train_data.copy(), 3, 0)
43 | train_data = train_data * original_mask
44 | train_data_flattened = data_preprocessing.flatten(train_data.copy())
45 | train_data_flattened = data_preprocessing.MinMax_scaler(train_data_flattened.copy())
46 |
47 |
48 |
49 | #preprocessing
50 | sample_pre=np.array(sample)*original_mask
51 | sample_pre=np.reshape(sample_pre,(1,-1))
52 | scaler = pickle.load(open(scaler_name, 'rb'))
53 | sample_pre=scaler.transform(sample_pre)
54 |
55 |
56 | #load_models
57 | high_certainity_model = pickle.load(open(high_certainity_model_name, 'rb'))
58 | low_certainty_model = pickle.load(open(low_certainty_model_name, 'rb'))
59 | outliers_model = pickle.load(open(outliers_model_name, 'rb'))
60 |
61 | #prediction
62 | neigh = LocalOutlierFactor(n_neighbors=number_of_neighbours,novelty=True)
63 | neighbours=neigh.fit(train_data_flattened)
64 | inlier_outlier_state=neighbours.predict(sample_pre.copy())
65 | if (inlier_outlier_state==1):#inlier
66 | sample_pre = variation(sample_pre[:,np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)]*
67 | created_mask_high_certainity_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)],axis=1)[:,np.newaxis]
68 |
69 |
70 | if (model_type!='ensamble classifer'):
71 | sample_prob=high_certainity_model.predict_proba(sample_pre)
72 | if ((sample_prob[0,0]>.64)|(sample_prob[0,0]<.35)):
73 | sample_pred=high_certainity_model.predict(sample_pre)
74 | print('high certainty prediction')
75 |
76 | else:
77 | sample_pred = low_certainty_model.predict(sample_pre)
78 | print('low certainty prediction')
79 |
80 | else:
81 | sample_pred = low_certainty_model.predict(sample_pre)
82 |
83 | else:
84 | sample_pre=variation(sample_pre[:,np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)]*
85 | created_mask_outlier_flatten[np.squeeze(np.where(orignal_mask_flatten > 0), axis=0)],axis=1)[:,np.newaxis]
86 | sample_pred = outliers_model.predict(sample_pre)
87 | print('an outlier prediction')
88 | if sample_pred:
89 | print('sample-prediction: AD')
90 | else:
91 | print('sample-prediction: CON')
92 |
93 |
94 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/shuffle.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from numpy import load
3 | import os
4 |
5 |
6 | oulu_con_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_Oulu_Con.npz')['masked_voxels']
7 | oulu_ad_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_Oulu_AD.npz')['masked_voxels']
8 | adni_con_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_ADNI_Con.npz')['masked_voxels']
9 | adni_ad_data=load('/data/fmri/Folder/AD_classification/Data/input_data/whole_brain_ADNI_AD.npz')['masked_voxels']
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 | idx = np.random.permutation(np.shape(oulu_con_data)[0])
18 | oulu_con_data= (oulu_con_data)[idx,:]
19 | oulu_ad_data=(oulu_ad_data)[idx,:]
20 | adni_con_data=(adni_con_data)[idx,:]
21 | adni_ad_data=(adni_ad_data)[idx,:]
22 |
23 | print(idx)
24 | print(np.shape(idx))
25 | os.mkdir('./data')
26 | np.savez('./data/oulu_con_data', masked_voxels=oulu_con_data)
27 | np.savez('./data/oulu_ad_data',masked_voxels=oulu_ad_data)
28 | np.savez('./data/adni_con_data',masked_voxels=adni_con_data)
29 | np.savez('./data/adni_ad_data',masked_voxels=adni_ad_data)
30 | np.savez('./data/key',idx)
31 |
32 | npzfile = np.load('data/key.npz')
33 | npzfile=np.asarray(npzfile['arr_0'])
34 | print(npzfile)
35 | print(np.shape(npzfile))
--------------------------------------------------------------------------------
/Machine Learning/Classification/Alzhimers CV-BOLD Classification/writing.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from numpy import load
3 | import nibabel as nib
4 |
5 | masking_img = nib.load('/data/fmri/Folder/AD_classification/Data/input_data/4mm_brain_mask_bin.nii.gz')
6 | masking_shape = masking_img.shape
7 |
8 | masking = np.empty(masking_shape, dtype=float)
9 | masking[:,:,:] = masking_img.get_data().astype(float)
10 | print(masking.shape)
11 | tmp=np.where(masking[2,:,:]>0)
12 | print(((tmp)))
13 |
14 |
15 |
16 | '''
17 | import os
18 | from datetime import date
19 | import glob
20 |
21 | today = date.today()
22 | x='hello'
23 | f=open('test.txt',"w+")
24 | f.write(x+'world')
25 |
26 | os.mkdir('writing')
27 | LatestFile = sorted(os.listdir('/data/fmri/Folder/AD_classification/codes/model/writing'),reverse = True)
28 |
29 | x=int(LatestFile[0][0])
30 | x=x+1
31 | print(x)
32 |
33 | '''
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/Sensor Activity Recognition.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Machine Learning/Classification/Sensor-activity-recognition/Sensor Activity Recognition.pdf
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/classes_accuarcy.m:
--------------------------------------------------------------------------------
1 | function [class_accuarcy] = classes_accuarcy(test_labels,predicted_labels)
2 | % giving the predicted labels and the groun truth labels, this function
3 | % returns the class(activity) that is best classified with the classifier
4 |
5 | misclassified_index=find(test_labels~=predicted_labels);
6 |
7 | misclassified_labels=test_labels(misclassified_index);
8 |
9 | unique_labels=unique(test_labels);
10 | class_accuarcy=[];
11 | for i =1: length(unique_labels)
12 | counter=length(find(misclassified_labels==unique_labels(i)));
13 | class_accuarcy=[class_accuarcy counter];
14 | end
15 | class_accuarcy=1-(class_accuarcy/length(test_labels));
16 |
17 |
18 | end
19 |
20 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/classification.m:
--------------------------------------------------------------------------------
1 | function [acc,class_accuarcy] = classification(train_features,train_labels,test_features,test_labels,classifier_name)
2 | %Apply the croos validation on the data and use Knn classifier
3 | % Detailed explanation goes here
4 |
5 |
6 | if strcmp(classifier_name,'KNN')
7 | clear classifer_model;
8 | classifer_model=fitcknn(train_features,train_labels,'NumNeighbors',5);
9 | predicted_labels=predict(classifer_model,test_features);
10 | performance_evaluaion=classperf(test_labels);
11 | classperf(performance_evaluaion,predicted_labels);
12 | acc=length(find(predicted_labels==test_labels))/length(predicted_labels);
13 | class_accuarcy = classes_accuarcy(test_labels,predicted_labels);
14 |
15 | elseif strcmp(classifier_name,'LDA')
16 | clear classifer_model;
17 | classifer_model=fitcdiscr(train_features,train_labels,'DiscrimType','linear');
18 | predicted_labels=predict(classifer_model,test_features);
19 | acc=length(find(predicted_labels==test_labels))/length(predicted_labels);
20 | class_accuarcy = classes_accuarcy(test_labels,predicted_labels);
21 |
22 |
23 | elseif strcmp(classifier_name,'QDA')
24 | classifer_model=fitcdiscr(train_features,train_labels,'DiscrimType','quadratic');
25 | predicted_labels=predict(classifer_model,test_features);
26 | acc=length(find(predicted_labels==test_labels))/length(predicted_labels);
27 | class_accuarcy = classes_accuarcy(test_labels,predicted_labels);
28 |
29 | end
30 |
31 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/create_feature_map.m:
--------------------------------------------------------------------------------
1 | function [feature_map] = create_feature_map(activity_data,window_size)
2 | %UNTITLED2 Summary of this function goes here
3 | % Detailed explanation goes here
4 | struct_fields=fieldnames(activity_data);
5 | feature_map=struct;
6 | for i=1:length(struct_fields)
7 | participant=getfield(activity_data,cell2mat(struct_fields(i)));
8 | labels=participant.labels;
9 | new_label_index=find((labels(2:end)-labels(1:end-1))~=0);
10 | new_label_index=[1 ; new_label_index ; length(labels)];
11 | participant_field_names=fieldnames(participant);
12 | labels_col=[];
13 | feat_map_participant=[];
14 | feature_map_all_positions=[];
15 | for j=1:length(participant_field_names)
16 | feat_map_all_labels_one_position=[];
17 | if ~strcmp(cell2mat(participant_field_names(j)), 'labels') && ~strcmp(cell2mat(participant_field_names(j)), 'time')
18 | position_data=getfield(participant,cell2mat(participant_field_names(j)));
19 | labels_col=[];
20 | for k=1:length(new_label_index)-1
21 | feat_map_all_axis_one_label=[];
22 | for c=1:9
23 | if k==length(new_label_index)-1
24 | strip=position_data(new_label_index(k):new_label_index(k+1),c);
25 | current_label=labels(new_label_index(k+1));
26 |
27 | else
28 | strip=position_data(new_label_index(k):new_label_index(k+1),c);
29 | current_label=labels(new_label_index(k+1));
30 | end
31 | feat_map_axis=features(strip,window_size);
32 | feat_map_all_axis_one_label=horzcat(feat_map_all_axis_one_label,feat_map_axis);
33 |
34 | end
35 | labels_col_temp=ones(size(feat_map_all_axis_one_label,1),1)*current_label;
36 | labels_col=vertcat(labels_col,labels_col_temp);
37 | feat_map_all_labels_one_position=vertcat(feat_map_all_labels_one_position,feat_map_all_axis_one_label);
38 | end
39 | feature_map_all_positions=horzcat(feature_map_all_positions,feat_map_all_labels_one_position);
40 |
41 | end
42 | end
43 | feat_map_participant=horzcat(labels_col,feature_map_all_positions);
44 | feature_map.(cell2mat(struct_fields(i)))=feat_map_participant;
45 | end
46 | save([ 'feature_map_' int2str(window_size) 's.mat'],'feature_map')
47 | end
48 |
49 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/main.m:
--------------------------------------------------------------------------------
1 | %% create new feature map and use it
2 | clear all
3 | clc;
4 | activity_data=load('dataActivity');
5 | classification_accuarcy=[];
6 | k_folds=10; % number of folds used for cross validation
7 | window_size=8; % widnow size for creating the feature map
8 | create_new_feature_map=1; % if ==1 then a new feature map will be created , else a saved one will be used
9 | saved_feature_map_file_name='feature_map_3s.mat'; % the name of the feature map file
10 | scaling=0; % scaling should be ==1 if you would like to scale the data and 0 if not
11 | outliers=0; % outliers should be ==1 if you would like to remove the outliers and zero if ypu donot like.
12 | %[activity_data] = scalingANDoutliers(activity_data,scaling,outliers);
13 |
14 | % Check whether a new feature map will be created or a save one should be
15 | % used
16 | if create_new_feature_map==1
17 | feature_map=create_feature_map(activity_data,window_size);
18 | participant_names=fieldnames(feature_map);
19 | else
20 | feature_map=load(saved_feature_map_file_name);
21 | feature_map=feature_map.feature_map;
22 | participant_names=fieldnames(feature_map);
23 | end
24 |
25 |
26 | classifier_name='KNN'; % classifier name, to change the classifier used check the names of the avaliable classifier from classification function
27 | train_labels=[];
28 | train_features=[];
29 | cross_validation_each_posiiton_acc=[];
30 | max_class_acc_each_position=[];
31 | max_class_acc_value=[];
32 |
33 |
34 | time_starting_feature_index=1;
35 | frequency_starting_feature_index=7;
36 |
37 | time_ending_feature_index=6;
38 | frequency_ending_feature_index=11;
39 |
40 | time_features=0;
41 | classification_accuarcy=[];
42 | class_acc_all_folds=[];
43 | for j=1:k_folds
44 | train_features=[];
45 | test_features=[];
46 | train_labels=[];
47 | test_labels=[];
48 | for i=1:length(participant_names)
49 |
50 | if i==length(participant_names)-(j-1)
51 | participant=getfield(feature_map,cell2mat(participant_names(i)));
52 | test_labels= participant(:,1);
53 | test_features=participant(:,2:end);
54 | else
55 | participant=getfield(feature_map,cell2mat(participant_names(i)));
56 | train_labels=vertcat(train_labels,participant(:,1));
57 | train_features=vertcat(train_features,participant(:,2:end));
58 | end
59 | end
60 | %%%% selecting certain feature for example certain positions or certain
61 | %%%% axis should be done here before giving the data to the classifier.
62 | position_feature_index_all=[];
63 | for sensor=0:8
64 | for position=0:4
65 | if time_features==1
66 | starting_index=time_starting_feature_index;
67 | ending_index= time_ending_feature_index;
68 | elseif time_features==0
69 | starting_index=frequency_starting_feature_index;
70 | ending_index= frequency_ending_feature_index;
71 | else
72 | starting_index=1;
73 | ending_index=11;
74 | end
75 | position_feature_index=linspace(starting_index,ending_index,ending_index-starting_index+1)+11*sensor+position*99;
76 | position_feature_index_all=[position_feature_index_all position_feature_index];
77 | end
78 | end
79 |
80 | train_features_certain_positions=train_features(:,position_feature_index_all);
81 | test_features_certain_positions=test_features(:,position_feature_index_all);
82 | [train_features_certain_positions] = scalingANDoutliers(train_features_certain_positions,scaling,outliers);
83 | [test_features_certain_positions] = scalingANDoutliers(test_features_certain_positions,scaling,outliers);
84 | [acc,class_acc] =classification(train_features_certain_positions,train_labels,test_features_certain_positions,test_labels,classifier_name);
85 | classification_accuarcy=[classification_accuarcy acc];
86 | class_acc_all_folds=[class_acc_all_folds ; class_acc];
87 | end
88 | cross_validation_acc=mean(classification_accuarcy);
89 | class_acc_one_position=mean(class_acc_all_folds,1);
90 | [max_class_acc,max_class_label]=max(class_acc_one_position);
91 |
92 |
93 |
94 |
95 |
96 |
97 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/performance_evaluation.m:
--------------------------------------------------------------------------------
1 | function [cp] = performance_evaluation(classifier_model,test_data,test_labels)
2 | %UNTITLED3 Summary of this function goes here
3 | % Detailed explanation goes here
4 | predicted_labels=predict(classifer_model,test_data);
5 | cp=classperf(test_labels);
6 | classperf(cp,predicted_labels)
7 |
8 |
9 |
10 | end
11 |
12 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/codes/scalingANDoutliers.m:
--------------------------------------------------------------------------------
1 | function [scaledANDcleanedData] = scalingANDoutliers(data,scaling,outliers)
2 | %Scales the activity data person-wise and postition-wise
3 | % Detailed explanation goes here
4 |
5 | %positions = {'leftPocket','rightPocket','belt','wrist','upperArm'};
6 |
7 | %scaling
8 |
9 |
10 | if scaling == 1 %standardization
11 |
12 | minVal = min(data,[],2);
13 | maxVal = max(data,[],2);
14 |
15 | data = (data - minVal)./maxVal;
16 |
17 |
18 | else
19 | ;
20 | end
21 |
22 | if outliers == 1
23 |
24 | if max(data.(positions{f})) > 2
25 |
26 | data.(positions{f})(data.(positions{f}) > 2) = NaN;
27 |
28 | end
29 |
30 | else
31 | ;
32 | end
33 |
34 |
35 | scaledANDcleanedData = data;
36 |
37 | end
38 |
39 |
40 |
--------------------------------------------------------------------------------
/Machine Learning/Classification/Sensor-activity-recognition/readme.md:
--------------------------------------------------------------------------------
1 | # Sensor Activity Recogniation
2 |
3 |
4 | ### Dataset
5 | The dataset can be downloaded __[here](https://www.kaggle.com/youssef19/sensor-activity-dataset)__
6 |
7 | The feature matrix has 381 columns and n of rows depending on the widnow size
8 |
9 | The first column is the labels column , so after that there are 380 columns which is the following :
10 | 380= 9*5*8
11 |
12 | 9:(3 axis for the three sensors: Accelometer , Linear Accelometer and Gyroscope)
13 |
14 | 5:(five positions and they are in this order: left pocket,right pocket,belt,wrist,upper arm )
15 |
16 | 3:(eight features an dthey are in these order:
17 |
18 | 1.Mean
19 |
20 | 2.Standard Deviation
21 |
22 | 3.Median
23 |
24 | 4.Variance
25 |
26 | 5.Zero crossings
27 |
28 | 6.Root mean square value
29 |
30 | 7.Sum of FFT coefficients
31 |
32 | 8.Signal energy medium )
33 |
34 |
35 | So the first 8 columns are the features of the xasis of the accelometer for the left pocket position and so on .
36 |
37 |
--------------------------------------------------------------------------------
/Machine Learning/Clustering/Customer identification for mail order products/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 youssefHosni
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Machine Learning/Clustering/Customer identification for mail order products/README.md:
--------------------------------------------------------------------------------
1 | # project summary
2 | In this project, real-life data from Bertelsmann partners AZ Direct and Arvato Finance Solution were used. The data here concerns a company that performs mail-order sales in Germany. Their main question of interest is to identify facets of the population that are most likely to be purchasers of their products for a mailout campaign. As a data scientist i will use unsupervised learning techniques to organize the general population into clusters, then use those clusters to see which of them comprise the main user base for the company. Prior to applying the machine learning methods, i cleaned the data in order to convert the data into a usable form.
3 |
4 | ## steps
5 | ### Step 1: Preprocessing
6 | When you start an analysis, you must first explore and understand the data that you are working with. In this (and the next) step of the project, you’ll be working with the general demographics data. As part of your investigation of dataset properties, you must attend to a few key points:
7 |
8 | How are missing or unknown values encoded in the data? Are there certain features (columns) that should be removed from the analysis because of missing data? Are there certain data points (rows) that should be treated separately from the rest?
9 | Consider the level of measurement for each feature in the dataset (e.g. categorical, ordinal, numeric). What assumptions must be made in order to use each feature in the final analysis? Are there features that need to be re-encoded before they can be used? Are there additional features that can be dropped at this stage?
10 | You will create a cleaning procedure that you will apply first to the general demographic data, then later to the customers data.
11 |
12 | ### Step 2: Feature Transformation
13 | Now that your data is clean, you will use dimensionality reduction techniques to identify relationships between variables in the dataset, resulting in the creation of a new set of variables that account for those correlations. In this stage of the project, you will attend to the following points:
14 |
15 | The first technique that you should perform on your data is feature scaling. What might happen if we don’t perform feature scaling before applying later techniques you’ll be using?
16 | Once you’ve scaled your features, you can then apply principal component analysis (PCA) to find the vectors of maximal variability. How much variability in the data does each principal component capture? Can you interpret associations between original features in your dataset based on the weights given on the strongest components? How many components will you keep as part of the dimensionality reduction process?
17 | You will use the sklearn library to create objects that implement your feature scaling and PCA dimensionality reduction decisions.
18 |
19 | ### Step 3: Clustering
20 | Finally, on your transformed data, you will apply clustering techniques to identify groups in the general demographic data. You will then apply the same clustering model to the customers dataset to see how market segments differ between the general population and the mail-order sales company. You will tackle the following points in this stage:
21 |
22 | Use the k-means method to cluster the demographic data into groups. How should you make a decision on how many clusters to use?
23 | Apply the techniques and models that you fit on the demographic data to the customers data: data cleaning, feature scaling, PCA, and k-means clustering. Compare the distribution of people by cluster for the customer data to that of the general population. Can you say anything about which types of people are likely consumers for the mail-order sales company?
24 |
25 | ## Requirments
26 | * NumPy
27 | * pandas
28 | * Sklearn / scikit-learn
29 | * Matplotlib (for data visualization)
30 | * Seaborn (for data visualization)
31 |
32 | ## Data used
33 |
34 | Demographic data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
35 | Demographic data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
36 | The data is not provided as it is private data.
37 |
38 |
--------------------------------------------------------------------------------
/Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 youssefHosni
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/Project report.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/Project report.pdf
--------------------------------------------------------------------------------
/Machine Learning/Clustering/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/Readme.md:
--------------------------------------------------------------------------------
1 | # Finding the best neighborhood to open a new gym
2 |
3 | ## Introduction
4 | Finding the best neighborhood in Tornoto city to open a new gym, given the demographic and geogrpahic and venues information data. More inforamtion can be found [here](https://github.com/youssefHosni/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/blob/main/Project%20report.pdf)
5 |
6 | ## Data
7 |
8 | The dataset used to solve this problem have the following informtion:
9 | * The demogrpahics information: They are the **total population**, the **15-45 poulation**, the **number of educated people** and the **number of employers** in each neighborhood.
10 | * The graphical data (lat,long) for each neighborhood is used to get the venues information for each neighborhood from Foursquare API.
11 |
12 | The neighbourhoods on the map are shown in the figure below
13 | 
14 | From the Foursquare API the number of venues per each neighbourhood and the number of Gym/Fitness centers per each neighbourhood were calculated and then merged with the demographics data and the final data used is as the shown in the figure below.
15 | 
16 | The final dataset can be found [here](https://www.kaggle.com/youssef19/toronto-neighborhoods-inforamtion)
17 |
18 | ## Methodology
19 | ### Data preprocessing
20 | The data were normalized using the min-max normalization. This is an important step because the k-means algorithms depend on distance measurement, so it is important that the data used be in a similar scale. The formula of the min-max scaler is as the following:
21 | `𝑓𝑒𝑎𝑡𝑢𝑟𝑒−min(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)max(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)−min(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)`
22 | The neighborhood and the geographical data were dropped from the data as they will be used by the clustering algorithm.
23 | ### K-means clustering
24 | The best k was found using the elbow method, in which the average distance from the clusters is calculated for different values of k and the best k is the k at the elbow. The best k was found to be 3.
25 |
26 | ## Results
27 | The neighborhoods are clustered into three clusters as shown in the figure below. The red color is the first cluster, the violet is the second cluster, green is the third cluster.
28 | 
29 |
30 | ## Conclusion
31 | Using the demographics data and the venue information for each neighborhood obtained from Foursquare API, I was able to cluster the neighborhoods into three clusters using the K-means clustering algorithm. The number of gyms was found to be correlated to the number of venues. The neighborhoods with a large number of venues and gyms are clustered into the third cluster, so the most suitable neighborhood out of this cluster is the **Trinity-Bellwoods neighborhood**. The first cluster contains neighborhoods with large population and small number of gyms and moderate number of venues. The **Church-Yonge Corridor neighborhood** is the best choice out of this cluster as it contains 98 venues and large population. The number of venues is almost similar to that of the “Trinity-Bellwoods” and the population is double of it, making it the best neighborhood to open a new gym in Toronto city.
32 |
33 | ## license & Copyright
34 |
35 | © Youssef Hosni
36 |
37 | Lincesed under [MIT Linces](https://github.com/youssefHosni/Finding-the-best-Tornoto-neighborhood-to-open-a-new-gym/blob/main/LICENSE).
38 |
--------------------------------------------------------------------------------
/Machine Learning/Clustering/Readme.md:
--------------------------------------------------------------------------------
1 | The clustering projects
2 |
--------------------------------------------------------------------------------
/Machine Learning/Regression/Automobile price prediction/Readme.md:
--------------------------------------------------------------------------------
1 | # Automobile Price Prediction #
2 |
3 | ## 1.Background ##
4 |
5 | In this project I will predict the price of old automobile. In this project the following steps were excluded:
6 |
7 | * Loading the data
8 | * Preprocessing the data
9 | * Explore features or charecteristics to predict price of car
10 | * Develop prediction models
11 | * Evaluate and refine prediction models
12 |
13 | ## 2. Methods
14 |
15 | ### 2.1. Data
16 | The dataset used is the [Automobile Dataset](https://www.kaggle.com/datasets/premptk/automobile-data-changed) from Kaggle. The data
17 |
18 | ### 2.2. Data Preprocessing
19 |
20 | #### Identify and handle missing values ####
21 |
22 | #### Correct data format ####
23 |
24 | ### 2.3. Feature Engineering ###
25 |
26 | #### Data Standardization ####
27 |
28 |
29 | #### Data Normalization ####
30 |
31 |
32 | #### Binning ####
33 |
34 |
35 | ### 2.3. Data Exploration ###
36 |
37 |
38 |
39 |
--------------------------------------------------------------------------------
/Machine Learning/Regression/Readme.md:
--------------------------------------------------------------------------------
1 | Regression projects
2 |
--------------------------------------------------------------------------------
/Natural_Language_processing/Data-Science-Resume-Selector/readme.md:
--------------------------------------------------------------------------------
1 | ## Data Science Resume Selector
2 |
3 | Selecting the resume that are eligbile to data scientist postions, the dataset used contains 125 resumes, in the resumetext column. Resumes were queried from Indeed.com with keyword 'data scientist', location 'Vermont'. If a resume is 'not flagged', the applicant can submit a modified resume version at a later date. If it is 'flagged', the applicant is invited to interview.
4 | The data can be downloaded from __[here](https://www.kaggle.com/samdeeplearning/deepnlp)__
5 |
--------------------------------------------------------------------------------
/Natural_Language_processing/Data-Science-Resume-Selector/resume.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/youssefHosni/Data-Science-Portofolio/cfb6f1cdd49735fed4b8c09d4a43e8f00cdf616e/Natural_Language_processing/Data-Science-Resume-Selector/resume.csv
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/README.md:
--------------------------------------------------------------------------------
1 | # Sentiment Analysis Web App
2 |
3 | The notebook and Python files provided here, once completed, result in a simple web app which interacts with a deployed recurrent neural network performing sentiment analysis on movie reviews. This project assumes some familiarity with SageMaker, the mini-project, Sentiment Analysis using XGBoost, should provide enough background.
4 |
5 | Please see the [README](https://github.com/udacity/sagemaker-deployment/tree/master/README.md) in the root directory for instructions on setting up a SageMaker notebook and downloading the project files (as well as the other notebooks).
6 |
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/sevre/model.py:
--------------------------------------------------------------------------------
1 | import torch.nn as nn
2 |
3 | class LSTMClassifier(nn.Module):
4 | """
5 | This is the simple RNN model we will be using to perform Sentiment Analysis.
6 | """
7 |
8 | def __init__(self, embedding_dim, hidden_dim, vocab_size):
9 | """
10 | Initialize the model by settingg up the various layers.
11 | """
12 | super(LSTMClassifier, self).__init__()
13 |
14 | self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
15 | self.lstm = nn.LSTM(embedding_dim, hidden_dim)
16 | self.dense = nn.Linear(in_features=hidden_dim, out_features=1)
17 | self.sig = nn.Sigmoid()
18 |
19 | self.word_dict = None
20 |
21 | def forward(self, x):
22 | """
23 | Perform a forward pass of our model on some input.
24 | """
25 | x = x.t()
26 | lengths = x[0,:]
27 | reviews = x[1:,:]
28 | embeds = self.embedding(reviews)
29 | lstm_out, _ = self.lstm(embeds)
30 | out = self.dense(lstm_out)
31 | out = out[lengths - 1, range(len(lengths))]
32 | return self.sig(out.squeeze())
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/sevre/predict.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | import os
4 | import pickle
5 | import sys
6 | import sagemaker_containers
7 | import pandas as pd
8 | import numpy as np
9 | import torch
10 | import torch.nn as nn
11 | import torch.optim as optim
12 | import torch.utils.data
13 |
14 | from model import LSTMClassifier
15 |
16 | from utils import review_to_words, convert_and_pad
17 |
18 | def model_fn(model_dir):
19 | """Load the PyTorch model from the `model_dir` directory."""
20 | print("Loading model.")
21 |
22 | # First, load the parameters used to create the model.
23 | model_info = {}
24 | model_info_path = os.path.join(model_dir, 'model_info.pth')
25 | with open(model_info_path, 'rb') as f:
26 | model_info = torch.load(f)
27 |
28 | print("model_info: {}".format(model_info))
29 |
30 | # Determine the device and construct the model.
31 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
32 | model = LSTMClassifier(model_info['embedding_dim'], model_info['hidden_dim'], model_info['vocab_size'])
33 |
34 | # Load the store model parameters.
35 | model_path = os.path.join(model_dir, 'model.pth')
36 | with open(model_path, 'rb') as f:
37 | model.load_state_dict(torch.load(f))
38 |
39 | # Load the saved word_dict.
40 | word_dict_path = os.path.join(model_dir, 'word_dict.pkl')
41 | with open(word_dict_path, 'rb') as f:
42 | model.word_dict = pickle.load(f)
43 |
44 | model.to(device).eval()
45 |
46 | print("Done loading model.")
47 | return model
48 |
49 | def input_fn(serialized_input_data, content_type):
50 | print('Deserializing the input data.')
51 | if content_type == 'text/plain':
52 | data = serialized_input_data.decode('utf-8')
53 | return data
54 | raise Exception('Requested unsupported ContentType in content_type: ' + content_type)
55 |
56 | def output_fn(prediction_output, accept):
57 | print('Serializing the generated output.')
58 | return str(prediction_output)
59 |
60 | def predict_fn(input_data, model):
61 | print('Inferring sentiment of input data.')
62 |
63 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
64 |
65 | if model.word_dict is None:
66 | raise Exception('Model has not been loaded properly, no word_dict.')
67 |
68 | # TODO: Process input_data so that it is ready to be sent to our model.
69 | # You should produce two variables:
70 | # data_X - A sequence of length 500 which represents the converted review
71 | # data_len - The length of the review
72 |
73 | words = review_to_words(input_data)
74 | data_X, data_len = convert_and_pad(model.word_dict, words)
75 |
76 | # Using data_X and data_len we construct an appropriate input tensor. Remember
77 | # that our model expects input data of the form 'len, review[500]'.
78 | data_pack = np.hstack((data_len, data_X))
79 | data_pack = data_pack.reshape(1, -1)
80 |
81 | data = torch.from_numpy(data_pack)
82 | data = data.to(device)
83 |
84 | # Make sure to put the model into evaluation mode
85 | model.eval()
86 |
87 | # TODO: Compute the result of applying the model to the input data. The variable `result` should
88 | # be a numpy array which contains a single integer which is either 1 or 0
89 |
90 | with torch.no_grad():
91 | output = model.forward(data)
92 |
93 | result = int(np.round(output.numpy()))
94 | return result
95 |
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/sevre/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas
2 | numpy
3 | nltk
4 | beautifulsoup4
5 | html5lib
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/sevre/utils.py:
--------------------------------------------------------------------------------
1 | import nltk
2 | from nltk.corpus import stopwords
3 | from nltk.stem.porter import *
4 |
5 | import re
6 | from bs4 import BeautifulSoup
7 |
8 | import pickle
9 |
10 | import os
11 | import glob
12 |
13 | def review_to_words(review):
14 | nltk.download("stopwords", quiet=True)
15 | stemmer = PorterStemmer()
16 |
17 | text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
18 | text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
19 | words = text.split() # Split string into words
20 | words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
21 | words = [PorterStemmer().stem(w) for w in words] # stem
22 |
23 | return words
24 |
25 | def convert_and_pad(word_dict, sentence, pad=500):
26 | NOWORD = 0 # We will use 0 to represent the 'no word' category
27 | INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
28 |
29 | working_sentence = [NOWORD] * pad
30 |
31 | for word_index, word in enumerate(sentence[:pad]):
32 | if word in word_dict:
33 | working_sentence[word_index] = word_dict[word]
34 | else:
35 | working_sentence[word_index] = INFREQ
36 |
37 | return working_sentence, min(len(sentence), pad)
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/train/model.py:
--------------------------------------------------------------------------------
1 | import torch.nn as nn
2 |
3 | class LSTMClassifier(nn.Module):
4 | """
5 | This is the simple RNN model we will be using to perform Sentiment Analysis.
6 | """
7 |
8 | def __init__(self, embedding_dim, hidden_dim, vocab_size):
9 | """
10 | Initialize the model by settingg up the various layers.
11 | """
12 | super(LSTMClassifier, self).__init__()
13 |
14 | self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
15 | self.lstm = nn.LSTM(embedding_dim, hidden_dim)
16 | self.dense = nn.Linear(in_features=hidden_dim, out_features=1)
17 | self.sig = nn.Sigmoid()
18 |
19 | self.word_dict = None
20 |
21 | def forward(self, x):
22 | """
23 | Perform a forward pass of our model on some input.
24 | """
25 | x = x.t()
26 | lengths = x[0,:]
27 | reviews = x[1:,:]
28 | embeds = self.embedding(reviews)
29 | lstm_out, _ = self.lstm(embeds)
30 | out = self.dense(lstm_out)
31 | out = out[lengths - 1, range(len(lengths))]
32 | return self.sig(out.squeeze())
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/train/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas
2 | numpy
3 | nltk
4 | beautifulsoup4
5 | html5lib
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/train/train.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | import os
4 | import pickle
5 | import sys
6 | import sagemaker_containers
7 | import pandas as pd
8 | import torch
9 | import torch.optim as optim
10 | import torch.utils.data
11 |
12 | from model import LSTMClassifier
13 |
14 | def model_fn(model_dir):
15 | """Load the PyTorch model from the `model_dir` directory."""
16 | print("Loading model.")
17 |
18 | # First, load the parameters used to create the model.
19 | model_info = {}
20 | model_info_path = os.path.join(model_dir, 'model_info.pth')
21 | with open(model_info_path, 'rb') as f:
22 | model_info = torch.load(f)
23 |
24 | print("model_info: {}".format(model_info))
25 |
26 | # Determine the device and construct the model.
27 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
28 | model = LSTMClassifier(model_info['embedding_dim'], model_info['hidden_dim'], model_info['vocab_size'])
29 |
30 | # Load the stored model parameters.
31 | model_path = os.path.join(model_dir, 'model.pth')
32 | with open(model_path, 'rb') as f:
33 | model.load_state_dict(torch.load(f))
34 |
35 | # Load the saved word_dict.
36 | word_dict_path = os.path.join(model_dir, 'word_dict.pkl')
37 | with open(word_dict_path, 'rb') as f:
38 | model.word_dict = pickle.load(f)
39 |
40 | model.to(device).eval()
41 |
42 | print("Done loading model.")
43 | return model
44 |
45 | def _get_train_data_loader(batch_size, training_dir):
46 | print("Get train data loader.")
47 |
48 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None)
49 |
50 | train_y = torch.from_numpy(train_data[[0]].values).float().squeeze()
51 | train_X = torch.from_numpy(train_data.drop([0], axis=1).values).long()
52 |
53 | train_ds = torch.utils.data.TensorDataset(train_X, train_y)
54 |
55 | return torch.utils.data.DataLoader(train_ds, batch_size=batch_size)
56 |
57 |
58 | def train(model, train_loader, epochs, optimizer, loss_fn, device):
59 | """
60 | This is the training method that is called by the PyTorch training script. The parameters
61 | passed are as follows:
62 | model - The PyTorch model that we wish to train.
63 | train_loader - The PyTorch DataLoader that should be used during training.
64 | epochs - The total number of epochs to train for.
65 | optimizer - The optimizer to use during training.
66 | loss_fn - The loss function used for training.
67 | device - Where the model and data should be loaded (gpu or cpu).
68 | """
69 |
70 | # TODO: Paste the train() method developed in the notebook here.
71 |
72 | for epoch in range(1, epochs + 1):
73 | model.train()
74 | total_loss = 0
75 | for batch in train_loader:
76 | batch_X, batch_y = batch
77 |
78 | batch_X = batch_X.to(device)
79 | batch_y = batch_y.to(device)
80 |
81 | # TODO: Complete this train method to train the model provided.
82 |
83 | output = model(batch_X)
84 | loss = loss_fn(output, batch_y)
85 | optimizer.zero_grad()
86 | loss.backward()
87 | optimizer.step()
88 |
89 | total_loss += loss.data.item()
90 | print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))
91 |
92 |
93 | if __name__ == '__main__':
94 | # All of the model parameters and training parameters are sent as arguments when the script
95 | # is executed. Here we set up an argument parser to easily access the parameters.
96 |
97 | parser = argparse.ArgumentParser()
98 |
99 | # Training Parameters
100 | parser.add_argument('--batch-size', type=int, default=512, metavar='N',
101 | help='input batch size for training (default: 512)')
102 | parser.add_argument('--epochs', type=int, default=10, metavar='N',
103 | help='number of epochs to train (default: 10)')
104 | parser.add_argument('--seed', type=int, default=1, metavar='S',
105 | help='random seed (default: 1)')
106 |
107 | # Model Parameters
108 | parser.add_argument('--embedding_dim', type=int, default=32, metavar='N',
109 | help='size of the word embeddings (default: 32)')
110 | parser.add_argument('--hidden_dim', type=int, default=100, metavar='N',
111 | help='size of the hidden dimension (default: 100)')
112 | parser.add_argument('--vocab_size', type=int, default=5000, metavar='N',
113 | help='size of the vocabulary (default: 5000)')
114 |
115 | # SageMaker Parameters
116 | parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))
117 | parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
118 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
119 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
120 | parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS'])
121 |
122 | args = parser.parse_args()
123 |
124 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
125 | print("Using device {}.".format(device))
126 |
127 | torch.manual_seed(args.seed)
128 |
129 | # Load the training data.
130 | train_loader = _get_train_data_loader(args.batch_size, args.data_dir)
131 |
132 | # Build the model.
133 | model = LSTMClassifier(args.embedding_dim, args.hidden_dim, args.vocab_size).to(device)
134 |
135 | with open(os.path.join(args.data_dir, "word_dict.pkl"), "rb") as f:
136 | model.word_dict = pickle.load(f)
137 |
138 | print("Model loaded with embedding_dim {}, hidden_dim {}, vocab_size {}.".format(
139 | args.embedding_dim, args.hidden_dim, args.vocab_size
140 | ))
141 |
142 | # Train the model.
143 | optimizer = optim.Adam(model.parameters())
144 | loss_fn = torch.nn.BCELoss()
145 |
146 | train(model, train_loader, args.epochs, optimizer, loss_fn, device)
147 |
148 | # Save the parameters used to construct the model
149 | model_info_path = os.path.join(args.model_dir, 'model_info.pth')
150 | with open(model_info_path, 'wb') as f:
151 | model_info = {
152 | 'embedding_dim': args.embedding_dim,
153 | 'hidden_dim': args.hidden_dim,
154 | 'vocab_size': args.vocab_size,
155 | }
156 | torch.save(model_info, f)
157 |
158 | # Save the word_dict
159 | word_dict_path = os.path.join(args.model_dir, 'word_dict.pkl')
160 | with open(word_dict_path, 'wb') as f:
161 | pickle.dump(model.word_dict, f)
162 |
163 | # Save the model parameters
164 | model_path = os.path.join(args.model_dir, 'model.pth')
165 | with open(model_path, 'wb') as f:
166 | torch.save(model.cpu().state_dict(), f)
167 |
--------------------------------------------------------------------------------
/Natural_Language_processing/Sentiment-analysis/website/index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Sentiment Analysis Web App
5 |
6 |
7 |
8 |
9 |
10 |
11 |
32 |
33 |
34 |
35 |
36 |
37 |
Is your review positive, or negative?
38 |
Enter your review below and click submit to find out...
39 |
48 |
49 |
50 |
51 |
52 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/README.md:
--------------------------------------------------------------------------------
1 | # Plagiarism Detector Web App
2 | This repository contains code and associated files for deploying a plagiarism detector using AWS SageMaker.
3 |
4 | ## Project Overview
5 |
6 | In this project, i build a plagiarism detector that examines a text file and performs binary classification; labeling that file as either *plagiarized* or *not*, depending on how similar that text file is to a provided source text. Detecting plagiarism is an active area of research; the task is non-trivial and the differences between paraphrased answers and original work are often not so obvious.
7 |
8 | This project is broken down into three main notebooks:
9 |
10 | **Notebook 1: Data Exploration**
11 | * Load in the corpus of plagiarism text data.
12 | * Explore the existing data features and the data distribution.
13 | * This first notebook is **not** required in your final project submission.
14 |
15 | **Notebook 2: Feature Engineering**
16 |
17 | * Clean and pre-process the text data.
18 | * Define features for comparing the similarity of an answer text and a source text, and extract similarity features.
19 | * Select "good" features, by analyzing the correlations between different features.
20 | * Create train/test `.csv` files that hold the relevant features and class labels for train/test data points.
21 |
22 | **Notebook 3: Train and Deploy Your Model in SageMaker**
23 |
24 | * Upload your train/test feature data to S3.
25 | * Define a binary classification model and a training script.
26 | * Train your model and deploy it using SageMaker.
27 | * Evaluate your deployed classifier.
28 |
29 | ---
30 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/helpers.py:
--------------------------------------------------------------------------------
1 | import re
2 | import pandas as pd
3 | import operator
4 |
5 | # Add 'datatype' column that indicates if the record is original wiki answer as 0, training data 1, test data 2, onto
6 | # the dataframe - uses stratified random sampling (with seed) to sample by task & plagiarism amount
7 |
8 | # Use function to label datatype for training 1 or test 2
9 | def create_datatype(df, train_value, test_value, datatype_var, compare_dfcolumn, operator_of_compare, value_of_compare,
10 | sampling_number, sampling_seed):
11 | # Subsets dataframe by condition relating to statement built from:
12 | # 'compare_dfcolumn' 'operator_of_compare' 'value_of_compare'
13 | df_subset = df[operator_of_compare(df[compare_dfcolumn], value_of_compare)]
14 | df_subset = df_subset.drop(columns = [datatype_var])
15 |
16 | # Prints counts by task and compare_dfcolumn for subset df
17 | #print("\nCounts by Task & " + compare_dfcolumn + ":\n", df_subset.groupby(['Task', compare_dfcolumn]).size().reset_index(name="Counts") )
18 |
19 | # Sets all datatype to value for training for df_subset
20 | df_subset.loc[:, datatype_var] = train_value
21 |
22 | # Performs stratified random sample of subset dataframe to create new df with subset values
23 | df_sampled = df_subset.groupby(['Task', compare_dfcolumn], group_keys=False).apply(lambda x: x.sample(min(len(x), sampling_number), random_state = sampling_seed))
24 | df_sampled = df_sampled.drop(columns = [datatype_var])
25 | # Sets all datatype to value for test_value for df_sampled
26 | df_sampled.loc[:, datatype_var] = test_value
27 |
28 | # Prints counts by compare_dfcolumn for selected sample
29 | #print("\nCounts by "+ compare_dfcolumn + ":\n", df_sampled.groupby([compare_dfcolumn]).size().reset_index(name="Counts") )
30 | #print("\nSampled DF:\n",df_sampled)
31 |
32 | # Labels all datatype_var column as train_value which will be overwritten to
33 | # test_value in next for loop for all test cases chosen with stratified sample
34 | for index in df_sampled.index:
35 | # Labels all datatype_var columns with test_value for straified test sample
36 | df_subset.loc[index, datatype_var] = test_value
37 |
38 | #print("\nSubset DF:\n",df_subset)
39 | # Adds test_value and train_value for all relevant data in main dataframe
40 | for index in df_subset.index:
41 | # Labels all datatype_var columns in df with train_value/test_value based upon
42 | # stratified test sample and subset of df
43 | df.loc[index, datatype_var] = df_subset.loc[index, datatype_var]
44 |
45 | # returns nothing because dataframe df already altered
46 |
47 | def train_test_dataframe(clean_df, random_seed=100):
48 |
49 | new_df = clean_df.copy()
50 |
51 | # Initialize datatype as 0 initially for all records - after function 0 will remain only for original wiki answers
52 | new_df.loc[:,'Datatype'] = 0
53 |
54 | # Creates test & training datatypes for plagiarized answers (1,2,3)
55 | create_datatype(new_df, 1, 2, 'Datatype', 'Category', operator.gt, 0, 1, random_seed)
56 |
57 | # Creates test & training datatypes for NON-plagiarized answers (0)
58 | create_datatype(new_df, 1, 2, 'Datatype', 'Category', operator.eq, 0, 2, random_seed)
59 |
60 | # creating a dictionary of categorical:numerical mappings for plagiarsm categories
61 | mapping = {0:'orig', 1:'train', 2:'test'}
62 |
63 | # traversing through dataframe and replacing categorical data
64 | new_df.Datatype = [mapping[item] for item in new_df.Datatype]
65 |
66 | return new_df
67 |
68 |
69 | # helper function for pre-processing text given a file
70 | def process_file(file):
71 | # put text in all lower case letters
72 | all_text = file.read().lower()
73 |
74 | # remove all non-alphanumeric chars
75 | all_text = re.sub(r"[^a-zA-Z0-9]", " ", all_text)
76 | # remove newlines/tabs, etc. so it's easier to match phrases, later
77 | all_text = re.sub(r"\t", " ", all_text)
78 | all_text = re.sub(r"\n", " ", all_text)
79 | all_text = re.sub(" ", " ", all_text)
80 | all_text = re.sub(" ", " ", all_text)
81 |
82 | return all_text
83 |
84 |
85 | def create_text_column(df, file_directory='data/'):
86 | '''Reads in the files, listed in a df and returns that df with an additional column, `Text`.
87 | :param df: A dataframe of file information including a column for `File`
88 | :param file_directory: the main directory where files are stored
89 | :return: A dataframe with processed text '''
90 |
91 | # create copy to modify
92 | text_df = df.copy()
93 |
94 | # store processed text
95 | text = []
96 |
97 | # for each file (row) in the df, read in the file
98 | for row_i in df.index:
99 | filename = df.iloc[row_i]['File']
100 | #print(filename)
101 | file_path = file_directory + filename
102 | with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
103 |
104 | # standardize text using helper function
105 | file_text = process_file(file)
106 | # append processed text to list
107 | text.append(file_text)
108 |
109 | # add column to the copied dataframe
110 | text_df['Text'] = text
111 |
112 | return text_df
113 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/palagrism_data/test.csv:
--------------------------------------------------------------------------------
1 | 1,1.0,0.9222797927461139,0.8207547169811321
2 | 1,0.7653061224489796,0.5896551724137931,0.6217105263157895
3 | 1,0.8844444444444445,0.18099547511312217,0.597457627118644
4 | 1,0.6190476190476191,0.043243243243243246,0.42783505154639173
5 | 1,0.92,0.39436619718309857,0.775
6 | 1,0.9926739926739927,0.9739776951672863,0.9930555555555556
7 | 0,0.4126984126984127,0.0,0.3466666666666667
8 | 0,0.4626865671641791,0.0,0.18932038834951456
9 | 0,0.581151832460733,0.0,0.24742268041237114
10 | 0,0.5842105263157895,0.0,0.29441624365482233
11 | 0,0.5663716814159292,0.0,0.25833333333333336
12 | 0,0.48148148148148145,0.022900763358778626,0.2789115646258503
13 | 1,0.6197916666666666,0.026595744680851064,0.3415841584158416
14 | 1,0.9217391304347826,0.6548672566371682,0.9294117647058824
15 | 1,1.0,0.9224806201550387,1.0
16 | 1,0.8615384615384616,0.06282722513089005,0.5047169811320755
17 | 1,0.6261682242990654,0.22397476340694006,0.5585585585585585
18 | 1,1.0,0.9688715953307393,0.9966996699669967
19 | 0,0.3838383838383838,0.010309278350515464,0.178743961352657
20 | 1,1.0,0.9446494464944649,0.8546712802768166
21 | 0,0.6139240506329114,0.0,0.2983425414364641
22 | 1,0.9727626459143969,0.8300395256916996,0.9270833333333334
23 | 1,0.9628099173553719,0.6890756302521008,0.9098039215686274
24 | 0,0.4152542372881356,0.0,0.1774193548387097
25 | 0,0.5321888412017167,0.017467248908296942,0.24583333333333332
26 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/palagrism_data/train.csv:
--------------------------------------------------------------------------------
1 | 0,0.39814814814814814,0.0,0.1917808219178082
2 | 1,0.8693693693693694,0.44954128440366975,0.8464912280701754
3 | 1,0.5935828877005348,0.08196721311475409,0.3160621761658031
4 | 0,0.5445026178010471,0.0,0.24257425742574257
5 | 0,0.32950191570881227,0.0,0.16117216117216118
6 | 0,0.5903083700440529,0.0,0.30165289256198347
7 | 1,0.7597765363128491,0.24571428571428572,0.484304932735426
8 | 0,0.5161290322580645,0.0,0.2708333333333333
9 | 0,0.44086021505376344,0.0,0.22395833333333334
10 | 1,0.9794520547945206,0.7887323943661971,0.9
11 | 1,0.9513888888888888,0.5214285714285715,0.8940397350993378
12 | 1,0.9764705882352941,0.5783132530120482,0.8232044198895028
13 | 1,0.8117647058823529,0.28313253012048195,0.45977011494252873
14 | 0,0.4411764705882353,0.0,0.3055555555555556
15 | 0,0.4888888888888889,0.0,0.2826086956521739
16 | 1,0.813953488372093,0.6341463414634146,0.7888888888888889
17 | 0,0.6111111111111112,0.0,0.3246753246753247
18 | 1,1.0,0.9659090909090909,1.0
19 | 1,0.634020618556701,0.005263157894736842,0.36893203883495146
20 | 1,0.5829383886255924,0.08695652173913043,0.4166666666666667
21 | 1,0.6379310344827587,0.30701754385964913,0.4898785425101215
22 | 0,0.42038216560509556,0.0,0.21875
23 | 1,0.6877637130801688,0.07725321888412018,0.5163934426229508
24 | 1,0.6766467065868264,0.11042944785276074,0.4725274725274725
25 | 1,0.7692307692307693,0.45084745762711864,0.6064516129032258
26 | 1,0.7122641509433962,0.08653846153846154,0.536697247706422
27 | 1,0.6299212598425197,0.28,0.39436619718309857
28 | 1,0.7157360406091371,0.0051813471502590676,0.3431372549019608
29 | 0,0.3320610687022901,0.0,0.15302491103202848
30 | 1,0.7172131147540983,0.07916666666666666,0.4559386973180077
31 | 1,0.8782608695652174,0.47345132743362833,0.82
32 | 1,0.5298013245033113,0.31543624161073824,0.45
33 | 0,0.5721153846153846,0.0,0.22935779816513763
34 | 0,0.319672131147541,0.0,0.16535433070866143
35 | 0,0.53,0.0,0.26046511627906976
36 | 1,0.78,0.6071428571428571,0.6699029126213593
37 | 0,0.6526946107784432,0.0,0.3551912568306011
38 | 0,0.4439461883408072,0.0,0.23376623376623376
39 | 1,0.6650246305418719,0.18090452261306533,0.3492647058823529
40 | 1,0.7281553398058253,0.034653465346534656,0.3476190476190476
41 | 1,0.7620481927710844,0.2896341463414634,0.5677233429394812
42 | 1,0.9470198675496688,0.2857142857142857,0.774390243902439
43 | 1,0.3684210526315789,0.0,0.19298245614035087
44 | 0,0.5328947368421053,0.0,0.21818181818181817
45 | 0,0.6184971098265896,0.005917159763313609,0.26666666666666666
46 | 0,0.5103092783505154,0.010526315789473684,0.22110552763819097
47 | 0,0.5798319327731093,0.0,0.2289156626506024
48 | 0,0.40703517587939697,0.0,0.1722488038277512
49 | 0,0.5154639175257731,0.0,0.23684210526315788
50 | 1,0.5845410628019324,0.04926108374384237,0.29493087557603687
51 | 1,0.6171875,0.1693548387096774,0.5037593984962406
52 | 1,1.0,0.84251968503937,0.9117647058823529
53 | 1,0.9916666666666667,0.8879310344827587,0.9923076923076923
54 | 0,0.550561797752809,0.0,0.2833333333333333
55 | 0,0.41935483870967744,0.0,0.2616822429906542
56 | 1,0.8351648351648352,0.034482758620689655,0.6470588235294118
57 | 1,0.9270833333333334,0.29347826086956524,0.85
58 | 0,0.4928909952606635,0.0,0.2350230414746544
59 | 1,0.7087378640776699,0.3217821782178218,0.6619718309859155
60 | 1,0.8633879781420765,0.30726256983240224,0.7911111111111111
61 | 1,0.9606060606060606,0.8650306748466258,0.9298245614035088
62 | 0,0.4380165289256198,0.0,0.2230769230769231
63 | 1,0.7336683417085427,0.07179487179487179,0.4900990099009901
64 | 1,0.5138888888888888,0.0,0.25203252032520324
65 | 0,0.4861111111111111,0.0,0.22767857142857142
66 | 1,0.8451882845188284,0.3021276595744681,0.6437246963562753
67 | 1,0.485,0.0,0.24271844660194175
68 | 1,0.9506726457399103,0.7808219178082192,0.8395061728395061
69 | 1,0.551219512195122,0.23383084577114427,0.2830188679245283
70 | 0,0.3612565445026178,0.0,0.16176470588235295
71 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/problem_unittests.py:
--------------------------------------------------------------------------------
1 | from unittest.mock import MagicMock, patch
2 | import sklearn.naive_bayes
3 | import numpy as np
4 | import pandas as pd
5 | import re
6 |
7 | # test csv file
8 | TEST_CSV = 'data/test_info.csv'
9 |
10 | class AssertTest(object):
11 | '''Defines general test behavior.'''
12 | def __init__(self, params):
13 | self.assert_param_message = '\n'.join([str(k) + ': ' + str(v) + '' for k, v in params.items()])
14 |
15 | def test(self, assert_condition, assert_message):
16 | assert assert_condition, assert_message + '\n\nUnit Test Function Parameters\n' + self.assert_param_message
17 |
18 | def _print_success_message():
19 | print('Tests Passed!')
20 |
21 | # test clean_dataframe
22 | def test_numerical_df(numerical_dataframe):
23 |
24 | # test result
25 | transformed_df = numerical_dataframe(TEST_CSV)
26 |
27 | # Check type is a DataFrame
28 | assert isinstance(transformed_df, pd.DataFrame), 'Returned type is {}.'.format(type(transformed_df))
29 |
30 | # check columns
31 | column_names = list(transformed_df)
32 | assert 'File' in column_names, 'No File column, found.'
33 | assert 'Task' in column_names, 'No Task column, found.'
34 | assert 'Category' in column_names, 'No Category column, found.'
35 | assert 'Class' in column_names, 'No Class column, found.'
36 |
37 | # check conversion values
38 | assert transformed_df.loc[0, 'Category'] == 1, '`heavy` plagiarism mapping test, failed.'
39 | assert transformed_df.loc[2, 'Category'] == 0, '`non` plagiarism mapping test, failed.'
40 | assert transformed_df.loc[30, 'Category'] == 3, '`cut` plagiarism mapping test, failed.'
41 | assert transformed_df.loc[5, 'Category'] == 2, '`light` plagiarism mapping test, failed.'
42 | assert transformed_df.loc[37, 'Category'] == -1, 'original file mapping test, failed; should have a Category = -1.'
43 | assert transformed_df.loc[41, 'Category'] == -1, 'original file mapping test, failed; should have a Category = -1.'
44 |
45 | _print_success_message()
46 |
47 |
48 | def test_containment(complete_df, containment_fn):
49 |
50 | # check basic format and value
51 | # for n = 1 and just the fifth file
52 | test_val = containment_fn(complete_df, 1, 'g0pA_taske.txt')
53 |
54 | assert isinstance(test_val, float), 'Returned type is {}.'.format(type(test_val))
55 | assert test_val<=1.0, 'It appears that the value is not normalized; expected a value <=1, got: '+str(test_val)
56 |
57 | # known vals for first few files
58 | filenames = ['g0pA_taska.txt', 'g0pA_taskb.txt', 'g0pA_taskc.txt', 'g0pA_taskd.txt']
59 | ngram_1 = [0.39814814814814814, 1.0, 0.86936936936936937, 0.5935828877005348]
60 | ngram_3 = [0.0093457943925233638, 0.96410256410256412, 0.61363636363636365, 0.15675675675675677]
61 |
62 | # results for comparison
63 | results_1gram = []
64 | results_3gram = []
65 |
66 | for i in range(4):
67 | val_1 = containment_fn(complete_df, 1, filenames[i])
68 | val_3 = containment_fn(complete_df, 3, filenames[i])
69 | results_1gram.append(val_1)
70 | results_3gram.append(val_3)
71 |
72 | # check correct results
73 | assert all(np.isclose(results_1gram, ngram_1, rtol=1e-04)), \
74 | 'n=1 calculations are incorrect. Double check the intersection calculation.'
75 | # check correct results
76 | assert all(np.isclose(results_3gram, ngram_3, rtol=1e-04)), \
77 | 'n=3 calculations are incorrect.'
78 |
79 | _print_success_message()
80 |
81 | def test_lcs(df, lcs_word):
82 |
83 | test_index = 10 # file 10
84 |
85 | # get answer file text
86 | answer_text = df.loc[test_index, 'Text']
87 |
88 | # get text for orig file
89 | # find the associated task type (one character, a-e)
90 | task = df.loc[test_index, 'Task']
91 | # we know that source texts have Class = -1
92 | orig_rows = df[(df['Class'] == -1)]
93 | orig_row = orig_rows[(orig_rows['Task'] == task)]
94 | source_text = orig_row['Text'].values[0]
95 |
96 | # calculate LCS
97 | test_val = lcs_word(answer_text, source_text)
98 |
99 | # check type
100 | assert isinstance(test_val, float), 'Returned type is {}.'.format(type(test_val))
101 | assert test_val<=1.0, 'It appears that the value is not normalized; expected a value <=1, got: '+str(test_val)
102 |
103 | # known vals for first few files
104 | lcs_vals = [0.1917808219178082, 0.8207547169811321, 0.8464912280701754, 0.3160621761658031, 0.24257425742574257]
105 |
106 | # results for comparison
107 | results = []
108 |
109 | for i in range(5):
110 | # get answer and source text
111 | answer_text = df.loc[i, 'Text']
112 | task = df.loc[i, 'Task']
113 | # we know that source texts have Class = -1
114 | orig_rows = df[(df['Class'] == -1)]
115 | orig_row = orig_rows[(orig_rows['Task'] == task)]
116 | source_text = orig_row['Text'].values[0]
117 | # calc lcs
118 | val = lcs_word(answer_text, source_text)
119 | results.append(val)
120 |
121 | # check correct results
122 | assert all(np.isclose(results, lcs_vals, rtol=1e-05)), 'LCS calculations are incorrect.'
123 |
124 | _print_success_message()
125 |
126 | def test_data_split(train_x, train_y, test_x, test_y):
127 |
128 | # check types
129 | assert isinstance(train_x, np.ndarray),\
130 | 'train_x is not an array, instead got type: {}'.format(type(train_x))
131 | assert isinstance(train_y, np.ndarray),\
132 | 'train_y is not an array, instead got type: {}'.format(type(train_y))
133 | assert isinstance(test_x, np.ndarray),\
134 | 'test_x is not an array, instead got type: {}'.format(type(test_x))
135 | assert isinstance(test_y, np.ndarray),\
136 | 'test_y is not an array, instead got type: {}'.format(type(test_y))
137 |
138 | # should hold all 95 submission files
139 | assert len(train_x) + len(test_x) == 95, \
140 | 'Unexpected amount of train + test data. Expecting 95 answer text files, got ' +str(len(train_x) + len(test_x))
141 | assert len(test_x) > 1, \
142 | 'Unexpected amount of test data. There should be multiple test files.'
143 |
144 | # check shape
145 | assert train_x.shape[1]==2, \
146 | 'train_x should have as many columns as selected features, got: {}'.format(train_x.shape[1])
147 | assert len(train_y.shape)==1, \
148 | 'train_y should be a 1D array, got shape: {}'.format(train_y.shape)
149 |
150 | _print_success_message()
151 |
152 |
153 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/source_pytorch/model.py:
--------------------------------------------------------------------------------
1 | # torch imports
2 | import torch.nn.functional as F
3 | import torch.nn as nn
4 |
5 |
6 | ## TODO: Complete this classifier
7 | class BinaryClassifier(nn.Module):
8 | """
9 | Define a neural network that performs binary classification.
10 | The network should accept your number of features as input, and produce
11 | a single sigmoid value, that can be rounded to a label: 0 or 1, as output.
12 |
13 | Notes on training:
14 | To train a binary classifier in PyTorch, use BCELoss.
15 | BCELoss is binary cross entropy loss, documentation: https://pytorch.org/docs/stable/nn.html#torch.nn.BCELoss
16 | """
17 |
18 | ## TODO: Define the init function, the input params are required (for loading code in train.py to work)
19 | def __init__(self, input_features, hidden_dim, output_dim):
20 | """
21 | Initialize the model by setting up linear layers.
22 | Use the input parameters to help define the layers of your model.
23 | :param input_features: the number of input features in your training/test data
24 | :param hidden_dim: helps define the number of nodes in the hidden layer(s)
25 | :param output_dim: the number of outputs you want to produce
26 | """
27 | super(BinaryClassifier, self).__init__()
28 |
29 | # define any initial layers, here
30 |
31 |
32 |
33 | ## TODO: Define the feedforward behavior of the network
34 | def forward(self, x):
35 | """
36 | Perform a forward pass of our model on input features, x.
37 | :param x: A batch of input features of size (batch_size, input_features)
38 | :return: A single, sigmoid-activated value as output
39 | """
40 |
41 | # define the feedforward behavior
42 |
43 | return x
44 |
45 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/source_pytorch/predict.py:
--------------------------------------------------------------------------------
1 | # import libraries
2 | import os
3 | import numpy as np
4 | import torch
5 | from six import BytesIO
6 |
7 | # import model from model.py, by name
8 | from model import BinaryClassifier
9 |
10 | # default content type is numpy array
11 | NP_CONTENT_TYPE = 'application/x-npy'
12 |
13 |
14 | # Provided model load function
15 | def model_fn(model_dir):
16 | """Load the PyTorch model from the `model_dir` directory."""
17 | print("Loading model.")
18 |
19 | # First, load the parameters used to create the model.
20 | model_info = {}
21 | model_info_path = os.path.join(model_dir, 'model_info.pth')
22 | with open(model_info_path, 'rb') as f:
23 | model_info = torch.load(f)
24 |
25 | print("model_info: {}".format(model_info))
26 |
27 | # Determine the device and construct the model.
28 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
29 | model = BinaryClassifier(model_info['input_features'], model_info['hidden_dim'], model_info['output_dim'])
30 |
31 | # Load the store model parameters.
32 | model_path = os.path.join(model_dir, 'model.pth')
33 | with open(model_path, 'rb') as f:
34 | model.load_state_dict(torch.load(f))
35 |
36 | # Prep for testing
37 | model.to(device).eval()
38 |
39 | print("Done loading model.")
40 | return model
41 |
42 |
43 | # Provided input data loading
44 | def input_fn(serialized_input_data, content_type):
45 | print('Deserializing the input data.')
46 | if content_type == NP_CONTENT_TYPE:
47 | stream = BytesIO(serialized_input_data)
48 | return np.load(stream)
49 | raise Exception('Requested unsupported ContentType in content_type: ' + content_type)
50 |
51 | # Provided output data handling
52 | def output_fn(prediction_output, accept):
53 | print('Serializing the generated output.')
54 | if accept == NP_CONTENT_TYPE:
55 | stream = BytesIO()
56 | np.save(stream, prediction_output)
57 | return stream.getvalue(), accept
58 | raise Exception('Requested unsupported ContentType in Accept: ' + accept)
59 |
60 |
61 | # Provided predict function
62 | def predict_fn(input_data, model):
63 | print('Predicting class labels for the input data...')
64 |
65 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
66 |
67 | # Process input_data so that it is ready to be sent to our model.
68 | data = torch.from_numpy(input_data.astype('float32'))
69 | data = data.to(device)
70 |
71 | # Put the model into evaluation mode
72 | model.eval()
73 |
74 | # Compute the result of applying the model to the input data
75 | # The variable `out_label` should be a rounded value, either 1 or 0
76 | out = model(data)
77 | out_np = out.cpu().detach().numpy()
78 | out_label = out_np.round()
79 |
80 | return out_label
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/source_pytorch/train.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import json
3 | import os
4 | import pandas as pd
5 | import torch
6 | import torch.optim as optim
7 | import torch.utils.data
8 |
9 | # imports the model in model.py by name
10 | from model import BinaryClassifier
11 |
12 | def model_fn(model_dir):
13 | """Load the PyTorch model from the `model_dir` directory."""
14 | print("Loading model.")
15 |
16 | # First, load the parameters used to create the model.
17 | model_info = {}
18 | model_info_path = os.path.join(model_dir, 'model_info.pth')
19 | with open(model_info_path, 'rb') as f:
20 | model_info = torch.load(f)
21 |
22 | print("model_info: {}".format(model_info))
23 |
24 | # Determine the device and construct the model.
25 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
26 | model = BinaryClassifier(model_info['input_features'], model_info['hidden_dim'], model_info['output_dim'])
27 |
28 | # Load the stored model parameters.
29 | model_path = os.path.join(model_dir, 'model.pth')
30 | with open(model_path, 'rb') as f:
31 | model.load_state_dict(torch.load(f))
32 |
33 | # set to eval mode, could use no_grad
34 | model.to(device).eval()
35 |
36 | print("Done loading model.")
37 | return model
38 |
39 | # Gets training data in batches from the train.csv file
40 | def _get_train_data_loader(batch_size, training_dir):
41 | print("Get train data loader.")
42 |
43 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None)
44 |
45 | train_y = torch.from_numpy(train_data[[0]].values).float().squeeze()
46 | train_x = torch.from_numpy(train_data.drop([0], axis=1).values).float()
47 |
48 | train_ds = torch.utils.data.TensorDataset(train_x, train_y)
49 |
50 | return torch.utils.data.DataLoader(train_ds, batch_size=batch_size)
51 |
52 |
53 | # Provided training function
54 | def train(model, train_loader, epochs, criterion, optimizer, device):
55 | """
56 | This is the training method that is called by the PyTorch training script. The parameters
57 | passed are as follows:
58 | model - The PyTorch model that we wish to train.
59 | train_loader - The PyTorch DataLoader that should be used during training.
60 | epochs - The total number of epochs to train for.
61 | criterion - The loss function used for training.
62 | optimizer - The optimizer to use during training.
63 | device - Where the model and data should be loaded (gpu or cpu).
64 | """
65 |
66 | # training loop is provided
67 | for epoch in range(1, epochs + 1):
68 | model.train() # Make sure that the model is in training mode.
69 |
70 | total_loss = 0
71 |
72 | for batch in train_loader:
73 | # get data
74 | batch_x, batch_y = batch
75 |
76 | batch_x = batch_x.to(device)
77 | batch_y = batch_y.to(device)
78 |
79 | optimizer.zero_grad()
80 |
81 | # get predictions from model
82 | y_pred = model(batch_x)
83 |
84 | # perform backprop
85 | loss = criterion(y_pred, batch_y)
86 | loss.backward()
87 | optimizer.step()
88 |
89 | total_loss += loss.data.item()
90 |
91 | print("Epoch: {}, Loss: {}".format(epoch, total_loss / len(train_loader)))
92 |
93 |
94 | ## TODO: Complete the main code
95 | if __name__ == '__main__':
96 |
97 | # All of the model parameters and training parameters are sent as arguments
98 | # when this script is executed, during a training job
99 |
100 | # Here we set up an argument parser to easily access the parameters
101 | parser = argparse.ArgumentParser()
102 |
103 | # SageMaker parameters, like the directories for training data and saving models; set automatically
104 | # Do not need to change
105 | parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
106 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
107 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
108 |
109 | # Training Parameters, given
110 | parser.add_argument('--batch-size', type=int, default=10, metavar='N',
111 | help='input batch size for training (default: 10)')
112 | parser.add_argument('--epochs', type=int, default=10, metavar='N',
113 | help='number of epochs to train (default: 10)')
114 | parser.add_argument('--seed', type=int, default=1, metavar='S',
115 | help='random seed (default: 1)')
116 |
117 | ## TODO: Add args for the three model parameters: input_features, hidden_dim, output_dim
118 | # Model Parameters
119 |
120 |
121 | # args holds all passed-in arguments
122 | args = parser.parse_args()
123 |
124 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
125 | print("Using device {}.".format(device))
126 |
127 | torch.manual_seed(args.seed)
128 |
129 | # Load the training data.
130 | train_loader = _get_train_data_loader(args.batch_size, args.data_dir)
131 |
132 |
133 | ## --- Your code here --- ##
134 |
135 | ## TODO: Build the model by passing in the input params
136 | # To get params from the parser, call args.argument_name, ex. args.epochs or ards.hidden_dim
137 | # Don't forget to move your model .to(device) to move to GPU , if appropriate
138 | model = None
139 |
140 | ## TODO: Define an optimizer and loss function for training
141 | optimizer = None
142 | criterion = None
143 |
144 | # Trains the model (given line of code, which calls the above training function)
145 | train(model, train_loader, args.epochs, criterion, optimizer, device)
146 |
147 | ## TODO: complete in the model_info by adding three argument names, the first is given
148 | # Keep the keys of this dictionary as they are
149 | model_info_path = os.path.join(args.model_dir, 'model_info.pth')
150 | with open(model_info_path, 'wb') as f:
151 | model_info = {
152 | 'input_features': args.input_features,
153 | 'hidden_dim': ,
154 | 'output_dim': ,
155 | }
156 | torch.save(model_info, f)
157 |
158 | ## --- End of your code --- ##
159 |
160 |
161 | # Save the model parameters
162 | model_path = os.path.join(args.model_dir, 'model.pth')
163 | with open(model_path, 'wb') as f:
164 | torch.save(model.cpu().state_dict(), f)
165 |
--------------------------------------------------------------------------------
/Natural_Language_processing/plagiarism-detector-web-app/source_sklearn/train.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 |
3 | import argparse
4 | import os
5 | import pandas as pd
6 |
7 | # sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23.
8 | from sklearn.externals import joblib
9 | # Import joblib package directly
10 |
11 | #import joblib
12 |
13 | ## TODO: Import any additional libraries you need to define a model
14 | from sklearn.ensemble import RandomForestClassifier
15 |
16 | # Provided model load function
17 | def model_fn(model_dir):
18 | """Load model from the model_dir. This is the same model that is saved
19 | in the main if statement.
20 | """
21 | print("Loading model.")
22 |
23 | # load using joblib
24 | model = joblib.load(os.path.join(model_dir, "model.joblib"))
25 | print("Done loading model.")
26 |
27 | return model
28 |
29 |
30 | ## TODO: Complete the main code
31 | if __name__ == '__main__':
32 |
33 | # All of the model parameters and training parameters are sent as arguments
34 | # when this script is executed, during a training job
35 |
36 | # Here we set up an argument parser to easily access the parameters
37 | parser = argparse.ArgumentParser()
38 |
39 | # SageMaker parameters, like the directories for training data and saving models; set automatically
40 | # Do not need to change
41 | parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
42 | parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
43 | parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
44 | #parser.add_argument('--max_depth', type=int, default= 10)
45 |
46 | ## TODO: Add any additional arguments that you will need to pass into your model
47 |
48 | # args holds all passed-in arguments
49 | args = parser.parse_args()
50 |
51 | # Read in csv training file
52 | training_dir = args.data_dir
53 | train_data = pd.read_csv(os.path.join(training_dir, "train.csv"), header=None, names=None)
54 |
55 | # Labels are in the first column
56 | train_y = train_data.iloc[:,0]
57 | train_x = train_data.iloc[:,1:]
58 |
59 |
60 | ## --- Your code here --- ##
61 |
62 | ## TODO: Define a model
63 | model = RandomForestClassifier()
64 |
65 | ## TODO: Train the model
66 | model.fit(train_x, train_y)
67 |
68 |
69 | ## --- End of your code --- ##
70 |
71 |
72 | # Save the trained model
73 |
74 | joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))
75 |
76 |
77 |
--------------------------------------------------------------------------------
/Spark/Cluster Analysis of the San Diego Weather Data/readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/Spark/San Diego Rainforest Fire Predicition/Readme.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/time-series-analysis/Power-consumption-forecasting/json_energy_data/readme.md:
--------------------------------------------------------------------------------
1 | The json energy file
2 |
--------------------------------------------------------------------------------
/time-series-analysis/Power-consumption-forecasting/readme.md:
--------------------------------------------------------------------------------
1 | Power Consumption Forecasting
2 |
3 | overview
4 |
5 | Power consumption forecasting is s a time series data collected periodically, over time. Time series forecasting is the task of predicting future data points, given some historical data. It is commonly used in a variety of tasks from weather forecasting, retail and sales forecasting, stock market prediction, and in behavior prediction (such as predicting the flow of car traffic over a day). There is a lot of time series data out there, and recognizing patterns in that data is an active area of machine learning research!
6 | Motivation and the problem
7 |
8 | Taking the data of the power consumption from 2007-2009, and then use it to accurately predict the average Global active power usage for the next several months in 2010!
9 |
10 | Data
11 | Energy Consumption Data
12 | The data we'll be working with in this notebook is data about household electric power consumption, over the globe. The dataset is avlaible on [kaggle](https://www.kaggle.com/uciml/electric-power-consumption-data-set) and represents power consumption collected over several years from 2006 to 2010. With such a large dataset, we can aim to predict over long periods of time, over days, weeks or months of time. Predicting energy consumption can be a useful task for a variety of reasons including determining seasonal prices for power consumption and efficiently delivering power to people, according to their predicted usage.
13 | Interesting read: An inversely-related project, recently done by Google and DeepMind, uses machine learning to predict the generation of power by wind turbines and efficiently deliver power to the grid. You can read about that research, in this post.
14 |
15 | DeepAR model
16 |
17 | DeepAR utilizes a recurrent neural network (RNN), which is designed to accept some sequence of data points as historical input and produce a predicted sequence of points. So, how does this model learn?
18 | During training, you'll provide a training dataset (made of several time series) to a DeepAR estimator. The estimator looks at all the training time series and tries to identify similarities across them. It trains by randomly sampling training examples from the training time series.
19 | Each training example consists of a pair of adjacent context and prediction windows of fixed, predefined lengths.
20 | The context_length parameter controls how far in the past the model can see.
21 | The prediction_length parameter controls how far in the future predictions can be made.
22 | You can find more details, in this **[documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar_how-it-works.html)**.
23 |
24 | Notebook outline
25 |
26 | * Loading and exploring the data
27 | * Creating training and test sets of time series
28 | * Formatting data as JSON files and uploading to S3
29 | * Instantiating and training a DeepAR estimator
30 | * Deploying a model and creating a predictor
31 | * Evaluating the predictor
32 |
--------------------------------------------------------------------------------
/time-series-analysis/Power-consumption-forecasting/txt_preprocessing.py:
--------------------------------------------------------------------------------
1 | # This file was used to process the initial, raw text: household_power_consumption.txt
2 | import pandas as pd
3 |
4 |
5 | ## 1. Raw data processing
6 |
7 | # The 'household_power_consumption.txt' file has the following attributes:
8 | # * Each data point has a date and time (hour) of recording
9 | # * The data points are separated by semicolons (;)
10 | # * Some values are 'nan' or '?', and we'll treat these both as `NaN` values when making a DataFrame
11 |
12 | # A helper function to read the file in and create a DataFrame, indexed by 'Date Time'
13 | def create_df(text_file, sep=';', na_values=['nan','?']):
14 | '''Reads in a text file and converts it to a dataframe, indexed by 'Date Time'.'''
15 |
16 | df = None
17 |
18 | # check that the file is the expected text file
19 | expected_file='household_power_consumption.txt'
20 | if(text_file != expected_file):
21 | print('Unexpected file: '+str(text_file))
22 | return df
23 |
24 | # read in the text file
25 | # each data point is separated by a semicolon
26 | df = pd.read_csv('household_power_consumption.txt', sep=sep,
27 | parse_dates={'Date-Time' : ['Date', 'Time']}, infer_datetime_format=True,
28 | low_memory=False, na_values=na_values, index_col='Date-Time') # indexed by Date-Time
29 |
30 | return df
31 |
32 | ## 2. Managing `NaN` values
33 |
34 | # This DataFrame does include some data points that have missing values.
35 | # So far, we've mainly been dropping these values, but there are other ways to handle `NaN` values.
36 | # One technique is to fill the missing values with the *mean* values from a column,
37 | # this way the added value is likely to be realistic.
38 |
39 | # A helper function to fill NaN values with a column average
40 | def fill_nan_with_mean(df):
41 | '''Fills NaN values in a given dataframe with the average values in a column.
42 | This technique works well for filling missing, hourly values
43 | that will later be averaged into energy stats over a day (24hrs).'''
44 |
45 | # filling nan with mean value of any columns
46 | num_cols = len(list(df.columns.values))
47 | for col in range(num_cols):
48 | df.iloc[:,col]=df.iloc[:,col].fillna(df.iloc[:,col].mean())
49 |
50 | return df
51 |
52 |
--------------------------------------------------------------------------------
/time-series-analysis/readme.md:
--------------------------------------------------------------------------------
1 | The time series analysis projeects
2 |
--------------------------------------------------------------------------------