├── .gitignore
├── README.md
├── part1-get-data-and-non-graph-modeling-prep.ipynb
├── part2-simple-non-graph-model-and-pca.ipynb
├── part3-prepare-papers-for-import.ipynb
├── part4-prepare-authors-and-inst-for-import.ipynb
├── part5-admin-import.md
├── part6-analysis-in-neo4j-gds.ipynb
├── part7-graph-feature-engineering-in-gds.ipynb
└── part8-graph-feature-model.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | data/
2 | scratch/
3 | .idea/
4 | .ipynb_checkpoints/
5 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # GDS Webinar Demo: Graph Data Science for Really Big Data
2 |
3 | This repo contains demo code from the 2022 GDS February Webinar - "Graph Data Science for Really Big Data". The exact pattern here may vary slightly from what you have seen in the webinar, most of the commands have been placed in notebooks for example, but the overall steps should be the same.
4 |
5 | The purpose of this demo is to explore engineering graph features using Neo4j and the [Graph Data Science (GDS) Library](https://neo4j.com/docs/graph-data-science/current/) on a larger dataset to see if we can improve accuracy for a classification problem.
6 |
7 | The graph used here is the [MAG240M OGB Large-Scale-Challenge Graph](https://ogb.stanford.edu/docs/lsc/mag240m/). It is a heterogeneous academic paper graph that contains around 240 Million Nodes and 1.7 Billion Relationships.
8 |
9 | ## Demo Outline and Notebooks Parts
10 |
11 | This demo walks through multiple steps including running a reference model before using graph, formatting and importing data into Neo4j, analyzing the graph and engineering graph features with GDS, and exporting data to re-run a model with those graph features.
12 |
13 | The demo here is ultimately split up into 8 parts, 7 of which are ipython notebooks. Hopefully the file names are descriptive as to what they cover
14 |
15 | - Parts 1 and 2 focus on understanding the data and running a classification model with available features before leveraging Neo4j/GDS/graph
16 | - `part1-get-data-and-non-graph-modeling-prep.ipynb`
17 | - `part2-simple-non-graph-model-and-pca.ipynb`
18 |
19 | - Parts 3-5 are focused on pre-formatting the data and importing into graph
20 | - `part3-prepare-papers-for-import.ipynb`
21 | - `part4-prepare-authors-and-inst-for-import.ipynb`
22 | - `part5-admin-import.md`
23 |
24 | - Part 6 and 7 focus on work in Neo4j and GDS. Part 6 is mostly inspecting the graph and demoing native projections and the WCC algorithm. Part 7 is focused on actually generating and exporting graph features (FastRP Node Embeddings)
25 | - `part6-analysis-in-neo4j-gds.ipynb`
26 | - `part7-graph-feature-engineering-in-gds.ipynb`
27 |
28 | - Finally Part 8 re-runs the classification model with the graph features (FastRP Node Embeddings). In this very rough exploratory first pass we get an ~9% point increase in classification accuracy.
29 | - `part8-graph-feature-model.ipynb`
30 |
31 |
32 |
33 | ## Prerequisites & Environment for Running the Demo
34 |
35 | ### Software Versions
36 | - Neo4j = Enterprise Edition 4.4.3
37 | - GDS = Enterprise Edition 1.8.3
38 | - APOC = 4.4.0.3
39 | - Python = 3.9.7
40 |
41 | Important Note: Enterprise (as opposed to Community) Editions were used for both the Neo4j Database and GDS library in this demo. The use of GDS Enterprise, in particular, provides high-concurrency and optimized in-memory compression which are not available in Community Edition and key to performance at these scales.
42 |
43 | ### Instance
44 | This demo was run on a single AWS ec2 x1.16xlarge instance (64 vCPUs, 976 GB Memory).
45 |
46 | ### Neo4j Configuration
47 | I tweaked a few things but the below are the most critical which you can update in the neo4j settings/configuration (a.k.a `neo4j.conf`)
48 |
49 | - `dbms.memory.heap.max_size=760G`
50 | - `gds.export.location=/data/neo-export` # or set to whatever directory you would like data exports from Neo4j to go
51 |
52 | Depending on your environment and specific needs you may need to tune this and other configuration like min heap size, pagecache, etc. For more details on optimizing Neo4j configuration for data science and analytics at scale I recommend looking into the [Graph Data Science Configuration Guide](https://neo4j.com/whitepapers/graph-data-science-configuration-guide/).
53 |
54 |
55 | ## Future Experimentation & Improvements
56 |
57 | This demo was just a rough first pass to explore what is possible. There are many ways to improve upon this analysis! Here are just a few areas to experiment:
58 |
59 | 1. Improved tuning of FastRP node Embeddings
60 | 2. Inclusion of more graph features
61 | 3. Streamlined data formatting and ETL
62 | 4. Better-tuned and/or more sophisticated classification models and frameworks
63 | 5. Exploration of Semi-supervised transductive approaches to label the rest of the papers, such as Label Propagation Algorithm (LPA) or K-Nearest Neighbor (KNN)
--------------------------------------------------------------------------------
/part1-get-data-and-non-graph-modeling-prep.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "12fe1bee",
6 | "metadata": {},
7 | "source": [
8 | "# Part 1: Prepare Data For Non-Graph (\"Flat\") Modeling"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "id": "50a66a9b",
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "from ogb.lsc import MAG240MDataset\n",
19 | "import numpy as np\n",
20 | "import os\n",
21 | "import pandas as pd"
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "id": "ec956515",
27 | "metadata": {},
28 | "source": [
29 | "## Notebook Setup"
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "id": "70aa26f6",
35 | "metadata": {},
36 | "source": [
37 | "Root Directory for data storage. Will be used in following parts as well."
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 2,
43 | "id": "266ad13e",
44 | "metadata": {},
45 | "outputs": [],
46 | "source": [
47 | "ROOT_DATA_DIR = \"/data\""
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 3,
53 | "id": "cc481218",
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "name": "stdout",
58 | "output_type": "stream",
59 | "text": [
60 | "Directory /data already exists\n"
61 | ]
62 | }
63 | ],
64 | "source": [
65 | "if not os.path.exists(ROOT_DATA_DIR):\n",
66 | " os.mkdir(ROOT_DATA_DIR)\n",
67 | " print(f'Created new directory: {ROOT_DATA_DIR}')\n",
68 | "else:\n",
69 | " print(f'Directory {ROOT_DATA_DIR} already exists')"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "id": "74bdc7db",
75 | "metadata": {},
76 | "source": [
77 | "### Get the Dataset Object\n",
78 | "The dataset object handles downloading and easy access to the data and its features. The dataset object leverages [numpy memmap](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html) functionality to reference large pieces of the dataset on disk so it does not need to load all the features into memory at a time. For more information, please see the [OGB MAG240M Page](https://ogb.stanford.edu/kddcup2021/mag240m/).\n",
79 | "\n",
80 | "__Note: This command takes a while in the *first* run (several hours to a day)__ as the source data needs to be download from OGB. Sequential runs should be near instantaneous though.\n"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 4,
86 | "id": "81db468a",
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "dataset = MAG240MDataset(root = ROOT_DATA_DIR)"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "id": "75a27300",
96 | "metadata": {},
97 | "source": [
98 | "## Examine Data Splitting and Labels\n",
99 | "\n",
100 | "Only a fraction of the papers (the arXiv papers) are labeled. An `idx_split` object is provided with indexes mapping the labeled papers to training, validate, and test sets. As we will see below, the training sets have their labels hidden for purposes of previous competition. More information on the data and labeling process can be found at the [OGB MAG240M Page](https://ogb.stanford.edu/kddcup2021/mag240m/)"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 5,
106 | "id": "100a55a5",
107 | "metadata": {},
108 | "outputs": [],
109 | "source": [
110 | "#get the indexes for arXiv paper data splits\n",
111 | "split_dict = dataset.get_idx_split()"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": 6,
117 | "id": "4080b496",
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "name": "stdout",
122 | "output_type": "stream",
123 | "text": [
124 | "------------------\n",
125 | "train index size = 1112392\n",
126 | "------------------\n",
127 | "valid index size = 138949\n",
128 | "------------------\n",
129 | "test-whole index size = 146818\n",
130 | "------------------\n",
131 | "test-dev index size = 88092\n",
132 | "------------------\n",
133 | "test-challenge index size = 58726\n"
134 | ]
135 | }
136 | ],
137 | "source": [
138 | "#get the relative sizes of each set\n",
139 | "for i in split_dict.keys():\n",
140 | " print('------------------')\n",
141 | " print(f'{i} index size = {len(split_dict[i])}')"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 7,
147 | "id": "7466446c",
148 | "metadata": {},
149 | "outputs": [
150 | {
151 | "name": "stdout",
152 | "output_type": "stream",
153 | "text": [
154 | "Paper labels for the \"train\" set:\n",
155 | "--------\n",
156 | "Sample values = [17. 29. 38. 5. 1.]\n",
157 | "Number non-missing = 1112392\n",
158 | "============================\n",
159 | "\n",
160 | "Paper labels for the \"valid\" set:\n",
161 | "--------\n",
162 | "Sample values = [140. 129. 33. 59. 24.]\n",
163 | "Number non-missing = 138949\n",
164 | "============================\n",
165 | "\n",
166 | "Paper labels for the \"test-whole\" set:\n",
167 | "--------\n",
168 | "Sample values = [-1. -1. -1. -1. -1.]\n",
169 | "Number non-missing = 0\n",
170 | "============================\n",
171 | "\n",
172 | "Paper labels for the \"test-dev\" set:\n",
173 | "--------\n",
174 | "Sample values = [-1. -1. -1. -1. -1.]\n",
175 | "Number non-missing = 0\n",
176 | "============================\n",
177 | "\n",
178 | "Paper labels for the \"test-challenge\" set:\n",
179 | "--------\n",
180 | "Sample values = [-1. -1. -1. -1. -1.]\n",
181 | "Number non-missing = 0\n",
182 | "============================\n",
183 | "\n"
184 | ]
185 | }
186 | ],
187 | "source": [
188 | "# Note that we only have known labels in the train and validate sets. \n",
189 | "# A value of -1 implies a hidden label\n",
190 | "for i in split_dict.keys():\n",
191 | " paper_labels = dataset.paper_label[split_dict[i]]\n",
192 | " print(f'Paper labels for the \"{i}\" set:')\n",
193 | " print('--------')\n",
194 | " print(f'Sample values = {paper_labels[:5]}')\n",
195 | " print(f'Number non-missing = {sum(dataset.paper_label[split_dict[i]] > -1)}')\n",
196 | " print('============================\\n')"
197 | ]
198 | },
199 | {
200 | "cell_type": "markdown",
201 | "id": "439a60ae",
202 | "metadata": {},
203 | "source": [
204 | "## Building a DataFrame for Supervised Model Testing\n",
205 | "\n",
206 | "We will use the 'train' and 'valid' set for pre-graph supervised model analysis\n",
207 | "since they are the only ones with labels"
208 | ]
209 | },
210 | {
211 | "cell_type": "code",
212 | "execution_count": 8,
213 | "id": "30a40151",
214 | "metadata": {},
215 | "outputs": [],
216 | "source": [
217 | "#get the training set\n",
218 | "feat_cols = [f'paper_encoding_{i}' for i in range(768)]\n",
219 | "paper_df_train = pd.DataFrame(dataset.paper_feat[split_dict['train']], columns = feat_cols)\n",
220 | "paper_df_train['split_segment'] = 'TRAIN'\n",
221 | "paper_df_train['paper_subject'] = dataset.paper_label[split_dict['train']]\n",
222 | "paper_df_train['paper_year'] = dataset.paper_year[split_dict['train']]"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 9,
228 | "id": "9c0f485e",
229 | "metadata": {},
230 | "outputs": [],
231 | "source": [
232 | "#get the validation set\n",
233 | "paper_df_validate = pd.DataFrame(dataset.paper_feat[split_dict['valid']], columns = feat_cols)\n",
234 | "paper_df_validate['split_segment'] = 'VALIDATE'\n",
235 | "paper_df_validate['paper_subject'] = dataset.paper_label[split_dict['valid']]\n",
236 | "paper_df_validate['paper_year'] = dataset.paper_year[split_dict['valid']]"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 10,
242 | "id": "49a7873b",
243 | "metadata": {},
244 | "outputs": [
245 | {
246 | "data": {
247 | "text/html": [
248 | "
"
642 | ],
643 | "text/plain": [
644 | " paper_encoding_0\n",
645 | "paper_subject \n",
646 | "0.0 28041\n",
647 | "1.0 2856\n",
648 | "2.0 3907\n",
649 | "3.0 1530\n",
650 | "4.0 1910\n",
651 | "... ...\n",
652 | "148.0 865\n",
653 | "149.0 815\n",
654 | "150.0 837\n",
655 | "151.0 22696\n",
656 | "152.0 1139\n",
657 | "\n",
658 | "[153 rows x 1 columns]"
659 | ]
660 | },
661 | "execution_count": 5,
662 | "metadata": {},
663 | "output_type": "execute_result"
664 | }
665 | ],
666 | "source": [
667 | "papers_df[['paper_subject', 'paper_encoding_0']].groupby('paper_subject').count()"
668 | ]
669 | },
670 | {
671 | "cell_type": "markdown",
672 | "id": "88beb827",
673 | "metadata": {},
674 | "source": [
675 | "## Logistic Regression Using Entire 768 Dimensional Encoding\n",
676 | "\n",
677 | "\n",
678 | "As a first pass we will try to fit this model with simple logistic regression using just the 768 dimensional RoBERTa encoding vectors as features. \n",
679 | "\n",
680 | "__Note: this model fitting step can take a while (several hours) to complete__\n",
681 | "\n",
682 | "We will get convergence warnings when running the below model model. I tried various different parameters to try and avoid this in sklearn but could not seem to do so. In a more rigorous setting I would recommend looking deeper into tuning parameters, different model types, different machine learning libraries/frameworks, etc. But for purposes of this demo we are just trying to get an initial rough benchmark. In the following sections we will apply a very simple solution of dimensionality reduction with Principal Components Analysis (PCA) to see the effect on results. "
683 | ]
684 | },
685 | {
686 | "cell_type": "code",
687 | "execution_count": 6,
688 | "id": "aa8a82dc",
689 | "metadata": {},
690 | "outputs": [],
691 | "source": [
692 | "papers_df = papers_df.astype({'paper_subject':'int32'})"
693 | ]
694 | },
695 | {
696 | "cell_type": "code",
697 | "execution_count": 7,
698 | "id": "4863519d",
699 | "metadata": {},
700 | "outputs": [],
701 | "source": [
702 | "X = papers_df[['paper_encoding_' + str(x) for x in range(768)]]\n",
703 | "y = papers_df.paper_subject"
704 | ]
705 | },
706 | {
707 | "cell_type": "code",
708 | "execution_count": 8,
709 | "id": "486dfa78",
710 | "metadata": {},
711 | "outputs": [],
712 | "source": [
713 | "X_train = X[papers_df.split_segment == \"TRAIN\"]\n",
714 | "X_validate = X[papers_df.split_segment == \"VALIDATE\"]\n",
715 | "y_train = y[papers_df.split_segment == \"TRAIN\"]\n",
716 | "y_validate = y[papers_df.split_segment == \"VALIDATE\"]"
717 | ]
718 | },
719 | {
720 | "cell_type": "code",
721 | "execution_count": 9,
722 | "id": "e3d95251",
723 | "metadata": {},
724 | "outputs": [],
725 | "source": [
726 | "model = LogisticRegression(multi_class='ovr', solver='saga', n_jobs=60, max_iter=200)"
727 | ]
728 | },
729 | {
730 | "cell_type": "code",
731 | "execution_count": 10,
732 | "id": "91cba657",
733 | "metadata": {},
734 | "outputs": [
735 | {
736 | "name": "stderr",
737 | "output_type": "stream",
738 | "text": [
739 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
740 | " warnings.warn(\n",
741 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
742 | " warnings.warn(\n",
743 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
744 | " warnings.warn(\n",
745 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
746 | " warnings.warn(\n",
747 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
748 | " warnings.warn(\n",
749 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
750 | " warnings.warn(\n",
751 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
752 | " warnings.warn(\n",
753 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
754 | " warnings.warn(\n",
755 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
756 | " warnings.warn(\n",
757 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
758 | " warnings.warn(\n",
759 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
760 | " warnings.warn(\n",
761 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
762 | " warnings.warn(\n",
763 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
764 | " warnings.warn(\n",
765 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
766 | " warnings.warn(\n",
767 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
768 | " warnings.warn(\n",
769 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
770 | " warnings.warn(\n",
771 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
772 | " warnings.warn(\n",
773 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
774 | " warnings.warn(\n",
775 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
776 | " warnings.warn(\n",
777 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
778 | " warnings.warn(\n",
779 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
780 | " warnings.warn(\n",
781 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
782 | " warnings.warn(\n",
783 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
784 | " warnings.warn(\n",
785 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
786 | " warnings.warn(\n",
787 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
788 | " warnings.warn(\n",
789 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
790 | " warnings.warn(\n",
791 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
792 | " warnings.warn(\n",
793 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
794 | " warnings.warn(\n",
795 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
796 | " warnings.warn(\n",
797 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
798 | " warnings.warn(\n",
799 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
800 | " warnings.warn(\n",
801 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
802 | " warnings.warn(\n",
803 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
804 | " warnings.warn(\n",
805 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
806 | " warnings.warn(\n",
807 | "/home/ubuntu/.conda/envs/graph2/lib/python3.9/site-packages/sklearn/linear_model/_sag.py:352: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge\n",
808 | " warnings.warn(\n"
809 | ]
810 | },
811 | {
812 | "data": {
813 | "text/plain": [
814 | "LogisticRegression(max_iter=200, multi_class='ovr', n_jobs=60, solver='saga')"
815 | ]
816 | },
817 | "execution_count": 10,
818 | "metadata": {},
819 | "output_type": "execute_result"
820 | }
821 | ],
822 | "source": [
823 | "#Note: This can take a while (several hours)\n",
824 | "model.fit(X_train, y_train)"
825 | ]
826 | },
827 | {
828 | "cell_type": "code",
829 | "execution_count": 11,
830 | "id": "a826e1b8",
831 | "metadata": {},
832 | "outputs": [
833 | {
834 | "name": "stdout",
835 | "output_type": "stream",
836 | "text": [
837 | "Accuracy of logistic regression classifier on VALIDATE set: 0.49\n"
838 | ]
839 | }
840 | ],
841 | "source": [
842 | "print('Accuracy of logistic regression classifier on VALIDATE set: {:.2f}'\\\n",
843 | " .format(model.score(X_validate, y_validate)))"
844 | ]
845 | },
846 | {
847 | "cell_type": "markdown",
848 | "id": "67621ceb",
849 | "metadata": {},
850 | "source": [
851 | "## Reducing Dimensionality with Principal Components Analysis (PCA)"
852 | ]
853 | },
854 | {
855 | "cell_type": "code",
856 | "execution_count": 12,
857 | "id": "1cea5185",
858 | "metadata": {},
859 | "outputs": [
860 | {
861 | "data": {
862 | "text/plain": [
863 | "PCA()"
864 | ]
865 | },
866 | "execution_count": 12,
867 | "metadata": {},
868 | "output_type": "execute_result"
869 | }
870 | ],
871 | "source": [
872 | "from sklearn.decomposition import PCA\n",
873 | "pca = PCA()\n",
874 | "pca.fit(X_train)"
875 | ]
876 | },
877 | {
878 | "cell_type": "code",
879 | "execution_count": 13,
880 | "id": "0bcbf2b2",
881 | "metadata": {},
882 | "outputs": [
883 | {
884 | "data": {
885 | "image/png": "\n",
886 | "text/plain": [
887 | "