├── local
    └── environment.yml
├── binder
    └── environment.yml
├── .gitignore
├── readme.md
├── 03-prepare-VOSviewer-term-map.ipynb
├── data-files
    └── vosviewer
    │   └── terms.txt
├── 02-advanced.ipynb
└── 01-basics.ipynb


/local/environment.yml:
--------------------------------------------------------------------------------
 1 | name: CSSS
 2 | channels:
 3 |   - conda-forge
 4 |   - defaults
 5 | dependencies:
 6 |   - python=3.8
 7 |   - jupyter
 8 |   - nbconvert
 9 |   - notebook
10 |   - tornado
11 |   - matplotlib
12 |   - numpy
13 |   - scipy
14 |   - pandas
15 |   - pycairo
16 |   - python-igraph
17 |   - leidenalg
18 | 


--------------------------------------------------------------------------------
/binder/environment.yml:
--------------------------------------------------------------------------------
 1 | channels:
 2 |   - vtraag
 3 |   - conda-forge
 4 |   - defaults
 5 | dependencies:
 6 |   - python=3.7
 7 |   - jupyter=1.0.0
 8 |   - nbconvert=5.4.0
 9 |   - notebook=5.7.4
10 |   - tornado<6
11 |   - matplotlib
12 |   - numpy
13 |   - scipy
14 |   - pandas>=0.21.0
15 |   - pycairo
16 |   - python-igraph
17 |   - leidenalg
18 |   - metaknowledge
19 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | data/
 2 | results/
 3 | latexdiff*/
 4 | *.dat
 5 | *.pyc
 6 | *.log
 7 | *.bbl
 8 | *.blg
 9 | *.aux
10 | *.pdf
11 | *.eps
12 | *.out
13 | *.synctex.gz
14 | *.synctex
15 | *.swp
16 | *.zip
17 | *.gephi
18 | *.fdb_latexmk
19 | *.fls
20 | *.*~
21 | *.tcp
22 | *.tps
23 | *.tiw
24 | *Notes.bib
25 | *.tmp
26 | *.docx
27 | *.picklez
28 | *.png
29 | *.spl
30 | **/.ipynb_checkpoints
31 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # CWTS Scientometrics Summer School
 3 | 
 4 | This GitHub repository contains the exercises for doing network analysis with Python.
 5 | 
 6 | We would like to encourage you to install [Anaconda](https://www.anaconda.com/distribution/) Python locally. This allows you to run the Python notebook on your own computer. As an alternative, the notebook are also available from an online service, if you don't manage to install [Anaconda](https://www.anaconda.com/distribution/) Python locally.
 7 | 
 8 | # Local installation
 9 | 
10 | We request you to install Python on your own computer. When you have everything installed locally, you can simply run the notebook, without depending on ony online service. Moreover, you then also have your local environment already setup if you want to use it in the future.
11 | 
12 | Please follow the following steps to correctly setup your environment:
13 | 
14 | 1. [Download](https://www.anaconda.com/distribution/) and install Anaconda Python. When asked, select to  install it only for a single user.
15 | 
16 | 2. [Download](https://github.com/CWTSLeiden/CSSS/archive/master.zip) this repository and unzip it to a certain directory.
17 | 
18 |   - For more technical users, you may also clone the repository, make sure that you use the master branch.
19 | 
20 | 
21 | 3. In Windows, please launch the "Anaconda prompt". In Mac OS/Linux, open the terminal and activate conda by running `source ~/anaconda3/bin/activate`. This enables the installation of the required packages. In the prompt/terminal navigate to the directory to which you unzipped the repository using
22 | 
23 |     ```
24 |     cd [DIRECTORY]
25 |     ```
26 | 
27 |     where you should replace `[DIRECTORY]` by the directory where you unzipped the repository.
28 | 
29 | 4. Setup the new environment ``CSSS`` using
30 | 
31 |     ```
32 |     conda env create -f local/environment.yml
33 |     ```
34 | 
35 |     This automatically creates the new environment ``CSSS`` and installs the correct versions of all required packages.
36 | 
37 |     **Note:** Installation may take some time.
38 | 
39 | ## Run Jupyter notebook
40 | 
41 | There are two ways in which you can run a Jupyter notebook.
42 | 
43 | 1. Launch the "Anaconda navigator" and launch the Jupyter notebook from there. Make sure to select the correct environment ``CSSS`` from the dropdown box at the top of the window. The Jupyter notebook will start in some specific directory, you may need to move the directory to which you unzipped the repository, to ensure it is also visible from Jupyter notebook.
44 | 
45 | 2. In Windows, please launch the "Anaconda prompt". In Mac OS/Linux, open the terminal and activate conda by running ``conda activate CSSS`` or that does not work ``source ~/anaconda3/bin/activate CSSS``. Navigate to the directory to which you unzipped the repository using
46 |     ```
47 |     cd [DIRECTORY]
48 |     ```
49 |     Then launch the Jupyter notebook by
50 |     ```
51 |     jupyter notebook
52 |     ```
53 | 
54 | In both approaches, you can open the desired notebook: `01-basics.ipynb` or `02-advanced.ipynb`.
55 | 
56 | ## Issues
57 | 
58 | If encounter any problem during installation, or with the Python notebook, please report it as an issue at https://github.com/CWTSLeiden/CSSS/issues.
59 | 
60 | # Run online
61 | 
62 | The Python notebooks can be run online without the need for installation. Please click on one of the badges below to start the interactive environment. Note that the number of resources are limited, and that you cannot use your own data files for further analysis. The online service may also not always be available unfortunately.
63 | 
64 | ## `01-basics.ipynb`
65 | * GESIS (Leibniz Institute for the Social Sciences)
66 | [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/CWTSLeiden/CSSS/master?filepath=01-basics.ipynb)
67 | 
68 | * PANGEO
69 | [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/CWTSLeiden/CSSS/master?filepath=01-basics.ipynb)
70 | 
71 | * MyBinder.org
72 | [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/CWTSLeiden/CSSS/master?filepath=01-basics.ipynb)
73 | 
74 | ## `02-advanced.ipynb`
75 | 
76 | * GESIS (Leibniz Institute for the Social Sciences)
77 | [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/CWTSLeiden/CSSS/master?filepath=02-advanced.ipynb)
78 | 
79 | * PANGEO
80 | [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/CWTSLeiden/CSSS/master?filepath=02-advanced.ipynb)
81 | 
82 | * MyBinder.org
83 | [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/CWTSLeiden/CSSS/master?filepath=02-advanced.ipynb)
84 | 
85 | 
86 | ## `03-prepare-VOSviewer-term-map.ipynb`
87 | 
88 | * GESIS (Leibniz Institute for the Social Sciences)
89 | [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/CWTSLeiden/CSSS/master?filepath=03-prepare-VOSviewer-term-map.ipynb)
90 | 
91 | * PANGEO
92 | [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/CWTSLeiden/CSSS/master?filepath=03-prepare-VOSviewer-term-map.ipynb)
93 | 
94 | * MyBinder.org
95 | [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/CWTSLeiden/CSSS/master?filepath=03-prepare-VOSviewer-term-map.ipynb)
96 | 


--------------------------------------------------------------------------------
/03-prepare-VOSviewer-term-map.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Preparing files for VOSviewer overlays"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "In this notebook we will load some files from Web of Science, parse them, and use them to prepare advanced overlays map in VOSviewer. Many of the operations you have already seen earlier during the summer school."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "As usual we will start by importing the relevant packages. We will need the `pandas` pacakge, and we will call it `pd` again, and additionally we need the `csv` package for some options, and finally, we also need the `glob` package to easily find the relevant files."
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "code",
 26 |    "execution_count": null,
 27 |    "metadata": {},
 28 |    "outputs": [],
 29 |    "source": [
 30 |     "import pandas as pd\n",
 31 |     "import csv\n",
 32 |     "import glob"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "We will start by reading in all files. We already did this in an earlier notebook, here below we repeat this."
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": null,
 45 |    "metadata": {},
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n",
 49 |     "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n",
 50 |     "                                        usecols=range(68), index_col='UT') for f in files)\n",
 51 |     "publications_df = publications_df.sort_index()"
 52 |    ]
 53 |   },
 54 |   {
 55 |    "cell_type": "markdown",
 56 |    "metadata": {},
 57 |    "source": [
 58 |     "We will now prepare files manually for VOSviewer. We will have to prepare two files: \n",
 59 |     " 1. a so-called corpus file that contains all text for each document.\n",
 60 |     " 2. a so-called scores file that contains \"scores\" for each document."
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "markdown",
 65 |    "metadata": {},
 66 |    "source": [
 67 |     "## Corpus file"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "markdown",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "We will now first prepare the corpus file. We will concatenate the title and abstract together for this purpose. VOSviewer will simply consider each line in the corpus file a document, and will simply consider all text when creating a term map. In other words, you can apply this to any type of file."
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "publications_df['text'] = publications_df['TI'] + '. ' + publications_df['AB']"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "markdown",
 88 |    "metadata": {},
 89 |    "source": [
 90 |     "We have added the additional full stop (`.`) to make sure that VOSviewer is able to parse the sentences correctly."
 91 |    ]
 92 |   },
 93 |   {
 94 |    "cell_type": "markdown",
 95 |    "metadata": {},
 96 |    "source": [
 97 |     "Since VOSviewer expects a document at each line, we need to make sure that the titles and abstract are all on a single line. In more technical terms: they cannot contain any newlines, which are represented by a combination of special characters, and this depends on the platform you are using. We will simply remove all possible newline characters as follows:"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": null,
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "publications_df['text'] = publications_df['text'].str.replace('\\n', '').replace('\\r', '');"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "markdown",
111 |    "metadata": {},
112 |    "source": [
113 |     "Now we write the text for each document to a corpus file."
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "code",
118 |    "execution_count": null,
119 |    "metadata": {},
120 |    "outputs": [],
121 |    "source": [
122 |     "publications_df['text'].to_csv('corpus.txt', index=False, header=False)"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "metadata": {},
128 |    "source": [
129 |     "## Scores file"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "markdown",
134 |    "metadata": {},
135 |    "source": [
136 |     "Now we have to determine what type of scores we want to project as overlays in VOSviewer. We will show how to do this using journals, you can repeat the exercise on countries."
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "metadata": {},
142 |    "source": [
143 |     "Scores in VOSviewer work as follows. For each score it will calculate the average of the scores in documents that match a specific term. It will then color the terms in the term map according to the average of these scores. This can then highlight certain parts of the map showing where this score is particularly high or low. The objective now is to show this for journals, highlighting what part of the map is particularly relevant to a certain journal."
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "metadata": {},
149 |    "source": [
150 |     "We will do this for each journal separately. At the moment, the journal is contained in the field `SO`."
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": [
159 |     "publications_df['SO']"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "markdown",
164 |    "metadata": {},
165 |    "source": [
166 |     "You may remember that you can get group the dataframe by the journal to get an overview per journal."
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": null,
172 |    "metadata": {},
173 |    "outputs": [],
174 |    "source": [
175 |     "publications_df.groupby('SO').size().sort_values(ascending=False)"
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "markdown",
180 |    "metadata": {},
181 |    "source": [
182 |     "Now we would like to translate the `SO` column in such a way that VOSviewer can show a separate overlay for each journal. For those of you are familiar with statistics, we will do this using so-called \"dummy\" variables. That is, for each journal, we will create a new column, and indicate whether the publication is from that journal (Yes, `1`) or not (No, `0`). If VOSviewer then takes the average, this comes down to showing the percentage of publications with a certain term that are publishing in that journal. Fortunately, this is implemented in `pandas`, so we can easily do that."
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": null,
188 |    "metadata": {},
189 |    "outputs": [],
190 |    "source": [
191 |     "journal_scores_df = publications_df['SO'].str.get_dummies()"
192 |    ]
193 |   },
194 |   {
195 |    "cell_type": "markdown",
196 |    "metadata": {},
197 |    "source": [
198 |     "If we now look at scores_df, you will see many column names that represent the journal, and only `0` or `1` in each entry."
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": null,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "journal_scores_df.head()"
208 |    ]
209 |   },
210 |   {
211 |    "cell_type": "markdown",
212 |    "metadata": {},
213 |    "source": [
214 |     "VOSviewer wants a specific column name for scores. In particular, it should be called `Score<...>`. We therefore change the column names"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "metadata": {},
221 |    "outputs": [],
222 |    "source": [
223 |     "journal_scores_df.columns = ['Score<{}>'.format(c) for c in journal_scores_df.columns]"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "markdown",
228 |    "metadata": {},
229 |    "source": [
230 |     "Finally, we then write the dataframe to a scores files, which should be tab-delimited."
231 |    ]
232 |   },
233 |   {
234 |    "cell_type": "code",
235 |    "execution_count": null,
236 |    "metadata": {},
237 |    "outputs": [],
238 |    "source": [
239 |     "journal_scores_df.to_csv('scores.txt', sep='\\t', index=None)"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "## VOSviewer"
247 |    ]
248 |   },
249 |   {
250 |    "cell_type": "markdown",
251 |    "metadata": {},
252 |    "source": [
253 |     "You can now create a term map in VOSviewer using the two files you produced `corpus.txt` and `scores.txt`. To create a term map based on these files, choose \"Create a map based on text data\" in VOSviewer, and then select \"Read data from VOSviewer files.\""
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "# Exercise Document type"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "markdown",
265 |    "metadata": {},
266 |    "source": [
267 |     "<div class=\"alert alert-info\">\n",
268 |     "    Now repeat the same exercise but using the document type <code>DT</code>.\n",
269 |     "</div>"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": null,
275 |    "metadata": {},
276 |    "outputs": [],
277 |    "source": []
278 |   },
279 |   {
280 |    "cell_type": "markdown",
281 |    "metadata": {},
282 |    "source": [
283 |     "<div class=\"alert alert-info\">\n",
284 |     "    Create the term map in VOSviewer with the document type score file. Does the category of \"Meeting Abstract\" show a particular pattern? Why (not)? Can you explain you observation?\n",
285 |     "</div>"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "markdown",
290 |    "metadata": {},
291 |    "source": [
292 |     "<div class=\"alert alert-info\">\n",
293 |     "    You probably now have two different dataframes. You then cannot see the document type overlay at the same time as the journal overlay. Could you try to combine the two dataframes? (Hint: check out the <code>concat</code> function we encountered earlier.)\n",
294 |     "</div>"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": null,
300 |    "metadata": {},
301 |    "outputs": [],
302 |    "source": []
303 |   }
304 |  ],
305 |  "metadata": {
306 |   "kernelspec": {
307 |    "display_name": "Python 3",
308 |    "language": "python",
309 |    "name": "python3"
310 |   },
311 |   "language_info": {
312 |    "codemirror_mode": {
313 |     "name": "ipython",
314 |     "version": 3
315 |    },
316 |    "file_extension": ".py",
317 |    "mimetype": "text/x-python",
318 |    "name": "python",
319 |    "nbconvert_exporter": "python",
320 |    "pygments_lexer": "ipython3",
321 |    "version": "3.8.3"
322 |   }
323 |  },
324 |  "nbformat": 4,
325 |  "nbformat_minor": 4
326 | }
327 | 


--------------------------------------------------------------------------------
/data-files/vosviewer/terms.txt:
--------------------------------------------------------------------------------
  1 | id	term	occurrences	relevance score
  2 | 1	a lumbricoide	11	2.3141
  3 | 2	abundance	28	0.8707
  4 | 3	acceptability	17	0.8648
  5 | 4	access	76	0.9195
  6 | 5	accuracy	32	0.2844
  7 | 6	act	26	1.6277
  8 | 7	action	25	0.7057
  9 | 8	acts	10	2.301
 10 | 9	adherence	26	0.7939
 11 | 10	administration	28	0.736
 12 | 11	admission	21	0.9007
 13 | 12	adverse event	16	2.0095
 14 | 13	agreement	26	0.6926
 15 | 14	amodiaquine	21	2.595
 16 | 15	anaemia	30	0.6928
 17 | 16	animal	56	0.7233
 18 | 17	anopheles	10	1.0869
 19 | 18	antibody	87	0.4992
 20 | 19	antibody response	27	0.7887
 21 | 20	antigen	108	0.6071
 22 | 21	antigen detection	12	1.4239
 23 | 22	antimalarial drug	10	1.5043
 24 | 23	antimalarial treatment	18	1.6177
 25 | 24	antiretroviral therapy	22	2.0214
 26 | 25	antiretroviral treatment	29	1.8361
 27 | 26	aor	13	0.7575
 28 | 27	arabiensis	11	2.1203
 29 | 28	art	34	1.8376
 30 | 29	artemether lumefantrine	28	1.7528
 31 | 30	artemisinin	45	1.2598
 32 | 31	artesunate	41	1.6278
 33 | 32	article	32	0.4991
 34 | 33	ascaris lumbricoide	19	2.1097
 35 | 34	assay	105	0.3265
 36 | 35	asymptomatic individual	10	0.3452
 37 | 36	attention	34	0.6734
 38 | 37	attitude	15	1.1021
 39 | 38	awareness	20	0.8483
 40 | 39	bangladesh	14	1.3618
 41 | 40	barrier	32	1.1333
 42 | 41	bed net	26	0.9034
 43 | 42	behaviour	43	0.7571
 44 | 43	belgium	35	0.5284
 45 | 44	benin	35	0.6813
 46 | 45	bihar	33	1.3218
 47 | 46	birth	19	1.0532
 48 | 47	blood	98	0.2187
 49 | 48	blood sample	87	0.4195
 50 | 49	bolivia	18	1.0507
 51 | 50	brazil	12	0.654
 52 | 51	buruli ulcer	40	1.163
 53 | 52	burundi	23	0.4816
 54 | 53	card agglutination test	16	1.6911
 55 | 54	care	145	1.1195
 56 | 55	case control study	22	0.5025
 57 | 56	case management	27	0.7348
 58 | 57	case study	18	1.1898
 59 | 58	catt	16	1.4986
 60 | 59	cattle	31	0.958
 61 | 60	cell	70	0.3892
 62 | 61	central africa	14	0.3532
 63 | 62	central vietnam	32	0.9589
 64 | 63	cerebrospinal fluid	26	1.1873
 65 | 64	chagas disease	20	0.8005
 66 | 65	chloroquine	26	1.4592
 67 | 66	choice	27	0.5989
 68 | 67	classification	14	0.3726
 69 | 68	clinic	41	0.92
 70 | 69	clinical isolate	14	0.946
 71 | 70	clinical malaria	19	0.6313
 72 | 71	clinical sample	14	0.8511
 73 | 72	clinical sign	12	0.3697
 74 | 73	clinical trial	64	0.7364
 75 | 74	clinician	12	0.5821
 76 | 75	combination therapy	45	1.2739
 77 | 76	combination treatment	13	1.973
 78 | 77	community health worker	10	1.1139
 79 | 78	compliance	26	0.9479
 80 | 79	complication	32	0.6004
 81 | 80	compound	25	0.4168
 82 | 81	conclusion significance	15	0.433
 83 | 82	conclusions significance	67	0.4678
 84 | 83	congenital chagas disease	12	2.276
 85 | 84	congenital infection	15	2.1774
 86 | 85	congo	164	0.5043
 87 | 86	control group	14	0.5092
 88 | 87	cost effectiveness	30	0.6627
 89 | 88	cote divoire	16	0.5223
 90 | 89	count	64	0.5516
 91 | 90	coverage	61	0.7996
 92 | 91	csf	18	1.2873
 93 | 92	csp	10	1.1474
 94 | 93	cuba	46	1.0261
 95 | 94	culture	56	0.3843
 96 | 95	cure	20	0.3492
 97 | 96	cure rate	21	1.1835
 98 | 97	curtain	12	2.1872
 99 | 98	cutaneous leishmaniasis	24	0.5467
100 | 99	cyst	20	1.1545
101 | 100	cysticercosis	45	1.9432
102 | 101	dat	21	1.3285
103 | 102	ddt	13	2.3299
104 | 103	degrees c	19	0.2078
105 | 104	delay	31	0.6738
106 | 105	delivery	53	1.1261
107 | 106	deltamethrin	19	2.3467
108 | 107	democratic republic	157	0.5191
109 | 108	dengue	28	0.7332
110 | 109	density	47	0.5081
111 | 110	detection	182	0.2664
112 | 111	diabete	10	0.9531
113 | 112	diagnostic accuracy	21	0.731
114 | 113	diagnostic performance	13	0.8762
115 | 114	diagnostic test	26	0.4004
116 | 115	diagnostic tool	29	0.3881
117 | 116	diarrhoea	10	1.0987
118 | 117	dihydroartemisinin piperaquine	12	2.1793
119 | 118	diptera	18	1.1759
120 | 119	direct agglutination test	21	1.2924
121 | 120	discharge	13	1.325
122 | 121	diversity	30	0.4426
123 | 122	dna	67	0.5467
124 | 123	dog	20	0.8383
125 | 124	domestic animal	11	1.2878
126 | 125	dose	90	0.6189
127 | 126	dr congo	20	0.4279
128 | 127	drc	57	0.554
129 | 128	drug efficacy	19	1.3295
130 | 129	east africa	21	0.3362
131 | 130	ecology	12	0.8811
132 | 131	editorial	11	2.3821
133 | 132	education	30	0.6797
134 | 133	effectiveness	88	0.7578
135 | 134	efficacy	172	0.6426
136 | 135	efficiency	12	0.6026
137 | 136	egg	49	1.2147
138 | 137	elisa	73	0.8491
139 | 138	endemic setting	25	0.7431
140 | 139	environmental factor	13	0.8999
141 | 140	enzyme	50	0.8277
142 | 141	epilepsy	35	1.7675
143 | 142	europe	40	0.4784
144 | 143	example	19	0.6507
145 | 144	expectation	13	1.0642
146 | 145	experience	68	1.0056
147 | 146	experiment	22	0.5704
148 | 147	expression	24	0.7327
149 | 148	facility	39	0.8738
150 | 149	faecal sample	16	0.8769
151 | 150	failure	38	0.6996
152 | 151	falciparum malaria	14	0.7246
153 | 152	feasibility	29	0.4219
154 | 153	fec	10	2.423
155 | 154	field condition	18	0.4187
156 | 155	filter paper	13	0.8792
157 | 156	first line treatment	26	0.9524
158 | 157	first report	10	0.4715
159 | 158	first time	16	0.5756
160 | 159	fly	13	1.9506
161 | 160	focus group discussion	14	1.1715
162 | 161	fold	18	0.3602
163 | 162	forest	14	1.0595
164 | 163	forest malaria	10	1.4586
165 | 164	formulation	24	0.8315
166 | 165	framework	28	0.6275
167 | 166	gambiae	16	2.2306
168 | 167	gambiense	35	1.3564
169 | 168	gender	22	0.4645
170 | 169	gene	81	0.422
171 | 170	genetic diversity	22	0.4079
172 | 171	genotype	36	0.4428
173 | 172	goal	22	0.6184
174 | 173	goat	12	1.23
175 | 174	government	15	1.1415
176 | 175	hat	45	0.9611
177 | 176	health	108	0.652
178 | 177	health care	38	1.7577
179 | 178	health centre	52	1.0803
180 | 179	health district	22	1.3851
181 | 180	health facility	52	0.7874
182 | 181	health service	58	1.6105
183 | 182	health system	51	1.8437
184 | 183	health worker	32	0.8809
185 | 184	healthy control	13	0.691
186 | 185	helminth	31	1.7871
187 | 186	helminth infection	18	2.5369
188 | 187	high sensitivity	12	0.5359
189 | 188	higher risk	24	0.5486
190 | 189	histidine rich protein	15	1.6542
191 | 190	hiv	84	0.8617
192 | 191	hiv aids	15	1.2892
193 | 192	hiv infection	14	0.8844
194 | 193	hiv testing	12	2.4182
195 | 194	home	26	0.9487
196 | 195	hookworm	16	2.7181
197 | 196	hospital	125	0.5609
198 | 197	host	75	0.6726
199 | 198	hour	19	0.4183
200 | 199	house	39	0.7926
201 | 200	hrp	12	1.7831
202 | 201	human	94	0.5259
203 | 202	human african trypanosomiasis	61	0.9204
204 | 203	human cysticercosis	17	2.2077
205 | 204	human host	11	0.7954
206 | 205	human immunodeficiency virus	15	0.5911
207 | 206	human infection	13	0.7083
208 | 207	identification	84	0.2787
209 | 208	ifn gamma	13	1.0255
210 | 209	igm	14	0.8616
211 | 210	illness	37	0.4749
212 | 211	immune response	33	0.7466
213 | 212	immunization	12	1.3073
214 | 213	immunogenicity	47	1.9979
215 | 214	immunosorbent assay	34	0.967
216 | 215	implementation	85	0.5954
217 | 216	important cause	11	0.7125
218 | 217	important role	13	0.5971
219 | 218	incidence rate	20	0.5325
220 | 219	india	84	0.7692
221 | 220	indian subcontinent	36	1.2129
222 | 221	indirect cost	11	1.4543
223 | 222	indoor residual spraying	20	1.8447
224 | 223	infant	36	0.9412
225 | 224	infected mother	13	1.7968
226 | 225	infection intensity	13	2.1954
227 | 226	infection rate	20	0.667
228 | 227	inflammation	14	0.6263
229 | 228	initiative	41	0.9908
230 | 229	insecticidal net	13	2.5042
231 | 230	insecticide	69	1.2326
232 | 231	insecticide resistance	16	2.0071
233 | 232	integration	22	1.2484
234 | 233	intensity	62	0.3743
235 | 234	intermittent preventive treatment	25	1.1001
236 | 235	interview	58	1.0813
237 | 236	iqr	18	0.8924
238 | 237	irs	15	1.8777
239 | 238	isolate	66	0.5381
240 | 239	issue	41	0.5952
241 | 240	ixodes ricinus	12	1.7308
242 | 241	kala azar	24	1.0179
243 | 242	kappa	10	1.0776
244 | 243	kinshasa	50	0.5328
245 | 244	l donovani	13	1.244
246 | 245	laboratory	48	0.2708
247 | 246	larvae	19	0.8288
248 | 247	larval stage	14	1.246
249 | 248	latin america	21	0.5882
250 | 249	leishmania	35	0.7479
251 | 250	leishmania donovani	34	1.0186
252 | 251	leishmania donovani infection	10	1.909
253 | 252	leishmaniasis	22	0.6317
254 | 253	lesion	40	0.5606
255 | 254	lesson	23	1.4635
256 | 255	life cycle	13	0.891
257 | 256	light	16	0.4287
258 | 257	limit	29	0.3421
259 | 258	line	37	0.3771
260 | 259	lineage	18	0.943
261 | 260	literature	37	0.3828
262 | 261	liver	16	0.5715
263 | 262	livestock	15	0.6055
264 | 263	llin	23	2.3642
265 | 264	llins	15	2.6416
266 | 265	locality	16	0.4866
267 | 266	logistic regression	23	0.3479
268 | 267	long lasting insecticidal net	15	1.7476
269 | 268	longitudinal study	15	0.4194
270 | 269	low birth weight	12	1.5831
271 | 270	low income country	25	0.708
272 | 271	low level	10	0.5379
273 | 272	m ulceran	18	1.3594
274 | 273	malaria burden	22	0.693
275 | 274	malaria control	28	0.6064
276 | 275	malaria diagnosis	20	0.749
277 | 276	malaria incidence	19	0.71
278 | 277	malaria infection	54	0.4422
279 | 278	malaria parasite	17	0.4658
280 | 279	malaria prevalence	21	0.5361
281 | 280	malaria rapid diagnostic test	32	1.1088
282 | 281	malaria transmission	62	0.5975
283 | 282	malaria transmission intensity	10	0.7573
284 | 283	malaria treatment	16	1.002
285 | 284	malaria vaccine	28	1.3842
286 | 285	malaria vector	26	1.7322
287 | 286	malawi	21	1.7172
288 | 287	mali	29	0.7531
289 | 288	malnutrition	18	0.7466
290 | 289	manufacturer	10	0.5714
291 | 290	map	27	0.5847
292 | 291	mapping	15	0.6918
293 | 292	mass drug administration	16	0.8529
294 | 293	mda	12	0.6935
295 | 294	mean	35	0.4014
296 | 295	mean age	13	0.7752
297 | 296	medecins sans frontieres	14	0.9379
298 | 297	median	28	0.635
299 | 298	median age	25	0.8884
300 | 299	medical record	11	1.0352
301 | 300	medium	20	0.4754
302 | 301	medline	11	1.0152
303 | 302	meta analysis	13	0.6555
304 | 303	methodology principal finding	15	0.5703
305 | 304	methodology principal findings	56	0.5121
306 | 305	mg kg	17	0.5852
307 | 306	microscopy	66	0.4977
308 | 307	middle income country	10	0.766
309 | 308	migrant	13	0.5622
310 | 309	miltefosine	15	0.8129
311 | 310	ministry	18	1.167
312 | 311	mixed infection	17	0.8085
313 | 312	monoclonal antibody	17	1.2685
314 | 313	monotherapy	16	1.2203
315 | 314	morocco	14	0.6616
316 | 315	mosquito	51	1.0189
317 | 316	mother	48	0.8227
318 | 317	mouse	45	0.7895
319 | 318	mozambique	30	0.7535
320 | 319	msf	11	1.1882
321 | 320	mu l	29	0.9433
322 | 321	mutation	49	0.5988
323 | 322	mycobacterium ulceran	23	1.3135
324 | 323	mycobacterium ulcerans disease	12	1.6582
325 | 324	neglected disease	12	0.6051
326 | 325	neglected tropical disease	14	0.7484
327 | 326	nepal	62	1.0398
328 | 327	net	47	1.5551
329 | 328	neurocysticercosis	29	1.8901
330 | 329	new infection	12	1.8707
331 | 330	newborn	25	1.2266
332 | 331	niger	11	0.8743
333 | 332	none	43	0.2978
334 | 333	northern senegal	19	1.6002
335 | 334	nurse	15	1.3774
336 | 335	observational study	13	0.9762
337 | 336	onchocerciasis	16	0.6799
338 | 337	opportunity	34	0.8025
339 | 338	organization	52	0.5218
340 | 339	overall sensitivity	13	1.4718
341 | 340	overview	17	0.5264
342 | 341	p falciparum	38	1.0042
343 | 342	p falciparum malaria	16	0.9503
344 | 343	p vivax	28	1.1235
345 | 344	pair	15	0.425
346 | 345	panel	36	0.7332
347 | 346	parasitaemia	35	0.3778
348 | 347	parasite density	27	1.1346
349 | 348	participation	17	1.6009
350 | 349	pathogen	41	0.462
351 | 350	pcr	152	0.297
352 | 351	pcr rflp	19	0.5897
353 | 352	perception	35	0.9809
354 | 353	permanet	16	2.8893
355 | 354	person year	12	0.7189
356 | 355	perspective	37	0.8188
357 | 356	peruvian amazon	21	0.6081
358 | 357	phase	64	0.5029
359 | 358	phlebotomus argentipe	12	1.8954
360 | 359	pig	44	1.2969
361 | 360	pilot study	11	0.8983
362 | 361	pkdl	11	1.6392
363 | 362	placebo	11	1.2504
364 | 363	plasma	17	0.8023
365 | 364	plasmodium	37	1.0145
366 | 365	plasmodium falciparum	75	0.4626
367 | 366	plasmodium falciparum infection	12	0.801
368 | 367	plasmodium falciparum malaria	22	1.005
369 | 368	plasmodium malariae	17	1.5481
370 | 369	plasmodium ovale	12	2.1912
371 | 370	plasmodium species	15	1.2288
372 | 371	plasmodium vivax	27	0.877
373 | 372	policy	62	0.6073
374 | 373	polymerase chain reaction	53	0.3009
375 | 374	polymorphism	34	0.6202
376 | 375	porcine cysticercosis	20	2.209
377 | 376	positive predictive value	14	0.3762
378 | 377	post kala azar dermal leishmaniasis	13	1.6738
379 | 378	poverty	22	0.656
380 | 379	praziquantel	14	1.3558
381 | 380	pregnancy	60	0.9399
382 | 381	pregnant woman	60	0.934
383 | 382	present study	63	0.256
384 | 383	prevention	62	0.6867
385 | 384	principle	18	0.2863
386 | 385	programme	166	0.4473
387 | 386	progression	17	0.3755
388 | 387	proof	12	0.4392
389 | 388	prospective study	14	0.4042
390 | 389	protection	51	0.7418
391 | 390	protein	59	0.5995
392 | 391	provincial hospital	11	1.3132
393 | 392	public health	21	0.5638
394 | 393	pyrethroid	13	2.1532
395 | 394	qpcr	11	0.5278
396 | 395	qualitative study	19	1.341
397 | 396	quantification	19	0.4003
398 | 397	rapid diagnostic test	59	0.5282
399 | 398	rdt	47	0.6766
400 | 399	rdts	33	0.6877
401 | 400	real time pcr	28	0.6338
402 | 401	reference laboratory	10	0.804
403 | 402	reference method	11	1.9641
404 | 403	referral	15	1.253
405 | 404	regulation	11	0.5623
406 | 405	relapse	27	0.5991
407 | 406	reproducibility	21	0.9738
408 | 407	researcher	12	0.9098
409 | 408	residence	16	0.5061
410 | 409	resistance	159	0.3618
411 | 410	resource	65	0.6262
412 | 411	resource limited setting	11	0.9349
413 | 412	resource poor setting	10	2.0335
414 | 413	respondent	16	0.9775
415 | 414	retention	15	1.3042
416 | 415	rodent	18	0.9648
417 | 416	rts	12	1.1688
418 | 417	rural burkina faso	18	1.814
419 | 418	rural community	19	0.7496
420 | 419	rural district	17	1.5223
421 | 420	rwanda	38	1.1134
422 | 421	s haematobium	24	1.4706
423 | 422	s mansoni	21	1.6004
424 | 423	s mansoni infection	11	1.4071
425 | 424	safety	98	1.2279
426 | 425	sample	202	0.2239
427 | 426	sand fly	14	1.2369
428 | 427	schistosoma haematobium	15	1.3448
429 | 428	schistosoma mansoni	29	1.3119
430 | 429	schistosoma mansoni infection	18	1.6762
431 | 430	schistosomiasis	45	0.915
432 | 431	school	17	1.4877
433 | 432	schoolchild	26	1.0815
434 | 433	semi	13	0.6222
435 | 434	senegal	27	0.7378
436 | 435	sensitivity	155	0.434
437 | 436	september	30	0.5516
438 | 437	sequence	47	0.5255
439 | 438	sequencing	27	0.4027
440 | 439	sera	30	0.7373
441 | 440	serious adverse event	16	1.982
442 | 441	serology	17	0.6214
443 | 442	seropositivity	15	1.0083
444 | 443	seroprevalence	23	1.0039
445 | 444	serum	38	0.7358
446 | 445	serum sample	24	0.9064
447 | 446	service	57	1.2804
448 | 447	severe malaria	19	0.6225
449 | 448	sheep	11	1.534
450 | 449	short report	12	0.7082
451 | 450	sickness	40	1.0169
452 | 451	sickness patient	12	1.7034
453 | 452	sierra leone	15	0.8664
454 | 453	significant association	15	0.4252
455 | 454	significant correlation	10	0.8619
456 | 455	significant reduction	11	0.9272
457 | 456	single dose	17	1.3867
458 | 457	skin	17	1.0023
459 | 458	socio economic status	12	0.8796
460 | 459	soil	44	1.9191
461 | 460	south africa	28	0.6455
462 | 461	south america	13	0.3408
463 | 462	southeast asia	19	0.9771
464 | 463	southern benin	10	1.3566
465 | 464	spatial distribution	13	0.7991
466 | 465	species	170	0.4428
467 | 466	species identification	21	0.6765
468 | 467	specific antibody	17	1.057
469 | 468	specificity	125	0.4021
470 | 469	specimen	43	0.3643
471 | 470	staff	32	0.7186
472 | 471	start	16	0.5357
473 | 472	sth	22	2.3533
474 | 473	sth infection	11	2.3304
475 | 474	stool	35	0.9815
476 | 475	stool sample	31	1.0999
477 | 476	strain	77	0.4459
478 | 477	subset	16	0.5158
479 | 478	sudan	26	0.5484
480 | 479	sulfadoxine pyrimethamine	25	1.5292
481 | 480	sulphadoxine pyrimethamine	26	1.5423
482 | 481	supervision	14	1.5976
483 | 482	supply	21	1.2577
484 | 483	support	38	1.1209
485 | 484	susceptibility	49	0.4468
486 | 485	systematic review	31	0.5215
487 | 486	t b	16	1.4086
488 | 487	t congolense	11	1.7586
489 | 488	t cruzi	16	1.7261
490 | 489	t solium	25	1.7418
491 | 490	tablet	18	1.1446
492 | 491	taenia solium	27	1.7986
493 | 492	taenia solium cysticercosis	23	1.9692
494 | 493	taeniasis	15	1.8607
495 | 494	technique	87	0.2832
496 | 495	test	211	0.2321
497 | 496	test result	27	0.6688
498 | 497	tete	11	1.6207
499 | 498	tick	21	1.2533
500 | 499	time point	12	0.4318
501 | 500	tissue	25	0.5192
502 | 501	titre	20	0.7125
503 | 502	tolerability	18	1.8715
504 | 503	total cost	14	1.1251
505 | 504	training	32	0.9558
506 | 505	transmission dynamic	16	0.7011
507 | 506	trap	18	0.8482
508 | 507	traveler	13	0.9764
509 | 508	treatment failure	38	0.6839
510 | 509	treatment outcome	33	0.4888
511 | 510	trial	142	0.4758
512 | 511	trichuris trichiura	18	2.8185
513 | 512	tropical disease	11	0.6173
514 | 513	tropical medicine	16	0.5709
515 | 514	trypanosoma	50	1.0323
516 | 515	trypanosoma brucei	13	1.5148
517 | 516	trypanosoma brucei gambiense	15	1.4779
518 | 517	trypanosoma congolense	11	1.5734
519 | 518	trypanosoma cruzi	25	1.8604
520 | 519	trypanosome	33	1.1162
521 | 520	trypanosomiasis	21	1.4596
522 | 521	trypanosomosis	17	1.2905
523 | 522	tsetse	14	1.6541
524 | 523	tsetse fly	17	1.971
525 | 524	uncomplicated falciparum malaria	17	2.1791
526 | 525	uncomplicated malaria	30	1.7924
527 | 526	uncomplicated plasmodium falciparum malaria	19	2.4514
528 | 527	uptake	29	0.7425
529 | 528	urban area	24	0.4335
530 | 529	urine	24	0.8298
531 | 530	urine sample	13	0.9968
532 | 531	vaccination	38	1.0759
533 | 532	vaccine	70	1.0634
534 | 533	validation	17	0.3754
535 | 534	validity	16	0.366
536 | 535	vector	93	0.6492
537 | 536	vector control	34	1.0335
538 | 537	venezuela	10	1.4684
539 | 538	vietnam	61	0.579
540 | 539	virus	48	0.4782
541 | 540	visceral leishmaniasis	129	0.8124
542 | 541	visit	32	0.9114
543 | 542	vitro	28	0.5322
544 | 543	vivo	10	0.6591
545 | 544	vl case	16	1.3056
546 | 545	vl patient	18	0.8485
547 | 546	vl treatment	12	1.2288
548 | 547	volunteer	22	0.6536
549 | 548	weight	33	0.5841
550 | 549	western kenya	11	0.8439
551 | 550	whole blood	12	1.3525
552 | 551	wide range	13	0.4298
553 | 552	woman	80	0.8942
554 | 553	year period	15	0.7455
555 | 554	zoonotic disease	13	0.8331
556 | 


--------------------------------------------------------------------------------
/02-advanced.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Introduction"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "In the previous sessions, you learned how to construct scientometric networks in Python. It was clear that this can be quite challenging. VOSviewer takes care of a lot of the necessary work in creating scientometric networks. You can hence use VOSviewer to create networks, which you could then export and analyse further in Python. We will here take this approach."
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "## VOSviewer"
 22 |    ]
 23 |   },
 24 |   {
 25 |    "cell_type": "markdown",
 26 |    "metadata": {},
 27 |    "source": [
 28 |     "You have previously constructed scientometric networks using VOSviewer. You can import the resulting network for further analysis in `igraph`. In order to import the file in `igraph` you need to have saved both the `map` file and the `network` file in VOSviewer. See the manual of VOSviewer for more explanation. As in the previous Python notebook, we have prepared some files for you, in this case the author collaboration network from the Web of Science files that we analysed previously."
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "We first import the necessary packages. You will presumably recognize these still from the previous Python notebook."
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": null,
 41 |    "metadata": {},
 42 |    "outputs": [],
 43 |    "source": [
 44 |     "import pandas as pd\n",
 45 |     "import igraph as ig"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "Now let us read the map and network file from VOSviewer."
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "markdown",
 57 |    "metadata": {},
 58 |    "source": [
 59 |     "<div class=\"alert alert-info\">\n",
 60 |     "    Read the file <code>data-files/vosviewer/vosviewer_map.txt</code> using tabs (<code>'\\t'</code>) as a field separator, and call the resulting variable <code>map_df</code>.\n",
 61 |     "</div>"
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": null,
 67 |    "metadata": {},
 68 |    "outputs": [],
 69 |    "source": []
 70 |   },
 71 |   {
 72 |    "cell_type": "markdown",
 73 |    "metadata": {},
 74 |    "source": [
 75 |     "The network file from VOSviewer has no header, so we set it manually"
 76 |    ]
 77 |   },
 78 |   {
 79 |    "cell_type": "code",
 80 |    "execution_count": null,
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "network_df = pd.read_csv('data-files/vosviewer/vosviewer_network.txt', sep='\\t', header=None,\n",
 85 |     "                         names=['idA', 'idB', 'weight'])"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "markdown",
 90 |    "metadata": {},
 91 |    "source": [
 92 |     "Now we have loaded the data, so we can simply construct a network as before."
 93 |    ]
 94 |   },
 95 |   {
 96 |    "cell_type": "code",
 97 |    "execution_count": null,
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "G_vosviewer = ig.Graph.DictList(\n",
102 |     "      vertices=map_df.to_dict('records'),\n",
103 |     "      edges=network_df.to_dict('records'),\n",
104 |     "      vertex_name_attr='id',\n",
105 |     "      edge_foreign_keys=('idA', 'idB'),\n",
106 |     "      directed=False\n",
107 |     "      )"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "markdown",
112 |    "metadata": {},
113 |    "source": [
114 |     "The layout and clustering is also stored by VOSviewer, and we can use that to display the same visualization in `igraph`."
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": null,
120 |    "metadata": {},
121 |    "outputs": [],
122 |    "source": [
123 |     "layout = ig.Layout(coords=zip(*[G_vosviewer.vs['x'], G_vosviewer.vs['y']]))\n",
124 |     "clustering = ig.VertexClustering.FromAttribute(G_vosviewer, 'cluster')\n",
125 |     "\n",
126 |     "ig.plot(clustering, layout=layout, vertex_size=4, vertex_frame_width=0, vertex_label=None)"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "markdown",
131 |    "metadata": {},
132 |    "source": [
133 |     "## Clustering"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "metadata": {},
139 |    "source": [
140 |     "A common phenomenon in many networks is the presence of group structure, where nodes within the same group are densely connected. Such a structure is sometimes called a *modular* structure, and a frequently used measure of group structure is known as *modularity*. You have already encountered this functionality briefly in VOSviewer, which provides clusters. Here we will explore this a bit more in-depth."
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "markdown",
145 |    "metadata": {},
146 |    "source": [
147 |     "First, we will import a package called `leidenalg` which is the *Leiden algorithm*, which we will use for clustering. It is built on top of `igraph` so that it easily integrates with all the exisiting methods of `igraph`."
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": null,
153 |    "metadata": {},
154 |    "outputs": [],
155 |    "source": [
156 |     "import leidenalg"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "markdown",
161 |    "metadata": {},
162 |    "source": [
163 |     "Now let us find clusters in the collaboration network from VOSviewer, using the weight of the edges. Because the algorithm is stochastic, it may yield somewhat different results every time you run it. To prevent that from happening, and to always get the same result, we will set the random seed to 0. The result is a `VertexClustering`, which we already briefly encountered when using the clustering results from VOSviewer."
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "markdown",
168 |    "metadata": {},
169 |    "source": [
170 |     "We will first find clusters using *modularity*."
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": null,
176 |    "metadata": {},
177 |    "outputs": [],
178 |    "source": [
179 |     "optimiser = leidenalg.Optimiser()\n",
180 |     "optimiser.set_rng_seed(0)\n",
181 |     "clusters = leidenalg.ModularityVertexPartition(G_vosviewer, weights='weight')\n",
182 |     "optimiser.optimise_partition(clusters)"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "markdown",
187 |    "metadata": {},
188 |    "source": [
189 |     "The length of the `clusters` variable indicates the number of clusters."
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "code",
194 |    "execution_count": null,
195 |    "metadata": {},
196 |    "outputs": [],
197 |    "source": [
198 |     "len(clusters)"
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "markdown",
203 |    "metadata": {},
204 |    "source": [
205 |     "When accessing `clusters` variable as a list, each element corresponds to the set of nodes in that cluster."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "markdown",
210 |    "metadata": {},
211 |    "source": [
212 |     "<div class=\"alert alert-info\">\n",
213 |     "    What are the nodes in cluster <code>30</code>?\n",
214 |     "</div>"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": null,
220 |    "metadata": {},
221 |    "outputs": [],
222 |    "source": []
223 |   },
224 |   {
225 |    "cell_type": "markdown",
226 |    "metadata": {},
227 |    "source": [
228 |     "Hence, node `548`, node `1052`, etc... belong to cluster `30`. Another way to look at the clusters is by looking at the `membership` of `clusters`."
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "markdown",
233 |    "metadata": {},
234 |    "source": [
235 |     "<div class=\"alert alert-info\">\n",
236 |     "    What is the membership of the first 10 nodes?\n",
237 |     "</div>"
238 |    ]
239 |   },
240 |   {
241 |    "cell_type": "code",
242 |    "execution_count": null,
243 |    "metadata": {},
244 |    "outputs": [],
245 |    "source": []
246 |   },
247 |   {
248 |    "cell_type": "markdown",
249 |    "metadata": {},
250 |    "source": [
251 |     "Hence, node `0` belongs to cluster `7`, node `1` belongs to cluster `9`, node `2` belongs to cluster `4`, et cetera."
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "Let us take a closer look at the largest cluster."
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": null,
264 |    "metadata": {},
265 |    "outputs": [],
266 |    "source": [
267 |     "H = clusters.giant()\n",
268 |     "print(H.summary())"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "metadata": {},
274 |    "source": [
275 |     "We could again detect clusters using modularity in the largest cluster."
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "code",
280 |    "execution_count": null,
281 |    "metadata": {},
282 |    "outputs": [],
283 |    "source": [
284 |     "optimiser.set_rng_seed(0)\n",
285 |     "subclusters = leidenalg.ModularityVertexPartition(H, weights='weight')\n",
286 |     "optimiser.optimise_partition(subclusters)\n",
287 |     "ig.plot(subclusters, vertex_size=5, vertex_label=None)"
288 |    ]
289 |   },
290 |   {
291 |    "cell_type": "markdown",
292 |    "metadata": {},
293 |    "source": [
294 |     "In general, modularity will continue to find subclusters in this way. An alternative approach, called CPM, does not suffer from that problem. \n",
295 |     "\n",
296 |     "Let us detect clusters using CPM. We do have to specify a parameter, called the `resolution_parameter`. As its name suggests, it specifies the resolution of the clusters we would like to find. At a higher resolution we will tend to find smaller clusters, while at a lower resolution we find larger clusters. Let us use the resolution parameter 0.01."
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": null,
302 |    "metadata": {},
303 |    "outputs": [],
304 |    "source": [
305 |     "optimiser.set_rng_seed(0)\n",
306 |     "clusters = leidenalg.CPMVertexPartition(G_vosviewer,\n",
307 |     "                                     weights='weight',\n",
308 |     "                                     resolution_parameter=0.1)\n",
309 |     "optimiser.optimise_partition(clusters)\n",
310 |     "clusters.giant().vcount()"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "<div class=\"alert alert-info\">\n",
318 |     "Detect subclusters in the largest cluster using CPM, using the same <code>resolution_parameter</code>. How many subclusters do you find? How does that compare to modularity?\n",
319 |     "</div>"
320 |    ]
321 |   },
322 |   {
323 |    "cell_type": "code",
324 |    "execution_count": null,
325 |    "metadata": {},
326 |    "outputs": [],
327 |    "source": []
328 |   },
329 |   {
330 |    "cell_type": "markdown",
331 |    "metadata": {},
332 |    "source": [
333 |     "<div class=\"alert alert-info\">\n",
334 |     "Try to find more subclusters by specifying a higher <code>resolution_parameter</code>.\n",
335 |     "</div>"
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "code",
340 |    "execution_count": null,
341 |    "metadata": {},
342 |    "outputs": [],
343 |    "source": []
344 |   },
345 |   {
346 |    "cell_type": "markdown",
347 |    "metadata": {},
348 |    "source": [
349 |     "Modularity adapts itself to the network. In a sense that is convenient, because you then do not have to specify any parameters. On the other hand, it makes the definition of what a \"cluster\" is less clear.\n",
350 |     "\n",
351 |     "CPM does not adapt itself to the network, and maintains the same defintion across different networks. That is convenient, because it brings more clarity to what we mean by a \"cluster\". Whenever you try to find subclusters using the same `resolution_parameter`, CPM should not find any subclusters. In practice, it may happen that CPM still finds some subclusters, in which case the original clusters were actually not the best possible. The Leiden algorithm can be run for multiple iterations, and with each iteration, the chances are smaller that CPM would find such subclusters. Modularity will always find subclusters, independent of the number of iterations."
352 |    ]
353 |   },
354 |   {
355 |    "cell_type": "markdown",
356 |    "metadata": {},
357 |    "source": [
358 |     "<div class=\"alert alert-info\">\n",
359 |     "    Try to optimise the partition further. Note that the function <code>optimise_partition</code> returns how much further it managed to improve the function, so that if it returns <code>0.0</code>, it means it couldn't find any further improvement. Execute the cell repeatedly. Does it return 0.0 after some time?\n",
360 |     "</div>"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "code",
365 |    "execution_count": null,
366 |    "metadata": {},
367 |    "outputs": [],
368 |    "source": []
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "metadata": {},
373 |    "source": [
374 |     "Let us compare the clusters that we detected in Python with the clustering results from VOSviewer.\n",
375 |     "\n",
376 |     "We can summarize the overall similarity to the partition based on the disciplines using the Normalised Mutual Information (NMI). The NMI varies between 0 and 1 and equals 1 if both are identical."
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": null,
382 |    "metadata": {},
383 |    "outputs": [],
384 |    "source": [
385 |     "clusters.compare_to(clustering, method='nmi')"
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "markdown",
390 |    "metadata": {},
391 |    "source": [
392 |     "There are some differences between the clustering from VOSviewer and the clusters we detected in Python. This will of course highly depend on what resolution parameter we have used for both results. One other important difference is that VOSviewer will by default use *normalized* weights. By default, it will divide the weight of a link by the expected weight, assuming that the total link weight of each node would remain the same, which is sometimes referred to as the *association strength*. We also perform this normalization here."
393 |    ]
394 |   },
395 |   {
396 |    "cell_type": "code",
397 |    "execution_count": null,
398 |    "metadata": {},
399 |    "outputs": [],
400 |    "source": [
401 |     "G_vosviewer.es['weight_normalized'] = [\n",
402 |     "    e['weight']/( G_vosviewer.vs[e.source]['weight<Total link strength>']*G_vosviewer.vs[e.target]['weight<Total link strength>'] / (2*sum(G_vosviewer.es['weight'])) ) \n",
403 |     "    for e in G_vosviewer.es]"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "By default VOSviewer uses the default resolution of `1` for these normalized weights. If we now detect clusters using these weights, you will see that the result are more closely aligned to the VOSviewer results."
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "code",
415 |    "execution_count": null,
416 |    "metadata": {},
417 |    "outputs": [],
418 |    "source": [
419 |     "clusters = leidenalg.find_partition(G_vosviewer, leidenalg.CPMVertexPartition, \n",
420 |     "                                       weights='weight_normalized', resolution_parameter=1,\n",
421 |     "                                       n_iterations=10)\n",
422 |     "\n",
423 |     "clusters.compare_to(clustering, method='nmi')"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "markdown",
428 |    "metadata": {},
429 |    "source": [
430 |     "Finally, the Leiden algorithm is also directly implemented in `igraph` itself nowadays. It is somewhat less elaborate than the `leidenalg` package, but it is also substantially faster. If you are analysing very large networks, it might be better to use the `igraph` Leiden algorithm. Using it is straightforward."
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "code",
435 |    "execution_count": null,
436 |    "metadata": {},
437 |    "outputs": [],
438 |    "source": [
439 |     "clusters = G_vosviewer.community_leiden(objective_function='CPM',weights='weight_normalized', \n",
440 |     "                                        resolution_parameter=1.0, n_iterations=10)\n",
441 |     "\n",
442 |     "clusters.compare_to(clustering, method='nmi')"
443 |    ]
444 |   },
445 |   {
446 |    "cell_type": "markdown",
447 |    "metadata": {},
448 |    "source": [
449 |     "Now let us explore cluster detection a bit further."
450 |    ]
451 |   },
452 |   {
453 |    "cell_type": "markdown",
454 |    "metadata": {},
455 |    "source": [
456 |     "<div class=\"alert alert-info\">\n",
457 |     "    Vary the <code>resolution_parameter</code> when detecting clusters using the CPM method. What <code>resolution_parameter</code> seems reasonable to you, and why?\n",
458 |     "</div>"
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "code",
463 |    "execution_count": null,
464 |    "metadata": {},
465 |    "outputs": [],
466 |    "source": []
467 |   },
468 |   {
469 |    "cell_type": "markdown",
470 |    "metadata": {},
471 |    "source": [
472 |     "<div class=\"alert alert-info\">\n",
473 |     "    Try to find a <code>resolution_parameter</code> such that the network separates in two large clusters (and some remaining small clusters). What is the cause of these two large clusters? (Hint: examine the author names)\n",
474 |     "</div>"
475 |    ]
476 |   },
477 |   {
478 |    "cell_type": "code",
479 |    "execution_count": null,
480 |    "metadata": {},
481 |    "outputs": [],
482 |    "source": []
483 |   },
484 |   {
485 |    "cell_type": "markdown",
486 |    "metadata": {},
487 |    "source": [
488 |     "<div class=\"alert alert-info\">\n",
489 |     "Compare the co-authorship network that we created previously in Python to the network created in VOSviewer. What are the differences?\n",
490 |     "</div>"
491 |    ]
492 |   },
493 |   {
494 |    "cell_type": "code",
495 |    "execution_count": null,
496 |    "metadata": {},
497 |    "outputs": [],
498 |    "source": []
499 |   },
500 |   {
501 |    "cell_type": "markdown",
502 |    "metadata": {},
503 |    "source": [
504 |     "# Document-term clustering"
505 |    ]
506 |   },
507 |   {
508 |    "cell_type": "markdown",
509 |    "metadata": {},
510 |    "source": [
511 |     "We will now use the same type of clustering technique that we used previously in a slightly different way. Instead of clustering a network, we will cluster a specific type of network, namely a bipartite network. This requires a slightly different (and more complicated) approach. More specifically, we will cluster a document-term network, where documents are linked to terms if those terms appear in a document.\n",
512 |     "\n",
513 |     "We leave the task of extracting terms to VOSviewer, and simply import the resulting document-term network in Python. At the end of the notebook, you will find instructions how to extract the document-term network from VOSviewer yourself.\n",
514 |     "\n",
515 |     "We read two files: (1) the `terms.txt` file, which simply contains the terms and their `id`; and (2) the `doc-term.txt` file, which contains which term occurs in which document. The `document id` refers to the line number of the WoS files that were read by VOSviewer. We will encounter this later."
516 |    ]
517 |   },
518 |   {
519 |    "cell_type": "code",
520 |    "execution_count": null,
521 |    "metadata": {},
522 |    "outputs": [],
523 |    "source": [
524 |     "terms_df = pd.read_csv('data-files/vosviewer/terms.txt', sep='\\t', index_col='id')\n",
525 |     "doc_terms_df = pd.read_csv('data-files/vosviewer/doc-term.txt', sep='\\t')"
526 |    ]
527 |   },
528 |   {
529 |    "cell_type": "markdown",
530 |    "metadata": {},
531 |    "source": [
532 |     "In this file, both the documents and the terms are using the same numbers, so that `igraph` cannot distinguish them (e.g. there is both a document `1` and a term `1`). We therefore create separate ids for both the documents and the terms."
533 |    ]
534 |   },
535 |   {
536 |    "cell_type": "code",
537 |    "execution_count": null,
538 |    "metadata": {},
539 |    "outputs": [],
540 |    "source": [
541 |     "doc_terms_df['document id'] = doc_terms_df['document id'].map(lambda x: str(x) + '-doc');\n",
542 |     "doc_terms_df['term id'] = doc_terms_df['term id'].map(lambda x: str(x) + '-term');"
543 |    ]
544 |   },
545 |   {
546 |    "cell_type": "markdown",
547 |    "metadata": {},
548 |    "source": [
549 |     "We can now create the network."
550 |    ]
551 |   },
552 |   {
553 |    "cell_type": "code",
554 |    "execution_count": null,
555 |    "metadata": {},
556 |    "outputs": [],
557 |    "source": [
558 |     "G_doc_term = ig.Graph.TupleList(\n",
559 |     "      edges=doc_terms_df.values,\n",
560 |     "      vertex_name_attr='id',\n",
561 |     "      directed=False\n",
562 |     "      )"
563 |    ]
564 |   },
565 |   {
566 |    "cell_type": "markdown",
567 |    "metadata": {},
568 |    "source": [
569 |     "This is a bipartite network, and we create a specific vertex attribute to indicate what the type is of the node: either a `doc` or a `term`."
570 |    ]
571 |   },
572 |   {
573 |    "cell_type": "code",
574 |    "execution_count": null,
575 |    "metadata": {},
576 |    "outputs": [],
577 |    "source": [
578 |     "G_doc_term.vs['type'] = ['doc' if 'doc' in v['id'] else 'term' for v in G_doc_term.vs]"
579 |    ]
580 |   },
581 |   {
582 |    "cell_type": "markdown",
583 |    "metadata": {},
584 |    "source": [
585 |     "Similar to the co-authorship network, VOSviewer typically normalizes the weights in a network by using the association strength, and we will also use that here."
586 |    ]
587 |   },
588 |   {
589 |    "cell_type": "code",
590 |    "execution_count": null,
591 |    "metadata": {},
592 |    "outputs": [],
593 |    "source": [
594 |     "G_doc_term.es['weight'] = [2.0*G_doc_term.ecount()/(G_doc_term.vs[e.source].degree()*G_doc_term.vs[e.target].degree()) \n",
595 |     "                           for e in G_doc_term.es];"
596 |    ]
597 |   },
598 |   {
599 |    "cell_type": "markdown",
600 |    "metadata": {},
601 |    "source": [
602 |     "We now employ a small trick in the `leidenalg` package in order to do clustering in a bipartite network. We will not explain the full details here, please see the [documentation](https://leidenalg.readthedocs.io/en/latest/multiplex.html#bipartite) for a brief explanation of this approach. Please note that this approach is *not* possible using the internal `igraph` Leiden algorithm."
603 |    ]
604 |   },
605 |   {
606 |    "cell_type": "code",
607 |    "execution_count": null,
608 |    "metadata": {},
609 |    "outputs": [],
610 |    "source": [
611 |     "partition, partition_docs, partition_terms = leidenalg.CPMVertexPartition.Bipartite(\n",
612 |     "    G_doc_term, types='type', weights='weight', resolution_parameter_01=1)"
613 |    ]
614 |   },
615 |   {
616 |    "cell_type": "markdown",
617 |    "metadata": {},
618 |    "source": [
619 |     "We are now ready to detect clusters, but we are going to use all three partitions we created. We do so by using the function `optimise_partition_multiplex` instead of the `optimise_partition` function that we used previously. We have to pass a list of partitions to that function. For the trick to work, we also need to pass the argument `layer_weights=[1,-1,-1]`, which assumes that the `partition` is the first element of the list that we pass."
620 |    ]
621 |   },
622 |   {
623 |    "cell_type": "code",
624 |    "execution_count": null,
625 |    "metadata": {},
626 |    "outputs": [],
627 |    "source": [
628 |     "optimiser = leidenalg.Optimiser()\n",
629 |     "optimiser.set_rng_seed(0)\n",
630 |     "optimiser.optimise_partition_multiplex(\n",
631 |     "              [partition, partition_docs, partition_terms],  \n",
632 |     "              layer_weights=[1,-1,-1], n_iterations=100)"
633 |    ]
634 |   },
635 |   {
636 |    "cell_type": "markdown",
637 |    "metadata": {},
638 |    "source": [
639 |     "Now `partition` contains the clustering results (actually, `partition_docs` and `partition_terms` contain the identical clustering results). We extract the cluster membership of each node, and make it a new node attribute."
640 |    ]
641 |   },
642 |   {
643 |    "cell_type": "code",
644 |    "execution_count": null,
645 |    "metadata": {},
646 |    "outputs": [],
647 |    "source": [
648 |     "G_doc_term.vs['cluster'] = partition.membership\n",
649 |     "G_doc_term.vs['degree'] = G_doc_term.degree();"
650 |    ]
651 |   },
652 |   {
653 |    "cell_type": "markdown",
654 |    "metadata": {},
655 |    "source": [
656 |     "We will now create a so-called *projection* of the bipartite graph, which actually simply refers to the creation of a co-occurrence network."
657 |    ]
658 |   },
659 |   {
660 |    "cell_type": "code",
661 |    "execution_count": null,
662 |    "metadata": {},
663 |    "outputs": [],
664 |    "source": [
665 |     "G_doc_term.vs['type_int'] = [1 if v['type'] == 'term' else 0 for v in G_doc_term.vs];\n",
666 |     "G_terms = G_doc_term.bipartite_projection(types='type_int', which=1);\n",
667 |     "G_terms.simplify(combine_edges='sum');\n",
668 |     "\n",
669 |     "G_terms.vs['id'] = [int(v['id'][:-5]) for v in G_terms.vs];\n",
670 |     "G_terms.vs['term'] = [terms_df.loc[v['id'],'term'] for v in G_terms.vs];"
671 |    ]
672 |   },
673 |   {
674 |    "cell_type": "markdown",
675 |    "metadata": {},
676 |    "source": [
677 |     "Now `G_terms` contains only terms and the co-occurrence between them. We will export this network to a file format so that we can read it back into VOSviewer. First, let us create the output directory (if necessary)."
678 |    ]
679 |   },
680 |   {
681 |    "cell_type": "code",
682 |    "execution_count": null,
683 |    "metadata": {},
684 |    "outputs": [],
685 |    "source": [
686 |     "import os\n",
687 |     "output_dir = 'results/'\n",
688 |     "if not os.path.exists(output_dir):\n",
689 |     "    os.makedirs(output_dir)"
690 |    ]
691 |   },
692 |   {
693 |    "cell_type": "markdown",
694 |    "metadata": {},
695 |    "source": [
696 |     "Now we export the network `G_terms` in file format which is understandable to VOSviewer."
697 |    ]
698 |   },
699 |   {
700 |    "cell_type": "code",
701 |    "execution_count": null,
702 |    "metadata": {},
703 |    "outputs": [],
704 |    "source": [
705 |     "nodes_df = pd.DataFrame.from_dict({attr: G_terms.vs[attr] for attr in G_terms.vs.attributes()});\n",
706 |     "nodes_df['label'] = nodes_df['term'];\n",
707 |     "nodes_df['cluster'] += 1;\n",
708 |     "nodes_df['weight<Occurence>'] = nodes_df['degree'];\n",
709 |     "nodes_df = nodes_df.sort_values('id')\n",
710 |     "nodes_df[['id', 'label', 'cluster', 'weight<Occurence>']].to_csv(output_dir + 'map_vosviewer.txt', sep='\\t', index=False);\n",
711 |     "\n",
712 |     "edge_df = pd.DataFrame([(G_terms.vs[e.source]['id'], G_terms.vs[e.target]['id'], e['weight']) for e in G_terms.es],\n",
713 |     "                       columns=['source', 'target', 'weight']);\n",
714 |     "edge_df = edge_df.sort_values(['source', 'target']);\n",
715 |     "edge_df.to_csv(output_dir + 'network_vosviewer.txt', sep='\\t', index=False, header=False);"
716 |    ]
717 |   },
718 |   {
719 |    "cell_type": "markdown",
720 |    "metadata": {},
721 |    "source": [
722 |     "The great benefit of doing the clustering in Python is that we now also have a clustering of the publications. This is something that is not possible in VOSviewer."
723 |    ]
724 |   },
725 |   {
726 |    "cell_type": "markdown",
727 |    "metadata": {},
728 |    "source": [
729 |     "Let us first load the actual publication files which were used by VOSviewer (we have already done this in the previous notebook). As said, the `document id` refers to the line number of the WoS files that were read by VOSviewer, starting from `1`. We therefore also create a `document id` that is the same."
730 |    ]
731 |   },
732 |   {
733 |    "cell_type": "code",
734 |    "execution_count": null,
735 |    "metadata": {},
736 |    "outputs": [],
737 |    "source": [
738 |     "import glob\n",
739 |     "import csv\n",
740 |     "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n",
741 |     "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n",
742 |     "                                        usecols=range(68), index_col='UT') for f in files)\n",
743 |     "publications_df['document id'] = range(1,publications_df.shape[0]+1)"
744 |    ]
745 |   },
746 |   {
747 |    "cell_type": "markdown",
748 |    "metadata": {},
749 |    "source": [
750 |     "Now let us create a dataframe from `G_doc_term` with all the information from the documents."
751 |    ]
752 |   },
753 |   {
754 |    "cell_type": "code",
755 |    "execution_count": null,
756 |    "metadata": {},
757 |    "outputs": [],
758 |    "source": [
759 |     "nodes_df = pd.DataFrame.from_dict({attr: G_doc_term.vs[attr] for attr in G_doc_term.vs.attributes()});\n",
760 |     "nodes_df = nodes_df[nodes_df['type'] == 'doc'];"
761 |    ]
762 |   },
763 |   {
764 |    "cell_type": "markdown",
765 |    "metadata": {},
766 |    "source": [
767 |     "Now we need back the original integer `document id`, instead of the identifiers we created `doc-1`, `doc-2`, etc... We can then use those `document id` to merge back the results with the original information from the publications."
768 |    ]
769 |   },
770 |   {
771 |    "cell_type": "code",
772 |    "execution_count": null,
773 |    "metadata": {},
774 |    "outputs": [],
775 |    "source": [
776 |     "nodes_df['document id'] = nodes_df['id'].str[:-4].astype(int);\n",
777 |     "publications_df = pd.merge(nodes_df[['document id', 'cluster']], publications_df, \n",
778 |     "                           left_on='document id', right_on='document id')"
779 |    ]
780 |   },
781 |   {
782 |    "cell_type": "markdown",
783 |    "metadata": {},
784 |    "source": [
785 |     "Finally, for further inspection, we may want to export our results to a `.csv` file."
786 |    ]
787 |   },
788 |   {
789 |    "cell_type": "code",
790 |    "execution_count": null,
791 |    "metadata": {},
792 |    "outputs": [],
793 |    "source": [
794 |     "publications_df[['AU', 'PY', 'TI', 'SO', 'cluster']].to_csv(output_dir + 'publications_clustering.csv', \n",
795 |     "                                                            index=False)"
796 |    ]
797 |   },
798 |   {
799 |    "cell_type": "markdown",
800 |    "metadata": {},
801 |    "source": [
802 |     "# Own analysis"
803 |    ]
804 |   },
805 |   {
806 |    "cell_type": "markdown",
807 |    "metadata": {},
808 |    "source": [
809 |     "<div class=\"alert alert-info\">\n",
810 |     "Load your own data in VOSviewer and create a co-citation network of journals.\n",
811 |     "</div>"
812 |    ]
813 |   },
814 |   {
815 |    "cell_type": "code",
816 |    "execution_count": null,
817 |    "metadata": {},
818 |    "outputs": [],
819 |    "source": []
820 |   },
821 |   {
822 |    "cell_type": "markdown",
823 |    "metadata": {},
824 |    "source": [
825 |     "<div class=\"alert alert-info\">\n",
826 |     "Detect comunities in the journal co-citation network. What do you think the different clusters mean?\n",
827 |     "</div>"
828 |    ]
829 |   },
830 |   {
831 |    "cell_type": "code",
832 |    "execution_count": null,
833 |    "metadata": {},
834 |    "outputs": [],
835 |    "source": []
836 |   },
837 |   {
838 |    "cell_type": "markdown",
839 |    "metadata": {},
840 |    "source": [
841 |     "<div class=\"alert alert-info\">\n",
842 |     "    Load your own data in VOSviewer and create a term-map. Please take the following steps to create the term-map and extract the <code>terms.csv</code> file and the <code>doc-terms.csv</code> file.\n",
843 |     "    \n",
844 |     "<ol>\n",
845 |     "    <li>Open VOSviewer and press the button \"Create...\".</li>    \n",
846 |     "    <li>Choose \"Create a map based on text data\" and press \"Next\".</li>\n",
847 |     "    <li>Choose \"Read data from bibliographic database files\" and press \"Next\".</li>\n",
848 |     "    <li>Choose the \"Web of Science\" tab and select the files you have downloaded yourself and press \"Next\".</li>\n",
849 |     "    <li>Choose \"Title and abstract fields\" (the default) and press \"Next\".</li>\n",
850 |     "    <li>Choose \"Binary counting\" (the default) and press \"Next\".</li>\n",
851 |     "    <li>Leave the default threshold of 10 and press \"Next\".</li>\n",
852 |     "    <li>Leave the default number of terms to be selected and press \"Next\".</li>\n",
853 |     "</ol>\n",
854 |     "\n",
855 |     "VOSviewer will now calculate the \"relevance\" scores. When it is done, you will be shown a list of terms together with the number of their occurrences and the relevance scores. Please follow the following remaining steps.\n",
856 |     "\n",
857 |     "<ol start=9>\n",
858 |     "    <li>On the list of terms, click-right, and choose \"Export selected terms...\". Choose an appropriate file name (<code>terms.txt</code>) and make sure you choose an appropriate directory and then press \"Export\".</li>\n",
859 |     "    <li>On the list of terms, click-right, and choose \"Export document-term relations...\". Choose an appropriate file name (<code>doc-terms.txt</code>) and make sure you choose an appropriate directory and then press \"Export\".</li>\n",
860 |     "</ol>\n",
861 |     "</div>"
862 |    ]
863 |   },
864 |   {
865 |    "cell_type": "markdown",
866 |    "metadata": {},
867 |    "source": [
868 |     "<div class=\"alert alert-info\">\n",
869 |     "    Load the <code>terms.csv</code> file and the <code>doc-terms.csv</code> files. Detect the clusters in this bipartite network, as explained above.\n",
870 |     "</div>"
871 |    ]
872 |   },
873 |   {
874 |    "cell_type": "code",
875 |    "execution_count": null,
876 |    "metadata": {},
877 |    "outputs": [],
878 |    "source": []
879 |   },
880 |   {
881 |    "cell_type": "markdown",
882 |    "metadata": {},
883 |    "source": [
884 |     "<div class=\"alert alert-info\">\n",
885 |     "    Compare the results to the clusters you can detect immediately in VOSviewer itself. Are they similar or not?\n",
886 |     "</div>"
887 |    ]
888 |   },
889 |   {
890 |    "cell_type": "code",
891 |    "execution_count": null,
892 |    "metadata": {},
893 |    "outputs": [],
894 |    "source": []
895 |   },
896 |   {
897 |    "cell_type": "markdown",
898 |    "metadata": {},
899 |    "source": [
900 |     "<div class=\"alert alert-info\">\n",
901 |     "    Try to identify the main topic for the largest few clusters on the basis of the terms in the term map. Does that match well with the publications in the same cluster? Do you see any discrepancies?\n",
902 |     "</div>"
903 |    ]
904 |   },
905 |   {
906 |    "cell_type": "code",
907 |    "execution_count": null,
908 |    "metadata": {},
909 |    "outputs": [],
910 |    "source": []
911 |   }
912 |  ],
913 |  "metadata": {
914 |   "kernelspec": {
915 |    "display_name": "Python 3",
916 |    "language": "python",
917 |    "name": "python3"
918 |   },
919 |   "language_info": {
920 |    "codemirror_mode": {
921 |     "name": "ipython",
922 |     "version": 3
923 |    },
924 |    "file_extension": ".py",
925 |    "mimetype": "text/x-python",
926 |    "name": "python",
927 |    "nbconvert_exporter": "python",
928 |    "pygments_lexer": "ipython3",
929 |    "version": "3.8.3"
930 |   }
931 |  },
932 |  "nbformat": 4,
933 |  "nbformat_minor": 2
934 | }
935 | 


--------------------------------------------------------------------------------
/01-basics.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# Introduction"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "In this lab exercise, you will learn how to perform scientometric network analysis in Python. We will start with practicalities on some basic data handling and import. We then move on to creating a network and cover some basic analysis. In the next session, we will be using more advanced techniques."
  15 |    ]
  16 |   },
  17 |   {
  18 |    "cell_type": "markdown",
  19 |    "metadata": {},
  20 |    "source": [
  21 |     "<div class=\"alert alert-info\">\n",
  22 |     "This Python notebook is intended to be used as an exercise. We have prepared it for you to include many details, but at some parts we will ask you to fill in some of the blanks. Exercises where you are asked to do something, or to think about something, will be indicated like this. If you need to execute and write your own code, we provide empty space below to do so.\n",
  23 |     "</div>"
  24 |    ]
  25 |   },
  26 |   {
  27 |    "cell_type": "markdown",
  28 |    "metadata": {},
  29 |    "source": [
  30 |     "<div class=\"alert alert-warning\">\n",
  31 |     "If you need any help with anything, please don't hesitate to ask your teachers. \n",
  32 |     "</div>"
  33 |    ]
  34 |   },
  35 |   {
  36 |    "cell_type": "markdown",
  37 |    "metadata": {},
  38 |    "source": [
  39 |     "# Data handling"
  40 |    ]
  41 |   },
  42 |   {
  43 |    "cell_type": "markdown",
  44 |    "metadata": {},
  45 |    "source": [
  46 |     "Python is a general purpose programming language and it can be used to handle data in general. In this notebook we will specifically deal with scientometric datasets, but you can also use it for other purposes.\n",
  47 |     "\n",
  48 |     "We will start by handling some data from a scientometric data source. There are many different possible data sources, and we discussed some of them earlier this week. In this notebook we will focus on data downloaded from Web of Science. We have already downloaded some data for you to demonstrate Python. At the end of the exercise you will be asked to load your own data. \n",
  49 |     "\n",
  50 |     "The data that we provided is a selection of publications from authors from Belgium from Tropical Medicine from 2000-2017."
  51 |    ]
  52 |   },
  53 |   {
  54 |    "cell_type": "markdown",
  55 |    "metadata": {},
  56 |    "source": [
  57 |     "<div class=\"alert alert-warning\">\n",
  58 |     "    <b>Note:</b> You cannot load your own data when you run this notebook online using Binder.\n",
  59 |     "</div>"
  60 |    ]
  61 |   },
  62 |   {
  63 |    "cell_type": "markdown",
  64 |    "metadata": {},
  65 |    "source": [
  66 |     "We start by loading the data. In order to read in the data, we first need to make sure that Python is able to read it. A very versatile *package* for handling data in Python is called `pandas`. For those of you familiar with `R`, it is similar to the `data.frame` in `R`."
  67 |    ]
  68 |   },
  69 |   {
  70 |    "cell_type": "markdown",
  71 |    "metadata": {},
  72 |    "source": [
  73 |     "We *import* this package as follows, and we call the `pandas` package `pd`, for easy reference. We also need the `csv` package to indicate some options to the `pandas` package."
  74 |    ]
  75 |   },
  76 |   {
  77 |    "cell_type": "markdown",
  78 |    "metadata": {},
  79 |    "source": [
  80 |     "<div class=\"alert alert-success\">\n",
  81 |     "    In order to execute the code you have to press <code>Ctrl-Enter</code> while selecting the code cell below. Alternatively, you can press the \"Play\" button at the top of the screen. This also moves to the next cell at the same time. Using <code>Shift-Enter</code> instead of <code>Ctrl-Enter</code> will also execute the code and move to the next cell at the same time.\n",
  82 |     "</div>"
  83 |    ]
  84 |   },
  85 |   {
  86 |    "cell_type": "code",
  87 |    "execution_count": null,
  88 |    "metadata": {},
  89 |    "outputs": [],
  90 |    "source": [
  91 |     "import pandas as pd\n",
  92 |     "import csv"
  93 |    ]
  94 |   },
  95 |   {
  96 |    "cell_type": "markdown",
  97 |    "metadata": {},
  98 |    "source": [
  99 |     "<div class=\"alert alert-success\">\n",
 100 |     "    If you have executed that code cell correctly, it should now be numbered <code>1</code>. While the code in a cell is being executed it is marked by an asterisk <code>*</code>. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered <code>2</code>, et cetera.\n",
 101 |     "</div>"
 102 |    ]
 103 |   },
 104 |   {
 105 |    "cell_type": "markdown",
 106 |    "metadata": {},
 107 |    "source": [
 108 |     "We are now ready to read in the data that you just downloaded. We have named the `pandas` package `pd`, which will save us some typing."
 109 |    ]
 110 |   },
 111 |   {
 112 |    "cell_type": "code",
 113 |    "execution_count": null,
 114 |    "metadata": {},
 115 |    "outputs": [],
 116 |    "source": [
 117 |     "publications_df = pd.read_csv('data-files/wos/tab-delimited/savedrecs_0001_0500.txt', \n",
 118 |     "                              sep='\\t', index_col='UT',\n",
 119 |     "                              quoting=csv.QUOTE_NONE, usecols=range(68))"
 120 |    ]
 121 |   },
 122 |   {
 123 |    "cell_type": "markdown",
 124 |    "metadata": {},
 125 |    "source": [
 126 |     "We called the *function* `read_csv` of the `pandas` package. We provide it with several *arguments*. \n",
 127 |     "\n",
 128 |     "1. The location of the file we want to read.\n",
 129 |     "\n",
 130 |     "2. The second argument is a *named argument*, we provide both the name of the argument (`sep`) and its value (`'\\t'`). This indicates the *sep*arator between different fields. In this case it is a tab-delimited file, so the fields are separated by tabs, which is indicated by `'\\t'`.\n",
 131 |     "\n",
 132 |     "3. The third argument is again a named argument. We indicate that the `UT` field should be the index. This is the unique identifier that WoS uses.\n",
 133 |     "\n",
 134 |     "The two subqeuent arguments are needed to correctly handle some peculiarities of WoS files."
 135 |    ]
 136 |   },
 137 |   {
 138 |    "cell_type": "markdown",
 139 |    "metadata": {},
 140 |    "source": [
 141 |     "<div class=\"alert alert-success\">\n",
 142 |     "We downloaded some example files for you, which are located in the folder <code>data_files/wos</code>. At the end of this notebook, you will be asked to download your own data. If you want to load that data instead, use the path to that data.\n",
 143 |     "</div>"
 144 |    ]
 145 |   },
 146 |   {
 147 |    "cell_type": "markdown",
 148 |    "metadata": {},
 149 |    "source": [
 150 |     "<div class=\"alert alert-warning\">\n",
 151 |     "    <b>Note:</b> Windows usually uses backslashes <code>\\</code> to separate directories, in Python you can also use the forward slash <code>/</code>, which is usually more convenient for a number of reasons.\n",
 152 |     "</div>"
 153 |    ]
 154 |   },
 155 |   {
 156 |    "cell_type": "markdown",
 157 |    "metadata": {},
 158 |    "source": [
 159 |     "The `pandas` package took care of reading the file, and has now stored it in the variable called `publications_df`. You can take a closer look at `publications_df` to see the data that we just read."
 160 |    ]
 161 |   },
 162 |   {
 163 |    "cell_type": "code",
 164 |    "execution_count": null,
 165 |    "metadata": {},
 166 |    "outputs": [],
 167 |    "source": [
 168 |     "publications_df"
 169 |    ]
 170 |   },
 171 |   {
 172 |    "cell_type": "markdown",
 173 |    "metadata": {},
 174 |    "source": [
 175 |     "You will see that the data has quite cryptic column headers. Each line contains information about a single publication, and contains various details, such as the title (`TI`), abstract (`AB`), authors (`AU`), journal title (`SO`) and cited references (`CR`). Unfortunately, the documentation of Web of Science is relatively limited, but some explanation can be found <a href=\"http://images.webofknowledge.com/WOKRS532FR6/help/WOS/hs_advanced_fieldtags.html\">here</a>. You can retrieve this information in various ways from the pandas dataframe `publications_df`. For example, you can list the first five titles as follows:"
 176 |    ]
 177 |   },
 178 |   {
 179 |    "cell_type": "code",
 180 |    "execution_count": null,
 181 |    "metadata": {},
 182 |    "outputs": [],
 183 |    "source": [
 184 |     "publications_df.TI[:5]"
 185 |    ]
 186 |   },
 187 |   {
 188 |    "cell_type": "markdown",
 189 |    "metadata": {},
 190 |    "source": [
 191 |     "Here, `[:5]` indicates that you want the first elements (starting at 0) until (but excluding) 5, so item 0, 1, 2, 3 and 4. This is called a *slice* of the data. You can also look at authors for rows 5 until 10 as follows:"
 192 |    ]
 193 |   },
 194 |   {
 195 |    "cell_type": "code",
 196 |    "execution_count": null,
 197 |    "metadata": {},
 198 |    "outputs": [],
 199 |    "source": [
 200 |     "publications_df.AU[5:10]"
 201 |    ]
 202 |   },
 203 |   {
 204 |    "cell_type": "markdown",
 205 |    "metadata": {},
 206 |    "source": [
 207 |     "In order to get the last few elements, you can use negative indices. The last element is indicated by `-1`, the penultimate element is indicated by `-2`, and so on. You can get the journals for the last five sources as follows:"
 208 |    ]
 209 |   },
 210 |   {
 211 |    "cell_type": "code",
 212 |    "execution_count": null,
 213 |    "metadata": {},
 214 |    "outputs": [],
 215 |    "source": [
 216 |     "publications_df.SO[-5:]"
 217 |    ]
 218 |   },
 219 |   {
 220 |    "cell_type": "markdown",
 221 |    "metadata": {},
 222 |    "source": [
 223 |     "Alternatively, there are various ways to index the dataframe. For example, to get the title and abstract for the first five elements you can do the following."
 224 |    ]
 225 |   },
 226 |   {
 227 |    "cell_type": "code",
 228 |    "execution_count": null,
 229 |    "metadata": {},
 230 |    "outputs": [],
 231 |    "source": [
 232 |     "publications_df[0:5][['TI', 'AB']]"
 233 |    ]
 234 |   },
 235 |   {
 236 |    "cell_type": "markdown",
 237 |    "metadata": {},
 238 |    "source": [
 239 |     "The notation `['TI', 'AB']` creates a *list* of elements in Python. We now used it to get multiple columns from the dataframe. "
 240 |    ]
 241 |   },
 242 |   {
 243 |    "cell_type": "markdown",
 244 |    "metadata": {},
 245 |    "source": [
 246 |     "The following does exactly the same:"
 247 |    ]
 248 |   },
 249 |   {
 250 |    "cell_type": "code",
 251 |    "execution_count": null,
 252 |    "metadata": {},
 253 |    "outputs": [],
 254 |    "source": [
 255 |     "publications_df[['TI', 'AB']][0:5]"
 256 |    ]
 257 |   },
 258 |   {
 259 |    "cell_type": "markdown",
 260 |    "metadata": {},
 261 |    "source": [
 262 |     "The `pandas` package automatically determines whether you try to get columns or rows. Slices are always assumed to refer to rows."
 263 |    ]
 264 |   },
 265 |   {
 266 |    "cell_type": "markdown",
 267 |    "metadata": {},
 268 |    "source": [
 269 |     "<div class=\"alert alert-info\">\n",
 270 |     "    Show the title (<code>TI</code>), abstract (<code>AB</code>), journal (<code>SO</code>) and publication year (<code>PY</code>) for rows 200-210.\n",
 271 |     "</div>"
 272 |    ]
 273 |   },
 274 |   {
 275 |    "cell_type": "markdown",
 276 |    "metadata": {},
 277 |    "source": [
 278 |     "<div class=\"alert alert-success\">\n",
 279 |     "To start typing in the cell below, select the cell using the mouse, or select it using the arrows on the keyboard and press <code>Enter</code>\n",
 280 |     "</div>"
 281 |    ]
 282 |   },
 283 |   {
 284 |    "cell_type": "code",
 285 |    "execution_count": null,
 286 |    "metadata": {},
 287 |    "outputs": [],
 288 |    "source": []
 289 |   },
 290 |   {
 291 |    "cell_type": "markdown",
 292 |    "metadata": {},
 293 |    "source": [
 294 |     "You can also access a particular `UT` directly by using the `.loc` indexer."
 295 |    ]
 296 |   },
 297 |   {
 298 |    "cell_type": "code",
 299 |    "execution_count": null,
 300 |    "metadata": {},
 301 |    "outputs": [],
 302 |    "source": [
 303 |     "publications_df.loc['WOS:000419235100004', ['TI', 'AU', 'SO', 'PY']]"
 304 |    ]
 305 |   },
 306 |   {
 307 |    "cell_type": "markdown",
 308 |    "metadata": {},
 309 |    "source": [
 310 |     "## Reading multiple files"
 311 |    ]
 312 |   },
 313 |   {
 314 |    "cell_type": "markdown",
 315 |    "metadata": {},
 316 |    "source": [
 317 |     "Until now we have only loaded one file. But we have of course downloaded more files, and we need to load all of them. We can list all files in a directory using the package `glob`. We first import the package."
 318 |    ]
 319 |   },
 320 |   {
 321 |    "cell_type": "code",
 322 |    "execution_count": null,
 323 |    "metadata": {},
 324 |    "outputs": [],
 325 |    "source": [
 326 |     "import glob"
 327 |    ]
 328 |   },
 329 |   {
 330 |    "cell_type": "markdown",
 331 |    "metadata": {},
 332 |    "source": [
 333 |     "Now, let us get a list of all files in the directory `data_files/wos/tab-delimited/`."
 334 |    ]
 335 |   },
 336 |   {
 337 |    "cell_type": "code",
 338 |    "execution_count": null,
 339 |    "metadata": {},
 340 |    "outputs": [],
 341 |    "source": [
 342 |     "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n",
 343 |     "files"
 344 |    ]
 345 |   },
 346 |   {
 347 |    "cell_type": "markdown",
 348 |    "metadata": {},
 349 |    "source": [
 350 |     "We asked `glob` for a list of files that end with `txt` (`*.txt`) in the directory `data-files/wos/tab-delimited`. We sorted the list to ensure that we read the files in the correct order. We can now simply pass this list of files to read multiple WoS files."
 351 |    ]
 352 |   },
 353 |   {
 354 |    "cell_type": "code",
 355 |    "execution_count": null,
 356 |    "metadata": {},
 357 |    "outputs": [],
 358 |    "source": [
 359 |     "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n",
 360 |     "                                        usecols=range(68), index_col='UT') for f in files)\n",
 361 |     "publications_df = publications_df.sort_index()"
 362 |    ]
 363 |   },
 364 |   {
 365 |    "cell_type": "markdown",
 366 |    "metadata": {},
 367 |    "source": [
 368 |     "<div class=\"alert alert-info\">\n",
 369 |     "    Now checkout the new <code>publications_df</code> data frame, and see how many rows it has.\n",
 370 |     "</div>"
 371 |    ]
 372 |   },
 373 |   {
 374 |    "cell_type": "code",
 375 |    "execution_count": null,
 376 |    "metadata": {},
 377 |    "outputs": [],
 378 |    "source": []
 379 |   },
 380 |   {
 381 |    "cell_type": "markdown",
 382 |    "metadata": {},
 383 |    "source": [
 384 |     "## Data summarisation"
 385 |    ]
 386 |   },
 387 |   {
 388 |    "cell_type": "markdown",
 389 |    "metadata": {},
 390 |    "source": [
 391 |     "The `pandas` package provides various ways to summarise the data and get a useful overview of the data. For example, you can group by a certain column, and count or sum things. For example, we can count the number of articles in each journal that is included in this dataset:"
 392 |    ]
 393 |   },
 394 |   {
 395 |    "cell_type": "code",
 396 |    "execution_count": null,
 397 |    "metadata": {},
 398 |    "outputs": [],
 399 |    "source": [
 400 |     "grouped_by_journal = publications_df.groupby('SO')\n",
 401 |     "grouped_by_journal.size().sort_values(ascending=False)[:10]"
 402 |    ]
 403 |   },
 404 |   {
 405 |    "cell_type": "markdown",
 406 |    "metadata": {},
 407 |    "source": [
 408 |     "We could also ask the mean publication year of publications in those journals"
 409 |    ]
 410 |   },
 411 |   {
 412 |    "cell_type": "code",
 413 |    "execution_count": null,
 414 |    "metadata": {},
 415 |    "outputs": [],
 416 |    "source": [
 417 |     "grouped_by_journal['PY'].mean()"
 418 |    ]
 419 |   },
 420 |   {
 421 |    "cell_type": "markdown",
 422 |    "metadata": {},
 423 |    "source": [
 424 |     "<div class=\"alert alert-info\">\n",
 425 |     "    Group by the year (<code>PY</code>) and count the number of paper from each year.\n",
 426 |     "</div>"
 427 |    ]
 428 |   },
 429 |   {
 430 |    "cell_type": "markdown",
 431 |    "metadata": {},
 432 |    "source": [
 433 |     "<div class=\"alert alert-success\">\n",
 434 |     "Now it is time to introduce you a little trick: you can get a list of all functions and argument of some variable  by simply pressing <code>Tab</code>. For example, you can type <code>publications_df.</code>, including the <code>.</code> and then press <code>Tab</code> (make sure the cursor is located after the <code>.</code>). If you then start typing the name of the function you are looking for and press <code>Tab</code> again, Python will automatically finish it as much as possible. This is something general: whenever you press <code>Tab</code> Python will try to <em>autocomplete</em> whatever you are typing.\n",
 435 |     "\n",
 436 |     "One other trick: if you have selected a function and press <code>Shift-Tab</code> you will get documentation of what this function does. You can press the <code>+</code> to find out more.\n",
 437 |     "</div>"
 438 |    ]
 439 |   },
 440 |   {
 441 |    "cell_type": "code",
 442 |    "execution_count": null,
 443 |    "metadata": {},
 444 |    "outputs": [],
 445 |    "source": []
 446 |   },
 447 |   {
 448 |    "cell_type": "markdown",
 449 |    "metadata": {},
 450 |    "source": [
 451 |     "## Network generation"
 452 |    ]
 453 |   },
 454 |   {
 455 |    "cell_type": "markdown",
 456 |    "metadata": {},
 457 |    "source": [
 458 |     "Ultimately, we would like to use this data to generation scientometric networks. This is not a trivial task, and we will now show how to construct a co-authorship network and a journal level bibliographic coupling network.\n",
 459 |     "\n",
 460 |     "We first load the network analysis package that we will use in the notebook, `igraph`."
 461 |    ]
 462 |   },
 463 |   {
 464 |    "cell_type": "markdown",
 465 |    "metadata": {},
 466 |    "source": [
 467 |     "<div class=\"alert alert-info\">\n",
 468 |     "    Import the pacakge <code>igraph</code> and call it <code>ig</code>.\n",
 469 |     "</div>"
 470 |    ]
 471 |   },
 472 |   {
 473 |    "cell_type": "code",
 474 |    "execution_count": null,
 475 |    "metadata": {},
 476 |    "outputs": [],
 477 |    "source": []
 478 |   },
 479 |   {
 480 |    "cell_type": "markdown",
 481 |    "metadata": {},
 482 |    "source": [
 483 |     "### Co-authorship"
 484 |    ]
 485 |   },
 486 |   {
 487 |    "cell_type": "markdown",
 488 |    "metadata": {},
 489 |    "source": [
 490 |     "We first build a co-authorship network. We will do this one publication at the time. All combinations of authors that are involved in a publication are co-authors. Let us look at the authors for publication 0."
 491 |    ]
 492 |   },
 493 |   {
 494 |    "cell_type": "code",
 495 |    "execution_count": null,
 496 |    "metadata": {},
 497 |    "outputs": [],
 498 |    "source": [
 499 |     "publications_df['AU'][0]"
 500 |    ]
 501 |   },
 502 |   {
 503 |    "cell_type": "markdown",
 504 |    "metadata": {},
 505 |    "source": [
 506 |     "Note that the authors are all listed and separated with a semicolon (`;`). In computer terms, it is now a single *string*. We will split this string of all authors into a list of strings where each string then represents a single author."
 507 |    ]
 508 |   },
 509 |   {
 510 |    "cell_type": "code",
 511 |    "execution_count": null,
 512 |    "metadata": {},
 513 |    "outputs": [],
 514 |    "source": [
 515 |     "publications_df['AU_split'] = publications_df['AU'].fillna('').str.split('; ')"
 516 |    ]
 517 |   },
 518 |   {
 519 |    "cell_type": "code",
 520 |    "execution_count": null,
 521 |    "metadata": {},
 522 |    "outputs": [],
 523 |    "source": [
 524 |     "authors = publications_df['AU_split'][0]\n",
 525 |     "authors"
 526 |    ]
 527 |   },
 528 |   {
 529 |    "cell_type": "markdown",
 530 |    "metadata": {},
 531 |    "source": [
 532 |     "In order to create all possible combinations, we can use a convenient package, called `itertools`. The function `combinations` can generate all possible combinations of the elements of a list."
 533 |    ]
 534 |   },
 535 |   {
 536 |    "cell_type": "code",
 537 |    "execution_count": null,
 538 |    "metadata": {},
 539 |    "outputs": [],
 540 |    "source": [
 541 |     "import itertools as itr\n",
 542 |     "list(itr.combinations(authors, 2))"
 543 |    ]
 544 |   },
 545 |   {
 546 |    "cell_type": "markdown",
 547 |    "metadata": {},
 548 |    "source": [
 549 |     "Of course, we don't want to do this for a single publication only, but rather, for all publications in our dataset. We can do that using the function `apply`. We can supply it with a small function (called a `lambda` function) that simply takes some input and produces some output. In this case, the input are the `authors`, and the output is the result of `itr.combinations(...)`."
 550 |    ]
 551 |   },
 552 |   {
 553 |    "cell_type": "code",
 554 |    "execution_count": null,
 555 |    "metadata": {},
 556 |    "outputs": [],
 557 |    "source": [
 558 |     "coauthors_per_publication = publications_df['AU_split'].apply(\n",
 559 |     "    lambda authors: list(itr.combinations(authors, 2)))"
 560 |    ]
 561 |   },
 562 |   {
 563 |    "cell_type": "markdown",
 564 |    "metadata": {},
 565 |    "source": [
 566 |     "The variable `coauthors_per_publication` is now a list of a list of co-authors per publication. That is, each element of `coauthors_per_publication` contains a list of all co-authors for that publication. So, `coauthors_per_publication[0]` contains the coauthors we examined previously."
 567 |    ]
 568 |   },
 569 |   {
 570 |    "cell_type": "code",
 571 |    "execution_count": null,
 572 |    "metadata": {},
 573 |    "outputs": [],
 574 |    "source": [
 575 |     "coauthors_per_publication[0]"
 576 |    ]
 577 |   },
 578 |   {
 579 |    "cell_type": "markdown",
 580 |    "metadata": {},
 581 |    "source": [
 582 |     "Let us turn each element of this list into a separate row. This is done by using `explode` in `pandas`. Publications with only one author have no co-authors, which results in an `NA` (Not Available) value. We will drop those using `dropna`."
 583 |    ]
 584 |   },
 585 |   {
 586 |    "cell_type": "code",
 587 |    "execution_count": null,
 588 |    "metadata": {},
 589 |    "outputs": [],
 590 |    "source": [
 591 |     "coauthors = coauthors_per_publication.explode().dropna()"
 592 |    ]
 593 |   },
 594 |   {
 595 |    "cell_type": "markdown",
 596 |    "metadata": {},
 597 |    "source": [
 598 |     "Finally, we can create the actual network as follows"
 599 |    ]
 600 |   },
 601 |   {
 602 |    "cell_type": "code",
 603 |    "execution_count": null,
 604 |    "metadata": {},
 605 |    "outputs": [],
 606 |    "source": [
 607 |     "G_coauthorship = ig.Graph.TupleList(\n",
 608 |     "      edges=coauthors.to_list(),\n",
 609 |     "      vertex_name_attr='author',\n",
 610 |     "      directed=False\n",
 611 |     "      )"
 612 |    ]
 613 |   },
 614 |   {
 615 |    "cell_type": "markdown",
 616 |    "metadata": {},
 617 |    "source": [
 618 |     "Note that this graph will still contain many duplicate edges, because there are multiple edges present. Let us therefore simplify this graph, and simply count the number of multiple edges. We first create a so-called edge attribute `n_joint_papers`. We can create it by using the edge sequence `es` of the graph. We can then simply sum this weight when we simplify the graph."
 619 |    ]
 620 |   },
 621 |   {
 622 |    "cell_type": "code",
 623 |    "execution_count": null,
 624 |    "metadata": {},
 625 |    "outputs": [],
 626 |    "source": [
 627 |     "G_coauthorship.es['n_joint_papers'] = 1\n",
 628 |     "G_coauthorship = G_coauthorship.simplify(combine_edges='sum')"
 629 |    ]
 630 |   },
 631 |   {
 632 |    "cell_type": "markdown",
 633 |    "metadata": {},
 634 |    "source": [
 635 |     "Let us see how many authors (i.e. nodes) there are in the network. This is called the `vcount` (vertex count) in `igraph`."
 636 |    ]
 637 |   },
 638 |   {
 639 |    "cell_type": "code",
 640 |    "execution_count": null,
 641 |    "metadata": {},
 642 |    "outputs": [],
 643 |    "source": [
 644 |     "G_coauthorship.vcount()"
 645 |    ]
 646 |   },
 647 |   {
 648 |    "cell_type": "markdown",
 649 |    "metadata": {},
 650 |    "source": [
 651 |     "Similarly, the number of edges is available as the `ecount` of the graph."
 652 |    ]
 653 |   },
 654 |   {
 655 |    "cell_type": "code",
 656 |    "execution_count": null,
 657 |    "metadata": {},
 658 |    "outputs": [],
 659 |    "source": [
 660 |     "G_coauthorship.ecount()"
 661 |    ]
 662 |   },
 663 |   {
 664 |    "cell_type": "markdown",
 665 |    "metadata": {},
 666 |    "source": [
 667 |     "We can do all sorts of analysis on this network. But first, we will create a bibliographic coupling network."
 668 |    ]
 669 |   },
 670 |   {
 671 |    "cell_type": "markdown",
 672 |    "metadata": {},
 673 |    "source": [
 674 |     "### Bibliographic coupling"
 675 |    ]
 676 |   },
 677 |   {
 678 |    "cell_type": "markdown",
 679 |    "metadata": {},
 680 |    "source": [
 681 |     "Bibliographic coupling and co-authorship is in a sense very similar. Previously, we computed for each publication a combination of all co-authors. For bibliographic coupling we can compute for each cited reference the combinations of all citing journals. We will first create a dataframe of all journal citations (`SO`) of a certain cited reference (`CR`). Similar to the authors, we need to first split the cited references."
 682 |    ]
 683 |   },
 684 |   {
 685 |    "cell_type": "code",
 686 |    "execution_count": null,
 687 |    "metadata": {},
 688 |    "outputs": [],
 689 |    "source": [
 690 |     "publication_with_cr_df = publications_df.loc[pd.notnull(publications_df['CR']), ['SO', 'CR']]\n",
 691 |     "publication_with_cr_df['CR'] = publication_with_cr_df['CR'].str.split('; ')"
 692 |    ]
 693 |   },
 694 |   {
 695 |    "cell_type": "markdown",
 696 |    "metadata": {},
 697 |    "source": [
 698 |     "We now simply list all citations from a certain journal (`SO`) to a certain cited reference (`CR`)."
 699 |    ]
 700 |   },
 701 |   {
 702 |    "cell_type": "code",
 703 |    "execution_count": null,
 704 |    "metadata": {},
 705 |    "outputs": [],
 706 |    "source": [
 707 |     "journal_cits_df = publication_with_cr_df[['SO', 'CR']].explode('CR')"
 708 |    ]
 709 |   },
 710 |   {
 711 |    "cell_type": "markdown",
 712 |    "metadata": {},
 713 |    "source": [
 714 |     "We then create all bibliographic couplings per cited reference as follows. We first group by the cited reference (`CR`) and then take all combinations of citing journals."
 715 |    ]
 716 |   },
 717 |   {
 718 |    "cell_type": "code",
 719 |    "execution_count": null,
 720 |    "metadata": {},
 721 |    "outputs": [],
 722 |    "source": [
 723 |     "bibcoupling_per_cr = journal_cits_df.groupby('CR').apply(lambda x: list(itr.combinations(x['SO'], 2)))"
 724 |    ]
 725 |   },
 726 |   {
 727 |    "cell_type": "markdown",
 728 |    "metadata": {},
 729 |    "source": [
 730 |     "We again `explode` all combinations of two sources citing the same reference."
 731 |    ]
 732 |   },
 733 |   {
 734 |    "cell_type": "code",
 735 |    "execution_count": null,
 736 |    "metadata": {},
 737 |    "outputs": [],
 738 |    "source": [
 739 |     "bibcouplings = bibcoupling_per_cr.explode().dropna()"
 740 |    ]
 741 |   },
 742 |   {
 743 |    "cell_type": "markdown",
 744 |    "metadata": {},
 745 |    "source": [
 746 |     "We can then create the network."
 747 |    ]
 748 |   },
 749 |   {
 750 |    "cell_type": "code",
 751 |    "execution_count": null,
 752 |    "metadata": {},
 753 |    "outputs": [],
 754 |    "source": [
 755 |     "G_coupling = ig.Graph.TupleList(\n",
 756 |     "      edges=bibcouplings,\n",
 757 |     "      vertex_name_attr='SO',\n",
 758 |     "      directed=False\n",
 759 |     "      )"
 760 |    ]
 761 |   },
 762 |   {
 763 |    "cell_type": "markdown",
 764 |    "metadata": {},
 765 |    "source": [
 766 |     "<div class=\"alert alert-info\">\n",
 767 |     "    We again need to simplify this network. Create a new edge attribute called <code>coupling</code> set it to <code>1</code> and then sum this attribute when simplifying the network.\n",
 768 |     "</div>"
 769 |    ]
 770 |   },
 771 |   {
 772 |    "cell_type": "code",
 773 |    "execution_count": null,
 774 |    "metadata": {},
 775 |    "outputs": [],
 776 |    "source": []
 777 |   },
 778 |   {
 779 |    "cell_type": "markdown",
 780 |    "metadata": {},
 781 |    "source": [
 782 |     "This network should be reasonably sized, and you should be able to visualize this network by calling `ig.plot`."
 783 |    ]
 784 |   },
 785 |   {
 786 |    "cell_type": "code",
 787 |    "execution_count": null,
 788 |    "metadata": {},
 789 |    "outputs": [],
 790 |    "source": [
 791 |     "ig.plot(G_coupling, vertex_label=G_coupling.vs['SO'])"
 792 |    ]
 793 |   },
 794 |   {
 795 |    "cell_type": "markdown",
 796 |    "metadata": {},
 797 |    "source": [
 798 |     "# Network analysis"
 799 |    ]
 800 |   },
 801 |   {
 802 |    "cell_type": "markdown",
 803 |    "metadata": {},
 804 |    "source": [
 805 |     "Now that we have created some scientometric networks, let us look at some basic analyses of these networks."
 806 |    ]
 807 |   },
 808 |   {
 809 |    "cell_type": "markdown",
 810 |    "metadata": {},
 811 |    "source": [
 812 |     "## Connectivity"
 813 |    ]
 814 |   },
 815 |   {
 816 |    "cell_type": "markdown",
 817 |    "metadata": {},
 818 |    "source": [
 819 |     "Let us start with a very simple question. Is the co-authorship network connected?"
 820 |    ]
 821 |   },
 822 |   {
 823 |    "cell_type": "code",
 824 |    "execution_count": null,
 825 |    "metadata": {},
 826 |    "outputs": [],
 827 |    "source": [
 828 |     "G_coauthorship.is_connected()"
 829 |    ]
 830 |   },
 831 |   {
 832 |    "cell_type": "markdown",
 833 |    "metadata": {},
 834 |    "source": [
 835 |     "Apparently, not all authors in this dataset are connected via co-authored papers."
 836 |    ]
 837 |   },
 838 |   {
 839 |    "cell_type": "markdown",
 840 |    "metadata": {},
 841 |    "source": [
 842 |     "<div class=\"alert alert-info\">\n",
 843 |     "How many authors do you think will be connected to each other? 500? 5000? Almost everybody?\n",
 844 |     "</div>"
 845 |    ]
 846 |   },
 847 |   {
 848 |    "cell_type": "markdown",
 849 |    "metadata": {},
 850 |    "source": [
 851 |     "In order to take a closer look, we need to detect the *connected components*. This is easily done, but the function is confusingly called `clusters`."
 852 |    ]
 853 |   },
 854 |   {
 855 |    "cell_type": "code",
 856 |    "execution_count": null,
 857 |    "metadata": {},
 858 |    "outputs": [],
 859 |    "source": [
 860 |     "components = G_coauthorship.clusters()"
 861 |    ]
 862 |   },
 863 |   {
 864 |    "cell_type": "markdown",
 865 |    "metadata": {},
 866 |    "source": [
 867 |     "We only want the so-called giant component. "
 868 |    ]
 869 |   },
 870 |   {
 871 |    "cell_type": "markdown",
 872 |    "metadata": {},
 873 |    "source": [
 874 |     "<div class=\"alert alert-info\">\n",
 875 |     "What function do you think returns the giant component?\n",
 876 |     "</div>"
 877 |    ]
 878 |   },
 879 |   {
 880 |    "cell_type": "markdown",
 881 |    "metadata": {},
 882 |    "source": [
 883 |     "<div class=\"alert alert-success\">\n",
 884 |     "    Remember, you can use <code>Tab</code> and <code>Shift-Tab</code> to find out more about possible functions.\n",
 885 |     "</div>"
 886 |    ]
 887 |   },
 888 |   {
 889 |    "cell_type": "code",
 890 |    "execution_count": null,
 891 |    "metadata": {},
 892 |    "outputs": [],
 893 |    "source": []
 894 |   },
 895 |   {
 896 |    "cell_type": "markdown",
 897 |    "metadata": {},
 898 |    "source": [
 899 |     "Let us only look at the giant component."
 900 |    ]
 901 |   },
 902 |   {
 903 |    "cell_type": "code",
 904 |    "execution_count": null,
 905 |    "metadata": {},
 906 |    "outputs": [],
 907 |    "source": [
 908 |     "H = components.giant()"
 909 |    ]
 910 |   },
 911 |   {
 912 |    "cell_type": "markdown",
 913 |    "metadata": {},
 914 |    "source": [
 915 |     "Let us check how many nodes are in the giant component. We can call the function `summary`."
 916 |    ]
 917 |   },
 918 |   {
 919 |    "cell_type": "code",
 920 |    "execution_count": null,
 921 |    "metadata": {},
 922 |    "outputs": [],
 923 |    "source": [
 924 |     "print(H.summary())"
 925 |    ]
 926 |   },
 927 |   {
 928 |    "cell_type": "markdown",
 929 |    "metadata": {},
 930 |    "source": [
 931 |     "The first line indicates that we have an undirected graph (`U`) with 7871 nodes and 69928 links. The next line shows vertex attributes (indicated by the `v` behind the name of the attribute), and edge attributes (indicated by the `e`)."
 932 |    ]
 933 |   },
 934 |   {
 935 |    "cell_type": "markdown",
 936 |    "metadata": {},
 937 |    "source": [
 938 |     "<div class=\"alert alert-info\">\n",
 939 |     "    <ol>\n",
 940 |     "      <li> What is the percentage of nodes that are in the giant component? \n",
 941 |     "      <li> Double check whether the giant component is connected.\n",
 942 |     "    </ol>\n",
 943 |     "</div>"
 944 |    ]
 945 |   },
 946 |   {
 947 |    "cell_type": "code",
 948 |    "execution_count": null,
 949 |    "metadata": {},
 950 |    "outputs": [],
 951 |    "source": []
 952 |   },
 953 |   {
 954 |    "cell_type": "markdown",
 955 |    "metadata": {},
 956 |    "source": [
 957 |     "Let us take a closer look at how far authors in this data set are apart from one another. Let us simply take a look at node number `0` (remember, the first node has number `0`, not `1`) and node number `355`. "
 958 |    ]
 959 |   },
 960 |   {
 961 |    "cell_type": "code",
 962 |    "execution_count": null,
 963 |    "metadata": {},
 964 |    "outputs": [],
 965 |    "source": [
 966 |     "paths = G_coauthorship.get_shortest_paths(0, 355)\n",
 967 |     "paths"
 968 |    ]
 969 |   },
 970 |   {
 971 |    "cell_type": "markdown",
 972 |    "metadata": {},
 973 |    "source": [
 974 |     "This returns a list of all shortests paths of the nodes between node number 0 and node number 355. In fact, there is only one path, so let us select that."
 975 |    ]
 976 |   },
 977 |   {
 978 |    "cell_type": "code",
 979 |    "execution_count": null,
 980 |    "metadata": {},
 981 |    "outputs": [],
 982 |    "source": [
 983 |     "path = paths[0]\n",
 984 |     "path"
 985 |    ]
 986 |   },
 987 |   {
 988 |    "cell_type": "markdown",
 989 |    "metadata": {},
 990 |    "source": [
 991 |     "<div class=\"alert alert-info\">\n",
 992 |     "How many nodes are in the path? What is the path length?\n",
 993 |     "</div>"
 994 |    ]
 995 |   },
 996 |   {
 997 |    "cell_type": "markdown",
 998 |    "metadata": {},
 999 |    "source": [
1000 |     "These numbers probably do not mean that much to you. You can find out more about an individual node by looking at the `VertexSequence` of `igraph`, abbreviated as `vs`. This is a sort of list of all vertices, and is indexed by brackets `[ ]`, similar to lists, instead of parentheses `( )` as we do for functions."
1001 |    ]
1002 |   },
1003 |   {
1004 |    "cell_type": "code",
1005 |    "execution_count": null,
1006 |    "metadata": {},
1007 |    "outputs": [],
1008 |    "source": [
1009 |     "G_coauthorship.vs[0]"
1010 |    ]
1011 |   },
1012 |   {
1013 |    "cell_type": "markdown",
1014 |    "metadata": {},
1015 |    "source": [
1016 |     "The vertex itself is also a type of list (called a *dictionary*), and you can only return the author name as follows"
1017 |    ]
1018 |   },
1019 |   {
1020 |    "cell_type": "code",
1021 |    "execution_count": null,
1022 |    "metadata": {},
1023 |    "outputs": [],
1024 |    "source": [
1025 |     "G_coauthorship.vs[0]['author']"
1026 |    ]
1027 |   },
1028 |   {
1029 |    "cell_type": "markdown",
1030 |    "metadata": {},
1031 |    "source": [
1032 |     "You can also list multiple vertices at once."
1033 |    ]
1034 |   },
1035 |   {
1036 |    "cell_type": "code",
1037 |    "execution_count": null,
1038 |    "metadata": {},
1039 |    "outputs": [],
1040 |    "source": [
1041 |     "G_coauthorship.vs[[0, 3, 223, 355]]['author']"
1042 |    ]
1043 |   },
1044 |   {
1045 |    "cell_type": "markdown",
1046 |    "metadata": {},
1047 |    "source": [
1048 |     "You can of course also simply pass the variable `path` that we constructed earlier."
1049 |    ]
1050 |   },
1051 |   {
1052 |    "cell_type": "code",
1053 |    "execution_count": null,
1054 |    "metadata": {},
1055 |    "outputs": [],
1056 |    "source": [
1057 |     "G_coauthorship.vs[path]['author']"
1058 |    ]
1059 |   },
1060 |   {
1061 |    "cell_type": "markdown",
1062 |    "metadata": {},
1063 |    "source": [
1064 |     "This shows that Osaer collaborated with Geert, who collaborated with Van Mark, who in the end collaborated with Watkins."
1065 |    ]
1066 |   },
1067 |   {
1068 |    "cell_type": "markdown",
1069 |    "metadata": {},
1070 |    "source": [
1071 |     "You can also get the vertex by searching for the author name. For example, if we want to find `'Van Marck, E'` we can use the following."
1072 |    ]
1073 |   },
1074 |   {
1075 |    "cell_type": "code",
1076 |    "execution_count": null,
1077 |    "metadata": {},
1078 |    "outputs": [],
1079 |    "source": [
1080 |     "G_coauthorship.vs.find(author_eq = 'Van Marck, E')"
1081 |    ]
1082 |   },
1083 |   {
1084 |    "cell_type": "markdown",
1085 |    "metadata": {},
1086 |    "source": [
1087 |     "Here `author_eq` refers to the condition that the vertex attribute `author` should **eq**ual `'Van Marck, E'`."
1088 |    ]
1089 |   },
1090 |   {
1091 |    "cell_type": "markdown",
1092 |    "metadata": {},
1093 |    "source": [
1094 |     "<div class=\"alert alert-info\">\n",
1095 |     "    Find the shortest path from <code>'Van Marck, E'</code> to <code>'Migchelsen, S'</code>. Who is in between?\n",
1096 |     "</div>"
1097 |    ]
1098 |   },
1099 |   {
1100 |    "cell_type": "code",
1101 |    "execution_count": null,
1102 |    "metadata": {},
1103 |    "outputs": [],
1104 |    "source": []
1105 |   },
1106 |   {
1107 |    "cell_type": "markdown",
1108 |    "metadata": {},
1109 |    "source": [
1110 |     "We can let `igraph` also calculate how far apart all nodes are."
1111 |    ]
1112 |   },
1113 |   {
1114 |    "cell_type": "markdown",
1115 |    "metadata": {},
1116 |    "source": [
1117 |     "<div class=\"alert alert-warning\">\n",
1118 |     "The following may take some time to run\n",
1119 |     "</div>"
1120 |    ]
1121 |   },
1122 |   {
1123 |    "cell_type": "code",
1124 |    "execution_count": null,
1125 |    "metadata": {},
1126 |    "outputs": [],
1127 |    "source": [
1128 |     "path_lengths = G_coauthorship.path_length_hist()\n",
1129 |     "print(path_lengths)"
1130 |    ]
1131 |   },
1132 |   {
1133 |    "cell_type": "markdown",
1134 |    "metadata": {},
1135 |    "source": [
1136 |     "<div class=\"alert alert-info\">\n",
1137 |     "How far apart are most authors? Do you think most authors are close by? Or do you think they are far apart?\n",
1138 |     "</div>"
1139 |    ]
1140 |   },
1141 |   {
1142 |    "cell_type": "markdown",
1143 |    "metadata": {},
1144 |    "source": [
1145 |     "Let us take a closer look at the path between node 0 and node 355 again. Instead of the nodes on the path, we now want to take a closer look at the edges on the path."
1146 |    ]
1147 |   },
1148 |   {
1149 |    "cell_type": "code",
1150 |    "execution_count": null,
1151 |    "metadata": {},
1152 |    "outputs": [],
1153 |    "source": [
1154 |     "epath = G_coauthorship.get_shortest_paths(0, 355, output='epath')\n",
1155 |     "epath"
1156 |    ]
1157 |   },
1158 |   {
1159 |    "cell_type": "markdown",
1160 |    "metadata": {},
1161 |    "source": [
1162 |     "There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the `VertexSequence` we encountered earlier, there is also an `EdgeSequence`, abbreviated as `es`. Let us take a closer look to the number of joint papers that the authors had co-authored."
1163 |    ]
1164 |   },
1165 |   {
1166 |    "cell_type": "code",
1167 |    "execution_count": null,
1168 |    "metadata": {},
1169 |    "outputs": [],
1170 |    "source": [
1171 |     "G_coauthorship.es[epath[0]]['n_joint_papers']"
1172 |    ]
1173 |   },
1174 |   {
1175 |    "cell_type": "markdown",
1176 |    "metadata": {},
1177 |    "source": [
1178 |     "Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?"
1179 |    ]
1180 |   },
1181 |   {
1182 |    "cell_type": "code",
1183 |    "execution_count": null,
1184 |    "metadata": {},
1185 |    "outputs": [],
1186 |    "source": [
1187 |     "epath = G_coauthorship.get_shortest_paths(0, 355, weights='n_joint_papers', output='epath')\n",
1188 |     "epath"
1189 |    ]
1190 |   },
1191 |   {
1192 |    "cell_type": "markdown",
1193 |    "metadata": {},
1194 |    "source": [
1195 |     "We do get a different path, which it is actually longer. Let us take a look at the number of joint papers."
1196 |    ]
1197 |   },
1198 |   {
1199 |    "cell_type": "code",
1200 |    "execution_count": null,
1201 |    "metadata": {},
1202 |    "outputs": [],
1203 |    "source": [
1204 |     "G_coauthorship.es[epath[0]]['n_joint_papers']"
1205 |    ]
1206 |   },
1207 |   {
1208 |    "cell_type": "markdown",
1209 |    "metadata": {},
1210 |    "source": [
1211 |     "The total number of joint papers is lower! That is because *shortest path* means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the *shortest path*."
1212 |    ]
1213 |   },
1214 |   {
1215 |    "cell_type": "markdown",
1216 |    "metadata": {},
1217 |    "source": [
1218 |     "<div class=\"alert alert-danger\">\n",
1219 |     "<b>Attention!</b> Weighted shortest paths have the <em>lowest</em> total weight.\n",
1220 |     "</div>"
1221 |    ]
1222 |   },
1223 |   {
1224 |    "cell_type": "markdown",
1225 |    "metadata": {},
1226 |    "source": [
1227 |     "## Clustering coefficient"
1228 |    ]
1229 |   },
1230 |   {
1231 |    "cell_type": "markdown",
1232 |    "metadata": {},
1233 |    "source": [
1234 |     "Let us look whether co-authors of an author also tend to be co-authors among themselves."
1235 |    ]
1236 |   },
1237 |   {
1238 |    "cell_type": "markdown",
1239 |    "metadata": {},
1240 |    "source": [
1241 |     "Let us take a look at the co-authors of of author number 0, which are called the *neighbors* in network terminology."
1242 |    ]
1243 |   },
1244 |   {
1245 |    "cell_type": "code",
1246 |    "execution_count": null,
1247 |    "metadata": {},
1248 |    "outputs": [],
1249 |    "source": [
1250 |     "G_coauthorship.neighborhood(0)"
1251 |    ]
1252 |   },
1253 |   {
1254 |    "cell_type": "markdown",
1255 |    "metadata": {},
1256 |    "source": [
1257 |     "What we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0."
1258 |    ]
1259 |   },
1260 |   {
1261 |    "cell_type": "code",
1262 |    "execution_count": null,
1263 |    "metadata": {},
1264 |    "outputs": [],
1265 |    "source": [
1266 |     "H = G_coauthorship.induced_subgraph(G_coauthorship.neighborhood(0))\n",
1267 |     "print(H.summary())"
1268 |    ]
1269 |   },
1270 |   {
1271 |    "cell_type": "markdown",
1272 |    "metadata": {},
1273 |    "source": [
1274 |     "This subgraph only has 4 nodes (including node 0, so it has 3 neighbours) and 6 edges. This is sufficiently small to be easily plotted for visual inspection."
1275 |    ]
1276 |   },
1277 |   {
1278 |    "cell_type": "code",
1279 |    "execution_count": null,
1280 |    "metadata": {},
1281 |    "outputs": [],
1282 |    "source": [
1283 |     "H.vs['color'] = 'red'\n",
1284 |     "H.vs[0]['color'] = 'grey'\n",
1285 |     "ig.plot(H)"
1286 |    ]
1287 |   },
1288 |   {
1289 |    "cell_type": "markdown",
1290 |    "metadata": {},
1291 |    "source": [
1292 |     "<div class=\"alert alert-info\">\n",
1293 |     "Do many of the co-authors collaborate among themselves as well? Why do you think this happens?\n",
1294 |     "</div>"
1295 |    ]
1296 |   },
1297 |   {
1298 |    "cell_type": "markdown",
1299 |    "metadata": {},
1300 |    "source": [
1301 |     "We can also ask `igraph` to calculate the clustering coefficient (which is called *transitivity* in igraph, which is the same concept using different terms) of node 0."
1302 |    ]
1303 |   },
1304 |   {
1305 |    "cell_type": "code",
1306 |    "execution_count": null,
1307 |    "metadata": {},
1308 |    "outputs": [],
1309 |    "source": [
1310 |     "G_coauthorship.transitivity_local_undirected(0)"
1311 |    ]
1312 |   },
1313 |   {
1314 |    "cell_type": "markdown",
1315 |    "metadata": {},
1316 |    "source": [
1317 |     "<div class=\"alert alert-info\">\n",
1318 |     "What percentage of the co-authors of node 0 have also written papers with each other?\n",
1319 |     "</div>"
1320 |    ]
1321 |   },
1322 |   {
1323 |    "cell_type": "code",
1324 |    "execution_count": null,
1325 |    "metadata": {},
1326 |    "outputs": [],
1327 |    "source": []
1328 |   },
1329 |   {
1330 |    "cell_type": "markdown",
1331 |    "metadata": {},
1332 |    "source": [
1333 |     "You can calculate the average for all nodes using the function `transitivity_avglocal_undirected`."
1334 |    ]
1335 |   },
1336 |   {
1337 |    "cell_type": "markdown",
1338 |    "metadata": {},
1339 |    "source": [
1340 |     "<div class=\"alert alert-info\">\n",
1341 |     "What percentage of the co-authors have also written papers with each other on average? Do you think this is high or not?\n",
1342 |     "</div>"
1343 |    ]
1344 |   },
1345 |   {
1346 |    "cell_type": "code",
1347 |    "execution_count": null,
1348 |    "metadata": {},
1349 |    "outputs": [],
1350 |    "source": []
1351 |   },
1352 |   {
1353 |    "cell_type": "markdown",
1354 |    "metadata": {},
1355 |    "source": [
1356 |     "## Centrality"
1357 |    ]
1358 |   },
1359 |   {
1360 |    "cell_type": "markdown",
1361 |    "metadata": {},
1362 |    "source": [
1363 |     "Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of *centrality* of a node."
1364 |    ]
1365 |   },
1366 |   {
1367 |    "cell_type": "markdown",
1368 |    "metadata": {},
1369 |    "source": [
1370 |     "### Degree"
1371 |    ]
1372 |   },
1373 |   {
1374 |    "cell_type": "markdown",
1375 |    "metadata": {},
1376 |    "source": [
1377 |     "The simplest type of centrality is the *degree* of a node, which is simply the number of its neighbors. Previously, we saw that node 0 had 3 neighbors, we therefore say its degree is 3."
1378 |    ]
1379 |   },
1380 |   {
1381 |    "cell_type": "code",
1382 |    "execution_count": null,
1383 |    "metadata": {},
1384 |    "outputs": [],
1385 |    "source": [
1386 |     "G_coauthorship.degree(0)"
1387 |    ]
1388 |   },
1389 |   {
1390 |    "cell_type": "markdown",
1391 |    "metadata": {},
1392 |    "source": [
1393 |     "We can also simply calculate the degree for everybody and store it in a new vertex attribute called `degree`."
1394 |    ]
1395 |   },
1396 |   {
1397 |    "cell_type": "code",
1398 |    "execution_count": null,
1399 |    "metadata": {},
1400 |    "outputs": [],
1401 |    "source": [
1402 |     "G_coauthorship.vs['degree'] = G_coauthorship.degree()"
1403 |    ]
1404 |   },
1405 |   {
1406 |    "cell_type": "markdown",
1407 |    "metadata": {},
1408 |    "source": [
1409 |     "<div class=\"alert alert-info\">\n",
1410 |     "    What is the degree of <code>'Van Marck, E'</code>?\n",
1411 |     "</div>"
1412 |    ]
1413 |   },
1414 |   {
1415 |    "cell_type": "code",
1416 |    "execution_count": null,
1417 |    "metadata": {},
1418 |    "outputs": [],
1419 |    "source": []
1420 |   },
1421 |   {
1422 |    "cell_type": "markdown",
1423 |    "metadata": {},
1424 |    "source": [
1425 |     "We can also take a look at the complete degree distribution. To plot it, we load the `matplotlib` package. We import the plotting functionality and name the package `plt`. We also include a statement telling Python to show the plots immediately in this notebook."
1426 |    ]
1427 |   },
1428 |   {
1429 |    "cell_type": "code",
1430 |    "execution_count": null,
1431 |    "metadata": {},
1432 |    "outputs": [],
1433 |    "source": [
1434 |     "import matplotlib.pyplot as plt\n",
1435 |     "%matplotlib inline"
1436 |    ]
1437 |   },
1438 |   {
1439 |    "cell_type": "markdown",
1440 |    "metadata": {},
1441 |    "source": [
1442 |     "Now let us plot a histogram of the degree, using 50 bins."
1443 |    ]
1444 |   },
1445 |   {
1446 |    "cell_type": "code",
1447 |    "execution_count": null,
1448 |    "metadata": {},
1449 |    "outputs": [],
1450 |    "source": [
1451 |     "plt.hist(G_coauthorship.vs['degree'], 50);\n",
1452 |     "plt.yscale('log')"
1453 |    ]
1454 |   },
1455 |   {
1456 |    "cell_type": "markdown",
1457 |    "metadata": {},
1458 |    "source": [
1459 |     "This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a *scale-free* network, although the exact definition has been a topic of intense discussion recently."
1460 |    ]
1461 |   },
1462 |   {
1463 |    "cell_type": "markdown",
1464 |    "metadata": {},
1465 |    "source": [
1466 |     "The code below sorts the nodes in descending order of the degree."
1467 |    ]
1468 |   },
1469 |   {
1470 |    "cell_type": "code",
1471 |    "execution_count": null,
1472 |    "metadata": {},
1473 |    "outputs": [],
1474 |    "source": [
1475 |     "highest_degree = sorted(G_coauthorship.vs, key=lambda v: v['degree'], reverse=True)"
1476 |    ]
1477 |   },
1478 |   {
1479 |    "cell_type": "markdown",
1480 |    "metadata": {},
1481 |    "source": [
1482 |     "The `sorted` function takes a list as input, `G_coauthorship.vs` in our case, and sorts it according to a sort key. We indicate the sort key by a small function, called a `lambda` function, that returns the degree. In other words, the `sorted` function will sort the nodes according to the degree. By indicating `reverse=True` we obtain a list that is sorted highest to lowest, instead of the other way around."
1483 |    ]
1484 |   },
1485 |   {
1486 |    "cell_type": "markdown",
1487 |    "metadata": {},
1488 |    "source": [
1489 |     "You can look at the first five results in the following way."
1490 |    ]
1491 |   },
1492 |   {
1493 |    "cell_type": "code",
1494 |    "execution_count": null,
1495 |    "metadata": {},
1496 |    "outputs": [],
1497 |    "source": [
1498 |     "highest_degree[:5]"
1499 |    ]
1500 |   },
1501 |   {
1502 |    "cell_type": "markdown",
1503 |    "metadata": {},
1504 |    "source": [
1505 |     "So, apparently, U D'Allessandro has collaborated with 715 other authors! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else.\n",
1506 |     "When specifying such *edge weights* like the number of joint papers, the weighted degree is referred to as the *strength* of a node (which is sometimes a bit confusing term). \n",
1507 |     "\n",
1508 |     "Let us look at the strength of node 0."
1509 |    ]
1510 |   },
1511 |   {
1512 |    "cell_type": "code",
1513 |    "execution_count": null,
1514 |    "metadata": {},
1515 |    "outputs": [],
1516 |    "source": [
1517 |     "G_coauthorship.strength(0, weights='n_joint_papers')"
1518 |    ]
1519 |   },
1520 |   {
1521 |    "cell_type": "markdown",
1522 |    "metadata": {},
1523 |    "source": [
1524 |     "Apparently, author 0 collaborated with 3 different authors, and has a total strength of 3. But what does this 3 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, then each of the three co-authors would have an edge weight of `n_joint_papers = 1`. So, the strenght would be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.\n",
1525 |     "\n",
1526 |     "Sometimes, we wish to take into account the number of co-authorships when creating a link weight. We can then fractionally count the weight of each collaboration between $n_a$ authors as\n",
1527 |     "\n",
1528 |     "$$\\frac{1}{n_a - 1}.$$\n",
1529 |     "\n",
1530 |     "We need to go back to the `publications_df` in order to construct such a *fractional* edge weight."
1531 |    ]
1532 |   },
1533 |   {
1534 |    "cell_type": "code",
1535 |    "execution_count": null,
1536 |    "metadata": {},
1537 |    "outputs": [],
1538 |    "source": [
1539 |     "import itertools as itr\n",
1540 |     "[(coauthor[0], coauthor[1], 1/(len(authors) - 1)) for coauthor in itr.combinations(authors, 2)]"
1541 |    ]
1542 |   },
1543 |   {
1544 |    "cell_type": "markdown",
1545 |    "metadata": {},
1546 |    "source": [
1547 |     "We again do this for all publications."
1548 |    ]
1549 |   },
1550 |   {
1551 |    "cell_type": "code",
1552 |    "execution_count": null,
1553 |    "metadata": {},
1554 |    "outputs": [],
1555 |    "source": [
1556 |     "coauthors_per_publication = publications_df['AU_split'].apply(\n",
1557 |     "    lambda authors: \n",
1558 |     "        [(coauthor[0], coauthor[1], 1, 1/(len(authors) - 1)) \n",
1559 |     "             for coauthor in itr.combinations(authors, 2)])"
1560 |    ]
1561 |   },
1562 |   {
1563 |    "cell_type": "markdown",
1564 |    "metadata": {},
1565 |    "source": [
1566 |     "The variable `coauthors_per_publication` is now a list of a list of co-authors per publication, but including a full weight of `1` and a fractional weight of `1/(len(authors) - 1)`, where `len(authors)` is the number of authors of the publications. We again `explode` this list."
1567 |    ]
1568 |   },
1569 |   {
1570 |    "cell_type": "code",
1571 |    "execution_count": null,
1572 |    "metadata": {},
1573 |    "outputs": [],
1574 |    "source": [
1575 |     "coauthors = coauthors_per_publication.explode().dropna()"
1576 |    ]
1577 |   },
1578 |   {
1579 |    "cell_type": "markdown",
1580 |    "metadata": {},
1581 |    "source": [
1582 |     "We can again create the network, but now we can pass two edge attributes, `n_joint_papers` and `n_joint_papers_frac`. We of course also have to simplify the network again."
1583 |    ]
1584 |   },
1585 |   {
1586 |    "cell_type": "code",
1587 |    "execution_count": null,
1588 |    "metadata": {},
1589 |    "outputs": [],
1590 |    "source": [
1591 |     "G_coauthorship = ig.Graph.TupleList(\n",
1592 |     "      edges=coauthors.to_list(),\n",
1593 |     "      vertex_name_attr='author',\n",
1594 |     "      directed=False,\n",
1595 |     "      edge_attrs=('n_joint_papers', 'n_joint_papers_frac')\n",
1596 |     "      )\n",
1597 |     "G_coauthorship = G_coauthorship.simplify(loops=False, combine_edges='sum')"
1598 |    ]
1599 |   },
1600 |   {
1601 |    "cell_type": "markdown",
1602 |    "metadata": {},
1603 |    "source": [
1604 |     "<div class=\"alert alert-info\">\n",
1605 |     "What is the sum of <code>n_joint_papers_frac</code> over all co-authors? Then shouldn't the strength sum up to a whole number? Why isn't that the case here? (Hint: look at the authors of publication <code>'WOS:000242241600004'</code)\n",
1606 |     "</div>"
1607 |    ]
1608 |   },
1609 |   {
1610 |    "cell_type": "code",
1611 |    "execution_count": null,
1612 |    "metadata": {},
1613 |    "outputs": [],
1614 |    "source": [
1615 |     "publications_df.loc['WOS:000242241600004', 'AU']"
1616 |    ]
1617 |   },
1618 |   {
1619 |    "cell_type": "markdown",
1620 |    "metadata": {},
1621 |    "source": [
1622 |     "### Betweenness centrality"
1623 |    ]
1624 |   },
1625 |   {
1626 |    "cell_type": "markdown",
1627 |    "metadata": {},
1628 |    "source": [
1629 |     "Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.\n",
1630 |     "\n",
1631 |     "As you can imagine, this can take quite some time to calculate for all nodes. We will therefore use the somewhat smaller bibliographic coupling network of journals."
1632 |    ]
1633 |   },
1634 |   {
1635 |    "cell_type": "markdown",
1636 |    "metadata": {},
1637 |    "source": [
1638 |     "<div class=\"alert alert-warning\">\n",
1639 |     "    <b>Note:</b> On larger networks, it may take a long time to calculate the betweenness centrality.\n",
1640 |     "</div>"
1641 |    ]
1642 |   },
1643 |   {
1644 |    "cell_type": "code",
1645 |    "execution_count": null,
1646 |    "metadata": {},
1647 |    "outputs": [],
1648 |    "source": [
1649 |     "G_coupling.vs['betweenness'] = G_coupling.betweenness()"
1650 |    ]
1651 |   },
1652 |   {
1653 |    "cell_type": "markdown",
1654 |    "metadata": {},
1655 |    "source": [
1656 |     "Now we can look at the journals that have the highest betweenness."
1657 |    ]
1658 |   },
1659 |   {
1660 |    "cell_type": "code",
1661 |    "execution_count": null,
1662 |    "metadata": {},
1663 |    "outputs": [],
1664 |    "source": [
1665 |     "sorted(G_coupling.vs, key=lambda v: v['betweenness'], reverse=True)[:5]"
1666 |    ]
1667 |   },
1668 |   {
1669 |    "cell_type": "markdown",
1670 |    "metadata": {},
1671 |    "source": [
1672 |     "As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths."
1673 |    ]
1674 |   },
1675 |   {
1676 |    "cell_type": "code",
1677 |    "execution_count": null,
1678 |    "metadata": {},
1679 |    "outputs": [],
1680 |    "source": [
1681 |     "G_coupling.vs['betweenness_weighted'] = G_coupling.betweenness(weights='coupling')"
1682 |    ]
1683 |   },
1684 |   {
1685 |    "cell_type": "markdown",
1686 |    "metadata": {},
1687 |    "source": [
1688 |     "<div class=\"alert alert-info\">\n",
1689 |     "What is journal with the highest weighted betweenness centrality? Does this make sense if you compare it to the unweighted betweenness centrality?\n",
1690 |     "</div>"
1691 |    ]
1692 |   },
1693 |   {
1694 |    "cell_type": "code",
1695 |    "execution_count": null,
1696 |    "metadata": {},
1697 |    "outputs": [],
1698 |    "source": []
1699 |   },
1700 |   {
1701 |    "cell_type": "markdown",
1702 |    "metadata": {},
1703 |    "source": [
1704 |     "<div class=\"alert alert-danger\">\n",
1705 |     "    <b>Attention!</b> Weighted shortest paths have the <em>lowest</em> total weight.\n",
1706 |     "</div>"
1707 |    ]
1708 |   },
1709 |   {
1710 |    "cell_type": "markdown",
1711 |    "metadata": {},
1712 |    "source": [
1713 |     "### Pagerank"
1714 |    ]
1715 |   },
1716 |   {
1717 |    "cell_type": "markdown",
1718 |    "metadata": {},
1719 |    "source": [
1720 |     "One way of identifying central nodes relies on the idea of a random walk in a network. We will study this in the journal bibliographic coupling network. When performing such a random walk, we simply go from one journal to the next, following the bibliographic coupling links. The journal that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine. Luckily, we can compute that a lot faster than betweenness."
1721 |    ]
1722 |   },
1723 |   {
1724 |    "cell_type": "code",
1725 |    "execution_count": null,
1726 |    "metadata": {},
1727 |    "outputs": [],
1728 |    "source": [
1729 |     "G_coupling.vs['pagerank'] = G_coupling.pagerank()"
1730 |    ]
1731 |   },
1732 |   {
1733 |    "cell_type": "markdown",
1734 |    "metadata": {},
1735 |    "source": [
1736 |     "<div class=\"alert alert-info\">\n",
1737 |     "Get the top 5 most central journals according to Pagerank. Who is the most central? Are the results very different from the betweenness?\n",
1738 |     "</div>"
1739 |    ]
1740 |   },
1741 |   {
1742 |    "cell_type": "code",
1743 |    "execution_count": null,
1744 |    "metadata": {},
1745 |    "outputs": [],
1746 |    "source": []
1747 |   },
1748 |   {
1749 |    "cell_type": "markdown",
1750 |    "metadata": {},
1751 |    "source": [
1752 |     "We can again take into account the weights. In pagerank this means that a journal that is a more closely bibliographically coupled will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that."
1753 |    ]
1754 |   },
1755 |   {
1756 |    "cell_type": "code",
1757 |    "execution_count": null,
1758 |    "metadata": {},
1759 |    "outputs": [],
1760 |    "source": [
1761 |     "G_coupling.vs['pagerank_weighted'] = G_coupling.pagerank(weights='coupling')"
1762 |    ]
1763 |   },
1764 |   {
1765 |    "cell_type": "markdown",
1766 |    "metadata": {},
1767 |    "source": [
1768 |     "<div class=\"alert alert-info\">\n",
1769 |     "Are the results different for the weighted version of pagerank?\n",
1770 |     "</div>"
1771 |    ]
1772 |   },
1773 |   {
1774 |    "cell_type": "code",
1775 |    "execution_count": null,
1776 |    "metadata": {},
1777 |    "outputs": [],
1778 |    "source": []
1779 |   },
1780 |   {
1781 |    "cell_type": "markdown",
1782 |    "metadata": {},
1783 |    "source": [
1784 |     "<div class=\"alert alert-info\">\n",
1785 |     "Pagerank is very similar to the techniques that underly the journal \"Eigenfactor\" and the \"SCImago Journal Rank\", which are seen as indicators of the scientific impact of a journal. Do you think it makes sense to interpret Pagerank on a bibliographic coupling network as the scientific impact of a journal? Why (not)?\n",
1786 |     "</div>"
1787 |    ]
1788 |   },
1789 |   {
1790 |    "cell_type": "markdown",
1791 |    "metadata": {},
1792 |    "source": [
1793 |     "## Co-authorship using bipartite projection (optional)"
1794 |    ]
1795 |   },
1796 |   {
1797 |    "cell_type": "markdown",
1798 |    "metadata": {},
1799 |    "source": [
1800 |     "We can also create co-authorship using a more theoretical approach from graph theory. We can first construct a network consisting of publications and authors."
1801 |    ]
1802 |   },
1803 |   {
1804 |    "cell_type": "markdown",
1805 |    "metadata": {},
1806 |    "source": [
1807 |     "We first again `explode` all authors for each publication, and create a graph out of it."
1808 |    ]
1809 |   },
1810 |   {
1811 |    "cell_type": "code",
1812 |    "execution_count": null,
1813 |    "metadata": {},
1814 |    "outputs": [],
1815 |    "source": [
1816 |     "author_pubs_df = publications_df['AU_split'].explode()\n",
1817 |     "\n",
1818 |     "G_pub_authors = ig.Graph.TupleList(\n",
1819 |     "      edges=author_pubs_df.reset_index().values,\n",
1820 |     "      vertex_name_attr='name',\n",
1821 |     "      directed=False\n",
1822 |     "      )"
1823 |    ]
1824 |   },
1825 |   {
1826 |    "cell_type": "markdown",
1827 |    "metadata": {},
1828 |    "source": [
1829 |     "This network consists of two types: publications and authors. This is called a *bipartite* graph. We can automatically get the types using `is_bipartite`."
1830 |    ]
1831 |   },
1832 |   {
1833 |    "cell_type": "code",
1834 |    "execution_count": null,
1835 |    "metadata": {},
1836 |    "outputs": [],
1837 |    "source": [
1838 |     "is_bipartite, types = G_pub_authors.is_bipartite(return_types = True)\n",
1839 |     "print(is_bipartite)"
1840 |    ]
1841 |   },
1842 |   {
1843 |    "cell_type": "markdown",
1844 |    "metadata": {},
1845 |    "source": [
1846 |     "The actual types are simply returned as a list of `True` and `False` values, which are arbitrary labels for publications and authors. Let us see what the first label stands for."
1847 |    ]
1848 |   },
1849 |   {
1850 |    "cell_type": "code",
1851 |    "execution_count": null,
1852 |    "metadata": {},
1853 |    "outputs": [],
1854 |    "source": [
1855 |     "print(types[0])\n",
1856 |     "print(G_pub_authors.vs[0])"
1857 |    ]
1858 |   },
1859 |   {
1860 |    "cell_type": "markdown",
1861 |    "metadata": {},
1862 |    "source": [
1863 |     "From the `name` of node `0` we can see that it refers to a publication, and so `False` indicates publications, while `True` indicates authors."
1864 |    ]
1865 |   },
1866 |   {
1867 |    "cell_type": "markdown",
1868 |    "metadata": {},
1869 |    "source": [
1870 |     "We now would like to perform a so-called *bipartite projection* onto the authors. This is exactly the type of operation that leads to a co-authorship network. If we were to *project* onto the publication, we would end up with a network of publications where each pair of publications is linked if it is authored by the same author."
1871 |    ]
1872 |   },
1873 |   {
1874 |    "cell_type": "code",
1875 |    "execution_count": null,
1876 |    "metadata": {},
1877 |    "outputs": [],
1878 |    "source": [
1879 |     "G_author_projection = G_pub_authors.bipartite_projection(types=types, which=True)"
1880 |    ]
1881 |   },
1882 |   {
1883 |    "cell_type": "markdown",
1884 |    "metadata": {},
1885 |    "source": [
1886 |     "By default, it keeps track of the *multiplicity* (i.e. the number of joint papers) in the `weight` edge attribute. Unfortunately, it is not possible to do fractional counting using this approach."
1887 |    ]
1888 |   },
1889 |   {
1890 |    "cell_type": "markdown",
1891 |    "metadata": {},
1892 |    "source": [
1893 |     "<div class=\"alert alert-info\">\n",
1894 |     "    Check the number of nodes in the bipartite projection. Why is it different from the number of nodes in the earlier constructed <code>G_coauthorship</code>? (Hint: checkout the degree.)\n",
1895 |     "</div>"
1896 |    ]
1897 |   },
1898 |   {
1899 |    "cell_type": "code",
1900 |    "execution_count": null,
1901 |    "metadata": {},
1902 |    "outputs": [],
1903 |    "source": []
1904 |   },
1905 |   {
1906 |    "cell_type": "markdown",
1907 |    "metadata": {},
1908 |    "source": [
1909 |     "# Analysis of your own data"
1910 |    ]
1911 |   },
1912 |   {
1913 |    "cell_type": "markdown",
1914 |    "metadata": {},
1915 |    "source": [
1916 |     "You have now learned the basics of handling WoS files and transforming them into scientometric networks. Please take some time now to do your own analysis."
1917 |    ]
1918 |   },
1919 |   {
1920 |    "cell_type": "markdown",
1921 |    "metadata": {},
1922 |    "source": [
1923 |     "<div class=\"alert alert-info\">\n",
1924 |     "Go to <a href=\"http://webofknowledge.com/\">Web of Science</a> and select a publication set of interest. Make sure that the number of publications is higher than 1000, but lower than 5000. Export the files as follows:\n",
1925 |     "<ol>\n",
1926 |     "  <li> Export using \"Save to Other File Formats\".\n",
1927 |     "  <li> Select the appropriate records (e.g. 1-500, 501-1000, etc...).\n",
1928 |     "  <li> Select the Record Content \"Full Record and Cited References\".\n",
1929 |     "  <li> Select the File Format \"Tab delimited (Win, UTF8)\".\n",
1930 |     "  <li> Click on Send.\n",
1931 |     "</ol>\n",
1932 |     "Repeat the above steps for each batch of 500 publications.\n",
1933 |     "\n",
1934 |     "Load the data from all files you downloaded using <code>pandas</code>\n",
1935 |     "</div>"
1936 |    ]
1937 |   },
1938 |   {
1939 |    "cell_type": "code",
1940 |    "execution_count": null,
1941 |    "metadata": {},
1942 |    "outputs": [],
1943 |    "source": []
1944 |   },
1945 |   {
1946 |    "cell_type": "markdown",
1947 |    "metadata": {},
1948 |    "source": [
1949 |     "<div class=\"alert alert-info\">\n",
1950 |     "Create a co-authorship network of your publications. Hint: use the approach you encountered earlier.\n",
1951 |     "</div>"
1952 |    ]
1953 |   },
1954 |   {
1955 |    "cell_type": "code",
1956 |    "execution_count": null,
1957 |    "metadata": {},
1958 |    "outputs": [],
1959 |    "source": []
1960 |   },
1961 |   {
1962 |    "cell_type": "markdown",
1963 |    "metadata": {},
1964 |    "source": [
1965 |     "<div class=\"alert alert-info\">\n",
1966 |     "Identify the authors that are most central to the coauthorship network and interpret the results.\n",
1967 |     "</div>"
1968 |    ]
1969 |   },
1970 |   {
1971 |    "cell_type": "code",
1972 |    "execution_count": null,
1973 |    "metadata": {},
1974 |    "outputs": [],
1975 |    "source": []
1976 |   },
1977 |   {
1978 |    "cell_type": "markdown",
1979 |    "metadata": {},
1980 |    "source": [
1981 |     "<div class=\"alert alert-info\">\n",
1982 |     "Create a co-citation network of your publications. Hint: use the bibliographic coupling approach, but switch the roles of the source and the target.\n",
1983 |     "</div>"
1984 |    ]
1985 |   },
1986 |   {
1987 |    "cell_type": "code",
1988 |    "execution_count": null,
1989 |    "metadata": {},
1990 |    "outputs": [],
1991 |    "source": []
1992 |   },
1993 |   {
1994 |    "cell_type": "markdown",
1995 |    "metadata": {},
1996 |    "source": [
1997 |     "<div class=\"alert alert-info\">\n",
1998 |     "Identify the publications that are most central to the co-citation network and interpret the results. Are they relatively recent publications or not?\n",
1999 |     "</div>"
2000 |    ]
2001 |   },
2002 |   {
2003 |    "cell_type": "code",
2004 |    "execution_count": null,
2005 |    "metadata": {},
2006 |    "outputs": [],
2007 |    "source": []
2008 |   }
2009 |  ],
2010 |  "metadata": {
2011 |   "kernelspec": {
2012 |    "display_name": "Python 3",
2013 |    "language": "python",
2014 |    "name": "python3"
2015 |   },
2016 |   "language_info": {
2017 |    "codemirror_mode": {
2018 |     "name": "ipython",
2019 |     "version": 3
2020 |    },
2021 |    "file_extension": ".py",
2022 |    "mimetype": "text/x-python",
2023 |    "name": "python",
2024 |    "nbconvert_exporter": "python",
2025 |    "pygments_lexer": "ipython3",
2026 |    "version": "3.8.3"
2027 |   }
2028 |  },
2029 |  "nbformat": 4,
2030 |  "nbformat_minor": 2
2031 | }
2032 | 


--------------------------------------------------------------------------------