├── local └── environment.yml ├── binder └── environment.yml ├── .gitignore ├── readme.md ├── 03-prepare-VOSviewer-term-map.ipynb ├── data-files └── vosviewer │ └── terms.txt ├── 02-advanced.ipynb └── 01-basics.ipynb /local/environment.yml: -------------------------------------------------------------------------------- 1 | name: CSSS 2 | channels: 3 | - conda-forge 4 | - defaults 5 | dependencies: 6 | - python=3.8 7 | - jupyter 8 | - nbconvert 9 | - notebook 10 | - tornado 11 | - matplotlib 12 | - numpy 13 | - scipy 14 | - pandas 15 | - pycairo 16 | - python-igraph 17 | - leidenalg 18 | -------------------------------------------------------------------------------- /binder/environment.yml: -------------------------------------------------------------------------------- 1 | channels: 2 | - vtraag 3 | - conda-forge 4 | - defaults 5 | dependencies: 6 | - python=3.7 7 | - jupyter=1.0.0 8 | - nbconvert=5.4.0 9 | - notebook=5.7.4 10 | - tornado<6 11 | - matplotlib 12 | - numpy 13 | - scipy 14 | - pandas>=0.21.0 15 | - pycairo 16 | - python-igraph 17 | - leidenalg 18 | - metaknowledge 19 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | results/ 3 | latexdiff*/ 4 | *.dat 5 | *.pyc 6 | *.log 7 | *.bbl 8 | *.blg 9 | *.aux 10 | *.pdf 11 | *.eps 12 | *.out 13 | *.synctex.gz 14 | *.synctex 15 | *.swp 16 | *.zip 17 | *.gephi 18 | *.fdb_latexmk 19 | *.fls 20 | *.*~ 21 | *.tcp 22 | *.tps 23 | *.tiw 24 | *Notes.bib 25 | *.tmp 26 | *.docx 27 | *.picklez 28 | *.png 29 | *.spl 30 | **/.ipynb_checkpoints 31 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | 2 | # CWTS Scientometrics Summer School 3 | 4 | This GitHub repository contains the exercises for doing network analysis with Python. 5 | 6 | We would like to encourage you to install [Anaconda](https://www.anaconda.com/distribution/) Python locally. This allows you to run the Python notebook on your own computer. As an alternative, the notebook are also available from an online service, if you don't manage to install [Anaconda](https://www.anaconda.com/distribution/) Python locally. 7 | 8 | # Local installation 9 | 10 | We request you to install Python on your own computer. When you have everything installed locally, you can simply run the notebook, without depending on ony online service. Moreover, you then also have your local environment already setup if you want to use it in the future. 11 | 12 | Please follow the following steps to correctly setup your environment: 13 | 14 | 1. [Download](https://www.anaconda.com/distribution/) and install Anaconda Python. When asked, select to install it only for a single user. 15 | 16 | 2. [Download](https://github.com/CWTSLeiden/CSSS/archive/master.zip) this repository and unzip it to a certain directory. 17 | 18 | - For more technical users, you may also clone the repository, make sure that you use the master branch. 19 | 20 | 21 | 3. In Windows, please launch the "Anaconda prompt". In Mac OS/Linux, open the terminal and activate conda by running `source ~/anaconda3/bin/activate`. This enables the installation of the required packages. In the prompt/terminal navigate to the directory to which you unzipped the repository using 22 | 23 | ``` 24 | cd [DIRECTORY] 25 | ``` 26 | 27 | where you should replace `[DIRECTORY]` by the directory where you unzipped the repository. 28 | 29 | 4. Setup the new environment ``CSSS`` using 30 | 31 | ``` 32 | conda env create -f local/environment.yml 33 | ``` 34 | 35 | This automatically creates the new environment ``CSSS`` and installs the correct versions of all required packages. 36 | 37 | **Note:** Installation may take some time. 38 | 39 | ## Run Jupyter notebook 40 | 41 | There are two ways in which you can run a Jupyter notebook. 42 | 43 | 1. Launch the "Anaconda navigator" and launch the Jupyter notebook from there. Make sure to select the correct environment ``CSSS`` from the dropdown box at the top of the window. The Jupyter notebook will start in some specific directory, you may need to move the directory to which you unzipped the repository, to ensure it is also visible from Jupyter notebook. 44 | 45 | 2. In Windows, please launch the "Anaconda prompt". In Mac OS/Linux, open the terminal and activate conda by running ``conda activate CSSS`` or that does not work ``source ~/anaconda3/bin/activate CSSS``. Navigate to the directory to which you unzipped the repository using 46 | ``` 47 | cd [DIRECTORY] 48 | ``` 49 | Then launch the Jupyter notebook by 50 | ``` 51 | jupyter notebook 52 | ``` 53 | 54 | In both approaches, you can open the desired notebook: `01-basics.ipynb` or `02-advanced.ipynb`. 55 | 56 | ## Issues 57 | 58 | If encounter any problem during installation, or with the Python notebook, please report it as an issue at https://github.com/CWTSLeiden/CSSS/issues. 59 | 60 | # Run online 61 | 62 | The Python notebooks can be run online without the need for installation. Please click on one of the badges below to start the interactive environment. Note that the number of resources are limited, and that you cannot use your own data files for further analysis. The online service may also not always be available unfortunately. 63 | 64 | ## `01-basics.ipynb` 65 | * GESIS (Leibniz Institute for the Social Sciences) 66 | [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/CWTSLeiden/CSSS/master?filepath=01-basics.ipynb) 67 | 68 | * PANGEO 69 | [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/CWTSLeiden/CSSS/master?filepath=01-basics.ipynb) 70 | 71 | * MyBinder.org 72 | [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/CWTSLeiden/CSSS/master?filepath=01-basics.ipynb) 73 | 74 | ## `02-advanced.ipynb` 75 | 76 | * GESIS (Leibniz Institute for the Social Sciences) 77 | [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/CWTSLeiden/CSSS/master?filepath=02-advanced.ipynb) 78 | 79 | * PANGEO 80 | [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/CWTSLeiden/CSSS/master?filepath=02-advanced.ipynb) 81 | 82 | * MyBinder.org 83 | [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/CWTSLeiden/CSSS/master?filepath=02-advanced.ipynb) 84 | 85 | 86 | ## `03-prepare-VOSviewer-term-map.ipynb` 87 | 88 | * GESIS (Leibniz Institute for the Social Sciences) 89 | [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/CWTSLeiden/CSSS/master?filepath=03-prepare-VOSviewer-term-map.ipynb) 90 | 91 | * PANGEO 92 | [![Binder](https://binder.pangeo.io/badge_logo.svg)](https://binder.pangeo.io/v2/gh/CWTSLeiden/CSSS/master?filepath=03-prepare-VOSviewer-term-map.ipynb) 93 | 94 | * MyBinder.org 95 | [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/CWTSLeiden/CSSS/master?filepath=03-prepare-VOSviewer-term-map.ipynb) 96 | -------------------------------------------------------------------------------- /03-prepare-VOSviewer-term-map.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Preparing files for VOSviewer overlays" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this notebook we will load some files from Web of Science, parse them, and use them to prepare advanced overlays map in VOSviewer. Many of the operations you have already seen earlier during the summer school." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "As usual we will start by importing the relevant packages. We will need the `pandas` pacakge, and we will call it `pd` again, and additionally we need the `csv` package for some options, and finally, we also need the `glob` package to easily find the relevant files." 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": null, 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "import pandas as pd\n", 31 | "import csv\n", 32 | "import glob" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "We will start by reading in all files. We already did this in an earlier notebook, here below we repeat this." 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n", 49 | "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n", 50 | " usecols=range(68), index_col='UT') for f in files)\n", 51 | "publications_df = publications_df.sort_index()" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "We will now prepare files manually for VOSviewer. We will have to prepare two files: \n", 59 | " 1. a so-called corpus file that contains all text for each document.\n", 60 | " 2. a so-called scores file that contains \"scores\" for each document." 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## Corpus file" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "We will now first prepare the corpus file. We will concatenate the title and abstract together for this purpose. VOSviewer will simply consider each line in the corpus file a document, and will simply consider all text when creating a term map. In other words, you can apply this to any type of file." 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "publications_df['text'] = publications_df['TI'] + '. ' + publications_df['AB']" 84 | ] 85 | }, 86 | { 87 | "cell_type": "markdown", 88 | "metadata": {}, 89 | "source": [ 90 | "We have added the additional full stop (`.`) to make sure that VOSviewer is able to parse the sentences correctly." 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "Since VOSviewer expects a document at each line, we need to make sure that the titles and abstract are all on a single line. In more technical terms: they cannot contain any newlines, which are represented by a combination of special characters, and this depends on the platform you are using. We will simply remove all possible newline characters as follows:" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "publications_df['text'] = publications_df['text'].str.replace('\\n', '').replace('\\r', '');" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "Now we write the text for each document to a corpus file." 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "publications_df['text'].to_csv('corpus.txt', index=False, header=False)" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "## Scores file" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "Now we have to determine what type of scores we want to project as overlays in VOSviewer. We will show how to do this using journals, you can repeat the exercise on countries." 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "Scores in VOSviewer work as follows. For each score it will calculate the average of the scores in documents that match a specific term. It will then color the terms in the term map according to the average of these scores. This can then highlight certain parts of the map showing where this score is particularly high or low. The objective now is to show this for journals, highlighting what part of the map is particularly relevant to a certain journal." 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "We will do this for each journal separately. At the moment, the journal is contained in the field `SO`." 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "publications_df['SO']" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "You may remember that you can get group the dataframe by the journal to get an overview per journal." 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": null, 172 | "metadata": {}, 173 | "outputs": [], 174 | "source": [ 175 | "publications_df.groupby('SO').size().sort_values(ascending=False)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Now we would like to translate the `SO` column in such a way that VOSviewer can show a separate overlay for each journal. For those of you are familiar with statistics, we will do this using so-called \"dummy\" variables. That is, for each journal, we will create a new column, and indicate whether the publication is from that journal (Yes, `1`) or not (No, `0`). If VOSviewer then takes the average, this comes down to showing the percentage of publications with a certain term that are publishing in that journal. Fortunately, this is implemented in `pandas`, so we can easily do that." 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": null, 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "journal_scores_df = publications_df['SO'].str.get_dummies()" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "If we now look at scores_df, you will see many column names that represent the journal, and only `0` or `1` in each entry." 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "journal_scores_df.head()" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "VOSviewer wants a specific column name for scores. In particular, it should be called `Score<...>`. We therefore change the column names" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "journal_scores_df.columns = ['Score<{}>'.format(c) for c in journal_scores_df.columns]" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "Finally, we then write the dataframe to a scores files, which should be tab-delimited." 231 | ] 232 | }, 233 | { 234 | "cell_type": "code", 235 | "execution_count": null, 236 | "metadata": {}, 237 | "outputs": [], 238 | "source": [ 239 | "journal_scores_df.to_csv('scores.txt', sep='\\t', index=None)" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "## VOSviewer" 247 | ] 248 | }, 249 | { 250 | "cell_type": "markdown", 251 | "metadata": {}, 252 | "source": [ 253 | "You can now create a term map in VOSviewer using the two files you produced `corpus.txt` and `scores.txt`. To create a term map based on these files, choose \"Create a map based on text data\" in VOSviewer, and then select \"Read data from VOSviewer files.\"" 254 | ] 255 | }, 256 | { 257 | "cell_type": "markdown", 258 | "metadata": {}, 259 | "source": [ 260 | "# Exercise Document type" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "
\n", 268 | " Now repeat the same exercise but using the document type DT.\n", 269 | "
" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": null, 275 | "metadata": {}, 276 | "outputs": [], 277 | "source": [] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "
\n", 284 | " Create the term map in VOSviewer with the document type score file. Does the category of \"Meeting Abstract\" show a particular pattern? Why (not)? Can you explain you observation?\n", 285 | "
" 286 | ] 287 | }, 288 | { 289 | "cell_type": "markdown", 290 | "metadata": {}, 291 | "source": [ 292 | "
\n", 293 | " You probably now have two different dataframes. You then cannot see the document type overlay at the same time as the journal overlay. Could you try to combine the two dataframes? (Hint: check out the concat function we encountered earlier.)\n", 294 | "
" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [] 303 | } 304 | ], 305 | "metadata": { 306 | "kernelspec": { 307 | "display_name": "Python 3", 308 | "language": "python", 309 | "name": "python3" 310 | }, 311 | "language_info": { 312 | "codemirror_mode": { 313 | "name": "ipython", 314 | "version": 3 315 | }, 316 | "file_extension": ".py", 317 | "mimetype": "text/x-python", 318 | "name": "python", 319 | "nbconvert_exporter": "python", 320 | "pygments_lexer": "ipython3", 321 | "version": "3.8.3" 322 | } 323 | }, 324 | "nbformat": 4, 325 | "nbformat_minor": 4 326 | } 327 | -------------------------------------------------------------------------------- /data-files/vosviewer/terms.txt: -------------------------------------------------------------------------------- 1 | id term occurrences relevance score 2 | 1 a lumbricoide 11 2.3141 3 | 2 abundance 28 0.8707 4 | 3 acceptability 17 0.8648 5 | 4 access 76 0.9195 6 | 5 accuracy 32 0.2844 7 | 6 act 26 1.6277 8 | 7 action 25 0.7057 9 | 8 acts 10 2.301 10 | 9 adherence 26 0.7939 11 | 10 administration 28 0.736 12 | 11 admission 21 0.9007 13 | 12 adverse event 16 2.0095 14 | 13 agreement 26 0.6926 15 | 14 amodiaquine 21 2.595 16 | 15 anaemia 30 0.6928 17 | 16 animal 56 0.7233 18 | 17 anopheles 10 1.0869 19 | 18 antibody 87 0.4992 20 | 19 antibody response 27 0.7887 21 | 20 antigen 108 0.6071 22 | 21 antigen detection 12 1.4239 23 | 22 antimalarial drug 10 1.5043 24 | 23 antimalarial treatment 18 1.6177 25 | 24 antiretroviral therapy 22 2.0214 26 | 25 antiretroviral treatment 29 1.8361 27 | 26 aor 13 0.7575 28 | 27 arabiensis 11 2.1203 29 | 28 art 34 1.8376 30 | 29 artemether lumefantrine 28 1.7528 31 | 30 artemisinin 45 1.2598 32 | 31 artesunate 41 1.6278 33 | 32 article 32 0.4991 34 | 33 ascaris lumbricoide 19 2.1097 35 | 34 assay 105 0.3265 36 | 35 asymptomatic individual 10 0.3452 37 | 36 attention 34 0.6734 38 | 37 attitude 15 1.1021 39 | 38 awareness 20 0.8483 40 | 39 bangladesh 14 1.3618 41 | 40 barrier 32 1.1333 42 | 41 bed net 26 0.9034 43 | 42 behaviour 43 0.7571 44 | 43 belgium 35 0.5284 45 | 44 benin 35 0.6813 46 | 45 bihar 33 1.3218 47 | 46 birth 19 1.0532 48 | 47 blood 98 0.2187 49 | 48 blood sample 87 0.4195 50 | 49 bolivia 18 1.0507 51 | 50 brazil 12 0.654 52 | 51 buruli ulcer 40 1.163 53 | 52 burundi 23 0.4816 54 | 53 card agglutination test 16 1.6911 55 | 54 care 145 1.1195 56 | 55 case control study 22 0.5025 57 | 56 case management 27 0.7348 58 | 57 case study 18 1.1898 59 | 58 catt 16 1.4986 60 | 59 cattle 31 0.958 61 | 60 cell 70 0.3892 62 | 61 central africa 14 0.3532 63 | 62 central vietnam 32 0.9589 64 | 63 cerebrospinal fluid 26 1.1873 65 | 64 chagas disease 20 0.8005 66 | 65 chloroquine 26 1.4592 67 | 66 choice 27 0.5989 68 | 67 classification 14 0.3726 69 | 68 clinic 41 0.92 70 | 69 clinical isolate 14 0.946 71 | 70 clinical malaria 19 0.6313 72 | 71 clinical sample 14 0.8511 73 | 72 clinical sign 12 0.3697 74 | 73 clinical trial 64 0.7364 75 | 74 clinician 12 0.5821 76 | 75 combination therapy 45 1.2739 77 | 76 combination treatment 13 1.973 78 | 77 community health worker 10 1.1139 79 | 78 compliance 26 0.9479 80 | 79 complication 32 0.6004 81 | 80 compound 25 0.4168 82 | 81 conclusion significance 15 0.433 83 | 82 conclusions significance 67 0.4678 84 | 83 congenital chagas disease 12 2.276 85 | 84 congenital infection 15 2.1774 86 | 85 congo 164 0.5043 87 | 86 control group 14 0.5092 88 | 87 cost effectiveness 30 0.6627 89 | 88 cote divoire 16 0.5223 90 | 89 count 64 0.5516 91 | 90 coverage 61 0.7996 92 | 91 csf 18 1.2873 93 | 92 csp 10 1.1474 94 | 93 cuba 46 1.0261 95 | 94 culture 56 0.3843 96 | 95 cure 20 0.3492 97 | 96 cure rate 21 1.1835 98 | 97 curtain 12 2.1872 99 | 98 cutaneous leishmaniasis 24 0.5467 100 | 99 cyst 20 1.1545 101 | 100 cysticercosis 45 1.9432 102 | 101 dat 21 1.3285 103 | 102 ddt 13 2.3299 104 | 103 degrees c 19 0.2078 105 | 104 delay 31 0.6738 106 | 105 delivery 53 1.1261 107 | 106 deltamethrin 19 2.3467 108 | 107 democratic republic 157 0.5191 109 | 108 dengue 28 0.7332 110 | 109 density 47 0.5081 111 | 110 detection 182 0.2664 112 | 111 diabete 10 0.9531 113 | 112 diagnostic accuracy 21 0.731 114 | 113 diagnostic performance 13 0.8762 115 | 114 diagnostic test 26 0.4004 116 | 115 diagnostic tool 29 0.3881 117 | 116 diarrhoea 10 1.0987 118 | 117 dihydroartemisinin piperaquine 12 2.1793 119 | 118 diptera 18 1.1759 120 | 119 direct agglutination test 21 1.2924 121 | 120 discharge 13 1.325 122 | 121 diversity 30 0.4426 123 | 122 dna 67 0.5467 124 | 123 dog 20 0.8383 125 | 124 domestic animal 11 1.2878 126 | 125 dose 90 0.6189 127 | 126 dr congo 20 0.4279 128 | 127 drc 57 0.554 129 | 128 drug efficacy 19 1.3295 130 | 129 east africa 21 0.3362 131 | 130 ecology 12 0.8811 132 | 131 editorial 11 2.3821 133 | 132 education 30 0.6797 134 | 133 effectiveness 88 0.7578 135 | 134 efficacy 172 0.6426 136 | 135 efficiency 12 0.6026 137 | 136 egg 49 1.2147 138 | 137 elisa 73 0.8491 139 | 138 endemic setting 25 0.7431 140 | 139 environmental factor 13 0.8999 141 | 140 enzyme 50 0.8277 142 | 141 epilepsy 35 1.7675 143 | 142 europe 40 0.4784 144 | 143 example 19 0.6507 145 | 144 expectation 13 1.0642 146 | 145 experience 68 1.0056 147 | 146 experiment 22 0.5704 148 | 147 expression 24 0.7327 149 | 148 facility 39 0.8738 150 | 149 faecal sample 16 0.8769 151 | 150 failure 38 0.6996 152 | 151 falciparum malaria 14 0.7246 153 | 152 feasibility 29 0.4219 154 | 153 fec 10 2.423 155 | 154 field condition 18 0.4187 156 | 155 filter paper 13 0.8792 157 | 156 first line treatment 26 0.9524 158 | 157 first report 10 0.4715 159 | 158 first time 16 0.5756 160 | 159 fly 13 1.9506 161 | 160 focus group discussion 14 1.1715 162 | 161 fold 18 0.3602 163 | 162 forest 14 1.0595 164 | 163 forest malaria 10 1.4586 165 | 164 formulation 24 0.8315 166 | 165 framework 28 0.6275 167 | 166 gambiae 16 2.2306 168 | 167 gambiense 35 1.3564 169 | 168 gender 22 0.4645 170 | 169 gene 81 0.422 171 | 170 genetic diversity 22 0.4079 172 | 171 genotype 36 0.4428 173 | 172 goal 22 0.6184 174 | 173 goat 12 1.23 175 | 174 government 15 1.1415 176 | 175 hat 45 0.9611 177 | 176 health 108 0.652 178 | 177 health care 38 1.7577 179 | 178 health centre 52 1.0803 180 | 179 health district 22 1.3851 181 | 180 health facility 52 0.7874 182 | 181 health service 58 1.6105 183 | 182 health system 51 1.8437 184 | 183 health worker 32 0.8809 185 | 184 healthy control 13 0.691 186 | 185 helminth 31 1.7871 187 | 186 helminth infection 18 2.5369 188 | 187 high sensitivity 12 0.5359 189 | 188 higher risk 24 0.5486 190 | 189 histidine rich protein 15 1.6542 191 | 190 hiv 84 0.8617 192 | 191 hiv aids 15 1.2892 193 | 192 hiv infection 14 0.8844 194 | 193 hiv testing 12 2.4182 195 | 194 home 26 0.9487 196 | 195 hookworm 16 2.7181 197 | 196 hospital 125 0.5609 198 | 197 host 75 0.6726 199 | 198 hour 19 0.4183 200 | 199 house 39 0.7926 201 | 200 hrp 12 1.7831 202 | 201 human 94 0.5259 203 | 202 human african trypanosomiasis 61 0.9204 204 | 203 human cysticercosis 17 2.2077 205 | 204 human host 11 0.7954 206 | 205 human immunodeficiency virus 15 0.5911 207 | 206 human infection 13 0.7083 208 | 207 identification 84 0.2787 209 | 208 ifn gamma 13 1.0255 210 | 209 igm 14 0.8616 211 | 210 illness 37 0.4749 212 | 211 immune response 33 0.7466 213 | 212 immunization 12 1.3073 214 | 213 immunogenicity 47 1.9979 215 | 214 immunosorbent assay 34 0.967 216 | 215 implementation 85 0.5954 217 | 216 important cause 11 0.7125 218 | 217 important role 13 0.5971 219 | 218 incidence rate 20 0.5325 220 | 219 india 84 0.7692 221 | 220 indian subcontinent 36 1.2129 222 | 221 indirect cost 11 1.4543 223 | 222 indoor residual spraying 20 1.8447 224 | 223 infant 36 0.9412 225 | 224 infected mother 13 1.7968 226 | 225 infection intensity 13 2.1954 227 | 226 infection rate 20 0.667 228 | 227 inflammation 14 0.6263 229 | 228 initiative 41 0.9908 230 | 229 insecticidal net 13 2.5042 231 | 230 insecticide 69 1.2326 232 | 231 insecticide resistance 16 2.0071 233 | 232 integration 22 1.2484 234 | 233 intensity 62 0.3743 235 | 234 intermittent preventive treatment 25 1.1001 236 | 235 interview 58 1.0813 237 | 236 iqr 18 0.8924 238 | 237 irs 15 1.8777 239 | 238 isolate 66 0.5381 240 | 239 issue 41 0.5952 241 | 240 ixodes ricinus 12 1.7308 242 | 241 kala azar 24 1.0179 243 | 242 kappa 10 1.0776 244 | 243 kinshasa 50 0.5328 245 | 244 l donovani 13 1.244 246 | 245 laboratory 48 0.2708 247 | 246 larvae 19 0.8288 248 | 247 larval stage 14 1.246 249 | 248 latin america 21 0.5882 250 | 249 leishmania 35 0.7479 251 | 250 leishmania donovani 34 1.0186 252 | 251 leishmania donovani infection 10 1.909 253 | 252 leishmaniasis 22 0.6317 254 | 253 lesion 40 0.5606 255 | 254 lesson 23 1.4635 256 | 255 life cycle 13 0.891 257 | 256 light 16 0.4287 258 | 257 limit 29 0.3421 259 | 258 line 37 0.3771 260 | 259 lineage 18 0.943 261 | 260 literature 37 0.3828 262 | 261 liver 16 0.5715 263 | 262 livestock 15 0.6055 264 | 263 llin 23 2.3642 265 | 264 llins 15 2.6416 266 | 265 locality 16 0.4866 267 | 266 logistic regression 23 0.3479 268 | 267 long lasting insecticidal net 15 1.7476 269 | 268 longitudinal study 15 0.4194 270 | 269 low birth weight 12 1.5831 271 | 270 low income country 25 0.708 272 | 271 low level 10 0.5379 273 | 272 m ulceran 18 1.3594 274 | 273 malaria burden 22 0.693 275 | 274 malaria control 28 0.6064 276 | 275 malaria diagnosis 20 0.749 277 | 276 malaria incidence 19 0.71 278 | 277 malaria infection 54 0.4422 279 | 278 malaria parasite 17 0.4658 280 | 279 malaria prevalence 21 0.5361 281 | 280 malaria rapid diagnostic test 32 1.1088 282 | 281 malaria transmission 62 0.5975 283 | 282 malaria transmission intensity 10 0.7573 284 | 283 malaria treatment 16 1.002 285 | 284 malaria vaccine 28 1.3842 286 | 285 malaria vector 26 1.7322 287 | 286 malawi 21 1.7172 288 | 287 mali 29 0.7531 289 | 288 malnutrition 18 0.7466 290 | 289 manufacturer 10 0.5714 291 | 290 map 27 0.5847 292 | 291 mapping 15 0.6918 293 | 292 mass drug administration 16 0.8529 294 | 293 mda 12 0.6935 295 | 294 mean 35 0.4014 296 | 295 mean age 13 0.7752 297 | 296 medecins sans frontieres 14 0.9379 298 | 297 median 28 0.635 299 | 298 median age 25 0.8884 300 | 299 medical record 11 1.0352 301 | 300 medium 20 0.4754 302 | 301 medline 11 1.0152 303 | 302 meta analysis 13 0.6555 304 | 303 methodology principal finding 15 0.5703 305 | 304 methodology principal findings 56 0.5121 306 | 305 mg kg 17 0.5852 307 | 306 microscopy 66 0.4977 308 | 307 middle income country 10 0.766 309 | 308 migrant 13 0.5622 310 | 309 miltefosine 15 0.8129 311 | 310 ministry 18 1.167 312 | 311 mixed infection 17 0.8085 313 | 312 monoclonal antibody 17 1.2685 314 | 313 monotherapy 16 1.2203 315 | 314 morocco 14 0.6616 316 | 315 mosquito 51 1.0189 317 | 316 mother 48 0.8227 318 | 317 mouse 45 0.7895 319 | 318 mozambique 30 0.7535 320 | 319 msf 11 1.1882 321 | 320 mu l 29 0.9433 322 | 321 mutation 49 0.5988 323 | 322 mycobacterium ulceran 23 1.3135 324 | 323 mycobacterium ulcerans disease 12 1.6582 325 | 324 neglected disease 12 0.6051 326 | 325 neglected tropical disease 14 0.7484 327 | 326 nepal 62 1.0398 328 | 327 net 47 1.5551 329 | 328 neurocysticercosis 29 1.8901 330 | 329 new infection 12 1.8707 331 | 330 newborn 25 1.2266 332 | 331 niger 11 0.8743 333 | 332 none 43 0.2978 334 | 333 northern senegal 19 1.6002 335 | 334 nurse 15 1.3774 336 | 335 observational study 13 0.9762 337 | 336 onchocerciasis 16 0.6799 338 | 337 opportunity 34 0.8025 339 | 338 organization 52 0.5218 340 | 339 overall sensitivity 13 1.4718 341 | 340 overview 17 0.5264 342 | 341 p falciparum 38 1.0042 343 | 342 p falciparum malaria 16 0.9503 344 | 343 p vivax 28 1.1235 345 | 344 pair 15 0.425 346 | 345 panel 36 0.7332 347 | 346 parasitaemia 35 0.3778 348 | 347 parasite density 27 1.1346 349 | 348 participation 17 1.6009 350 | 349 pathogen 41 0.462 351 | 350 pcr 152 0.297 352 | 351 pcr rflp 19 0.5897 353 | 352 perception 35 0.9809 354 | 353 permanet 16 2.8893 355 | 354 person year 12 0.7189 356 | 355 perspective 37 0.8188 357 | 356 peruvian amazon 21 0.6081 358 | 357 phase 64 0.5029 359 | 358 phlebotomus argentipe 12 1.8954 360 | 359 pig 44 1.2969 361 | 360 pilot study 11 0.8983 362 | 361 pkdl 11 1.6392 363 | 362 placebo 11 1.2504 364 | 363 plasma 17 0.8023 365 | 364 plasmodium 37 1.0145 366 | 365 plasmodium falciparum 75 0.4626 367 | 366 plasmodium falciparum infection 12 0.801 368 | 367 plasmodium falciparum malaria 22 1.005 369 | 368 plasmodium malariae 17 1.5481 370 | 369 plasmodium ovale 12 2.1912 371 | 370 plasmodium species 15 1.2288 372 | 371 plasmodium vivax 27 0.877 373 | 372 policy 62 0.6073 374 | 373 polymerase chain reaction 53 0.3009 375 | 374 polymorphism 34 0.6202 376 | 375 porcine cysticercosis 20 2.209 377 | 376 positive predictive value 14 0.3762 378 | 377 post kala azar dermal leishmaniasis 13 1.6738 379 | 378 poverty 22 0.656 380 | 379 praziquantel 14 1.3558 381 | 380 pregnancy 60 0.9399 382 | 381 pregnant woman 60 0.934 383 | 382 present study 63 0.256 384 | 383 prevention 62 0.6867 385 | 384 principle 18 0.2863 386 | 385 programme 166 0.4473 387 | 386 progression 17 0.3755 388 | 387 proof 12 0.4392 389 | 388 prospective study 14 0.4042 390 | 389 protection 51 0.7418 391 | 390 protein 59 0.5995 392 | 391 provincial hospital 11 1.3132 393 | 392 public health 21 0.5638 394 | 393 pyrethroid 13 2.1532 395 | 394 qpcr 11 0.5278 396 | 395 qualitative study 19 1.341 397 | 396 quantification 19 0.4003 398 | 397 rapid diagnostic test 59 0.5282 399 | 398 rdt 47 0.6766 400 | 399 rdts 33 0.6877 401 | 400 real time pcr 28 0.6338 402 | 401 reference laboratory 10 0.804 403 | 402 reference method 11 1.9641 404 | 403 referral 15 1.253 405 | 404 regulation 11 0.5623 406 | 405 relapse 27 0.5991 407 | 406 reproducibility 21 0.9738 408 | 407 researcher 12 0.9098 409 | 408 residence 16 0.5061 410 | 409 resistance 159 0.3618 411 | 410 resource 65 0.6262 412 | 411 resource limited setting 11 0.9349 413 | 412 resource poor setting 10 2.0335 414 | 413 respondent 16 0.9775 415 | 414 retention 15 1.3042 416 | 415 rodent 18 0.9648 417 | 416 rts 12 1.1688 418 | 417 rural burkina faso 18 1.814 419 | 418 rural community 19 0.7496 420 | 419 rural district 17 1.5223 421 | 420 rwanda 38 1.1134 422 | 421 s haematobium 24 1.4706 423 | 422 s mansoni 21 1.6004 424 | 423 s mansoni infection 11 1.4071 425 | 424 safety 98 1.2279 426 | 425 sample 202 0.2239 427 | 426 sand fly 14 1.2369 428 | 427 schistosoma haematobium 15 1.3448 429 | 428 schistosoma mansoni 29 1.3119 430 | 429 schistosoma mansoni infection 18 1.6762 431 | 430 schistosomiasis 45 0.915 432 | 431 school 17 1.4877 433 | 432 schoolchild 26 1.0815 434 | 433 semi 13 0.6222 435 | 434 senegal 27 0.7378 436 | 435 sensitivity 155 0.434 437 | 436 september 30 0.5516 438 | 437 sequence 47 0.5255 439 | 438 sequencing 27 0.4027 440 | 439 sera 30 0.7373 441 | 440 serious adverse event 16 1.982 442 | 441 serology 17 0.6214 443 | 442 seropositivity 15 1.0083 444 | 443 seroprevalence 23 1.0039 445 | 444 serum 38 0.7358 446 | 445 serum sample 24 0.9064 447 | 446 service 57 1.2804 448 | 447 severe malaria 19 0.6225 449 | 448 sheep 11 1.534 450 | 449 short report 12 0.7082 451 | 450 sickness 40 1.0169 452 | 451 sickness patient 12 1.7034 453 | 452 sierra leone 15 0.8664 454 | 453 significant association 15 0.4252 455 | 454 significant correlation 10 0.8619 456 | 455 significant reduction 11 0.9272 457 | 456 single dose 17 1.3867 458 | 457 skin 17 1.0023 459 | 458 socio economic status 12 0.8796 460 | 459 soil 44 1.9191 461 | 460 south africa 28 0.6455 462 | 461 south america 13 0.3408 463 | 462 southeast asia 19 0.9771 464 | 463 southern benin 10 1.3566 465 | 464 spatial distribution 13 0.7991 466 | 465 species 170 0.4428 467 | 466 species identification 21 0.6765 468 | 467 specific antibody 17 1.057 469 | 468 specificity 125 0.4021 470 | 469 specimen 43 0.3643 471 | 470 staff 32 0.7186 472 | 471 start 16 0.5357 473 | 472 sth 22 2.3533 474 | 473 sth infection 11 2.3304 475 | 474 stool 35 0.9815 476 | 475 stool sample 31 1.0999 477 | 476 strain 77 0.4459 478 | 477 subset 16 0.5158 479 | 478 sudan 26 0.5484 480 | 479 sulfadoxine pyrimethamine 25 1.5292 481 | 480 sulphadoxine pyrimethamine 26 1.5423 482 | 481 supervision 14 1.5976 483 | 482 supply 21 1.2577 484 | 483 support 38 1.1209 485 | 484 susceptibility 49 0.4468 486 | 485 systematic review 31 0.5215 487 | 486 t b 16 1.4086 488 | 487 t congolense 11 1.7586 489 | 488 t cruzi 16 1.7261 490 | 489 t solium 25 1.7418 491 | 490 tablet 18 1.1446 492 | 491 taenia solium 27 1.7986 493 | 492 taenia solium cysticercosis 23 1.9692 494 | 493 taeniasis 15 1.8607 495 | 494 technique 87 0.2832 496 | 495 test 211 0.2321 497 | 496 test result 27 0.6688 498 | 497 tete 11 1.6207 499 | 498 tick 21 1.2533 500 | 499 time point 12 0.4318 501 | 500 tissue 25 0.5192 502 | 501 titre 20 0.7125 503 | 502 tolerability 18 1.8715 504 | 503 total cost 14 1.1251 505 | 504 training 32 0.9558 506 | 505 transmission dynamic 16 0.7011 507 | 506 trap 18 0.8482 508 | 507 traveler 13 0.9764 509 | 508 treatment failure 38 0.6839 510 | 509 treatment outcome 33 0.4888 511 | 510 trial 142 0.4758 512 | 511 trichuris trichiura 18 2.8185 513 | 512 tropical disease 11 0.6173 514 | 513 tropical medicine 16 0.5709 515 | 514 trypanosoma 50 1.0323 516 | 515 trypanosoma brucei 13 1.5148 517 | 516 trypanosoma brucei gambiense 15 1.4779 518 | 517 trypanosoma congolense 11 1.5734 519 | 518 trypanosoma cruzi 25 1.8604 520 | 519 trypanosome 33 1.1162 521 | 520 trypanosomiasis 21 1.4596 522 | 521 trypanosomosis 17 1.2905 523 | 522 tsetse 14 1.6541 524 | 523 tsetse fly 17 1.971 525 | 524 uncomplicated falciparum malaria 17 2.1791 526 | 525 uncomplicated malaria 30 1.7924 527 | 526 uncomplicated plasmodium falciparum malaria 19 2.4514 528 | 527 uptake 29 0.7425 529 | 528 urban area 24 0.4335 530 | 529 urine 24 0.8298 531 | 530 urine sample 13 0.9968 532 | 531 vaccination 38 1.0759 533 | 532 vaccine 70 1.0634 534 | 533 validation 17 0.3754 535 | 534 validity 16 0.366 536 | 535 vector 93 0.6492 537 | 536 vector control 34 1.0335 538 | 537 venezuela 10 1.4684 539 | 538 vietnam 61 0.579 540 | 539 virus 48 0.4782 541 | 540 visceral leishmaniasis 129 0.8124 542 | 541 visit 32 0.9114 543 | 542 vitro 28 0.5322 544 | 543 vivo 10 0.6591 545 | 544 vl case 16 1.3056 546 | 545 vl patient 18 0.8485 547 | 546 vl treatment 12 1.2288 548 | 547 volunteer 22 0.6536 549 | 548 weight 33 0.5841 550 | 549 western kenya 11 0.8439 551 | 550 whole blood 12 1.3525 552 | 551 wide range 13 0.4298 553 | 552 woman 80 0.8942 554 | 553 year period 15 0.7455 555 | 554 zoonotic disease 13 0.8331 556 | -------------------------------------------------------------------------------- /02-advanced.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In the previous sessions, you learned how to construct scientometric networks in Python. It was clear that this can be quite challenging. VOSviewer takes care of a lot of the necessary work in creating scientometric networks. You can hence use VOSviewer to create networks, which you could then export and analyse further in Python. We will here take this approach." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## VOSviewer" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "You have previously constructed scientometric networks using VOSviewer. You can import the resulting network for further analysis in `igraph`. In order to import the file in `igraph` you need to have saved both the `map` file and the `network` file in VOSviewer. See the manual of VOSviewer for more explanation. As in the previous Python notebook, we have prepared some files for you, in this case the author collaboration network from the Web of Science files that we analysed previously." 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "We first import the necessary packages. You will presumably recognize these still from the previous Python notebook." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "import pandas as pd\n", 45 | "import igraph as ig" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "Now let us read the map and network file from VOSviewer." 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "
\n", 60 | " Read the file data-files/vosviewer/vosviewer_map.txt using tabs ('\\t') as a field separator, and call the resulting variable map_df.\n", 61 | "
" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "The network file from VOSviewer has no header, so we set it manually" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "network_df = pd.read_csv('data-files/vosviewer/vosviewer_network.txt', sep='\\t', header=None,\n", 85 | " names=['idA', 'idB', 'weight'])" 86 | ] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "Now we have loaded the data, so we can simply construct a network as before." 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "G_vosviewer = ig.Graph.DictList(\n", 102 | " vertices=map_df.to_dict('records'),\n", 103 | " edges=network_df.to_dict('records'),\n", 104 | " vertex_name_attr='id',\n", 105 | " edge_foreign_keys=('idA', 'idB'),\n", 106 | " directed=False\n", 107 | " )" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "The layout and clustering is also stored by VOSviewer, and we can use that to display the same visualization in `igraph`." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "layout = ig.Layout(coords=zip(*[G_vosviewer.vs['x'], G_vosviewer.vs['y']]))\n", 124 | "clustering = ig.VertexClustering.FromAttribute(G_vosviewer, 'cluster')\n", 125 | "\n", 126 | "ig.plot(clustering, layout=layout, vertex_size=4, vertex_frame_width=0, vertex_label=None)" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "## Clustering" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "A common phenomenon in many networks is the presence of group structure, where nodes within the same group are densely connected. Such a structure is sometimes called a *modular* structure, and a frequently used measure of group structure is known as *modularity*. You have already encountered this functionality briefly in VOSviewer, which provides clusters. Here we will explore this a bit more in-depth." 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "metadata": {}, 146 | "source": [ 147 | "First, we will import a package called `leidenalg` which is the *Leiden algorithm*, which we will use for clustering. It is built on top of `igraph` so that it easily integrates with all the exisiting methods of `igraph`." 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "import leidenalg" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "Now let us find clusters in the collaboration network from VOSviewer, using the weight of the edges. Because the algorithm is stochastic, it may yield somewhat different results every time you run it. To prevent that from happening, and to always get the same result, we will set the random seed to 0. The result is a `VertexClustering`, which we already briefly encountered when using the clustering results from VOSviewer." 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "We will first find clusters using *modularity*." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "optimiser = leidenalg.Optimiser()\n", 180 | "optimiser.set_rng_seed(0)\n", 181 | "clusters = leidenalg.ModularityVertexPartition(G_vosviewer, weights='weight')\n", 182 | "optimiser.optimise_partition(clusters)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "markdown", 187 | "metadata": {}, 188 | "source": [ 189 | "The length of the `clusters` variable indicates the number of clusters." 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "len(clusters)" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "When accessing `clusters` variable as a list, each element corresponds to the set of nodes in that cluster." 206 | ] 207 | }, 208 | { 209 | "cell_type": "markdown", 210 | "metadata": {}, 211 | "source": [ 212 | "
\n", 213 | " What are the nodes in cluster 30?\n", 214 | "
" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": null, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [] 223 | }, 224 | { 225 | "cell_type": "markdown", 226 | "metadata": {}, 227 | "source": [ 228 | "Hence, node `548`, node `1052`, etc... belong to cluster `30`. Another way to look at the clusters is by looking at the `membership` of `clusters`." 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "
\n", 236 | " What is the membership of the first 10 nodes?\n", 237 | "
" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": null, 243 | "metadata": {}, 244 | "outputs": [], 245 | "source": [] 246 | }, 247 | { 248 | "cell_type": "markdown", 249 | "metadata": {}, 250 | "source": [ 251 | "Hence, node `0` belongs to cluster `7`, node `1` belongs to cluster `9`, node `2` belongs to cluster `4`, et cetera." 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "Let us take a closer look at the largest cluster." 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "H = clusters.giant()\n", 268 | "print(H.summary())" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "We could again detect clusters using modularity in the largest cluster." 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": null, 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "optimiser.set_rng_seed(0)\n", 285 | "subclusters = leidenalg.ModularityVertexPartition(H, weights='weight')\n", 286 | "optimiser.optimise_partition(subclusters)\n", 287 | "ig.plot(subclusters, vertex_size=5, vertex_label=None)" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "In general, modularity will continue to find subclusters in this way. An alternative approach, called CPM, does not suffer from that problem. \n", 295 | "\n", 296 | "Let us detect clusters using CPM. We do have to specify a parameter, called the `resolution_parameter`. As its name suggests, it specifies the resolution of the clusters we would like to find. At a higher resolution we will tend to find smaller clusters, while at a lower resolution we find larger clusters. Let us use the resolution parameter 0.01." 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "optimiser.set_rng_seed(0)\n", 306 | "clusters = leidenalg.CPMVertexPartition(G_vosviewer,\n", 307 | " weights='weight',\n", 308 | " resolution_parameter=0.1)\n", 309 | "optimiser.optimise_partition(clusters)\n", 310 | "clusters.giant().vcount()" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "
\n", 318 | "Detect subclusters in the largest cluster using CPM, using the same resolution_parameter. How many subclusters do you find? How does that compare to modularity?\n", 319 | "
" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "
\n", 334 | "Try to find more subclusters by specifying a higher resolution_parameter.\n", 335 | "
" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": {}, 342 | "outputs": [], 343 | "source": [] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "Modularity adapts itself to the network. In a sense that is convenient, because you then do not have to specify any parameters. On the other hand, it makes the definition of what a \"cluster\" is less clear.\n", 350 | "\n", 351 | "CPM does not adapt itself to the network, and maintains the same defintion across different networks. That is convenient, because it brings more clarity to what we mean by a \"cluster\". Whenever you try to find subclusters using the same `resolution_parameter`, CPM should not find any subclusters. In practice, it may happen that CPM still finds some subclusters, in which case the original clusters were actually not the best possible. The Leiden algorithm can be run for multiple iterations, and with each iteration, the chances are smaller that CPM would find such subclusters. Modularity will always find subclusters, independent of the number of iterations." 352 | ] 353 | }, 354 | { 355 | "cell_type": "markdown", 356 | "metadata": {}, 357 | "source": [ 358 | "
\n", 359 | " Try to optimise the partition further. Note that the function optimise_partition returns how much further it managed to improve the function, so that if it returns 0.0, it means it couldn't find any further improvement. Execute the cell repeatedly. Does it return 0.0 after some time?\n", 360 | "
" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "Let us compare the clusters that we detected in Python with the clustering results from VOSviewer.\n", 375 | "\n", 376 | "We can summarize the overall similarity to the partition based on the disciplines using the Normalised Mutual Information (NMI). The NMI varies between 0 and 1 and equals 1 if both are identical." 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": null, 382 | "metadata": {}, 383 | "outputs": [], 384 | "source": [ 385 | "clusters.compare_to(clustering, method='nmi')" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "There are some differences between the clustering from VOSviewer and the clusters we detected in Python. This will of course highly depend on what resolution parameter we have used for both results. One other important difference is that VOSviewer will by default use *normalized* weights. By default, it will divide the weight of a link by the expected weight, assuming that the total link weight of each node would remain the same, which is sometimes referred to as the *association strength*. We also perform this normalization here." 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "G_vosviewer.es['weight_normalized'] = [\n", 402 | " e['weight']/( G_vosviewer.vs[e.source]['weight']*G_vosviewer.vs[e.target]['weight'] / (2*sum(G_vosviewer.es['weight'])) ) \n", 403 | " for e in G_vosviewer.es]" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "By default VOSviewer uses the default resolution of `1` for these normalized weights. If we now detect clusters using these weights, you will see that the result are more closely aligned to the VOSviewer results." 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": {}, 417 | "outputs": [], 418 | "source": [ 419 | "clusters = leidenalg.find_partition(G_vosviewer, leidenalg.CPMVertexPartition, \n", 420 | " weights='weight_normalized', resolution_parameter=1,\n", 421 | " n_iterations=10)\n", 422 | "\n", 423 | "clusters.compare_to(clustering, method='nmi')" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "Finally, the Leiden algorithm is also directly implemented in `igraph` itself nowadays. It is somewhat less elaborate than the `leidenalg` package, but it is also substantially faster. If you are analysing very large networks, it might be better to use the `igraph` Leiden algorithm. Using it is straightforward." 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": null, 436 | "metadata": {}, 437 | "outputs": [], 438 | "source": [ 439 | "clusters = G_vosviewer.community_leiden(objective_function='CPM',weights='weight_normalized', \n", 440 | " resolution_parameter=1.0, n_iterations=10)\n", 441 | "\n", 442 | "clusters.compare_to(clustering, method='nmi')" 443 | ] 444 | }, 445 | { 446 | "cell_type": "markdown", 447 | "metadata": {}, 448 | "source": [ 449 | "Now let us explore cluster detection a bit further." 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": {}, 455 | "source": [ 456 | "
\n", 457 | " Vary the resolution_parameter when detecting clusters using the CPM method. What resolution_parameter seems reasonable to you, and why?\n", 458 | "
" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": {}, 465 | "outputs": [], 466 | "source": [] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "metadata": {}, 471 | "source": [ 472 | "
\n", 473 | " Try to find a resolution_parameter such that the network separates in two large clusters (and some remaining small clusters). What is the cause of these two large clusters? (Hint: examine the author names)\n", 474 | "
" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": null, 480 | "metadata": {}, 481 | "outputs": [], 482 | "source": [] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "
\n", 489 | "Compare the co-authorship network that we created previously in Python to the network created in VOSviewer. What are the differences?\n", 490 | "
" 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "# Document-term clustering" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "metadata": {}, 510 | "source": [ 511 | "We will now use the same type of clustering technique that we used previously in a slightly different way. Instead of clustering a network, we will cluster a specific type of network, namely a bipartite network. This requires a slightly different (and more complicated) approach. More specifically, we will cluster a document-term network, where documents are linked to terms if those terms appear in a document.\n", 512 | "\n", 513 | "We leave the task of extracting terms to VOSviewer, and simply import the resulting document-term network in Python. At the end of the notebook, you will find instructions how to extract the document-term network from VOSviewer yourself.\n", 514 | "\n", 515 | "We read two files: (1) the `terms.txt` file, which simply contains the terms and their `id`; and (2) the `doc-term.txt` file, which contains which term occurs in which document. The `document id` refers to the line number of the WoS files that were read by VOSviewer. We will encounter this later." 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "terms_df = pd.read_csv('data-files/vosviewer/terms.txt', sep='\\t', index_col='id')\n", 525 | "doc_terms_df = pd.read_csv('data-files/vosviewer/doc-term.txt', sep='\\t')" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "In this file, both the documents and the terms are using the same numbers, so that `igraph` cannot distinguish them (e.g. there is both a document `1` and a term `1`). We therefore create separate ids for both the documents and the terms." 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "metadata": {}, 539 | "outputs": [], 540 | "source": [ 541 | "doc_terms_df['document id'] = doc_terms_df['document id'].map(lambda x: str(x) + '-doc');\n", 542 | "doc_terms_df['term id'] = doc_terms_df['term id'].map(lambda x: str(x) + '-term');" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "We can now create the network." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "G_doc_term = ig.Graph.TupleList(\n", 559 | " edges=doc_terms_df.values,\n", 560 | " vertex_name_attr='id',\n", 561 | " directed=False\n", 562 | " )" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "This is a bipartite network, and we create a specific vertex attribute to indicate what the type is of the node: either a `doc` or a `term`." 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": null, 575 | "metadata": {}, 576 | "outputs": [], 577 | "source": [ 578 | "G_doc_term.vs['type'] = ['doc' if 'doc' in v['id'] else 'term' for v in G_doc_term.vs]" 579 | ] 580 | }, 581 | { 582 | "cell_type": "markdown", 583 | "metadata": {}, 584 | "source": [ 585 | "Similar to the co-authorship network, VOSviewer typically normalizes the weights in a network by using the association strength, and we will also use that here." 586 | ] 587 | }, 588 | { 589 | "cell_type": "code", 590 | "execution_count": null, 591 | "metadata": {}, 592 | "outputs": [], 593 | "source": [ 594 | "G_doc_term.es['weight'] = [2.0*G_doc_term.ecount()/(G_doc_term.vs[e.source].degree()*G_doc_term.vs[e.target].degree()) \n", 595 | " for e in G_doc_term.es];" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "We now employ a small trick in the `leidenalg` package in order to do clustering in a bipartite network. We will not explain the full details here, please see the [documentation](https://leidenalg.readthedocs.io/en/latest/multiplex.html#bipartite) for a brief explanation of this approach. Please note that this approach is *not* possible using the internal `igraph` Leiden algorithm." 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": null, 608 | "metadata": {}, 609 | "outputs": [], 610 | "source": [ 611 | "partition, partition_docs, partition_terms = leidenalg.CPMVertexPartition.Bipartite(\n", 612 | " G_doc_term, types='type', weights='weight', resolution_parameter_01=1)" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "We are now ready to detect clusters, but we are going to use all three partitions we created. We do so by using the function `optimise_partition_multiplex` instead of the `optimise_partition` function that we used previously. We have to pass a list of partitions to that function. For the trick to work, we also need to pass the argument `layer_weights=[1,-1,-1]`, which assumes that the `partition` is the first element of the list that we pass." 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "metadata": {}, 626 | "outputs": [], 627 | "source": [ 628 | "optimiser = leidenalg.Optimiser()\n", 629 | "optimiser.set_rng_seed(0)\n", 630 | "optimiser.optimise_partition_multiplex(\n", 631 | " [partition, partition_docs, partition_terms], \n", 632 | " layer_weights=[1,-1,-1], n_iterations=100)" 633 | ] 634 | }, 635 | { 636 | "cell_type": "markdown", 637 | "metadata": {}, 638 | "source": [ 639 | "Now `partition` contains the clustering results (actually, `partition_docs` and `partition_terms` contain the identical clustering results). We extract the cluster membership of each node, and make it a new node attribute." 640 | ] 641 | }, 642 | { 643 | "cell_type": "code", 644 | "execution_count": null, 645 | "metadata": {}, 646 | "outputs": [], 647 | "source": [ 648 | "G_doc_term.vs['cluster'] = partition.membership\n", 649 | "G_doc_term.vs['degree'] = G_doc_term.degree();" 650 | ] 651 | }, 652 | { 653 | "cell_type": "markdown", 654 | "metadata": {}, 655 | "source": [ 656 | "We will now create a so-called *projection* of the bipartite graph, which actually simply refers to the creation of a co-occurrence network." 657 | ] 658 | }, 659 | { 660 | "cell_type": "code", 661 | "execution_count": null, 662 | "metadata": {}, 663 | "outputs": [], 664 | "source": [ 665 | "G_doc_term.vs['type_int'] = [1 if v['type'] == 'term' else 0 for v in G_doc_term.vs];\n", 666 | "G_terms = G_doc_term.bipartite_projection(types='type_int', which=1);\n", 667 | "G_terms.simplify(combine_edges='sum');\n", 668 | "\n", 669 | "G_terms.vs['id'] = [int(v['id'][:-5]) for v in G_terms.vs];\n", 670 | "G_terms.vs['term'] = [terms_df.loc[v['id'],'term'] for v in G_terms.vs];" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "Now `G_terms` contains only terms and the co-occurrence between them. We will export this network to a file format so that we can read it back into VOSviewer. First, let us create the output directory (if necessary)." 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": null, 683 | "metadata": {}, 684 | "outputs": [], 685 | "source": [ 686 | "import os\n", 687 | "output_dir = 'results/'\n", 688 | "if not os.path.exists(output_dir):\n", 689 | " os.makedirs(output_dir)" 690 | ] 691 | }, 692 | { 693 | "cell_type": "markdown", 694 | "metadata": {}, 695 | "source": [ 696 | "Now we export the network `G_terms` in file format which is understandable to VOSviewer." 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": null, 702 | "metadata": {}, 703 | "outputs": [], 704 | "source": [ 705 | "nodes_df = pd.DataFrame.from_dict({attr: G_terms.vs[attr] for attr in G_terms.vs.attributes()});\n", 706 | "nodes_df['label'] = nodes_df['term'];\n", 707 | "nodes_df['cluster'] += 1;\n", 708 | "nodes_df['weight'] = nodes_df['degree'];\n", 709 | "nodes_df = nodes_df.sort_values('id')\n", 710 | "nodes_df[['id', 'label', 'cluster', 'weight']].to_csv(output_dir + 'map_vosviewer.txt', sep='\\t', index=False);\n", 711 | "\n", 712 | "edge_df = pd.DataFrame([(G_terms.vs[e.source]['id'], G_terms.vs[e.target]['id'], e['weight']) for e in G_terms.es],\n", 713 | " columns=['source', 'target', 'weight']);\n", 714 | "edge_df = edge_df.sort_values(['source', 'target']);\n", 715 | "edge_df.to_csv(output_dir + 'network_vosviewer.txt', sep='\\t', index=False, header=False);" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "metadata": {}, 721 | "source": [ 722 | "The great benefit of doing the clustering in Python is that we now also have a clustering of the publications. This is something that is not possible in VOSviewer." 723 | ] 724 | }, 725 | { 726 | "cell_type": "markdown", 727 | "metadata": {}, 728 | "source": [ 729 | "Let us first load the actual publication files which were used by VOSviewer (we have already done this in the previous notebook). As said, the `document id` refers to the line number of the WoS files that were read by VOSviewer, starting from `1`. We therefore also create a `document id` that is the same." 730 | ] 731 | }, 732 | { 733 | "cell_type": "code", 734 | "execution_count": null, 735 | "metadata": {}, 736 | "outputs": [], 737 | "source": [ 738 | "import glob\n", 739 | "import csv\n", 740 | "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n", 741 | "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n", 742 | " usecols=range(68), index_col='UT') for f in files)\n", 743 | "publications_df['document id'] = range(1,publications_df.shape[0]+1)" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "Now let us create a dataframe from `G_doc_term` with all the information from the documents." 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": {}, 757 | "outputs": [], 758 | "source": [ 759 | "nodes_df = pd.DataFrame.from_dict({attr: G_doc_term.vs[attr] for attr in G_doc_term.vs.attributes()});\n", 760 | "nodes_df = nodes_df[nodes_df['type'] == 'doc'];" 761 | ] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "metadata": {}, 766 | "source": [ 767 | "Now we need back the original integer `document id`, instead of the identifiers we created `doc-1`, `doc-2`, etc... We can then use those `document id` to merge back the results with the original information from the publications." 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": null, 773 | "metadata": {}, 774 | "outputs": [], 775 | "source": [ 776 | "nodes_df['document id'] = nodes_df['id'].str[:-4].astype(int);\n", 777 | "publications_df = pd.merge(nodes_df[['document id', 'cluster']], publications_df, \n", 778 | " left_on='document id', right_on='document id')" 779 | ] 780 | }, 781 | { 782 | "cell_type": "markdown", 783 | "metadata": {}, 784 | "source": [ 785 | "Finally, for further inspection, we may want to export our results to a `.csv` file." 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "metadata": {}, 792 | "outputs": [], 793 | "source": [ 794 | "publications_df[['AU', 'PY', 'TI', 'SO', 'cluster']].to_csv(output_dir + 'publications_clustering.csv', \n", 795 | " index=False)" 796 | ] 797 | }, 798 | { 799 | "cell_type": "markdown", 800 | "metadata": {}, 801 | "source": [ 802 | "# Own analysis" 803 | ] 804 | }, 805 | { 806 | "cell_type": "markdown", 807 | "metadata": {}, 808 | "source": [ 809 | "
\n", 810 | "Load your own data in VOSviewer and create a co-citation network of journals.\n", 811 | "
" 812 | ] 813 | }, 814 | { 815 | "cell_type": "code", 816 | "execution_count": null, 817 | "metadata": {}, 818 | "outputs": [], 819 | "source": [] 820 | }, 821 | { 822 | "cell_type": "markdown", 823 | "metadata": {}, 824 | "source": [ 825 | "
\n", 826 | "Detect comunities in the journal co-citation network. What do you think the different clusters mean?\n", 827 | "
" 828 | ] 829 | }, 830 | { 831 | "cell_type": "code", 832 | "execution_count": null, 833 | "metadata": {}, 834 | "outputs": [], 835 | "source": [] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "metadata": {}, 840 | "source": [ 841 | "
\n", 842 | " Load your own data in VOSviewer and create a term-map. Please take the following steps to create the term-map and extract the terms.csv file and the doc-terms.csv file.\n", 843 | " \n", 844 | "
    \n", 845 | "
  1. Open VOSviewer and press the button \"Create...\".
  2. \n", 846 | "
  3. Choose \"Create a map based on text data\" and press \"Next\".
  4. \n", 847 | "
  5. Choose \"Read data from bibliographic database files\" and press \"Next\".
  6. \n", 848 | "
  7. Choose the \"Web of Science\" tab and select the files you have downloaded yourself and press \"Next\".
  8. \n", 849 | "
  9. Choose \"Title and abstract fields\" (the default) and press \"Next\".
  10. \n", 850 | "
  11. Choose \"Binary counting\" (the default) and press \"Next\".
  12. \n", 851 | "
  13. Leave the default threshold of 10 and press \"Next\".
  14. \n", 852 | "
  15. Leave the default number of terms to be selected and press \"Next\".
  16. \n", 853 | "
\n", 854 | "\n", 855 | "VOSviewer will now calculate the \"relevance\" scores. When it is done, you will be shown a list of terms together with the number of their occurrences and the relevance scores. Please follow the following remaining steps.\n", 856 | "\n", 857 | "
    \n", 858 | "
  1. On the list of terms, click-right, and choose \"Export selected terms...\". Choose an appropriate file name (terms.txt) and make sure you choose an appropriate directory and then press \"Export\".
  2. \n", 859 | "
  3. On the list of terms, click-right, and choose \"Export document-term relations...\". Choose an appropriate file name (doc-terms.txt) and make sure you choose an appropriate directory and then press \"Export\".
  4. \n", 860 | "
\n", 861 | "
" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": {}, 867 | "source": [ 868 | "
\n", 869 | " Load the terms.csv file and the doc-terms.csv files. Detect the clusters in this bipartite network, as explained above.\n", 870 | "
" 871 | ] 872 | }, 873 | { 874 | "cell_type": "code", 875 | "execution_count": null, 876 | "metadata": {}, 877 | "outputs": [], 878 | "source": [] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "metadata": {}, 883 | "source": [ 884 | "
\n", 885 | " Compare the results to the clusters you can detect immediately in VOSviewer itself. Are they similar or not?\n", 886 | "
" 887 | ] 888 | }, 889 | { 890 | "cell_type": "code", 891 | "execution_count": null, 892 | "metadata": {}, 893 | "outputs": [], 894 | "source": [] 895 | }, 896 | { 897 | "cell_type": "markdown", 898 | "metadata": {}, 899 | "source": [ 900 | "
\n", 901 | " Try to identify the main topic for the largest few clusters on the basis of the terms in the term map. Does that match well with the publications in the same cluster? Do you see any discrepancies?\n", 902 | "
" 903 | ] 904 | }, 905 | { 906 | "cell_type": "code", 907 | "execution_count": null, 908 | "metadata": {}, 909 | "outputs": [], 910 | "source": [] 911 | } 912 | ], 913 | "metadata": { 914 | "kernelspec": { 915 | "display_name": "Python 3", 916 | "language": "python", 917 | "name": "python3" 918 | }, 919 | "language_info": { 920 | "codemirror_mode": { 921 | "name": "ipython", 922 | "version": 3 923 | }, 924 | "file_extension": ".py", 925 | "mimetype": "text/x-python", 926 | "name": "python", 927 | "nbconvert_exporter": "python", 928 | "pygments_lexer": "ipython3", 929 | "version": "3.8.3" 930 | } 931 | }, 932 | "nbformat": 4, 933 | "nbformat_minor": 2 934 | } 935 | -------------------------------------------------------------------------------- /01-basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Introduction" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "In this lab exercise, you will learn how to perform scientometric network analysis in Python. We will start with practicalities on some basic data handling and import. We then move on to creating a network and cover some basic analysis. In the next session, we will be using more advanced techniques." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "
\n", 22 | "This Python notebook is intended to be used as an exercise. We have prepared it for you to include many details, but at some parts we will ask you to fill in some of the blanks. Exercises where you are asked to do something, or to think about something, will be indicated like this. If you need to execute and write your own code, we provide empty space below to do so.\n", 23 | "
" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "
\n", 31 | "If you need any help with anything, please don't hesitate to ask your teachers. \n", 32 | "
" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "# Data handling" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Python is a general purpose programming language and it can be used to handle data in general. In this notebook we will specifically deal with scientometric datasets, but you can also use it for other purposes.\n", 47 | "\n", 48 | "We will start by handling some data from a scientometric data source. There are many different possible data sources, and we discussed some of them earlier this week. In this notebook we will focus on data downloaded from Web of Science. We have already downloaded some data for you to demonstrate Python. At the end of the exercise you will be asked to load your own data. \n", 49 | "\n", 50 | "The data that we provided is a selection of publications from authors from Belgium from Tropical Medicine from 2000-2017." 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "
\n", 58 | " Note: You cannot load your own data when you run this notebook online using Binder.\n", 59 | "
" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "We start by loading the data. In order to read in the data, we first need to make sure that Python is able to read it. A very versatile *package* for handling data in Python is called `pandas`. For those of you familiar with `R`, it is similar to the `data.frame` in `R`." 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "We *import* this package as follows, and we call the `pandas` package `pd`, for easy reference. We also need the `csv` package to indicate some options to the `pandas` package." 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "
\n", 81 | " In order to execute the code you have to press Ctrl-Enter while selecting the code cell below. Alternatively, you can press the \"Play\" button at the top of the screen. This also moves to the next cell at the same time. Using Shift-Enter instead of Ctrl-Enter will also execute the code and move to the next cell at the same time.\n", 82 | "
" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "import pandas as pd\n", 92 | "import csv" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "
\n", 100 | " If you have executed that code cell correctly, it should now be numbered 1. While the code in a cell is being executed it is marked by an asterisk *. Each cell of executed code will be numbered in the order in which you execute it. If you execute it again, it will be numbered 2, et cetera.\n", 101 | "
" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "We are now ready to read in the data that you just downloaded. We have named the `pandas` package `pd`, which will save us some typing." 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "publications_df = pd.read_csv('data-files/wos/tab-delimited/savedrecs_0001_0500.txt', \n", 118 | " sep='\\t', index_col='UT',\n", 119 | " quoting=csv.QUOTE_NONE, usecols=range(68))" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "We called the *function* `read_csv` of the `pandas` package. We provide it with several *arguments*. \n", 127 | "\n", 128 | "1. The location of the file we want to read.\n", 129 | "\n", 130 | "2. The second argument is a *named argument*, we provide both the name of the argument (`sep`) and its value (`'\\t'`). This indicates the *sep*arator between different fields. In this case it is a tab-delimited file, so the fields are separated by tabs, which is indicated by `'\\t'`.\n", 131 | "\n", 132 | "3. The third argument is again a named argument. We indicate that the `UT` field should be the index. This is the unique identifier that WoS uses.\n", 133 | "\n", 134 | "The two subqeuent arguments are needed to correctly handle some peculiarities of WoS files." 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "
\n", 142 | "We downloaded some example files for you, which are located in the folder data_files/wos. At the end of this notebook, you will be asked to download your own data. If you want to load that data instead, use the path to that data.\n", 143 | "
" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "
\n", 151 | " Note: Windows usually uses backslashes \\ to separate directories, in Python you can also use the forward slash /, which is usually more convenient for a number of reasons.\n", 152 | "
" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "The `pandas` package took care of reading the file, and has now stored it in the variable called `publications_df`. You can take a closer look at `publications_df` to see the data that we just read." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "metadata": {}, 166 | "outputs": [], 167 | "source": [ 168 | "publications_df" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "You will see that the data has quite cryptic column headers. Each line contains information about a single publication, and contains various details, such as the title (`TI`), abstract (`AB`), authors (`AU`), journal title (`SO`) and cited references (`CR`). Unfortunately, the documentation of Web of Science is relatively limited, but some explanation can be found here. You can retrieve this information in various ways from the pandas dataframe `publications_df`. For example, you can list the first five titles as follows:" 176 | ] 177 | }, 178 | { 179 | "cell_type": "code", 180 | "execution_count": null, 181 | "metadata": {}, 182 | "outputs": [], 183 | "source": [ 184 | "publications_df.TI[:5]" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "Here, `[:5]` indicates that you want the first elements (starting at 0) until (but excluding) 5, so item 0, 1, 2, 3 and 4. This is called a *slice* of the data. You can also look at authors for rows 5 until 10 as follows:" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": null, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "publications_df.AU[5:10]" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "In order to get the last few elements, you can use negative indices. The last element is indicated by `-1`, the penultimate element is indicated by `-2`, and so on. You can get the journals for the last five sources as follows:" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": null, 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "publications_df.SO[-5:]" 217 | ] 218 | }, 219 | { 220 | "cell_type": "markdown", 221 | "metadata": {}, 222 | "source": [ 223 | "Alternatively, there are various ways to index the dataframe. For example, to get the title and abstract for the first five elements you can do the following." 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "publications_df[0:5][['TI', 'AB']]" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "The notation `['TI', 'AB']` creates a *list* of elements in Python. We now used it to get multiple columns from the dataframe. " 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "The following does exactly the same:" 247 | ] 248 | }, 249 | { 250 | "cell_type": "code", 251 | "execution_count": null, 252 | "metadata": {}, 253 | "outputs": [], 254 | "source": [ 255 | "publications_df[['TI', 'AB']][0:5]" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "The `pandas` package automatically determines whether you try to get columns or rows. Slices are always assumed to refer to rows." 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "
\n", 270 | " Show the title (TI), abstract (AB), journal (SO) and publication year (PY) for rows 200-210.\n", 271 | "
" 272 | ] 273 | }, 274 | { 275 | "cell_type": "markdown", 276 | "metadata": {}, 277 | "source": [ 278 | "
\n", 279 | "To start typing in the cell below, select the cell using the mouse, or select it using the arrows on the keyboard and press Enter\n", 280 | "
" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "You can also access a particular `UT` directly by using the `.loc` indexer." 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": null, 300 | "metadata": {}, 301 | "outputs": [], 302 | "source": [ 303 | "publications_df.loc['WOS:000419235100004', ['TI', 'AU', 'SO', 'PY']]" 304 | ] 305 | }, 306 | { 307 | "cell_type": "markdown", 308 | "metadata": {}, 309 | "source": [ 310 | "## Reading multiple files" 311 | ] 312 | }, 313 | { 314 | "cell_type": "markdown", 315 | "metadata": {}, 316 | "source": [ 317 | "Until now we have only loaded one file. But we have of course downloaded more files, and we need to load all of them. We can list all files in a directory using the package `glob`. We first import the package." 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "import glob" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "Now, let us get a list of all files in the directory `data_files/wos/tab-delimited/`." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n", 343 | "files" 344 | ] 345 | }, 346 | { 347 | "cell_type": "markdown", 348 | "metadata": {}, 349 | "source": [ 350 | "We asked `glob` for a list of files that end with `txt` (`*.txt`) in the directory `data-files/wos/tab-delimited`. We sorted the list to ensure that we read the files in the correct order. We can now simply pass this list of files to read multiple WoS files." 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": null, 356 | "metadata": {}, 357 | "outputs": [], 358 | "source": [ 359 | "publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n", 360 | " usecols=range(68), index_col='UT') for f in files)\n", 361 | "publications_df = publications_df.sort_index()" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "metadata": {}, 367 | "source": [ 368 | "
\n", 369 | " Now checkout the new publications_df data frame, and see how many rows it has.\n", 370 | "
" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "## Data summarisation" 385 | ] 386 | }, 387 | { 388 | "cell_type": "markdown", 389 | "metadata": {}, 390 | "source": [ 391 | "The `pandas` package provides various ways to summarise the data and get a useful overview of the data. For example, you can group by a certain column, and count or sum things. For example, we can count the number of articles in each journal that is included in this dataset:" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "grouped_by_journal = publications_df.groupby('SO')\n", 401 | "grouped_by_journal.size().sort_values(ascending=False)[:10]" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "We could also ask the mean publication year of publications in those journals" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": null, 414 | "metadata": {}, 415 | "outputs": [], 416 | "source": [ 417 | "grouped_by_journal['PY'].mean()" 418 | ] 419 | }, 420 | { 421 | "cell_type": "markdown", 422 | "metadata": {}, 423 | "source": [ 424 | "
\n", 425 | " Group by the year (PY) and count the number of paper from each year.\n", 426 | "
" 427 | ] 428 | }, 429 | { 430 | "cell_type": "markdown", 431 | "metadata": {}, 432 | "source": [ 433 | "
\n", 434 | "Now it is time to introduce you a little trick: you can get a list of all functions and argument of some variable by simply pressing Tab. For example, you can type publications_df., including the . and then press Tab (make sure the cursor is located after the .). If you then start typing the name of the function you are looking for and press Tab again, Python will automatically finish it as much as possible. This is something general: whenever you press Tab Python will try to autocomplete whatever you are typing.\n", 435 | "\n", 436 | "One other trick: if you have selected a function and press Shift-Tab you will get documentation of what this function does. You can press the + to find out more.\n", 437 | "
" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": null, 443 | "metadata": {}, 444 | "outputs": [], 445 | "source": [] 446 | }, 447 | { 448 | "cell_type": "markdown", 449 | "metadata": {}, 450 | "source": [ 451 | "## Network generation" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "Ultimately, we would like to use this data to generation scientometric networks. This is not a trivial task, and we will now show how to construct a co-authorship network and a journal level bibliographic coupling network.\n", 459 | "\n", 460 | "We first load the network analysis package that we will use in the notebook, `igraph`." 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "
\n", 468 | " Import the pacakge igraph and call it ig.\n", 469 | "
" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": null, 475 | "metadata": {}, 476 | "outputs": [], 477 | "source": [] 478 | }, 479 | { 480 | "cell_type": "markdown", 481 | "metadata": {}, 482 | "source": [ 483 | "### Co-authorship" 484 | ] 485 | }, 486 | { 487 | "cell_type": "markdown", 488 | "metadata": {}, 489 | "source": [ 490 | "We first build a co-authorship network. We will do this one publication at the time. All combinations of authors that are involved in a publication are co-authors. Let us look at the authors for publication 0." 491 | ] 492 | }, 493 | { 494 | "cell_type": "code", 495 | "execution_count": null, 496 | "metadata": {}, 497 | "outputs": [], 498 | "source": [ 499 | "publications_df['AU'][0]" 500 | ] 501 | }, 502 | { 503 | "cell_type": "markdown", 504 | "metadata": {}, 505 | "source": [ 506 | "Note that the authors are all listed and separated with a semicolon (`;`). In computer terms, it is now a single *string*. We will split this string of all authors into a list of strings where each string then represents a single author." 507 | ] 508 | }, 509 | { 510 | "cell_type": "code", 511 | "execution_count": null, 512 | "metadata": {}, 513 | "outputs": [], 514 | "source": [ 515 | "publications_df['AU_split'] = publications_df['AU'].fillna('').str.split('; ')" 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": null, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "authors = publications_df['AU_split'][0]\n", 525 | "authors" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "metadata": {}, 531 | "source": [ 532 | "In order to create all possible combinations, we can use a convenient package, called `itertools`. The function `combinations` can generate all possible combinations of the elements of a list." 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": null, 538 | "metadata": {}, 539 | "outputs": [], 540 | "source": [ 541 | "import itertools as itr\n", 542 | "list(itr.combinations(authors, 2))" 543 | ] 544 | }, 545 | { 546 | "cell_type": "markdown", 547 | "metadata": {}, 548 | "source": [ 549 | "Of course, we don't want to do this for a single publication only, but rather, for all publications in our dataset. We can do that using the function `apply`. We can supply it with a small function (called a `lambda` function) that simply takes some input and produces some output. In this case, the input are the `authors`, and the output is the result of `itr.combinations(...)`." 550 | ] 551 | }, 552 | { 553 | "cell_type": "code", 554 | "execution_count": null, 555 | "metadata": {}, 556 | "outputs": [], 557 | "source": [ 558 | "coauthors_per_publication = publications_df['AU_split'].apply(\n", 559 | " lambda authors: list(itr.combinations(authors, 2)))" 560 | ] 561 | }, 562 | { 563 | "cell_type": "markdown", 564 | "metadata": {}, 565 | "source": [ 566 | "The variable `coauthors_per_publication` is now a list of a list of co-authors per publication. That is, each element of `coauthors_per_publication` contains a list of all co-authors for that publication. So, `coauthors_per_publication[0]` contains the coauthors we examined previously." 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": null, 572 | "metadata": {}, 573 | "outputs": [], 574 | "source": [ 575 | "coauthors_per_publication[0]" 576 | ] 577 | }, 578 | { 579 | "cell_type": "markdown", 580 | "metadata": {}, 581 | "source": [ 582 | "Let us turn each element of this list into a separate row. This is done by using `explode` in `pandas`. Publications with only one author have no co-authors, which results in an `NA` (Not Available) value. We will drop those using `dropna`." 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": {}, 589 | "outputs": [], 590 | "source": [ 591 | "coauthors = coauthors_per_publication.explode().dropna()" 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "metadata": {}, 597 | "source": [ 598 | "Finally, we can create the actual network as follows" 599 | ] 600 | }, 601 | { 602 | "cell_type": "code", 603 | "execution_count": null, 604 | "metadata": {}, 605 | "outputs": [], 606 | "source": [ 607 | "G_coauthorship = ig.Graph.TupleList(\n", 608 | " edges=coauthors.to_list(),\n", 609 | " vertex_name_attr='author',\n", 610 | " directed=False\n", 611 | " )" 612 | ] 613 | }, 614 | { 615 | "cell_type": "markdown", 616 | "metadata": {}, 617 | "source": [ 618 | "Note that this graph will still contain many duplicate edges, because there are multiple edges present. Let us therefore simplify this graph, and simply count the number of multiple edges. We first create a so-called edge attribute `n_joint_papers`. We can create it by using the edge sequence `es` of the graph. We can then simply sum this weight when we simplify the graph." 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": null, 624 | "metadata": {}, 625 | "outputs": [], 626 | "source": [ 627 | "G_coauthorship.es['n_joint_papers'] = 1\n", 628 | "G_coauthorship = G_coauthorship.simplify(combine_edges='sum')" 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": {}, 634 | "source": [ 635 | "Let us see how many authors (i.e. nodes) there are in the network. This is called the `vcount` (vertex count) in `igraph`." 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": null, 641 | "metadata": {}, 642 | "outputs": [], 643 | "source": [ 644 | "G_coauthorship.vcount()" 645 | ] 646 | }, 647 | { 648 | "cell_type": "markdown", 649 | "metadata": {}, 650 | "source": [ 651 | "Similarly, the number of edges is available as the `ecount` of the graph." 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "metadata": {}, 658 | "outputs": [], 659 | "source": [ 660 | "G_coauthorship.ecount()" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "metadata": {}, 666 | "source": [ 667 | "We can do all sorts of analysis on this network. But first, we will create a bibliographic coupling network." 668 | ] 669 | }, 670 | { 671 | "cell_type": "markdown", 672 | "metadata": {}, 673 | "source": [ 674 | "### Bibliographic coupling" 675 | ] 676 | }, 677 | { 678 | "cell_type": "markdown", 679 | "metadata": {}, 680 | "source": [ 681 | "Bibliographic coupling and co-authorship is in a sense very similar. Previously, we computed for each publication a combination of all co-authors. For bibliographic coupling we can compute for each cited reference the combinations of all citing journals. We will first create a dataframe of all journal citations (`SO`) of a certain cited reference (`CR`). Similar to the authors, we need to first split the cited references." 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": null, 687 | "metadata": {}, 688 | "outputs": [], 689 | "source": [ 690 | "publication_with_cr_df = publications_df.loc[pd.notnull(publications_df['CR']), ['SO', 'CR']]\n", 691 | "publication_with_cr_df['CR'] = publication_with_cr_df['CR'].str.split('; ')" 692 | ] 693 | }, 694 | { 695 | "cell_type": "markdown", 696 | "metadata": {}, 697 | "source": [ 698 | "We now simply list all citations from a certain journal (`SO`) to a certain cited reference (`CR`)." 699 | ] 700 | }, 701 | { 702 | "cell_type": "code", 703 | "execution_count": null, 704 | "metadata": {}, 705 | "outputs": [], 706 | "source": [ 707 | "journal_cits_df = publication_with_cr_df[['SO', 'CR']].explode('CR')" 708 | ] 709 | }, 710 | { 711 | "cell_type": "markdown", 712 | "metadata": {}, 713 | "source": [ 714 | "We then create all bibliographic couplings per cited reference as follows. We first group by the cited reference (`CR`) and then take all combinations of citing journals." 715 | ] 716 | }, 717 | { 718 | "cell_type": "code", 719 | "execution_count": null, 720 | "metadata": {}, 721 | "outputs": [], 722 | "source": [ 723 | "bibcoupling_per_cr = journal_cits_df.groupby('CR').apply(lambda x: list(itr.combinations(x['SO'], 2)))" 724 | ] 725 | }, 726 | { 727 | "cell_type": "markdown", 728 | "metadata": {}, 729 | "source": [ 730 | "We again `explode` all combinations of two sources citing the same reference." 731 | ] 732 | }, 733 | { 734 | "cell_type": "code", 735 | "execution_count": null, 736 | "metadata": {}, 737 | "outputs": [], 738 | "source": [ 739 | "bibcouplings = bibcoupling_per_cr.explode().dropna()" 740 | ] 741 | }, 742 | { 743 | "cell_type": "markdown", 744 | "metadata": {}, 745 | "source": [ 746 | "We can then create the network." 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": null, 752 | "metadata": {}, 753 | "outputs": [], 754 | "source": [ 755 | "G_coupling = ig.Graph.TupleList(\n", 756 | " edges=bibcouplings,\n", 757 | " vertex_name_attr='SO',\n", 758 | " directed=False\n", 759 | " )" 760 | ] 761 | }, 762 | { 763 | "cell_type": "markdown", 764 | "metadata": {}, 765 | "source": [ 766 | "
\n", 767 | " We again need to simplify this network. Create a new edge attribute called coupling set it to 1 and then sum this attribute when simplifying the network.\n", 768 | "
" 769 | ] 770 | }, 771 | { 772 | "cell_type": "code", 773 | "execution_count": null, 774 | "metadata": {}, 775 | "outputs": [], 776 | "source": [] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": {}, 781 | "source": [ 782 | "This network should be reasonably sized, and you should be able to visualize this network by calling `ig.plot`." 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "execution_count": null, 788 | "metadata": {}, 789 | "outputs": [], 790 | "source": [ 791 | "ig.plot(G_coupling, vertex_label=G_coupling.vs['SO'])" 792 | ] 793 | }, 794 | { 795 | "cell_type": "markdown", 796 | "metadata": {}, 797 | "source": [ 798 | "# Network analysis" 799 | ] 800 | }, 801 | { 802 | "cell_type": "markdown", 803 | "metadata": {}, 804 | "source": [ 805 | "Now that we have created some scientometric networks, let us look at some basic analyses of these networks." 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "metadata": {}, 811 | "source": [ 812 | "## Connectivity" 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": {}, 818 | "source": [ 819 | "Let us start with a very simple question. Is the co-authorship network connected?" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": null, 825 | "metadata": {}, 826 | "outputs": [], 827 | "source": [ 828 | "G_coauthorship.is_connected()" 829 | ] 830 | }, 831 | { 832 | "cell_type": "markdown", 833 | "metadata": {}, 834 | "source": [ 835 | "Apparently, not all authors in this dataset are connected via co-authored papers." 836 | ] 837 | }, 838 | { 839 | "cell_type": "markdown", 840 | "metadata": {}, 841 | "source": [ 842 | "
\n", 843 | "How many authors do you think will be connected to each other? 500? 5000? Almost everybody?\n", 844 | "
" 845 | ] 846 | }, 847 | { 848 | "cell_type": "markdown", 849 | "metadata": {}, 850 | "source": [ 851 | "In order to take a closer look, we need to detect the *connected components*. This is easily done, but the function is confusingly called `clusters`." 852 | ] 853 | }, 854 | { 855 | "cell_type": "code", 856 | "execution_count": null, 857 | "metadata": {}, 858 | "outputs": [], 859 | "source": [ 860 | "components = G_coauthorship.clusters()" 861 | ] 862 | }, 863 | { 864 | "cell_type": "markdown", 865 | "metadata": {}, 866 | "source": [ 867 | "We only want the so-called giant component. " 868 | ] 869 | }, 870 | { 871 | "cell_type": "markdown", 872 | "metadata": {}, 873 | "source": [ 874 | "
\n", 875 | "What function do you think returns the giant component?\n", 876 | "
" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "
\n", 884 | " Remember, you can use Tab and Shift-Tab to find out more about possible functions.\n", 885 | "
" 886 | ] 887 | }, 888 | { 889 | "cell_type": "code", 890 | "execution_count": null, 891 | "metadata": {}, 892 | "outputs": [], 893 | "source": [] 894 | }, 895 | { 896 | "cell_type": "markdown", 897 | "metadata": {}, 898 | "source": [ 899 | "Let us only look at the giant component." 900 | ] 901 | }, 902 | { 903 | "cell_type": "code", 904 | "execution_count": null, 905 | "metadata": {}, 906 | "outputs": [], 907 | "source": [ 908 | "H = components.giant()" 909 | ] 910 | }, 911 | { 912 | "cell_type": "markdown", 913 | "metadata": {}, 914 | "source": [ 915 | "Let us check how many nodes are in the giant component. We can call the function `summary`." 916 | ] 917 | }, 918 | { 919 | "cell_type": "code", 920 | "execution_count": null, 921 | "metadata": {}, 922 | "outputs": [], 923 | "source": [ 924 | "print(H.summary())" 925 | ] 926 | }, 927 | { 928 | "cell_type": "markdown", 929 | "metadata": {}, 930 | "source": [ 931 | "The first line indicates that we have an undirected graph (`U`) with 7871 nodes and 69928 links. The next line shows vertex attributes (indicated by the `v` behind the name of the attribute), and edge attributes (indicated by the `e`)." 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": {}, 937 | "source": [ 938 | "
\n", 939 | "
    \n", 940 | "
  1. What is the percentage of nodes that are in the giant component? \n", 941 | "
  2. Double check whether the giant component is connected.\n", 942 | "
\n", 943 | "
" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": {}, 950 | "outputs": [], 951 | "source": [] 952 | }, 953 | { 954 | "cell_type": "markdown", 955 | "metadata": {}, 956 | "source": [ 957 | "Let us take a closer look at how far authors in this data set are apart from one another. Let us simply take a look at node number `0` (remember, the first node has number `0`, not `1`) and node number `355`. " 958 | ] 959 | }, 960 | { 961 | "cell_type": "code", 962 | "execution_count": null, 963 | "metadata": {}, 964 | "outputs": [], 965 | "source": [ 966 | "paths = G_coauthorship.get_shortest_paths(0, 355)\n", 967 | "paths" 968 | ] 969 | }, 970 | { 971 | "cell_type": "markdown", 972 | "metadata": {}, 973 | "source": [ 974 | "This returns a list of all shortests paths of the nodes between node number 0 and node number 355. In fact, there is only one path, so let us select that." 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": null, 980 | "metadata": {}, 981 | "outputs": [], 982 | "source": [ 983 | "path = paths[0]\n", 984 | "path" 985 | ] 986 | }, 987 | { 988 | "cell_type": "markdown", 989 | "metadata": {}, 990 | "source": [ 991 | "
\n", 992 | "How many nodes are in the path? What is the path length?\n", 993 | "
" 994 | ] 995 | }, 996 | { 997 | "cell_type": "markdown", 998 | "metadata": {}, 999 | "source": [ 1000 | "These numbers probably do not mean that much to you. You can find out more about an individual node by looking at the `VertexSequence` of `igraph`, abbreviated as `vs`. This is a sort of list of all vertices, and is indexed by brackets `[ ]`, similar to lists, instead of parentheses `( )` as we do for functions." 1001 | ] 1002 | }, 1003 | { 1004 | "cell_type": "code", 1005 | "execution_count": null, 1006 | "metadata": {}, 1007 | "outputs": [], 1008 | "source": [ 1009 | "G_coauthorship.vs[0]" 1010 | ] 1011 | }, 1012 | { 1013 | "cell_type": "markdown", 1014 | "metadata": {}, 1015 | "source": [ 1016 | "The vertex itself is also a type of list (called a *dictionary*), and you can only return the author name as follows" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": null, 1022 | "metadata": {}, 1023 | "outputs": [], 1024 | "source": [ 1025 | "G_coauthorship.vs[0]['author']" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "You can also list multiple vertices at once." 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": null, 1038 | "metadata": {}, 1039 | "outputs": [], 1040 | "source": [ 1041 | "G_coauthorship.vs[[0, 3, 223, 355]]['author']" 1042 | ] 1043 | }, 1044 | { 1045 | "cell_type": "markdown", 1046 | "metadata": {}, 1047 | "source": [ 1048 | "You can of course also simply pass the variable `path` that we constructed earlier." 1049 | ] 1050 | }, 1051 | { 1052 | "cell_type": "code", 1053 | "execution_count": null, 1054 | "metadata": {}, 1055 | "outputs": [], 1056 | "source": [ 1057 | "G_coauthorship.vs[path]['author']" 1058 | ] 1059 | }, 1060 | { 1061 | "cell_type": "markdown", 1062 | "metadata": {}, 1063 | "source": [ 1064 | "This shows that Osaer collaborated with Geert, who collaborated with Van Mark, who in the end collaborated with Watkins." 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "markdown", 1069 | "metadata": {}, 1070 | "source": [ 1071 | "You can also get the vertex by searching for the author name. For example, if we want to find `'Van Marck, E'` we can use the following." 1072 | ] 1073 | }, 1074 | { 1075 | "cell_type": "code", 1076 | "execution_count": null, 1077 | "metadata": {}, 1078 | "outputs": [], 1079 | "source": [ 1080 | "G_coauthorship.vs.find(author_eq = 'Van Marck, E')" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "markdown", 1085 | "metadata": {}, 1086 | "source": [ 1087 | "Here `author_eq` refers to the condition that the vertex attribute `author` should **eq**ual `'Van Marck, E'`." 1088 | ] 1089 | }, 1090 | { 1091 | "cell_type": "markdown", 1092 | "metadata": {}, 1093 | "source": [ 1094 | "
\n", 1095 | " Find the shortest path from 'Van Marck, E' to 'Migchelsen, S'. Who is in between?\n", 1096 | "
" 1097 | ] 1098 | }, 1099 | { 1100 | "cell_type": "code", 1101 | "execution_count": null, 1102 | "metadata": {}, 1103 | "outputs": [], 1104 | "source": [] 1105 | }, 1106 | { 1107 | "cell_type": "markdown", 1108 | "metadata": {}, 1109 | "source": [ 1110 | "We can let `igraph` also calculate how far apart all nodes are." 1111 | ] 1112 | }, 1113 | { 1114 | "cell_type": "markdown", 1115 | "metadata": {}, 1116 | "source": [ 1117 | "
\n", 1118 | "The following may take some time to run\n", 1119 | "
" 1120 | ] 1121 | }, 1122 | { 1123 | "cell_type": "code", 1124 | "execution_count": null, 1125 | "metadata": {}, 1126 | "outputs": [], 1127 | "source": [ 1128 | "path_lengths = G_coauthorship.path_length_hist()\n", 1129 | "print(path_lengths)" 1130 | ] 1131 | }, 1132 | { 1133 | "cell_type": "markdown", 1134 | "metadata": {}, 1135 | "source": [ 1136 | "
\n", 1137 | "How far apart are most authors? Do you think most authors are close by? Or do you think they are far apart?\n", 1138 | "
" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "markdown", 1143 | "metadata": {}, 1144 | "source": [ 1145 | "Let us take a closer look at the path between node 0 and node 355 again. Instead of the nodes on the path, we now want to take a closer look at the edges on the path." 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "code", 1150 | "execution_count": null, 1151 | "metadata": {}, 1152 | "outputs": [], 1153 | "source": [ 1154 | "epath = G_coauthorship.get_shortest_paths(0, 355, output='epath')\n", 1155 | "epath" 1156 | ] 1157 | }, 1158 | { 1159 | "cell_type": "markdown", 1160 | "metadata": {}, 1161 | "source": [ 1162 | "There are three edges on this path, but the numbers themselves are not very informative. They refer to the edges, and similar to the `VertexSequence` we encountered earlier, there is also an `EdgeSequence`, abbreviated as `es`. Let us take a closer look to the number of joint papers that the authors had co-authored." 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "code", 1167 | "execution_count": null, 1168 | "metadata": {}, 1169 | "outputs": [], 1170 | "source": [ 1171 | "G_coauthorship.es[epath[0]]['n_joint_papers']" 1172 | ] 1173 | }, 1174 | { 1175 | "cell_type": "markdown", 1176 | "metadata": {}, 1177 | "source": [ 1178 | "Perhaps there are other paths that connect the two authors with more joint papers? Perhaps we could use the number of joint papers as weights?" 1179 | ] 1180 | }, 1181 | { 1182 | "cell_type": "code", 1183 | "execution_count": null, 1184 | "metadata": {}, 1185 | "outputs": [], 1186 | "source": [ 1187 | "epath = G_coauthorship.get_shortest_paths(0, 355, weights='n_joint_papers', output='epath')\n", 1188 | "epath" 1189 | ] 1190 | }, 1191 | { 1192 | "cell_type": "markdown", 1193 | "metadata": {}, 1194 | "source": [ 1195 | "We do get a different path, which it is actually longer. Let us take a look at the number of joint papers." 1196 | ] 1197 | }, 1198 | { 1199 | "cell_type": "code", 1200 | "execution_count": null, 1201 | "metadata": {}, 1202 | "outputs": [], 1203 | "source": [ 1204 | "G_coauthorship.es[epath[0]]['n_joint_papers']" 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "markdown", 1209 | "metadata": {}, 1210 | "source": [ 1211 | "The total number of joint papers is lower! That is because *shortest path* means: the path with the lowest sum of the weights. This is clearly not what we want. You should always be aware of this whenever using the concept of the *shortest path*." 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "markdown", 1216 | "metadata": {}, 1217 | "source": [ 1218 | "
\n", 1219 | "Attention! Weighted shortest paths have the lowest total weight.\n", 1220 | "
" 1221 | ] 1222 | }, 1223 | { 1224 | "cell_type": "markdown", 1225 | "metadata": {}, 1226 | "source": [ 1227 | "## Clustering coefficient" 1228 | ] 1229 | }, 1230 | { 1231 | "cell_type": "markdown", 1232 | "metadata": {}, 1233 | "source": [ 1234 | "Let us look whether co-authors of an author also tend to be co-authors among themselves." 1235 | ] 1236 | }, 1237 | { 1238 | "cell_type": "markdown", 1239 | "metadata": {}, 1240 | "source": [ 1241 | "Let us take a look at the co-authors of of author number 0, which are called the *neighbors* in network terminology." 1242 | ] 1243 | }, 1244 | { 1245 | "cell_type": "code", 1246 | "execution_count": null, 1247 | "metadata": {}, 1248 | "outputs": [], 1249 | "source": [ 1250 | "G_coauthorship.neighborhood(0)" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "markdown", 1255 | "metadata": {}, 1256 | "source": [ 1257 | "What we actually want to know is whether many of those neighors are connected. That is, we want to take the subgraph of all authors that have co-authored with author number 0." 1258 | ] 1259 | }, 1260 | { 1261 | "cell_type": "code", 1262 | "execution_count": null, 1263 | "metadata": {}, 1264 | "outputs": [], 1265 | "source": [ 1266 | "H = G_coauthorship.induced_subgraph(G_coauthorship.neighborhood(0))\n", 1267 | "print(H.summary())" 1268 | ] 1269 | }, 1270 | { 1271 | "cell_type": "markdown", 1272 | "metadata": {}, 1273 | "source": [ 1274 | "This subgraph only has 4 nodes (including node 0, so it has 3 neighbours) and 6 edges. This is sufficiently small to be easily plotted for visual inspection." 1275 | ] 1276 | }, 1277 | { 1278 | "cell_type": "code", 1279 | "execution_count": null, 1280 | "metadata": {}, 1281 | "outputs": [], 1282 | "source": [ 1283 | "H.vs['color'] = 'red'\n", 1284 | "H.vs[0]['color'] = 'grey'\n", 1285 | "ig.plot(H)" 1286 | ] 1287 | }, 1288 | { 1289 | "cell_type": "markdown", 1290 | "metadata": {}, 1291 | "source": [ 1292 | "
\n", 1293 | "Do many of the co-authors collaborate among themselves as well? Why do you think this happens?\n", 1294 | "
" 1295 | ] 1296 | }, 1297 | { 1298 | "cell_type": "markdown", 1299 | "metadata": {}, 1300 | "source": [ 1301 | "We can also ask `igraph` to calculate the clustering coefficient (which is called *transitivity* in igraph, which is the same concept using different terms) of node 0." 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "code", 1306 | "execution_count": null, 1307 | "metadata": {}, 1308 | "outputs": [], 1309 | "source": [ 1310 | "G_coauthorship.transitivity_local_undirected(0)" 1311 | ] 1312 | }, 1313 | { 1314 | "cell_type": "markdown", 1315 | "metadata": {}, 1316 | "source": [ 1317 | "
\n", 1318 | "What percentage of the co-authors of node 0 have also written papers with each other?\n", 1319 | "
" 1320 | ] 1321 | }, 1322 | { 1323 | "cell_type": "code", 1324 | "execution_count": null, 1325 | "metadata": {}, 1326 | "outputs": [], 1327 | "source": [] 1328 | }, 1329 | { 1330 | "cell_type": "markdown", 1331 | "metadata": {}, 1332 | "source": [ 1333 | "You can calculate the average for all nodes using the function `transitivity_avglocal_undirected`." 1334 | ] 1335 | }, 1336 | { 1337 | "cell_type": "markdown", 1338 | "metadata": {}, 1339 | "source": [ 1340 | "
\n", 1341 | "What percentage of the co-authors have also written papers with each other on average? Do you think this is high or not?\n", 1342 | "
" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "code", 1347 | "execution_count": null, 1348 | "metadata": {}, 1349 | "outputs": [], 1350 | "source": [] 1351 | }, 1352 | { 1353 | "cell_type": "markdown", 1354 | "metadata": {}, 1355 | "source": [ 1356 | "## Centrality" 1357 | ] 1358 | }, 1359 | { 1360 | "cell_type": "markdown", 1361 | "metadata": {}, 1362 | "source": [ 1363 | "Often, people want to identify wich nodes seem to be most important in some way in the network. This is often thought of as a type of *centrality* of a node." 1364 | ] 1365 | }, 1366 | { 1367 | "cell_type": "markdown", 1368 | "metadata": {}, 1369 | "source": [ 1370 | "### Degree" 1371 | ] 1372 | }, 1373 | { 1374 | "cell_type": "markdown", 1375 | "metadata": {}, 1376 | "source": [ 1377 | "The simplest type of centrality is the *degree* of a node, which is simply the number of its neighbors. Previously, we saw that node 0 had 3 neighbors, we therefore say its degree is 3." 1378 | ] 1379 | }, 1380 | { 1381 | "cell_type": "code", 1382 | "execution_count": null, 1383 | "metadata": {}, 1384 | "outputs": [], 1385 | "source": [ 1386 | "G_coauthorship.degree(0)" 1387 | ] 1388 | }, 1389 | { 1390 | "cell_type": "markdown", 1391 | "metadata": {}, 1392 | "source": [ 1393 | "We can also simply calculate the degree for everybody and store it in a new vertex attribute called `degree`." 1394 | ] 1395 | }, 1396 | { 1397 | "cell_type": "code", 1398 | "execution_count": null, 1399 | "metadata": {}, 1400 | "outputs": [], 1401 | "source": [ 1402 | "G_coauthorship.vs['degree'] = G_coauthorship.degree()" 1403 | ] 1404 | }, 1405 | { 1406 | "cell_type": "markdown", 1407 | "metadata": {}, 1408 | "source": [ 1409 | "
\n", 1410 | " What is the degree of 'Van Marck, E'?\n", 1411 | "
" 1412 | ] 1413 | }, 1414 | { 1415 | "cell_type": "code", 1416 | "execution_count": null, 1417 | "metadata": {}, 1418 | "outputs": [], 1419 | "source": [] 1420 | }, 1421 | { 1422 | "cell_type": "markdown", 1423 | "metadata": {}, 1424 | "source": [ 1425 | "We can also take a look at the complete degree distribution. To plot it, we load the `matplotlib` package. We import the plotting functionality and name the package `plt`. We also include a statement telling Python to show the plots immediately in this notebook." 1426 | ] 1427 | }, 1428 | { 1429 | "cell_type": "code", 1430 | "execution_count": null, 1431 | "metadata": {}, 1432 | "outputs": [], 1433 | "source": [ 1434 | "import matplotlib.pyplot as plt\n", 1435 | "%matplotlib inline" 1436 | ] 1437 | }, 1438 | { 1439 | "cell_type": "markdown", 1440 | "metadata": {}, 1441 | "source": [ 1442 | "Now let us plot a histogram of the degree, using 50 bins." 1443 | ] 1444 | }, 1445 | { 1446 | "cell_type": "code", 1447 | "execution_count": null, 1448 | "metadata": {}, 1449 | "outputs": [], 1450 | "source": [ 1451 | "plt.hist(G_coauthorship.vs['degree'], 50);\n", 1452 | "plt.yscale('log')" 1453 | ] 1454 | }, 1455 | { 1456 | "cell_type": "markdown", 1457 | "metadata": {}, 1458 | "source": [ 1459 | "This clearly shows that the degree distribution is quite skewed. Most authors have only few collaborators, while a few authors have many collaborators. If the degree distribution is so skewed, it is sometimes referred to as a *scale-free* network, although the exact definition has been a topic of intense discussion recently." 1460 | ] 1461 | }, 1462 | { 1463 | "cell_type": "markdown", 1464 | "metadata": {}, 1465 | "source": [ 1466 | "The code below sorts the nodes in descending order of the degree." 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "code", 1471 | "execution_count": null, 1472 | "metadata": {}, 1473 | "outputs": [], 1474 | "source": [ 1475 | "highest_degree = sorted(G_coauthorship.vs, key=lambda v: v['degree'], reverse=True)" 1476 | ] 1477 | }, 1478 | { 1479 | "cell_type": "markdown", 1480 | "metadata": {}, 1481 | "source": [ 1482 | "The `sorted` function takes a list as input, `G_coauthorship.vs` in our case, and sorts it according to a sort key. We indicate the sort key by a small function, called a `lambda` function, that returns the degree. In other words, the `sorted` function will sort the nodes according to the degree. By indicating `reverse=True` we obtain a list that is sorted highest to lowest, instead of the other way around." 1483 | ] 1484 | }, 1485 | { 1486 | "cell_type": "markdown", 1487 | "metadata": {}, 1488 | "source": [ 1489 | "You can look at the first five results in the following way." 1490 | ] 1491 | }, 1492 | { 1493 | "cell_type": "code", 1494 | "execution_count": null, 1495 | "metadata": {}, 1496 | "outputs": [], 1497 | "source": [ 1498 | "highest_degree[:5]" 1499 | ] 1500 | }, 1501 | { 1502 | "cell_type": "markdown", 1503 | "metadata": {}, 1504 | "source": [ 1505 | "So, apparently, U D'Allessandro has collaborated with 715 other authors! This of course only considers the number of co-authors, it does not take into account the number of papers written with somebody else.\n", 1506 | "When specifying such *edge weights* like the number of joint papers, the weighted degree is referred to as the *strength* of a node (which is sometimes a bit confusing term). \n", 1507 | "\n", 1508 | "Let us look at the strength of node 0." 1509 | ] 1510 | }, 1511 | { 1512 | "cell_type": "code", 1513 | "execution_count": null, 1514 | "metadata": {}, 1515 | "outputs": [], 1516 | "source": [ 1517 | "G_coauthorship.strength(0, weights='n_joint_papers')" 1518 | ] 1519 | }, 1520 | { 1521 | "cell_type": "markdown", 1522 | "metadata": {}, 1523 | "source": [ 1524 | "Apparently, author 0 collaborated with 3 different authors, and has a total strength of 3. But what does this 3 mean? We need to carefully think about this. Suppose that author 0 has co-authored a single publication with three other co-authors, then each of the three co-authors would have an edge weight of `n_joint_papers = 1`. So, the strenght would be 3. Hence, the strength denotes the total number of collaborations that an author had, which depends both on the number of publications and the number of collaborators per paper.\n", 1525 | "\n", 1526 | "Sometimes, we wish to take into account the number of co-authorships when creating a link weight. We can then fractionally count the weight of each collaboration between $n_a$ authors as\n", 1527 | "\n", 1528 | "$$\\frac{1}{n_a - 1}.$$\n", 1529 | "\n", 1530 | "We need to go back to the `publications_df` in order to construct such a *fractional* edge weight." 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "code", 1535 | "execution_count": null, 1536 | "metadata": {}, 1537 | "outputs": [], 1538 | "source": [ 1539 | "import itertools as itr\n", 1540 | "[(coauthor[0], coauthor[1], 1/(len(authors) - 1)) for coauthor in itr.combinations(authors, 2)]" 1541 | ] 1542 | }, 1543 | { 1544 | "cell_type": "markdown", 1545 | "metadata": {}, 1546 | "source": [ 1547 | "We again do this for all publications." 1548 | ] 1549 | }, 1550 | { 1551 | "cell_type": "code", 1552 | "execution_count": null, 1553 | "metadata": {}, 1554 | "outputs": [], 1555 | "source": [ 1556 | "coauthors_per_publication = publications_df['AU_split'].apply(\n", 1557 | " lambda authors: \n", 1558 | " [(coauthor[0], coauthor[1], 1, 1/(len(authors) - 1)) \n", 1559 | " for coauthor in itr.combinations(authors, 2)])" 1560 | ] 1561 | }, 1562 | { 1563 | "cell_type": "markdown", 1564 | "metadata": {}, 1565 | "source": [ 1566 | "The variable `coauthors_per_publication` is now a list of a list of co-authors per publication, but including a full weight of `1` and a fractional weight of `1/(len(authors) - 1)`, where `len(authors)` is the number of authors of the publications. We again `explode` this list." 1567 | ] 1568 | }, 1569 | { 1570 | "cell_type": "code", 1571 | "execution_count": null, 1572 | "metadata": {}, 1573 | "outputs": [], 1574 | "source": [ 1575 | "coauthors = coauthors_per_publication.explode().dropna()" 1576 | ] 1577 | }, 1578 | { 1579 | "cell_type": "markdown", 1580 | "metadata": {}, 1581 | "source": [ 1582 | "We can again create the network, but now we can pass two edge attributes, `n_joint_papers` and `n_joint_papers_frac`. We of course also have to simplify the network again." 1583 | ] 1584 | }, 1585 | { 1586 | "cell_type": "code", 1587 | "execution_count": null, 1588 | "metadata": {}, 1589 | "outputs": [], 1590 | "source": [ 1591 | "G_coauthorship = ig.Graph.TupleList(\n", 1592 | " edges=coauthors.to_list(),\n", 1593 | " vertex_name_attr='author',\n", 1594 | " directed=False,\n", 1595 | " edge_attrs=('n_joint_papers', 'n_joint_papers_frac')\n", 1596 | " )\n", 1597 | "G_coauthorship = G_coauthorship.simplify(loops=False, combine_edges='sum')" 1598 | ] 1599 | }, 1600 | { 1601 | "cell_type": "markdown", 1602 | "metadata": {}, 1603 | "source": [ 1604 | "
\n", 1605 | "What is the sum of n_joint_papers_frac over all co-authors? Then shouldn't the strength sum up to a whole number? Why isn't that the case here? (Hint: look at the authors of publication 'WOS:000242241600004'" 1607 | ] 1608 | }, 1609 | { 1610 | "cell_type": "code", 1611 | "execution_count": null, 1612 | "metadata": {}, 1613 | "outputs": [], 1614 | "source": [ 1615 | "publications_df.loc['WOS:000242241600004', 'AU']" 1616 | ] 1617 | }, 1618 | { 1619 | "cell_type": "markdown", 1620 | "metadata": {}, 1621 | "source": [ 1622 | "### Betweenness centrality" 1623 | ] 1624 | }, 1625 | { 1626 | "cell_type": "markdown", 1627 | "metadata": {}, 1628 | "source": [ 1629 | "Betweenness centrality is much more elaborate, and gives an indication of the number of times a node is on the shortest path from one node to another node.\n", 1630 | "\n", 1631 | "As you can imagine, this can take quite some time to calculate for all nodes. We will therefore use the somewhat smaller bibliographic coupling network of journals." 1632 | ] 1633 | }, 1634 | { 1635 | "cell_type": "markdown", 1636 | "metadata": {}, 1637 | "source": [ 1638 | "
\n", 1639 | " Note: On larger networks, it may take a long time to calculate the betweenness centrality.\n", 1640 | "
" 1641 | ] 1642 | }, 1643 | { 1644 | "cell_type": "code", 1645 | "execution_count": null, 1646 | "metadata": {}, 1647 | "outputs": [], 1648 | "source": [ 1649 | "G_coupling.vs['betweenness'] = G_coupling.betweenness()" 1650 | ] 1651 | }, 1652 | { 1653 | "cell_type": "markdown", 1654 | "metadata": {}, 1655 | "source": [ 1656 | "Now we can look at the journals that have the highest betweenness." 1657 | ] 1658 | }, 1659 | { 1660 | "cell_type": "code", 1661 | "execution_count": null, 1662 | "metadata": {}, 1663 | "outputs": [], 1664 | "source": [ 1665 | "sorted(G_coupling.vs, key=lambda v: v['betweenness'], reverse=True)[:5]" 1666 | ] 1667 | }, 1668 | { 1669 | "cell_type": "markdown", 1670 | "metadata": {}, 1671 | "source": [ 1672 | "As we did previously when dealing with shortest paths, we can also use a weight for determining the shortest paths." 1673 | ] 1674 | }, 1675 | { 1676 | "cell_type": "code", 1677 | "execution_count": null, 1678 | "metadata": {}, 1679 | "outputs": [], 1680 | "source": [ 1681 | "G_coupling.vs['betweenness_weighted'] = G_coupling.betweenness(weights='coupling')" 1682 | ] 1683 | }, 1684 | { 1685 | "cell_type": "markdown", 1686 | "metadata": {}, 1687 | "source": [ 1688 | "
\n", 1689 | "What is journal with the highest weighted betweenness centrality? Does this make sense if you compare it to the unweighted betweenness centrality?\n", 1690 | "
" 1691 | ] 1692 | }, 1693 | { 1694 | "cell_type": "code", 1695 | "execution_count": null, 1696 | "metadata": {}, 1697 | "outputs": [], 1698 | "source": [] 1699 | }, 1700 | { 1701 | "cell_type": "markdown", 1702 | "metadata": {}, 1703 | "source": [ 1704 | "
\n", 1705 | " Attention! Weighted shortest paths have the lowest total weight.\n", 1706 | "
" 1707 | ] 1708 | }, 1709 | { 1710 | "cell_type": "markdown", 1711 | "metadata": {}, 1712 | "source": [ 1713 | "### Pagerank" 1714 | ] 1715 | }, 1716 | { 1717 | "cell_type": "markdown", 1718 | "metadata": {}, 1719 | "source": [ 1720 | "One way of identifying central nodes relies on the idea of a random walk in a network. We will study this in the journal bibliographic coupling network. When performing such a random walk, we simply go from one journal to the next, following the bibliographic coupling links. The journal that is most frequently visited during such a random walk is then seen as most central. This is actually the idea that underlies Google's famous search engine. Luckily, we can compute that a lot faster than betweenness." 1721 | ] 1722 | }, 1723 | { 1724 | "cell_type": "code", 1725 | "execution_count": null, 1726 | "metadata": {}, 1727 | "outputs": [], 1728 | "source": [ 1729 | "G_coupling.vs['pagerank'] = G_coupling.pagerank()" 1730 | ] 1731 | }, 1732 | { 1733 | "cell_type": "markdown", 1734 | "metadata": {}, 1735 | "source": [ 1736 | "
\n", 1737 | "Get the top 5 most central journals according to Pagerank. Who is the most central? Are the results very different from the betweenness?\n", 1738 | "
" 1739 | ] 1740 | }, 1741 | { 1742 | "cell_type": "code", 1743 | "execution_count": null, 1744 | "metadata": {}, 1745 | "outputs": [], 1746 | "source": [] 1747 | }, 1748 | { 1749 | "cell_type": "markdown", 1750 | "metadata": {}, 1751 | "source": [ 1752 | "We can again take into account the weights. In pagerank this means that a journal that is a more closely bibliographically coupled will be more likely to be visited during a random walk. This is actually much more in line with our intuition than the shortest path. Let us see what we get if we do that." 1753 | ] 1754 | }, 1755 | { 1756 | "cell_type": "code", 1757 | "execution_count": null, 1758 | "metadata": {}, 1759 | "outputs": [], 1760 | "source": [ 1761 | "G_coupling.vs['pagerank_weighted'] = G_coupling.pagerank(weights='coupling')" 1762 | ] 1763 | }, 1764 | { 1765 | "cell_type": "markdown", 1766 | "metadata": {}, 1767 | "source": [ 1768 | "
\n", 1769 | "Are the results different for the weighted version of pagerank?\n", 1770 | "
" 1771 | ] 1772 | }, 1773 | { 1774 | "cell_type": "code", 1775 | "execution_count": null, 1776 | "metadata": {}, 1777 | "outputs": [], 1778 | "source": [] 1779 | }, 1780 | { 1781 | "cell_type": "markdown", 1782 | "metadata": {}, 1783 | "source": [ 1784 | "
\n", 1785 | "Pagerank is very similar to the techniques that underly the journal \"Eigenfactor\" and the \"SCImago Journal Rank\", which are seen as indicators of the scientific impact of a journal. Do you think it makes sense to interpret Pagerank on a bibliographic coupling network as the scientific impact of a journal? Why (not)?\n", 1786 | "
" 1787 | ] 1788 | }, 1789 | { 1790 | "cell_type": "markdown", 1791 | "metadata": {}, 1792 | "source": [ 1793 | "## Co-authorship using bipartite projection (optional)" 1794 | ] 1795 | }, 1796 | { 1797 | "cell_type": "markdown", 1798 | "metadata": {}, 1799 | "source": [ 1800 | "We can also create co-authorship using a more theoretical approach from graph theory. We can first construct a network consisting of publications and authors." 1801 | ] 1802 | }, 1803 | { 1804 | "cell_type": "markdown", 1805 | "metadata": {}, 1806 | "source": [ 1807 | "We first again `explode` all authors for each publication, and create a graph out of it." 1808 | ] 1809 | }, 1810 | { 1811 | "cell_type": "code", 1812 | "execution_count": null, 1813 | "metadata": {}, 1814 | "outputs": [], 1815 | "source": [ 1816 | "author_pubs_df = publications_df['AU_split'].explode()\n", 1817 | "\n", 1818 | "G_pub_authors = ig.Graph.TupleList(\n", 1819 | " edges=author_pubs_df.reset_index().values,\n", 1820 | " vertex_name_attr='name',\n", 1821 | " directed=False\n", 1822 | " )" 1823 | ] 1824 | }, 1825 | { 1826 | "cell_type": "markdown", 1827 | "metadata": {}, 1828 | "source": [ 1829 | "This network consists of two types: publications and authors. This is called a *bipartite* graph. We can automatically get the types using `is_bipartite`." 1830 | ] 1831 | }, 1832 | { 1833 | "cell_type": "code", 1834 | "execution_count": null, 1835 | "metadata": {}, 1836 | "outputs": [], 1837 | "source": [ 1838 | "is_bipartite, types = G_pub_authors.is_bipartite(return_types = True)\n", 1839 | "print(is_bipartite)" 1840 | ] 1841 | }, 1842 | { 1843 | "cell_type": "markdown", 1844 | "metadata": {}, 1845 | "source": [ 1846 | "The actual types are simply returned as a list of `True` and `False` values, which are arbitrary labels for publications and authors. Let us see what the first label stands for." 1847 | ] 1848 | }, 1849 | { 1850 | "cell_type": "code", 1851 | "execution_count": null, 1852 | "metadata": {}, 1853 | "outputs": [], 1854 | "source": [ 1855 | "print(types[0])\n", 1856 | "print(G_pub_authors.vs[0])" 1857 | ] 1858 | }, 1859 | { 1860 | "cell_type": "markdown", 1861 | "metadata": {}, 1862 | "source": [ 1863 | "From the `name` of node `0` we can see that it refers to a publication, and so `False` indicates publications, while `True` indicates authors." 1864 | ] 1865 | }, 1866 | { 1867 | "cell_type": "markdown", 1868 | "metadata": {}, 1869 | "source": [ 1870 | "We now would like to perform a so-called *bipartite projection* onto the authors. This is exactly the type of operation that leads to a co-authorship network. If we were to *project* onto the publication, we would end up with a network of publications where each pair of publications is linked if it is authored by the same author." 1871 | ] 1872 | }, 1873 | { 1874 | "cell_type": "code", 1875 | "execution_count": null, 1876 | "metadata": {}, 1877 | "outputs": [], 1878 | "source": [ 1879 | "G_author_projection = G_pub_authors.bipartite_projection(types=types, which=True)" 1880 | ] 1881 | }, 1882 | { 1883 | "cell_type": "markdown", 1884 | "metadata": {}, 1885 | "source": [ 1886 | "By default, it keeps track of the *multiplicity* (i.e. the number of joint papers) in the `weight` edge attribute. Unfortunately, it is not possible to do fractional counting using this approach." 1887 | ] 1888 | }, 1889 | { 1890 | "cell_type": "markdown", 1891 | "metadata": {}, 1892 | "source": [ 1893 | "
\n", 1894 | " Check the number of nodes in the bipartite projection. Why is it different from the number of nodes in the earlier constructed G_coauthorship? (Hint: checkout the degree.)\n", 1895 | "
" 1896 | ] 1897 | }, 1898 | { 1899 | "cell_type": "code", 1900 | "execution_count": null, 1901 | "metadata": {}, 1902 | "outputs": [], 1903 | "source": [] 1904 | }, 1905 | { 1906 | "cell_type": "markdown", 1907 | "metadata": {}, 1908 | "source": [ 1909 | "# Analysis of your own data" 1910 | ] 1911 | }, 1912 | { 1913 | "cell_type": "markdown", 1914 | "metadata": {}, 1915 | "source": [ 1916 | "You have now learned the basics of handling WoS files and transforming them into scientometric networks. Please take some time now to do your own analysis." 1917 | ] 1918 | }, 1919 | { 1920 | "cell_type": "markdown", 1921 | "metadata": {}, 1922 | "source": [ 1923 | "
\n", 1924 | "Go to Web of Science and select a publication set of interest. Make sure that the number of publications is higher than 1000, but lower than 5000. Export the files as follows:\n", 1925 | "
    \n", 1926 | "
  1. Export using \"Save to Other File Formats\".\n", 1927 | "
  2. Select the appropriate records (e.g. 1-500, 501-1000, etc...).\n", 1928 | "
  3. Select the Record Content \"Full Record and Cited References\".\n", 1929 | "
  4. Select the File Format \"Tab delimited (Win, UTF8)\".\n", 1930 | "
  5. Click on Send.\n", 1931 | "
\n", 1932 | "Repeat the above steps for each batch of 500 publications.\n", 1933 | "\n", 1934 | "Load the data from all files you downloaded using pandas\n", 1935 | "
" 1936 | ] 1937 | }, 1938 | { 1939 | "cell_type": "code", 1940 | "execution_count": null, 1941 | "metadata": {}, 1942 | "outputs": [], 1943 | "source": [] 1944 | }, 1945 | { 1946 | "cell_type": "markdown", 1947 | "metadata": {}, 1948 | "source": [ 1949 | "
\n", 1950 | "Create a co-authorship network of your publications. Hint: use the approach you encountered earlier.\n", 1951 | "
" 1952 | ] 1953 | }, 1954 | { 1955 | "cell_type": "code", 1956 | "execution_count": null, 1957 | "metadata": {}, 1958 | "outputs": [], 1959 | "source": [] 1960 | }, 1961 | { 1962 | "cell_type": "markdown", 1963 | "metadata": {}, 1964 | "source": [ 1965 | "
\n", 1966 | "Identify the authors that are most central to the coauthorship network and interpret the results.\n", 1967 | "
" 1968 | ] 1969 | }, 1970 | { 1971 | "cell_type": "code", 1972 | "execution_count": null, 1973 | "metadata": {}, 1974 | "outputs": [], 1975 | "source": [] 1976 | }, 1977 | { 1978 | "cell_type": "markdown", 1979 | "metadata": {}, 1980 | "source": [ 1981 | "
\n", 1982 | "Create a co-citation network of your publications. Hint: use the bibliographic coupling approach, but switch the roles of the source and the target.\n", 1983 | "
" 1984 | ] 1985 | }, 1986 | { 1987 | "cell_type": "code", 1988 | "execution_count": null, 1989 | "metadata": {}, 1990 | "outputs": [], 1991 | "source": [] 1992 | }, 1993 | { 1994 | "cell_type": "markdown", 1995 | "metadata": {}, 1996 | "source": [ 1997 | "
\n", 1998 | "Identify the publications that are most central to the co-citation network and interpret the results. Are they relatively recent publications or not?\n", 1999 | "
" 2000 | ] 2001 | }, 2002 | { 2003 | "cell_type": "code", 2004 | "execution_count": null, 2005 | "metadata": {}, 2006 | "outputs": [], 2007 | "source": [] 2008 | } 2009 | ], 2010 | "metadata": { 2011 | "kernelspec": { 2012 | "display_name": "Python 3", 2013 | "language": "python", 2014 | "name": "python3" 2015 | }, 2016 | "language_info": { 2017 | "codemirror_mode": { 2018 | "name": "ipython", 2019 | "version": 3 2020 | }, 2021 | "file_extension": ".py", 2022 | "mimetype": "text/x-python", 2023 | "name": "python", 2024 | "nbconvert_exporter": "python", 2025 | "pygments_lexer": "ipython3", 2026 | "version": "3.8.3" 2027 | } 2028 | }, 2029 | "nbformat": 4, 2030 | "nbformat_minor": 2 2031 | } 2032 | --------------------------------------------------------------------------------