├── docs
├── handson
│ ├── .gitignore
│ ├── 01
│ │ ├── .gitignore
│ │ ├── 01-example.tsv
│ │ ├── 01-example.csv
│ │ ├── 01-venues.csv
│ │ ├── 01-publications.csv
│ │ └── 01-publications-venues.json
│ ├── 03
│ │ └── .gitignore
│ ├── 02
│ │ ├── .gitignore
│ │ ├── uml.png
│ │ ├── uml2.png
│ │ ├── classuse.py
│ │ ├── myclasses.py
│ │ └── 02-Implementation_of_data_models_via_Python_classes.ipynb
│ ├── 04
│ │ ├── .gitignore
│ │ ├── tables.png
│ │ └── 04-Configuring_and_populating_a_relational_database.ipynb
│ ├── 05
│ │ ├── .gitignore
│ │ ├── rdfgraph.png
│ │ ├── blazegraph.png
│ │ └── 05-Configuring_and_populating_a_graph_database.ipynb
│ └── 06
│ │ └── 06-Interacting_with_databases_using_Pandas.ipynb
├── .gitignore
├── lecture
│ ├── .gitignore
│ ├── 00
│ │ ├── 00.pdf
│ │ └── 00.pptx
│ ├── 01
│ │ ├── 01.pdf
│ │ └── 01.pptx
│ ├── 02
│ │ ├── 02.pdf
│ │ └── 02.pptx
│ ├── 03
│ │ ├── 03.pdf
│ │ └── 03.pptx
│ ├── 04
│ │ ├── 04.pdf
│ │ └── 04.pptx
│ ├── 05
│ │ ├── 05.pdf
│ │ └── 05.pptx
│ ├── 06
│ │ ├── 06.pdf
│ │ └── 06.pptx
│ ├── 07
│ │ ├── 07.pdf
│ │ └── 07.pptx
│ └── rdfdata.png
└── project
│ ├── .gitignore
│ ├── uml.png
│ ├── uml2.png
│ ├── datamodel.png
│ ├── workflow.png
│ └── README.md
├── .gitignore
└── README.md
/docs/handson/.gitignore:
--------------------------------------------------------------------------------
1 | .*/
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | private/
2 | .DS_Store
--------------------------------------------------------------------------------
/docs/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | .*/
--------------------------------------------------------------------------------
/docs/lecture/.gitignore:
--------------------------------------------------------------------------------
1 | .*/
2 | notes.*
--------------------------------------------------------------------------------
/docs/handson/01/.gitignore:
--------------------------------------------------------------------------------
1 | *-modified*
2 |
--------------------------------------------------------------------------------
/docs/handson/03/.gitignore:
--------------------------------------------------------------------------------
1 | *.csv
2 |
3 |
--------------------------------------------------------------------------------
/docs/project/.gitignore:
--------------------------------------------------------------------------------
1 | *.graphml
2 | private/
--------------------------------------------------------------------------------
/docs/handson/02/.gitignore:
--------------------------------------------------------------------------------
1 | __py*
2 | *.graphml
3 |
--------------------------------------------------------------------------------
/docs/handson/04/.gitignore:
--------------------------------------------------------------------------------
1 | __py*
2 | *.graphml
3 | *.db
4 |
--------------------------------------------------------------------------------
/docs/handson/05/.gitignore:
--------------------------------------------------------------------------------
1 | *.jnl
2 | *.jar
3 | *.graphml
4 | rules.log
--------------------------------------------------------------------------------
/docs/project/uml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/project/uml.png
--------------------------------------------------------------------------------
/docs/project/uml2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/project/uml2.png
--------------------------------------------------------------------------------
/docs/handson/02/uml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/handson/02/uml.png
--------------------------------------------------------------------------------
/docs/handson/02/uml2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/handson/02/uml2.png
--------------------------------------------------------------------------------
/docs/lecture/00/00.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/00/00.pdf
--------------------------------------------------------------------------------
/docs/lecture/00/00.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/00/00.pptx
--------------------------------------------------------------------------------
/docs/lecture/01/01.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/01/01.pdf
--------------------------------------------------------------------------------
/docs/lecture/01/01.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/01/01.pptx
--------------------------------------------------------------------------------
/docs/lecture/02/02.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/02/02.pdf
--------------------------------------------------------------------------------
/docs/lecture/02/02.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/02/02.pptx
--------------------------------------------------------------------------------
/docs/lecture/03/03.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/03/03.pdf
--------------------------------------------------------------------------------
/docs/lecture/03/03.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/03/03.pptx
--------------------------------------------------------------------------------
/docs/lecture/04/04.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/04/04.pdf
--------------------------------------------------------------------------------
/docs/lecture/04/04.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/04/04.pptx
--------------------------------------------------------------------------------
/docs/lecture/05/05.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/05/05.pdf
--------------------------------------------------------------------------------
/docs/lecture/05/05.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/05/05.pptx
--------------------------------------------------------------------------------
/docs/lecture/06/06.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/06/06.pdf
--------------------------------------------------------------------------------
/docs/lecture/06/06.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/06/06.pptx
--------------------------------------------------------------------------------
/docs/lecture/07/07.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/07/07.pdf
--------------------------------------------------------------------------------
/docs/lecture/07/07.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/07/07.pptx
--------------------------------------------------------------------------------
/docs/lecture/rdfdata.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/lecture/rdfdata.png
--------------------------------------------------------------------------------
/docs/handson/04/tables.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/handson/04/tables.png
--------------------------------------------------------------------------------
/docs/project/datamodel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/project/datamodel.png
--------------------------------------------------------------------------------
/docs/project/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/project/workflow.png
--------------------------------------------------------------------------------
/docs/handson/05/rdfgraph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/handson/05/rdfgraph.png
--------------------------------------------------------------------------------
/docs/handson/05/blazegraph.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/comp-data/2021-2022/HEAD/docs/handson/05/blazegraph.png
--------------------------------------------------------------------------------
/docs/handson/01/01-example.tsv:
--------------------------------------------------------------------------------
1 | column name another name, with a comma
2 | a value a value, with a comma
3 | a quoted "value" a quoted "value", with a comma
--------------------------------------------------------------------------------
/docs/handson/01/01-example.csv:
--------------------------------------------------------------------------------
1 | column name,"another name, with a comma"
2 | a value,"a value, with a comma"
3 | a quoted "value","a quoted ""value"", with a comma"
--------------------------------------------------------------------------------
/docs/handson/01/01-venues.csv:
--------------------------------------------------------------------------------
1 | id,name,type
2 | 1531-6912,Comparative and Functional Genomics,journal
3 | 1367-5931,Current Opinion in Chemical Biology,journal
4 | 9780470291092,Proceedings of the 5th Annual Conference on Composites and Advanced Ceramic Materials: Ceramic Engineering and Science Proceedings,book
5 | 1027-3662,Journal of Theoretical Medicine,journal
--------------------------------------------------------------------------------
/docs/handson/01/01-publications.csv:
--------------------------------------------------------------------------------
1 | doi,title,publication year,publication venue,type,issue,volume
2 | 10.1002/cfg.304,Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information,2003,1531-6912,journal article,4,4
3 | 10.1016/s1367-5931(02)00332-0,In vitro selection as a powerful tool for the applied evolution of proteins and peptides,2002,1367-5931,journal article,3,6
4 | 10.1002/9780470291092.ch20,Mechanisms of Toughening in Ceramic Matrix Composites,1981,9780470291092,book chapter,,
--------------------------------------------------------------------------------
/docs/handson/02/classuse.py:
--------------------------------------------------------------------------------
1 | from myclasses import JournalArticle, Journal
2 |
3 | journal_1 = Journal(["1531-6912"], "Comparative and Functional Genomics")
4 |
5 | journal_article_1 = JournalArticle("10.1002/cfg.304",
6 | 2003,
7 | "Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information",
8 | journal_1,
9 | "4",
10 | "4")
11 |
12 | print("-- Journal article metadata")
13 | print(" | title:", journal_article_1.getTitle())
14 | print(" | venue name:", journal_article_1.getPublicationVenue().getName())
15 | print(" | issue:", journal_article_1.getIssue())
16 | print(" | volume:", journal_article_1.getVolume())
17 |
--------------------------------------------------------------------------------
/docs/handson/01/01-publications-venues.json:
--------------------------------------------------------------------------------
1 | [
2 | {
3 | "doi": "10.1002/cfg.304",
4 | "issue": "4",
5 | "publication venue": {
6 | "id": [ "1531-6912" ],
7 | "name": "Comparative and Functional Genomics",
8 | "type": "journal"
9 | },
10 | "publication year": 2003,
11 | "title": "Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information",
12 | "type": "journal article",
13 | "volume": "4"
14 | },
15 | {
16 | "doi": "10.1016/s1367-5931(02)00332-0",
17 | "issue": "3",
18 | "publication venue": {
19 | "id": [ "1367-5931" ],
20 | "name": "Current Opinion in Chemical Biology",
21 | "type": "journal"
22 | },
23 | "publication year": 2002,
24 | "title": "In vitro selection as a powerful tool for the applied evolution of proteins and peptides",
25 | "type": "journal article",
26 | "volume": "6"
27 | },
28 | {
29 | "doi": "10.1002/9780470291092.ch20",
30 | "publication venue": {
31 | "id": [ "9780470291092" ],
32 | "name": "Proceedings of the 5th Annual Conference on Composites and Advanced Ceramic Materials: Ceramic Engineering and Science Proceedings",
33 | "type": "book"
34 | },
35 | "publication year": 1981,
36 | "title": "Mechanisms of Toughening in Ceramic Matrix Composites",
37 | "type": "book chapter"
38 | }
39 | ]
--------------------------------------------------------------------------------
/docs/handson/02/myclasses.py:
--------------------------------------------------------------------------------
1 | class Publication(object):
2 | def __init__(self, doi, publicationYear, title, publicationVenue):
3 | self.doi = doi
4 | self.publicationYear = publicationYear
5 | self.title = title
6 | self.publicationVenue = publicationVenue
7 |
8 | def getDOI(self):
9 | return self.doi
10 |
11 | def getPublicationYear(self):
12 | return self.publicationYear
13 |
14 | def getTitle(self):
15 | return self.title
16 |
17 | def getPublicationVenue(self):
18 | return self.publicationVenue
19 |
20 |
21 | class Venue(object):
22 | def __init__(self, identifiers, name):
23 | self.id = set()
24 | for identifier in identifiers:
25 | self.id.add(identifier)
26 |
27 | self.name = name
28 |
29 | def getIds(self):
30 | result = []
31 | for identifier in self.id:
32 | result.append(identifier)
33 | result.sort()
34 | return result
35 |
36 | def getName(self):
37 | return self.name
38 |
39 | def addId(self, identifier):
40 | result = True
41 | if identifier not in self.id:
42 | self.id.add(identifier)
43 | else:
44 | result = False
45 | return result
46 |
47 | def removeId(self, identifier):
48 | result = True
49 | if identifier in self.id:
50 | self.id.remove(identifier)
51 | else:
52 | result = False
53 | return result
54 |
55 |
56 | class JournalArticle(Publication):
57 | def __init__(self, doi, publicationYear, title, publicationVenue, issue, volume):
58 | self.issue = issue
59 | self.volume = volume
60 |
61 | # Here is where the constructor of the superclass is explicitly recalled, so as
62 | # to handle the input parameters as done in the superclass
63 | super().__init__(doi, publicationYear, title, publicationVenue)
64 |
65 | def getIssue(self):
66 | return self.issue
67 |
68 | def getVolume(self):
69 | return self.volume
70 |
71 |
72 | class BookChapter(Publication):
73 | pass
74 |
75 |
76 | class Journal(Venue):
77 | pass
78 |
79 |
80 | class Book(Venue):
81 | pass
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Data Science
2 |
3 | This space contains all the material related to the [Data Science course](https://www.unibo.it/en/teaching/course-unit-catalogue/course-unit/2021/467046) of the [Digital Humanities and Digital Knowledge degree](https://corsi.unibo.it/2cycle/DigitalHumanitiesKnowledge) at the [University of Bologna](http://www.unibo.it/en).
4 |
5 | ## Academic year 2021/2022
6 |
7 | ### Table of content
8 |
9 | - [Data Science](#data-science)
10 | - [Academic year 2021/2022](#academic-year-20212022)
11 | - [Table of content](#table-of-content)
12 | - [Material](#material)
13 | - [Schedule](#schedule)
14 | - [Exam sessions](#exam-sessions)
15 | - [Links](#links)
16 |
17 | ### Material
18 |
19 | **Keys:**
20 |
21 | - _the_ = theoretical lecture
22 | - _hon_ = hands-on session
23 |
24 | 1. [31/01/22, *the*] Introduction to the course and final project specifications
25 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/00/00.pdf)
26 |
27 |
28 | 2. [02/02/22, *the*] What is a datum and how it can be represented computationally
29 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/01/01.pdf)
30 |
31 |
32 | 3. [04/02/22, *hon*] Data formats and methods for storing data in Python
33 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/01)
34 |
35 |
36 | 4. [07/02/22, *the*] Introduction to data modelling
37 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/02/02.pdf)
38 |
39 |
40 | 5. [09/02/22, *hon*] Implementation of data models via Python classes
41 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/02)
42 |
43 |
44 | 6. [11/02/22, *the*] Processing and querying the data
45 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/03/03.pdf)
46 |
47 |
48 | 7. [14/02/22, *hon*] Introduction to Pandas
49 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/03)
50 |
51 |
52 | 8. [16/02/22, *the*] Database Management Systems
53 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/04/04.pdf)
54 |
55 |
56 | 9. [18/02/22, *hon*] Configuring and populating a relational database
57 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/04)
58 |
59 |
60 | 10. [21/02/22, *the*] SQL, a query language for relational databases
61 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/05/05.pdf)
62 |
63 |
64 | 11. [23/02/22, *hon*] Configuring and populating a graph database
65 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/05)
66 |
67 |
68 | 12. [25/02/22, *the*] SPARQL, a query language for RDF databases
69 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/06/06.pdf)
70 |
71 |
72 | 13. [28/02/22, *hon*] Interacting with databases using Pandas
73 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/06)
74 |
75 |
76 | 14. [02/03/22, *the*] Describing and visualising data
77 | - slides: [PDF](https://comp-data.github.io/2021-2022/lecture/07/07.pdf)
78 |
79 |
80 | 15. [04/03/22, *hon*] Descriptive statistics and graphs about data using Pandas
81 | - material: [GitHub](https://github.com/comp-data/2021-2022/tree/main/docs/handson/07)
82 |
83 |
84 | ### Schedule
85 |
86 |
87 |
31/01/2022
12:30-14:30
Introduction to the course and final project specifications
88 |
02/02/2022
12:30-14:30
What is a datum and how it can be represented computationally
89 |
04/02/2022
12:30-14:30
Data formats and methods for storing data in Python
90 |
07/02/2022
12:30-14:30
Introduction to data modelling
91 |
09/02/2022
12:30-14:30
Implementation of data models via Python classes
92 |
11/02/2022
12:30-14:30
Processing and querying the data
93 |
14/02/2022
12:30-14:30
Introduction to Pandas
94 |
16/02/2022
12:30-14:30
Database Management Systems
95 |
18/02/2022
12:30-14:30
Configuring and populating a relational database
96 |
21/02/2022
12:30-14:30
SQL, a query language for relational databases
97 |
23/02/2022
12:30-14:30
Configuring and populating a graph database
98 |
25/02/2022
12:30-14:30
SPARQL, a query language for RDF databases
99 |
28/02/2022
12:30-14:30
Interacting with databases using Pandas
100 |
02/03/2022
12:30-14:30
Describing and visualising data
101 |
04/03/2022
12:30-14:30
Descriptive statistics and graphs about data using Pandas
102 |
103 |
104 | ### Exam sessions
105 |
106 | - 16 May 2022
107 | - 20 June 2022
108 | - 15 July 2022
109 | - 5 September 2022
110 |
111 | # Links
112 |
113 | - [Project information](https://github.com/comp-data/2021-2022/tree/main/docs/project)
114 |
--------------------------------------------------------------------------------
/docs/project/README.md:
--------------------------------------------------------------------------------
1 | # Data Science: project
2 |
3 | The goal of the project is to develop a software that enables one to process data stored in different formats and to upload them into two distinct databases
4 | to query these databases simultaneously according to predefined operations. The software must be accompanied by a document (i.e. a Jupyter notebook) describing the data to process (their main characteristics and possible issues) and how the software has been organised (name of the files, where have been defined the various Python classes, etc.).
5 |
6 | ## Data
7 |
8 | Exemplar data for testing the project have been made available. In particular:
9 |
10 | * for creating the relational database, there are two files, a [CSV file](data/relational_publications.csv) containing data about publications and a [JSON file](data/relational_other_data.json) containing additional information including the authors of each publication, the identifiers of the venue of each publication, and the identifier and name of each publisher publishing the venues;
11 |
12 | * for creating the RDF triplestore, there are two files, a [CSV file](data/graph_publications.csv) containing data about publications and a [JSON file](data/graph_other_data.json) containing additional information including the authors of each publication, the identifiers of the venue of each publication, and the identifier and name of each publisher publishing the venues.
13 |
14 | ## Workflow
15 |
16 | 
17 |
18 | ## Data model
19 |
20 | 
21 |
22 | ## UML of data model classes
23 |
24 | 
25 |
26 | All the methods of each class must return the appropriate value that have been specified in the object of that class when it has been created. It iIs up to the implementer to decide how to enable someone to add this information to the object of each class, e.g. by defining a specific constructor. While one can add additional methods to each class if needed, it is crucial that the *get* methods introduced in the UML diagram are all defined.
27 |
28 | ## UML of additional classes
29 |
30 | 
31 |
32 | All the attributes methods of each class are defined as follows. All the constructors of each of the class introduced in the UML diagram do not take in input any parameter. While one can add additional methods to each class if needed, it is crucial that all the methods introduced in the UML diagram are defined.
33 |
34 | ### Class `RelationalProcessor`
35 |
36 | #### Attributes
37 | `dbPath`: the variable containing the path of the database, initially set as an empty string, that will be updated with the method `setDbPath`.
38 |
39 | #### Methods
40 | `getDbPath`: it returns the path of the database.
41 |
42 | `setDbPath`: it enables to set a new path for the database to handle.
43 |
44 | ### Class `TriplestoreProcessor`
45 |
46 | #### Attributes
47 | `endpointUrl`: the variable containing the URL of the SPARQL endpoint of the triplestore, initially set as an empty string, that will be updated with the method `setEndpointUrl`.
48 |
49 | #### Methods
50 | `getEndpointUrl`: it returns the URL of the SPARQL endpoint of the triplestore.
51 |
52 | `setEndpointUrl`: it enables to set a new URL for the SPARQL endpoint of the triplestore.
53 |
54 | ### Classes `RelationalDataProcessor` and `TriplestoreDataProcessor`
55 |
56 | #### Methods
57 | `uploadData`: it enables to upload the collection of data specified in the input file path (either in CSV or JSON, according to the [formats specified above](#source-data)) into the database.
58 |
59 | ### Classes `RelationalQueryProcessor` and `TriplestoreQueryProcessor`
60 |
61 | #### Methods
62 | `getPublicationsPublishedInYear`: It returns a data frame with all the publications (i.e. the rows) that have been published in the input year (e.g. `2020`).
63 |
64 | `getPublicationsByAuthorId`: It returns a data frame with all the publications (i.e. the rows) that have been authored by the person having the identifier specified as input (e.g. `"0000-0001-9857-1511"`).
65 |
66 | `getMostCitedPublication`: It returns a data frame with all the publications (i.e. the rows) that have received the most number of citations by other publications.
67 |
68 | `getMostCitedVenue`: It returns a data frame with all the venues (i.e. the rows) containing the publications that, overall, have received the most number of citations by other publications.
69 |
70 | `getVenuesByPublisherId`: It returns a data frame with all the venues (i.e. the rows) that have been published by the organisation having the identifier specified as input (e.g. `"crossref:78"`).
71 |
72 | `getPublicationInVenue`: It returns a data frame with all the publications (i.e. the rows) that have been included in the venue having the identifier specified as input (e.g. `"issn:0944-1344"`).
73 |
74 | `getJournalArticlesInIssue`: It returns a data frame with all the journal articles (i.e. the rows) that have been included in the input issue (e.g. `"9"`) of the input volume (e.g. `"17"`) of the journal having the identifier specified as input (e.g. `"issn:2164-5515"`).
75 |
76 | `getJournalArticlesInVolume`: It returns a data frame with all the journal articles (i.e. the rows) that have been included, independently from the issue, in input volume (e.g. `"17"`) of the journal having the identifier specified as input (e.g. `"issn:2164-5515"`).
77 |
78 | `getJournalArticlesInJournal`: It returns a data frame with all the journal articles (i.e. the rows) that have been included, independently from the issue and the volume, in the journal having the identifier specified as input (e.g. `"issn:2164-5515"`).
79 |
80 | `getProceedingsByEvent`: It returns a data frame with all the proceedings (i.e. the rows) that refer to the events that match (in lowercase), even partially, with the name specified as input (e.g. `"web"`).
81 |
82 | `getPublicationAuthors`: It returns a data frame with all the authors (i.e. the rows) of the publication with the identifier specified as input (e.g. `"doi:10.1080/21645515.2021.1910000"`).
83 |
84 | `getPublicationsByAuthorName`: It returns a data frame with all the publications (i.e. the rows) that have been authored by the people having their name matching (in lowercase), even partially, with the name specified as input (e.g. `"doe"`).
85 |
86 | `getDistinctPublisherOfPublications`: It returns a data frame with all the distinct publishers (i.e. the rows) that have published the venues of the publications with identifiers those specified as input (e.g. `[ "doi:10.1080/21645515.2021.1910000", "doi:10.3390/ijfs9030035" ]`).
87 |
88 | ### Class `GenericQueryProcessor`
89 |
90 | #### Attributes
91 | `queryProcessor`: the variable containing the list of `QueryProcessor` objects to involve when one of the *get* methods below is executed. In practice, every time a *get* method is executed, the method will call the related method on all the `QueryProcessor` objects included in the variable `queryProcessor`, before combining the results and returning the requested object.
92 |
93 | #### Methods
94 | `cleanQueryProcessors`: It clean the list `queryProcessor` from all the `QueryProcessor` objects it includes.
95 |
96 | `addQueryProcessor`: It append the input `QueryProcessor` object to the list `queryProcessor`.
97 |
98 | `getPublicationsPublishedInYear`: It returns a list of `Publication` objects referring to all the publications that have been published in the input year (e.g. `2020`).
99 |
100 | `getPublicationsPublishedInYear`: It returns a list of `Publication` objects referring to all the publications that have been published in the input year (e.g. `2020`).
101 |
102 | `getPublicationsByAuthorId`: It returns a list of `Publication` objects referring to all the publications that have been authored by the person having the identifier specified as input (e.g. `"0000-0001-9857-1511"`).
103 |
104 | `getMostCitedPublication`: It returns the `Publication` object that has received the most number of citations by other publications.
105 |
106 | `getMostCitedVenue`: It returns the `Venue` object containing the publications that, overall, have received the most number of citations by other publications.
107 |
108 | `getVenuesByPublisherId`: It returns a list of `Venue` objects referring to all the venues that have been published by the organisation having the identifier specified as input (e.g. `"crossref:78"`).
109 |
110 | `getPublicationInVenue`: It returns a list of `Publication` objects referring to all the publications that have been included in the venue having the identifier specified as input (e.g. `"issn:0944-1344"`).
111 |
112 | `getJournalArticlesInIssue`: It returns a list of `JournalArticle` objects referring to all the journal articles that have been included in the input issue (e.g. `"9"`) of the input volume (e.g. `"17"`) of the journal having the identifier specified as input (e.g. `"issn:2164-5515"`).
113 |
114 | `getJournalArticlesInVolume`: It returns a list of `JournalArticle` objects referring to all the journal articles that have been included, independently from the issue, in input volume (e.g. `"17"`) of the journal having the identifier specified as input (e.g. `"issn:2164-5515"`).
115 |
116 | `getJournalArticlesInJournal`: It returns a list of `JournalArticle` objects referring to all the journal articles that have been included, independently from the issue and the volume, in the journal having the identifier specified as input (e.g. `"issn:2164-5515"`).
117 |
118 | `getProceedingsByEvent`: It returns a list of `Proceedings` objects referring to all the proceedings that refer to the events that match (in lowercase), even partially, with the name specified as input (e.g. `"web"`).
119 |
120 | `getPublicationAuthors`: It returns a list of `Person` objects referring to all the authors of the publication with the identifier specified as input (e.g. `"doi:10.1080/21645515.2021.1910000"`).
121 |
122 | `getPublicationsByAuthorName`: It returns a list of `Publication` objects referring to all the publications that have been authored by the people having their name matching (in lowercase), even partially, with the name specified as input (e.g. `"doe"`).
123 |
124 | `getDistinctPublisherOfPublications`: It returns a list of `Organization` objects referring to all the distinct publishers that have published the venues of the publications with identifiers those specified as input (e.g. `[ "doi:10.1080/21645515.2021.1910000", "doi:10.3390/ijfs9030035" ]`).
125 |
126 | ## Uses of the classes
127 |
128 | ```
129 | # Supposing that all the classes developed for the project
130 | # are contained in the file 'impl.py', then:
131 |
132 | # 1) Importing all the classes for handling the relational database
133 | from impl import RelationalDataProcessor, RelationalQueryProcessor
134 |
135 | # 2) Importing all the classes for handling RDF database
136 | from impl import TriplestoreDataProcessor, TriplestoreQueryProcessor
137 |
138 | # 3) Importing the class for dealing with generic queries
139 | from impl import GenericQueryProcessor
140 |
141 | # Once all the classes are imported, first create the relational
142 | # database using the related source data
143 | rel_path = "relational.db"
144 | rel_dp = RelationalDataProcessor()
145 | rel_dp.setDbPath(rel_path)
146 | rel_dp.uploadData("data/relational_publications.csv")
147 | rel_dp.uploadData("data/relational_other_data.json")
148 |
149 | # Then, create the RDF triplestore (remember first to run the
150 | # Blazegraph instance) using the related source data
151 | grp_endpoint = "http://127.0.0.1:9999/blazegraph/sparql"
152 | grp_dp = TriplestoreDataProcessor()
153 | grp_dp.setEndpointUrl(grp_endpoint)
154 | grp_dp.uploadData("data/graph_publications.csv")
155 | grp_dp.uploadData("data/graph_other_data.json")
156 |
157 | # In the next passage, create the query processors for both
158 | # the databases, using the related classes
159 | rel_qp = RelationalQueryProcessor()
160 | rel_qp.setDbPath(rel_path)
161 |
162 | grp_qp = TriplestoreQueryProcessor()
163 | grp_qp.setEndpointUrl(grp_endpoint)
164 |
165 | # Finally, create a generic query processor for asking
166 | # about data
167 | generic = GenericQueryProcessor()
168 | generic.addQueryProcessor(rel_qp)
169 | generic.addQueryProcessor(grp_qp)
170 |
171 | result_q1 = generic.getPublicationsPublishedInYear(2020)
172 | result_q2 = generic.getPublicationsByAuthorId("0000-0001-9857-1511")
173 | # etc...
174 | ```
175 |
176 | ## Submission of the project
177 |
178 | You have to provide all Python files implementing your project, by sharing them in some way (e.g. via OneDrive). You have to send all the files **one week before** the exam session you want to take.
179 |
--------------------------------------------------------------------------------
/docs/handson/05/05-Configuring_and_populating_a_graph_database.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "compatible-disabled",
6 | "metadata": {},
7 | "source": [
8 | "# Configuring and populating a graph database\n",
9 | "\n",
10 | "In this tutorial, we show how to use RDF and Blazegraph to create a graph database using Python."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "id": "unavailable-torture",
16 | "metadata": {},
17 | "source": [
18 | "## What is RDF\n",
19 | "\n",
20 | "The [Resource Description Framework (RDF)](https://en.wikipedia.org/wiki/Resource_Description_Framework) is a high-level data model (some times it is improperly called \"language\") based on triples *subject-predicate-object* called statements. For instance, a simple natural language sentence such as *Umberto Eco is author of The name of the rose* can be expressed through an RDF statement assigning to:\n",
21 | "\n",
22 | "* *Umberto Eco* the role of subject;\n",
23 | "* *is author of* the role of predicate;\n",
24 | "* *The name of the rose* the role of object.\n",
25 | "\n",
26 | "The main entities comprising RDF are listed as follows.\n",
27 | "\n",
28 | "\n",
29 | "### Resources\n",
30 | "\n",
31 | "A *resource* is an object we want to talk about, and it is identified by an [IRI](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier). IRIs are the most generic class of Internet identifiers for resources, but often [HTTP URLs](https://en.wikipedia.org/wiki/URL) are used instead, which may be considered a subclass of IRIs (e.g. the [URL `http://www.wikidata.org/entity/Q12807`](http://www.wikidata.org/entity/Q12807) identifies Umberto Eco in [Wikidata](https://wikidata.org)).\n",
32 | "\n",
33 | "\n",
34 | "### Properties\n",
35 | "\n",
36 | "A *property* is a special type of resource since it is used to describe relation between resources, and it is identified by an IRI (e.g. the [URL `http://www.wikidata.org/prop/direct/P800`](http://www.wikidata.org/entity/P800) identifies the property *has notable work* - which mimic the *is author of* predicate of the statement above).\n",
37 | "\n",
38 | "\n",
39 | "### Statements\n",
40 | "\n",
41 | "*Statements* enable one to assert properties between resources. Each statement is a triple subject-predicate-object, where the subject is a resource, the predicate is a property, and the object is either a resource or a literal (i.e. a string). \n",
42 | "\n",
43 | "There are different notations that can be used to represent statements in RDF in plain text files. The simplest (and most verbose) one is called [N-Triples](https://en.wikipedia.org/wiki/N-Triples). It allows to define statements according to the following syntax:\n",
44 | "\n",
45 | "```\n",
46 | "# 1) statement with a resource as an object\n",
47 | " .\n",
48 | "\n",
49 | "# 2) statement with a literal as an object\n",
50 | " \"literal value\"^^ .\n",
51 | "```\n",
52 | "\n",
53 | "Type (1) statements must be used to state relationships between resources, while type (2) statements are generally used to associate attributes to a specific resource (the IRI defining the type of value is not specified for generic literals, i.e. strings). For instance, in Wikidata, the exemplar sentence above (*Umberto Eco is author of The name of the rose*) is defined by three distict RDF statements:\n",
54 | "\n",
55 | "```\n",
56 | " \"Umberto Eco\" .\n",
57 | "\n",
58 | " \"The Name of the Rose\" .\n",
59 | "\n",
60 | " .\n",
61 | "```\n",
62 | "\n",
63 | "Actually, the relation described by the natural language sentence is defined by the third RDF statement above. However, two additional statements have been added to associate the strings representing the name of the resources referring to *Umberto Eco* and *The name of the rose*. Be aware: literals (i.e. simple values) cannot be subjects in any statement.\n",
64 | "\n",
65 | "\n",
66 | "### A special property\n",
67 | "\n",
68 | "While all the properties you can use in your statements as predicates can be defined in several distinct vocabularies (the [Wikidata data model](https://www.wikidata.org/wiki/Wikidata:List_of_properties), [schema.org data model](https://schema.org/docs/datamodel.html), etc.), RDF defines a special property that is used to associate a resource to its intended type (e.g. another resource representing a class of resources). The IRI of this property is `http://www.w3.org/1999/02/22-rdf-syntax-ns#type`. For instance, we can use this property to assign the appropriate type of object to the two entities defined in the excerpt above, i.e. that referencing to *Umberto Eco* and *The name of the rose*, as shown as follows:\n",
69 | "\n",
70 | "```\n",
71 | " .\n",
72 | "\n",
73 | " .\n",
74 | "```\n",
75 | "\n",
76 | "In the example above, we reuse two existing classes of resources included in [schema.org](https://schema.org) for people and books. It is worth mentioning that an existing resource can be associated via `http://www.w3.org/1999/02/22-rdf-syntax-ns#type` to one or more types, if they apply.\n",
77 | "\n",
78 | "\n",
79 | "### RDF Graphs\n",
80 | "\n",
81 | "An *RDF Graph* is a set of RDF statements. For instance, a file that contains RDF statements represents an RDF graph, and IRIs contained in different graph actually refer to the same resource. \n",
82 | "\n",
83 | "We talk about graphs in this context because all the RDF statements, and the resources they include, actually defined a direct graph structure, where the direct edges are labelled with the predicates of the statements and the subjects and objects are nodes linked through such edges. For instance, the diagram below represents all the RDF statements introduced above using a visual graph.\n",
84 | "\n",
85 | "\n",
86 | "\n",
87 | "\n",
88 | "### Triplestores\n",
89 | "\n",
90 | "A *triplestore* is a database built for storing and retrieving RDF statements, and can contain one or more RDF graphs."
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "id": "drawn-identifier",
96 | "metadata": {},
97 | "source": [
98 | "## Blazegraph, a database for RDF data\n",
99 | "\n",
100 | "[Blazegraph DB](https://blazegraph.com/) is a ultra high-performance graph database supporting RDF/SPARQL APIs (thus, it is a triplestore). It supports up to 50 Billion edges on a single machine. Its code is entirely [open source and available on GitHub](https://github.com/blazegraph/database). \n",
101 | "\n",
102 | "Running this database as a server application is very simple. One has just to [download the .jar application](https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar), put it in a directory, and [run it](https://github.com/blazegraph/database/wiki/Quick_Start) from a shell as follows:\n",
103 | "\n",
104 | "```\n",
105 | "java -server -Xmx1g -jar blazegraph.jar\n",
106 | "```\n",
107 | "\n",
108 | "You need at least Java 9 installed in your system. If you do not have it, you can easily dowload and install it from the [Java webpage](https://www.java.com/it/download/manual.jsp). As you can see from the output of the command above, the database will be exposed via HTTP at a specific IP address:\n",
109 | "\n",
110 | "\n",
111 | "\n",
112 | "However, from your local machine, you can always contact it at the following URL:\n",
113 | "\n",
114 | "```\n",
115 | "http://127.0.0.1:9999/blazegraph/\n",
116 | "```"
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "id": "affiliated-enough",
122 | "metadata": {},
123 | "source": [
124 | "## From a diagram to a graph\n",
125 | "\n",
126 | "As you can see, the UML diagram introduced in the previous lecture, which I recall below, is already organised as a (directed) graph. Thus, translating such a data model into an RDF graph database is kind of straightforward.\n",
127 | "\n",
128 | "\n",
129 | "\n",
130 | "The important thing to decide, in this context, is to clarify what are the names (i.e. the URLs) of the classes and properties to use to represent the data compliant with the data model. In particular:\n",
131 | "\n",
132 | "* supposing that each resource will be assigned to at least one of the types defined in the data model, we need to identify the names of all the most concrete classes (e.g. `JournalArticle`, `BookChapter`, `Journal`, `Book`);\n",
133 | "\n",
134 | "* each attribute of each UML class will be represented by a distinct RDF property which will be involved in statements where the subjects are always resources of the class in consideration and the objects are simple literals (i.e. values). Of course, we have to identify the names of these properties (i.e. the URLs);\n",
135 | "\n",
136 | "* each relation starting from an UML class and ending in another UML class will be represented by a distinct RDF property which will be involved in statements where the subjects are always resources of the source class while the objects are resources of the target class. Of course, we have to identify the names of these properties (i.e. the URLs);\n",
137 | "\n",
138 | "* please, bear in mind that all attributes and relations defined in a class are inherited (i.e. can be used by) all its subclasses.\n",
139 | "\n",
140 | "You can choose to reuse existing classes and properties (e.g. as defined in [schema.org](https://schema.org)) or create your own. In the latter case, you have to remind to use an URL you are in control of (e.g. your website or GitHub repository). For instance, a possible pattern for defining your own name for the class `Book` could be `https:///Book` (e.g. `https://essepuntato.it/Book`). Of course, there are strategies and guidelines that should be used to implement appropriately data model in RDF-compliant languages. However these are out of the scope of the present course (and will be clarified in other courses).\n",
141 | "\n",
142 | "The name of all the classes and properties I will use in the examples in this tutorial are as follows:\n",
143 | "\n",
144 | "* UML class `JournalArticle`: `https://schema.org/ScholarlyArticle`;\n",
145 | "* UML class `BookChapter`: `https://schema.org/Chapter`;\n",
146 | "* UML class `Journal`: `https://schema.org/Periodical`;\n",
147 | "* UML class `Book`: `https://schema.org/Book`;\n",
148 | "* UML attribute `doi` of class `Publication`: `https://schema.org/identifier`;\n",
149 | "* UML attribute `publicationYear` of class `Publication`: `https://schema.org/datePublished`;\n",
150 | "* UML attribute `title` of class `Publication`: `https://schema.org/name`;\n",
151 | "* UML attribute `issue` of class `JournalArticle`: `https://schema.org/issueNumber`;\n",
152 | "* UML attribute `volume` of class `JournalArticle`: `https://schema.org/volumeNumber`;\n",
153 | "* UML attribute `id` of class `Venue`: `https://schema.org/identifier`;\n",
154 | "* UML attribute `name` of class `Venue`: `https://schema.org/name`;\n",
155 | "* UML relation `publicationVenue` of class `Publication`: `https://schema.org/isPartOf`."
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "id": "personal-values",
161 | "metadata": {},
162 | "source": [
163 | "## Using RDF in Python\n",
164 | "\n",
165 | "The [library `rdflib`](https://rdflib.readthedocs.io/en/stable/) provides classes and methods that allow one to create RDF graphs and populating them with RDF statements. It can be installed using the `pip` command as follows: \n",
166 | "\n",
167 | "```\n",
168 | "pip install rdflib\n",
169 | "```\n",
170 | "\n",
171 | "The [class `Graph`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph) is used to create an (initially empty) RDF graph, as shown as follows:"
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": 13,
177 | "id": "historic-transcript",
178 | "metadata": {},
179 | "outputs": [],
180 | "source": [
181 | "from rdflib import Graph\n",
182 | "\n",
183 | "my_graph = Graph()"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "id": "environmental-marker",
189 | "metadata": {},
190 | "source": [
191 | "All the resources (including the properties) are defined using the [class `URIRef`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.term.URIRef). The constructor of this class takes in input a string representing the IRI (or URL) of the resource in consideration. For instance, the code below shows all the resources mentioned above, i.e. those referring to classes, attributes and relations:"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 14,
197 | "id": "scientific-norfolk",
198 | "metadata": {},
199 | "outputs": [],
200 | "source": [
201 | "from rdflib import URIRef\n",
202 | "\n",
203 | "# classes of resources\n",
204 | "JournalArticle = URIRef(\"https://schema.org/ScholarlyArticle\")\n",
205 | "BookChapter = URIRef(\"https://schema.org/Chapter\")\n",
206 | "Journal = URIRef(\"https://schema.org/Periodical\")\n",
207 | "Book = URIRef(\"https://schema.org/Book\")\n",
208 | "\n",
209 | "# attributes related to classes\n",
210 | "doi = URIRef(\"https://schema.org/identifier\")\n",
211 | "publicationYear = URIRef(\"https://schema.org/datePublished\")\n",
212 | "title = URIRef(\"https://schema.org/name\")\n",
213 | "issue = URIRef(\"https://schema.org/issueNumber\")\n",
214 | "volume = URIRef(\"https://schema.org/volumeNumber\")\n",
215 | "identifier = URIRef(\"https://schema.org/identifier\")\n",
216 | "name = URIRef(\"https://schema.org/name\")\n",
217 | "\n",
218 | "# relations among classes\n",
219 | "publicationVenue = URIRef(\"https://schema.org/isPartOf\")"
220 | ]
221 | },
222 | {
223 | "cell_type": "markdown",
224 | "id": "competitive-louis",
225 | "metadata": {},
226 | "source": [
227 | "Instead, literals (i.e. value to specify as objects of RDF statements) can be created using the [class `Literal`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.term.Literal). The constructor of this class takes in input a value (of any basic type: it can be a string, an integer, etc.) and create the related literal object in RDF, as shown in the next excerpt:"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 15,
233 | "id": "functioning-lemon",
234 | "metadata": {},
235 | "outputs": [],
236 | "source": [
237 | "from rdflib import Literal\n",
238 | "\n",
239 | "a_string = Literal(\"a string with this value\")\n",
240 | "a_number = Literal(42)\n",
241 | "a_boolean = Literal(True)"
242 | ]
243 | },
244 | {
245 | "cell_type": "markdown",
246 | "id": "trying-chaos",
247 | "metadata": {},
248 | "source": [
249 | "Using these classes it is possible to create all the Python objects necessary to create statements describing all the data to be pushed into an RDF graph. We need to use the [method `add`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.Graph.add) to add a new RDF statement to a graph. Such method takes in input a tuple of three elements defining the subject (an `URIRef`), the predicate (another `URIRef`) and the object (either an `URIRef` or a `Literal`) of the statements.\n",
250 | "\n",
251 | "The following code show how to populate the RDF graph defining using the data obtained by processing the two CSV documents presented in previous tutorials. i.e. [that of the publications](../01/01-publications.csv) and [that of the venues](../01/01-venues.csv). For instance, all the venues are created using the following code:"
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 16,
257 | "id": "ultimate-circle",
258 | "metadata": {},
259 | "outputs": [],
260 | "source": [
261 | "from pandas import read_csv, Series\n",
262 | "from rdflib import RDF\n",
263 | "\n",
264 | "# This is the string defining the base URL used to defined\n",
265 | "# the URLs of all the resources created from the data\n",
266 | "base_url = \"https://comp-data.github.io/res/\"\n",
267 | "\n",
268 | "venues = read_csv(\"../01/01-venues.csv\", \n",
269 | " keep_default_na=False,\n",
270 | " dtype={\n",
271 | " \"id\": \"string\",\n",
272 | " \"name\": \"string\",\n",
273 | " \"type\": \"string\"\n",
274 | " })\n",
275 | "\n",
276 | "venue_internal_id = {}\n",
277 | "for idx, row in venues.iterrows():\n",
278 | " local_id = \"venue-\" + str(idx)\n",
279 | " \n",
280 | " # The shape of the new resources that are venues is\n",
281 | " # 'https://comp-data.github.io/res/venue-'\n",
282 | " subj = URIRef(base_url + local_id)\n",
283 | " \n",
284 | " # We put the new venue resources created here, to use them\n",
285 | " # when creating publications\n",
286 | " venue_internal_id[row[\"id\"]] = subj\n",
287 | " \n",
288 | " if row[\"type\"] == \"journal\":\n",
289 | " # RDF.type is the URIRef already provided by rdflib of the property \n",
290 | " # 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'\n",
291 | " my_graph.add((subj, RDF.type, Journal))\n",
292 | " else:\n",
293 | " my_graph.add((subj, RDF.type, Book))\n",
294 | " \n",
295 | " my_graph.add((subj, name, Literal(row[\"name\"])))\n",
296 | " my_graph.add((subj, identifier, Literal(row[\"id\"])))"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "id": "increasing-leisure",
302 | "metadata": {},
303 | "source": [
304 | "As you can see, all the RDF triples have been added to the graph, that currently contain the following number of distinct triples (which is coincident with the number of cells in the original table):"
305 | ]
306 | },
307 | {
308 | "cell_type": "code",
309 | "execution_count": 17,
310 | "id": "attractive-genius",
311 | "metadata": {},
312 | "outputs": [
313 | {
314 | "name": "stdout",
315 | "output_type": "stream",
316 | "text": [
317 | "-- Number of triples added to the graph after processing the venues\n",
318 | "12\n"
319 | ]
320 | }
321 | ],
322 | "source": [
323 | "print(\"-- Number of triples added to the graph after processing the venues\")\n",
324 | "print(len(my_graph))"
325 | ]
326 | },
327 | {
328 | "cell_type": "markdown",
329 | "id": "psychological-marketplace",
330 | "metadata": {},
331 | "source": [
332 | "The same approach can be used to add information about the publications, as shown as follows:"
333 | ]
334 | },
335 | {
336 | "cell_type": "code",
337 | "execution_count": 18,
338 | "id": "flexible-affiliate",
339 | "metadata": {},
340 | "outputs": [],
341 | "source": [
342 | "publications = read_csv(\"../01/01-publications.csv\", \n",
343 | " keep_default_na=False,\n",
344 | " dtype={\n",
345 | " \"doi\": \"string\",\n",
346 | " \"title\": \"string\",\n",
347 | " \"publication year\": \"int\",\n",
348 | " \"publication venue\": \"string\",\n",
349 | " \"type\": \"string\",\n",
350 | " \"issue\": \"string\",\n",
351 | " \"volume\": \"string\"\n",
352 | " })\n",
353 | "\n",
354 | "for idx, row in publications.iterrows():\n",
355 | " local_id = \"publication-\" + str(idx)\n",
356 | " \n",
357 | " # The shape of the new resources that are publications is\n",
358 | " # 'https://comp-data.github.io/res/publication-'\n",
359 | " subj = URIRef(base_url + local_id)\n",
360 | " \n",
361 | " if row[\"type\"] == \"journal article\":\n",
362 | " my_graph.add((subj, RDF.type, JournalArticle))\n",
363 | "\n",
364 | " # These two statements applies only to journal articles\n",
365 | " my_graph.add((subj, issue, Literal(row[\"issue\"])))\n",
366 | " my_graph.add((subj, volume, Literal(row[\"volume\"])))\n",
367 | " else:\n",
368 | " my_graph.add((subj, RDF.type, BookChapter))\n",
369 | " \n",
370 | " my_graph.add((subj, name, Literal(row[\"title\"])))\n",
371 | " my_graph.add((subj, identifier, Literal(row[\"doi\"])))\n",
372 | " \n",
373 | " # The original value here has been casted to string since the Date type\n",
374 | " # in schema.org ('https://schema.org/Date') is actually a string-like value\n",
375 | " my_graph.add((subj, publicationYear, Literal(str(row[\"publication year\"]))))\n",
376 | " \n",
377 | " # The URL of the related publication venue is taken from the previous\n",
378 | " # dictionary defined when processing the venues\n",
379 | " my_graph.add((subj, publicationVenue, venue_internal_id[row[\"publication venue\"]]))"
380 | ]
381 | },
382 | {
383 | "cell_type": "markdown",
384 | "id": "taken-trace",
385 | "metadata": {},
386 | "source": [
387 | "After the addition of this new statements, the number of total RDF triples added to the graph is equal to all the cells in the venue CSV plus all the non-empty cells in the publication CSV:"
388 | ]
389 | },
390 | {
391 | "cell_type": "code",
392 | "execution_count": 19,
393 | "id": "identified-transfer",
394 | "metadata": {},
395 | "outputs": [
396 | {
397 | "name": "stdout",
398 | "output_type": "stream",
399 | "text": [
400 | "-- Number of triples added to the graph after processing venues and publications\n",
401 | "31\n"
402 | ]
403 | }
404 | ],
405 | "source": [
406 | "print(\"-- Number of triples added to the graph after processing venues and publications\")\n",
407 | "print(len(my_graph))"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "id": "operational-sacrifice",
413 | "metadata": {},
414 | "source": [
415 | "It is worth mentioning that we should not map in RDF cells in the original table that do not contain any value. Thus, if for instance there is an `issue` cell in the publication CSV which is empty (i.e. no information about the issue have been specified), you should not create any RDF statement mapping such a non-information."
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "id": "partial-communication",
421 | "metadata": {},
422 | "source": [
423 | "## How to create and populate a graph database with Python\n",
424 | "\n",
425 | "Once we have created our graph with all the triples we need, we can upload persistently the graph on our triplestore. In order to do that, we have to create an instance of the [class `SPARQLUpdateStore`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.plugins.stores.html#rdflib.plugins.stores.sparqlstore.SPARQLUpdateStore), which acts as a proxy to interact with the triplestore. The important thing is to open the connection with the store passing, as input, a tuple of two strings with the same URLs, defining the SPARQL endpoint of the triplestore where to upload the data.\n",
426 | "\n",
427 | "Then, we can upload triple by triple using a for-each iteration over the list of RDF statements obtained by using the [method `triples`](https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.triples) of the class `Graph`, passing as input a tuple with three `None` values, as shown as follows:"
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": 20,
433 | "id": "saved-prescription",
434 | "metadata": {},
435 | "outputs": [],
436 | "source": [
437 | "from rdflib.plugins.stores.sparqlstore import SPARQLUpdateStore\n",
438 | "\n",
439 | "store = SPARQLUpdateStore()\n",
440 | "\n",
441 | "# The URL of the SPARQL endpoint is the same URL of the Blazegraph\n",
442 | "# instance + '/sparql'\n",
443 | "endpoint = 'http://127.0.0.1:9999/blazegraph/sparql'\n",
444 | "\n",
445 | "# It opens the connection with the SPARQL endpoint instance\n",
446 | "store.open((endpoint, endpoint))\n",
447 | "\n",
448 | "for triple in my_graph.triples((None, None, None)):\n",
449 | " store.add(triple)\n",
450 | " \n",
451 | "# Once finished, remeber to close the connection\n",
452 | "store.close()"
453 | ]
454 | }
455 | ],
456 | "metadata": {
457 | "kernelspec": {
458 | "display_name": "Python 3",
459 | "language": "python",
460 | "name": "python3"
461 | },
462 | "language_info": {
463 | "codemirror_mode": {
464 | "name": "ipython",
465 | "version": 3
466 | },
467 | "file_extension": ".py",
468 | "mimetype": "text/x-python",
469 | "name": "python",
470 | "nbconvert_exporter": "python",
471 | "pygments_lexer": "ipython3",
472 | "version": "3.9.0"
473 | }
474 | },
475 | "nbformat": 4,
476 | "nbformat_minor": 5
477 | }
478 |
--------------------------------------------------------------------------------
/docs/handson/02/02-Implementation_of_data_models_via_Python_classes.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "modified-haven",
6 | "metadata": {},
7 | "source": [
8 | "# Implementation of data models via Python classes\n",
9 | "\n",
10 | "In this tutorial, we see how to create Python classes to implement a model for the representation of data."
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "id": "universal-extraction",
16 | "metadata": {},
17 | "source": [
18 | "## What is a class in Python\n",
19 | "\n",
20 | "In Python, as in other [object-oriented programming languages](https://en.wikipedia.org/wiki/Object-oriented_programming), a class is an extensible template for creating objects having a specific type. All the basic types of values (e.g. strings, integers, booleans) and the other data structures (e.g. lists, sets, dictionaries) are defined by means of particular classes. \n",
21 | "\n",
22 | "In addition, each class makes available a set of [methods](https://en.wikipedia.org/wiki/Method_(computer_programming)) that allow one to interact with the objects (i.e. the instances) of such a class. A method is a particular function that can be run only if directly called via an object. For instance, the instruction `\"this is a string\".split(\" \")` executes the method `split` passing `\" \"` as the input parameter on the particular string object on which the method is called, i.e. the string `\"this is a string\"` (defined by the [class `str`](https://docs.python.org/3/library/stdtypes.html#str) in Python)."
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "id": "shaped-elevation",
28 | "metadata": {},
29 | "source": [
30 | "## Defining a data model using Python classes\n",
31 | "\n",
32 | "[Python classes](https://docs.python.org/3/tutorial/classes.html), as the name may recall, can be used to implement a particular data model such as that introduced in the following diagram using the [Unified Modelling Language (UML)](https://en.wikipedia.org/wiki/Unified_Modeling_Language). We will use this example to understand how to implement classes in Python, and to show how they works.\n",
33 | "\n",
34 | "\n",
35 | "\n",
36 | "As you can see from the diagram above, we defined six distinct classes which are, somehow, related to each other. Let us see how to define this structure in Python."
37 | ]
38 | },
39 | {
40 | "cell_type": "markdown",
41 | "id": "blessed-hardwood",
42 | "metadata": {},
43 | "source": [
44 | "### Defining our first class\n",
45 | "\n",
46 | "For defining classes in Python, one has to use the following signature:\n",
47 | "\n",
48 | "```\n",
49 | "class (, , ...):\n",
50 | " def __init__(self, , , ...):\n",
51 | " ...\n",
52 | "```\n",
53 | "\n",
54 | "In the excerpt above, `` is the name one wants to assign to a class, while ``, ``, etc., indicate the superclasses from which this class is derived from. In Python, all new classes must be subclass of the generic class `object`. Instead, the indented `def __init__` is a special methods defining the constructor of an object of that class, and it will called every time one wants to create a new object (instance) of this type. For instance, when we create a new set in Python using `set()`, we are calling the constructur of the [class `set`](https://docs.python.org/3/library/stdtypes.html#set), defined as shown above.\n",
55 | "\n",
56 | "It is worth mentioning that all the methods of a class, including its constructor, must specify `self` as the first parameter. This special parameter represents the instance of the class in consideration. In practice, every time we instantiate a new object of that class, `self` will be assigned to that object and provides access to its attributes (i.e. variables assigned with particular values for that object) and methods as defined in the related class. In particular, it is used to access to all object related information within the class itself.\n",
57 | "\n",
58 | "For instance, by using such a `self` parameter, it is possible to create variables and associated values that are local to a particular object of that class. In the following excerpt, we use it to define the constructor of the the class `Venue` in the data model shown above as a UML diagram:"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": 1,
64 | "id": "processed-physiology",
65 | "metadata": {},
66 | "outputs": [],
67 | "source": [
68 | "class Venue(object):\n",
69 | " def __init__(self, identifiers, name):\n",
70 | " self.id = set()\n",
71 | " for identifier in identifiers:\n",
72 | " self.id.add(identifier)\n",
73 | " \n",
74 | " self.name = name"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "id": "affecting-collectible",
80 | "metadata": {},
81 | "source": [
82 | "As shown in the code above, the class `Venue` is defined as subclass of the top class `object`, and its constructor takes in input three parameters: `self` (as explained above), `identifiers` and `name`. \n",
83 | "\n",
84 | "The parameter `identifiers` is used to take in input a collection of strings that contains all the identifiers of such an object. In the above code, I decided to handle all the items included in the collection using a set to comply with the declaration in the data model class which wants to have a collection of at least one or more string identifiers (`id : string [1..*]`). Indeed, I have created a new variable `id` related to the particular object of the class `self` (i.e. `self.id`) and I have assigned a new set to it. Then, I added all the identifiers in the input collection to the set using the [set method `add`](https://docs.python.org/3/library/stdtypes.html#frozenset.add) (i.e. via the instruction `self.id.add(identifier)`.\n",
85 | "\n",
86 | "Instead, the parameter `name` is used to specify the name of a particular venue. Thus, I have just assigned it to the variable `name` of the object `self` (i.e. `self.name`) to mimic the data model attribute `name : str [1]`. Of course, I could also use a different structure to store this information - for instance, I could use again a set which contained only one value in it. The important thing here, while trying to map the data model into a Python class, is to be compliant with the data model declaration. I chose to assigned it straight with a variable supposing that the input will be a simple string.\n",
87 | "\n",
88 | "In practice, thanks to the `self` keyword, I can create new independent variables for each new object created using this class."
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "id": "authorized-weekend",
94 | "metadata": {},
95 | "source": [
96 | "### Representing relations in Python\n",
97 | "\n",
98 | "The Python class defined above represents (by means of its constructor) all the attributes associated to the related data model class. However, in data models, there are also relations that may exist between different kinds of objects, as the relation `publicationVenue` between the data model classes `Publication` and `Venue`. In Python, such relations can be represented as the other attributes, i.e. by assigning some specific values to `self`-declared variables, as shown in the following excerpt:"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 2,
104 | "id": "alpha-reading",
105 | "metadata": {},
106 | "outputs": [],
107 | "source": [
108 | "class Publication(object):\n",
109 | " def __init__(self, doi, publicationYear, title, publicationVenue):\n",
110 | " self.doi = doi\n",
111 | " self.publicationYear = publicationYear\n",
112 | " self.title = title\n",
113 | " self.publicationVenue = publicationVenue"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "id": "considerable-lancaster",
119 | "metadata": {},
120 | "source": [
121 | "As shown in the except above, the constructor of the class `Publication` takes in input not only the attributes of the related data model class but also its relations (i.e. the relation from which the class is the starting point), and considers it as additional parameters of the constructor. Then, they will be handled as the others. Of course, the type of object that should be specified in the parameter `publicationVenue` should have class `Venue`, defined above."
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "id": "raising-salem",
127 | "metadata": {},
128 | "source": [
129 | "### Instantiating a class\n",
130 | "\n",
131 | "Once classes are defined, we can use them to instantiate objects of that kind. For doing it, we should call their constructor (using the name of the class) passing the parameters it requires **except** `self`, that will be implicitly considered. In practice, for creating a new object of class `Venue`, we need to specify only two parameters, i.e. those for `identifiers` (i.e. a collection of strings) and `name` (i.e. a string). As an example, let us consider again the first two items of the [venues CSV file](../01/01-venues.csv) we have introduced in the previous tutorial, i.e.:\n",
132 | "\n",
133 | "| id | name | type |\n",
134 | "|---|---|---|\n",
135 | "| 1531-6912 | Comparative and Functional Genomics | journal |\n",
136 | "| 1367-5931 | Current Opinion in Chemical Biology | journal |\n",
137 | "\n",
138 | "These two entities (i.e. venues) can be defined using the Python class `Venue` as follows:"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": 3,
144 | "id": "dated-contamination",
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "venue_1 = Venue([\"1531-6912\"], \"Comparative and Functional Genomics\")\n",
149 | "venue_2 = Venue([\"1367-5931\"], \"Current Opinion in Chemica Biology\")"
150 | ]
151 | },
152 | {
153 | "cell_type": "markdown",
154 | "id": "solid-example",
155 | "metadata": {},
156 | "source": [
157 | "As shown in the above excerpt, I have created two new objects, assigned to two distinct variables, one for each venue. All the values specified as input of the constructur have been assigned to the `self` variables of each object, that are distinct, while share the same structure. Indeed, using the Python built-in [function `id`](https://docs.python.org/3/library/functions.html#id) (that takes in input an object and returns the unique integer identifying it) and [function `type`](https://docs.python.org/3/library/functions.html#type) (that takes in input an object and returns its related type), it is possible to see that `value_1` and `value_2` are different objects of the same class:"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 4,
163 | "id": "reported-sixth",
164 | "metadata": {},
165 | "outputs": [
166 | {
167 | "name": "stdout",
168 | "output_type": "stream",
169 | "text": [
170 | "The objects in 'value_1' and 'value_2' share the same class --> True\n",
171 | "Indeed, the types of the two objects are both \n",
172 | "\n",
173 | "The objects in 'value_1' and 'value_2' are the same object --> False\n",
174 | "Indeed, the integers identifying the two objects are 140587130074496 and 140587130073872 respectively\n"
175 | ]
176 | }
177 | ],
178 | "source": [
179 | "print(\"The objects in 'value_1' and 'value_2' share the same class -->\", type(venue_1) == type(venue_2))\n",
180 | "print(\"Indeed, the types of the two objects are both\", type(venue_1))\n",
181 | "\n",
182 | "print(\"\\nThe objects in 'value_1' and 'value_2' are the same object -->\", id(venue_1) == id(venue_2))\n",
183 | "print(\"Indeed, the integers identifying the two objects are\", id(venue_1), \"and\", id(venue_2), \"respectively\")"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "id": "cognitive-property",
189 | "metadata": {},
190 | "source": [
191 | "Similarly, we can create new objects also of other classes, such as `Publication`. In this case, the last parameter of the constructor of `Publication` (i.e. `publicationVenue`) should take in input an object having class `Venue` as defined above. As another example, let us consider again the first two items of the [publications CSV file](../01/01-publications.csv) we have introduced in the previous tutorial, i.e.:\n",
192 | "\n",
193 | "| doi | title | publication year | publication venue | type | issue | volume |\n",
194 | "|---|---|---|---|---|---|---|\n",
195 | "| 10.1002/cfg.304 | Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information | 2003 | 1531-6912 | journal article | 4 | 4 |\n",
196 | "10.1016/s1367-5931(02)00332-0 | In vitro selection as a powerful tool for the applied evolution of proteins and peptides | 2002 | 1367-5931 | journal article | 3 | 6 |\n",
197 | "\n",
198 | "These two publications can be defined using the Python class `Publication` as follows:"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": 5,
204 | "id": "outer-steal",
205 | "metadata": {},
206 | "outputs": [],
207 | "source": [
208 | "publication_1 = Publication(\"10.1002/cfg.304\", \n",
209 | " 2003, \n",
210 | " \"Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information\", \n",
211 | " venue_1)\n",
212 | "\n",
213 | "publication_2 = Publication(\"10.1016/s1367-5931(02)00332-0\", \n",
214 | " 2002, \n",
215 | " \"In vitro selection as a powerful tool for the applied evolution of proteins and peptides\", \n",
216 | " venue_2)"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "id": "hybrid-vertical",
222 | "metadata": {},
223 | "source": [
224 | "It is worth mentioning that, as shown in the excerpt above, we have not specified the identifier of a particular venue as input, bur rather we have provided the `Venue` object representing such a venue, as also defined by the relation `publicationVenue` specified in the data model."
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "id": "miniature-welsh",
230 | "metadata": {},
231 | "source": [
232 | "### Creating subclasses of a given class\n",
233 | "\n",
234 | "As you may have noticed, we did not map all the columns of the CSV documents introduced above in the classes we have defined. Indeed, the data model above actually specifies some of this information (for instance the concept of publication type and the fields `issue` and `volume`) into subclasses of `Publication` and `Venue`. Python makes available a mechanism to create new classes as subclasses of existing ones, thus inheriting all the attributes and methods that the superclasses already implement, similar to what a data model enables. \n",
235 | "\n",
236 | "We can use the same signature adopted for classes for creating subclasses by specifying the classes to extend in the definition of the class, as we already did specifying the class `object` as top class of `Publication` and `Venue`, as shown as follows:"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 6,
242 | "id": "charming-minister",
243 | "metadata": {},
244 | "outputs": [],
245 | "source": [
246 | "class BookChapter(Publication):\n",
247 | " pass\n",
248 | "\n",
249 | "class Journal(Venue):\n",
250 | " pass\n",
251 | "\n",
252 | "class Book(Venue):\n",
253 | " pass"
254 | ]
255 | },
256 | {
257 | "cell_type": "markdown",
258 | "id": "unable-module",
259 | "metadata": {},
260 | "source": [
261 | "In the code above, the body of each class extending the classes `Publication` and `Venue` is left unspecified. This means that the new subclasses inherit (and can access via `self`) all the attributes and methods (including the constructor) from the superclass. Thus, the only thing they really add in this case is the specification of a new characterising type, which mimic the `type` field of the CSV file presented above.\n",
262 | "\n",
263 | "However, adding such new information is enough for classifying them as distinct classes, even if one (e.g. `Journal`) is subclass of another (e.g. `Venue`). Indeed, in the following code, I create a new instance of the class `Journal` using the same input values of `value_1`, specified above. As you can see, the classes returned by these two objects are indeed different:"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 7,
269 | "id": "alert-conducting",
270 | "metadata": {},
271 | "outputs": [
272 | {
273 | "name": "stdout",
274 | "output_type": "stream",
275 | "text": [
276 | "The objects in 'journal_1' and 'venue_1' share the same class --> False\n",
277 | "Indeed, the types of the two objects are and respectively\n"
278 | ]
279 | }
280 | ],
281 | "source": [
282 | "# An object of class 'Journal' is instantiated using the same parameters\n",
283 | "# of the constructor of its parent class 'Venue' since 'Journal' does not\n",
284 | "# define any explicit constructor\n",
285 | "journal_1 = Journal([\"1531-6912\"], \"Comparative and Functional Genomics\")\n",
286 | "\n",
287 | "print(\"The objects in 'journal_1' and 'venue_1' share the same class -->\", type(journal_1) == type(venue_1))\n",
288 | "print(\"Indeed, the types of the two objects are\", type(journal_1), \"and\", type(venue_2), \"respectively\")"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "id": "social-attack",
294 | "metadata": {},
295 | "source": [
296 | "Of course, in some cases, the new subclass may take in input additional information compared to its superclass. In these cases, e.g. for mapping in Python the data model class `JournalArticle` that introduces also the attributes `issue` and `volume`, it would be necessary to define an appropriate constructor extending that of the parent superclass. An implementation of the Python class `JournalArticle` is shown as follows:"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 8,
302 | "id": "affiliated-ridge",
303 | "metadata": {},
304 | "outputs": [],
305 | "source": [
306 | "class JournalArticle(Publication):\n",
307 | " def __init__(self, doi, publicationYear, title, publicationVenue, issue, volume):\n",
308 | " self.issue = issue\n",
309 | " self.volume = volume\n",
310 | " \n",
311 | " # Here is where the constructor of the superclass is explicitly recalled, so as\n",
312 | " # to handle the input parameters as done in the superclass\n",
313 | " super().__init__(doi, publicationYear, title, publicationVenue)"
314 | ]
315 | },
316 | {
317 | "cell_type": "markdown",
318 | "id": "wanted-character",
319 | "metadata": {},
320 | "source": [
321 | "In the code above, the additional parameters `issue` and `venue` are handled as before, while all the other are tranferred to the constructor of the superclass accessed by using the [function `super`](https://docs.python.org/3.5/library/functions.html#super) (which returns a proxy object that delegates method calls to the parent class) and then calling the `__init__` constructor with all the expected parameters **except** `self`. In this case, to instantiate an object of class `JournalArticle`, all the input parameters must be specified:"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 9,
327 | "id": "accurate-supply",
328 | "metadata": {},
329 | "outputs": [],
330 | "source": [
331 | "journal_article_1 = JournalArticle(\"10.1002/cfg.304\", \n",
332 | " 2003, \n",
333 | " \"Development of Computational Tools for the Inference of Protein Interaction Specificity Rules and Functional Annotation Using Structural Information\", \n",
334 | " journal_1, \n",
335 | " \"4\", \n",
336 | " \"4\")"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "id": "weird-tragedy",
342 | "metadata": {},
343 | "source": [
344 | "## Extending classes with methods\n",
345 | "\n",
346 | "Once an object of a certain class is created, one can access all its attributes (i.e. those assigned to `self` variables) directly by their name using the following syntax: `