└── main.tex /main.tex: -------------------------------------------------------------------------------- 1 | \documentclass{article} 2 | \usepackage[utf8]{inputenc} 3 | 4 | \usepackage[a4paper]{geometry} 5 | 6 | \usepackage[colorlinks=false]{hyperref} 7 | \usepackage{verbatim} 8 | 9 | \title{Annif 4 dissertations\\Experimenting with Annif at KB} 10 | \author{Sara Veldhoen} 11 | 12 | \begin{document} 13 | 14 | \maketitle 15 | 16 | \tableofcontents 17 | 18 | \clearpage 19 | 20 | \section{Annif} 21 | 22 | See \url{https://github.com/NatLibFi/Annif/wiki/Getting-started} for information on how to set things up. 23 | 24 | 25 | \subsection{Projects} 26 | 27 | The file {\tt projects.cfg} (in the top directory of the repo Annif) contains settings per project (experimental setup). Structure: 28 | \begin{verbatim}[PROJECT_ID] 29 | name=readable name # may contain whitespace 30 | language=en # I'm not sure what this does yet 31 | backend=tfidf # what backend to use 32 | analyzer=snowball(english) # I'm not sure what this does yet 33 | limit=20 # number of topics to produce 34 | vocab=yso-en # vocabulary name (can be shared accross projects) 35 | \end{verbatim} 36 | 37 | Every project has a project identifier, that is used in calls to Annif. Every single trained model is associated with a single project identifier, so use a different project for every experimental setting (even though the project settings are the same). 38 | \subsection{Commands} 39 | 40 | Have annif list all possible commands entering {\tt annif --help}. The most relevant for this project are listed and explained below. 41 | 42 | \begin{itemize} 43 | \item {\bf loadvoc}: to load a subject vocabulary (thesaurus) for a specific project 44 | 45 | Usage: \verb|annif loadvoc [OPTIONS] PROJECT_ID SUBJECTFILE| 46 | 47 | The subject vocabulary ({\tt SUBJECTFILE}) must be formatted as described here: \url{https://github.com/NatLibFi/Annif/wiki/Subject-vocabulary-formats} 48 | 49 | \item {\bf train}: train a model on training data 50 | 51 | Usage: \verb|annif train [OPTIONS] PROJECT_ID [PATHS]| 52 | 53 | The corpus ({\tt PATHS}) can be a single tsv file or a (zipped or regular) directory. See the guidelines here: 54 | \url{https://github.com/NatLibFi/Annif/wiki/Document-corpus-formats}. So far, we use the \emph{Extended subject file format}. 55 | 56 | \item {\bf analyzedir}: apply the trained model to all documents in a directory (e.g. the test data) 57 | (analyze also exists for single files) 58 | 59 | Usage: \verb|annif analyzedir [OPTIONS] PROJECT_ID DIRECTORY| 60 | 61 | NB: the results are stored in the same directory as the input, in files with suffix {\tt .annif} (by default). For every document, the top {\tt N} subjects are listed, together with the score/ probability. 62 | 63 | \item {\bf eval}: evaluate a model against a gold standard. We might want to use our own evaluation (or extend the built in evaluation tool to our own needs) 64 | 65 | Usage: \verb|annif eval [OPTIONS] PROJECT_ID [PATHS]| 66 | 67 | \end{itemize} 68 | 69 | 70 | 71 | 72 | 73 | \section{Data} 74 | 75 | 76 | \subsection{Brinkeys} 77 | 78 | Brinkman Thesaurus, Brinkman onderwerpen - a lot of knowledge is spread throughout the KB on this controlled vocabulary of subjects. Read for instance this Plein post: \url{https://plein.kb.nl/thoughts/4290}. 79 | It is a hierarchical Thesaurus, albeit a rather shallow one. 80 | 81 | The Brinkman Thesaurus is the most stable subject vocabulary in the KB, and the current standard for the cataloguing department. We use the Brinkeys assigned by cataloguers as gold labels to train on (see below). 82 | 83 | In order to make Annif digest the thesaurus, it was transformed to a tsv-file that can be found here: \url{https://github.com/KBNLresearch/Annif-corpora/blob/master/my-corpora/brinkmanthesaurus_vocab.tsv}. 84 | 85 | 86 | \subsection{Dissertations} 87 | 88 | The data stem from two origins: the repositories of the university libraries and the metadata as stored in ggc, the catalogue of the KB. Note that linking documents in those two sources was not trivial. There may be a few faulty links, and there are presumably quite some documents that could not be linked and are therefore not usable for our experiments. Still, a reasonable data set remains. 89 | 90 | \paragraph{The harvest} from the university repositories yielded both metadata and resources, which form the input to the model that we train. The nature and quality of the metadata and the format of the resources varies a lot between and also within universities, so obtaining useful data is not as easy as it may seem. 91 | 92 | The metadata contain valuable information for this case such as title and abstract. Experiments can point out what parts of the (meta)data are most useful for the task at hand. One thing that we want to explore is the benefit of using full text, as compared to just abstract and other metadata. Using a random sample of the full text might very well form a useful compromise between training intensity and model performance. Table of contents is also known as a powerful feature for subject classification, but retrieving it may not be easy in all cases. 93 | 94 | Features:\begin{itemize} 95 | \item Title and subtitle 96 | \item Abstract 97 | \item department 98 | \item table of contents 99 | \item full text 100 | \item sample from full text 101 | \end{itemize} 102 | 103 | For the experiments, the data can be partitioned by filtering on language: how well does the model perform on single language versus mixed language? Note that the amount of data per language differs a lot, so in order to get comparable results some pruning needs to be done. Another thing that we cannot solve but must keep in mind, is that topics and language are presumably correlated: humanities dissertations are often written in Dutch, whereas in life sciences English is the standard. 104 | 105 | Language dimension:\begin{itemize} 106 | \item English only 107 | \item Dutch only 108 | \item Mixed languages (English, Dutch and other) 109 | \end{itemize} 110 | 111 | \subsection{Labels} 112 | The model is trained to predict subjects (Brinkeys) based on documents. The gold data, used both for training and evaluation, stems from the KB cataloguing department. For a selection of dissertations, human annotators applied Brinkeys to them. Note that this selection is far from random, some university/ fields are better represented than others. 113 | 114 | \paragraph{The distribution} over Brinkeys is quite skewed and resembles a Zipfian distribution: few topics are assigned quite frequently, whereas most are assigned rarely (`long tail'). This is generally a hard problem for Machine Learning. 115 | 116 | One thing to note, is that cataloguers are expected to pick as few labels as possible (preferably one to three) and have them be as specific as possible. Very broad subjects are thus not expected to be assigned, even though they do apply to a document. 117 | 118 | \paragraph{Agreement} 119 | Also important to keep in mind here, is that the Brinkeys are assigned by a single cataloguer. Experiments show that inter- and intra- annotator agreement on this kind of tasks is quite low: humans tend to not agree on what label should be applied. We can therefore not expect a machine to exactly match the somewhat trivial decisions of the annotator. 120 | 121 | 122 | \subsection{Train/ test split} 123 | We worked on the predecessor of these experiments in a workshop, where a train/ test split was made that we will maintain (unless we decide otherwise). The document identifiers (both university identifier and ggc identifier/ ppn) for the train and test set can be found in the corpora repo: \url{https://github.com/KBNLresearch/Annif-corpora/blob/master/my-corpora/dissertaties/}, \verb|train_identifiers.csv| and \verb|test_identifiers.csv| respectively. 124 | 125 | \section{Experiments} 126 | 127 | \subsection{Selection of input features} 128 | \subsection{Inter and intra language} 129 | \subsection{Different back-ends} 130 | 131 | 132 | \section{Evaluation} 133 | 134 | Evaluation: 135 | - Macro/ micro averaged 136 | - Recall @20 (Annif provides 'optimize' to choose optimal limit) 137 | - Precision @3 138 | - Compute recall and precision per-label: how much variability? Correlation with frequency of the labels? 139 | - Further look into NDCG (metric that takes rank into account) 140 | 141 | 142 | 143 | \section{Results} 144 | 145 | - Correlation performance per label vs relative frequency of the label (per language?) 146 | 147 | 148 | 149 | \section{Open issues/ future work} 150 | 151 | \begin{itemize} 152 | \item How to deal with the fact that the documents are written in multiple languages (not always indicated correctly in the metadata)? Should we even focus on this issue, or is it only relevant in this domain (dissertations/ scientific publications)? 153 | \item How can we evaluate this task?\begin{itemize} 154 | \item Quantitative evaluation: Also measure (variance in) performance per class (next to performance per document) 155 | 156 | \item Qualitative evaluation: \\ 157 | How does a human judge the suggestions?\\ 158 | How useful are the suggestions to a cataloguer?\\ 159 | Question: to what extent do we catalogue dissertations at the moment? In principle not, but there are items labeled in recent years.. (esp. Groningen) In other words: how fit is this case for human evaluation? 160 | \end{itemize} 161 | \item What about other domains/ data sources? What metadata (Onyx, flaptekst, full text) is available to train on in those cases? 162 | \item The focus was on Brinkeys so far, as we have data (gold labels) in this realm. To what extent could we look into other subject vocabularies, such as Thema? Can we map Brinkeys to Thema (if even for a subset of Brinkeys) so that we can extrapolate? 163 | \end{itemize} 164 | \end{document} 165 | --------------------------------------------------------------------------------