├── 1-regexp.md ├── 10-rag.md ├── 2-fts.md ├── 3-levenshtein.md ├── 4-lm.md ├── 5-classification.md ├── 6-ner.md ├── 7-classification-ner-llm.md ├── 8-neural.md ├── 9-qa.md ├── elastic ├── Dockerfile ├── config │ └── elasticsearch.yml ├── docker-compose.yaml └── query.sh ├── project ├── readme.md ├── spacex.zip └── text_description.txt └── readme.md /1-regexp.md: -------------------------------------------------------------------------------- 1 | # Regular expressions (aka regex) 2 | 3 | The task is concentrated on using regular expressions for extracting basic information from textual data. 4 | You will get more familiar with the regexp features that are particularly important for natural language processing applications, especially when you are preparing data for machine learning models. 5 | 6 | ## Task 7 | 8 | A FIQA-PL dataset containing Polish questions and answers is available at [Huggigface](https://huggingface.co/datasets/clarin-knext/fiqa-pl). 9 | In this lab we only concentrate on the `corpus` split of the dataset. 10 | 11 | Task objectives (8 points): 12 | 13 | 1. Devise two regular expressions: 14 | * extracting times, e.g. recognizing `20:30` as an instance of a time. 15 | * extracting dates, e.g. recognizing `20 września` as an instance of a date. 16 | 2. Search for occurrences of times and dates in the dataset. 17 | 3. Plot results from point 2: 18 | * for times create a bar plot for full hours (i.e. 17:35 -> is 17). 19 | * for dates create a bar plot for months. 20 | 4. Compute the number of occurrences of `kwiecień` word in any inflectional form. Use a compact form for the query. The only forbidden form is a juxtaposition of the month names, i.e. `"kwiecień|kwietnia|kwietniu..."`. 21 | 5. As in 4, but preceded by a number and a space. 22 | 6. As in 4, but not preceded by a number and a space. Check if the results from 5 and 6 sum to 4. 23 | 7. Ask an LLM (e.g. [Bielik](https://chat.bielik.ai/)) to complete these tasks for you. Compare and criticize the code generated by the LLM. 24 | 25 | 26 | Answer the following questions (2 points): 27 | 1. Are regular expressions good at capturing times? 28 | 2. Are regular expressions good at capturing dates? 29 | 3. How one can be sure that the expression has matched all and only the correct expressions of a given type? 30 | 4. Is LLM able to generate regular expressions? 31 | 32 | ## Hints 33 | 34 | * Some programming languages allow to use Unicode classes in regular expressions, e.g. 35 | * `\p{L}` - letters from any alphabet (e.g. a, ą, ć, ü, カ) 36 | * `\p{Ll}` - small letters from any alphabet 37 | * `\p{Lu}` - capital letters from any alphabet 38 | * Not all regular expressions engines support Unicode classes, e.g. `re` from Python does not. 39 | Yet you can use `regex` library (`pip install regex`), which has much more features. 40 | * Regular expressions can include positive and negative lookahead and lookbehind constructions, e.g. 41 | * *positive lookahead* - `(\w+)(?= has a cat)` will match in the string `Ann has a cat`, but it will match `Ann` only. 42 | * *negative lookbehind* - `(? "energy"` or `["wind energy", "hydropower", "solar energy"] -> "renewable energy sources"` (the second 16 | example is more difficult to achieve). The algorithm may use any publicly available data or tools, e.g. word embeddings, 17 | language models, onthologies or dictionaries. 18 | 19 | 20 | ## Keyphrase extraction 21 | 22 | The extracting algorithm is given a set of documents and should output a set of lists of phrases (one list of phrases per 23 | document) that best describe (summarize) the document content. Those phrases should be present in the document. 24 | Lemmatization of the keyphrases is optional. Additionaly the algorith should rank those phrases, meaning that the phrase 25 | more important for the document (better describing its content) should be given higher score. E.g. in the article about 26 | Orlen-Lotos merger, the companies names would probably score higher than names of the CEOs, while the names of people 27 | quoted in the article could not be extracted as keywords at all. The algorithm may or may not process one document at a 28 | time. The algorithm may use any publicly available data or tools, e.g. word embeddings, language models, onthologies or 29 | dictionaries 30 | 31 | 32 | ## Resources 33 | 34 | * [sample text corpus](https://github.com/the-ultimate-krol/spacex), 35 | * keyphrases assigned to each document by some algorithm. Please note, that those phrases may not be present in the 36 | document, thus not fulfiling the requirement of being extractive keyphrases, 37 | * a reference algorithm extracting keypharses from documents in Polish: https://huggingface.co/Voicelab/vlt5-base-keywords 38 | -------------------------------------------------------------------------------- /project/spacex.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/apohllo/nlp/034c26e00321fc45d1b3b8444a3eea5503b27f2d/project/spacex.zip -------------------------------------------------------------------------------- /project/text_description.txt: -------------------------------------------------------------------------------- 1 | Korpus spacex.zip zawiera 20 tekstów dotyczących różnych gałęzi działalności Elona Muska i jego projektów. 2 | Każdy tekst zapisany jako .yaml zawiera metrykę: 3 | 4 | subject: temat przewodni 5 | title: "Tytuł artykułu zapisany w cudzysłowie" 6 | author: autor lub autorzy 7 | date: rok-miesiąc-dzień 8 | publisher: nazwa portalu 9 | references: media, na które powołują się autorzy tekstów 10 | source: agencje informacyjne, na które powołują się autorzy tekstów 11 | url: link do artykułu 12 | id: identyfikator artykułu 13 | tags: 14 | - tagi przypisane przez redakcję 15 | - 16 | ... 17 | keys: 18 | - tagi przypisane ręcznie 19 | - 20 | ... 21 | content: >- treść 22 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Natural language processing tasks 2 | 3 | The tasks accompanying the Natural Language Processing classes at the Faculty of Computer Science, at AGH University of Science and Technology in Krakow. 4 | 5 | 1. [Regular expressions](1-regexp.md) 6 | 2. [Full text search](2-fts.md) 7 | 3. [Levenshtein distance](3-levenshtein.md) 8 | 4. [Language modelling](4-lm.md) 9 | 5. [Classification](5-classification.md) 10 | 6. [Named entity recognition](6-ner.md) 11 | 7. [Classification and NER with LLMs](7-classification-ner-llm.md) 12 | 8. [Neural Search](8-neural.md) 13 | 9. [Question Answering](9-qa.md) 14 | 10. [RAG](10-rag.md) 15 | 16 | # Staff 17 | 18 | * Aleksander Smywiński-Pohl, PhD, 19 | * Magda Król, 20 | * Bartosz Minch, PhD. 21 | --------------------------------------------------------------------------------