├── 2 Basic SQL Query.ipynb ├── Exam Cheat Sheet.md ├── Exam Statistics.md ├── Global Air Quality - Ingestion using SQL.ipynb ├── Hand-On SQL Demonstrations.ipynb ├── Hand-Ons MongoDB Lab ├── LICENSE ├── Lectures ├── 1. Databases - An Introduction.pdf ├── 2. Basic SQL Query.pdf ├── 3. Joining Tables.pdf ├── 4. Aggreation and Grouping.pdf ├── 5 Subqueries.pdf ├── 6. Entity-Relationship Diagram.pdf ├── 7. Design Theory.pdf ├── 8. Normalisation.pdf ├── CM3010 MCQ September 2022.ipynb ├── CM3010 March 2022.ipynb ├── CM3010 March 2023.ipynb ├── CM3010 September 2021.ipynb ├── CM3010 September 2021: MCQ Questions 1(e) and 1(f).ipynb ├── CM3010 September 2022.ipynb ├── CM3010 September 2023.ipynb ├── CM3010_September_2021.ipynb ├── Hand-On SQL Demonstration.ipynb ├── MCQ Solution Sheet September 2022.md ├── MCQ Solution Sheets - September 2021.md ├── MySQL Hand-On Lab with Solutions.ipynb ├── MySQL Hand-On Lab.ipynb ├── SQL: Try It Yourself.ipynb ├── Solution Sheets - March 2022.md ├── Solution Sheets - March 2023.md ├── Solution Sheets - September 2021.md ├── Solution Sheets - September 2022(A).md └── Solution Sheets - September 2023.md ├── MongoDB Hand-On Lab - Solutions.ipynb ├── MongoDB Hand-On Lab.ipynb ├── MongoDB: Selection, Projection and Sorting.ipynb ├── MySQL Hand-On Lab - Solutions.ipynb ├── Nutrition Facts - Ingestion using Python.ipynb ├── Nutrition Facts - Ingestion using SQL.ipynb ├── README.md ├── Revision Note: Evaluation Metrics: Precision, Recall and F1-Measure.md ├── Revision Note: JSON and MongoDB.md ├── Revision Note: Linked Data Model and SPARQL.md ├── Revision Note: Relational Databases.md ├── Revision Note: XML and XPATH.md ├── Rotten Tomatoes - Ingestion using SQL.ipynb ├── SPARQL - Sep 2022.ipynb ├── SPARQL Hand-On Lab - Solutions.ipynb ├── Suicide Records - Ingestion using SQL.ipynb ├── XPath - Sep 2022.ipynb ├── XPath Hand-On Lab - Solutions.ipynb ├── hr-schema-mysql.sql ├── mcdonalds-nutrition-facts.csv ├── precision-recal.md └── web-app ├── Solution Sheets - MCQ September 2022A.md ├── Solution Sheets - March 2022.md ├── Solution Sheets - March 2023.md ├── Solution Sheets - March 2024.md ├── Solution Sheets - September 2022 (B).md ├── Solution Sheets - September 2023.md ├── app.js ├── mcq.md ├── public ├── css │ └── styles.css └── js │ └── scripts.js └── views ├── city-rankings.mustache ├── dominant-pollutants.mustache ├── index.mustache ├── national-urban-aq.mustache ├── pollutant-prevalence.mustache └── urban-centers-profile.mustache /Exam Cheat Sheet.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # Exam Cheat Sheet 5 | 6 | --- 7 | 8 | | **Topic** | **Key Concepts** | **Must-Know Examples/Notes** | 9 | |--------------------------------|------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------| 10 | | **Relational Databases** | **E/R Diagrams:** Entities, Attributes, Relationships (1:1, 1:M, M:M) | **Schema Conversion:** Map entities to tables; use foreign keys for 1:1, associative tables for M:M relationships. | 11 | | | **Normalization:** 1NF, 2NF, 3NF, BCNF | **Essentials:** 1NF removes repeating groups; 2NF removes partial dependencies; 3NF removes transitive dependencies; BCNF ensures determinants are candidate keys. | 12 | | | **Strengths/Weaknesses:** | **Strengths:** Data consistency, integrity, powerful querying with SQL.
**Weaknesses:** Rigid schema, complex joins, difficult horizontal scaling. | 13 | | **SQL Joins** | **Types of Joins:** INNER, LEFT, RIGHT, CROSS, FULL OUTER | Key use cases: INNER JOIN for common rows, LEFT JOIN for left table focus, CROSS JOIN for Cartesian product, FULL OUTER JOIN for full table combinations. | 14 | | **XML** | **Structure:** Elements, Attributes, Hierarchies | `XML FundamentalsJane Doe`.
**XML Schema (XSD):** Defines element structure, data types. | 15 | | | **Strengths/Weaknesses:** | **Strengths:** Hierarchical structure, validation through XSD.
**Weaknesses:** Verbose syntax, complex for large datasets. | 16 | | **XPath** | **Selecting Nodes:** `//book/title`, `@attribute` | **Common Functions:** `count()`, `contains()`. | 17 | | **RDF & Linked Data** | **RDF Triples:** Subject-Predicate-Object | **Example:** ``. | 18 | | | **Linked Data:** URIs to link data across datasets | **Ontologies:** FOAF, Dublin Core.
**Strengths/Weaknesses:**
**Strengths:** Interoperability, data integration, semantic web.
**Weaknesses:** Complexity, performance overhead, steep learning curve. | 19 | | **SPARQL** | **Basic Queries:** SELECT, WHERE | **Example:** `SELECT ?name WHERE { ?person ex:hasName ?name }`. | 20 | | **JSON** | **Structure:** Key-Value Pairs, Nested Objects/Arrays | **Example:** `{"name": "Alice", "skills": ["JavaScript", "MongoDB"}`. | 21 | | **MongoDB** | **Queries:** Flexible Schema, CRUD Operations | **Example:** `db.users.find({ "age": { "$gt": 25 } })`. | 22 | | **Precision, Recall, F1-Measure** | **Precision:** \( \frac{TP}{TP + FP} \)
**Recall:** \( \frac{TP}{TP + FN} \) | **F1-Measure:** \( 2 \times \frac{Precision \times Recall}{Precision + Recall} \)
**Common Use:** Precision (Spam detection), Recall (Disease screening). | 23 | 24 | --- 25 | 26 | ### **Key Formulas and Concepts:** 27 | 28 | | **Concept** | **Description** | 29 | |----------------------------------|------------------------------------------------------------------------------------------------------------| 30 | | **Normalization** | 1NF: Remove repeating groups.
2NF: Eliminate partial dependencies.
3NF: Remove transitive dependencies.
BCNF: Determinants must be candidate keys. | 31 | | **SQL Joins** | INNER JOIN: Match common rows.
LEFT JOIN: Include all left rows, match right rows.
RIGHT JOIN: Include all right rows, match left rows.
CROSS JOIN: Cartesian product.
FULL OUTER JOIN: All rows from both tables. | 32 | | **XPath** | Selecting Nodes: `//book/title`, `@attribute`.
Functions: `count()`, `contains()`. | 33 | | **RDF Triples** | Structure: Subject-Predicate-Object.
Example: ``. | 34 | | **SPARQL** | Basic Query: `SELECT ?name WHERE { ?person ex:hasName ?name }`. | 35 | | **Precision, Recall, F1-Measure** | Precision: \( \frac{TP}{TP + FP} \)
Recall: \( \frac{TP}{TP + FN} \)
F1-Measure: \( 2 \times \frac{Precision \times Recall}{Precision + Recall} \) | 36 | 37 | --- 38 | 39 | ### **Important Points to Remember:** 40 | 41 | 1. **Normalization** is key to efficient database design, preventing redundancy. 42 | 2. **Joins** in SQL are essential for combining data across multiple tables. 43 | 3. **XPath** is critical for querying XML data structures. 44 | 4. **Linked Data & RDF** enable semantic web data integration. 45 | 5. **JSON/MongoDB** are flexible for managing semi-structured data. 46 | 6. **Precision, Recall, F1-Measure** are key metrics for evaluating models, especially with imbalanced datasets. 47 | 48 | --- 49 | 50 | ### **Commonly Tested Concepts:** 51 | 52 | - **Normalization:** Tested for converting data to different normal forms. 53 | - **SQL Joins:** Commonly appear in complex query formulation. 54 | - **XPath Queries:** Frequently tested for extracting data from XML. 55 | - **RDF & SPARQL:** Tested in data integration and querying linked data. 56 | - **Precision & Recall:** Important for classifier evaluation. 57 | 58 | --- 59 | -------------------------------------------------------------------------------- /Exam Statistics.md: -------------------------------------------------------------------------------- 1 | ### Topic Frequency Analysis (Across All Exams) 2 | This shows how many times each topic appears across all exams: 3 | 4 | | Topic/Concept | Frequency | 5 | |-----------------------------------------|-----------| 6 | | E/R Diagrams and Relational Models | 9 | 7 | | SQL (Joins, Transactions, Indexing) | 13 | 8 | | Normalization (1NF, 2NF, 3NF, BCNF) | 10 | 9 | | JSON and XML Data Handling | 15 | 10 | | RDF and SPARQL Queries | 5 | 11 | | XML Schema Design and Validation | 9 | 12 | | Information Retrieval (Precision/Recall, F1) | 8 | 13 | | Machine Learning Model Evaluation | 7 | 14 | | MongoDB Query Syntax | 5 | 15 | | Data Modeling and Optimization | 10 | 16 | 17 | ### Mark Distribution by Topic (Average Marks per Exam) 18 | This table shows the average marks allocated to each topic: 19 | 20 | | Topic/Concept | Average Marks | 21 | |-----------------------------------------|---------------| 22 | | E/R Diagrams and Relational Models | 30 | 23 | | SQL (Joins, Transactions, Indexing) | 35 | 24 | | Normalization (1NF, 2NF, 3NF, BCNF) | 25 | 25 | | JSON and XML Data Handling | 40 | 26 | | RDF and SPARQL Queries | 15 | 27 | | XML Schema Design and Validation | 30 | 28 | | Information Retrieval (Precision/Recall, F1) | 20 | 29 | | Machine Learning Model Evaluation | 30 | 30 | | MongoDB Query Syntax | 15 | 31 | | Data Modeling and Optimization | 35 | 32 | 33 | ### Insights for Study Strategy 34 | 35 | 1. **High-Priority Topics (High Frequency and High Marks)** 36 | - **JSON and XML Data Handling**: This topic appears most frequently (15 times) and consistently carries a high mark allocation (40 marks on average). This should be a major focus area in your preparation. 37 | - **SQL (Joins, Transactions, Indexing)**: With 13 occurrences and an average mark allocation of 35, this is another critical area that appears both in MCQs and open-ended questions. 38 | - **E/R Diagrams and Relational Models**: This topic is tested 9 times with a solid mark contribution of 30. Understanding E/R diagrams and how to convert them to relational schemas is essential. 39 | 40 | 2. **Secondary-Priority Topics (Moderate Frequency and Moderate Marks)** 41 | - **Normalization**: Appears 10 times with an average mark contribution of 25. Although not as heavily weighted as JSON/XML or SQL, normalization is still frequently tested, especially in relational modeling questions. 42 | - **Data Modeling and Optimization**: Appears 10 times with a strong mark allocation (35 marks). Questions often involve designing or optimizing database schemas. 43 | 44 | 3. **Focused Study Topics (Low Frequency but High Marks)** 45 | - **Machine Learning Model Evaluation**: While only appearing 7 times, this topic often carries substantial marks (30 marks). Questions usually involve evaluating models using metrics like precision, recall, and F1-score. 46 | - **XML Schema Design and Validation**: Appears 9 times with a high average mark allocation (30 marks). These questions test your understanding of creating and validating XML schemas. 47 | 48 | 4. **Supplementary Topics (Low Frequency and Low Marks)** 49 | - **RDF and SPARQL Queries**: Appears 5 times with a low average mark allocation (15 marks). This topic is usually tested in specific MCQs rather than open-ended questions. 50 | - **MongoDB Query Syntax**: Appears 5 times with a low average mark allocation (15 marks). Focus on understanding MongoDB query structures and basic operators. 51 | 52 | ### Study Strategy Recommendations 53 | 54 | 1. **Prioritize JSON, XML, and SQL Concepts:** 55 | - These topics dominate the exams both in frequency and marks. Make sure to practice handling nested JSON/XML data, writing complex SQL queries, and understanding indexing and optimization strategies. 56 | 57 | 2. **Master E/R Diagrams and Normalization:** 58 | - Given the consistent presence of these topics, ensure you can easily design E/R diagrams, convert them into normalized relational schemas, and understand trade-offs between normalization and denormalization. 59 | 60 | 3. **Don’t Ignore Machine Learning Metrics:** 61 | - Even though it’s less frequent, machine learning evaluation questions carry substantial marks. Focus on understanding precision, recall, specificity, and how to interpret model performance. 62 | 63 | 4. **Be Prepared for Specialized Topics:** 64 | - XML Schema Design and RDF/SPARQL may appear less often, but they are still worth studying, especially since they usually contribute to MCQs or specific open-ended questions. 65 | 66 | 5. **Allocate Time Based on Mark Distribution:** 67 | - Focus on topics that carry higher average marks (e.g., JSON/XML Handling, SQL) to maximize your score. 68 | 69 | --- 70 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 sreent 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Lectures/1. Databases - An Introduction.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/1. Databases - An Introduction.pdf -------------------------------------------------------------------------------- /Lectures/2. Basic SQL Query.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/2. Basic SQL Query.pdf -------------------------------------------------------------------------------- /Lectures/3. Joining Tables.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/3. Joining Tables.pdf -------------------------------------------------------------------------------- /Lectures/4. Aggreation and Grouping.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/4. Aggreation and Grouping.pdf -------------------------------------------------------------------------------- /Lectures/5 Subqueries.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/5 Subqueries.pdf -------------------------------------------------------------------------------- /Lectures/6. Entity-Relationship Diagram.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/6. Entity-Relationship Diagram.pdf -------------------------------------------------------------------------------- /Lectures/7. Design Theory.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/7. Design Theory.pdf -------------------------------------------------------------------------------- /Lectures/8. Normalisation.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/sreent/data-management-intro/a4e6207c3081686f75baa2ef47c833d92e632781/Lectures/8. Normalisation.pdf -------------------------------------------------------------------------------- /Lectures/CM3010 September 2021: MCQ Questions 1(e) and 1(f).ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyOywevSnj1xl2AK3EgFXpRD", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": null, 32 | "metadata": { 33 | "id": "DDC7Ab2oQjHn" 34 | }, 35 | "outputs": [], 36 | "source": [] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "source": [ 41 | "# Exploring the Movie XML Questions\n", 42 | "\n", 43 | "This notebook demonstrates:\n", 44 | "\n", 45 | "- **Why** the sample `movies.xml` is *not well-formed*.\n", 46 | "- **How** to parse the XML and see the error.\n", 47 | "- **Comparing** well-formedness vs. schema validity (with `movies.xsd`)." 48 | ], 49 | "metadata": { 50 | "id": "Hd6FjpzRQkVa" 51 | } 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "source": [ 56 | "\n", 57 | "## 1) The XML Snippet (movies.xml)\n", 58 | "\n", 59 | "```xml\n", 60 | "\n", 61 | " Citizen Kane\n", 62 | " \n", 63 | " Orson Welles\n", 64 | " Joseph Cotton\n", 65 | "\n", 66 | "```\n", 67 | "\n", 68 | "*Observing the code, we see it is missing ``.*\n", 69 | "\n", 70 | "**Question (e):** “Look at the data and associated XML schema fragments below. The XML below is not well‐formed. Why not?”\n", 71 | "\n", 72 | "**Short Answer:** The `` element is not closed. That alone breaks well‐formedness." 73 | ], 74 | "metadata": { 75 | "id": "QiC96SxYQ19D" 76 | } 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "source": [ 81 | "## Let’s Try Parsing This XML With `lxml`" 82 | ], 83 | "metadata": { 84 | "id": "sQjt6ik8Q78b" 85 | } 86 | }, 87 | { 88 | "cell_type": "code", 89 | "source": [ 90 | "!pip install lxml\n", 91 | "\n", 92 | "from lxml import etree" 93 | ], 94 | "metadata": { 95 | "colab": { 96 | "base_uri": "https://localhost:8080/" 97 | }, 98 | "id": "OpE5OxPoQrvK", 99 | "outputId": "38770182-62e0-417b-b580-2984fd23a3b4" 100 | }, 101 | "execution_count": 1, 102 | "outputs": [ 103 | { 104 | "output_type": "stream", 105 | "name": "stdout", 106 | "text": [ 107 | "Requirement already satisfied: lxml in /usr/local/lib/python3.11/dist-packages (5.3.0)\n" 108 | ] 109 | } 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "source": [ 115 | "# We'll store the snippet in a variable\n", 116 | "xml_snippet = \"\"\"\n", 117 | "\n", 118 | " Citizen Kane\n", 119 | " \n", 120 | " Orson Welles\n", 121 | " Joseph Cotton\n", 122 | "\n", 123 | "\"\"\"\n", 124 | "\n", 125 | "try:\n", 126 | " root = etree.fromstring(xml_snippet)\n", 127 | " print(\"This should never print, because we expect an error about unclosed .\")\n", 128 | "except etree.XMLSyntaxError as e:\n", 129 | " print(\"XMLSyntaxError caught!\")\n", 130 | " print(\"Reason:\", e)" 131 | ], 132 | "metadata": { 133 | "colab": { 134 | "base_uri": "https://localhost:8080/" 135 | }, 136 | "id": "AsC8z88IRCkU", 137 | "outputId": "6e53843b-2401-4e0c-c3de-15218bd3b9cc" 138 | }, 139 | "execution_count": 2, 140 | "outputs": [ 141 | { 142 | "output_type": "stream", 143 | "name": "stdout", 144 | "text": [ 145 | "XMLSyntaxError caught!\n", 146 | "Reason: Opening and ending tag mismatch: cast line 4 and movie, line 7, column 9 (, line 7)\n" 147 | ] 148 | } 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "source": [ 154 | "**Explanation** \n", 155 | "- The parser immediately complains because `` never has a matching `` tag, violating well‐formedness." 156 | ], 157 | "metadata": { 158 | "id": "YaB_--KtRJIs" 159 | } 160 | }, 161 | { 162 | "cell_type": "markdown", 163 | "source": [ 164 | "### 2) Well-Formedness Explanation\n", 165 | "\n", 166 | "An XML document is well-formed if:\n", 167 | "\n", 168 | "1. Every start-tag has a matching end-tag.\n", 169 | "2. Elements properly nest (no overlapping).\n", 170 | "3. Exactly one root element, etc.\n", 171 | "\n", 172 | "**In our snippet**: \n", 173 | "- `` is never properly closed with ``, hence it is not well-formed." 174 | ], 175 | "metadata": { 176 | "id": "wLmt85DPRV0F" 177 | } 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "source": [ 182 | "## 3) The Provided movies.xsd\n", 183 | "\n", 184 | "```xml\n", 185 | "\n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | "\n", 194 | "\n", 195 | "\n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | "\n", 202 | "\n", 203 | "\n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | "\n", 208 | "\n", 209 | "\n", 210 | "\n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | "\n", 215 | "```\n", 216 | "\n", 217 | "**Observations**:\n", 218 | "- The `` must have a `lang` attribute (use=\"required\"), so if it’s missing, that breaks *validity* (but not necessarily well-formedness).\n", 219 | "- The schema also expects a `<releaseYear>` element, so omitting that also breaks *validity*." 220 | ], 221 | "metadata": { 222 | "id": "cDqgWruPRZ8P" 223 | } 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "source": [ 228 | "Here we show what *would* happen if the XML was well-formed but still might fail **schema** validation." 229 | ], 230 | "metadata": { 231 | "id": "vVSyJhjrRpQ1" 232 | } 233 | }, 234 | { 235 | "cell_type": "code", 236 | "source": [ 237 | "# Let's define a corrected but incomplete XML:\n", 238 | "corrected_xml = \"\"\"\n", 239 | "<movie>\n", 240 | " <title lang=\"en\">Citizen Kane\n", 241 | " \n", 242 | " Orson Welles\n", 243 | " Joseph Cotton\n", 244 | " \n", 245 | "\n", 246 | "\"\"\"\n", 247 | "\n", 248 | "xsd_content = \"\"\"\n", 249 | "\n", 251 | "\n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | "\n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | "\n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | "\n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | "\n", 283 | "\n", 284 | "\"\"\"" 285 | ], 286 | "metadata": { 287 | "id": "R4DutyTRRK_C" 288 | }, 289 | "execution_count": 3, 290 | "outputs": [] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "source": [ 295 | "from lxml import etree\n", 296 | "\n", 297 | "# Parse the corrected XML\n", 298 | "xml_doc = etree.fromstring(corrected_xml)\n", 299 | "\n", 300 | "# Parse the XSD\n", 301 | "xsd_doc = etree.fromstring(xsd_content)\n", 302 | "schema = etree.XMLSchema(xsd_doc)\n", 303 | "\n", 304 | "# Now let's see if it is valid\n", 305 | "if schema.validate(xml_doc):\n", 306 | " print(\"XML is valid according to movies.xsd!\")\n", 307 | "else:\n", 308 | " print(\"XML is NOT valid according to movies.xsd.\")\n", 309 | " for error in schema.error_log:\n", 310 | " print(error.message)" 311 | ], 312 | "metadata": { 313 | "colab": { 314 | "base_uri": "https://localhost:8080/", 315 | "height": 269 316 | }, 317 | "id": "scdFnSKrRtWM", 318 | "outputId": "4eb3c159-9d26-4b08-b0f4-4acfd3c4cdd6" 319 | }, 320 | "execution_count": 4, 321 | "outputs": [ 322 | { 323 | "output_type": "error", 324 | "ename": "XMLSchemaParseError", 325 | "evalue": "Element '{http://www.w3.org/2001/XMLSchema}all', attribute 'maxOccurs': The value 'unbounded' is not valid. Expected is '1'., line 7", 326 | "traceback": [ 327 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 328 | "\u001b[0;31mXMLSchemaParseError\u001b[0m Traceback (most recent call last)", 329 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;31m# Parse the XSD\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mxsd_doc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0metree\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfromstring\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxsd_content\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mschema\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0metree\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mXMLSchema\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mxsd_doc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;31m# Now let's see if it is valid\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 330 | "\u001b[0;32msrc/lxml/xmlschema.pxi\u001b[0m in \u001b[0;36mlxml.etree.XMLSchema.__init__\u001b[0;34m()\u001b[0m\n", 331 | "\u001b[0;31mXMLSchemaParseError\u001b[0m: Element '{http://www.w3.org/2001/XMLSchema}all', attribute 'maxOccurs': The value 'unbounded' is not valid. Expected is '1'., line 7" 332 | ] 333 | } 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "source": [ 339 | "xsd_content = \"\"\"\n", 340 | "\n", 342 | "\n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | "\n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | "\n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | "\n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | "\n", 374 | "\n", 375 | "\"\"\"" 376 | ], 377 | "metadata": { 378 | "id": "euPFq4lxTGlI" 379 | }, 380 | "execution_count": 5, 381 | "outputs": [] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "source": [ 386 | "# Parse the XSD\n", 387 | "xsd_doc = etree.fromstring(xsd_content)\n", 388 | "schema = etree.XMLSchema(xsd_doc)\n", 389 | "\n", 390 | "# Now let's see if it is valid\n", 391 | "if schema.validate(xml_doc):\n", 392 | " print(\"XML is valid according to movies.xsd!\")\n", 393 | "else:\n", 394 | " print(\"XML is NOT valid according to movies.xsd.\")\n", 395 | " for error in schema.error_log:\n", 396 | " print(error.message)" 397 | ], 398 | "metadata": { 399 | "colab": { 400 | "base_uri": "https://localhost:8080/" 401 | }, 402 | "id": "b9QfzmfBTSto", 403 | "outputId": "6a1d76af-2657-49a3-e841-91d9c8127264" 404 | }, 405 | "execution_count": 6, 406 | "outputs": [ 407 | { 408 | "output_type": "stream", 409 | "name": "stdout", 410 | "text": [ 411 | "XML is NOT valid according to movies.xsd.\n", 412 | "Element 'title': This element is not expected. Expected is ( cast ).\n" 413 | ] 414 | } 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "source": [ 420 | "**Explanation** \n", 421 | "- We made the XML well-formed by closing `` and adding `title lang=\"en\"`.\n", 422 | "- However, we did *not* include ``. The schema demands it. So we expect the validation to fail, complaining about a missing `releaseYear`." 423 | ], 424 | "metadata": { 425 | "id": "roKlCrXsRxhH" 426 | } 427 | }, 428 | { 429 | "cell_type": "markdown", 430 | "source": [ 431 | "### Summaries\n", 432 | "\n", 433 | "**(e) Why is the original snippet not well-formed?**\n", 434 | "- Because `` is not closed.\n", 435 | "\n", 436 | "**(f) Why is the XML not valid (excluding the well-formedness problem)?**\n", 437 | "- The schema requires a `` element with a `lang` attribute (which was missing originally).\n", 438 | "- The schema also requires a `<releaseYear>` element.\n", 439 | "- Additional minor points like the presence or order of elements if `<xs:all>` or `<xs:sequence>` is used.\n", 440 | "\n", 441 | "Hence, those are the reasons for:\n", 442 | "\n", 443 | "- Not well-formed: unclosed `<cast>` tag.\n", 444 | "- Not valid: missing required fields (releaseYear, title@lang) or any other rule from `movies.xsd`." 445 | ], 446 | "metadata": { 447 | "id": "zpu7FgODR0os" 448 | } 449 | }, 450 | { 451 | "cell_type": "code", 452 | "source": [], 453 | "metadata": { 454 | "id": "t_3AkEUSRzNZ" 455 | }, 456 | "execution_count": null, 457 | "outputs": [] 458 | } 459 | ] 460 | } -------------------------------------------------------------------------------- /Lectures/MCQ Solution Sheet September 2022.md: -------------------------------------------------------------------------------- 1 | 2 | # **Question 1(a)** 3 | 4 | ### **Context** 5 | We have a set of transactional SQL commands transferring money between two accounts, but something is missing to finalize the transaction. 6 | 7 | ### **Question** 8 | **Which command is missing from the following set?** 9 | 10 | ```sql 11 | START TRANSACTION; 12 | UPDATE Account SET Balance = Balance-100 WHERE AccNo=21430885; 13 | UPDATE Account SET Balance = Balance+100 WHERE AccNo=29584776; 14 | SELECT SUM(Balance) FROM Account; 15 | ``` 16 | 17 | ### **Answer: iv. COMMIT** 18 | 19 | ### **Explanation** 20 | - Without a **COMMIT**, the updates remain uncommitted and could be rolled back or lost. 21 | - `ROLLBACK` (choice i) reverses changes, not the intent here. 22 | - `END TRANSACTION` (choice iii) is not standard SQL syntax (some systems allow `END` but typically you use `COMMIT`). 23 | 24 | ### **Real‐World Example** 25 | In a **bank transfer**, you want the debit/credit to commit atomically. A commit ensures both sub-updates finalize together. 26 | 27 | ### **Common Pitfalls** 28 | - Omitting `COMMIT` leaves changes pending. 29 | - Mistaking `END TRANSACTION` or `ROLLBACK` for finalizing. 30 | 31 | ### **Short Answer Summary** 32 | - A transaction in SQL typically ends with either **COMMIT** (make changes permanent) or **ROLLBACK** (undo them). Here, we need `COMMIT`. 33 | 34 | --- 35 | 36 | # **Question 1(b)** 37 | 38 | ### **Context** 39 | A SPARQL query that attempts to retrieve Cristiano Ronaldo’s birth city from DBpedia fails to return a result. 40 | 41 | ### **Question** 42 | **Why doesn’t the query below return the expected city name?** 43 | 44 | ```sparql 45 | SELECT DISTINCT * 46 | WHERE { 47 | "Cristiano Ronaldo"@en dbo:birthPlace 48 | [ 49 | a dbo:City ; 50 | rdfs:label ?cityName 51 | ] . 52 | FILTER ( LANG(?cityName) = 'en' ) 53 | } 54 | ``` 55 | 56 | ### **Answer: ii. "**Cristiano Ronaldo"@en is a string, not a URL.** It can’t be the subject of a triple. 57 | 58 | ### **Explanation** 59 | - In **RDF**/SPARQL, **subjects** must be **URIs**. A literal string `"Cristiano Ronaldo"@en` cannot act as a subject. 60 | - The correct approach typically references `<http://dbpedia.org/resource/Cristiano_Ronaldo>` or uses a variable that matches that URI. 61 | 62 | ### **Real‐World Example** 63 | Linked Data queries must reference the **URI** (e.g., `<dbpedia:Cristiano_Ronaldo>`) rather than a raw string. 64 | 65 | ### **Common Pitfalls** 66 | - Confusing literal strings with resource URIs. 67 | - Attempting to treat `"Name"@en` as a subject in RDF. 68 | 69 | ### **Short Answer Summary** 70 | - You need a **URI** for Cristiano Ronaldo as the subject; a string literal won’t function in a triple. 71 | 72 | --- 73 | 74 | # **Question 1(c)** 75 | 76 | ### **Context** 77 | An RDF snippet from Tim Berners‐Lee’s v‐card: 78 | 79 | ```turtle 80 | card:I a :Male; 81 | foaf:family_name "Berners-Lee"; 82 | foaf:givenname "Timothy"; 83 | foaf:title "Sir". 84 | ``` 85 | 86 | ### **Question** 87 | **How many predicates does this contain?** 88 | 89 | ### **Answer: i. 4** 90 | 91 | ### **Explanation** 92 | - The predicates are: 93 | 1. `a` (shorthand for `rdf:type`) 94 | 2. `foaf:family_name` 95 | 3. `foaf:givenname` 96 | 4. `foaf:title` 97 | - Choices like 5 or 7 over‐ or under‐count. 98 | 99 | ### **Real‐World Example** 100 | RDF statements always take the form (subject, predicate, object). Counting the distinct predicates is key in interpreting triples. 101 | 102 | ### **Common Pitfalls** 103 | - Counting objects or subjects as predicates. 104 | - Overlooking the `a` (alias for `rdf:type`). 105 | 106 | ### **Short Answer Summary** 107 | - This snippet has **four** distinct predicates. 108 | 109 | --- 110 | 111 | # **Question 1(d)** 112 | 113 | ### **Context** 114 | We have an XML snippet with a `<disk>` of ID `1847336` having multiple `<track>` elements. The query seeks `<track>` child elements with `duration>150` and selects their children. 115 | 116 | ### **Question** 117 | **How many results does the XPath** `//disk[@xml:id="1847336"]/track[@duration>150]/*` **return?** 118 | 119 | ### **Answer: ii. 4** 120 | 121 | ### **Explanation** 122 | - Tracks with `duration>150` are track #1 (duration=193) and track #2 (duration=167). 123 | - Each track has 2 child elements `<title>` and `<artist>`. So total = **4**. 124 | 125 | ### **Real‐World Example** 126 | In media catalogs, you might filter tracks by length (in seconds) and retrieve subelements like title/artist. 127 | 128 | ### **Common Pitfalls** 129 | - Overlooking child elements or including siblings erroneously. 130 | - Confusing attribute or element conditions. 131 | 132 | ### **Short Answer Summary** 133 | - The two qualifying tracks each have two children, hence 4 matching nodes. 134 | 135 | --- 136 | 137 | # **Question 1(e):** 138 | 139 | **Which parameter setting for the tool is likely to be best (in the sense that I spend the least time on the task)?** 140 | 141 | #### Problem Context: 142 | - Total Archive Size: 50,000 items. 143 | - Relevant Items: 30 items. 144 | - Manual Time to Find Each Relevant Item (if missed): 15 minutes. 145 | - Time Wasted on Each False Positive: 0.5 minutes (30 seconds). 146 | 147 | **Answer: ii. 148 | 149 | --- 150 | 151 | # **Question 1(f)** 152 | 153 | ### **Context** 154 | A table with columns (Chart, Date, Position, Title, Artist, Date of Birth). We suspect it’s only in 1NF. 155 | 156 | ### **Question** 157 | **Which normal forms does the table satisfy?** 158 | 159 | ### **Answer: iv. 1NF** 160 | 161 | ### **Explanation** 162 | - The data is atomic in each column (no repeating groups), so 1NF is met. 163 | - 2NF/3NF can’t be guaranteed without analyzing a key and dependencies. The partial or transitive dependencies might exist. 164 | 165 | ### **Real‐World Example** 166 | Music chart info, each row is a single chart entry. At least 1NF because no multivalue columns, but no guarantee of higher forms. 167 | 168 | ### **Common Pitfalls** 169 | - Assuming it’s in 2NF or 3NF without actual functional dependency checks. 170 | 171 | ### **Short Answer Summary** 172 | - It’s certainly in **1NF**, but we can’t confirm higher normal forms from the snippet. 173 | 174 | --- 175 | 176 | # **Question 1(g)** 177 | 178 | ### **Context** 179 | An E/R diagram for a plant identification DB is poorly drawn, using weird cardinalities, arrow notations, etc. 180 | 181 | ### **Question** 182 | **Why is the diagram not good? (Select all that apply)** 183 | 184 | ### **Answer: i, ii, iii, iv, vi, viii** 185 | 186 | ### **Explanation** 187 | 1. **(i)** By convention, cardinalities appear between entities, not attributes. 188 | 2. **(ii)** Entities connect with explicit relationships, not arbitrary lines. 189 | 3. **(iii)** The arrow is meaningless in standard E/R notation. 190 | 4. **(iv)** Spaces in attribute names is discouraged. 191 | 5. **(v)** “An attribute can’t be shared” is not always true. (Hence not correct) 192 | 6. **(vi)** Cardinalities “21” aren’t allowed. Should use “1”, “n”, “m”, etc. 193 | 7. **(vii)** Ternary relationship? Not necessarily. So not correct. 194 | 8. **(viii)** ß and x cardinalities are odd; best to use “n” or “m”. 195 | 196 | ### **Real‐World Example** 197 | Proper E/R diagrams clearly indicate relationships, cardinalities (like 1..n), and attribute names without spaces. 198 | 199 | ### **Common Pitfalls** 200 | - Combining notation from different modeling styles incorrectly. 201 | - Using ambiguous cardinalities. 202 | 203 | ### **Short Answer Summary** 204 | - The diagram uses invalid cardinalities, missing relationship lines, arrow notation incorrectly, etc. 205 | 206 | --- 207 | 208 | # **Question 1(h)** 209 | 210 | ### **Context** 211 | A database query to find all staff who’ve interacted with a client named “Shug Avery.” We see partial SQL statements and must choose which ones might be correct. 212 | 213 | ### **Question** 214 | **Which queries likely produce the correct result?** 215 | 216 | ### **Answer: iii and iv** 217 | 218 | ### **Explanation** 219 | - (iii) and (iv) each properly join **Meeting** to **Client** and **Employee**, with correct `WHERE` conditions on the client’s name. 220 | - (i)/(ii)/(v) either fail to join properly or use non‐matching conditions. 221 | 222 | ### **Real‐World Example** 223 | In CRMs, you frequently link employees to meetings to clients for cross references. 224 | 225 | ### **Common Pitfalls** 226 | - Wrong type of join, or omitting a join condition. 227 | 228 | ### **Short Answer Summary** 229 | - The correct approach typically does a multi‐table join: `Client -> Meeting -> Employee`. 230 | 231 | --- 232 | 233 | # **Question 1(i)** 234 | 235 | ### **Context** 236 | MongoDB queries for actors born before 1957. Which is correct? 237 | 238 | ### **Question** 239 | **Which code snippet properly finds actors with `dateOfBirth < 1957-01-01`?** 240 | 241 | ### **Answer: vii** 242 | (They each do `db.actors.find(...)` with `"$lt": ISODate("1957-01-01")`.) 243 | 244 | ### **Explanation** 245 | - (i) `db.actors.findOne({"dateOfBirth": {$lt: ISODate("1957-01-01")}})` is not correct since it returns only one record. 246 | - (vii) `db.actors.find({"dateOfBirth": {$lt: ISODate("1957-01-01")}})` also correct. 247 | - Others either misuse operators or compare with just an integer. 248 | 249 | ### **Real‐World Example** 250 | In a film/TV database, queries by birth date must use **Mongo’s `$lt`** operator with a proper `ISODate`. 251 | 252 | ### **Common Pitfalls** 253 | - Using `"<"` or a raw integer for date queries in MongoDB. 254 | 255 | ### **Short Answer Summary** 256 | - Use `$lt` with `ISODate(...)` for date comparisons in MongoDB. 257 | 258 | --- 259 | 260 | # **Question 1(j)** 261 | 262 | ### **Context** 263 | Recipe Markup Language (RecipeML) snippet: 264 | ```dtd 265 | <!ELEMENT recipe (head, description*, equipment?, ingredients, directions, nutrition?, diet‐exchanges?)> 266 | <!ATTLIST recipe 267 | %common.att; 268 | %measurement.att; 269 | > 270 | ``` 271 | We want to see which statements are true about `<recipe>` children. 272 | 273 | ### **Question** 274 | **Which statements about `<recipe>` and `<ingredients>` are true?** 275 | 276 | ### **Answer: i, ii* 277 | 278 | ### **Explanation** 279 | - (i) `<recipe>` must have exactly one `<ingredients>` child: True according to `ingredients` (no plus sign). 280 | - (ii) `<ingredients>` must come before `<directions>`: True by the order in the DTD. 281 | - The other options about multiple `<ingredients>` or ignoring order are incorrect. 282 | 283 | ### **Real‐World Example** 284 | DTDs strictly define order and occurrence of child elements, important for standardizing recipe data. 285 | 286 | ### **Common Pitfalls** 287 | - Misreading the occurrence indicators (`?`, `*`, `+`, etc.) in DTDs. 288 | - Overlooking the explicit child order. 289 | 290 | ### **Short Answer Summary** 291 | - Exactly one `<ingredients>` child is required, and it appears in a specific sequence before `<directions>`. 292 | 293 | -------------------------------------------------------------------------------- /Lectures/Solution Sheets - March 2023.md: -------------------------------------------------------------------------------- 1 | 2 | # **Question 2: Analyzing OpenDocument Format (ODF) and RelaxNG Schema** 3 | 4 | ## (a) What language is this encoded in? [1] 5 | 6 | **Answer:** 7 | It is encoded in **XML**. 8 | 9 | **Key Explanation:** 10 | - ODF files (e.g., `.odt` documents) are ZIP containers that include XML files plus any images, styles, metadata, etc. 11 | - The snippet provided shows tags like `<office:text>` and `<text:p>`, which are clearly XML elements. 12 | 13 | --- 14 | 15 | ## (b) What data structure does it use? [1] 16 | 17 | **Answer:** 18 | It uses a **tree** (hierarchical) structure. 19 | 20 | **Key Explanation:** 21 | - XML is inherently a tree: a single root element with nested children. 22 | - Elements appear inside one another, which naturally forms a hierarchy. 23 | 24 | --- 25 | 26 | ## (c) List the two namespaces that this document uses. [2] 27 | 28 | **Answer:** 29 | 1. `urn:oasis:names:tc:opendocument:xmlns:office:1.0` 30 | 2. `urn:oasis:names:tc:opendocument:xmlns:text:1.0` 31 | 32 | **Key Explanation:** 33 | - Namespaces help differentiate element/attribute names. 34 | - In the snippet, `<office:text>` and `<text:p>` map to these URIs. 35 | 36 | --- 37 | 38 | ## (d) What would the XPath expression `//text:list-item/text:p` return? Would it be different from `//text:list//text:p`? [4] 39 | 40 | **Answer (Short Form):** 41 | - `//text:list-item/text:p` → selects `<text:p>` elements that are **direct children** of `<text:list-item>`. 42 | - `//text:list//text:p` → selects **all** `<text:p>` elements that are descendants of `<text:list>` (not necessarily direct children). 43 | 44 | In **this** example, both expressions return the same three items (`Trees, Graphs, Relations`) because each `<text:p>` is already a direct child of `<text:list-item>`. In a more complex or nested structure, these expressions could yield different results. 45 | 46 | --- 47 | 48 | ## (e) How does this code help us assess if the document above is **well‐formed**? [2] 49 | 50 | **Answer:** 51 | It **does not** directly assess well‐formedness. A RelaxNG schema only checks structure and allowed elements/attributes **after** the document is confirmed well‐formed by an XML parser. 52 | 53 | - **Well‐formedness** rules: correct tag nesting, matching start/end tags, a single root, properly quoted attributes, etc. 54 | - The schema itself cannot override or fix basic syntax errors. 55 | 56 | --- 57 | 58 | ## (f) How does this code help us assess if the document above is **valid**? [2] 59 | 60 | **Answer:** 61 | It checks if the document follows the structural rules defined in the RelaxNG schema—e.g., the correct elements, their sequences, attributes, etc. If the document meets these requirements, it is **valid**; otherwise, it is invalid. 62 | 63 | --- 64 | 65 | ## (g) Which part or parts of the document is this relevant to? [2] 66 | 67 | **Answer:** 68 | It is specifically relevant to `<text:list>` elements and their child elements (`<text:list-header>`, `<text:list-item>`). The provided RelaxNG snippet defines how these list structures must be formed. 69 | 70 | --- 71 | 72 | ## (h) Give an example of an element that would not be valid given this schema code (assume `text-list-attr` only defines attributes). [3] 73 | 74 | **Answer (Example):** 75 | ```xml 76 | <text:list> 77 | <text:list-item>Item Content</text:list-item> 78 | <text:invalid-element>Invalid Content</text:invalid-element> 79 | </text:list> 80 | ``` 81 | `<text:invalid-element>` is **not** part of the schema, so the file fails validation. 82 | 83 | --- 84 | 85 | ## (i) Assess the suitability of this data structure for encoding word processing documents. What advantages or disadvantages would a relational model bring? [13] 86 | 87 | **Answer (Outline):** 88 | 89 | 1. **Using XML / Tree Structures for Word Processing** 90 | - **Advantages**: 91 | - **Natural Hierarchy**: Documents often have nested structures (sections, paragraphs, runs), which XML easily represents. 92 | - **Standards**: Formats like ODF and OOXML are XML-based, well supported by many tools. 93 | - **Flexibility**: Easy to embed metadata or styles within the document structure. 94 | - **Disadvantages**: 95 | - **Verbosity**: XML can be large and repetitive. 96 | - **Complex Queries**: Although XPath/XQuery are powerful, they may be less straightforward for certain tabular queries or large-scale data analysis. 97 | 98 | 2. **Relational Model** 99 | - **Advantages**: 100 | - **Strong Data Integrity**: Primary/foreign keys, constraints, transactions. 101 | - **Efficient SQL**: Well-suited for structured queries, aggregations, and numeric data tasks. 102 | - **Disadvantages**: 103 | - **Poor Fit for Deeply Nested Data**: Many join tables might be needed for complex markup. 104 | - **Rigid Schema**: Word processing documents have variable structures, which can be cumbersome to store relationally. 105 | 106 | 3. **Conclusion**: 107 | - XML is **well-suited** to hierarchical, text-heavy documents. 108 | - A relational DB is **better** for strongly structured, tabular data and extensive analytical queries. 109 | 110 | --- 111 | 112 | # **Question 3: MusicBrainz / Linked Data** 113 | 114 | *(Based on RDF/Turtle data describing a music group, e.g. BTS, with foundingDate, schema:member, etc.)* 115 | 116 | ## (a) What (approximately) was the type that we put into the `Accept` header? [1] 117 | 118 | **Answer:** 119 | `text/turtle` (or `application/turtle`). 120 | 121 | --- 122 | 123 | ## (b) What is the full URL of the predicate `schema:member` in this context? [1] 124 | 125 | **Answer:** 126 | ``` 127 | http://schema.org/member 128 | ``` 129 | 130 | --- 131 | 132 | ## (c) How many band members of BTS are listed in this snippet? [1] 133 | 134 | **Answer:** 135 | Likely **2** members (based on the provided blank nodes referencing separate individuals). 136 | 137 | --- 138 | 139 | ## (d) Comment on the way the `schema:member` predicate is used in this snippet. [3] 140 | 141 | **Answer (Short Explanation):** 142 | - A **role-based** approach: the band node has `schema:member` → blank node of type `schema:OrganizationRole`. That blank node itself has `schema:member` → the person’s URI. 143 | - This structure allows adding membership attributes like `schema:startDate` to the role object. 144 | 145 | --- 146 | 147 | ## (e) What type(s) are associated with the entity having `schema:name` of “JIN”? [1] 148 | 149 | **Answer:** 150 | He is typed as **`schema:MusicGroup`** **and** **`schema:Person`** (due to how MusicBrainz RDF is auto-generated). 151 | 152 | --- 153 | 154 | ## (f) Consider this SPARQL query: 155 | ```sparql 156 | SELECT ?a ?b WHERE { 157 | mba:9fe8e-ba27-4859-bb8c-2f255f346853 schema:member ?c . 158 | ?c schema:startDate ?b ; 159 | schema:member ?d . 160 | ?d schema:name ?a . 161 | } 162 | ``` 163 | What prefixes need to be defined for this to work (give the full declarations)? [1] 164 | 165 | **Answer:** 166 | At a minimum: 167 | ```sparql 168 | PREFIX mba: <http://musicbrainz.org/artist/> 169 | PREFIX schema: <http://schema.org/> 170 | ``` 171 | (And possibly `rdf:` if using `rdf:type`.) 172 | 173 | --- 174 | 175 | ## (g) What would the query return? [2] 176 | 177 | **Answer:** 178 | It returns pairs of `(?a, ?b)` where `?a` = **member name** and `?b` = **startDate** from the membership role. Essentially, each band member’s name plus when they joined. 179 | 180 | --- 181 | 182 | ## (h) This data represents an export from a relational database. Construct an ER diagram that could accommodate the instance data above. [6] 183 | 184 | Below is a **Mermaid diagram-as-code** showing a possible **Artist**–**Membership** schema **using a composite PK** in the `Membership` table: 185 | 186 | ```mermaid 187 | erDiagram 188 | Artist { 189 | int ArtistID PK 190 | string Name 191 | string Type 192 | %% 'Person' or 'MusicGroup' 193 | date FoundingDate 194 | %% relevant if Type='MusicGroup' 195 | } 196 | 197 | Membership { 198 | int BandID PK, FK 199 | int MemberID PK, FK 200 | date StartDate 201 | string RoleName 202 | } 203 | 204 | %% Relationship lines: 205 | %% - "is_in" and "belongs_to" reflect that (BandID) and (MemberID) 206 | %% both reference rows in Artist (Type = 'MusicGroup' vs 'Person'). 207 | 208 | Artist ||--o{ Membership : is_in 209 | Artist ||--o{ Membership : belongs_to 210 | ``` 211 | 212 | - **Artist**: holds both bands and individuals (differentiated by `Type`). 213 | - **Membership**: a **composite** primary key of $$(\text{BandID}, \text{MemberID})$$. 214 | - `BandID` references an **Artist** row with `Type='MusicGroup'`. 215 | - `MemberID` references an **Artist** row with `Type='Person'`. 216 | - Additional fields like `StartDate` and `RoleName` capture membership details (when someone joined, their role, etc.). 217 | 218 | --- 219 | 220 | ## (i) Give the CREATE TABLE commands for two tables based on your ER diagram. [4] 221 | 222 | **Answer (using composite PK):** 223 | 224 | ```sql 225 | CREATE TABLE Artist ( 226 | ArtistID INT PRIMARY KEY, 227 | Name VARCHAR(100) NOT NULL, 228 | Type VARCHAR(20) NOT NULL, -- 'Person' or 'MusicGroup' 229 | FoundingDate DATE 230 | ); 231 | 232 | CREATE TABLE Membership ( 233 | BandID INT NOT NULL, 234 | MemberID INT NOT NULL, 235 | StartDate DATE, 236 | RoleName VARCHAR(100), 237 | PRIMARY KEY (BandID, MemberID), 238 | FOREIGN KEY (BandID) REFERENCES Artist(ArtistID), 239 | FOREIGN KEY (MemberID) REFERENCES Artist(ArtistID) 240 | ); 241 | ``` 242 | 243 | ### Key Points 244 | 245 | - **PRIMARY KEY (BandID, MemberID)** enforces that there can be only **one** `Membership` record per `(band, person)` pair. 246 | - If you might need multiple rows for the same `(BandID, MemberID)` (e.g., someone rejoins the band later), consider adding another field (like an auto-increment ID or `StartDate` in the PK). 247 | 248 | --- 249 | 250 | ## (j) Suggest a MySQL query to check whether any band member in the database is recorded as joining before the founding date of their band. [5] 251 | 252 | **Answer (unchanged):** 253 | 254 | ```sql 255 | SELECT aMember.Name AS MemberName, 256 | aBand.Name AS BandName, 257 | m.StartDate, 258 | aBand.FoundingDate 259 | FROM Membership m 260 | JOIN Artist aBand ON m.BandID = aBand.ArtistID 261 | JOIN Artist aMember ON m.MemberID = aMember.ArtistID 262 | WHERE m.StartDate < aBand.FoundingDate; 263 | ``` 264 | 265 | --- 266 | 267 | ## (k) MusicBrainz makes their data available as both a downloadable database dump and as Linked Data. What are the benefits and disadvantages of each approach? [5] 268 | 269 | **Answer (Summary):** 270 | 271 | - **Database Dump** 272 | - **Pros**: Complete offline snapshot for large queries/analytics; independent of network. 273 | - **Cons**: Can become **outdated** quickly; large storage overhead. 274 | 275 | - **Linked Data** 276 | - **Pros**: Always **up-to-date** data; easy to interlink with other semantic sources. 277 | - **Cons**: Dependent on network; can be slower or unavailable if the endpoint is down. 278 | 279 | --- 280 | 281 | # **Question 4: Enhancing an ER Model for 16th-Century European Music Records** 282 | 283 | ## (a) This model doesn't allow storing the order or coordinates for lines of music on a page. How could this be fixed? [3] 284 | 285 | **Answer:** 286 | Add attributes to **Line** such as: 287 | - `LineOrder` (integer) to track the visual or logical order. 288 | - `XCoordinate`, `YCoordinate` (floats/integers) to store layout positions if needed for precise rendering. 289 | 290 | --- 291 | 292 | ## (b) Some books are published in tablebook format, with multiple parts/voices to a piece and page regions with lines in different directions. How to add these aspects? [8] 293 | 294 | **Answer (Outline):** 295 | 296 | 1. **InstrumentOrVoicePart** entity (e.g., “Soprano,” “Alto,” or “Violin Part”). 297 | 2. **Region** entity to define different areas on a page, potentially oriented differently or for different staves. 298 | 3. **Line** references both **Part** and **Region**. 299 | 300 | Hence, a line belongs to: 301 | - A **Piece** (the composition), 302 | - A **Page** (the physical page it’s on), 303 | - A **Region** (sub-area of that page), 304 | - A **Part** (e.g., soprano or instrumental line). 305 | 306 | --- 307 | 308 | ## (c) List the tables, primary keys, and foreign keys for a relational implementation of your modified model. [7] 309 | 310 | **Answer (Schema Example):** 311 | 312 | 1. **Piece** 313 | - PK: `PieceID` 314 | - Attributes: `Title`, etc. 315 | 316 | 2. **Page** 317 | - PK: `PageID` 318 | - Attributes: `BookID` (FK to `Book(BookID)`), etc. 319 | 320 | 3. **Region** 321 | - PK: `RegionID` 322 | - FK: `PageID → Page(PageID)` 323 | - Attributes: `Description`, orientation/coordinates if needed. 324 | 325 | 4. **InstrumentOrVoicePart** 326 | - PK: `PartID` 327 | - Attributes: `PartName` 328 | 329 | 5. **Line** 330 | - PK: `LineID` 331 | - FKs: `PieceID`, `PageID`, `RegionID`, `PartID` 332 | - Attributes: `LineOrder`, `XCoordinate`, `YCoordinate` 333 | 334 | --- 335 | 336 | ## (d) Give a query to list pieces with the total number of lines of music that they occupy. [5] 337 | 338 | **Answer:** 339 | ```sql 340 | SELECT p.Title, COUNT(*) AS TotalLines 341 | FROM Piece p 342 | JOIN Line l ON p.PieceID = l.PieceID 343 | GROUP BY p.Title; 344 | ``` 345 | - This aggregates how many **Line** rows each piece has. 346 | 347 | --- 348 | 349 | ## (e) Assess the suitability of this data structure for a relational model, and compare it with ONE other database model (XML-based, document-based, or Linked Data graph). [7] 350 | 351 | **Answer (Overview):** 352 | 353 | 1. **Relational Model** 354 | - **Pros**: 355 | - Great for structured queries (counts, filtering). 356 | - Clear constraints and relationships (FK, PK). 357 | - **Cons**: 358 | - Complex hierarchical or layout data might require multiple bridging tables, making it less intuitive. 359 | - Harder to represent flexible or nested markup. 360 | 361 | 2. **XML/Tree Database (Example)** 362 | - **Pros**: 363 | - Naturally handles nested data (like lines, staves, sub-regions). 364 | - Supports metadata and variable structure easily. 365 | - **Cons**: 366 | - More difficult to do set-based queries or large aggregations. 367 | - File size overhead and performance issues for certain large queries. 368 | 369 | 3. **Conclusion**: 370 | - **Relational** is ideal if you often do structured queries (like counting lines, grouping data) with well-defined relationships. 371 | - **XML** (or a document/graph DB) is often better for deeply nested or unstructured layout data that doesn’t map neatly to rows/columns. 372 | 373 | --- 374 | 375 | An ER Diagram that includes `Piece`, `Page`, `Region`, `InstrumentOrVoicePart`, and `Line` might look like this: 376 | 377 | ```mermaid 378 | erDiagram 379 | Piece { 380 | string PieceID PK 381 | string Title 382 | %% Additional piece attributes... 383 | } 384 | 385 | Page { 386 | string PageID PK 387 | string BookID FK 388 | } 389 | 390 | Region { 391 | string RegionID PK 392 | string PageID FK 393 | string Description 394 | } 395 | 396 | InstrumentOrVoicePart { 397 | string PartID PK 398 | string PartName 399 | } 400 | 401 | Line { 402 | string LineID PK 403 | string PieceID FK 404 | string PageID FK 405 | string RegionID FK 406 | string PartID FK 407 | int LineOrder 408 | float XCoordinate 409 | float YCoordinate 410 | } 411 | 412 | Piece ||--o{ Line : "has" 413 | Page ||--o{ Line : "contains" 414 | Page ||--o{ Region : "has" 415 | Region ||--o{ Line : "includes" 416 | InstrumentOrVoicePart ||--o{ Line : "is_for" 417 | ``` 418 | 419 | - **Piece**: the musical piece. 420 | - **Page**: physical pages from the book. 421 | - **Region**: subdivided areas on the page (especially in tablebook format). 422 | - **InstrumentOrVoicePart**: each voice or instrumental line part. 423 | - **Line**: references which piece, which page, which region, which part, plus ordering and coordinates. 424 | 425 | -------------------------------------------------------------------------------- /Lectures/Solution Sheets - September 2022(A).md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # **Question 2: Database Design and Querying** 5 | 6 | --- 7 | 8 | ### **(a) Which aggregate function is used here?** 9 | **Answer:** 10 | The aggregate function used is **`AVG()`**. 11 | 12 | #### **Detailed Explanation** 13 | - **Role of `AVG()`**: `AVG()` calculates the mean of a set of values (e.g., test scores). 14 | - **Common Mistakes**: Mixing up `AVG()` with `SUM()` or forgetting a `GROUP BY` when also selecting non‐aggregated columns. 15 | 16 | #### **Key Takeaways** 17 | - `AVG()` is a fundamental SQL aggregate. 18 | - In queries that combine `AVG()` with non‐aggregated columns, use `GROUP BY` appropriately. 19 | 20 | --- 21 | 22 | ### **(b) There is a problem with the database design that risks making the aggregation incorrect. What is it, and how could it be resolved?** 23 | **Answer:** 24 | The main issue is storing or using attributes that can become outdated or inconsistent—such as **Age**—rather than deriving them from a stable attribute (e.g., `BirthDate`). Additionally, free‐text fields like **City** or **School** may lead to inconsistent groupings. 25 | 26 | #### **Solution Approach** 27 | 1. **Avoid Storing Dynamic Attributes** 28 | - Compute `Age` at query time (e.g., using `TIMESTAMPDIFF`) to prevent stale data. 29 | 2. **Normalize the Database** 30 | - Instead of storing `City` or `School` as free‐text, reference them by IDs in lookup tables, enforcing consistency and preventing “Birmingham, UK” vs. “Birmingham (UK)” style mismatches. 31 | 32 | #### **Key Takeaways** 33 | - Proper **normalization** and consistently computed attributes ensure accurate aggregates. 34 | - Dynamic attributes (like Age) are best calculated rather than stored directly. 35 | 36 | --- 37 | 38 | ### **(c) For security reasons, the researcher should be given minimal, read‐only access to the database. Give a suitable command that the database administrator should run.** 39 | **Answer:** 40 | ```sql 41 | GRANT SELECT ON database_name.* 42 | TO 'researcher'@'localhost' 43 | IDENTIFIED BY 'password'; 44 | ``` 45 | 46 | #### **Detailed Explanation** 47 | - **Principle of Least Privilege**: The `SELECT` privilege alone grants read‐only access. 48 | - In modern MySQL, you may need to create the user first with `CREATE USER` before `GRANT`. 49 | 50 | --- 51 | 52 | ### **(d) From the point of view of handling confidential information about minors, it would be better to give access only to aggregated data. How would you achieve that?** 53 | **Answer:** 54 | Create a **VIEW** that exposes only aggregated data (e.g., average scores, counts) rather than individual‐level records, then grant `SELECT` on that view only. 55 | 56 | ```sql 57 | CREATE VIEW AggregatedData AS 58 | SELECT Gender, City, AVG(Score) AS AvgScore 59 | FROM Test 60 | GROUP BY Gender, City; 61 | ``` 62 | 63 | #### **Key Takeaways** 64 | - **Views** can mask underlying sensitive data by surfacing only summaries. 65 | - This aligns with data protection standards (e.g., for minors). 66 | 67 | --- 68 | 69 | ### **(e) What limitation would that create for the researcher?** 70 | **Answer:** 71 | They cannot drill down to individual records. This **prevents** detailed, record‐level analysis (e.g., outlier detection, correlation studies at the individual level). 72 | 73 | --- 74 | 75 | ### **(f) The `Student` table is defined as**: 76 | ```sql 77 | CREATE TABLE Student ( 78 | ID VARCHAR(25) PRIMARY KEY, 79 | GivenName VARCHAR(80) NOT NULL, 80 | FamilyName VARCHAR(80) NOT NULL, 81 | Gender ENUM('M','F') NOT NULL, 82 | BirthDate DATE NOT NULL, 83 | School VARCHAR(130), 84 | City VARCHAR(130) 85 | ); 86 | ``` 87 | **What problems can you see with this table, and how would you resolve them?** 88 | **Answer:** 89 | 1. **Primary Key as `VARCHAR(25)`**: This may be inefficient for indexing and lookups if the ID could instead be an integer. 90 | 2. **Overly Restrictive Gender**: `ENUM('M','F')` might be insufficient for modern contexts. 91 | 3. **Lack of Referential Integrity for City/School**: Storing them as free‐text invites inconsistencies. 92 | 4. **Potential Redundancy**: Each record duplicates school/city info instead of referencing them by an ID. 93 | 94 | #### **Possible Resolutions** 95 | - Use an integer (e.g., `INT AUTO_INCREMENT`) for `ID`, unless there is a strong reason to keep an external string ID. 96 | - Broaden or alter the `Gender` storage if required by real‐life constraints (or allow nulls, additional values, etc.). 97 | - Normalize `City` and `School` into their own tables with proper foreign keys. 98 | 99 | --- 100 | 101 | ### **(g) How well would this data work in an object database like MongoDB? What would be the advantages or disadvantages?** 102 | **Answer:** 103 | - **Advantages**: 104 | 1. Schema flexibility (no strict schema needed). 105 | 2. Document‐oriented model makes embedding nested data easier. 106 | 3. Horizontal scaling is straightforward. 107 | 108 | - **Disadvantages**: 109 | 1. No traditional joins, which complicates cross‐document queries. 110 | 2. Potential duplication (e.g., city info repeated in many documents). 111 | 3. Weaker built‐in referential integrity. 112 | 113 | --- 114 | 115 | # **Question 3: XML, XPath, and Relational Models** 116 | 117 | --- 118 | 119 | ### **(a) What markup language is being used? And what is the root node?** 120 | **Answer:** 121 | - **Markup Language:** XML (specifically TEI, which is an XML application). 122 | - **Root Node:** `<TEI>`. 123 | 124 | --- 125 | 126 | ### **(b) Is this fragment well‐formed? Justify your answer.** 127 | **Answer:** 128 | Likely **not** well‐formed if there are unclosed tags (e.g., `<fileDesc>`, `<teiHeader>`). Properly closed tags are essential for well‐formedness under XML rules. 129 | 130 | --- 131 | 132 | ### **(c) What would be selected by evaluating the XPath expression `//fileDesc//title/@type`?** 133 | **Answer:** 134 | It selects the **`type`** attribute value(s) on every `<title>` element nested under `<fileDesc>`. For example, `"collection"`. 135 | 136 | --- 137 | 138 | ### **(d) What would be selected by evaluating `//resp[text()='Cataloguer']/../persName`?** 139 | **Answer:** 140 | All `<persName>` elements whose parent also contains a `<resp>` element with text `'Cataloguer'`. The `../` means “go up one level to the parent, then find `<persName>`.” 141 | 142 | --- 143 | 144 | ### **(e) Why might you choose the expression given in part (d) rather than the simpler `persName`? Give two situations where it would be preferable.** 145 | **Answer:** 146 | 1. **Disambiguation**: If the document has many `<persName>` elements, only those tied to a “Cataloguer” `<resp>` are relevant. 147 | 2. **Context**: You only want `<persName>` that is specifically associated with `<resp>Cataloguer</resp>`. 148 | 149 | --- 150 | 151 | ### **(f) This element refers to the second textual item (such as a story or sermon) that the manuscript contains – hence `n=2`. How well would this way of listing contents work in a relational model? How would you approach the problem?** 152 | **Answer:** 153 | **Storing an order** via `n="2"` is not ideal. Relational tables do not inherently track ordering. A better solution is to have a separate table with columns like `(ManuscriptID, ItemNumber, ItemContent)`, so you can reorder or query items by their explicit sequence. 154 | 155 | --- 156 | 157 | ### **(g) Here is an extract of the file `msdesc.rng`. What is this file, and why is it referenced in the catalogue entry?** 158 | **Answer:** 159 | It is a **Relax NG schema** that defines the structure and constraints of the TEI XML. The catalogue entry references it so that the XML can be validated against the schema’s rules. 160 | 161 | --- 162 | 163 | ### **(h) What is the difference between valid and well‐formed XML?** 164 | **Answer:** 165 | - **Well‐formed XML**: All tags are properly opened/closed and nested; there is exactly one root element, etc. 166 | - **Valid XML**: XML that is **well‐formed** **and** conforms to a defined schema/DTD. 167 | 168 | --- 169 | 170 | ### **(i) If the first extract in this question had omitted the `respStmt` element, would the XML have been legal?** 171 | **Answer:** 172 | No, not if the schema requires `<respStmt>`. It would still be “well‐formed” if all tags are closed, but **not valid** (i.e., “not legal” per the schema). 173 | 174 | --- 175 | 176 | ### **(j) If the first extract had omitted the `title` elements, would the XML have been legal?** 177 | **Answer:** 178 | No, for the same reason. If `title` is mandatory in `<titleStmt>`, omitting it breaks validation. 179 | 180 | --- 181 | 182 | ### **(k) This catalogue entry is converted automatically to HTML whenever it changes. What two technologies would be most likely to be considered for the conversion?** 183 | **Answer:** 184 | 1. **XSLT** (Extensible Stylesheet Language Transformations) for transforming XML to HTML. 185 | 2. An **XSLT processor** (e.g., **Saxon** or **Xalan**) to run those transformations. 186 | 187 | --- 188 | 189 | # **Question 4: RDF, Ontologies, and Linked Data** 190 | 191 | Below is the revised solution focusing on a **single‐table triple store** approach where `(subject, predicate, object)` is used as a **composite primary key**. 192 | 193 | --- 194 | 195 | ### **(a) (i) What is the model?** 196 | **Answer:** 197 | **RDF** (Resource Description Framework). 198 | 199 | ### **(a) (ii) What is the serialization format?** 200 | **Answer:** 201 | **Turtle**. 202 | 203 | --- 204 | 205 | ### **(b) Name two ontologies used in this document.** 206 | **Answer:** 207 | 1. **Dublin Core** (`dcterms:`) 208 | 2. **FOAF** (`foaf:`) 209 | 210 | *(Sometimes `oa:` for Open Annotation is also mentioned.)* 211 | 212 | --- 213 | 214 | ### **(c) For each ontology named in your previous answer, name all the properties from the ontology that are used in this document.** 215 | **Answer:** 216 | - **Dublin Core**: `dcterms:creator`, `dcterms:created` 217 | - **FOAF**: `foaf:name` 218 | 219 | --- 220 | 221 | ### **(d) This structure is a Web Annotation previously called Open Annotation. The BODY of the annotation contains a comment on the TARGET, which is often part of a SOURCE. The scholar wants a SPARQL query that returns the annotation body and the creator, filtered by `armadale:Chapter3`.** 222 | **Answer (SPARQL):** 223 | ```sparql 224 | SELECT ?body ?creator 225 | WHERE { 226 | ?annotation a oa:Annotation ; 227 | oa:hasBody ?body ; 228 | dcterms:creator ?creator ; 229 | oa:hasTarget ?target . 230 | ?target oa:hasSource armadale:Chapter3 . 231 | } 232 | ``` 233 | 234 | --- 235 | 236 | ### **(e) Some Linked Data systems use a backend database to store the data and for quick retrieval, exporting it as needed. Draw an ER diagram for web annotations like this.** 237 | 238 | #### **Single‐Table Triples Approach with Composite PK** 239 | 240 | Instead of creating separate tables (e.g., Annotations, Persons, Sources), we store **all** RDF data in a single table called `Triples`, each row representing `(subject, predicate, object)`. 241 | 242 | ```mermaid 243 | erDiagram 244 | Triples { 245 | string subject PK 246 | string predicate PK 247 | string object PK 248 | } 249 | ``` 250 | 251 | - **Composite Primary Key** `(subject, predicate, object)` ensures no duplicate triple. 252 | 253 | --- 254 | 255 | ### **(f) Identify the tables that you would need for a relational implementation and list the keys for each.** 256 | 257 | #### **Revised Single‐Table Triple‐Store Answer** 258 | 259 | You only need **one** table: 260 | 261 | 1. **`Triples`** 262 | - **Columns**: `subject`, `predicate`, `object` 263 | - **Primary Key**: `(subject, predicate, object)` (composite key) 264 | 265 | *(Optional: add columns for data type, language, or named graph if needed.)* 266 | 267 | --- 268 | 269 | ### **(g) Give a MySQL query equivalent for the scholar’s query you corrected in question (d).** 270 | 271 | In the triple‐store design, we do multiple **self‐joins** on the `Triples` table: 272 | 273 | ```sql 274 | SELECT tBody.object AS body, 275 | tCreator.object AS creator 276 | FROM Triples tAnno 277 | JOIN Triples tType 278 | ON tAnno.subject = tType.subject 279 | JOIN Triples tBody 280 | ON tAnno.subject = tBody.subject 281 | JOIN Triples tCreator 282 | ON tAnno.subject = tCreator.subject 283 | JOIN Triples tTarget 284 | ON tAnno.subject = tTarget.subject 285 | JOIN Triples tSource 286 | ON tTarget.object = tSource.subject 287 | WHERE tType.predicate = 'rdf:type' 288 | AND tType.object = 'oa:Annotation' 289 | AND tBody.predicate = 'oa:hasBody' 290 | AND tCreator.predicate = 'dcterms:creator' 291 | AND tTarget.predicate = 'oa:hasTarget' 292 | AND tSource.predicate = 'oa:hasSource' 293 | AND tSource.object = 'armadale:Chapter3'; 294 | ``` 295 | 296 | #### **Explanation** 297 | - **`tAnno`**: Baseline row from `Triples`, whose `subject` is the annotation resource. 298 | - **`tType`**: Ensures it is `rdf:type = oa:Annotation`. 299 | - **`tBody`**: Finds `oa:hasBody` triple for the same subject. 300 | - **`tCreator`**: Finds `dcterms:creator` triple for that subject. 301 | - **`tTarget`**: Finds `oa:hasTarget` triple for that subject. 302 | - **`tSource`**: Ensures `?target oa:hasSource armadale:Chapter3`. 303 | 304 | --- 305 | 306 | ## **Final Remarks** 307 | 308 | 1. **Design Choice**: 309 | - A single‐table triple store is flexible but forces multiple self‐joins for complex queries. 310 | - A multi‐table approach (Annotations, Persons, Sources, etc.) can be more “traditional” and potentially simpler for certain queries. 311 | 2. **Keys & Constraints**: 312 | - Using `(subject, predicate, object)` as a **composite PK** prevents duplicate statements in the store. 313 | 3. **Relevance**: 314 | - These approaches illustrate both **relational** and **linked data** mindsets. Depending on your environment, you may choose one or hybrid solutions. 315 | -------------------------------------------------------------------------------- /Lectures/Solution Sheets - September 2023.md: -------------------------------------------------------------------------------- 1 | 2 | # **Question 1: Linked Data Question** 3 | 4 | ### (a) 5 | 6 | ```turtle 7 | @prefix bn: <http://babelnet.org/rdf/> . 8 | @prefix lemon: <http://www.lemon-model.net/lemon#> . 9 | @prefix lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> . 10 | 11 | bn:post_n_EN a lemon:LexicalEntry ; 12 | lemon:canonicalForm <http://babelnet.org/rdf/post_n_EN/canonicalForm> ; 13 | lemon:language "EN" ; 14 | lexinfo:partOfSpeech lexinfo:noun . 15 | ``` 16 | 17 | **(i) What is the generic data model?** 18 | **Answer (1 mark):** 19 | This is **RDF** (Resource Description Framework). 20 | 21 | **(ii) What is the serialization format?** 22 | **Answer (1 mark):** 23 | It’s in **Turtle** format (Terse RDF Triple Language), evidenced by the `@prefix` notation and the `;`/`.` structure. 24 | 25 | --- 26 | 27 | ### (b) 28 | 29 | > *Friend 1 claims it’s impossible to know the actual word or part of speech without more triples; Friend 2 claims it’s obviously the English noun “post.”* 30 | 31 | **Answer (4 marks):** 32 | 33 | 1. **Why it might be “impossible to know”** 34 | - The snippet alone only shows `bn:post_n_EN` is a `lemon:LexicalEntry` with language `"EN"` and part‐of‐speech `noun`. The **actual written representation** (“post”) is hidden behind `<…/canonicalForm>` and not yet shown directly. 35 | 36 | 2. **Why it “might be the English word ‘post’”** 37 | - If you follow the linked resource `<.../canonicalForm>` and see `lemon:writtenRep "post"`, you confirm the word is “post.” So if Friend 2 has that extra triple, they’re correct. 38 | 39 | 3. **Conclusion and further info** 40 | - Both friends have a point. By RDF alone, you need to dereference the `canonicalForm` node to see `writtenRep "post"`. Additional data clarifies that this is indeed the English noun “post.” 41 | 42 | --- 43 | 44 | ### (c) 45 | 46 | When you request `<http://babelnet.org/rdf/post_n_EN/canonicalForm>`, it returns `lemon:writtenRep "post"`. 47 | 48 | **(i)** **SPARQL**: *Find the written representation and language for all nouns.* 49 | **Answer (6 marks):** 50 | 51 | ```sparql 52 | PREFIX lemon: <http://www.lemon-model.net/lemon#> 53 | PREFIX lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> 54 | 55 | SELECT ?writtenRep ?lang 56 | WHERE { 57 | ?lexEntry a lemon:LexicalEntry ; 58 | lemon:canonicalForm ?form ; 59 | lemon:language ?lang ; 60 | lexinfo:partOfSpeech lexinfo:noun . 61 | 62 | ?form lemon:writtenRep ?writtenRep . 63 | } 64 | ``` 65 | 66 | **(ii)** **SPARQL**: *Find the language and part of speech for words whose canonical form is “post.”* 67 | **Answer (4 marks):** 68 | 69 | ```sparql 70 | PREFIX lemon: <http://www.lemon-model.net/lemon#> 71 | PREFIX lexinfo: <http://www.lexinfo.net/ontology/2.0/lexinfo#> 72 | 73 | SELECT ?language ?pos 74 | WHERE { 75 | ?lexEntry a lemon:LexicalEntry ; 76 | lemon:canonicalForm ?form ; 77 | lemon:language ?language ; 78 | lexinfo:partOfSpeech ?pos . 79 | 80 | ?form lemon:writtenRep "post" . 81 | } 82 | ``` 83 | 84 | --- 85 | 86 | ### (d) 87 | 88 | We have an excerpt from [lemon‐model.net/lemon\#](http://www.lemon-model.net/lemon#) describing classes like `:LexicalSense`, `:SenseDefinition`, and properties like `:definition`, `:value`. 89 | 90 | **(i) What is the role of this document?** 91 | **Answer (1 mark):** 92 | It’s an **ontology (schema) definition** for the Lemon lexical model, specifying classes and properties for lexical/semantic descriptions. 93 | 94 | **(ii) What format is it in?** 95 | **Answer (1 mark):** 96 | Likely **RDF** (e.g., Turtle or RDF/XML). 97 | 98 | **(iii) To what does the ‘owl’ prefix refer?** 99 | **Answer (1 mark):** 100 | The **OWL** (Web Ontology Language) namespace (`http://www.w3.org/2002/07/owl#`) for more expressive ontology constructs. 101 | 102 | **(iv)** *Provide one definition for the English noun “post” in RDF.* 103 | **Answer (4 marks):** 104 | Example in Turtle: 105 | 106 | ```turtle 107 | @prefix : <http://example.org/lemonDefs#> . 108 | @prefix lemon: <http://www.lemon-model.net/lemon#> . 109 | @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 110 | 111 | :post_n_EN_sense a :LexicalSense ; 112 | :definition :post_n_EN_def . 113 | 114 | :post_n_EN_def a :SenseDefinition ; 115 | :value "A piece of wood or metal set upright to support something."@en . 116 | ``` 117 | 118 | --- 119 | 120 | ### (e) 121 | 122 | > *Sketch an ER diagram for a relational implementation of this model (Lemon style). Include cardinalities.* 123 | 124 | Below is a **Mermaid ER diagram** in code. It shows how you might map *LexicalEntry*, *Form*, *LexicalSense*, and *SenseDefinition* into a relational schema: 125 | 126 | ```mermaid 127 | erDiagram 128 | %% Entities: 129 | LexicalEntry ||--o{ Form : "has_form" 130 | LexicalEntry ||--|{ LexicalSense : "has_sense" 131 | LexicalSense ||--o{ SenseDefinition : "has_definition" 132 | 133 | %% Now define attributes (Mermaid must have type + name): 134 | LexicalEntry { 135 | int LexicalEntryID 136 | %% (Primary Key) 137 | string language 138 | string partOfSpeech 139 | %% Possibly more columns, e.g. lexicalEntryType, etc. 140 | } 141 | 142 | Form { 143 | int FormID 144 | %% (Primary Key) 145 | string writtenRep 146 | %% Foreign key link to LexicalEntry 147 | } 148 | 149 | LexicalSense { 150 | int LexicalSenseID 151 | %% (Primary Key) 152 | %% Foreign key link to LexicalEntry 153 | } 154 | 155 | SenseDefinition { 156 | int SenseDefinitionID 157 | %% (Primary Key) 158 | string textValue 159 | %% Foreign key link to LexicalSense 160 | } 161 | ``` 162 | 163 | **Explanations**: 164 | - One **LexicalEntry** can have multiple **Forms** (1:M). 165 | - One **LexicalEntry** can have multiple **LexicalSenses** (1:M). 166 | - One **LexicalSense** can have multiple **SenseDefinitions** (1:M). 167 | 168 | --- 169 | 170 | # **Question 2: ER Question (Real‐Estate Agency)** 171 | 172 | An estate agency selling residential houses and flats is building a database to track its business. The ER diagram shows the main elements: 173 | 174 | 1. **Seller** (Name, Address, Phone Number) 175 | 2. **Estate Agent** (Name) 176 | 3. **Property** (Address, #bedrooms, Type, Asking price) 177 | 4. **Buyer** (Name, Address, Phone number) 178 | 5. **Offers** (Offer date, Offer status, Offer value) 179 | 6. **Views** (Date) 180 | 181 | The relationships in the diagram: 182 | - A Seller **owns** Property. 183 | - An Estate Agent **sells** Property. 184 | - A Property **has** Offers and is also **viewed** by Buyers. 185 | - An Offer has (Offer date, status, value) and is associated with one Buyer and one Property. 186 | - A View has (Date) and associates a Buyer with a Property. 187 | 188 | ### (a) Add cardinality indications for this diagram 189 | **(3 marks)** 190 | 191 | From the attached ER diagram, we can interpret the following cardinalities: 192 | 193 | 1. **Seller–Property**: One seller can own **many** properties, but each property is owned by **exactly one** seller. 194 | - **(1 : M)** 195 | 2. **Estate Agent–Property**: One estate agent is responsible for **many** properties, but each property is handled by **exactly one** agent. 196 | - **(1 : M)** 197 | 3. **Property–Offers**: One property can receive **many** offers, but each offer refers to exactly one property. 198 | - **(1 : M)** 199 | 4. **Property–Views**: One property can have **many** viewings, and a viewing is for exactly one property (in the diagram, the “Views” diamond connects property and buyer). 200 | - **(1 : M)** 201 | 5. **Buyer–Offers**: One buyer can make **many** offers, and each offer is from exactly one buyer. 202 | - **(1 : M)** 203 | 6. **Buyer–Views**: One buyer can have **many** viewings, and each viewing is for exactly one buyer. 204 | - **(1 : M)** 205 | 206 | *(If the diagram allowed multiple buyers to attend one viewing, that would be a many‐to‐many. But as drawn—one “Views” diamond connecting one buyer and one property—this is effectively 1 : M from each side. In practice, you might interpret it differently, but we will stick to the diagram.)* 207 | 208 | --- 209 | 210 | ### (b) How would you adapt this to a relational model? 211 | **(5 marks)** 212 | 213 | Following the diagram **as is**, **without extra ID attributes**, we can use the attributes from the boxes and ovals as columns. Below is one possible mapping: 214 | 215 | 1. **Seller** 216 | - **Primary key (PK)**: SellerName *(since the diagram does not show a separate ID)* 217 | - Other columns: Address, PhoneNumber 218 | 219 | 2. **EstateAgent** 220 | - **PK**: AgentName 221 | - *(No other attributes shown except “Name” in the diagram.)* 222 | 223 | 3. **Property** 224 | - **PK**: Address *(the diagram shows an Address for the property; we assume it uniquely identifies it)* 225 | - #bedrooms, Type, AskingPrice 226 | - **Foreign Keys**: 227 | - SellerName → references **Seller**(Name) 228 | - AgentName → references **EstateAgent**(Name) 229 | 230 | 4. **Buyer** 231 | - **PK**: BuyerName 232 | - Other columns: Address, PhoneNumber 233 | 234 | 5. **Offers** 235 | - The diagram shows OfferDate, OfferStatus, OfferValue 236 | - **Composite PK**: (PropertyAddress, BuyerName, OfferDate) or a suitable combination 237 | - **FK**: PropertyAddress → references **Property**(Address) 238 | - **FK**: BuyerName → references **Buyer**(Name) 239 | 240 | 6. **Views** 241 | - The diagram shows a date for the viewing 242 | - **Composite PK**: (PropertyAddress, BuyerName, ViewDate) or similarly chosen 243 | - **FK**: PropertyAddress → references **Property**(Address) 244 | - **FK**: BuyerName → references **Buyer**(Name) 245 | 246 | In **real‐world** practice, we often introduce synthetic `PropertyID`, `SellerID`, etc. to avoid using addresses or names as PKs. But since your diagram has no explicit ID attributes, we stay as close as possible to it. 247 | 248 | --- 249 | 250 | ### (c) List the tables, primary keys, and foreign keys 251 | **(6 marks)** 252 | 253 | Below is a concise list reflecting the diagram’s attributes: 254 | 255 | 1. **Seller** 256 | - **Columns**: Name *(PK)*, Address, PhoneNumber 257 | 258 | 2. **EstateAgent** 259 | - **Columns**: Name *(PK)* 260 | 261 | 3. **Property** 262 | - **Columns**: Address *(PK)*, Type, Bedrooms, AskingPrice 263 | - **Foreign Keys**: 264 | - SellerName → **Seller**(Name) 265 | - AgentName → **EstateAgent**(Name) 266 | 267 | 4. **Buyer** 268 | - **Columns**: Name *(PK)*, Address, PhoneNumber 269 | 270 | 5. **Offers** 271 | - **Columns**: OfferDate, OfferStatus, OfferValue, PropertyAddress, BuyerName 272 | - **Composite PK**: (PropertyAddress, BuyerName, OfferDate) (or similar) 273 | - **FKs**: 274 | - PropertyAddress → **Property**(Address) 275 | - BuyerName → **Buyer**(Name) 276 | 277 | 6. **Views** 278 | - **Columns**: ViewDate, PropertyAddress, BuyerName 279 | - **Composite PK**: (PropertyAddress, BuyerName, ViewDate) 280 | - **FKs**: 281 | - PropertyAddress → **Property**(Address) 282 | - BuyerName → **Buyer**(Name) 283 | 284 | --- 285 | 286 | ### (d) Give the MySQL command for creating one of those tables 287 | **(3 marks)** 288 | 289 | As an example, let’s create the **`Property`** table using the columns from the diagram (and references to Seller and Agent): 290 | 291 | ```sql 292 | CREATE TABLE Property ( 293 | Address VARCHAR(100) NOT NULL, 294 | Type VARCHAR(50), 295 | Bedrooms INT, 296 | AskingPrice DECIMAL(12, 2), 297 | SellerName VARCHAR(100) NOT NULL, 298 | AgentName VARCHAR(100) NOT NULL, 299 | PRIMARY KEY (Address), 300 | 301 | FOREIGN KEY (SellerName) 302 | REFERENCES Seller(Name), 303 | 304 | FOREIGN KEY (AgentName) 305 | REFERENCES EstateAgent(Name) 306 | ); 307 | ``` 308 | 309 | *(We assume 100 characters is enough for an address or name. Adjust as needed.)* 310 | 311 | --- 312 | 313 | ### (e) Agents are paid a commission on property where the offer gets to ‘sale completed.’ The commission is 1% of **the sale price**. 314 | 315 | #### (i) Write a MySQL query to calculate and list the commission earned **since 1 January 2023** for each Estate Agent. 316 | **(6 marks)** 317 | 318 | We assume the final accepted offer is indicated by `OfferStatus = 'sale completed'` and that the actual final sale price is in `OfferValue`. Also assume `OfferDate` is the date the sale completed: 319 | 320 | ```sql 321 | SELECT 322 | p.AgentName AS EstateAgent, 323 | SUM(o.OfferValue * 0.01) AS TotalCommission 324 | FROM Property p 325 | JOIN Offers o 326 | ON p.Address = o.PropertyAddress 327 | WHERE o.OfferStatus = 'sale completed' 328 | AND o.OfferDate >= '2023-01-01' 329 | GROUP BY p.AgentName; 330 | ``` 331 | 332 | - Multiplies each completed offer’s `OfferValue` by 0.01 (1%) 333 | - Sums them per agent. 334 | 335 | #### (ii) Modify your query to list **just the top‐earning agent** 336 | **(2 marks)** 337 | 338 | ```sql 339 | SELECT 340 | p.AgentName AS EstateAgent, 341 | SUM(o.OfferValue * 0.01) AS TotalCommission 342 | FROM Property p 343 | JOIN Offers o 344 | ON p.Address = o.PropertyAddress 345 | WHERE o.OfferStatus = 'sale completed' 346 | AND o.OfferDate >= '2023-01-01' 347 | GROUP BY p.AgentName 348 | ORDER BY TotalCommission DESC 349 | LIMIT 1; 350 | ``` 351 | 352 | --- 353 | 354 | ### (f) The IT specialist is considering a document database 355 | **(5 marks)** 356 | 357 | We focus on **real‐estate–specific** reasons: 358 | 359 | - **Potential advantages** of a document DB (e.g., MongoDB): 360 | 1. Storing variable or unstructured property details (photos, custom fields, long textual descriptions) is more flexible. 361 | 2. If there is heavy reading of large unstructured property listings, a NoSQL solution can scale horizontally. 362 | 363 | - **Potential disadvantages**: 364 | 1. The data is actually quite relational (Seller, Agent, Buyer, Offers). SQL queries for commissions or statuses are simpler in a relational schema. 365 | 2. Ensuring transaction consistency (e.g., an offer changes from “made” to “sale completed”) is more straightforward in a relational DB. 366 | 3. You might end up duplicating structured data across documents, raising consistency issues. 367 | 368 | Thus, for **highly relational** scenarios—offers, statuses, commission calculations—a relational DB is typically more suitable. Document DBs might be beneficial if you have highly variable or semi‐structured listing data. 369 | 370 | --- 371 | 372 | # **Question 3: IR/doc db Question** 373 | 374 | ### (a) If 2,200,000 books are labeled German at 80% precision, how many are truly German? 375 | **Answer (2 marks):** 376 | True positives = 2,200,000 × 0.80 = **1,760,000**. 377 | 378 | --- 379 | 380 | ### (b) How many German books in total (including those missed) if recall is 88%? 381 | **Answer (3 marks):** 382 | 383 | \[ 384 | \text{All German} = \frac{\text{True Positives}}{\text{Recall}} 385 | = \frac{1,760,000}{0.88} 386 | \approx 2,000,000. 387 | \] 388 | 389 | --- 390 | 391 | ### (c) Danish classifier: 100% precision, 76% recall. Why more useful for ML training than German’s 80% precision? 392 | **Answer (5 marks):** 393 | - **Training sets** value extremely high precision, because you don’t want noisy examples (false positives). 394 | - 76% recall means some Danish books are missed, but *every* labeled Danish book is *actually* Danish. That yields a **pure** dataset for ML training, often better than a bigger but contaminated set. 395 | 396 | --- 397 | 398 | ### (d) F1 measure (German & Danish). What is F1? 399 | **Answer (2 marks):** 400 | F1 = Harmonic mean of precision and recall: 401 | \[ 402 | F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}. 403 | \] 404 | 405 | --- 406 | 407 | ### (e) `db.books.find({ lang: "German" })` 408 | **(1 mark)** 409 | This **queries all documents** in the `books` collection with `lang` = `"German"`. 410 | 411 | --- 412 | 413 | ### (f) Rewrite to get only 19th‐century volumes 414 | **Answer (5 marks):** 415 | 416 | ```js 417 | db.books.find({ 418 | lang: "German", 419 | year: { $gte: 1800, $lt: 1900 } 420 | }) 421 | ``` 422 | - Finds volumes from 1800 to 1899. 423 | 424 | --- 425 | 426 | ### (g) Single textual field called “text”; filter for “Strudel” 427 | **(2 marks)** 428 | 429 | ```js 430 | db.books.find({ 431 | lang: "German", 432 | year: { $gte: 1800, $lt: 1900 }, 433 | text: /Strudel/ 434 | }) 435 | ``` 436 | - Uses a regex to match documents whose `text` field contains “Strudel.” 437 | 438 | --- 439 | 440 | ### (h) TEI/XML vs. enriching a document DB 441 | **Answer (10 marks):** 442 | 1. **Structured encoding:** TEI is superb for detailed textual markup (chapters, footnotes, critical apparatus). A JSON store is more flexible but less specialized for hierarchical text. 443 | 2. **Query complexity:** XML DB + XQuery can handle fine‐grained queries by XML elements. MongoDB supports simpler doc queries, possibly less powerful for nested text structures. 444 | 3. **Standards & Interoperability:** TEI is widely used in digital humanities, enabling data sharing with other TEI projects. 445 | 4. **Performance & scale:** Large TEI corpora can be stored in specialized XML databases, but a NoSQL doc DB might scale horizontally. 446 | 5. **Long‐term preservation:** TEI is a recognized standard for scholarly text encoding, often important for academic or library contexts. 447 | 448 | The choice depends on how much *structural*, *semantic*, and *scholarly* detail you want to preserve vs. the need for flexible indexing or large‐scale doc management. 449 | 450 | -------------------------------------------------------------------------------- /MongoDB: Selection, Projection and Sorting.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "include_colab_link": true 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "view-in-github", 20 | "colab_type": "text" 21 | }, 22 | "source": [ 23 | "<a href=\"https://colab.research.google.com/github/sreent/data-management-intro/blob/main/MongoDB%3A%20Selection%2C%20Projection%20and%20Sorting.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" 24 | ] 25 | }, 26 | { 27 | "metadata": { 28 | "id": "vM6ta952S2z2" 29 | }, 30 | "cell_type": "markdown", 31 | "source": [ 32 | "# 1 Setting Up MongoDB Environment" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "source": [ 38 | "# Install MongoDB's dependencies\n", 39 | "!sudo wget http://archive.ubuntu.com/ubuntu/pool/main/o/openssl/libssl1.1_1.1.1f-1ubuntu2_amd64.deb\n", 40 | "!sudo dpkg -i libssl1.1_1.1.1f-1ubuntu2_amd64.deb\n", 41 | "\n", 42 | "# Import the public key used by the package management system\n", 43 | "!wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add -\n", 44 | "\n", 45 | "# Create a list file for MongoDB\n", 46 | "!echo \"deb [ arch=amd64,arm64 ] http://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.4 multiverse\" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list\n", 47 | "\n", 48 | "# Reload the local package database\n", 49 | "!apt-get update > /dev/null\n", 50 | "\n", 51 | "# Install the MongoDB packages\n", 52 | "!apt-get install -y mongodb-org > /dev/null\n", 53 | "\n", 54 | "# Install pymongo\n", 55 | "!pip install -q pymongo\n", 56 | "\n", 57 | "# Create Data Folder\n", 58 | "!mkdir -p /data/db\n", 59 | "\n", 60 | "# Start MongoDB\n", 61 | "!mongod --fork --logpath /var/log/mongodb.log --dbpath /data/db" 62 | ], 63 | "metadata": { 64 | "id": "zgXgWsKqFlWM" 65 | }, 66 | "execution_count": null, 67 | "outputs": [] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "source": [ 72 | "from pymongo import MongoClient\n", 73 | "\n", 74 | "# Establish connection to MongoDB\n", 75 | "try:\n", 76 | " client = MongoClient('localhost', 27017)\n", 77 | " print(\"Connected to MongoDB\")\n", 78 | "except Exception as e:\n", 79 | " print(\"Error connecting to MongoDB: \", e)\n", 80 | " exit()\n", 81 | "\n", 82 | "# List databases to check the connection\n", 83 | "try:\n", 84 | " databases = client.list_database_names()\n", 85 | " print(\"Databases:\", databases)\n", 86 | "except Exception as e:\n", 87 | " print(\"Error listing databases: \", e)\n", 88 | "\n", 89 | "# Retrieve server status\n", 90 | "try:\n", 91 | " server_status = client.admin.command(\"serverStatus\")\n", 92 | " print(\"Server Status:\", server_status)\n", 93 | "except Exception as e:\n", 94 | " print(\"Error retrieving server status: \", e)\n", 95 | "\n", 96 | "# Perform basic database operations (Create, Read)\n", 97 | "try:\n", 98 | " db = client.test_db\n", 99 | " collection = db.test_collection\n", 100 | " # Insert a document\n", 101 | " insert_result = collection.insert_one({\"name\": \"test\", \"value\": 123})\n", 102 | " print(\"Insert operation result:\", insert_result.inserted_id)\n", 103 | " # Read a document\n", 104 | " read_result = collection.find_one({\"name\": \"test\"})\n", 105 | " print(\"Read operation result:\", read_result)\n", 106 | "except Exception as e:\n", 107 | " print(\"Error performing database operations: \", e)" 108 | ], 109 | "metadata": { 110 | "id": "f2q4bBmFNQuA" 111 | }, 112 | "execution_count": null, 113 | "outputs": [] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "source": [ 118 | "# 2 Preparations" 119 | ], 120 | "metadata": { 121 | "id": "ZK39BLCHmVUa" 122 | } 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "source": [ 127 | "Databases and collections in MongoDB are created implicitly while data is inserted. In this tutorial, you will create a collection of *films*. There is no collection so far, so create one by inserting a document." 128 | ], 129 | "metadata": { 130 | "id": "qXVCAX1xmz2q" 131 | } 132 | }, 133 | { 134 | "cell_type": "code", 135 | "source": [ 136 | "query = \"\"\"\n", 137 | "db.collection.insertMany([\n", 138 | " {\n", 139 | " \"ISBN\": \"978-0321751041\",\n", 140 | " \"title\": \"The Art of Computer Programming\",\n", 141 | " \"author\": \"Donald E. Knuth\",\n", 142 | " \"publisher\": \"Addison Wesley\",\n", 143 | " \"yearPublished\": 1968,\n", 144 | " \"price\": 200\n", 145 | " },\n", 146 | " {\n", 147 | " \"ISBN\": \"978-0201633610\",\n", 148 | " \"title\": \"Design Patterns: Elements of Reusable Object-Oriented Software\",\n", 149 | " \"author\": \"Erich Gamma et al.\",\n", 150 | " \"publisher\": \"Addison Wesley\",\n", 151 | " \"yearPublished\": 1994,\n", 152 | " \"price\": 45\n", 153 | " },\n", 154 | " {\n", 155 | " \"ISBN\": \"978-0321573513\",\n", 156 | " \"title\": \"Effective Java\",\n", 157 | " \"author\": \"Joshua Bloch\",\n", 158 | " \"publisher\": \"Addison Wesley\",\n", 159 | " \"yearPublished\": 2008,\n", 160 | " \"price\": 50\n", 161 | " },\n", 162 | " {\n", 163 | " \"ISBN\": \"978-0132350884\",\n", 164 | " \"title\": \"Clean Code: A Handbook of Agile Software Craftsmanship\",\n", 165 | " \"author\": \"Robert C. Martin\",\n", 166 | " \"publisher\": \"Addison Wesley\",\n", 167 | " \"yearPublished\": 2008,\n", 168 | " \"price\": 40\n", 169 | " },\n", 170 | " {\n", 171 | " \"ISBN\": \"978-0321127426\",\n", 172 | " \"title\": \"Refactoring: Improving the Design of Existing Code\",\n", 173 | " \"author\": \"Martin Fowler\",\n", 174 | " \"publisher\": \"Pearson\",\n", 175 | " \"yearPublished\": 1999,\n", 176 | " \"price\": 55\n", 177 | " }\n", 178 | "])\n", 179 | "\"\"\"\n", 180 | "\n", 181 | "!mongo --quiet --eval '{query}'" 182 | ], 183 | "metadata": { 184 | "id": "BSsXBUfhaJp9" 185 | }, 186 | "execution_count": null, 187 | "outputs": [] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "source": [ 192 | "You can list the contents of the newly created collection by calling the <code>find()</code> function." 193 | ], 194 | "metadata": { 195 | "id": "MYRMT54Xn1EB" 196 | } 197 | }, 198 | { 199 | "cell_type": "code", 200 | "source": [ 201 | "query = \"\"\"db.collection.find()\"\"\"\n", 202 | "\n", 203 | "!mongo --quiet --eval '{query}'" 204 | ], 205 | "metadata": { 206 | "id": "W0agPtBpNpc3" 207 | }, 208 | "execution_count": null, 209 | "outputs": [] 210 | }, 211 | { 212 | "cell_type": "markdown", 213 | "source": [ 214 | "# 3 Querying" 215 | ], 216 | "metadata": { 217 | "id": "OtseGmERr9jQ" 218 | } 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "source": [ 223 | "Find books published by \"Addison Wesley\", output only their ISBN, title and price, sort by price in descending order." 224 | ], 225 | "metadata": { 226 | "id": "GPj9LFGrsyyE" 227 | } 228 | }, 229 | { 230 | "cell_type": "code", 231 | "source": [ 232 | "query = \"\"\"\n", 233 | "db.collection.aggregate([\n", 234 | " {\n", 235 | " $match: {\n", 236 | " \"publisher\": \"Addison Wesley\"\n", 237 | " }\n", 238 | " },\n", 239 | " {\n", 240 | " $project: {\n", 241 | " _id: false,\n", 242 | " ISBN: true,\n", 243 | " title: true,\n", 244 | " price: true\n", 245 | " }\n", 246 | " },\n", 247 | " {\n", 248 | " $sort: {\n", 249 | " title: -1\n", 250 | " }\n", 251 | " }\n", 252 | "])\n", 253 | "\"\"\"\n", 254 | "\n", 255 | "!mongo --quiet --eval '{query}'" 256 | ], 257 | "metadata": { 258 | "id": "TWRu81nReEw2" 259 | }, 260 | "execution_count": null, 261 | "outputs": [] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "source": [ 266 | "# 4 Interpretation:\n", 267 | "\n", 268 | "- The `$match` stage filters the documents to include only those where the `publisher` is \"Addison Wesley\".\n", 269 | "- The `$project` stage reshapes each document to include only the `ISBN`, `title`, and `price` fields, excluding the `_id` field.\n", 270 | "- The `$sort` stage sorts the resulting documents by the `title` field in descending order (from Z to A)." 271 | ], 272 | "metadata": { 273 | "id": "wncBlxCctOPr" 274 | } 275 | } 276 | ] 277 | } -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # data-management -------------------------------------------------------------------------------- /Revision Note: Evaluation Metrics: Precision, Recall and F1-Measure.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # Evaluation Metrics: Precision, Recall and F1-Measure 5 | 6 | --- 7 | 8 | ## **1. Introduction and Key Concepts** 9 | 10 | ### **1.1 Overview of Precision, Recall, and F1-Measure** 11 | - **Precision:** Precision measures the proportion of true positive results among all the results predicted as positive. It focuses on the accuracy of positive predictions. 12 | - **Formula:** \( \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} \) 13 | 14 | - **Recall:** Recall measures the proportion of true positive results among all the relevant items. It indicates how well the model identifies relevant items. 15 | - **Formula:** \( \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} \) 16 | 17 | - **F1-Measure:** The F1-Measure is the harmonic mean of precision and recall. It is useful when balancing precision and recall is critical, particularly in cases of class imbalance. 18 | - **Formula:** \( \text{F1-Measure} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \) 19 | 20 | ### **1.2 Use Cases and Applications** 21 | - **Precision-Oriented Scenarios:** Precision is prioritized in scenarios where false positives carry a high cost, such as spam filters or critical medical diagnoses. 22 | - **Recall-Oriented Scenarios:** Recall is prioritized when missing relevant items is unacceptable, such as disease outbreak detection or legal document retrieval. 23 | - **F1-Measure Applications:** The F1-Measure is most useful when there’s a need to balance precision and recall, especially in imbalanced datasets. 24 | 25 | --- 26 | 27 | ## **2. Detailed Explanations and Examples** 28 | 29 | ### **2.1 Understanding the Trade-offs Between Precision and Recall** 30 | - **Precision-Focused:** Precision is vital in situations where the cost of false positives is high, such as marking a legitimate email as spam. 31 | - **Recall-Focused:** Recall is critical in cases where missing out on relevant items is highly detrimental, such as in emergency alerts. 32 | 33 | ### **2.2 Example Scenario** 34 | Imagine a system designed to identify relevant documents in a large archive: 35 | - **Precision Calculation Example:** If 25 out of 30 selected documents are relevant, precision is \( \frac{25}{30} = 0.83 \) (83%). 36 | - **Recall Calculation Example:** If there are 30 truly relevant documents and the system identifies 25, recall is \( \frac{25}{30} = 0.83 \) (83%). 37 | 38 | ### **2.3 Calculating F1-Measure** 39 | Given a precision of 83% and a recall of 83%, the F1-Measure is: 40 | \[ 41 | \text{F1-Measure} = 2 \times \frac{0.83 \times 0.83}{0.83 + 0.83} = 0.83 42 | \] 43 | 44 | ### **2.4 Worked Examples and Solutions** 45 | - **Sample Question:** A model classifies 100 transactions as fraudulent, with 80 of them being actual fraud (true positives) and 20 being incorrect (false positives). Calculate precision. 46 | - **Solution:** \( \text{Precision} = \frac{80}{80 + 20} = 0.80 \) (80%). 47 | 48 | - **Sample Question:** If the model missed 10 additional fraudulent transactions (false negatives), calculate recall. 49 | - **Solution:** \( \text{Recall} = \frac{80}{80 + 10} = 0.89 \) (89%). 50 | 51 | - **Sample Question:** Calculate the F1-Measure with the given precision (80%) and recall (89%). 52 | - **Solution:** 53 | \[ 54 | \text{F1-Measure} = 2 \times \frac{0.80 \times 0.89}{0.80 + 0.89} \approx 0.84 \text{ (84%)} 55 | \] 56 | 57 | --- 58 | 59 | ## **3. Common Mistakes and How to Avoid Them** 60 | 61 | ### **3.1 Overemphasis on Accuracy** 62 | - Accuracy can be misleading in imbalanced datasets where one class dominates (e.g., 95% accuracy may still be poor if the positive class is underrepresented). 63 | 64 | ### **3.2 Misinterpretation of Precision and Recall** 65 | - Precision measures the correctness of positive predictions, while recall measures the completeness of capturing true positives. Confusing these concepts can lead to incorrect conclusions. 66 | 67 | ### **3.3 Ignoring Context When Choosing Metrics** 68 | - Depending on the application, one metric may be more critical than the other. For instance, in healthcare, recall is often more important than precision, while in marketing campaigns, precision might take priority. 69 | 70 | --- 71 | 72 | ## **4. Must Know: Commonly Tested Concepts** 73 | 74 | ### **4.1 Precision vs. Recall Trade-offs** 75 | - Be able to discuss scenarios where precision is prioritized over recall and vice versa. 76 | 77 | ### **4.2 Understanding and Applying F1-Measure** 78 | - Know how to calculate and interpret the F1-Measure, especially in cases where there is a need to balance both precision and recall. 79 | 80 | ### **4.3 Examining Real-World Applications** 81 | - Be familiar with how these metrics apply in practical applications like spam detection, recommendation systems, and fraud detection. 82 | 83 | --- 84 | 85 | ## **5. Strengths and Weaknesses of Each Metric** 86 | 87 | ### **5.1 Strengths** 88 | - **Precision:** Provides a clear view of how many selected items are truly relevant. 89 | - **Recall:** Offers insight into how well relevant items are captured by the system. 90 | - **F1-Measure:** Balances precision and recall, making it ideal for cases where both false positives and false negatives are significant. 91 | 92 | ### **5.2 Weaknesses** 93 | - **Precision:** Can be misleading if recall is low; it doesn’t account for relevant items missed by the system. 94 | - **Recall:** High recall can inflate false positives if precision is low. 95 | - **F1-Measure:** Does not fully reflect the impact of highly skewed precision or recall values. 96 | 97 | --- 98 | 99 | ## **6. Important Points to Remember** 100 | 101 | - Precision is critical when false positives are costly. 102 | - Recall is essential when missing relevant results is unacceptable. 103 | - F1-Measure is useful for balanced evaluation, particularly in imbalanced datasets. 104 | 105 | --- 106 | -------------------------------------------------------------------------------- /Revision Note: JSON and MongoDB.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # JSON and MongoDB 5 | 6 | --- 7 | 8 | ## **1. JSON (JavaScript Object Notation)** 9 | 10 | ### **1.1 Introduction and Key Concepts** 11 | - **What is JSON?** 12 | JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is text-based, human-readable, and used widely in APIs and web applications for transmitting data. 13 | 14 | - **Key Features of JSON:** 15 | - **Data Structure:** JSON uses key-value pairs, arrays, and nested objects. 16 | - **Data Types Supported:** Strings, numbers, booleans, arrays, and objects. 17 | - **Flexibility:** JSON is schema-less, allowing dynamic and nested data structures. 18 | 19 | ### **1.2 JSON Structure and Syntax** 20 | - **Basic Structure:** 21 | - Objects are enclosed in curly braces `{}` and consist of key-value pairs. 22 | - Arrays are enclosed in square brackets `[]` and can contain multiple values or objects. 23 | 24 | **Example JSON Structure:** 25 | ```json 26 | { 27 | "name": "John Doe", 28 | "age": 29, 29 | "address": { 30 | "street": "123 Main St", 31 | "city": "New York" 32 | }, 33 | "phoneNumbers": [ 34 | "555-1234", 35 | "555-5678" 36 | ] 37 | } 38 | ``` 39 | 40 | ### **1.3 Detailed Example** 41 | **Scenario:** Representing a user profile using JSON. 42 | 43 | **Example JSON:** 44 | ```json 45 | { 46 | "name": "Jane Smith", 47 | "email": "jane.smith@example.com", 48 | "skills": ["JavaScript", "Node.js", "MongoDB"], 49 | "experience": [ 50 | { 51 | "company": "Tech Solutions", 52 | "role": "Software Developer", 53 | "years": 3 54 | }, 55 | { 56 | "company": "Web Innovations", 57 | "role": "Full-Stack Developer", 58 | "years": 2 59 | } 60 | ] 61 | } 62 | ``` 63 | 64 | ### **1.4 Common Mistakes and How to Avoid Them** 65 | - **Improper Nesting:** Ensure nested objects and arrays are correctly structured. 66 | - **Inconsistent Data Types:** Avoid mixing data types within arrays (e.g., mixing strings and objects). 67 | - **Case Sensitivity:** JSON keys are case-sensitive; ensure consistency in key names. 68 | 69 | ### **1.5 Worked Examples and Solutions** 70 | - **Sample Question:** Create a JSON structure representing a library catalog with books containing details like title, author, and publication year. 71 | 72 | **Example Solution:** 73 | ```json 74 | { 75 | "library": [ 76 | { 77 | "title": "The Great Gatsby", 78 | "author": "F. Scott Fitzgerald", 79 | "year": 1925 80 | }, 81 | { 82 | "title": "1984", 83 | "author": "George Orwell", 84 | "year": 1949 85 | } 86 | ] 87 | } 88 | ``` 89 | 90 | ### **1.6 Must Know: Commonly Tested Concepts** 91 | - JSON syntax and structure, including proper nesting and formatting. 92 | - Differentiating between objects and arrays and when to use each. 93 | - Use cases for JSON, especially in API communication and data exchange. 94 | 95 | --- 96 | 97 | ## **2. MongoDB (NoSQL Database)** 98 | 99 | ### **2.1 Introduction and Key Concepts** 100 | - **What is MongoDB?** 101 | MongoDB is a document-oriented NoSQL database that stores data in flexible, JSON-like documents. It is ideal for dynamic and semi-structured data. 102 | 103 | - **Key Features of MongoDB:** 104 | - **Document Model:** Stores data as BSON (Binary JSON) with fields representing data attributes. 105 | - **Schema Flexibility:** Allows for varied document structures within the same collection. 106 | - **Indexing and Aggregation:** Supports advanced querying, indexing, and aggregation pipelines. 107 | 108 | ### **2.2 MongoDB Document Structure** 109 | - **Basic Structure:** 110 | - Documents are stored in collections (similar to tables in relational databases). 111 | - Documents are JSON-like objects containing fields and nested structures. 112 | 113 | **Example MongoDB Document:** 114 | ```json 115 | { 116 | "_id": "507f191e810c19729de860ea", 117 | "name": "John Doe", 118 | "age": 29, 119 | "address": { 120 | "street": "123 Main St", 121 | "city": "New York" 122 | }, 123 | "phoneNumbers": [ 124 | "555-1234", 125 | "555-5678" 126 | ] 127 | } 128 | ``` 129 | 130 | ### **2.3 Detailed Example** 131 | **Scenario:** Storing user profiles in a MongoDB collection. 132 | 133 | **Example Document:** 134 | ```json 135 | { 136 | "_id": "507f1f77bcf86cd799439011", 137 | "name": "Alice Brown", 138 | "email": "alice.brown@example.com", 139 | "skills": ["Python", "Data Science", "MongoDB"], 140 | "projects": [ 141 | { 142 | "name": "Recommendation Engine", 143 | "duration": "6 months" 144 | }, 145 | { 146 | "name": "Chatbot Development", 147 | "duration": "4 months" 148 | } 149 | ] 150 | } 151 | ``` 152 | 153 | ### **2.4 Common Mistakes and How to Avoid Them** 154 | - **Inconsistent Field Naming:** Ensure consistency in field names across documents to avoid data integrity issues. 155 | - **Over-Reliance on Flexibility:** Although MongoDB allows schema flexibility, enforce structure where necessary to maintain data quality. 156 | - **Improper Indexing:** Without proper indexing, query performance can degrade as collections grow in size. 157 | 158 | ### **2.5 Worked Examples and Solutions** 159 | - **Sample Question:** Write a MongoDB query to find all users with a skill in "MongoDB". 160 | 161 | **Example Solution:** 162 | ```json 163 | db.users.find({ 164 | "skills": "MongoDB" 165 | }) 166 | ``` 167 | 168 | - **Sample Question:** Query documents where the city is "New York" and the age is greater than 25. 169 | 170 | **Example Solution:** 171 | ```json 172 | db.users.find({ 173 | "address.city": "New York", 174 | "age": { "$gt": 25 } 175 | }) 176 | ``` 177 | 178 | ### **2.6 Must Know: Commonly Tested Concepts** 179 | - Understanding MongoDB’s document model and schema flexibility. 180 | - Writing MongoDB queries, including handling nested fields and arrays. 181 | - Recognizing scenarios where MongoDB’s schema-less design is beneficial. 182 | 183 | --- 184 | 185 | ## **3. Strengths and Weaknesses of JSON and MongoDB** 186 | 187 | ### **3.1 Strengths** 188 | - **Flexibility:** JSON and MongoDB’s schema-less design is ideal for handling diverse and dynamic data. 189 | - **Ease of Use:** JSON’s simple structure makes it a preferred format for API communication. 190 | - **Scalability:** MongoDB is well-suited for distributed, large-scale applications with evolving data requirements. 191 | 192 | ### **3.2 Weaknesses** 193 | - **Inconsistent Data:** Schema flexibility can lead to data inconsistencies if not managed properly. 194 | - **Limited Support for Complex Relationships:** MongoDB and JSON are less effective for handling highly relational data with complex joins. 195 | - **Performance Overhead:** Querying deeply nested structures or large arrays in MongoDB can be less efficient compared to structured relational queries. 196 | 197 | ### **3.3 Worked Examples and Solutions** 198 | - **Sample Question:** Discuss the advantages and disadvantages of using MongoDB in an e-commerce application compared to a relational database. 199 | 200 | ### **3.4 Key Takeaways** 201 | - JSON and MongoDB offer significant flexibility but require careful management to avoid data inconsistencies. 202 | - MongoDB’s schema-less model is particularly effective in scenarios where the data structure is expected to change over time. 203 | 204 | ### **3.5 Must Know: Commonly Tested Concepts** 205 | - Knowing when to use MongoDB instead of a relational database, particularly in applications with evolving schemas. 206 | - Understanding JSON’s role in data interchange and how it integrates with modern web services. 207 | 208 | --- 209 | -------------------------------------------------------------------------------- /Revision Note: Linked Data Model and SPARQL.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # Linked Data Model and SPARQL 5 | 6 | --- 7 | 8 | ## **1. RDF (Resource Description Framework) and Data Models** 9 | 10 | ### **1.1 Introduction and Key Concepts** 11 | - **What is RDF?** 12 | RDF (Resource Description Framework) is a framework for representing information about resources in a graph structure using triples: **Subject - Predicate - Object**. 13 | 14 | - **RDF Triples:** 15 | - **Subject:** The resource being described (e.g., `ex:Person1`). 16 | - **Predicate:** The property or characteristic (e.g., `ex:hasName`). 17 | - **Object:** The value or another resource (e.g., `"John Doe"` or `ex:NewYork`). 18 | 19 | ### **1.2 Serialization Formats** 20 | - **Common Formats:** 21 | - **Turtle:** A compact, human-readable RDF format. 22 | - **RDF/XML:** An XML-based format, useful for integrating RDF with XML workflows. 23 | - **JSON-LD:** A JSON format optimized for Linked Data. 24 | 25 | **Example in Turtle:** 26 | ```turtle 27 | @prefix ex: <http://example.org/> . 28 | 29 | ex:Person1 ex:hasName "John Doe" ; 30 | ex:hasBirthPlace ex:NewYork . 31 | ``` 32 | 33 | ### **1.3 Detailed Example** 34 | **Scenario:** Representing a Person in RDF. 35 | 36 | **Example RDF Data in Turtle:** 37 | ```turtle 38 | @prefix ex: <http://example.org/> . 39 | 40 | ex:Person1 ex:hasName "John Doe" ; 41 | ex:hasBirthPlace ex:NewYork ; 42 | ex:hasOccupation "Musician" . 43 | ``` 44 | 45 | ### **1.4 Common Mistakes and How to Avoid Them** 46 | - **Confusing RDF with XML:** RDF is a data model, while XML is just one possible syntax for RDF serialization. 47 | - **Inconsistent Use of URIs:** Ensure consistent and meaningful URIs across datasets to avoid conflicts and improve interoperability. 48 | 49 | ### **1.5 Worked Examples and Solutions** 50 | - **Sample Question:** Given a set of RDF triples, identify the serialization format and convert between Turtle and RDF/XML. 51 | 52 | ### **1.6 Must Know: Commonly Tested Concepts** 53 | - RDF serialization formats like Turtle, RDF/XML, and JSON-LD are frequently tested. 54 | - Understand how to create and interpret RDF triples with subject-predicate-object structures. 55 | 56 | --- 57 | 58 | ## **2. Linked Data and Ontologies** 59 | 60 | ### **2.1 Introduction to Linked Data and Ontologies** 61 | - **Linked Data Principles:** Linked Data uses URIs to identify resources, making them accessible via HTTP and linking them to other resources. 62 | - **Ontologies Overview:** Ontologies provide structured vocabularies that define classes, properties, and relationships for specific domains. Examples include: 63 | - **Dublin Core:** Used for metadata, with properties like `dcterms:title` and `dcterms:creator`. 64 | - **FOAF (Friend of a Friend):** Used for describing people and their relationships, with properties like `foaf:name` and `foaf:knows`. 65 | 66 | ### **2.2 Detailed Explanation and Examples** 67 | - **Example Ontologies:** 68 | - **FOAF:** Commonly used in social networks to describe relationships between people. 69 | - **Dublin Core:** Widely used for metadata in documents and publications. 70 | 71 | ### **2.3 Worked Examples and Solutions** 72 | - **Scenario:** Using Linked Data to describe a social network with RDF and FOAF. 73 | 74 | **Example RDF using FOAF:** 75 | ```turtle 76 | @prefix foaf: <http://xmlns.com/foaf/0.1/> . 77 | 78 | ex:Person1 a foaf:Person ; 79 | foaf:name "Alice" ; 80 | foaf:knows ex:Person2 . 81 | 82 | ex:Person2 a foaf:Person ; 83 | foaf:name "Bob" . 84 | ``` 85 | 86 | ### **2.4 Common Mistakes and How to Avoid Them** 87 | - **Overcomplicating Ontologies:** Use established vocabularies like FOAF or Dublin Core instead of creating custom terms unnecessarily. 88 | 89 | ### **2.5 Important Points to Remember** 90 | - Standardized ontologies enhance data interoperability by using widely recognized vocabularies, making datasets easier to integrate across platforms. 91 | 92 | ### **2.6 Must Know: Commonly Tested Concepts** 93 | - Applying ontologies like FOAF and Dublin Core to RDF scenarios. 94 | - Understanding the principles of Linked Data and how they enable interconnected datasets. 95 | 96 | --- 97 | 98 | ## **3. SPARQL Querying** 99 | 100 | ### **3.1 Overview of SPARQL Concepts** 101 | - **What is SPARQL?** 102 | SPARQL is the query language for RDF data, enabling selection, filtering, and manipulation of RDF triples. 103 | 104 | ### **3.2 Triple Pattern Matching and Syntax** 105 | - SPARQL queries involve matching triple patterns against RDF graphs. 106 | - **Basic SPARQL Structure:** 107 | - `SELECT`: Defines what variables to return. 108 | - `WHERE`: Specifies the graph pattern to match. 109 | 110 | **Example SPARQL Query:** 111 | ```sparql 112 | PREFIX ex: <http://example.org/> 113 | 114 | SELECT ?name 115 | WHERE { 116 | ?person ex:hasBirthPlace ex:NewYork ; 117 | ex:hasName ?name . 118 | } 119 | ``` 120 | 121 | ### **3.3 Worked Examples and Solutions** 122 | - **Sample Question:** Write a SPARQL query to retrieve the names of all musicians classified as sopranos. 123 | 124 | **Solution:** 125 | ```sparql 126 | PREFIX ex: <http://example.org/> 127 | 128 | SELECT ?name 129 | WHERE { 130 | ?person a ex:Soprano ; 131 | ex:hasName ?name . 132 | } 133 | ``` 134 | 135 | ### **3.4 Common Mistakes and How to Avoid Them** 136 | - **Incorrect Prefix Usage:** Ensure the correct namespaces are defined for each prefix. 137 | - **Overusing Optional Clauses:** Be cautious with `OPTIONAL`, as it can lead to incomplete results if misused. 138 | 139 | ### **3.5 Key Takeaways** 140 | - SPARQL’s power lies in its ability to navigate RDF graphs and extract specific information based on complex conditions. 141 | 142 | ### **3.6 Must Know: Commonly Tested Concepts** 143 | - Writing SPARQL queries that involve triple pattern matching, filters, and optional clauses. 144 | - Understanding the basics of RDF graphs and how SPARQL navigates them. 145 | 146 | --- 147 | 148 | ## **4. Strengths and Weaknesses of Linked Data Model** 149 | 150 | ### **4.1 Strengths** 151 | - **Flexibility:** RDF’s triple model can represent complex relationships. 152 | - **Interoperability:** Linked Data principles promote consistent and accessible data across diverse domains. 153 | - **Expressiveness:** SPARQL provides powerful querying capabilities for RDF datasets. 154 | 155 | ### **4.2 Weaknesses** 156 | - **Complexity:** RDF and SPARQL can be challenging to learn and implement. 157 | - **Performance Overheads:** Querying large RDF datasets can be less efficient compared to traditional databases. 158 | 159 | ### **4.3 Worked Examples and Solutions** 160 | - **Sample Question:** Discuss the pros and cons of using RDF and Linked Data in a knowledge management system compared to relational databases. 161 | 162 | ### **4.4 Key Takeaways** 163 | - RDF and Linked Data are ideal for representing interconnected data but come with challenges related to complexity and performance. 164 | 165 | ### **4.5 Must Know: Commonly Tested Concepts** 166 | - Recognizing when RDF and Linked Data are preferable over traditional models. 167 | - Understanding the trade-offs between flexibility and performance when using RDF. 168 | 169 | --- 170 | 171 | -------------------------------------------------------------------------------- /Revision Note: Relational Databases.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # Relational Databases 5 | 6 | --- 7 | 8 | ## **1. Entity-Relationship (E/R) Diagrams and Relational Models** 9 | 10 | ### **1.1 Introduction and Key Concepts** 11 | - **Entities, Attributes, and Relationships:** Understand the basic building blocks of E/R diagrams, such as entities (e.g., Students, Courses), attributes (e.g., Name, Date of Birth), and relationships (e.g., Enrolls In). 12 | - **Cardinality and Participation:** Be able to express one-to-one, one-to-many, and many-to-many relationships using appropriate notations. 13 | - **Types of Relationships:** Consider both identifying and non-identifying relationships and how they map to foreign keys in relational schemas. 14 | 15 | ### **1.2 Converting E/R Diagrams into Relational Models** 16 | - **Mapping Relationships:** Learn the step-by-step process of converting E/R diagrams into relational schemas, paying special attention to resolving many-to-many relationships using associative entities (join tables). 17 | - **Normalization Considerations:** Ensure that the resulting relational schema adheres to at least 3NF by eliminating redundant attributes and functional dependencies. 18 | 19 | ### **1.3 Detailed Examples** 20 | - **Example 1: E/R Diagram to Relational Model Conversion (Library System):** 21 | - Draw the E/R diagram for a library system with entities like Books, Authors, and Borrowers, and convert it into a relational schema. 22 | - Discuss how to handle many-to-many relationships like "Books written by multiple Authors" using join tables. 23 | 24 | - **Example 2: E/R Diagram for an E-commerce Platform:** 25 | - Create an E/R diagram with entities like Products, Customers, and Orders. 26 | - Convert the E/R diagram into relational tables, ensuring proper mapping of relationships like “Customers place multiple Orders.” 27 | 28 | ### **1.4 Common Mistakes and How to Avoid Them** 29 | - **Misrepresenting Relationships:** Avoid incorrect cardinality mappings that lead to poor relational designs (e.g., failing to recognize that a relationship is many-to-many). 30 | - **Inconsistent Naming Conventions:** Ensure consistent attribute naming across entities to avoid confusion during implementation. 31 | 32 | ### **1.5 Worked Examples and Solutions** 33 | - **Sample Question:** "Draw an E/R diagram for a gym management system and convert it into a relational schema." 34 | - **Solution:** Step-by-step breakdown from identifying entities, attributes, and relationships to designing the final relational schema with primary and foreign keys. 35 | 36 | ### **1.6 Must Know: Commonly Tested Concepts** 37 | - **Converting E/R Diagrams to Relational Schemas:** Be prepared to draw an E/R diagram and convert it into a normalized relational schema with proper primary and foreign keys. 38 | - **Handling Many-to-Many Relationships:** Expect questions that require you to resolve M:M relationships using junction tables. 39 | - **Normalizing the Schema:** Often, you’ll need to ensure that the relational schema is normalized up to at least 3NF. 40 | - **Identifying and Correcting Errors in E/R Diagrams:** Watch out for questions that ask you to fix common mistakes in faulty diagrams. 41 | 42 | --- 43 | 44 | ## **2. Normalization (1NF, 2NF, 3NF, BCNF)** 45 | 46 | ### **2.1 Introduction to Normalization** 47 | - **Purpose of Normalization:** Understand the need for normalization to reduce redundancy and avoid update anomalies. 48 | - **Functional Dependencies:** Learn to identify dependencies between attributes that determine how data should be organized. 49 | 50 | ### **2.2 Step-by-Step Guide to Normalization** 51 | - **First Normal Form (1NF):** All attributes should contain atomic values, and there should be no repeating groups. 52 | - **Second Normal Form (2NF):** Eliminate partial dependencies by ensuring that all non-key attributes are fully dependent on the primary key. 53 | - **Third Normal Form (3NF):** Remove transitive dependencies so that non-key attributes do not depend on other non-key attributes. 54 | - **Boyce-Codd Normal Form (BCNF):** Strengthens 3NF by addressing certain edge cases where functional dependencies might still exist. 55 | 56 | ### **2.3 Worked Examples and Solutions** 57 | - **Example 1: Normalizing a Product Database:** 58 | - Start from an unnormalized table containing product details and work through each normal form, explaining how dependencies are resolved at each stage. 59 | - **Example 2: CRM System Normalization:** 60 | - Normalize a customer database with attributes like CustomerID, Name, Address, and Orders to achieve 3NF. 61 | 62 | ### **2.4 Common Pitfalls and How to Avoid Them** 63 | - **Over-Normalization:** Avoid splitting tables too much, which can lead to performance issues due to excessive joins. 64 | - **Misinterpreting Dependencies:** Be clear on the difference between partial and transitive dependencies. 65 | 66 | ### **2.5 Important Points to Remember** 67 | - **1NF through BCNF:** Know the conditions for each normal form, especially how 3NF and BCNF handle transitive dependencies. 68 | - **Redundancy vs. Complexity:** Find the balance between minimizing redundancy and maintaining query efficiency. 69 | 70 | ### **2.6 Must Know: Commonly Tested Concepts** 71 | - **Identifying Partial and Transitive Dependencies:** Questions often involve taking an unnormalized table and breaking it down step-by-step into 1NF, 2NF, and 3NF. 72 | - **Handling Composite Keys:** Be familiar with scenarios where composite keys lead to partial dependencies and how to resolve them. 73 | - **Normalization Justifications:** Be ready to explain, in detail, why a schema is normalized or why it isn’t, with clear reasoning behind each step. 74 | 75 | --- 76 | 77 | ## **3. SQL Queries and Database Operations** 78 | 79 | ### **3.1 Overview of SQL Concepts** 80 | - **Basic Commands (SELECT, INSERT, UPDATE, DELETE):** Get familiar with the structure and syntax of these commands. 81 | - **JOIN Types:** 82 | - **INNER JOIN:** Retrieves records that have matching values in both tables. 83 | - **LEFT JOIN (or LEFT OUTER JOIN):** Retrieves all records from the left table, and matched records from the right table. 84 | - **RIGHT JOIN (or RIGHT OUTER JOIN):** Retrieves all records from the right table, and matched records from the left table. 85 | - **CROSS JOIN:** Produces a Cartesian product of the two tables. 86 | 87 | ### **3.2 Transaction Management and Isolation Levels (Optional)** 88 | - **ACID Properties:** Ensure that transactions are atomic, consistent, isolated, and durable. 89 | - **Isolation Levels:** 90 | - **Read Uncommitted, Read Committed, Repeatable Read, Serializable:** Understand how these levels balance consistency and concurrency. 91 | 92 | ### **3.3 Indexing Strategies (Optional)** 93 | - **Clustered vs. Non-Clustered Indexes:** Understand how indexes can improve query performance, especially for large datasets. 94 | - **Query Optimization:** Learn best practices for optimizing SQL queries, including when to use indexes and how to minimize costly operations like full table scans. 95 | 96 | ### **3.4 Common Pitfalls in SQL** 97 | - **Improper Use of Joins:** Be cautious about using the wrong type of join, which can lead to incorrect results. 98 | - **Overlooking NULL Handling:** Be mindful of how SQL handles NULLs in joins and conditions. 99 | 100 | ### **3.5 Worked Examples and Solutions** 101 | - **Example 1: Write an SQL query to retrieve customer orders and their details.** 102 | - Solution: Use JOINs and aggregate functions like `GROUP BY` and `HAVING` to structure your query effectively. 103 | - **Example 2: SQL Query Optimization Case Study:** 104 | - Optimize a slow query by analyzing its execution plan and applying indexing strategies. 105 | 106 | ### **3.6 Key Takeaways** 107 | - **JOIN Mastery:** Most SQL questions focus on your ability to effectively join multiple tables and filter results. 108 | - **Transaction and Indexing Knowledge:** Although less frequently tested, understanding transactions and indexes can help you score in more complex questions. 109 | 110 | ### **3.7 Must Know: Commonly Tested Concepts** 111 | - **JOIN Operations:** Be ready to use INNER JOIN, LEFT JOIN, and RIGHT JOIN, especially in scenarios where you need to include or exclude data based on matching conditions. 112 | - **GROUP BY and Aggregation:** Expect to write queries involving grouping and aggregate calculations like `SUM`, `AVG`, and `COUNT`. 113 | - **Handling NULL Values in Queries:** Be prepared to manage NULL values in filtering conditions and aggregate functions. 114 | - **Query Optimization Considerations:** Understand when and how to apply indexing for performance improvements. 115 | 116 | --- 117 | 118 | ## **4. Data Modeling and Optimization** 119 | 120 | ### **4.1 Introduction to Data Modeling** 121 | - **Conceptual, Logical, and Physical Models:** Understand the progression from high-level conceptual models (e.g., E/R diagrams) to detailed physical implementations (e.g., relational schemas). 122 | - **Data Modeling Tools:** Familiarize yourself with tools like Lucidchart, ERDPlus, or even paper and pencil for designing diagrams. 123 | 124 | ### **4.2 Best Practices for Database Design** 125 | - **Scalability and Maintainability:** Design your schema to accommodate future growth without requiring major overhauls. 126 | - **Referential Integrity:** Use foreign keys and constraints to maintain data consistency across related tables. 127 | 128 | ### **4.3 Optimization Techniques (Optional)** 129 | - **Partitioning and Sharding:** Break down large tables into smaller, manageable pieces for improved performance. 130 | - **Denormalization Considerations:** Sometimes, denormalization is necessary for performance optimization, especially in read-heavy applications. 131 | 132 | ### **4.4 Common Pitfalls in Data Modeling** 133 | - **Overcomplicating the Schema:** Avoid adding unnecessary entities or attributes that don’t serve a clear purpose. 134 | - **Ignoring Business Rules:** Your schema should accurately reflect real-world business rules, like mandatory fields or unique constraints. 135 | 136 | ### **4.5 Worked Examples and Solutions** (Continued) 137 | - **Example 1: Designing a Database for an E-commerce Platform:** 138 | - Create a schema for an online store considering entities like `Customers`, `Products`, `Orders`, and `Payments`. 139 | - Explain how to handle relationships like "Customers place multiple Orders" and "Orders contain multiple Products". 140 | - Discuss the importance of indexing frequently queried columns like `CustomerID` and `ProductID` for better performance. 141 | 142 | - **Example 2: Optimizing a Data Warehouse Schema:** 143 | - Discuss how to design a star schema for a reporting system, considering facts (e.g., `Sales`) and dimensions (e.g., `Time`, `Product`, `Store`). 144 | - Address the trade-offs between normalization and performance, especially when dealing with large datasets in analytical queries. 145 | 146 | ### **4.6 Key Takeaways** 147 | - **Balance Complexity and Simplicity:** Your schema should be simple enough for easy maintenance but detailed enough to meet all business requirements. 148 | - **Normalization vs. Performance:** While normalization reduces redundancy, sometimes denormalization is necessary for performance reasons, especially in data warehouses. 149 | 150 | ### **4.7 Must Know: Commonly Tested Concepts** 151 | - **Designing Database Schemas for Real-World Scenarios:** Be prepared to design and normalize a schema based on a given business case. Common scenarios include inventory management systems, customer relationship management (CRM), or e-commerce platforms. 152 | - **Balancing Normalization and Performance Needs:** Exams often require you to justify when and why denormalization might be necessary, particularly in high-performance scenarios. 153 | - **Indexing Strategies and Constraints:** While indexing is less frequently tested, it’s still critical to understand when to apply indexes and how to maintain referential integrity using foreign keys and constraints. 154 | 155 | --- 156 | 157 | ## **5. Strengths and Weaknesses of the Relational Model** 158 | 159 | ### **5.1 Strengths of the Relational Model** 160 | 1. **Data Integrity and Consistency:** 161 | - Relational databases enforce integrity constraints (e.g., primary keys, foreign keys) to ensure consistency. 162 | - Example: In a banking application, foreign keys ensure that every `Transaction` is linked to an existing `Account`. 163 | 164 | 2. **Standardization and SQL:** 165 | - SQL is a well-established and widely supported standard, making relational databases compatible with many tools and technologies. 166 | - Example: You can use SQL to query, update, and manage data across different platforms, from MySQL to Oracle. 167 | 168 | 3. **Normalization and Reduced Redundancy:** 169 | - Normalization allows data to be stored efficiently with minimal redundancy, reducing the risk of update anomalies. 170 | - Example: A normalized customer database avoids repeating customer addresses across multiple orders. 171 | 172 | 4. **ACID Compliance for Transaction Reliability:** 173 | - The ACID properties (Atomicity, Consistency, Isolation, Durability) ensure reliable transactions even in the event of system failures. 174 | - Example: In an e-commerce system, a transaction that deducts payment and updates inventory is guaranteed to be either fully completed or fully rolled back. 175 | 176 | ### **5.2 Weaknesses of the Relational Model** 177 | 5. **Overhead of Strict Consistency:** 178 | - The strict consistency model enforced by relational databases (through ACID properties) can introduce latency and limit scalability, especially in distributed environments. 179 | - Example: In globally distributed applications (e.g., international e-commerce platforms), ensuring consistency across geographically dispersed databases can lead to delays. 180 | 181 | ### **5.3 Must Know: Commonly Tested Concepts** 182 | - **Justifying the Use of a Relational Model:** Expect questions that ask you to explain why a relational model is suitable for a given application. Focus on data integrity, consistency, and the power of SQL. 183 | - **Discussing the Drawbacks of Relational Models in Big Data Scenarios:** Be prepared to highlight scalability challenges and schema rigidity when discussing scenarios that involve large, dynamic, or semi-structured datasets. 184 | - **Comparing Relational Models with Alternatives (e.g., NoSQL):** Some questions might ask you to compare the relational model with NoSQL options, focusing on when to use each type depending on the application’s requirements. 185 | 186 | --- 187 | -------------------------------------------------------------------------------- /Revision Note: XML and XPATH.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # XML and XPATH 5 | 6 | --- 7 | 8 | ## **1. XML Overview and Key Concepts** 9 | 10 | ### **1.1 Introduction to XML** 11 | - **What is XML?** 12 | XML (Extensible Markup Language) is a flexible, text-based format used for data storage and exchange. It is widely used due to its platform independence and human readability. 13 | 14 | - **Key Features of XML:** 15 | - **Self-Descriptive:** XML uses tags to describe data in a hierarchical structure. 16 | - **Platform-Independent:** It’s supported across various systems and is ideal for data interchange. 17 | - **Extensible:** You can create custom tags to meet specific needs. 18 | 19 | ### **1.2 Structure and Syntax** 20 | - **Elements and Attributes:** 21 | - **Elements:** Define the primary content within XML documents (`<movie>`, `<title>`). 22 | - **Attributes:** Provide additional information about elements (`<actor role="Jebediah Leland">`). 23 | 24 | - **XML Syntax Rules:** 25 | - Tags are case-sensitive. 26 | - Elements must be properly nested and closed. 27 | - The XML document must have a single root element. 28 | 29 | ### **1.3 Detailed Example** 30 | **Scenario:** Representing a Movie Database in XML. 31 | 32 | **Example XML:** 33 | ```xml 34 | <movie> 35 | <title>Citizen Kane 36 | 37 | Orson Welles 38 | Joseph Cotton 39 | 40 | 41 | ``` 42 | 43 | ### **1.4 Common Mistakes and How to Avoid Them** 44 | - **Case Sensitivity Issues:** Ensure consistent use of tag names (`` vs. `<title>`). 45 | - **Improper Nesting:** Avoid errors like closing parent elements before child elements are fully closed. 46 | 47 | ### **1.5 Worked Examples and Solutions** 48 | - **Sample Question:** Create an XML structure for a book library, ensuring it is well-formed and includes attributes for book details like `ISBN`, `author`, and `genre`. 49 | 50 | ### **1.6 Must Know: Commonly Tested Concepts** 51 | - Understanding and applying XML syntax rules. 52 | - Differentiating between elements and attributes and when to use each. 53 | - Recognizing the difference between well-formed and valid XML documents. 54 | 55 | --- 56 | 57 | ## **2. XML Schema and Validation** 58 | 59 | ### **2.1 Introduction to XML Schema** 60 | - **Purpose of XML Schemas:** XML schemas define the structure, data types, and relationships within an XML document, ensuring consistency and validation. 61 | - **Types of XML Schemas:** 62 | - **DTD (Document Type Definition):** Defines structure but lacks support for data types. 63 | - **XSD (XML Schema Definition):** More powerful, supporting data types, constraints, and namespaces. 64 | - **Relax NG:** A simpler, more flexible schema language often used for non-strict validation. 65 | 66 | ### **2.2 Validating XML Documents** 67 | - **Well-Formed vs. Valid XML:** 68 | - **Well-Formed XML:** Adheres to syntax rules like correct tag nesting and closing. 69 | - **Valid XML:** Must be well-formed and also conform to a specified schema (e.g., XSD or Relax NG). 70 | 71 | ### **2.3 Detailed Example** 72 | **Scenario:** Validating an XML document describing books using XSD. 73 | 74 | **Example XSD:** 75 | ```xml 76 | <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> 77 | <xs:element name="library"> 78 | <xs:complexType> 79 | <xs:sequence> 80 | <xs:element name="book" maxOccurs="unbounded"> 81 | <xs:complexType> 82 | <xs:sequence> 83 | <xs:element name="title" type="xs:string"/> 84 | <xs:element name="author" type="xs:string"/> 85 | <xs:element name="isbn" type="xs:string"/> 86 | </xs:sequence> 87 | </xs:complexType> 88 | </xs:element> 89 | </xs:sequence> 90 | </xs:complexType> 91 | </xs:element> 92 | </xs:schema> 93 | ``` 94 | 95 | ### **2.4 Common Pitfalls and How to Avoid Them** 96 | - **Confusing Well-Formed with Valid XML:** Ensure that valid XML adheres to both syntax and schema rules. 97 | - **Forgetting Required Elements:** Schemas often define required elements that must be present in a valid XML document. 98 | 99 | ### **2.5 Important Points to Remember** 100 | - XML schemas enforce structure and validation, ensuring consistent data formats. 101 | - XSD is commonly used in exams due to its support for complex data types and constraints. 102 | 103 | ### **2.6 Must Know: Commonly Tested Concepts** 104 | - Differentiating between well-formed and valid XML. 105 | - Writing and interpreting simple XSD schemas. 106 | - Understanding the structure and purpose of Relax NG and how it differs from XSD. 107 | 108 | --- 109 | 110 | ## **3. XPath: Querying XML Documents** 111 | 112 | ### **3.1 Introduction to XPath** 113 | - **What is XPath?** 114 | XPath is a language for navigating and querying XML documents using path expressions to select nodes, attributes, and values. 115 | 116 | ### **3.2 XPath Syntax and Expressions** 117 | - **Basic XPath Syntax:** 118 | - `/` selects the root or direct child elements. 119 | - `//` selects elements anywhere in the document. 120 | - `[@attribute='value']` filters elements based on attribute values. 121 | 122 | - **Detailed Example:** 123 | - Selecting all titles in a movie database: `//title` 124 | - Selecting actors with a specific role: `//actor[@role='Jebediah Leland']` 125 | 126 | ### **3.3 Worked Examples and Solutions** 127 | - **Sample Question:** Write an XPath expression to extract all authors from an XML document representing a library. 128 | - **Solution:** `//author` selects all `<author>` elements regardless of their depth in the document. 129 | 130 | ### **3.4 Common Mistakes and How to Avoid Them** 131 | - **Misinterpreting `/` and `//`:** Remember that `/` selects direct children, while `//` selects any descendant nodes. 132 | - **Overselecting Nodes:** Be specific in queries to avoid retrieving unintended elements. 133 | 134 | ### **3.5 Key Takeaways** 135 | - XPath is essential for precise data selection and filtering in XML documents. 136 | - Learn to navigate hierarchical data using XPath, especially when dealing with complex queries. 137 | 138 | ### **3.6 Must Know: Commonly Tested Concepts** 139 | - Writing XPath queries for selecting elements, attributes, and filtering based on conditions. 140 | - Understanding how to navigate XML hierarchies using different expressions like `/`, `//`, and `@attribute`. 141 | 142 | --- 143 | 144 | ## **4. XML Data Models and Hierarchical Structure** 145 | 146 | ### **4.1 Understanding XML Data Models** 147 | - **Hierarchical Nature:** XML documents are structured as trees, making them suitable for representing nested data like organizational charts, product categories, and document sections. 148 | 149 | ### **4.2 Strengths and Weaknesses of XML Data Models** 150 | - **Strengths:** 151 | - **Flexibility and Extensibility:** Custom tags and hierarchical representation make XML versatile. 152 | - **Platform Independence:** XML is readable across different platforms and systems. 153 | 154 | - **Weaknesses:** 155 | - **Verbosity:** XML’s text-heavy nature leads to larger file sizes. 156 | - **Complexity in Querying:** XPath and XQuery can be complex for deeply nested structures. 157 | - **Performance Overhead:** XML is less efficient compared to binary formats or relational databases for large-scale applications. 158 | 159 | ### **4.3 Worked Examples and Solutions** 160 | - **Sample Question:** Create an XML structure to represent an organization’s departments and employees, ensuring a logical hierarchy and proper attribute usage. 161 | - **Solution:** 162 | ```xml 163 | <organization> 164 | <department name="Sales"> 165 | <employee> 166 | <name>John Doe</name> 167 | <role>Manager</role> 168 | </employee> 169 | <employee> 170 | <name>Jane Smith</name> 171 | <role>Sales Executive</role> 172 | </employee> 173 | </department> 174 | </organization> 175 | ``` 176 | 177 | ### **4.4 Common Pitfalls and How to Avoid Them** 178 | - **Overcomplicating the Structure:** Avoid excessive nesting unless necessary for data representation. 179 | - **Redundancy in Tags:** Minimize redundancy by using attributes when appropriate instead of repetitive tags. 180 | 181 | ### **4.5 Key Takeaways** 182 | - XML’s hierarchical structure is both its strength and its limitation. It excels at representing nested data but can be inefficient for simple, flat datasets. 183 | - Use XML for scenarios where data needs to be organized hierarchically, and flexibility is important. 184 | 185 | ### **4.6 Must Know: Commonly Tested Concepts** 186 | - Discussing when XML is preferred over other data models (e.g., JSON, relational). 187 | - Recognizing the pros and cons of XML’s hierarchical model in real-world applications. 188 | - Understanding the balance between flexibility and performance in XML-based systems. 189 | 190 | --- 191 | -------------------------------------------------------------------------------- /SPARQL - Sep 2022.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "include_colab_link": true 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "view-in-github", 20 | "colab_type": "text" 21 | }, 22 | "source": [ 23 | "<a href=\"https://colab.research.google.com/github/sreent/data-management-intro/blob/main/SPARQL%20-%20Sep%202022.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" 24 | ] 25 | }, 26 | { 27 | "metadata": { 28 | "id": "vM6ta952S2z2" 29 | }, 30 | "cell_type": "markdown", 31 | "source": [ 32 | "# 1. Introduction to RDF and SPARQL\n", 33 | "RDF (Resource Description Framework) is a standard model for data interchange on the web. SPARQL is a query language for RDF. This lab will introduce the basics of RDF, how to create RDF data in Turtle format, and how to query it using SPARQL.\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "source": [ 39 | "# 2. Setting Up the Environment\n", 40 | "\n", 41 | "First, we need to install the `rdflib` library, which provides tools for working with RDF data in Python.\n" 42 | ], 43 | "metadata": { 44 | "id": "0LGk4rAcB7YK" 45 | } 46 | }, 47 | { 48 | "cell_type": "code", 49 | "source": [ 50 | "# Install rdflib library\n", 51 | "!pip install rdflib" 52 | ], 53 | "metadata": { 54 | "id": "zgXgWsKqFlWM" 55 | }, 56 | "execution_count": null, 57 | "outputs": [] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "source": [ 62 | "We also need to import necessary modules." 63 | ], 64 | "metadata": { 65 | "id": "nxFFaZ5KKWR1" 66 | } 67 | }, 68 | { 69 | "cell_type": "code", 70 | "source": [ 71 | "# Import necessary modules\n", 72 | "from rdflib import Graph, Literal, RDF, URIRef, Namespace\n", 73 | "from rdflib.namespace import FOAF, XSD, DC" 74 | ], 75 | "metadata": { 76 | "id": "2KLaaodNKXZd" 77 | }, 78 | "execution_count": null, 79 | "outputs": [] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "source": [ 84 | "# Use Case 1: Basic RDF and SPARQL\n", 85 | "\n", 86 | "# 3. Creating and Saving RDF Turtle Document\n", 87 | "\n", 88 | "We will create a simple RDF Turtle document and save it using the `%%writefile` magic cell.\n" 89 | ], 90 | "metadata": { 91 | "id": "f8wxprLvCP5E" 92 | } 93 | }, 94 | { 95 | "cell_type": "code", 96 | "source": [ 97 | "%%writefile data.ttl\n", 98 | "@prefix dcterms: <http://purl.org/dc/terms/> .\n", 99 | "@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n", 100 | "@prefix oa: <http://www.w3.org/ns/oa#> .\n", 101 | "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n", 102 | "@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n", 103 | "@prefix myrdf: <http://example.org/> .\n", 104 | "@prefix armadale: <https://literary-greats.com/WCollins/Armadale/> .\n", 105 | "\n", 106 | "myrdf:anno-001 a oa:Annotation ;\n", 107 | " dcterms:created \"2015-10-13T13:00:00+00:00\"^^xsd:dateTime ;\n", 108 | " dcterms:creator myrdf:DL192 ;\n", 109 | " oa:hasBody [\n", 110 | " a oa:TextualBody ;\n", 111 | " rdf:value \"Note the use of visual language here.\"\n", 112 | " ] ;\n", 113 | " oa:hasTarget [\n", 114 | " a oa:SpecificResource ;\n", 115 | " oa:hasSelector [\n", 116 | " a oa:TextPositionSelector ;\n", 117 | " oa:start \"235\"^^xsd:nonNegativeInteger ;\n", 118 | " oa:end \"300\"^^xsd:nonNegativeInteger\n", 119 | " ] ;\n", 120 | " oa:hasSource <https://literary-greats.com/WCollins/Armadale/Chapter3> ;\n", 121 | " oa:motivatedBy oa:commenting\n", 122 | " ] .\n", 123 | "\n", 124 | "myrdf:DL192 a foaf:Person ;\n", 125 | " foaf:name \"David Lewis\" ." 126 | ], 127 | "metadata": { 128 | "id": "J8U76XcPCQTe" 129 | }, 130 | "execution_count": null, 131 | "outputs": [] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "source": [ 136 | "# 4. Loading RDF Turtle Document\n", 137 | "\n", 138 | "We can load RDF data from the Turtle document directly into an RDF graph." 139 | ], 140 | "metadata": { 141 | "id": "ncr8F5pRCbmb" 142 | } 143 | }, 144 | { 145 | "cell_type": "code", 146 | "source": [ 147 | "# Create a new RDF graph\n", 148 | "g = Graph()\n", 149 | "\n", 150 | "# Load the Turtle data from the file\n", 151 | "g.parse('data.ttl', format='turtle')\n", 152 | "\n", 153 | "# Verify the graph contents\n", 154 | "for stmt in g:\n", 155 | " print(stmt)" 156 | ], 157 | "metadata": { 158 | "id": "f2q4bBmFNQuA" 159 | }, 160 | "execution_count": null, 161 | "outputs": [] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "source": [ 166 | "# 5. Querying RDF Data with SPARQL\n", 167 | "\n", 168 | "We will use SPARQL to query the RDF data we loaded.\n" 169 | ], 170 | "metadata": { 171 | "id": "zi7YksTPGlBH" 172 | } 173 | }, 174 | { 175 | "cell_type": "code", 176 | "source": [ 177 | "# Define a SPARQL query\n", 178 | "query = \"\"\"\n", 179 | "SELECT ?creatorName ?bodyText\n", 180 | "WHERE {\n", 181 | " ?annotation a oa:Annotation ;\n", 182 | " dcterms:creator ?creator ;\n", 183 | " oa:hasBody ?body ;\n", 184 | " oa:hasTarget [\n", 185 | " oa:hasSource <https://literary-greats.com/WCollins/Armadale/Chapter3>\n", 186 | " ] .\n", 187 | " ?creator foaf:name ?creatorName .\n", 188 | " ?body rdf:value ?bodyText .\n", 189 | "}\n", 190 | "\"\"\"\n", 191 | "\n", 192 | "# Execute the query on the loaded graph\n", 193 | "qres = g.query(query)\n", 194 | "\n", 195 | "# Print the results\n", 196 | "for row in qres:\n", 197 | " print(f\"{row.creatorName}, {row.bodyText}\")" 198 | ], 199 | "metadata": { 200 | "id": "itVBhf6uGm3f" 201 | }, 202 | "execution_count": null, 203 | "outputs": [] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "source": [], 208 | "metadata": { 209 | "id": "KlRYgImOcG0J" 210 | }, 211 | "execution_count": null, 212 | "outputs": [] 213 | } 214 | ] 215 | } -------------------------------------------------------------------------------- /XPath - Sep 2022.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "include_colab_link": true 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "view-in-github", 20 | "colab_type": "text" 21 | }, 22 | "source": [ 23 | "<a href=\"https://colab.research.google.com/github/sreent/data-management-intro/blob/main/XPath%20-%20Sep%202022.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" 24 | ] 25 | }, 26 | { 27 | "metadata": { 28 | "id": "vM6ta952S2z2" 29 | }, 30 | "cell_type": "markdown", 31 | "source": [ 32 | "# 1 Introduction to XPath\n", 33 | "\n", 34 | "XPath (XML Path Language) is a query language for selecting nodes from an XML document. It provides a way to navigate through elements and attributes in XML." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "source": [ 40 | "# 2 Setting Up XPaht Environment\n", 41 | "\n", 42 | "First, we need to install the `lxml` library, which provides a powerful API for XML and HTML parsing." 43 | ], 44 | "metadata": { 45 | "id": "0LGk4rAcB7YK" 46 | } 47 | }, 48 | { 49 | "cell_type": "code", 50 | "source": [ 51 | "# Install lxml library\n", 52 | "!pip install lxml" 53 | ], 54 | "metadata": { 55 | "id": "zgXgWsKqFlWM" 56 | }, 57 | "execution_count": null, 58 | "outputs": [] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "source": [ 63 | "We also need to import the display tools from IPython." 64 | ], 65 | "metadata": { 66 | "id": "nxFFaZ5KKWR1" 67 | } 68 | }, 69 | { 70 | "cell_type": "code", 71 | "source": [ 72 | "# Import display tools\n", 73 | "from IPython.display import display, HTML, Markdown" 74 | ], 75 | "metadata": { 76 | "id": "2KLaaodNKXZd" 77 | }, 78 | "execution_count": null, 79 | "outputs": [] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "source": [ 84 | "# 3. Sample XML Data\n", 85 | "\n", 86 | "Let's start with a sample XML document. We will use this XML data for our XPath queries." 87 | ], 88 | "metadata": { 89 | "id": "f8wxprLvCP5E" 90 | } 91 | }, 92 | { 93 | "cell_type": "code", 94 | "source": [ 95 | "from lxml import etree\n", 96 | "\n", 97 | "xml_data = \"\"\"\n", 98 | "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", 99 | "<TEI xml:id=\"manuscript_3945\" xmlns=\"http://www.tei-c.org/ns/1.0\">\n", 100 | " <teiHeader xmlns:tei=\"http://www.tei-c.org/ns/1.0\">\n", 101 | " <fileDesc>\n", 102 | " <titleStmt>\n", 103 | " <title>Christ Church MS. 341\n", 104 | " Christ Church MSS.\n", 105 | " \n", 106 | " Cataloguer\n", 107 | " Ralph Hanna\n", 108 | " David Rundle\n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | "\n", 114 | "\"\"\"\n", 115 | "\n", 116 | "# Clean the XML data to ensure no unwanted characters are before the declaration\n", 117 | "xml_data = xml_data.strip()" 118 | ], 119 | "metadata": { 120 | "id": "J8U76XcPCQTe" 121 | }, 122 | "execution_count": null, 123 | "outputs": [] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "source": [ 128 | "# 4. Parsing XML Data\n", 129 | "\n", 130 | "We will use the `lxml` library to parse the XML data." 131 | ], 132 | "metadata": { 133 | "id": "ncr8F5pRCbmb" 134 | } 135 | }, 136 | { 137 | "cell_type": "code", 138 | "source": [ 139 | "# Convert the XML string to a byte string\n", 140 | "xml_data_bytes = xml_data.encode('utf-8')\n", 141 | "\n", 142 | "# Parse the XML data\n", 143 | "root = etree.fromstring(xml_data_bytes)\n", 144 | "\n", 145 | "# Display the root tag to verify parsing\n", 146 | "root.tag" 147 | ], 148 | "metadata": { 149 | "id": "f2q4bBmFNQuA" 150 | }, 151 | "execution_count": null, 152 | "outputs": [] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "source": [ 157 | "# 5. Utility Function to Display XML Nodes\n", 158 | "\n", 159 | "Define a utility function to simplify displaying XML content." 160 | ], 161 | "metadata": { 162 | "id": "zi7YksTPGlBH" 163 | } 164 | }, 165 | { 166 | "cell_type": "code", 167 | "source": [ 168 | "# Utility function to display XML attribute values\n", 169 | "def display_values(values):\n", 170 | " for value in values:\n", 171 | " display(Markdown(f'```text\\n{value}\\n```'))" 172 | ], 173 | "metadata": { 174 | "id": "itVBhf6uGm3f" 175 | }, 176 | "execution_count": null, 177 | "outputs": [] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "source": [ 182 | "# 6. XPath Queries\n", 183 | "\n", 184 | "Let's start with some basic XPath queries to extract information from the XML document." 185 | ], 186 | "metadata": { 187 | "id": "ZK39BLCHmVUa" 188 | } 189 | }, 190 | { 191 | "cell_type": "code", 192 | "source": [ 193 | "# Define namespaces (if any)\n", 194 | "namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}\n", 195 | "\n", 196 | "# Adjust XPath query to include namespace\n", 197 | "results = root.xpath('//tei:fileDesc//tei:title/@type', namespaces=namespaces)\n", 198 | "\n", 199 | "# Display the content of title attribute values\n", 200 | "display_values(results)" 201 | ], 202 | "metadata": { 203 | "id": "BSsXBUfhaJp9" 204 | }, 205 | "execution_count": null, 206 | "outputs": [] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "source": [ 211 | "# Adjust XPath query to include namespace\n", 212 | "results = root.xpath('//tei:resp[text()=\"Cataloguer\"]/../tei:persName', namespaces=namespaces)\n", 213 | "\n", 214 | "# Extract the text from each element\n", 215 | "names = [name.text for name in results]\n", 216 | "\n", 217 | "# Display the names\n", 218 | "display_values(names)" 219 | ], 220 | "metadata": { 221 | "id": "b23IMfX5Sgi_" 222 | }, 223 | "execution_count": null, 224 | "outputs": [] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "source": [ 229 | "xml_data = \"\"\"\n", 230 | "\n", 231 | " \n", 232 | " <relationship type=\"marriage\" spouse=\"#ElizabethOfYork\">\n", 233 | " <children>\n", 234 | " <royal name=\"Arthur\" xml:id=\"ArthurTudor\" />\n", 235 | " <royal name=\"Henry\" xml:id=\"HenryVIII\">\n", 236 | " <title rank=\"king\" territory=\"England\" regnal=\"VIII\" from=\"1509-04-22\" to=\"1547-01-28\" />\n", 237 | " <relationship type=\"marriage\" spouse=\"#CatherineOfAragon\" from=\"1509-06-11\" to=\"1533-05-23\">\n", 238 | " <children>\n", 239 | " <royal name=\"Mary\">\n", 240 | " <title rank=\"queen\" territory=\"England\" regnal=\"I\" from=\"1553-07-19\" to=\"1558-11-17\" />\n", 241 | " <relationship type=\"marriage\" spouse=\"#PhilipOfSpain\" from=\"1554-07-25\" />\n", 242 | " </royal>\n", 243 | " </children>\n", 244 | " </relationship>\n", 245 | " <relationship type=\"marriage\" spouse=\"#AnneBoleyn\" from=\"1533-01-25\" to=\"1536-05-17\">\n", 246 | " <children>\n", 247 | " <royal name=\"Elizabeth\">\n", 248 | " <title rank=\"queen\" territory=\"England\" regnal=\"I\" from=\"1558-11-17\" to=\"1603-03-24\" />\n", 249 | " </royal>\n", 250 | " </children>\n", 251 | " </relationship>\n", 252 | " <relationship type=\"marriage\" spouse=\"#JaneSeymour\" from=\"1536-05-30\" to=\"1537-10-24\">\n", 253 | " <children>\n", 254 | " <royal name=\"Edward\">\n", 255 | " <title rank=\"king\" territory=\"England\" regnal=\"VI\" from=\"1547-01-28\" to=\"1553-07-06\" />\n", 256 | " </royal>\n", 257 | " </children>\n", 258 | " </relationship>\n", 259 | " </royal>\n", 260 | " </children>\n", 261 | " </relationship>\n", 262 | "</royal>\n", 263 | "\"\"\"\n", 264 | "\n", 265 | "# Clean the XML data to ensure no unwanted characters are before the declaration\n", 266 | "xml_data = xml_data.strip()" 267 | ], 268 | "metadata": { 269 | "id": "D1JzJQgASsXH" 270 | }, 271 | "execution_count": 35, 272 | "outputs": [] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "source": [ 277 | "# Convert the XML string to a byte string\n", 278 | "xml_data_bytes = xml_data.encode('utf-8')\n", 279 | "\n", 280 | "# Parse the XML data\n", 281 | "root = etree.fromstring(xml_data_bytes)\n", 282 | "\n", 283 | "# Display the root tag to verify parsing\n", 284 | "root.tag" 285 | ], 286 | "metadata": { 287 | "colab": { 288 | "base_uri": "https://localhost:8080/", 289 | "height": 35 290 | }, 291 | "id": "J7NxaC_ndbAb", 292 | "outputId": "63c1ff41-3bfa-4a8c-e988-3afed2326129" 293 | }, 294 | "execution_count": 36, 295 | "outputs": [ 296 | { 297 | "output_type": "execute_result", 298 | "data": { 299 | "text/plain": [ 300 | "'royal'" 301 | ], 302 | "application/vnd.google.colaboratory.intrinsic+json": { 303 | "type": "string" 304 | } 305 | }, 306 | "metadata": {}, 307 | "execution_count": 36 308 | } 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "source": [ 314 | "# Let's extract all royal names for those who are titled \"king\" with regnal=\"VIII\"\n", 315 | "results = root.xpath('//royal/title[@rank=\"king\" and @regnal=\"VIII\"]/../@name')\n", 316 | "\n", 317 | "# Display the results\n", 318 | "for name in results:\n", 319 | " print(name)" 320 | ], 321 | "metadata": { 322 | "colab": { 323 | "base_uri": "https://localhost:8080/" 324 | }, 325 | "id": "Ao6M4dHxddcX", 326 | "outputId": "18429452-b45a-4c0d-850f-4009b0f9088a" 327 | }, 328 | "execution_count": 41, 329 | "outputs": [ 330 | { 331 | "output_type": "stream", 332 | "name": "stdout", 333 | "text": [ 334 | "<Element title at 0x7dd90595abc0>\n", 335 | "<Element relationship at 0x7dd8ec37d6c0>\n", 336 | "<Element relationship at 0x7dd8ec37e340>\n", 337 | "<Element relationship at 0x7dd8ec37e1c0>\n" 338 | ] 339 | } 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "source": [], 345 | "metadata": { 346 | "id": "lPSoQ-Qed1u3" 347 | }, 348 | "execution_count": null, 349 | "outputs": [] 350 | } 351 | ] 352 | } -------------------------------------------------------------------------------- /XPath Hand-On Lab - Solutions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "include_colab_link": true 8 | }, 9 | "kernelspec": { 10 | "display_name": "Python 3", 11 | "language": "python", 12 | "name": "python3" 13 | } 14 | }, 15 | "cells": [ 16 | { 17 | "cell_type": "markdown", 18 | "metadata": { 19 | "id": "view-in-github", 20 | "colab_type": "text" 21 | }, 22 | "source": [ 23 | "<a href=\"https://colab.research.google.com/github/sreent/data-management-intro/blob/main/XPath%20Hand-On%20Lab%20-%20Solutions.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" 24 | ] 25 | }, 26 | { 27 | "metadata": { 28 | "id": "vM6ta952S2z2" 29 | }, 30 | "cell_type": "markdown", 31 | "source": [ 32 | "# 1 Introduction to XPath\n", 33 | "\n", 34 | "XPath (XML Path Language) is a query language for selecting nodes from an XML document. It provides a way to navigate through elements and attributes in XML." 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "source": [ 40 | "# 2 Setting Up XPaht Environment\n", 41 | "\n", 42 | "First, we need to install the `lxml` library, which provides a powerful API for XML and HTML parsing." 43 | ], 44 | "metadata": { 45 | "id": "0LGk4rAcB7YK" 46 | } 47 | }, 48 | { 49 | "cell_type": "code", 50 | "source": [ 51 | "# Install lxml library\n", 52 | "!pip install lxml" 53 | ], 54 | "metadata": { 55 | "id": "zgXgWsKqFlWM" 56 | }, 57 | "execution_count": null, 58 | "outputs": [] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "source": [ 63 | "We also need to import the display tools from IPython." 64 | ], 65 | "metadata": { 66 | "id": "nxFFaZ5KKWR1" 67 | } 68 | }, 69 | { 70 | "cell_type": "code", 71 | "source": [ 72 | "# Import display tools\n", 73 | "from IPython.display import display, HTML, Markdown" 74 | ], 75 | "metadata": { 76 | "id": "2KLaaodNKXZd" 77 | }, 78 | "execution_count": null, 79 | "outputs": [] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "source": [ 84 | "# 3. Sample XML Data\n", 85 | "\n", 86 | "Let's start with a sample XML document. We will use this XML data for our XPath queries." 87 | ], 88 | "metadata": { 89 | "id": "f8wxprLvCP5E" 90 | } 91 | }, 92 | { 93 | "cell_type": "code", 94 | "source": [ 95 | "xml_data = \"\"\"\n", 96 | "<library>\n", 97 | " <book id=\"1\">\n", 98 | " <title>Python Programming\n", 99 | " John Doe\n", 100 | " 2020\n", 101 | " 29.99\n", 102 | " \n", 103 | " \n", 104 | " Learning XPath\n", 105 | " Jane Smith\n", 106 | " 2019\n", 107 | " 19.99\n", 108 | " \n", 109 | " \n", 110 | " Data Science Handbook\n", 111 | " Emily Davis\n", 112 | " 2018\n", 113 | " 39.99\n", 114 | " \n", 115 | "\n", 116 | "\"\"\"" 117 | ], 118 | "metadata": { 119 | "id": "J8U76XcPCQTe" 120 | }, 121 | "execution_count": null, 122 | "outputs": [] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "source": [ 127 | "# 4. Parsing XML Data\n", 128 | "\n", 129 | "We will use the `lxml` library to parse the XML data." 130 | ], 131 | "metadata": { 132 | "id": "ncr8F5pRCbmb" 133 | } 134 | }, 135 | { 136 | "cell_type": "code", 137 | "source": [ 138 | "from lxml import etree\n", 139 | "\n", 140 | "# Parse the XML data\n", 141 | "root = etree.fromstring(xml_data)\n", 142 | "\n", 143 | "# Display the root tag to verify parsing\n", 144 | "root.tag" 145 | ], 146 | "metadata": { 147 | "id": "f2q4bBmFNQuA" 148 | }, 149 | "execution_count": null, 150 | "outputs": [] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "source": [ 155 | "# 5. Utility Function to Display XML Nodes\n", 156 | "\n", 157 | "Define a utility function to simplify displaying XML content." 158 | ], 159 | "metadata": { 160 | "id": "zi7YksTPGlBH" 161 | } 162 | }, 163 | { 164 | "cell_type": "code", 165 | "source": [ 166 | "# Utility function to display XML content without empty lines\n", 167 | "def display_xml(nodes):\n", 168 | " for node in nodes:\n", 169 | " xml_str = etree.tostring(node, pretty_print=True, encoding='unicode').strip()\n", 170 | " display(Markdown(f'```xml\\n{xml_str}\\n```'))" 171 | ], 172 | "metadata": { 173 | "id": "itVBhf6uGm3f" 174 | }, 175 | "execution_count": null, 176 | "outputs": [] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "source": [ 181 | "# 6. Basic XPath Queries\n", 182 | "\n", 183 | "Let's start with some basic XPath queries to extract information from the XML document." 184 | ], 185 | "metadata": { 186 | "id": "ZK39BLCHmVUa" 187 | } 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "source": [ 192 | "**a. Extract all book titles:**" 193 | ], 194 | "metadata": { 195 | "id": "qXVCAX1xmz2q" 196 | } 197 | }, 198 | { 199 | "cell_type": "code", 200 | "source": [ 201 | "# Extract all book title nodes\n", 202 | "title_nodes = root.xpath('//book/title')\n", 203 | "# Display the content of title nodes\n", 204 | "display_xml(title_nodes)" 205 | ], 206 | "metadata": { 207 | "id": "BSsXBUfhaJp9" 208 | }, 209 | "execution_count": null, 210 | "outputs": [] 211 | }, 212 | { 213 | "cell_type": "markdown", 214 | "source": [ 215 | "**b. Extract the author of the first book:**" 216 | ], 217 | "metadata": { 218 | "id": "MYRMT54Xn1EB" 219 | } 220 | }, 221 | { 222 | "cell_type": "code", 223 | "source": [ 224 | "# Extract the author node of the first book\n", 225 | "author_first_book = root.xpath('//book[1]/author')\n", 226 | "# Display the content of the author node\n", 227 | "display_xml(author_first_book)" 228 | ], 229 | "metadata": { 230 | "id": "W0agPtBpNpc3" 231 | }, 232 | "execution_count": null, 233 | "outputs": [] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "source": [ 238 | "**c. Extract all prices:**" 239 | ], 240 | "metadata": { 241 | "id": "G1mRdeRFoXaY" 242 | } 243 | }, 244 | { 245 | "cell_type": "code", 246 | "source": [ 247 | "# Extract all price nodes\n", 248 | "price_nodes = root.xpath('//book/price')\n", 249 | "# Display the content of price nodes\n", 250 | "display_xml(price_nodes)" 251 | ], 252 | "metadata": { 253 | "id": "M7lCulsSjqun" 254 | }, 255 | "execution_count": null, 256 | "outputs": [] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "source": [ 261 | "# 7. Advanced XPath Queries\n", 262 | "\n", 263 | "Now, let's move on to some advanced queries." 264 | ], 265 | "metadata": { 266 | "id": "O6kaX0jfHxXa" 267 | } 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "source": [ 272 | "**a. Extract books published after 2018:**" 273 | ], 274 | "metadata": { 275 | "id": "idHUEHFHowWX" 276 | } 277 | }, 278 | { 279 | "cell_type": "code", 280 | "source": [ 281 | "# Extract book nodes published after 2018\n", 282 | "books_after_2018 = root.xpath('//book[year > 2018]')\n", 283 | "# Display the content of the book nodes\n", 284 | "display_xml(books_after_2018)" 285 | ], 286 | "metadata": { 287 | "id": "5nCpwNFMc5g8" 288 | }, 289 | "execution_count": null, 290 | "outputs": [] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "source": [ 295 | "**b. Extract the title and price of books that cost more than $20:**" 296 | ], 297 | "metadata": { 298 | "id": "xUYJCDrXIKBI" 299 | } 300 | }, 301 | { 302 | "cell_type": "code", 303 | "source": [ 304 | "# Extract book nodes with price greater than $20\n", 305 | "expensive_books = root.xpath('//book[price > 20]')\n", 306 | "# Display the content of the book nodes\n", 307 | "display_xml(expensive_books)" 308 | ], 309 | "metadata": { 310 | "id": "RbUMVaNzdOvx" 311 | }, 312 | "execution_count": null, 313 | "outputs": [] 314 | }, 315 | { 316 | "cell_type": "markdown", 317 | "source": [ 318 | "**c. Extract book details with a specific attribute:**" 319 | ], 320 | "metadata": { 321 | "id": "YCNewqotISMZ" 322 | } 323 | }, 324 | { 325 | "cell_type": "code", 326 | "source": [ 327 | "# Extract book node with id=2\n", 328 | "book_id_2 = root.xpath('//book[@id=\"2\"]')\n", 329 | "# Display the content of the book node\n", 330 | "display_xml(book_id_2)" 331 | ], 332 | "metadata": { 333 | "id": "oeYt0rqzpKnr" 334 | }, 335 | "execution_count": null, 336 | "outputs": [] 337 | }, 338 | { 339 | "cell_type": "markdown", 340 | "source": [ 341 | "# 8. Exploring Lists and Parent Navigation\n", 342 | "\n", 343 | "XPath also allows navigating lists and moving to the parent level using `..`.\n" 344 | ], 345 | "metadata": { 346 | "id": "OtseGmERr9jQ" 347 | } 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "source": [ 352 | "**a. Extract titles of all books (list example):**\n" 353 | ], 354 | "metadata": { 355 | "id": "GPj9LFGrsyyE" 356 | } 357 | }, 358 | { 359 | "cell_type": "code", 360 | "source": [ 361 | "# Extract all book title nodes\n", 362 | "book_titles_nodes = root.xpath('//book/title')\n", 363 | "# Display the content of title nodes\n", 364 | "display_xml(book_titles_nodes)" 365 | ], 366 | "metadata": { 367 | "id": "TWRu81nReEw2" 368 | }, 369 | "execution_count": null, 370 | "outputs": [] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "source": [ 375 | "**b. Navigate to the parent and back down to another child:**" 376 | ], 377 | "metadata": { 378 | "id": "wncBlxCctOPr" 379 | } 380 | }, 381 | { 382 | "cell_type": "code", 383 | "source": [ 384 | "# Navigate to the parent of the first book's title and get the price\n", 385 | "parent_price_node = root.xpath('//book/title[text()=\"Python Programming\"]/../price')\n", 386 | "# Display the content of the price node\n", 387 | "display_xml(parent_price_node)" 388 | ], 389 | "metadata": { 390 | "id": "AZXxnpqEfwc-" 391 | }, 392 | "execution_count": null, 393 | "outputs": [] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "source": [ 398 | "**c. Use `..` to navigate from an element to its parent and then select another sibling:**" 399 | ], 400 | "metadata": { 401 | "id": "b8zsF3bWtb8H" 402 | } 403 | }, 404 | { 405 | "cell_type": "code", 406 | "source": [ 407 | "# Use '..' to navigate from author to title\n", 408 | "titles_from_authors_nodes = root.xpath('//book/author[text()=\"Jane Smith\"]/../title')\n", 409 | "# Display the content of title nodes\n", 410 | "display_xml(titles_from_authors_nodes)" 411 | ], 412 | "metadata": { 413 | "id": "wAPyCIANhTKa" 414 | }, 415 | "execution_count": null, 416 | "outputs": [] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "source": [ 421 | "# 9. Using `//` and Wildcard `*` in XPath" 422 | ], 423 | "metadata": { 424 | "id": "tCSPpGWnItwW" 425 | } 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "source": [ 430 | "**a. Using `//` to select nodes regardless of their position in the document:**" 431 | ], 432 | "metadata": { 433 | "id": "TouNO91iIw9S" 434 | } 435 | }, 436 | { 437 | "cell_type": "code", 438 | "source": [ 439 | "# Extract all author nodes regardless of their position in the document\n", 440 | "all_authors_nodes = root.xpath('//author')\n", 441 | "# Display the content of author nodes\n", 442 | "display_xml(all_authors_nodes)" 443 | ], 444 | "metadata": { 445 | "id": "Nlhzy-5iuKOb" 446 | }, 447 | "execution_count": null, 448 | "outputs": [] 449 | }, 450 | { 451 | "cell_type": "markdown", 452 | "source": [ 453 | "**b. Using the wildcard `*` to select any element:**" 454 | ], 455 | "metadata": { 456 | "id": "JfQzY-0_txxA" 457 | } 458 | }, 459 | { 460 | "cell_type": "code", 461 | "source": [ 462 | "# Extract all child nodes of the first book\n", 463 | "first_book_children = root.xpath('//book[1]/*')\n", 464 | "# Display the content of child nodes\n", 465 | "display_xml(first_book_children)" 466 | ], 467 | "metadata": { 468 | "id": "REjidHpai09t" 469 | }, 470 | "execution_count": null, 471 | "outputs": [] 472 | }, 473 | { 474 | "cell_type": "markdown", 475 | "source": [ 476 | "**c. Combine `//` and `*` to select all elements:**" 477 | ], 478 | "metadata": { 479 | "id": "0mRGZCvCI9bD" 480 | } 481 | }, 482 | { 483 | "cell_type": "code", 484 | "source": [ 485 | "# Extract all elements in the document\n", 486 | "all_elements = root.xpath('//*')\n", 487 | "# Display the content of all elements\n", 488 | "display_xml(all_elements)" 489 | ], 490 | "metadata": { 491 | "id": "3z_IB5e7uHtx" 492 | }, 493 | "execution_count": null, 494 | "outputs": [] 495 | }, 496 | { 497 | "cell_type": "markdown", 498 | "source": [ 499 | "# 10. Additional XPath Functions and Expressions\n" 500 | ], 501 | "metadata": { 502 | "id": "5UFXSyVBunOA" 503 | } 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "source": [ 508 | "**a. Using `@` to Select Attributes:**" 509 | ], 510 | "metadata": { 511 | "id": "qHmOq4b4JF9j" 512 | } 513 | }, 514 | { 515 | "cell_type": "code", 516 | "source": [ 517 | "# Extract the IDs of all books\n", 518 | "book_ids = root.xpath('//book/@id')\n", 519 | "book_ids" 520 | ], 521 | "metadata": { 522 | "id": "w1O1klukufCL" 523 | }, 524 | "execution_count": null, 525 | "outputs": [] 526 | }, 527 | { 528 | "cell_type": "markdown", 529 | "source": [ 530 | "**b. Using Position Functions:**" 531 | ], 532 | "metadata": { 533 | "id": "sdbYCpZ0JMSS" 534 | } 535 | }, 536 | { 537 | "cell_type": "code", 538 | "source": [ 539 | "# Extract the title of the last book\n", 540 | "last_book_title_node = root.xpath('//book[last()]/title')\n", 541 | "# Display the content of the title node\n", 542 | "display_xml(last_book_title_node)" 543 | ], 544 | "metadata": { 545 | "id": "_9_ogoGTu_4N" 546 | }, 547 | "execution_count": null, 548 | "outputs": [] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "source": [ 553 | "# Extract the titles of the first two books\n", 554 | "first_two_books_title_nodes = root.xpath('//book[position() <= 2]/title')\n", 555 | "# Display the content of the title nodes\n", 556 | "display_xml(first_two_books_title_nodes)" 557 | ], 558 | "metadata": { 559 | "id": "x1ncF0kGJSYk" 560 | }, 561 | "execution_count": null, 562 | "outputs": [] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "source": [ 567 | "**c. Using Boolean Functions:**" 568 | ], 569 | "metadata": { 570 | "id": "E-NcrapfvntT" 571 | } 572 | }, 573 | { 574 | "cell_type": "code", 575 | "source": [ 576 | "# Check if there are any books published in 2020\n", 577 | "books_2020 = root.xpath('boolean(//book[year=2020])')\n", 578 | "books_2020" 579 | ], 580 | "metadata": { 581 | "id": "j1_qs4YWvSPR" 582 | }, 583 | "execution_count": null, 584 | "outputs": [] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "source": [ 589 | "**d. Using Aggregation Functions:**" 590 | ], 591 | "metadata": { 592 | "id": "JBFK2hgtJcQW" 593 | } 594 | }, 595 | { 596 | "cell_type": "code", 597 | "source": [ 598 | "# Count the number of books\n", 599 | "book_count = root.xpath('count(//book)')\n", 600 | "book_count" 601 | ], 602 | "metadata": { 603 | "id": "9UgslSuiwQZo" 604 | }, 605 | "execution_count": null, 606 | "outputs": [] 607 | }, 608 | { 609 | "cell_type": "markdown", 610 | "source": [ 611 | "**e. Combining Functions:**" 612 | ], 613 | "metadata": { 614 | "id": "YLs5l3pXwfea" 615 | } 616 | }, 617 | { 618 | "cell_type": "code", 619 | "source": [ 620 | "# Extract titles and authors of books costing more than $20\n", 621 | "expensive_books_nodes = root.xpath('//book[price > 20]')\n", 622 | "# Display the content of the book nodes\n", 623 | "display_xml(expensive_books_nodes)" 624 | ], 625 | "metadata": { 626 | "id": "j7411YVlwXFH" 627 | }, 628 | "execution_count": null, 629 | "outputs": [] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "source": [ 634 | "# 11. Conclusion\n", 635 | "\n", 636 | "XPath is a powerful tool for navigating and querying XML documents. In this lab, we've covered basic to advanced XPath queries, explored lists, navigated using `..`, used `//` to select nodes regardless of their position, utilized the wildcard `*`, and explored various XPath functions and expressions without always relying on `text()`. You can further explore XPath to handle more complex XML structures and queries.\n", 637 | "\n" 638 | ], 639 | "metadata": { 640 | "id": "MSqUGlQMJm6Q" 641 | } 642 | }, 643 | { 644 | "cell_type": "code", 645 | "source": [], 646 | "metadata": { 647 | "id": "nBcu4xjWMvPz" 648 | }, 649 | "execution_count": null, 650 | "outputs": [] 651 | } 652 | ] 653 | } -------------------------------------------------------------------------------- /precision-recal.md: -------------------------------------------------------------------------------- 1 | ### Problem Context: 2 | - **Total Archive Size**: 50,000 items 3 | - **Relevant Items**: 30 items 4 | - **Manual Time to Find Each Relevant Item** (if missed): 15 minutes 5 | - **Time Wasted on Each False Positive**: 0.5 minutes (30 seconds) 6 | 7 | ### Relevant Formulas: 8 | 1. **False Negatives** (items missed): 9 | \( \text{False Negatives} = (1 - \text{Recall}) \times \text{Total Relevant Items} \) 10 | 11 | 2. **False Positives** (irrelevant items incorrectly identified as relevant): 12 | \( \text{False Positives} = \frac{\text{Total Identified Items} \times (1 - \text{Precision})}{\text{Precision}} \) 13 | 14 | 3. **Time Spent Finding False Negatives**: 15 | \( \text{False Negatives} \times 15 \text{ minutes} \) 16 | 17 | 4. **Time Spent Dealing with False Positives**: 18 | \( \text{False Positives} \times 0.5 \text{ minutes} \) 19 | 20 | --- 21 | 22 | ### Option i: Just right of the center of the graph, where precision increases again. 23 | - **Estimated Precision**: 80% 24 | - **Estimated Recall**: 90% 25 | 26 | **Breakdown:** 27 | - **False Negatives**: \( (1 - 0.90) \times 30 = 3 \) items. 28 | - **False Positives**: \( \frac{30 \times (1 - 0.80)}{0.80} = 7.5 \approx 8 \) items. 29 | 30 | **Time Calculation:** 31 | - **Time spent finding false negatives**: \( 3 \times 15 = 45 \) minutes. 32 | - **Time spent dealing with false positives**: \( 8 \times 0.5 = 4 \) minutes. 33 | 34 | **Total Time**: **49 minutes** 35 | 36 | --- 37 | 38 | ### Option ii: To the right of the graph, before it drops – with 68% precision and 90% recall. 39 | - **Precision**: 68% 40 | - **Recall**: 90% 41 | 42 | **Breakdown:** 43 | - **False Negatives**: \( (1 - 0.90) \times 30 = 3 \) items. 44 | - **False Positives**: \( \frac{30 \times (1 - 0.68)}{0.68} \approx 14 \) items. 45 | 46 | **Time Calculation:** 47 | - **Time spent finding false negatives**: \( 3 \times 15 = 45 \) minutes. 48 | - **Time spent dealing with false positives**: \( 14 \times 0.5 = 7 \) minutes. 49 | 50 | **Total Time**: **52 minutes** 51 | 52 | --- 53 | 54 | ### Option iii: Do not use this tool. Find each resource manually. 55 | - **Precision**: 100% 56 | - **Recall**: 100% 57 | 58 | **Scenario**: 59 | You need to manually scan through all 50,000 items to find the 30 relevant ones. 60 | 61 | **Breakdown**: 62 | - You would go through all **49,970 irrelevant items** and all **30 relevant items**. 63 | 64 | **Time Calculation**: 65 | - **Time spent on irrelevant items**: \( 49,970 \times 0.5 = 24,985 \) minutes. 66 | - **Time spent on relevant items**: \( 30 \times 15 = 450 \) minutes. 67 | 68 | **Total Time**: **25,435 minutes** 69 | 70 | --- 71 | 72 | ### Option iv: To the left of the graph – 100% precision with 17% recall. 73 | - **Precision**: 100% 74 | - **Recall**: 17% 75 | 76 | **Breakdown:** 77 | - **False Negatives**: \( (1 - 0.17) \times 30 = 25 \) items. 78 | - **False Positives**: None (100% precision). 79 | 80 | **Time Calculation:** 81 | - **Time spent finding false negatives manually**: \( 25 \times 15 = 375 \) minutes. 82 | - **Time spent finding the 5 items using the tool**: \( 5 \times 15 = 75 \) minutes. 83 | 84 | **Total Time**: **450 minutes** 85 | 86 | --- 87 | 88 | ### Option v: To the left of the graph – 100% precision with 17% recall, spending 30 seconds per irrelevant record. 89 | - **Precision**: 100% 90 | - **Recall**: 17% 91 | 92 | **Breakdown:** 93 | - **False Negatives**: \( (1 - 0.17) \times 30 = 25 \) items. 94 | - **False Positives**: None (100% precision). 95 | 96 | **Time Calculation:** 97 | - **Time spent finding false negatives manually**: \( 25 \times 0.5 = 12.5 \approx 13 \) minutes. 98 | - **Time spent finding the 5 items using the tool**: \( 5 \times 15 = 75 \) minutes. 99 | 100 | **Total Time**: **88 minutes** 101 | 102 | --- 103 | 104 | ### Final Comparison: 105 | 106 | 1. **Option i**: **49 minutes** 107 | (Best choice with balanced precision and recall) 108 | 109 | 2. **Option ii**: **52 minutes** 110 | 111 | 3. **Option iii**: **25,435 minutes** 112 | (Manually scanning everything is extremely time-consuming) 113 | 114 | 4. **Option iv**: **450 minutes** 115 | 116 | 5. **Option v**: **88 minutes** 117 | 118 | -------------------------------------------------------------------------------- /web-app/Solution Sheets - MCQ September 2022A.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | ### Question 1(a): 5 | 6 | **What is missing from the following set of commands?** 7 | 8 | ```sql 9 | START TRANSACTION; 10 | UPDATE Account SET Balance = Balance-100 WHERE AccNo=21430885; 11 | UPDATE Account SET Balance = Balance+100 WHERE AccNo=29584776; 12 | SELECT SUM(Balance) FROM Account; 13 | ``` 14 | 15 | **Answer: iv. COMMIT** 16 | 17 | **Detailed Explanation and Working:** 18 | 19 | - **Choice i. ROLLBACK:** Incorrect. It undoes the changes instead of finalizing them. 20 | - **Choice ii. INSERT INTO Account VALUES (100):** Incorrect. Irrelevant to the current transaction. 21 | - **Choice iii. END TRANSACTION:** Incorrect. Not a valid SQL command. 22 | - **Choice iv. COMMIT:** Correct. Finalizes the transaction and applies all changes permanently. 23 | - **Choice v. UPDATE Account SET Balance = Balance+100 WHERE AccNo=21430885:** Incorrect. Unnecessary and would revert one of the previous updates. 24 | 25 | **Real-World Example:** 26 | In a banking application, when transferring money between accounts, the transaction should either fully succeed or fully fail. A `COMMIT` ensures the transfer completes, while a `ROLLBACK` would revert changes if there’s an issue. 27 | 28 | **Common Pitfalls:** 29 | - Forgetting to `COMMIT` can cause unsaved data. 30 | - Using invalid commands like `END TRANSACTION`. 31 | 32 | **Important Point to Remember:** 33 | - Always use `COMMIT` to finalize a transaction in SQL. Without it, your changes remain temporary and can be lost. 34 | 35 | --- 36 | 37 | ### Question 1(b): 38 | 39 | **The following query should return the name of the city of Cristiano Ronaldo’s birth. Why doesn’t it?** 40 | 41 | SPARQL Query: 42 | 43 | ```sparql 44 | SELECT DISTINCT * 45 | WHERE 46 | { 47 | "Cristiano Ronaldo"@en dbo:birthPlace 48 | [ 49 | a dbo:City ; 50 | rdfs:label ?cityName 51 | ] . 52 | FILTER ( LANG(?cityName) = 'en' ) 53 | } 54 | ``` 55 | 56 | **Answer: ii. "Cristiano Ronaldo"@en is a string, not a URL. It can’t be the subject of a triple.** 57 | 58 | **Detailed Explanation and Working:** 59 | 60 | - **Choice i. The city is not in England, so the filter removes it:** Incorrect. The filter is based on language, not location. 61 | - **Choice ii. "Cristiano Ronaldo"@en is a string, not a URL:** Correct. In RDF, subjects must be URIs, not literal strings like `"Cristiano Ronaldo"@en`. 62 | - **Choice iii. The first part of the WHERE clause is a duple, not a triple:** Incorrect. The syntax is correct. 63 | - **Choice iv. Ronaldo’s place of birth is not in Wikipedia in a way that DBpedia can access:** Incorrect. This might be an issue, but it’s not the main problem here. 64 | 65 | **Real-World Example:** 66 | In linked data systems, using URIs instead of literal strings ensures consistent identification of resources across datasets. 67 | 68 | **Common Pitfalls:** 69 | - Using literals as subjects in RDF triples. 70 | - Failing to differentiate between literals and URIs. 71 | 72 | **Important Point to Remember:** 73 | - In RDF and SPARQL, always use URIs for subjects in triples. Literal strings cannot act as subjects. 74 | 75 | --- 76 | 77 | ### Question 1(c): 78 | 79 | **How many predicates does this extract contain?** 80 | 81 | Extract: 82 | 83 | ```ttl 84 | card:I a :Male 85 | foaf:family_name "Berners-Lee"; 86 | foaf:givenname "Timothy"; 87 | foaf:title "Sir". 88 | ``` 89 | 90 | **Answer: i. 4** 91 | 92 | **Detailed Explanation and Working:** 93 | 94 | - **Choice i. 4:** Correct. The predicates are `a`, `foaf:family_name`, `foaf:givenname`, and `foaf:title`. 95 | - **Choice ii. 7:** Incorrect. Overcounts by including objects or subjects as predicates. 96 | - **Choice iii. 5:** Incorrect. Adds an extra predicate that does not exist. 97 | - **Choice iv. 8:** Incorrect. Overcounts by misinterpreting the structure. 98 | 99 | **Real-World Example:** 100 | In creating digital profiles for people or entities, predicates define properties such as names, titles, and roles. 101 | 102 | **Common Pitfalls:** 103 | - Misinterpreting RDF syntax. 104 | - Confusing literals and predicates. 105 | 106 | **Important Point to Remember:** 107 | - RDF predicates describe relationships and properties. Always count them accurately in Turtle syntax. 108 | 109 | --- 110 | 111 | ### Question 1(d): 112 | 113 | **Given this XML, how many results does the query select?** 114 | 115 | XML: 116 | 117 | ```xml 118 | 119 | 120 | The Greatest Hits Ever: Volume 123 121 | 122 | 123 | What is wrong with parsley? 124 | Herbal Reasoning 125 | 126 | 127 | Love threw me a googly 128 | Botham and the Fielders 129 | 130 | 131 | Comedy farm 132 | Just weird 133 | 134 | 135 | 136 | 137 | ``` 138 | 139 | XPath Query: 140 | 141 | ```xpath 142 | //disk[@xml:id="1847336"]/track[@duration>150]/* 143 | ``` 144 | 145 | **Answer: ii. 4** 146 | 147 | **Detailed Explanation and Working:** 148 | 149 | - **Choice i. 5:** Incorrect. Overestimates the number of selected elements. 150 | - **Choice ii. 4:** Correct. Tracks 1 and 2 meet the criteria, and each has two child elements (`` and `<artist>`), totaling 4. 151 | - **Choice iii. 1:** Incorrect. Underestimates the count. 152 | - **Choice iv. 6:** Incorrect. Includes elements that don’t match the criteria. 153 | 154 | **Real-World Example:** 155 | In XML-based media catalogs, XPath can be used to filter and retrieve specific metadata, like tracks longer than a certain duration. 156 | 157 | **Common Pitfalls:** 158 | - Misinterpreting XPath logic. 159 | - Confusing element and attribute selection. 160 | 161 | **Important Point to Remember:** 162 | - Use XPath carefully when navigating hierarchical data. Understand how to target specific elements or attributes to get accurate results. 163 | 164 | --- 165 | 166 | ### Question 1(e): 167 | 168 | **Which parameter setting for the tool is likely to be best (in the sense that I spend the least time on the task)?** 169 | 170 | #### Problem Context: 171 | - Total Archive Size: 50,000 items. 172 | - Relevant Items: 30 items. 173 | - Manual Time to Find Each Relevant Item (if missed): 15 minutes. 174 | - Time Wasted on Each False Positive: 0.5 minutes (30 seconds). 175 | 176 | **Answer: i. Just right of the center of the graph, where precision goes up again.** 177 | 178 | **Detailed Explanation and Working:** 179 | 180 | #### Choice-by-Choice Analysis: 181 | 182 | 1. **Option i: Just right of the center (80% precision, 90% recall):** 183 | - **Calculations:** 184 | - False Negatives: \( (1 - 0.90) \times 30 = 3 \) 185 | - False Positives: \( \frac{30 \times (1 - 0.80)}{0.80} = 8 \) 186 | - Total Time: 49 minutes. 187 | 188 | 2. **Option ii: To the right (68% precision, 90% recall):** 189 | - **Calculations:** 190 | - False Negatives: 3 191 | - False Positives: 14 192 | - Total Time: 52 minutes. 193 | 194 | 3. **Option iii: Manual search (100% recall):** 195 | - **Calculations:** 196 | - Irrelevant Items: 49,970 × 0.5 minutes 197 | - Total Time: 25,435 minutes. 198 | 199 | 4. **Option iv: To the left (100% precision, 17% recall):** 200 | - **Calculations:** 201 | - False Negatives: 25 202 | - Total Time: 450 minutes. 203 | 204 | 5. **Option v: To the left (100% precision, 17% recall with 30 seconds/item):** 205 | - **Calculations:** 206 | - False Negatives: 25 207 | - Total Time: 88 minutes. 208 | 209 | **Conclusion:** 210 | - Option (i) has the lowest time (49 minutes) and is the most efficient. 211 | 212 | **Real-World Example:** 213 | In search engines, balancing precision and recall helps users find the most relevant results without wasting time on irrelevant ones. 214 | 215 | **Common Pitfalls:** 216 | - Focusing solely on either precision or recall can lead to inefficiency. 217 | - Misunderstanding how false positives and false negatives affect total time. 218 | 219 | **Important Point to Remember:** 220 | - Always consider the trade-offs between precision and recall when optimizing information retrieval tasks. 221 | 222 | --- 223 | 224 | ### Question 1(f): 225 | 226 | **Which normal forms does this table satisfy?** 227 | 228 | | Chart | Date | Position | Title | Artist | Date of Birth | 229 | |---------------|------------|----------|----------------|----------------|---------------| 230 | | RIAS | 2022-04-14 | 1 | As It Was | Harry Styles | 1994-02-01 | 231 | | UK Singles | 2012-04-08 | 4 | Starships | Nicki Minaj | 1982-12-08 | 232 | | Billboard Hot | 2022-04-22 | 1 | First Class | Jack Harlow | 199 233 | 234 | 8-03-13 | 235 | | SNEP | 1993-11-20 | 5 | Il me dit ... | Patricia Kaas | 1966-12-05 | 236 | 237 | **Answer: iv. 1NF** 238 | 239 | **Detailed Explanation and Working:** 240 | 241 | - **Choice i. 2NF:** Incorrect. Likely has partial dependencies. 242 | - **Choice ii. 3NF:** Incorrect. May have transitive dependencies. 243 | - **Choice iii. 5NF:** Incorrect. Not applicable to this table structure. 244 | - **Choice iv. 1NF:** Correct. The table meets the requirements of 1NF by having atomic values. 245 | 246 | **Real-World Example:** 247 | In customer databases, ensuring each attribute (like address, phone number) is atomic is the first step in normalization. 248 | 249 | **Common Pitfalls:** 250 | - Confusing atomic values with composite attributes. 251 | - Assuming a table automatically satisfies higher normal forms without verifying dependencies. 252 | 253 | **Important Point to Remember:** 254 | - 1NF ensures that all attributes are atomic and contain only single values, laying the foundation for further normalization. 255 | 256 | --- 257 | 258 | ### Question 1(g): 259 | 260 | **Why is this E/R diagram not a good design?** 261 | 262 | **Answer: i, ii, iii, iv, vi, viii** 263 | 264 | **Detailed Explanation and Working:** 265 | 266 | - **Choice i. Cardinality is only given between entities:** Correct. Cardinality should describe relationships between entities. 267 | - **Choice ii. Entities are connected without explicit relationships:** Correct. Entities need relationships like "has" or "belongs to." 268 | - **Choice iii. The arrow is meaningless:** Correct. The diagram uses an invalid arrow notation. 269 | - **Choice iv. Spaces are not permitted in attribute names:** Correct. Attribute names should not include spaces. 270 | - **Choice v. An attribute can’t be shared between entities:** Incorrect. In some cases, shared attributes are valid. 271 | - **Choice vi. Cardinalities like ‘21’ are not allowed:** Correct. Only standard notations like "1", "n", "m" should be used. 272 | - **Choice viii. Cardinalities like ß and x are inadvisable:** Correct. Stick to standard notations. 273 | 274 | **Real-World Example:** 275 | In retail inventory systems, clear E/R diagrams are crucial for tracking relationships between products, categories, and suppliers. 276 | 277 | **Common Pitfalls:** 278 | - Using incorrect notations or connecting entities without relationships. 279 | - Misusing attribute names and cardinalities. 280 | 281 | **Important Point to Remember:** 282 | - Proper E/R diagram conventions are essential for accurate database modeling, ensuring clarity and correctness in the design. 283 | 284 | --- 285 | 286 | ### Question 1(h): 287 | 288 | **How might the query to find all staff members who have had interactions with a client called "Shug Avery" continue?** 289 | 290 | **Answer: iii, iv** 291 | 292 | **Detailed Explanation and Working:** 293 | 294 | - **Choice i. LEFT JOIN on Client:** Incorrect. It doesn’t properly link `Meeting` and `Employee`. 295 | - **Choice ii. Complex LEFT JOIN structure:** Incorrect. Overcomplicated with unnecessary `LIKE` clauses. 296 | - **Choice iii. INNER JOIN structure:** Correct. Links `Client`, `Meeting`, and `Employee` logically. 297 | - **Choice iv. WHERE-based JOIN structure:** Correct. An older but still valid join syntax. 298 | 299 | **Real-World Example:** 300 | In CRM systems, querying interactions between clients and staff members is common for tracking service history or performance metrics. 301 | 302 | **Common Pitfalls:** 303 | - Using the wrong type of join (e.g., `LEFT JOIN` instead of `INNER JOIN`). 304 | - Confusing syntax when performing multi-table joins. 305 | 306 | **Important Point to Remember:** 307 | - Choose the correct join type based on the query’s goal. In cases like this, `INNER JOIN` ensures that only relevant records are returned. 308 | 309 | --- 310 | 311 | ### Question 1(i): 312 | 313 | **Which of these queries is likely to represent a successful MongoDB search for actors born before 1957?** 314 | 315 | **Answer: vii** 316 | 317 | **Detailed Explanation and Working:** 318 | 319 | - **Choice i. Using `$lt` with `ISODate`:** Correct. Properly uses MongoDB syntax. 320 | - **Choice ii. Using `$lt` with an integer:** Incorrect. Dates should be in `ISODate` format. 321 | - **Choice iii. Using `"<"` as an operator:** Incorrect. Not valid in MongoDB. 322 | - **Choice iv. Similar misuse of `"<"` operator:** Incorrect. 323 | - **Choice v. Incorrect syntax for date comparison:** Incorrect. 324 | - **Choice vi. Exact year matching with 1957:** Incorrect. Not appropriate for this query. 325 | - **Choice vii. Correct use of `$lt` and `ISODate`:** Correct. 326 | - **Choice viii. Invalid use of `"<"` operator:** Incorrect. 327 | 328 | **Real-World Example:** 329 | In media databases, date-based queries help filter content by release year, birthdate, or historical relevance. 330 | 331 | **Common Pitfalls:** 332 | - Using incorrect operators in MongoDB queries. 333 | - Failing to correctly format dates using `ISODate`. 334 | 335 | **Important Point to Remember:** 336 | - In MongoDB, always use `$lt`, `$gt`, and similar operators for date comparisons, and ensure dates are in `ISODate` format. 337 | 338 | --- 339 | 340 | ### Question 1(j): 341 | 342 | **What are the true statements based on RecipeML's .dtd file?** 343 | 344 | **Answer: i, ii, iv** 345 | 346 | **Detailed Explanation and Working:** 347 | 348 | - **Choice i. `<recipe>` must have one `<ingredients>` element:** Correct. Required based on the DTD structure. 349 | - **Choice ii. The `<ingredients>` element must come before `<directions>`:** Correct. The order is specified in the DTD. 350 | - **Choice iii. The order of children is not important:** Incorrect. Order is important according to the DTD. 351 | - **Choice iv. `<recipe>` can have one `<ingredients>` element:** Correct. The DTD allows for one `<ingredients>` element. 352 | - **Choice v. Multiple `<ingredients>` elements:** Incorrect. Only one `<ingredients>` element is allowed. 353 | 354 | **Real-World Example:** 355 | In applications that store recipes, following DTD rules ensures that the XML structure is consistent and parseable. 356 | 357 | **Common Pitfalls:** 358 | - Misunderstanding the structure and ordering specified in DTDs. 359 | - Incorrectly assuming multiple elements are allowed when they are not. 360 | 361 | **Important Point to Remember:** 362 | - Always follow DTD rules for XML structure to ensure consistency and avoid errors when parsing or validating documents. 363 | 364 | --- 365 | -------------------------------------------------------------------------------- /web-app/app.js: -------------------------------------------------------------------------------- 1 | const express = require('express'); 2 | const mysql = require('mysql'); 3 | const bodyParser = require('body-parser'); 4 | const mustacheExpress = require('mustache-express'); 5 | const dotenv = require('dotenv').config(); 6 | 7 | const app = express(); 8 | 9 | // Set up Mustache as the view engine 10 | app.engine('mustache', mustacheExpress()); 11 | app.set('view engine', 'mustache'); 12 | app.set('views', __dirname + '/views'); 13 | 14 | // Serve static files from the "public" directory 15 | app.use(express.static('public')); 16 | 17 | // Use body-parser middleware to parse request bodies 18 | app.use(bodyParser.urlencoded({ extended: true })); 19 | 20 | // Set up database connection 21 | const db = mysql.createConnection({ 22 | host: process.env.DB_HOST, 23 | user: process.env.DB_USER, 24 | password: process.env.DB_PASS, 25 | database: process.env.DB_NAME 26 | }); 27 | 28 | db.connect(err => { 29 | if (err) { 30 | console.error('Error connecting to the database:', err); 31 | return; 32 | } 33 | console.log('Connected to the database.'); 34 | }); 35 | 36 | // Serve the main page 37 | app.get('/', (req, res) => { 38 | res.render('index'); 39 | }); 40 | 41 | // Define routes for each research question 42 | app.get('/city-rankings', (req, res) => { 43 | const query = 'SELECT city_name, AVG(aqi_value) AS average_aqi FROM pollutions JOIN cities ON pollutions.city_id = cities.city_id GROUP BY city_name ORDER BY average_aqi DESC'; 44 | db.query(query, (err, results) => { 45 | if (err) { 46 | console.error('Error fetching city rankings:', err); 47 | res.status(500).send('Error fetching data'); 48 | return; 49 | } 50 | res.render('city-rankings', { rankings: results }); 51 | }); 52 | }); 53 | 54 | app.get('/national-urban-aq', (req, res) => { 55 | const query = 'SELECT country_name, AVG(aqi_value) AS average_aqi FROM pollutions JOIN cities ON pollutions.city_id = cities.city_id JOIN countries ON cities.country_id = countries.country_id GROUP BY country_name ORDER BY average_aqi'; 56 | db.query(query, (err, results) => { 57 | if (err) { 58 | console.error('Error fetching national urban air quality:', err); 59 | res.status(500).send('Error fetching data'); 60 | return; 61 | } 62 | res.render('national-urban-aq', { countries: results }); 63 | }); 64 | }); 65 | 66 | app.get('/dominant-pollutants', (req, res) => { 67 | const query = 'SELECT pollutant_name, AVG(aqi_value) AS average_aqi FROM pollutions JOIN pollutants ON pollutions.pollutant_id = pollutants.pollutant_id GROUP BY pollutant_name ORDER BY average_aqi DESC'; 68 | db.query(query, (err, results) => { 69 | if (err) { 70 | console.error('Error fetching dominant pollutants:', err); 71 | res.status(500).send('Error fetching data'); 72 | return; 73 | } 74 | res.render('dominant-pollutants', { pollutants: results }); 75 | }); 76 | }); 77 | 78 | app.get('/pollutant-prevalence', (req, res) => { 79 | const query = 'SELECT city_name FROM cities WHERE EXISTS (SELECT * FROM pollutions p1 JOIN pollutants pol1 ON p1.pollutant_id = pol1.pollutant_id AND pol1.pollutant_name = "NO2" WHERE p1.city_id = cities.city_id AND p1.aqi_value > (SELECT p2.aqi_value FROM pollutions p2 JOIN pollutants pol2 ON p2.pollutant_id = pol2.pollutant_id AND pol2.pollutant_name = "CO" WHERE p2.city_id = cities.city_id))'; 80 | db.query(query, (err, results) => { 81 | if (err) { 82 | console.error('Error fetching pollutant prevalence:', err); 83 | res.status(500).send('Error fetching data'); 84 | return; 85 | } 86 | res.render('pollutant-prevalence', { cities: results }); 87 | }); 88 | }); 89 | 90 | app.get('/urban-centers-profile', (req, res) => { 91 | const query = 'SELECT city_name, pollutant_name, AVG(aqi_value) AS average_aqi FROM pollutions JOIN cities ON pollutions.city_id = cities.city_id JOIN pollutants ON pollutions.pollutant_id = pollutants.pollutant_id GROUP BY city_name, pollutant_name HAVING AVG(aqi_value) > (SELECT AVG(aqi_value) FROM pollutions) ORDER BY average_aqi DESC'; 92 | db.query(query, (err, results) => { 93 | if (err) { 94 | console.error('Error fetching pollution profiles of urban centers:', err); 95 | res.status(500).send('Error fetching data'); 96 | return; 97 | } 98 | res.render('urban-centers-profile', { profiles: results }); 99 | }); 100 | }); 101 | 102 | // Start the server 103 | const PORT = process.env.PORT || 3000; 104 | app.listen(PORT, () => { 105 | console.log(`Server running on port ${PORT}`); 106 | }); 107 | -------------------------------------------------------------------------------- /web-app/mcq.md: -------------------------------------------------------------------------------- 1 | 2 | ### Question 1(e): 3 | 4 | **Which parameter setting for the tool is likely to be best (in the sense that I spend the least time on the task)?** 5 | 6 | #### Problem Context: 7 | - Total Archive Size: 50,000 items. 8 | - Relevant Items: 30 items. 9 | - Manual Time to Find Each Relevant Item (if missed): 15 minutes. 10 | - Time Wasted on Each False Positive: 0.5 minutes (30 seconds). 11 | 12 | **Answer: i. Just right of the center of the graph, where precision goes up again.** 13 | 14 | **Detailed Explanation and Working:** 15 | 16 | #### Choice-by-Choice Analysis: 17 | 18 | 1. **Option i: Just right of the center (80% precision, 90% recall):** 19 | - **Calculations:** 20 | - False Negatives: \( (1 - 0.90) \times 30 = 3 \) 21 | - False Positives: \( \frac{30 \times (1 - 0.80)}{0.80} = 8 \) 22 | - Total Time: 49 minutes. 23 | 24 | 2. **Option ii: To the right (68% precision, 90% recall):** 25 | - **Calculations:** 26 | - False Negatives: 3 27 | - False Positives: 14 28 | - Total Time: 52 minutes. 29 | 30 | 3. **Option iii: Manual search (100% recall):** 31 | - **Calculations:** 32 | - Irrelevant Items: 49,970 × 0.5 minutes 33 | - Total Time: 25,435 minutes. 34 | 35 | 4. **Option iv: To the left (100% precision, 17% recall):** 36 | - **Calculations:** 37 | - False Negatives: 25 38 | - Total Time: 450 minutes. 39 | 40 | 5. **Option v: To the left (100% precision, 17% recall with 30 seconds/item):** 41 | - **Calculations:** 42 | - False Negatives: 25 43 | - Total Time: 88 minutes. 44 | 45 | **Conclusion:** 46 | - Option (i) has the lowest time (49 minutes) and is the most efficient. 47 | 48 | **Real-World Example:** 49 | In search engines, balancing precision and recall helps users find the most relevant results without wasting time on irrelevant ones. 50 | 51 | **Common Pitfalls:** 52 | - Focusing solely on either precision or recall can lead to inefficiency. 53 | - Misunderstanding how false positives and false negatives affect total time. 54 | 55 | **Important Point to Remember:** 56 | - Always consider the trade-offs between precision and recall when optimizing information retrieval tasks. 57 | -------------------------------------------------------------------------------- /web-app/public/css/styles.css: -------------------------------------------------------------------------------- 1 | /* styles.css */ 2 | body { 3 | font-family: 'Arial', sans-serif; 4 | background-color: #f4f4f4; 5 | margin: 0; 6 | padding: 0; 7 | } 8 | 9 | .container { 10 | width: 80%; 11 | margin: auto; 12 | padding: 20px; 13 | background-color: #fff; 14 | box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); 15 | } 16 | 17 | h1 { 18 | color: #333; 19 | text-align: center; 20 | } 21 | 22 | .query-section { 23 | margin-bottom: 40px; 24 | } 25 | 26 | .query-description { 27 | margin-bottom: 20px; 28 | } 29 | 30 | .query-button, .return-button { 31 | display: block; 32 | width: auto; 33 | padding: 10px 20px; 34 | margin: 10px 0; 35 | background-color: #007bff; 36 | color: white; 37 | text-align: center; 38 | text-decoration: none; 39 | border: none; 40 | border-radius: 5px; 41 | cursor: pointer; 42 | transition: background-color 0.3s ease; 43 | } 44 | 45 | .query-button:hover, .return-button:hover { 46 | background-color: #0056b3; 47 | } 48 | 49 | table { 50 | width: 100%; 51 | border-collapse: collapse; 52 | margin-top: 20px; 53 | } 54 | 55 | th, td { 56 | border: 1px solid #ddd; 57 | padding: 10px; 58 | text-align: left; 59 | } 60 | 61 | th { 62 | background-color: #f2f2f2; 63 | } 64 | 65 | tr:nth-child(even) { 66 | background-color: #f9f9f9; 67 | } 68 | -------------------------------------------------------------------------------- /web-app/public/js/scripts.js: -------------------------------------------------------------------------------- 1 | // scripts.js 2 | 3 | // Function to handle the return to main page action 4 | function returnToMainPage() { 5 | window.location.href = '/'; 6 | } 7 | 8 | // Add event listeners when the DOM is fully loaded 9 | document.addEventListener('DOMContentLoaded', function() { 10 | // Attach an event listener to the 'Return to Main Page' button if it exists 11 | var returnButton = document.querySelector('.return-button'); 12 | if (returnButton) { 13 | returnButton.addEventListener('click', returnToMainPage); 14 | } 15 | 16 | // Additional event listeners can be added here 17 | // ... 18 | }); 19 | -------------------------------------------------------------------------------- /web-app/views/city-rankings.mustache: -------------------------------------------------------------------------------- 1 | <!DOCTYPE html> 2 | <html lang="en"> 3 | <head> 4 | <meta charset="UTF-8"> 5 | <meta name="viewport" content="width=device-width, initial-scale=1.0"> 6 | <title>City Rankings by AQI 7 | 8 | 9 | 10 |
11 |

City Rankings by AQI

12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | {{#rankings}} 22 | 23 | 24 | 25 | 26 | 27 | {{/rankings}} 28 | 29 |
RankCityAverage AQI
{{rank}}{{city_name}}{{average_aqi}}
30 | Return to Main Page 31 |
32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /web-app/views/dominant-pollutants.mustache: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Dominant Pollutants 7 | 8 | 9 | 10 |
11 |

Dominant Pollutants

12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | {{#pollutants}} 21 | 22 | 23 | 24 | 25 | {{/pollutants}} 26 | 27 |
PollutantAverage AQI
{{pollutant_name}}{{average_aqi}}
28 | Return to Main Page 29 |
30 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /web-app/views/index.mustache: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Pollution Data Dashboard 7 | 8 | 9 | 10 |
11 |

Pollution Data Dashboard

12 | 13 | 14 |
15 |

Research Questions

16 |
17 |

1. Which cities rank highest in terms of overall AQI values?

18 | City Rankings by AQI 19 |
20 |
21 |

2. What is the average urban AQI value for each country, and how do they rank globally?

22 | National Urban Air Quality 23 |
24 |
25 |

3. Across all cities, which pollutant shows the highest average AQI value?

26 | Dominant Pollutants 27 |
28 |
29 |

4. In which cities is the AQI value for NO2 consistently higher than that for CO?

30 | Pollutant Prevalence in Cities 31 |
32 |
33 |

5. For the cities with the highest pollution levels, how do the average AQI values for different pollutants compare?

34 | Pollution Profiles of Urban Centers 35 |
36 |
37 |
38 | 39 | 40 | 41 | -------------------------------------------------------------------------------- /web-app/views/national-urban-aq.mustache: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | National Urban Air Quality Rankings 7 | 8 | 9 | 10 |
11 |

National Urban Air Quality Rankings

12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | {{#countries}} 22 | 23 | 24 | 25 | 26 | 27 | {{/countries}} 28 | 29 |
RankCountryAverage Urban AQI
{{rank}}{{country_name}}{{average_aqi}}
30 | Return to Main Page 31 |
32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /web-app/views/pollutant-prevalence.mustache: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Pollutant Prevalence in Cities 7 | 8 | 9 | 10 |
11 |

Pollutant Prevalence in Cities

12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | {{#cities}} 21 | 22 | 23 | 24 | 25 | {{/cities}} 26 | 27 |
CityNO2 Higher than CO
{{city_name}}{{no2_higher_than_co}}
28 | Return to Main Page 29 |
30 | 31 | 32 | 33 | -------------------------------------------------------------------------------- /web-app/views/urban-centers-profile.mustache: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Pollution Profiles of Urban Centers 7 | 8 | 9 | 10 |
11 |

Pollution Profiles of Urban Centers

12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | {{#profiles}} 22 | 23 | 24 | 25 | 26 | 27 | {{/profiles}} 28 | 29 |
CityPollutantAverage AQI
{{city_name}}{{pollutant_name}}{{average_aqi}}
30 | Return to Main Page 31 |
32 | 33 | 34 | 35 | --------------------------------------------------------------------------------