36 |
Study Number 7724 - European Quality of Life Time Series, 2007 and 2011: Open Access
37 |
38 |
39 |
40 |
41 |
42 |
43 |
DATA PROCESSING NOTES
44 |
45 |
Data Archive Processing Standards
46 |
47 |
The data were processed to the UK Data Archive's A* standard. This is the
48 | Archive's highest standard, and means that an extremely rigorous and
49 | comprehensive series of checks was carried out to ensure the quality of the data
50 | and documentation. Firstly, checks were made that the number of cases and
51 | variables matched the depositor's records. Secondly, checks were made that all
52 | variables had comprehensible variable labels and all nominal (categorical)
53 | variables had comprehensible value labels. Where possible, either with reference
54 | to the documentation and/or in communication with the depositor, labels were
55 | accordingly edited or created. Thirdly, logical checks were performed to ensure
56 | that nominal (categorical) variables had values within the range defined (either
57 | by value labels or in the depositor's documentation). Lastly, any data or
58 | documentation that breached confidentiality rules were altered or suppressed to
59 | preserve anonymity.
60 |
61 |
All notable and/or outstanding problems discovered are detailed under the 'Data
62 | and documentation problems' heading below.
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
Data and documentation problems
72 |
None encountered.
73 |
74 |
75 |
76 |
Data conversion information
77 |
78 |
79 |
From January 2003 onwards, almost all data conversions have been performed using software developed by the UK Data Archive. This enables standardisation of the conversion methods and ensures optimal data quality. In addition to its own data processing/conversion code, this software uses the SPSS and StatTransfer command processors to perform certain format translations. Although data conversion is automated, all data files are also subject to visual inspection by a member of the Archive's Data Services team.
80 |
With some format conversions, data, and more especially internal metadata (i.e. variable labels, value labels, missing value definitions, data type information), will inevitably be lost or truncated owing to the differential limits of the proprietary formats. A UK Data Archive Data Dictionary file (generally in Rich Text Format (RTF)) is usually provided for each data file, enabling viewing and searching of the internal metadata as it existed in the originating format. These
81 | files are called:
82 |
83 | [data file name]_UKDA_Data_Dictionary.rtf
84 |
85 |
Important information about the data format supplied
86 |
87 |
The links below provide important information about the Archive's data
88 | supply formats. Some of this information is specific to the
ingest
89 | format of the data, i.e. the format in which the Archive received the data
90 | from the depositor. The ingest format for this study was
91 | SPSS
92 |
93 | Please follow the appropriate link below to see information on your chosen supply (download)
94 | format.
95 |
96 |
SPSS (*.sav)
97 |
98 |
SPSS files (*.sav files)
99 |
If SPSS was not the ingest format, this format will generally either have been created via the SPSS command processor (e.g. if the ingest format is STATA, SAS, Excel, or dBase). If the ingest format was non-delimited or fixed-width text, SPSS files will have been created using SPSS command syntax.
100 |
Issues: There is very seldom any loss of data or internal metadata when importing data files into SPSS. Any problems will have been listed above in the Data and Documentation Problems section of this file.
101 |
102 |
103 |
STATA (*.dta)
104 |
105 |
STATA (*.dta files)
106 |
If STATA was not the ingest format, STATA files will generally have been created from SPSS via the StatTransfer command processor. Importantly, StatTransfer's optimisation routine is run so that variables with SPSS write formats narrower than the data (e.g. numeric variables with 10 decimal places of data formatted to FX.2) are not rounded upon conversion to STATA because they are converted to 'doubles ' rather than floats. Discrete user missing values are copied across into STATA (as opposed to being collapsed into a single system missing code).
107 |
Issues: There are a number of data and metadata handling mismatches between SPSS
108 | and STATA. Where any data or internal metadata has been lost or truncated, it will be logged in the study's SPSS_to_STATA_conversion RTF file.
109 | Note that the complete internal metadata has been supplied in the UKDA Data
110 | Dictionary file(s): [data file name]_UKDA_Data_Dictionary.rtf
111 |
112 |
113 |
Tab-delimited text (*.tab)
114 |
115 |
Tab-delimited text (*.tab files)
116 |
If tab-delimited text was not the ingest format, tab-delimited files will have been created from via the SPSS command processor, and also from Excel and MS Access files. When exporting from Access data tables to tab-delimited text, the potentially problematic special characters (tabs, carriage returns, line feeds, etc.) allowed by Access memo and text fields may have been removed by the Archive if necessary.
117 |
118 |
Issues: Date formats in SPSS are always exported to mm/dd/yyyy in tab-delimited text format. There may be a mismatch with the documentation on such variables. Variables that include both date and time such as dd-mm-yyyy hh:mm:ss (e.g. 18-JUN-2011 13:28:00), will lose the time information and become mm/dd/yyyy. All users of the data in tab-delimited format should consult the UK Data Archive Data Dictionary RTF file(s).
119 |
120 |
If the data was exported from MS Access, more limited 'data documenter' information is generally available in the RTF variable information files. These files may also contain SQL setup information.
121 |
122 |
123 |
124 |
MS Excel (*.xls/*.xslx)
125 |
126 |
MS Excel (*.xls/*xslx files)
127 |
If MS Excel was not the ingest format, Excel files may have been created via StatTransfer. The date and time issues noted under tab-delimited format may also apply here.
128 |
129 |
130 |
SAS (*.sas7bdat and *.sas)
131 |
132 |
SAS (*.sas7bdat and *sas files)
133 |
If SAS was not the ingest format, SAS files will usually have been created via StatTransfer or SPSS. SAS is not one of the Archive's standard supply formats, and the files are likely to have been created in response to a user request. The usual format is *.sas7bdat files plus a .sas proc formats file. Note that the complete internal metadata has been supplied in the accompanying UK Data Archive Data Dictionary file(s).
134 | <%--
135 |
Issues: The main loss of information when converting from SPSS to SAS is
136 | user-missing value definitions. By editing the .sas file, the user can choose
137 | whether to collapse all user-missing values into system missing or preserve
138 | the value and lose the user-missing definition. To achieve the latter the
139 | following section of the .sas file should be removed before running it:
140 |
141 |
/* User Missing Value Specifications */
142 |
143 |
Note that the complete internal metadata has been supplied in the UKDA Data
144 | Dictionary file(s): [data file name]_UKDA_Data_Dictionary.rtf
145 | --%>
146 |
147 |
MS Access (*.mdb/*.mdbx)
148 |
149 |
MS Access (*.mdb/*.mdbx files)
150 |
Due to substantial incompatibilities between versions of MS Access, the Archive will only make data available in MS Access format if this is the ingest format and/or the database contains important information in addition to the data tables (coding information, forms, queries, etc.).
151 |
152 |
153 |
154 |
Conversion of documentation formats
155 |
The documentation supplied with Archive studies is usually converted to Adobe Portable Document Format (PDF), with documents bookmarked to aid navigation. The vast majority of PDF files are generated from MS Word, RTF, Excel or plain text (.txt) source files, though PDF documentation for older studies in the collection may have been created from scanned paper documents. Occasionally, some documentation cannot be usefully converted to PDF (e.g. MS Excel files with wide worksheets) and this is usually supplied in the original or a more appropriate format.
156 |
157 |
158 |
159 |
160 |
--------------------------------------------------------------------------------
/coursebook/modules/m4/hands-on.ipynb:
--------------------------------------------------------------------------------
1 | {"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Module 4 hands-on session\n", "\n", "## Description\n", "\n", "In this hands-on session participants are divided in groups of 4 or 5\n", "participants and paired with a helper. Each group should choose one of\n", "the proposed tasks (or propose a new one if keen!) and work together on\n", "it. At the end of the session the students will present their work to\n", "the class.\n", "\n", "The timeline for this session is the following:\n", "\n", "- Phase 1: Presentation of tasks and goals of the hands-on session (15\n", " minutes)\n", "- Phase 2: Groups are formed and students choose task to develop (15\n", " minutes)\n", "- Phase 3: Work on the tasks (2.5 hours + with two 15 m break)\n", "- Phase 4: Group discussion of findings (30 minutes)\n", "- Phase 5: Course wrap-up (30 minutes)\n", "\n", "Students are encouraged to use code developed in the hands-on session\n", "for module 3 and/or the data exploration notebook in Sections 4.3 and\n", "4.4. They can use data from the UK or choose another country. This is\n", "your chance to draw mathematical conclusions from the dataset.\n", "\n", "## Proposed tasks\n", "\n", "1. **Improving the models**: We invite you to try to improve/modify the\n", " models discussed in *Section 4.4*. Some suggestions:\n", "\n", " - Adding new variables and/or [interactions\n", " variables](https://en.wikipedia.org/wiki/Interaction_(statistics))\n", " to the model. Does this new model improve your knowledge?\n", "\n", " - The dataset is very imbalanced (the majority is \u2018good health\u2019).\n", " We have addressed this in sections 4.3/4.4 by changing the\n", " threshold of our p(x) classifier. But there are other ways of\n", " dealing with an imbalanced dataset (for some ideas see\n", " [here](https://towardsdatascience.com/how-to-deal-with-imbalanced-data-34ab7db9b100)).\n", " Investigate how somese change your modelled conclusions.\n", "\n", "2. **Prediction & Simulation**.\n", "\n", " - Logistic regression predicts the mean of a bernouilli\n", " distribution. Essentially, you get a generative model for each\n", " combination of predictor variable values. Have a play with\n", " simulating data from this bernouilli distribution to generate a\n", " new dataset of N people (we do this a bit in 4.1 and 4.3). Does\n", " our simulated dataset look anything like our real dataset?\n", " - Can you visualise how p(x) changes when you change specific\n", " variables while keeping the others constant?\n", " - In the above point we have assumed a single point estimate of\n", " p(x). But there is uncertainty in our coefficients, and\n", " therefore uncertainty in our p(x). What if we sample from this\n", " uncertainty when generating p(x)?\n", "\n", "3. **Comparative analysis with another country**: Up to now we have\n", " only looked at the UK, but what happens in other countries? How good\n", " is the performance if you use a model trained with UK data in\n", " another country? How different is the model (coefficients,\n", " performance, etc) trained with data from another country (e.g Poland\n", " vs the model of the UK shown in *Section 4.4*). Can you conclude\n", " that the same factors have different impact between the countries?\n", " Feel free to compare between any country you\u2019d like.\n", "\n", "We expect you to compare and discuss in detail what you have learned\n", "from these new models and think what would be the answer to the research\n", "question.\n", "\n", "1. **Imputation**: In module 3 we explored missingness in the data, and\n", " touched on different ways of dealing with this. Here we could\n", " explore the effect of different methods of imputation. For any\n", " method of imputation, the critical thing is to compare model output\n", " on the imputed data with the model output on the uninmputated data\n", " to assess how it changes the conclusions. Some suggestions of\n", " increasing complexity:\n", "\n", "- Replacing missing rows with the average of the missing variables.\n", "\n", "- Sample from a variable\u2019s distribution to fill out the missing rows.\n", " You could:\n", "\n", " - sample with replacement from the empirical values\n", " - create a probability estimate of the distribution (e.g.\u00a0kde) and\n", " sample from that.\n", " - something else\u2026\n", "\n", "- Model the missing variable as dependent on present variables. You\n", " could apply our generalised regression framework: pick potential\n", " predictors, select your distribution for the residuals, see if you\n", " want a link function other than the identity function.\n", "\n", "## Final discussion session\n", "\n", "Group discussion with the following points. We don\u2019t have \u201cright\n", "answers\u201d for this discussion.\n", "\n", "1. If the research question asks for an overall assessment of all\n", " Europe. How do we appropriately combine the models?\n", "\n", "2. Is there a better way of modelling dataset given the research\n", " question?\n", "\n", "3. After everything you\u2019ve done in all the hands on sessions, what\n", " would be your answer to the research question? What else has to be\n", " done?"], "id": "37f0d8de-f7e5-435d-a8ec-43c49eeeb6cd"}], "nbformat": 4, "nbformat_minor": 5, "metadata": {}}
--------------------------------------------------------------------------------
/coursebook/modules/m4/overview.ipynb:
--------------------------------------------------------------------------------
1 | {"cells": [{"cell_type": "markdown", "id": "f7ab275c-1750-46cf-9740-10162e827152", "metadata": {}, "source": ["# Overview\n", "\n", "The key goal of research data science is to learn from data. One of the\n", "most powerful methods of learning from data is **statistical\n", "modelling**.\n", "\n", "We demystify the key concepts involved through applying simple models (linear and logistic regression). The intended take-homes can be applied to any modelling problem.\n", "\n", "The module is structured as follows:\n", "\n", "- **The what and why of statistical modelling**. We begin by defining\n", " what modelling is and motivating the power of modelling.\n", "- **Fitting models**. Here we go through the components of a model,\n", " including describing how to fit one to data.\n", "- **Building a simple model**. We then carefully build a model based\n", " on the understanding of our data, taking care to understand the\n", " model.\n", "- **Evaluation a model**. It is not enough to have a model that is\n", " fitted to your data. The model has to be useful. The final section\n", " will cover how to evaluate your model and iteratively improve upon\n", " your model.\n", "\n", "**References:**\n", "\n", "We will include more specific references as we move through the module.\n", "But useful accessible introductions to modelling that has inspired much\n", "of this module\u2019s content are Poldrack\u2019s [Statistical Thinking for the\n", "21st\n", "Century](https://web.stanford.edu/group/poldracklab/statsthinking21/index.html),\n", "Holmes and Huber\u2019s [Modern Statistics for Modern\n", "Biology](https://web.stanford.edu/class/bios221/book/Chap-Models.html),\n", "as well as the introductory sections of Richard McElreath\u2019s wonderfully\n", "readable [Statistical\n", "Rethinking](https://xcelab.net/rm/statistical-rethinking/) and Bishop\u2019s\n", "classic [Machine Learning for Pattern\n", "Recognition](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)\n", "textbook."]}], "metadata": {"kernelspec": {"display_name": "Python 3.10.4 64-bit", "language": "python", "name": "python3"}, "language_info": {"name": "python", "version": "3.10.6"}, "vscode": {"interpreter": {"hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"}}}, "nbformat": 4, "nbformat_minor": 5}
--------------------------------------------------------------------------------
/documentation/delivery_tips.md:
--------------------------------------------------------------------------------
1 | # TODO
--------------------------------------------------------------------------------
/documentation/developer_instructions.md:
--------------------------------------------------------------------------------
1 | # TODO
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | [tool.poetry]
2 | name = "rds-course"
3 | packages = [
4 | { include = "coursebook" },
5 | ]
6 | version = "0.1.0"
7 | description = ""
8 | authors = ["ChristinaLast