├── README.md ├── lec01_boring_stuff └── README.md ├── lec02_open_save ├── README.md ├── lec02_opensave.do └── managers_epl.xlsx ├── lec03_preparation ├── README.md ├── lec03HW.dta └── lec03_preparation.do ├── lec04_reshape ├── README.md └── lec04_reshape.do ├── lec05_eda ├── README.md └── lec05_eda.do ├── lec06_subsamples ├── README.md └── lec06_subsamples.do ├── lec07_graphs ├── README.md └── lec07_graphs.do ├── lec08_moredatasets ├── README.md ├── lec08_hw_data1.xlsx ├── lec08_hw_data2.csv ├── lec08_hw_data3.dta ├── lec08_hw_data4.dta └── lec08_moredatasets.do ├── lec09_datamanipulation ├── README.md └── lec09_datamanipulation.do ├── lec10_macro_loop ├── README.md └── lec10_macro_loop.do ├── lec11_hypothesis_testing ├── README.md └── lec11_hypothesistesting.do ├── lec12_regression_basics ├── README.md └── lec12_regression_basics.do ├── lec13_presenting_regresults ├── README.md └── lec13_presenting_regresults.do └── lec14_TSdata ├── README.md └── lect14_TSdata.do /README.md: -------------------------------------------------------------------------------- 1 | # Coding for Data Analysis with Stata 2 | Introduction to Data Analysis with Stata - lecture materials 3 | by [László Tõkés](https://www.uni-corvinus.hu/elerhetosegek/tokes-laszlo/) (CUB) with [Ágoston Reguly](https://regulyagoston.github.io/) (Georgia Tech) and [Gábor Békés](https://sites.google.com/site/bekesg/) ([CEU](https://people.ceu.edu/gabor_bekes), [KRTK](https://kti.krtk.hu/en/kutatok/gabor-bekes/5896/), [CEPR](https://voxeu.org/users/gaborbekes0)) 4 | 5 | This course material is a supplement to **[Data Analysis for Business, Economics, and Policy](https://www.cambridge.org/highereducation/books/data-analysis-for-business-economics-and-policy/D67A1B0B56176D6D6A92E27F3F82AA20)** by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan), Cambridge University Press, 2021. 6 | 7 | *Textbook* information: see the textbook's website [**gabors-data-analysis.com**](https://gabors-data-analysis.com/) or visit [Cambridge University Press](https://www.cambridge.org/highereducation/books/data-analysis-for-business-economics-and-policy/D67A1B0B56176D6D6A92E27F3F82AA20) 8 | 9 | To get a copy: [Inspection copy for instructors](https://www.cambridge.org/highereducation/books/data-analysis-for-business-economics-and-policy/D67A1B0B56176D6D6A92E27F3F82AA20/examination-copy/personal-details) or [buy from Amazon](https://www.amazon.com/Data-Analysis-Business-Economics-Policy-dp-1108716202/dp/1108716202/ref=mt_other?_encoding=UTF8&me=&qid=) or [order online around the globe](https://gabors-data-analysis.com/order) 10 | 11 | 12 | ## Acknowledgments 13 | 14 | We thank [CEU Department of Econimics and Business](https://economics.ceu.edu/) for financial support. 15 | 16 | 17 | ## Status 18 | 19 | This is version 1.0. (2022-10-03) 20 | 21 | Comments are really welcome in email or as a GitHub issue. 22 | 23 | ## About this lecture series 24 | 25 | This series of lectures offers a brief introduction to Stata, containing 13+1 lectures, including a summary lecture. The course serves as an introduction to the Stata programming language and software environment for data exploration, data wrangling, data analysis, and visualization. The structure tries to follow the structure of the [textbook](https://gabors-data-analysis.com/), although there are of course some differences: the main organization principle of the lectures is the logic of Stata, not necessary the logic of the book. After going through the lectures, students will be able to reproduce the results of the first two parts of the textbook (Data Exploration, and Regression Analysis) in Stata. Moreover, they will hopefully also understand the language of Stata enough to be able to go on in the textbook, and do the exercises in the second two parts on their own. 26 | 27 | Note that in the lectures I use **Stata 14**, however, all the elements discussed here are compatible forward (and in most cases backward) as well. 28 | 29 | Lectures 1 to 11 - complementing [Part I: Data Exploration (Chapter 1-6)](https://gabors-data-analysis.com/chapters/#part-i-data-exploration) - focus the logic of the Stata language, data preparation and wrangling, exploratory data analysis, and hypothesis testing. Please note that the first lecture is boring, but unfortunately unavoidable. I tried to be as brief as possible there. 30 | 31 | Lecture 12 to 14 - complementing [PART II: Regression Analysis (Chapter 7-12)](https://gabors-data-analysis.com/chapters/#part-ii-regression-analysis) - focus on the basics of regression analysis, the presentation of regression results, and visualization. 32 | 33 | 34 | ## Teaching philosophy 35 | 36 | We believe in learning by doing, so although the lectures offer a detailed introduction to the topic with many explanations and examples, the more important part is the homework assignments that can help students practicing. We also recommend students to deal with the data exercises at the end of the chapters of the textbook. 37 | 38 | This is not a hardcore coding course, but a course to supplement the material of the textbook. The lectures focus on the commands that are needed to reproduce the case studies and to solve the data exercises of the textbook. 39 | 40 | The structure of the material reflects these principles. On one hand, the lectures include pre-written codes as an introduction to the topic, while, on the other hand, homework assignments and data exercises of the textbook can help students to gain experience in coding. In most cases, pre-written codes and homework assignments reproduce case study results that can be found in the textbook. 41 | 42 | ## How to use 43 | 44 | These lectures can serve as a basis for a course on Stata programming for data wrangling and basic regression analysis. Although, the series is structured and comprehensive enough to be able to stand alone, we recommend to teach (or and learn) it hand in hand with the textbook, since almost all examples are from the textbook. 45 | 46 | This series of lectures does not need any prior knowledge in Stata programming. 47 | 48 | ## Sources 49 | 50 | The material is based on experience coming from years of teaching coding and empirical courses at [Corvinus University of Budapest](https://uni-corvinus.hu/), being a research assistant and later researcher, and of course advice from many great resources such as 51 | - [Getting Started with Stata for Windows](https://www.stata.com/bookstore/getting-started-windows/) by Stata Press 52 | - [Economics Lesson with Stata](https://datacarpentry.org/stata-economics/index.html) by Data Carpentry 53 | - UCLA's [Stata Learning Modules](https://stats.oarc.ucla.edu/stata/modules/) 54 | - Kurt Schmidheiny's brief intro [document](https://www.schmidheiny.name/teaching/stataguide2up.pdf) 55 | - [Fundamentals of data analysis and visualization](https://geocenter.github.io/StataTraining/) from a group of instructors 56 | - A huge collection of advanced Stata stuff on the [Medium site](https://medium.com/the-stata-guide) 57 | - A [great online training](https://www.sscc.wisc.edu/statistics/training/) by SSCC 58 | - A [four-piece tutorial](https://data.princeton.edu/stata) by Germán Rodríguez from Princeton University 59 | 60 | and many others, listed in the lecture's READMEs. 61 | 62 | 63 | ## Lectures, contents, and case-studies 64 | 65 | The following table shows a brief summary of the lectures: what is the type of the lecture, what is the expected learning outcome, and how it relates to the textbook's case studies and datasets. 66 | 67 | | Lecture | Content | Case-study (at least partly) covered | Dataset | 68 | | ------- | ----------------- | ---------- | ------- | 69 | | PART I. | | | | 70 | | [lecture01-boring_stuff](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec01_boring_stuff) | Introduction to the Stata interface and communication. Basics of .do files and the logic of syntaxes. | - | - | 71 | | [lecture02-open_save](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec02_open_save) | Opening and saving datasets. | - | [football](https://gabors-data-analysis.com/datasets/#football), [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna), [wms](https://gabors-data-analysis.com/datasets/#wms-management-survey) | 72 | | [lecture03-preparation](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec03_preparation) | Basics of data wrangling | [Chapter 01, 1.A1: Finding a Good Deal among Hotels: Data Collection](https://gabors-data-analysis.com/casestudies/#ch01a-finding-a-good-deal-among-hotels-data-collection), [Chapter 02, 2.A1: Finding a Good Deal among Hotels: Data Preparation](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration) | [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna) | 73 | | [lecture04-reshape](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec04_reshape) | Reshaping multi-dimensional data. Wide and long formats. | [Chapter 02, 2.B1: Displaying Immunization Rates across Countries](https://gabors-data-analysis.com/casestudies/#ch02b-displaying-immunization-rates-across-countries) | [worldbank-immunization](https://gabors-data-analysis.com/datasets/#world-bank-immunization) | 74 | | [lecture05-eda](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec05_eda)| Exploratory data analysis. | [Chapter 03, 3.A1 and 3.A2: Finding a Good Deal among Hotels: Data Exploration](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration), [Chapter 03, 3.B1: Comparing Hotel Prices in Europe: Vienna vs. London](https://gabors-data-analysis.com/casestudies/#ch03b-comparing-hotel-prices-in-europe-vienna-vs-london) | [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna)| 75 | | [lecture06-subsamples](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec06_subsamples)| Dealing with subsamples using the if condition, the in range, and the bysort prefix. | [Chapter 03, 3.A1 and 3.A2: Finding a Good Deal among Hotels: Data Exploration](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration), [Chapter 03, 3.B1: Comparing Hotel Prices in Europe: Vienna vs. London](https://gabors-data-analysis.com/casestudies/#ch03b-comparing-hotel-prices-in-europe-vienna-vs-london) | [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna) | 76 | | [lecture07-graphs](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec07_graphs)| Making graphs. | [Chapter 7, 7.A1 and 7.A2: Finding a good deal among hotels with simple regression](https://gabors-data-analysis.com/casestudies/#ch07a-finding-a-good-deal-among-hotels-with-simple-regression) | [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna) | 77 | | [lecture08-moredatasets](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec08_moredatasets)| Combining datasets: adding observations (append) and variables (merge). | [Chapter 02, 2.C1: Identifying Successful Football Managers](https://gabors-data-analysis.com/casestudies/#ch02c-identifying-successful-football-managers) | [football](https://gabors-data-analysis.com/datasets/#football)| 78 | | [lecture09-datamanipulation](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec09_datamanipulation)| Manipulating data: Producing new variables and changing existing ones. Deleting variables and observations. | [Chapter 04, 4.A1: Management Quality and Firms Size: Describing Patterns of Association](https://gabors-data-analysis.com/casestudies/#ch04a-management-quality-and-firm-size-describing-patterns-of-association) | [wms](https://gabors-data-analysis.com/datasets/#wms-management-survey) | 79 | | [lecture10-macro_loop](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec10_macro_loop)| Working with local and global macros, applying loops, and using stored results. | - | [wms](https://gabors-data-analysis.com/datasets/#wms-management-survey), [football](https://gabors-data-analysis.com/datasets/#football) | 80 | | [lecture11-hypothesis_testing](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec11_hypothesis_testing)| Testing hypothesis. | [Chapter 06, 6.A1, 6.A2, and 6.A3: Comparing online and offline prices: testing the difference](https://gabors-data-analysis.com/casestudies/#ch06a-comparing-online-and-offline-prices-testing-the-difference) | [billion-prices.dta](https://gabors-data-analysis.com/datasets/#billion-prices)| 81 | | PART II. | | | | 82 | | [lecture12-regression_basics](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec12_regression_basics)| Basics of regressions: fitting, predicting, dummy variables and interaction terms. | [Chapter 07, 7.A1, 7.A2, and 7.A3: Finding a good deal among hotels with simple regression](https://gabors-data-analysis.com/casestudies/#ch07a-finding-a-good-deal-among-hotels-with-simple-regression) | [hotels-vienna](https://gabors-data-analysis.com/datasets/#hotels-vienna) | 83 | | [lecture13-presenting_regresults](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec13_presenting_regresults) | Presenting regression results nicely and compactly. | [Chapter 10, 10.A1: Understanding the gender difference in earnings](https://gabors-data-analysis.com/casestudies/#ch10a-understanding-the-gender-difference-in-earnings) | [cps-earnings](https://gabors-data-analysis.com/datasets/#cps-earnings)| 84 | | [lecture14-TSdata](https://github.com/gabors-data-analysis/da-coding-stata/tree/main/lec1_TSdata)| Basics of time series data commands. | [Chapter 12, 12.A1: Returns on a company stock and market returns](https://gabors-data-analysis.com/casestudies/#ch12a-returns-on-a-company-stock-and-market-returns) | [sp500](https://gabors-data-analysis.com/datasets/#sp500)| 85 | 86 | 87 | ## Found an error or have a suggestion? 88 | 89 | Awesome, we know there are errors and bugs. Or just much better ways to do a procedure. 90 | 91 | To make a suggestion, please open a `GitHub issue` here with a title containing the case study name. You may also [contact us directly](https://gabors-data-analysis.com/contact-us/). 92 | -------------------------------------------------------------------------------- /lec01_boring_stuff/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec01_boring_stuff/README.md -------------------------------------------------------------------------------- /lec02_open_save/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 02: Opening and saving data 2 | 3 | ## Motivation 4 | 5 | Datasets can be stored in many types of files, our first task is to import them to Stata. On the other hand, the dataset we successfully prepared to analysis, and also the results of our analysis should be saved or exported. This is the topic of this lecture. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces students to opening and saving datasets focusing on the most common types like Excel tables, delimited text files, and Stata’s own file extension, the .dta. 10 | 11 | ## Learning outcomes 12 | After successfully completing the code in `lecture02_opensave.do` students should be able to: 13 | 14 | - Import and export .xls, .xlsx, and .csv files. 15 | - Open and save .dta files. 16 | - Execute some basic commands in order to communicate with their computer via Stata. 17 | 18 | ## Datasets used 19 | 20 | * [managers_epl.xlsx](https://osf.io/pu3vx) 21 | * [hotels-vienna.csv](https://osf.io/y6jvb) 22 | * [wms_da_textbook.csv](https://osf.io/uzpce) 23 | 24 | ## Lecture Time 25 | 26 | Ideal overall time: **30 mins**. 27 | 28 | ## Homework 29 | 30 | *Type*: quick practice, approx 20 mins 31 | 32 | Let us now put you into deep water and give you a homework assignment that may be a bit hard at first. Use the help menu, or the search command if it is needed. Do the followings 33 | - Define a default folder. 34 | - Create a new subfolder called all_data in your default folder. 35 | - Download into this all_data folder all the datasets from the book. You can find them as a .zip file: [da_data_repo.zip](https://osf.io/9gw4a). It may take some time since the size of the .zip is about 180 MB. 36 | - Unzip the data. 37 | - Import the wms_da_textbook.csv file (that can be found in the wms-management-survey folder). 38 | - Save it as a .dta file into your default folder. 39 | - Delete the all_data folder with all its content. 40 | 41 | ## Detailed online descriptions for the most important commands from the Stata website 42 | 43 | - Importing and exporting text-delimited files: [here](https://www.stata.com/manuals/dimportdelimited.pdf) 44 | - Importing and exporting Excel files: [here](https://www.stata.com/manuals/dimportexcel.pdf) 45 | 46 | ## Further material 47 | 48 | - Stata Press' Getting Started with Stata for Windows discusses the topic in [Chapter 8](https://www.stata.com/manuals/gsw8.pdf). 49 | 50 | - The Stata User Guide discusses in details the importing of datasets in [Chapter 22](https://www.stata.com/manuals/u22.pdf). 51 | 52 | - The [relevant part](https://sscc.wisc.edu/sscc/pubs/dws/data_wrangling_stata2.htm) of SSCC's Stata for student course gives a detailed description of reading in data. -------------------------------------------------------------------------------- /lec02_open_save/lec02_opensave.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 02 ****************************************************** 4 | *********************************************************************************************************************** 5 | ********************************************** Opening and saving datasets ******************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** At the bottom left part of the Stata platform you can see the default folder... 11 | *** ... Stata uses this folder to open and save files... 12 | *** ... Although, you can define a new default folder using the 'cd' command: 13 | cd "da-coding-stata/lec02_open_save/" 14 | 15 | 16 | *************** Opening an Excel file (and some practice of the language) 17 | *** You should use the 'use' command to open .dta files, ... 18 | *** ... and you have the 'import' command for all other commonly used formats (.xls or .xlsx, .csv, .txt, etc.). 19 | 20 | *** Let us start with importing the managers_epl.xlsx... 21 | *** ... which contains some data about football managers (see case studies 2.C1 and 2.C3 in Chapter 2). 22 | 23 | *** We will build this syntax step by step to see to logic of the language of Stata. 24 | import excel using "managers_epl.xlsx" /*'import', filetype, 'using', then the name of the file*/ 25 | *** It is important to note that you should be aware whether your file is an .xls or an .xlsx... 26 | *** ... You should define the extension accordingly, otherwise an error message will occur. 27 | 28 | *** Let us now check the data table: 29 | browse 30 | 31 | *** As you can see, the first sheet of the Excel file is imported (it is the default)... 32 | *** ... Suppose that we would like to import the EPL championship data, i.e. the second sheet,... 33 | *** so modify the syntax accordingly. 34 | 35 | *** As you may remember from Lecture 1, the command itself does the main job... 36 | *** ... On the other hand, options of the command can be used to slighly alter the final result... 37 | *** ... Let us use the 'sheet' option to import the second sheet instead of the first one: 38 | import excel using "managers_epl.xlsx", sheet("efl-championship") 39 | 40 | *** You may realize that we get an error message: 'no; data in memory would be lost': 41 | *** Stata tries to protect the users from themselves,... 42 | *** ... and if they have a dataset in the memory (now we have the first sheet) Stata force them to: 43 | * either drop the dataset from the memory ('clear' option) 44 | * or save the dataset. 45 | 46 | *** Let us no drop it, since we do not need it, so change the syntax again: 47 | *** The main task is still the same: importing data, but we need another alteration, i.e. clearing the memory. 48 | *** So, let us use another option, the -clear- option: 49 | import excel using "managers_epl.xlsx", sheet("efl-championship") clear 50 | 51 | *** IMPORTANT note: the order of the options is irrelevant in Stata,... 52 | *** ... so import excel using "managers_epl.xlsx", clear sheet("efl-championship") is also okay. 53 | 54 | *** Let us again check the data table: 55 | browse 56 | 57 | *** Now we are almost ready, since we have the second sheet open, but there is still a problem: ... 58 | *** ... Stata interpreted the names of the variables (first row of our Excel file)... 59 | *** ... as an observation (they appear in the first row in Stata). 60 | 61 | *** As a final step, let us fix this problem using the 'firstrow' option: 62 | import excel using "managers_epl.xlsx", sheet("efl-championship") clear firstrow 63 | browse 64 | *** And now, we are ready. Let us save this data as a .dta file using the 'save' command. 65 | 66 | 67 | *************** Saving data as .dta 68 | *** We also use the 'replace' option that forces Stata to overwrite the file if it already exists. 69 | save "managers_epl.dta", replace 70 | 71 | *** Then let us clear the memory. 72 | clear 73 | 74 | 75 | *************** Opening a .dta file 76 | *** We can open a .dta file using the 'use' command. 77 | *** Let us open the managers_epl dataset we saved before. 78 | use "managers_epl.dta" 79 | 80 | *** Please note that 'clear' can be used as an option as well, so the following is also okay: 81 | use "managers_epl.dta", clear 82 | 83 | 84 | *************** Copying data from the web 85 | *** It is also possible to copy files from the internet to your computer, and then open them. 86 | copy "https://osf.io/download/y6jvb/" "hotels-vienna.csv", replace 87 | 88 | *** After the 'copy' command, first you should define the online path where the file is stored,... 89 | *** ... then you have to define a name used for the copied file. 90 | 91 | *** Note that the path you should define in case of files stored at osf.io could be collected by... 92 | *** ... clicking the three dot button in the top right corner... 93 | *** At the same time, in most cases you should simply copy-paste the address from the URL from the address bar. 94 | 95 | 96 | *************** Opening a .csv file 97 | *** Now, let us import the Hotels in Vienna dataset we downloaded before: 98 | import delimited "hotels-vienna.csv", clear 99 | 100 | *** Some important notes: 101 | * Stata checks by default if the file is delimited by tabs or commas based on the first line of data. 102 | * Although, in case of other delimiters, you should use the 'delimiters' option to define the delimiter. 103 | * There is no firstrow option, like we had in the case of an .xls or .xlsx... 104 | * ... By default, Stata tries to determine whether the file includes variable names... 105 | * ... Using the 'varnames(#)' option you can specify the row (#) in which variable names can be found,... 106 | *... and Stata will ignore (not import) the rows before #. 107 | 108 | 109 | *************** Opening data directly from webpages and online storages 110 | *** Note that data can be imported directly from the web (without downloading it in advance). 111 | *** For example, the .csv we imported in the previous section can be opened directly from gaborsdata's OSF. 112 | import delimited "https://osf.io/y6jvb/download", clear 113 | 114 | *** Other files (.dta, .xls etc) can also be opened directly from the web. 115 | *** For example, let us open a dataset about cars from Stata's webpage. 116 | use "http://www.stata-press.com/data/r17/auto.dta", clear 117 | 118 | *** Datasets from the Stata website can be opened even easier, simply using the 'webuse' command, without any URLs: 119 | webuse "auto.dta", clear 120 | 121 | *** It is worth noting that there are some online databases that can be reached via Stata commands, for example: 122 | * FRED: 'freduse'. Video tutorial: https://www.youtube.com/watch?v=CegDUU-eFGM 123 | * World Bank: 'wbopendata'. Tutorial: https://blogs.worldbank.org/opendata/accessing-world-bank-open-data-stata 124 | 125 | 126 | *************** Saving and exporting data 127 | *** As we showed before, a .dta file can be saved with the 'save' command. 128 | *** Although, if you would like to save data as an Excel, or delimited, you should use the 'export' command... 129 | ***... Options are similar to the ones we discussed in the case of 'import'. 130 | 131 | 132 | *************** Communicating with your computer via Stata 133 | *** There are some commands that can be helpful when you open and save data: 134 | * You can create or remove folders: 'mkdir', 'rmdir'. 135 | * You can also create (copy from somewhere else) or delete files: 'copy', 'erase'. 136 | * You can list the names of files in the specified directory: 'dir' or 'ls'. 137 | * You can deal with zip files: 'zipfile', 'unzipfile'. 138 | * You can send commands to your operating system: 'shell'. 139 | 140 | *** You are invited to use some of these commands in your homework assignment. 141 | 142 | -------------------------------------------------------------------------------- /lec02_open_save/managers_epl.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec02_open_save/managers_epl.xlsx -------------------------------------------------------------------------------- /lec03_preparation/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec03_preparation/README.md -------------------------------------------------------------------------------- /lec03_preparation/lec03HW.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec03_preparation/lec03HW.dta -------------------------------------------------------------------------------- /lec03_preparation/lec03_preparation.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 03 ****************************************************** 4 | *********************************************************************************************************************** 5 | ********************************************** Preparing data for analysis ******************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec03_preparation/" 12 | 13 | 14 | *************** Logging 15 | *** Preparation (and also the analysis later) should be done in a way that it is easy to trace,... 16 | *** ... repeat and reproduce. Traceability needs documenting our work... 17 | *** ... One possible approach for documentation is logging, which can be done with a log file... 18 | *** ... A log file stores (saves) everything that appears in the main window. 19 | 20 | *** Let us launch a log file at this point: 21 | log using "myfirstlogfile.smcl", replace 22 | * The 'replace' option means that we will overwrite the log file if it already exists. 23 | 24 | *** At the end of the lecture we will close the log file. 25 | 26 | 27 | *************** Copying and opening the dataset 28 | copy "https://osf.io/download/dn8je/" "hotels-vienna.dta", replace 29 | use "hotels-vienna.dta", clear 30 | 31 | 32 | *************** The data table 33 | *** Let us take a look at the data table itself: 34 | browse 35 | 36 | *** As you can see, the structure of the data table is the following: 37 | * Observations (here, hotels) are stored in the rows. 38 | * Variables (-hotel_id-, -country- etc.) are stored in the columns. 39 | 40 | *** You can also realize that the values of some variables are red, while some others are black: 41 | * Black color means that the values are numerical (numbers). 42 | * Red color means that values are stored as stings. 43 | * It can happen that in fact they are number, but somehow they are stored as string. 44 | * For string numbers see for example the ratingta variable's values in rows 1, 2, 3 etc. 45 | * In fact a third color can also appear: blue, which means numerical content with some labels. 46 | 47 | 48 | *************** The variables at a glance 49 | *** After taking a look at the data table, let us also check the variables: 50 | describe 51 | *** As you can see, there are 428 observations (rows) and 24 variables (columns) in our dataset. 52 | 53 | *** You can see some pieces of information about the variables. The most important ones are the following: 54 | * name (that you can use to refer to the variable) 55 | * storage type: 56 | * byte, int, long, float and double are numerical variables with different precision 57 | * str is string content, and the number refers to its length 58 | * a detailed description can be found typing: 'help data_types' 59 | * labels (variable and value) are also important, we will get back to them later in this lecture 60 | * Under the table you can find a part "Sorted by..."... 61 | * ... It shows based on what variables are your observations are sorted... 62 | * ... Now, first based on city, if there are duplicates in -city-, then based on -hotel_id- etc. 63 | 64 | 65 | *************** Sorting and ordering 66 | *** As we have mentioned before, observations are now sorted by -city-, -hotel_id- etc. 67 | 68 | *** Suppose that we would like to see the hotels sorted by their user ratings, so: 69 | sort rating 70 | *** Check the browse window. Or, you can also type 'describe', and check the bottom of the table. 71 | 72 | *** If we would like to sort in a descending order, we should type 73 | gsort- rating 74 | *** Check again the browse window. 75 | 76 | *** We are focusing on -rating- now, so why not moving it to the first column? 77 | order rating 78 | *** Check the browse window again, and as you can see that -rating- is in the first column. 79 | 80 | *** Note: We can use more than one variable as well with 'sort', 'gsort-', and 'order'. 81 | 82 | 83 | *************** The identifier 84 | *** As it was mentioned before, observations are stored in the rows... 85 | *** ... In most datasets we also have a variable that uniquely identifies the observations... 86 | *** ... Here we can have a feeling that this unique identifier is the variable called -hotel_id-. Let us check it: 87 | isid hotel_id 88 | 89 | *** If the particluar variable is not a unique identifier, Stata warns us, otherwise it remains silent. 90 | *** Based on the silence after 'isid hotel_id', -hotel_id- is a unique identifier. 91 | 92 | *** Let us check whether country is a unique identifier: 93 | isid country 94 | *** As we expected, no, it is not: variable -country- does not uniquely identify the observations. 95 | 96 | *** 'isid' can be used with more than one variable, for example,... 97 | *** ... 'isid country city' will test whether -country- and -city- together can uniquely identify the observations: 98 | isid country city 99 | *** Still not... 100 | 101 | 102 | *************** Duplicates 103 | *** If a variable (or a group of variables) cannot serve as a unique identifier,... 104 | *** ... it means that we have duplicates based on that variable(s)... 105 | *** ... Duplicates mean: multiple rows with the same values of the particular variable(s). 106 | 107 | *** Let is analyse the variable price: 108 | isid price 109 | *** As you can see, -price- is not a unique identifier, we have more hotels with the same price. Investigate it. 110 | duplicates report price 111 | *** We have 73 -price- vaules which are unique, 31 values that appear twice (the same for two hotels) etc. 112 | 113 | *** With 'duplicates examples' you can list one example for each group of duplicated observations. 114 | duplicates examples price 115 | 116 | *** We can also tag the duplicated observations: 117 | duplicates tag price, generate(dupl) 118 | browse 119 | *** As you can see, a new variable (called -dupl-) was created,... 120 | *** ... which represents the number of duplicates for each observation. 121 | 122 | *** We can of course remove the duplicated observations (we will not do this now) using the 'duplicates drop' command. 123 | 124 | *** Note that all the 'duplicates' commands can be used with more than one variable (as in the case of the 'isid'). 125 | 126 | 127 | *************** Discovering missing values 128 | *** As we have done the entity resolution by now, let us move on to another important aspect: missing values. 129 | 130 | *** Let us still focus on the -price- variable, and check whether it contains missing values. 131 | codebook price 132 | *** 'codebook' is a command to present some descriptive statistics - see in Lecture 05 -,... 133 | *** ... here we focus on just one piece of information it generates: the number of missings. 134 | *** As it can be seen, -price- contains no missing values. 135 | 136 | *** Codebook can be used for more than one variables: 137 | codebook price distance stars rating_count 138 | *** The -rating_count- variable contains 35 missing values, while all the other variables have zero missings. 139 | 140 | *** IMPORTANT note: 141 | * Check the 'browse' window, and you will realize that in the case of numerical variables, ... 142 | * ... a missing value is represented by a dot (.) 143 | * Although here all the values of all string variables are non-missing (check it with 'codebook'),... 144 | * ... they can be missings as well, of course. In missing string cells nothing appears. 145 | 146 | *** How to deal with missing values? ... 147 | *** ... According to the book, we can either drop the problematic observations or do imputation... 148 | *** ... We will see the programming side of these sloutions later. 149 | 150 | 151 | *************** Labeling 152 | *** "Order is the key to all problems." - writes Dumas, and yes, labels help you to create order. 153 | 154 | *** In order to easy handling, it is advisable to give short names to variables in Stata,... 155 | *** ... and store all relevant information somewhere else. Somewhere else can mean labels. 156 | 157 | *** According to the Excel sheet called VARIABLES (that you can find at OSF) -rating- means "User rating average"... 158 | *** ... Let us attach this piece of information to the variable by creating a label. 159 | label variable rating "User rating average" 160 | 161 | *** 'label variable' is the command for labeling variables: 162 | * First you should define the name of the variable to label, ... 163 | * ... then the label itself between " ". 164 | 165 | *** We can label also values of variables... 166 | *** ... For that, we should define the value labels, then attach it to the values of a variable. 167 | *** Let us see an example: ... 168 | *** ... the variable -offer- is a dummy indicating whether there is an offer for that particular room or not... 169 | *** ...If you check the browse window, you can see that the variable contains 0s and 1s. 170 | 171 | *** Let us first define the value labels: 172 | label define yesno 0 "no" 1 "yes" 173 | * 'label define' is the command itself,... 174 | * ... then the name of the value labels (yesno) we are creating,... 175 | * ... then a value (0), then its label (no),... 176 | * ... then the next value (1), then its label (no) etc. 177 | 178 | *** Then attach this to the variable offer: 179 | label values offer yesno 180 | 181 | *** Let us now check the browse window:... 182 | *** ... values of the variable -offer- are now presented in blue - we can see the labels (yes or no),... 183 | *** ... but there are numbers behind (try to click in a cell, and check it). 184 | browse 185 | 186 | *** Finally, you can label the dataset itself (up to 80 characters): 187 | label data "Dataset contains information on hotels, hostels, and others from Vienna, 2017." 188 | 189 | *** Where do we see this label? 190 | *** Execute the 'describe' command, and check the top right corner. 191 | 192 | 193 | *************** Notes 194 | *** A great advantage of labels that they are always visible ... 195 | *** ... in the Variables window, or in the Properties window, or in the Browse window... 196 | *** ... On the other hand, the quantity and length of labels are limited... 197 | *** ... Notes can ease this problem. Let us attach a note to the dataset. 198 | note: Data was collected using a web scraping algorithm. 199 | 200 | *** Let us attach another note. 201 | note: Data refers to the same weekday night in November, 2017. 202 | 203 | *** Now we have to notes attachd to the dataset, but how can we see them? Let us type the following: 204 | notes 205 | *** This command shows all the notes we added... 206 | *** ... Although, clicking the small + sign near notes in the Properties window also makes them visible. 207 | 208 | *** We can also add time stamps using 'TS', to make traceability even more efficient: 209 | note: TS I added some labels and notes today. 210 | 211 | *** Notes can be attached to the variables as well. 212 | note ratingta: This is the average user rating via Tripadvisor. 213 | *** Let us check all the notes again. 214 | notes 215 | 216 | *** Notes (that become irrelevant) can also be dropped. Let us drop the first note attached to the dataset. 217 | notes drop _dta in 1 218 | 219 | 220 | *************** Saving 221 | *** Let us save the dataset. 222 | save "hotels-vienna_prep.dta", replace 223 | 224 | 225 | *************** Closing the log 226 | *** Finally, let us close the log file we launched at the beginning. 227 | log close 228 | -------------------------------------------------------------------------------- /lec04_reshape/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 04: Reshaping data 2 | 3 | ## Motivation 4 | 5 | In Lecture 3 we got to know the basics of data wrangling, i.e. getting data prepared for analysis. In this lecture we still focus on data wrangling, dealing shortly with restructuring multi-dimensional data. Nowadays a great proportion of datasets is panel data (a special kind of multi-dimensional data), and so an analyst should now the difference between long format and wide format, how to restructure the one to the other, and how to create tidy data. This is the topic of this lecture. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces students to the usage of the reshape command. 10 | 11 | Case studies (at least partly) connected to this lecture: 12 | - [Chapter 02, 2.B1: Displaying Immunization Rates across Countries](https://gabors-data-analysis.com/casestudies/#ch02b-displaying-immunization-rates-across-countries) 13 | 14 | ## Learning outcomes 15 | After successfully completing the code in `lecture04_reshape.do` students should be able to: 16 | 17 | - Reshape datasets from wide form to long form and vice versa. 18 | 19 | ## Datasets used 20 | 21 | * In the lecture: [worldbank-immunization-panel.dta](https://osf.io/ku4fd) 22 | * In the homework: [worldbank-immunization-continents.csv](https://osf.io/58zrj) 23 | 24 | ## Lecture Time 25 | 26 | Ideal overall time: **20 mins**. 27 | 28 | ## Homework 29 | 30 | *Type*: quick practice, approx 15 mins 31 | 32 | Import the [worldbank-immunization-continents.csv](https://osf.io/58zrj) dataset that contains immunization rate and survival rate data for the following regions on yearly frequency: East Asia and Pacific (EAS), Europe and Central Asia (ECS), Latin America and the Caribbean (LCN), Middle East and North Africa (MEA), North America (NAC), South Asia (SAS), and Sub-Saharan Africa (SSF). Reshape the data 33 | 34 | ## Detailed online descriptions for the most important commands from the Stata website 35 | 36 | - [reshape](https://www.stata.com/manuals/dreshape.pdf) 37 | 38 | ## Further material 39 | 40 | - The second part of the [3rd lesson](https://datacarpentry.org/stata-economics/03-transform-data/index.html) of Data Carpentry's Economics Lesson with Stata course offers a detailed description of the topic with many exercises. 41 | 42 | - UCLA has two modules with many examples on reshaping data: [wide to long](https://stats.oarc.ucla.edu/stata/modules/reshaping-data-wide-to-long/) and [long to wide](https://stats.oarc.ucla.edu/stata/modules/reshaping-data-long-to-wide/). 43 | -------------------------------------------------------------------------------- /lec04_reshape/lec04_reshape.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 04 ****************************************************** 4 | *********************************************************************************************************************** 5 | *************************************************** Reshaping data **************************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec04_reshape" 12 | 13 | *************** Copying and opening data 14 | *** We use the worldbank-immunization-panel.dta file here, so first let us copy it to our computer: 15 | copy "https://osf.io/download/ku4fd/" "worldbank-immunization-panel.dta" 16 | 17 | *** Now let us open it: 18 | use "worldbank-immunization-panel.dta", clear 19 | 20 | *** Let us check the data: 21 | browse 22 | *** As you can see, now it is in long structure. Next, we will re-structure this data into wide format. 23 | 24 | 25 | *************** Reshaping 26 | *** Let us first define the steps of reshaping in general: 27 | 28 | * Step 1: Determining whether our data is now long (then we can go to wide) or wide (then we can go to long). 29 | * You should define the aim-structure after the reshape command. 30 | * Now we have a long structure, so we can go to wide. 31 | 32 | * Step 2: Determining the logical observation, i.e. the group identifier. 33 | * You should define this variable using option 'i()'. 34 | * Now it is a country identifier, so we can use either -c-, -countryname-, or -countrycode-. 35 | 36 | * Step 3: Determining the subobservation, i.e. the within-group identifier. 37 | * In case of a long structure we do not have such a variable, it will be created by the 'reshape' command. 38 | * You should define this variable using option 'j()'. 39 | * Now it is time, so we can use the -year- variable. 40 | 41 | * Step 4: Determining the so-called stubnames, which are the stub of the variables to be converted (reshaped). 42 | * You should define it after 'reshape' and the aim-structure. 43 | * Now they are the following ones: -pop-, -mort-, -surv-, -imm-, -gdppc-, and -lngdppc-. 44 | 45 | * Step 5: Execute the reshape according based on the previous steps. 46 | 47 | * So, let us reshape our data to wide accordingly: 48 | reshape wide pop mort surv imm gdppc lngdppc, i(countrycode) j(year) 49 | 50 | *** Let us check the data: 51 | browse 52 | *** Now it is in wide. 53 | 54 | 55 | *************** Getting back to the original structure 56 | *** If you have done a reshape, and you would like to get back to the original structure, ... 57 | * ... just simply type 'reshape' and the original structure. Here for example: 58 | reshape long 59 | 60 | browse 61 | *** Now it is in long again. 62 | 63 | 64 | *************** Common mistakes 65 | *** Let us now mention the most common mistakes. Reshape will fail if: 66 | * The data are in the wide form, and the 'i()' variable has non-unique values. 67 | * The data are in the long form, and the 'j()' variable has non-unique values within 'i()'. 68 | * The data are in the long form, and a variable that is not listed as stub is not constant. 69 | 70 | 71 | *************** Two important notes 72 | *** The 'reshape error' command after trying to reshape a data lists problematic observations if 'reshape' fails. 73 | 74 | *** 'reshape' also works for string 'j()' values. In your homework you are invited to solve a string case. 75 | -------------------------------------------------------------------------------- /lec05_eda/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 05: Exploratory data analysis 2 | 3 | ## Motivation 4 | 5 | After data wrangling the next step is exploratory data analysis (EDA) that is made due to five main reasons: 6 | - To know whether your data is clean and ready for analysis. 7 | - To guide subsequent analysis. 8 | - To give context. 9 | - To spare further effort on analysis, since sometimes EDA is enough to answer the question. 10 | - To be able to ask more questions. 11 | 12 | 13 | ## This lecture 14 | 15 | This lecture introduces students to the usage of the most important EDA commands. Students will be able to visualize data with histograms, box plots and violin plots, and to calculate the most important summary statistics focusing on measures of central value and spread. 16 | 17 | Case studies (at least partly) connected to this lecture: 18 | - [Chapter 03, 3.A1 and 3.A2: Finding a Good Deal among Hotels: Data Exploration](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration) 19 | - [Chapter 03, 3.B1: Comparing Hotel Prices in Europe: Vienna vs. London](https://gabors-data-analysis.com/casestudies/#ch03b-comparing-hotel-prices-in-europe-vienna-vs-london) - dealing only with Vienna 20 | 21 | This lecture uses the [hotels-vienna.dta](https://osf.io/dn8je) dataset. 22 | 23 | ## Learning outcomes 24 | After successfully completing the code in `lecture05_eda.do` students should be able to: 25 | 26 | - Calculate standard measures of central value and spread (mean, minimum, maximum, median, quantiles, mode, range, interquartile range, standard deviation, variance etc.). 27 | - Make histograms, box plots and violin plots (without scaffolding since we discuss it in a latter lecture.) 28 | 29 | ## Datasets used 30 | 31 | * [hotels-vienna.dta](https://osf.io/dn8je) 32 | 33 | ## Lecture Time 34 | 35 | Ideal overall time: **35 mins**. 36 | 37 | ## Homework 38 | 39 | *Type*: quick practice, approx 25 mins 40 | 41 | In this homework you are invited to do some exploratory data analysis on the Hotels in Vienna dataset. Do the following exercises: 42 | - Determine the number of observations and variables. 43 | - Create two histograms of accomodation types: one with frequencies, and anozher one with percentages. 44 | - Reproduce figures 3.1 (pp. 63), and 3.3 (pp. 64) for the whole sample (in the textbook we use only the data of hotels, so the number of observations is only 264). 45 | - Calculate the most important descriptive statistics (find them in Table 3.5 on pp. 74) for the price and distance variables. Also visualize the results using box plots or violin plots. 46 | 47 | 48 | ## Detailed online descriptions for the most important commands from the Stata website 49 | 50 | - [summarize](https://www.stata.com/manuals/rsummarize.pdf) 51 | - [tabstat](https://www.stata.com/manuals/rtabstat.pdf) 52 | - [tabulate oneway](https://www.stata.com/manuals/rtabulateoneway.pdf) 53 | - [tabulate twoway](https://www.stata.com/manuals/rtabulatetwoway.pdf) 54 | - [tabulate, summarize](https://www.stata.com/manuals/rtabulatesummarize.pdf) 55 | - [histogram](https://www.stata.com/manuals/dby.pdf) 56 | - [density plot](https://www.stata.com/manuals16/rkdensity.pdf) 57 | - [box plot](https://www.stata.com/manuals16/g-2graphbox.pdf) 58 | 59 | 60 | ## Further material 61 | 62 | - The [relevant part](https://sscc.wisc.edu/sscc/pubs/intro_stata/intro_stata5.htm) of SSCC's Stata for student course presents a description of some summary statistics commands. 63 | -------------------------------------------------------------------------------- /lec05_eda/lec05_eda.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 05 ****************************************************** 4 | *********************************************************************************************************************** 5 | ********************************************** Exploratory data analysis ********************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec05_eda" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use the Hotels in Vienna dataset that we prepared in Lecture 3. 16 | *** So first, let us copy that file into the folder of the current lecture. 17 | copy "da-coding-stata/lec03_preparation/hotels-vienna_prep.dta" "hotels-vienna_prep.dta", replace 18 | use "hotels-vienna_prep.dta", clear 19 | 20 | 21 | *************** Tidy structure 22 | describe, short 23 | *** We have 428 observations (rows) and 24 variables (columns). 24 | 25 | *** We will focus on variables -rating-, -ratingta-, -price- and -offer-. 26 | describe rating ratingta offer 27 | *** -rating-: is the user rating average (as is shown by the label) - numerical. 28 | *** -ratingta-: is the average user rating via Tripadvisor - string. 29 | * As you may remember, we attached a note to this variable in Lecture 3... 30 | * ... Can you reveal this note now? 31 | *** -offer-: is a dummy variable indicating whether there is an offer for that particular room or not. 32 | *** -price-: is the price of the accommodation in EUR. 33 | 34 | 35 | *************** Wild cards for variables 36 | *** Let us take a brief digression here. In many cases we would like to refer to more than one variables,... 37 | *** ... like in our previous example: describe rating ratingta offer... 38 | *** ...We can use some so-called wild cards, i.e. simplifications in such cases. Instead of listing all the variables,... 39 | *** ... we can use other solutions: the * and the - wild cards. 40 | 41 | *** 1) Using the * character: 42 | describe rati* 43 | * We are referring here to all the variables that start with the letters rati and continue in any way. 44 | * Note that * can be used at the beginning, at the end, or anywhere else in a combination of characters, so: 45 | * *x refers to all the variables ending with the letter x. 46 | * x* refers to all variables starting with the letter x. 47 | * *x* refers to all variables containing the letter x and have anything (even nothing) in their names before and after. 48 | 49 | *** 2) Using the - character 50 | describe stars-offer_cat 51 | * We are referring here to all the variables from -stars- to -offer_cat- based on the Variable Window: ... 52 | * ... -stars-, -ratingta-, -ratingta_count-, -scarce_room-, -offer-, and -offer_cat-. 53 | 54 | 55 | ********************************************* Topic 1: Summary statistics 56 | 57 | *************** 'codebook' 58 | *** Let us start with the 'codebook' command (that we already used in Lecture 3 in order to find missing values). 59 | codebook ratingta rating offer 60 | *** As you can see, codebook presents: 61 | * the type (string or numeric) 62 | * the number of missings 63 | * the number of unique values (that can be really helpful in the case of panel data) 64 | * for numerical variables: some descriptive statistics 65 | * for string variables: tabulation of the unique values (or, some examples if there are more than 9 uniques) 66 | 67 | *** Using the 'notes' option, 'codebook' also displays the notes attached: 68 | codebook ratingta, notes 69 | *** So, 'codebook' can be used both for numerical and string variables. 70 | 71 | 72 | *************** 'summarize' 73 | *** Let us now check the 'summarize' command that can be used only for numerical variables. 74 | summarize rating offer 75 | 76 | *** In order to calculate more pieces of descriptive statistics, we can use the 'detail' option: 77 | summarize rating offer, detail 78 | *** The 'detail' option produces statistics that can be used to find extreme values - see Chapter 3.4. 79 | 80 | 81 | *************** 'tabstat' 82 | *** Still focusing on numerical variables, we can produce even more descriptive figures with 'tabstat'. 83 | *** We should list the statistics to be calculated in the 'statistics' option: 84 | tabstat rating, statistics(mean range iqr) 85 | 86 | *** Check the help menu for the entire list of statistics can be calculated using 'tabstat'. 87 | 88 | 89 | *************** 'tabulate' 90 | *** The tabulate command can be used to create frequency tables, e.g.: 91 | tabulate rating 92 | 93 | *** Note that 'tabulate' does not present missing values by default. The 'missing' option can solve this: 94 | tabulate rating, missing 95 | 96 | *** The plot option creates a "low-budget" histogram: 97 | tabulate rating, missing plot 98 | 99 | *** Two-way frequency tables can also be produced: 100 | tabulate rating offer, missing 101 | 102 | *** In case of labeled values, labels are presented by default, although,... 103 | *** ... with the 'nolabels' option you can order Stata to present the values instead: 104 | tabulate rating offer, missing nolabel 105 | 106 | 107 | *************** 'tabulate, summarize()' 108 | *** Combining 'tabulate' and 'summarize' you can create one- and two-way tables containing summary statistics: 109 | tabulate offer, summarize(rating) 110 | *** As you can see, it creates two subsamples, and calculate the summary statistics for those two subsamples. 111 | 112 | 113 | *************** 'compare' 114 | *** You can compare the values of two variables using the 'compare' command. 115 | compare distance distance_alter 116 | 117 | *************** 'misstable' 118 | * There is a useful command that reports only missing values, ... 119 | * ... and so it is easy to review the results for many variables as well. 120 | misstable summarize 121 | 122 | * As you can see, it displays all the variables which have missings, ... 123 | * ... and also displays the number of missing values. 124 | 125 | 126 | ********************************************* Topic 2: Visualization 127 | 128 | *** Please note that we cover visualization (and scaffolding) in a latter lecture, ... 129 | *** ... so this lecture's graphs will not be as fancy and as informative as they should be. 130 | 131 | *************** 1) Visualizing distributions 132 | *** Let us first draw a simple histogram: 133 | histogram price 134 | 135 | *** As you can see, the histogram is scaled to density units by default. 136 | *** Although, we can display frequencies, using the 'frequency' option: 137 | histogram price, frequency 138 | 139 | *** Or, percentage scale can also be used: 140 | histogram price, percent 141 | 142 | *** Using the 'normal' option, a normal density plot can be displayed... 143 | *** ... (with the same mean and standard deviation as the plotted distribution's) 144 | histogram price, normal 145 | 146 | *** Although Stata determines the number of bins automatically based on the number of obs, ... 147 | *** ... you can overrule it using... 148 | * either the 'bin()' option to define the number of bins 149 | * or the 'width()' option to define the width of the bins 150 | * For example: 151 | histogram price, bin(30) 152 | histogram price, width(20) 153 | 154 | *** Price is a continuous variable, although we can also make histograms for discrete variables, as well: 155 | histogram stars, discrete 156 | 157 | *** Density plots can also be easily created: 158 | kdensity price 159 | 160 | *************** 2) Visualizing summary statistics 161 | *** Boxplots can be created using the 'graph box' command: 162 | graph box price 163 | 164 | *** Violin plots can be created using the 'vioplot' command, which should be installed first: 165 | ssc install vioplot 166 | vioplot price 167 | -------------------------------------------------------------------------------- /lec06_subsamples/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 06: Dealing with subsamples 2 | 3 | ## Motivation 4 | 5 | In many cases we would like to analyze only parts of our whole dataset: we would like to deal with subsamples. Characterizing only male employees in our employee-level dataset, or checking only developing countries, instead of all countries in the world, etc. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces students to the three most important approaches of defining subsamples. 10 | 11 | Case studies (at least partly) connected to this lecture: 12 | - [Chapter 03, 3.A1 and 3.A2: Finding a Good Deal among Hotels: Data Exploration](https://gabors-data-analysis.com/casestudies/#ch03a-finding-a-good-deal-among-hotels-data-exploration) 13 | - [Chapter 03, 3.B1: Comparing Hotel Prices in Europe: Vienna vs. London](https://gabors-data-analysis.com/casestudies/#ch03b-comparing-hotel-prices-in-europe-vienna-vs-london) - dealing only with Vienna 14 | 15 | This lecture uses the [hotels-vienna.dta](https://osf.io/dn8je) dataset. 16 | 17 | ## Learning outcomes 18 | After successfully completing the code in `lecture05_eda.do` students should be able to: 19 | 20 | - Deal with subsamples using the if condition, the by or bysort prefixes, and the in range. 21 | 22 | ## Datasets used 23 | 24 | * [hotels-vienna.dta](https://osf.io/dn8je) 25 | 26 | ## Lecture Time 27 | 28 | Ideal overall time: **30 mins**. 29 | 30 | ## Homework 31 | 32 | *Type*: quick practice, approx 20 mins 33 | 34 | In this homework you are invited to use the Hotels in Vienna dataset again. Do the following exercises: 35 | 36 | - Calculate the most important descriptive statistics (find them in Table 3.5 on pp. 74) for the price and distance variables, and also visualize the results using box plots or violin plots. Do it for all types of accommodazions. 37 | - Reproduce figures 3.1 (pp. 63), and 3.3 (pp. 64). Please note that these figures use only a subsample, and you should also use only those observations. The precise definition of the sample narrowing can be found on the top of page 68. 38 | 39 | 40 | ## Detailed online descriptions for the most important commands from the Stata website 41 | 42 | - [by and bysort](https://www.stata.com/manuals/dby.pdf) 43 | 44 | ## Further material 45 | 46 | - UCLA offers a module on the [if condition](https://stats.oarc.ucla.edu/stata/modules/using-if-with-stata-commands/). 47 | 48 | -------------------------------------------------------------------------------- /lec06_subsamples/lec06_subsamples.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 06 ****************************************************** 4 | *********************************************************************************************************************** 5 | *********************************************** Dealing with subsamples *********************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec06_subsamples" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use again the Hotels in Vienna dataset that we prepared in Lecture 3. 16 | *** So first, let us copy that file into the folder of the current lecture. 17 | copy "da-coding-stata/lec03_preparation\hotels-vienna_prep.dta" "hotels-vienna_prep.dta", replace 18 | use "hotels-vienna_prep.dta", clear 19 | 20 | 21 | *************** Defining and characterising subsamples 22 | *** There are basically three main tools to define subsamples: 23 | * the if condition 24 | * the by or bysort prefix 25 | * the in range. 26 | 27 | 28 | *************** 1) The 'if' condition 29 | * The 'if' condition refers to a (or more) variable(s) and checked at the level of observations. 30 | * Stata runs through the observations, and the command will only be executed on the data for which... 31 | * ... the 'if' condition is true. 32 | 33 | * Let us see some basic examples: 34 | count if rating < 3 /*Counting the number of obs that have a lower rating than 3.*/ 35 | 36 | count if accommodation_type == "Hotel" /*Displaying the number of obs that are hotels.*/ 37 | * ! Note that two equation signs should be used in the 'if' condition for checking equality. 38 | * '==' means "is equal to" and '=' means 'set this to'. 39 | 40 | summarize rating if stars == 5 /*Some descriptives of the ratings for 5-star accommodations.*/ 41 | 42 | summarize rating if stars == 5, detail /*The if condition should come before the options.*/ 43 | 44 | count if accommodation_type != "Hotel" /*Displaying the number of obs that are not hotels.*/ 45 | * Note that 'not equal to' can be expressed both by != or ~=. 46 | 47 | * More conditions can also be defined using logical operators. 48 | 49 | * Displaying a frequency table of -rating- for 5-star hotels: 50 | tabulate rating if stars == 5 & accommodation_type == "Hotel" 51 | * 'and' (&) means that both conditions should be true. 52 | 53 | * Displaying a frequency table of -rating- for accommodations that either have 5 stars or are hotels, or both: 54 | tabulate rating if stars == 5 | accommodation_type == "Hotel" 55 | * 'or' (|) means that at least one of the conditions should be true (both of them can be true as well). 56 | 57 | * More complex systems of 'if' conditions can also be defined. 58 | 59 | * Counting accommodations that are either 5-star hotels or 3-star apartments*/ 60 | count if stars == 5 & accommodation_type == "Hotel" | stars == 3 & accommodation_type == "Apartment" 61 | * Please note that 'and' takes precedence over 'or', so arrange your if-systems accordingly using parentheses. 62 | 63 | * The most commonly used operators: 64 | * equal to: == 65 | * not equal to: != or ~= 66 | * less/larger than: < / > 67 | * less/larger than or equal to: <= / >= 68 | * and: & 69 | * or: | 70 | 71 | *** IMPORTANT NOTE about missing values 72 | * Let us check the -rating- variable again with the 'tabulate' command: 73 | tabulate rating, missing 74 | * As you can see, there are 6 observations with value 5. Let us count them in another way: 75 | count if rating > 4.9 76 | * Now the result is 41 pieces that contradicts our result with the 'tabulate' command. 77 | 78 | * What is the solution? In the case of the comparison operators missing values are treated as infinity. 79 | 80 | * So the right syntaxes in this case are the followings: 81 | count if rating > 4.9 & rating < . 82 | count if rating > 4.9 & rating != . 83 | count if rating > 4.9 & rating ~= . 84 | 85 | * In case of string variables we can refer to missing using: "". For example: 86 | count if accommodation_type == "" 87 | * According to the results, there are no missings. You can also check it using the 'tabulate' command: 88 | tabulate accommodation_type, missing 89 | 90 | * So, missing values should be referred to as: 91 | * . in case of numerical variables. 92 | * "" in case of string variables. 93 | 94 | * Let us recall that 'misstable' is a useful command to check whether a variable has missing values. 95 | misstable summarize 96 | 97 | 98 | *************** 2) The 'by' and 'bysort' prefixes 99 | * Both prefixes do the same: repeat the command for each group of observations ... 100 | * ... for which the values of the variables in varlist are the same. 101 | * 'by' requires that the data be sorted by varlist, while 'bysort' does this sorting automatically. 102 | 103 | * Let us see the following example: creating summary statistics of the -rating- variable ... 104 | * ... for all subsamples that are created according to the unique values of the -offer- variable: 105 | bysort offer: summarize rating 106 | 107 | * More variables can also be used: 108 | bysort offer accommodation_type: summarize rating 109 | 110 | 111 | *************** 3) The in range 112 | * The 'in' range qualifier restricts the scope of the command to a specific observation range. 113 | 114 | * Let us for example summarize the -price- variable for the first ten observations: 115 | summarize price in 1/10 116 | 117 | * Or, list the values of -price- and -rating- for obseravtions 10-20 118 | list price rating in 10/20 119 | * The list command creates a table that simply contains the values of the particular variables (like in the browse window). 120 | 121 | *** IMPORTANT NOTE: it is favoured to avoid the use of the 'in' range qualifier since if you change the order... 122 | *** ... of the observations, the result can also change. For example: 123 | summarize price in 1/10 /*The average is: 213.5*/ 124 | sort hotel_id 125 | summarize price in 1/10 /*The average now is: 112.7*/ 126 | 127 | 128 | *************** Visualization of subsamples 129 | *** Please note that the visualization commands we used in the previous lecture can be used for subsamples, as well. 130 | 131 | * Using the 'if' condition: 132 | graph box price if accommodation_type == "Apartment" 133 | 134 | * Using the 'over' option: 135 | graph box price, over(offer) 136 | -------------------------------------------------------------------------------- /lec07_graphs/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Lecture 07: Making graphs 5 | 6 | ## Motivation 7 | 8 | A general wisdom is that a picture paints a thousand words, and it is even more true in data analysis. Graphing is essential in exploring our data, and also a powerful way of illustrating the results of an analysis, and conveying our message. 9 | 10 | ## This lecture 11 | 12 | This lecture introduces students to the basics of making graphs. The opportunities of graphing in Stata are limitless, even the last tiny part of a graph can be changed. We show here only the basic logic. 13 | 14 | Case studies (at least partly) connected to this lecture: 15 | - [Chapter 7, 7.A1 and 7.A2: Finding a good deal among hotels with simple regression](https://gabors-data-analysis.com/casestudies/#ch07a-finding-a-good-deal-among-hotels-with-simple-regression) 16 | 17 | This lecture uses the [hotels-vienna.dta](https://osf.io/dn8je) dataset. 18 | 19 | ## Learning outcomes 20 | After successfully completing the code in `lecture07_graphs.do` students should be able to: 21 | 22 | - Understand the logic of graphing in Stata. 23 | - Create fancy graphs and do the scaffolding. 24 | 25 | ## Datasets used 26 | 27 | * [hotels-vienna.dta](https://osf.io/dn8je) 28 | 29 | ## Lecture Time 30 | 31 | Ideal overall time: **40 mins**. 32 | 33 | ## Homework 34 | 35 | *Type*: quick practice, approx 15 mins 36 | 37 | Your homework assignment is to reproduce Figure 8.9 on page 230 from the textbook. 38 | 39 | ## Further material 40 | 41 | - Asjad Naqvi has a great [post](https://medium.com/the-stata-guide/stata-graph-tips-for-academic-articles-8d962d5e8b75) on medium.com about graph tips for academic articles. 42 | 43 | - There are some interesting intro material on the Stata webpage that can function as starting points: [Visual overview for creating graphs](https://www.stata.com/support/faqs/graphics/gph/stata-graphs/), [Publication-quality graphics](https://www.stata.com/features/publication-quality-graphics/), and [Graph styles](https://www.stata.com/features/overview/graph-styles/). 44 | 45 | - The World Bank has a nice DIME Wiki [page](https://dimewiki.worldbank.org/Stata_Coding_Practices:_Visualization) about visualization. 46 | 47 | - SSCC has an [intro module](https://www.ssc.wisc.edu/sscc/pubs/4-24.htm) on visualizing, which unfortunately discusses the topic not with coding, but using the clicking method, although it is still helpful, as is offering an overview about the possibilities and the logic. 48 | 49 | - Kevin Denny has a great [paper](https://www.ucd.ie/geary/static/publications/workingpapers/gearywp202102.pdf) about the basics of Stata graphs. 50 | 51 | - Stata Press has a [744-pages-long book](https://www.stata-press.com/manuals/graphics-reference-manual/) on graphing (online available and free of charge). 52 | 53 | - You should also check the two cheat sheets ([1](https://geocenter.github.io/StataTraining/pdf/StataCheatSheet_visualization15_Plots_2016_June-REV.pdf) and [2](https://geocenter.github.io/StataTraining/pdf/StataCheatSheet_visualization15_Syntax_2016_June-REV.pdf)) by the instructors of Geocenter. 54 | 55 | - For advanced graphing read the [posts](https://medium.com/the-stata-guide) of Asjad Naqvi via medium.com. 56 | 57 | 58 | -------------------------------------------------------------------------------- /lec07_graphs/lec07_graphs.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 07 ****************************************************** 4 | *********************************************************************************************************************** 5 | **************************************************** Making graphs **************************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** Intro 10 | * There are two kinds of graphs in Stata: 11 | * univariate graphs show the distribution of one variable: bar charts, histograms, pie charts, box plots etc. 12 | * twoway (bivariate) graphs show the relationship between two variables: scatter, line, area etc. 13 | 14 | * Typing 'help graph' gives an overview about graphs in Stata. 15 | help graph 16 | * In this lecture we focus on twoway graphs. The logic of oneway graphs is really similar to twoways. 17 | 18 | 19 | *************** The default folder 20 | *** Defining the default folder 21 | cd "da-coding-stata/lec07_graphs/" 22 | 23 | 24 | *************** Copying and opening data 25 | *** We use the Hotels in Vienna data, introduced in Case Study 1.A1, and... 26 | *** ... we reproduce Figures 7.5 (page 183) and 7.7 (page 186). 27 | copy "https://osf.io/dn8je/download" "hotels-vienna.dta", replace 28 | use "hotels-vienna.dta", clear 29 | 30 | 31 | *************** Sample selection 32 | *** Let us start with the sample selection in order to get the same results as in the case study. We do the same ... 33 | *** ... as we did at the beginning of Lecture 10: 34 | * Keeping accommodations from Vienna: 35 | keep if city_actual == "Vienna" 36 | * Keeping hotels: 37 | keep if accommodation_type == "Hotel" 38 | * Keeping hotels with 3, 3.5, or 4 stars: 39 | keep if stars == 3 | stars == 3.5 | stars == 4 40 | * Dropping a hotel with extremely high price: 41 | summarize price, detail 42 | drop if price == 1012 43 | count /*207 observations remained*/ 44 | 45 | 46 | *************** Reproducing Figure 7.5 step by step 47 | * Let us start with a basic graph without any scaffolding: 48 | graph twoway (scatter price distance) 49 | * 'graph twoway' is the main command, then, in parenthesis, we should define the type of the chart first (scatter) and then... 50 | * ... the variable of the Y axis, and the the variable of the X axis. 51 | 52 | * Adding the linear regression line: 53 | graph twoway (scatter price distance) (lfit price distance) 54 | * In another parenthesis, we can define another chart, here a linear regression line. 55 | * Now, we are ready with the main graph. From this point we will make only changes in the main graph, and so, we will use options. 56 | * The logic is that the option should be attached to that element that we would like to change. 57 | 58 | * Changing the color of the dots and the regression line: 59 | * The aim is to change the color of the dots (markers), and the color of the regression line,... 60 | * ... so we put the adequate options near to the 'scatter' and the 'lfit' commands. 61 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) 62 | * mcolor means marker color, while lcolor means line color. 63 | 64 | * Naming the X and Y axes: 65 | * Now we would like to change the main part (skeleton) of the graph, so we put the options accordingly:... 66 | * ... we attach the options to the skeleton, i.e. to the 'graph twoway'. 67 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) /// 68 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") 69 | 70 | * Setting values of the X and Y axes, and adding grid lines: 71 | * Again, we are going the alter the skeleton itself (while with the 'grid' option we alter the axes). 72 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) /// 73 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") /// 74 | xlabel(0(1)7, grid) ylabel(0(50)400, grid angle(horizontal)) 75 | 76 | * Removing legend: 77 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) /// 78 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") /// 79 | xlabel(0(1)7, grid) ylabel(0(50)400, grid angle(horizontal)) /// 80 | legend(off) 81 | 82 | * Adding title and subtitle: 83 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) /// 84 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") /// 85 | xlabel(0(1)7, grid) ylabel(0(50)400, grid angle(horizontal)) /// 86 | legend(off) /// 87 | title("Hotel price by distance to the city center", color(black)) subtitle("linear regression and scatterplot", color(black)) 88 | 89 | * Adding note: 90 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) /// 91 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") /// 92 | xlabel(0(1)7, grid) ylabel(0(50)400, grid angle(horizontal)) /// 93 | legend(off) /// 94 | title("Hotel price by distance to the city center", color(black)) subtitle("linear regression and scatterplot", color(black)) /// 95 | note("Source: hotels-vienna dataset. Vienna, November 2017, weekday. N=207.") 96 | 97 | * Saving as .gph: 98 | graph twoway (scatter price distance, mcolor(blue)) (lfit price distance, lcolor(green)) /// 99 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") /// 100 | xlabel(0(1)7, grid) ylabel(0(50)400, grid angle(horizontal)) /// 101 | legend(off) /// 102 | title("Hotel price by distance to the city center", color(black)) subtitle("linear regression and scatterplot", color(black)) /// 103 | note("Source: hotels-vienna dataset. Vienna, November 2017, weekday. N=207.") /// 104 | saving("figure_7_5.gph", replace) 105 | * Please note that .gph files can be interactively edited using the clicking method in the Graph Editor... 106 | * ... It is useful when you would like to change an element, but you do not know its name. You can look for it in the Graph... 107 | * ... Editor, and after finding it, you can check the help menu or online sources in order to write the adequate code. 108 | 109 | * Exporting graph as a .png file: 110 | graph export "figure_7_5.png", as(png) replace width(3200) height(2400) 111 | 112 | * Hopefully, the logic is clear now, so we can make a step further, and reproduce Figure 7.7. 113 | 114 | 115 | *************** Reproducing Figure 7.7 step by step 116 | * Here we would like to highlight the five most underpriced hotels, so first let us identify them. 117 | regress price distance 118 | predict res, resid 119 | sort res 120 | list res in 1/5 /*We are interested in the five smallest residual.*/ 121 | generate mostunderpriced = 0 122 | replace mostunderpriced = 1 if res < -51.05 123 | tabulate mostunderpriced, missing 124 | * Now we have a dummy variable showing the observations wthat we would like to highlight on the graph, so we can start plotting. 125 | 126 | * The starting point is the same: we create the main graph: 127 | graph twoway (scatter price distance) (lfit price distance) 128 | 129 | * Now, let us divide the observations into two groups and make two charts of them: ... 130 | * ... one for the highlighted obs, and another for the rest: 131 | graph twoway (scatter price distance if mostunderpriced == 0) /// 132 | (scatter price distance if mostunderpriced == 1) /// 133 | (lfit price distance) 134 | 135 | * Let us now change the colors: 136 | graph twoway (scatter price distance if mostunderpriced == 0, mcolor(blue)) /// 137 | (scatter price distance if mostunderpriced == 1, mcolor(yellow) mlcolor(black)) /// 138 | (lfit price distance, lcolor(green)) 139 | 140 | * Now, add the arrow with the note (as in the figure in the textbook) as a seperate chart: 141 | graph twoway (scatter price distance if mostunderpriced == 0, mcolor(blue)) /// 142 | (scatter price distance if mostunderpriced == 1, mcolor(yellow) mlcolor(black)) /// 143 | (lfit price distance, lcolor(green)) /// 144 | (pcarrowi 25 2 50 1.15 "Most underpriced hotels", color(black) mlabcolor(black)) 145 | * 'pcarrowi' draws an arrow with some text: 146 | * The first two numbers are the coordinates (Y and X) of the starting point of the arrow. 147 | * The second two numbers are the coordinates (Y and X) of the arrow head. 148 | * The last element is the text itself. 149 | * 'color()' changes the color of the arrow, wihle 'mlabcolor()' (= marker label color) changes the color of the text. 150 | 151 | * Finally, we can do the same scaffolding as with Figure 7.5, so just simply copy-paste it here: 152 | graph twoway (scatter price distance if mostunderpriced == 0, mcolor(blue)) /// 153 | (scatter price distance if mostunderpriced == 1, mcolor(yellow) mlcolor(black)) /// 154 | (lfit price distance, lcolor(green)) /// 155 | (pcarrowi 25 2 50 1.15 "Most underpriced hotels", color(black) mlabcolor(black)) /// 156 | , xtitle("Distance to city center (miles)") ytitle("Price (US dollars)") /// 157 | xlabel(0(1)7, grid) ylabel(0(50)400, grid angle(horizontal)) /// 158 | legend(off) /// 159 | title("Hotel price by distance to the city center", color(black)) subtitle("linear regression and scatterplot", color(black)) /// 160 | note("Source: hotels-vienna dataset. Vienna, November 2017, weekday. N=207.") /// 161 | saving("figure_7_7.gph", replace) 162 | 163 | 164 | *************** Two additional notes 165 | *** 1) There are some pre-defined schemes in Stata that can be used to easliy format our graphs... 166 | *** ... For example, let us make a scatter in the style of The Economist graphs: 167 | use "hotels-vienna.dta", clear 168 | graph twoway (scatter price distance) (lfit price distance), scheme(economist) 169 | *** The list of pre-installed schemes is available using the following command: 170 | graph query, schemes 171 | 172 | *** 2) Stata graphs (.gph files) can be edited interactively (i.e. using the clicking method) using the Graph Editor. 173 | 174 | 175 | 176 | -------------------------------------------------------------------------------- /lec08_moredatasets/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 08: Combining datasets 2 | 3 | ## Motivation 4 | 5 | In many cases you have to work with more than one datasets. For example, let us consider the case of the game-level data we use in case study 2.C1 on pp. 40, [Identifying Successful Football Managers](https://gabors-data-analysis.com/casestudies/#ch02c-identifying-successful-football-managers). The data covers 11 seasons of EPL games, and originally comes in separate .csv files: one .csv contains the data of one season, so we have 11 files. Moreover, we also have data on managers, and in a latter analysis (see case study 24.B1, [Estimating the Impact of Replacing Football Team Managers](https://gabors-data-analysis.com/casestudies/#ch24-estimating-the-impact-of-replacing-football-team-managers)) we would like to use both the game-level and the manager-level data. How can we integrate these datasets? How can we deal with more than one datasets? 6 | 7 | Stata always works with one dataset, so if we would like to use more datasets simultaneously, we should combine them first. This is the topic of this lecture. (In fact, data frames were introduced in Stata 16, and so it became possible to have multiple datasets in memory, but it is still much easier to follow the old logic of working with one dataset. For further details on data frames please check the [website](https://www.stata.com/features/overview/multiple-datasets-in-memory/) of Stata. 8 | 9 | 10 | ## This lecture 11 | 12 | In this lecture we first discuss some definitions according to combining datasets, and then we also review the two most important commands - append and merge - that can be used to combine multiple datasets. 13 | 14 | Case studies (at least partly) connected to this lecture: 15 | - [Chapter 02, 2.C1: Identifying Successful Football Managers](https://gabors-data-analysis.com/casestudies/#ch02c-identifying-successful-football-managers) 16 | 17 | This lecture uses some raw [football](https://osf.io/zqm6c/) data. 18 | 19 | ## Learning outcomes 20 | 21 | After successfully completing the code in lecture08_moredatasets.do students should be able to combine datasets both vertically and horizontally, using the append and merge commands. 22 | 23 | ## Datasets used 24 | 25 | * [raw football data](https://osf.io/zqm6c/) 26 | 27 | ## Lecture Time 28 | 29 | Ideal overall time: **90 mins**. 30 | 31 | ## Homework 32 | 33 | *Type*: quick practice, approx 30 mins 34 | 35 | In this homework you are invited to combine some artificial datasets containing different pieces of information on products and shops selling the products. The following variables are in the datasets: 36 | - id: unique identifier of the product 37 | - price: price of the product 38 | - quantity: quantity of the product in storage 39 | - shop_id: unique identifier of the shop where the product is sold 40 | - region: identifier of the region where the shop can be found 41 | 42 | Do the following exercises step by step: 43 | - Define a default folder. 44 | - Open lec08_hw_data1.xlsx, take a look at it (use the list command), and then save it as hw8_data1.dta. 45 | - Open lec08_hw_data2.csv, take a look at it (use the list command), and then save it as hw8_data2.dta. 46 | - Open hw8_data1.dta and try to combine it with hw8_data2.dta. Do the process in a way that the most information is used (i.e. pay attention to values that are missing in one dataset and non-missing in another). At first, it will not work. Find and solve the problem first (hint: use isid and duplicates), and then try to combine the two datasets again. 47 | - Save the combined dataset as hw8_data12.dta. 48 | - Open lec08_hw_data3.dta and take a look at it (use the list command). 49 | - Combine it with hw8_data12.dta. Then save it as hw8_data123.dta. 50 | - Open lec08_hw_data4.dta and take a look at it (use the list command). This dataset shows the region of all shops. 51 | - Combine it with hw8_data123.dta. You will again face a problem. You can solve it looking for the adequate option in the help menu of the combining command you use. 52 | 53 | 54 | ## Detailed online descriptions for the most important commands from the Stata website 55 | 56 | - [merge](https://www.stata.com/manuals/dmerge.pdf) 57 | - [append](https://www.stata.com/manuals/dappend.pdf) 58 | 59 | ## Further material 60 | 61 | - [Chapter 23](https://www.stata.com/manuals/u23.pdf) of the Stata User's Guide discusses the basic logic of combining datasets. 62 | 63 | - The [4th lesson](https://datacarpentry.org/stata-economics/04-combine-data/index.html) of Data Carpentry's Economics Lesson with Stata course offers a detailed description of the topic with many exercises. 64 | 65 | - UCLA has a [module](https://stats.oarc.ucla.edu/stata/modules/combining-data/) on appending and merging. 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | -------------------------------------------------------------------------------- /lec08_moredatasets/lec08_hw_data1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec08_moredatasets/lec08_hw_data1.xlsx -------------------------------------------------------------------------------- /lec08_moredatasets/lec08_hw_data2.csv: -------------------------------------------------------------------------------- 1 | id,price,quantity,shop_id 2 | 1,610,20,1 3 | 2,400,50,1 4 | 3,711,20,1 5 | 4,200,15,2 6 | 5,156,50,3 7 | 6,213,50,3 8 | -------------------------------------------------------------------------------- /lec08_moredatasets/lec08_hw_data3.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec08_moredatasets/lec08_hw_data3.dta -------------------------------------------------------------------------------- /lec08_moredatasets/lec08_hw_data4.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec08_moredatasets/lec08_hw_data4.dta -------------------------------------------------------------------------------- /lec08_moredatasets/lec08_moredatasets.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 08 ****************************************************** 4 | *********************************************************************************************************************** 5 | ********************************************* Working with more datasets ********************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** Definitions 10 | *** Master vs. using data 11 | * Master data: the dataset that is currently in the memory. 12 | * Using data: other dataset(s) that you would like to combine with the master data. 13 | * Note that all the files should be .dta files. 14 | 15 | *** Adding rows (observations) vs. adding columns (variables) 16 | * For adding new observations, we should use the 'append' command. 17 | * For adding new variables, we should use the 'merge' command. 18 | 19 | 20 | *************** The default folder 21 | *** Defining the default folder 22 | cd "da-coding-stata/lec08_moredatasets/" 23 | 24 | 25 | *************** Vertical combining: APPEND 26 | *** Introduction 27 | * The game-level data in our case study "Identifying Successful Football Managers" on pp. 40 is... 28 | * ... originally available in many .csv files - one .csv for each season... 29 | * ... In this exercise let us combine years 2016, 2017, and 2018... 30 | * ... In a latter lecture we learn how to combine even more years much easier - using loops. 31 | 32 | *** Preparing datasets 33 | * Let us first open the .csv files from OSF and save them as .dta files 34 | import delimited "https://osf.io/ad9zg/download", clear 35 | 36 | browse 37 | * As you can see from the date variable, it is season 2016/17... 38 | * ... As it can also be seen, each observation (row) in the data table is a single game... 39 | * ... Let us now save it, and then do the same for the next two seasons. 40 | count /*380 observations*/ 41 | 42 | save "football_season201617.dta", replace 43 | 44 | import delimited "https://osf.io/b5fvd/download", clear 45 | browse 46 | count /*380 observations*/ 47 | save "football_season201718.dta", replace 48 | 49 | import delimited "https://osf.io/4skqc/download", clear 50 | browse 51 | count /*380 observations*/ 52 | save "football_season201819.dta", replace 53 | 54 | *** Appending 55 | * So, as we saw, we have three datasets with the same structure, but different observations. Let us combine them. 56 | * We will always count the number of observations as we go on in order to see the cahnges. 57 | 58 | clear 59 | 60 | * First, define a master data by opening it: 61 | use "football_season201617.dta" 62 | count /*380*/ 63 | 64 | * Then let us append data of season 2017-18 (the general syntax is: append using filename): 65 | append using "football_season201718.dta" 66 | count /*760*/ 67 | 68 | * Finally, et us append data of season 2018-19: 69 | append using "football_season201819.dta" 70 | count /*1140*/ 71 | 72 | * Now we have the data of all the three seasons in one dataset. 73 | 74 | *** Important notes 75 | * Let us now make some important technical notes about the append command. 76 | 77 | * 1) The order of variables in the two datasets is irrelevant. Stata always appends variables by name. 78 | 79 | * 2) The identicality of the variable lists in the two datasets is not necessary. If variable X is presented... 80 | * ... only in one of the datasets, values will be missing for all the observations coming from the other dataset... 81 | * ... If you check for example variables -lbh-, -lbd- and -lba- in the previously appended football dataset, they are... 82 | * ... presented only in tha master data and using data 1,... 83 | * ... so they are missing for all the observations coming from using 2. 84 | 85 | * 3) The master-using relationship: 86 | * The master data takes precedence over the using data, and so overrides the using data's definitions: 87 | * This extends to value labels, variable labels, characteristics, and date–time stamps. 88 | * If there are conflicts in numeric storage types, the more precise storage type will be used... 89 | * ... regardless of whether this storage type was in the master dataset or the using dataset. 90 | * If a variable is stored as a string in one dataset that is longer than in the other, the longer... 91 | * ... str# storage type will prevail. 92 | * If a variable is a string in one dataset and numeric in the other, Stata issues an error message... 93 | * ... unless the 'force' option is specified. 94 | 95 | * 4) Using the 'generate' option we are able to mark the source of the observations. Let us try the following: 96 | use "football_season201617.dta", clear 97 | append using "football_season201718.dta", generate(sourcefile) 98 | browse 99 | * As you can see, now the last column is called -sourcefile-, containing zeros (master) and ones (using). 100 | 101 | * 5) We can append more using files at the same time: 102 | use "football_season201617.dta", clear 103 | append using "football_season201718.dta" "football_season201819.dta" 104 | * As a useful practice, do the same but with using the 'generate' option. Take a look at the sourcefile variable. 105 | use "football_season201617.dta", clear 106 | append using "football_season201718.dta" "football_season201819.dta", generate(sourcefile) 107 | browse 108 | 109 | 110 | *************** Horizontal combining: MERGE 111 | *** Introduction 112 | * In Chapter 20 we use some employee-level data to measure the effect on preformance of letting.. 113 | * ... workers work from home. The original data comes in more datasets. Let us first take a look at them. 114 | 115 | copy "https://osf.io/t7kvp/download" "whm_empdata1.dta", replace 116 | use "whm_empdata1.dta", clear 117 | describe 118 | * As you can see, we have some information about 249 employees in this dataset. 119 | 120 | * Moreover, -personid- is the unique identifier of the employee: 121 | isid personid 122 | 123 | copy "https://osf.io/e8dz7/download" "whm_empdata2.dta", replace 124 | use "whm_empdata2.dta", clear 125 | describe 126 | * We have some more (personal and performance-related) information about the employees in this dataset. 127 | isid personid /*-personid- is a unique identifier here, as well.*/ 128 | 129 | *** Merge 1:1 130 | * So, we have two datasets with (presumably) the same employees, but with different pieces of information (i.e. variables). 131 | 132 | * Let us combine the two datasets: 133 | use "whm_empdata1.dta", clear 134 | 135 | merge 1:1 personid using "whm_empdata2.dta" 136 | * 1:1 (one-to-one) means here that one observation from the master is merged with one observation from the using. 137 | * We will see other types of merge (1:m and m:1) later. 138 | * -personid- is a variable (can be more variables) that uniquely identifies observations both in the master and the using. 139 | * using "whm_empdata2.dta" refers to the data on disk that is the using data during the merge process. 140 | 141 | * Merge presents a table after its execution that reports on the successfulness of merging: 142 | * All the observations are matched here, so all the observations appear both in master and using. 143 | * Please note that the table tells only whether an obs appears in both datasets or not, and tells nothing... 144 | * ... about the other variables (and their values). 145 | * If you check the Variables Window, you can realize that a new variable, called -_merge- was generated. 146 | * It can take 3 values (with some options - see later - 5 values): 147 | * _merge=1: obs appears only in the master data 148 | * _merge=2: obs appears only in the using data 149 | * _merge=3: obs appears both in master and using 150 | 151 | * Some other examples for 1:1 merge could be the followings: 152 | * In one dataset you have GDP data for countries, while in another dataset population data. 153 | * You observe products, and in one dataset you store their offline prices, in another one their online prices. 154 | * You observe firms and store their balance sheet data and profit and loss statement data in two datasets. 155 | 156 | *** Merge 1:m or m:1 157 | * Let us now take a look at a third dataset from the WHM case study: 158 | copy "https://osf.io/rz4cy/download" "whm_empdata3.dta", replace 159 | use "whm_empdata3.dta", clear 160 | describe 161 | * We have -personid- here as well, but the dataset contains 112279 obs, instead of 249 as before. 162 | 163 | * Let us check the -personid- variable: 164 | isid personid /*variable -personid- does not uniquely identify the observations*/ 165 | 166 | * The observation here is not an employee, but an employee-week (a particular employee on a particular week): 167 | isid personid year_week 168 | 169 | * Let us see the structure: 170 | sort personid year_week 171 | 172 | browse 173 | * So now a 1:1 merge would be wrong, we would like to match one obs with many obs: 174 | * Many master obs (employee i in time t) will be matched to one using obs (employee i), ... 175 | * ... this is a many-to-one, i.e. m:1 merge: 176 | merge m:1 personid using "whm_empdata1.dta" 177 | * As you can see the feedback table: 178 | * We have 18751 matched obs (employee-weeks), and... 179 | * ... we have 93528 non-matched - appears only in the master - obs (employee-weeks)... 180 | * ... It means that we have employees in the master that are not presented in the using. How many? 181 | codebook personid if _merge==1 /*Number of unique values: 1685 employees*/ 182 | 183 | * Finally, we should have 249 employees that are matched. Let us check it: 184 | codebook personid if _merge==3 185 | 186 | * Please note that 1:m and m:1 merge are the same, the choice depends only the choice (definition) of master and using... 187 | * ... In out previous example, if whm_empdata3.dta is the using and whm_empdata1.dta is the master, then we have a 1:m merge. 188 | 189 | * Some other examples for 1:m or m:1 merge could be the followings: 190 | * One dataset contains household-level data, while the other one contains individual-level data on the household members. 191 | * One dataset contains the continent of countries, while the other one contains yearly GDP data of the countries. 192 | 193 | *** Important notes 194 | * Let us now make some more important technical notes about the 'merge' command. 195 | * 1) It can happen that some variables appear in both master and using. 196 | * In this case master overrides the using data's definitions (as it was in the case of 'append'). 197 | * Although, with some options we can change it: 198 | * the 'update' option: update missing values of same-named variables in master with values from using. 199 | * the 'replace' option: replace all values of same-named variables in master with nonmissing values from using. 200 | * In this case, codes 4 and 5 can also appear in the -_merge- variable. 201 | 202 | * 2) m:m merge also exist, although its usage should be avoided, even the official documentation says that it "is a bad idea". 203 | 204 | 205 | *************** Other commands to combine datasets 206 | *** There are some other commands that produces special combinations... 207 | *** ... Since they are just rarely used, we do not discuss them here, their help menu can be checked in case of need. 208 | *** 'joinby': forming all pairwise combinations within groups. 209 | *** 'cross': forming pairwise combinations. 210 | 211 | 212 | *************** Deleting files 213 | * Finally, let us delete the unnecessary files: 214 | erase "football_season201617.dta" 215 | erase "football_season201718.dta" 216 | erase "football_season201819.dta" 217 | erase "whm_empdata1.dta" 218 | erase "whm_empdata2.dta" 219 | erase "whm_empdata3.dta" 220 | -------------------------------------------------------------------------------- /lec09_datamanipulation/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec09_datamanipulation/README.md -------------------------------------------------------------------------------- /lec09_datamanipulation/lec09_datamanipulation.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 09 ****************************************************** 4 | *********************************************************************************************************************** 5 | ************************************************** Data manipulation ************************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec09_datamanipulation/" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use data from the World Management Survey, introduced in Case Study 1.C1 and analyzed in 4.A1. 16 | copy "https://osf.io/qx4fn/download" "wms_da_textbook-xsec.dta", replace 17 | use "wms_da_textbook-xsec.dta", clear 18 | 19 | 20 | *************** Note 21 | *** It is extremely useful to check the result in a Browse Window after each step in this lecture. 22 | 23 | 24 | *************** Producing new variables: general notes 25 | *** Basically, there are 2 commands for producing new variables: 'generate' and 'egen'. 26 | *** Unfotunately, there are no strict definitions about their usage, but: 27 | * 'generate' is mainly for simple stuffs, while... 28 | * ... 'egen' is for more complex stuff, it is a kind of Swiss Army knife... 29 | * ... 'egen' can handle more variables or more observations in a more complex way. 30 | 31 | *** Basic syntax of 'generate' and 'egen': 32 | * generate new_var_name = expression 33 | * egen new_var_name = function() 34 | 35 | *** There are some rules about variable names: 36 | * They cannot be longer than 32 characters, although it is worth giving short names and use labels instead. 37 | * The letters a-z (both lower-case and upper-case), numbers (0-9), and underscore (_) are valid characters. 38 | * Names cannot start with a number. 39 | * Please also note that Stata is case-sensitive. 40 | 41 | 42 | *************** Producing new variables: the 'generate' command 43 | * Generating the year of born of the firms is a simple mathematical operation: 44 | generate born_year = wave - firmage 45 | 46 | * Generating a new variable that measures the number of employees in thousands: 47 | generate emp_firm_th = emp_firm/1000 48 | 49 | * In the expression part of 'generate', the basic maths tools can be used: +, -, *, /, ^, etc. 50 | 51 | * Generating the natural logarithm of -emp_firm-: 52 | generate lnemp = ln(emp_firm) 53 | 54 | * Generating a new variable that contains the closest integer to -management-: 55 | generate roundmanagement = round(management) 56 | 57 | * In the expression part of 'generate', many mathematical functions can also be used, see: 58 | help math_functions 59 | 60 | * There are other functions as well: date and time, random-number, string etc. See the details by typing: 'help functions'. 61 | 62 | 63 | *************** Producing new variables: the 'egen' command 64 | * 'egen' can be applied to more observations,... 65 | * ... e.g. calculating the average of the management score for the whole sample: 66 | egen mean_management = mean(management) 67 | 68 | * We can also consider more variables,... 69 | * ... e.g. calculating the average of the -aa_1-, -aa_2-, -aa_3- and -aa_4- variables for each observation: 70 | egen mean_aa_1_4 = rowmean(aa_1 aa_2 aa_3 aa_4) 71 | 72 | * Let us calculate the average for all the aa_ variables using a wild card: 73 | egen mean_aa_all = rowmean(aa_*) 74 | 75 | * Let us create a new variable containing the number of nonmissing values in varlist (again, with another wild card): 76 | egen nrnomiss = rownonmiss(sic - target) 77 | 78 | 79 | *************** Producing new variables: the if condition 80 | * Both 'generate' and 'egen' can be combined with the 'if' condition using the logic we discussed in Lecture 5. 81 | 82 | * Generating a new variable that is equal to 1 if the home country of the firm is the US (and missing otherwise): 83 | generate usa = 1 if country == "United States" 84 | 85 | * Generating a new variable that contains the mode of -country- for the firms that have more than 1000 employees 86 | egen x = mode(country) if emp_firm>1000 & emp_firm!=. 87 | 88 | 89 | *************** Producing new variables: the bysort prefix 90 | * Both 'generate' and 'egen' can be combined with the 'bysort' prefix using the logic we discussed in Lecture 6. 91 | * Although, 'bysort' is more important in the case of 'egen'. 92 | 93 | * Calculating the country-level average management score for all the countries: 94 | bysort country: egen cmean_manag = mean(management) 95 | 96 | * Calculating the standard deviation of the management score for all country-industry subsamples: 97 | bysort country sic: egen sd_manag = sd(management) 98 | 99 | 100 | *************** Changing the content of existing variables 101 | *** Basically, there are 2 commands for changing the values of existing variables: 'replace' and 'recode'. 102 | * 'replace' is a basic command, can be easily used, while... 103 | * ... 'recode' is mainly for categorical variables, but with more opportunities. 104 | 105 | *** Let us start with 'replace': 106 | * The syntax and logic of 'replace' is similar to what we discussed in the case of 'generate'. 107 | 108 | * Mathematical operations can be done, e.g.: 109 | replace mean_aa_all = 1 /*All the values of -mean_aa_all- will be 1.*/ 110 | 111 | * The if condition can be used: 112 | replace usa = 0 if usa == . /*Recoding missings to zeros, so we have a dummy now.*/ 113 | 114 | * Functions can be used: 115 | replace mean_management = round(mean_management) 116 | 117 | * Can be used for string: 118 | replace country = "USA" if country == "United States" 119 | 120 | *** 'recode' is only for numerical variables, and most frequently used for recoding categorical variables. 121 | 122 | tabulate aa_1, missing /*The variable contains 0s and 1s, and one missing. Let us replace the missing with the number 2.*/ 123 | recode aa_1 (. = 2) /*We should define a rule in parentheses and the variable(s) to apply the rule for.*/ 124 | 125 | * The rule can be applied to more variables: 126 | recode aa_2 aa_3 aa_4 (. = 2) 127 | 128 | * More rules can also be applied: 129 | recode aa_5 aa_6 (0 = 5) (1 = 6) 130 | 131 | * More complex rules can also be applied: 132 | recode aa_7 aa_8 (0 1 = 10) (. = 999) 133 | 134 | * We can keep the original variable, and apply the rule(s) during generating a new one: 135 | recode sic (. = 999), generate(sic_nomissing) 136 | 137 | * You should avoid defining a lot of rules simultaneously (it could be confusing),... 138 | *... although the 'test' option can show whether rules are ever invoked or that rules overlap. 139 | recode aa_10 (0/5 = 100) (1 = 999), test 140 | 141 | 142 | *************** Making numerical variables from string ones 143 | *** Dealing with string content in Stata is a bit harder, so if it is possible, one should work with numerical content. 144 | *** Here we present two solutions to convert string variables to numerical ones. 145 | *** If our variable contains numbers, but for some reasons they are stored as strings, we should use 'destring'. 146 | 147 | * Let us first generate two variables to convert: 148 | generate uk_str = "1" if country == "Great Britain" 149 | replace uk_str = "0" if uk_str=="" 150 | generate uk_str_other = uk_str 151 | replace uk_str_other = "non-uk" in 1 152 | 153 | * Take a look at these two variables and the -country- variable: 154 | browse country uk_str uk_str_other 155 | 156 | * Now all the three variables are string ones, but basically the two we generated should be numerical ones. 157 | 158 | * Let us covert -uk_str-, while keeping the original variable: 159 | destring uk_str, generate(uk) 160 | 161 | * If we do not need the original variable, we can replace it using the 'replace' option: 162 | destring uk_str, replace 163 | 164 | * Destring works only if all the values can be converted. Remember that uk_str_other contains a true string value... 165 | * ... in row 1, so Stata cannot convert it: 166 | destring uk_str_other, replace 167 | 168 | * We can use the 'force' option to force the conversion /*non-convertible values are replaced by missings*/: 169 | destring uk_str_other, replace force 170 | 171 | *** So, in the case of string content, which is in fact numeric, we should use the 'destring' option. 172 | *** On the other hand, in the case of true string content (e.g. the -competition- variable) 'encode' should be used: 173 | encode competition, generate(compet_numer) 174 | 175 | * Observations are ordered in alphabetic order based on the values of the variable to encode, ... 176 | * ... and attach the number 1 to the first value(s), then 2 to the second value(s) etc. 177 | 178 | tabulate competition, miss /*Four different string values and one missing.*/ 179 | tabulate compet_numer, miss /*The same as the previous at first sight. Let us see it without the labels.*/ 180 | tabulate compet_numer, miss nolabel /*So, we have numbers 1, 2, 3, and 4, labelled according to the original string values.*/ 181 | 182 | *** Note that using 'encode' in case of numerical content stored as string could make fatal problems. Check the following: 183 | generate number = "0" in 1/6 184 | replace number = "5" in 7/15 185 | replace number = "99" in 16/20 186 | encode number, gen(num_number) 187 | list number num_number in 1/20 188 | list number num_number in 1/20, nolabel 189 | 190 | 191 | *************** Making string variables from numerical ones 192 | *** Although in most cases we would like to go to the string-->numerical direction, ... 193 | *** ... sometimes the other way is needed. Two commands can be used: 194 | * 'tostring' for converting numeric variables to string variables. 195 | * 'decode' for creating a new string variable based on the "encoded" numeric variable. 196 | 197 | 198 | *************** Changing the name of the variables 199 | *** Variables can be renamed using the 'rename' command. 200 | *** Let us rename the industry code variable, -sic-: 201 | rename sic indcode 202 | 203 | 204 | *************** Deleting variables and observations 205 | *** 'drop' and 'keep' can be used to delete/keep variables and observations. 206 | *** 'drop' and 'keep' are substitutes: use the one that is more comfortable. 207 | 208 | *** Variables should be listed after the command in order to drop/keep them: 209 | drop x /*Dropping variable -x-*/ 210 | keep firmid - reliability /*Dropping all the variables from -aa_1- to -compet_numer-*/ 211 | 212 | *** Observations can be dropped/kept with the if condition: 213 | drop if wave == 2015 /*Dropping obs that are from the wave of 2015.*/ 214 | keep if country!="USA" /*Dropping obs from the USA, i.e. keeping the rest.*/ 215 | -------------------------------------------------------------------------------- /lec10_macro_loop/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gabors-data-analysis/da-coding-stata/43e029f823d250c1e02d34803fb30077e3021182/lec10_macro_loop/README.md -------------------------------------------------------------------------------- /lec10_macro_loop/lec10_macro_loop.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 10 ****************************************************** 4 | *********************************************************************************************************************** 5 | ************************************** Macros, loops, branching, and stored results *********************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec10_macro_loop/" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use data from the World Management Survey, introduced in Case Study 1.C1 and analyzed in 4.A1. 16 | copy "https://osf.io/qx4fn/download" "wms_da_textbook-xsec.dta", replace 17 | use "wms_da_textbook-xsec.dta", clear 18 | 19 | 20 | *************** Topic 1: Macros 21 | *** A macro is a string of characters, called the macroname, that stands for another string of characters, ... 22 | *** ... called the macro contents. Everywhere a macro name appears in a command, ... 23 | *** ... the macro contents are substituted for the macro name. Macros can be local or global. 24 | 25 | *** Local macros: 26 | * Macro names are up to 31 characters long. 27 | * The contents of local macros are defined with the 'local' command. 28 | * Local macros are ephemeral macros: 29 | * Local macros exist solely within the program or .do file in which they are defined. 30 | * Local macros are valid only in a single execution of commands in .do files. 31 | * Syntax of defining local macros: 32 | * Defining string macro content: local macroname "string" 33 | * Defining numerical macro content: local macroname = expression 34 | * Referring to local macros: `macroname' 35 | 36 | *** Global macros: 37 | * Macro names are up to 32 characters long. 38 | * The contents of global macros are defined with the 'global' command. 39 | * Global macros are persisting macros: 40 | * Global macros, once defined, are available anywhere in Stata. 41 | * Global macros persist until you delete them, or the end of the session. 42 | * Syntax of defining global macros: 43 | * Defining string macro content: global macroname "string" 44 | * Defining numerical macro content: global macroname = expression 45 | * Referring to global macros: $macroname 46 | 47 | *** Some examples for local macros: 48 | * 1) Define a local macro called i, containing the number 30, and then use it in an if condition: 49 | local i = 30 50 | count if sic == `i' 51 | * As a useful practice try to execute the two commands first together, then one by one. 52 | 53 | * 2) Define a local macro called number, containing the number 10, and then use it to produce a new variable: 54 | local number = 10 55 | generate fifteen = `number'+5 56 | 57 | * 3) Define a local macro called varlist, containing some variable names, and then use it to produce some descriptives: 58 | local varlist = "management operations monitor target" 59 | summarize `varlist', detail 60 | 61 | *** Some examples for global macros: 62 | * 1) Define a global macro called j, containing the number 20, and then use it in an if condition: 63 | global j = 20 64 | count if sic == $j 65 | * As a useful practice try to execute the two commands first together, then one by one. 66 | 67 | * 2) Define a global macro called x, containing the word newvar, and then use it to produce a new variable: 68 | global x = "newvar" 69 | generate $x = 100 70 | 71 | *** Macros can be used to define new macros: 72 | local h = $j + 20 73 | generate forty = `h' 74 | 75 | 76 | *************** Topic 2: Loops 77 | *** Loops are for executing repetitive tasks: making similar things many times. 78 | *** A loop repeatedly sets a local macro to each element of the list you specify... 79 | *** ... and executes the commands enclosed in brackets. 80 | *** There are two kinds of loops: 81 | * 'forvalues' loops over a list of numbers that has a clear pattern. 82 | * 'foreach' loops over a list of anything (elements of a local or global macro, variable list, numbers etc.) 83 | 84 | *** 2a) The 'forvalues' loop 85 | * The general syntax is the following: 86 | * forvalues lname = range { 87 | * Stata command(s) to execute 88 | * } 89 | * Some important note: 90 | * lname is a local macro containing numbers following a clear pattern and defined in range. 91 | * The open brace must appear on the same line as 'forvalues' itself. 92 | * Nothing may follow the open brace except comments. 93 | * The close brace must appear on a line by itself. 94 | 95 | * Example 1: Calculating some descriptives of -management- for each -wave-: 96 | tabulate wave /*Running from 2004 to 2015, containing each year.*/ 97 | forvalues i = 2004(1)2015 { 98 | summarize management if wave == `i', detail 99 | } 100 | 101 | * Example 2: Renaming the -perf1-, -perf2-, ... variables to -performance_1, performance_2, etc.: 102 | forvalues i = 1(1)10 { 103 | rename perf`i' performance_`i' 104 | } 105 | 106 | *** 2b) The 'foreach' loop 107 | * The general syntax is the following: 108 | * foreach lname in/of list { 109 | * Stata command(s) to execute 110 | * } 111 | * Some important note: 112 | * lname is a local macro containing a list of something to loop over. 113 | * The open brace must appear on the same line as 'foreach' itself. 114 | * Nothing may follow the open brace except comments. 115 | * The close brace must appear on a line by itself. 116 | 117 | * Example 1: Looping over a list of nubers without clear pattern: 118 | foreach i of numlist 20 21 25(1)30 36 39 { 119 | summarize management if sic == `i' 120 | } 121 | 122 | * Example 2: Looping over a list of existing variables: 123 | foreach j of varlist firmid wave cty management degree_m degree_nm { 124 | codebook `j' 125 | } 126 | 127 | * Example 3: Looping over the elements of a global macro: 128 | global k = "2005 2006 2010 2012" 129 | foreach m of global k { 130 | tabulate country if wave == `m' 131 | } 132 | 133 | * Example 4: Looping over anything else that is not a list of numbers, existing variables, new variables, ... 134 | * ... or elements of a local or global macro: 135 | foreach n in us ar br ca { 136 | summarize management if cty == "`n'" 137 | } 138 | 139 | * Please note that in the case of lists of variables, numbers, and macros, you should define the type of the list... 140 | * ... (numlist, varlist, newlist, local, global) after macroname, and before the list, you should write 'of', while... 141 | * ... in the case of other content you do not define the type of the list, and you write 'in' instead of 'of'. 142 | 143 | *** Useful tips for loops 144 | * 1) If you loop over many elements of a list, it is hard to follow which results refer to which element. E.g.: 145 | forvalues i = 20(1)39 { 146 | summarize management if sic == `i', detail 147 | } 148 | * Can you tell immediately which table refers to -sic- code 31? I guess no. The 'display' command can help. 149 | * 'display' let us print anything in the main window. Check the following code for example: 150 | forvalues i = 20(1)39 { 151 | display "Summary statistics of -management- for sic code `i':" 152 | summarize management if sic == `i', detail 153 | } 154 | 155 | * 2) Consider the following case: you have a well-written, long do file for an analysis trying to ... 156 | * ... capture the association between management quality and firm size (like the case study of Chapter 4)... 157 | * ... You have data for years between 2004 and 2015. Later you get data for 2016, so you should ... 158 | * ... revise all your do file and make many changes to integrate the new data. Would not it be nice ... 159 | * ... if your original do file was so general and adaptive that it did not need updates? ... 160 | * ... If for example your loops are looping over not years 2004, 2005, ..., 2015, but over "all years in current data"? ... 161 | * ... The 'levelsof' command can solve this, since it stores all the values of a variable in a local macro. 162 | levelsof wave, local(years) /*Storing all the unique values of -wave- in a local macro called years*/ 163 | foreach i of local years { /*Order Stata to loop over the values of years*/ 164 | summarize management if wave == `i' 165 | } 166 | 167 | * 3) Loops can be nested 168 | * Let us for example create some descriptive statistics of the -management- variable for all country - wave pairs: 169 | levelsof country, local(c) 170 | levelsof wave, local(w) 171 | foreach clocal of local c { 172 | foreach wlocal of local w { 173 | display "Country: `clocal', wave: `wlocal'" 174 | summarize management if country == "`clocal'" & wave == `wlocal' 175 | display "" 176 | display "" 177 | } 178 | } 179 | 180 | 181 | *************** Topic 3: Branching 182 | *** In some cases we need to do something if a condition is true, and something else, if the condition is not true... 183 | *** ... In these cases, we should make branching, i.e. using an 'if' - 'else' condition. 184 | *** Let us consider the following example: we create a new variable called -manag_number- that is equal to... 185 | *** ... 2*-management- in waves 2004-2009, and 4*-management- in the waves after. 186 | 187 | generate manag_number = . 188 | forvalues i = 2004(1)2015 { 189 | if `i' < 2010 { 190 | local number 2 191 | } 192 | else { 193 | local number 4 194 | } 195 | replace manag_number = management*`number' if wave==`i' 196 | } 197 | 198 | 199 | *************** Topic 4: Stored results 200 | *** Stored results is another feature in Stata that helps generalization and automatization, i.e. ... 201 | *** ... re-executing your .do file without changing it after your data changes. 202 | 203 | *** Suppose that you would like to count the number of observations that have a higher -management- score... 204 | *** ... than the average -management- score. The thing we would do: 205 | summarize management /*The average is: 2.883423.*/ 206 | count if management > 2.883423 /*5,317 such observations.*/ 207 | 208 | *** The problem here is that after getting more observations (another wave for example) the average changes... 209 | *** ... Fortunately, there is a solution that helps generalizing your code: referring to stored results: ... 210 | *** ... Results of calculations are stored by many Stata commands so that they can be easily accessed... 211 | *** ... and substituted into later commands. Let us do this for the previous case: 212 | help summarize /*Check the last part of the help file: the average can be referred to as 'r(mean)'.*/ 213 | 214 | summarize management 215 | count if management > r(mean) /*Now you do not need to rewrite this part if you get a new wave of data.*/ 216 | 217 | *** In the help menu of each command you can check which results are stored and can be substituted into later commands. 218 | *** You can display all the stored results after executing a command as well: 219 | summarize management 220 | return list 221 | *** Note that estimation results can also be saved, we will learn it later. 222 | 223 | 224 | -------------------------------------------------------------------------------- /lec11_hypothesis_testing/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 11: Testing hypotheses 2 | 3 | ## Motivation 4 | 5 | One approach of generalizing the results of our analysis from the data we have is hypothesis testing. We can test hypotheses about population means, regression results etc. 6 | 7 | 8 | ## This lecture 9 | 10 | This lecture introduces students to the basics of hypothesis testing. 11 | 12 | Case studies (at least partly) connected to this lecture: 13 | - [Chapter 06, 6.A1, 6.A2, and 6.A3: Comparing online and offline prices: testing the difference](https://gabors-data-analysis.com/casestudies/#ch06a-comparing-online-and-offline-prices-testing-the-difference) 14 | 15 | This lecture uses the [billion-prices.dta](https://osf.io/wm6ge) dataset. 16 | 17 | ## Learning outcomes 18 | After successfully completing the code in `lecture09_hypothesistesting.do` students should be able to: 19 | 20 | - Execute one-sample, paired, and two-sample t-test. 21 | 22 | ## Datasets used 23 | 24 | * [billion-prices.dta](https://osf.io/wm6ge) 25 | 26 | ## Lecture Time 27 | 28 | Ideal overall time: **20 mins**. 29 | 30 | ## Homework 31 | 32 | *Type*: quick practice, approx 10 mins 33 | 34 | In this homework you are invited to do Exercis 5 on page 166. 35 | 36 | ## Detailed online descriptions for the most important commands from the Stata website 37 | 38 | - [ttest](https://www.stata.com/manuals/rttest.pdf) 39 | 40 | ## Further material 41 | 42 | - The [relevant part](https://www.ssc.wisc.edu/sscc/pubs/sfs/sfs-ttest.htm) of SSCC's Stata for student course gives a detailed description of the topic with many examples. -------------------------------------------------------------------------------- /lec11_hypothesis_testing/lec11_hypothesistesting.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 11 ****************************************************** 4 | *********************************************************************************************************************** 5 | ************************************************** Testing hypotheses ************************************************* 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec11_hypothesis_testing/" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use data from the Billion Prices project, introduced in Case Study 1.B1 and used for hypothesis testing in Chapter 6. 16 | copy "https://osf.io/wm6ge/download" "billion-prices.dta", replace 17 | use "billion-prices.dta", clear 18 | 19 | 20 | *************** Data preparation 21 | *** We are going to reproduce some results of case studies 6.A1, 6.A2, and 6.A3, so let us start with sample selection. 22 | *** Following the book, let us fix the level of significance at 5%. 23 | 24 | * Keeping data from the USA: 25 | keep if COUNTRY == "USA" 26 | 27 | * Keeping products with regular prices: 28 | keep if PRICETYPE == "Regular Price" 29 | 30 | * Dropping products where online prices are sales: 31 | drop if sale_online == 1 32 | 33 | * Dropping three products with extreme prices: 34 | drop if price > 590000 35 | 36 | count /*6439 observations remained*/ 37 | 38 | 39 | *************** Descriptives 40 | * Let us calculate the average offline and online prices and their difference using macros and stored results: 41 | summarize price /*Mean: 28.73*/ 42 | local offprice = r(mean) 43 | summarize price_online /*Mean: 28.79*/ 44 | local onprice = r(mean) 45 | display "The difference is:" `onprice' - `offprice' 46 | 47 | * Let us check the price difference in details 48 | compare price price_online 49 | 50 | 51 | *************** One-sample t-test 52 | *** It is for testing whether the population average is equal to a particular value. 53 | *** Let us test whether the average offline price is equal to 30: 54 | ttest price == 30 55 | * According to the results, we cannot reject the null that the average offline price is 30. 56 | 57 | 58 | *************** Paired t-test 59 | *** It is for testing whether two variables have the same population average. 60 | *** Let us now test whether offline and online prices are equal to each other: 61 | ttest price_online == price 62 | * According to the results, we cannot reject the null that the average offline and online prices are equal to each other. 63 | 64 | 65 | *************** Two-sample t-test 66 | *** It is for testing whether two groups have the same population average for a particular variable. 67 | *** Let us test whether the online-offline price difference is the same in Japan and the Germany. 68 | * Re-opening the data, and keeping Japan and Germany: 69 | use "billion-prices.dta", clear 70 | keep if COUNTRY == "JAPAN" | COUNTRY == "GERMANY" 71 | keep if PRICETYPE == "Regular Price" 72 | drop if sale_online == 1 73 | count /*1044 observations remained*/ 74 | 75 | * Calculating price difference: 76 | generate diff = price_online - price 77 | 78 | * Testing: 79 | ttest diff, by(COUNTRY) 80 | * According to the results, we can reject the null that the average offline and online price difference is equal in Japan and Germany. 81 | 82 | 83 | -------------------------------------------------------------------------------- /lec12_regression_basics/README.md: -------------------------------------------------------------------------------- 1 | # Lecture 12: Regression basics 2 | 3 | ## Motivation 4 | 5 | Economic growth is an important topic in macroeconomics, and is analyzed by thousands of economists, including for example Xavier Sala-i-Martin. One of his articles was published in the American Economic Review in 1997 (87(2), pages 178-183), and its title is [I Just Ran Two Million Regressions](https://www.jstor.org/stable/2950909). This title is funny, but also shows the importance of regression analysis in economics: it is the Swiss army knife for economists. If economists would like to uncover patterns of association between variables, in most cases they use regression analysis. 6 | 7 | ## This lecture 8 | 9 | This lecture introduces students to the basics of regression analysis. We deal with many topics from Chapters 7, 8, 9, and 10. From an econometric point of view, regression analysis means a huge amount of material, although, from a Stata point of view it only needs one lecture to discuss the basics. 10 | 11 | Case studies (at least partly) connected to this lecture: 12 | - [Chapter 07, 7.A1, 7.A2, and 7.A3: Finding a good deal among hotels with simple regression](https://gabors-data-analysis.com/casestudies/#ch07a-finding-a-good-deal-among-hotels-with-simple-regression) 13 | 14 | This lecture uses the [hotels-vienna.dta](https://osf.io/dn8je) dataset. 15 | 16 | ## Learning outcomes 17 | After successfully completing the code in `lecture10_regression_basics.do` students should be able to: 18 | 19 | - Run simple regression models. 20 | - Calculate fitted values and residuals. 21 | - Include qualitative variables in the regression models. 22 | - Include interaction terms in the regression models. 23 | 24 | ## Datasets used 25 | 26 | * [hotels-vienna.dta](https://osf.io/dn8je) 27 | 28 | ## Lecture Time 29 | 30 | Ideal overall time: **60 mins**. 31 | 32 | ## Homework 33 | 34 | *Type*: quick practice, approx 20 mins 35 | 36 | In this homework you are invited to reproduce the regression results of the following case studies: 37 | 38 | - 9.A4: Tables 9.1 and 9.2. 39 | - 10.A1: Table 10.1. 40 | - 10.A3: Table 10.2. 41 | - 10.A4: Table 10.3. 42 | - 10.A5: Table 10.4. 43 | - 10.A6: Table 10.5. 44 | 45 | 46 | ## Detailed online descriptions for the most important commands from the Stata website 47 | 48 | - [regress](https://www.stata.com/manuals/rregress.pdf) 49 | - [lowess](https://www.stata.com/manuals/rlowess.pdf) 50 | 51 | ## Further material 52 | 53 | - [Chapter 1](https://stats.oarc.ucla.edu/stata/webbooks/reg/chapter1/regressionwith-statachapter-1-simple-and-multiple-regression/) of UCLA's Regression with Stata book offers an extensive introduction to the topic. 54 | 55 | - The [paper](https://www.stata.com/why-use-stata/easy-to-grow-with/linear.pdf) of Rose Madeiros is an astonishingly great description of interactions. 56 | 57 | - There is a brief description of the usage of factor variables at the Stata [webpage](https://www.stata.com/features/overview/factor-variables/). 58 | -------------------------------------------------------------------------------- /lec12_regression_basics/lec12_regression_basics.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 12 ****************************************************** 4 | *********************************************************************************************************************** 5 | ************************************************** Regression basics ************************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec12_regression_basics/" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use the Hotels in Vienna data, introduced in Case Study 1.A1 and used in many other case studies later. 16 | copy "https://osf.io/dn8je/download" "hotels-vienna.dta", replace 17 | use "hotels-vienna.dta", clear 18 | 19 | 20 | *************** Research question 21 | *** We analyze hotels, and... 22 | *** ... our research question is whether there is an association between price and distance from the city center. 23 | *** We fix the level of significance at 5%. 24 | 25 | 26 | *************** The linear regression 27 | *** We are going to reproduce some results of case studies 7.A2, and 7.A3 (and later 7.A1), so let us start with sample selection... 28 | *** ... We follow the steps presented in the second paragraph of case study 7.A1. 29 | 30 | * Keeping accommodations from Vienna: 31 | keep if city_actual == "Vienna" 32 | 33 | * Keeping hotels: 34 | keep if accommodation_type == "Hotel" 35 | 36 | * Keeping hotels with 3, 3.5, or 4 stars: 37 | keep if stars == 3 | stars == 3.5 | stars == 4 38 | 39 | * Dropping a hotel with extremely high price: 40 | summarize price, detail 41 | drop if price == 1012 42 | 43 | count /*207 observations remained*/ 44 | 45 | 46 | *************** Correlation 47 | * Let us first calculate the correlation coefficient between 'price' and 'distance': 48 | pwcorr price distance /*The coeff is: -0.3963.*/ 49 | 50 | * We can also indicate the significance of the coeff with a star using the 'star' option: 51 | pwcorr price distance, star(0.05) /*It is significant.*/ 52 | 53 | * Or, we can display the level of significance itself: 54 | pwcorr price distance, sig 55 | 56 | *** Please note that there is another command for calculating correlation coefficients: 'correlate' 57 | * 'correlate' displays correlation matrix or covariance matrix. 58 | * 'pwcorr' displays pairwise correlation coefficients. 59 | 60 | 61 | *************** Linear regression 62 | *** We can run a linear regression using the 'regress' command: regress depvar [indepvars] [if] [in] [weight] [, options] 63 | regress price distance 64 | *** As you can see, we have all the important results in the output table: 65 | * The number of observations is 207. 66 | * The R2 is 0.157 (rounded to 0.16 in the book on page 192). 67 | * The coefficient of 'distance' is -14.4 (rounded to -14 in the book on page 183). 68 | * We can also see the results of the t-tests with their p-values, and also the confidence intervals (discussed in chapter 9). 69 | 70 | 71 | *************** Referring to the regression sample 72 | *** It can happen that not all the observations are included in our regression model... 73 | *** ... For example, Stata automatically omits all the observations that have missing values for any regression variables... 74 | *** ... We can refer to the sample of the last regression we ran using the 'e(sample)' stored result... 75 | *** ... Let us count the number of obs included in the regression we ran before... 76 | *** ... (Of course, we know that it should be 207, as was indicated in the regression table.) 77 | count if e(sample) 78 | 79 | 80 | *************** Obtaining predictions and residuals 81 | *** Let us now predict price based on the model (E(y|x)): 82 | predict price_hat /*A new variable was generated that contains the fitted values.*/ 83 | 84 | *** Let us now calculate the residuals: 85 | * We use 'predict' here as well, but with the 'residuals' option: 86 | predict res, residuals 87 | 88 | * Of coures, we can calculate residuals also manually using the predicted values: 89 | generate res_manual = price - price_hat 90 | 91 | *** Please note that different influental statistics (Section 8.9) can also be calculated using the 'predict' command,... 92 | *** ... check the help for details: 93 | help regress_postestimation 94 | 95 | 96 | *************** Including qualitative variables in the regression models 97 | *** Section 10.9 introduces the usage of qualitative variables as right-hand-side variables. The solution: dummmy variables. 98 | *** The are two basic ways to include a set of dummy variables in our regression model: 99 | * 1) Generating the dummies and the include them. 100 | * 2) Include the dummies without generating them. 101 | 102 | *** Let us investigate whether the -rating- matters, but we will use a qualitative rating variable, so first let us generate it: 103 | summarize rating, detail 104 | generate rating_quali = "low" if rating <= 3.5 105 | replace rating_quali = "moderate" if rating > 3.5 & rating < 4.5 106 | replace rating_quali = "high" if rating >= 4.5 107 | tabulate rating_quali, missing 108 | * In most cases we will need a quantified version of this variable, so let us encode it: 109 | encode rating_quali, gen(rating_quali_num) 110 | 111 | * 1) Generating the dummies and the include them in the regression model 112 | * Using the 'tabulate' command with the 'generate' option, we can generate the necessary dummy set: 113 | tabulate rating_quali, gen(drating_quali) 114 | 115 | * As you can see, 3 dummy variables were created. Let us include them (of course, one is dropped automatically): 116 | regress price distance drating_quali* 117 | 118 | * 2) Include the dummies without generating them 119 | * Using the 'i.' operator, we can include dummy variables without generating them (string variables cannot be used): 120 | regress price distance i.rating_quali_num 121 | 122 | * This 'i.' operator give us more opportunity, so it is worth using. For example: 123 | * a) We can show the reference category in the output table: 124 | set showbaselevels on 125 | regress price distance i.rating_quali_num 126 | 127 | * b) We can set the reference category (let it be the moderate rating, coded as 2)... 128 | * ... Note that we use 'b.' (~base) instead of 'i.': 129 | regress price distance b2.rating_quali_num 130 | 131 | 132 | *************** Including interaction terms in the regression models 133 | *** Section 10.10 introduces the usage of interaction terms. Let us see some examples: 134 | 135 | * 1) Categorical by categorical interactions 136 | * We will use 'offer_cat' as well here: 137 | tab offer_cat, missing 138 | 139 | * It is a string variable, so let us encode it: 140 | encode offer_cat, gen(offer_cat_num) 141 | 142 | * The interaction: 143 | regress price distance rating_quali_num##offer_cat_num 144 | 145 | * 2) Categorical by continuous interactions 146 | * We should use the 'c.' operator to refer to continuous variables. Categorical is assumed by default - see our previous example. 147 | regress price distance rating_quali_num##c.stars 148 | 149 | * 3) Continuous by continuous interactions 150 | regress price distance c.rating##c.stars /*We use the original -rating- variable here.*/ 151 | 152 | * Please note that polynomials (Subsection 8.7) can also be added in this way, e.g. the quadratic form of -distance-: 153 | regress price c.distance##c.distance 154 | 155 | 156 | *************** Robust standard errors 157 | * Subsection 9.2 introduced the phenomenon of heteroskedasticity and the definition of robust SE. 158 | * Robust SE can be easily calculated using the 'vce(robust)' option: 159 | regress price distance, vce(robust) 160 | 161 | 162 | *************** The lowess non-parametric regression 163 | * Using the 'lowess' command, we can run regressions like the one on Figure 7.3: 164 | lowess price distance , bwidth(0.8) 165 | -------------------------------------------------------------------------------- /lec13_presenting_regresults/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Lecture 13: Presenting regression results 5 | 6 | ## Motivation 7 | 8 | A book is often judged by its cover. Yes, having a nice content is important, but a fancy cover is important as well. Running a regression is not enough: you should be able to present its result in a way that is compact and easy to understand. 9 | 10 | ## This lecture 11 | 12 | This lecture introduces students to the basics of presenting regression results. We mention some commands that help visualize the results, and we also discuss other commands that can be used to create compact and fancy tables containing regression results. 13 | 14 | Case studies (at least partly) connected to this lecture: 15 | - [Chapter 10, 10.A1: Understanding the gender difference in earnings](https://gabors-data-analysis.com/casestudies/#ch10a-understanding-the-gender-difference-in-earnings) 16 | 17 | This lecture uses the [morg-2014-emp.dta](https://osf.io/rtmga) dataset. 18 | 19 | ## Learning outcomes 20 | After successfully completing the code in `lecture11_presenting_regresults.do` students should be able to: 21 | 22 | - Visualize simple regressions and regression results. 23 | - Compile compact and fancy tables with regression results. 24 | 25 | ## Datasets used 26 | 27 | * [morg-2014-emp.dta](https://osf.io/rtmga) 28 | 29 | ## Lecture Time 30 | 31 | Ideal overall time: **35 mins**. 32 | 33 | ## Homework 34 | 35 | *Type*: quick practice, approx 20 mins 36 | 37 | In your previous homework assignment you reproduced some regression results from chapters 9 and 10. Now, get back to those results, and try to arrange them into fancy and informative tables. 38 | 39 | ## Further material 40 | 41 | - Ben Jann, the creator of the coefplot command has a great [presentation material](https://www.stata.com/meeting/germany14/abstracts/materials/de14_jann.pdf) on it. 42 | 43 | - Ben Jann published an article in The Stata Journal about [estout](https://journals.sagepub.com/doi/pdf/10.1177/1536867X0500500302), and another one about [eststo](https://journals.sagepub.com/doi/pdf/10.1177/1536867X0700700207). 44 | 45 | - Roy Wada has a [paper](https://www.stata.com/meeting/wcsug07/Rapid_Formation_article.pdf) about the outreg2 command. 46 | 47 | - Asjad Naqvi has a long, but comprehensive [description](https://medium.com/the-stata-guide/the-stata-to-latex-guide-6e7ed5622856) about producing LaTex-ready tables in Stata. 48 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /lec13_presenting_regresults/lec13_presenting_regresults.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 13 ****************************************************** 4 | *********************************************************************************************************************** 5 | ******************************************** Presenting regression results ******************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec13_presenting_regresults/" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use the Current Population Survey data, introduced in Case Study 9.A1, and... 16 | *** we reproduce the results of the three specifications presented in Table 10.1 in case study 10.A1 (+1 own specification), and... 17 | *** present them in some different ways. 18 | copy "https://osf.io/rtmga/download" "cps-earnings.dta", replace 19 | use "cps-earnings.dta", clear 20 | 21 | 22 | *************** Sample selection 23 | *** First we reproduce the sample selection described in case study 10.A1 to get the same results as the ones in Table 10.1. 24 | 25 | * Keeping employees with a graduate degree: 26 | keep if grade92 >= 44 27 | 28 | * Keeping employees of age 24 to 65: 29 | keep if age >= 24 & age <= 65 30 | 31 | * Keeping employees who reported at least 20 hours as their usual weekly time worked: 32 | keep if uhours >= 20 33 | 34 | count /*Number of observations is: 18241*/ 35 | 36 | 37 | *************** Variable production 38 | *** Now we create the necessary variables. 39 | generate hwage = (earnwke/uhour) 40 | label variable hwage "hourly wage" 41 | 42 | generate lnhwage = ln(hwage) 43 | label variable lnhwage "ln(hourly wage)" 44 | 45 | generate female = 0 if sex == 1 46 | replace female = 1 if sex == 2 47 | label define fem 0 "male" 1 "female" 48 | label values female fem 49 | 50 | 51 | *************** Running regressions 52 | * Specification 1: 53 | regress lnhwage female 54 | 55 | * Specification 2: 56 | regress lnhwage female age 57 | 58 | * Specification 3: 59 | regress age female 60 | 61 | * Specification +1: 62 | regress hwage age 63 | 64 | 65 | *************** 1) Graphical representation of regression results 66 | *** First let us discuss some methods for the graphical representation of regression results. 67 | ***(Please note that we only do some basics without any scaffolding.) 68 | 69 | *** 1a) Plotting a simple regression 70 | * We can plot the observations using a scatter and fit a linear regression line: 71 | graph twoway (scatter hwage age) (lfit hwage age) 72 | 73 | * A confidence interval can also be presented using the 'lfitci' command: 74 | graph twoway (scatter hwage age) (lfitci hwage age) 75 | 76 | *** 1b) Plotting coefficients 77 | * We can also plot the estimated coefficients using a user-written command which should be installed first: 78 | ssc install coefplot 79 | 80 | regress lnhwage female age 81 | coefplot 82 | 83 | * 'coefplot' shows the confidence interval as well, but it cannot be seen now due to the relative sizes of the coeffs,... 84 | * ... so let us drop the constant and visualize without it: 85 | coefplot, drop(_cons) 86 | 87 | * We can also add a vertical line to 0 to help the quick interpretation of sign and significance: 88 | coefplot, drop(_cons) xline(0) 89 | 90 | 91 | *************** 2) Presenting results in compact tables 92 | *** Now, let us discuss some commands that help making fancy and compact (with a bit more coding, also ready-to-publish) tables. 93 | 94 | *** 2a) The 'estout' package 95 | * Step 0: since it is a user-written command, it should be installed first (if it has not been so far): 96 | ssc install estout 97 | 98 | * Step 1: previous stored results (if any) should be cleared from the memory: 99 | eststo clear 100 | 101 | * Step 2: the regressions should be run and stored: 102 | eststo: regress lnhwage female /*stored as est1*/ 103 | eststo: regress lnhwage female age /*stored as est2*/ 104 | eststo: regress age female /*stored as est3*/ 105 | eststo: regress hwage age /*stored as est4*/ 106 | 107 | * Step 3: results should be displayed: 108 | estout est1 est2 est3 est4 109 | 110 | * Some formatting can also be added using options, e.g.: 111 | * Adding cells with some information 'cells()': 112 | * Displaying stars referring to significance and coeffs (called 'b') with three decimals: 'b(star fmt(3))' 113 | * Displaying SEs also with three decimals in parentheses: 'se(par fmt(3))' 114 | * Displaying R2 and number of observations: 'stats(r2 N)' 115 | * Displaying the meaning of stars: 'legend' 116 | 117 | estout est1 est2 est3 est4, cells(b(star fmt(3)) se(par fmt(3))) stats(r2 N) legend 118 | 119 | * We can also give names (using the 'title' option) to the models and display them (using the 'label' option later): 120 | eststo clear 121 | eststo, title("lnhwage"): regress lnhwage female /*stored as est1*/ 122 | eststo, title("lnhwage"): regress lnhwage female age /*stored as est2*/ 123 | eststo, title("age"): regress age female /*stored as est3*/ 124 | eststo, title("hwage"): regress hwage age /*stored as est4*/ 125 | 126 | estout est1 est2 est3 est4, label /*Check the result!*/ 127 | 128 | * For further formatting options, check the help menu: 129 | help eststo 130 | 131 | *** 2b) The 'outreg2' command 132 | * With 'outreg2' we can create compact tables and export them easily to Excel or to LaTex. 133 | * 'outreg2' is also a user-written command, so first it should be installed. 134 | ssc install outreg2 135 | 136 | * Let us start with a basic example: running a regression, then export the results to a .txt file: 137 | regress lnhwage female 138 | outreg2 using "results" /*Check the new file in your default folder.*/ 139 | 140 | * Now, let us continue with some important options and running more sepcifications: 141 | regress lnhwage female 142 | outreg2 using "results", replace bdec(3) /*Overwriting the existing file, and displaying coeffs (b) with 3 decimals.*/ 143 | 144 | regress lnhwage female age 145 | outreg2 using "results", append bdec(3) /*Appending new results to the existing file.*/ 146 | 147 | regress age female 148 | outreg2 using "results", append bdec(3) excel /*Creating an Excel file.*/ 149 | 150 | * Check again your default folder, and open the Excel file created. Fancy, isn't it? 151 | -------------------------------------------------------------------------------- /lec14_TSdata/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | # Lecture 14: Working with time series data 5 | 6 | ## Motivation 7 | 8 | Nicholas Bohr once said funnily that "Prediction is very difficult, especially if it's about the future." Nowadays most datasets have a time dimension and so analysts should face the challenges of using time series or panel data, and forecasting. So, all the analysts should know how to deal with time series (TS) data. 9 | 10 | ## This lecture 11 | 12 | This lecture introduces students to the basics of using time series data. 13 | 14 | Case studies (at least partly) connected to this lecture: 15 | - [Chapter 12, 12.A1: Returns on a company stock and market returns](https://gabors-data-analysis.com/casestudies/#ch12a-returns-on-a-company-stock-and-market-returns) 16 | 17 | This lecture uses the [SP500_2006_16_data.dta](https://osf.io/gds5t) dataset. 18 | 19 | ## Learning outcomes 20 | After successfully completing the code in `lecture12_TSdata.do` students should be able to: 21 | 22 | - Deal with date variables. 23 | - Define the TS structure of the dataset. 24 | - Fill gaps in TS data. 25 | - Use TS operators. 26 | - Execute some other important TS-specific commands. 27 | 28 | ## Datasets used 29 | 30 | * [SP500_2006_16_data](https://osf.io/gds5t) 31 | 32 | ## Lecture Time 33 | 34 | Ideal overall time: **30 mins**. 35 | 36 | ## Homework 37 | 38 | *Type*: quick practice, approx 30 mins 39 | 40 | Your task is to reproduce the following results of the case study: Electricity Consumption and Temperature: 41 | 42 | - Figure 12.7 on page 347. 43 | - Figure 12.8 on page 348. 44 | - Table 12.4 on page 353. 45 | - Table 12.5 on page 355. 46 | - Table 12.6 on page 358. 47 | 48 | ## Further material 49 | 50 | - The SSCC's online course has a comprehensive module on [working with dates](https://sscc.wisc.edu/sscc/pubs/stata_dates.htm). 51 | 52 | - The UCLA's online course also discusses the usage of dates in one of its [modules](https://stats.oarc.ucla.edu/stata/modules/using-dates-in-stata/). 53 | 54 | - There is a [great list](https://www.stata.com/features/time-series/) on the Stata website about the opportunities of Stata in TS modeling. 55 | 56 | - Stata's [Time-Series Reference Manual](https://www.stata.com/manuals/ts.pdf) is about 1000 pages long and free of charge, and offers a comprehensive description of TS in Stata. 57 | 58 | -------------------------------------------------------------------------------- /lec14_TSdata/lect14_TSdata.do: -------------------------------------------------------------------------------- 1 | *********************************************************************************************************************** 2 | *********************************************************************************************************************** 3 | ***************************************************** Lecture 14 ****************************************************** 4 | *********************************************************************************************************************** 5 | ******************************************** Working with time series data ******************************************** 6 | *********************************************************************************************************************** 7 | *********************************************************************************************************************** 8 | 9 | *************** The default folder 10 | *** Defining the default folder 11 | cd "da-coding-stata/lec14_TSdata/" 12 | 13 | 14 | *************** Copying and opening data 15 | *** We use the stock market data, introduced in Case Study 5.A1,... 16 | *** and which data is also used in Chapter 12 (although not exactly this dataset, but it is a part of the dataset used there). 17 | copy "https://osf.io/gds5t/download" "sp500.dta", replace 18 | use "sp500.dta", clear 19 | 20 | 21 | *************** 1) Dealing with dates 22 | *** Let us start with dealing with dates. We have two variables here that contain dates: -datestring- and -date-... 23 | *** ... Variable -datestring- is a string one, while -date- is recognized as a date variable. In most cases,... 24 | *** ... when we import a .csv, or an .xls(x) the result is one of these ones... 25 | *** ... Let us deal with both cases. 26 | 27 | *** 1a) Starting with the string case: -datestring- is a simple string variable here, it has nothing to do with dates... 28 | *** ... Let us make it usable: 29 | * Step 1: cut it into three pieces (= new variables) using a string function ('split') and defining where to cut ('parse' option): 30 | split datestring, parse(-) 31 | 32 | * As you can see, we have now three new variables containing year, month, and day. 33 | * Step 2: rename and destring them: 34 | destring datestring*, replace 35 | rename datestring1 year 36 | rename datestring2 month 37 | rename datestring3 day 38 | 39 | * Step 3: create a date varaible using the 'mdy' function: 40 | generate date_new = mdy(month, day, year) 41 | * Let us check it: 42 | browse year month day date_new 43 | 44 | * As you can see, the content of the variable is strange: 17038, 17041 etc., still nothing to do with dates. 45 | * Well, it is in fact a date variable: dates are simply numbers in Stata. In case of dailiy data their interpretation is: ... 46 | * ... number of days since January 1, 1960. Similarly: 47 | * - in case of yearly data: number of years since 1960 48 | * - in case of monthly data: number of months since January, 1960, etc... 49 | 50 | * Step 4: for a more intuitive way of display, you should change the format: 51 | format date_new %td /*%td means that the variable contains date on daily frequency*/ 52 | browse year month day date date_new /*Now it is okay, and is the same as our original -date- variable.*/ 53 | 54 | *** 1b) Let us now consider the original -date- variable that is recognized by Stata. 55 | * We can apply date functions immediately to it. E.g.: 56 | * Let us create a day variable: 57 | generate day_new = day(date) 58 | 59 | * Or a month or a year variable: 60 | generate month_new = month(date) 61 | 62 | * Or year: 63 | generate year_new = year(date) 64 | 65 | *** Summing up: If we have a string date variable, first we should covert it to date,... 66 | *** ... and then we can apply different date and time functions to it... 67 | *** ... Variables stored as dates can be used immediately, without any preparation. 68 | 69 | *** For the comprehensive list of date and time functions, check the help menu: 70 | help datetime_functions 71 | 72 | 73 | *************** 2) Defining time series (TS) 74 | *** Before using any kind of time series-specific commands (commands starting with the letters 'ts'),... 75 | *** ... you should define the time series structure, i.e... 76 | *** ... name a variable that defines time. Now, let it be the original -date- variable: 77 | tsset date 78 | 79 | * Check the return message in the output window! Stata recognized that it is daily data. It may not work always,... 80 | * ... sometimes we should define the frequency with an option for 'tsset'. Check the help for the necessary option. 81 | 82 | *** Please note that the time variable cannot contain duplicates. Do the following and check the error message at the end: 83 | generate x = ym(year, month) 84 | tsset x 85 | 86 | *** By simply typing 'tsset' we can query whether date is defined or not: 87 | tsset /*So we have daily data between 2006 and 2016, with some gaps.*/ 88 | count /*We have 2519 days, i.e. observations.*/ 89 | 90 | 91 | *************** 3) Filling gaps 92 | *** So, as we saw, there are gaps in our data: there are days that are missing from the dataset... 93 | *** ... (note that not the values are missing for some days, but the days - i.e. observations - themselves... 94 | *** ... We can fill those gaps using the 'tsfill' command. 95 | 96 | tsfill 97 | count /*Now we have 3655 days, i.e. observations, and the newly created ones conatin missings for all other variables.*/ 98 | 99 | *** Let us drop some variables that are not needed 100 | drop datestring date_new day_new month_new year_new x 101 | 102 | 103 | *************** 4) Time series operators 104 | *** In time series analysis we often need lags or leads, etc. It is easy to generate them in Stata. 105 | *** First, the time series structure should be defined (as we have already done it), and then we can create lags, leads etc. 106 | 107 | *** Lag operator: 108 | generate lvalue = l.value /*1st lag*/ 109 | generate llvalue = l2.value /*2nd lag*/ 110 | * We could go on of course... 111 | 112 | *** Lead operator: 113 | generate fvalue = f.value /*1st lead*/ 114 | generate ffvalue = f2.value /*2nd lead*/ 115 | * We could go on of course... 116 | 117 | *** Difference operator 118 | generate dvalue = d.value /* t-(t-1) */ 119 | generate ddvalue = d2.value /* [t-(t-1)]-[(t-1)-(t-2)] */ 120 | * We could go on of course... 121 | 122 | *** Seasonal difference operator 123 | generate svalue = s.value /* t-(t-1) */ 124 | generate ssvalue = s2.value /* t-(t-2) */ 125 | * We could go on of course... 126 | 127 | 128 | *************** 5) Other basic, but important commands 129 | *** Using the 'tsreport' command, we get some elementary description of the TS aspects of our data 130 | tsreport 131 | 132 | *** Using the 'tsappend' command, we can add observations (i.e. extra rows) to our dataset 133 | tsappend, add(10) /*It adds 10 new observations (days) to the end of our current TS*/ 134 | 135 | *** Using the 'tsline' command, we can create a line graph 136 | tsline value 137 | 138 | *** Using the 'corrgram' command, we can calculate serial correlation (subsection 12.6) 139 | corrgram value, lags(3) /*1st-, 2nd-, and 3rd-order serial correlation and partial serial correlation*/ 140 | 141 | *** Using the 'newey' command, we can estimate OLS regression with Newey-West standard errors (subsection 12.7) 142 | 143 | *** Using the 'pperron' command, we can perform a Phillips-Perron unit-root test (subsection 12.U1) 144 | pperron value /*We cannot reject the null of unit root.*/ 145 | 146 | 147 | --------------------------------------------------------------------------------