├── images ├── README.md ├── k-city.jpg ├── all-city.jpg ├── bin-city.jpg ├── final-city.jpg ├── k-bin-city.jpg ├── k-downtown.jpg ├── k-raw-city.jpg ├── raw-city.jpg ├── all-downtown.jpg ├── bin-downtown.jpg ├── raw-downtown.jpg ├── final-downtown.jpg ├── k-bin-downtown.jpg ├── k-bin-raw-city.jpg ├── k-raw-downtown.jpg ├── k-bin-raw-downtown.jpg ├── 2-single-trip-fuzzed-location.gif ├── 3-single-trip-final-location.gif ├── 4-potential-original-locations.gif └── 1-single-trip-original-location.gif ├── README.md └── DocklessOpenData-Sample-Aug2019-Louisville.csv /images/README.md: -------------------------------------------------------------------------------- 1 | Sample images go in this folder 2 | -------------------------------------------------------------------------------- /images/k-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-city.jpg -------------------------------------------------------------------------------- /images/all-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/all-city.jpg -------------------------------------------------------------------------------- /images/bin-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/bin-city.jpg -------------------------------------------------------------------------------- /images/final-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/final-city.jpg -------------------------------------------------------------------------------- /images/k-bin-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-bin-city.jpg -------------------------------------------------------------------------------- /images/k-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-downtown.jpg -------------------------------------------------------------------------------- /images/k-raw-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-raw-city.jpg -------------------------------------------------------------------------------- /images/raw-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/raw-city.jpg -------------------------------------------------------------------------------- /images/all-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/all-downtown.jpg -------------------------------------------------------------------------------- /images/bin-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/bin-downtown.jpg -------------------------------------------------------------------------------- /images/raw-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/raw-downtown.jpg -------------------------------------------------------------------------------- /images/final-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/final-downtown.jpg -------------------------------------------------------------------------------- /images/k-bin-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-bin-downtown.jpg -------------------------------------------------------------------------------- /images/k-bin-raw-city.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-bin-raw-city.jpg -------------------------------------------------------------------------------- /images/k-raw-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-raw-downtown.jpg -------------------------------------------------------------------------------- /images/k-bin-raw-downtown.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/k-bin-raw-downtown.jpg -------------------------------------------------------------------------------- /images/2-single-trip-fuzzed-location.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/2-single-trip-fuzzed-location.gif -------------------------------------------------------------------------------- /images/3-single-trip-final-location.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/3-single-trip-final-location.gif -------------------------------------------------------------------------------- /images/4-potential-original-locations.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/4-potential-original-locations.gif -------------------------------------------------------------------------------- /images/1-single-trip-original-location.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/louisvillemetro-innovation/Dockless-Open-Data/HEAD/images/1-single-trip-original-location.gif -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Dockless Open Data 4 | 5 | This guide will show how and why cities can convert [MDS](https://github.com/OpenMobilityFoundation/mobility-data-specification) trip data to anonymized open data, while respecting rider privacy. This method is being used in [Louisville's public dockless open trip data](https://data.louisvilleky.gov/dataset/dockless-vehicles) and uses MySQL queries only. It bins data in both space and time. 6 | 7 | ## Table of Contents 8 | 9 | - [Points to Consider](#points-to-consider) 10 | - [Example Geographic Data Outcomes](#example-geographic-data-outcomes) 11 | + [1. Time Binning](#1-time-binning) 12 | + [2. Initial Locations - Raw GPS Points](#2-initial-locations---raw-gps-points) 13 | + [3. Geographic Binning](#3-geographic-binning) 14 | + [4. Geographic Fuzzing](#4-geographic-fuzzing) 15 | + [5. End Result](#5-end-result) 16 | * [Interactive Map](#interactive-map) 17 | - [Data Processing](#data-processing) 18 | + [1. Obtain secure access to an MDS feed](#1-obtain-secure-access-to-an-mds-feed) 19 | + [2. Ingest a subset of MDS data into a database table](#2-ingest-a-subset-of-mds-data-into-a-database-table) 20 | + [3. Run scripts to convert and clean open data](#3-run-scripts-to-convert-and-clean-open-data) 21 | + [4. Anonymize start and end points with few trips](#4-anonymize-start-and-end-points-with-few-trips) 22 | + [Variables](#variables) 23 | - [Final format of Open Data](#final-format-of-open-data) 24 | - [Anonymization Deep Dive](#anonymization-deep-dive) 25 | + [1. Single Trip - Original Location](#1-single-trip---original-location) 26 | + [2. Single Trip - Fuzzed Location](#2-single-trip---fuzzed-location) 27 | + [3. Single Trip - Final Location](#3-single-trip---final-location) 28 | + [4. Potential Original Locations](#4-potential-original-locations) 29 | - [Cities with Dockless Trip Open Data](#cities-with-dockless-trip-open-data) 30 | - [References](#references) 31 | - [Feedback](#feedback) 32 | 33 | ## Points to Consider 34 | 35 | We welcome feedback on this method of publishing. We want to preserve rider privacy while being transparent with our methods and the data we collect. 36 | 37 | - Cities need to be transparent with the kinds of data we and private companies collect on residents. Publishing a subset of this data helps with this goal. 38 | - The trip data in its raw form is considered highly sensitive. The data potentially contains PII if merged with other data sources/information and since it anonymously tracks use of transportation devices in space and time. This is why we process the data before releasing. Cities collect information that can include PII as required to provide services to residents, and similar processing is done before releasing this publicly. Examples include crime report, permit, property, health, transit, fire, salary, HR, fee, tax, violation, ticket, citation, business registration, car crash, bikeshare, financial, 311, and 911 data. 39 | - The data cities receive does not include traditional PII or any other information about the rider like name, home/billing address, credit card number, cell phone number, email address, birthdate, sex, drivers license info, height, weight, or trip history. Only the dockless vehicle companies have that information and can connect it to each trips. 40 | - Sharing a subset of this data is required in many jurisdictions by open records laws, local policy, state law, and federal law, so defining the details of what to share is important. There are typically exceptions for personally identifiable information, trade secrets of companies, and sensitive data which this method should account for. 41 | - When publishing open data in this form, talk to your mobility service provider and third party aggregators to ensure the resulting data is accurate. 42 | - Cities need to balance transparency requirements and open records laws with privacy best practices. 43 | 44 | Note this is not legal advice, but considerations from Louisville's perspective. 45 | 46 | # Example Geographic Data Outcomes 47 | 48 | Starting with the raw location data (red), we will use binning and k-anonymity to fuzz the locations, while still providing useful, granular data (green) to comply with local, state, and federal open records laws. 49 | 50 | ![Raw to Final](https://raw.githubusercontent.com/louisvillemetro-innovation/dockless-open-data/images/images/k-bin-raw-city.jpg) 51 | 52 | This image shows 100,000 dockless vehicle trip starting points (in red) from one provider selected randomly from raw Louisville data, and zoomed into downtown for detail. After we bin the location to about 100 meters, we then use a k-anonymity generalization method to arrive at the final open data (point grid in green). 53 | 54 | ### 1. Time Binning 55 | 56 | The first thing we do to the raw data is bin the start and end location timestamps into 15 minute increments. This temporal resolution reduction helps with data anonymization. Note that we store times in our city's local time. We also are using ISO 8601 to be clear you should be accounting for timezones and Daylight Saving Time in your local area. 57 | 58 | ### 2. Initial Locations - Raw GPS Points 59 | 60 | The raw start/end data comes to us through MDS as GPS points. Note some have inherent GPS error already, as can be seen by points in the Ohio River to the north. We use this data internally for policy compliance, planning, complaint resolution, parking compliance, and equitable distribution checks. 61 | 62 | ![Start](https://raw.githubusercontent.com/louisvillemetro-innovation/dockless-open-data/images/images/raw-downtown.jpg) 63 | 64 | ### 3. Geographic Binning 65 | 66 | The first thing we do is simply truncate the latitude and longitude to 3 decimal places, which clearly bins the starting and ending locations into a grid that is about 100 meters tall and 80 meters wide at this location (Louisville) on the planet. This effectively creates a spatial histogram of rectangular tessellation across the city -- instead of displaying this as points, you could show the data as weighted rectangles. 67 | 68 | ![Binning](https://raw.githubusercontent.com/louisvillemetro-innovation/dockless-open-data/images/images/bin-downtown.jpg) 69 | 70 | ### 4. Geographic Fuzzing 71 | 72 | Next, we run those binned locations through a k-anonymity generalization function. If there are 4 or less origin/destination pairs to/from the same location then we move both the start and end points further. In the Louisville data, this is about one third of all the trips. We randomly move the locations in a 400 meter radius, which is up to 5 binning locations away in any direction. 73 | 74 | ![Fuzzing](https://raw.githubusercontent.com/louisvillemetro-innovation/dockless-open-data/images/images/final-downtown.jpg) 75 | 76 | Note how the points here are more spread out than with the step 2 binning alone. See this [online code sample](http://jsfiddle.net/7891b51f/) and article about [Disk Point Picking](http://mathworld.wolfram.com/DiskPointPicking.html) for more details. 77 | 78 | ### 5. End Result 79 | 80 | In the end we have a grid of points, and the person looking at the data cannot trace a location back to its original location. Also, there is no way to tell if a point has been both fuzzed and binned, or only binned. 81 | 82 | ![Final](https://raw.githubusercontent.com/louisvillemetro-innovation/dockless-open-data/images/images/k-bin-raw-downtown.jpg) 83 | 84 | Effectively, this means each point could be up to 1,600+ meters away from its actual location, while the integrity of the data is still reasonably maintained for analysis. See the "[Anonymization Deep Dive](https://github.com/louisvillemetro-innovation/dockless-open-data/blob/images/README.md#anonymization-deep-dive)" section below for more details. 85 | 86 | *Note image colors have been checked to be accessible to color blind individuals. Please let us know if you experience any difficulties.* 87 | 88 | ## Interactive Map 89 | 90 | Take a look at this 100,000 point data sample and 4 different layers on an [interactive map](https://cdolabs.carto.com/u/cdolabs-admin/viz/fd80e015-4319-4937-b350-545e4095f40c). 91 | 92 | Note this only includes the location samples needed to make the example visuals in this document, not the final open data. The raw data layer is not downloadable (only visible on the map) and only includes start location, not the end location, or any time/date information, or any trip information (trip line, end point, distance, duration). 93 | 94 | # Data Processing 95 | 96 | This methodology will bin origin and destination data in both space and time. These are the technical steps to processing from MDS to open data using MySQL. Note you can adapt this to MS SQL or PostGis with changes to some of the function names. 97 | 98 | ### 1. Obtain secure access to an MDS feed 99 | 100 | Using your city's [Dockless Vehicle Policy](https://data.louisvilleky.gov/dataset/dockless-vehicles/resource/541f050d-b868-428e-9601-c48a04eba17c) data sharing and enforcement requirements, obtain authentication with each operator's MDS feed for your city. 101 | 102 | ### 2. Ingest a subset of MDS data into a database table 103 | 104 | Ingestion method from MDS is left as an exercise for the reader. Securely store a subset (we do not store trip line data within the city network and only access that from the MDS source APIs when needed) of the raw MDS data and provide only authorized, audited, secure access to the location. Open source code, tools, or third party options may be added here at a later date. 105 | 106 | This is the table structure for the *DocklessOpenData* open data table that you will be converting your data to: 107 | 108 | ``` 109 | CREATE TABLE `DocklessOpenData` ( 110 | `TripID` varchar(50) NOT NULL, 111 | `StartDate` varchar(20) DEFAULT NULL, 112 | `StartTime` varchar(20) DEFAULT NULL, 113 | `EndDate` varchar(20) DEFAULT NULL, 114 | `EndTime` varchar(20) DEFAULT NULL, 115 | `TripDuration` float DEFAULT NULL, 116 | `TripDistance` float DEFAULT NULL, 117 | `StartLatitude` float DEFAULT NULL, 118 | `StartLongitude` float DEFAULT NULL, 119 | `EndLatitude` float DEFAULT NULL, 120 | `EndLongitude` float DEFAULT NULL, 121 | `DayOfWeek` varchar(45) DEFAULT NULL, 122 | `HourNum` varchar(45) DEFAULT NULL, 123 | `Fuzzed` tinyint(4) DEFAULT '0', 124 | `StartLat` float DEFAULT NULL, 125 | `StartLon` float DEFAULT NULL, 126 | `EndLat` float DEFAULT NULL, 127 | `EndLon` float DEFAULT NULL, 128 | PRIMARY KEY (`TripID`), 129 | KEY `idx_DocklessOpenData_StartLat` (`StartLat`), 130 | KEY `idx_DocklessOpenData_StartLon` (`StartLon`), 131 | KEY `idx_DocklessOpenData_EndLat` (`EndLat`), 132 | KEY `idx_DocklessOpenData_EndLon` (`EndLon`), 133 | KEY `idx_DocklessOpenData_Fuzzed` (`Fuzzed`) 134 | ) ENGINE=InnoDB DEFAULT CHARSET=latin1; 135 | ``` 136 | 137 | Note the use of *varchars*, because not all company MDS feeds have reliable/complete data in the right format. 138 | 139 | The last 5 columns are used just for anonymizing the data later, in step 4 below. 140 | 141 | **Note no trip line/polyline data is being stored, and no provider information**. 142 | 143 | When inserting from MDS into *DocklessOpenData*, use the following SQL for formatting values: 144 | 145 | ``` 146 | TripID = insert(insert(insert(insert(md5(sha2(source.OriginalTripID, '256')),9,1,'-'),14,1,'-'),19,1,'-'),24,1,'-') 147 | -- creates a new uniform trip UUID from the original. This is a one-way function based on the source Trip UUID. 148 | -- There may be a better way to do this, but we wanted to not generate a new UUID each time 149 | -- and instead wanted it to be reproducible based on source data. 150 | 151 | -- round start and end locations to 3 decimal places 152 | StartLatitude = ROUND(source.OriginalStartLatitude, 3) 153 | StartLongitude = ROUND(source.OriginalStartLongitude, 3) 154 | EndLatitude = ROUND(source.OriginalEndLatitude, 3) 155 | EndLongitude = ROUND(source.OriginalEndLongitude, 3) 156 | 157 | StartDate = STR_TO_DATE(source.OriginalStartDateTime, '%Y-%m-%d') 158 | 159 | EndDate = STR_TO_DATE(source.OriginalEndDateTime, '%Y-%m-%d') 160 | 161 | StartTime = LEFT(SEC_TO_TIME(FLOOR((TIME_TO_SEC(source.OriginalStartDateTime) + 450) / 900) * 900), 5) 162 | -- bins to 15 minute increments 163 | 164 | EndTime = LEFT(SEC_TO_TIME(FLOOR((TIME_TO_SEC(source.OriginalEndDateTime) + 450) / 900) * 900), 5) 165 | -- bins to 15 minute increments 166 | 167 | TripDuration = Round( ( UNIX_TIMESTAMP(source.OriginalEndDateTime) - UNIX_TIMESTAMP(source.OriginalStartDateTime) ) /60 ) 168 | -- rounded to nearest minute 169 | ``` 170 | 171 | ### 3. Run scripts to convert and clean open data 172 | 173 | This removes distance outliers and populates day of week and hour of day fields. 174 | 175 | ``` 176 | Update 177 | DocklessOpenData o 178 | Set 179 | o.DayOfWeek = dayofweek(o.StartDate), 180 | o.HourNum = left(o.StartTime, 2), 181 | o.TripDistance = -1 where o.TripDistance < 0, 182 | o.TripDistance = 100 where o.TripDistance > 100; 183 | ``` 184 | 185 | ### 4. Anonymize start and end points with few trips 186 | 187 | If there are not many trips between a starting location even after binning to 3 decimal places in step 2, then we anonymize further to protect privacy of individual riders. This common practice is called "k-anonymity". 188 | 189 | In our case, we look for O/D pairs of less than 5. That is, where there are less than 5 trips made between any combination of 2 aggregated start and end trip areas across the city. If there are, then we randomly move those points in a larger radius from the original location. The radius here is about 400 meters in a random direction, which is a k-anonymity generalization method. 190 | 191 | In the final data there is no way to know which trips have been anonymized in this way, and which trips are only aggreggated to the block level without further anonymization. 192 | 193 | To do this with only SQL, we use the column called 'Fuzzed' which tracks what O/D pairs need to be fuzzed, and which ones have then been fuzzed with a stored procedure. There are also 4 columns for lat/lon start/end coordinates, that are the original values before fuzzing. 194 | 195 | **Technical Details** 196 | 197 | The formula in the procedure below moves the location to an evenly distributed location (not clustered around the center) within a defined circle. See this [online code sample](http://jsfiddle.net/7891b51f/) and article about [Disk Point Picking](http://mathworld.wolfram.com/DiskPointPicking.html) formulas for more details. 198 | 199 | 200 | ``` 201 | -- 1 clear fuzz list 202 | update mobility.DocklessOpenData set Fuzzed = 0; 203 | 204 | -- 2 reset initial/fill new lat/lons into fuzzed latitude/longitude columns 205 | Update DocklessOpenData o 206 | Set 207 | o.StartLatitude = o.StartLat, o.StartLongitude = o.StartLon, o.EndLatitude = o.EndLat, o.EndLongitude = o.EndLon; 208 | 209 | -- 3 update Fuzzed for OD pairs of 4 or less. 210 | 211 | update mobility.DocklessOpenData d set d.Fuzzed = 1 where d.TripID in 212 | (SELECT * FROM ( 213 | SELECT o.TripID as TripID 214 | FROM mobility.DocklessOpenData o 215 | group by o.StartLat, o.StartLon, o.EndLat, o.EndLon 216 | having count(o.TripID) <= 4 217 | order by count(o.TripID) desc 218 | ) tblTmp) 219 | ; 220 | ``` 221 | 222 | For the final step, we need to run a stored procedure. You can create that procedure running this code one time: 223 | 224 | ``` 225 | DELIMITER $$ 226 | CREATE DEFINER=`root`@`localhost` PROCEDURE `FuzzOpenData`() 227 | BEGIN 228 | 229 | DECLARE randA FLOAT DEFAULT 0; 230 | DECLARE randB FLOAT DEFAULT 0; 231 | 232 | DECLARE n INT DEFAULT 0; 233 | DECLARE i INT DEFAULT 0; 234 | 235 | SELECT COUNT(*) FROM DocklessOpenData where Fuzzed = 1 INTO n; 236 | 237 | SET i=0; 238 | WHILE i