├── .gitignore ├── .travis.yml ├── LICENSE.md ├── README.md ├── csvdedupe ├── __init__.py ├── csvdedupe.py ├── csvhelpers.py └── csvlink.py ├── docs ├── conf.py └── index.rst ├── examples ├── .gitkeep ├── CPS_Early_Childhood_Portal_Scrape.csv ├── Contracts_after_8_2010.csv ├── IDHS_child_care_provider_list.csv ├── Lobbyists_2012_present.csv ├── config.json.example ├── csv_example_messy_input.csv ├── restaurant-1.csv └── restaurant-2.csv ├── requirements-test.txt ├── setup.cfg ├── setup.py └── tests └── test_command_line.py /.gitignore: -------------------------------------------------------------------------------- 1 | build 2 | docs 3 | *.pyc 4 | logfile 5 | *.*~ 6 | *.o 7 | *.so 8 | *.py.* 9 | *.*gz 10 | *.html 11 | .#* 12 | *.*# 13 | *.json 14 | *.db 15 | .DS_Store 16 | *.egg-info 17 | output.csv 18 | *_cleaned.csv 19 | dist -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | sudo: false 2 | branches: 3 | only: 4 | - master 5 | - "/^v.*$/" 6 | language: python 7 | notifications: 8 | email: 9 | on_failure: change 10 | python: 11 | - '3.6' 12 | install: 13 | - pip install -r requirements-test.txt 14 | - pip install -e . 15 | script: nosetests 16 | deploy: 17 | provider: pypi 18 | user: datamade.wheelbuilder 19 | skip_cleanup: true 20 | password: 21 | secure: PwqGZJ/EMfpOVl8AirG61wcZnyb760uHD9YhRokDVZYVgohfT7pHk9nsRUEmq6Bzh9eioMn4tqkqxIBrnCbwqzHLFLZMlC2/76oKSPs9NH17o7rup/lX8AyskLRUN49xG7WjU4U791IODwb8eAtN4sMkT9HtUKumVwhJlAyyp0Y= 22 | on: 23 | tags: true 24 | distributions: "sdist bdist_wheel" 25 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | Copyright © 2013 DataMade (http://datamade.us) 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sub-license, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice, and every other copyright notice found in this software, and all the attributions in every file, and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # csvdedupe 2 | 3 | Command line tools for using the [dedupe python library](https://github.com/dedupe.io/dedupe/) for deduplicating CSV files. 4 | 5 | Part of the [Dedupe.io](https://dedupe.io/) cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data. For more details, see the [differences between Dedupe.io and the dedupe library](https://dedupe.io/documentation/should-i-use-dedupeio-or-the-dedupe-python-library.html). 6 | 7 | Two easy commands: 8 | 9 | `csvdedupe` - takes a messy input file or STDIN pipe and identifies duplicates. 10 | 11 | `csvlink` - takes two CSV files and finds matches between them. 12 | 13 | [Read more about csvdedupe on OpenNews Source](http://source.opennews.org/en-US/articles/introducing-cvsdedupe/) 14 | 15 | 16 | [![Build Status](https://travis-ci.org/dedupeio/csvdedupe.png?branch=master)](https://travis-ci.org/dedupeio/csvdedupe) 17 | 18 | ## Installation and dependencies 19 | 20 | ``` 21 | pip install csvdedupe 22 | ``` 23 | 24 | ## Getting Started 25 | 26 | ### csvdedupe 27 | 28 | `csvdedupe` takes a messy input file or STDIN pipe and identifies duplicates. To get started, pick one of three deduping strategies: call `csvdedupe` with arguments, pipe your file using UNIX, or define a config file. 29 | 30 | Provide an input file, field names, and output file: 31 | ```bash 32 | csvdedupe examples/csv_example_messy_input.csv \ 33 | --field_names "Site name" Address Zip Phone \ 34 | --output_file output.csv 35 | ``` 36 | 37 | __or__ 38 | 39 | Pipe it, UNIX style: 40 | ```bash 41 | cat examples/csv_example_messy_input.csv | csvdedupe --skip_training \ 42 | --field_names "Site name" Address Zip Phone > output.csv 43 | ``` 44 | 45 | __or__ 46 | 47 | Define everything in a config file: 48 | ```bash 49 | csvdedupe examples/csv_example_messy_input.csv \ 50 | --config_file=config.json 51 | ``` 52 | 53 | **Your config file may look like this:** 54 | 55 | ```json 56 | { 57 | "field_names": ["Site name", "Address", "Zip", "Phone"], 58 | "field_definition" : [{"field" : "Site name", "type" : "String"}, 59 | {"field" : "Address", "type" : "String"}, 60 | {"field" : "Zip", "type" : "String", 61 | "Has Missing" : true}, 62 | {"field" : "Phone", "type" : "String", 63 | "Has Missing" : true}], 64 | "output_file": "examples/output.csv", 65 | "skip_training": false, 66 | "training_file": "training.json", 67 | "sample_size": 150000, 68 | "recall_weight": 2 69 | } 70 | ``` 71 | 72 | #### To use `csvdedupe` you absolutely need: 73 | 74 | * `input` a CSV file name or piped CSV file to deduplicate 75 | 76 | Either 77 | * `--config_file` Path to configuration file. 78 | 79 | Or 80 | * `--field_names` List of column names for dedupe to pay attention to 81 | 82 | #### You may also need: 83 | * `--output_file OUTPUT_FILE` 84 | CSV file to store deduplication results (default: 85 | None) 86 | * `--destructive` Output file will contain unique records only 87 | * `--skip_training` Skip labeling examples by user and read training from 88 | training_file only (default: False) 89 | * `--training_file TRAINING_FILE` 90 | Path to a new or existing file consisting of labeled 91 | training examples (default: training.json) 92 | * `--sample_size SAMPLE_SIZE` 93 | Number of random sample pairs to train off of 94 | (default: 150000) 95 | * `--recall_weight RECALL_WEIGHT` 96 | Threshold that will maximize a weighted average of our 97 | precision and recall (default: 2) 98 | * `-d`, `--delimiter` 99 | Delimiting character of the input CSV file (default: ,) 100 | * `-h`, `--help` show help message and exit 101 | 102 | 103 | --- 104 | ### csvlink 105 | `csvlink` takes two CSV files and finds matches between them. 106 | 107 | Provide an input file, field names, and output file: 108 | ```bash 109 | csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ 110 | --field_names name address city cuisine \ 111 | --output_file output.csv 112 | ``` 113 | 114 | __or__ 115 | 116 | Line up different field names from each file: 117 | ```bash 118 | csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ 119 | --field_names_1 name address city cuisine \ 120 | --field_names_2 restaurant street city type \ 121 | --output_file output.csv 122 | ``` 123 | 124 | __or__ 125 | 126 | Pipe the output to STDOUT: 127 | ```bash 128 | csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ 129 | --field_names name address city cuisine \ 130 | > output.csv 131 | ``` 132 | 133 | __or__ 134 | 135 | Define everything in a config file: 136 | ```bash 137 | csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ 138 | --config_file=config.json 139 | ``` 140 | 141 | **Your config file may look like this:** 142 | 143 | ```json 144 | { 145 | "field_names_1": ["name", "address", "city", "cuisine"], 146 | "field_names_2": ["restaurant", "street", "city", "type"], 147 | "field_definition" : [{"field" : "name", "type" : "String"}, 148 | {"field" : "address", "type" : "String"}, 149 | {"field" : "city", "type" : "String", 150 | "Has Missing" : true}, 151 | {"field" : "cuisine", "type" : "String", 152 | "Has Missing" : true}], 153 | "output_file": "examples/output.csv", 154 | "skip_training": false, 155 | "training_file": "training.json", 156 | "sample_size": 150000, 157 | "recall_weight": 2 158 | } 159 | ``` 160 | 161 | #### To use `csvlink` you absolutely need: 162 | 163 | * `input` two CSV file names to join together 164 | 165 | Either 166 | * `--config_file` Path to configuration file. 167 | 168 | Or 169 | * `--field_names_1` List of column names in first file for dedupe to pay attention to 170 | * `--field_names_2` List of column names in second file for dedupe to pay attention to 171 | 172 | #### You may also need: 173 | 174 | * `--output_file OUTPUT_FILE` 175 | CSV file to store deduplication results (default: 176 | None) 177 | * `--inner_join` Only return matches between datasets 178 | * `--skip_training` Skip labeling examples by user and read training from 179 | training_file only (default: False) 180 | * `--training_file TRAINING_FILE` 181 | Path to a new or existing file consisting of labeled 182 | training examples (default: training.json) 183 | * `--sample_size SAMPLE_SIZE` 184 | Number of random sample pairs to train off of 185 | (default: 150000) 186 | * `--recall_weight RECALL_WEIGHT` 187 | Threshold that will maximize a weighted average of our 188 | precision and recall (default: 2) 189 | * `-d`, `--delimiter` 190 | Delimiting character of the input CSV file (default: ,) 191 | * `-h`, `--help` show help message and exit 192 | 193 | ## Training 194 | 195 | The _secret sauce_ of csvdedupe is human input. In order to figure out the best rules to deduplicate a set of data, you must give it a set of labeled examples to learn from. 196 | 197 | The more labeled examples you give it, the better the deduplication results will be. At minimum, you should try to provide __10 positive matches__ and __10 negative matches__. 198 | 199 | The results of your training will be saved in a JSON file ( __training.json__, unless specified otherwise with the `--training-file` option) for future runs of csvdedupe. 200 | 201 | Here's an example labeling operation: 202 | 203 | ```bash 204 | Phone : 2850617 205 | Address : 3801 s. wabash 206 | Zip : 207 | Site name : ada s. mckinley st. thomas cdc 208 | 209 | Phone : 2850617 210 | Address : 3801 s wabash ave 211 | Zip : 212 | Site name : ada s. mckinley community services - mckinley - st. thomas 213 | 214 | Do these records refer to the same thing? 215 | (y)es / (n)o / (u)nsure / (f)inished 216 | ``` 217 | 218 | ## Output 219 | `csvdedupe` attempts to identify all the rows in the csv that refer to the same thing. Each group of 220 | such records are called a cluster. `csvdedupe` returns your input file with an additional column called `Cluster ID`, 221 | that either is the numeric id (zero-indexed) of a cluster of grouped records or an `x` if csvdedupe believes 222 | the record doesn't belong to any cluster. 223 | 224 | `csvlink` operates in much the same way as `csvdedupe`, but will flatten both CSVs in to one 225 | output file similar to a SQL [OUTER JOIN](http://stackoverflow.com/questions/38549/difference-between-inner-and-outer-join) statement. You can use the `--inner_join` flag to exclude rows that don't match across the two input files, much like an INNER JOIN. 226 | 227 | 228 | ## Preprocessing 229 | csvdedupe attempts to convert all strings to ASCII, ignores case, new lines, and padding whitespace. This is all 230 | probably uncontroversial except the conversion to ASCII. Basically, we had to choose between two ways of handling 231 | extended characters. 232 | 233 | ``` 234 | distance("Tomas", "Tomás') = distance("Tomas", "Tomas") 235 | ``` 236 | 237 | __or__ 238 | 239 | ``` 240 | distance("Tomas, "Tomás") = distance("Tomas", "Tomzs") 241 | ``` 242 | 243 | We chose the first option. While it is possible to do something more sophisticated, this option seems to work pretty well 244 | for Latin alphabet languages. 245 | 246 | ## Testing 247 | 248 | Unit tests of core csvdedupe functions 249 | ```bash 250 | pip install -r requirements-test.txt 251 | nosetests 252 | ``` 253 | 254 | ## Community 255 | * [Dedupe Google group](https://groups.google.com/forum/?fromgroups=#!forum/open-source-deduplication) 256 | * IRC channel, #dedupe on irc.freenode.net 257 | 258 | ## Recipes 259 | 260 | ### Combining and deduplicating files from different sources. 261 | 262 | Lets say we have a few sources of early childhood programs in Chicago and we'd like to get a canonical list. 263 | Let's do it with `csvdedupe`, `csvkit`, and some other common command line tools. 264 | 265 | #### Alignment and stacking 266 | Our first task will be to align the files and have the same data in the same columns for stacking. 267 | 268 | First, let's look at the headers of the files. 269 | 270 | File 1 271 | ```console 272 | > head -1 CPS_Early_Childhood_Portal_Scrape.csv 273 | "Site name","Address","Phone","Program Name","Length of Day" 274 | ``` 275 | 276 | File 2 277 | ```console 278 | > head -1 IDHS_child_care_provider_list.csv 279 | "Site name","Address","Zip Code","Phone","Fax","IDHS Provider ID" 280 | ``` 281 | 282 | So, we'll have to add "Zip Code", "Fax", and "IDHS Provider ID" 283 | to ```CPS_Early_Childhood_Portal_Scrape.csv```, and we'll have to add "Program Name", 284 | "Length of Day" to ```IDHS_child_care_provider_list.csv```. 285 | 286 | ```console 287 | > cd examples 288 | > sed '1 s/$/,"Zip Code","Fax","IDHS Provider ID"/' CPS_Early_Childhood_Portal_Scrape.csv > input_1a.csv 289 | > sed '2,$s/$/,,,/' input_1a.csv > input_1b.csv 290 | ``` 291 | 292 | ```console 293 | > sed '1 s/$/,"Program Name","Length of Day"/' IDHS_child_care_provider_list.csv > input_2a.csv 294 | > sed '2,$s/$/,,/' input_2a.csv > input_2b.csv 295 | ``` 296 | 297 | Now, we reorder the columns in the second file to align to the first. 298 | 299 | ```console 300 | > csvcut -c "Site name","Address","Phone","Program Name","Length of Day","Zip Code","Fax","IDHS Provider ID" \ 301 | input_2b.csv > input_2c.csv 302 | ``` 303 | 304 | And we are finally ready to stack. 305 | 306 | ```console 307 | > csvstack -g CPS_Early_Childhood_Portal_Scrape.csv,IDHS_child_care_provider_list.csv \ 308 | -n source \ 309 | input_1b.csv input_2c.csv > input.csv 310 | ``` 311 | 312 | #### Dedupe it! 313 | And now we can dedupe 314 | 315 | ```console 316 | > cat input.csv | csvdedupe --field_names "Site name" Address "Zip Code" Phone > output.csv 317 | ``` 318 | 319 | Let's sort the output by duplicate IDs, and we are ready to open it in your favorite spreadsheet program. 320 | 321 | ```console 322 | > csvsort -c "Cluster ID" output.csv > sorted.csv 323 | ``` 324 | 325 | ## Errors and Bugs 326 | 327 | If something is not behaving intuitively, it is a bug, and should be reported. 328 | Report it [here](https://github.com/dedupeio/csvdedupe/issues). 329 | 330 | ## Patches and Pull Requests 331 | We welcome your ideas! You can make suggestions in the form of [github issues](https://github.com/dedupeio/csvdedupe/issues) (bug reports, feature requests, general questions), or you can submit a code contribution via a pull request. 332 | 333 | How to contribute code: 334 | 335 | - Fork the project. 336 | - Make your feature addition or bug fix. 337 | - Send us a pull request with a description of your work! Don't worry if it isn't perfect - think of a PR as a start of a conversation, rather than a finished product. 338 | 339 | ## Copyright and Attribution 340 | 341 | Copyright (c) 2016 DataMade. Released under the [MIT License](https://github.com/dedupeio/csvdedupe/blob/master/LICENSE.md). 342 | -------------------------------------------------------------------------------- /csvdedupe/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dedupeio/csvdedupe/ec1a16303b39a0989f8face96eb5e046f03d168d/csvdedupe/__init__.py -------------------------------------------------------------------------------- /csvdedupe/csvdedupe.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | import future 3 | import os 4 | import codecs 5 | import sys 6 | import locale 7 | import logging 8 | from io import StringIO, open 9 | from . import csvhelpers 10 | import dedupe 11 | 12 | import itertools 13 | 14 | 15 | class CSVDedupe(csvhelpers.CSVCommand) : 16 | def __init__(self): 17 | super(CSVDedupe, self).__init__() 18 | 19 | # set defaults 20 | try: 21 | # take in STDIN input or open the file 22 | if hasattr(self.configuration['input'], 'read'): 23 | if not sys.stdin.isatty(): 24 | self.input = self.configuration['input'].read() 25 | # We need to get control of STDIN again. 26 | # This is a UNIX/Mac OSX solution only 27 | # http://stackoverflow.com/questions/7141331/pipe-input-to-python-program-and-later-get-input-from-user 28 | # 29 | # Same question has a Windows solution 30 | sys.stdin = open('/dev/tty') # Unix only solution, 31 | else: 32 | raise self.parser.error("No input file or STDIN specified.") 33 | else: 34 | try: 35 | self.input = open(self.configuration['input'], encoding='utf-8').read() 36 | except IOError: 37 | raise self.parser.error("Could not find the file %s" % 38 | (self.configuration['input'], )) 39 | except KeyError: 40 | raise self.parser.error("No input file or STDIN specified.") 41 | 42 | if self.field_definition is None : 43 | try: 44 | self.field_names = self.configuration['field_names'] 45 | self.field_definition = [{'field': field, 46 | 'type': 'String'} 47 | for field in self.field_names] 48 | except KeyError: 49 | raise self.parser.error("You must provide field_names") 50 | else : 51 | self.field_names = [field_def['field'] 52 | for field_def in self.field_definition] 53 | 54 | self.destructive = self.configuration.get('destructive', False) 55 | 56 | def add_args(self) : 57 | # positional arguments 58 | self.parser.add_argument('input', nargs='?', default=sys.stdin, 59 | help='The CSV file to operate on. If omitted, will accept input on STDIN.') 60 | self.parser.add_argument('--destructive', action='store_true', 61 | help='Output file will contain unique records only') 62 | 63 | 64 | def main(self): 65 | 66 | data_d = {} 67 | # import the specified CSV file 68 | 69 | data_d = csvhelpers.readData(self.input, self.field_names, delimiter=self.delimiter) 70 | 71 | logging.info('imported %d rows', len(data_d)) 72 | 73 | # sanity check for provided field names in CSV file 74 | for field in self.field_definition: 75 | if field['type'] != 'Interaction': 76 | if not field['field'] in data_d[0]: 77 | 78 | raise self.parser.error("Could not find field '" + 79 | field['field'] + "' in input") 80 | 81 | logging.info('using fields: %s' % [field['field'] 82 | for field in self.field_definition]) 83 | 84 | # If --skip_training has been selected, and we have a settings cache still 85 | # persisting from the last run, use it in this next run. 86 | # __Note:__ if you want to add more training data, don't use skip training 87 | if self.skip_training and os.path.exists(self.settings_file): 88 | 89 | # Load our deduper from the last training session cache. 90 | logging.info('reading from previous training cache %s' 91 | % self.settings_file) 92 | with open(self.settings_file, 'rb') as f: 93 | deduper = dedupe.StaticDedupe(f) 94 | 95 | fields = {variable.field for variable in deduper.data_model.primary_fields} 96 | unique_d, parents = exact_matches(data_d, fields) 97 | 98 | else: 99 | # # Create a new deduper object and pass our data model to it. 100 | deduper = dedupe.Dedupe(self.field_definition) 101 | 102 | fields = {variable.field for variable in deduper.data_model.primary_fields} 103 | unique_d, parents = exact_matches(data_d, fields) 104 | 105 | # Set up our data sample 106 | logging.info('taking a sample of %d possible pairs', self.sample_size) 107 | deduper.sample(unique_d, self.sample_size) 108 | 109 | # Perform standard training procedures 110 | self.dedupe_training(deduper) 111 | 112 | # ## Blocking 113 | 114 | logging.info('blocking...') 115 | 116 | # ## Clustering 117 | 118 | # Find the threshold that will maximize a weighted average of our precision and recall. 119 | # When we set the recall weight to 2, we are saying we care twice as much 120 | # about recall as we do precision. 121 | # 122 | # If we had more data, we would not pass in all the blocked data into 123 | # this function but a representative sample. 124 | 125 | logging.info('finding a good threshold with a recall_weight of %s' % 126 | self.recall_weight) 127 | threshold = deduper.threshold(unique_d, recall_weight=self.recall_weight) 128 | 129 | # `duplicateClusters` will return sets of record IDs that dedupe 130 | # believes are all referring to the same entity. 131 | 132 | logging.info('clustering...') 133 | clustered_dupes = deduper.match(unique_d, threshold) 134 | 135 | expanded_clustered_dupes = [] 136 | for cluster, scores in clustered_dupes: 137 | new_cluster = list(cluster) 138 | new_scores = list(scores) 139 | for row_id, score in zip(cluster, scores): 140 | children = parents.get(row_id, []) 141 | new_cluster.extend(children) 142 | new_scores.extend([score] * len(children)) 143 | expanded_clustered_dupes.append((new_cluster, new_scores)) 144 | 145 | clustered_dupes = expanded_clustered_dupes 146 | 147 | logging.info('# duplicate sets %s' % len(clustered_dupes)) 148 | 149 | write_function = csvhelpers.writeResults 150 | # write out our results 151 | if self.destructive: 152 | write_function = csvhelpers.writeUniqueResults 153 | 154 | if self.output_file: 155 | with open(self.output_file, 'w', encoding='utf-8') as output_file: 156 | write_function(clustered_dupes, self.input, output_file) 157 | else: 158 | if sys.version < '3' : 159 | out = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 160 | write_function(clustered_dupes, self.input, out) 161 | else : 162 | write_function(clustered_dupes, self.input, sys.stdout) 163 | 164 | def exact_matches(data_d, match_fields): 165 | unique = {} 166 | redundant = {} 167 | for key, record in data_d.items(): 168 | record_hash = hash(tuple(record[f] for f in match_fields)) 169 | if record_hash not in redundant: 170 | unique[key] = record 171 | redundant[record_hash] = (key, []) 172 | else: 173 | redundant[record_hash][1].append(key) 174 | 175 | return unique, {k : v for k, v in redundant.values()} 176 | 177 | 178 | def launch_new_instance(): 179 | d = CSVDedupe() 180 | d.main() 181 | 182 | 183 | if __name__ == "__main__": 184 | launch_new_instance() 185 | -------------------------------------------------------------------------------- /csvdedupe/csvhelpers.py: -------------------------------------------------------------------------------- 1 | import future 2 | from future.builtins import next 3 | 4 | import os 5 | import re 6 | import collections 7 | import logging 8 | from io import StringIO, open 9 | import sys 10 | import platform 11 | if sys.version < '3' : 12 | from backports import csv 13 | else : 14 | import csv 15 | 16 | if platform.system() != 'Windows' : 17 | from signal import signal, SIGPIPE, SIG_DFL 18 | signal(SIGPIPE, SIG_DFL) 19 | 20 | import dedupe 21 | import json 22 | import argparse 23 | 24 | def preProcess(column): 25 | """ 26 | Do a little bit of data cleaning. Things like casing, extra spaces, 27 | quotes and new lines are ignored. 28 | """ 29 | column = re.sub(' +', ' ', column) 30 | column = re.sub('\n', ' ', column) 31 | column = column.strip().strip('"').strip("'").lower().strip() 32 | if column == '' : 33 | column = None 34 | return column 35 | 36 | 37 | def readData(input_file, field_names, delimiter=',', prefix=None): 38 | """ 39 | Read in our data from a CSV file and create a dictionary of records, 40 | where the key is a unique record ID and each value is a dict 41 | of the row fields. 42 | 43 | **Currently, dedupe depends upon records' unique ids being integers 44 | with no integers skipped. The smallest valued unique id must be 0 or 45 | 1. Expect this requirement will likely be relaxed in the future.** 46 | """ 47 | 48 | data = {} 49 | 50 | reader = csv.DictReader(StringIO(input_file),delimiter=delimiter) 51 | for i, row in enumerate(reader): 52 | clean_row = {k: preProcess(v) for (k, v) in row.items() if k is not None} 53 | if prefix: 54 | row_id = u"%s|%s" % (prefix, i) 55 | else: 56 | row_id = i 57 | data[row_id] = clean_row 58 | 59 | return data 60 | 61 | 62 | # ## Writing results 63 | def writeResults(clustered_dupes, input_file, output_file): 64 | 65 | # Write our original data back out to a CSV with a new column called 66 | # 'Cluster ID' which indicates which records refer to each other. 67 | 68 | logging.info('saving results to: %s' % output_file) 69 | 70 | cluster_membership = {} 71 | for cluster_id, (cluster, score) in enumerate(clustered_dupes): 72 | for record_id in cluster: 73 | cluster_membership[record_id] = cluster_id 74 | 75 | unique_record_id = cluster_id + 1 76 | 77 | writer = csv.writer(output_file) 78 | 79 | reader = csv.reader(StringIO(input_file)) 80 | 81 | heading_row = next(reader) 82 | heading_row.insert(0, u'Cluster ID') 83 | writer.writerow(heading_row) 84 | 85 | for row_id, row in enumerate(reader): 86 | if row_id in cluster_membership: 87 | cluster_id = cluster_membership[row_id] 88 | else: 89 | cluster_id = unique_record_id 90 | unique_record_id += 1 91 | row.insert(0, cluster_id) 92 | writer.writerow(row) 93 | 94 | 95 | # ## Writing results 96 | def writeUniqueResults(clustered_dupes, input_file, output_file): 97 | 98 | # Write our original data back out to a CSV with a new column called 99 | # 'Cluster ID' which indicates which records refer to each other. 100 | 101 | logging.info('saving unique results to: %s' % output_file) 102 | 103 | cluster_membership = {} 104 | for cluster_id, (cluster, score) in enumerate(clustered_dupes): 105 | for record_id in cluster: 106 | cluster_membership[record_id] = cluster_id 107 | 108 | unique_record_id = cluster_id + 1 109 | 110 | writer = csv.writer(output_file) 111 | 112 | reader = csv.reader(StringIO(input_file)) 113 | 114 | heading_row = next(reader) 115 | heading_row.insert(0, u'Cluster ID') 116 | writer.writerow(heading_row) 117 | 118 | seen_clusters = set() 119 | for row_id, row in enumerate(reader): 120 | if row_id in cluster_membership: 121 | cluster_id = cluster_membership[row_id] 122 | if cluster_id not in seen_clusters: 123 | row.insert(0, cluster_id) 124 | writer.writerow(row) 125 | seen_clusters.add(cluster_id) 126 | else: 127 | cluster_id = unique_record_id 128 | unique_record_id += 1 129 | row.insert(0, cluster_id) 130 | writer.writerow(row) 131 | 132 | 133 | def writeLinkedResults(clustered_pairs, input_1, input_2, output_file, 134 | inner_join=False): 135 | logging.info('saving unique results to: %s' % output_file) 136 | 137 | matched_records = [] 138 | seen_1 = set() 139 | seen_2 = set() 140 | 141 | input_1 = [row for row in csv.reader(StringIO(input_1))] 142 | row_header = input_1.pop(0) 143 | length_1 = len(row_header) 144 | 145 | input_2 = [row for row in csv.reader(StringIO(input_2))] 146 | row_header_2 = input_2.pop(0) 147 | length_2 = len(row_header_2) 148 | row_header += row_header_2 149 | 150 | for pair in clustered_pairs: 151 | index_1, index_2 = [int(index.split('|', 1)[1]) for index in pair[0]] 152 | 153 | matched_records.append(input_1[index_1] + input_2[index_2]) 154 | seen_1.add(index_1) 155 | seen_2.add(index_2) 156 | 157 | writer = csv.writer(output_file) 158 | writer.writerow(row_header) 159 | 160 | for matches in matched_records: 161 | writer.writerow(matches) 162 | 163 | if not inner_join: 164 | 165 | for i, row in enumerate(input_1): 166 | if i not in seen_1: 167 | writer.writerow(row + [None] * length_2) 168 | 169 | for i, row in enumerate(input_2): 170 | if i not in seen_2: 171 | writer.writerow([None] * length_1 + row) 172 | 173 | class CSVCommand(object) : 174 | def __init__(self) : 175 | self.parser = argparse.ArgumentParser( 176 | formatter_class=argparse.ArgumentDefaultsHelpFormatter) 177 | 178 | self._common_args() 179 | self.add_args() 180 | 181 | self.args = self.parser.parse_known_args()[0] 182 | 183 | self.configuration = {} 184 | 185 | if self.args.config_file: 186 | #read from configuration file 187 | try: 188 | with open(self.args.config_file, 'r') as f: 189 | config = json.load(f) 190 | self.configuration.update(config) 191 | except IOError: 192 | raise self.parser.error( 193 | "Could not find config file %s. Did you name it correctly?" 194 | % self.args.config_file) 195 | 196 | # override if provided from the command line 197 | args_d = vars(self.args) 198 | args_d = dict((k, v) for (k, v) in args_d.items() if v is not None) 199 | self.configuration.update(args_d) 200 | 201 | self.output_file = self.configuration.get('output_file', None) 202 | self.skip_training = self.configuration.get('skip_training', False) 203 | self.training_file = self.configuration.get('training_file', 204 | 'training.json') 205 | self.settings_file = self.configuration.get('settings_file', 206 | 'learned_settings') 207 | self.sample_size = self.configuration.get('sample_size', 1500) 208 | self.recall_weight = self.configuration.get('recall_weight', 1) 209 | 210 | self.delimiter = self.configuration.get('delimiter',',') 211 | 212 | # backports for python version below 3 uses unicode delimiters 213 | if sys.version < '3': 214 | self.delimiter = unicode(self.delimiter) 215 | 216 | if 'field_definition' in self.configuration: 217 | self.field_definition = self.configuration['field_definition'] 218 | else : 219 | self.field_definition = None 220 | 221 | if self.skip_training and not os.path.exists(self.training_file): 222 | raise self.parser.error( 223 | "You need to provide an existing training_file or run this script without --skip_training") 224 | 225 | def _common_args(self) : 226 | # optional arguments 227 | self.parser.add_argument('--config_file', type=str, 228 | help='Path to configuration file. Must provide either a config_file or input and field_names.') 229 | self.parser.add_argument('--field_names', type=str, nargs="+", 230 | help='List of column names for dedupe to pay attention to') 231 | self.parser.add_argument('--output_file', type=str, 232 | help='CSV file to store deduplication results') 233 | self.parser.add_argument('--skip_training', action='store_true', 234 | help='Skip labeling examples by user and read training from training_files only') 235 | self.parser.add_argument('--training_file', type=str, 236 | help='Path to a new or existing file consisting of labeled training examples') 237 | self.parser.add_argument('--settings_file', type=str, 238 | help='Path to a new or existing file consisting of learned training settings') 239 | self.parser.add_argument('--sample_size', type=int, 240 | help='Number of random sample pairs to train off of') 241 | self.parser.add_argument('--recall_weight', type=int, 242 | help='Threshold that will maximize a weighted average of our precision and recall') 243 | self.parser.add_argument('-d', '--delimiter', type=str, 244 | help='Delimiting character of the input CSV file', default=',') 245 | self.parser.add_argument('-v', '--verbose', action='count', default=0) 246 | 247 | 248 | # If we have training data saved from a previous run of dedupe, 249 | # look for it an load it in. 250 | def dedupe_training(self, deduper) : 251 | 252 | # __Note:__ if you want to train from scratch, delete the training_file 253 | if os.path.exists(self.training_file): 254 | logging.info('reading labeled examples from %s' % 255 | self.training_file) 256 | with open(self.training_file) as tf: 257 | deduper.readTraining(tf) 258 | 259 | if not self.skip_training: 260 | logging.info('starting active labeling...') 261 | 262 | dedupe.consoleLabel(deduper) 263 | 264 | # When finished, save our training away to disk 265 | logging.info('saving training data to %s' % self.training_file) 266 | if sys.version < '3' : 267 | with open(self.training_file, 'wb') as tf: 268 | deduper.writeTraining(tf) 269 | else : 270 | with open(self.training_file, 'w') as tf: 271 | deduper.writeTraining(tf) 272 | else: 273 | logging.info('skipping the training step') 274 | 275 | deduper.train() 276 | 277 | # After training settings have been established make a cache file for reuse 278 | logging.info('caching training result set to file %s' % self.settings_file) 279 | with open(self.settings_file, 'wb') as sf: 280 | deduper.writeSettings(sf) 281 | -------------------------------------------------------------------------------- /csvdedupe/csvlink.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | import future 3 | 4 | import logging 5 | import os 6 | import sys 7 | import json 8 | from io import StringIO, open 9 | 10 | from . import csvhelpers 11 | import dedupe 12 | 13 | import itertools 14 | 15 | class CSVLink(csvhelpers.CSVCommand): 16 | def __init__(self): 17 | super(CSVLink, self).__init__() 18 | 19 | if len(self.configuration['input']) == 2: 20 | try: 21 | self.input_1 = open(self.configuration['input'][0], encoding='utf-8').read() 22 | except IOError: 23 | raise self.parser.error("Could not find the file %s" % 24 | (self.configuration['input'][0], )) 25 | 26 | try: 27 | self.input_2 = open(self.configuration['input'][1], encoding='utf-8').read() 28 | except IOError: 29 | raise self.parser.error("Could not find the file %s" % 30 | (self.configuration['input'][1], )) 31 | 32 | else: 33 | raise self.parser.error("You must provide two input files.") 34 | 35 | if 'field_names' in self.configuration: 36 | if 'field_names_1' in self.configuration or 'field_names_2' in self.configuration: 37 | raise self.parser.error( 38 | "You should only define field_names or individual dataset fields (field_names_1 and field_names_2") 39 | else: 40 | self.field_names_1 = self.configuration['field_names'] 41 | self.field_names_2 = self.configuration['field_names'] 42 | elif 'field_names_1' in self.configuration and 'field_names_2' in self.configuration: 43 | self.field_names_1 = self.configuration['field_names_1'] 44 | self.field_names_2 = self.configuration['field_names_2'] 45 | else: 46 | raise self.parser.error( 47 | "You must provide field_names of field_names_1 and field_names_2") 48 | 49 | self.inner_join = self.configuration.get('inner_join', False) 50 | 51 | if self.field_definition is None : 52 | self.field_definition = [{'field': field, 53 | 'type': 'String'} 54 | for field in self.field_names_1] 55 | 56 | def add_args(self) : 57 | # positional arguments 58 | self.parser.add_argument('input', nargs="+", type=str, 59 | help='The two CSV files to operate on.') 60 | self.parser.add_argument('--field_names_1', type=str, nargs="+", 61 | help='List of column names for first dataset') 62 | self.parser.add_argument('--field_names_2', type=str, nargs="+", 63 | help='List of column names for second dataset') 64 | self.parser.add_argument('--inner_join', action='store_true', 65 | help='Only return matches between datasets') 66 | 67 | def main(self): 68 | 69 | data_1 = {} 70 | data_2 = {} 71 | # import the specified CSV file 72 | 73 | data_1 = csvhelpers.readData(self.input_1, self.field_names_1, 74 | delimiter=self.delimiter, 75 | prefix='input_1') 76 | data_2 = csvhelpers.readData(self.input_2, self.field_names_2, 77 | delimiter=self.delimiter, 78 | prefix='input_2') 79 | 80 | # sanity check for provided field names in CSV file 81 | for field in self.field_names_1: 82 | if field not in list(data_1.values())[0]: 83 | raise self.parser.error( 84 | "Could not find field '" + field + "' in input") 85 | 86 | for field in self.field_names_2: 87 | if field not in list(data_2.values())[0]: 88 | raise self.parser.error( 89 | "Could not find field '" + field + "' in input") 90 | 91 | if self.field_names_1 != self.field_names_2: 92 | for record_id, record in data_2.items(): 93 | remapped_record = {} 94 | for new_field, old_field in zip(self.field_names_1, 95 | self.field_names_2): 96 | remapped_record[new_field] = record[old_field] 97 | data_2[record_id] = remapped_record 98 | 99 | logging.info('imported %d rows from file 1', len(data_1)) 100 | logging.info('imported %d rows from file 2', len(data_2)) 101 | 102 | logging.info('using fields: %s' % [field['field'] 103 | for field in self.field_definition]) 104 | 105 | # If --skip_training has been selected, and we have a settings cache still 106 | # persisting from the last run, use it in this next run. 107 | # __Note:__ if you want to add more training data, don't use skip training 108 | if self.skip_training and os.path.exists(self.settings_file): 109 | 110 | # Load our deduper from the last training session cache. 111 | logging.info('reading from previous training cache %s' 112 | % self.settings_file) 113 | with open(self.settings_file, 'rb') as f: 114 | deduper = dedupe.StaticRecordLink(f) 115 | 116 | 117 | fields = {variable.field for variable in deduper.data_model.primary_fields} 118 | (nonexact_1, 119 | nonexact_2, 120 | exact_pairs) = exact_matches(data_1, data_2, fields) 121 | 122 | 123 | else: 124 | # # Create a new deduper object and pass our data model to it. 125 | deduper = dedupe.RecordLink(self.field_definition) 126 | 127 | fields = {variable.field for variable in deduper.data_model.primary_fields} 128 | (nonexact_1, 129 | nonexact_2, 130 | exact_pairs) = exact_matches(data_1, data_2, fields) 131 | 132 | # Set up our data sample 133 | logging.info('taking a sample of %d possible pairs', self.sample_size) 134 | deduper.sample(nonexact_1, nonexact_2, self.sample_size) 135 | 136 | # Perform standard training procedures 137 | self.dedupe_training(deduper) 138 | 139 | # ## Blocking 140 | 141 | logging.info('blocking...') 142 | 143 | # ## Clustering 144 | 145 | # Find the threshold that will maximize a weighted average of our precision and recall. 146 | # When we set the recall weight to 2, we are saying we care twice as much 147 | # about recall as we do precision. 148 | # 149 | # If we had more data, we would not pass in all the blocked data into 150 | # this function but a representative sample. 151 | 152 | logging.info('finding a good threshold with a recall_weight of %s' % 153 | self.recall_weight) 154 | threshold = deduper.threshold(data_1, data_2, 155 | recall_weight=self.recall_weight) 156 | 157 | # `duplicateClusters` will return sets of record IDs that dedupe 158 | # believes are all referring to the same entity. 159 | 160 | logging.info('clustering...') 161 | clustered_dupes = deduper.match(data_1, data_2, threshold) 162 | 163 | clustered_dupes.extend(exact_pairs) 164 | 165 | logging.info('# duplicate sets %s' % len(clustered_dupes)) 166 | 167 | write_function = csvhelpers.writeLinkedResults 168 | # write out our results 169 | 170 | if self.output_file: 171 | if sys.version < '3' : 172 | with open(self.output_file, 'wb', encoding='utf-8') as output_file: 173 | write_function(clustered_dupes, self.input_1, self.input_2, 174 | output_file, self.inner_join) 175 | else : 176 | with open(self.output_file, 'w', encoding='utf-8') as output_file: 177 | write_function(clustered_dupes, self.input_1, self.input_2, 178 | output_file, self.inner_join) 179 | else: 180 | write_function(clustered_dupes, self.input_1, self.input_2, 181 | sys.stdout, self.inner_join) 182 | 183 | 184 | def exact_matches(data_1, data_2, match_fields): 185 | nonexact_1 = {} 186 | nonexact_2 = {} 187 | exact_pairs = [] 188 | redundant = {} 189 | 190 | for key, record in data_1.items(): 191 | record_hash = hash(tuple(record[f] for f in match_fields)) 192 | redundant[record_hash] = key 193 | 194 | for key_2, record in data_2.items(): 195 | record_hash = hash(tuple(record[f] for f in match_fields)) 196 | if record_hash in redundant: 197 | key_1 = redundant[record_hash] 198 | exact_pairs.append(((key_1, key_2), 1.0)) 199 | del redundant[record_hash] 200 | else: 201 | nonexact_2[key_2] = record 202 | 203 | for key_1 in redundant.values(): 204 | nonexact_1[key_1] = data_1[key_1] 205 | 206 | return nonexact_1, nonexact_2, exact_pairs 207 | 208 | def launch_new_instance(): 209 | d = CSVLink() 210 | d.main() 211 | 212 | if __name__ == "__main__": 213 | launch_new_instance() 214 | -------------------------------------------------------------------------------- /docs/conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # csvdedupe documentation build configuration file, created by 4 | # sphinx-quickstart on Thu Apr 17 17:34:25 2014. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | import sys 16 | import os 17 | 18 | # If extensions (or modules to document with autodoc) are in another directory, 19 | # add these directories to sys.path here. If the directory is relative to the 20 | # documentation root, use os.path.abspath to make it absolute, like shown here. 21 | #sys.path.insert(0, os.path.abspath('.')) 22 | 23 | # -- General configuration ------------------------------------------------ 24 | 25 | # If your documentation needs a minimal Sphinx version, state it here. 26 | #needs_sphinx = '1.0' 27 | 28 | # Add any Sphinx extension module names here, as strings. They can be 29 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 30 | # ones. 31 | extensions = [ 32 | 'sphinx.ext.mathjax', 33 | ] 34 | 35 | # Add any paths that contain templates here, relative to this directory. 36 | templates_path = ['_templates'] 37 | 38 | # The suffix of source filenames. 39 | source_suffix = '.rst' 40 | 41 | # The encoding of source files. 42 | #source_encoding = 'utf-8-sig' 43 | 44 | # The master toctree document. 45 | master_doc = 'index' 46 | 47 | # General information about the project. 48 | project = u'csvdedupe' 49 | copyright = u'2014, Forest Gregg and Derek Eder' 50 | 51 | # The version info for the project you're documenting, acts as replacement for 52 | # |version| and |release|, also used in various other places throughout the 53 | # built documents. 54 | # 55 | # The short X.Y version. 56 | version = '0.1' 57 | # The full version, including alpha/beta/rc tags. 58 | release = '0.1' 59 | 60 | # The language for content autogenerated by Sphinx. Refer to documentation 61 | # for a list of supported languages. 62 | #language = None 63 | 64 | # There are two options for replacing |today|: either, you set today to some 65 | # non-false value, then it is used: 66 | #today = '' 67 | # Else, today_fmt is used as the format for a strftime call. 68 | #today_fmt = '%B %d, %Y' 69 | 70 | # List of patterns, relative to source directory, that match files and 71 | # directories to ignore when looking for source files. 72 | exclude_patterns = ['_build'] 73 | 74 | # The reST default role (used for this markup: `text`) to use for all 75 | # documents. 76 | #default_role = None 77 | 78 | # If true, '()' will be appended to :func: etc. cross-reference text. 79 | #add_function_parentheses = True 80 | 81 | # If true, the current module name will be prepended to all description 82 | # unit titles (such as .. function::). 83 | #add_module_names = True 84 | 85 | # If true, sectionauthor and moduleauthor directives will be shown in the 86 | # output. They are ignored by default. 87 | #show_authors = False 88 | 89 | # The name of the Pygments (syntax highlighting) style to use. 90 | pygments_style = 'sphinx' 91 | 92 | # A list of ignored prefixes for module index sorting. 93 | #modindex_common_prefix = [] 94 | 95 | # If true, keep warnings as "system message" paragraphs in the built documents. 96 | #keep_warnings = False 97 | 98 | 99 | # -- Options for HTML output ---------------------------------------------- 100 | 101 | # The theme to use for HTML and HTML Help pages. See the documentation for 102 | # a list of builtin themes. 103 | html_theme = 'default' 104 | 105 | # Theme options are theme-specific and customize the look and feel of a theme 106 | # further. For a list of options available for each theme, see the 107 | # documentation. 108 | #html_theme_options = {} 109 | 110 | # Add any paths that contain custom themes here, relative to this directory. 111 | #html_theme_path = [] 112 | 113 | # The name for this set of Sphinx documents. If None, it defaults to 114 | # " v documentation". 115 | #html_title = None 116 | 117 | # A shorter title for the navigation bar. Default is the same as html_title. 118 | #html_short_title = None 119 | 120 | # The name of an image file (relative to this directory) to place at the top 121 | # of the sidebar. 122 | #html_logo = None 123 | 124 | # The name of an image file (within the static path) to use as favicon of the 125 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 126 | # pixels large. 127 | #html_favicon = None 128 | 129 | # Add any paths that contain custom static files (such as style sheets) here, 130 | # relative to this directory. They are copied after the builtin static files, 131 | # so a file named "default.css" will overwrite the builtin "default.css". 132 | html_static_path = ['_static'] 133 | 134 | # Add any extra paths that contain custom files (such as robots.txt or 135 | # .htaccess) here, relative to this directory. These files are copied 136 | # directly to the root of the documentation. 137 | #html_extra_path = [] 138 | 139 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 140 | # using the given strftime format. 141 | #html_last_updated_fmt = '%b %d, %Y' 142 | 143 | # If true, SmartyPants will be used to convert quotes and dashes to 144 | # typographically correct entities. 145 | #html_use_smartypants = True 146 | 147 | # Custom sidebar templates, maps document names to template names. 148 | #html_sidebars = {} 149 | 150 | # Additional templates that should be rendered to pages, maps page names to 151 | # template names. 152 | #html_additional_pages = {} 153 | 154 | # If false, no module index is generated. 155 | #html_domain_indices = True 156 | 157 | # If false, no index is generated. 158 | #html_use_index = True 159 | 160 | # If true, the index is split into individual pages for each letter. 161 | #html_split_index = False 162 | 163 | # If true, links to the reST sources are added to the pages. 164 | #html_show_sourcelink = True 165 | 166 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 167 | #html_show_sphinx = True 168 | 169 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 170 | #html_show_copyright = True 171 | 172 | # If true, an OpenSearch description file will be output, and all pages will 173 | # contain a tag referring to it. The value of this option must be the 174 | # base URL from which the finished HTML is served. 175 | #html_use_opensearch = '' 176 | 177 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 178 | #html_file_suffix = None 179 | 180 | # Output file base name for HTML help builder. 181 | htmlhelp_basename = 'csvdedupedoc' 182 | 183 | 184 | # -- Options for LaTeX output --------------------------------------------- 185 | 186 | latex_elements = { 187 | # The paper size ('letterpaper' or 'a4paper'). 188 | #'papersize': 'letterpaper', 189 | 190 | # The font size ('10pt', '11pt' or '12pt'). 191 | #'pointsize': '10pt', 192 | 193 | # Additional stuff for the LaTeX preamble. 194 | #'preamble': '', 195 | } 196 | 197 | # Grouping the document tree into LaTeX files. List of tuples 198 | # (source start file, target name, title, 199 | # author, documentclass [howto, manual, or own class]). 200 | latex_documents = [ 201 | ('index', 'csvdedupe.tex', u'csvdedupe Documentation', 202 | u'Forest Gregg and Derek Eder', 'manual'), 203 | ] 204 | 205 | # The name of an image file (relative to this directory) to place at the top of 206 | # the title page. 207 | #latex_logo = None 208 | 209 | # For "manual" documents, if this is true, then toplevel headings are parts, 210 | # not chapters. 211 | #latex_use_parts = False 212 | 213 | # If true, show page references after internal links. 214 | #latex_show_pagerefs = False 215 | 216 | # If true, show URL addresses after external links. 217 | #latex_show_urls = False 218 | 219 | # Documents to append as an appendix to all manuals. 220 | #latex_appendices = [] 221 | 222 | # If false, no module index is generated. 223 | #latex_domain_indices = True 224 | 225 | 226 | # -- Options for manual page output --------------------------------------- 227 | 228 | # One entry per manual page. List of tuples 229 | # (source start file, name, description, authors, manual section). 230 | man_pages = [ 231 | ('index', 'csvdedupe', u'csvdedupe Documentation', 232 | [u'Forest Gregg and Derek Eder'], 1) 233 | ] 234 | 235 | # If true, show URL addresses after external links. 236 | #man_show_urls = False 237 | 238 | 239 | # -- Options for Texinfo output ------------------------------------------- 240 | 241 | # Grouping the document tree into Texinfo files. List of tuples 242 | # (source start file, target name, title, author, 243 | # dir menu entry, description, category) 244 | texinfo_documents = [ 245 | ('index', 'csvdedupe', u'csvdedupe Documentation', 246 | u'Forest Gregg and Derek Eder', 'csvdedupe', 'One line description of project.', 247 | 'Miscellaneous'), 248 | ] 249 | 250 | # Documents to append as an appendix to all manuals. 251 | #texinfo_appendices = [] 252 | 253 | # If false, no module index is generated. 254 | #texinfo_domain_indices = True 255 | 256 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 257 | #texinfo_show_urls = 'footnote' 258 | 259 | # If true, do not generate a @detailmenu in the "Top" node's menu. 260 | #texinfo_no_detailmenu = False 261 | -------------------------------------------------------------------------------- /docs/index.rst: -------------------------------------------------------------------------------- 1 | csvdedupe 2 | ========= 3 | 4 | Contents: 5 | 6 | .. toctree:: 7 | :maxdepth: 2 8 | 9 | Command line tools for using the `dedupe python 10 | library `__ for deduplicating CSV 11 | files. 12 | 13 | ``csvdedupe`` take a messy input file or STDIN pipe and identify 14 | duplicates 15 | 16 | ``csvlink`` take two CSV files and find matches between them 17 | 18 | `Read more about csvdedupe on OpenNews 19 | Source `__ 20 | 21 | |Build Status| 22 | 23 | Installation and dependencies 24 | ----------------------------- 25 | 26 | csvdedupe requires `numpy `__, which can be 27 | complicated to install. If you are installing numpy for the first time, 28 | `follow these 29 | instructions `__. 30 | You'll need to version 1.6 of numpy or higher. 31 | 32 | After numpy is set up, then install the following: \* 33 | `fastcluster `__ \* 34 | `hcluster `__ \* 35 | `networkx `__ 36 | 37 | .. code:: bash 38 | 39 | git clone git@github.com:datamade/csvdedupe.git 40 | cd csvdedupe 41 | pip install "numpy>=1.6" 42 | pip install -r requirements.txt 43 | python setup.py install 44 | 45 | csvdedupe usage 46 | --------------- 47 | 48 | Take a messy input file or STDIN pipe and identify duplicates 49 | 50 | Provide an input file and field names 51 | ``bash csvdedupe examples/csv_example_messy_input.csv \ --field_names "Site name" Address Zip Phone \ --output_file output.csv`` 52 | 53 | **or** 54 | 55 | Pipe it, UNIX style 56 | ``bash cat examples/csv_example_messy_input.csv | csvdedupe --skip_training \ --field_names "Site name" Address Zip Phone > output.csv`` 57 | 58 | **or** 59 | 60 | Define everything in a config file 61 | ``bash csvdedupe examples/csv_example_messy_input.csv \ --config_file=config.json`` 62 | 63 | Example config file 64 | ~~~~~~~~~~~~~~~~~~~ 65 | 66 | .. code:: json 67 | 68 | { 69 | "field_names": ["Site name", "Address", "Zip", "Phone"], 70 | "field_definitions" : {"Site name" : {"type" : "String"}, 71 | "Address" : {"type" : "String"}, 72 | "Zip" : {"type" : "String", 73 | "Has Missing" : true}, 74 | "Phone" : {"type" : "String", 75 | "Has Missing" : true}}, 76 | "output_file": "examples/output.csv", 77 | "skip_training": false, 78 | "training_file": "training.json", 79 | "sample_size": 150000, 80 | "recall_weight": 2 81 | } 82 | 83 | Arguments: 84 | ~~~~~~~~~~ 85 | 86 | Required 87 | ^^^^^^^^ 88 | 89 | - ``input`` a CSV file name or piped CSV file to deduplicate 90 | 91 | Either \* ``--config_file`` Path to configuration file. 92 | 93 | Or \* ``--field_names`` List of column names for dedupe to pay attention 94 | to 95 | 96 | Optional 97 | ^^^^^^^^ 98 | 99 | - ``--output_file OUTPUT_FILE`` CSV file to store deduplication results 100 | (default: None) 101 | - ``--destructive`` Output file will contain unique records only 102 | - ``--skip_training`` Skip labeling examples by user and read training 103 | from training\_file only (default: False) 104 | - ``--training_file TRAINING_FILE`` Path to a new or existing file 105 | consisting of labeled training examples (default: training.json) 106 | - ``--sample_size SAMPLE_SIZE`` Number of random sample pairs to train 107 | off of (default: 150000) 108 | - ``--recall_weight RECALL_WEIGHT`` Threshold that will maximize a 109 | weighted average of our precision and recall (default: 2) 110 | - ``-h``, ``--help`` show help message and exit 111 | 112 | csvlink usage 113 | ------------- 114 | 115 | Take two CSV files and find matches between them 116 | 117 | Provide an input file and field names 118 | ``bash csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ --field_names name address city cuisine \ --output_file output.csv`` 119 | 120 | Line up different field names from each file 121 | ``bash csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ --field_names_1 name address city cuisine \ --field_names_2 restaurant street city type \ --output_file output.csv`` 122 | 123 | Pipe the output to STDOUT 124 | ``bash csvlink examples/restaurant-1.csv examples/restaurant-2.csv \ --field_names name address city cuisine \ > output.csv`` 125 | 126 | **or** 127 | 128 | Define everything in a config file 129 | ``bash csvdedupe examples/restaurant-1.csv examples/restaurant-2.csv \ --config_file=config.json`` 130 | 131 | Example config file 132 | ~~~~~~~~~~~~~~~~~~~ 133 | 134 | .. code:: json 135 | 136 | { 137 | "field_names_1": ["name", "address", "city", "cuisine"], 138 | "field_names_2": ["restaurant", "street", "city", "type"], 139 | "field_definitions" : {"name": {"type" : "String"}, 140 | "address": {"type" : "String"}, 141 | "city": {"type" : "String", 142 | "Has Missing" : true}, 143 | "cuisine": {"type" : "String", 144 | "Has Missing" : true}}, 145 | "output_file": "examples/output.csv", 146 | "skip_training": false, 147 | "training_file": "training.json", 148 | "sample_size": 150000, 149 | "recall_weight": 2 150 | } 151 | 152 | Arguments: 153 | ~~~~~~~~~~ 154 | 155 | Required 156 | ^^^^^^^^ 157 | 158 | - ``input`` two CSV file names to join together 159 | 160 | Either \* ``--config_file`` Path to configuration file. 161 | 162 | Or \* ``--field_names_1`` List of column names in first file for dedupe 163 | to pay attention to \* ``--field_names_2`` List of column names in 164 | second file for dedupe to pay attention to 165 | 166 | Optional 167 | ^^^^^^^^ 168 | 169 | - ``--output_file OUTPUT_FILE`` CSV file to store deduplication results 170 | (default: None) 171 | - ``--inner_join`` Only return matches between datasets 172 | - ``--skip_training`` Skip labeling examples by user and read training 173 | from training\_file only (default: False) 174 | - ``--training_file TRAINING_FILE`` Path to a new or existing file 175 | consisting of labeled training examples (default: training.json) 176 | - ``--sample_size SAMPLE_SIZE`` Number of random sample pairs to train 177 | off of (default: 150000) 178 | - ``--recall_weight RECALL_WEIGHT`` Threshold that will maximize a 179 | weighted average of our precision and recall (default: 2) 180 | - ``-h``, ``--help`` show help message and exit 181 | 182 | Training 183 | -------- 184 | 185 | The *secret sauce* of csvdedupe is human input. In order to figure out 186 | the best rules to deduplicate a set of data, you must give it a set of 187 | labeled examples to learn from. 188 | 189 | The more labeled examples you give it, the better the deduplication 190 | results will be. At minimum, you should try to provide **10 positive 191 | matches** and **10 negative matches**. 192 | 193 | The results of your training will be saved in a JSON file ( 194 | **training.json**, unless specified otherwise with the 195 | ``--training-file`` option) for future runs of csvdedupe. 196 | 197 | Here's an example labeling operation: 198 | 199 | .. code:: bash 200 | 201 | Phone : 2850617 202 | Address : 3801 s. wabash 203 | Zip : 204 | Site name : ada s. mckinley st. thomas cdc 205 | 206 | Phone : 2850617 207 | Address : 3801 s wabash ave 208 | Zip : 209 | Site name : ada s. mckinley community services - mckinley - st. thomas 210 | 211 | Do these records refer to the same thing? 212 | (y)es / (n)o / (u)nsure / (f)inished 213 | 214 | Output 215 | ------ 216 | 217 | ``csvdedupe`` attempts to identify all the rows in the csv that refer to 218 | the same thing. Each group of such records are called a cluster. 219 | ``csvdedupe`` returns your input file with an additional column called 220 | ``Cluster ID``, that either is the numeric id (zero-indexed) of a 221 | cluster of grouped records or an ``x`` if csvdedupe believes the record 222 | doesn't belong to any cluster. 223 | 224 | ``csvlink`` operates in much the same way as ``csvdedupe``, but will 225 | flatten both CSVs in to one output file similar to a SQL `OUTER 226 | JOIN `__ 227 | statement. You can use the ``--inner_join`` flag to exclude rows that 228 | don't match across the two input files, much like an INNER JOIN. 229 | 230 | Preprocessing 231 | ------------- 232 | 233 | csvdedupe attempts to convert all strings to ASCII, ignores case, new 234 | lines, and padding whitespace. This is all probably uncontroversial 235 | except the conversion to ASCII. Basically, we had to choose between two 236 | ways of handling extended characters. 237 | 238 | :: 239 | 240 | distance("Tomas", "Tomás') = distance("Tomas", "Tomas") 241 | 242 | **or** 243 | 244 | :: 245 | 246 | distance("Tomas, "Tomás") = distance("Tomas", "Tomzs") 247 | 248 | We chose the first option. While it is possible to do something more 249 | sophisticated, this option seems to work pretty well for Latin alphabet 250 | languages. 251 | 252 | Testing 253 | ------- 254 | 255 | Unit tests of core csvdedupe functions 256 | ``bash pip install -r requirements-test.txt nosetests`` 257 | 258 | Community 259 | --------- 260 | 261 | - `Dedupe Google 262 | group `__ 263 | - IRC channel, #dedupe on irc.freenode.net 264 | 265 | Recipes 266 | ------- 267 | 268 | Combining and deduplicating files from different sources. 269 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 270 | 271 | Lets say we have a few sources of early childhood programs in Chicago 272 | and we'd like to get a canonical list. Let's do it with ``csvdedupe``, 273 | ``csvkit``, and some other common command line tools. 274 | 275 | Alignment and stacking 276 | ^^^^^^^^^^^^^^^^^^^^^^ 277 | 278 | Our first task will be to align the files and have the same data in the 279 | same columns for stacking. 280 | 281 | First let's look at the headers of the files 282 | 283 | File 1 284 | ``console > head -1 CPS_Early_Childhood_Portal_Scrape.csv "Site name","Address","Phone","Program Name","Length of Day"`` 285 | 286 | File 2 287 | ``console > head -1 IDHS_child_care_provider_list.csv "Site name","Address","Zip Code","Phone","Fax","IDHS Provider ID"`` 288 | 289 | So, we'll have to add "Zip Code", "Fax", and "IDHS Provider ID" to 290 | ``CPS_Early_Childhood_Portal_Scrape.csv``, and we'll have to add 291 | "Program Name", "Length of Day" to 292 | ``IDHS_child_care_provider_list.csv``. 293 | 294 | .. code:: console 295 | 296 | > cd examples 297 | > sed '1 s/$/,"Zip Code","Fax","IDHS Provider ID"/' CPS_Early_Childhood_Portal_Scrape.csv > input_1a.csv 298 | > sed '2,$s/$/,,,/' input_1a.csv > input_1b.csv 299 | 300 | .. code:: console 301 | 302 | > sed '1 s/$/,"Program Name","Length of Day"/' IDHS_child_care_provider_list.csv > input_2a.csv 303 | > sed '2,$s/$/,,/' input_2a.csv > input_2b.csv 304 | 305 | Now, we reorder the columns in the second file to align to the first. 306 | 307 | .. code:: console 308 | 309 | > csvcut -c "Site name","Address","Phone","Program Name","Length of Day","Zip Code","Fax","IDHS Provider ID" \ 310 | input_2b.csv > input_2c.csv 311 | 312 | And we are finally ready to stack. 313 | 314 | .. code:: console 315 | 316 | > csvstack -g CPS_Early_Childhood_Portal_Scrape.csv,IDHS_child_care_provider_list.csv \ 317 | -n source \ 318 | input_1b.csv input_2c.csv > input.csv 319 | 320 | Dedupe it! 321 | ^^^^^^^^^^ 322 | 323 | And now we can dedupe 324 | 325 | .. code:: console 326 | 327 | > cat input.csv | csvdedupe --field_names "Site name" Address "Zip Code" Phone > output.csv 328 | 329 | Let's sort the output by duplicate IDs, and we are ready to open it in 330 | your favorite spreadsheet program. 331 | 332 | .. code:: console 333 | 334 | > csvsort -c "Cluster ID" output.csv > sorted.csv 335 | 336 | |githalytics.com alpha| 337 | 338 | .. |Build Status| image:: https://travis-ci.org/datamade/csvdedupe.png?branch=master 339 | :target: https://travis-ci.org/datamade/csvdedupe 340 | .. |githalytics.com alpha| image:: https://cruel-carlota.pagodabox.com/88cda639ab635a100d23de5948ffbef5 341 | :target: http://githalytics.com/datamade/csvdedupe 342 | -------------------------------------------------------------------------------- /examples/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dedupeio/csvdedupe/ec1a16303b39a0989f8face96eb5e046f03d168d/examples/.gitkeep -------------------------------------------------------------------------------- /examples/IDHS_child_care_provider_list.csv: -------------------------------------------------------------------------------- 1 | "Site name","Address","Zip Code","Phone","Fax","IDHS Provider ID" 2 | " Bloomington Day Care Center, Inc.","2708 East Lincoln Street, Bloomington",61704,6615600,"(309) 3 | 662-4202","8307-7891-7433-008" 4 | " Carole Robertson Center 5 | for Learning","2020 West Roosevelt Road",60608,2437300,"(312) 6 | 243-1087","6391-8286-5433-002" 7 | " Casa Central","1343 North California Avenue",60622,6452368,"(773) 8 | 645-1432","5672-5795-5433-005" 9 | " Chicago Urban Day 10 | School","1248 West 69th Street Chicago",60636,4833555,"(773) 11 | 483-0150","3398-4665-5433-003" 12 | " ChildServ","8765 West Higgins Road Suite 450",60631,8677308,"(773) 13 | 693-0322","2935-8342-5433-002" 14 | " Child Care Center 15 | of Evanston","1840 Asbury Avenue Evanston",60201,8692680,"(847) 16 | 869-2687","0305-2980-5433-002" 17 | " Childcare Network 18 | of Evanston","1416 Lake Street, 19 | Evanston",60201,4752661,"(847) 20 | 475-2699","7767-4532-7633-003" 21 | " Children's Center 22 | of Tazewell County","210 North Thorncrest Drive Creve Coeur",61610,6996141,"(309) 23 | 699-5147","1281-8212-7433-003" 24 | " Children's Home & Aid Society of Illinois","125 South Wacker Drive 14th Floor",60606,4240200,"(312) 25 | 424-6200","4277-4401-5433-006" 26 | " Community Coordinated Child Care","155 North 3rd Street 27 | Suite 300 28 | DeKalb",60115,7588149,"(815) 758-5652","3156-6336-5433-009" 29 | " Community Mennonite","3215 West 162nd Street Markham",60428,3331232,"(708) 30 | 333-1248","9237-3517-0633-007" 31 | " East Moline Citizens for Comm. Center","489-27th Street, East Moline",61244,7555031,"(309) 32 | 755-5036","2987-4157-5433-004" 33 | " Educational 34 | Day Care Center","330 West Michigan Avenue Jacksonville",62650,2435720,"(217) 35 | 243-5385","6813-5632-7433-006" 36 | " Ezzard Charles","7946 South Ashland",60620,4870227,"(773) 37 | 487-0044","6525-2857-5433-002" 38 | " First Step 39 | Day¬¨‚ĆCare Center","1300 Pearl Street 40 | Belvidere",6100,5446560,"(815) 41 | 544-6560","6896-1416-5433-006" 42 | " Geneseo Development 43 | & Growth, Inc.","541 East North Street 44 | P.O. Box 172 45 | Geneseo",61254,9445024,"(309) 46 | 945-4103","2596-3557-5433-004" 47 | " Highland Park Comm. Nursery School 48 | & Day Care Center","1850 Green Bay Road 49 | Highland Park",60035,4323301,"(847) 50 | 432-3308","2592-0023-5433-000" 51 | " Human Development Corporation","142 East 154th Street 52 | Harvey",60426,3394449,"(708) 53 | 339-8513","1316-0598-5433-003" 54 | " Improved Child Care 55 | Mgt. Services, Inc.","520 North Halsted Street Suite 412 56 | Chicago",60642,7370231,"(773) 57 | 737-7009","7483-8223-6433-005" 58 | " Just Kids Child Care, Inc.","1800 West¬¨‚Ć1st Street 59 | ¬¨‚ĆP.O. Box 410 60 | Mlian",61264,7876303,"(309) 61 | 787-6375","1640-4249-5433-009" 62 | " Kiddie Kollege of Fairfield","2226 Mt. Vernon Road 63 | P.O. Box 362 64 | Fairfield",62837,8477102,"(618) 65 | 847-7212","3125-2563-7433-009" 66 | " Marillac Social Center","212 South Francisco Street",60612,5843232,"(773) 67 | 722-1469","0350-8270-5433-000" 68 | " Mary Crane League","2974 North Clybourn Avenue",60618,9388157,"(773) 69 | 325-2530","9356-4723-5433-005" 70 | " McDonough County Council for Child Development DBA Wee Care Macomb","425 North Prairie Avenue Macomb",61455,8335267,"(309) 71 | 837-5751","3316-0591-7433-001" 72 | " Northwest Suburban 73 | Day Care Center","1755 Howard Street 74 | DesPlaines",60018,2995103,"(847) 75 | 299-1070","4955-5565-5433-000" 76 | " Northwestern University Settlement","1400 West Augusta Blvd.",60642,2787471,"(773) 77 | 278-7536","5701-4911-5433-005" 78 | " Oak Park/River Forest 79 | Day Nursery","1139 Randolph Steet 80 | Oak Park",60302,3838211,"(708) 81 | 383-0692","9300-0813-5433-003" 82 | " One Hope United","P.O. Box 1128 83 | Lake Villa",60046,2456559,"(847) 84 | 245-6715","8581-4703-5433-004" 85 | " Ounce of Prevention Fund, Inc.","33 West Monroe Street, Suite 2400",60603,9223863,"(312) 86 | 922-3337","3985-7887-5433-003" 87 | " Paxton Day Care 88 | Center","200 North Elm Street Paxton",60957,3793865,"(217) 89 | 379-6205","7766-4691-7433-009" 90 | " Pillars","333 North LaGrange¬¨‚Ć 91 | LaGrange Park",60957,9953500,,"7064-6764-9133-063" 92 | " Rockford Day Nursery","2323 South 6th Street Rockford",61104,9620834,"(815) 93 | 962-0838","8253-6253-5433-002" 94 | " Skip-A-Long 95 | Daycare Center, Inc.","4800-60th Street 96 | Moline",61265,7648110,"(309) 97 | 764-8281","7108-7395-5433-001" 98 | " St. Vincent DePaul 99 | Center -Halsted","2145 North Halsted Street",60614,9436776,"(312) 100 | 573-0646","7611-0607-4433-007" 101 | " Streator Child 102 | Development Center","405 Chicago Street Streator",61364,6724350,"(815) 103 | 672-4784","0858-7317-0633-006" 104 | " Thornton Township 105 | High School District 106 | 205 - Infant Care","465 East 170th Street South Holland",60473,2254118,"(708) 107 | 225-4088","1946-2627-6433-001" 108 | " Tri-Con Child Care Center, Inc.","425 Laurel Avenue 109 | Highland Park",60035,4331450,"(847) 110 | 433-1749","9795-2875-5433-009" 111 | " YWCA of Elgin","220 East Chicago Street 112 | Elgin",60120,7427930,"(847) 113 | 742-8217","3498-6732-5433-003" 114 | " YWCA of Kankakee","1086 East Court Street Kankakee",60901,9334516,"(815) 115 | 935-0015","3261-3053-5433-005" 116 | -------------------------------------------------------------------------------- /examples/config.json.example: -------------------------------------------------------------------------------- 1 | { 2 | "field_names": ["Site name","Address","Zip","Phone"], 3 | "field_definitions" : {"Site name" : {"type" : "String"}, 4 | "Address" : {"type" : "String"}, 5 | "Zip" : {"type" : "String", 6 | "Has Missing" : true}, 7 | "Phone" : {"type" : "String", 8 | "Has Missing" : true}}, 9 | "output_file": "output.csv", 10 | "skip_training": false, 11 | "training_file": "training.json", 12 | "sample_size": 150000, 13 | "recall_weight": 2 14 | } 15 | -------------------------------------------------------------------------------- /examples/restaurant-1.csv: -------------------------------------------------------------------------------- 1 | name,address,city,cuisine,unique_id 2 | "arnie morton's of chicago", "435 s. la cienega blvd.", "los angeles", "steakhouses", '0' 3 | "art's deli", "12224 ventura blvd.", "studio city", "delis", '1' 4 | "bel-air hotel", "701 stone canyon rd.", "bel air", "californian", '2' 5 | "cafe bizou", "14016 ventura blvd.", "sherman oaks", "french bistro", '3' 6 | "campanile", "624 s. la brea ave.", "los angeles", "californian", '4' 7 | "chinois on main", "2709 main st.", "santa monica", "pacific new wave", '5' 8 | "citrus", "6703 melrose ave.", "los angeles", "californian", '6' 9 | "fenix at the argyle", "8358 sunset blvd.", "w. hollywood", "french (new)", '7' 10 | "granita", "23725 w. malibu rd.", "malibu", "californian", '8' 11 | "grill the", "9560 dayton way", "beverly hills", "american (traditional)", '9' 12 | "katsu", "1972 hillhurst ave.", "los feliz", "japanese", '10' 13 | "l'orangerie", "903 n. la cienega blvd.", "w. hollywood", "french (classic)", '11' 14 | "le chardonnay (los angeles)", "8284 melrose ave.", "los angeles", "french bistro", '12' 15 | "locanda veneta", "8638 w. third st.", "los angeles", "italian", '13' 16 | "matsuhisa", "129 n. la cienega blvd.", "beverly hills", "seafood", '14' 17 | "palm the (los angeles)", "9001 santa monica blvd.", "w. hollywood", "steakhouses", '15' 18 | "patina", "5955 melrose ave.", "los angeles", "californian", '16' 19 | "philippe the original", "1001 n. alameda st.", "chinatown", "cafeterias", '17' 20 | "pinot bistro", "12969 ventura blvd.", "studio city", "french bistro", '18' 21 | "rex il ristorante", "617 s. olive st.", "los angeles", "nuova cucina italian", '19' 22 | "spago (los angeles)", "8795 sunset blvd.", "w. hollywood", "californian", '20' 23 | "valentino", "3115 pico blvd.", "santa monica", "italian", '21' 24 | "yujean kang's", "67 n. raymond ave.", "pasadena", "chinese", '22' 25 | "21 club", "21 w. 52nd st.", "new york city", "american (new)", '23' 26 | "aquavit", "13 w. 54th st.", "new york city", "scandinavian", '24' 27 | "aureole", "34 e. 61st st.", "new york city", "american (new)", '25' 28 | "cafe lalo", "201 w. 83rd st.", "new york city", "coffeehouses", '26' 29 | "cafe des artistes", "1 w. 67th st.", "new york city", "french (classic)", '27' 30 | "carmine's", "2450 broadway", "new york city", "italian", '28' 31 | "carnegie deli", "854 seventh ave.", "new york city", "delis", '29' 32 | "chanterelle", "2 harrison st.", "new york city", "french (new)", '30' 33 | "daniel", "20 e. 76th st.", "new york city", "french (new)", '31' 34 | "dawat", "210 e. 58th st.", "new york city", "indian", '32' 35 | "felidia", "243 e. 58th st.", "new york city", "italian", '33' 36 | "four seasons", "99 e. 52nd st.", "new york city", "american (new)", '34' 37 | "gotham bar & grill", "12 e. 12th st.", "new york city", "american (new)", '35' 38 | "gramercy tavern", "42 e. 20th st.", "new york city", "american (new)", '36' 39 | "island spice", "402 w. 44th st.", "new york city", "caribbean", '37' 40 | "jo jo", "160 e. 64th st.", "new york city", "french bistro", '38' 41 | "la caravelle", "33 w. 55th st.", "new york city", "french (classic)", '39' 42 | "la cote basque", "60 w. 55th st.", "new york city", "french (classic)", '40' 43 | "le bernardin", "155 w. 51st st.", "new york city", "seafood", '41' 44 | "les celebrites", "155 w. 58th st.", "new york city", "french (classic)", '42' 45 | "lespinasse (new york city)", "2 e. 55th st.", "new york city", "asian", '43' 46 | "lutece", "249 e. 50th st.", "new york city", "french (classic)", '44' 47 | "manhattan ocean club", "57 w. 58th st.", "new york city", "seafood", '45' 48 | "march", "405 e. 58th st.", "new york city", "american (new)", '46' 49 | "mesa grill", "102 fifth ave.", "new york city", "southwestern", '47' 50 | "mi cocina", "57 jane st.", "new york city", "mexican", '48' 51 | "montrachet", "239 w. broadway", "new york city", "french bistro", '49' 52 | "oceana", "55 e. 54th st.", "new york city", "seafood", '50' 53 | "park avenue cafe (new york city)", "100 e. 63rd st.", "new york city", "american (new)", '51' 54 | "petrossian", "182 w. 58th st.", "new york city", "russian", '52' 55 | "picholine", "35 w. 64th st.", "new york city", "mediterranean", '53' 56 | "pisces", "95 ave. a", "new york city", "seafood", '54' 57 | "rainbow room", "30 rockefeller plaza", "new york city", "american (new)", '55' 58 | "river cafe", "1 water st.", "brooklyn", "american (new)", '56' 59 | "san domenico", "240 central park s.", "new york city", "italian", '57' 60 | "second avenue deli", "156 second ave.", "new york city", "delis", '58' 61 | "seryna", "11 e. 53rd st.", "new york city", "japanese", '59' 62 | "shun lee palace", "155 e. 55th st.", "new york city", "chinese", '60' 63 | "sign of the dove", "1110 third ave.", "new york city", "american (new)", '61' 64 | "smith & wollensky", "797 third ave.", "new york city", "steakhouses", '62' 65 | "tavern on the green", "central park west", "new york city", "american (new)", '63' 66 | "uncle nick's", "747 ninth ave.", "new york city", "greek", '64' 67 | "union square cafe", "21 e. 16th st.", "new york city", "american (new)", '65' 68 | "virgil's real bbq", "152 w. 44th st.", "new york city", "bbq", '66' 69 | "chin's", "3200 las vegas blvd. s.", "las vegas", "chinese", '67' 70 | "coyote cafe (las vegas)", "3799 las vegas blvd. s.", "las vegas", "southwestern", '68' 71 | "le montrachet bistro", "3000 paradise rd.", "las vegas", "french bistro", '69' 72 | "palace court", "3570 las vegas blvd. s.", "las vegas", "french (new)", '70' 73 | "second street grill", "200 e. fremont st.", "las vegas", "pacific rim", '71' 74 | "steak house the", "2880 las vegas blvd. s.", "las vegas", "steakhouses", '72' 75 | "tillerman the", "2245 e. flamingo rd.", "las vegas", "steakhouses", '73' 76 | "abruzzi", "2355 peachtree rd. ne", "atlanta", "italian", '74' 77 | "bacchanalia", "3125 piedmont rd.", "atlanta", "californian", '75' 78 | "bone's restaurant", "3130 piedmont rd. ne", "atlanta", "steakhouses", '76' 79 | "brasserie le coze", "3393 peachtree rd.", "atlanta", "french bistro", '77' 80 | "buckhead diner", "3073 piedmont rd.", "atlanta", "american (new)", '78' 81 | "ciboulette restaurant", "1529 piedmont ave.", "atlanta", "french (new)", '79' 82 | "delectables", "1 margaret mitchell sq.", "atlanta", "cafeterias", '80' 83 | "georgia grille", "2290 peachtree rd.", "atlanta", "southwestern", '81' 84 | "hedgerose heights inn the", "490 e. paces ferry rd. ne", "atlanta", "continental", '82' 85 | "heera of india", "595 piedmont ave.", "atlanta", "indian", '83' 86 | "indigo coastal grill", "1397 n. highland ave.", "atlanta", "eclectic", '84' 87 | "la grotta", "2637 peachtree rd. ne", "atlanta", "italian", '85' 88 | "mary mac's tea room", "224 ponce de leon ave.", "atlanta", "southern/soul", '86' 89 | "nikolai's roof", "255 courtland st.", "atlanta", "continental", '87' 90 | "pano's & paul's", "1232 w. paces ferry rd.", "atlanta", "american (new)", '88' 91 | "ritz-carlton cafe (buckhead)", "3434 peachtree rd. ne", "atlanta", "american (new)", '89' 92 | "ritz-carlton dining room (buckhead)", "3434 peachtree rd. ne", "atlanta", "american (new)", '90' 93 | "ritz-carlton restaurant", "181 peachtree st.", "atlanta", "french (classic)", '91' 94 | "toulouse", "293-b peachtree rd.", "atlanta", "french (new)", '92' 95 | "veni vidi vici", "41 14th st.", "atlanta", "italian", '93' 96 | "alain rondelli", "126 clement st.", "san francisco", "french (new)", '94' 97 | "aqua", "252 california st.", "san francisco", "american (new)", '95' 98 | "boulevard", "1 mission st.", "san francisco", "american (new)", '96' 99 | "cafe claude", "7 claude ln.", "san francisco", "french bistro", '97' 100 | "campton place", "340 stockton st.", "san francisco", "american (new)", '98' 101 | "chez michel", "804 north point st.", "san francisco", "californian", '99' 102 | "fleur de lys", "777 sutter st.", "san francisco", "french (new)", '100' 103 | "fringale", "570 fourth st.", "san francisco", "french bistro", '101' 104 | "hawthorne lane", "22 hawthorne st.", "san francisco", "californian", '102' 105 | "khan toke thai house", "5937 geary blvd.", "san francisco", "thai", '103' 106 | "la folie", "2316 polk st.", "san francisco", "french (new)", '104' 107 | "lulu restaurant-bis-cafe", "816 folsom st.", "san francisco", "mediterranean", '105' 108 | "masa's", "648 bush st.", "san francisco", "french (new)", '106' 109 | "mifune", "1737 post st.", "san francisco", "japanese", '107' 110 | "plumpjack cafe", "3127 fillmore st.", "san francisco", "american (new)", '108' 111 | "postrio", "545 post st.", "san francisco", "californian", '109' 112 | "ritz-carlton dining room (san francisco)", "600 stockton st.", "san francisco", "french (new)", '110' 113 | "rose pistola", "532 columbus ave.", "san francisco", "italian", '111' 114 | -------------------------------------------------------------------------------- /examples/restaurant-2.csv: -------------------------------------------------------------------------------- 1 | name,address,city,cuisine,unique_id 2 | "arnie morton's of chicago", "435 s. la cienega blv.", "los angeles", "american", '0' 3 | "art's delicatessen", "12224 ventura blvd.", "studio city", "american", '1' 4 | "hotel bel-air", "701 stone canyon rd.", "bel air", "californian", '2' 5 | "cafe bizou", "14016 ventura blvd.", "sherman oaks", "french", '3' 6 | "campanile", "624 s. la brea ave.", "los angeles", "american", '4' 7 | "chinois on main", "2709 main st.", "santa monica", "french", '5' 8 | "citrus", "6703 melrose ave.", "los angeles", "californian", '6' 9 | "fenix", "8358 sunset blvd. west", "hollywood", "american", '7' 10 | "granita", "23725 w. malibu rd.", "malibu", "californian", '8' 11 | "grill on the alley", "9560 dayton way", "los angeles", "american", '9' 12 | "restaurant katsu", "1972 n. hillhurst ave.", "los angeles", "asian", '10' 13 | "l'orangerie", "903 n. la cienega blvd.", "los angeles", "french", '11' 14 | "le chardonnay", "8284 melrose ave.", "los angeles", "french", '12' 15 | "locanda veneta", "3rd st.", "los angeles", "italian", '13' 16 | "matsuhisa", "129 n. la cienega blvd.", "beverly hills", "asian", '14' 17 | "the palm", "9001 santa monica blvd.", "los angeles", "american", '15' 18 | "patina", "5955 melrose ave.", "los angeles", "californian", '16' 19 | "philippe's the original", "1001 n. alameda st.", "los angeles", "american", '17' 20 | "pinot bistro", "12969 ventura blvd.", "los angeles", "french", '18' 21 | "rex il ristorante", "617 s. olive st.", "los angeles", "italian", '19' 22 | "spago", "1114 horn ave.", "los angeles", "californian", '20' 23 | "valentino", "3115 pico blvd.", "santa monica", "italian", '21' 24 | "yujean kang's gourmet chinese cuisine", "67 n. raymond ave.", "los angeles", "asian", '22' 25 | "21 club", "21 w. 52nd st.", "new york", "american", '23' 26 | "aquavit", "13 w. 54th st.", "new york", "continental", '24' 27 | "aureole", "34 e. 61st st.", "new york", "american", '25' 28 | "cafe lalo", "201 w. 83rd st.", "new york", "coffee bar", '26' 29 | "cafe des artistes", "1 w. 67th st.", "new york", "continental", '27' 30 | "carmine's", "2450 broadway between 90th and 91st sts.", "new york", "italian", '28' 31 | "carnegie deli", "854 7th ave. between 54th and 55th sts.", "new york", "delicatessen", '29' 32 | "chanterelle", "2 harrison st. near hudson st.", "new york", "american", '30' 33 | "daniel", "20 e. 76th st.", "new york", "french", '31' 34 | "dawat", "210 e. 58th st.", "new york", "asian", '32' 35 | "felidia", "243 e. 58th st.", "new york", "italian", '33' 36 | "four seasons grill room", "99 e. 52nd st.", "new york", "american", '34' 37 | "gotham bar & grill", "12 e. 12th st.", "new york", "american", '35' 38 | "gramercy tavern", "42 e. 20th st. between park ave. s and broadway", "new york", "american", '36' 39 | "island spice", "402 w. 44th st.", "new york", "tel caribbean", '37' 40 | "jo jo", "160 e. 64th st.", "new york", "american", '38' 41 | "la caravelle", "33 w. 55th st.", "new york", "french", '39' 42 | "la cote basque", "60 w. 55th st. between 5th and 6th ave.", "new york", "french", '40' 43 | "le bernardin", "155 w. 51st st.", "new york", "french", '41' 44 | "les celebrites", "160 central park s", "new york", "french", '42' 45 | "lespinasse", "2 e. 55th st.", "new york", "american", '43' 46 | "lutece", "249 e. 50th st.", "new york", "french", '44' 47 | "manhattan ocean club", "57 w. 58th st.", "new york", "seafood", '45' 48 | "march", "405 e. 58th st.", "new york", "american", '46' 49 | "mesa grill", "102 5th ave. between 15th and 16th sts.", "new york", "american", '47' 50 | "mi cocina", "57 jane st. off hudson st.", "new york", "mexican", '48' 51 | "montrachet", "239 w. broadway between walker and white sts.", "new york", "french", '49' 52 | "oceana", "55 e. 54th st.", "new york", "seafood", '50' 53 | "park avenue cafe", "100 e. 63rd st.", "new york", "american", '51' 54 | "petrossian", "182 w. 58th st.", "new york", "french", '52' 55 | "picholine", "35 w. 64th st.", "new york", "mediterranean", '53' 56 | "pisces", "95 ave. a at 6th st.", "new york", "seafood", '54' 57 | "rainbow room", "30 rockefeller plaza", "new york", "or 212/632-5100 american", '55' 58 | "river cafe", "1 water st. at the east river", "brooklyn", "american", '56' 59 | "san domenico", "240 central park s", "new york", "italian", '57' 60 | "second avenue deli", "156 2nd ave. at 10th st.", "new york", "delicatessen", '58' 61 | "seryna", "11 e. 53rd st.", "new york", "asian", '59' 62 | "shun lee west", "43 w. 65th st.", "new york", "asian", '60' 63 | "sign of the dove", "1110 3rd ave. at 65th st.", "new york", "american", '61' 64 | "smith & wollensky", "201 e. 49th st.", "new york", "american", '62' 65 | "tavern on the green", "in central park at 67th st.", "new york", "american", '63' 66 | "uncle nick's", "747 9th ave. between 50th and 51st sts.", "new york", "mediterranean", '64' 67 | "union square cafe", "21 e. 16th st.", "new york", "american", '65' 68 | "virgil's", "152 w. 44th st.", "new york", "american", '66' 69 | "chin's", "3200 las vegas blvd. s", "las vegas", "asian", '67' 70 | "coyote cafe", "3799 las vegas blvd. s", "las vegas", "southwestern", '68' 71 | "le montrachet", "3000 w. paradise rd.", "las vegas", "continental", '69' 72 | "palace court", "3570 las vegas blvd. s", "las vegas", "continental", '70' 73 | "second street grille", "200 e. fremont st.", "las vegas", "seafood", '71' 74 | "steak house", "2880 las vegas blvd. s", "las vegas", "steak houses", '72' 75 | "tillerman", "2245 e. flamingo rd.", "las vegas", "seafood", '73' 76 | "abruzzi", "2355 peachtree rd. peachtree battle shopping center", "atlanta", "italian", '74' 77 | "bacchanalia", "3125 piedmont rd. near peachtree rd.", "atlanta", "international", '75' 78 | "bone's", "3130 piedmont road", "atlanta", "american", '76' 79 | "brasserie le coze", "3393 peachtree rd. lenox square mall near neiman marcus", "atlanta", "french", '77' 80 | "buckhead diner", "3073 piedmont road", "atlanta", "american", '78' 81 | "ciboulette", "1529 piedmont ave.", "atlanta", "french", '79' 82 | "delectables", "1 margaret mitchell sq.", "atlanta", "american", '80' 83 | "georgia grille", "2290 peachtree rd. peachtree square shopping center", "atlanta", "american", '81' 84 | "hedgerose heights inn", "490 e. paces ferry rd.", "atlanta", "international", '82' 85 | "heera of india", "595 piedmont ave. rio shopping mall", "atlanta", "asian", '83' 86 | "indigo coastal grill", "1397 n. highland ave.", "atlanta", "caribbean", '84' 87 | "la grotta", "2637 peachtree rd. peachtree house condominium", "atlanta", "italian", '85' 88 | "mary mac's tea room", "224 ponce de leon ave.", "atlanta", "southern", '86' 89 | "nikolai's roof", "255 courtland st. at harris st.", "atlanta", "continental", '87' 90 | "pano's and paul's", "1232 w. paces ferry rd.", "atlanta", "international", '88' 91 | "cafe ritz-carlton buckhead", "3434 peachtree rd.", "atlanta", "ext 6108 international", '89' 92 | "dining room ritz-carlton buckhead", "3434 peachtree rd.", "atlanta", "international", '90' 93 | "restaurant ritz-carlton atlanta", "181 peachtree st.", "atlanta", "continental", '91' 94 | "toulouse", "b peachtree rd.", "atlanta", "french", '92' 95 | "veni vidi vici", "41 14th st.", "atlanta", "italian", '93' 96 | "alain rondelli", "126 clement st.", "san francisco", "french", '94' 97 | "aqua", "252 california st.", "san francisco", "seafood", '95' 98 | "boulevard", "1 mission st.", "san francisco", "american", '96' 99 | "cafe claude", "7 claude la.", "san francisco", "french", '97' 100 | "campton place", "340 stockton st.", "san francisco", "american", '98' 101 | "chez michel", "804 northpoint", "san francisco", "french", '99' 102 | "fleur de lys", "777 sutter st.", "san francisco", "french", '100' 103 | "fringale", "570 4th st.", "san francisco", "french", '101' 104 | "hawthorne lane", "22 hawthorne st.", "san francisco", "american", '102' 105 | "khan toke thai house", "5937 geary blvd.", "san francisco", "asian", '103' 106 | "la folie", "2316 polk st.", "san francisco", "french", '104' 107 | "lulu", "816 folsom st.", "san francisco", "mediterranean", '105' 108 | "masa's", "648 bush st.", "san francisco", "french", '106' 109 | "mifune japan center kintetsu building", "1737 post st.", "san francisco", "asian", '107' 110 | "plumpjack cafe", "3201 fillmore st.", "san francisco", "mediterranean", '108' 111 | "postrio", "545 post st.", "san francisco", "american", '109' 112 | "ritz-carlton restaurant and dining room", "600 stockton st.", "san francisco", "american", '110' 113 | "rose pistola", "532 columbus ave.", "san francisco", "italian", '111' 114 | "adriano's ristorante", "2930 beverly glen circle", "los angeles", "italian", '112' 115 | "barney greengrass", "9570 wilshire blvd.", "beverly hills", "american", '113' 116 | "beaurivage", "26025 pacific coast hwy.", "malibu", "french", '114' 117 | "bistro garden", "176 n. canon dr.", "los angeles", "californian", '115' 118 | "border grill", "4th st.", "los angeles", "mexican", '116' 119 | "broadway deli", "3rd st. promenade", "santa monica", "american", '117' 120 | "ca'brea", "346 s. la brea ave.", "los angeles", "italian", '118' 121 | "ca'del sol", "4100 cahuenga blvd.", "los angeles", "italian", '119' 122 | "cafe pinot", "700 w. fifth st.", "los angeles", "californian", '120' 123 | "california pizza kitchen", "207 s. beverly dr.", "los angeles", "californian", '121' 124 | "canter's", "419 n. fairfax ave.", "los angeles", "american", '122' 125 | "cava", "3rd st.", "los angeles", "mediterranean", '123' 126 | "cha cha cha", "656 n. virgil ave.", "los angeles", "caribbean", '124' 127 | "chan dara", "310 n. larchmont blvd.", "los angeles", "asian", '125' 128 | "clearwater cafe", "168 w. colorado blvd.", "los angeles", "health food", '126' 129 | "dining room", "9500 wilshire blvd.", "los angeles", "californian", '127' 130 | "dive!", "10250 santa monica blvd.", "los angeles", "dive american", '128' 131 | "drago", "2628 wilshire blvd.", "santa monica", "italian", '129' 132 | "drai's", "730 n. la cienega blvd.", "los angeles", "french", '130' 133 | "dynasty room", "930 hilgard ave.", "los angeles", "continental", '131' 134 | "eclipse", "8800 melrose ave.", "los angeles", "californian", '132' 135 | "ed debevic's", "134 n. la cienega", "los angeles", "american", '133' 136 | "el cholo", "1121 s. western ave.", "los angeles", "mexican", '134' 137 | "gilliland's", "2424 main st.", "santa monica", "american", '135' 138 | "gladstone's", "4 fish 17300 pacific coast hwy. at sunset blvd.", "pacific palisades", "american", '136' 139 | "hard rock cafe", "8600 beverly blvd.", "los angeles", "american", '137' 140 | "harry's bar & american grill", "2020 ave. of the stars", "los angeles", "italian", '138' 141 | "il fornaio cucina italiana", "301 n. beverly dr.", "los angeles", "italian", '139' 142 | "jack sprat's grill", "10668 w. pico blvd.", "los angeles", "health food", '140' 143 | "jackson's farm", "439 n. beverly drive", "los angeles", "californian", '141' 144 | "jimmy's", "201 moreno dr.", "los angeles", "continental", '142' 145 | "joss", "9255 sunset blvd.", "los angeles", "asian", '143' 146 | "le colonial", "8783 beverly blvd.", "los angeles", "asian", '144' 147 | "le dome", "8720 sunset blvd.", "los angeles", "french", '145' 148 | "louise's trattoria", "4500 los feliz blvd.", "los angeles", "italian", '146' 149 | "mon kee seafood restaurant", "679 n. spring st.", "los angeles", "asian", '147' 150 | "morton's", "8764 melrose ave.", "los angeles", "american", '148' 151 | "nate 'n' al's", "414 n. beverly dr.", "los angeles", "american", '149' 152 | "nicola", "601 s. figueroa st.", "los angeles", "american", '150' 153 | "ocean avenue", "1401 ocean ave.", "santa monica", "american", '151' 154 | "orleans", "11705 national blvd.", "los angeles", "cajun", '152' 155 | "pacific dining car", "6th st.", "los angeles", "american", '153' 156 | "paty's", "10001 riverside dr.", "toluca lake", "american", '154' 157 | "pinot hollywood", "1448 n. gower st.", "los angeles", "californian", '155' 158 | "posto", "14928 ventura blvd.", "sherman oaks", "italian", '156' 159 | "prego", "362 n. camden dr.", "los angeles", "italian", '157' 160 | "rj's the rib joint", "252 n. beverly dr.", "los angeles", "american", '158' 161 | "remi", "3rd st. promenade", "santa monica", "italian", '159' 162 | "restaurant horikawa", "111 s. san pedro st.", "los angeles", "asian", '160' 163 | "roscoe's house of chicken 'n' waffles", "1514 n. gower st.", "los angeles", "american", '161' 164 | "schatzi on main", "3110 main st.", "los angeles", "continental", '162' 165 | "sofi", "3rd st.", "los angeles", "mediterranean", '163' 166 | "swingers", "8020 beverly blvd.", "los angeles", "american", '164' 167 | "tavola calda", "7371 melrose ave.", "los angeles", "italian", '165' 168 | "the mandarin", "430 n. camden dr.", "los angeles", "asian", '166' 169 | "tommy tang's", "7313 melrose ave.", "los angeles", "asian", '167' 170 | "tra di noi", "3835 cross creek rd.", "los angeles", "italian", '168' 171 | "trader vic's", "9876 wilshire blvd.", "los angeles", "asian", '169' 172 | "vida", "1930 north hillhurst ave.", "los feliz", "american", '170' 173 | "west beach cafe", "60 n. venice blvd.", "los angeles", "american", '171' 174 | "20 mott", "20 mott st. between bowery and pell st.", "new york", "asian", '172' 175 | "9 jones street", "9 jones st.", "new york", "american", '173' 176 | "adrienne", "700 5th ave. at 55th st.", "new york", "french", '174' 177 | "agrotikon", "322 e. 14 st. between 1st and 2nd aves.", "new york", "mediterranean", '175' 178 | "aja", "937 broadway at 22nd st.", "new york", "american", '176' 179 | "alamo", "304 e. 48th st.", "new york", "mexican", '177' 180 | "alley's end", "311 w. 17th st.", "new york", "american", '178' 181 | "ambassador grill", "1 united nations plaza at 44th st.", "new york", "american", '179' 182 | "american place", "2 park ave. at 32nd st.", "new york", "american", '180' 183 | "anche vivolo", "222 e. 58th st. between 2nd and 3rd aves.", "new york", "italian", '181' 184 | "arizona", "206 206 e. 60th st.", "new york", "american", '182' 185 | "arturo's", "106 w. houston st. off thompson st.", "new york", "italian", '183' 186 | "au mandarin", "200-250 vesey st. world financial center", "new york", "asian", '184' 187 | "bar anise", "1022 3rd ave. between 60th and 61st sts.", "new york", "mediterranean", '185' 188 | "barbetta", "321 w. 46th st.", "new york", "italian", '186' 189 | "ben benson's", "123 w. 52nd st.", "new york", "american", '187' 190 | "big cup", "228 8th ave. between 21st and 22nd sts.", "new york", "coffee bar", '188' 191 | "billy's", "948 1st ave. between 52nd and 53rd sts.", "new york", "american", '189' 192 | "boca chica", "13 1st ave. near 1st st.", "new york", "latin american", '190' 193 | "bolo", "23 e. 22nd st.", "new york", "mediterranean", '191' 194 | "boonthai", "1393a 2nd ave. between 72nd and 73rd sts.", "new york", "asian", '192' 195 | "bouterin", "420 e. 59th st. off 1st ave.", "new york", "french", '193' 196 | "brothers bar-b-q", "225 varick st. at clarkston st.", "new york", "american", '194' 197 | "bruno", "240 e. 58th st.", "new york", "italian", '195' 198 | "bryant park grill roof restaurant and bp cafe", "25 w. 40th st. between 5th and 6th aves.", "new york", "american", '196' 199 | "c3", "103 waverly pl. near washington sq.", "new york", "american", '197' 200 | "ct", "111 e. 22nd st. between park ave. s and lexington ave.", "new york", "french", '198' 201 | "cafe bianco", "1486 2nd ave. between 77th and 78th sts.", "new york", "coffee bar", '199' 202 | "cafe botanica", "160 central park s", "new york", "french", '200' 203 | "cafe la fortuna", "69 w. 71st st.", "new york", "coffee bar", '201' 204 | "cafe luxembourg", "200 w. 70th st.", "new york", "french", '202' 205 | "cafe pierre", "2 e. 61st st.", "new york", "french", '203' 206 | "cafe centro", "200 park ave. between 45th st. and vanderbilt ave.", "new york", "french", '204' 207 | "cafe fes", "246 w. 4th st. at charles st.", "new york", "mediterranean", '205' 208 | "caffe dante", "81 macdougal st. between houston and bleeker sts.", "new york", "coffee bar", '206' 209 | "caffe dell'artista", "46 greenwich ave.", "new york", "coffee bar", '207' 210 | "caffe lure", "169 sullivan st. between houston and bleecker sts.", "new york", "french", '208' 211 | "caffe reggio", "119 macdougal st. between 3rd and bleecker sts.", "new york", "coffee bar", '209' 212 | "caffe roma", "385 broome st. at mulberry", "new york", "coffee bar", '210' 213 | "caffe vivaldi", "32 jones st. at bleecker st.", "new york", "coffee bar", '211' 214 | "caffe bondi ristorante", "7 w. 20th st.", "new york", "italian", '212' 215 | "capsouto freres", "451 washington st. near watts st.", "new york", "french", '213' 216 | "captain's table", "860 2nd ave. at 46th st.", "new york", "seafood", '214' 217 | "casa la femme", "150 wooster st. between houston and prince sts.", "new york", "middle eastern", '215' 218 | "cendrillon asian grill & marimba bar", "45 mercer st. between broome and grand sts.", "new york", "asian", '216' 219 | "chez jacqueline", "72 macdougal st. between w. houston and bleecker sts.", "new york", "french", '217' 220 | "chiam", "160 e. 48th st.", "new york", "asian", '218' 221 | "china grill", "60 w. 53rd st.", "new york", "american", '219' 222 | "cite", "120 w. 51st st.", "new york", "french", '220' 223 | "coco pazzo", "23 e. 74th st.", "new york", "italian", '221' 224 | "columbus bakery", "53rd sts.", "new york", "coffee bar", '222' 225 | "corrado cafe", "1013 3rd ave. between 60th and 61st sts.", "new york", "coffee bar", '223' 226 | "cupcake cafe", "522 9th ave. at 39th st.", "new york", "coffee bar", '224' 227 | "da nico", "164 mulberry st. between grand and broome sts.", "new york", "italian", '225' 228 | "dean & deluca", "121 prince st.", "new york", "coffee bar", '226' 229 | "diva", "341 w. broadway near grand st.", "new york", "italian", '227' 230 | "dix et sept", "181 w. 10th st.", "new york", "french", '228' 231 | "docks", "633 3rd ave. at 40th st.", "new york", "seafood", '229' 232 | "duane park cafe", "157 duane st. between w. broadway and hudson st.", "new york", "american", '230' 233 | "el teddy's", "219 w. broadway between franklin and white sts.", "new york", "mexican", '231' 234 | "emily's", "1325 5th ave. at 111th st.", "new york", "american", '232' 235 | "empire korea", "6 e. 32nd st.", "new york", "asian", '233' 236 | "ernie's", "2150 broadway between 75th and 76th sts.", "new york", "american", '234' 237 | "evergreen cafe", "1288 1st ave. at 69th st.", "new york", "asian", '235' 238 | "f. ille ponte ristorante", "39 desbrosses st. near west st.", "new york", "italian", '236' 239 | "felix", "340 w. broadway at grand st.", "new york", "french", '237' 240 | "ferrier", "29 e. 65th st.", "new york", "french", '238' 241 | "fifty seven fifty seven", "57 e. 57th st.", "new york", "american", '239' 242 | "film center cafe", "635 9th ave. between 44th and 45th sts.", "new york", "american", '240' 243 | "fiorello's roman cafe", "1900 broadway between 63rd and 64th sts.", "new york", "italian", '241' 244 | "firehouse", "522 columbus ave. between 85th and 86th sts.", "new york", "american", '242' 245 | "first", "87 1st ave. between 5th and 6th sts.", "new york", "american", '243' 246 | "fishin eddie", "73 w. 71st st.", "new york", "seafood", '244' 247 | "fleur de jour", "348 e. 62nd st.", "new york", "coffee bar", '245' 248 | "flowers", "21 west 17th st. between 5th and 6th aves.", "new york", "american", '246' 249 | "follonico", "6 w. 24th st.", "new york", "italian", '247' 250 | "fraunces tavern", "54 pearl st. at broad st.", "new york", "american", '248' 251 | "french roast", "458 6th ave. at 11th st.", "new york", "french", '249' 252 | "french roast cafe", "2340 broadway at 85th st.", "new york", "coffee bar", '250' 253 | "frico bar", "402 w. 43rd st. off 9th ave.", "new york", "italian", '251' 254 | "fujiyama mama", "467 columbus ave. between 82nd and 83rd sts.", "new york", "asian", '252' 255 | "gabriela's", "685 amsterdam ave. at 93rd st.", "new york", "mexican", '253' 256 | "gallagher's", "228 w. 52nd st.", "new york", "american", '254' 257 | "gianni's", "15 fulton st.", "new york", "seafood", '255' 258 | "girafe", "208 e. 58th st. between 2nd and 3rd aves.", "new york", "italian", '256' 259 | "global", "33 93 2nd ave. between 5th and 6th sts.", "new york", "american", '257' 260 | "golden unicorn", "18 e. broadway at catherine st.", "new york", "asian", '258' 261 | "grand ticino", "228 thompson st. between w. 3rd and bleecker sts.", "new york", "italian", '259' 262 | "halcyon", "151 w. 54th st. in the rihga royal hotel", "new york", "american", '260' 263 | "hard rock cafe", "221 w. 57th st.", "new york", "american", '261' 264 | "hi-life restaurant and lounge", "1340 1st ave. at 72nd st.", "new york", "american", '262' 265 | "home", "20 cornelia st. between bleecker and w. 4th st.", "new york", "american", '263' 266 | "hudson river club", "4 world financial center", "new york", "american", '264' 267 | "i trulli", "122 e. 27th st. between lexington and park aves.", "new york", "italian", '265' 268 | "il cortile", "125 mulberry st. between canal and hester sts.", "new york", "italian", '266' 269 | "il nido", "251 e. 53rd st.", "new york", "italian", '267' 270 | "inca grill", "492 broome st. near w. broadway", "new york", "latin american", '268' 271 | "indochine", "430 lafayette st. between 4th st. and astor pl.", "new york", "asian", '269' 272 | "internet cafe", "82 e. 3rd st. between 1st and 2nd aves.", "new york", "coffee bar", '270' 273 | "ipanema", "13 w. 46th st.", "new york", "latin american", '271' 274 | "jean lafitte", "68 w. 58th st.", "new york", "french", '272' 275 | "jewel of india", "15 w. 44th st.", "new york", "asian", '273' 276 | "jimmy sung's", "219 e. 44th st. between 2nd and 3rd aves.", "new york", "asian", '274' 277 | "joe allen", "326 w. 46th st.", "new york", "american", '275' 278 | "judson grill", "152 w. 52nd st.", "new york", "american", '276' 279 | "l'absinthe", "227 e. 67th st.", "new york", "french", '277' 280 | "l'auberge", "1191 1st ave. between 64th and 65th sts.", "new york", "middle eastern", '278' 281 | "l'auberge du midi", "310 w. 4th st. between w. 12th and bank sts.", "new york", "french", '279' 282 | "l'udo", "432 lafayette st. near astor pl.", "new york", "french", '280' 283 | "la reserve", "4 w. 49th st.", "new york", "french", '281' 284 | "lanza restaurant", "168 1st ave. between 10th and 11th sts.", "new york", "italian", '282' 285 | "lattanzi ristorante", "361 w. 46th st.", "new york", "italian", '283' 286 | "layla", "211 w. broadway at franklin st.", "new york", "middle eastern", '284' 287 | "le chantilly", "106 e. 57th st.", "new york", "french", '285' 288 | "le colonial", "149 e. 57th st.", "new york", "asian", '286' 289 | "le gamin", "50 macdougal st. between houston and prince sts.", "new york", "coffee bar", '287' 290 | "le jardin", "25 cleveland pl. near spring st.", "new york", "french", '288' 291 | "le madri", "168 w. 18th st.", "new york", "italian", '289' 292 | "le marais", "150 w. 46th st.", "new york", "american", '290' 293 | "le perigord", "405 e. 52nd st.", "new york", "french", '291' 294 | "le select", "507 columbus ave. between 84th and 85th sts.", "new york", "american", '292' 295 | "les halles", "411 park ave. s between 28th and 29th sts.", "new york", "french", '293' 296 | "lincoln tavern", "51 w. 64th st.", "new york", "american", '294' 297 | "lola", "30 west 22nd st. between 5th and 6th ave.", "new york", "american", '295' 298 | "lucky strike", "59 grand st. between wooster st. and w. broadway", "new york", "or 212/941-0772 american", '296' 299 | "mad fish", "2182 broadway between 77th and 78th sts.", "new york", "seafood", '297' 300 | "main street", "446 columbus ave. between 81st and 82nd sts.", "new york", "american", '298' 301 | "mangia e bevi", "800 9th ave. at 53rd st.", "new york", "italian", '299' 302 | "manhattan cafe", "1161 1st ave. between 63rd and 64th sts.", "new york", "american", '300' 303 | "manila garden", "325 e. 14th st. between 1st and 2nd aves.", "new york", "asian", '301' 304 | "marichu", "342 e. 46th st. between 1st and 2nd aves.", "new york", "french", '302' 305 | "marquet patisserie", "15 e. 12th st. between 5th ave. and university pl.", "new york", "coffee bar", '303' 306 | "match", "160 mercer st. between houston and prince sts.", "new york", "american", '304' 307 | "matthew's", "1030 3rd ave. at 61st st.", "new york", "american", '305' 308 | "mavalli palace", "46 e. 29th st.", "new york", "asian", '306' 309 | "milan cafe and coffee bar", "120 w. 23rd st.", "new york", "coffee bar", '307' 310 | "monkey bar", "60 e. 54th st.", "new york", "american", '308' 311 | "montien", "1134 1st ave. between 62nd and 63rd sts.", "new york", "asian", '309' 312 | "morton's", "551 5th ave. at 45th st.", "new york", "american", '310' 313 | "motown cafe", "104 w. 57th st. near 6th ave.", "new york", "american", '311' 314 | "new york kom tang soot bul house", "32 w. 32nd st.", "new york", "asian", '312' 315 | "new york noodletown", "28 1/2 bowery at bayard st.", "new york", "asian", '313' 316 | "newsbar", "2 w. 19th st.", "new york", "coffee bar", '314' 317 | "odeon", "145 w. broadway at thomas st.", "new york", "american", '315' 318 | "orso", "322 w. 46th st.", "new york", "italian", '316' 319 | "osteria al droge", "142 w. 44th st.", "new york", "italian", '317' 320 | "otabe", "68 e. 56th st.", "new york", "asian", '318' 321 | "pacifica", "138 lafayette st. between canal and howard sts.", "new york", "asian", '319' 322 | "palio", "151 w. 51st. st.", "new york", "italian", '320' 323 | "pamir", "1065 1st ave. at 58th st.", "new york", "middle eastern", '321' 324 | "parioli romanissimo", "24 e. 81st st.", "new york", "italian", '322' 325 | "patria", "250 park ave. s at 20th st.", "new york", "latin american", '323' 326 | "peacock alley", "301 park ave. between 49th and 50th sts.", "new york", "french", '324' 327 | "pen & pencil", "205 e. 45th st.", "new york", "american", '325' 328 | "penang soho", "109 spring st. between greene and mercer sts.", "new york", "asian", '326' 329 | "persepolis", "1423 2nd ave. between 74th and 75th sts.", "new york", "middle eastern", '327' 330 | "planet hollywood", "140 w. 57th st.", "new york", "american", '328' 331 | "pomaire", "371 w. 46th st. off 9th ave.", "new york", "latin american", '329' 332 | "popover cafe", "551 amsterdam ave. between 86th and 87th sts.", "new york", "american", '330' 333 | "post house", "28 e. 63rd st.", "new york", "american", '331' 334 | "rain", "100 w. 82nd st.", "new york", "asian", '332' 335 | "red tulip", "439 e. 75th st.", "new york", "eastern european", '333' 336 | "remi", "145 w. 53rd st.", "new york", "italian", '334' 337 | "republic", "37a union sq. w between 16th and 17th sts.", "new york", "asian", '335' 338 | "roettelle a. g", "126 e. 7th st. between 1st ave. and ave. a", "new york", "continental", '336' 339 | "rosa mexicano", "1063 1st ave. at 58th st.", "new york", "mexican", '337' 340 | "ruth's chris", "148 w. 51st st.", "new york", "american", '338' 341 | "s.p.q.r", "133 mulberry st. between hester and grand sts.", "new york", "italian", '339' 342 | "sal anthony's", "55 irving pl.", "new york", "italian", '340' 343 | "sammy's roumanian steak house", "157 chrystie st. at delancey st.", "new york", "east european", '341' 344 | "san pietro", "18 e. 54th st.", "new york", "italian", '342' 345 | "sant ambroeus", "1000 madison ave. between 77th and 78th sts.", "new york", "coffee bar", '343' 346 | "sarabeth's kitchen", "423 amsterdam ave. between 80th and 81st sts.", "new york", "american", '344' 347 | "sea grill", "19 w. 49th st.", "new york", "seafood", '345' 348 | "serendipity", "3 225 e. 60th st.", "new york", "american", '346' 349 | "seventh regiment mess and bar", "643 park ave. at 66th st.", "new york", "american", '347' 350 | "sfuzzi", "58 w. 65th st.", "new york", "american", '348' 351 | "shaan", "57 w. 48th st.", "new york", "asian", '349' 352 | "sofia fabulous pizza", "1022 madison ave. near 79th st.", "new york", "italian", '350' 353 | "spring street natural restaurant & bar", "62 spring st. at lafayette st.", "new york", "american", '351' 354 | "stage deli", "834 7th ave. between 53rd and 54th sts.", "new york", "delicatessen", '352' 355 | "stingray", "428 amsterdam ave. between 80th and 81st sts.", "new york", "seafood", '353' 356 | "sweet'n'tart cafe", "76 mott st. at canal st.", "new york", "asian", '354' 357 | "t salon", "143 mercer st. at prince st.", "new york", "coffee bar", '355' 358 | "tang pavillion", "65 w. 55th st.", "new york", "asian", '356' 359 | "tapika", "950 8th ave. at 56th st.", "new york", "american", '357' 360 | "teresa's", "103 1st ave. between 6th and 7th sts.", "new york", "east european", '358' 361 | "terrace", "400 w. 119th st. between amsterdam and morningside aves.", "new york", "continental", '359' 362 | "the coffee pot", "350 9th ave. at 49th st.", "new york", "coffee bar", '360' 363 | "the savannah club", "2420 broadway at 89th st.", "new york", "american", '361' 364 | "trattoria dell'arte", "900 7th ave. between 56th and 57th sts.", "new york", "italian", '362' 365 | "triangolo", "345 e. 83rd st.", "new york", "italian", '363' 366 | "tribeca grill", "375 greenwich st. near franklin st.", "new york", "american", '364' 367 | "trois jean", "154 e. 79th st. between lexington and 3rd aves.", "new york", "coffee bar", '365' 368 | "tse yang", "34 e. 51st st.", "new york", "asian", '366' 369 | "turkish kitchen", "386 3rd ave. between 27th and 28th sts.", "new york", "middle eastern", '367' 370 | "two two two", "222 w. 79th st.", "new york", "american", '368' 371 | "veniero's pasticceria", "342 e. 11th st. near 1st ave.", "new york", "coffee bar", '369' 372 | "verbena", "54 irving pl. at 17th st.", "new york", "american", '370' 373 | "victor's cafe", "52 236 w. 52nd st.", "new york", "latin american", '371' 374 | "vince & eddie's", "70 w. 68th st.", "new york", "american", '372' 375 | "vong", "200 e. 54th st.", "new york", "american", '373' 376 | "water club", "500 e. 30th st.", "new york", "american", '374' 377 | "west", "63rd street steakhouse 44 w. 63rd st.", "new york", "american", '375' 378 | "xunta", "174 1st ave. between 10th and 11th sts.", "new york", "mediterranean", '376' 379 | "zen palate", "34 union sq. e at 16th st.", "new york", "and 212/614-9345 asian", '377' 380 | "zoe", "90 prince st. between broadway and mercer st.", "new york", "american", '378' 381 | "abbey", "163 ponce de leon ave.", "atlanta", "international", '379' 382 | "aleck's barbecue heaven", "783 martin luther king jr. dr.", "atlanta", "barbecue", '380' 383 | "annie's thai castle", "3195 roswell rd.", "atlanta", "asian", '381' 384 | "anthonys", "3109 piedmont rd. just south of peachtree rd.", "atlanta", "american", '382' 385 | "atlanta fish market", "265 pharr rd.", "atlanta", "american", '383' 386 | "beesley's of buckhead", "260 e. paces ferry road", "atlanta", "continental", '384' 387 | "bertolini's", "3500 peachtree rd. phipps plaza", "atlanta", "italian", '385' 388 | "bistango", "1100 peachtree st.", "atlanta", "mediterranean", '386' 389 | "cafe renaissance", "7050 jimmy carter blvd. norcross", "atlanta", "american", '387' 390 | "camille's", "1186 n. highland ave.", "atlanta", "italian", '388' 391 | "cassis", "3300 peachtree rd. grand hyatt", "atlanta", "mediterranean", '389' 392 | "city grill", "50 hurt plaza", "atlanta", "international", '390' 393 | "coco loco", "40 buckhead crossing mall on the sidney marcus blvd.", "atlanta", "caribbean", '391' 394 | "colonnade restaurant", "1879 cheshire bridge rd.", "atlanta", "southern", '392' 395 | "dante's down the hatch buckhead", "3380 peachtree rd.", "atlanta", "continental", '393' 396 | "dante's down the hatch", "underground underground mall underground atlanta", "atlanta", "continental", '394' 397 | "fat matt's rib shack", "1811 piedmont ave. near cheshire bridge rd.", "atlanta", "barbecue", '395' 398 | "french quarter food shop", "923 peachtree st. at 8th st.", "atlanta", "southern", '396' 399 | "holt bros. bar-b-q", "6359 jimmy carter blvd. at buford hwy. norcross", "atlanta", "barbecue", '397' 400 | "horseradish grill", "4320 powers ferry rd.", "atlanta", "southern", '398' 401 | "hsu's gourmet", "192 peachtree center ave. at international blvd.", "atlanta", "asian", '399' 402 | "imperial fez", "2285 peachtree rd. peachtree battle condominium", "atlanta", "mediterranean", '400' 403 | "kamogawa", "3300 peachtree rd. grand hyatt", "atlanta", "asian", '401' 404 | "la grotta at ravinia dunwoody rd.", "holiday inn/crowne plaza at ravinia dunwoody", "atlanta", "italian", '402' 405 | "little szechuan", "c buford hwy. northwoods plaza doraville", "atlanta", "asian", '403' 406 | "lowcountry barbecue", "6301 roswell rd. sandy springs plaza sandy springs", "atlanta", "barbecue", '404' 407 | "luna si", "1931 peachtree rd.", "atlanta", "continental", '405' 408 | "mambo restaurante cubano", "1402 n. highland ave.", "atlanta", "caribbean", '406' 409 | "mckinnon's louisiane", "3209 maple dr.", "atlanta", "southern", '407' 410 | "mi spia dunwoody rd.", "park place across from perimeter mall dunwoody", "atlanta", "italian", '408' 411 | "nickiemoto's: a sushi bar", "247 buckhead ave. east village sq.", "atlanta", "fusion", '409' 412 | "palisades", "1829 peachtree rd.", "atlanta", "continental", '410' 413 | "pleasant peasant", "555 peachtree st. at linden ave.", "atlanta", "american", '411' 414 | "pricci", "500 pharr rd.", "atlanta", "italian", '412' 415 | "r.j.'s uptown kitchen & wine bar", "870 n. highland ave.", "atlanta", "american", '413' 416 | "rib ranch", "25 irby ave.", "atlanta", "barbecue", '414' 417 | "sa tsu ki", "3043 buford hwy.", "atlanta", "asian", '415' 418 | "sato sushi and thai", "6050 peachtree pkwy. norcross", "atlanta", "asian", '416' 419 | "south city kitchen", "1144 crescent ave.", "atlanta", "southern", '417' 420 | "south of france", "2345 cheshire bridge rd.", "atlanta", "french", '418' 421 | "stringer's fish camp and oyster bar", "3384 shallowford rd. chamblee", "atlanta", "southern", '419' 422 | "sundown cafe", "2165 cheshire bridge rd.", "atlanta", "american", '420' 423 | "taste of new orleans", "889 w. peachtree st.", "atlanta", "southern", '421' 424 | "tomtom", "3393 peachtree rd.", "atlanta", "continental", '422' 425 | "antonio's", "3700 w. flamingo", "las vegas", "italian", '423' 426 | "bally's big kitchen", "3645 las vegas blvd. s", "las vegas", "buffets", '424' 427 | "bamboo garden", "4850 flamingo rd.", "las vegas", "asian", '425' 428 | "battista's hole in the wall", "4041 audrie st. at flamingo rd.", "las vegas", "italian", '426' 429 | "bertolini's", "3570 las vegas blvd. s", "las vegas", "italian", '427' 430 | "binion's coffee shop", "128 fremont st.", "las vegas", "coffee shops/diners", '428' 431 | "bistro", "3400 las vegas blvd. s", "las vegas", "continental", '429' 432 | "broiler", "4111 boulder hwy.", "las vegas", "american", '430' 433 | "bugsy's diner", "3555 las vegas blvd. s", "las vegas", "coffee shops/diners", '431' 434 | "cafe michelle", "1350 e. flamingo rd.", "las vegas", "american", '432' 435 | "cafe roma", "3570 las vegas blvd. s", "las vegas", "coffee shops/diners", '433' 436 | "capozzoli's", "3333 s. maryland pkwy.", "las vegas", "italian", '434' 437 | "carnival world", "3700 w. flamingo rd.", "las vegas", "buffets", '435' 438 | "center stage plaza hotel", "1 main st.", "las vegas", "american", '436' 439 | "circus circus", "2880 las vegas blvd. s", "las vegas", "buffets", '437' 440 | "empress court", "3570 las vegas blvd. s", "las vegas", "asian", '438' 441 | "feast", "2411 w. sahara ave.", "las vegas", "buffets", '439' 442 | "golden nugget hotel", "129 e. fremont st.", "las vegas", "buffets", '440' 443 | "golden steer", "308 w. sahara ave.", "las vegas", "steak houses", '441' 444 | "lillie langtry's", "129 e. fremont st.", "las vegas", "asian", '442' 445 | "mandarin court", "1510 e. flamingo rd.", "las vegas", "asian", '443' 446 | "margarita's mexican cantina", "3120 las vegas blvd. s", "las vegas", "mexican", '444' 447 | "mary's diner", "5111 w. boulder hwy.", "las vegas", "coffee shops/diners", '445' 448 | "mikado", "3400 las vegas blvd. s", "las vegas", "asian", '446' 449 | "pamplemousse", "400 e. sahara ave.", "las vegas", "continental", '447' 450 | "ralph's diner", "3000 las vegas blvd. s", "las vegas", "coffee shops/diners", '448' 451 | "the bacchanal", "3570 las vegas blvd. s", "las vegas", "only in las vegas", '449' 452 | "venetian", "3713 w. sahara ave.", "las vegas", "italian", '450' 453 | "viva mercado's", "6182 w. flamingo rd.", "las vegas", "mexican", '451' 454 | "yolie's", "3900 paradise rd.", "las vegas", "steak houses", '452' 455 | "2223", "2223 market st.", "san francisco", "american", '453' 456 | "acquarello", "1722 sacramento st.", "san francisco", "italian", '454' 457 | "bardelli's", "243 o'farrell st.", "san francisco", "old san francisco", '455' 458 | "betelnut", "2030 union st.", "san francisco", "asian", '456' 459 | "bistro roti", "155 steuart st.", "san francisco", "french", '457' 460 | "bix", "56 gold st.", "san francisco", "american", '458' 461 | "bizou", "598 fourth st.", "san francisco", "french", '459' 462 | "buca giovanni", "800 greenwich st.", "san francisco", "italian", '460' 463 | "cafe adriano", "3347 fillmore st.", "san francisco", "italian", '461' 464 | "cafe marimba", "2317 chestnut st.", "san francisco", "mexican/latin american/spanish", '462' 465 | "california culinary academy", "625 polk st.", "san francisco", "french", '463' 466 | "capp's corner", "1600 powell st.", "san francisco", "italian", '464' 467 | "carta", "1772 market st.", "san francisco", "american", '465' 468 | "chevys", "4th and howard sts.", "san francisco", "mexican/latin american/spanish", '466' 469 | "cypress club", "500 jackson st.", "san francisco", "american", '467' 470 | "des alpes", "732 broadway", "san francisco", "french", '468' 471 | "faz", "161 sutter st.", "san francisco", "greek and middle eastern", '469' 472 | "fog city diner", "1300 battery st.", "san francisco", "american", '470' 473 | "garden court", "market and new montgomery sts.", "san francisco", "old san francisco", '471' 474 | "gaylord's", "ghirardelli sq.", "san francisco", "asian", '472' 475 | "grand cafe hotel monaco", "501 geary st.", "san francisco", "american", '473' 476 | "greens", "bldg. a fort mason", "san francisco", "vegetarian", '474' 477 | "harbor village", "4 embarcadero center", "san francisco", "asian", '475' 478 | "harris'", "2100 van ness ave.", "san francisco", "steak houses", '476' 479 | "harry denton's", "161 steuart st.", "san francisco", "american", '477' 480 | "hayes street grill", "320 hayes st.", "san francisco", "seafood", '478' 481 | "helmand", "430 broadway", "san francisco", "greek and middle eastern", '479' 482 | "hong kong flower lounge", "5322 geary blvd.", "san francisco", "asian", '480' 483 | "hong kong villa", "2332 clement st.", "san francisco", "asian", '481' 484 | "hyde street bistro", "1521 hyde st.", "san francisco", "italian", '482' 485 | "il fornaio levi's plaza", "1265 battery st.", "san francisco", "italian", '483' 486 | "izzy's steak & chop house", "3345 steiner st.", "san francisco", "steak houses", '484' 487 | "jack's", "615 sacramento st.", "san francisco", "old san francisco", '485' 488 | "kabuto sushi", "5116 geary blvd.", "san francisco", "asian", '486' 489 | "katia's", "600 5th ave.", "san francisco", "", '487' 490 | "kuleto's", "221 powell st.", "san francisco", "italian", '488' 491 | "kyo-ya. sheraton palace hotel", "2 new montgomery st. at market st.", "san francisco", "asian", '489' 492 | "l'osteria del forno", "519 columbus ave.", "san francisco", "italian", '490' 493 | "le central", "453 bush st.", "san francisco", "french", '491' 494 | "le soleil", "133 clement st.", "san francisco", "asian", '492' 495 | "macarthur park", "607 front st.", "san francisco", "american", '493' 496 | "manora", "3226 mission st.", "san francisco", "asian", '494' 497 | "maykadeh", "470 green st.", "san francisco", "greek and middle eastern", '495' 498 | "mccormick & kuleto's", "ghirardelli sq.", "san francisco", "seafood", '496' 499 | "millennium", "246 mcallister st.", "san francisco", "vegetarian", '497' 500 | "moose's", "1652 stockton st.", "san francisco", "mediterranean", '498' 501 | "north india", "3131 webster st.", "san francisco", "asian", '499' 502 | "one market", "1 market st.", "san francisco", "american", '500' 503 | "oritalia", "1915 fillmore st.", "san francisco", "italian", '501' 504 | "pacific pan pacific hotel", "500 post st.", "san francisco", "french", '502' 505 | "palio d'asti", "640 sacramento st.", "san francisco", "italian", '503' 506 | "pane e vino", "3011 steiner st.", "san francisco", "italian", '504' 507 | "pastis", "1015 battery st.", "san francisco", "french", '505' 508 | "perry's", "1944 union st.", "san francisco", "american", '506' 509 | "r&g lounge", "631 b kearny st.", "san francisco", "or 415/982-3811 asian", '507' 510 | "rubicon", "558 sacramento st.", "san francisco", "american", '508' 511 | "rumpus", "1 tillman pl.", "san francisco", "american", '509' 512 | "sanppo", "1702 post st.", "san francisco", "asian", '510' 513 | "scala's bistro", "432 powell st.", "san francisco", "italian", '511' 514 | "south park cafe", "108 south park", "san francisco", "french", '512' 515 | "splendido embarcadero", "4", "san francisco", "mediterranean", '513' 516 | "stars", "150 redwood alley", "san francisco", "american", '514' 517 | "stars cafe", "500 van ness ave.", "san francisco", "american", '515' 518 | "stoyanof's cafe", "1240 9th ave.", "san francisco", "greek and middle eastern", '516' 519 | "straits cafe", "3300 geary blvd.", "san francisco", "asian", '517' 520 | "suppenkuche", "601 hayes st.", "san francisco", "russian/german", '518' 521 | "tadich grill", "240 california st.", "san francisco", "seafood", '519' 522 | "the heights", "3235 sacramento st.", "san francisco", "french", '520' 523 | "thepin", "298 gough st.", "san francisco", "asian", '521' 524 | "ton kiang", "3148 geary blvd.", "san francisco", "asian", '522' 525 | "vertigo", "600 montgomery st.", "san francisco", "mediterranean", '523' 526 | "vivande porta via", "2125 fillmore st.", "san francisco", "italian", '524' 527 | "vivande ristorante", "670 golden gate ave.", "san francisco", "italian", '525' 528 | "world wrapps", "2257 chestnut st.", "san francisco", "american", '526' 529 | "wu kong", "101 spear st.", "san francisco", "asian", '527' 530 | "yank sing", "427 battery st.", "san francisco", "asian", '528' 531 | "yaya cuisine", "1220 9th ave.", "san francisco", "greek and middle eastern", '529' 532 | "yoyo tsumami bistro", "1611 post st.", "san francisco", "french", '530' 533 | "zarzuela", "2000 hyde st.", "san francisco", "mexican/latin american/spanish", '531' 534 | "zuni cafe & grill", "1658 market st.", "san francisco", "mediterranean", '532' 535 | "apple pan the", "10801 w. pico blvd.", "west la", "american", '534' 536 | "asahi ramen", "2027 sawtelle blvd.", "west la", "noodle shops", '535' 537 | "baja fresh", "3345 kimber dr.", "westlake village", "mexican", '536' 538 | "belvedere the", "9882 little santa monica blvd.", "beverly hills", "pacific new wave", '537' 539 | "benita's frites", "1433 third st. promenade", "santa monica", "fast food", '538' 540 | "bernard's", "515 s. olive st.", "los angeles", "continental", '539' 541 | "bistro 45", "45 s. mentor ave.", "pasadena", "californian", '540' 542 | "brent's deli", "19565 parthenia ave.", "northridge", "delis", '541' 543 | "brighton coffee shop", "9600 brighton way", "beverly hills", "coffee shops", '542' 544 | "bristol farms market cafe", "1570 rosecrans ave. s.", "pasadena", "californian", '543' 545 | "bruno's", "3838 centinela ave.", "mar vista", "italian", '544' 546 | "cafe '50s", "838 lincoln blvd.", "venice", "american", '545' 547 | "cafe blanc", "9777 little santa monica blvd.", "beverly hills", "pacific new wave", '546' 548 | "cassell's", "3266 w. sixth st.", "la", "hamburgers", '547' 549 | "chez melange", "1716 pch", "redondo beach", "eclectic", '548' 550 | "diaghilev", "1020 n. san vicente blvd.", "w. hollywood", "russian", '549' 551 | "don antonio's", "1136 westwood blvd.", "westwood", "italian", '550' 552 | "duke's", "8909 sunset blvd.", "w. hollywood", "coffee shops", '551' 553 | "falafel king", "1059 broxton ave.", "westwood", "middle eastern", '552' 554 | "feast from the east", "1949 westwood blvd.", "west la", "chinese", '553' 555 | "gumbo pot the", "6333 w. third st.", "la", "cajun/creole", '554' 556 | "hollywood hills coffee shop", "6145 franklin ave.", "hollywood", "coffee shops", '555' 557 | "indo cafe", "10428 1/2 national blvd.", "la", "indonesian", '556' 558 | "jan's family restaurant", "8424 beverly blvd.", "la", "coffee shops", '557' 559 | "jiraffe", "502 santa monica blvd", "santa monica", "californian", '558' 560 | "jody maroni's sausage kingdom", "2011 ocean front walk", "venice", "hot dogs", '559' 561 | "joe's", "1023 abbot kinney blvd.", "venice", "american (new)", '560' 562 | "john o'groats", "10516 w. pico blvd.", "west la", "coffee shops", '561' 563 | "johnnie's pastrami", "4017 s. sepulveda blvd.", "culver city", "delis", '562' 564 | "johnny reb's southern smokehouse", "4663 long beach blvd.", "long beach", "southern/soul", '563' 565 | "johnny rockets (la)", "7507 melrose ave.", "la", "american", '564' 566 | "killer shrimp", "4000 colfax ave.", "studio city", "seafood", '565' 567 | "kokomo cafe", "6333 w. third st.", "la", "american", '566' 568 | "koo koo roo", "8393 w. beverly blvd.", "la", "chicken", '567' 569 | "la cachette", "10506 little santa monica blvd.", "century city", "french (new)", '568' 570 | "la salsa (la)", "22800 pch", "malibu", "mexican", '569' 571 | "la serenata de garibaldi", "1842 e. first", "st. boyle hts.", "mexican/tex-mex", '570' 572 | "langer's", "704 s. alvarado st.", "la", "delis", '571' 573 | "local nochol", "30869 thousand oaks blvd.", "westlake village", "health food", '572' 574 | "main course the", "10509 w. pico blvd.", "rancho park", "american", '573' 575 | "mani's bakery & espresso bar", "519 s. fairfax ave.", "la", "desserts", '574' 576 | "martha's", "22nd street grill 25 22nd", "st. hermosa beach", "american", '575' 577 | "maxwell's cafe", "13329 washington blvd.", "marina del rey", "american", '576' 578 | "michael's (los angeles)", "1147 third st.", "santa monica", "californian", '577' 579 | "mishima", "8474 w. third st.", "la", "noodle shops", '578' 580 | "mo better meatty meat", "7261 melrose ave.", "la", "hamburgers", '579' 581 | "mulberry st.", "17040 ventura blvd.", "encino", "pizza", '580' 582 | "ocean park cafe", "3117 ocean park blvd.", "santa monica", "american", '581' 583 | "ocean star", "145 n. atlantic blvd.", "monterey park", "seafood", '582' 584 | "original pantry bakery", "875 s. figueroa st. downtown", "la", "diners", '583' 585 | "parkway grill", "510 s. arroyo pkwy.", "pasadena", "californian", '584' 586 | "pho hoa", "642 broadway", "chinatown", "vietnamese", '585' 587 | "pink's famous chili dogs", "709 n. la brea ave.", "la", "hot dogs", '586' 588 | "poquito mas", "2635 w. olive ave.", "burbank", "mexican", '587' 589 | "r-23", "923 e. third st.", "los angeles", "japanese", '588' 590 | "rae's", "2901 pico blvd.", "santa monica", "diners", '589' 591 | "rubin's red hots", "15322 ventura blvd.", "encino", "hot dogs", '590' 592 | "ruby's (la)", "45 s. fair oaks ave.", "pasadena", "diners", '591' 593 | "russell's burgers", "1198 pch", "seal beach", "hamburgers", '592' 594 | "ruth's chris steak house (los angeles)", "224 s. beverly dr.", "beverly hills", "steakhouses", '593' 595 | "shiro", "1505 mission st. s.", "pasadena", "pacific new wave", '594' 596 | "sushi nozawa", "11288 ventura blvd.", "studio city", "japanese", '595' 597 | "sweet lady jane", "8360 melrose ave.", "la", "desserts", '596' 598 | "taiko", "11677 san vicente blvd.", "brentwood", "noodle shops", '597' 599 | "tommy's", "2575 beverly blvd.", "la", "hamburgers", '598' 600 | "uncle bill's pancake house", "1305 highland ave.", "manhattan beach", "diners", '599' 601 | "water grill", "544 s. grand ave.", "los angeles", "seafood", '600' 602 | "zankou chicken", "1415 e. colorado st.", "glendale", "middle eastern", '601' 603 | "afghan kebab house", "764 ninth ave.", "new york city", "afghan", '602' 604 | "arcadia", "21 e. 62nd st.", "new york city", "american (new)", '603' 605 | "benny's burritos", "93 ave. a", "new york city", "mexican", '604' 606 | "cafe con leche", "424 amsterdam ave.", "new york city", "cuban", '605' 607 | "corner bistro", "331 w. fourth st.", "new york city", "hamburgers", '606' 608 | "cucina della fontana", "368 bleecker st.", "new york city", "italian", '607' 609 | "cucina di pesce", "87 e. fourth st.", "new york city", "seafood", '608' 610 | "darbar", "44 w. 56th st.", "new york city", "indian", '609' 611 | "ej's luncheonette", "432 sixth ave.", "new york city", "diners", '610' 612 | "edison cafe", "228 w. 47th st.", "new york city", "diners", '611' 613 | "elias corner", "24-02 31st st.", "queens", "greek", '612' 614 | "good enough to eat", "483 amsterdam ave.", "new york city", "american", '613' 615 | "gray's papaya", "2090 broadway", "new york city", "hot dogs", '614' 616 | "il mulino", "86 w. third st.", "new york city", "italian", '615' 617 | "jackson diner", "37-03 74th st.", "queens", "indian", '616' 618 | "joe's shanghai", "9 pell st.", "queens", "chinese", '617' 619 | "john's pizzeria", "48 w. 65th st.", "new york city", "pizza", '618' 620 | "kelley & ping", "127 greene st.", "new york city", "pan-asian", '619' 621 | "kiev", "117 second ave.", "new york city", "ukrainian", '620' 622 | "kuruma zushi", "2nd fl.", "new york city", "japanese", '621' 623 | "la caridad", "2199 broadway", "new york city", "cuban", '622' 624 | "la grenouille", "3 e. 52nd st.", "new york city", "french (classic)", '623' 625 | "lemongrass grill", "61a seventh ave.", "brooklyn", "thai", '624' 626 | "lombardi's", "32 spring st.", "new york city", "pizza", '625' 627 | "marnie's noodle shop", "466 hudson st.", "new york city", "asian", '626' 628 | "menchanko-tei", "39 w. 55th st.", "new york city", "japanese", '627' 629 | "mitali east-west", "296 bleecker st.", "new york city", "indian", '628' 630 | "monsoon (ny)", "435 amsterdam ave.", "new york city", "thai", '629' 631 | "moustache", "405 atlantic ave.", "brooklyn", "middle eastern", '630' 632 | "nobu", "105 hudson st.", "new york city", "japanese", '631' 633 | "one if by land tibs", "17 barrow st.", "new york city", "continental", '632' 634 | "oyster bar", "lower level", "new york city", "seafood", '633' 635 | "palm", "837 second ave.", "new york city", "steakhouses", '634' 636 | "palm too", "840 second ave.", "new york city", "steakhouses", '635' 637 | "patsy's pizza", "19 old fulton st.", "brooklyn", "pizza", '636' 638 | "peter luger steak house", "178 broadway", "brooklyn", "steakhouses", '637' 639 | "rose of india", "308 e. sixth st.", "new york city", "indian", '638' 640 | "sam's noodle shop", "411 third ave.", "new york city", "chinese", '639' 641 | "sarabeth's", "1295 madison ave.", "new york city", "american", '640' 642 | "sparks steak house", "210 e. 46th st.", "new york city", "steakhouses", '641' 643 | "stick to your ribs", "5-16 51st ave.", "queens", "bbq", '642' 644 | "sushisay", "38 e. 51st st.", "new york city", "japanese", '643' 645 | "sylvia's", "328 lenox ave.", "new york city", "southern/soul", '644' 646 | "szechuan hunan cottage", "1588 york ave.", "new york city", "chinese", '645' 647 | "szechuan kitchen", "1460 first ave.", "new york city", "chinese", '646' 648 | "teresa's", "80 montague st.", "queens", "polish", '647' 649 | "thai house cafe", "151 hudson st.", "new york city", "thai", '648' 650 | "thailand restaurant", "106 bayard st.", "new york city", "thai", '649' 651 | "veselka", "144 second ave.", "new york city", "ukrainian", '650' 652 | "westside cottage", "689 ninth ave.", "new york city", "chinese", '651' 653 | "windows on the world", "107th fl.", "new york city", "eclectic", '652' 654 | "wollensky's grill", "205 e. 49th st.", "new york city", "steakhouses", '653' 655 | "yama", "122 e. 17th st.", "new york city", "japanese", '654' 656 | "zarela", "953 second ave.", "new york city", "mexican", '655' 657 | "andre's french restaurant", "401 s. 6th st.", "las vegas", "french (classic)", '656' 658 | "buccaneer bay club", "3300 las vegas blvd. s.", "las vegas", "continental", '657' 659 | "buzio's in the rio", "3700 w. flamingo rd.", "las vegas", "seafood", '658' 660 | "emeril's new orleans fish house", "3799 las vegas blvd. s.", "las vegas", "seafood", '659' 661 | "fiore rotisserie & grille", "3700 w. flamingo rd.", "las vegas", "italian", '660' 662 | "hugo's cellar", "202 e. fremont st.", "las vegas", "continental", '661' 663 | "madame ching's", "3300 las vegas blvd. s.", "las vegas", "asian", '662' 664 | "mayflower cuisinier", "4750 w. sahara ave.", "las vegas", "chinese", '663' 665 | "michael's (las vegas)", "3595 las vegas blvd. s.", "las vegas", "continental", '664' 666 | "monte carlo", "3145 las vegas blvd. s.", "las vegas", "french (new)", '665' 667 | "moongate", "3400 las vegas blvd. s.", "las vegas", "chinese", '666' 668 | "morton's of chicago (las vegas)", "3200 las vegas blvd. s.", "las vegas", "steakhouses", '667' 669 | "nicky blair's", "3925 paradise rd.", "las vegas", "italian", '668' 670 | "piero's restaurant", "355 convention center dr.", "las vegas", "italian", '669' 671 | "spago (las vegas)", "3500 las vegas blvd. s.", "las vegas", "californian", '670' 672 | "steakhouse the", "128 e. fremont st.", "las vegas", "steakhouses", '671' 673 | "stefano's", "129 fremont st.", "las vegas", "italian", '672' 674 | "sterling brunch", "3645 las vegas blvd. s.", "las vegas", "eclectic", '673' 675 | "tre visi", "3799 las vegas blvd. s.", "las vegas", "italian", '674' 676 | "103 west", "103 w. paces ferry rd.", "atlanta", "continental", '675' 677 | "alon's at the terrace", "659 peachtree st.", "atlanta", "sandwiches", '676' 678 | "baker's cajun cafe", "1134 euclid ave.", "atlanta", "cajun/creole", '677' 679 | "barbecue kitchen", "1437 virginia ave.", "atlanta", "bbq", '678' 680 | "bistro the", "56 e. andrews dr. nw", "atlanta", "french bistro", '679' 681 | "bobby & june's kountry kitchen", "375 14th st.", "atlanta", "southern/soul", '680' 682 | "bradshaw's restaurant", "2911 s. pharr court", "atlanta", "southern/soul", '681' 683 | "brookhaven cafe", "4274 peachtree rd.", "atlanta", "vegetarian", '682' 684 | "cafe sunflower", "5975 roswell rd.", "atlanta", "health food", '683' 685 | "canoe", "4199 paces ferry rd.", "atlanta", "american (new)", '684' 686 | "carey's", "1021 cobb pkwy. se", "marietta", "hamburgers", '685' 687 | "carey's corner", "1215 powers ferry rd.", "marietta", "hamburgers", '686' 688 | "chops", "70 w. paces ferry rd.", "atlanta", "steakhouses", '687' 689 | "chopstix", "4279 roswell rd.", "atlanta", "chinese", '688' 690 | "deacon burton's soulfood restaurant", "1029 edgewood ave. se", "atlanta", "southern/soul", '689' 691 | "eats", "600 ponce de leon ave.", "atlanta", "italian", '690' 692 | "flying biscuit the", "1655 mclendon ave.", "atlanta", "eclectic", '691' 693 | "frijoleros", "1031 peachtree st. ne", "atlanta", "tex-mex", '692' 694 | "greenwood's", "1087 green st.", "roswell", "southern/soul", '693' 695 | "harold's barbecue", "171 mcdonough blvd.", "atlanta", "bbq", '694' 696 | "havana sandwich shop", "2905 buford hwy.", "atlanta", "cuban", '695' 697 | "house of chan", "2469 cobb pkwy.", "smyrna", "chinese", '696' 698 | "indian delights", "3675 satellite blvd.", "duluth", "indian", '697' 699 | "java jive", "790 ponce de leon ave.", "atlanta", "coffee shops", '698' 700 | "johnny rockets (at)", "2970 cobb pkwy.", "atlanta", "american", '699' 701 | "kalo's coffee house", "1248 clairmont rd.", "decatur", "coffeehouses", '700' 702 | "la fonda latina", "4427 roswell rd.", "atlanta", "spanish", '701' 703 | "lettuce souprise you (at)", "3525 mall blvd.", "duluth", "cafeterias", '702' 704 | "majestic", "1031 ponce de leon ave.", "atlanta", "diners", '703' 705 | "morton's of chicago (atlanta)", "303 peachtree st. ne", "atlanta", "steakhouses", '704' 706 | "my thai", "1248 clairmont rd.", "atlanta", "thai", '705' 707 | "nava", "3060 peachtree rd.", "atlanta", "southwestern", '706' 708 | "nuevo laredo cantina", "1495 chattahoochee ave. nw", "atlanta", "mexican", '707' 709 | "original pancake house (at)", "4330 peachtree rd.", "atlanta", "american", '708' 710 | "palm the (atlanta)", "3391 peachtree rd. ne", "atlanta", "steakhouses", '709' 711 | "rainbow restaurant", "2118 n. decatur rd.", "decatur", "vegetarian", '710' 712 | "ritz-carlton cafe (atlanta)", "181 peachtree st.", "atlanta", "american (new)", '711' 713 | "riviera", "519 e. paces ferry rd.", "atlanta", "mediterranean", '712' 714 | "silver skillet the", "200 14th st. nw", "atlanta", "coffee shops", '713' 715 | "soto", "3330 piedmont rd.", "atlanta", "japanese", '714' 716 | "thelma's kitchen", "764 marietta st. nw", "atlanta", "cafeterias", '715' 717 | "tortillas", "774 ponce de leon ave. ne", "atlanta", "tex-mex", '716' 718 | "van gogh's restaurant & bar", "70 w. crossville rd.", "roswell", "american (new)", '717' 719 | "veggieland", "220 sandy springs circle", "atlanta", "vegetarian", '718' 720 | "white house restaurant", "3172 peachtree rd. ne", "atlanta", "diners", '719' 721 | "zab-e-lee", "4837 old national hwy.", "college park", "thai", '720' 722 | "bill's place", "2315 clement st.", "san francisco", "hamburgers", '721' 723 | "cafe flore", "2298 market st.", "san francisco", "californian", '722' 724 | "caffe greco", "423 columbus ave.", "san francisco", "continental", '723' 725 | "campo santo", "240 columbus ave.", "san francisco", "mexican", '724' 726 | "cha cha cha's", "1805 haight st.", "san francisco", "caribbean", '725' 727 | "doidge's", "2217 union st.", "san francisco", "american", '726' 728 | "dottie's true blue cafe", "522 jones st.", "san francisco", "diners", '727' 729 | "dusit thai", "3221 mission st.", "san francisco", "thai", '728' 730 | "ebisu", "1283 ninth ave.", "san francisco", "japanese", '729' 731 | "emerald garden restaurant", "1550 california st.", "san francisco", "vietnamese", '730' 732 | "eric's chinese restaurant", "1500 church st.", "san francisco", "chinese", '731' 733 | "hamburger mary's", "1582 folsom st.", "san francisco", "hamburgers", '732' 734 | "kelly's on trinity", "333 bush st.", "san francisco", "californian", '733' 735 | "la cumbre", "515 valencia st.", "san francisco", "mexican", '734' 736 | "la mediterranee", "288 noe st.", "san francisco", "mediterranean", '735' 737 | "la taqueria", "2889 mission st.", "san francisco", "mexican", '736' 738 | "mario's bohemian cigar store cafe", "2209 polk st.", "san francisco", "italian", '737' 739 | "marnee thai", "2225 irving st.", "san francisco", "thai", '738' 740 | "mel's drive-in", "3355 geary st.", "san francisco", "hamburgers", '739' 741 | "mo's burgers", "1322 grant st.", "san francisco", "hamburgers", '740' 742 | "phnom penh cambodian restaurant", "631 larkin st.", "san francisco", "cambodian", '741' 743 | "roosevelt tamale parlor", "2817 24th st.", "san francisco", "mexican", '742' 744 | "sally's cafe & bakery", "300 de haro st.", "san francisco", "american", '743' 745 | "san francisco bbq", "1328 18th st.", "san francisco", "thai", '744' 746 | "slanted door", "584 valencia st.", "san francisco", "vietnamese", '745' 747 | "swan oyster depot", "1517 polk st.", "san francisco", "seafood", '746' 748 | "thep phanom", "400 waller st.", "san francisco", "thai", '747' 749 | "ti couz", "3108 16th st.", "san francisco", "french", '748' 750 | "trio cafe", "1870 fillmore st.", "san francisco", "american", '749' 751 | "tu lan", "8 sixth st.", "san francisco", "vietnamese", '750' 752 | "vicolo pizzeria", "201 ivy st.", "san francisco", "pizza", '751' 753 | "wa-ha-ka oaxaca mexican grill", "2141 polk st.", "san francisco", "mexican", '752' 754 | -------------------------------------------------------------------------------- /requirements-test.txt: -------------------------------------------------------------------------------- 1 | pexpect 2 | nose 3 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [bdist_wheel] 2 | universal = 1 3 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | import sys 3 | 4 | requirements = ['future>=0.14', 5 | 'dedupe>=1.6,<2'] 6 | 7 | if sys.version < '3': 8 | requirements += ['backports.csv'] 9 | 10 | 11 | from os import path 12 | this_directory = path.abspath(path.dirname(__file__)) 13 | with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f: 14 | long_description = f.read() 15 | 16 | setup( 17 | name = "csvdedupe", 18 | version = '0.1.20', 19 | description="Command line tools for deduplicating and merging csv files", 20 | author="Forest Gregg, Derek Eder", 21 | license="MIT", 22 | packages=['csvdedupe'], 23 | entry_points ={ 24 | 'console_scripts': [ 25 | 'csvdedupe = csvdedupe.csvdedupe:launch_new_instance', 26 | 'csvlink = csvdedupe.csvlink:launch_new_instance' 27 | ] 28 | }, 29 | install_requires = requirements, 30 | long_description=long_description, 31 | long_description_content_type='text/markdown', 32 | ) 33 | -------------------------------------------------------------------------------- /tests/test_command_line.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import argparse 3 | import pexpect 4 | import sys 5 | 6 | class TestCSVDedupe(unittest.TestCase) : 7 | 8 | def test_no_parameters(self): 9 | child = pexpect.spawn('csvdedupe') 10 | child.expect('error: No input file or STDIN specified.') 11 | 12 | def test_input_file_and_field_names(self) : 13 | child = pexpect.spawn('csvdedupe foo --field_names bar') 14 | child.expect('error: Could not find the file foo') 15 | 16 | def test_config_file(self) : 17 | child = pexpect.spawn('csvdedupe examples/csv_example_messy_input.csv --config_file foo.json') 18 | child.expect('error: Could not find config file foo.json') 19 | 20 | def test_incorrect_fields(self) : 21 | child = pexpect.spawn('csvdedupe examples/csv_example_messy_input.csv --field_names "Site name" Address Zip foo') 22 | child.expect("error: Could not find field 'foo'") 23 | 24 | def test_no_training(self) : 25 | child = pexpect.spawn('csvdedupe examples/csv_example_messy_input.csv --field_names "Site name" Address Zip Phone --training_file foo.json --skip_training') 26 | child.expect("error: You need to provide an existing training_file or run this script without --skip_training") 27 | 28 | class TestCSVLink(unittest.TestCase) : 29 | 30 | def test_no_parameters(self): 31 | child = pexpect.spawn('csvlink') 32 | if sys.version < '3' : 33 | child.expect('error: too few arguments') 34 | else : 35 | child.expect('error: the following arguments are required: input') 36 | 37 | def test_one_argument(self): 38 | child = pexpect.spawn('csvlink examples/restaurant-1.csv') 39 | child.expect('error: You must provide two input files.') 40 | 41 | def test_input_file_1_and_field_names(self) : 42 | child = pexpect.spawn('csvlink foo1 examples/restaurant-1.csv --field_names foo') 43 | child.expect('error: Could not find the file foo1') 44 | 45 | def test_input_file_2_and_field_names(self) : 46 | child = pexpect.spawn('csvlink examples/restaurant-1.csv foo2 --field_names foo') 47 | child.expect('error: Could not find the file foo2') 48 | 49 | def test_incorrect_fields(self) : 50 | child = pexpect.spawn('csvlink examples/restaurant-1.csv examples/restaurant-2.csv --field_names_1 name address city foo --field_names_2 name address city cuisine' ) 51 | child.expect("error: Could not find field 'foo'") 52 | --------------------------------------------------------------------------------