├── README.md ├── csv2parquet ├── csv2parquet.py └── test ├── test-header-mapping.csv ├── test-simple.csv └── test_csv2parquet.py /README.md: -------------------------------------------------------------------------------- 1 | # csv2parquet: Create Parquet files from CSV 2 | 3 | This simple tool creates Parquet files from CSV input, using a minimal 4 | installation of [Apache Drill](https://drill.apache.org). As a data 5 | format, [Parquet](https://parquet.apache.org) offers strong advantages 6 | over comma-separated values for big data and cloud computing needs; 7 | `csv2parquet` is designed to let you experience those benefits more 8 | easily. 9 | 10 | Much credit for this goes to Tugdual "Tug" Grall; `csv2parquet` 11 | essentially automates the process he documents in [Convert a CSV File 12 | to Apache Parquet With 13 | Drill](http://tgrall.github.io/blog/2015/08/17/convert-csv-file-to-apache-parquet-dot-dot-dot-with-drill/). 14 | 15 | `csv2parquet` is now in **public beta**. Feedback, comments, bug 16 | reports, and feature requests are all appreciated. See "About and 17 | Contact" below to reach the author. 18 | 19 | # Usage 20 | 21 | ```csv2parquet CSV_INPUT PARQUET_OUTPUT [--column-map ...] [--types ...] ``` 22 | 23 | `csv_input` is a CSV file, whose first line defines the column names. 24 | `parquet_output` is the Parquet output (i.e., directory in which one 25 | or more Parquet files are written.) Note that `csv2parquet` is 26 | currently specifically designed to work with CSV files whose first 27 | line defines header/column names. 28 | 29 | ## Customizing Column Names 30 | 31 | By default, Parquet column names have the same name as the CSV header. 32 | You can specify a different name for each output column with the 33 | `--column-map` option. When used, it must be followed by an even 34 | number of strings, i.e. a sequence of pairs. In each pair, the first 35 | string is the CSV file column name, and the second is the Parquet 36 | column name to use instead: 37 | 38 | ``` 39 | csv2parquet data.csv data.parquet --column-map "First Column" "Primary Column" "Another Column" "Special Name" 40 | ``` 41 | 42 | In this example, two of the CSV columns are named "First Column" and 43 | "Another Column". The created Parquet file will store data from these 44 | columns under "Primary Column" and "Special Name", respectively. 45 | 46 | (A perfectly good CSV column name may not be valid as a Parquet column 47 | name - for example, a header name with a period, like 48 | "Min. Investment". In this situation, you *must* use `--column-map` 49 | to provide a column name that Parquet can accept, or edit the source 50 | CSV file.) 51 | 52 | ## Column Types 53 | 54 | By default, `csv2parquet` assumes all columns are of type string, but 55 | you can declare specific columns to be any Drill data type. You do 56 | this using the `--types` option, whose syntax is similar to 57 | `--column-map`. On the command line, you write `--types`, followed by 58 | an even number of strings that encode a sequence of pairs. In each 59 | pair, the first string matches the name of the CSV column. (*Not* the 60 | Parquet column name, if that is different.) The second string is one 61 | of the [Drill data 62 | types](https://drill.apache.org/docs/supported-data-types/), such as 63 | "INT", "FLOAT", "DATE", and so on. For example: 64 | 65 | ``` 66 | csv2parquet data.csv data.parquet --types "First Column" "INT" "Another Column" "FLOAT" 67 | ``` 68 | 69 | Note you can pass both `--types` and `--column-map` to 70 | `csv2parquet` at once: 71 | 72 | # On one long line: 73 | csv2parquet data.csv data.parquet --column-map "First Column" "Primary Column" "Another Column" "Special Name" --types "First Column" "INT" "Another Column" "FLOAT" 74 | 75 | # Split across lines, for readability: 76 | csv2parquet data.csv data.parquet \ 77 | --column-map "First Column" "Primary Column" "Another Column" "Special Name" \ 78 | --types "First Column" "INT" "Another Column" "FLOAT" 79 | 80 | ## Troubleshooting 81 | 82 | If you encounter a bug, run again with the `--debug` option. and note 83 | the directory name which is printed out at startup. Many files, logs, 84 | and other info useful for troubleshooting are stored in a temporary 85 | folder. `--debug` prevents this from being deleted after the program 86 | completes. See in particular `script`, `script_stderr` and 87 | `script_stdout` from that folder. To report bugs, see "About and 88 | Contact" below. 89 | 90 | # Installation 91 | 92 | Your system must have: 93 | 94 | * Python 3 (version 3.5 or later). 95 | * A quick-and-easy installation of [Apache Drill](https://drill.apache.org), version 1.4 or 1.5 - see below. 96 | 97 | There are no other dependencies. You can simply copy the `csv2parquet` script wherever you'd like, and run it. 98 | 99 | If you do not currently have Drill installed, simply 100 | [download the tarball](https://drill.apache.org/download/), uncompress 101 | it, and add its `bin` directory in your `$PATH`. No additional setup is 102 | needed. (`cvs2parquet` just uses the `drill-embedded` executable.) 103 | 104 | Currently, `csv2parquet` runs on OS X and Linux. It has not been tested 105 | on Windows, though Windows support is intended, and I appreciate 106 | comments, pull requests, etc. to support Windows users. 107 | 108 | Regarding Python versions: Note that Python 3 safely installs 109 | alongside Python 2 with no conflict: even the executables are named 110 | differently ("python" for 2.7, and "python3" for 3.x). So you can 111 | [simply install it](https://www.python.org/downloads/) to run 112 | `csv2parquet` today on any system you control. 113 | 114 | # Future Work 115 | 116 | In terms of priority: 117 | 118 | * Adding certain important features, including: 119 | - delimiters other than commas 120 | - CSV files without header lines 121 | * Running `csv2parquet` on Windows 122 | 123 | # About and Contact 124 | 125 | Written by [Aaron Maxwell](http://redsymbol.net). Contact him at amax@redsymbol.net. 126 | 127 | Licensed under GPLv3. 128 | 129 | For bug reports, please run with the `--debug` option (see 130 | "Troubleshooting" above), and email the `script`, `script_stderr` and 131 | `script_stdout` files to the author, along with a description of what 132 | happened, and a CSV file that will reproduce the error. 133 | -------------------------------------------------------------------------------- /csv2parquet: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import atexit 5 | import csv 6 | import os 7 | import shutil 8 | import string 9 | import subprocess 10 | import sys 11 | import tempfile 12 | 13 | HELP=''' 14 | csv_input is a CSV file, whose first line defines the column names. 15 | parquet_output is the Parquet output (i.e., directory in which one or 16 | more Parquet files are written.) 17 | 18 | When used, --column-map must be followed by an even number of 19 | strings, constituting the key-value pairs mapping CSV to Parquet 20 | column names: 21 | csv2parquet data.csv data.parquet --column-map "CSV Column Name" "Parquet Column Name" 22 | 23 | To provide types for columns, use --types: 24 | csv2parquet data.csv data.parquet --types "CSV Column Name" "INT" 25 | 26 | See documentation (README.md in the source repo) for more information. 27 | '''.strip() 28 | 29 | # True iff we are to preserve temporary files. 30 | # Globally set to true in debug mode. 31 | global preserve 32 | preserve = False 33 | 34 | DRILL_OVERRIDE_TEMPLATE = ''' 35 | drill.exec: { 36 | sys.store.provider: { 37 | # The following section is only required by LocalPStoreProvider 38 | local: { 39 | path: "${local_base}", 40 | write: true 41 | } 42 | }, 43 | tmp: { 44 | directories: ["${local_base}"], 45 | filesystem: "drill-local:///" 46 | }, 47 | sort: { 48 | purge.threshold : 100, 49 | external: { 50 | batch.size : 4000, 51 | spill: { 52 | batch.size : 4000, 53 | group.size : 100, 54 | threshold : 200, 55 | directories : [ "${local_base}/spill" ], 56 | fs : "file:///" 57 | } 58 | } 59 | }, 60 | } 61 | ''' 62 | 63 | # exceptions 64 | class CsvSourceError(Exception): 65 | def __init__(self, message): 66 | super().__init__(message) 67 | self.message = message 68 | 69 | class DrillScriptError(CsvSourceError): 70 | def __init__(self, returncode): 71 | super().__init__(returncode) 72 | self.returncode = returncode 73 | 74 | class InvalidColumnNames(CsvSourceError): 75 | pass 76 | 77 | # classes 78 | class Column: 79 | def __init__(self, csv, parquet, type): 80 | self.csv = csv 81 | self.parquet = parquet 82 | self.type = type 83 | def __eq__(self, other): 84 | return \ 85 | self.csv == other.csv and \ 86 | self.parquet == other.parquet and \ 87 | self.type == other.type 88 | def line(self, index): 89 | if self.type is None: 90 | return 'columns[{}] as `{}`'.format(index, self.parquet) 91 | # In Drill, if a SELECT query has both an OFFSET with a CAST, 92 | # Drill will apply that cast even to columns that are 93 | # skipped. For a headerless CSV file, we could just use 94 | # something like: 95 | # 96 | # CAST(columns[{index}] as {type}) as `{parquet_name}` 97 | # 98 | # But if the header line is present, this causes the entire 99 | # conversion to fail, because Drill attempts to cast the 100 | # header (e.g., "Price") to the type (e.g., INT), triggering a 101 | # fatal error. So instead we must do: 102 | # 103 | # CASE when columns[{index}]='{csv_name}' then CAST(NULL AS {type}) else ... 104 | # 105 | # I really don't like this, because it makes it possible for 106 | # data corruption to hide. If a cell should contain a number, 107 | # but instead contains a non-numeric string, that should be a 108 | # loud, noisy error which is impossible to ignore. However, if 109 | # that happens here, and you are so unlucky that the corrupted 110 | # value happens to equal the CSV column name, then it is 111 | # silently nulled out. This is admittedly very unlikely, but 112 | # that's not the same as impossible. If you are reading this 113 | # and have an idea for a better solution, please contact the 114 | # author (see README.md). 115 | return "CASE when columns[{index}]='{csv_name}' then CAST(NULL AS {type}) else CAST(columns[{index}] as {type}) end as `{parquet_name}`".format( 116 | index=index, type=self.type, parquet_name=self.parquet, csv_name=self.csv) 117 | 118 | class Columns: 119 | def __init__(self, csv_columns: list, name_map: dict, type_map: dict): 120 | self.csv_columns = csv_columns 121 | self.name_map = name_map 122 | self.type_map = type_map 123 | 124 | self.items = [] 125 | invalid_names = [] 126 | for csv_name in self.csv_columns: 127 | parquet_name = self.name_map.get(csv_name, csv_name) 128 | type_name = self.type_map.get(csv_name, None) 129 | if not is_valid_parquet_column_name(parquet_name): 130 | invalid_names.append(parquet_name) 131 | self.items.append(Column(csv_name, parquet_name, type_name)) 132 | if len(invalid_names) > 0: 133 | raise InvalidColumnNames(invalid_names) 134 | 135 | def __iter__(self): 136 | return iter(self.items) 137 | 138 | class CsvSource: 139 | def __init__(self, path: str, name_map: dict = None, type_map: dict = None): 140 | if name_map is None: 141 | name_map = {} 142 | if type_map is None: 143 | type_map = {} 144 | self.path = os.path.realpath(path) 145 | self.headers = self._init_headers() 146 | self.columns = Columns(self.headers, name_map, type_map) 147 | def _init_headers(self): 148 | with open(self.path, newline='') as handle: 149 | csv_data = csv.reader(handle) 150 | return next(csv_data) 151 | 152 | class TempLocation: 153 | _tempdir = None 154 | def __init__(self): 155 | drive, path = os.path.splitdrive(self.tempdir) 156 | assert drive == '', 'Windows support not provided yet' 157 | assert path.startswith('/tmp/'), self.tempdir 158 | self.dfs_tmp_base = path[len('/tmp'):] 159 | def dfs_tmp_path(self, path: str): 160 | return os.path.join(self.dfs_tmp_base, path) 161 | def full_path(self, path: str): 162 | return os.path.join(self.tempdir, path) 163 | @property 164 | def tempdir(self): 165 | if self._tempdir is None: 166 | self._tempdir = tempfile.mkdtemp(prefix='/tmp/') 167 | if preserve: 168 | print('Preserving logs and intermediate files: ' + self._tempdir) 169 | else: 170 | atexit.register(shutil.rmtree, self._tempdir) 171 | return self._tempdir 172 | 173 | class DrillInstallation: 174 | '''Create a temporary, custom Drill installation 175 | 176 | Even in embedded mode, Drill runs in a stateful fashion, storing 177 | its state in (by default) /tmp/drill. This poses a few problems 178 | for running csv2parquet in ways that are (a) robust, and (b) 179 | certain not to affect any other Drill users, human or machine. 180 | 181 | This would be an easy problem to solve if I could create a custom 182 | conf/drill-overrides.conf, and pass it to drill-embedded as a 183 | command line option. However, as of Drill 1.4, the only way to do 184 | such customization is to modify the actual drill-overrides.conf 185 | file itself, before starting drill-embedded. 186 | 187 | This class gets around this through an admittedly unorthodox hack: 188 | creating a whole parallel Drill installation under a temporary 189 | directory, with a customized conf/ directory, but reusing (via 190 | symlinks) everything else in the main installation. This lets us 191 | safely construct our own drill configuration, using its own 192 | separate equivalent of /tmp/drill (etc.), and which is all cleaned 193 | up after the script exits. 194 | 195 | (Feature request to anyone on the Drill team reading this: Please 196 | give drill-embedded a --with-override-file option!) 197 | 198 | ''' 199 | def __init__(self, reference_executable:str=None): 200 | self.location = TempLocation() 201 | if reference_executable is None: 202 | reference_executable = shutil.which('drill-embedded') 203 | assert reference_executable is not None 204 | self.reference_executable = reference_executable 205 | self.reference_base, self.bindir = os.path.split(os.path.dirname(reference_executable)) 206 | self.install() 207 | @property 208 | def base(self): 209 | return os.path.join(self.location.tempdir, 'drill') 210 | @property 211 | def local_base(self): 212 | return os.path.join(self.location.tempdir, 'drill-local-base') 213 | @property 214 | def executable(self): 215 | return os.path.join(self.base, self.bindir, 'drill-embedded') 216 | def install(self): 217 | # create required subdirs 218 | for dirname in (self.base, self.local_base): 219 | os.makedirs(dirname) 220 | # link to reference 221 | for item in os.scandir(self.reference_base): 222 | if item.name == 'conf': 223 | assert item.is_dir(), os.path.realpath(item) 224 | continue 225 | os.symlink(item.path, os.path.join(self.base, item.name)) 226 | # install config 227 | conf_dir = os.path.join(self.base, 'conf') 228 | os.makedirs(conf_dir) 229 | with open(os.path.join(conf_dir, 'drill-override.conf'), 'w') as handle: 230 | handle.write(string.Template(DRILL_OVERRIDE_TEMPLATE).substitute( 231 | local_base=self.local_base)) 232 | 233 | def build_script(self, csv_source: CsvSource, parquet_output: str): 234 | return DrillScript(self, csv_source, parquet_output) 235 | 236 | class DrillScript: 237 | def __init__(self, drill: DrillInstallation, csv_source: CsvSource, parquet_output: str): 238 | self.drill = drill 239 | self.csv_source = csv_source 240 | self.parquet_output = parquet_output 241 | def render(self): 242 | return render_drill_script( 243 | self.csv_source.columns, 244 | self.drill.location.dfs_tmp_path('parquet_tmp_output'), 245 | self.csv_source.path, 246 | ) 247 | def run(self): 248 | # execute drill script 249 | script_path = os.path.join(self.drill.location.tempdir, 'script') 250 | script_stdout = os.path.join(self.drill.location.tempdir, 'script_stdout') 251 | script_stderr = os.path.join(self.drill.location.tempdir, 'script_stderr') 252 | cmd = [ 253 | self.drill.executable, 254 | '--run={}'.format(script_path), 255 | ] 256 | with open(script_path, 'w') as handle: 257 | handle.write(self.render()) 258 | with open(script_stdout, 'w') as stdout, open(script_stderr, 'w') as stderr: 259 | proc = subprocess.Popen(cmd, stdout=stdout, stderr=stderr) 260 | proc.wait() 261 | if proc.returncode != 0: 262 | raise DrillScriptError(proc.returncode) 263 | 264 | # publish resulting output parquet file 265 | os.rename(self.drill.location.full_path('parquet_tmp_output'), self.parquet_output) 266 | 267 | # helper functions 268 | def get_args(): 269 | parser = argparse.ArgumentParser( 270 | description='', 271 | epilog=HELP, 272 | formatter_class=argparse.RawDescriptionHelpFormatter, 273 | ) 274 | parser.add_argument('csv_input', 275 | help='Path to input CSV file') 276 | parser.add_argument('parquet_output', 277 | help='Path to Parquet output') 278 | parser.add_argument('--debug', default=False, action='store_true', 279 | help='Preserve intermediate files and logs') 280 | parser.add_argument('--column-map', nargs='*', 281 | help='Map CSV header names to Parquet column names') 282 | parser.add_argument('--types', nargs='*', 283 | help='Map CSV header names to Parquet types') 284 | args = parser.parse_args() 285 | try: 286 | args.column_map = list2dict(args.column_map) 287 | except ValueError: 288 | parser.error('--column-map requires an even number of arguments, as key-value pairs') 289 | try: 290 | args.types = list2dict(args.types) 291 | except ValueError: 292 | parser.error('--types requires an even number of arguments, as key-value pairs') 293 | return args 294 | 295 | def list2dict(items): 296 | '''convert [a, b, c, d] to {a:b, c:d}''' 297 | if items is None: 298 | return {} 299 | if len(items) % 2 != 0: 300 | raise ValueError 301 | return dict( (items[n], items[n+1]) 302 | for n in range(0, len(items)-1, 2) ) 303 | 304 | def is_valid_parquet_column_name(val): 305 | return '.' not in val 306 | 307 | def render_drill_script(columns: Columns, parquet_output: str, csv_input: str): 308 | script = '''alter session set `store.format`='parquet'; 309 | CREATE TABLE dfs.tmp.`{}` AS 310 | SELECT 311 | '''.format(parquet_output) 312 | column_lines = [column.line(n) for n, column in enumerate(columns)] 313 | script += ',\n'.join(column_lines) + '\n' 314 | script += 'FROM dfs.`{}`\n'.format(csv_input) 315 | script += 'OFFSET 1\n' 316 | return script 317 | 318 | if __name__ == "__main__": 319 | args = get_args() 320 | if args.debug: 321 | preserve = True 322 | # Quick pre-check whether destination exists, so user doesn't have 323 | # to wait long before we abort with a write error. There's a race 324 | # condition because it can still be created between now and when 325 | # we eventually try to write it, but this will catch the common case. 326 | if os.path.exists(args.parquet_output): 327 | sys.stderr.write('Output location "{}" already exists. Rename or delete before running again.\n'.format(args.parquet_output)) 328 | sys.exit(1) 329 | csv_source = CsvSource(args.csv_input, args.column_map, args.types) 330 | drill = DrillInstallation() 331 | drill_script = drill.build_script(csv_source, args.parquet_output) 332 | try: 333 | drill_script.run() 334 | except DrillScriptError as err: 335 | sys.stderr.write('''FATAL: Drill script failed with error code {}. To troubleshoot, run 336 | with --debug and inspect files script, script_stderr and script_stdout. 337 | '''.format(err.returncode)) 338 | sys.exit(2) 339 | -------------------------------------------------------------------------------- /csv2parquet.py: -------------------------------------------------------------------------------- 1 | csv2parquet -------------------------------------------------------------------------------- /test/test-header-mapping.csv: -------------------------------------------------------------------------------- 1 | Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume 2 | 2016-02-12,38.06,39.3,37.71,39.25,4473052.0,0.0,1.0,38.06,39.3,37.71,39.25,4473052.0 3 | 2016-02-11,38.32,39.17,37.7301,37.9,5352149.0,0.0,1.0,38.32,39.17,37.7301,37.9,5352149.0 4 | 2016-02-10,39.8,39.97,38.72,38.81,4962605.0,0.0,1.0,39.8,39.97,38.72,38.81,4962605.0 5 | 2016-02-09,39.48,39.89,38.79,39.51,4081861.0,0.0,1.0,39.48,39.89,38.79,39.51,4081861.0 6 | 2016-02-08,40.0,40.43,39.02,39.65,5638394.0,0.0,1.0,40.0,40.43,39.02,39.65,5638394.0 7 | -------------------------------------------------------------------------------- /test/test-simple.csv: -------------------------------------------------------------------------------- 1 | Date,Open,High,Low,Close,Volume,ExDividend,SplitRatio,AdjOpen,AdjHigh,AdjLow,AdjClose,AdjVolume 2 | 2016-02-12,38.06,39.3,37.71,39.25,4473052.0,0.0,1.0,38.06,39.3,37.71,39.25,4473052.0 3 | 2016-02-11,38.32,39.17,37.7301,37.9,5352149.0,0.0,1.0,38.32,39.17,37.7301,37.9,5352149.0 4 | 2016-02-10,39.8,39.97,38.72,38.81,4962605.0,0.0,1.0,39.8,39.97,38.72,38.81,4962605.0 5 | 2016-02-09,39.48,39.89,38.79,39.51,4081861.0,0.0,1.0,39.48,39.89,38.79,39.51,4081861.0 6 | 2016-02-08,40.0,40.43,39.02,39.65,5638394.0,0.0,1.0,40.0,40.43,39.02,39.65,5638394.0 7 | -------------------------------------------------------------------------------- /test/test_csv2parquet.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import os 3 | from collections import OrderedDict 4 | 5 | import csv2parquet 6 | from csv2parquet import Columns, Column 7 | 8 | THIS_DIR = os.path.dirname(__file__) 9 | TEST_CSV = os.path.join(THIS_DIR, 'test-simple.csv') 10 | TEST_CSV_MAP = os.path.join(THIS_DIR, 'test-header-mapping.csv') 11 | 12 | class TestUtil(unittest.TestCase): 13 | def test_list2dict(self): 14 | from csv2parquet import list2dict 15 | with self.assertRaises(ValueError): 16 | list2dict(["foo"]) 17 | with self.assertRaises(ValueError): 18 | list2dict(["foo", "bar", "baz"]) 19 | self.assertEqual({}, list2dict([])) 20 | self.assertEqual({}, list2dict(None)) 21 | self.assertEqual({"a":"b"}, list2dict(["a", "b"])) 22 | self.assertEqual({"a":"b", "x":"y"}, list2dict(["a", "b", "x", "y"])) 23 | self.assertEqual({"a":"b", "x":"y"}, list2dict(["x", "y", "a", "b"])) 24 | 25 | class TestCsvSource(unittest.TestCase): 26 | def test_real_path_to_prevent_drill_script_errors(self): 27 | # Specifying a CSV file path of something like "../something.csv" will confuse Drill. 28 | # Prevent this by expanding the path. 29 | csv_src = csv2parquet.CsvSource('./test-simple.csv') 30 | self.assertEqual(csv_src.path, os.path.realpath(TEST_CSV)) 31 | def test_headers_simple(self): 32 | csv_src = csv2parquet.CsvSource(TEST_CSV) 33 | expected_headers = [ 34 | 'Date', 35 | 'Open', 36 | 'High', 37 | 'Low', 38 | 'Close', 39 | 'Volume', 40 | 'ExDividend', 41 | 'SplitRatio', 42 | 'AdjOpen', 43 | 'AdjHigh', 44 | 'AdjLow', 45 | 'AdjClose', 46 | 'AdjVolume', 47 | ] 48 | self.assertEqual(expected_headers, csv_src.headers) 49 | # CSV and Parquet column names should be the same. 50 | expected_columns = [Column(header, header, None) for header in expected_headers] 51 | self.assertEqual(expected_columns, csv_src.columns.items) 52 | 53 | def test_columns_from_csv_source(self): 54 | # verify that an exception is raised if we don't override Parquet-invalid column names 55 | with self.assertRaises(csv2parquet.InvalidColumnNames): 56 | csv_src = csv2parquet.CsvSource(TEST_CSV_MAP) 57 | # now try again, with a mapping 58 | name_map = { 59 | 'Adj. Open' : 'Adj Open', 60 | 'Adj. High' : 'Adj High', 61 | 'Adj. Low' : 'Adj Low', 62 | 'Adj. Close' : 'Adj Close', 63 | 'Adj. Volume' : 'Adj Volume', 64 | } 65 | csv_src = csv2parquet.CsvSource(TEST_CSV_MAP, name_map) 66 | expected_columns = [ 67 | Column('Date', 'Date', None), 68 | Column('Open', 'Open', None), 69 | Column('High', 'High', None), 70 | Column('Low', 'Low', None), 71 | Column('Close', 'Close', None), 72 | Column('Volume', 'Volume', None), 73 | Column('Ex-Dividend', 'Ex-Dividend', None), 74 | Column('Split Ratio', 'Split Ratio', None), 75 | Column('Adj. Open', 'Adj Open', None), 76 | Column('Adj. High', 'Adj High', None), 77 | Column('Adj. Low', 'Adj Low', None), 78 | Column('Adj. Close', 'Adj Close', None), 79 | Column('Adj. Volume', 'Adj Volume', None), 80 | ] 81 | self.assertEqual(expected_columns, csv_src.columns.items) 82 | 83 | class TestDrillScript(unittest.TestCase): 84 | def test_build_script(self): 85 | # .strip() the actual scripts to ignore leading/trailing whitespace 86 | expected_script = ''' 87 | alter session set `store.format`='parquet'; 88 | CREATE TABLE dfs.tmp.`/path/to/parquet_output/` AS 89 | SELECT 90 | CASE when columns[0]='When' then CAST(NULL AS DATE) else CAST(columns[0] as DATE) end as `Date`, 91 | columns[1] as `Open`, 92 | columns[2] as `High`, 93 | columns[3] as `Low`, 94 | columns[4] as `Close`, 95 | columns[5] as `Volume`, 96 | columns[6] as `Ex-Dividend`, 97 | CASE when columns[7]='Split Ratio' then CAST(NULL AS FLOAT) else CAST(columns[7] as FLOAT) end as `Split Ratio`, 98 | CASE when columns[8]='Adj. Open' then CAST(NULL AS DOUBLE) else CAST(columns[8] as DOUBLE) end as `Adj Open` 99 | FROM dfs.`/path/to/input.csv` 100 | OFFSET 1 101 | '''.strip() 102 | columns = [ 103 | Column('When', 'Date', 'DATE'), 104 | Column('Open', 'Open', None), 105 | Column('Day High', 'High', None), 106 | Column('Day Low', 'Low', None), 107 | Column('Close', 'Close', None), 108 | Column('Volume', 'Volume', None), 109 | Column('Ex-Dividend', 'Ex-Dividend', None), 110 | Column('Split Ratio', 'Split Ratio', 'FLOAT'), 111 | Column('Adj. Open', 'Adj Open', 'DOUBLE'), 112 | ] 113 | actual_script = csv2parquet.render_drill_script(columns, '/path/to/parquet_output/', '/path/to/input.csv').strip() 114 | self.assertEqual(expected_script, actual_script) 115 | maxDiff=None 116 | 117 | class TestColumns(unittest.TestCase): 118 | def test_main(self): 119 | columns = Columns([], {}, {}) 120 | self.assertEqual([], columns.items) 121 | 122 | columns = Columns( 123 | ["abc", "xyz", "foo", "bar", "baz"], 124 | {"foo": "whee", "baz": "magic"}, 125 | {}) 126 | items = [ 127 | Column("abc", "abc", None), 128 | Column("xyz", "xyz", None), 129 | Column("foo", "whee", None), 130 | Column("bar", "bar", None), 131 | Column("baz", "magic", None), 132 | ] 133 | self.assertEqual(items, columns.items) 134 | self.assertEqual(items, list(columns)) 135 | --------------------------------------------------------------------------------