├── README.md
├── csv2parquet
├── csv2parquet.py
└── test
    ├── test-header-mapping.csv
    ├── test-simple.csv
    └── test_csv2parquet.py


/README.md:
--------------------------------------------------------------------------------
  1 | # csv2parquet: Create Parquet files from CSV
  2 | 
  3 | This simple tool creates Parquet files from CSV input, using a minimal
  4 | installation of [Apache Drill](https://drill.apache.org). As a data
  5 | format, [Parquet](https://parquet.apache.org) offers strong advantages
  6 | over comma-separated values for big data and cloud computing needs;
  7 | `csv2parquet` is designed to let you experience those benefits more
  8 | easily.
  9 | 
 10 | Much credit for this goes to Tugdual "Tug" Grall; `csv2parquet`
 11 | essentially automates the process he documents in [Convert a CSV File
 12 | to Apache Parquet With
 13 | Drill](http://tgrall.github.io/blog/2015/08/17/convert-csv-file-to-apache-parquet-dot-dot-dot-with-drill/).
 14 | 
 15 | `csv2parquet` is now in **public beta**. Feedback, comments, bug
 16 | reports, and feature requests are all appreciated. See "About and
 17 | Contact" below to reach the author.
 18 | 
 19 | # Usage
 20 | 
 21 | ```csv2parquet CSV_INPUT PARQUET_OUTPUT [--column-map ...] [--types ...] ```
 22 | 
 23 | `csv_input` is a CSV file, whose first line defines the column names.
 24 | `parquet_output` is the Parquet output (i.e., directory in which one
 25 | or more Parquet files are written.) Note that `csv2parquet` is
 26 | currently specifically designed to work with CSV files whose first
 27 | line defines header/column names.
 28 | 
 29 | ## Customizing Column Names
 30 | 
 31 | By default, Parquet column names have the same name as the CSV header.
 32 | You can specify a different name for each output column with the
 33 | `--column-map` option.  When used, it must be followed by an even
 34 | number of strings, i.e. a sequence of pairs. In each pair, the first
 35 | string is the CSV file column name, and the second is the Parquet
 36 | column name to use instead:
 37 | 
 38 | ```
 39 | csv2parquet data.csv data.parquet --column-map "First Column" "Primary Column" "Another Column" "Special Name"
 40 | ```
 41 | 
 42 | In this example, two of the CSV columns are named "First Column" and
 43 | "Another Column". The created Parquet file will store data from these
 44 | columns under "Primary Column" and "Special Name", respectively.
 45 | 
 46 | (A perfectly good CSV column name may not be valid as a Parquet column
 47 | name - for example, a header name with a period, like
 48 | "Min. Investment". In this situation, you *must* use `--column-map`
 49 | to provide a column name that Parquet can accept, or edit the source
 50 | CSV file.)
 51 | 
 52 | ## Column Types
 53 | 
 54 | By default, `csv2parquet` assumes all columns are of type string, but
 55 | you can declare specific columns to be any Drill data type. You do
 56 | this using the `--types` option, whose syntax is similar to
 57 | `--column-map`. On the command line, you write `--types`, followed by
 58 | an even number of strings that encode a sequence of pairs. In each
 59 | pair, the first string matches the name of the CSV column. (*Not* the
 60 | Parquet column name, if that is different.) The second string is one
 61 | of the [Drill data
 62 | types](https://drill.apache.org/docs/supported-data-types/), such as
 63 | "INT", "FLOAT", "DATE", and so on. For example:
 64 | 
 65 | ```
 66 | csv2parquet data.csv data.parquet --types "First Column" "INT" "Another Column" "FLOAT"
 67 | ```
 68 | 
 69 | Note you can pass both `--types` and `--column-map` to
 70 | `csv2parquet` at once:
 71 | 
 72 |     # On one long line:
 73 |     csv2parquet data.csv data.parquet --column-map "First Column" "Primary Column" "Another Column" "Special Name" --types "First Column" "INT" "Another Column" "FLOAT"
 74 |     
 75 |     # Split across lines, for readability:
 76 |     csv2parquet data.csv data.parquet \
 77 |         --column-map "First Column" "Primary Column" "Another Column" "Special Name" \
 78 |         --types "First Column" "INT" "Another Column" "FLOAT"
 79 | 
 80 | ## Troubleshooting
 81 | 
 82 | If you encounter a bug, run again with the `--debug` option. and note
 83 | the directory name which is printed out at startup. Many files, logs,
 84 | and other info useful for troubleshooting are stored in a temporary
 85 | folder. `--debug` prevents this from being deleted after the program
 86 | completes. See in particular `script`, `script_stderr` and
 87 | `script_stdout` from that folder. To report bugs, see "About and
 88 | Contact" below.
 89 | 
 90 | # Installation
 91 | 
 92 | Your system must have:
 93 | 
 94 |  * Python 3 (version 3.5 or later).
 95 |  * A quick-and-easy installation of [Apache Drill](https://drill.apache.org), version 1.4 or 1.5 - see below.
 96 | 
 97 | There are no other dependencies. You can simply copy the `csv2parquet` script wherever you'd like, and run it.
 98 | 
 99 | If you do not currently have Drill installed, simply
100 | [download the tarball](https://drill.apache.org/download/), uncompress
101 | it, and add its `bin` directory in your `$PATH`. No additional setup is
102 | needed. (`cvs2parquet` just uses the `drill-embedded` executable.)
103 | 
104 | Currently, `csv2parquet` runs on OS X and Linux. It has not been tested
105 | on Windows, though Windows support is intended, and I appreciate
106 | comments, pull requests, etc. to support Windows users.
107 | 
108 | Regarding Python versions: Note that Python 3 safely installs
109 | alongside Python 2 with no conflict: even the executables are named
110 | differently ("python" for 2.7, and "python3" for 3.x). So you can
111 | [simply install it](https://www.python.org/downloads/) to run
112 | `csv2parquet` today on any system you control.
113 | 
114 | # Future Work
115 | 
116 | In terms of priority:
117 | 
118 |  * Adding certain important features, including:
119 |    - delimiters other than commas
120 |    - CSV files without header lines
121 |  * Running `csv2parquet` on Windows
122 | 
123 | # About and Contact
124 | 
125 | Written by [Aaron Maxwell](http://redsymbol.net). Contact him at amax@redsymbol.net.
126 | 
127 | Licensed under GPLv3.
128 | 
129 | For bug reports, please run with the `--debug` option (see
130 | "Troubleshooting" above), and email the `script`, `script_stderr` and
131 | `script_stdout` files to the author, along with a description of what
132 | happened, and a CSV file that will reproduce the error.
133 | 


--------------------------------------------------------------------------------
/csv2parquet:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import argparse
  4 | import atexit
  5 | import csv
  6 | import os
  7 | import shutil
  8 | import string
  9 | import subprocess
 10 | import sys
 11 | import tempfile
 12 | 
 13 | HELP='''
 14 | csv_input is a CSV file, whose first line defines the column names.
 15 | parquet_output is the Parquet output (i.e., directory in which one or
 16 | more Parquet files are written.)
 17 | 
 18 | When used, --column-map must be followed by an even number of
 19 | strings, constituting the key-value pairs mapping CSV to Parquet
 20 | column names:
 21 |   csv2parquet data.csv data.parquet --column-map "CSV Column Name" "Parquet Column Name"
 22 | 
 23 | To provide types for columns, use --types:
 24 |   csv2parquet data.csv data.parquet --types "CSV Column Name" "INT"
 25 | 
 26 | See documentation (README.md in the source repo) for more information.
 27 | '''.strip()
 28 | 
 29 | # True iff we are to preserve temporary files.
 30 | # Globally set to true in debug mode.
 31 | global preserve
 32 | preserve = False
 33 | 
 34 | DRILL_OVERRIDE_TEMPLATE = '''
 35 | drill.exec: {
 36 |   sys.store.provider: {
 37 |     # The following section is only required by LocalPStoreProvider
 38 |     local: {
 39 |       path: "${local_base}",
 40 |       write: true
 41 |     }
 42 |   },
 43 |   tmp: {
 44 |     directories: ["${local_base}"],
 45 |     filesystem: "drill-local:///"
 46 |   },
 47 |   sort: {
 48 |     purge.threshold : 100,
 49 |     external: {
 50 |       batch.size : 4000,
 51 |       spill: {
 52 |         batch.size : 4000,
 53 |         group.size : 100,
 54 |         threshold : 200,
 55 |         directories : [ "${local_base}/spill" ],
 56 |         fs : "file:///"
 57 |       }
 58 |     }
 59 |   },
 60 | }
 61 | '''
 62 | 
 63 | # exceptions
 64 | class CsvSourceError(Exception):
 65 |     def __init__(self, message):
 66 |         super().__init__(message)
 67 |         self.message = message
 68 | 
 69 | class DrillScriptError(CsvSourceError):
 70 |     def __init__(self, returncode):
 71 |         super().__init__(returncode)
 72 |         self.returncode = returncode
 73 | 
 74 | class InvalidColumnNames(CsvSourceError):
 75 |     pass
 76 | 
 77 | # classes
 78 | class Column:
 79 |     def __init__(self, csv, parquet, type):
 80 |         self.csv = csv
 81 |         self.parquet = parquet
 82 |         self.type = type
 83 |     def __eq__(self, other):
 84 |         return \
 85 |             self.csv == other.csv and \
 86 |             self.parquet == other.parquet and \
 87 |             self.type == other.type
 88 |     def line(self, index):
 89 |         if self.type is None:
 90 |             return 'columns[{}] as `{}`'.format(index, self.parquet)
 91 |         # In Drill, if a SELECT query has both an OFFSET with a CAST,
 92 |         # Drill will apply that cast even to columns that are
 93 |         # skipped. For a headerless CSV file, we could just use
 94 |         # something like:
 95 |         #
 96 |         #     CAST(columns[{index}] as {type}) as `{parquet_name}`
 97 |         #
 98 |         # But if the header line is present, this causes the entire
 99 |         # conversion to fail, because Drill attempts to cast the
100 |         # header (e.g., "Price") to the type (e.g., INT), triggering a
101 |         # fatal error. So instead we must do:
102 |         #
103 |         #     CASE when columns[{index}]='{csv_name}' then CAST(NULL AS {type}) else ...
104 |         #
105 |         # I really don't like this, because it makes it possible for
106 |         # data corruption to hide. If a cell should contain a number,
107 |         # but instead contains a non-numeric string, that should be a
108 |         # loud, noisy error which is impossible to ignore. However, if
109 |         # that happens here, and you are so unlucky that the corrupted
110 |         # value happens to equal the CSV column name, then it is
111 |         # silently nulled out.  This is admittedly very unlikely, but
112 |         # that's not the same as impossible. If you are reading this
113 |         # and have an idea for a better solution, please contact the
114 |         # author (see README.md).
115 |         return "CASE when columns[{index}]='{csv_name}' then CAST(NULL AS {type}) else CAST(columns[{index}] as {type}) end as `{parquet_name}`".format(
116 |             index=index, type=self.type, parquet_name=self.parquet, csv_name=self.csv)
117 | 
118 | class Columns:
119 |     def __init__(self, csv_columns: list, name_map: dict, type_map: dict):
120 |         self.csv_columns = csv_columns
121 |         self.name_map = name_map
122 |         self.type_map = type_map
123 |         
124 |         self.items = []
125 |         invalid_names = []
126 |         for csv_name in self.csv_columns:
127 |             parquet_name = self.name_map.get(csv_name, csv_name)
128 |             type_name = self.type_map.get(csv_name, None)
129 |             if not is_valid_parquet_column_name(parquet_name):
130 |                 invalid_names.append(parquet_name)
131 |             self.items.append(Column(csv_name, parquet_name, type_name))
132 |         if len(invalid_names) > 0:
133 |             raise InvalidColumnNames(invalid_names)
134 |             
135 |     def __iter__(self):
136 |         return iter(self.items)
137 |         
138 | class CsvSource:
139 |     def __init__(self, path: str, name_map: dict = None, type_map: dict = None):
140 |         if name_map is None:
141 |             name_map = {}
142 |         if type_map is None:
143 |             type_map = {}
144 |         self.path = os.path.realpath(path)
145 |         self.headers = self._init_headers()
146 |         self.columns = Columns(self.headers, name_map, type_map)
147 |     def _init_headers(self):
148 |         with open(self.path, newline='') as handle:
149 |             csv_data = csv.reader(handle)
150 |             return next(csv_data)
151 | 
152 | class TempLocation:
153 |     _tempdir = None
154 |     def __init__(self):
155 |         drive, path = os.path.splitdrive(self.tempdir)
156 |         assert drive == '', 'Windows support not provided yet'
157 |         assert path.startswith('/tmp/'), self.tempdir
158 |         self.dfs_tmp_base = path[len('/tmp'):]
159 |     def dfs_tmp_path(self, path: str):
160 |         return os.path.join(self.dfs_tmp_base, path)
161 |     def full_path(self, path: str):
162 |         return os.path.join(self.tempdir, path)
163 |     @property
164 |     def tempdir(self):
165 |         if self._tempdir is None:
166 |             self._tempdir = tempfile.mkdtemp(prefix='/tmp/')
167 |             if preserve:
168 |                 print('Preserving logs and intermediate files: ' + self._tempdir)
169 |             else:
170 |                 atexit.register(shutil.rmtree, self._tempdir)
171 |         return self._tempdir
172 | 
173 | class DrillInstallation:
174 |     '''Create a temporary, custom Drill installation
175 |     
176 |     Even in embedded mode, Drill runs in a stateful fashion, storing
177 |     its state in (by default) /tmp/drill. This poses a few problems
178 |     for running csv2parquet in ways that are (a) robust, and (b)
179 |     certain not to affect any other Drill users, human or machine.
180 | 
181 |     This would be an easy problem to solve if I could create a custom
182 |     conf/drill-overrides.conf, and pass it to drill-embedded as a
183 |     command line option. However, as of Drill 1.4, the only way to do
184 |     such customization is to modify the actual drill-overrides.conf
185 |     file itself, before starting drill-embedded.
186 | 
187 |     This class gets around this through an admittedly unorthodox hack:
188 |     creating a whole parallel Drill installation under a temporary
189 |     directory, with a customized conf/ directory, but reusing (via
190 |     symlinks) everything else in the main installation. This lets us
191 |     safely construct our own drill configuration, using its own
192 |     separate equivalent of /tmp/drill (etc.), and which is all cleaned
193 |     up after the script exits.
194 | 
195 |     (Feature request to anyone on the Drill team reading this: Please
196 |     give drill-embedded a --with-override-file option!)
197 | 
198 |     '''
199 |     def __init__(self, reference_executable:str=None):
200 |         self.location = TempLocation()
201 |         if reference_executable is None:
202 |             reference_executable = shutil.which('drill-embedded')
203 |         assert reference_executable is not None
204 |         self.reference_executable = reference_executable
205 |         self.reference_base, self.bindir = os.path.split(os.path.dirname(reference_executable))
206 |         self.install()
207 |     @property
208 |     def base(self):
209 |         return os.path.join(self.location.tempdir, 'drill')
210 |     @property
211 |     def local_base(self):
212 |         return os.path.join(self.location.tempdir, 'drill-local-base')
213 |     @property
214 |     def executable(self):
215 |         return os.path.join(self.base, self.bindir, 'drill-embedded')
216 |     def install(self):
217 |         # create required subdirs
218 |         for dirname in (self.base, self.local_base):
219 |             os.makedirs(dirname)
220 |         # link to reference
221 |         for item in os.scandir(self.reference_base):
222 |             if item.name == 'conf':
223 |                 assert item.is_dir(), os.path.realpath(item)
224 |                 continue
225 |             os.symlink(item.path, os.path.join(self.base, item.name))
226 |         # install config
227 |         conf_dir = os.path.join(self.base, 'conf')
228 |         os.makedirs(conf_dir)
229 |         with open(os.path.join(conf_dir, 'drill-override.conf'), 'w') as handle:
230 |             handle.write(string.Template(DRILL_OVERRIDE_TEMPLATE).substitute(
231 |                 local_base=self.local_base))
232 | 
233 |     def build_script(self, csv_source: CsvSource, parquet_output: str):
234 |         return DrillScript(self, csv_source, parquet_output)
235 | 
236 | class DrillScript:
237 |     def __init__(self, drill: DrillInstallation, csv_source: CsvSource, parquet_output: str):
238 |         self.drill = drill
239 |         self.csv_source = csv_source
240 |         self.parquet_output = parquet_output
241 |     def render(self):
242 |         return render_drill_script(
243 |             self.csv_source.columns,
244 |             self.drill.location.dfs_tmp_path('parquet_tmp_output'),
245 |             self.csv_source.path,
246 |         )
247 |     def run(self):
248 |         # execute drill script
249 |         script_path = os.path.join(self.drill.location.tempdir, 'script')
250 |         script_stdout = os.path.join(self.drill.location.tempdir, 'script_stdout')
251 |         script_stderr = os.path.join(self.drill.location.tempdir, 'script_stderr')
252 |         cmd = [
253 |             self.drill.executable,
254 |             '--run={}'.format(script_path),
255 |         ]
256 |         with open(script_path, 'w') as handle:
257 |             handle.write(self.render())
258 |         with open(script_stdout, 'w') as stdout, open(script_stderr, 'w') as stderr:
259 |             proc = subprocess.Popen(cmd, stdout=stdout, stderr=stderr)
260 |             proc.wait()
261 |         if proc.returncode != 0:
262 |             raise DrillScriptError(proc.returncode)
263 | 
264 |         # publish resulting output parquet file
265 |         os.rename(self.drill.location.full_path('parquet_tmp_output'), self.parquet_output)
266 | 
267 | # helper functions
268 | def get_args():
269 |     parser = argparse.ArgumentParser(
270 |             description='',
271 |             epilog=HELP,
272 |             formatter_class=argparse.RawDescriptionHelpFormatter,
273 |     )
274 |     parser.add_argument('csv_input',
275 |                         help='Path to input CSV file')
276 |     parser.add_argument('parquet_output',
277 |                         help='Path to Parquet output')
278 |     parser.add_argument('--debug', default=False, action='store_true',
279 |                         help='Preserve intermediate files and logs')
280 |     parser.add_argument('--column-map', nargs='*',
281 |                         help='Map CSV header names to Parquet column names')
282 |     parser.add_argument('--types', nargs='*',
283 |                         help='Map CSV header names to Parquet types')
284 |     args = parser.parse_args()
285 |     try:
286 |         args.column_map = list2dict(args.column_map)
287 |     except ValueError:
288 |         parser.error('--column-map requires an even number of arguments, as key-value pairs')
289 |     try:
290 |         args.types = list2dict(args.types)
291 |     except ValueError:
292 |         parser.error('--types requires an even number of arguments, as key-value pairs')
293 |     return args
294 | 
295 | def list2dict(items):
296 |     '''convert [a, b, c, d] to {a:b, c:d}'''
297 |     if items is None:
298 |         return {}
299 |     if len(items) % 2 != 0:
300 |         raise ValueError
301 |     return dict( (items[n], items[n+1])
302 |                  for n in range(0, len(items)-1, 2) )    
303 | 
304 | def is_valid_parquet_column_name(val):
305 |     return '.' not in val
306 | 
307 | def render_drill_script(columns: Columns, parquet_output: str, csv_input: str):
308 |     script = '''alter session set `store.format`='parquet';
309 | CREATE TABLE dfs.tmp.`{}` AS
310 | SELECT
311 | '''.format(parquet_output)
312 |     column_lines = [column.line(n) for n, column in enumerate(columns)]
313 |     script += ',\n'.join(column_lines) + '\n'
314 |     script += 'FROM dfs.`{}`\n'.format(csv_input)
315 |     script += 'OFFSET 1\n'
316 |     return script
317 |     
318 | if __name__ == "__main__":
319 |     args = get_args()
320 |     if args.debug:
321 |         preserve = True
322 |     # Quick pre-check whether destination exists, so user doesn't have
323 |     # to wait long before we abort with a write error. There's a race
324 |     # condition because it can still be created between now and when
325 |     # we eventually try to write it, but this will catch the common case.
326 |     if os.path.exists(args.parquet_output):
327 |         sys.stderr.write('Output location "{}" already exists. Rename or delete before running again.\n'.format(args.parquet_output))
328 |         sys.exit(1)
329 |     csv_source = CsvSource(args.csv_input, args.column_map, args.types)
330 |     drill = DrillInstallation()
331 |     drill_script = drill.build_script(csv_source, args.parquet_output)
332 |     try:
333 |         drill_script.run()
334 |     except DrillScriptError as err:
335 |         sys.stderr.write('''FATAL: Drill script failed with error code {}.  To troubleshoot, run
336 | with --debug and inspect files script, script_stderr and script_stdout.
337 | '''.format(err.returncode))
338 |         sys.exit(2)
339 | 


--------------------------------------------------------------------------------
/csv2parquet.py:
--------------------------------------------------------------------------------
1 | csv2parquet


--------------------------------------------------------------------------------
/test/test-header-mapping.csv:
--------------------------------------------------------------------------------
1 | Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
2 | 2016-02-12,38.06,39.3,37.71,39.25,4473052.0,0.0,1.0,38.06,39.3,37.71,39.25,4473052.0
3 | 2016-02-11,38.32,39.17,37.7301,37.9,5352149.0,0.0,1.0,38.32,39.17,37.7301,37.9,5352149.0
4 | 2016-02-10,39.8,39.97,38.72,38.81,4962605.0,0.0,1.0,39.8,39.97,38.72,38.81,4962605.0
5 | 2016-02-09,39.48,39.89,38.79,39.51,4081861.0,0.0,1.0,39.48,39.89,38.79,39.51,4081861.0
6 | 2016-02-08,40.0,40.43,39.02,39.65,5638394.0,0.0,1.0,40.0,40.43,39.02,39.65,5638394.0
7 | 


--------------------------------------------------------------------------------
/test/test-simple.csv:
--------------------------------------------------------------------------------
1 | Date,Open,High,Low,Close,Volume,ExDividend,SplitRatio,AdjOpen,AdjHigh,AdjLow,AdjClose,AdjVolume
2 | 2016-02-12,38.06,39.3,37.71,39.25,4473052.0,0.0,1.0,38.06,39.3,37.71,39.25,4473052.0
3 | 2016-02-11,38.32,39.17,37.7301,37.9,5352149.0,0.0,1.0,38.32,39.17,37.7301,37.9,5352149.0
4 | 2016-02-10,39.8,39.97,38.72,38.81,4962605.0,0.0,1.0,39.8,39.97,38.72,38.81,4962605.0
5 | 2016-02-09,39.48,39.89,38.79,39.51,4081861.0,0.0,1.0,39.48,39.89,38.79,39.51,4081861.0
6 | 2016-02-08,40.0,40.43,39.02,39.65,5638394.0,0.0,1.0,40.0,40.43,39.02,39.65,5638394.0
7 | 


--------------------------------------------------------------------------------
/test/test_csv2parquet.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | import os
  3 | from collections import OrderedDict
  4 | 
  5 | import csv2parquet
  6 | from csv2parquet import Columns, Column
  7 | 
  8 | THIS_DIR = os.path.dirname(__file__)
  9 | TEST_CSV = os.path.join(THIS_DIR, 'test-simple.csv')
 10 | TEST_CSV_MAP = os.path.join(THIS_DIR, 'test-header-mapping.csv')
 11 | 
 12 | class TestUtil(unittest.TestCase):
 13 |     def test_list2dict(self):
 14 |         from csv2parquet import list2dict
 15 |         with self.assertRaises(ValueError):
 16 |             list2dict(["foo"])
 17 |         with self.assertRaises(ValueError):
 18 |             list2dict(["foo", "bar", "baz"])
 19 |         self.assertEqual({}, list2dict([]))
 20 |         self.assertEqual({}, list2dict(None))
 21 |         self.assertEqual({"a":"b"}, list2dict(["a", "b"]))
 22 |         self.assertEqual({"a":"b", "x":"y"}, list2dict(["a", "b", "x", "y"]))
 23 |         self.assertEqual({"a":"b", "x":"y"}, list2dict(["x", "y", "a", "b"]))
 24 | 
 25 | class TestCsvSource(unittest.TestCase):
 26 |     def test_real_path_to_prevent_drill_script_errors(self):
 27 |         # Specifying a CSV file path of something like "../something.csv" will confuse Drill.
 28 |         # Prevent this by expanding the path.
 29 |         csv_src = csv2parquet.CsvSource('./test-simple.csv')
 30 |         self.assertEqual(csv_src.path, os.path.realpath(TEST_CSV))
 31 |     def test_headers_simple(self):
 32 |         csv_src = csv2parquet.CsvSource(TEST_CSV)
 33 |         expected_headers = [
 34 |             'Date',
 35 |             'Open',
 36 |             'High',
 37 |             'Low',
 38 |             'Close',
 39 |             'Volume',
 40 |             'ExDividend',
 41 |             'SplitRatio',
 42 |             'AdjOpen',
 43 |             'AdjHigh',
 44 |             'AdjLow',
 45 |             'AdjClose',
 46 |             'AdjVolume',
 47 |             ]
 48 |         self.assertEqual(expected_headers, csv_src.headers)
 49 |         # CSV and Parquet column names should be the same.
 50 |         expected_columns = [Column(header, header, None) for header in expected_headers]
 51 |         self.assertEqual(expected_columns, csv_src.columns.items)
 52 | 
 53 |     def test_columns_from_csv_source(self):
 54 |         # verify that an exception is raised if we don't override Parquet-invalid column names
 55 |         with self.assertRaises(csv2parquet.InvalidColumnNames):
 56 |             csv_src = csv2parquet.CsvSource(TEST_CSV_MAP)
 57 |         # now try again, with a mapping
 58 |         name_map = {
 59 |             'Adj. Open'   : 'Adj Open',
 60 |             'Adj. High'   : 'Adj High',
 61 |             'Adj. Low'    : 'Adj Low',
 62 |             'Adj. Close'  : 'Adj Close',
 63 |             'Adj. Volume' : 'Adj Volume',
 64 |             }
 65 |         csv_src = csv2parquet.CsvSource(TEST_CSV_MAP, name_map)
 66 |         expected_columns = [
 67 |             Column('Date', 'Date', None),
 68 |             Column('Open', 'Open', None),
 69 |             Column('High', 'High', None),
 70 |             Column('Low', 'Low', None),
 71 |             Column('Close', 'Close', None),
 72 |             Column('Volume', 'Volume', None),
 73 |             Column('Ex-Dividend', 'Ex-Dividend', None),
 74 |             Column('Split Ratio', 'Split Ratio', None),
 75 |             Column('Adj. Open', 'Adj Open', None),
 76 |             Column('Adj. High', 'Adj High', None),
 77 |             Column('Adj. Low', 'Adj Low', None),
 78 |             Column('Adj. Close', 'Adj Close', None),
 79 |             Column('Adj. Volume', 'Adj Volume', None),
 80 |             ]
 81 |         self.assertEqual(expected_columns, csv_src.columns.items)
 82 | 
 83 | class TestDrillScript(unittest.TestCase):
 84 |     def test_build_script(self):
 85 |         # .strip() the actual scripts to ignore leading/trailing whitespace
 86 |         expected_script = '''
 87 | alter session set `store.format`='parquet';
 88 | CREATE TABLE dfs.tmp.`/path/to/parquet_output/` AS
 89 | SELECT
 90 | CASE when columns[0]='When' then CAST(NULL AS DATE) else CAST(columns[0] as DATE) end as `Date`,
 91 | columns[1] as `Open`,
 92 | columns[2] as `High`,
 93 | columns[3] as `Low`,
 94 | columns[4] as `Close`,
 95 | columns[5] as `Volume`,
 96 | columns[6] as `Ex-Dividend`,
 97 | CASE when columns[7]='Split Ratio' then CAST(NULL AS FLOAT) else CAST(columns[7] as FLOAT) end as `Split Ratio`,
 98 | CASE when columns[8]='Adj. Open' then CAST(NULL AS DOUBLE) else CAST(columns[8] as DOUBLE) end as `Adj Open`
 99 | FROM dfs.`/path/to/input.csv`
100 | OFFSET 1
101 | '''.strip()
102 |         columns = [
103 |             Column('When', 'Date', 'DATE'),
104 |             Column('Open', 'Open', None),
105 |             Column('Day High', 'High', None),
106 |             Column('Day Low', 'Low', None),
107 |             Column('Close', 'Close', None),
108 |             Column('Volume', 'Volume', None),
109 |             Column('Ex-Dividend', 'Ex-Dividend', None),
110 |             Column('Split Ratio', 'Split Ratio', 'FLOAT'),
111 |             Column('Adj. Open', 'Adj Open', 'DOUBLE'),
112 |             ]
113 |         actual_script = csv2parquet.render_drill_script(columns, '/path/to/parquet_output/', '/path/to/input.csv').strip()
114 |         self.assertEqual(expected_script, actual_script)
115 |     maxDiff=None
116 | 
117 | class TestColumns(unittest.TestCase):
118 |     def test_main(self):
119 |         columns = Columns([], {}, {})
120 |         self.assertEqual([], columns.items)
121 |         
122 |         columns = Columns(
123 |             ["abc", "xyz", "foo", "bar", "baz"],
124 |             {"foo": "whee", "baz": "magic"},
125 |             {})
126 |         items = [
127 |             Column("abc", "abc", None),
128 |             Column("xyz", "xyz", None),
129 |             Column("foo", "whee", None),
130 |             Column("bar", "bar", None),
131 |             Column("baz", "magic", None),
132 |         ]
133 |         self.assertEqual(items, columns.items)
134 |         self.assertEqual(items, list(columns))
135 | 


--------------------------------------------------------------------------------