├── test_requirements.in ├── .gitignore ├── .coveragerc ├── tox.ini ├── HISTORY.rst ├── .travis.yml ├── LICENSE ├── setup.py ├── test_requirements.txt ├── lazyreader.py ├── test_lazyreader.py └── README.rst /test_requirements.in: -------------------------------------------------------------------------------- 1 | boto3 2 | coverage 3 | moto 4 | pytest 5 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | build 2 | dist 3 | *.egg-info 4 | __pycache__ 5 | *.pyc 6 | .cache 7 | .coverage 8 | .tox 9 | -------------------------------------------------------------------------------- /.coveragerc: -------------------------------------------------------------------------------- 1 | [run] 2 | branch = True 3 | source = lazyreader 4 | 5 | [report] 6 | fail_under = 100 7 | show_missing = True 8 | 9 | [paths] 10 | source = 11 | lazyreader.py 12 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py27, py34, py35, py36 3 | 4 | [testenv] 5 | deps = -r{toxinidir}/test_requirements.txt 6 | commands= 7 | coverage run -m py.test {posargs} test_lazyreader.py 8 | coverage report 9 | -------------------------------------------------------------------------------- /HISTORY.rst: -------------------------------------------------------------------------------- 1 | Release History 2 | =============== 3 | 4 | 1.0.1 (2017-09-06) 5 | ------------------ 6 | 7 | - Fix a bug where trying to read an S3 object with the `boto3` library would throw an ``IncompleteReadError``. 8 | (`#2 `_) 9 | 10 | 1.0.0 (2017-09-02) 11 | ------------------ 12 | 13 | - First production release! 14 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | sudo: false 3 | 4 | install: 5 | - pip install -U pip setuptools 6 | - pip install tox 7 | 8 | cache: 9 | directories: 10 | - $HOME/.cache/pip 11 | 12 | matrix: 13 | include: 14 | - python: "2.7" 15 | env: TOXENV=py27 16 | - python: "3.4" 17 | env: TOXENV=py34 18 | - python: "3.5" 19 | env: TOXENV=py35 20 | - python: "3.6" 21 | env: TOXENV=py36 22 | 23 | script: 24 | - tox 25 | 26 | branches: 27 | only: 28 | - master 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Alex Chan 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 8 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import codecs 4 | import os 5 | import sys 6 | 7 | from setuptools import setup 8 | 9 | version = '1.0.1' 10 | 11 | # Stolen from Kenneth Reitz 12 | if sys.argv[-1] == 'publish': 13 | os.system('python setup.py sdist') 14 | os.system('python setup.py bdist_wheel --universal') 15 | os.system('twine upload dist/lazyreader-%s*' % version) 16 | sys.exit() 17 | 18 | 19 | def read(f): 20 | return codecs.open(f, encoding='utf-8').read() 21 | 22 | 23 | setup( 24 | name='lazyreader', 25 | version=version, 26 | description='Lazy reading of file objects for efficient batch processing', 27 | long_description=read('README.rst'), 28 | author='Alex Chan', 29 | author_email='alex@alexwlchan.net', 30 | url='https://github.com/alexwlchan/lazyreader', 31 | py_modules=['lazyreader'], 32 | package_data={'': ['LICENSE']}, 33 | include_package_data=True, 34 | license='MIT', 35 | classifiers=( 36 | 'Development Status :: 5 - Production/Stable', 37 | 'Intended Audience :: Developers', 38 | 'License :: OSI Approved :: MIT License', 39 | 'Programming Language :: Python', 40 | 'Programming Language :: Python :: 2.7', 41 | 'Programming Language :: Python :: 3', 42 | 'Programming Language :: Python :: 3.4', 43 | 'Programming Language :: Python :: 3.5', 44 | 'Programming Language :: Python :: 3.6', 45 | 'Programming Language :: Python :: Implementation :: CPython', 46 | ) 47 | ) 48 | -------------------------------------------------------------------------------- /test_requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile 3 | # To update, run: 4 | # 5 | # pip-compile --output-file test_requirements.txt test_requirements.in 6 | # 7 | boto3==1.4.7 8 | boto==2.48.0 # via moto 9 | botocore==1.7.4 # via boto3, s3transfer 10 | certifi==2017.7.27.1 # via requests 11 | chardet==3.0.4 # via requests 12 | cookies==2.2.1 # via moto 13 | coverage==4.4.1 14 | dicttoxml==1.7.4 # via moto 15 | docutils==0.14 # via botocore 16 | funcsigs==1.0.2 # via mock 17 | futures==3.1.1 # via s3transfer 18 | idna==2.6 # via requests 19 | jinja2==2.9.6 # via moto 20 | jmespath==0.9.3 # via boto3, botocore 21 | markupsafe==1.0 # via jinja2 22 | mock==2.0.0 # via moto 23 | moto==1.1.1 24 | pbr==3.1.1 # via mock 25 | py==1.4.34 # via pytest 26 | pyaml==17.8.0 # via moto 27 | pytest==3.2.1 28 | python-dateutil==2.6.1 # via botocore, moto 29 | pytz==2017.2 # via moto 30 | pyyaml==3.12 # via pyaml 31 | requests==2.18.4 # via moto 32 | s3transfer==0.1.11 # via boto3 33 | six==1.10.0 # via mock, moto, python-dateutil 34 | urllib3==1.22 # via requests 35 | werkzeug==0.12.2 # via moto 36 | xmltodict==0.11.0 # via moto 37 | 38 | # The following packages are considered to be unsafe in a requirements file: 39 | # setuptools # via pytest 40 | -------------------------------------------------------------------------------- /lazyreader.py: -------------------------------------------------------------------------------- 1 | # -*- encoding: utf-8 2 | 3 | def lazyread(f, delimiter): 4 | """ 5 | Generator which continually reads ``f`` to the next instance 6 | of ``delimiter``. 7 | 8 | This allows you to do batch processing on the contents of ``f`` without 9 | loading the entire file into memory. 10 | 11 | :param f: Any file-like object which has a ``.read()`` method. 12 | :param delimiter: Delimiter on which to split up the file. 13 | """ 14 | # Get an empty string to start with. We need to make sure that if the 15 | # file is opened in binary mode, we're using byte strings, and similar 16 | # for Unicode. Otherwise trying to update the running string will 17 | # hit a TypeError. 18 | try: 19 | running = f.read(0) 20 | except Exception as e: 21 | 22 | # The boto3 APIs don't let you read zero bytes from an S3 object, but 23 | # they always return bytestrings, so in this case we know what to 24 | # start with. 25 | if e.__class__.__name__ == 'IncompleteReadError': 26 | running = b'' 27 | else: 28 | raise 29 | 30 | while True: 31 | new_data = f.read(1024) 32 | 33 | # When a call to read() returns nothing, we're at the end of the file. 34 | if not new_data: 35 | yield running 36 | return 37 | 38 | # Otherwise, update the running stream and look for instances of 39 | # the delimiter. Remember we might have read more than one delimiter 40 | # since the last time we checked 41 | running += new_data 42 | while delimiter in running: 43 | curr, running = running.split(delimiter, 1) 44 | yield curr + delimiter 45 | -------------------------------------------------------------------------------- /test_lazyreader.py: -------------------------------------------------------------------------------- 1 | # -*- encoding: utf-8 2 | 3 | import boto3 4 | import moto 5 | import pytest 6 | 7 | from lazyreader import lazyread 8 | 9 | 10 | class Readable(object): 11 | def __init__(self, body): 12 | self.body = body 13 | self.position = 0 14 | 15 | def read(self, size): 16 | retval = self.body[self.position:self.position + size] 17 | self.position += size 18 | return retval 19 | 20 | 21 | def test_read_single_line_with_binary_file(): 22 | f = open('README.rst', 'rb') 23 | assert next(lazyread(f, delimiter=b'\n')) == b'lazyreader\n' 24 | 25 | 26 | def test_read_single_line_with_unicode_file(): 27 | f = open('README.rst', 'r') 28 | assert next(lazyread(f, delimiter='\n')) == 'lazyreader\n' 29 | 30 | 31 | def test_can_read_entire_file(): 32 | expected = [ 33 | 'A triplet of lines;', 34 | 'separated by semicolons;', 35 | 'not newlines' 36 | ] 37 | f = Readable(''.join(expected)) 38 | assert list(lazyread(f, delimiter=';')) == expected 39 | 40 | 41 | @moto.mock_s3 42 | def test_can_read_from_s3(): 43 | s3 = boto3.client('s3') 44 | s3.create_bucket(Bucket='bukkit') 45 | s3.put_object(Bucket='bukkit', Key='long_file.txt', Body=b'foo\nbar\nbaz') 46 | 47 | f = s3.get_object(Bucket='bukkit', Key='long_file.txt')['Body'] 48 | assert list(lazyread(f, delimiter=b'\n')) == [b'foo\n', b'bar\n', b'baz'] 49 | 50 | 51 | def test_unexpected_error_is_still_raised(): 52 | class BrokenReadable(object): 53 | def read(self, size): 54 | raise RuntimeError("Reading from this object isn't allowed!") 55 | 56 | f = BrokenReadable() 57 | 58 | with pytest.raises(RuntimeError): 59 | next(lazyread(f, delimiter=b'\n')) 60 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | lazyreader 2 | ========== 3 | 4 | lazyreader is a Python module for doing lazy reading of file objects. 5 | 6 | The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory. 7 | For example: 8 | 9 | .. code-block:: python 10 | 11 | with open('large_file.txt') as f: 12 | for line in f: 13 | print(line) 14 | 15 | lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method. 16 | For example: 17 | 18 | .. code-block:: python 19 | 20 | from lazyreader import lazyread 21 | 22 | with open('large_file.txt') as f: 23 | for doc in lazyread(f, delimiter=';'): 24 | print(doc) 25 | 26 | This is a snippet of code I spun out from the `Wellcome Digital Platform `_. 27 | We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters. 28 | Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time. 29 | 30 | Installation 31 | ************ 32 | 33 | lazyreader is available from PyPI: 34 | 35 | .. code-block:: console 36 | 37 | $ pip install lazyreader 38 | 39 | Examples 40 | ******** 41 | 42 | If we have a file stored locally, we can open it and split based on any choice of delimiter. 43 | For example, if we had a text file in which record were separated by commas: 44 | 45 | .. code-block:: python 46 | 47 | with open('lots_of_records.txt') as f: 48 | for doc in lazyread(f, delimiter=','): 49 | print(doc) 50 | 51 | Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line. 52 | The `boto3 `_ API gives us a file object for reading from S3: 53 | 54 | .. code-block:: python 55 | 56 | import boto3 57 | 58 | client = boto3.client('s3') 59 | s3_object = client.get_object(Bucket='example-bucket', Key='words.txt') 60 | body = s3_object['Body'] 61 | 62 | for doc in lazyread(body, delimiter=b'\n'): 63 | print(doc) 64 | 65 | (This is the use case for which this code was originally written.) 66 | 67 | One more example: we're fetching an HTML page, and want to read lines separated by ``
`` in the underlying HTML. 68 | Like so: 69 | 70 | .. code-block:: python 71 | 72 | import urllib.request 73 | 74 | with urllib.request.urlopen('https://example.org/') as f: 75 | for doc in lazyread(f, delimiter=b'
'): 76 | print(doc) 77 | 78 | Advanced usage 79 | ************** 80 | 81 | ``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data. 82 | 83 | First example: we have a file which contains a list of JSON objects, one per line. 84 | (This is the format of output files from `elasticdump `_.) 85 | What the caller really needs is Python dictionaries, not JSON strings. 86 | We can wrap ``lazyread()`` like so: 87 | 88 | .. code-block:: python 89 | 90 | import json 91 | 92 | def lazyjson(f, delimiter=b'\n'): 93 | for doc in lazyread(f, delimiter=delimiter): 94 | 95 | # Ignore empty lines, e.g. the last line in a file 96 | if not doc.strip(): 97 | continue 98 | 99 | yield json.loads(doc) 100 | 101 | Another example: we want to parse a large XML file, but not load it all into memory at once. 102 | We can write the following wrapper: 103 | 104 | .. code-block:: python 105 | 106 | from lxml import etree 107 | 108 | def lazyxmlstrings(f, opening_tag, closing_tag): 109 | for doc in lazyread(f, delimiter=closing_tag): 110 | if opening_tag not in doc: 111 | continue 112 | 113 | # We want complete XML blocks, so look for the opening tag and 114 | # just return its contents 115 | block = doc.split(opening_tag)[-1] 116 | yield opening_tag + block 117 | 118 | def lazyxml(f, opening_tag, closing_tag): 119 | for xml_string in lazyxmlstrings(f, opening_tag, closing_tag): 120 | yield etree.fromstring(xml_string) 121 | 122 | We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3. 123 | 124 | Isn't this a bit simple to be a module? 125 | *************************************** 126 | 127 | Maybe. 128 | There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module. 129 | 130 | And it's not completely trivial -- at least, not for me. 131 | I made two mistakes when I first wrote this: 132 | 133 | * I was hard-coding the initial running string as 134 | 135 | .. code-block:: python 136 | 137 | running = b'' 138 | 139 | That only works if your file object is returning bytestrings. 140 | If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file. 141 | String types are important! 142 | 143 | * After I'd read another 1024 characters from the file, I checked for the delimiter like so: 144 | 145 | .. code-block:: python 146 | 147 | running += new_data 148 | if delimiter in running: 149 | curr, running = running.split(delimiter) 150 | yield curr + delimiter 151 | 152 | For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters. 153 | But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``. 154 | So now the code correctly checks and handles the case where a single read includes more than one delimiter. 155 | 156 | Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again. 157 | 158 | License 159 | ******* 160 | 161 | MIT. 162 | --------------------------------------------------------------------------------