├── test_requirements.in
├── .gitignore
├── .coveragerc
├── tox.ini
├── HISTORY.rst
├── .travis.yml
├── LICENSE
├── setup.py
├── test_requirements.txt
├── lazyreader.py
├── test_lazyreader.py
└── README.rst


/test_requirements.in:
--------------------------------------------------------------------------------
1 | boto3
2 | coverage
3 | moto
4 | pytest
5 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | build
2 | dist
3 | *.egg-info
4 | __pycache__
5 | *.pyc
6 | .cache
7 | .coverage
8 | .tox
9 | 


--------------------------------------------------------------------------------
/.coveragerc:
--------------------------------------------------------------------------------
 1 | [run]
 2 | branch = True
 3 | source = lazyreader
 4 | 
 5 | [report]
 6 | fail_under = 100
 7 | show_missing = True
 8 | 
 9 | [paths]
10 | source =
11 |     lazyreader.py
12 | 


--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [tox]
2 | envlist = py27, py34, py35, py36
3 | 
4 | [testenv]
5 | deps = -r{toxinidir}/test_requirements.txt
6 | commands=
7 |     coverage run -m py.test {posargs} test_lazyreader.py
8 |     coverage report
9 | 


--------------------------------------------------------------------------------
/HISTORY.rst:
--------------------------------------------------------------------------------
 1 | Release History
 2 | ===============
 3 | 
 4 | 1.0.1 (2017-09-06)
 5 | ------------------
 6 | 
 7 | -  Fix a bug where trying to read an S3 object with the `boto3` library would throw an ``IncompleteReadError``.
 8 |    (`#2 <https://github.com/alexwlchan/lazyreader/issues/2>`_)
 9 | 
10 | 1.0.0 (2017-09-02)
11 | ------------------
12 | 
13 | -  First production release!
14 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | sudo: false
 3 | 
 4 | install:
 5 |   - pip install -U pip setuptools
 6 |   - pip install tox
 7 | 
 8 | cache:
 9 |   directories:
10 |     - $HOME/.cache/pip
11 | 
12 | matrix:
13 |   include:
14 |     - python: "2.7"
15 |       env: TOXENV=py27
16 |     - python: "3.4"
17 |       env: TOXENV=py34
18 |     - python: "3.5"
19 |       env: TOXENV=py35
20 |     - python: "3.6"
21 |       env: TOXENV=py36
22 | 
23 | script:
24 |   - tox
25 | 
26 | branches:
27 |   only:
28 |     - master
29 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2017 Alex Chan
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
8 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import codecs
 4 | import os
 5 | import sys
 6 | 
 7 | from setuptools import setup
 8 | 
 9 | version = '1.0.1'
10 | 
11 | # Stolen from Kenneth Reitz
12 | if sys.argv[-1] == 'publish':
13 |     os.system('python setup.py sdist')
14 |     os.system('python setup.py bdist_wheel --universal')
15 |     os.system('twine upload dist/lazyreader-%s*' % version)
16 |     sys.exit()
17 | 
18 | 
19 | def read(f):
20 |     return codecs.open(f, encoding='utf-8').read()
21 | 
22 | 
23 | setup(
24 |     name='lazyreader',
25 |     version=version,
26 |     description='Lazy reading of file objects for efficient batch processing',
27 |     long_description=read('README.rst'),
28 |     author='Alex Chan',
29 |     author_email='alex@alexwlchan.net',
30 |     url='https://github.com/alexwlchan/lazyreader',
31 |     py_modules=['lazyreader'],
32 |     package_data={'': ['LICENSE']},
33 |     include_package_data=True,
34 |     license='MIT',
35 |     classifiers=(
36 |         'Development Status :: 5 - Production/Stable',
37 |         'Intended Audience :: Developers',
38 |         'License :: OSI Approved :: MIT License',
39 |         'Programming Language :: Python',
40 |         'Programming Language :: Python :: 2.7',
41 |         'Programming Language :: Python :: 3',
42 |         'Programming Language :: Python :: 3.4',
43 |         'Programming Language :: Python :: 3.5',
44 |         'Programming Language :: Python :: 3.6',
45 |         'Programming Language :: Python :: Implementation :: CPython',
46 |     )
47 | )
48 | 


--------------------------------------------------------------------------------
/test_requirements.txt:
--------------------------------------------------------------------------------
 1 | #
 2 | # This file is autogenerated by pip-compile
 3 | # To update, run:
 4 | #
 5 | #    pip-compile --output-file test_requirements.txt test_requirements.in
 6 | #
 7 | boto3==1.4.7
 8 | boto==2.48.0              # via moto
 9 | botocore==1.7.4           # via boto3, s3transfer
10 | certifi==2017.7.27.1      # via requests
11 | chardet==3.0.4            # via requests
12 | cookies==2.2.1            # via moto
13 | coverage==4.4.1
14 | dicttoxml==1.7.4          # via moto
15 | docutils==0.14            # via botocore
16 | funcsigs==1.0.2           # via mock
17 | futures==3.1.1            # via s3transfer
18 | idna==2.6                 # via requests
19 | jinja2==2.9.6             # via moto
20 | jmespath==0.9.3           # via boto3, botocore
21 | markupsafe==1.0           # via jinja2
22 | mock==2.0.0               # via moto
23 | moto==1.1.1
24 | pbr==3.1.1                # via mock
25 | py==1.4.34                # via pytest
26 | pyaml==17.8.0             # via moto
27 | pytest==3.2.1
28 | python-dateutil==2.6.1    # via botocore, moto
29 | pytz==2017.2              # via moto
30 | pyyaml==3.12              # via pyaml
31 | requests==2.18.4          # via moto
32 | s3transfer==0.1.11        # via boto3
33 | six==1.10.0               # via mock, moto, python-dateutil
34 | urllib3==1.22             # via requests
35 | werkzeug==0.12.2          # via moto
36 | xmltodict==0.11.0         # via moto
37 | 
38 | # The following packages are considered to be unsafe in a requirements file:
39 | # setuptools                # via pytest
40 | 


--------------------------------------------------------------------------------
/lazyreader.py:
--------------------------------------------------------------------------------
 1 | # -*- encoding: utf-8
 2 | 
 3 | def lazyread(f, delimiter):
 4 |     """
 5 |     Generator which continually reads ``f`` to the next instance
 6 |     of ``delimiter``.
 7 | 
 8 |     This allows you to do batch processing on the contents of ``f`` without
 9 |     loading the entire file into memory.
10 | 
11 |     :param f: Any file-like object which has a ``.read()`` method.
12 |     :param delimiter: Delimiter on which to split up the file.
13 |     """
14 |     # Get an empty string to start with.  We need to make sure that if the
15 |     # file is opened in binary mode, we're using byte strings, and similar
16 |     # for Unicode.  Otherwise trying to update the running string will
17 |     # hit a TypeError.
18 |     try:
19 |         running = f.read(0)
20 |     except Exception as e:
21 | 
22 |         # The boto3 APIs don't let you read zero bytes from an S3 object, but
23 |         # they always return bytestrings, so in this case we know what to
24 |         # start with.
25 |         if e.__class__.__name__ == 'IncompleteReadError':
26 |             running = b''
27 |         else:
28 |             raise
29 | 
30 |     while True:
31 |         new_data = f.read(1024)
32 | 
33 |         # When a call to read() returns nothing, we're at the end of the file.
34 |         if not new_data:
35 |             yield running
36 |             return
37 | 
38 |         # Otherwise, update the running stream and look for instances of
39 |         # the delimiter.  Remember we might have read more than one delimiter
40 |         # since the last time we checked
41 |         running += new_data
42 |         while delimiter in running:
43 |             curr, running = running.split(delimiter, 1)
44 |             yield curr + delimiter
45 | 


--------------------------------------------------------------------------------
/test_lazyreader.py:
--------------------------------------------------------------------------------
 1 | # -*- encoding: utf-8
 2 | 
 3 | import boto3
 4 | import moto
 5 | import pytest
 6 | 
 7 | from lazyreader import lazyread
 8 | 
 9 | 
10 | class Readable(object):
11 |     def __init__(self, body):
12 |         self.body = body
13 |         self.position = 0
14 | 
15 |     def read(self, size):
16 |         retval = self.body[self.position:self.position + size]
17 |         self.position += size
18 |         return retval
19 | 
20 | 
21 | def test_read_single_line_with_binary_file():
22 |     f = open('README.rst', 'rb')
23 |     assert next(lazyread(f, delimiter=b'\n')) == b'lazyreader\n'
24 | 
25 | 
26 | def test_read_single_line_with_unicode_file():
27 |     f = open('README.rst', 'r')
28 |     assert next(lazyread(f, delimiter='\n')) == 'lazyreader\n'
29 | 
30 | 
31 | def test_can_read_entire_file():
32 |     expected = [
33 |         'A triplet of lines;',
34 |         'separated by semicolons;',
35 |         'not newlines'
36 |     ]
37 |     f = Readable(''.join(expected))
38 |     assert list(lazyread(f, delimiter=';')) == expected
39 | 
40 | 
41 | @moto.mock_s3
42 | def test_can_read_from_s3():
43 |     s3 = boto3.client('s3')
44 |     s3.create_bucket(Bucket='bukkit')
45 |     s3.put_object(Bucket='bukkit', Key='long_file.txt', Body=b'foo\nbar\nbaz')
46 | 
47 |     f = s3.get_object(Bucket='bukkit', Key='long_file.txt')['Body']
48 |     assert list(lazyread(f, delimiter=b'\n')) == [b'foo\n', b'bar\n', b'baz']
49 | 
50 | 
51 | def test_unexpected_error_is_still_raised():
52 |     class BrokenReadable(object):
53 |         def read(self, size):
54 |             raise RuntimeError("Reading from this object isn't allowed!")
55 | 
56 |     f = BrokenReadable()
57 | 
58 |     with pytest.raises(RuntimeError):
59 |         next(lazyread(f, delimiter=b'\n'))
60 | 


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
  1 | lazyreader
  2 | ==========
  3 | 
  4 | lazyreader is a Python module for doing lazy reading of file objects.
  5 | 
  6 | The Python standard library lets you read a file a line-at-a-time, saving you from loading the entire file into memory.
  7 | For example:
  8 | 
  9 | .. code-block:: python
 10 | 
 11 |    with open('large_file.txt') as f:
 12 |        for line in f:
 13 |            print(line)
 14 | 
 15 | lazyreader lets you do the same thing, but with an arbitrary delimiter, and for any object that presents a ``.read()`` method.
 16 | For example:
 17 | 
 18 | .. code-block:: python
 19 | 
 20 |    from lazyreader import lazyread
 21 | 
 22 |    with open('large_file.txt') as f:
 23 |        for doc in lazyread(f, delimiter=';'):
 24 |            print(doc)
 25 | 
 26 | This is a snippet of code I spun out from the `Wellcome Digital Platform <https://github.com/wellcometrust/platform-api>`_.
 27 | We have large XML and JSON files stored in S3 -- sometimes multiple GBs -- but each file is really a series of "documents", separated by known delimiters.
 28 | Downloading and parsing the entire file would be prohibitively expensive, but lazyreader allows us to hold just a single document in memory at a time.
 29 | 
 30 | Installation
 31 | ************
 32 | 
 33 | lazyreader is available from PyPI:
 34 | 
 35 | .. code-block:: console
 36 | 
 37 |    $ pip install lazyreader
 38 | 
 39 | Examples
 40 | ********
 41 | 
 42 | If we have a file stored locally, we can open it and split based on any choice of delimiter.
 43 | For example, if we had a text file in which record were separated by commas:
 44 | 
 45 | .. code-block:: python
 46 | 
 47 |    with open('lots_of_records.txt') as f:
 48 |        for doc in lazyread(f, delimiter=','):
 49 |            print(doc)
 50 | 
 51 | Another example: we have a file stored in Amazon S3, and we'd like to read it line-by-line.
 52 | The `boto3 <https://boto3.readthedocs.io/en/stable/>`_ API gives us a file object for reading from S3:
 53 | 
 54 | .. code-block:: python
 55 | 
 56 |    import boto3
 57 | 
 58 |    client = boto3.client('s3')
 59 |    s3_object = client.get_object(Bucket='example-bucket', Key='words.txt')
 60 |    body = s3_object['Body']
 61 | 
 62 |    for doc in lazyread(body, delimiter=b'\n'):
 63 |        print(doc)
 64 | 
 65 | (This is the use case for which this code was originally written.)
 66 | 
 67 | One more example: we're fetching an HTML page, and want to read lines separated by ``<br>`` in the underlying HTML.
 68 | Like so:
 69 | 
 70 | .. code-block:: python
 71 | 
 72 |    import urllib.request
 73 | 
 74 |    with urllib.request.urlopen('https://example.org/') as f:
 75 |        for doc in lazyread(f, delimiter=b'<br>'):
 76 |            print(doc)
 77 | 
 78 | Advanced usage
 79 | **************
 80 | 
 81 | ``lazyread()`` returns a generator, which you can wrap to build a pipeline of generators which do processing on the data.
 82 | 
 83 | First example: we have a file which contains a list of JSON objects, one per line.
 84 | (This is the format of output files from `elasticdump <https://github.com/taskrabbit/elasticsearch-dump>`_.)
 85 | What the caller really needs is Python dictionaries, not JSON strings.
 86 | We can wrap ``lazyread()`` like so:
 87 | 
 88 | .. code-block:: python
 89 | 
 90 |    import json
 91 | 
 92 |    def lazyjson(f, delimiter=b'\n'):
 93 |        for doc in lazyread(f, delimiter=delimiter):
 94 | 
 95 |            # Ignore empty lines, e.g. the last line in a file
 96 |            if not doc.strip():
 97 |                continue
 98 | 
 99 |            yield json.loads(doc)
100 | 
101 | Another example: we want to parse a large XML file, but not load it all into memory at once.
102 | We can write the following wrapper:
103 | 
104 | .. code-block:: python
105 | 
106 |    from lxml import etree
107 | 
108 |    def lazyxmlstrings(f, opening_tag, closing_tag):
109 |        for doc in lazyread(f, delimiter=closing_tag):
110 |            if opening_tag not in doc:
111 |                continue
112 | 
113 |            # We want complete XML blocks, so look for the opening tag and
114 |            # just return its contents
115 |            block = doc.split(opening_tag)[-1]
116 |            yield opening_tag + block
117 | 
118 |    def lazyxml(f, opening_tag, closing_tag):
119 |        for xml_string in lazyxmlstrings(f, opening_tag, closing_tag):
120 |             yield etree.fromstring(xml_string)
121 | 
122 | We use both of these wrappers at Wellcome to do efficient processing of large files that are kept in Amazon S3.
123 | 
124 | Isn't this a bit simple to be a module?
125 | ***************************************
126 | 
127 | Maybe.
128 | There are recipes on Stack Overflow that do very similar, but I find it useful to have in a standalone module.
129 | 
130 | And it's not completely trivial -- at least, not for me.
131 | I made two mistakes when I first wrote this:
132 | 
133 | *  I was hard-coding the initial running string as
134 | 
135 |    .. code-block:: python
136 | 
137 |       running = b''
138 | 
139 |    That only works if your file object is returning bytestrings.
140 |    If it's returning Unicode strings, you get a ``TypeError`` (`can't concat bytes to str`) when it first tries to read from the file.
141 |    String types are important!
142 | 
143 | *  After I'd read another 1024 characters from the file, I checked for the delimiter like so:
144 | 
145 |    .. code-block:: python
146 | 
147 |       running += new_data
148 |       if delimiter in running:
149 |           curr, running = running.split(delimiter)
150 |           yield curr + delimiter
151 | 
152 |    For my initial use case, individual documents were `much` bigger than 1024 characters, so the new data would never contain multiple delimiters.
153 |    But with smaller documents, you might get multiple delimiters in one read, and then unpacking the result of ``.split()`` would throw a ``ValueError``.
154 |    So now the code correctly checks and handles the case where a single read includes more than one delimiter.
155 | 
156 | Now it's encoded and tested in a module, I don't have to worry about making the same mistakes again.
157 | 
158 | License
159 | *******
160 | 
161 | MIT.
162 | 


--------------------------------------------------------------------------------