├── LICENSE ├── README └── unio.py /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2014 by Armin Ronacher. 2 | 3 | Some rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are 7 | met: 8 | 9 | * Redistributions of source code must retain the above copyright 10 | notice, this list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above 13 | copyright notice, this list of conditions and the following 14 | disclaimer in the documentation and/or other materials provided 15 | with the distribution. 16 | 17 | * The names of the contributors may not be used to endorse or 18 | promote products derived from this software without specific 19 | prior written permission. 20 | 21 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 22 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 23 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 24 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 25 | OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 26 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 27 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 28 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 29 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 30 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 31 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 32 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | unIO! 2 | 3 | unio is a Python module that tries to set an end to all unicode 4 | problems. What it does is providing explicit access to binary and unicode 5 | data for both Python 2 and 3 that work exactly the same on both systems and 6 | considerably better than the builtin defaults. 7 | 8 | What it gives you: 9 | 10 | * explicit access to stdin/stdout/stderr in both binary and text mode 11 | * improved file open functions that work the same on 2.x and 3.x that 12 | handle all encodings for you and come with sensible defaults 13 | * helper functions to deal with all of the crazy Python 3 unicode 14 | encoding edge cases 15 | 16 | Basic API: 17 | 18 | unio.TextIO: 19 | gives the most appropriate in-memory text io (accepts only unicode) 20 | unio.BytesIO: 21 | gives the most appropriate in-memory bytes io (accepts only bytes) 22 | unio.NativeIO: 23 | gives the most appropriate in-memory IO for the system. That's bytes 24 | only on Python 3 and bytes + unicode within reason on Python 2. 25 | 26 | unio.get_binary_stdin() 27 | unio.get_binary_stdout() 28 | unio.get_binary_stderr() 29 | Does what the name says, on all platforms. 30 | 31 | unio.get_text_stdin() 32 | unio.get_text_stdout() 33 | unio.get_text_stderr() 34 | Returns a standard stream wrapped in a way that it yields unicode data. 35 | It will do that in the most appropriate encoding and intelligently fix 36 | some broken environments to utf-8. You can also force the encoding to 37 | be something of your choice this way. 38 | 39 | unio.capture_stdout() 40 | Captures stdout (and optionally stderr) in a bytes io and provides some 41 | fixes for Python 3 limitations on flushing. 42 | 43 | unio.get_binary_argv() 44 | Returns a copy of sys.argv fixed up to bytes on all versions of Python. 45 | 46 | unio.binary_env 47 | A byte version of os.environ on all python envs. 48 | 49 | unio.open() 50 | Intelligently opens a file in binary or text mode, by following sane 51 | default encodings and the same behavior on 2.x and 3.x. The default 52 | encoding for text files is utf-8 or utf-8-sig based on the mode. No 53 | magic environment defaults. 54 | 55 | unio.is_ascii_encoding() 56 | Checks if a given encoding is ascii or not. 57 | 58 | unio.get_filesystem_encoding() 59 | Like sys.getdefaultencoding() but will intelligently assume utf-8 in 60 | situations where it assumes a broken environment. 61 | -------------------------------------------------------------------------------- /unio.py: -------------------------------------------------------------------------------- 1 | import io 2 | import os 3 | import sys 4 | import codecs 5 | import contextlib 6 | 7 | 8 | # We do not trust traditional unixes about having reliable file systems. 9 | # In that case we know better than what the env says and declare this to 10 | # be utf-8 always. 11 | has_likely_buggy_unicode_filesystem = \ 12 | sys.platform.startswith('linux') or 'bsd' in sys.platform 13 | 14 | 15 | def is_ascii_encoding(encoding): 16 | """Given an encoding this figures out if the encoding is actually ASCII 17 | (which is something we don't actually want in most cases). This is 18 | necessary because ASCII comes under many names such as ANSI_X3.4-1968. 19 | """ 20 | if encoding is None: 21 | return False 22 | try: 23 | codec = codecs.lookup(encoding) 24 | except LookupError: 25 | return False 26 | return codec.name == 'ascii' 27 | 28 | 29 | def get_filesystem_encoding(): 30 | """Returns the filesystem encoding that should be used. Note that 31 | this is different from the Python understanding of the filesystem 32 | encoding which might be deeply flawed. Do not use this value against 33 | Python's unicode APIs because it might be different. 34 | 35 | The concept of a filesystem encoding in generally is not something 36 | you should rely on. As such if you ever need to use this function 37 | except for writing wrapper code reconsider. 38 | """ 39 | if has_likely_buggy_unicode_filesystem: 40 | return 'utf-8' 41 | rv = sys.getfilesystemencoding() 42 | if is_ascii_encoding(rv): 43 | return 'utf-8' 44 | return rv 45 | 46 | 47 | def get_file_encoding(for_writing=False): 48 | """Returns the encoding for text file data. This is always the same 49 | on all operating systems because this is the only thing that makes 50 | sense when wanting to make data exchange feasible. This is utf-8 no 51 | questions asked. The only simplification is that if a file is opened 52 | for reading then we allo utf-8-sig. 53 | """ 54 | if for_writing: 55 | return 'utf-8' 56 | return 'utf-8-sig' 57 | 58 | 59 | def get_std_stream_encoding(): 60 | """Returns the default stream encoding if not found.""" 61 | rv = sys.getdefaultencoding() 62 | if is_ascii_encoding(rv): 63 | return 'utf-8' 64 | return rv 65 | 66 | 67 | class BrokenEnvironment(Exception): 68 | """This error is raised on Python 3 if the system was malconfigured 69 | beyond repair. 70 | """ 71 | 72 | 73 | class _NonClosingTextIOWrapper(io.TextIOWrapper): 74 | """Subclass of the wrapper that does not close the underlying file 75 | in the destructor. This is necessary so that our wrapping of the 76 | standard streams does not accidentally close the original file. 77 | """ 78 | 79 | def __del__(self): 80 | pass 81 | 82 | 83 | class _FixupStream(object): 84 | """The new io interface needs more from streams than streams 85 | traditionally implement. As such this fixup stuff is necessary in 86 | some circumstances. 87 | """ 88 | 89 | def __init__(self, stream): 90 | self._stream = stream 91 | 92 | def __getattr__(self, name): 93 | return getattr(self._stream, name) 94 | 95 | def readable(self): 96 | x = getattr(self._stream, 'readable', None) 97 | if x is not None: 98 | return x 99 | try: 100 | self._stream.read(0) 101 | except Exception: 102 | return False 103 | return True 104 | 105 | def writable(self): 106 | x = getattr(self._stream, 'writable', None) 107 | if x is not None: 108 | return x 109 | try: 110 | self._stream.write('') 111 | except Exception: 112 | try: 113 | self._stream.write(b'') 114 | except Exception: 115 | return False 116 | return True 117 | 118 | def seekable(self): 119 | x = getattr(self._stream, 'seekable', None) 120 | if x is not None: 121 | return x 122 | try: 123 | self._stream.seek(self._stream.tell()) 124 | except Exception: 125 | return False 126 | return True 127 | 128 | 129 | PY2 = sys.version_info[0] == 2 130 | if PY2: 131 | import StringIO 132 | text_type = unicode 133 | 134 | TextIO = io.StringIO 135 | BytesIO = io.BytesIO 136 | NativeIO = StringIO.StringIO 137 | 138 | def _make_text_stream(stream, encoding, errors): 139 | if encoding is None: 140 | encoding = get_std_stream_encoding() 141 | if errors is None: 142 | errors = 'replace' 143 | return _NonClosingTextIOWrapper(_FixupStream(stream), encoding, errors) 144 | 145 | def get_binary_stdin(): 146 | return sys.stdin 147 | 148 | def get_binary_stdout(): 149 | return sys.stdout 150 | 151 | def get_binary_stderr(): 152 | return sys.stderr 153 | 154 | def get_binary_argv(): 155 | return list(sys.argv) 156 | 157 | def get_text_stdin(encoding=None, errors=None): 158 | return _make_text_stream(sys.stdin, encoding, errors) 159 | 160 | def get_text_stdout(encoding=None, errors=None): 161 | return _make_text_stream(sys.stdout, encoding, errors) 162 | 163 | def get_text_stderr(encoding=None, errors=None): 164 | return _make_text_stream(sys.stderr, encoding, errors) 165 | 166 | @contextlib.contextmanager 167 | def wrap_standard_stream(stream_type, stream): 168 | if stream_type not in ('stdin', 'stdout', 'stderr'): 169 | raise TypeError('Invalid stream %s' % stream_type) 170 | old_stream = getattr(sys, stream_type) 171 | setattr(sys, stream_type, stream) 172 | try: 173 | yield stream 174 | finally: 175 | setattr(sys, stream_type, old_stream) 176 | 177 | @contextlib.contextmanager 178 | def capture_stdout(and_stderr=False): 179 | stream = StringIO.StringIO() 180 | old_stdout = sys.stdout 181 | old_stderr = sys.stderr 182 | sys.stdout = stream 183 | if and_stderr: 184 | sys.stderr = stream 185 | try: 186 | yield stream 187 | finally: 188 | sys.stdout = old_stdout 189 | if and_stderr: 190 | sys.stderr = old_stderr 191 | 192 | binary_env = os.environ 193 | else: 194 | text_type = str 195 | 196 | TextIO = io.StringIO 197 | BytesIO = io.BytesIO 198 | NativeIO = io.StringIO 199 | 200 | def _is_binary_reader(stream, default=False): 201 | try: 202 | return isinstance(stream.read(0), bytes) 203 | except Exception: 204 | return default 205 | # This happens in some cases where the stream was already 206 | # closed. In this case we assume the defalt. 207 | 208 | def _is_binary_writer(stream, default=False): 209 | try: 210 | stream.write(b'') 211 | except Exception: 212 | try: 213 | stream.write('') 214 | return False 215 | except Exception: 216 | pass 217 | return default 218 | return True 219 | 220 | def _find_binary_reader(stream): 221 | # We need to figure out if the given stream is already binary. 222 | # This can happen because the official docs recommend detatching 223 | # the streams to get binary streams. Some code might do this, so 224 | # we need to deal with this case explicitly. 225 | is_binary = _is_binary_reader(stream, False) 226 | 227 | if is_binary: 228 | return stream 229 | 230 | buf = getattr(stream, 'buffer', None) 231 | # Same situation here, this time we assume that the buffer is 232 | # actually binary in case it's closed. 233 | if buf is not None and _is_binary_reader(buf, True): 234 | return buf 235 | 236 | def _find_binary_writer(stream): 237 | # We need to figure out if the given stream is already binary. 238 | # This can happen because the official docs recommend detatching 239 | # the streams to get binary streams. Some code might do this, so 240 | # we need to deal with this case explicitly. 241 | if _is_binary_writer(stream, False): 242 | return stream 243 | 244 | buf = getattr(stream, 'buffer', None) 245 | 246 | # Same situation here, this time we assume that the buffer is 247 | # actually binary in case it's closed. 248 | if buf is not None and _is_binary_reader(buf, True): 249 | return buf 250 | 251 | def _stream_is_misconfigured(stream): 252 | """A stream is misconfigured if it's encoding is ASCII.""" 253 | return is_ascii_encoding(getattr(stream, 'encoding', None)) 254 | 255 | def _wrap_stream_for_text(stream, encoding, errors): 256 | if errors is None: 257 | errors = 'replace' 258 | if encoding is None: 259 | encoding = get_std_stream_encoding() 260 | return _NonClosingTextIOWrapper(_FixupStream(stream), encoding, errors) 261 | 262 | def _is_compatible_text_stream(stream, encoding, errors): 263 | stream_encoding = getattr(stream, 'encoding', None) 264 | stream_errors = getattr(stream, 'errors', None) 265 | 266 | # Perfect match. 267 | if stream_encoding == encoding and stream_errors == errors: 268 | return True 269 | 270 | # Otherwise it's only a compatible stream if we did not ask for 271 | # an encoding. 272 | if encoding is None: 273 | return stream_encoding is not None 274 | 275 | return False 276 | 277 | def _force_correct_text_reader(text_reader, encoding, errors): 278 | if _is_binary_reader(text_reader, False): 279 | binary_reader = text_reader 280 | else: 281 | # If there is no target encoding set we need to verify that the 282 | # reader is actually not misconfigured. 283 | if encoding is None and not _stream_is_misconfigured(text_reader): 284 | return text_reader 285 | 286 | if _is_compatible_text_stream(text_reader, encoding, errors): 287 | return text_reader 288 | 289 | # If the reader has no encoding we try to find the underlying 290 | # binary reader for it. If that fails because the environment is 291 | # misconfigured, we silently go with the same reader because this 292 | # is too common to happen. In that case mojibake is better than 293 | # exceptions. 294 | binary_reader = _find_binary_reader(text_reader) 295 | if binary_reader is None: 296 | return text_reader 297 | 298 | # At this point we default the errors to replace instead of strict 299 | # because nobody handles those errors anyways and at this point 300 | # we're so fundamentally fucked that nothing can repair it. 301 | if errors is None: 302 | errors = 'replace' 303 | return _wrap_stream_for_text(binary_reader, encoding, errors) 304 | 305 | def _force_correct_text_writer(text_writer, encoding, errors): 306 | if _is_binary_writer(text_writer, False): 307 | binary_writer = text_writer 308 | else: 309 | # If there is no target encoding set we need to verify that the 310 | # writer is actually not misconfigured. 311 | if encoding is None and not _stream_is_misconfigured(text_writer): 312 | return text_writer 313 | 314 | if _is_compatible_text_stream(text_writer, encoding, errors): 315 | return text_writer 316 | 317 | # If the writer has no encoding we try to find the underlying 318 | # binary writer for it. If that fails because the environment is 319 | # misconfigured, we silently go with the same writer because this 320 | # is too common to happen. In that case mojibake is better than 321 | # exceptions. 322 | binary_writer = _find_binary_writer(text_writer) 323 | if binary_writer is None: 324 | return text_writer 325 | 326 | # At this point we default the errors to replace instead of strict 327 | # because nobody handles those errors anyways and at this point 328 | # we're so fundamentally fucked that nothing can repair it. 329 | if errors is None: 330 | errors = 'replace' 331 | return _wrap_stream_for_text(binary_writer, encoding, errors) 332 | 333 | def get_binary_stdin(): 334 | reader = _find_binary_reader(sys.stdin) 335 | if reader is None: 336 | raise BrokenEnvironment('Was not able to determine binary ' 337 | 'stream for sys.stdin.') 338 | return reader 339 | 340 | def get_binary_stdout(): 341 | writer = _find_binary_writer(sys.stdout) 342 | if writer is None: 343 | raise BrokenEnvironment('Was not able to determine binary ' 344 | 'stream for sys.stdout.') 345 | return writer 346 | 347 | def get_binary_stderr(): 348 | writer = _find_binary_writer(sys.stderr) 349 | if writer is None: 350 | raise BrokenEnvironment('Was not able to determine binary ' 351 | 'stream for sys.stderr.') 352 | return writer 353 | 354 | def get_text_stdin(encoding=None, errors=None): 355 | return _force_correct_text_reader(sys.stdin, encoding, errors) 356 | 357 | def get_text_stdout(encoding=None, errors=None): 358 | return _force_correct_text_writer(sys.stdout, encoding, errors) 359 | 360 | def get_text_stderr(encoding=None, errors=None): 361 | return _force_correct_text_writer(sys.stderr, encoding, errors) 362 | 363 | def get_binary_argv(): 364 | fs_enc = sys.getfilesystemencoding() 365 | return [x.encode(fs_enc, 'surrogateescape') for x in sys.argv] 366 | 367 | binary_env = os.environb 368 | 369 | @contextlib.contextmanager 370 | def wrap_standard_stream(stream_type, stream): 371 | old_stream = getattr(sys, stream_type, None) 372 | if stream_type == 'stdin': 373 | if _is_binary_reader(stream): 374 | raise TypeError('Standard input stream cannot be set to a ' 375 | 'binary reader directly.') 376 | if _find_binary_reader(stream) is None: 377 | raise TypeError('Standard input stream needs to be backed ' 378 | 'by a binary stream.') 379 | elif stream_type in ('stdout', 'stderr'): 380 | if _is_binary_writer(stream): 381 | raise TypeError('Standard output stream cannot be set to a ' 382 | 'binary writer directly.') 383 | if _find_binary_writer(stream) is None: 384 | raise TypeError('Standard output and error streams need ' 385 | 'to be backed by a binary streams.') 386 | else: 387 | raise TypeError('Invalid stream %s' % stream_type) 388 | setattr(sys, stream_type, stream) 389 | try: 390 | yield old_stream 391 | finally: 392 | setattr(sys, stream_type, old_stream) 393 | 394 | class _CapturedStream(object): 395 | """A helper that flushes before getvalue() to fix a few oddities 396 | on Python 3. 397 | """ 398 | 399 | def __init__(self, stream): 400 | self._stream = stream 401 | 402 | def __getattr__(self, name): 403 | return getattr(self._stream, name) 404 | 405 | def getvalue(self): 406 | self._stream.flush() 407 | return self._stream.buffer.getvalue() 408 | 409 | def __repr__(self): 410 | return repr(self._stream) 411 | 412 | @contextlib.contextmanager 413 | def capture_stdout(and_stderr=False): 414 | """Captures stdout and yields the new bytes stream that backs it. 415 | It also wraps it in a fake object that flushes on getting the 416 | underlying value. 417 | """ 418 | ll_stream = io.BytesIO() 419 | stream = _NonClosingTextIOWrapper(ll_stream, sys.stdout.encoding, 420 | sys.stdout.errors) 421 | old_stdout = sys.stdout 422 | sys.stdout = stream 423 | 424 | if and_stderr: 425 | old_stderr = sys.stderr 426 | sys.stderr = stream 427 | 428 | try: 429 | yield _CapturedStream(stream) 430 | finally: 431 | stream.flush() 432 | sys.stdout = old_stdout 433 | if and_stderr: 434 | sys.stderr = old_stderr 435 | 436 | 437 | def _fixup_path(path): 438 | if has_likely_buggy_unicode_filesystem \ 439 | and isinstance(path, text_type): 440 | if PY2: 441 | path = path.encode(get_filesystem_encoding()) 442 | else: 443 | path = path.encode(get_filesystem_encoding(), 444 | 'surrogateescape') 445 | return path 446 | 447 | 448 | def open(filename, mode='r', encoding=None, errors=None): 449 | """Opens a file either in text or binary mode. The encoding for the 450 | file is automatically detected. 451 | """ 452 | filename = _fixup_path(filename) 453 | if 'b' not in mode: 454 | encoding = get_file_encoding('w' in mode) 455 | if encoding is not None: 456 | return io.open(filename, mode, encoding=encoding, errors=errors) 457 | return io.open(filename, mode) 458 | --------------------------------------------------------------------------------