├── README.md ├── README_CN.md └── images ├── pycharm-type-hinting.png └── variable_annotations.png /README.md: -------------------------------------------------------------------------------- 1 | # We made it! 2 | 3 | *Update (Jan 2020)*. 4 | Python 2 is now officially retired. Thanks to everyone for making this hard transition to better code happen! 5 | 6 | # Migrating to Python 3 with pleasure 7 | ## A short guide on features of Python 3 for data scientists 8 | 9 | 10 | Python became a mainstream language for machine learning and other scientific fields that heavily operate with data; 11 | it boasts various deep learning frameworks and well-established set of tools for data processing and visualization. 12 | 13 | However, Python ecosystem co-exists in Python 2 and Python 3, and Python 2 is still used among data scientists. 14 | By the end of 2019 the scientific stack will [stop supporting Python2](http://www.python3statement.org). 15 | As for numpy, after 2018 any new feature releases will only support [Python3](https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst). *Update (Sep 2018): same story now with pandas, matplotlib, ipython, jupyter notebook and jupyter lab.* 16 | 17 | To make the transition less frustrating, I've collected a bunch of Python 3 features that you may find useful. 18 | 19 | 20 | 21 | Image from [Dario Bertini post (toptal)](https://www.toptal.com/python/python-3-is-it-worth-the-switch) 22 | 23 | ## Better paths handling with `pathlib` 24 | 25 | `pathlib` is a default module in python3, that helps you to avoid tons of `os.path.join`s: 26 | 27 | ```python 28 | from pathlib import Path 29 | 30 | dataset = 'wiki_images' 31 | datasets_root = Path('/path/to/datasets/') 32 | 33 | train_path = datasets_root / dataset / 'train' 34 | test_path = datasets_root / dataset / 'test' 35 | 36 | for image_path in train_path.iterdir(): 37 | with image_path.open() as f: # note, open is a method of Path object 38 | # do something with an image 39 | ``` 40 | 41 | Previously it was always tempting to use string concatenation (concise, but obviously bad), 42 | now with `pathlib` the code is safe, concise, and readable. 43 | 44 | Also `pathlib.Path` has a bunch of methods and properties, that every python novice previously had to google: 45 | 46 | ```python 47 | p.exists() 48 | p.is_dir() 49 | p.parts 50 | p.with_name('sibling.png') # only change the name, but keep the folder 51 | p.with_suffix('.jpg') # only change the extension, but keep the folder and the name 52 | p.chmod(mode) 53 | p.rmdir() 54 | ``` 55 | 56 | `pathlib` should save you lots of time, 57 | please see [docs](https://docs.python.org/3/library/pathlib.html) and [reference](https://pymotw.com/3/pathlib/) for more. 58 | 59 | 60 | ## Type hinting is now part of the language 61 | 62 | Example of type hinting in pycharm:
63 | 64 | 65 | Python is not just a language for small scripts anymore, 66 | data pipelines these days include numerous steps each involving different frameworks (and sometimes very different logic). 67 | 68 | Type hinting was introduced to help with growing complexity of programs, so machines could help with code verification. 69 | Previously different modules used custom ways to point [types in docstrings](https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html#legacy) 70 | (Hint: pycharm can convert old docstrings to fresh type hinting). 71 | 72 | As a simple example, the following code may work with different types of data (that's what we like about python data stack). 73 | ```python 74 | def repeat_each_entry(data): 75 | """ Each entry in the data is doubled 76 | 77 | """ 78 | index = numpy.repeat(numpy.arange(len(data)), 2) 79 | return data[index] 80 | ``` 81 | 82 | This code e.g. works for `numpy.array` (incl. multidimensional ones), `astropy.Table` and `astropy.Column`, `bcolz`, `cupy`, `mxnet.ndarray` and others. 83 | 84 | This code will work for `pandas.Series`, but in the wrong way: 85 | ```python 86 | repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside 87 | ``` 88 | 89 | This was two lines of code. Imagine how unpredictable behavior of a complex system, because just one function may misbehave. 90 | Stating explicitly which types a method expects is very helpful in large systems, this will warn you if a function was passed unexpected arguments. 91 | 92 | ```python 93 | def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]): 94 | ``` 95 | 96 | If you have a significant codebase, hinting tools like [MyPy](http://mypy.readthedocs.io) are likely to become part of your continuous integration pipeline. 97 | A webinar ["Putting Type Hints to Work"](https://www.youtube.com/watch?v=JqBCFfiE11g) by Daniel Pyrathon is good for a brief introduction. 98 | 99 | Sidenote: unfortunately, hinting is not yet powerful enough to provide fine-grained typing for ndarrays/tensors, but [maybe we'll have it once](https://github.com/numpy/numpy/issues/7370), and this will be a great feature for DS. 100 | 101 | ## Type hinting → type checking in runtime 102 | 103 | By default, function annotations do not influence how your code is working, but merely help you to point code intentions. 104 | 105 | However, you can enforce type checking in runtime with tools like ... [enforce](https://github.com/RussBaz/enforce), 106 | this can help you in debugging (there are many cases when type hinting is not working). 107 | 108 | ```python 109 | @enforce.runtime_validation 110 | def foo(text: str) -> None: 111 | print(text) 112 | 113 | foo('Hi') # ok 114 | foo(5) # fails 115 | 116 | 117 | @enforce.runtime_validation 118 | def any2(x: List[bool]) -> bool: 119 | return any(x) 120 | 121 | any ([False, False, True, False]) # True 122 | any2([False, False, True, False]) # True 123 | 124 | any (['False']) # True 125 | any2(['False']) # fails 126 | 127 | any ([False, None, "", 0]) # False 128 | any2([False, None, "", 0]) # fails 129 | 130 | ``` 131 | 132 | ## Other usages of function annotations 133 | 134 | *Update: starting from python 3.7 this behavior was [deprecated](https://www.python.org/dev/peps/pep-0563/#non-typing-usage-of-annotations), and function annotations should be used for type hinting only. Python 4 will not support other usages of annotations.* 135 | 136 | As mentioned before, annotations do not influence code execution, but rather provide some meta-information, 137 | and you can use it as you wish. 138 | 139 | For instance, measurement units are a common pain in scientific areas, `astropy` package [provides a simple decorator](http://docs.astropy.org/en/stable/units/quantity.html#functions-that-accept-quantities) to control units of input quantities and convert output to required units 140 | ```python 141 | # Python 3 142 | from astropy import units as u 143 | @u.quantity_input() 144 | def frequency(speed: u.meter / u.s, wavelength: u.nm) -> u.terahertz: 145 | return speed / wavelength 146 | 147 | frequency(speed=300_000 * u.km / u.s, wavelength=555 * u.nm) 148 | # output: 540.5405405405404 THz, frequency of green visible light 149 | ``` 150 | 151 | If you're processing tabular scientific data in python (not necessarily astronomical), you should give `astropy` a shot. 152 | 153 | You can also define your application-specific decorators to perform control / conversion of inputs and output in the same manner. 154 | 155 | ## Matrix multiplication with @ 156 | 157 | Let's implement one of the simplest ML models — a linear regression with l2 regularization (a.k.a. ridge regression): 158 | 159 | ```python 160 | # l2-regularized linear regression: || AX - y ||^2 + alpha * ||x||^2 -> min 161 | 162 | # Python 2 163 | X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(y)) 164 | # Python 3 165 | X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ y) 166 | ``` 167 | 168 | The code with `@` becomes more readable and more translatable between deep learning frameworks: same code `X @ W + b[None, :]` for a single layer of perceptron works in `numpy`, `cupy`, `pytorch`, `tensorflow` (and other frameworks that operate with tensors). 169 | 170 | ## Globbing with `**` 171 | 172 | Recursive folder globbing is not easy in Python 2, even though the [glob2](https://github.com/miracle2k/python-glob2) custom module exists that overcomes this. A recursive flag is supported since Python 3.5: 173 | 174 | ```python 175 | import glob 176 | 177 | # Python 2 178 | found_images = ( 179 | glob.glob('/path/*.jpg') 180 | + glob.glob('/path/*/*.jpg') 181 | + glob.glob('/path/*/*/*.jpg') 182 | + glob.glob('/path/*/*/*/*.jpg') 183 | + glob.glob('/path/*/*/*/*/*.jpg')) 184 | 185 | # Python 3 186 | found_images = glob.glob('/path/**/*.jpg', recursive=True) 187 | ``` 188 | 189 | A better option is to use `pathlib` in python3 (minus one import!): 190 | ```python 191 | # Python 3 192 | found_images = pathlib.Path('/path/').glob('**/*.jpg') 193 | ``` 194 | Note: there are [minor differences](https://github.com/arogozhnikov/python3_with_pleasure/issues/16) between `glob.glob`, `Path.glob` and bash globbing. 195 | 196 | ## Print is a function now 197 | 198 | Yes, code now has these annoying parentheses, but there are some advantages: 199 | 200 | - simple syntax for using file descriptor: 201 | ```python 202 | print >>sys.stderr, "critical error" # Python 2 203 | print("critical error", file=sys.stderr) # Python 3 204 | ``` 205 | - printing tab-aligned tables without `str.join`: 206 | ```python 207 | # Python 3 208 | print(*array, sep='\t') 209 | print(batch, epoch, loss, accuracy, time, sep='\t') 210 | ``` 211 | - hacky suppressing / redirection of printing output: 212 | ```python 213 | # Python 3 214 | _print = print # store the original print function 215 | def print(*args, **kargs): 216 | pass # do something useful, e.g. store output to some file 217 | ``` 218 | In jupyter it is desirable to log each output to a separate file (to track what's happening after you got disconnected), so you can override `print` now. 219 | 220 | Below you can see a context manager that temporarily overrides behavior of print: 221 | ```python 222 | @contextlib.contextmanager 223 | def replace_print(): 224 | import builtins 225 | _print = print # saving old print function 226 | # or use some other function here 227 | builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs) 228 | yield 229 | builtins.print = _print 230 | 231 | with replace_print(): 232 | 233 | ``` 234 | It is *not* a recommended approach, but a small dirty hack that is now possible. 235 | - `print` can participate in list comprehensions and other language constructs 236 | ```python 237 | # Python 3 238 | result = process(x) if is_valid(x) else print('invalid item: ', x) 239 | ``` 240 | 241 | 242 | ## Underscores in Numeric Literal (Thousands Separator) 243 | 244 | [PEP-515](https://www.python.org/dev/peps/pep-0515/ "PEP-515") introduced underscores in Numeric Literals. 245 | In Python3, underscores can be used to group digits visually in integral, floating-point, and complex number literals. 246 | 247 | ```python 248 | # grouping decimal numbers by thousands 249 | one_million = 1_000_000 250 | 251 | # grouping hexadecimal addresses by words 252 | addr = 0xCAFE_F00D 253 | 254 | # grouping bits into nibbles in a binary literal 255 | flags = 0b_0011_1111_0100_1110 256 | 257 | # same, for string conversions 258 | flags = int('0b_1111_0000', 2) 259 | ``` 260 | 261 | ## f-strings for simple and reliable formatting 262 | 263 | The default formatting system provides a flexibility that is not required in data experiments. 264 | The resulting code is either too verbose or too fragile towards any changes. 265 | 266 | Quite typically data scientists outputs some logging information iteratively in a fixed format. 267 | It is common to have a code like: 268 | 269 | ```python 270 | # Python 2 271 | print '{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format( 272 | batch=batch, epoch=epoch, total_epochs=total_epochs, 273 | acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies), 274 | avg_time=time / len(data_batch) 275 | ) 276 | 277 | # Python 2 (too error-prone during fast modifications, please avoid): 278 | print '{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format( 279 | batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies), 280 | time / len(data_batch) 281 | ) 282 | ``` 283 | 284 | Sample output: 285 | ``` 286 | 120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60 287 | ``` 288 | 289 | **f-strings** aka formatted string literals were introduced in Python 3.6: 290 | ```python 291 | # Python 3.6+ 292 | print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}') 293 | ``` 294 | 295 | 296 | ## Explicit difference between 'true division' and 'floor division' 297 | 298 | For data science this is definitely a handy change 299 | 300 | ```python 301 | data = pandas.read_csv('timing.csv') 302 | velocity = data['distance'] / data['time'] 303 | ``` 304 | 305 | Results in Python 2 depend on whether 'time' and 'distance' (e.g. measured in meters and seconds) are stored as integers. 306 | In Python 3, the result is correct in both cases, because the result of division is float. 307 | 308 | Another case is floor division, which is now an explicit operation: 309 | 310 | ```python 311 | n_gifts = money // gift_price # correct for int and float arguments 312 | ``` 313 | 314 | In a nutshell: 315 | 316 | ```python 317 | >>> from operator import truediv, floordiv 318 | >>> truediv.__doc__, floordiv.__doc__ 319 | ('truediv(a, b) -- Same as a / b.', 'floordiv(a, b) -- Same as a // b.') 320 | >>> (3 / 2), (3 // 2), (3.0 // 2.0) 321 | (1.5, 1, 1.0) 322 | ``` 323 | 324 | Note, that this applies both to built-in types and to custom types provided by data packages (e.g. `numpy` or `pandas`). 325 | 326 | 327 | ## Strict ordering 328 | 329 | ```python 330 | # All these comparisons are illegal in Python 3 331 | 3 < '3' 332 | 2 < None 333 | (3, 4) < (3, None) 334 | (4, 5) < [4, 5] 335 | 336 | # False in both Python 2 and Python 3 337 | (4, 5) == [4, 5] 338 | ``` 339 | 340 | - prevents from occasional sorting of instances of different types 341 | ```python 342 | sorted([2, '1', 3]) # invalid for Python 3, in Python 2 returns [2, 3, '1'] 343 | ``` 344 | - helps to spot some problems that arise when processing raw data 345 | 346 | Sidenote: proper check for None is (in both Python versions) 347 | ```python 348 | if a is not None: 349 | pass 350 | 351 | if a: # WRONG check for None 352 | pass 353 | ``` 354 | 355 | 356 | ## Unicode for NLP 357 | 358 | ```python 359 | s = '您好' 360 | print(len(s)) 361 | print(s[:2]) 362 | ``` 363 | Output: 364 | - Python 2: `6\n��` 365 | - Python 3: `2\n您好`. 366 | 367 | ```python 368 | x = u'со' 369 | x += 'co' # ok 370 | x += 'со' # fail 371 | ``` 372 | Python 2 fails, Python 3 works as expected (because I've used russian letters in strings). 373 | 374 | In Python 3 `str`s are unicode strings, and it is more convenient for NLP processing of non-english texts. 375 | 376 | There are other funny things, for instance: 377 | ```python 378 | 'a' < type < u'a' # Python 2: True 379 | 'a' < u'a' # Python 2: False 380 | ``` 381 | 382 | ```python 383 | from collections import Counter 384 | Counter('Möbelstück') 385 | ``` 386 | 387 | - Python 2: `Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1})` 388 | - Python 3: `Counter({'M': 1, 'ö': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1, 'c': 1, 'k': 1})` 389 | 390 | You can handle all of this in Python 2 properly, but Python 3 is more friendly. 391 | 392 | ## Preserving order of dictionaries and **kwargs 393 | 394 | In CPython 3.6+ dicts behave like `OrderedDict` by default (and [this is guaranteed in Python 3.7+](https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6)). 395 | This preserves order during dict comprehensions (and other operations, e.g. during json serialization/deserialization) 396 | 397 | ```python 398 | import json 399 | x = {str(i):i for i in range(5)} 400 | json.loads(json.dumps(x)) 401 | # Python 2 402 | {u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4} 403 | # Python 3 404 | {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4} 405 | ``` 406 | 407 | Same applies to `**kwargs` (in Python 3.6+), they're kept in the same order as they appear in parameters. 408 | Order is crucial when it comes to data pipelines, previously we had to write it in a cumbersome manner: 409 | ```python 410 | from torch import nn 411 | 412 | # Python 2 413 | model = nn.Sequential(OrderedDict([ 414 | ('conv1', nn.Conv2d(1,20,5)), 415 | ('relu1', nn.ReLU()), 416 | ('conv2', nn.Conv2d(20,64,5)), 417 | ('relu2', nn.ReLU()) 418 | ])) 419 | 420 | # Python 3.6+, how it *can* be done, not supported right now in pytorch 421 | model = nn.Sequential( 422 | conv1=nn.Conv2d(1,20,5), 423 | relu1=nn.ReLU(), 424 | conv2=nn.Conv2d(20,64,5), 425 | relu2=nn.ReLU()) 426 | ) 427 | ``` 428 | 429 | Did you notice? Uniqueness of names is also checked automatically. 430 | 431 | 432 | ## Iterable unpacking 433 | 434 | ```python 435 | # handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases 436 | model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name) 437 | 438 | # picking two last values from a sequence 439 | *prev, next_to_last, last = values_history 440 | 441 | # This also works with any iterables, so if you have a function that yields e.g. qualities, 442 | # below is a simple way to take only last two values from a list 443 | *prev, next_to_last, last = iter_train(args) 444 | ``` 445 | 446 | ## Default pickle engine provides better compression for arrays 447 | 448 | Pickling is a mechanism to pass data between threads / processes, in particular used inside `multiprocessing` package. 449 | 450 | ```python 451 | # Python 2 452 | import cPickle as pickle 453 | import numpy 454 | print len(pickle.dumps(numpy.random.normal(size=[1000, 1000]))) 455 | # result: 23691675 456 | 457 | # Python 3 458 | import pickle 459 | import numpy 460 | len(pickle.dumps(numpy.random.normal(size=[1000, 1000]))) 461 | # result: 8000162 462 | ``` 463 | 464 | Three times less space. And it is *much* faster. 465 | Actually similar compression (but not speed) is achievable with `protocol=2` parameter, but developers typically ignore this option (or simply are not aware of it). 466 | 467 | Note: pickle is [not safe](https://docs.python.org/3/library/pickle.html) (and not quite transferrable), so never unpickle data received from an untrusted or unauthenticated source. 468 | 469 | ## Safer comprehensions 470 | 471 | ```python 472 | labels = 473 | predictions = [model.predict(data) for data, labels in dataset] 474 | 475 | # labels are overwritten in Python 2 476 | # labels are not affected by comprehension in Python 3 477 | ``` 478 | 479 | ## Super, simply super() 480 | 481 | Python 2 `super(...)` was a frequent source of mistakes in code. 482 | 483 | ```python 484 | # Python 2 485 | class MySubClass(MySuperClass): 486 | def __init__(self, name, **options): 487 | super(MySubClass, self).__init__(name='subclass', **options) 488 | 489 | # Python 3 490 | class MySubClass(MySuperClass): 491 | def __init__(self, name, **options): 492 | super().__init__(name='subclass', **options) 493 | ``` 494 | 495 | More on `super` and method resolution order on [stackoverflow](https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods). 496 | 497 | ## Better IDE suggestions with variable annotations 498 | 499 | The most enjoyable thing about programming in languages like Java, C# and alike is that IDE can make very good suggestions, 500 | because type of each identifier is known before executing a program. 501 | 502 | In python this is hard to achieve, but annotations will help you 503 | - write your expectations in a clear form 504 | - and get good suggestions from IDE 505 | 506 |
507 | This is an example of PyCharm suggestions with variable annotations. 508 | This works even in situations when functions you use are not annotated (e.g. due to backward compatibility). 509 | 510 | ## Multiple unpacking 511 | 512 | Here is how you merge two dicts now: 513 | ```python 514 | x = dict(a=1, b=2) 515 | y = dict(b=3, d=4) 516 | # Python 3.5+ 517 | z = {**x, **y} 518 | # z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict. 519 | ``` 520 | 521 | See [this thread at StackOverflow](https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression) for a comparison with Python 2. 522 | 523 | The same approach also works for lists, tuples, and sets (`a`, `b`, `c` are any iterables): 524 | ```python 525 | [*a, *b, *c] # list, concatenating 526 | (*a, *b, *c) # tuple, concatenating 527 | {*a, *b, *c} # set, union 528 | ``` 529 | 530 | Functions also [support multiple unpacking](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-448) for `*args` and `**kwargs`: 531 | ```python 532 | # Python 3.5+ 533 | do_something(**{**default_settings, **custom_settings}) 534 | 535 | # Also possible, this code also checks there is no intersection between keys of dictionaries 536 | do_something(**first_args, **second_args) 537 | ``` 538 | 539 | ## Future-proof APIs with keyword-only arguments 540 | 541 | Let's consider this snippet 542 | ```python 543 | model = sklearn.svm.SVC(2, 'poly', 2, 4, 0.5) 544 | ``` 545 | Obviously, an author of this code didn't get the Python style of coding yet (most probably, just jumped from cpp or rust). 546 | Unfortunately, this is not just question of taste, because changing the order of arguments (adding/deleting) in `SVC` will break this code. In particular, `sklearn` does some reordering/renaming from time to time of numerous algorithm parameters to provide consistent API. Each such refactoring may drive to broken code. 547 | 548 | In Python 3, library authors may demand explicitly named parameters by using `*`: 549 | ```python 550 | class SVC(BaseSVC): 551 | def __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, ... ) 552 | ``` 553 | - users have to specify names of parameters `sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)` now 554 | - this mechanism provides a great combination of reliability and flexibility of APIs 555 | 556 | ## Data classes 557 | 558 | Python 3.7 introduces data classes, a good replacement for `namedtuple` in most cases. 559 | ```python 560 | @dataclass 561 | class Person: 562 | name: str 563 | age: int 564 | 565 | @dataclass 566 | class Coder(Person): 567 | preferred_language: str = 'Python 3' 568 | ``` 569 | 570 | `dataclass` decorator takes the job of implementing routine methods for you (initialization, representation, comparison, and hashing when applicable). 571 | Let's name some features: 572 | - data classes can be both mutable and immutable 573 | - default values for fields are supported 574 | - inheritance 575 | - data classes are still old good classes: you can define new methods and override existing 576 | - post-init processing (e.g. to verify consistency) 577 | 578 | Geir Arne Hjelle gives a good overview of dataclasses [in his post](https://realpython.com/python-data-classes/). 579 | 580 | 581 | 582 | 583 | ## Customizing access to module attributes 584 | 585 | In Python you can control attribute access and hinting with `__getattr__` and `__dir__` for any object. Since python 3.7 you can do it for modules too. 586 | 587 | A natural example is implementing a `random` submodule of tensor libraries, which is typically a shortcut to skip initialization and passing of RandomState objects. Here's implementation for numpy: 588 | ```python 589 | # nprandom.py 590 | import numpy 591 | __random_state = numpy.random.RandomState() 592 | 593 | def __getattr__(name): 594 | return getattr(__random_state, name) 595 | 596 | def __dir__(): 597 | return dir(__random_state) 598 | 599 | def seed(seed): 600 | __random_state = numpy.random.RandomState(seed=seed) 601 | ``` 602 | 603 | One can also mix this way functionalities of different objects/submodules. Compare with tricks in [pytorch](https://github.com/pytorch/pytorch/blob/3ce17bf8f6a2c4239085191ea60d6ee51cd620a5/torch/__init__.py#L253-L256) and [cupy](https://github.com/cupy/cupy/blob/94592ecac8152d5f4a56a129325cc91d184480ad/cupy/random/distributions.py). 604 | 605 | Additionally, now one can 606 | - use it for [lazy loading of submodules](https://snarky.ca/lazy-importing-in-python-3-7/). For example, `import tensorflow` takes **~150MB** of RAM is imports all submodules (and dependencies). 607 | - use this for [deprecations in API](https://www.python.org/dev/peps/pep-0562/) 608 | - introduce runtime routing between submodules 609 | 610 | ## Built-in breakpoint() 611 | 612 | Just write `breakpoint()` in the code to invoke debugger. 613 | ```python 614 | # Python 3.7+, not all IDEs support this at the moment 615 | foo() 616 | breakpoint() 617 | bar() 618 | ``` 619 | 620 | For remote debugging you may want to try [combining breakpoint() with `web-pdb`](https://hackernoon.com/python-3-7s-new-builtin-breakpoint-a-quick-tour-4f1aebc444c) 621 | 622 | 623 | ## Minor: constants in `math` module 624 | 625 | ```python 626 | # Python 3 627 | math.inf # Infinite float 628 | math.nan # not a number 629 | 630 | max_quality = -math.inf # no more magic initial values! 631 | 632 | for model in trained_models: 633 | max_quality = max(max_quality, compute_quality(model, data)) 634 | ``` 635 | 636 | ## Minor: single integer type 637 | 638 | Python 2 provides two basic integer types, which are `int` (64-bit signed integer) and `long` for long arithmetics (quite confusing after C++). 639 | 640 | Python 3 has a single type `int`, which incorporates long arithmetics. 641 | 642 | Here is how you check that value is integer: 643 | 644 | ```python 645 | isinstance(x, numbers.Integral) # Python 2, the canonical way 646 | isinstance(x, (long, int)) # Python 2 647 | isinstance(x, int) # Python 3, easier to remember 648 | ``` 649 | 650 | Update: first check also works for *other integral types*, such as `numpy.int32`, `numpy.int64`, but others don't. So they're not equivalent. 651 | 652 | 653 | ## Other stuff 654 | 655 | - `Enum`s are theoretically useful, but 656 | - string-typing is already widely adopted in the python data stack 657 | - `Enum`s don't seem to interplay with numpy and categorical from pandas 658 | - coroutines also *sound* very promising for data pipelining (see [slides](http://www.dabeaz.com/coroutines/Coroutines.pdf) by David Beazley), but I don't see their adoption in the wild. 659 | - Python 3 has [stable ABI](https://www.python.org/dev/peps/pep-0384/) 660 | - Python 3 supports unicode identifies (so `ω = Δφ / Δt` is ok), but you'd [better use good old ASCII names](https://stackoverflow.com/a/29855176/498892) 661 | - some libraries e.g. [jupyterhub](https://github.com/jupyterhub/jupyterhub) (jupyter in cloud), django and fresh ipython only support Python 3, so features that sound useless for you are useful for libraries you'll probably want to use once. 662 | 663 | 664 | ### Problems for code migration specific for data science (and how to resolve those) 665 | 666 | - support for nested arguments [was dropped](https://www.python.org/dev/peps/pep-3113/) 667 | ```python 668 | map(lambda x, (y, z): x, z, dict.items()) 669 | ``` 670 | 671 | However, it is still perfectly working with different comprehensions: 672 | ```python 673 | {x:z for x, (y, z) in d.items()} 674 | ``` 675 | In general, comprehensions are also better 'translatable' between Python 2 and 3. 676 | 677 | - `map()`, `.keys()`, `.values()`, `.items()`, etc. return iterators, not lists. Main problems with iterators are: 678 | - no trivial slicing 679 | - can't be iterated twice 680 | 681 | Almost all of the problems are resolved by converting result to list. 682 | 683 | - see [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/) when in trouble 684 | 685 | ### Main problems for teaching machine learning and data science with python 686 | 687 | Course authors should spend time in the first lectures to explain what is an iterator, 688 | why it can't be sliced / concatenated / multiplied / iterated twice like a string (and how to deal with it). 689 | 690 | I think most course authors would be happy to avoid these details, but now it is hardly possible. 691 | 692 | # Conclusion 693 | 694 | Python 2 and Python 3 have co-existed for almost 10 years, but we *should* move to Python 3. 695 | 696 | Research and production code should become a bit shorter, more readable, and significantly safer after moving to Python 3-only codebase. 697 | 698 | Right now most libraries support both Python versions. 699 | And I can't wait for the bright moment when packages drop support for Python 2 and enjoy new language features. 700 | 701 | Following migrations are promised to be smoother: ["we will never do this kind of backwards-incompatible change again"](https://snarky.ca/why-python-3-exists/) 702 | 703 | ### Links 704 | 705 | - [Key differences between Python 2.7 and Python 3.x](http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html) 706 | - [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/) 707 | - [10 awesome features of Python that you can't use because you refuse to upgrade to Python 3](http://www.asmeurer.com/python3-presentation/slides.html) 708 | - [Trust me, python 3.3 is better than 2.7 (video)](http://pyvideo.org/pycon-us-2013/python-33-trust-me-its-better-than-27.html) 709 | - [Python 3 for scientists](http://python-3-for-scientists.readthedocs.io/en/latest/) 710 | 711 | ### License 712 | 713 | This text was published by [Alex Rogozhnikov](https://arogozhnikov.github.io/about/) and [contributors](https://github.com/arogozhnikov/python3_with_pleasure/graphs/contributors) under [CC BY-SA 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/) (excluding images). 714 | -------------------------------------------------------------------------------- /README_CN.md: -------------------------------------------------------------------------------- 1 | # 快乐迁移Python 3 2 | ## 为数据科学家提供的关于Python 3特性的简介 3 | 4 | > Python became a mainstream language for machine learning and other scientific fields that heavily operate with data; 5 | it boasts various deep learning frameworks and well-established set of tools for data processing and visualization. 6 | 7 | Python 已成为机器学习以及其他紧密结合数据的科学领域的主流语言;它提供了各种深度学习的框架以及一系列完善的数据处理和可视化工具。 8 | 9 | > However, Python ecosystem co-exists in Python 2 and Python 3, and Python 2 is still used among data scientists. 10 | By the end of 2019 the scientific stack will [stop supporting Python2](http://www.python3statement.org). 11 | As for numpy, after 2018 any new feature releases will only support [Python3](https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst). 12 | 13 | 然而,Python 的生态圈中 Python 2 和 Python 3 是共存状态,并且数据科学家之中是依然有使用 Python 2 的。2019年年底(Python的)科学组件将会[停止支持 Python 2 ](http://www.python3statement.org)。 至于numpy,2018年之后任何推出的新特性将会只支持[Python 3](https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst) 。 14 | 15 | >To make the transition less frustrating, I've collected a bunch of Python 3 features that you may find useful. 16 | 17 | 为了让这一过渡更轻松一点,我整理了一些 Python 3 你可能觉得有用的特性。 18 | 19 | 20 | 21 | 图片来源 [Dario Bertini post (toptal)](https://www.toptal.com/python/python-3-is-it-worth-the-switch) 22 | 23 | ## `pathlib`提供了更好的路径处理 24 | 25 | > `pathlib` is a default module in python3, that helps you to avoid tons of `os.path.join`s: 26 | 27 | `pathlib` 是Python 3 一个默认的组件,有助于避免大量使用`os.path.join`: 28 | 29 | ```python 30 | from pathlib import Path 31 | 32 | dataset = 'wiki_images' 33 | datasets_root = Path('/path/to/datasets/') 34 | 35 | train_path = datasets_root / dataset / 'train' 36 | test_path = datasets_root / dataset / 'test' 37 | 38 | for image_path in train_path.iterdir(): 39 | with image_path.open() as f: # note, open is a method of Path object 40 | # do something with an image 41 | ``` 42 | 43 | > Previously it was always tempting to use string concatenation (concise, but obviously bad), 44 | now with `pathlib` the code is safe, concise, and readable. 45 | 46 | 以前,人们倾向于使用字符串连接(虽然简洁,但明显不好);现在,代码中用`pathlib`是安全的,简洁的,并且更有可读性。 47 | 48 | > Also `pathlib.Path` has a bunch of methods and properties, that every python novice previously had to google: 49 | 50 | 此外,`pathlib.Path`有大量的方法和属性,每一位 Python 早期的初学者不得不谷歌了解: 51 | 52 | ```python 53 | p.exists() 54 | p.is_dir() 55 | p.parts 56 | p.with_name('sibling.png') # only change the name, but keep the folder 57 | p.with_suffix('.jpg') # only change the extension, but keep the folder and the name 58 | p.chmod(mode) 59 | p.rmdir() 60 | ``` 61 | 62 | > `pathlib` should save you lots of time, 63 | please see [docs](https://docs.python.org/3/library/pathlib.html) and [reference](https://pymotw.com/3/pathlib/) for more. 64 | 65 | `pathlib` 应当会节省大量时间,请参看[文档](https://docs.python.org/3/library/pathlib.html)以及[指南](https://pymotw.com/3/pathlib/)了解更多。 66 | 67 | 68 | ## 类型提示现在已是这语言的一部分 69 | 70 | > Example of type hinting in pycharm:
71 | 72 | pycharm环境类型提示的例子: 73 | 74 | 75 | 76 | > Python is not just a language for small scripts anymore, 77 | data pipelines these days include numerous steps each involving different frameworks (and sometimes very different logic). 78 | 79 | Python 不再是一种小型的脚本语言,数据管道现如今包含数个级别,而每一级又涉及到不同的框架(甚至有时是千差万别的逻辑)。 80 | 81 | > Type hinting was introduced to help with growing complexity of programs, so machines could help with code verification. 82 | Previously different modules used custom ways to point [types in docstrings](https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html#legacy) 83 | (Hint: pycharm can convert old docstrings to fresh type hinting). 84 | 85 | 类型提示的引入是为了在程序的持续增加的复杂性方面提供帮助,这样机器可以辅助代码验证。以前不同的模块使用自定义的方式指定[文档字符中的类型](https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html#legacy)(提示:pycharm可以将旧的字符串转换成新的类型提示)。 86 | 87 | > As a simple example, the following code may work with different types of data (that's what we like about python data stack). 88 | 89 | 作为一个简单的例子,下面的代码可以适用于数据的不同类型(这也是关于数据栈我们喜欢的一点)。 90 | 91 | ```python 92 | def repeat_each_entry(data): 93 | """ Each entry in the data is doubled 94 | 95 | """ 96 | index = numpy.repeat(numpy.arange(len(data)), 2) 97 | return data[index] 98 | ``` 99 | 100 | > This code e.g. works for `numpy.array` (incl. multidimensional ones), `astropy.Table` and `astropy.Column`, `bcolz`, `cupy`, `mxnet.ndarray` and others. 101 | 102 | 这段代码可适用于例如 `numpy.array` (包括多维数组), `astropy.Table` 以及 `astropy.Column`, `bcolz`, `cupy`, `mxnet.ndarray` 和其他的组件。 103 | 104 | > This code will work for `pandas.Series`, but in the wrong way: 105 | 106 | 这段代码虽然也适用于`pandas.Series`,但是是错误的使用方式: 107 | 108 | ```python 109 | repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside 110 | ``` 111 | 112 | > This was two lines of code. Imagine how unpredictable behavior of a complex system, because just one function may misbehave. 113 | Stating explicitly which types a method expects is very helpful in large systems, this will warn you if a function was passed unexpected arguments. 114 | 115 | 这曾经是两行代码。想象一下一个复杂系统不可预知的行为,仅仅是因为一个功能可能会失败。在大型的系统中,明确地指出方法期望的类型是非常有帮助的。如果一个方法通过了意外参数,则会给出警告。 116 | 117 | ```python 118 | def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]): 119 | ``` 120 | > If you have a significant codebase, hinting tools like [MyPy](http://mypy.readthedocs.io) are likely to become part of your continuous integration pipeline.A webinar ["Putting Type Hints to Work"](https://www.youtube.com/watch?v=JqBCFfiE11g) by Daniel Pyrathon is good for a brief introduction. 121 | 122 | 如果你有一个重要的代码仓库,比如[MyPy](http://mypy.readthedocs.io)的提示工具有可能成为你持续集成管道的一部分。Daniel Pyrathon主持的["Putting Type Hints to Work"](https://www.youtube.com/watch?v=JqBCFfiE11g)研讨会,给出了一个很好的简介。 123 | 124 | > Sidenote: unfortunately, hinting is not yet powerful enough to provide fine-grained typing for ndarrays/tensors, but [maybe we'll have it once](https://github.com/numpy/numpy/issues/7370), and this will be a great feature for DS. 125 | 126 | 边注:不幸的是,提示信息还不够强大为多维数组/张量提供精细的提示。但是[也许我们会有](https://github.com/numpy/numpy/issues/7370),并且这将是DS的一个强大功能。 127 | 128 | ## 类型提示 → 在运行时检查类型 129 | 130 | > By default, function annotations do not influence how your code is working, but merely help you to point code intentions. 131 | 132 | 默认情况下,方法声明不会影响你运行中的代码,而只是帮助你指出代码的意图。 133 | 134 | > However, you can enforce type checking in runtime with tools like ... [enforce](https://github.com/RussBaz/enforce), 135 | this can help you in debugging (there are many cases when type hinting is not working). 136 | 137 | 然而,你可以利用工具,比如[enforce](https://github.com/RussBaz/enforce),在代码运行时执行类型检查,这对你在debug代码时是很有帮助的(类型提示不起作用的情况也很多)。 138 | 139 | ```python 140 | @enforce.runtime_validation 141 | def foo(text: str) -> None: 142 | print(text) 143 | 144 | foo('Hi') # ok 145 | foo(5) # fails 146 | 147 | 148 | @enforce.runtime_validation 149 | def any2(x: List[bool]) -> bool: 150 | return any(x) 151 | 152 | any ([False, False, True, False]) # True 153 | any2([False, False, True, False]) # True 154 | 155 | any (['False']) # True 156 | any2(['False']) # fails 157 | 158 | any ([False, None, "", 0]) # False 159 | any2([False, None, "", 0]) # fails 160 | 161 | ``` 162 | 163 | ## 方法声明的其他用途 164 | 165 | > As mentioned before, annotations do not influence code execution, but rather provide some meta-information, 166 | and you can use it as you wish. 167 | 168 | 正如之前提到的,声明不会影响代码执行,而只是提供一些元信息,此外你也可以随意使用。 169 | 170 | > For instance, measurement units are a common pain in scientific areas, `astropy` package [provides a simple decorator](http://docs.astropy.org/en/stable/units/quantity.html#functions-that-accept-quantities) to control units of input quantities and convert output to required units. 171 | 172 | 比如,测量单位是科学领域常见的痛点,`astropy`包[提供了一个简单的装饰器](http://docs.astropy.org/en/stable/units/quantity.html#functions-that-accept-quantities)用来控制输入数量的单位及转换输出部分所需的单位。 173 | ```python 174 | # Python 3 175 | from astropy import units as u 176 | @u.quantity_input() 177 | def frequency(speed: u.meter / u.s, wavelength: u.m) -> u.terahertz: 178 | return speed / wavelength 179 | 180 | frequency(speed=300_000 * u.km / u.s, wavelength=555 * u.nm) 181 | # output: 540.5405405405404 THz, frequency of green visible light 182 | ``` 183 | 184 | > If you're processing tabular scientific data in python (not necessarily astronomical), you should give `astropy` a shot. 185 | 186 | 如果你正在用Python处理表格式的科学数据(没必要是天文数字),那么你应该试试`astropy`。 187 | 188 | > You can also define your application-specific decorators to perform control / conversion of inputs and output in the same manner. 189 | 190 | 你也可以自定义专用的装饰器,以相同的方式执行输入和输出的控制/转换。 191 | 192 | ## 矩阵乘号 @ 193 | 194 | > Let's implement one of the simplest ML models — a linear regression with l2 regularization (a.k.a. ridge regression): 195 | 196 | 让我们来实现一个最简单的 ML(机器学习) 模型 — 具有 l2 正则化的线性回归(又名岭回归): 197 | 198 | ```python 199 | # l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min 200 | 201 | # Python 2 202 | X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(b)) 203 | # Python 3 204 | X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ b) 205 | ``` 206 | 207 | > The code with `@` becomes more readable and more translatable between deep learning frameworks: same code `X @ W + b[None, :]` for a single layer of perceptron works in `numpy`, `cupy`, `pytorch`, `tensorflow` (and other frameworks that operate with tensors). 208 | 209 | 使用`@`的代码在深度学习框架之间变得更有可读性和可转换性:对于单层感知器,相同的代码`X @ W + b[None, :]` 可运行与`numpy`、 `cupy`、 `pytorch`、 `tensorflow`(以及其他基于张量运行的框架)。 210 | 211 | ## 通配符 `**` 212 | 213 | > Recursive folder globbing is not easy in Python 2, even though the [glob2](https://github.com/miracle2k/python-glob2) custom module exists that overcomes this. A recursive flag is supported since Python 3.5: 214 | 215 | 即使[glob2](https://github.com/miracle2k/python-glob2)的自定义模块克服了这一点,但是在Python 2中递归的文件夹通配依旧不容易。自Python3.5以来便支持了递归标志: 216 | 217 | ```python 218 | import glob 219 | 220 | # Python 2 221 | found_images = \ 222 | glob.glob('/path/*.jpg') \ 223 | + glob.glob('/path/*/*.jpg') \ 224 | + glob.glob('/path/*/*/*.jpg') \ 225 | + glob.glob('/path/*/*/*/*.jpg') \ 226 | + glob.glob('/path/*/*/*/*/*.jpg') 227 | 228 | # Python 3 229 | found_images = glob.glob('/path/**/*.jpg', recursive=True) 230 | ``` 231 | 232 | > A better option is to use `pathlib` in python3 (minus one import!): 233 | 234 | 一个更好的选项就是在Python 3中使用`pathlib`(减少了一个导入!): 235 | ```python 236 | # Python 3 237 | found_images = pathlib.Path('/path/').glob('**/*.jpg') 238 | ``` 239 | 240 | ## Print 现在成了一个方法 241 | 242 | > Yes, code now has these annoying parentheses, but there are some advantages: 243 | 244 | 是的,代码现在有了这些烦人的括号,但也是有一些好处的: 245 | 246 | > - simple syntax for using file descriptor: 247 | - 使用文件描述符的简单语法: 248 | 249 | ```python 250 | print >>sys.stderr, "critical error" # Python 2 251 | print("critical error", file=sys.stderr) # Python 3 252 | ``` 253 | > - printing tab-aligned tables without `str.join`: 254 | - 不使用`str.join`打印制表符对齐表: 255 | 256 | ```python 257 | # Python 3 258 | print(*array, sep='\t') 259 | print(batch, epoch, loss, accuracy, time, sep='\t') 260 | ``` 261 | > - hacky suppressing / redirection of printing output: 262 | - 结束/重定向打印输出: 263 | ```python 264 | # Python 3 265 | _print = print # store the original print function 266 | def print(*args, **kargs): 267 | pass # do something useful, e.g. store output to some file 268 | ``` 269 | In jupyter it is desirable to log each output to a separate file (to track what's happening after you got disconnected), so you can override `print` now. 270 | 271 | 在jupyter中,最好将每个输出记录到一个单独的文件中(以便跟踪断开连接后发生的情况),以便你现在可以重写 `print` 。 272 | 273 | Below you can see a context manager that temporarily overrides behavior of print: 274 | 275 | 下面你可以看到暂时覆盖打印行为的上下文管理器: 276 | ```python 277 | @contextlib.contextmanager 278 | def replace_print(): 279 | import builtins 280 | _print = print # saving old print function 281 | # or use some other function here 282 | builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs) 283 | yield 284 | builtins.print = _print 285 | 286 | with replace_print(): 287 | 288 | ``` 289 | It is *not* a recommended approach, but a small dirty hack that is now possible. 290 | 291 | 这*并不是*推荐的方法,现在却可能是一次小小的黑客攻击。 292 | > - `print` can participate in list comprehensions and other language constructs 293 | - `print` 可以参与列表推导式和其他语言结构: 294 | 295 | ```python 296 | # Python 3 297 | result = process(x) if is_valid(x) else print('invalid item: ', x) 298 | ``` 299 | 300 | 301 | ## 数字中的下划线 (千位分隔符) 302 | 303 | > [PEP-515](https://www.python.org/dev/peps/pep-0515/ "PEP-515") introduced underscores in Numeric Literals. 304 | In Python3, underscores can be used to group digits visually in integral, floating-point, and complex number literals. 305 | 306 | [PEP-515](https://www.python.org/dev/peps/pep-0515/ "PEP-515")在数字中引入了下划线。在Python 3 中,下划线可以用于在整数,浮点数,以及一些复杂的数字中以可视的方式对数字分组。 307 | 308 | ```python 309 | # grouping decimal numbers by thousands 310 | one_million = 1_000_000 311 | 312 | # grouping hexadecimal addresses by words 313 | addr = 0xCAFE_F00D 314 | 315 | # grouping bits into nibbles in a binary literal 316 | flags = 0b_0011_1111_0100_1110 317 | 318 | # same, for string conversions 319 | flags = int('0b_1111_0000', 2) 320 | ``` 321 | 322 | ## 用于简单可靠格式化的 f-strings 323 | 324 | > The default formatting system provides a flexibility that is not required in data experiments. 325 | The resulting code is either too verbose or too fragile towards any changes. 326 | 327 | 默认的格式化系统提供了数据实验中不必要的灵活性。由此产生的代码对于任何更改都显得过于冗长或者脆弱。 328 | 329 | > Quite typically data scientists outputs some logging information iteratively in a fixed format. 330 | It is common to have a code like: 331 | 332 | 通常数据科学家会以固定的格式反复输出一些记录信息。如下代码就是常见的一段: 333 | 334 | ```python 335 | # Python 2 336 | print('{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format( 337 | batch=batch, epoch=epoch, total_epochs=total_epochs, 338 | acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies), 339 | avg_time=time / len(data_batch) 340 | )) 341 | 342 | # Python 2 (too error-prone during fast modifications, please avoid): 343 | print('{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format( 344 | batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies), 345 | time / len(data_batch) 346 | )) 347 | ``` 348 | 349 | 简单输出: 350 | ``` 351 | 120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60 352 | ``` 353 | 354 | > **f-strings** aka formatted string literals were introduced in Python 3.6: 355 | 356 | **f-string** 又名格式化的字符串,在Python 3.6 中引入: 357 | ```python 358 | # Python 3.6+ 359 | print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}') 360 | ``` 361 | 362 | 363 | ## “真正的除法”与“整数除法”之间的明显区别 364 | 365 | > For data science this is definitely a handy change. 366 | 367 | 对于数据科学来说,这绝对是一个便利的改变。 368 | 369 | ```python 370 | data = pandas.read_csv('timing.csv') 371 | velocity = data['distance'] / data['time'] 372 | ``` 373 | 374 | > Results in Python 2 depend on whether 'time' and 'distance' (e.g. measured in meters and seconds) are stored as integers. 375 | In Python 3, the result is correct in both cases, because the result of division is float. 376 | 377 | Python 2 中的计算结果取决于“时间”和“距离”(例如,分别以米和秒计量)是否存储为整数,而在Python 3 中,结果在两种情况下都是正确的,因为除法的计算结果是浮点型了。 378 | 379 | > Another case is integer division, which is now an explicit operation: 380 | 381 | 另一种情况是整数除法,它现在是一种精确的运算了: 382 | 383 | ```python 384 | n_gifts = money // gift_price # correct for int and float arguments 385 | ``` 386 | 387 | > Note, that this applies both to built-in types and to custom types provided by data packages (e.g. `numpy` or `pandas`). 388 | 389 | 注意,这都适用于内置类型及数据包提供的自定义类型(如`numpy` 或者 `pandas`)。 390 | 391 | ## 严谨的排序 392 | 393 | ```python 394 | # All these comparisons are illegal in Python 3 395 | 3 < '3' 396 | 2 < None 397 | (3, 4) < (3, None) 398 | (4, 5) < [4, 5] 399 | 400 | # False in both Python 2 and Python 3 401 | (4, 5) == [4, 5] 402 | ``` 403 | 404 | > - prevents from occasional sorting of instances of different types 405 | - 防止偶尔对不同类型的实例进行排序 406 | ```python 407 | sorted([2, '1', 3]) # invalid for Python 3, in Python 2 returns [2, 3, '1'] 408 | ``` 409 | > - helps to spot some problems that arise when processing raw data 410 | - 有助于发现在处理原始数据时的一些问题 411 | 412 | > Sidenote: proper check for None is (in both Python versions) 413 | 414 | 边注:合理检查None的情况(Python两个版本中都有) 415 | ```python 416 | if a is not None: 417 | pass 418 | 419 | if a: # WRONG check for None 420 | pass 421 | ``` 422 | 423 | 424 | ## 用于NLP的Unicode 425 | 426 | *译者注:NLP,自然语言处理 (Natural Language Processing) * 427 | 428 | ```python 429 | s = '您好' 430 | print(len(s)) 431 | print(s[:2]) 432 | ``` 433 | 输出: 434 | - Python 2: `6\n��` 435 | - Python 3: `2\n您好`. 436 | 437 | ``` 438 | x = u'со' 439 | x += 'co' # ok 440 | x += 'со' # fail 441 | ``` 442 | > Python 2 fails, Python 3 works as expected (because I've used russian letters in strings). 443 | 444 | Python 2 失败了,Python 3 如预期运行(因为我在字符串中使用了俄语的文字)。 445 | 446 | > In Python 3 `str`s are unicode strings, and it is more convenient for NLP processing of non-english texts. 447 | 448 | 在Python 3 中,`str`是unicode字符串,对于非英文文本的NLP处理更为方便。 449 | 450 | > There are other funny things, for instance: 451 | 452 | 这还有一些其他有趣的事情,比如: 453 | ```python 454 | 'a' < type < u'a' # Python 2: True 455 | 'a' < u'a' # Python 2: False 456 | ``` 457 | 458 | ```python 459 | from collections import Counter 460 | Counter('Möbelstück') 461 | ``` 462 | 463 | - Python 2: `Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1})` 464 | - Python 3: `Counter({'M': 1, 'ö': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1, 'c': 1, 'k': 1})` 465 | 466 | > You can handle all of this in Python 2 properly, but Python 3 is more friendly. 467 | 468 | 虽然你可以用Python 2正确地处理所有这些情况,但Python 3显得更加友好。 469 | 470 | ## 保留字典和** kwargs的顺序 471 | 472 | > In CPython 3.6+ dicts behave like `OrderedDict` by default (and [this is guaranteed in Python 3.7+](https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6)). 473 | This preserves order during dict comprehensions (and other operations, e.g. during json serialization/deserialization) 474 | 475 | 在CPython 3.6+中,字典的默认行为与`OrderedDict`类似(并且[这在Python 3.7+ 中也得到了保证]((https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6)))。这在字典释义时提供了顺序(以及其他操作执行时,比如json序列化/反序列化)。 476 | ```python 477 | import json 478 | x = {str(i):i for i in range(5)} 479 | json.loads(json.dumps(x)) 480 | # Python 2 481 | {u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4} 482 | # Python 3 483 | {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4} 484 | ``` 485 | 486 | > Same applies to `**kwargs` (in Python 3.6+), they're kept in the same order as they appear in parameters. 487 | Order is crucial when it comes to data pipelines, previously we had to write it in a cumbersome manner: 488 | 489 | 同样适用于`** kwargs`(Python 3.6+),它们保持与它们在参数中出现的顺序相同。在数据管道方面,顺序至关重要,以前我们必须以繁琐的方式来编写: 490 | ``` 491 | from torch import nn 492 | 493 | # Python 2 494 | model = nn.Sequential(OrderedDict([ 495 | ('conv1', nn.Conv2d(1,20,5)), 496 | ('relu1', nn.ReLU()), 497 | ('conv2', nn.Conv2d(20,64,5)), 498 | ('relu2', nn.ReLU()) 499 | ])) 500 | 501 | # Python 3.6+, how it *can* be done, not supported right now in pytorch 502 | model = nn.Sequential( 503 | conv1=nn.Conv2d(1,20,5), 504 | relu1=nn.ReLU(), 505 | conv2=nn.Conv2d(20,64,5), 506 | relu2=nn.ReLU()) 507 | ) 508 | ``` 509 | 510 | > Did you notice? Uniqueness of names is also checked automatically. 511 | 512 | 你注意到了吗?命名的唯一性也会自动检查。 513 | 514 | 515 | ## 可迭代对象的(Iterable)解压 516 | 517 | ```python 518 | # handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases 519 | model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name) 520 | 521 | # picking two last values from a sequence 522 | *prev, next_to_last, last = values_history 523 | 524 | # This also works with any iterables, so if you have a function that yields e.g. qualities, 525 | # below is a simple way to take only last two values from a list 526 | *prev, next_to_last, last = iter_train(args) 527 | ``` 528 | 529 | ## 默认的pickle引擎为数组提供更好的压缩 530 | 531 | ```python 532 | # Python 2 533 | import cPickle as pickle 534 | import numpy 535 | print len(pickle.dumps(numpy.random.normal(size=[1000, 1000]))) 536 | # result: 23691675 537 | 538 | # Python 3 539 | import pickle 540 | import numpy 541 | len(pickle.dumps(numpy.random.normal(size=[1000, 1000]))) 542 | # result: 8000162 543 | ``` 544 | 545 | > Three times less space. And it is *much* faster. 546 | Actually similar compression (but not speed) is achievable with `protocol=2` parameter, but users typically ignore this option (or simply are not aware of it). 547 | 548 | 1/3的空间,以及*更加*快的速度。事实上,使用`protocol = 2`参数可以实现类似的压缩(速度则大相径庭),但用户通常会忽略此选项(或者根本不知道它)。 549 | 550 | 551 | ## 更安全的压缩 552 | 553 | ```python 554 | labels = 555 | predictions = [model.predict(data) for data, labels in dataset] 556 | 557 | # labels are overwritten in Python 2 558 | # labels are not affected by comprehension in Python 3 559 | ``` 560 | 561 | ## 超简单的super()函数 562 | 563 | > Python 2 `super(...)` was a frequent source of mistakes in code. 564 | 565 | Python 2 中的`super(...)`曾是代码中最常见的错误源头。 566 | 567 | ```python 568 | # Python 2 569 | class MySubClass(MySuperClass): 570 | def __init__(self, name, **options): 571 | super(MySubClass, self).__init__(name='subclass', **options) 572 | 573 | # Python 3 574 | class MySubClass(MySuperClass): 575 | def __init__(self, name, **options): 576 | super().__init__(name='subclass', **options) 577 | ``` 578 | 579 | > More on `super` and method resolution order on [stackoverflow](https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods). 580 | 581 | [stackoverflow](https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods)上有更多关于`super`和方法解决的信息。 582 | 583 | ## 有着变量注释的更好的IDE建议 584 | 585 | > The most enjoyable thing about programming in languages like Java, C# and alike is that IDE can make very good suggestions, 586 | because type of each identifier is known before executing a program. 587 | 588 | 关于Java,C#等语言编程最令人享受的事情是IDE可以提出非常好的建议,因为每个标识符的类型在执行程序之前是已知的。 589 | 590 | > In python this is hard to achieve, but annotations will help you 591 | > - write your expectations in a clear form 592 | > - and get good suggestions from IDE 593 | 594 | Python中这很难实现,但注释是会帮助你的 595 | - 以清晰的形式写下你的期望 596 | - 并从IDE获得很好的建议 597 | 598 |
599 | > This is an example of PyCharm suggestions with variable annotations. 600 | This works even in situations when functions you use are not annotated (e.g. due to backward compatibility). 601 | 602 | 这是PyCharm带有变量声明建议的一个例子。即使在你使用的功能未被注释过的情况依旧是有效的(例如,向后的兼容性)。 603 | 604 | ## 更多的解包(unpacking) 605 | 606 | > Here is how you merge two dicts now: 607 | 608 | 现在展示如何合并两个字典: 609 | ```python 610 | x = dict(a=1, b=2) 611 | y = dict(b=3, d=4) 612 | # Python 3.5+ 613 | z = {**x, **y} 614 | # z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict. 615 | ``` 616 | 617 | > See [this thread at StackOverflow](https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression) for a comparison with Python 2. 618 | 619 | 请参照[在StackOverflow上的这一过程](https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression),与Python 2进行比较。 620 | 621 | > The same approach also works for lists, tuples, and sets (`a`, `b`, `c` are any iterables): 622 | 623 | 同样的方法对于列表,元组,以及集合(`a`, `b`, `c` 是可任意迭代的): 624 | ```python 625 | [*a, *b, *c] # list, concatenating 626 | (*a, *b, *c) # tuple, concatenating 627 | {*a, *b, *c} # set, union 628 | ``` 629 | 630 | > Functions also [support this](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-448) for `*args` and `**kwargs`: 631 | 632 | 函数对于参数`*args`和`**kwargs`同样[支持](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-448) 633 | ``` 634 | Python 3.5+ 635 | do_something(**{**default_settings, **custom_settings}) 636 | 637 | # Also possible, this code also checks there is no intersection between keys of dictionaries 638 | do_something(**first_args, **second_args) 639 | ``` 640 | 641 | ## 具有关键字参数的面向未来的API 642 | 643 | 让我们看一下这个代码片段: 644 | ```python 645 | model = sklearn.svm.SVC(2, 'poly', 2, 4, 0.5) 646 | ``` 647 | > Obviously, an author of this code didn't get the Python style of coding yet (most probably, just jumped from cpp or rust). 648 | Unfortunately, this is not just question of taste, because changing the order of arguments (adding/deleting) in `SVC` will break this code. In particular, `sklearn` does some reordering/renaming from time to time of numerous algorithm parameters to provide consistent API. Each such refactoring may drive to broken code. 649 | 650 | 很明显,代码的作者还未理解Python的编码风格(很有可能是从cpp或者rust转到Python的)。 651 | 不幸的是,这不仅仅是品味的问题,因为在`SVC`中改变参数顺序(添加/删除)都会破坏代码。 特别是,`sklearn`会不时地对许多算法参数进行重新排序/重命名以提供一致的API。 每个这样的重构都可能导致代码损坏。 652 | 653 | > In Python 3, library authors may demand explicitly named parameters by using `*`: 654 | 655 | 在Python 3中,类库作者可能会通过使用`*`来要求明确命名的参数: 656 | ``` 657 | class SVC(BaseSVC): 658 | def __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, ... ) 659 | ``` 660 | > - users have to specify names of parameters `sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)` now 661 | > - this mechanism provides a great combination of reliability and flexibility of APIs 662 | 663 | - 用户现在必须指定参数名称为`sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)` 664 | - 这种机制提供了API完美结合的可靠性和灵活性 665 | 666 | ## 次要: `math`模块中的常量 667 | 668 | ```python 669 | # Python 3 670 | math.inf # 'largest' number 671 | math.nan # not a number 672 | 673 | max_quality = -math.inf # no more magic initial values! 674 | 675 | for model in trained_models: 676 | max_quality = max(max_quality, compute_quality(model, data)) 677 | ``` 678 | 679 | ## 次要: 单一的整数类型 680 | 681 | > Python 2 provides two basic integer types, which are int (64-bit signed integer) and long for long arithmetics (quite confusing after C++). 682 | 683 | Python 2提供了两种基础的整数类型,int(64位有符号的整数)以及对于长整型计算的long(在C++之后就变得非常混乱)。 684 | 685 | > Python 3 has a single type `int`, which incorporates long arithmetics. 686 | 687 | Python 3有着单一的类型`int`,其同时融合了长整型的计算。 688 | 689 | > Here is how you check that value is integer: 690 | 691 | 如下为如何检查该值是整数: 692 | 693 | ``` 694 | isinstance(x, numbers.Integral) # Python 2, the canonical way 695 | isinstance(x, (long, int)) # Python 2 696 | isinstance(x, int) # Python 3, easier to remember 697 | ``` 698 | 699 | ## 其他事项 700 | 701 | - `Enum`s are theoretically useful, but 702 | - string-typing is already widely adopted in the python data stack 703 | - `Enum`s don't seem to interplay with numpy and categorical from pandas 704 | - coroutines also *sound* very promising for data pipelining (see [slides](http://www.dabeaz.com/coroutines/Coroutines.pdf) by David Beazley), but I don't see their adoption in the wild. 705 | - Python 3 has [stable ABI](https://www.python.org/dev/peps/pep-0384/) 706 | - Python 3 supports unicode identifies (so `ω = Δφ / Δt` is ok), but you'd [better use good old ASCII names](https://stackoverflow.com/a/29855176/498892) 707 | - some libraries e.g. [jupyterhub](https://github.com/jupyterhub/jupyterhub) (jupyter in cloud), django and fresh ipython only support Python 3, so features that sound useless for you are useful for libraries you'll probably want to use once. 708 | ---------- 709 | - `Enum`(枚举类)理论上是有用的,但是 710 | - string-typing 已经在Python数据栈中被广泛采用 711 | - `Enum`似乎不会与numpy和pandas的分类相互作用 712 | - 协程(coroutines)*听起来*也非常适用于数据管道(参见David Beazley的[幻灯片](http://www.dabeaz.com/coroutines/Coroutines.pdf)),但是我从来没见过代码引用它们。 713 | - Python 3 有着[稳定的ABI](https://www.python.org/dev/peps/pep-0384/) 714 | 715 | *ABI(Application Binary Interface): 应用程序二进制接口 描述了应用程序和操作系统之间,一个应用和它的库之间,或者应用的组成部分之间的低接口。* 716 | - Python 3支持unicode标识(甚至`ω=Δφ/Δt`也可以),但是你[最好使用好的旧ASCII名称](https://stackoverflow.com/a/29855176/498892)。 717 | - 一些类库例如 [jupyterhub](https://github.com/jupyterhub/jupyterhub)(云端的jupyter),django和最新的ipython仅支持Python 3,因此对于您来说听起来无用的功能,对于您可能想要使用的库却很有用。 718 | 719 | ### 特定于数据科学的代码迁移问题(以及如何解决这些问题) 720 | 721 | > - support for nested arguments [was dropped](https://www.python.org/dev/peps/pep-3113/) 722 | - 对于嵌套参数的支持[已被删除](https://www.python.org/dev/peps/pep-3113/)。 723 | ``` 724 | map(lambda x, (y, z): x, z, dict.items()) 725 | ``` 726 | 727 | > However, it is still perfectly working with different comprehensions: 728 | 729 | 但是,它仍然完全适用于不同的(列表)解析: 730 | ```python 731 | {x:z for x, (y, z) in d.items()} 732 | ``` 733 | > In general, comprehensions are also better 'translatable' between Python 2 and 3. 734 | 735 | 一般来说,Python 2和Python 3之间的解析也是有着更好的“可翻译性”。 736 | 737 | > - `map()`, `.keys()`, `.values()`, `.items()`, etc. return iterators, not lists. Main problems with iterators are: 738 | - no trivial slicing 739 | - can't be iterated twice 740 | 741 | - `map()`, `.keys()`, `.values()`, `.items()`等等返回的是迭代器(iterators),而不是列表(lists)。迭代器的主要问题如下: 742 | - 没有细小的切片 743 | - 不能迭代两次 744 | 745 | > Almost all of the problems are resolved by converting result to list. 746 | 747 | 将结果转换为列表几乎可以解决所有问题。 748 | 749 | > - see [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/) when in trouble. 750 | 751 | - 当你遇到问题时请参见[Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/)。 752 | 753 | ### 使用python教授机器学习和数据科学的主要问题 754 | 755 | > Course authors should spend time in the first lectures to explain what is an iterator, 756 | why it can't be sliced / concatenated / multiplied / iterated twice like a string (and how to deal with it). 757 | 758 | 课程讲解者应该花时间在第一讲中解释什么是迭代器, 759 | 为什么它不能像字符串一样被分割/连接/相乘/重复两次(以及如何处理它)。 760 | 761 | > I think most course authors would be happy to avoid these details, but now it is hardly possible. 762 | 763 | 我认为大多数课程讲解者曾经都乐于避开这些细节,但现在几乎不可能(再避开了)。 764 | 765 | # 结论 766 | 767 | > Python 2 and Python 3 have co-existed for almost 10 years, but we *should* move to Python 3. 768 | 769 | 虽然Python 2 和 Python 3 已经共存了十年有余,但是我们*应该*要过渡到Python 3 了。 770 | 771 | > Research and production code should become a bit shorter, more readable, and significantly safer after moving to Python 3-only codebase. 772 | 773 | 在转向使用唯一的 Python 3 代码库之后,研究和生产的代码将会变得更剪短,更有可读性,以及明显是更加安全的。 774 | 775 | > Right now most libraries support both Python versions. 776 | And I can't wait for the bright moment when packages drop support for Python 2 and enjoy new language features. 777 | 778 | 目前大部分类库都会支持两个Python版本,我已等不及要使用新的语言特性了,也同样期待依赖包舍弃对 Python 2 支持这一光明时刻的到来。 779 | 780 | > Following migrations are promised to be smoother: ["we will never do this kind of backwards-incompatible change again"](https://snarky.ca/why-python-3-exists/) 781 | 782 | 以后的(版本)迁移会更加顺利:[我们再也不会做这种不向后兼容的变化了](https://snarky.ca/why-python-3-exists/)。 783 | 784 | ### 相关链接 785 | 786 | - [Key differences between Python 2.7 and Python 3.x](http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html) 787 | - [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/) 788 | - [10 awesome features of Python that you can't use because you refuse to upgrade to Python 3](http://www.asmeurer.com/python3-presentation/slides.html) 789 | - [Trust me, python 3.3 is better than 2.7 (video)](http://pyvideo.org/pycon-us-2013/python-33-trust-me-its-better-than-27.html) 790 | - [Python 3 for scientists](http://python-3-for-scientists.readthedocs.io/en/latest/) 791 | 792 | 793 | ### 版权声明 794 | 795 | This text was published by [Alex Rogozhnikov](https://arogozhnikov.github.io/about/) under [CC BY-SA 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/) (excluding images). 796 | 797 | Translated to Chinese by Hunter-liu (@lq920320). 798 | 799 | 800 | -------------------------------------------------------------------------------- /images/pycharm-type-hinting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arogozhnikov/python3_with_pleasure/e018d177b358f34b0038d886085ebc1211fec82d/images/pycharm-type-hinting.png -------------------------------------------------------------------------------- /images/variable_annotations.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/arogozhnikov/python3_with_pleasure/e018d177b358f34b0038d886085ebc1211fec82d/images/variable_annotations.png --------------------------------------------------------------------------------