├── README.md
├── README_CN.md
└── images
├── pycharm-type-hinting.png
└── variable_annotations.png
/README.md:
--------------------------------------------------------------------------------
1 | # We made it!
2 |
3 | *Update (Jan 2020)*.
4 | Python 2 is now officially retired. Thanks to everyone for making this hard transition to better code happen!
5 |
6 | # Migrating to Python 3 with pleasure
7 | ## A short guide on features of Python 3 for data scientists
8 |
9 |
10 | Python became a mainstream language for machine learning and other scientific fields that heavily operate with data;
11 | it boasts various deep learning frameworks and well-established set of tools for data processing and visualization.
12 |
13 | However, Python ecosystem co-exists in Python 2 and Python 3, and Python 2 is still used among data scientists.
14 | By the end of 2019 the scientific stack will [stop supporting Python2](http://www.python3statement.org).
15 | As for numpy, after 2018 any new feature releases will only support [Python3](https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst). *Update (Sep 2018): same story now with pandas, matplotlib, ipython, jupyter notebook and jupyter lab.*
16 |
17 | To make the transition less frustrating, I've collected a bunch of Python 3 features that you may find useful.
18 |
19 |
20 |
21 | Image from [Dario Bertini post (toptal)](https://www.toptal.com/python/python-3-is-it-worth-the-switch)
22 |
23 | ## Better paths handling with `pathlib`
24 |
25 | `pathlib` is a default module in python3, that helps you to avoid tons of `os.path.join`s:
26 |
27 | ```python
28 | from pathlib import Path
29 |
30 | dataset = 'wiki_images'
31 | datasets_root = Path('/path/to/datasets/')
32 |
33 | train_path = datasets_root / dataset / 'train'
34 | test_path = datasets_root / dataset / 'test'
35 |
36 | for image_path in train_path.iterdir():
37 | with image_path.open() as f: # note, open is a method of Path object
38 | # do something with an image
39 | ```
40 |
41 | Previously it was always tempting to use string concatenation (concise, but obviously bad),
42 | now with `pathlib` the code is safe, concise, and readable.
43 |
44 | Also `pathlib.Path` has a bunch of methods and properties, that every python novice previously had to google:
45 |
46 | ```python
47 | p.exists()
48 | p.is_dir()
49 | p.parts
50 | p.with_name('sibling.png') # only change the name, but keep the folder
51 | p.with_suffix('.jpg') # only change the extension, but keep the folder and the name
52 | p.chmod(mode)
53 | p.rmdir()
54 | ```
55 |
56 | `pathlib` should save you lots of time,
57 | please see [docs](https://docs.python.org/3/library/pathlib.html) and [reference](https://pymotw.com/3/pathlib/) for more.
58 |
59 |
60 | ## Type hinting is now part of the language
61 |
62 | Example of type hinting in pycharm:
63 |
64 |
65 | Python is not just a language for small scripts anymore,
66 | data pipelines these days include numerous steps each involving different frameworks (and sometimes very different logic).
67 |
68 | Type hinting was introduced to help with growing complexity of programs, so machines could help with code verification.
69 | Previously different modules used custom ways to point [types in docstrings](https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html#legacy)
70 | (Hint: pycharm can convert old docstrings to fresh type hinting).
71 |
72 | As a simple example, the following code may work with different types of data (that's what we like about python data stack).
73 | ```python
74 | def repeat_each_entry(data):
75 | """ Each entry in the data is doubled
76 |
77 | """
78 | index = numpy.repeat(numpy.arange(len(data)), 2)
79 | return data[index]
80 | ```
81 |
82 | This code e.g. works for `numpy.array` (incl. multidimensional ones), `astropy.Table` and `astropy.Column`, `bcolz`, `cupy`, `mxnet.ndarray` and others.
83 |
84 | This code will work for `pandas.Series`, but in the wrong way:
85 | ```python
86 | repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside
87 | ```
88 |
89 | This was two lines of code. Imagine how unpredictable behavior of a complex system, because just one function may misbehave.
90 | Stating explicitly which types a method expects is very helpful in large systems, this will warn you if a function was passed unexpected arguments.
91 |
92 | ```python
93 | def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]):
94 | ```
95 |
96 | If you have a significant codebase, hinting tools like [MyPy](http://mypy.readthedocs.io) are likely to become part of your continuous integration pipeline.
97 | A webinar ["Putting Type Hints to Work"](https://www.youtube.com/watch?v=JqBCFfiE11g) by Daniel Pyrathon is good for a brief introduction.
98 |
99 | Sidenote: unfortunately, hinting is not yet powerful enough to provide fine-grained typing for ndarrays/tensors, but [maybe we'll have it once](https://github.com/numpy/numpy/issues/7370), and this will be a great feature for DS.
100 |
101 | ## Type hinting → type checking in runtime
102 |
103 | By default, function annotations do not influence how your code is working, but merely help you to point code intentions.
104 |
105 | However, you can enforce type checking in runtime with tools like ... [enforce](https://github.com/RussBaz/enforce),
106 | this can help you in debugging (there are many cases when type hinting is not working).
107 |
108 | ```python
109 | @enforce.runtime_validation
110 | def foo(text: str) -> None:
111 | print(text)
112 |
113 | foo('Hi') # ok
114 | foo(5) # fails
115 |
116 |
117 | @enforce.runtime_validation
118 | def any2(x: List[bool]) -> bool:
119 | return any(x)
120 |
121 | any ([False, False, True, False]) # True
122 | any2([False, False, True, False]) # True
123 |
124 | any (['False']) # True
125 | any2(['False']) # fails
126 |
127 | any ([False, None, "", 0]) # False
128 | any2([False, None, "", 0]) # fails
129 |
130 | ```
131 |
132 | ## Other usages of function annotations
133 |
134 | *Update: starting from python 3.7 this behavior was [deprecated](https://www.python.org/dev/peps/pep-0563/#non-typing-usage-of-annotations), and function annotations should be used for type hinting only. Python 4 will not support other usages of annotations.*
135 |
136 | As mentioned before, annotations do not influence code execution, but rather provide some meta-information,
137 | and you can use it as you wish.
138 |
139 | For instance, measurement units are a common pain in scientific areas, `astropy` package [provides a simple decorator](http://docs.astropy.org/en/stable/units/quantity.html#functions-that-accept-quantities) to control units of input quantities and convert output to required units
140 | ```python
141 | # Python 3
142 | from astropy import units as u
143 | @u.quantity_input()
144 | def frequency(speed: u.meter / u.s, wavelength: u.nm) -> u.terahertz:
145 | return speed / wavelength
146 |
147 | frequency(speed=300_000 * u.km / u.s, wavelength=555 * u.nm)
148 | # output: 540.5405405405404 THz, frequency of green visible light
149 | ```
150 |
151 | If you're processing tabular scientific data in python (not necessarily astronomical), you should give `astropy` a shot.
152 |
153 | You can also define your application-specific decorators to perform control / conversion of inputs and output in the same manner.
154 |
155 | ## Matrix multiplication with @
156 |
157 | Let's implement one of the simplest ML models — a linear regression with l2 regularization (a.k.a. ridge regression):
158 |
159 | ```python
160 | # l2-regularized linear regression: || AX - y ||^2 + alpha * ||x||^2 -> min
161 |
162 | # Python 2
163 | X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(y))
164 | # Python 3
165 | X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ y)
166 | ```
167 |
168 | The code with `@` becomes more readable and more translatable between deep learning frameworks: same code `X @ W + b[None, :]` for a single layer of perceptron works in `numpy`, `cupy`, `pytorch`, `tensorflow` (and other frameworks that operate with tensors).
169 |
170 | ## Globbing with `**`
171 |
172 | Recursive folder globbing is not easy in Python 2, even though the [glob2](https://github.com/miracle2k/python-glob2) custom module exists that overcomes this. A recursive flag is supported since Python 3.5:
173 |
174 | ```python
175 | import glob
176 |
177 | # Python 2
178 | found_images = (
179 | glob.glob('/path/*.jpg')
180 | + glob.glob('/path/*/*.jpg')
181 | + glob.glob('/path/*/*/*.jpg')
182 | + glob.glob('/path/*/*/*/*.jpg')
183 | + glob.glob('/path/*/*/*/*/*.jpg'))
184 |
185 | # Python 3
186 | found_images = glob.glob('/path/**/*.jpg', recursive=True)
187 | ```
188 |
189 | A better option is to use `pathlib` in python3 (minus one import!):
190 | ```python
191 | # Python 3
192 | found_images = pathlib.Path('/path/').glob('**/*.jpg')
193 | ```
194 | Note: there are [minor differences](https://github.com/arogozhnikov/python3_with_pleasure/issues/16) between `glob.glob`, `Path.glob` and bash globbing.
195 |
196 | ## Print is a function now
197 |
198 | Yes, code now has these annoying parentheses, but there are some advantages:
199 |
200 | - simple syntax for using file descriptor:
201 | ```python
202 | print >>sys.stderr, "critical error" # Python 2
203 | print("critical error", file=sys.stderr) # Python 3
204 | ```
205 | - printing tab-aligned tables without `str.join`:
206 | ```python
207 | # Python 3
208 | print(*array, sep='\t')
209 | print(batch, epoch, loss, accuracy, time, sep='\t')
210 | ```
211 | - hacky suppressing / redirection of printing output:
212 | ```python
213 | # Python 3
214 | _print = print # store the original print function
215 | def print(*args, **kargs):
216 | pass # do something useful, e.g. store output to some file
217 | ```
218 | In jupyter it is desirable to log each output to a separate file (to track what's happening after you got disconnected), so you can override `print` now.
219 |
220 | Below you can see a context manager that temporarily overrides behavior of print:
221 | ```python
222 | @contextlib.contextmanager
223 | def replace_print():
224 | import builtins
225 | _print = print # saving old print function
226 | # or use some other function here
227 | builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs)
228 | yield
229 | builtins.print = _print
230 |
231 | with replace_print():
232 |
233 | ```
234 | It is *not* a recommended approach, but a small dirty hack that is now possible.
235 | - `print` can participate in list comprehensions and other language constructs
236 | ```python
237 | # Python 3
238 | result = process(x) if is_valid(x) else print('invalid item: ', x)
239 | ```
240 |
241 |
242 | ## Underscores in Numeric Literal (Thousands Separator)
243 |
244 | [PEP-515](https://www.python.org/dev/peps/pep-0515/ "PEP-515") introduced underscores in Numeric Literals.
245 | In Python3, underscores can be used to group digits visually in integral, floating-point, and complex number literals.
246 |
247 | ```python
248 | # grouping decimal numbers by thousands
249 | one_million = 1_000_000
250 |
251 | # grouping hexadecimal addresses by words
252 | addr = 0xCAFE_F00D
253 |
254 | # grouping bits into nibbles in a binary literal
255 | flags = 0b_0011_1111_0100_1110
256 |
257 | # same, for string conversions
258 | flags = int('0b_1111_0000', 2)
259 | ```
260 |
261 | ## f-strings for simple and reliable formatting
262 |
263 | The default formatting system provides a flexibility that is not required in data experiments.
264 | The resulting code is either too verbose or too fragile towards any changes.
265 |
266 | Quite typically data scientists outputs some logging information iteratively in a fixed format.
267 | It is common to have a code like:
268 |
269 | ```python
270 | # Python 2
271 | print '{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format(
272 | batch=batch, epoch=epoch, total_epochs=total_epochs,
273 | acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies),
274 | avg_time=time / len(data_batch)
275 | )
276 |
277 | # Python 2 (too error-prone during fast modifications, please avoid):
278 | print '{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format(
279 | batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies),
280 | time / len(data_batch)
281 | )
282 | ```
283 |
284 | Sample output:
285 | ```
286 | 120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60
287 | ```
288 |
289 | **f-strings** aka formatted string literals were introduced in Python 3.6:
290 | ```python
291 | # Python 3.6+
292 | print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}')
293 | ```
294 |
295 |
296 | ## Explicit difference between 'true division' and 'floor division'
297 |
298 | For data science this is definitely a handy change
299 |
300 | ```python
301 | data = pandas.read_csv('timing.csv')
302 | velocity = data['distance'] / data['time']
303 | ```
304 |
305 | Results in Python 2 depend on whether 'time' and 'distance' (e.g. measured in meters and seconds) are stored as integers.
306 | In Python 3, the result is correct in both cases, because the result of division is float.
307 |
308 | Another case is floor division, which is now an explicit operation:
309 |
310 | ```python
311 | n_gifts = money // gift_price # correct for int and float arguments
312 | ```
313 |
314 | In a nutshell:
315 |
316 | ```python
317 | >>> from operator import truediv, floordiv
318 | >>> truediv.__doc__, floordiv.__doc__
319 | ('truediv(a, b) -- Same as a / b.', 'floordiv(a, b) -- Same as a // b.')
320 | >>> (3 / 2), (3 // 2), (3.0 // 2.0)
321 | (1.5, 1, 1.0)
322 | ```
323 |
324 | Note, that this applies both to built-in types and to custom types provided by data packages (e.g. `numpy` or `pandas`).
325 |
326 |
327 | ## Strict ordering
328 |
329 | ```python
330 | # All these comparisons are illegal in Python 3
331 | 3 < '3'
332 | 2 < None
333 | (3, 4) < (3, None)
334 | (4, 5) < [4, 5]
335 |
336 | # False in both Python 2 and Python 3
337 | (4, 5) == [4, 5]
338 | ```
339 |
340 | - prevents from occasional sorting of instances of different types
341 | ```python
342 | sorted([2, '1', 3]) # invalid for Python 3, in Python 2 returns [2, 3, '1']
343 | ```
344 | - helps to spot some problems that arise when processing raw data
345 |
346 | Sidenote: proper check for None is (in both Python versions)
347 | ```python
348 | if a is not None:
349 | pass
350 |
351 | if a: # WRONG check for None
352 | pass
353 | ```
354 |
355 |
356 | ## Unicode for NLP
357 |
358 | ```python
359 | s = '您好'
360 | print(len(s))
361 | print(s[:2])
362 | ```
363 | Output:
364 | - Python 2: `6\n��`
365 | - Python 3: `2\n您好`.
366 |
367 | ```python
368 | x = u'со'
369 | x += 'co' # ok
370 | x += 'со' # fail
371 | ```
372 | Python 2 fails, Python 3 works as expected (because I've used russian letters in strings).
373 |
374 | In Python 3 `str`s are unicode strings, and it is more convenient for NLP processing of non-english texts.
375 |
376 | There are other funny things, for instance:
377 | ```python
378 | 'a' < type < u'a' # Python 2: True
379 | 'a' < u'a' # Python 2: False
380 | ```
381 |
382 | ```python
383 | from collections import Counter
384 | Counter('Möbelstück')
385 | ```
386 |
387 | - Python 2: `Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1})`
388 | - Python 3: `Counter({'M': 1, 'ö': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1, 'c': 1, 'k': 1})`
389 |
390 | You can handle all of this in Python 2 properly, but Python 3 is more friendly.
391 |
392 | ## Preserving order of dictionaries and **kwargs
393 |
394 | In CPython 3.6+ dicts behave like `OrderedDict` by default (and [this is guaranteed in Python 3.7+](https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6)).
395 | This preserves order during dict comprehensions (and other operations, e.g. during json serialization/deserialization)
396 |
397 | ```python
398 | import json
399 | x = {str(i):i for i in range(5)}
400 | json.loads(json.dumps(x))
401 | # Python 2
402 | {u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4}
403 | # Python 3
404 | {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4}
405 | ```
406 |
407 | Same applies to `**kwargs` (in Python 3.6+), they're kept in the same order as they appear in parameters.
408 | Order is crucial when it comes to data pipelines, previously we had to write it in a cumbersome manner:
409 | ```python
410 | from torch import nn
411 |
412 | # Python 2
413 | model = nn.Sequential(OrderedDict([
414 | ('conv1', nn.Conv2d(1,20,5)),
415 | ('relu1', nn.ReLU()),
416 | ('conv2', nn.Conv2d(20,64,5)),
417 | ('relu2', nn.ReLU())
418 | ]))
419 |
420 | # Python 3.6+, how it *can* be done, not supported right now in pytorch
421 | model = nn.Sequential(
422 | conv1=nn.Conv2d(1,20,5),
423 | relu1=nn.ReLU(),
424 | conv2=nn.Conv2d(20,64,5),
425 | relu2=nn.ReLU())
426 | )
427 | ```
428 |
429 | Did you notice? Uniqueness of names is also checked automatically.
430 |
431 |
432 | ## Iterable unpacking
433 |
434 | ```python
435 | # handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases
436 | model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name)
437 |
438 | # picking two last values from a sequence
439 | *prev, next_to_last, last = values_history
440 |
441 | # This also works with any iterables, so if you have a function that yields e.g. qualities,
442 | # below is a simple way to take only last two values from a list
443 | *prev, next_to_last, last = iter_train(args)
444 | ```
445 |
446 | ## Default pickle engine provides better compression for arrays
447 |
448 | Pickling is a mechanism to pass data between threads / processes, in particular used inside `multiprocessing` package.
449 |
450 | ```python
451 | # Python 2
452 | import cPickle as pickle
453 | import numpy
454 | print len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
455 | # result: 23691675
456 |
457 | # Python 3
458 | import pickle
459 | import numpy
460 | len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
461 | # result: 8000162
462 | ```
463 |
464 | Three times less space. And it is *much* faster.
465 | Actually similar compression (but not speed) is achievable with `protocol=2` parameter, but developers typically ignore this option (or simply are not aware of it).
466 |
467 | Note: pickle is [not safe](https://docs.python.org/3/library/pickle.html) (and not quite transferrable), so never unpickle data received from an untrusted or unauthenticated source.
468 |
469 | ## Safer comprehensions
470 |
471 | ```python
472 | labels =
473 | predictions = [model.predict(data) for data, labels in dataset]
474 |
475 | # labels are overwritten in Python 2
476 | # labels are not affected by comprehension in Python 3
477 | ```
478 |
479 | ## Super, simply super()
480 |
481 | Python 2 `super(...)` was a frequent source of mistakes in code.
482 |
483 | ```python
484 | # Python 2
485 | class MySubClass(MySuperClass):
486 | def __init__(self, name, **options):
487 | super(MySubClass, self).__init__(name='subclass', **options)
488 |
489 | # Python 3
490 | class MySubClass(MySuperClass):
491 | def __init__(self, name, **options):
492 | super().__init__(name='subclass', **options)
493 | ```
494 |
495 | More on `super` and method resolution order on [stackoverflow](https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods).
496 |
497 | ## Better IDE suggestions with variable annotations
498 |
499 | The most enjoyable thing about programming in languages like Java, C# and alike is that IDE can make very good suggestions,
500 | because type of each identifier is known before executing a program.
501 |
502 | In python this is hard to achieve, but annotations will help you
503 | - write your expectations in a clear form
504 | - and get good suggestions from IDE
505 |
506 | 
507 | This is an example of PyCharm suggestions with variable annotations.
508 | This works even in situations when functions you use are not annotated (e.g. due to backward compatibility).
509 |
510 | ## Multiple unpacking
511 |
512 | Here is how you merge two dicts now:
513 | ```python
514 | x = dict(a=1, b=2)
515 | y = dict(b=3, d=4)
516 | # Python 3.5+
517 | z = {**x, **y}
518 | # z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.
519 | ```
520 |
521 | See [this thread at StackOverflow](https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression) for a comparison with Python 2.
522 |
523 | The same approach also works for lists, tuples, and sets (`a`, `b`, `c` are any iterables):
524 | ```python
525 | [*a, *b, *c] # list, concatenating
526 | (*a, *b, *c) # tuple, concatenating
527 | {*a, *b, *c} # set, union
528 | ```
529 |
530 | Functions also [support multiple unpacking](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-448) for `*args` and `**kwargs`:
531 | ```python
532 | # Python 3.5+
533 | do_something(**{**default_settings, **custom_settings})
534 |
535 | # Also possible, this code also checks there is no intersection between keys of dictionaries
536 | do_something(**first_args, **second_args)
537 | ```
538 |
539 | ## Future-proof APIs with keyword-only arguments
540 |
541 | Let's consider this snippet
542 | ```python
543 | model = sklearn.svm.SVC(2, 'poly', 2, 4, 0.5)
544 | ```
545 | Obviously, an author of this code didn't get the Python style of coding yet (most probably, just jumped from cpp or rust).
546 | Unfortunately, this is not just question of taste, because changing the order of arguments (adding/deleting) in `SVC` will break this code. In particular, `sklearn` does some reordering/renaming from time to time of numerous algorithm parameters to provide consistent API. Each such refactoring may drive to broken code.
547 |
548 | In Python 3, library authors may demand explicitly named parameters by using `*`:
549 | ```python
550 | class SVC(BaseSVC):
551 | def __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, ... )
552 | ```
553 | - users have to specify names of parameters `sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)` now
554 | - this mechanism provides a great combination of reliability and flexibility of APIs
555 |
556 | ## Data classes
557 |
558 | Python 3.7 introduces data classes, a good replacement for `namedtuple` in most cases.
559 | ```python
560 | @dataclass
561 | class Person:
562 | name: str
563 | age: int
564 |
565 | @dataclass
566 | class Coder(Person):
567 | preferred_language: str = 'Python 3'
568 | ```
569 |
570 | `dataclass` decorator takes the job of implementing routine methods for you (initialization, representation, comparison, and hashing when applicable).
571 | Let's name some features:
572 | - data classes can be both mutable and immutable
573 | - default values for fields are supported
574 | - inheritance
575 | - data classes are still old good classes: you can define new methods and override existing
576 | - post-init processing (e.g. to verify consistency)
577 |
578 | Geir Arne Hjelle gives a good overview of dataclasses [in his post](https://realpython.com/python-data-classes/).
579 |
580 |
581 |
582 |
583 | ## Customizing access to module attributes
584 |
585 | In Python you can control attribute access and hinting with `__getattr__` and `__dir__` for any object. Since python 3.7 you can do it for modules too.
586 |
587 | A natural example is implementing a `random` submodule of tensor libraries, which is typically a shortcut to skip initialization and passing of RandomState objects. Here's implementation for numpy:
588 | ```python
589 | # nprandom.py
590 | import numpy
591 | __random_state = numpy.random.RandomState()
592 |
593 | def __getattr__(name):
594 | return getattr(__random_state, name)
595 |
596 | def __dir__():
597 | return dir(__random_state)
598 |
599 | def seed(seed):
600 | __random_state = numpy.random.RandomState(seed=seed)
601 | ```
602 |
603 | One can also mix this way functionalities of different objects/submodules. Compare with tricks in [pytorch](https://github.com/pytorch/pytorch/blob/3ce17bf8f6a2c4239085191ea60d6ee51cd620a5/torch/__init__.py#L253-L256) and [cupy](https://github.com/cupy/cupy/blob/94592ecac8152d5f4a56a129325cc91d184480ad/cupy/random/distributions.py).
604 |
605 | Additionally, now one can
606 | - use it for [lazy loading of submodules](https://snarky.ca/lazy-importing-in-python-3-7/). For example, `import tensorflow` takes **~150MB** of RAM is imports all submodules (and dependencies).
607 | - use this for [deprecations in API](https://www.python.org/dev/peps/pep-0562/)
608 | - introduce runtime routing between submodules
609 |
610 | ## Built-in breakpoint()
611 |
612 | Just write `breakpoint()` in the code to invoke debugger.
613 | ```python
614 | # Python 3.7+, not all IDEs support this at the moment
615 | foo()
616 | breakpoint()
617 | bar()
618 | ```
619 |
620 | For remote debugging you may want to try [combining breakpoint() with `web-pdb`](https://hackernoon.com/python-3-7s-new-builtin-breakpoint-a-quick-tour-4f1aebc444c)
621 |
622 |
623 | ## Minor: constants in `math` module
624 |
625 | ```python
626 | # Python 3
627 | math.inf # Infinite float
628 | math.nan # not a number
629 |
630 | max_quality = -math.inf # no more magic initial values!
631 |
632 | for model in trained_models:
633 | max_quality = max(max_quality, compute_quality(model, data))
634 | ```
635 |
636 | ## Minor: single integer type
637 |
638 | Python 2 provides two basic integer types, which are `int` (64-bit signed integer) and `long` for long arithmetics (quite confusing after C++).
639 |
640 | Python 3 has a single type `int`, which incorporates long arithmetics.
641 |
642 | Here is how you check that value is integer:
643 |
644 | ```python
645 | isinstance(x, numbers.Integral) # Python 2, the canonical way
646 | isinstance(x, (long, int)) # Python 2
647 | isinstance(x, int) # Python 3, easier to remember
648 | ```
649 |
650 | Update: first check also works for *other integral types*, such as `numpy.int32`, `numpy.int64`, but others don't. So they're not equivalent.
651 |
652 |
653 | ## Other stuff
654 |
655 | - `Enum`s are theoretically useful, but
656 | - string-typing is already widely adopted in the python data stack
657 | - `Enum`s don't seem to interplay with numpy and categorical from pandas
658 | - coroutines also *sound* very promising for data pipelining (see [slides](http://www.dabeaz.com/coroutines/Coroutines.pdf) by David Beazley), but I don't see their adoption in the wild.
659 | - Python 3 has [stable ABI](https://www.python.org/dev/peps/pep-0384/)
660 | - Python 3 supports unicode identifies (so `ω = Δφ / Δt` is ok), but you'd [better use good old ASCII names](https://stackoverflow.com/a/29855176/498892)
661 | - some libraries e.g. [jupyterhub](https://github.com/jupyterhub/jupyterhub) (jupyter in cloud), django and fresh ipython only support Python 3, so features that sound useless for you are useful for libraries you'll probably want to use once.
662 |
663 |
664 | ### Problems for code migration specific for data science (and how to resolve those)
665 |
666 | - support for nested arguments [was dropped](https://www.python.org/dev/peps/pep-3113/)
667 | ```python
668 | map(lambda x, (y, z): x, z, dict.items())
669 | ```
670 |
671 | However, it is still perfectly working with different comprehensions:
672 | ```python
673 | {x:z for x, (y, z) in d.items()}
674 | ```
675 | In general, comprehensions are also better 'translatable' between Python 2 and 3.
676 |
677 | - `map()`, `.keys()`, `.values()`, `.items()`, etc. return iterators, not lists. Main problems with iterators are:
678 | - no trivial slicing
679 | - can't be iterated twice
680 |
681 | Almost all of the problems are resolved by converting result to list.
682 |
683 | - see [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/) when in trouble
684 |
685 | ### Main problems for teaching machine learning and data science with python
686 |
687 | Course authors should spend time in the first lectures to explain what is an iterator,
688 | why it can't be sliced / concatenated / multiplied / iterated twice like a string (and how to deal with it).
689 |
690 | I think most course authors would be happy to avoid these details, but now it is hardly possible.
691 |
692 | # Conclusion
693 |
694 | Python 2 and Python 3 have co-existed for almost 10 years, but we *should* move to Python 3.
695 |
696 | Research and production code should become a bit shorter, more readable, and significantly safer after moving to Python 3-only codebase.
697 |
698 | Right now most libraries support both Python versions.
699 | And I can't wait for the bright moment when packages drop support for Python 2 and enjoy new language features.
700 |
701 | Following migrations are promised to be smoother: ["we will never do this kind of backwards-incompatible change again"](https://snarky.ca/why-python-3-exists/)
702 |
703 | ### Links
704 |
705 | - [Key differences between Python 2.7 and Python 3.x](http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html)
706 | - [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/)
707 | - [10 awesome features of Python that you can't use because you refuse to upgrade to Python 3](http://www.asmeurer.com/python3-presentation/slides.html)
708 | - [Trust me, python 3.3 is better than 2.7 (video)](http://pyvideo.org/pycon-us-2013/python-33-trust-me-its-better-than-27.html)
709 | - [Python 3 for scientists](http://python-3-for-scientists.readthedocs.io/en/latest/)
710 |
711 | ### License
712 |
713 | This text was published by [Alex Rogozhnikov](https://arogozhnikov.github.io/about/) and [contributors](https://github.com/arogozhnikov/python3_with_pleasure/graphs/contributors) under [CC BY-SA 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/) (excluding images).
714 |
--------------------------------------------------------------------------------
/README_CN.md:
--------------------------------------------------------------------------------
1 | # 快乐迁移Python 3
2 | ## 为数据科学家提供的关于Python 3特性的简介
3 |
4 | > Python became a mainstream language for machine learning and other scientific fields that heavily operate with data;
5 | it boasts various deep learning frameworks and well-established set of tools for data processing and visualization.
6 |
7 | Python 已成为机器学习以及其他紧密结合数据的科学领域的主流语言;它提供了各种深度学习的框架以及一系列完善的数据处理和可视化工具。
8 |
9 | > However, Python ecosystem co-exists in Python 2 and Python 3, and Python 2 is still used among data scientists.
10 | By the end of 2019 the scientific stack will [stop supporting Python2](http://www.python3statement.org).
11 | As for numpy, after 2018 any new feature releases will only support [Python3](https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst).
12 |
13 | 然而,Python 的生态圈中 Python 2 和 Python 3 是共存状态,并且数据科学家之中是依然有使用 Python 2 的。2019年年底(Python的)科学组件将会[停止支持 Python 2 ](http://www.python3statement.org)。 至于numpy,2018年之后任何推出的新特性将会只支持[Python 3](https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst) 。
14 |
15 | >To make the transition less frustrating, I've collected a bunch of Python 3 features that you may find useful.
16 |
17 | 为了让这一过渡更轻松一点,我整理了一些 Python 3 你可能觉得有用的特性。
18 |
19 |
20 |
21 | 图片来源 [Dario Bertini post (toptal)](https://www.toptal.com/python/python-3-is-it-worth-the-switch)
22 |
23 | ## `pathlib`提供了更好的路径处理
24 |
25 | > `pathlib` is a default module in python3, that helps you to avoid tons of `os.path.join`s:
26 |
27 | `pathlib` 是Python 3 一个默认的组件,有助于避免大量使用`os.path.join`:
28 |
29 | ```python
30 | from pathlib import Path
31 |
32 | dataset = 'wiki_images'
33 | datasets_root = Path('/path/to/datasets/')
34 |
35 | train_path = datasets_root / dataset / 'train'
36 | test_path = datasets_root / dataset / 'test'
37 |
38 | for image_path in train_path.iterdir():
39 | with image_path.open() as f: # note, open is a method of Path object
40 | # do something with an image
41 | ```
42 |
43 | > Previously it was always tempting to use string concatenation (concise, but obviously bad),
44 | now with `pathlib` the code is safe, concise, and readable.
45 |
46 | 以前,人们倾向于使用字符串连接(虽然简洁,但明显不好);现在,代码中用`pathlib`是安全的,简洁的,并且更有可读性。
47 |
48 | > Also `pathlib.Path` has a bunch of methods and properties, that every python novice previously had to google:
49 |
50 | 此外,`pathlib.Path`有大量的方法和属性,每一位 Python 早期的初学者不得不谷歌了解:
51 |
52 | ```python
53 | p.exists()
54 | p.is_dir()
55 | p.parts
56 | p.with_name('sibling.png') # only change the name, but keep the folder
57 | p.with_suffix('.jpg') # only change the extension, but keep the folder and the name
58 | p.chmod(mode)
59 | p.rmdir()
60 | ```
61 |
62 | > `pathlib` should save you lots of time,
63 | please see [docs](https://docs.python.org/3/library/pathlib.html) and [reference](https://pymotw.com/3/pathlib/) for more.
64 |
65 | `pathlib` 应当会节省大量时间,请参看[文档](https://docs.python.org/3/library/pathlib.html)以及[指南](https://pymotw.com/3/pathlib/)了解更多。
66 |
67 |
68 | ## 类型提示现在已是这语言的一部分
69 |
70 | > Example of type hinting in pycharm:
71 |
72 | pycharm环境类型提示的例子:
73 |
74 |
75 |
76 | > Python is not just a language for small scripts anymore,
77 | data pipelines these days include numerous steps each involving different frameworks (and sometimes very different logic).
78 |
79 | Python 不再是一种小型的脚本语言,数据管道现如今包含数个级别,而每一级又涉及到不同的框架(甚至有时是千差万别的逻辑)。
80 |
81 | > Type hinting was introduced to help with growing complexity of programs, so machines could help with code verification.
82 | Previously different modules used custom ways to point [types in docstrings](https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html#legacy)
83 | (Hint: pycharm can convert old docstrings to fresh type hinting).
84 |
85 | 类型提示的引入是为了在程序的持续增加的复杂性方面提供帮助,这样机器可以辅助代码验证。以前不同的模块使用自定义的方式指定[文档字符中的类型](https://www.jetbrains.com/help/pycharm/type-hinting-in-pycharm.html#legacy)(提示:pycharm可以将旧的字符串转换成新的类型提示)。
86 |
87 | > As a simple example, the following code may work with different types of data (that's what we like about python data stack).
88 |
89 | 作为一个简单的例子,下面的代码可以适用于数据的不同类型(这也是关于数据栈我们喜欢的一点)。
90 |
91 | ```python
92 | def repeat_each_entry(data):
93 | """ Each entry in the data is doubled
94 |
95 | """
96 | index = numpy.repeat(numpy.arange(len(data)), 2)
97 | return data[index]
98 | ```
99 |
100 | > This code e.g. works for `numpy.array` (incl. multidimensional ones), `astropy.Table` and `astropy.Column`, `bcolz`, `cupy`, `mxnet.ndarray` and others.
101 |
102 | 这段代码可适用于例如 `numpy.array` (包括多维数组), `astropy.Table` 以及 `astropy.Column`, `bcolz`, `cupy`, `mxnet.ndarray` 和其他的组件。
103 |
104 | > This code will work for `pandas.Series`, but in the wrong way:
105 |
106 | 这段代码虽然也适用于`pandas.Series`,但是是错误的使用方式:
107 |
108 | ```python
109 | repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones inside
110 | ```
111 |
112 | > This was two lines of code. Imagine how unpredictable behavior of a complex system, because just one function may misbehave.
113 | Stating explicitly which types a method expects is very helpful in large systems, this will warn you if a function was passed unexpected arguments.
114 |
115 | 这曾经是两行代码。想象一下一个复杂系统不可预知的行为,仅仅是因为一个功能可能会失败。在大型的系统中,明确地指出方法期望的类型是非常有帮助的。如果一个方法通过了意外参数,则会给出警告。
116 |
117 | ```python
118 | def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]):
119 | ```
120 | > If you have a significant codebase, hinting tools like [MyPy](http://mypy.readthedocs.io) are likely to become part of your continuous integration pipeline.A webinar ["Putting Type Hints to Work"](https://www.youtube.com/watch?v=JqBCFfiE11g) by Daniel Pyrathon is good for a brief introduction.
121 |
122 | 如果你有一个重要的代码仓库,比如[MyPy](http://mypy.readthedocs.io)的提示工具有可能成为你持续集成管道的一部分。Daniel Pyrathon主持的["Putting Type Hints to Work"](https://www.youtube.com/watch?v=JqBCFfiE11g)研讨会,给出了一个很好的简介。
123 |
124 | > Sidenote: unfortunately, hinting is not yet powerful enough to provide fine-grained typing for ndarrays/tensors, but [maybe we'll have it once](https://github.com/numpy/numpy/issues/7370), and this will be a great feature for DS.
125 |
126 | 边注:不幸的是,提示信息还不够强大为多维数组/张量提供精细的提示。但是[也许我们会有](https://github.com/numpy/numpy/issues/7370),并且这将是DS的一个强大功能。
127 |
128 | ## 类型提示 → 在运行时检查类型
129 |
130 | > By default, function annotations do not influence how your code is working, but merely help you to point code intentions.
131 |
132 | 默认情况下,方法声明不会影响你运行中的代码,而只是帮助你指出代码的意图。
133 |
134 | > However, you can enforce type checking in runtime with tools like ... [enforce](https://github.com/RussBaz/enforce),
135 | this can help you in debugging (there are many cases when type hinting is not working).
136 |
137 | 然而,你可以利用工具,比如[enforce](https://github.com/RussBaz/enforce),在代码运行时执行类型检查,这对你在debug代码时是很有帮助的(类型提示不起作用的情况也很多)。
138 |
139 | ```python
140 | @enforce.runtime_validation
141 | def foo(text: str) -> None:
142 | print(text)
143 |
144 | foo('Hi') # ok
145 | foo(5) # fails
146 |
147 |
148 | @enforce.runtime_validation
149 | def any2(x: List[bool]) -> bool:
150 | return any(x)
151 |
152 | any ([False, False, True, False]) # True
153 | any2([False, False, True, False]) # True
154 |
155 | any (['False']) # True
156 | any2(['False']) # fails
157 |
158 | any ([False, None, "", 0]) # False
159 | any2([False, None, "", 0]) # fails
160 |
161 | ```
162 |
163 | ## 方法声明的其他用途
164 |
165 | > As mentioned before, annotations do not influence code execution, but rather provide some meta-information,
166 | and you can use it as you wish.
167 |
168 | 正如之前提到的,声明不会影响代码执行,而只是提供一些元信息,此外你也可以随意使用。
169 |
170 | > For instance, measurement units are a common pain in scientific areas, `astropy` package [provides a simple decorator](http://docs.astropy.org/en/stable/units/quantity.html#functions-that-accept-quantities) to control units of input quantities and convert output to required units.
171 |
172 | 比如,测量单位是科学领域常见的痛点,`astropy`包[提供了一个简单的装饰器](http://docs.astropy.org/en/stable/units/quantity.html#functions-that-accept-quantities)用来控制输入数量的单位及转换输出部分所需的单位。
173 | ```python
174 | # Python 3
175 | from astropy import units as u
176 | @u.quantity_input()
177 | def frequency(speed: u.meter / u.s, wavelength: u.m) -> u.terahertz:
178 | return speed / wavelength
179 |
180 | frequency(speed=300_000 * u.km / u.s, wavelength=555 * u.nm)
181 | # output: 540.5405405405404 THz, frequency of green visible light
182 | ```
183 |
184 | > If you're processing tabular scientific data in python (not necessarily astronomical), you should give `astropy` a shot.
185 |
186 | 如果你正在用Python处理表格式的科学数据(没必要是天文数字),那么你应该试试`astropy`。
187 |
188 | > You can also define your application-specific decorators to perform control / conversion of inputs and output in the same manner.
189 |
190 | 你也可以自定义专用的装饰器,以相同的方式执行输入和输出的控制/转换。
191 |
192 | ## 矩阵乘号 @
193 |
194 | > Let's implement one of the simplest ML models — a linear regression with l2 regularization (a.k.a. ridge regression):
195 |
196 | 让我们来实现一个最简单的 ML(机器学习) 模型 — 具有 l2 正则化的线性回归(又名岭回归):
197 |
198 | ```python
199 | # l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min
200 |
201 | # Python 2
202 | X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(b))
203 | # Python 3
204 | X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ b)
205 | ```
206 |
207 | > The code with `@` becomes more readable and more translatable between deep learning frameworks: same code `X @ W + b[None, :]` for a single layer of perceptron works in `numpy`, `cupy`, `pytorch`, `tensorflow` (and other frameworks that operate with tensors).
208 |
209 | 使用`@`的代码在深度学习框架之间变得更有可读性和可转换性:对于单层感知器,相同的代码`X @ W + b[None, :]` 可运行与`numpy`、 `cupy`、 `pytorch`、 `tensorflow`(以及其他基于张量运行的框架)。
210 |
211 | ## 通配符 `**`
212 |
213 | > Recursive folder globbing is not easy in Python 2, even though the [glob2](https://github.com/miracle2k/python-glob2) custom module exists that overcomes this. A recursive flag is supported since Python 3.5:
214 |
215 | 即使[glob2](https://github.com/miracle2k/python-glob2)的自定义模块克服了这一点,但是在Python 2中递归的文件夹通配依旧不容易。自Python3.5以来便支持了递归标志:
216 |
217 | ```python
218 | import glob
219 |
220 | # Python 2
221 | found_images = \
222 | glob.glob('/path/*.jpg') \
223 | + glob.glob('/path/*/*.jpg') \
224 | + glob.glob('/path/*/*/*.jpg') \
225 | + glob.glob('/path/*/*/*/*.jpg') \
226 | + glob.glob('/path/*/*/*/*/*.jpg')
227 |
228 | # Python 3
229 | found_images = glob.glob('/path/**/*.jpg', recursive=True)
230 | ```
231 |
232 | > A better option is to use `pathlib` in python3 (minus one import!):
233 |
234 | 一个更好的选项就是在Python 3中使用`pathlib`(减少了一个导入!):
235 | ```python
236 | # Python 3
237 | found_images = pathlib.Path('/path/').glob('**/*.jpg')
238 | ```
239 |
240 | ## Print 现在成了一个方法
241 |
242 | > Yes, code now has these annoying parentheses, but there are some advantages:
243 |
244 | 是的,代码现在有了这些烦人的括号,但也是有一些好处的:
245 |
246 | > - simple syntax for using file descriptor:
247 | - 使用文件描述符的简单语法:
248 |
249 | ```python
250 | print >>sys.stderr, "critical error" # Python 2
251 | print("critical error", file=sys.stderr) # Python 3
252 | ```
253 | > - printing tab-aligned tables without `str.join`:
254 | - 不使用`str.join`打印制表符对齐表:
255 |
256 | ```python
257 | # Python 3
258 | print(*array, sep='\t')
259 | print(batch, epoch, loss, accuracy, time, sep='\t')
260 | ```
261 | > - hacky suppressing / redirection of printing output:
262 | - 结束/重定向打印输出:
263 | ```python
264 | # Python 3
265 | _print = print # store the original print function
266 | def print(*args, **kargs):
267 | pass # do something useful, e.g. store output to some file
268 | ```
269 | In jupyter it is desirable to log each output to a separate file (to track what's happening after you got disconnected), so you can override `print` now.
270 |
271 | 在jupyter中,最好将每个输出记录到一个单独的文件中(以便跟踪断开连接后发生的情况),以便你现在可以重写 `print` 。
272 |
273 | Below you can see a context manager that temporarily overrides behavior of print:
274 |
275 | 下面你可以看到暂时覆盖打印行为的上下文管理器:
276 | ```python
277 | @contextlib.contextmanager
278 | def replace_print():
279 | import builtins
280 | _print = print # saving old print function
281 | # or use some other function here
282 | builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs)
283 | yield
284 | builtins.print = _print
285 |
286 | with replace_print():
287 |
288 | ```
289 | It is *not* a recommended approach, but a small dirty hack that is now possible.
290 |
291 | 这*并不是*推荐的方法,现在却可能是一次小小的黑客攻击。
292 | > - `print` can participate in list comprehensions and other language constructs
293 | - `print` 可以参与列表推导式和其他语言结构:
294 |
295 | ```python
296 | # Python 3
297 | result = process(x) if is_valid(x) else print('invalid item: ', x)
298 | ```
299 |
300 |
301 | ## 数字中的下划线 (千位分隔符)
302 |
303 | > [PEP-515](https://www.python.org/dev/peps/pep-0515/ "PEP-515") introduced underscores in Numeric Literals.
304 | In Python3, underscores can be used to group digits visually in integral, floating-point, and complex number literals.
305 |
306 | [PEP-515](https://www.python.org/dev/peps/pep-0515/ "PEP-515")在数字中引入了下划线。在Python 3 中,下划线可以用于在整数,浮点数,以及一些复杂的数字中以可视的方式对数字分组。
307 |
308 | ```python
309 | # grouping decimal numbers by thousands
310 | one_million = 1_000_000
311 |
312 | # grouping hexadecimal addresses by words
313 | addr = 0xCAFE_F00D
314 |
315 | # grouping bits into nibbles in a binary literal
316 | flags = 0b_0011_1111_0100_1110
317 |
318 | # same, for string conversions
319 | flags = int('0b_1111_0000', 2)
320 | ```
321 |
322 | ## 用于简单可靠格式化的 f-strings
323 |
324 | > The default formatting system provides a flexibility that is not required in data experiments.
325 | The resulting code is either too verbose or too fragile towards any changes.
326 |
327 | 默认的格式化系统提供了数据实验中不必要的灵活性。由此产生的代码对于任何更改都显得过于冗长或者脆弱。
328 |
329 | > Quite typically data scientists outputs some logging information iteratively in a fixed format.
330 | It is common to have a code like:
331 |
332 | 通常数据科学家会以固定的格式反复输出一些记录信息。如下代码就是常见的一段:
333 |
334 | ```python
335 | # Python 2
336 | print('{batch:3} {epoch:3} / {total_epochs:3} accuracy: {acc_mean:0.4f}±{acc_std:0.4f} time: {avg_time:3.2f}'.format(
337 | batch=batch, epoch=epoch, total_epochs=total_epochs,
338 | acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies),
339 | avg_time=time / len(data_batch)
340 | ))
341 |
342 | # Python 2 (too error-prone during fast modifications, please avoid):
343 | print('{:3} {:3} / {:3} accuracy: {:0.4f}±{:0.4f} time: {:3.2f}'.format(
344 | batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies),
345 | time / len(data_batch)
346 | ))
347 | ```
348 |
349 | 简单输出:
350 | ```
351 | 120 12 / 300 accuracy: 0.8180±0.4649 time: 56.60
352 | ```
353 |
354 | > **f-strings** aka formatted string literals were introduced in Python 3.6:
355 |
356 | **f-string** 又名格式化的字符串,在Python 3.6 中引入:
357 | ```python
358 | # Python 3.6+
359 | print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}')
360 | ```
361 |
362 |
363 | ## “真正的除法”与“整数除法”之间的明显区别
364 |
365 | > For data science this is definitely a handy change.
366 |
367 | 对于数据科学来说,这绝对是一个便利的改变。
368 |
369 | ```python
370 | data = pandas.read_csv('timing.csv')
371 | velocity = data['distance'] / data['time']
372 | ```
373 |
374 | > Results in Python 2 depend on whether 'time' and 'distance' (e.g. measured in meters and seconds) are stored as integers.
375 | In Python 3, the result is correct in both cases, because the result of division is float.
376 |
377 | Python 2 中的计算结果取决于“时间”和“距离”(例如,分别以米和秒计量)是否存储为整数,而在Python 3 中,结果在两种情况下都是正确的,因为除法的计算结果是浮点型了。
378 |
379 | > Another case is integer division, which is now an explicit operation:
380 |
381 | 另一种情况是整数除法,它现在是一种精确的运算了:
382 |
383 | ```python
384 | n_gifts = money // gift_price # correct for int and float arguments
385 | ```
386 |
387 | > Note, that this applies both to built-in types and to custom types provided by data packages (e.g. `numpy` or `pandas`).
388 |
389 | 注意,这都适用于内置类型及数据包提供的自定义类型(如`numpy` 或者 `pandas`)。
390 |
391 | ## 严谨的排序
392 |
393 | ```python
394 | # All these comparisons are illegal in Python 3
395 | 3 < '3'
396 | 2 < None
397 | (3, 4) < (3, None)
398 | (4, 5) < [4, 5]
399 |
400 | # False in both Python 2 and Python 3
401 | (4, 5) == [4, 5]
402 | ```
403 |
404 | > - prevents from occasional sorting of instances of different types
405 | - 防止偶尔对不同类型的实例进行排序
406 | ```python
407 | sorted([2, '1', 3]) # invalid for Python 3, in Python 2 returns [2, 3, '1']
408 | ```
409 | > - helps to spot some problems that arise when processing raw data
410 | - 有助于发现在处理原始数据时的一些问题
411 |
412 | > Sidenote: proper check for None is (in both Python versions)
413 |
414 | 边注:合理检查None的情况(Python两个版本中都有)
415 | ```python
416 | if a is not None:
417 | pass
418 |
419 | if a: # WRONG check for None
420 | pass
421 | ```
422 |
423 |
424 | ## 用于NLP的Unicode
425 |
426 | *译者注:NLP,自然语言处理 (Natural Language Processing) *
427 |
428 | ```python
429 | s = '您好'
430 | print(len(s))
431 | print(s[:2])
432 | ```
433 | 输出:
434 | - Python 2: `6\n��`
435 | - Python 3: `2\n您好`.
436 |
437 | ```
438 | x = u'со'
439 | x += 'co' # ok
440 | x += 'со' # fail
441 | ```
442 | > Python 2 fails, Python 3 works as expected (because I've used russian letters in strings).
443 |
444 | Python 2 失败了,Python 3 如预期运行(因为我在字符串中使用了俄语的文字)。
445 |
446 | > In Python 3 `str`s are unicode strings, and it is more convenient for NLP processing of non-english texts.
447 |
448 | 在Python 3 中,`str`是unicode字符串,对于非英文文本的NLP处理更为方便。
449 |
450 | > There are other funny things, for instance:
451 |
452 | 这还有一些其他有趣的事情,比如:
453 | ```python
454 | 'a' < type < u'a' # Python 2: True
455 | 'a' < u'a' # Python 2: False
456 | ```
457 |
458 | ```python
459 | from collections import Counter
460 | Counter('Möbelstück')
461 | ```
462 |
463 | - Python 2: `Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1})`
464 | - Python 3: `Counter({'M': 1, 'ö': 1, 'b': 1, 'e': 1, 'l': 1, 's': 1, 't': 1, 'ü': 1, 'c': 1, 'k': 1})`
465 |
466 | > You can handle all of this in Python 2 properly, but Python 3 is more friendly.
467 |
468 | 虽然你可以用Python 2正确地处理所有这些情况,但Python 3显得更加友好。
469 |
470 | ## 保留字典和** kwargs的顺序
471 |
472 | > In CPython 3.6+ dicts behave like `OrderedDict` by default (and [this is guaranteed in Python 3.7+](https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6)).
473 | This preserves order during dict comprehensions (and other operations, e.g. during json serialization/deserialization)
474 |
475 | 在CPython 3.6+中,字典的默认行为与`OrderedDict`类似(并且[这在Python 3.7+ 中也得到了保证]((https://stackoverflow.com/questions/39980323/are-dictionaries-ordered-in-python-3-6)))。这在字典释义时提供了顺序(以及其他操作执行时,比如json序列化/反序列化)。
476 | ```python
477 | import json
478 | x = {str(i):i for i in range(5)}
479 | json.loads(json.dumps(x))
480 | # Python 2
481 | {u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4': 4}
482 | # Python 3
483 | {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4}
484 | ```
485 |
486 | > Same applies to `**kwargs` (in Python 3.6+), they're kept in the same order as they appear in parameters.
487 | Order is crucial when it comes to data pipelines, previously we had to write it in a cumbersome manner:
488 |
489 | 同样适用于`** kwargs`(Python 3.6+),它们保持与它们在参数中出现的顺序相同。在数据管道方面,顺序至关重要,以前我们必须以繁琐的方式来编写:
490 | ```
491 | from torch import nn
492 |
493 | # Python 2
494 | model = nn.Sequential(OrderedDict([
495 | ('conv1', nn.Conv2d(1,20,5)),
496 | ('relu1', nn.ReLU()),
497 | ('conv2', nn.Conv2d(20,64,5)),
498 | ('relu2', nn.ReLU())
499 | ]))
500 |
501 | # Python 3.6+, how it *can* be done, not supported right now in pytorch
502 | model = nn.Sequential(
503 | conv1=nn.Conv2d(1,20,5),
504 | relu1=nn.ReLU(),
505 | conv2=nn.Conv2d(20,64,5),
506 | relu2=nn.ReLU())
507 | )
508 | ```
509 |
510 | > Did you notice? Uniqueness of names is also checked automatically.
511 |
512 | 你注意到了吗?命名的唯一性也会自动检查。
513 |
514 |
515 | ## 可迭代对象的(Iterable)解压
516 |
517 | ```python
518 | # handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases
519 | model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name)
520 |
521 | # picking two last values from a sequence
522 | *prev, next_to_last, last = values_history
523 |
524 | # This also works with any iterables, so if you have a function that yields e.g. qualities,
525 | # below is a simple way to take only last two values from a list
526 | *prev, next_to_last, last = iter_train(args)
527 | ```
528 |
529 | ## 默认的pickle引擎为数组提供更好的压缩
530 |
531 | ```python
532 | # Python 2
533 | import cPickle as pickle
534 | import numpy
535 | print len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
536 | # result: 23691675
537 |
538 | # Python 3
539 | import pickle
540 | import numpy
541 | len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
542 | # result: 8000162
543 | ```
544 |
545 | > Three times less space. And it is *much* faster.
546 | Actually similar compression (but not speed) is achievable with `protocol=2` parameter, but users typically ignore this option (or simply are not aware of it).
547 |
548 | 1/3的空间,以及*更加*快的速度。事实上,使用`protocol = 2`参数可以实现类似的压缩(速度则大相径庭),但用户通常会忽略此选项(或者根本不知道它)。
549 |
550 |
551 | ## 更安全的压缩
552 |
553 | ```python
554 | labels =
555 | predictions = [model.predict(data) for data, labels in dataset]
556 |
557 | # labels are overwritten in Python 2
558 | # labels are not affected by comprehension in Python 3
559 | ```
560 |
561 | ## 超简单的super()函数
562 |
563 | > Python 2 `super(...)` was a frequent source of mistakes in code.
564 |
565 | Python 2 中的`super(...)`曾是代码中最常见的错误源头。
566 |
567 | ```python
568 | # Python 2
569 | class MySubClass(MySuperClass):
570 | def __init__(self, name, **options):
571 | super(MySubClass, self).__init__(name='subclass', **options)
572 |
573 | # Python 3
574 | class MySubClass(MySuperClass):
575 | def __init__(self, name, **options):
576 | super().__init__(name='subclass', **options)
577 | ```
578 |
579 | > More on `super` and method resolution order on [stackoverflow](https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods).
580 |
581 | [stackoverflow](https://stackoverflow.com/questions/576169/understanding-python-super-with-init-methods)上有更多关于`super`和方法解决的信息。
582 |
583 | ## 有着变量注释的更好的IDE建议
584 |
585 | > The most enjoyable thing about programming in languages like Java, C# and alike is that IDE can make very good suggestions,
586 | because type of each identifier is known before executing a program.
587 |
588 | 关于Java,C#等语言编程最令人享受的事情是IDE可以提出非常好的建议,因为每个标识符的类型在执行程序之前是已知的。
589 |
590 | > In python this is hard to achieve, but annotations will help you
591 | > - write your expectations in a clear form
592 | > - and get good suggestions from IDE
593 |
594 | Python中这很难实现,但注释是会帮助你的
595 | - 以清晰的形式写下你的期望
596 | - 并从IDE获得很好的建议
597 |
598 | 
599 | > This is an example of PyCharm suggestions with variable annotations.
600 | This works even in situations when functions you use are not annotated (e.g. due to backward compatibility).
601 |
602 | 这是PyCharm带有变量声明建议的一个例子。即使在你使用的功能未被注释过的情况依旧是有效的(例如,向后的兼容性)。
603 |
604 | ## 更多的解包(unpacking)
605 |
606 | > Here is how you merge two dicts now:
607 |
608 | 现在展示如何合并两个字典:
609 | ```python
610 | x = dict(a=1, b=2)
611 | y = dict(b=3, d=4)
612 | # Python 3.5+
613 | z = {**x, **y}
614 | # z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.
615 | ```
616 |
617 | > See [this thread at StackOverflow](https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression) for a comparison with Python 2.
618 |
619 | 请参照[在StackOverflow上的这一过程](https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression),与Python 2进行比较。
620 |
621 | > The same approach also works for lists, tuples, and sets (`a`, `b`, `c` are any iterables):
622 |
623 | 同样的方法对于列表,元组,以及集合(`a`, `b`, `c` 是可任意迭代的):
624 | ```python
625 | [*a, *b, *c] # list, concatenating
626 | (*a, *b, *c) # tuple, concatenating
627 | {*a, *b, *c} # set, union
628 | ```
629 |
630 | > Functions also [support this](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-448) for `*args` and `**kwargs`:
631 |
632 | 函数对于参数`*args`和`**kwargs`同样[支持](https://docs.python.org/3/whatsnew/3.5.html#whatsnew-pep-448)
633 | ```
634 | Python 3.5+
635 | do_something(**{**default_settings, **custom_settings})
636 |
637 | # Also possible, this code also checks there is no intersection between keys of dictionaries
638 | do_something(**first_args, **second_args)
639 | ```
640 |
641 | ## 具有关键字参数的面向未来的API
642 |
643 | 让我们看一下这个代码片段:
644 | ```python
645 | model = sklearn.svm.SVC(2, 'poly', 2, 4, 0.5)
646 | ```
647 | > Obviously, an author of this code didn't get the Python style of coding yet (most probably, just jumped from cpp or rust).
648 | Unfortunately, this is not just question of taste, because changing the order of arguments (adding/deleting) in `SVC` will break this code. In particular, `sklearn` does some reordering/renaming from time to time of numerous algorithm parameters to provide consistent API. Each such refactoring may drive to broken code.
649 |
650 | 很明显,代码的作者还未理解Python的编码风格(很有可能是从cpp或者rust转到Python的)。
651 | 不幸的是,这不仅仅是品味的问题,因为在`SVC`中改变参数顺序(添加/删除)都会破坏代码。 特别是,`sklearn`会不时地对许多算法参数进行重新排序/重命名以提供一致的API。 每个这样的重构都可能导致代码损坏。
652 |
653 | > In Python 3, library authors may demand explicitly named parameters by using `*`:
654 |
655 | 在Python 3中,类库作者可能会通过使用`*`来要求明确命名的参数:
656 | ```
657 | class SVC(BaseSVC):
658 | def __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, ... )
659 | ```
660 | > - users have to specify names of parameters `sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)` now
661 | > - this mechanism provides a great combination of reliability and flexibility of APIs
662 |
663 | - 用户现在必须指定参数名称为`sklearn.svm.SVC(C=2, kernel='poly', degree=2, gamma=4, coef0=0.5)`
664 | - 这种机制提供了API完美结合的可靠性和灵活性
665 |
666 | ## 次要: `math`模块中的常量
667 |
668 | ```python
669 | # Python 3
670 | math.inf # 'largest' number
671 | math.nan # not a number
672 |
673 | max_quality = -math.inf # no more magic initial values!
674 |
675 | for model in trained_models:
676 | max_quality = max(max_quality, compute_quality(model, data))
677 | ```
678 |
679 | ## 次要: 单一的整数类型
680 |
681 | > Python 2 provides two basic integer types, which are int (64-bit signed integer) and long for long arithmetics (quite confusing after C++).
682 |
683 | Python 2提供了两种基础的整数类型,int(64位有符号的整数)以及对于长整型计算的long(在C++之后就变得非常混乱)。
684 |
685 | > Python 3 has a single type `int`, which incorporates long arithmetics.
686 |
687 | Python 3有着单一的类型`int`,其同时融合了长整型的计算。
688 |
689 | > Here is how you check that value is integer:
690 |
691 | 如下为如何检查该值是整数:
692 |
693 | ```
694 | isinstance(x, numbers.Integral) # Python 2, the canonical way
695 | isinstance(x, (long, int)) # Python 2
696 | isinstance(x, int) # Python 3, easier to remember
697 | ```
698 |
699 | ## 其他事项
700 |
701 | - `Enum`s are theoretically useful, but
702 | - string-typing is already widely adopted in the python data stack
703 | - `Enum`s don't seem to interplay with numpy and categorical from pandas
704 | - coroutines also *sound* very promising for data pipelining (see [slides](http://www.dabeaz.com/coroutines/Coroutines.pdf) by David Beazley), but I don't see their adoption in the wild.
705 | - Python 3 has [stable ABI](https://www.python.org/dev/peps/pep-0384/)
706 | - Python 3 supports unicode identifies (so `ω = Δφ / Δt` is ok), but you'd [better use good old ASCII names](https://stackoverflow.com/a/29855176/498892)
707 | - some libraries e.g. [jupyterhub](https://github.com/jupyterhub/jupyterhub) (jupyter in cloud), django and fresh ipython only support Python 3, so features that sound useless for you are useful for libraries you'll probably want to use once.
708 | ----------
709 | - `Enum`(枚举类)理论上是有用的,但是
710 | - string-typing 已经在Python数据栈中被广泛采用
711 | - `Enum`似乎不会与numpy和pandas的分类相互作用
712 | - 协程(coroutines)*听起来*也非常适用于数据管道(参见David Beazley的[幻灯片](http://www.dabeaz.com/coroutines/Coroutines.pdf)),但是我从来没见过代码引用它们。
713 | - Python 3 有着[稳定的ABI](https://www.python.org/dev/peps/pep-0384/)
714 |
715 | *ABI(Application Binary Interface): 应用程序二进制接口 描述了应用程序和操作系统之间,一个应用和它的库之间,或者应用的组成部分之间的低接口。*
716 | - Python 3支持unicode标识(甚至`ω=Δφ/Δt`也可以),但是你[最好使用好的旧ASCII名称](https://stackoverflow.com/a/29855176/498892)。
717 | - 一些类库例如 [jupyterhub](https://github.com/jupyterhub/jupyterhub)(云端的jupyter),django和最新的ipython仅支持Python 3,因此对于您来说听起来无用的功能,对于您可能想要使用的库却很有用。
718 |
719 | ### 特定于数据科学的代码迁移问题(以及如何解决这些问题)
720 |
721 | > - support for nested arguments [was dropped](https://www.python.org/dev/peps/pep-3113/)
722 | - 对于嵌套参数的支持[已被删除](https://www.python.org/dev/peps/pep-3113/)。
723 | ```
724 | map(lambda x, (y, z): x, z, dict.items())
725 | ```
726 |
727 | > However, it is still perfectly working with different comprehensions:
728 |
729 | 但是,它仍然完全适用于不同的(列表)解析:
730 | ```python
731 | {x:z for x, (y, z) in d.items()}
732 | ```
733 | > In general, comprehensions are also better 'translatable' between Python 2 and 3.
734 |
735 | 一般来说,Python 2和Python 3之间的解析也是有着更好的“可翻译性”。
736 |
737 | > - `map()`, `.keys()`, `.values()`, `.items()`, etc. return iterators, not lists. Main problems with iterators are:
738 | - no trivial slicing
739 | - can't be iterated twice
740 |
741 | - `map()`, `.keys()`, `.values()`, `.items()`等等返回的是迭代器(iterators),而不是列表(lists)。迭代器的主要问题如下:
742 | - 没有细小的切片
743 | - 不能迭代两次
744 |
745 | > Almost all of the problems are resolved by converting result to list.
746 |
747 | 将结果转换为列表几乎可以解决所有问题。
748 |
749 | > - see [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/) when in trouble.
750 |
751 | - 当你遇到问题时请参见[Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/)。
752 |
753 | ### 使用python教授机器学习和数据科学的主要问题
754 |
755 | > Course authors should spend time in the first lectures to explain what is an iterator,
756 | why it can't be sliced / concatenated / multiplied / iterated twice like a string (and how to deal with it).
757 |
758 | 课程讲解者应该花时间在第一讲中解释什么是迭代器,
759 | 为什么它不能像字符串一样被分割/连接/相乘/重复两次(以及如何处理它)。
760 |
761 | > I think most course authors would be happy to avoid these details, but now it is hardly possible.
762 |
763 | 我认为大多数课程讲解者曾经都乐于避开这些细节,但现在几乎不可能(再避开了)。
764 |
765 | # 结论
766 |
767 | > Python 2 and Python 3 have co-existed for almost 10 years, but we *should* move to Python 3.
768 |
769 | 虽然Python 2 和 Python 3 已经共存了十年有余,但是我们*应该*要过渡到Python 3 了。
770 |
771 | > Research and production code should become a bit shorter, more readable, and significantly safer after moving to Python 3-only codebase.
772 |
773 | 在转向使用唯一的 Python 3 代码库之后,研究和生产的代码将会变得更剪短,更有可读性,以及明显是更加安全的。
774 |
775 | > Right now most libraries support both Python versions.
776 | And I can't wait for the bright moment when packages drop support for Python 2 and enjoy new language features.
777 |
778 | 目前大部分类库都会支持两个Python版本,我已等不及要使用新的语言特性了,也同样期待依赖包舍弃对 Python 2 支持这一光明时刻的到来。
779 |
780 | > Following migrations are promised to be smoother: ["we will never do this kind of backwards-incompatible change again"](https://snarky.ca/why-python-3-exists/)
781 |
782 | 以后的(版本)迁移会更加顺利:[我们再也不会做这种不向后兼容的变化了](https://snarky.ca/why-python-3-exists/)。
783 |
784 | ### 相关链接
785 |
786 | - [Key differences between Python 2.7 and Python 3.x](http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html)
787 | - [Python FAQ: How do I port to Python 3?](https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/)
788 | - [10 awesome features of Python that you can't use because you refuse to upgrade to Python 3](http://www.asmeurer.com/python3-presentation/slides.html)
789 | - [Trust me, python 3.3 is better than 2.7 (video)](http://pyvideo.org/pycon-us-2013/python-33-trust-me-its-better-than-27.html)
790 | - [Python 3 for scientists](http://python-3-for-scientists.readthedocs.io/en/latest/)
791 |
792 |
793 | ### 版权声明
794 |
795 | This text was published by [Alex Rogozhnikov](https://arogozhnikov.github.io/about/) under [CC BY-SA 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/) (excluding images).
796 |
797 | Translated to Chinese by Hunter-liu (@lq920320).
798 |
799 |
800 |
--------------------------------------------------------------------------------
/images/pycharm-type-hinting.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arogozhnikov/python3_with_pleasure/e018d177b358f34b0038d886085ebc1211fec82d/images/pycharm-type-hinting.png
--------------------------------------------------------------------------------
/images/variable_annotations.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/arogozhnikov/python3_with_pleasure/e018d177b358f34b0038d886085ebc1211fec82d/images/variable_annotations.png
--------------------------------------------------------------------------------