to clear sys.modules and rerun your
44 | session to test changes to code in a module you're working on.
45 |
46 |
47 | Configuration
48 | See the sample-config file for a list of available options. You should save
49 | your config file as ~/.config/bpython/config (i.e.
50 | $XDG_CONFIG_HOME/bpython/config ) or specify at the command line:
51 | bpython --config /path/to/bpython/config
52 |
53 |
54 | Dependencies
55 |
56 | Pygments
57 | requests
58 | curtsies >= 0.1.18
59 | greenlet
60 | six >= 1.5
61 | Sphinx != 1.1.2 (optional, for the documentation)
62 | mock (optional, for the testsuite)
63 | babel (optional, for internationalization)
64 | watchdog (optional, for monitoring imported modules for changes)
65 | jedi (optional, for experimental multiline completion)
66 |
67 |
68 | Python 2 before 2.7.7
69 | If you are using Python 2 before 2.7.7, the following dependency is also
70 | required:
71 |
72 | requests[security]
73 |
74 |
75 | cffi
76 | If you have problems installing cffi, which is needed by OpenSSL, please take a
77 | look at cffi docs .
78 |
79 | bpython-urwid
80 | bpython-urwid requires the following additional packages:
81 |
82 | urwid
83 |
84 |
85 | Known Bugs
86 | For known bugs please see bpython's known issues and FAQ page.
87 |
88 | Contact & Contributing
89 | I hope you find it useful and please feel free to submit any bugs/patches
90 | suggestions to Robert or place them on the GitHub
91 | issues tracker .
92 | For any other ways of communicating with bpython users and devs you can find us
93 | at the community page on the project homepage , or in the community .
94 | Hope to see you there!
95 |
96 | CLI Windows Support
97 |
98 | Dependencies
99 | Curses Use the appropriate version compiled by Christoph Gohlke.
100 | pyreadline Use the version in the cheeseshop.
101 |
102 | Recommended
103 | Obtain the less program from GnuUtils. This makes the pager work as intended.
104 | It can be obtained from cygwin or GnuWin32 or msys
105 |
106 | Current version is tested with
107 |
108 | Curses 2.2
109 | pyreadline 1.7
110 |
111 |
112 | Curses Notes
113 | The curses used has a bug where the colours are displayed incorrectly:
114 |
115 | red is swapped with blue
116 | cyan is swapped with yellow
117 |
118 | To correct this I have provided a windows.theme file.
119 | This curses implementation has 16 colors (dark and light versions of the
120 | colours)
121 |
122 | Alternatives
123 | ptpython
124 | IPython
125 | Feel free to get in touch if you know of any other alternatives that people
126 | may be interested to try.
127 |
--------------------------------------------------------------------------------
/files/xxx/butterdb.txt:
--------------------------------------------------------------------------------
1 |
2 | butterdb
3 |
4 | Master:
5 | Develop:
6 |
7 | Documentation | butterdb on PyPi
8 | butterdb is a library to help you work with Google Spreadsheet data. It lets you model your data as Python objects, to be easily manipulated or created.
9 |
10 | How do I use it?
11 |
12 | import butterdb
13 | import json
14 |
15 | # For getting OAuth Credential JSON file see http://gspread.readthedocs.org/en/latest/oauth2.html
16 | # Ensure that the client_email has been granted privileges to any workbooks you wish to access.
17 |
18 | with open('SomeGoogleProject-2a31d827b2a9.json') as credentials_file:
19 | json_key = json.load(credentials_file)
20 |
21 | client_email = json_key['client_email']
22 | private_key = str(json_key['private_key']).encode('utf-8')
23 |
24 | database = butterdb.Database(name="MyDatabaseSheet", client_email=client_email, private_key=private_key)
25 |
26 | @butterdb.register(database)
27 | class User(butterdb.Model):
28 | def __init__(self, name, password):
29 | self.name = self.field(name)
30 | self.password = self.field(password)
31 |
32 | users = User.get_instances()
33 |
34 | marianne = users[1]
35 |
36 | print(marianne.password) # rainbow_trout
37 |
38 | marianne.password = "hunter2"
39 | marianne.commit()
40 |
41 |
42 | How do I make instances?
43 | bob = User("bob", "BestPassword!")
44 | bob.commit()
45 |
46 |
47 | Where do I get it?
48 | pip install butterdb
49 |
50 | Simple as that?
51 | Yep! butterdb is a simple interface around gspread . Just .commit() your objects when you want to update the spreadsheet!
52 |
53 | How do I run the tests?
54 | nosetests
55 |
56 | What works?
57 |
58 | Store data in Google Spreadsheets (the cloud!!!)
59 | Models from classes
60 | Fields as attributes. decimals, ints and strings only (as far as I know)
61 | Commits
62 | Mocked unit tests, mock database
63 | Arbitrary cell execution with =blah() (free stored procedures?)
64 | Auto backup/bad patch control
65 |
66 |
67 | What's missing?
68 |
69 | Spreadsheets must exist before connecting
70 | References
71 | Collections
72 | Customizable fields
73 | Customizable table size (arbitrarily hardcoded)
74 |
75 |
76 | Feedback
77 | Comments, concerns, issues and pull requests welcomed. Reddit /u/Widdershiny or email me at ncwjohnstone@gmail.com .
78 |
79 | License
80 | MIT License. See LICENSE file for full text.
81 |
--------------------------------------------------------------------------------
/files/xxx/cpython.txt:
--------------------------------------------------------------------------------
1 | This is Python version 3.7.0 alpha 1
2 |
3 |
4 |
5 |
6 | Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
7 | 2012, 2013, 2014, 2015, 2016, 2017 Python Software Foundation. All rights
8 | reserved.
9 | See the end of this file for further copyright and license information.
10 |
11 | Contents
12 |
13 | General Information
14 | Contributing to CPython
15 | Using Python
16 | Build Instructions
17 | Profile Guided Optimization
18 | Link Time Optimization
19 |
20 |
21 | What's New
22 | Documentation
23 | Converting From Python 2.x to 3.x
24 | Testing
25 | Installing multiple versions
26 | Issue Tracker and Mailing List
27 | Proposals for enhancement
28 | Release Schedule
29 | Copyright and License Information
30 |
31 |
32 |
33 | General Information
34 |
35 | Website: https://www.python.org
36 | Source code: https://github.com/python/cpython
37 | Issue tracker: https://bugs.python.org
38 | Documentation: https://docs.python.org
39 | Developer's Guide: https://docs.python.org/devguide/
40 |
41 |
42 | Contributing to CPython
43 | For more complete instructions on contributing to CPython development,
44 | see the Developer Guide .
45 |
46 | Using Python
47 | Installable Python kits, and information about using Python, are available at
48 | python.org .
49 |
50 | Build Instructions
51 | On Unix, Linux, BSD, macOS, and Cygwin:
52 | ./configure
53 | make
54 | make test
55 | sudo make install
56 |
57 | This will install Python as python3.
58 | You can pass many options to the configure script; run ./configure --help
59 | to find out more. On macOS and Cygwin, the executable is called python.exe ;
60 | elsewhere it's just python .
61 | On macOS, if you have configured Python with --enable-framework , you
62 | should use make frameworkinstall to do the installation. Note that this
63 | installs the Python executable in a place that is not normally on your PATH,
64 | you may want to set up a symlink in /usr/local/bin .
65 | On Windows, see PCbuild/readme.txt .
66 | If you wish, you can create a subdirectory and invoke configure from there.
67 | For example:
68 | mkdir debug
69 | cd debug
70 | ../configure --with-pydebug
71 | make
72 | make test
73 |
74 | (This will fail if you also built at the top-level directory. You should do
75 | a make clean at the toplevel first.)
76 | To get an optimized build of Python, configure --enable-optimizations
77 | before you run make . This sets the default make targets up to enable
78 | Profile Guided Optimization (PGO) and may be used to auto-enable Link Time
79 | Optimization (LTO) on some platforms. For more details, see the sections
80 | below.
81 |
82 | Profile Guided Optimization
83 | PGO takes advantage of recent versions of the GCC or Clang compilers. If ran,
84 | make profile-opt will do several steps.
85 | First, the entire Python directory is cleaned of temporary files that may have
86 | resulted in a previous compilation.
87 | Then, an instrumented version of the interpreter is built, using suitable
88 | compiler flags for each flavour. Note that this is just an intermediary step
89 | and the binary resulted after this step is not good for real life workloads, as
90 | it has profiling instructions embedded inside.
91 | After this instrumented version of the interpreter is built, the Makefile will
92 | automatically run a training workload. This is necessary in order to profile
93 | the interpreter execution. Note also that any output, both stdout and stderr,
94 | that may appear at this step is suppressed.
95 | Finally, the last step is to rebuild the interpreter, using the information
96 | collected in the previous one. The end result will be a Python binary that is
97 | optimized and suitable for distribution or production installation.
98 |
99 | Link Time Optimization
100 | Enabled via configure's --with-lto flag. LTO takes advantage of the
101 | ability of recent compiler toolchains to optimize across the otherwise
102 | arbitrary .o file boundary when building final executables or shared
103 | libraries for additional performance gains.
104 |
105 | What's New
106 | We have a comprehensive overview of the changes in the What's New in Python
107 | 3.7 document. For a more
108 | detailed change log, read Misc/NEWS , but a full
109 | accounting of changes can only be gleaned from the commit history .
110 | If you want to install multiple versions of Python see the section below
111 | entitled "Installing multiple versions".
112 |
113 | Documentation
114 | Documentation for Python 3.7 is online,
115 | updated daily.
116 | It can also be downloaded in many formats for faster access. The documentation
117 | is downloadable in HTML, PDF, and reStructuredText formats; the latter version
118 | is primarily for documentation authors, translators, and people with special
119 | formatting requirements.
120 | For information about building Python's documentation, refer to Doc/README.rst .
121 |
122 | Converting From Python 2.x to 3.x
123 | Significant backward incompatible changes were made for the release of Python
124 | 3.0, which may cause programs written for Python 2 to fail when run with Python
125 | 3. For more information about porting your code from Python 2 to Python 3, see
126 | the Porting HOWTO .
127 |
128 | Testing
129 | To test the interpreter, type make test in the top-level directory. The
130 | test set produces some output. You can generally ignore the messages about
131 | skipped tests due to optional features which can't be imported. If a message
132 | is printed about a failed test or a traceback or core dump is produced,
133 | something is wrong.
134 | By default, tests are prevented from overusing resources like disk space and
135 | memory. To enable these tests, run make testall .
136 | If any tests fail, you can re-run the failing test(s) in verbose mode:
137 | make test TESTOPTS="-v test_that_failed"
138 |
139 | If the failure persists and appears to be a problem with Python rather than
140 | your environment, you can file a bug report and
141 | include relevant output from that command to show the issue.
142 |
143 | Installing multiple versions
144 | On Unix and Mac systems if you intend to install multiple versions of Python
145 | using the same installation prefix ( --prefix argument to the configure
146 | script) you must take care that your primary python executable is not
147 | overwritten by the installation of a different version. All files and
148 | directories installed using make altinstall contain the major and minor
149 | version and can thus live side-by-side. make install also creates
150 | ${prefix}/bin/python3 which refers to ${prefix}/bin/pythonX.Y . If you
151 | intend to install multiple versions using the same prefix you must decide which
152 | version (if any) is your "primary" version. Install that version using make
153 | install . Install all other versions using make altinstall .
154 | For example, if you want to install Python 2.7, 3.6, and 3.7 with 3.7 being the
155 | primary version, you would execute make install in your 3.7 build directory
156 | and make altinstall in the others.
157 |
158 | Issue Tracker and Mailing List
159 | Bug reports are welcome! You can use the issue tracker to report bugs, and/or submit pull requests on
160 | GitHub .
161 | You can also follow development discussion on the python-dev mailing list .
162 |
163 | Proposals for enhancement
164 | If you have a proposal to change Python, you may want to send an email to the
165 | comp.lang.python or python-ideas mailing lists for initial feedback. A
166 | Python Enhancement Proposal (PEP) may be submitted if your idea gains ground.
167 | All current PEPs, as well as guidelines for submitting a new PEP, are listed at
168 | python.org/dev/peps/ .
169 |
170 | Release Schedule
171 | See PEP 537 for Python 3.7 release details.
172 |
173 | Copyright and License Information
174 | Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
175 | 2012, 2013, 2014, 2015, 2016 Python Software Foundation. All rights reserved.
176 | Copyright (c) 2000 BeOpen.com. All rights reserved.
177 | Copyright (c) 1995-2001 Corporation for National Research Initiatives. All
178 | rights reserved.
179 | Copyright (c) 1991-1995 Stichting Mathematisch Centrum. All rights reserved.
180 | See the file "LICENSE" for information on the history of this software, terms &
181 | conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.
182 | This Python distribution contains no GNU General Public License (GPL) code,
183 | so it may be used in proprietary projects. There are interfaces to some GNU
184 | code but these are entirely optional.
185 | All trademarks referenced herein are property of their respective holders.
186 |
--------------------------------------------------------------------------------
/files/xxx/dejavu.txt:
--------------------------------------------------------------------------------
1 | dejavu
2 | Audio fingerprinting and recognition algorithm implemented in Python, see the explanation here:
3 | How it works
4 | Dejavu can memorize audio by listening to it once and fingerprinting it. Then by playing a song and recording microphone input, Dejavu attempts to match the audio against the fingerprints held in the database, returning the song being played.
5 | Note that for voice recognition, Dejavu is not the right tool! Dejavu excels at recognition of exact signals with reasonable amounts of noise.
6 | Installation and Dependencies:
7 | Read INSTALLATION.md
8 | Setup
9 | First, install the above dependencies.
10 | Second, you'll need to create a MySQL database where Dejavu can store fingerprints. For example, on your local setup:
11 | $ mysql -u root -p
12 | Enter password: **********
13 | mysql> CREATE DATABASE IF NOT EXISTS dejavu;
14 |
15 | Now you're ready to start fingerprinting your audio collection!
16 | Quickstart
17 | $ git clone https://github.com/worldveil/dejavu.git ./dejavu
18 | $ cd dejavu
19 | $ python example.py
20 | Fingerprinting
21 | Let's say we want to fingerprint all of July 2013's VA US Top 40 hits.
22 | Start by creating a Dejavu object with your configurations settings (Dejavu takes an ordinary Python dictionary for the settings).
23 | >> > from dejavu import Dejavu
24 | >> > config = {
25 | ... " database " : {
26 | ... " host " : " 127.0.0.1 " ,
27 | ... " user " : " root " ,
28 | ... " passwd " : < password above > ,
29 | ... " db " : < name of the database you created above > ,
30 | ... }
31 | ... }
32 | >> > djv = Dejavu(config)
33 | Next, give the fingerprint_directory method three arguments:
34 |
35 | input directory to look for audio files
36 | audio extensions to look for in the input directory
37 | number of processes (optional)
38 |
39 | >> > djv.fingerprint_directory( " va_us_top_40/mp3 " , [ " .mp3 " ], 3 )
40 | For a large amount of files, this will take a while. However, Dejavu is robust enough you can kill and restart without affecting progress: Dejavu remembers which songs it fingerprinted and converted and which it didn't, and so won't repeat itself.
41 | You'll have a lot of fingerprints once it completes a large folder of mp3s:
42 | >> > print djv.db.get_num_fingerprints()
43 | 5442376
44 | Also, any subsequent calls to fingerprint_file or fingerprint_directory will fingerprint and add those songs to the database as well. It's meant to simulate a system where as new songs are released, they are fingerprinted and added to the database seemlessly without stopping the system.
45 | Configuration options
46 | The configuration object to the Dejavu constructor must be a dictionary.
47 | The following keys are mandatory:
48 |
49 | database , with a value as a dictionary with keys that the database you are using will accept. For example with MySQL, the keys must can be anything that the MySQLdb.connect() function will accept.
50 |
51 | The following keys are optional:
52 |
53 | fingerprint_limit : allows you to control how many seconds of each audio file to fingerprint. Leaving out this key, or alternatively using -1 and None will cause Dejavu to fingerprint the entire audio file. Default value is None .
54 | database_type : as of now, only mysql (the default value) is supported. If you'd like to subclass Database and add another, please fork and send a pull request!
55 |
56 | An example configuration is as follows:
57 | >> > from dejavu import Dejavu
58 | >> > config = {
59 | ... " database " : {
60 | ... " host " : " 127.0.0.1 " ,
61 | ... " user " : " root " ,
62 | ... " passwd " : " Password123 " ,
63 | ... " db " : " dejavu_db " ,
64 | ... },
65 | ... " database_type " : " mysql " ,
66 | ... " fingerprint_limit " : 10
67 | ... }
68 | >> > djv = Dejavu(config)
69 | Tuning
70 | Inside fingerprint.py , you may want to adjust following parameters (some values are given below).
71 | FINGERPRINT_REDUCTION = 30
72 | PEAK_SORT = False
73 | DEFAULT_OVERLAP_RATIO = 0.4
74 | DEFAULT_FAN_VALUE = 10
75 | DEFAULT_AMP_MIN = 15
76 | PEAK_NEIGHBORHOOD_SIZE = 30
77 |
78 | These parameters are described in the fingerprint.py in detail. Read that in-order to understand the impact of changing these values.
79 | Recognizing
80 | There are two ways to recognize audio using Dejavu. You can recognize by reading and processing files on disk, or through your computer's microphone.
81 | Recognizing: On Disk
82 | Through the terminal:
83 | $ python dejavu.py --recognize file sometrack.wav
84 | { ' song_id ' : 1, ' song_name ' : ' Taylor Swift - Shake It Off ' , ' confidence ' : 3948, ' offset_seconds ' : 30.00018, ' match_time ' : 0.7159781455993652, ' offset ' : 646L}
85 | or in scripting, assuming you've already instantiated a Dejavu object:
86 | >> > from dejavu.recognize import FileRecognizer
87 | >> > song = djv.recognize(FileRecognizer, " va_us_top_40/wav/Mirrors - Justin Timberlake.wav " )
88 | Recognizing: Through a Microphone
89 | With scripting:
90 | >> > from dejavu.recognize import MicrophoneRecognizer
91 | >> > song = djv.recognize(MicrophoneRecognizer, seconds = 10 ) # Defaults to 10 seconds.
92 | and with the command line script, you specify the number of seconds to listen:
93 | $ python dejavu.py --recognize mic 10
94 | Testing
95 | Testing out different parameterizations of the fingerprinting algorithm is often useful as the corpus becomes larger and larger, and inevitable tradeoffs between speed and accuracy come into play.
96 |
97 | Test your Dejavu settings on a corpus of audio files on a number of different metrics:
98 |
99 | Confidence of match (number fingerprints aligned)
100 | Offset matching accuracy
101 | Song matching accuracy
102 | Time to match
103 |
104 |
105 | An example script is given in test_dejavu.sh , shown below:
106 | # ####################################
107 | # ## Dejavu example testing script ###
108 | # ####################################
109 |
110 | # ##########
111 | # Clear out previous results
112 | rm -rf ./results ./temp_audio
113 |
114 | # ##########
115 | # Fingerprint files of extension mp3 in the ./mp3 folder
116 | python dejavu.py --fingerprint ./mp3/ mp3
117 |
118 | # #########
119 | # Run a test suite on the ./mp3 folder by extracting 1, 2, 3, 4, and 5
120 | # second clips sampled randomly from within each song 8 seconds
121 | # away from start or end, sampling offset with random seed = 42, and finally,
122 | # store results in ./results and log to ./results/dejavu-test.log
123 | python run_tests.py \
124 | --secs 5 \
125 | --temp ./temp_audio \
126 | --log-file ./results/dejavu-test.log \
127 | --padding 8 \
128 | --seed 42 \
129 | --results ./results \
130 | ./mp3
131 | The testing scripts are as of now are a bit rough, and could certainly use some love and attention if you're interested in submitting a PR! For example, underscores in audio filenames currently breaks the test scripts.
132 | How does it work?
133 | The algorithm works off a fingerprint based system, much like:
134 |
135 | Shazam
136 | MusicRetrieval
137 | Chromaprint
138 |
139 | The "fingerprints" are locality sensitive hashes that are computed from the spectrogram of the audio. This is done by taking the FFT of the signal over overlapping windows of the song and identifying peaks. A very robust peak finding algorithm is needed, otherwise you'll have a terrible signal to noise ratio.
140 | Here I've taken the spectrogram over the first few seconds of "Blurred Lines". The spectrogram is a 2D plot and shows amplitude as a function of time (a particular window, actually) and frequency, binned logrithmically, just as the human ear percieves it. In the plot below you can see where local maxima occur in the amplitude space:
141 |
142 | Finding these local maxima is a combination of a high pass filter (a threshold in amplitude space) and some image processing techniques to find maxima. A concept of a "neighboorhood" is needed - a local maxima with only its directly adjacent pixels is a poor peak - one that will not survive the noise of coming through speakers and through a microphone.
143 | If we zoom in even closer, we can begin to imagine how to bin and discretize these peaks. Finding the peaks itself is the most computationally intensive part, but it's not the end. Peaks are combined using their discrete time and frequency bins to create a unique hash for that particular moment in the song - creating a fingerprint.
144 |
145 | For a more detailed look at the making of Dejavu, see my blog post here .
146 | How well it works
147 | To truly get the benefit of an audio fingerprinting system, it can't take a long time to fingerprint. It's a bad user experience, and furthermore, a user may only decide to try to match the song with only a few precious seconds of audio left before the radio station goes to a commercial break.
148 | To test Dejavu's speed and accuracy, I fingerprinted a list of 45 songs from the US VA Top 40 from July 2013 (I know, their counting is off somewhere). I tested in three ways:
149 |
150 | Reading from disk the raw mp3 -> wav data, and
151 | Playing the song over the speakers with Dejavu listening on the laptop microphone.
152 | Compressed streamed music played on my iPhone
153 |
154 | Below are the results.
155 | 1. Reading from Disk
156 | Reading from disk was an overwhelming 100% recall - no mistakes were made over the 45 songs I fingerprinted. Since Dejavu gets all of the samples from the song (without noise), it would be nasty surprise if reading the same file from disk didn't work every time!
157 | 2. Audio over laptop microphone
158 | Here I wrote a script to randomly chose n seconds of audio from the original mp3 file to play and have Dejavu listen over the microphone. To be fair I only allowed segments of audio that were more than 10 seconds from the starting/ending of the track to avoid listening to silence.
159 | Additionally my friend was even talking and I was humming along a bit during the whole process, just to throw in some noise.
160 | Here are the results for different values of listening time ( n ):
161 |
162 | This is pretty rad. For the percentages:
163 |
164 |
165 |
166 | Number of Seconds
167 | Number Correct
168 | Percentage Accuracy
169 |
170 |
171 |
172 |
173 | 1
174 | 27 / 45
175 | 60.0%
176 |
177 |
178 | 2
179 | 43 / 45
180 | 95.6%
181 |
182 |
183 | 3
184 | 44 / 45
185 | 97.8%
186 |
187 |
188 | 4
189 | 44 / 45
190 | 97.8%
191 |
192 |
193 | 5
194 | 45 / 45
195 | 100.0%
196 |
197 |
198 | 6
199 | 45 / 45
200 | 100.0%
201 |
202 | Even with only a single second, randomly chosen from anywhere in the song, Dejavu is getting 60%! One extra second to 2 seconds get us to around 96%, while getting perfect only took 5 seconds or more. Honestly when I was testing this myself, I found Dejavu beat me - listening to only 1-2 seconds of a song out of context to identify is pretty hard. I had even been listening to these same songs for two days straight while debugging...
203 | In conclusion, Dejavu works amazingly well, even with next to nothing to work with.
204 | 3. Compressed streamed music played on my iPhone
205 | Just to try it out, I tried playing music from my Spotify account (160 kbit/s compressed) through my iPhone's speakers with Dejavu again listening on my MacBook mic. I saw no degredation in performance; 1-2 seconds was enough to recognize any of the songs.
206 | Performance
207 | Speed
208 | On my MacBook Pro, matching was done at 3x listening speed with a small constant overhead. To test, I tried different recording times and plotted the recording time plus the time to match. Since the speed is mostly invariant of the particular song and more dependent on the length of the spectrogram created, I tested on a single song, "Get Lucky" by Daft Punk:
209 |
210 | As you can see, the relationship is quite linear. The line you see is a least-squares linear regression fit to the data, with the corresponding line equation:
211 | 1.364757 * record_time - 0.034373 = time_to_match
212 |
213 | Notice of course since the matching itself is single threaded, the matching time includes the recording time. This makes sense with the 3x speed in purely matching, as:
214 | 1 (recording) + 1/3 (matching) = 4/3 ~= 1.364757
215 |
216 | if we disregard the miniscule constant term.
217 | The overhead of peak finding is the bottleneck - I experimented with mutlithreading and realtime matching, and alas, it wasn't meant to be in Python. An equivalent Java or C/C++ implementation would most likely have little trouble keeping up, applying FFT and peakfinding in realtime.
218 | An important caveat is of course, the round trip time (RTT) for making matches. Since my MySQL instance was local, I didn't have to deal with the latency penalty of transfering fingerprint matches over the air. This would add RTT to the constant term in the overall calculation, but would not effect the matching process.
219 | Storage
220 | For the 45 songs I fingerprinted, the database used 377 MB of space for 5.4 million fingerprints. In comparison, the disk usage is given below:
221 |
222 |
223 |
224 | Audio Information Type
225 | Storage in MB
226 |
227 |
228 |
229 |
230 | mp3
231 | 339
232 |
233 |
234 | wav
235 | 1885
236 |
237 |
238 | fingerprints
239 | 377
240 |
241 | There's a pretty direct trade-off between the necessary record time and the amount of storage needed. Adjusting the amplitude threshold for peaks and the fan value for fingerprinting will add more fingerprints and bolster the accuracy at the expense of more space.
242 |
--------------------------------------------------------------------------------
/files/xxx/demiurge.txt:
--------------------------------------------------------------------------------
1 | demiurge
2 | PyQuery-based scraping micro-framework.
3 | Supports Python 2.x and 3.x.
4 |
5 | Documentation: http://demiurge.readthedocs.org
6 | Installing demiurge
7 | $ pip install demiurge
8 |
9 | Quick start
10 | Define items to be scraped using a declarative (Django-inspired) syntax:
11 | import demiurge
12 |
13 | class TorrentDetails ( demiurge . Item ):
14 | label = demiurge.TextField( selector = ' strong ' )
15 | value = demiurge.TextField()
16 |
17 | def clean_value ( self , value ):
18 | unlabel = value[value.find( ' : ' ) + 1 :]
19 | return unlabel.strip()
20 |
21 | class Meta :
22 | selector = ' div#specifications p '
23 |
24 | class Torrent ( demiurge . Item ):
25 | url = demiurge.AttributeValueField(
26 | selector = ' td:eq(2) a:eq(1) ' , attr = ' href ' )
27 | name = demiurge.TextField( selector = ' td:eq(2) a:eq(2) ' )
28 | size = demiurge.TextField( selector = ' td:eq(3) ' )
29 | details = demiurge.RelatedItem(
30 | TorrentDetails, selector = ' td:eq(2) a:eq(2) ' , attr = ' href ' )
31 |
32 | class Meta :
33 | selector = ' table.maintable:gt(0) tr:gt(0) '
34 | base_url = ' http://www.mininova.org '
35 |
36 |
37 | >> > t = Torrent.one( ' /search/ubuntu/seeds ' )
38 | >> > t.name
39 | ' Ubuntu 7.10 Desktop Live CD '
40 | >> > t.size
41 | u ' 695.81 \xa0 MB '
42 | >> > t.url
43 | ' /get/1053846 '
44 | >> > t.html
45 | u ' 19 \xa0 Dec \xa0 07 | Software | ... '
46 |
47 | >> > results = Torrent.all( ' /search/ubuntu/seeds ' )
48 | >> > len (results)
49 | 116
50 | >> > for t in results[: 3 ]:
51 | ... print t.name, t.size
52 | ...
53 | Ubuntu 7.10 Desktop Live CD 695.81 MB
54 | Super Ubuntu 2008.09 - VMware image 871.95 MB
55 | Portable Ubuntu 9.10 for Windows 559.78 MB
56 | ...
57 |
58 | >> > t = Torrent.one( ' /search/ubuntu/seeds ' )
59 | >> > for detail in t.details:
60 | ... print detail.label, detail.value
61 | ...
62 | Category: Software > GNU / Linux
63 | Total size: 695.81 megabyte
64 | Added: 2467 days ago by Distribution
65 | Share ratio: 17 seeds, 2 leechers
66 | Last updated: 35 minutes ago
67 | Downloads: 29 , 0 85
68 | See documentation for details: http://demiurge.readthedocs.org
69 | Why demiurge ?
70 | Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic
71 | dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the
72 | entity who "fashioned and shaped" the material world. Timaeus describes the
73 | Demiurge as unreservedly benevolent, and hence desirous of a world as good as
74 | possible. The world remains imperfect, however, because the Demiurge created
75 | the world out of a chaotic, indeterminate non-being.
76 | http://en.wikipedia.org/wiki/Demiurge
77 | Contributors
78 |
79 | Martín Gaitán (@mgaitan)
80 |
81 |
--------------------------------------------------------------------------------
/files/xxx/django-guardian.txt:
--------------------------------------------------------------------------------
1 | django-guardian
2 |
3 | django-guardian is an implementation of per object permissions [1] on top
4 | of Django's authorization backend
5 |
6 | Documentation
7 | Online documentation is available at https://django-guardian.readthedocs.io/ .
8 |
9 | Requirements
10 |
11 | Python 2.7 or 3.4+
12 | A supported version of Django (currently 1.8+)
13 |
14 | Travis CI tests on Django version 1.8, 1.10, and 1.11.
15 |
16 | Installation
17 | To install django-guardian simply run:
18 | pip install django-guardian
19 |
20 |
21 | Configuration
22 | We need to hook django-guardian into our project.
23 |
24 | Put guardian into your INSTALLED_APPS at settings module:
25 |
26 | INSTALLED_APPS = (
27 | ...
28 | ' guardian ' ,
29 | )
30 |
31 | Add extra authorization backend to your settings.py :
32 |
33 | AUTHENTICATION_BACKENDS = (
34 | ' django.contrib.auth.backends.ModelBackend ' , # default
35 | ' guardian.backends.ObjectPermissionBackend ' ,
36 | )
37 |
38 | Create guardian database tables by running:
39 | python manage.py migrate
40 |
41 |
42 |
43 |
44 | Usage
45 | After installation and project hooks we can finally use object permissions
46 | with Django .
47 | Lets start really quickly:
48 | >> > from django.contrib.auth.models import User, Group
49 | >> > jack = User.objects.create_user( ' jack ' , ' jack@example.com ' , ' topsecretagentjack ' )
50 | >> > admins = Group.objects.create( name = ' admins ' )
51 | >> > jack.has_perm( ' change_group ' , admins)
52 | False
53 | >> > from guardian.models import UserObjectPermission
54 | >> > UserObjectPermission.objects.assign_perm( ' change_group ' , jack, obj = admins)
55 | < UserObjectPermission: admins | jack | change_group >
56 | >> > jack.has_perm( ' change_group ' , admins)
57 | True
58 | Of course our agent jack here would not be able to change_group globally:
59 | >> > jack.has_perm( ' change_group ' )
60 | False
61 |
62 | Admin integration
63 | Replace admin.ModelAdmin with GuardedModelAdmin for those models
64 | which should have object permissions support within admin panel.
65 | For example:
66 | from django.contrib import admin
67 | from myapp.models import Author
68 | from guardian.admin import GuardedModelAdmin
69 |
70 | # Old way:
71 | # class AuthorAdmin(admin.ModelAdmin):
72 | # pass
73 |
74 | # With object permissions support
75 | class AuthorAdmin ( GuardedModelAdmin ):
76 | pass
77 |
78 | admin.site.register(Author, AuthorAdmin)
79 |
80 |
81 | [1] Great paper about this feature is available at djangoadvent articles .
82 |
83 |
84 |
--------------------------------------------------------------------------------
/files/xxx/django-haystack.txt:
--------------------------------------------------------------------------------
1 | Haystack
2 |
3 |
4 | Author:
5 | Daniel Lindsley
6 | Date:
7 | 2013/07/28
8 |
9 |
10 | Haystack provides modular search for Django. It features a unified, familiar
11 | API that allows you to plug in different search backends (such as Solr ,
12 | Elasticsearch , Whoosh , Xapian , etc.) without having to modify your code.
13 | Haystack is BSD licensed, plays nicely with third-party app without needing to
14 | modify the source and supports advanced features like faceting, More Like This,
15 | highlighting, spatial search and spelling suggestions.
16 | You can find more information at http://haystacksearch.org/ .
17 |
18 | Getting Help
19 | There is a mailing list ( http://groups.google.com/group/django-haystack/ )
20 | available for general discussion and an IRC channel (#haystack on
21 | irc.freenode.net).
22 |
23 | Documentation
24 |
25 | Development version: http://docs.haystacksearch.org/
26 | v2.6.X: https://django-haystack.readthedocs.io/en/v2.6.0/
27 | v2.5.X: https://django-haystack.readthedocs.io/en/v2.5.0/
28 | v2.4.X: https://django-haystack.readthedocs.io/en/v2.4.1/
29 | v2.3.X: https://django-haystack.readthedocs.io/en/v2.3.0/
30 | v2.2.X: https://django-haystack.readthedocs.io/en/v2.2.0/
31 | v2.1.X: https://django-haystack.readthedocs.io/en/v2.1.0/
32 | v2.0.X: https://django-haystack.readthedocs.io/en/v2.0.0/
33 | v1.2.X: https://django-haystack.readthedocs.io/en/v1.2.7/
34 | v1.1.X: https://django-haystack.readthedocs.io/en/v1.1/
35 |
36 | See the changelog
37 |
38 | Build Status
39 |
40 |
41 | Requirements
42 | Haystack has a relatively easily-met set of requirements.
43 |
44 | Python 2.7+ or Python 3.3+
45 | A supported version of Django: https://www.djangoproject.com/download/#supported-versions
46 |
47 | Additionally, each backend has its own requirements. You should refer to
48 | https://django-haystack.readthedocs.io/en/latest/installing_search_engines.html for more
49 | details.
50 |
--------------------------------------------------------------------------------
/fortest.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zx576/programming_vocabulary/35de38621c03c1385f59008bb8296f67eb35bdf8/fortest.db
--------------------------------------------------------------------------------
/models_exp.py:
--------------------------------------------------------------------------------
1 | # coding = utf-8
2 | # author = zhouxin
3 | # date = 2017.7.14
4 | # description
5 | # expand exsited database, add some nessary column, and reserve some columns
6 | # 使用 peewee 库操作 sqlite3
7 | # 建立两个 table: word-book
8 | # 以下为 newword 的原因是由于之前创建过一次,但需要扩展字段
9 | # 所有迁移了数据,重新建了表
10 |
11 | from settings import DATABASE
12 | from peewee import *
13 |
14 | new_db = SqliteDatabase(DATABASE)
15 |
16 |
17 | class NewBook(Model):
18 |
19 | name = CharField()
20 | # 总词汇
21 | total = IntegerField(default=0)
22 | # 是否已经统计
23 | is_analyzed = BooleanField(default=False)
24 | # reserved columns
25 | # 保留字段,便于之后扩展
26 | re1 = CharField(default='')
27 | re2 = CharField(default='')
28 | re3 = IntegerField(default=0)
29 | re4 = IntegerField(default=0)
30 |
31 | class Meta:
32 | database = new_db
33 |
34 | class NewWord(Model):
35 | # foreignkey , which books the word collect from
36 | # book = ForeignKeyField(Book)
37 | # 单词名
38 | name = CharField()
39 | # 解释
40 | explanation = TextField(default='')
41 | # 词频
42 | frequency = IntegerField(default=0)
43 | # 是否有效
44 | is_valid = BooleanField(default=True)
45 | # 音标
46 | phonogram = CharField(default='')
47 | # reserved columns
48 | # 保留字段,便于之后扩展
49 | re1 = CharField(default='')
50 | re2 = CharField(default='')
51 | re3 = IntegerField(default=0)
52 | re4 = IntegerField(default=0)
53 |
54 | class Meta:
55 | database = new_db
--------------------------------------------------------------------------------
/python-words.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zx576/programming_vocabulary/35de38621c03c1385f59008bb8296f67eb35bdf8/python-words.xlsx
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | peewee
2 | requests
3 | bs4
4 | lxml
5 |
--------------------------------------------------------------------------------
/settings.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.11
4 | # description
5 | # 该项目的一些设置
6 |
7 | import os
8 |
9 | _BASEDIR = os.path.dirname(__file__)
10 |
11 | # 数据库名
12 | DATABASE = os.path.join(_BASEDIR, 'fortest.db')
13 |
14 |
15 | # 需要遍历的文件夹
16 | DIRS = [
17 | # 示例, 该文件夹在项目文件夹下,名为 'files'
18 | os.path.join(_BASEDIR, 'files'),
19 | ]
20 |
21 |
22 | # 文件也可以单独添加
23 | FILES = [
24 | # 示例, 该文件在项目文件夹下, 名为 'python.txt'
25 | # os.path.join(_BASEDIR, 'fortest.txt')
26 |
27 | ]
28 |
29 | # 每本书抓取的词汇量
30 | NUMBERS = [
31 | (100, 10), # 小于 100 取 10 个
32 | (1000, 100), # 100 - 1000 取 100 个
33 | (5000, 300),
34 | (10000, 500),
35 | (50000, 1000),
36 | (2**31, 1500) # 大于 50000 统一取 1500
37 | ]
38 |
39 |
40 | # 收集一些需要被排除的词汇
41 | exclude_list = [
42 | # 代词
43 | 'i', 'you', 'he', 'she', 'it', 'we', 'they', # 主格
44 | 'me', 'him', 'her', 'us', 'them', # 宾格
45 | 'my', 'your', 'his', 'her', 'its', 'our', 'their', # 形容词性
46 | 'mine', 'yours', 'his', 'hers', 'ours', 'yours', 'theirs', # 名词性
47 | 'myself', 'yourself', 'himself', 'herself', 'itself', 'ourselves', 'yourselves', 'themselves', # 反身代词
48 | 'this', 'that', 'such', 'these', 'those', 'some',
49 | 'who', 'whom', 'whose', 'which', 'what', 'whoever', 'whichever', 'whatever', 'when',
50 | 'as', 'self',
51 | 'one', 'some', 'any', 'each', 'every', 'none', 'no', 'many', 'much', 'few', 'little',
52 | 'other', 'another', 'all', 'both', 'neither', 'either',
53 | # 冠词
54 | 'a', 'an', 'the',
55 |
56 | # 简单介词
57 | 'about', 'with',
58 | 'into', 'out', 'of' , 'without',
59 | 'at', 'in', 'on', 'by', 'to',
60 |
61 | # 简单连词
62 | 'and', 'also', 'too','not', 'but',
63 |
64 | # 简单量词
65 | 'one', 'two', 'three', 'four', 'five',
66 | # 简单动词
67 | 'is', 'am', 'are', 'was', 'were', 'be',
68 | # 其他
69 | 'or', 'if', 'else', 'for','have', 'must', 'has', 'new', 'time',
70 |
71 | ]
72 |
73 |
--------------------------------------------------------------------------------
/shanbay/README.md:
--------------------------------------------------------------------------------
1 | ## 批量添加扇贝单词
2 |
3 | #### 文件说明
4 |
5 | - shanbeisettings.py 一些基本的设置
6 | - creat_word_list.py 在扇贝上创建单词章节
7 | - add_to_shanbay.py 提取出单词,逐一将其加入到单词章节中
8 |
9 |
10 | #### 使用说明
11 |
12 | 在使用之前最好亲自在扇贝上走一遍流程
13 |
14 | 1、更改 settings 设置
15 |
16 | - 设置创建多少个章节,章节名,描述
17 | - 手动 F12 将 headers 信息复制到 HEADER
18 | - 更改单词书 id
19 |
20 |
21 | 2、运行 creat_word_list.py
22 |
23 | 本程序的作用是在 扇贝 上创建设置好的 单词章节
24 |
25 | 3、运行 add_to_shanbay.py
26 |
27 | 本程序的作用是 将 单词逐一添加到单词章节中
28 |
29 |
30 |
31 |
--------------------------------------------------------------------------------
/shanbay/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zx576/programming_vocabulary/35de38621c03c1385f59008bb8296f67eb35bdf8/shanbay/__init__.py
--------------------------------------------------------------------------------
/shanbay/add_to_shanbay.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.14
4 |
5 | # description
6 | # 将单词添加到扇贝,手动版
7 |
8 | import requests
9 | import bs4
10 |
11 | from models_exp import NewWord
12 | from shanbay.shanbeisettings import HEADER,WORKBOOK_PATH, WORKBOOKID
13 |
14 |
15 | class ShanBay:
16 |
17 | def __init__(self):
18 |
19 | # 带有登录信息的 header
20 | self.header = HEADER
21 | self.url = 'https://www.shanbay.com/api/v1/wordlist/vocabulary/'
22 | self.listid = []
23 | self.book_url = 'https://www.shanbay.com/wordbook/{}/'.format(str(WORKBOOKID))
24 | print(self.header)
25 | print(self.book_url)
26 | # 存储创建的所有单词章节 id
27 | def _parse_id(self):
28 |
29 | req = requests.get(self.book_url, headers=self.header, timeout=2)
30 | req.raise_for_status()
31 | soup = bs4.BeautifulSoup(req.text, 'lxml')
32 | soup_a = soup.find_all('a', attrs={'desc': True})
33 |
34 | for a in soup_a:
35 | id = a['unit-id']
36 | print(id)
37 | self._save_id(id)
38 |
39 | # 保存 单词章节 id 到制定的文件
40 | def _save_id(self, id):
41 | with open(WORKBOOK_PATH, 'a+')as f:
42 | f.write(str(id))
43 | f.write('\n')
44 |
45 | # 读取单词章节 id
46 | def _open_bookid(self):
47 | with open(WORKBOOK_PATH, 'r')as f:
48 | for i in f.readlines():
49 | self.listid.append(int(i))
50 |
51 | # print(self.listid)
52 |
53 | # 将某个特定的单词添加到 指定的单词章节
54 |
55 | def _add_one(self, word, listid):
56 |
57 | dct = {
58 | 'id': listid,
59 | 'word': word
60 | }
61 | # print(dct)
62 | # 请求错误
63 | try:
64 | req = requests.post(self.url, dct, headers=self.header)
65 | req.raise_for_status()
66 | res = req.json()
67 | print('单词 {}'.format(word), res)
68 | # print(req.status_code)
69 | except:
70 | return '1'
71 |
72 | return res['msg']
73 |
74 | # 添加单词
75 | # 如果扇贝反馈无此单词 - 跳过
76 | # 如果反馈 单词章节的单词已满 - 则更换单词章节添加
77 | def add(self):
78 |
79 | query = NewWord.select().where((NewWord.is_valid == True) & (NewWord.re1 == '')).order_by(-NewWord.frequency)
80 | iter_word = iter(query)
81 | self._open_bookid()
82 | iter_lst = iter(self.listid)
83 | id = next(iter_lst)
84 | while True:
85 | # 单词添加完毕,程序结束
86 | try:
87 | next_word = next(iter_word)
88 | except:
89 | break
90 | res = self._add_one(next_word.name, id)
91 | # 设置请求错误处理
92 | if res == '1':
93 | print('请求错误,稍后再试')
94 | break
95 |
96 | # 设置单词无效处理
97 | elif 'NOT' in res:
98 | next_word.re1 = 'invalid'
99 | next_word.save()
100 | continue
101 |
102 | # 换单词表
103 | elif '过上限' in res:
104 | id = next(iter_lst)
105 | self.list_count = 0
106 | self._add_one(next_word.name, id)
107 |
108 | # 标记该单词已经添加
109 | next_word.re1 = 'added'
110 | next_word.save()
111 |
112 |
113 | # 测试单个单词添加是否成功
114 | def test_add(self):
115 | self._add_one('define', 539857)
116 |
117 | if __name__ == '__main__':
118 |
119 | s = ShanBay()
120 | s._parse_id()
121 | s.add()
122 | # s._open_bookid()
--------------------------------------------------------------------------------
/shanbay/creat_word_list.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.15
4 | # description
5 | # 创建单词书章节
6 |
7 | import requests
8 | import time
9 |
10 | from shanbay.shanbeisettings import CHAPTER_NAME, HEADER, WORKBOOKID
11 |
12 | class Create_list:
13 |
14 | def __init__(self):
15 | self.chapter = CHAPTER_NAME
16 | # self.cookie = COOKIE
17 | self.bookid = WORKBOOKID
18 | self.header = HEADER
19 | self.url = 'https://www.shanbay.com/api/v1/wordbook/wordlist/'
20 |
21 | def _create_lst(self, name, description):
22 |
23 | keywords = {
24 | 'name': name,
25 | 'description': description,
26 | 'wordbook_id': self.bookid
27 | }
28 | print(keywords)
29 | # self.header['cookie'] = self.cookie
30 | # print(self.header)
31 | try:
32 | req = requests.post(self.url, keywords, headers=self.header)
33 | req.raise_for_status()
34 | print(req.status_code)
35 | print(req.json()['data']['id'])
36 | assert req.json()['data']
37 | except Exception as e:
38 | print(e)
39 | return
40 |
41 | return req.json()['data']['id']
42 |
43 |
44 | def create(self):
45 |
46 | for key in self.chapter:
47 | # if '1' in key or '2' in key:
48 | # continue
49 | print('创建单词章节{0}, 描述为{1}'.format(key, self.chapter[key]))
50 | id = self._create_lst(key, self.chapter[key])
51 |
52 | time.sleep(1)
53 |
54 | if __name__ == '__main__':
55 |
56 | c = Create_list()
57 | c.create()
--------------------------------------------------------------------------------
/shanbay/shanbeisettings.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.15
4 | # description
5 | # batch upload words to shanbay.com
6 | #
7 |
8 | from collections import OrderedDict
9 |
10 |
11 | # 需创建的单词章节名称
12 | def create():
13 | count = 1
14 | # 总共创建多少个 章节
15 | total = 2
16 | dct = OrderedDict()
17 | # 章节名
18 | fix_key = 'Chapter-'
19 | # 描述
20 | fix_value = '第 {} 单元'
21 | while True:
22 | if count == total:
23 | break
24 |
25 | dct[fix_key+str(count)] = fix_value.format(str(count))
26 |
27 | count += 1
28 |
29 | # dct = tuple(dct)
30 | # print(dct)
31 | # for i in dct:
32 | # print(i, dct[i])
33 | return dct
34 |
35 | CHAPTER_NAME = create()
36 |
37 |
38 | # 登录凭证
39 | # 扇贝账户与密码 - 未实现
40 | # 或者手动复制 cookie 信息
41 |
42 | HEADER = {
43 | # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
44 | # 'Accept-Encoding':'gzip, deflate, br',
45 | # 'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.6',
46 | # 'Cache - Control': 'no-cache',
47 | # 'Connection': 'keep-alive',
48 | # 'Content-Length': '21',
49 | # 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
50 | 'Cookie': """sessionid=".eJyrVopPLC3JiC8tTi2KT0pMzk7NS1GyUkrOz83Nz9MDS0FFi_WcE5MzUn3zU1JznKAKdZB1ZwI1mpgZmppYmtYCAJS6HyY:1dVqdm:lsde3ncYWWJnay2wgBTkLXjxcTk"; csrftoken=HQDY8b1bHbSyiJVD1RgG2qBTqwQL6VYw; _ga=GA1.2.1254111374.1497746577; __utmt=1; auth_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6Inp4NTc2IiwiZGV2aWNlIjowLCJpc19zdGFmZiI6ZmFsc2UsImlkIjo0NjE1NDk1LCJleHAiOjE1MDExNDkxMDB9.eo3vJ-ylhqCaGs3DKK0pV_ny8H8rq_NSHY3ei1Lfe70; userid=4615495; __utma=183787513.1254111374.1497746577.1500258334.1500285087.13; __utmb=183787513.5.10.1500285087; __utmc=183787513; __utmz=183787513.1499999350.5.2.utmcsr=baidu|utmccn=(organic)|utmcmd=organic""",
51 | # 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36',
52 | # 'X-Requested-With': 'XMLHttpRequest',
53 | # 'Host': 'www.shanbay.com',
54 | # 'Pragma': 'no-cache',
55 | # 'Origin': 'https://www.shanbay.com'
56 | }
57 |
58 |
59 | # 创建的单词书ID
60 | WORKBOOKID = 187633
61 |
62 | # workbook_id 输出文件地址
63 | WORKBOOK_PATH = 'workbook_id_test.txt'
64 |
65 |
66 |
67 |
--------------------------------------------------------------------------------
/shanbay/workbook_id.txt:
--------------------------------------------------------------------------------
1 | 569452
2 | 569455
3 | 569458
4 | 569461
5 | 569464
6 | 569467
7 | 569470
8 | 569473
9 | 569476
10 | 569479
11 | 569482
12 | 569485
13 | 569488
14 | 569491
15 | 569494
16 | 569497
17 | 569500
18 | 569503
19 | 569506
20 | 569509
21 | 569512
22 | 569515
23 | 569518
24 | 569521
25 | 569524
26 | 569527
27 | 569530
28 | 569533
29 | 569536
30 | 569539
31 | 569542
32 | 569545
33 | 569548
34 | 569551
35 | 569554
36 | 569557
37 | 569560
38 | 569563
39 | 569566
40 | 569569
41 | 569572
42 | 569575
43 | 569578
44 | 569581
45 |
--------------------------------------------------------------------------------
/shanbay/workbook_id_test.txt:
--------------------------------------------------------------------------------
1 | 541513
2 | 541522
3 | 541513
4 | 541522
5 | 541513
6 | 541522
7 |
--------------------------------------------------------------------------------
/spiders/README.md:
--------------------------------------------------------------------------------
1 | ## 下载文档或者网页内容的爬虫
2 |
3 | #### 文件说明
4 |
5 | - downloadPdf.py 下载 pdf 文件
6 | - github.py 下载 github 上的说明文档
7 | - onlinedocs.py 下载在线文档
8 | - stackoverflow.py 下载 st 上的一些方法
9 | - utils.py 抽象出的一些通用爬虫方法
10 |
11 |
12 |
13 |
--------------------------------------------------------------------------------
/spiders/downloadPdf.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # description
4 | # 下载 pdf 文件, 将 pdf 下载地址添加到 downlst ,运行程序即可
5 |
6 | import requests
7 | from spiders.utils import Utils
8 |
9 | PATH_DIR = 'download/'
10 | util = Utils()
11 |
12 | def download(url):
13 |
14 | util.checkpath(PATH_DIR)
15 |
16 | req = requests.get(url)
17 | c = req.content
18 | name = url.split('/')[-1]
19 | with open(PATH_DIR+name, 'wb')as f:
20 | f.write(c)
21 |
22 |
23 | downlst = [
24 | # 'http://files2.syncfusion.com/Downloads/Ebooks/SciPy_Programming_Succinctly.pdf',
25 | # 'https://docs.google.com/file/d/0B8IUCMSuNpl7MnpaQ3hhN2R0Z1k/edit'
26 | # 'http://stock.ethop.org/pdf/python/Learning%20Python,%205th%20Edition.pdf',
27 | # 'http://slav0nic.org.ua/static/books/python/OReilly%20-%20Core%20Python%20Programming.pdf',
28 | # ///////////
29 | # 'http://www.oreilly.com/programming/free/files/functional-programming-python.pdf',
30 | # 'https://doc.lagout.org/programmation/python/Python%20Pocket%20Reference_%20Python%20in%20Your%20Pocket%20%285th%20ed.%29%20%5BLutz%202014-02-09%5D.pdf',
31 | # 'http://www.oreilly.com/programming/free/files/a-whirlwind-tour-of-python.pdf',
32 | # 'http://www.oreilly.com/programming/free/files/20-python-libraries-you-arent-using-but-should.pdf',
33 | # 'http://www.oreilly.com/programming/free/files/hadoop-with-python.pdf',
34 | # 'http://www.oreilly.com/programming/free/files/how-to-make-mistakes-in-python.pdf',
35 | # 'http://www.oreilly.com/programming/free/files/functional-programming-python.pdf',
36 | # 'http://www.oreilly.com/programming/free/files/python-in-education.pdf',
37 | # 'http://www.oreilly.com/programming/free/files/from-future-import-python.pdf'
38 | # 'http://trickntip.com/wp-content/uploads/2017/01/Head-First-Python-ora-2011.pdf'
39 | # ''''''''''''''''
40 | # 'http://victoria.lviv.ua/html/fl5/NaturalLanguageProcessingWithPython.pdf',
41 | # 'http://www3.canisius.edu/~yany/python/Python4DataAnalysis.pdf',
42 | # 'ftp://ftp.micronet-rostov.ru/linux-support/books/programming/Python/[O%60Reilly]%20-%20Programming%20Python,%204th%20ed.%20-%20[Lutz]/[O%60Reilly]%20-%20Programming%20Python,%204th%20ed.%20-%20[Lutz].pdf
43 | # ..for
44 | # 'https://media.readthedocs.org/pdf/requests/latest/requests.pdf',
45 | # 'http://gsl.mit.edu/media/programs/nigeria-summer-2012/materials/python/django.pdf',
46 | # 'https://media.readthedocs.org/pdf/beautiful-soup-4/latest/beautiful-soup-4.pdf',
47 | # 'https://media.readthedocs.org/pdf/flask/0.7/flask.pdf',
48 |
49 | # 'https://media.readthedocs.org/pdf/jinja2/latest/jinja2.pdf',
50 | # 'http://lxml.de/3.4/lxmldoc-3.4.4.pdf',
51 | # 'https://docs.scipy.org/doc/numpy-1.11.0/numpy-ref-1.11.0.pdf',
52 | # 'https://pandas.pydata.org/pandas-docs/stable/pandas.pdf',
53 | # 'https://media.readthedocs.org/pdf/peewee/latest/peewee.pdf',
54 | # 'https://media.readthedocs.org/pdf/pillow/latest/pillow.pdf',
55 | # 'https://media.readthedocs.org/pdf/scrapy/1.0/scrapy.pdf',
56 | 'https://media.readthedocs.org/pdf/xlwt/latest/xlwt.pdf'
57 | # 'http://1.droppdf.com/files/X06AR/fluent-python-2015-.pdf',
58 | # 'http://files.meetup.com/18552511/Learn%20Python%20The%20Hard%20Way%203rd%20Edition%20V413HAV.pdf',
59 |
60 |
61 | ]
62 |
63 | if __name__ == '__main__':
64 |
65 | for l in downlst:
66 | download(l)
--------------------------------------------------------------------------------
/spiders/github.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.12
4 | # description
5 | # crawl README's words about most projects of awesome-python on the github.com
6 | # 爬取 github 上的资源集合, 或者某个独立的项目
7 |
8 | import bs4
9 | import re
10 |
11 | from spiders.utils import Utils
12 |
13 |
14 | PATH_DIR = 'github/'
15 | # add some github projects
16 | # return all projects' urls base on githubur
17 | # 添加一些 github 项目/仓库地址, 返回所有仓库地址
18 | class _Settings:
19 |
20 | def __init__(self):
21 |
22 | # github projects which contain many python directories
23 | # 资源集合
24 | self.projectsPool = [
25 | # 'https://github.com/vinta/awesome-python'
26 | ]
27 | # dependent directories
28 | # 独立的仓库
29 | self.projectsUrl = [
30 | 'https://github.com/zx576/scancode_backend'
31 | ]
32 | # invoke general class
33 | # 爬虫工具箱
34 | self.util = Utils()
35 |
36 | # parse projects(like awesome-python)
37 | # return all directories' url which domain url are github.com
38 | # 解析类似 awesome-python 的项目,返回所有项目的 github 地址,过滤掉指向站外的 url
39 | def _parse_pool(self):
40 |
41 | if not self.projectsPool:
42 | return []
43 |
44 | links = []
45 | for project in self.projectsPool:
46 | page = self.util.req(project)
47 | if not page:
48 | continue
49 | links += self._parse_html_get_links(page)
50 |
51 | return links
52 |
53 | # use bs4 parse html
54 | # return all links
55 | def _parse_html_get_links(self, page):
56 |
57 | soup = bs4.BeautifulSoup(page, 'lxml')
58 | soup_a = soup.find_all('a', href=re.compile('https://github.com/'))
59 | links = []
60 | for a in soup_a:
61 | links.append(a['href'])
62 |
63 | return links
64 |
65 |
66 | def parse(self):
67 |
68 | # deduplicate urls
69 | return list(set(self.projectsUrl+self._parse_pool()))
70 |
71 |
72 | # 爬虫程序
73 | class GitSpider:
74 |
75 | def __init__(self):
76 | self.links = _Settings().parse()
77 | self.util = Utils()
78 |
79 | def _get_words(self, url):
80 | text = self.util.req(url)
81 | if not text:
82 | return
83 |
84 | soup = bs4.BeautifulSoup(text, 'lxml')
85 | soup_article = soup.find('article')
86 |
87 | return soup_article.get_text(' ') if soup_article else None
88 |
89 |
90 | def _save(self, url, words):
91 |
92 | self.util.checkpath(PATH_DIR)
93 | if not words:
94 | return
95 | title = url.split('/')[-1]
96 | with open(PATH_DIR+'{}.txt'.format(title), 'w')as f:
97 | f.write(words)
98 |
99 | def start(self):
100 |
101 | if not self.links:
102 | return
103 |
104 | for url in self.links:
105 | words = self._get_words(url)
106 | self._save(url, words)
107 | print('successfully get {0} '.format(url))
108 |
109 |
110 | if __name__ == '__main__':
111 | gs = GitSpider()
112 | gs.start()
--------------------------------------------------------------------------------
/spiders/onlinedocs.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.13
4 | # description
5 | # download online books,
6 | # 资源来自与 readthedocs.io
7 | # 下载某个在线的文档所有的文字
8 |
9 | import bs4
10 | import queue
11 |
12 | from spiders.utils import Utils
13 |
14 | PATH_DIR = 'docs/'
15 |
16 | class _Down:
17 |
18 | def __init__(self):
19 | self.util = Utils()
20 |
21 | def _save(self, title, words):
22 |
23 | self.util.checkpath(PATH_DIR)
24 | if not words:
25 | return
26 | with open(PATH_DIR+title, 'a+')as f:
27 | f.write(words)
28 |
29 | # 递归抓取某文档所有链接
30 | def _download(self, qu, domain, title,switch=True):
31 | # print(title)
32 | if qu.empty():
33 | return
34 |
35 | url = qu.get()
36 | text = self.util.req(url)
37 |
38 | if not text:
39 | # qu.put(url)
40 | return self._download(qu,domain, title, False)
41 |
42 | if switch:
43 | res = self._download_links(domain, text)
44 | for i in res:
45 | qu.put(i)
46 |
47 | words = self._download_docs(text)
48 | self._save(title,words)
49 |
50 | return self._download(qu, domain, title,switch=False)
51 |
52 | def _download_docs(self, page):
53 |
54 | soup = bs4.BeautifulSoup(page, 'lxml')
55 | soup_body = soup.find('body')
56 | words = ''
57 | if soup_body:
58 | words += soup_body.get_text(' ')
59 |
60 | return words
61 |
62 | def _download_links(self, domain, page):
63 |
64 | lst = []
65 | soup = bs4.BeautifulSoup(page, 'lxml')
66 | soup_link = soup.find_all('a')
67 | for link in soup_link:
68 | lst.append(domain+link['href'])
69 |
70 | return lst
71 |
72 | def download(self, url, domain, title):
73 | # title = 'Problem Solving with Algorithms and Data Structures using Python.pdf'
74 | qu = queue.Queue()
75 | qu.put(url)
76 |
77 | return self._download(qu, domain, title)
78 |
79 |
80 | class Pat1(_Down):
81 |
82 | def __init__(self):
83 | # super(_Down, self).__init__()
84 | self.util = Utils()
85 | # 某文档信息
86 | # self.url = 'https://interactivepython.org/courselib/static/pythonds/index.html'
87 | # self.domain = 'https://interactivepython.org/courselib/static/pythonds/'
88 | # self.title = 'Problem Solving with Algorithms and Data Structures using Python.txt'
89 | # self.url = 'http://chimera.labs.oreilly.com/books/1230000000393/index.html'
90 | # self.domain = 'http://chimera.labs.oreilly.com/books/1230000000393/'
91 | # self.title = 'Python Cookbook.txt'
92 | self.url = 'http://docs.peewee-orm.com/en/stable/'
93 | self.domain = self.url
94 | self.title = 'peewee.txt'
95 |
96 | def _download_links(self, domain, page):
97 | lst = []
98 | soup = bs4.BeautifulSoup(page, 'lxml')
99 | soup_li = soup.find_all('li', class_="toctree-l1")
100 | for li in soup_li:
101 | lst.append(domain + li.a['href'])
102 | res = list(set(lst))
103 | # print(len(res))
104 | return res
105 |
106 | def get(self):
107 |
108 | return self.download(self.url, self.domain, self.title)
109 |
110 | if __name__ == '__main__':
111 | p1 = Pat1()
112 | p1.get()
113 |
114 |
115 |
116 |
--------------------------------------------------------------------------------
/spiders/stackoverflow.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.12
4 | # description
5 | # crwal stacloverflow's topic
6 |
7 | import bs4
8 |
9 | from spiders.utils import Utils
10 |
11 | PATH_DIR = 'stack/'
12 |
13 | class _Settings():
14 |
15 | def __init__(self):
16 |
17 | # 手动设置
18 | # topic links
19 | self.topic = [
20 | # python topic
21 | # 'https://stackoverflow.com/documentation/python/topics'
22 | # 'https://stackoverflow.com/documentation/django/topics',
23 | # 'https://stackoverflow.com/documentation/algorithm/topics',
24 | 'https://stackoverflow.com/documentation/git/topics',
25 | # 'https://stackoverflow.com/documentation/design-patterns/topics',
26 | # 'https://stackoverflow.com/documentation/flask/topics'
27 | ]
28 | # question links
29 | self.res = []
30 | # =======================
31 | # dont change anything below
32 | self.util = Utils()
33 | self.domain = 'https://stackoverflow.com'
34 |
35 | # 解析这个 topic 下的所有答案链接
36 | def _parse_topic(self):
37 | if not self.topic:
38 | return
39 | for url in self.topic:
40 | self._add_url(url)
41 |
42 | def _add_url(self, url):
43 |
44 | page = self.util.req(url)
45 | if not page:
46 | return
47 | soup = bs4.BeautifulSoup(page, 'lxml')
48 | soup_a = soup.find_all('a', class_='doc-topic-link')
49 | for a in soup_a:
50 |
51 | last = a.get('href', None)
52 | self.res.append(self.domain+last)
53 |
54 | soup_next = soup.find('a', attrs={'rel': 'next'})
55 | # get next page
56 | if soup_next:
57 |
58 | next_url =self.domain + soup_next['href']
59 | return self._add_url(next_url)
60 |
61 |
62 | def parse(self):
63 |
64 | self._parse_topic()
65 | return self.res
66 |
67 |
68 | class Stspider:
69 |
70 | def __init__(self):
71 | self.links = _Settings().parse()
72 | self.util = Utils()
73 |
74 | # 获取所有文字内容
75 | def _get_words(self, url):
76 | page = self.util.req(url)
77 | if not page:
78 | return
79 | soup = bs4.BeautifulSoup(page, 'lxml')
80 | body = soup.find('body')
81 | if not body:
82 | return
83 | else:
84 | words = body.get_text(' ')
85 |
86 | return words
87 |
88 | # 保存文字内容
89 | def _save(self, url, words):
90 |
91 | self.util.checkpath(PATH_DIR)
92 | if not words:
93 | return
94 | title = url.split('/')[-1]
95 | with open(PATH_DIR + '{}.txt'.format(title), 'w')as f:
96 | f.write(words)
97 |
98 | # 启动
99 | def start(self):
100 |
101 | if not self.links:
102 | return
103 |
104 | for url in self.links:
105 | words = self._get_words(url)
106 | self._save(url, words)
107 | print('successfully get {0} '.format(url))
108 |
109 |
110 | if __name__ == '__main__':
111 | st = Stspider()
112 | st.start()
113 |
114 |
115 |
--------------------------------------------------------------------------------
/spiders/utils.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.12
4 |
5 | # description
6 | # 一些通用方法, 目前只有请求网页的方法
7 |
8 | import requests
9 | import os
10 |
11 | class Utils:
12 |
13 | def __init__(self):
14 | # self.pr = _ProxyList()
15 | self.header = {'User-Agent': 'Mozilla/5.0 (Macintosh; U;'
16 | ' Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
17 |
18 | def _req_url(self, url, headers, proxies):
19 |
20 | try:
21 | req = requests.get(url, headers=headers, proxies=proxies, timeout=2)
22 | req.raise_for_status()
23 | return req.text
24 |
25 | except:
26 | return None
27 |
28 | def req(self, url, error=0):
29 |
30 | if error == 5:
31 | print('请求网页 {0} 失败'.format(url))
32 | return None
33 |
34 | # proxies = self.pr.get_proxy()
35 | proxies = None
36 | return self._req_url(url, headers=self.header, proxies=proxies) or self.req(url, error=error + 1)
37 |
38 | def checkpath(self, path):
39 |
40 | created = os.path.exists(path)
41 | if not created:
42 | os.mkdir(path)
43 |
--------------------------------------------------------------------------------
/translate.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # author=zhouxin
3 | # date=2017.7.10
4 | # description
5 | # 调用翻译接口,翻译数据库内词汇
6 |
7 | import requests
8 | import time
9 |
10 | from models_exp import NewWord
11 |
12 |
13 | class Translate:
14 |
15 | def __init__(self):
16 | # self.util = Utils()
17 | pass
18 |
19 | # translation api, tranlate a english word to chinese
20 | # return translation result
21 | # 百度翻译接口
22 | def _trans(self, word):
23 | # res = self.trans.translate('hello', dest='zh-CN')
24 | url = 'http://fanyi.baidu.com/sug'
25 | dct = {'kw': word}
26 | req = requests.post(url, dct)
27 | req.raise_for_status()
28 | res = req.json().get('data')
29 | if not res:
30 | return None
31 | return res[0].get('v', None)
32 |
33 | # iciba api / 金山词典 api
34 | # baidu api dont contain Phonogram , so change an api
35 | def _trans_ici(self, word):
36 |
37 | url = 'http://www.iciba.com/index.php?a=getWordMean&c=search&word=' + word
38 | try:
39 | req = requests.get(url)
40 | req.raise_for_status()
41 | info = req.json()
42 | data = info['baesInfo']['symbols'][0]
43 | assert info['baesInfo']['symbols'][0]
44 | # 去除没有音标的单词
45 | assert data['ph_am'] and data['ph_en']
46 | # 去除没有词性的单词
47 | assert data['parts'][0]['part']
48 |
49 | except:
50 | return
51 |
52 | ph_en = '英 [' + data['ph_en'] + ']'
53 | ph_am = '美 [' + data['ph_am'] + ']'
54 | ex = ''
55 | for part in data['parts']:
56 | ex += part['part'] + ';'.join(part['means']) + ';'
57 |
58 | return ph_en+ph_am, ex
59 |
60 | # 扇贝单词 api
61 | def _trans_shanbay(self, word):
62 | url = 'https://api.shanbay.com/bdc/search/?word=' + word
63 | req = requests.get(url)
64 | print(req.json())
65 |
66 |
67 | # 使用 金山单词 翻译接口
68 | # 百度接口没有音标
69 | # 扇贝接口包含的信息不如其他两家
70 | def trans(self):
71 |
72 | query = NewWord.select().where(NewWord.explanation == '')
73 | # print(len(query))
74 | if not query:
75 | return
76 | for word in query:
77 |
78 | res = self._trans_ici(word.name)
79 | # print(res)
80 | if res:
81 | word.phonogram = res[0]
82 | # word.
83 | word.explanation = res[1]
84 |
85 | else:
86 | word.is_valid = False
87 | word.save()
88 | # print('suc save word : {}'.format(word.name))
89 | time.sleep(1)
90 |
91 |
92 | if __name__ == '__main__':
93 |
94 | t = Translate()
95 | # res = t._trans_shanbay('hello')
96 | # print(res)
97 | t.trans()
98 | # res = t._trans_ici('hello')
99 | # print(res)
--------------------------------------------------------------------------------
/voca.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zx576/programming_vocabulary/35de38621c03c1385f59008bb8296f67eb35bdf8/voca.db
--------------------------------------------------------------------------------
/work.py:
--------------------------------------------------------------------------------
1 | # conding=utf-8
2 | # author = zhouxin
3 | # date = 2017.7.11
4 | # description
5 | # 该项目启动文件
6 |
7 | import os
8 |
9 | from settings import DIRS,FILES,DATABASE
10 | from analysis_book import AnlysisBook
11 | from models_exp import new_db, NewWord, NewBook
12 |
13 | # 解析所有文件路径
14 | class ParseFile:
15 |
16 | # 解析 settings 中的文件夹地址
17 | # 程序只会将 txt 文件添加到待获取列表
18 | def _parse_dirs(self, dirs):
19 |
20 | assert isinstance(dirs, list), 'type(dirs) should be list '
21 | if not dirs:
22 | return dirs
23 |
24 | files = []
25 | for path in dirs:
26 | if not os.path.isdir(path):
27 | continue
28 | for pathname, dirname, filenames in os.walk(path):
29 | for filename in filenames:
30 | # 仅获取 txt 文件
31 | if '.txt' in filename:
32 | file_path = pathname + os.sep + filename
33 | files.append(file_path)
34 |
35 | return files
36 |
37 | # 解析单个文件
38 | def _parse_files(self, files):
39 |
40 | assert isinstance(files, list), 'type(files) should be list '
41 | f = []
42 | for path in files:
43 | if not os.path.isfile(path) or '.txt' not in path:
44 | continue
45 | f.append(path)
46 |
47 | return f
48 |
49 | def parse(self, dirs, files):
50 | # print(dirs, files)
51 | f1 = self._parse_dirs(dirs)
52 | f2 = self._parse_files(files)
53 |
54 | return f1 + f2
55 |
56 |
57 | # 创建数据库
58 | class Dt:
59 |
60 | def __init__(self):
61 | self.build()
62 |
63 | def build(self):
64 |
65 | created = os.path.exists(DATABASE)
66 |
67 | if not created:
68 | new_db.connect()
69 | new_db.create_tables([NewBook, NewWord])
70 |
71 |
72 | if __name__ == '__main__':
73 |
74 | # 建表
75 | dt = Dt()
76 | # 解析文件路径
77 | s = ParseFile()
78 | res = s.parse(DIRS, FILES)
79 | # print(len(res))
80 | # extract words from books
81 | ana = AnlysisBook()
82 | ana.analysis(res)
83 | # print(res)
84 |
85 |
86 |
--------------------------------------------------------------------------------
|