├── LICENSE.txt ├── NOTICE.txt └── README.md /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Nextword-data is licensed under a Creative Commons Attribution 4.0 International 2 | License (https://creativecommons.org/licenses/by/4.0/). 3 | -------------------------------------------------------------------------------- /NOTICE.txt: -------------------------------------------------------------------------------- 1 | Nextword-data 2 | Copyright 2019 high-moctane 3 | 4 | Nextword-data is a remixed data from Google Books Ngram Viewer English Version 5 | 20120701. 6 | 7 | ------------------------------------------------------------ 8 | Google Ngram Viewer English Version 20120701 9 | ------------------------------------------------------------ 10 | 11 | Google Ngram Viewer English Version 20120701 provided by Google LLC. 12 | URL: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html 13 | License: Creative Commons Attribution 3.0 Unported 14 | (http://creativecommons.org/licenses/by/3.0/) 15 | The data is modified by nwgen toolset (https://github.com/high-moctane/nwgen). 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Nextword-data 2 | 3 | ## **🎉 NEW PREDICTION ENGINE [MOCWORD](https://github.com/high-moctane/mocword) IS AVAILABLE 🎉** 4 | 5 | [Mocword](https://github.com/high-moctane/mocword) is more advanced engine than Nextword. 6 | 7 | - Less data file size 8 | - 1.63GB (Nextword) -> 655MB (Mocword) 9 | - Using latest Google Ngram dataset 10 | - 2012 data (Nextword) -> 2020 data (Mocword) 11 | - More appropriate prediction 12 | - Less noisy vocabularies 13 | 14 | --- 15 | 16 | A dataset for nextword. 17 | 18 | ## Install 19 | 20 | 0. (Recommended) Star this repository (`・ω・´)★ 21 | 22 | 1. Visit [releases](https://github.com/high-moctane/nextword-data/releases) page. 23 | 24 | 2. Download `zip` or `tar.gz`. 25 | 26 | You can choose larger or smaller one. 27 | 28 | | | Zip size | Total size | 29 | | ----- | -------: | ---------: | 30 | | Small | 152.2 MB | 493.1 MB | 31 | | Large | 483.3 MB | 1.63 GB | 32 | 33 | 3. Decompress downloaded data. 34 | 35 | 4. Set `$NEXTWORD_DATA_PATH` environment variable. 36 | 37 | Example: 38 | 39 | ```bash 40 | export NEXTWORD_DATA_PATH=/path/to/nextword-data 41 | ``` 42 | 43 | ## Uninstall 44 | 45 | 1. Remove `$NEXTWORD_DATA_PATH` environment variable. 46 | 47 | 2. Remove nextword-data directory. 48 | 49 | ## Format 50 | 51 | ``` 52 | (n-1)gram tab candidates newline 53 | ``` 54 | 55 | Candidates are sorted by appearance order. 56 | 57 | ### Example 58 | 59 | You can find the line 60 | 61 | ``` 62 | empty milk bottles carton bottle cartons cans 63 | ``` 64 | 65 | at line 59349 in file `3gram-e.txt`. 66 | 67 | This line describes the word "bottles" is the most likely word after "empty milk" 68 | and "carton" is the next. 69 | 70 | ## Recipe 71 | 72 | 1. Fetch data. 73 | 74 | ``` 75 | $ mkdir fetch 76 | $ nwgen-fetch fetch 77 | ``` 78 | 79 | 2. Run xonsh script. 80 | 81 | ```xonsh 82 | dstdir = "dstdir" 83 | mkdir -p @(dstdir)/format 84 | mkdir -p @(dstdir)/concat 85 | 86 | ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname 87 | 88 | ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname 89 | 90 | ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname 91 | 92 | ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname 93 | 94 | ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname 95 | 96 | nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram* 97 | 98 | for n in [2,3,4,5]: 99 | for c in [chr(i) for i in range(97, 97+26)]: 100 | nwgen-concat @(dstdir)/concat/@(n)gram-@(c).txt.gz @(dstdir)/format/@(n)gram-@(c)* 101 | 102 | cp -R @(dstdir)/concat @(dstdir)/data 103 | 104 | gunzip @(dstdir)/data/* 105 | ``` 106 | 107 | ## Notice 108 | 109 | Nextword-data is based on 110 | [Google Books Ngram Viewer English Version 20120701](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) 111 | which is distributed under a [Creative Commons Attribution 3.0 Unported](http://creativecommons.org/licenses/by/3.0/). 112 | See [NOTICE.txt](NOTICE.txt). 113 | 114 | ## License 115 | 116 | Nextword-data is distributed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/). 117 | See [LICENSE.txt](LICENSE.txt). 118 | --------------------------------------------------------------------------------