├── LICENSE.txt
├── NOTICE.txt
└── README.md


/LICENSE.txt:
--------------------------------------------------------------------------------
1 | Nextword-data is licensed under a Creative Commons Attribution 4.0 International
2 | License (https://creativecommons.org/licenses/by/4.0/).
3 | 


--------------------------------------------------------------------------------
/NOTICE.txt:
--------------------------------------------------------------------------------
 1 | Nextword-data
 2 | Copyright 2019 high-moctane
 3 | 
 4 | Nextword-data is a remixed data from Google Books Ngram Viewer English Version
 5 | 20120701.
 6 | 
 7 | ------------------------------------------------------------
 8 | Google Ngram Viewer English Version 20120701
 9 | ------------------------------------------------------------
10 | 
11 | Google Ngram Viewer English Version 20120701 provided by Google LLC.
12 | URL: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
13 | License: Creative Commons Attribution 3.0 Unported
14 | (http://creativecommons.org/licenses/by/3.0/)
15 | The data is modified by nwgen toolset (https://github.com/high-moctane/nwgen).
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Nextword-data
  2 | 
  3 | ## **🎉 NEW PREDICTION ENGINE [MOCWORD](https://github.com/high-moctane/mocword) IS AVAILABLE 🎉**
  4 | 
  5 | [Mocword](https://github.com/high-moctane/mocword) is more advanced engine than Nextword.
  6 | 
  7 | - Less data file size
  8 |   - 1.63GB (Nextword) -> 655MB (Mocword)
  9 | - Using latest Google Ngram dataset
 10 |   - 2012 data (Nextword) -> 2020 data (Mocword)
 11 | - More appropriate prediction
 12 | - Less noisy vocabularies
 13 | 
 14 | ---
 15 | 
 16 | A dataset for nextword.
 17 | 
 18 | ## Install
 19 | 
 20 | 0. (Recommended) Star this repository (｀･ω･´)★
 21 | 
 22 | 1. Visit [releases](https://github.com/high-moctane/nextword-data/releases) page.
 23 | 
 24 | 2. Download `zip` or `tar.gz`.
 25 | 
 26 |    You can choose larger or smaller one.
 27 | 
 28 |    |       | Zip size | Total size |
 29 |    | ----- | -------: | ---------: |
 30 |    | Small | 152.2 MB |   493.1 MB |
 31 |    | Large | 483.3 MB |    1.63 GB |
 32 | 
 33 | 3. Decompress downloaded data.
 34 | 
 35 | 4. Set `$NEXTWORD_DATA_PATH` environment variable.
 36 | 
 37 |    Example:
 38 | 
 39 |    ```bash
 40 |    export NEXTWORD_DATA_PATH=/path/to/nextword-data
 41 |    ```
 42 | 
 43 | ## Uninstall
 44 | 
 45 | 1. Remove `$NEXTWORD_DATA_PATH` environment variable.
 46 | 
 47 | 2. Remove nextword-data directory.
 48 | 
 49 | ## Format
 50 | 
 51 | ```
 52 | (n-1)gram tab candidates newline
 53 | ```
 54 | 
 55 | Candidates are sorted by appearance order.
 56 | 
 57 | ### Example
 58 | 
 59 | You can find the line
 60 | 
 61 | ```
 62 | empty milk	bottles carton bottle cartons cans
 63 | ```
 64 | 
 65 | at line 59349 in file `3gram-e.txt`.
 66 | 
 67 | This line describes the word "bottles" is the most likely word after "empty milk"
 68 | and "carton" is the next.
 69 | 
 70 | ## Recipe
 71 | 
 72 | 1. Fetch data.
 73 | 
 74 |    ```
 75 |    $ mkdir fetch
 76 |    $ nwgen-fetch fetch
 77 |    ```
 78 | 
 79 | 2. Run xonsh script.
 80 | 
 81 |    ```xonsh
 82 |    dstdir = "dstdir"
 83 |    mkdir -p @(dstdir)/format
 84 |    mkdir -p @(dstdir)/concat
 85 | 
 86 |    ls fetch | grep 1gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 10000 @(dstdir)/format/fname fetch/fname
 87 | 
 88 |    ls fetch | grep 2gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 2000 @(dstdir)/format/fname fetch/fname
 89 | 
 90 |    ls fetch | grep 3gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 400 @(dstdir)/format/fname fetch/fname
 91 | 
 92 |    ls fetch | grep 4gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 300 @(dstdir)/format/fname fetch/fname
 93 | 
 94 |    ls fetch | grep 5gram | xargs -P 8 -I fname nwgen-format -max-candidates 10000 -min-count 200 @(dstdir)/format/fname fetch/fname
 95 | 
 96 |    nwgen-concat @(dstdir)/concat/1gram.txt.gz @(dstdir)/format/1gram*
 97 | 
 98 |    for n in [2,3,4,5]:
 99 |        for c in [chr(i) for i in range(97, 97+26)]:
100 |            nwgen-concat @(dstdir)/concat/@(n)gram-@(c).txt.gz @(dstdir)/format/@(n)gram-@(c)*
101 | 
102 |    cp -R @(dstdir)/concat @(dstdir)/data
103 | 
104 |    gunzip @(dstdir)/data/*
105 |    ```
106 | 
107 | ## Notice
108 | 
109 | Nextword-data is based on
110 | [Google Books Ngram Viewer English Version 20120701](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
111 | which is distributed under a [Creative Commons Attribution 3.0 Unported](http://creativecommons.org/licenses/by/3.0/).
112 | See [NOTICE.txt](NOTICE.txt).
113 | 
114 | ## License
115 | 
116 | Nextword-data is distributed under a [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/).
117 | See [LICENSE.txt](LICENSE.txt).
118 | 


--------------------------------------------------------------------------------