├── .gitignore
├── DEPENDENCIES.md
├── LICENSE
├── README.md
├── data
├── cbow.bin
├── cbow.sh
├── cbow.txt
├── mtgvocab.json
└── output.txt
├── decode.py
├── encode.py
├── lib
├── cardlib.py
├── cbow.py
├── config.py
├── datalib.py
├── html_extra_data.py
├── jdecode.py
├── manalib.py
├── namediff.py
├── nltk_model.py
├── nltk_model_api.py
├── transforms.py
└── utils.py
├── mtg_sweep1.ipynb
├── scripts
├── analysis.py
├── autosample.py
├── collect_checkpoints.py
├── distances.py
├── keydiff.py
├── mtg_validate.py
├── ngrams.py
├── pairing.py
├── sanity.py
├── streamcards.py
├── sum.py
└── summarize.py
└── sortcards.py
/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | *.pyc
3 | AllSets.json
4 | AllSets-x.json
5 | lib/__init__.py
6 |
--------------------------------------------------------------------------------
/DEPENDENCIES.md:
--------------------------------------------------------------------------------
1 | Dependencies
2 | ======
3 |
4 | ## mtgjson
5 |
6 | First, you'll need the json corpus of Magic the Gathering cards, which can be found at:
7 |
8 | http://mtgjson.com/
9 |
10 | You probably want the file AllSets.json, which you should also be able to download here:
11 |
12 | http://mtgjson.com/json/AllSets.json
13 |
14 | ## Python packages
15 |
16 | mtgencode uses a few additional Python packages which you should be able to install with Pip, Python's package manager. They aren'y mission critical, but they provide better capitalization of names and text in human-readable output formats. If they aren't installed, mtgencode will silently fall back to less effective workarounds.
17 |
18 | On Ubuntu, you should be able to install the necessary packages with:
19 |
20 | ```
21 | sudo apt-get install python-pip
22 | sudo pip install titlecase
23 | sudo pip install nltk
24 | ```
25 |
26 | nltk requires some additional data files to work, so you'll also have to do:
27 |
28 | ```
29 | mkdir ~/nltk_data
30 | cd ~/nltk_data
31 | python -c "import nltk; nltk.download('punkt')"
32 | cd -
33 | ```
34 |
35 | You don't have to put the files in ~/nltk_data, that's just one of the places it will look automatically. If you try to run decode.py with nltk but without the additional files, the error message is pretty helpful.
36 |
37 | mtgencode can also use numpy to speed up some of the long calculations required to generate the creativity statistics comparing similarity of generated and existing cards. You can install numpy with:
38 |
39 | ```
40 | sudo apt-get install python-dev python-pip
41 | sudo pip install numpy
42 | ```
43 |
44 | This will launch an absolutely massive compilation process for all of the numpy C sources. Go get a cup of coffee, and if it fails consult google. You'll probably need to at least have GCC installed, I'm not sure what else.
45 |
46 | Some additional packages will be needed for multithreading, but that doesn't work yet, so no worries.
47 |
48 | ## word2vec
49 |
50 | The creativity analysis is done using vector models produced by this tool:
51 |
52 | https://code.google.com/p/word2vec/
53 |
54 | You can install it pretty easily with subversion:
55 |
56 | ```
57 | sudo apt-get install subversion
58 | mkdir ~/word2vec
59 | cd ~/word2vec
60 | svn checkout http://word2vec.googlecode.com/svn/trunk/
61 | cd trunk
62 | make
63 | ```
64 |
65 | That should create some files, among them a binary called word2vec. Add this to your path somehow, and you'll be able to invoke cbow.sh from within the data/ subdirectory to recompile the vector model (cbow.bin) from whatever text representation was last produced (cbow.txt).
66 |
67 | ## Rebuilding the data files
68 |
69 | The standard procedure to produce the derived data files from AllSets.json is the following:
70 |
71 | ```
72 | ./encode.py -v data/AllSets.json data/output.txt
73 | ./encode.py -v data/output.txt data/cbow.txt -s -e vec
74 | cd data
75 | ./cbow.sh
76 | ```
77 |
78 | This of course assumes that you have AllSets.json in data/, and that you start from the root of the repo, in the same directory as encode.py.
79 |
80 | ## Magic Set Editor 2
81 |
82 | MSE2 is a tool for creating and viewing custom magic cards:
83 |
84 | http://magicseteditor.sourceforge.net/
85 |
86 | Set files, with the extension .mse-set, can be produced by decode.py using the -mse option and then viewed in MSE2.
87 |
88 | Unfortunately, getting MSE2 to run on Linux can be tricky. Both Wine 1.6 and 1.7 have been reported to work on Ubuntu; instructions for 1.7 can be found here:
89 |
90 | https://www.winehq.org/download/ubuntu
91 |
92 | To install MSE with Wine, download the standard Windows installer and open it with Wine. Everything should just work. You will need some additional card styles:
93 |
94 | http://sourceforge.net/projects/msetemps/files/Magic%20-%20Recently%20Printed%20Styles.mse-installer/download
95 |
96 | And possibly this:
97 |
98 | http://sourceforge.net/projects/msetemps/files/Magic%20-%20M15%20Extra.mse-installer/download
99 |
100 | Once MSE2 is installed with Wine, you should be able to just click on the template installers and MSE2 will know what to do with them.
101 |
102 | Some additional system fonts are required, specifically Beleren Bold, Beleren Small Caps Bold, and Relay Medium. Those can be found here:
103 |
104 | http://www.slightlymagic.net/forum/viewtopic.php?f=15&t=14730
105 |
106 | http://www.azfonts.net/download/relay-medium/ttf.html
107 |
108 | Open them in Font Viewer and click install; you might then have to clear the caches so MSE2 can see them:
109 |
110 | ```
111 | sudo fc-cache -fv
112 | ```
113 |
114 | If you're running a Linux distro other than Ubuntu, then a similar procedure will probably work. If you're on Windows, then it should work fine as is without messing around with Wine. You'll still need the additional styles.
115 |
116 | I tried to build MSE2 from source on 64-bit Ubuntu. After hacking up some of the files, I did get a working binary, but I was unable to set up the data files it needs in such a way that I could actually open a set. If you manage to get this to work, please explain how, and I will be very grateful.
117 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Copyright (c) 2015 Bill Zorn
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy
4 | of this software and associated documentation files (the "Software"), to deal
5 | in the Software without restriction, including without limitation the rights
6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7 | copies of the Software, and to permit persons to whom the Software is
8 | furnished to do so, subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in
11 | all copies or substantial portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19 | THE SOFTWARE.
20 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # mtgencode
2 |
3 | Utilities to assist in the process of generating Magic the Gathering cards with neural nets. Inspired by this thread on the mtgsalvation forums:
4 |
5 | http://www.mtgsalvation.com/forums/creativity/custom-card-creation/612057-generating-magic-cards-using-deep-recurrent-neural
6 |
7 | The purpose of this code is mostly to wrangle text between various human and machine readable formats. The original input comes from [mtgjson](http://mtgjson.com); this is filtered and reduced to one of several input formats intended for neural network training, such as the standard encoded format used in [data/output.txt](https://github.com/billzorn/mtgencode/blob/master/data/output.txt). Any json or encoded data, including output from appropriately trained neural nets, can then be interpreted as cards and decoded to a human readable format, such as a text spoiler, [Magic Set Editor 2](http://magicseteditor.sourceforge.net) set file, or a pretty, portable html file that can be viewed in any browser.
8 |
9 | ## Requirements
10 |
11 | I'm running this code on Ubuntu 14.04 with Python 2.7. Unfortunately it does not work with Python 3, though apparently it isn't too hard to use 2to3 to automatically convert it.
12 |
13 | For the most part it should work out of the box, though there are a few optional bonus features that will make it much better. See [DEPENDENCIES.md](https://github.com/billzorn/mtgencode/blob/master/DEPENDENCIES.md#dependencies).
14 |
15 | This code does not have anything to do with neural nets; if you want to generate cards with them, see the [tutorial](https://github.com/billzorn/mtgencode#tutorial).
16 |
17 | ## Usage
18 |
19 | Functionality is provided by two main driver scripts: encode.py and decode.py. Logically, encode.py handles encoding to formats intended to feed into a neural network, while decode.py handles decoding to formats intended to be read by a human.
20 |
21 | ### encode.py
22 |
23 | ```
24 | usage: encode.py [-h] [-e {std,named,noname,rfields,old,norarity,vec,custom}]
25 | [-r] [--nolinetrans] [--nolabel] [-s] [-v]
26 | infile [outfile]
27 |
28 | positional arguments:
29 | infile encoded card file or json corpus to encode
30 | outfile output file, defaults to stdout
31 |
32 | optional arguments:
33 | -h, --help show this help message and exit
34 | -e {std,named,noname,rfields,old,norarity,vec,custom}, --encoding {std,named,noname,rfields,old,norarity,vec,custom}
35 | encoding format to use
36 | -r, --randomize randomize the order of symbols in mana costs
37 | --nolinetrans don't reorder lines of card text
38 | --nolabel don't label fields
39 | -s, --stable don't randomize the order of the cards
40 | -v, --verbose verbose output
41 | ```
42 |
43 | The supported encodings are:
44 |
45 | Argument | Description
46 | -----------|------------
47 | std | Standard format: `|type|supertype|subtype|loyalty|pt|text|cost|rarity|name|`.
48 | named | Name first: `|name|type|supertype|subtype|loyalty|pt|text|cost|rarity|`.
49 | noname | No name field at all: `|type|supertype|subtype|loyalty|pt|text|cost|rarity|`.
50 | rfields | Randomize the order of the fields, using only the label to distinguish which field is which.
51 | old | Legacy format: `|name|supertype|type|loyalty|subtype|rarity|pt|cost|text|`. No field labels.
52 | norarity | Older legacy format: `|name|supertype|type|loyalty|subtype|pt|cost|text|`. No field labels.
53 | vec | Produce a content vector for each card; used with [word2vec](https://code.google.com/p/word2vec/).
54 | custom | Blank format slot, inteded to help users add their own formats to the python source.
55 |
56 | ### decode.py
57 |
58 | ```
59 | usage: decode.py [-h] [-e {std,named,noname,rfields,old,norarity,vec,custom}]
60 | [-g] [-f] [-c] [-d] [-v] [-mse] [-html]
61 | infile [outfile]
62 |
63 | positional arguments:
64 | infile encoded card file or json corpus to encode
65 | outfile output file, defaults to stdout
66 |
67 | optional arguments:
68 | -h, --help show this help message and exit
69 | -e {std,named,noname,rfields,old,norarity,vec,custom}, --encoding {std,named,noname,rfields,old,norarity,vec,custom}
70 | encoding format to use
71 | -g, --gatherer emulate Gatherer visual spoiler
72 | -f, --forum use pretty mana encoding for mtgsalvation forum
73 | -c, --creativity use CBOW fuzzy matching to check creativity of cards
74 | -d, --dump dump out lots of information about invalid cards
75 | -v, --verbose verbose output
76 | -mse, --mse use Magic Set Editor 2 encoding; will output as .mse-
77 | set file
78 | -html, --html create a .html file with pretty forum formatting
79 | ```
80 |
81 | The default output is a text spoiler which modifies the output of the neural net as little as possible while making it human readable. Specifying the -g option will produce a prettier, Gatherer-inspired text spoiler with heavier-weight transformations applied to the text, such as capitalization. The -f option encodes mana symbols in the format used by the mtgsalvation forum; this is useful if you want to cut and paste your spoiler into a post to share it.
82 |
83 | Passing the -mse option will cause decode.py to produce both the hilarious internal MSE text format as well as an actual mse set file, which is really just a renamed zip archive. The -f and -g flags will be respected in the text that is dumped to each card's notes field.
84 |
85 | Finally, the -c and -d options will print out additional data about the quality of the cards. Running with -c is extremely slow due to the massive amount of computation involved, though at least we can do it in parallel over all of your processor cores; -d is probably a good idea to use in general unless you're trying to produce pretty output to show off. Using html mode is especially useful with -c as we can link to visual spoilers from magiccards.info.
86 |
87 | ### Examples
88 |
89 | To generate the standard encoding in data/output.txt, I run:
90 |
91 | ```
92 | ./encode.py -v data/AllSets.json data/output.txt
93 | ```
94 |
95 | Of course, this requires that you've downloaded the mtgjson corpus to data/AllSets.json, and are running from the root of the repo.
96 |
97 | If I wanted to convert that standard output to a Magic Set Editor 2 set, I'd run:
98 |
99 | ```
100 | ./decode.py -v data/output.txt data/allcards -f -g -d
101 | ```
102 |
103 | This will produce a useless text file called data/allcards, and a set file called data/allcards.mse-set that you can open with MSE2. The -f and -g options will cause the text spoiler included in the notes field of each card in the set to be a pretty Gatherer-inspired affair that you could cut and paste onto the mtgsalvation forum. The -d option will dump additional information if any of the cards are invalidly formatted, which probably won't do anything because all existing magic cards are encoded correctly. Specifying the -c option here would be a bad idea; it would probably take several days to run.
104 |
105 | ### Scripts
106 |
107 | A bunch of additional data processing functionality is provided by the files in scripts/. Right now there isn't a whole lot, but more tools might be added in the future, to do things such as convert card dumps into .arff files that could be analyzed in [Weka](http://www.cs.waikato.ac.nz/ml/weka/).
108 |
109 | Currently, scripts/summarize.py will build a bunch of big data mining indices and use them to print out interesting statistics about a dump of cards. If you want to use mtgencode to do your own data analysis, taking a look at it would be a good place to start.
110 |
111 |
112 |
113 | ## Tutorial
114 |
115 | This tutorial will cover how to generate cards from scratch using neural nets.
116 |
117 | ### Set up a Linux environment
118 |
119 | If you're already running on Linux, hooray! If not, you have a few options. The easiest is probably to use a virtual machine; the disadvantage of this approach is that it will prevent you from using a graphics card to train the neural net, which speeds things up immensely. For reference, my GTX Titan is about 10x faster than my overclocked 8-core i7-5960X.
120 |
121 | The other option is to dual boot your machine (which is what I do) or otherwise acquire a machine that you can run Linux on natively. How exactly you do this is beyond the scope of this tutorial.
122 |
123 | If you do decide to go the virtual machine route:
124 |
125 | 1. Download some sort of virtual machine software. I recommend [VirtualBox](https://help.ubuntu.com/community/VirtualBox).
126 | 2. Download a Linux operating system. I recommend [Ubuntu](http://www.ubuntu.com/download/desktop).
127 | 3. [Create a virtual machine, and install the operating system on it](https://help.ubuntu.com/community/VirtualBox/FirstVM).
128 |
129 | IMPORTANT NOTE: Training neural nets is extremely CPU intensive, and rather memory intensive as well. If you don't want training to take multiple weeks, it's a very good idea to give your virtual machine as many processor cores and as much memory as you can spare, and to monitor system performance with the 'top' command to make sure you aren't [swapping](https://help.ubuntu.com/community/SwapFaq), as that will degrade performance immensely.
130 |
131 | You should be able to boot up the virtual machine and use whatever operating system you installed. If you're new to Linux, you might want to familiarize yourself with it a little. For my own sanity, I'm going to assume at least basic familiarity. Most of what we'll be doing will be in terminals; if the instructions say to do something and then provide some code in a block quote, it probably means to type that into a terminal, on line at a time.
132 |
133 | ### Set up the neural net code
134 |
135 | We're ultimately going to use the code from the [mtg-rnn repo](https://github.com/billzorn/mtg-rnn); if anything is unclear you can refer to the documentation there as well.
136 |
137 | First, we need to install some dependencies. The primary one is Torch, the scientific computing framework the neural net code is written. Directions are [here](http://torch.ch/docs/getting-started.html).
138 |
139 | Next, open a terminal and install some additional lua packages:
140 |
141 | ```
142 | luarocks install nngraph
143 | luarocks install optim
144 | ```
145 |
146 | Now we'll clone the git repo with the neural net code. You'll need git installed, if it isn't:
147 |
148 | ```
149 | sudo apt-get install git
150 | ```
151 |
152 | Then go to your home directory (or wherever you want to put the repo, it can be anywhere really) and clone it:
153 |
154 | ```
155 | cd ~
156 | git clone https://github.com/billzorn/mtg-rnn.git
157 | ```
158 |
159 | This should create the folder mtg-rnn, with a bunch of files in it. To check if it works, try:
160 |
161 | ```
162 | cd ~/mtg-rnn
163 | th train.lua --help
164 | ```
165 |
166 | A large usage message should be printed. If you get an error, then check to make sure Torch is working. As always, Google is your best friend when anything goes wrong.
167 |
168 | ### Set up mtgencode
169 |
170 | Go back to your home directory (or wherever) and clone mtgencode as well:
171 |
172 | ```
173 | cd ~
174 | git clone https://github.com/billzorn/mtgencode.git
175 | ```
176 |
177 | This should create the folder mtgencode, also with a bunch of files in it.
178 |
179 | You'll need Python to run it; to get full functionality, consult [DEPENDENCIES.md](https://github.com/billzorn/mtgencode/blob/master/DEPENDENCIES.md#dependencies). But, it should work with just Python. To install Python:
180 |
181 | ```
182 | sudo apt-get install python
183 | ```
184 |
185 | To check if it works:
186 |
187 | ```
188 | cd ~/mtgencode
189 | ./encode.py --help
190 | ```
191 |
192 | Again, you should see a usage message; if you don't, make sure Python is working. mtgencode uses Python 2.7, so if you think your default python is Python 3, you can try:
193 |
194 | ```
195 | python2 encode.py --help
196 | ```
197 |
198 | instead of running the script directly.
199 |
200 | ### Generating an encoded corpus for training
201 |
202 | If you just want to train with the default corpus, you can skip this step, as it already exists in mtg-rnn. Just replace all instances of 'custom_encoding' with 'mtgencode-std'.
203 |
204 | To generate an encoded corpus, you'll first need to download AllSets.json from [mtgjson.com](http://mtgjson.com/) to data/AllSets.json. Then to encode it:
205 |
206 | ```
207 | ./encode.py -v data/AllSets.json data/custom_encoding.txt
208 | ```
209 |
210 | This will create a the file data/custom_encoding.txt with your encoding in it. You can add some options to create a different encoding; consult the usage of [encode.py](https://github.com/billzorn/mtgencode#encodepy).
211 |
212 | Now copy this encoded corpus over to mtg-rnn:
213 |
214 | ```
215 | cd ~/mtg-rnn
216 | mkdir data/custom_encoding
217 | cp ~/mtgencode/data/custom_encoding.txt data/custom_encoding/input.txt
218 | ```
219 |
220 | The input file does have to be named input.txt, though you can name the folder that holds it, under mtg-rnn/data/, whatever you want.
221 |
222 | ### Training a neural net
223 |
224 | There are lots of parameters to control training. With a good GPU, I can train a 3-layer, size 512 network in a few hours; on a CPU this will probably take at least a day.
225 |
226 | Most networks we use are about that size. I'd recommend avoiding anything much larger, as they don't seem to produce appreciably better results and take longer to train. The only other parameter you really have to change from the defaults is seq_length, which we usually set somewhere from 120-200. If this causes memory issues you can reduce batch_size slightly to compensate.
227 |
228 | A sample training command might like this:
229 |
230 | ```
231 | th train.lua -gpuid -1 -rnn_size 256 -num_layers 3 -seq_length 200 -data_dir data/custom_encoding -checkpoint_dir cv/custom_format-256/ -eval_val_every 1000 -seed 7767
232 | ```
233 |
234 | This tells the neural network to train using the corpus in data/custom_encoding/, and to output periodic checkpoints to the directory cv/custom_format-256/. The option "-gpuid -1" means to use the CPU, not a GPU (which won't be possible in VirtualBox anyway). The final options, -eval_val_every and -seed, aren't necessary, but I like to specify them. The seed will be set to a fixed 123 if you don't specify one yourself. If you're generating too many checkpoints and filling up your disk, you can increase the number of iterations between saving them by increasing the argument to -eval_val_every.
235 |
236 | If all goes well, you should see the neural net code do some stuff and then start training, reporting training loss and batch times as it goes:
237 |
238 | ```
239 | 1/112100 (epoch 0.000), train_loss = 4.21492900, grad/param norm = 3.1264e+00, time/batch = 4.73s
240 | 2/112100 (epoch 0.001), train_loss = 4.29372822, grad/param norm = 8.6741e+00, time/batch = 3.62s
241 | 3/112100 (epoch 0.001), train_loss = 4.02817964, grad/param norm = 8.0445e+00, time/batch = 3.57s
242 | ...
243 | ```
244 |
245 | This process can take a while, so go to sleep or something and come back in the morning. The train_loss should eventually start to decrease and settle around 0.5 or so; if it doesn't, then something is wrong and the neural net will probably produce gibberish.
246 |
247 | Every N iterations, where N is the argument to -eval_val_every, the neural net will generate a checkpoint in cv/custom_format-256/. They look like this:
248 |
249 | ```
250 | lm_lstm_epoch2.23_0.5367.t7
251 | ```
252 |
253 | The numbers are important; the first is the epoch, which tells you how many passes the neural network had made over the training data when it saved the checkpoint, and the second is the validation loss of the checkpoint. Validation loss is effectively a measurement of how accurate the checkpoint is at producing text that resembles the encoded format, the lower the better. The two numbers are separated by an underscore, so for the example above, the checkpoint is from epoch 2.23, and it had a validation loss of 0.5367, which isn't great but probably isn't gibberish either.
254 |
255 | ### Sampling checkpoints to generate cards
256 |
257 | Once you're done training, or you've got enough checkpoints and you're just impatient, you can sample to generate actual cards. If the network is still training, you'll probably want to pause it by typing Control-Z in the terminal; you can resume it later with the command 'fg'. Training will use all available CPU resources all by itself, so trying to sample at the same time is a recipe for slow.
258 |
259 | Once you're ready, go the the mtg-rnn repo. A typical sampling command might look like this:
260 |
261 | ```
262 | th sample.lua cv/custom_format-256/lm_lstm_epochXX.XX_X.XXXX.t7 -gpuid -1 -temperature 0.9 -length 2000 | tee cards.txt
263 | ```
264 |
265 | Replace the Xs in the checkpoint name with the numbers in the name of an actual checkpoint; tab completion is your friend. This command will sample 2000 characters, which is probably something like 20 cards, and both print them to the terminal and write them to a file called cards.txt. The interesting options here are the temperature and the length. Temperature controls how cautious the network is; lower values produce more probable output, while higher values make it wilder and more creative. Somewhere in the range of 0.7-1.0 usually works best. Length is just how many characters to generate. You can also specify a seed with -seed, exactly as for training, which is a particularly good idea if you just generated a few million characters and would like to see something new. The default seed is fixed at 123, again exactly as for training.
266 |
267 | You can read the output yourself, but it might be painful, especially if you're using randomly ordered fields.
268 |
269 | ### Postprocessing neural net output with mtgencode
270 |
271 | Once you've generated some cards, you can turn them into pretty text spoilers or a set file for MSE2.
272 |
273 | Go back to mtgencode, and run something like:
274 |
275 | ```
276 | ./decode.py -v ~/mtg-rnn/cards.txt cards.pretty.txt -d
277 | ```
278 |
279 | This should create a file called cards.pretty.txt with a text spoiler in it that's actually designed for human consumption. Open it in your favorite text editor and enjoy!
280 |
281 | The -d option ensures you'll still be able to see anything that went wrong with the cards. You can change the formatting with -f and -g, and produce a set file for MSE2 with -mse. The -c option produces some intersting comparisons to existing cards, but it's slow, so be prepared to wait a long time if you use it on a large dump.
282 |
283 | ## Gory details of the format
284 |
285 | Individual cards are separated by two newlines. Multifaced cards (split, flip, etc.) are encoded together, with the castable one first if applicable, and separated by only one newline.
286 |
287 | All decimal numbers are in represented in unary, with numbers over 20 special-cased into english. Fun fact: the only numbers over 20 on cards are 25, 30, 40, 50, 100, and 200. The unary represenation uses one character to mark the start of the number, and another to count. So 0 is &, 1 is &^, 2 is &^^, 11 is &^^^^^^^^^^^, and so on.
288 |
289 | Mana costs are specially encoded between braces {}. I use the unary counter to encode the colorless part, and then special two-character symbols for everything else. So, {3}{W}{W} becomes {^^^WWWW}, {U/B}{U/B} becomes {UBUB}, and {X}{X}{X} becomes {XXXXXX}. The details are controlled in lib/utils.py, and handled with the Manacost and Manatext objects in lib/manalib.py.
290 |
291 | The name of the card becomes @ in the text. I try to handle all the stupid special cases correctly. For example, Crovax the Cursed is referred to in his text box as simply 'Crovax'. Yuch.
292 |
293 | The names of counters are similarly replaced with %, and then a speial line of text is added to tell what kind of counter % refers to. Fun fact: there's more than a hundred different kinds used in real cards.
294 |
295 | Several ambiguous words are resolved. Most directly, the word 'counter' as in 'counter target spell' is replaced with 'uncast'. This should prevent confusion with +&^/+&^ counters and % counters.
296 |
297 | I also reformat cards that choose between multiple things by removing the choice clause itself and instead having a delimited list of options prefixed by a number. If you could choose different numbers of things (one or both, one or more - turns out the latter is valid in all existing cases) then the number is 0, otherwise it's however many things you'd get to choose. So, 'choose one -\= effect x\= effect y' (the \ is a newline) becomes [&^ = effect x = effect y].
298 |
299 | Finally, some postprocessing is done to put the lines of a card's ability text into a standardized, canonical form. Lines with multiple keywords are split, and then we put all of the simple keywords first, followed by things like static or activated abilities. A few things always go first (such as equip and enchant) and a few other things always go last (such as kicker and countertype). There are various reasons for doing this transformation, and some proper science could probably come up with a better specific procedure. One of the primary motivations for putting abilities onto individual lines is that it should simplify the process of adding back in reminder text. It should be noted somewhere that the definition of a simple keyword ability vs. some other line of text is that a simple keyword won't contain a period, and we can split a line with multiple of them by looking for commas and semicolons.
300 |
301 | ======
302 |
303 | Here's an attempt at a list of all the things I do:
304 |
305 | * Aggregate split / flip / rotating / etc. cards by their card number (22a with 22b) and put them together
306 |
307 | * Make all text lowercase, so the symbols for mana and X are distinct
308 |
309 | * Remove all reminder text
310 |
311 | * Put @ in for the name of the card
312 |
313 | * Encode the mana costs, and the tap and untap symbols
314 |
315 | * Convert decimal numbers to unary
316 |
317 | * Simplify the syntax of dashes, so that - is only used as a minus sign, and ~ is used elsewhere
318 |
319 | * Make sure that where X is the variable X, it's uppercase
320 |
321 | * Change the names of all counters to % and add a line to identify what kind of counter % refers to
322 |
323 | * Move the equip cost of equipment to the beginning of the text so that it's closer to the type
324 |
325 | * Rename 'counter' in the context of 'counter target spell' to 'uncast'
326 |
327 | * Put choices into [&^ = effect x = effect y] format
328 |
329 | * Replace acutal newline characters with \ so that we can use those to separate cards
330 |
331 | * Clean all the unicode junk like accents and unicode minus signs out of the text so there are fewer characters
332 |
333 | * Split composite text lines (i.e. "flying, first strike" -> "flying\first strike") and put the lines into canonical order
334 |
--------------------------------------------------------------------------------
/data/cbow.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/billzorn/mtgencode/ee5f26590dc77cd252fa0ceb00d88b4665e2a9bf/data/cbow.bin
--------------------------------------------------------------------------------
/data/cbow.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | word2vec -train cbow.txt -output cbow.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15
4 |
--------------------------------------------------------------------------------
/data/mtgvocab.json:
--------------------------------------------------------------------------------
1 | {"idx_to_token": {"1": "\n", "2": " ", "3": "\"", "4": "%", "5": "&", "6": "'", "7": "*", "8": "+", "9": ",", "10": "-", "11": ".", "12": "/", "13": "0", "14": "1", "15": "2", "16": "3", "17": "4", "18": "5", "19": "6", "20": "7", "21": "8", "22": "9", "23": ":", "24": "=", "25": "@", "26": "A", "27": "B", "28": "C", "29": "E", "30": "G", "31": "L", "32": "N", "33": "O", "34": "P", "35": "Q", "36": "R", "37": "S", "38": "T", "39": "U", "40": "W", "41": "X", "42": "Y", "43": "[", "44": "\\", "45": "]", "46": "^", "47": "a", "48": "b", "49": "c", "50": "d", "51": "e", "52": "f", "53": "g", "54": "h", "55": "i", "56": "j", "57": "k", "58": "l", "59": "m", "60": "n", "61": "o", "62": "p", "63": "q", "64": "r", "65": "s", "66": "t", "67": "u", "68": "v", "69": "w", "70": "x", "71": "y", "72": "z", "73": "{", "74": "|", "75": "}", "76": "~"}, "token_to_idx": {"\n": 1, " ": 2, "\"": 3, "%": 4, "'": 6, "&": 5, "+": 8, "*": 7, "-": 10, ",": 9, "/": 12, ".": 11, "1": 14, "0": 13, "3": 16, "2": 15, "5": 18, "4": 17, "7": 20, "6": 19, "9": 22, "8": 21, ":": 23, "=": 24, "A": 26, "@": 25, "C": 28, "B": 27, "E": 29, "G": 30, "L": 31, "O": 33, "N": 32, "Q": 35, "P": 34, "S": 37, "R": 36, "U": 39, "T": 38, "W": 40, "Y": 42, "X": 41, "[": 43, "]": 45, "\\": 44, "^": 46, "a": 47, "c": 49, "b": 48, "e": 51, "d": 50, "g": 53, "f": 52, "i": 55, "h": 54, "k": 57, "j": 56, "m": 59, "l": 58, "o": 61, "n": 60, "q": 63, "p": 62, "s": 65, "r": 64, "u": 67, "t": 66, "w": 69, "v": 68, "y": 71, "x": 70, "{": 73, "z": 72, "}": 75, "|": 74, "~": 76}}
--------------------------------------------------------------------------------
/decode.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import sys
3 | import os
4 | import zipfile
5 | import shutil
6 |
7 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib')
8 | sys.path.append(libdir)
9 | import utils
10 | import jdecode
11 | import cardlib
12 | from cbow import CBOW
13 | from namediff import Namediff
14 |
15 | def main(fname, oname = None, verbose = True, encoding = 'std',
16 | gatherer = False, for_forum = False, for_mse = False,
17 | creativity = False, vdump = False, for_html = False):
18 |
19 | # there is a sane thing to do here (namely, produce both at the same time)
20 | # but we don't support it yet.
21 | if for_mse and for_html:
22 | print 'ERROR - decode.py - incompatible formats "mse" and "html"'
23 | return
24 |
25 | fmt_ordered = cardlib.fmt_ordered_default
26 |
27 | if encoding in ['std']:
28 | pass
29 | elif encoding in ['named']:
30 | fmt_ordered = cardlib.fmt_ordered_named
31 | elif encoding in ['noname']:
32 | fmt_ordered = cardlib.fmt_ordered_noname
33 | elif encoding in ['rfields']:
34 | pass
35 | elif encoding in ['old']:
36 | fmt_ordered = cardlib.fmt_ordered_old
37 | elif encoding in ['norarity']:
38 | fmt_ordered = cardlib.fmt_ordered_norarity
39 | elif encoding in ['vec']:
40 | pass
41 | elif encoding in ['custom']:
42 | ## put custom format decisions here ##########################
43 |
44 | ## end of custom format ######################################
45 | pass
46 | else:
47 | raise ValueError('encode.py: unknown encoding: ' + encoding)
48 |
49 | cards = jdecode.mtg_open_file(fname, verbose=verbose, fmt_ordered=fmt_ordered)
50 |
51 | if creativity:
52 | namediff = Namediff()
53 | cbow = CBOW()
54 | if verbose:
55 | print 'Computing nearest names...'
56 | nearest_names = namediff.nearest_par(map(lambda c: c.name, cards), n=3)
57 | if verbose:
58 | print 'Computing nearest cards...'
59 | nearest_cards = cbow.nearest_par(cards)
60 | for i in range(0, len(cards)):
61 | cards[i].nearest_names = nearest_names[i]
62 | cards[i].nearest_cards = nearest_cards[i]
63 | if verbose:
64 | print '...Done.'
65 |
66 | def hoverimg(cardname, dist, nd):
67 | truename = nd.names[cardname]
68 | code = nd.codes[cardname]
69 | namestr = ''
70 | if for_html:
71 | if code:
72 | namestr = ('
\n')
75 | else:
76 | namestr = '' + truename + ': ' + str(dist) + '
'
77 | elif for_forum:
78 | namestr = '[card]' + truename + '[/card]' + ': ' + str(dist) + '\n'
79 | else:
80 | namestr = truename + ': ' + str(dist) + '\n'
81 | return namestr
82 |
83 | def writecards(writer):
84 | if for_mse:
85 | # have to prepend a massive chunk of formatting info
86 | writer.write(utils.mse_prepend)
87 |
88 | if for_html:
89 | # have to preapend html info
90 | writer.write(utils.html_prepend)
91 | # seperate the write function to allow for writing smaller chunks of cards at a time
92 | segments = sort_colors(cards)
93 | for i in range(len(segments)):
94 | # sort color by CMC
95 | segments[i] = sort_type(segments[i])
96 | # this allows card boxes to be colored for each color
97 | # for coloring of each box seperately cardlib.Card.format() must change non-minimaly
98 | writer.write('')
99 | writehtml(writer, segments[i])
100 | writer.write("
")
101 | # closing the html file
102 | writer.write(utils.html_append)
103 | return #break out of the write cards funcrion to avoid writing cards twice
104 |
105 |
106 | for card in cards:
107 | if for_mse:
108 | writer.write(card.to_mse().encode('utf-8'))
109 | fstring = ''
110 | if card.json:
111 | fstring += 'JSON:\n' + card.json + '\n'
112 | if card.raw:
113 | fstring += 'raw:\n' + card.raw + '\n'
114 | fstring += '\n'
115 | fstring += card.format(gatherer = gatherer, for_forum = for_forum,
116 | vdump = vdump) + '\n'
117 | fstring = fstring.replace('<', '(').replace('>', ')')
118 | writer.write(('\n' + fstring[:-1]).replace('\n', '\n\t\t'))
119 | else:
120 | fstring = card.format(gatherer = gatherer, for_forum = for_forum,
121 | vdump = vdump, for_html = for_html)
122 | writer.write((fstring + '\n').encode('utf-8'))
123 |
124 | if creativity:
125 | cstring = '~~ closest cards ~~\n'
126 | nearest = card.nearest_cards
127 | for dist, cardname in nearest:
128 | cstring += hoverimg(cardname, dist, namediff)
129 | cstring += '~~ closest names ~~\n'
130 | nearest = card.nearest_names
131 | for dist, cardname in nearest:
132 | cstring += hoverimg(cardname, dist, namediff)
133 | if for_mse:
134 | cstring = ('\n\n' + cstring[:-1]).replace('\n', '\n\t\t')
135 | writer.write(cstring.encode('utf-8'))
136 |
137 | writer.write('\n'.encode('utf-8'))
138 |
139 | if for_mse:
140 | # more formatting info
141 | writer.write('version control:\n\ttype: none\napprentice code: ')
142 |
143 |
144 | def writehtml(writer, card_set):
145 | for card in card_set:
146 | fstring = card.format(gatherer = gatherer, for_forum = True,
147 | vdump = vdump, for_html = for_html)
148 | if creativity:
149 | fstring = fstring[:-6] # chop off the closing to stick stuff in
150 | writer.write((fstring + '\n').encode('utf-8'))
151 |
152 | if creativity:
153 | cstring = '~~ closest cards ~~\n
\n'
154 | nearest = card.nearest_cards
155 | for dist, cardname in nearest:
156 | cstring += hoverimg(cardname, dist, namediff)
157 | cstring += "
\n"
158 | cstring += '~~ closest names ~~\n
\n'
159 | nearest = card.nearest_names
160 | for dist, cardname in nearest:
161 | cstring += hoverimg(cardname, dist, namediff)
162 | cstring = '
' + cstring + '
\n'
163 | writer.write(cstring.encode('utf-8'))
164 |
165 | writer.write('\n'.encode('utf-8'))
166 |
167 | # Sorting by colors
168 | def sort_colors(card_set):
169 | # Initialize sections
170 | red_cards = []
171 | blue_cards = []
172 | green_cards = []
173 | black_cards = []
174 | white_cards = []
175 | multi_cards = []
176 | colorless_cards = []
177 | lands = []
178 | for card in card_set:
179 | if len(card.get_colors())>1:
180 | multi_cards += [card]
181 | continue
182 | if 'R' in card.get_colors():
183 | red_cards += [card]
184 | continue
185 | elif 'U' in card.get_colors():
186 | blue_cards += [card]
187 | continue
188 | elif 'B' in card.get_colors():
189 | black_cards += [card]
190 | continue
191 | elif 'G' in card.get_colors():
192 | green_cards += [card]
193 | continue
194 | elif 'W' in card.get_colors():
195 | white_cards += [card]
196 | continue
197 | else:
198 | if "land" in card.get_types():
199 | lands += [card]
200 | continue
201 | colorless_cards += [card]
202 | return[white_cards, blue_cards, black_cards, red_cards, green_cards, multi_cards, colorless_cards, lands]
203 |
204 | def sort_type(card_set):
205 | sorting = ["creature", "enchantment", "instant", "sorcery", "artifact", "planeswalker"]
206 | sorted_cards = [[],[],[],[],[],[],[]]
207 | sorted_set = []
208 | for card in card_set:
209 | types = card.get_types()
210 | for i in range(len(sorting)):
211 | if sorting[i] in types:
212 | sorted_cards[i] += [card]
213 | break
214 | else:
215 | sorted_cards[6] += [card]
216 | for value in sorted_cards:
217 | for card in value:
218 | sorted_set += [card]
219 | return sorted_set
220 |
221 |
222 |
223 | def sort_cmc(card_set):
224 | sorted_cards = []
225 | sorted_set = []
226 | for card in card_set:
227 | # make sure there is an empty set for each CMC
228 | while len(sorted_cards)-1 < card.get_cmc():
229 | sorted_cards += [[]]
230 | # add card to correct set of CMC values
231 | sorted_cards[card.get_cmc()] += [card]
232 | # combine each set of CMC valued cards together
233 | for value in sorted_cards:
234 | for card in value:
235 | sorted_set += [card]
236 | return sorted_set
237 |
238 |
239 | if oname:
240 | if for_html:
241 | print oname
242 | # if ('.html' != oname[-])
243 | # oname += '.html'
244 | if verbose:
245 | print 'Writing output to: ' + oname
246 | with open(oname, 'w') as ofile:
247 | writecards(ofile)
248 | if for_mse:
249 | # Copy whatever output file is produced, name the copy 'set' (yes, no extension).
250 | if os.path.isfile('set'):
251 | print 'ERROR: tried to overwrite existing file "set" - aborting.'
252 | return
253 | shutil.copyfile(oname, 'set')
254 | # Use the freaky mse extension instead of zip.
255 | with zipfile.ZipFile(oname+'.mse-set', mode='w') as zf:
256 | try:
257 | # Zip up the set file into oname.mse-set.
258 | zf.write('set')
259 | finally:
260 | if verbose:
261 | print 'Made an MSE set file called ' + oname + '.mse-set.'
262 | # The set file is useless outside the .mse-set, delete it.
263 | os.remove('set')
264 | else:
265 | writecards(sys.stdout)
266 | sys.stdout.flush()
267 |
268 |
269 | if __name__ == '__main__':
270 | import argparse
271 | parser = argparse.ArgumentParser()
272 |
273 | parser.add_argument('infile', #nargs='?'. default=None,
274 | help='encoded card file or json corpus to encode')
275 | parser.add_argument('outfile', nargs='?', default=None,
276 | help='output file, defaults to stdout')
277 | parser.add_argument('-e', '--encoding', default='std', choices=utils.formats,
278 | #help='{' + ','.join(formats) + '}',
279 | help='encoding format to use',
280 | )
281 | parser.add_argument('-g', '--gatherer', action='store_true',
282 | help='emulate Gatherer visual spoiler')
283 | parser.add_argument('-f', '--forum', action='store_true',
284 | help='use pretty mana encoding for mtgsalvation forum')
285 | parser.add_argument('-c', '--creativity', action='store_true',
286 | help='use CBOW fuzzy matching to check creativity of cards')
287 | parser.add_argument('-d', '--dump', action='store_true',
288 | help='dump out lots of information about invalid cards')
289 | parser.add_argument('-v', '--verbose', action='store_true',
290 | help='verbose output')
291 | parser.add_argument('-mse', '--mse', action='store_true',
292 | help='use Magic Set Editor 2 encoding; will output as .mse-set file')
293 | parser.add_argument('-html', '--html', action='store_true', help='create a .html file with pretty forum formatting')
294 |
295 | args = parser.parse_args()
296 |
297 | main(args.infile, args.outfile, verbose = args.verbose, encoding = args.encoding,
298 | gatherer = args.gatherer, for_forum = args.forum, for_mse = args.mse,
299 | creativity = args.creativity, vdump = args.dump, for_html = args.html)
300 |
301 | exit(0)
302 |
--------------------------------------------------------------------------------
/encode.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | import sys
3 | import os
4 |
5 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib')
6 | sys.path.append(libdir)
7 | import re
8 | import random
9 | import utils
10 | import jdecode
11 | import cardlib
12 |
13 | def main(fname, oname = None, verbose = True, encoding = 'std',
14 | nolinetrans = False, randomize = False, nolabel = False, stable = False):
15 | fmt_ordered = cardlib.fmt_ordered_default
16 | fmt_labeled = None if nolabel else cardlib.fmt_labeled_default
17 | fieldsep = utils.fieldsep
18 | line_transformations = not nolinetrans
19 | randomize_fields = False
20 | randomize_mana = randomize
21 | initial_sep = True
22 | final_sep = True
23 |
24 | # set the properties of the encoding
25 |
26 | if encoding in ['std']:
27 | pass
28 | elif encoding in ['named']:
29 | fmt_ordered = cardlib.fmt_ordered_named
30 | elif encoding in ['noname']:
31 | fmt_ordered = cardlib.fmt_ordered_noname
32 | elif encoding in ['rfields']:
33 | randomize_fields = True
34 | final_sep = False
35 | elif encoding in ['old']:
36 | fmt_ordered = cardlib.fmt_ordered_old
37 | elif encoding in ['norarity']:
38 | fmt_ordered = cardlib.fmt_ordered_norarity
39 | elif encoding in ['vec']:
40 | pass
41 | elif encoding in ['custom']:
42 | ## put custom format decisions here ##########################
43 |
44 | ## end of custom format ######################################
45 | pass
46 | else:
47 | raise ValueError('encode.py: unknown encoding: ' + encoding)
48 |
49 | if verbose:
50 | print 'Preparing to encode:'
51 | print ' Using encoding ' + repr(encoding)
52 | if stable:
53 | print ' NOT randomizing order of cards.'
54 | if randomize_mana:
55 | print ' Randomizing order of symobls in manacosts.'
56 | if not fmt_labeled:
57 | print ' NOT labeling fields for this run (may be harder to decode).'
58 | if not line_transformations:
59 | print ' NOT using line reordering transformations'
60 |
61 | cards = jdecode.mtg_open_file(fname, verbose=verbose, linetrans=line_transformations)
62 |
63 | # This should give a random but consistent ordering, to make comparing changes
64 | # between the output of different versions easier.
65 | if not stable:
66 | random.seed(1371367)
67 | random.shuffle(cards)
68 |
69 | def writecards(writer):
70 | for card in cards:
71 | if encoding in ['vec']:
72 | writer.write(card.vectorize() + '\n\n')
73 | else:
74 | writer.write(card.encode(fmt_ordered = fmt_ordered,
75 | fmt_labeled = fmt_labeled,
76 | fieldsep = fieldsep,
77 | randomize_fields = randomize_fields,
78 | randomize_mana = randomize_mana,
79 | initial_sep = initial_sep,
80 | final_sep = final_sep)
81 | + utils.cardsep)
82 |
83 | if oname:
84 | if verbose:
85 | print 'Writing output to: ' + oname
86 | with open(oname, 'w') as ofile:
87 | writecards(ofile)
88 | else:
89 | writecards(sys.stdout)
90 | sys.stdout.flush()
91 |
92 |
93 | if __name__ == '__main__':
94 | import argparse
95 | parser = argparse.ArgumentParser()
96 |
97 | parser.add_argument('infile',
98 | help='encoded card file or json corpus to encode')
99 | parser.add_argument('outfile', nargs='?', default=None,
100 | help='output file, defaults to stdout')
101 | parser.add_argument('-e', '--encoding', default='std', choices=utils.formats,
102 | #help='{' + ','.join(formats) + '}',
103 | help='encoding format to use',
104 | )
105 | parser.add_argument('-r', '--randomize', action='store_true',
106 | help='randomize the order of symbols in mana costs')
107 | parser.add_argument('--nolinetrans', action='store_true',
108 | help="don't reorder lines of card text")
109 | parser.add_argument('--nolabel', action='store_true',
110 | help="don't label fields")
111 | parser.add_argument('-s', '--stable', action='store_true',
112 | help="don't randomize the order of the cards")
113 | parser.add_argument('-v', '--verbose', action='store_true',
114 | help='verbose output')
115 |
116 | args = parser.parse_args()
117 | main(args.infile, args.outfile, verbose = args.verbose, encoding = args.encoding,
118 | nolinetrans = args.nolinetrans, randomize = args.randomize, nolabel = args.nolabel,
119 | stable = args.stable)
120 | exit(0)
121 |
--------------------------------------------------------------------------------
/lib/cbow.py:
--------------------------------------------------------------------------------
1 | # Infinite thanks to Talcos from the mtgsalvation forums, who among
2 | # many, many other things wrote the original version of this code.
3 | # I have merely ported it to fit my needs.
4 |
5 | import re
6 | import sys
7 | import subprocess
8 | import os
9 | import struct
10 | import math
11 | import multiprocessing
12 |
13 | import utils
14 | import cardlib
15 | import transforms
16 | import namediff
17 |
18 | libdir = os.path.dirname(os.path.realpath(__file__))
19 | datadir = os.path.realpath(os.path.join(libdir, '../data'))
20 |
21 | # multithreading control parameters
22 | cores = multiprocessing.cpu_count()
23 |
24 | # max length of vocabulary entries
25 | max_w = 50
26 |
27 |
28 | #### snip! ####
29 |
30 | def read_vector_file(fname):
31 | with open(fname, 'rb') as f:
32 | words = int(f.read(4))
33 | size = int(f.read(4))
34 | vocab = [' '] * (words * max_w)
35 | M = []
36 | for b in range(0,words):
37 | a = 0
38 | while True:
39 | c = f.read(1)
40 | vocab[b * max_w + a] = c;
41 | if len(c) == 0 or c == ' ':
42 | break
43 | if (a < max_w) and vocab[b * max_w + a] != '\n':
44 | a += 1
45 | tmp = list(struct.unpack('f'*size,f.read(4 * size)))
46 | length = math.sqrt(sum([tmp[i] * tmp[i] for i in range(0,len(tmp))]))
47 | for i in range(0,len(tmp)):
48 | tmp[i] /= length
49 | M.append(tmp)
50 | return ((''.join(vocab)).split(),M)
51 |
52 | def makevector(vocabulary,vecs,sequence):
53 | words = sequence.split()
54 | indices = []
55 | for word in words:
56 | if word not in vocabulary:
57 | #print("Missing word in vocabulary: " + word)
58 | continue
59 | #return [0.0]*len(vecs[0])
60 | indices.append(vocabulary.index(word))
61 | #res = map(sum,[vecs[i] for i in indices])
62 | res = None
63 | for v in [vecs[i] for i in indices]:
64 | if res == None:
65 | res = v
66 | else:
67 | res = [x + y for x, y in zip(res,v)]
68 |
69 | # bad things happen if we have a vector of only unknown words
70 | if res is None:
71 | return [0.0]*len(vecs[0])
72 |
73 | length = math.sqrt(sum([res[i] * res[i] for i in range(0,len(res))]))
74 | for i in range(0,len(res)):
75 | res[i] /= length
76 | return res
77 |
78 | #### !snip ####
79 |
80 |
81 | try:
82 | import numpy
83 | def cosine_similarity(v1,v2):
84 | A = numpy.array([v1,v2])
85 |
86 | # from http://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat
87 |
88 | # base similarity matrix (all dot products)
89 | # replace this with A.dot(A.T).todense() for sparse representation
90 | similarity = numpy.dot(A, A.T)
91 |
92 | # squared magnitude of preference vectors (number of occurrences)
93 | square_mag = numpy.diag(similarity)
94 |
95 | # inverse squared magnitude
96 | inv_square_mag = 1 / square_mag
97 |
98 | # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
99 | inv_square_mag[numpy.isinf(inv_square_mag)] = 0
100 |
101 | # inverse of the magnitude
102 | inv_mag = numpy.sqrt(inv_square_mag)
103 |
104 | # cosine similarity (elementwise multiply by inverse magnitudes)
105 | cosine = similarity * inv_mag
106 | cosine = cosine.T * inv_mag
107 |
108 | return cosine[0][1]
109 |
110 | except ImportError:
111 | def cosine_similarity(v1,v2):
112 | #compute cosine similarity of v1 to v2: (v1 dot v1)/{||v1||*||v2||)
113 | sumxx, sumxy, sumyy = 0, 0, 0
114 | for i in range(len(v1)):
115 | x = v1[i]; y = v2[i]
116 | sumxx += x*x
117 | sumyy += y*y
118 | sumxy += x*y
119 | return sumxy/math.sqrt(sumxx*sumyy)
120 |
121 | def cosine_similarity_name(cardvec, v, name):
122 | return (cosine_similarity(cardvec, v), name)
123 |
124 | # we need to put the logic in a regular function (as opposed to a method of an object)
125 | # so that we can pass the function to multiprocessing
126 | def f_nearest(card, vocab, vecs, cardvecs, n):
127 | if isinstance(card, cardlib.Card):
128 | words = card.vectorize().split('\n\n')[0]
129 | else:
130 | # assume it's a string (that's already a vector)
131 | words = card
132 |
133 | if not words:
134 | return []
135 |
136 | cardvec = makevector(vocab, vecs, words)
137 |
138 | comparisons = [cosine_similarity_name(cardvec, v, name) for (name, v) in cardvecs]
139 |
140 | comparisons.sort(reverse = True)
141 | comp_n = comparisons[:n]
142 |
143 | if isinstance(card, cardlib.Card) and card.bside:
144 | comp_n += f_nearest(card.bside, vocab, vecs, cardvecs, n=n)
145 |
146 | return comp_n
147 |
148 | def f_nearest_per_thread(workitem):
149 | (workcards, vocab, vecs, cardvecs, n) = workitem
150 | return map(lambda card: f_nearest(card, vocab, vecs, cardvecs, n), workcards)
151 |
152 | class CBOW:
153 | def __init__(self, verbose = True,
154 | vector_fname = os.path.join(datadir, 'cbow.bin'),
155 | card_fname = os.path.join(datadir, 'output.txt')):
156 | self.verbose = verbose
157 | self.cardvecs = []
158 |
159 | if self.verbose:
160 | print 'Building a cbow model...'
161 |
162 | if self.verbose:
163 | print ' Reading binary vector data from: ' + vector_fname
164 | (vocab, vecs) = read_vector_file(vector_fname)
165 | self.vocab = vocab
166 | self.vecs = vecs
167 |
168 | if self.verbose:
169 | print ' Reading encoded cards from: ' + card_fname
170 | print ' They\'d better be in the same order as the file used to build the vector model!'
171 | with open(card_fname, 'rt') as f:
172 | text = f.read()
173 | for card_src in text.split(utils.cardsep):
174 | if card_src:
175 | card = cardlib.Card(card_src)
176 | name = card.name
177 | self.cardvecs += [(name, makevector(self.vocab,
178 | self.vecs,
179 | card.vectorize()))]
180 |
181 | if self.verbose:
182 | print '... Done.'
183 | print ' vocab size: ' + str(len(self.vocab))
184 | print ' raw vecs: ' + str(len(self.vecs))
185 | print ' card vecs: ' + str(len(self.cardvecs))
186 |
187 | def nearest(self, card, n=5):
188 | return f_nearest(card, self.vocab, self.vecs, self.cardvecs, n)
189 |
190 | def nearest_par(self, cards, n=5, threads=cores):
191 | workpool = multiprocessing.Pool(threads)
192 | proto_worklist = namediff.list_split(cards, threads)
193 | worklist = map(lambda x: (x, self.vocab, self.vecs, self.cardvecs, n), proto_worklist)
194 | donelist = workpool.map(f_nearest_per_thread, worklist)
195 | return namediff.list_flatten(donelist)
196 |
--------------------------------------------------------------------------------
/lib/config.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | # Utilities for handling unicode, unary numbers, mana costs, and special symbols.
4 | # For convenience we redefine everything from utils so that it can all be accessed
5 | # from the utils module.
6 |
7 | # separators
8 | cardsep = '\n\n'
9 | fieldsep = '|'
10 | bsidesep = '\n'
11 | newline = '\\'
12 |
13 | # special indicators
14 | dash_marker = '~'
15 | bullet_marker = '='
16 | this_marker = '@'
17 | counter_marker = '%'
18 | reserved_marker = '\v'
19 | reserved_mana_marker = '$'
20 | choice_open_delimiter = '['
21 | choice_close_delimiter = ']'
22 | x_marker = 'X'
23 | tap_marker = 'T'
24 | untap_marker = 'Q'
25 | # second letter of the word
26 | rarity_common_marker = 'O'
27 | rarity_uncommon_marker = 'N'
28 | rarity_rare_marker = 'A'
29 | rarity_mythic_marker = 'Y'
30 | # with some crazy exceptions
31 | rarity_special_marker = 'E'
32 | rarity_basic_land_marker = 'L'
33 |
34 | # unambiguous synonyms
35 | counter_rename = 'uncast'
36 |
37 | # unary numbers
38 | unary_marker = '&'
39 | unary_counter = '^'
40 | unary_max = 20
41 | unary_exceptions = {
42 | 25 : 'twenty' + dash_marker + 'five',
43 | 30 : 'thirty',
44 | 40 : 'forty',
45 | 50 : 'fifly',
46 | 100: 'one hundred',
47 | 200: 'two hundred',
48 | }
49 |
50 | # field labels, to allow potential reordering of card format
51 | field_label_name = '1'
52 | field_label_rarity = '0' # 2 is part of some mana symbols {2/B} ...
53 | field_label_cost = '3'
54 | field_label_supertypes = '4'
55 | field_label_types = '5'
56 | field_label_subtypes = '6'
57 | field_label_loyalty = '7'
58 | field_label_pt = '8'
59 | field_label_text = '9'
60 |
61 | # additional fields we add to the json cards
62 | json_field_bside = 'bside'
63 | json_field_set_name = 'setName'
64 | json_field_info_code = 'magicCardsInfoCode'
65 |
--------------------------------------------------------------------------------
/lib/datalib.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | import utils
4 | from cardlib import Card
5 |
6 | # Format a list of rows of data into nice columns.
7 | # Note that it's the columns that are nice, not this code.
8 | def padrows(l):
9 | # get length for each field
10 | lens = []
11 | for ll in l:
12 | for i, field in enumerate(ll):
13 | if i < len(lens):
14 | lens[i] = max(len(str(field)), lens[i])
15 | else:
16 | lens += [len(str(field))]
17 | # now pad out to that length
18 | padded = []
19 | for ll in l:
20 | padded += ['']
21 | for i, field in enumerate(ll):
22 | s = str(field)
23 | pad = ' ' * (lens[i] - len(s))
24 | padded[-1] += (s + pad + ' ')
25 | return padded
26 | def printrows(l):
27 | for row in l:
28 | print row
29 |
30 | # index management helpers
31 | def index_size(d):
32 | return sum(map(lambda k: len(d[k]), d))
33 |
34 | def inc(d, k, obj):
35 | if k or k == 0:
36 | if k in d:
37 | d[k] += obj
38 | else:
39 | d[k] = obj
40 |
41 | # thanks gleemax
42 | def plimit(s, mlen = 1000):
43 | if len(s) > mlen:
44 | return s[:1000] + '[...]'
45 | else:
46 | return s
47 |
48 | class Datamine:
49 | # build the global indices
50 | def __init__(self, card_srcs):
51 | # global card pools
52 | self.unparsed_cards = []
53 | self.invalid_cards = []
54 | self.cards = []
55 | self.allcards = []
56 |
57 | # global indices
58 | self.by_name = {}
59 | self.by_type = {}
60 | self.by_type_inclusive = {}
61 | self.by_supertype = {}
62 | self.by_supertype_inclusive = {}
63 | self.by_subtype = {}
64 | self.by_subtype_inclusive = {}
65 | self.by_color = {}
66 | self.by_color_inclusive = {}
67 | self.by_color_count = {}
68 | self.by_cmc = {}
69 | self.by_cost = {}
70 | self.by_power = {}
71 | self.by_toughness = {}
72 | self.by_pt = {}
73 | self.by_loyalty = {}
74 | self.by_textlines = {}
75 | self.by_textlen = {}
76 |
77 | self.indices = {
78 | 'by_name' : self.by_name,
79 | 'by_type' : self.by_type,
80 | 'by_type_inclusive' : self.by_type_inclusive,
81 | 'by_supertype' : self.by_supertype,
82 | 'by_supertype_inclusive' : self.by_supertype_inclusive,
83 | 'by_subtype' : self.by_subtype,
84 | 'by_subtype_inclusive' : self.by_subtype_inclusive,
85 | 'by_color' : self.by_color,
86 | 'by_color_inclusive' : self.by_color_inclusive,
87 | 'by_color_count' : self.by_color_count,
88 | 'by_cmc' : self.by_cmc,
89 | 'by_cost' : self.by_cost,
90 | 'by_power' : self.by_power,
91 | 'by_toughness' : self.by_toughness,
92 | 'by_pt' : self.by_pt,
93 | 'by_loyalty' : self.by_loyalty,
94 | 'by_textlines' : self.by_textlines,
95 | 'by_textlen' : self.by_textlen,
96 | }
97 |
98 | for card_src in card_srcs:
99 | # the empty card is not interesting
100 | if not card_src:
101 | continue
102 | card = Card(card_src)
103 | if card.valid:
104 | self.cards += [card]
105 | self.allcards += [card]
106 | elif card.parsed:
107 | self.invalid_cards += [card]
108 | self.allcards += [card]
109 | else:
110 | self.unparsed_cards += [card]
111 |
112 | if card.parsed:
113 | inc(self.by_name, card.name, [card])
114 |
115 | inc(self.by_type, ' '.join(card.types), [card])
116 | for t in card.types:
117 | inc(self.by_type_inclusive, t, [card])
118 | inc(self.by_supertype, ' '.join(card.supertypes), [card])
119 | for t in card.supertypes:
120 | inc(self.by_supertype_inclusive, t, [card])
121 | inc(self.by_subtype, ' '.join(card.subtypes), [card])
122 | for t in card.subtypes:
123 | inc(self.by_subtype_inclusive, t, [card])
124 |
125 | if card.cost.colors:
126 | inc(self.by_color, card.cost.colors, [card])
127 | for c in card.cost.colors:
128 | inc(self.by_color_inclusive, c, [card])
129 | inc(self.by_color_count, len(card.cost.colors), [card])
130 | else:
131 | # colorless, still want to include in these tables
132 | inc(self.by_color, 'A', [card])
133 | inc(self.by_color_inclusive, 'A', [card])
134 | inc(self.by_color_count, 0, [card])
135 |
136 | inc(self.by_cmc, card.cost.cmc, [card])
137 | inc(self.by_cost, card.cost.encode() if card.cost.encode() else 'none', [card])
138 |
139 | inc(self.by_power, card.pt_p, [card])
140 | inc(self.by_toughness, card.pt_t, [card])
141 | inc(self.by_pt, card.pt, [card])
142 |
143 | inc(self.by_loyalty, card.loyalty, [card])
144 |
145 | inc(self.by_textlines, len(card.text_lines), [card])
146 | inc(self.by_textlen, len(card.text.encode()), [card])
147 |
148 | # summarize the indices
149 | # Yes, this printing code is pretty terrible.
150 | def summarize(self, hsize = 10, vsize = 10, cmcsize = 20):
151 | print '===================='
152 | print str(len(self.cards)) + ' valid cards, ' + str(len(self.invalid_cards)) + ' invalid cards.'
153 | print str(len(self.allcards)) + ' cards parsed, ' + str(len(self.unparsed_cards)) + ' failed to parse'
154 | print '--------------------'
155 | print str(len(self.by_name)) + ' unique card names'
156 | print '--------------------'
157 | print (str(len(self.by_color_inclusive)) + ' represented colors (including colorless as \'A\'), '
158 | + str(len(self.by_color)) + ' combinations')
159 | print 'Breakdown by color:'
160 | rows = [self.by_color_inclusive.keys()]
161 | rows += [[len(self.by_color_inclusive[k]) for k in rows[0]]]
162 | printrows(padrows(rows))
163 | print 'Breakdown by number of colors:'
164 | rows = [self.by_color_count.keys()]
165 | rows += [[len(self.by_color_count[k]) for k in rows[0]]]
166 | printrows(padrows(rows))
167 | print '--------------------'
168 | print str(len(self.by_type_inclusive)) + ' unique card types, ' + str(len(self.by_type)) + ' combinations'
169 | print 'Breakdown by type:'
170 | d = sorted(self.by_type_inclusive,
171 | lambda x,y: cmp(len(self.by_type_inclusive[x]), len(self.by_type_inclusive[y])),
172 | reverse = True)
173 | rows = [[k for k in d[:hsize]]]
174 | rows += [[len(self.by_type_inclusive[k]) for k in rows[0]]]
175 | printrows(padrows(rows))
176 | print '--------------------'
177 | print (str(len(self.by_subtype_inclusive)) + ' unique subtypes, '
178 | + str(len(self.by_subtype)) + ' combinations')
179 | print '-- Popular subtypes: --'
180 | d = sorted(self.by_subtype_inclusive,
181 | lambda x,y: cmp(len(self.by_subtype_inclusive[x]), len(self.by_subtype_inclusive[y])),
182 | reverse = True)
183 | rows = []
184 | for k in d[0:vsize]:
185 | rows += [[k, len(self.by_subtype_inclusive[k])]]
186 | printrows(padrows(rows))
187 | print '-- Top combinations: --'
188 | d = sorted(self.by_subtype,
189 | lambda x,y: cmp(len(self.by_subtype[x]), len(self.by_subtype[y])),
190 | reverse = True)
191 | rows = []
192 | for k in d[0:vsize]:
193 | rows += [[k, len(self.by_subtype[k])]]
194 | printrows(padrows(rows))
195 | print '--------------------'
196 | print (str(len(self.by_supertype_inclusive)) + ' unique supertypes, '
197 | + str(len(self.by_supertype)) + ' combinations')
198 | print 'Breakdown by supertype:'
199 | d = sorted(self.by_supertype_inclusive,
200 | lambda x,y: cmp(len(self.by_supertype_inclusive[x]),len(self.by_supertype_inclusive[y])),
201 | reverse = True)
202 | rows = [[k for k in d[:hsize]]]
203 | rows += [[len(self.by_supertype_inclusive[k]) for k in rows[0]]]
204 | printrows(padrows(rows))
205 | print '--------------------'
206 | print str(len(self.by_cmc)) + ' different CMCs, ' + str(len(self.by_cost)) + ' unique mana costs'
207 | print 'Breakdown by CMC:'
208 | d = sorted(self.by_cmc, reverse = False)
209 | rows = [[k for k in d[:cmcsize]]]
210 | rows += [[len(self.by_cmc[k]) for k in rows[0]]]
211 | printrows(padrows(rows))
212 | print '-- Popular mana costs: --'
213 | d = sorted(self.by_cost,
214 | lambda x,y: cmp(len(self.by_cost[x]), len(self.by_cost[y])),
215 | reverse = True)
216 | rows = []
217 | for k in d[0:vsize]:
218 | rows += [[utils.from_mana(k), len(self.by_cost[k])]]
219 | printrows(padrows(rows))
220 | print '--------------------'
221 | print str(len(self.by_pt)) + ' unique p/t combinations'
222 | if len(self.by_power) > 0 and len(self.by_toughness) > 0:
223 | print ('Largest power: ' + str(max(map(len, self.by_power)) - 1) +
224 | ', largest toughness: ' + str(max(map(len, self.by_toughness)) - 1))
225 | print '-- Popular p/t values: --'
226 | d = sorted(self.by_pt,
227 | lambda x,y: cmp(len(self.by_pt[x]), len(self.by_pt[y])),
228 | reverse = True)
229 | rows = []
230 | for k in d[0:vsize]:
231 | rows += [[utils.from_unary(k), len(self.by_pt[k])]]
232 | printrows(padrows(rows))
233 | print '--------------------'
234 | print 'Loyalty values:'
235 | d = sorted(self.by_loyalty,
236 | lambda x,y: cmp(len(self.by_loyalty[x]), len(self.by_loyalty[y])),
237 | reverse = True)
238 | rows = []
239 | for k in d[0:vsize]:
240 | rows += [[utils.from_unary(k), len(self.by_loyalty[k])]]
241 | printrows(padrows(rows))
242 | print '--------------------'
243 | if len(self.by_textlen) > 0 and len(self.by_textlines) > 0:
244 | print('Card text ranges from ' + str(min(self.by_textlen)) + ' to '
245 | + str(max(self.by_textlen)) + ' characters in length')
246 | print('Card text ranges from ' + str(min(self.by_textlines)) + ' to '
247 | + str(max(self.by_textlines)) + ' lines')
248 | print '-- Line counts by frequency: --'
249 | d = sorted(self.by_textlines,
250 | lambda x,y: cmp(len(self.by_textlines[x]), len(self.by_textlines[y])),
251 | reverse = True)
252 | rows = []
253 | for k in d[0:vsize]:
254 | rows += [[k, len(self.by_textlines[k])]]
255 | printrows(padrows(rows))
256 | print '===================='
257 |
258 |
259 | # describe outliers in the indices
260 | def outliers(self, hsize = 10, vsize = 10, dump_invalid = False):
261 | print '********************'
262 | print 'Overview of indices:'
263 | rows = [['Index Name', 'Keys', 'Total Members']]
264 | for index in self.indices:
265 | rows += [[index, len(self.indices[index]), index_size(self.indices[index])]]
266 | printrows(padrows(rows))
267 | print '********************'
268 | if len(self.by_name) > 0:
269 | scardname = sorted(self.by_name,
270 | lambda x,y: cmp(len(x), len(y)),
271 | reverse = False)[0]
272 | print 'Shortest Cardname: (' + str(len(scardname)) + ')'
273 | print ' ' + scardname
274 | lcardname = sorted(self.by_name,
275 | lambda x,y: cmp(len(x), len(y)),
276 | reverse = True)[0]
277 | print 'Longest Cardname: (' + str(len(lcardname)) + ')'
278 | print ' ' + lcardname
279 | d = sorted(self.by_name,
280 | lambda x,y: cmp(len(self.by_name[x]), len(self.by_name[y])),
281 | reverse = True)
282 | rows = []
283 | for k in d[0:vsize]:
284 | if len(self.by_name[k]) > 1:
285 | rows += [[k, len(self.by_name[k])]]
286 | if rows == []:
287 | print('No duplicated cardnames')
288 | else:
289 | print '-- Most duplicated names: --'
290 | printrows(padrows(rows))
291 | else:
292 | print 'No cards indexed by name?'
293 | print '--------------------'
294 | if len(self.by_type) > 0:
295 | ltypes = sorted(self.by_type,
296 | lambda x,y: cmp(len(x), len(y)),
297 | reverse = True)[0]
298 | print 'Longest card type: (' + str(len(ltypes)) + ')'
299 | print ' ' + ltypes
300 | else:
301 | print 'No cards indexed by type?'
302 | if len(self.by_subtype) > 0:
303 | lsubtypes = sorted(self.by_subtype,
304 | lambda x,y: cmp(len(x), len(y)),
305 | reverse = True)[0]
306 | print 'Longest subtype: (' + str(len(lsubtypes)) + ')'
307 | print ' ' + lsubtypes
308 | else:
309 | print 'No cards indexed by subtype?'
310 | if len(self.by_supertype) > 0:
311 | lsupertypes = sorted(self.by_supertype,
312 | lambda x,y: cmp(len(x), len(y)),
313 | reverse = True)[0]
314 | print 'Longest supertype: (' + str(len(lsupertypes)) + ')'
315 | print ' ' + lsupertypes
316 | else:
317 | print 'No cards indexed by supertype?'
318 | print '--------------------'
319 | if len(self.by_cost) > 0:
320 | lcost = sorted(self.by_cost,
321 | lambda x,y: cmp(len(x), len(y)),
322 | reverse = True)[0]
323 | print 'Longest mana cost: (' + str(len(lcost)) + ')'
324 | print ' ' + utils.from_mana(lcost)
325 | print '\n' + plimit(self.by_cost[lcost][0].encode()) + '\n'
326 | else:
327 | print 'No cards indexed by cost?'
328 | if len(self.by_cmc) > 0:
329 | lcmc = sorted(self.by_cmc, reverse = True)[0]
330 | print 'Largest cmc: (' + str(lcmc) + ')'
331 | print ' ' + str(self.by_cmc[lcmc][0].cost)
332 | print '\n' + plimit(self.by_cmc[lcmc][0].encode())
333 | else:
334 | print 'No cards indexed by cmc?'
335 | print '--------------------'
336 | if len(self.by_power) > 0:
337 | lpower = sorted(self.by_power,
338 | lambda x,y: cmp(len(x), len(y)),
339 | reverse = True)[0]
340 | print 'Largest creature power: ' + utils.from_unary(lpower)
341 | print '\n' + plimit(self.by_power[lpower][0].encode()) + '\n'
342 | else:
343 | print 'No cards indexed by power?'
344 | if len(self.by_toughness) > 0:
345 | ltoughness = sorted(self.by_toughness,
346 | lambda x,y: cmp(len(x), len(y)),
347 | reverse = True)[0]
348 | print 'Largest creature toughness: ' + utils.from_unary(ltoughness)
349 | print '\n' + plimit(self.by_toughness[ltoughness][0].encode())
350 | else:
351 | print 'No cards indexed by toughness?'
352 | print '--------------------'
353 | if len(self.by_textlines) > 0:
354 | llines = sorted(self.by_textlines, reverse = True)[0]
355 | print 'Most lines of text in a card: ' + str(llines)
356 | print '\n' + plimit(self.by_textlines[llines][0].encode()) + '\n'
357 | else:
358 | print 'No cards indexed by line count?'
359 | if len(self.by_textlen) > 0:
360 | ltext = sorted(self.by_textlen, reverse = True)[0]
361 | print 'Most chars in a card text: ' + str(ltext)
362 | print '\n' + plimit(self.by_textlen[ltext][0].encode())
363 | else:
364 | print 'No cards indexed by char count?'
365 | print '--------------------'
366 | print 'There were ' + str(len(self.invalid_cards)) + ' invalid cards.'
367 | if dump_invalid:
368 | for card in self.invalid_cards:
369 | print '\n' + repr(card.fields)
370 | elif len(self.invalid_cards) > 0:
371 | print 'Not summarizing.'
372 | print '--------------------'
373 | print 'There were ' + str(len(self.unparsed_cards)) + ' unparsed cards.'
374 | if dump_invalid:
375 | for card in self.unparsed_cards:
376 | print '\n' + repr(card.fields)
377 | elif len(self.unparsed_cards) > 0:
378 | print 'Not summarizing.'
379 | print '===================='
380 |
--------------------------------------------------------------------------------
/lib/jdecode.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 | import utils
4 | import cardlib
5 |
6 | def mtg_open_json(fname, verbose = False):
7 |
8 | with open(fname, 'r') as f:
9 | jobj = json.load(f)
10 |
11 | allcards = {}
12 | asides = {}
13 | bsides = {}
14 |
15 | for k_set in jobj:
16 | set = jobj[k_set]
17 | setname = set['name']
18 | if 'magicCardsInfoCode' in set:
19 | codename = set['magicCardsInfoCode']
20 | else:
21 | codename = ''
22 |
23 | for card in set['cards']:
24 | card[utils.json_field_set_name] = setname
25 | card[utils.json_field_info_code] = codename
26 |
27 | cardnumber = None
28 | if 'number' in card:
29 | cardnumber = card['number']
30 | # the lower avoids duplication of at least one card (Will-o/O'-the-Wisp)
31 | cardname = card['name'].lower()
32 |
33 | uid = set['code']
34 | if cardnumber == None:
35 | uid = uid + '_' + cardname + '_'
36 | else:
37 | uid = uid + '_' + cardnumber
38 |
39 | # aggregate by name to avoid duplicates, not counting bsides
40 | if not uid[-1] == 'b':
41 | if cardname in allcards:
42 | allcards[cardname] += [card]
43 | else:
44 | allcards[cardname] = [card]
45 |
46 | # also aggregate aside cards by uid so we can add bsides later
47 | if uid[-1:] == 'a':
48 | asides[uid] = card
49 | if uid[-1:] == 'b':
50 | bsides[uid] = card
51 |
52 | for uid in bsides:
53 | aside_uid = uid[:-1] + 'a'
54 | if aside_uid in asides:
55 | # the second check handles the brothers yamazaki edge case
56 | if not asides[aside_uid]['name'] == bsides[uid]['name']:
57 | asides[aside_uid][utils.json_field_bside] = bsides[uid]
58 | else:
59 | pass
60 | # this exposes some coldsnap theme deck bsides that aren't
61 | # really bsides; shouldn't matter too much
62 | #print aside_uid
63 | #print bsides[uid]
64 |
65 | if verbose:
66 | print 'Opened ' + str(len(allcards)) + ' uniquely named cards.'
67 | return allcards
68 |
69 | # filters to ignore some undesirable cards, only used when opening json
70 | def default_exclude_sets(cardset):
71 | return cardset == 'Unglued' or cardset == 'Unhinged' or cardset == 'Celebration'
72 |
73 | def default_exclude_types(cardtype):
74 | return cardtype in ['conspiracy']
75 |
76 | def default_exclude_layouts(layout):
77 | return layout in ['token', 'plane', 'scheme', 'phenomenon', 'vanguard']
78 |
79 | # centralized logic for opening files of cards, either encoded or json
80 | def mtg_open_file(fname, verbose = False,
81 | linetrans = True, fmt_ordered = cardlib.fmt_ordered_default,
82 | exclude_sets = default_exclude_sets,
83 | exclude_types = default_exclude_types,
84 | exclude_layouts = default_exclude_layouts):
85 |
86 | cards = []
87 | valid = 0
88 | skipped = 0
89 | invalid = 0
90 | unparsed = 0
91 |
92 | if fname[-5:] == '.json':
93 | if verbose:
94 | print 'This looks like a json file: ' + fname
95 | json_srcs = mtg_open_json(fname, verbose)
96 | # sorted for stability
97 | for json_cardname in sorted(json_srcs):
98 | if len(json_srcs[json_cardname]) > 0:
99 | jcards = json_srcs[json_cardname]
100 |
101 | # look for a normal rarity version, in a set we can use
102 | idx = 0
103 | card = cardlib.Card(jcards[idx], linetrans=linetrans)
104 | while (idx < len(jcards)
105 | and (card.rarity == utils.rarity_special_marker
106 | or exclude_sets(jcards[idx][utils.json_field_set_name]))):
107 | idx += 1
108 | if idx < len(jcards):
109 | card = cardlib.Card(jcards[idx], linetrans=linetrans)
110 | # if there isn't one, settle with index 0
111 | if idx >= len(jcards):
112 | idx = 0
113 | card = cardlib.Card(jcards[idx], linetrans=linetrans)
114 | # we could go back and look for a card satisfying one of the criteria,
115 | # but eh
116 |
117 | skip = False
118 | if (exclude_sets(jcards[idx][utils.json_field_set_name])
119 | or exclude_layouts(jcards[idx]['layout'])):
120 | skip = True
121 | for cardtype in card.types:
122 | if exclude_types(cardtype):
123 | skip = True
124 | if skip:
125 | skipped += 1
126 | continue
127 |
128 | if card.valid:
129 | valid += 1
130 | cards += [card]
131 | elif card.parsed:
132 | invalid += 1
133 | if verbose:
134 | print 'Invalid card: ' + json_cardname
135 | else:
136 | unparsed += 1
137 |
138 | # fall back to opening a normal encoded file
139 | else:
140 | if verbose:
141 | print 'Opening encoded card file: ' + fname
142 | with open(fname, 'rt') as f:
143 | text = f.read()
144 | for card_src in text.split(utils.cardsep):
145 | if card_src:
146 | card = cardlib.Card(card_src, fmt_ordered=fmt_ordered)
147 | # unlike opening from json, we still want to return invalid cards
148 | cards += [card]
149 | if card.valid:
150 | valid += 1
151 | elif card.parsed:
152 | invalid += 1
153 | if verbose:
154 | print 'Invalid card: ' + json_cardname
155 | else:
156 | unparsed += 1
157 |
158 | if verbose:
159 | print (str(valid) + ' valid, ' + str(skipped) + ' skipped, '
160 | + str(invalid) + ' invalid, ' + str(unparsed) + ' failed to parse.')
161 |
162 | good_count = 0
163 | bad_count = 0
164 | for card in cards:
165 | if not card.parsed and not card.text.text:
166 | bad_count += 1
167 | elif len(card.name) > 50 or len(card.rarity) > 3:
168 | bad_count += 1
169 | else:
170 | good_count += 1
171 | if good_count + bad_count > 15:
172 | break
173 | # random heuristic
174 | if bad_count > 10:
175 | print 'WARNING: Saw a bunch of unparsed cards:'
176 | print ' Is this a legacy format, you may need to specify the field order.'
177 |
178 | return cards
179 |
--------------------------------------------------------------------------------
/lib/manalib.py:
--------------------------------------------------------------------------------
1 | # representation for mana costs and text with embedded mana costs
2 | # data aggregating classes
3 | import re
4 | import random
5 |
6 | import utils
7 |
8 | class Manacost:
9 | '''mana cost representation with data'''
10 |
11 | # hardcoded to be dependent on the symbol structure... ah well
12 | def get_colors(self):
13 | colors = ''
14 | for sym in self.symbols:
15 | if self.symbols[sym] > 0:
16 | symcolors = re.sub(r'2|P|S|X|C', '', sym)
17 | for symcolor in symcolors:
18 | if symcolor not in colors:
19 | colors += symcolor
20 | # sort so the order is always consistent
21 | return ''.join(sorted(colors))
22 |
23 | def check_colors(self, symbolstring):
24 | for sym in symbolstring:
25 | if not sym in self.colors:
26 | return False
27 | return True
28 |
29 | def __init__(self, src, fmt = ''):
30 | # source fields, exactly one will be set
31 | self.raw = None
32 | self.json = None
33 | # flags
34 | self.parsed = True
35 | self.valid = True
36 | self.none = False
37 | # default values for all fields
38 | self.inner = None
39 | self.cmc = 0
40 | self.colorless = 0
41 | self.sequence = []
42 | self.symbols = {sym : 0 for sym in utils.mana_syms}
43 | self.allsymbols = {sym : 0 for sym in utils.mana_symall}
44 | self.colors = ''
45 |
46 | if fmt == 'json':
47 | self.json = src
48 | text = utils.mana_translate(self.json.upper())
49 | else:
50 | self.raw = src
51 | text = self.raw
52 |
53 | if text == '':
54 | self.inner = ''
55 | self.none = True
56 |
57 | elif not (len(text) >= 2 and text[0] == '{' and text[-1] == '}'):
58 | self.parsed = False
59 | self.valid = False
60 |
61 | else:
62 | self.inner = text[1:-1]
63 |
64 | # structure mirrors the decoding in utils, but we pull out different data here
65 | idx = 0
66 | while idx < len(self.inner):
67 | # taking this branch is an infinite loop if unary_marker is empty
68 | if (len(utils.mana_unary_marker) > 0 and
69 | self.inner[idx:idx+len(utils.mana_unary_marker)] == utils.mana_unary_marker):
70 | idx += len(utils.mana_unary_marker)
71 | self.sequence += [utils.mana_unary_marker]
72 | elif self.inner[idx:idx+len(utils.mana_unary_counter)] == utils.mana_unary_counter:
73 | idx += len(utils.mana_unary_counter)
74 | self.sequence += [utils.mana_unary_counter]
75 | self.colorless += 1
76 | self.cmc += 1
77 | else:
78 | old_idx = idx
79 | for symlen in range(utils.mana_symlen_min, utils.mana_symlen_max + 1):
80 | encoded_sym = self.inner[idx:idx+symlen]
81 | if encoded_sym in utils.mana_symall_decode:
82 | idx += symlen
83 | # leave the sequence encoded for convenience
84 | self.sequence += [encoded_sym]
85 | sym = utils.mana_symall_decode[encoded_sym]
86 | self.allsymbols[sym] += 1
87 | if sym in utils.mana_symalt:
88 | self.symbols[utils.mana_alt(sym)] += 1
89 | else:
90 | self.symbols[sym] += 1
91 | if sym == utils.mana_X:
92 | self.cmc += 0
93 | elif utils.mana_2 in sym:
94 | self.cmc += 2
95 | else:
96 | self.cmc += 1
97 | break
98 | # otherwise we'll go into an infinite loop if we see a symbol we don't know
99 | if idx == old_idx:
100 | idx += 1
101 | self.valid = False
102 |
103 | self.colors = self.get_colors()
104 |
105 | def __str__(self):
106 | if self.none:
107 | return '_NOCOST_'
108 | return utils.mana_untranslate(utils.mana_open_delimiter + ''.join(self.sequence)
109 | + utils.mana_close_delimiter)
110 |
111 | def format(self, for_forum = False, for_html = False):
112 | if self.none:
113 | return '_NOCOST_'
114 |
115 | else:
116 | return utils.mana_untranslate(utils.mana_open_delimiter + ''.join(self.sequence)
117 | + utils.mana_close_delimiter, for_forum, for_html)
118 |
119 | def encode(self, randomize = False):
120 | if self.none:
121 | return ''
122 | elif randomize:
123 | # so this won't work very well if mana_unary_marker isn't empty
124 | return (utils.mana_open_delimiter
125 | + ''.join(random.sample(self.sequence, len(self.sequence)))
126 | + utils.mana_close_delimiter)
127 | else:
128 | return utils.mana_open_delimiter + ''.join(self.sequence) + utils.mana_close_delimiter
129 |
130 | def vectorize(self, delimit = False):
131 | if self.none:
132 | return ''
133 | elif delimit:
134 | ld = '('
135 | rd = ')'
136 | else:
137 | ld = ''
138 | rd = ''
139 | return ' '.join(map(lambda s: ld + s + rd, sorted(self.sequence)))
140 |
141 |
142 | class Manatext:
143 | '''text representation with embedded mana costs'''
144 |
145 | def __init__(self, src, fmt = ''):
146 | # source fields
147 | self.raw = None
148 | self.json = None
149 | # flags
150 | self.valid = True
151 | # default values for all fields
152 | self.text = src
153 | self.costs = []
154 |
155 | if fmt == 'json':
156 | self.json = src
157 | manastrs = re.findall(utils.mana_json_regex, src)
158 | else:
159 | self.raw = src
160 | manastrs = re.findall(utils.mana_regex, src)
161 |
162 | for manastr in manastrs:
163 | cost = Manacost(manastr, fmt)
164 | if not cost.valid:
165 | self.valid = False
166 | self.costs += [cost]
167 | self.text = self.text.replace(manastr, utils.reserved_mana_marker, 1)
168 |
169 | if (utils.mana_open_delimiter in self.text
170 | or utils.mana_close_delimiter in self.text
171 | or utils.mana_json_open_delimiter in self.text
172 | or utils.mana_json_close_delimiter in self.text):
173 | self.valid = False
174 |
175 | def __str__(self):
176 | text = self.text
177 | for cost in self.costs:
178 | text = text.replace(utils.reserved_mana_marker, str(cost), 1)
179 | return text
180 |
181 | def format(self, for_forum = False, for_html = False):
182 | text = self.text
183 | for cost in self.costs:
184 | text = text.replace(utils.reserved_mana_marker, cost.format(for_forum=for_forum, for_html=for_html), 1)
185 | if for_html:
186 | text = text.replace('\n', '
\n')
187 | return text
188 |
189 | def encode(self, randomize = False):
190 | text = self.text
191 | for cost in self.costs:
192 | text = text.replace(utils.reserved_mana_marker, cost.encode(randomize = randomize), 1)
193 | return text
194 |
195 | def vectorize(self):
196 | text = self.text
197 | special_chars = [utils.reserved_mana_marker,
198 | utils.dash_marker,
199 | utils.bullet_marker,
200 | utils.this_marker,
201 | utils.counter_marker,
202 | utils.choice_open_delimiter,
203 | utils.choice_close_delimiter,
204 | utils.newline,
205 | #utils.x_marker,
206 | utils.tap_marker,
207 | utils.untap_marker,
208 | utils.newline,
209 | ';', ':', '"', ',', '.']
210 | for char in special_chars:
211 | text = text.replace(char, ' ' + char + ' ')
212 | text = text.replace('/', '/ /')
213 | for cost in self.costs:
214 | text = text.replace(utils.reserved_mana_marker, cost.vectorize(), 1)
215 | return ' '.join(text.split())
216 |
--------------------------------------------------------------------------------
/lib/namediff.py:
--------------------------------------------------------------------------------
1 | # This module is misleadingly named, as it has other utilities as well
2 | # that are generally necessary when trying to postprocess output by
3 | # comparing it against existing cards.
4 |
5 | import difflib
6 | import os
7 | import multiprocessing
8 |
9 | import utils
10 | import jdecode
11 | import cardlib
12 |
13 | libdir = os.path.dirname(os.path.realpath(__file__))
14 | datadir = os.path.realpath(os.path.join(libdir, '../data'))
15 |
16 | # multithreading control parameters
17 | cores = multiprocessing.cpu_count()
18 |
19 | # split a list into n pieces; return a list of these lists
20 | # has slightly interesting behavior, in that if n is large, it can
21 | # run out of elements early and return less than n lists
22 | def list_split(l, n):
23 | if n <= 0:
24 | return l
25 | split_size = len(l) / n
26 | if len(l) % n > 0:
27 | split_size += 1
28 | return [l[i:i+split_size] for i in range(0, len(l), split_size)]
29 |
30 | # flatten a list of lists into a single list of all their contents, in order
31 | def list_flatten(l):
32 | return [item for sublist in l for item in sublist]
33 |
34 |
35 | # isolated logic for multiprocessing
36 | def f_nearest(name, matchers, n):
37 | for m in matchers:
38 | m.set_seq1(name)
39 | ratios = [(m.ratio(), m.b) for m in matchers]
40 | ratios.sort(reverse = True)
41 |
42 | if ratios[0][0] >= 1:
43 | return ratios[:1]
44 | else:
45 | return ratios[:n]
46 |
47 | def f_nearest_per_thread(workitem):
48 | (worknames, names, n) = workitem
49 | # each thread (well, process) needs to generate its own matchers
50 | matchers = [difflib.SequenceMatcher(b=name, autojunk=False) for name in names]
51 | return map(lambda name: f_nearest(name, matchers, n), worknames)
52 |
53 | class Namediff:
54 | def __init__(self, verbose = True,
55 | json_fname = os.path.join(datadir, 'AllSets.json')):
56 | self.verbose = verbose
57 | self.names = {}
58 | self.codes = {}
59 | self.cardstrings = {}
60 |
61 | if self.verbose:
62 | print 'Setting up namediff...'
63 |
64 | if self.verbose:
65 | print ' Reading names from: ' + json_fname
66 | json_srcs = jdecode.mtg_open_json(json_fname, verbose)
67 | namecount = 0
68 | for json_cardname in sorted(json_srcs):
69 | if len(json_srcs[json_cardname]) > 0:
70 | jcards = json_srcs[json_cardname]
71 |
72 | # just use the first one
73 | idx = 0
74 | card = cardlib.Card(jcards[idx])
75 | name = card.name
76 | jname = jcards[idx]['name']
77 | jcode = jcards[idx][utils.json_field_info_code]
78 | if 'number' in jcards[idx]:
79 | jnum = jcards[idx]['number']
80 | else:
81 | jnum = ''
82 |
83 | if name in self.names:
84 | print ' Duplicate name ' + name + ', ignoring.'
85 | else:
86 | self.names[name] = jname
87 | self.cardstrings[name] = card.encode()
88 | if jcode and jnum:
89 | self.codes[name] = jcode + '/' + jnum + '.jpg'
90 | else:
91 | self.codes[name] = ''
92 | namecount += 1
93 |
94 | print ' Read ' + str(namecount) + ' unique cardnames'
95 | print ' Building SequenceMatcher objects.'
96 |
97 | self.matchers = [difflib.SequenceMatcher(b=n, autojunk=False) for n in self.names]
98 | self.card_matchers = [difflib.SequenceMatcher(b=self.cardstrings[n], autojunk=False) for n in self.cardstrings]
99 |
100 | print '... Done.'
101 |
102 | def nearest(self, name, n=3):
103 | return f_nearest(name, self.matchers, n)
104 |
105 | def nearest_par(self, names, n=3, threads=cores):
106 | workpool = multiprocessing.Pool(threads)
107 | proto_worklist = list_split(names, threads)
108 | worklist = map(lambda x: (x, self.names, n), proto_worklist)
109 | donelist = workpool.map(f_nearest_per_thread, worklist)
110 | return list_flatten(donelist)
111 |
112 | def nearest_card(self, card, n=5):
113 | return f_nearest(card.encode(), self.card_matchers, n)
114 |
115 | def nearest_card_par(self, cards, n=5, threads=cores):
116 | workpool = multiprocessing.Pool(threads)
117 | proto_worklist = list_split(cards, threads)
118 | worklist = map(lambda x: (map(lambda c: c.encode(), x), self.cardstrings.values(), n), proto_worklist)
119 | donelist = workpool.map(f_nearest_per_thread, worklist)
120 | return list_flatten(donelist)
121 |
--------------------------------------------------------------------------------
/lib/nltk_model.py:
--------------------------------------------------------------------------------
1 | # Natural Language Toolkit: Language Models
2 | #
3 | # Copyright (C) 2001-2014 NLTK Project
4 | # Authors: Steven Bird
5 | # Daniel Blanchard
6 | # Ilia Kurenkov
7 | # URL:
8 | # For license information, see LICENSE.TXT
9 | #
10 | # adapted for mtgencode Nov. 2015
11 | # an attempt was made to preserve the exact functionality of this code,
12 | # hampered somewhat by its brokenness
13 |
14 | from __future__ import unicode_literals
15 |
16 | from math import log
17 |
18 | from nltk.probability import ConditionalProbDist, ConditionalFreqDist, LidstoneProbDist
19 | from nltk.util import ngrams
20 | from nltk_model_api import ModelI
21 |
22 | from nltk import compat
23 |
24 |
25 | def _estimator(fdist, **estimator_kwargs):
26 | """
27 | Default estimator function using a LidstoneProbDist.
28 | """
29 | # can't be an instance method of NgramModel as they
30 | # can't be pickled either.
31 | return LidstoneProbDist(fdist, 0.001, **estimator_kwargs)
32 |
33 |
34 | @compat.python_2_unicode_compatible
35 | class NgramModel(ModelI):
36 | """
37 | A processing interface for assigning a probability to the next word.
38 | """
39 |
40 | def __init__(self, n, train, pad_left=True, pad_right=False,
41 | estimator=None, **estimator_kwargs):
42 | """
43 | Create an ngram language model to capture patterns in n consecutive
44 | words of training text. An estimator smooths the probabilities derived
45 | from the text and may allow generation of ngrams not seen during
46 | training. See model.doctest for more detailed testing
47 |
48 | >>> from nltk.corpus import brown
49 | >>> lm = NgramModel(3, brown.words(categories='news'))
50 | >>> lm
51 |
52 | >>> lm._backoff
53 |
54 | >>> lm.entropy(brown.words(categories='humor'))
55 | ... # doctest: +ELLIPSIS
56 | 12.0399...
57 |
58 | :param n: the order of the language model (ngram size)
59 | :type n: int
60 | :param train: the training text
61 | :type train: list(str) or list(list(str))
62 | :param pad_left: whether to pad the left of each sentence with an (n-1)-gram of empty strings
63 | :type pad_left: bool
64 | :param pad_right: whether to pad the right of each sentence with an (n-1)-gram of empty strings
65 | :type pad_right: bool
66 | :param estimator: a function for generating a probability distribution
67 | :type estimator: a function that takes a ConditionalFreqDist and
68 | returns a ConditionalProbDist
69 | :param estimator_kwargs: Extra keyword arguments for the estimator
70 | :type estimator_kwargs: (any)
71 | """
72 |
73 | # protection from cryptic behavior for calling programs
74 | # that use the pre-2.0.2 interface
75 | assert(isinstance(pad_left, bool))
76 | assert(isinstance(pad_right, bool))
77 |
78 | self._lpad = ('',) * (n - 1) if pad_left else ()
79 | self._rpad = ('',) * (n - 1) if pad_right else ()
80 |
81 | # make sure n is greater than zero, otherwise print it
82 | assert (n > 0), n
83 |
84 | # For explicitness save the check whether this is a unigram model
85 | self.is_unigram_model = (n == 1)
86 | # save the ngram order number
87 | self._n = n
88 | # save left and right padding
89 | self._lpad = ('',) * (n - 1) if pad_left else ()
90 | self._rpad = ('',) * (n - 1) if pad_right else ()
91 |
92 | if estimator is None:
93 | estimator = _estimator
94 |
95 | cfd = ConditionalFreqDist()
96 |
97 | # set read-only ngrams set (see property declaration below to reconfigure)
98 | self._ngrams = set()
99 |
100 | # If given a list of strings instead of a list of lists, create enclosing list
101 | if (train is not None) and isinstance(train[0], compat.string_types):
102 | train = [train]
103 |
104 | # we need to keep track of the number of word types we encounter
105 | vocabulary = set()
106 | for sent in train:
107 | raw_ngrams = ngrams(sent, n, pad_left, pad_right, pad_symbol='')
108 | for ngram in raw_ngrams:
109 | self._ngrams.add(ngram)
110 | context = tuple(ngram[:-1])
111 | token = ngram[-1]
112 | cfd[context][token] += 1
113 | vocabulary.add(token)
114 |
115 | # Unless number of bins is explicitly passed, we should use the number
116 | # of word types encountered during training as the bins value.
117 | # If right padding is on, this includes the padding symbol.
118 | if 'bins' not in estimator_kwargs:
119 | estimator_kwargs['bins'] = len(vocabulary)
120 |
121 | self._model = ConditionalProbDist(cfd, estimator, **estimator_kwargs)
122 |
123 | # recursively construct the lower-order models
124 | if not self.is_unigram_model:
125 | self._backoff = NgramModel(n-1, train,
126 | pad_left, pad_right,
127 | estimator,
128 | **estimator_kwargs)
129 |
130 | self._backoff_alphas = dict()
131 | # For each condition (or context)
132 | for ctxt in cfd.conditions():
133 | backoff_ctxt = ctxt[1:]
134 | backoff_total_pr = 0.0
135 | total_observed_pr = 0.0
136 |
137 | # this is the subset of words that we OBSERVED following
138 | # this context.
139 | # i.e. Count(word | context) > 0
140 | for words in self._words_following(ctxt, cfd):
141 |
142 | # so, _words_following as fixed gives back a whole list now...
143 | for word in words:
144 |
145 | total_observed_pr += self.prob(word, ctxt)
146 | # we also need the total (n-1)-gram probability of
147 | # words observed in this n-gram context
148 | backoff_total_pr += self._backoff.prob(word, backoff_ctxt)
149 |
150 | assert (0 <= total_observed_pr <= 1), total_observed_pr
151 | # beta is the remaining probability weight after we factor out
152 | # the probability of observed words.
153 | # As a sanity check, both total_observed_pr and backoff_total_pr
154 | # must be GE 0, since probabilities are never negative
155 | beta = 1.0 - total_observed_pr
156 |
157 | # backoff total has to be less than one, otherwise we get
158 | # an error when we try subtracting it from 1 in the denominator
159 | assert (0 <= backoff_total_pr < 1), backoff_total_pr
160 | alpha_ctxt = beta / (1.0 - backoff_total_pr)
161 |
162 | self._backoff_alphas[ctxt] = alpha_ctxt
163 |
164 | # broken
165 | # def _words_following(self, context, cond_freq_dist):
166 | # for ctxt, word in cond_freq_dist.iterkeys():
167 | # if ctxt == context:
168 | # yield word
169 |
170 | # fixed
171 | def _words_following(self, context, cond_freq_dist):
172 | for ctxt in cond_freq_dist.iterkeys():
173 | if ctxt == context:
174 | yield cond_freq_dist[ctxt].keys()
175 |
176 | def prob(self, word, context):
177 | """
178 | Evaluate the probability of this word in this context using Katz Backoff.
179 |
180 | :param word: the word to get the probability of
181 | :type word: str
182 | :param context: the context the word is in
183 | :type context: list(str)
184 | """
185 | context = tuple(context)
186 | if (context + (word,) in self._ngrams) or (self.is_unigram_model):
187 | return self._model[context].prob(word)
188 | else:
189 | return self._alpha(context) * self._backoff.prob(word, context[1:])
190 |
191 | def _alpha(self, context):
192 | """Get the backoff alpha value for the given context
193 | """
194 | error_message = "Alphas and backoff are not defined for unigram models"
195 | assert not self.is_unigram_model, error_message
196 |
197 | if context in self._backoff_alphas:
198 | return self._backoff_alphas[context]
199 | else:
200 | return 1
201 |
202 | def logprob(self, word, context):
203 | """
204 | Evaluate the (negative) log probability of this word in this context.
205 |
206 | :param word: the word to get the probability of
207 | :type word: str
208 | :param context: the context the word is in
209 | :type context: list(str)
210 | """
211 | return -log(self.prob(word, context), 2)
212 |
213 | @property
214 | def ngrams(self):
215 | return self._ngrams
216 |
217 | @property
218 | def backoff(self):
219 | return self._backoff
220 |
221 | @property
222 | def model(self):
223 | return self._model
224 |
225 | def choose_random_word(self, context):
226 | '''
227 | Randomly select a word that is likely to appear in this context.
228 |
229 | :param context: the context the word is in
230 | :type context: list(str)
231 | '''
232 |
233 | return self.generate(1, context)[-1]
234 |
235 | # NB, this will always start with same word if the model
236 | # was trained on a single text
237 | def generate(self, num_words, context=()):
238 | '''
239 | Generate random text based on the language model.
240 |
241 | :param num_words: number of words to generate
242 | :type num_words: int
243 | :param context: initial words in generated string
244 | :type context: list(str)
245 | '''
246 |
247 | text = list(context)
248 | for i in range(num_words):
249 | text.append(self._generate_one(text))
250 | return text
251 |
252 | def _generate_one(self, context):
253 | context = (self._lpad + tuple(context))[-self._n + 1:]
254 | if context in self:
255 | return self[context].generate()
256 | elif self._n > 1:
257 | return self._backoff._generate_one(context[1:])
258 | else:
259 | return '.'
260 |
261 | def entropy(self, text):
262 | """
263 | Calculate the approximate cross-entropy of the n-gram model for a
264 | given evaluation text.
265 | This is the average log probability of each word in the text.
266 |
267 | :param text: words to use for evaluation
268 | :type text: list(str)
269 | """
270 |
271 | H = 0.0 # entropy is conventionally denoted by "H"
272 | text = list(self._lpad) + text + list(self._rpad)
273 | for i in range(self._n - 1, len(text)):
274 | context = tuple(text[(i - self._n + 1):i])
275 | token = text[i]
276 | H += self.logprob(token, context)
277 | return H / float(len(text) - (self._n - 1))
278 |
279 | def perplexity(self, text):
280 | """
281 | Calculates the perplexity of the given text.
282 | This is simply 2 ** cross-entropy for the text.
283 |
284 | :param text: words to calculate perplexity of
285 | :type text: list(str)
286 | """
287 |
288 | return pow(2.0, self.entropy(text))
289 |
290 | def __contains__(self, item):
291 | if not isinstance(item, tuple):
292 | item = (item,)
293 | return item in self._model
294 |
295 | def __getitem__(self, item):
296 | if not isinstance(item, tuple):
297 | item = (item,)
298 | return self._model[item]
299 |
300 | def __repr__(self):
301 | return '' % (len(self._ngrams), self._n)
302 |
303 | if __name__ == "__main__":
304 | import doctest
305 | doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE)
306 |
--------------------------------------------------------------------------------
/lib/nltk_model_api.py:
--------------------------------------------------------------------------------
1 | # Natural Language Toolkit: API for Language Models
2 | #
3 | # Copyright (C) 2001-2014 NLTK Project
4 | # Author: Steven Bird
5 | # URL:
6 | # For license information, see LICENSE.TXT
7 | #
8 | # imported for use in mtgcode Nov. 2015
9 |
10 |
11 | # should this be a subclass of ConditionalProbDistI?
12 |
13 | class ModelI(object):
14 | """
15 | A processing interface for assigning a probability to the next word.
16 | """
17 |
18 | def __init__(self):
19 | '''Create a new language model.'''
20 | raise NotImplementedError()
21 |
22 | def prob(self, word, context):
23 | '''Evaluate the probability of this word in this context.'''
24 | raise NotImplementedError()
25 |
26 | def logprob(self, word, context):
27 | '''Evaluate the (negative) log probability of this word in this context.'''
28 | raise NotImplementedError()
29 |
30 | def choose_random_word(self, context):
31 | '''Randomly select a word that is likely to appear in this context.'''
32 | raise NotImplementedError()
33 |
34 | def generate(self, n):
35 | '''Generate n words of text from the language model.'''
36 | raise NotImplementedError()
37 |
38 | def entropy(self, text):
39 | '''Evaluate the total entropy of a message with respect to the model.
40 | This is the sum of the log probability of each word in the message.'''
41 | raise NotImplementedError()
42 |
43 |
--------------------------------------------------------------------------------
/lib/transforms.py:
--------------------------------------------------------------------------------
1 | # transform passes used to encode / decode cards
2 | import re
3 | import random
4 |
5 | # These could probably use a little love... They tend to hardcode in lots
6 | # of things very specific to the mtgjson format.
7 |
8 | import utils
9 |
10 | cardsep = utils.cardsep
11 | fieldsep = utils.fieldsep
12 | bsidesep = utils.bsidesep
13 | newline = utils.newline
14 | dash_marker = utils.dash_marker
15 | bullet_marker = utils.bullet_marker
16 | this_marker = utils.this_marker
17 | counter_marker = utils.counter_marker
18 | reserved_marker = utils.reserved_marker
19 | choice_open_delimiter = utils.choice_open_delimiter
20 | choice_close_delimiter = utils.choice_close_delimiter
21 | x_marker = utils.x_marker
22 | tap_marker = utils.tap_marker
23 | untap_marker = utils.untap_marker
24 | counter_rename = utils.counter_rename
25 | unary_marker = utils.unary_marker
26 | unary_counter = utils.unary_counter
27 |
28 |
29 | # Name Passes.
30 |
31 |
32 | def name_pass_1_sanitize(s):
33 | s = s.replace('!', '')
34 | s = s.replace('?', '')
35 | s = s.replace('-', dash_marker)
36 | s = s.replace('100,000', 'one hundred thousand')
37 | s = s.replace('1,000', 'one thousand')
38 | s = s.replace('1996', 'nineteen ninety-six')
39 | return s
40 |
41 |
42 | # Name unpasses.
43 |
44 |
45 | # particularly helpful if you want to call text_unpass_8_unicode later
46 | # and NOT have it stick unicode long dashes into names.
47 | def name_unpass_1_dashes(s):
48 | return s.replace(dash_marker, '-')
49 |
50 |
51 | # Text Passes.
52 |
53 |
54 | def text_pass_1_strip_rt(s):
55 | return re.sub(r'\(.*\)', '', s)
56 |
57 |
58 | def text_pass_2_cardname(s, name):
59 | # Here are some fun edge cases, thanks to jml34 on the forum for
60 | # pointing them out.
61 | if name == 'sacrifice':
62 | s = s.replace(name, this_marker, 1)
63 | return s
64 | elif name == 'fear':
65 | return s
66 |
67 | s = s.replace(name, this_marker)
68 |
69 | # So, some legends don't use the full cardname in their text box...
70 | # this check finds about 400 of them.
71 | nameparts = name.split(',')
72 | if len(nameparts) > 1:
73 | mininame = nameparts[0]
74 | new_s = s.replace(mininame, this_marker)
75 | if not new_s == s:
76 | s = new_s
77 |
78 | # A few others don't have a convenient comma to detect their nicknames,
79 | # so we override them here.
80 | overrides = [
81 | # detectable by splitting on 'the', though that might cause other issues
82 | 'crovax',
83 | 'rashka',
84 | 'phage',
85 | 'shimatsu',
86 | # random and arbitrary: they have a last name, 1996 world champion, etc.
87 | 'world champion',
88 | 'axelrod',
89 | 'hazezon',
90 | 'rubinia',
91 | 'rasputin',
92 | 'hivis',
93 | ]
94 |
95 | for override in overrides:
96 | s = s.replace(override, this_marker)
97 |
98 | # stupid planeswalker abilities
99 | s = s.replace('to him.', 'to ' + this_marker + '.')
100 | s = s.replace('to him this', 'to ' + this_marker + ' this')
101 | s = s.replace('to himself', 'to itself')
102 | s = s.replace("he's", this_marker + ' is')
103 |
104 | # sometimes we actually don't want to do this replacement
105 | s = s.replace('named ' + this_marker, 'named ' + name)
106 | s = s.replace('name is still ' + this_marker, 'name is still ' + name)
107 | s = s.replace('named keeper of ' + this_marker, 'named keeper of ' + name)
108 | s = s.replace('named kobolds of ' + this_marker, 'named kobolds of ' + name)
109 | s = s.replace('named sword of kaldra, ' + this_marker, 'named sword of kaldra, ' + name)
110 |
111 | return s
112 |
113 |
114 | def text_pass_3_unary(s):
115 | return utils.to_unary(s)
116 |
117 |
118 | # Run only after doing unary conversion.
119 | def text_pass_4a_dashes(s):
120 | s = s.replace('-' + unary_marker, reserved_marker)
121 | s = s.replace('-', dash_marker)
122 | s = s.replace(reserved_marker, '-' + unary_marker)
123 |
124 | # level up is annoying
125 | levels = re.findall(r'level &\^*\-&', s)
126 | for level in levels:
127 | newlevel = level.replace('-', dash_marker)
128 | s = s.replace(level, newlevel)
129 |
130 | levels = re.findall(r'level &\^*\+', s)
131 | for level in levels:
132 | newlevel = level.replace('+', dash_marker)
133 | s = s.replace(level, newlevel)
134 |
135 | # and we still have the ~x issue
136 | return s
137 |
138 |
139 | # Run this after fixing dashes, because this unbreaks the ~x issue.
140 | # Also probably don't run this on names, there are a few names with x~ in them.
141 | def text_pass_4b_x(s):
142 | s = s.replace(dash_marker + 'x', '-' + x_marker)
143 | s = s.replace('+x', '+' + x_marker)
144 | s = s.replace(' x ', ' ' + x_marker + ' ')
145 | s = s.replace('x:', x_marker + ':')
146 | s = s.replace('x~', x_marker + '~')
147 | s = s.replace(u'x\u2014', x_marker + u'\u2014')
148 | s = s.replace('x.', x_marker + '.')
149 | s = s.replace('x,', x_marker + ',')
150 | s = s.replace('x is', x_marker + ' is')
151 | s = s.replace('x can\'t', x_marker + ' can\'t')
152 | s = s.replace('x/x', x_marker + '/' + x_marker)
153 | s = s.replace('x target', x_marker + ' target')
154 | s = s.replace('si' + x_marker + ' target', 'six target')
155 | s = s.replace('avara' + x_marker, 'avarax')
156 | # there's also some stupid ice age card that wants -x/-y
157 | s = s.replace('/~', '/-')
158 | return s
159 |
160 |
161 | # Call this before replacing newlines.
162 | # This one ends up being really bad because of the confusion
163 | # with 'counter target spell or ability'.
164 | def text_pass_5_counters(s):
165 | # so, big fat old dictionary time!!!!!!!!!
166 | allcounters = [
167 | 'time counter',
168 | 'devotion counter',
169 | 'charge counter',
170 | 'ki counter',
171 | 'matrix counter',
172 | 'spore counter',
173 | 'poison counter',
174 | 'quest counter',
175 | 'hatchling counter',
176 | 'storage counter',
177 | 'growth counter',
178 | 'paralyzation counter',
179 | 'energy counter',
180 | 'study counter',
181 | 'glyph counter',
182 | 'depletion counter',
183 | 'sleight counter',
184 | 'loyalty counter',
185 | 'hoofprint counter',
186 | 'wage counter',
187 | 'echo counter',
188 | 'lore counter',
189 | 'page counter',
190 | 'divinity counter',
191 | 'mannequin counter',
192 | 'ice counter',
193 | 'fade counter',
194 | 'pain counter',
195 | #'age counter',
196 | 'gold counter',
197 | 'muster counter',
198 | 'infection counter',
199 | 'plague counter',
200 | 'fate counter',
201 | 'slime counter',
202 | 'shell counter',
203 | 'credit counter',
204 | 'despair counter',
205 | 'globe counter',
206 | 'currency counter',
207 | 'blood counter',
208 | 'soot counter',
209 | 'carrion counter',
210 | 'fuse counter',
211 | 'filibuster counter',
212 | 'wind counter',
213 | 'hourglass counter',
214 | 'trap counter',
215 | 'corpse counter',
216 | 'awakening counter',
217 | 'verse counter',
218 | 'scream counter',
219 | 'doom counter',
220 | 'luck counter',
221 | 'intervention counter',
222 | 'eyeball counter',
223 | 'flood counter',
224 | 'eon counter',
225 | 'death counter',
226 | 'delay counter',
227 | 'blaze counter',
228 | 'magnet counter',
229 | 'feather counter',
230 | 'shield counter',
231 | 'wish counter',
232 | 'petal counter',
233 | 'music counter',
234 | 'pressure counter',
235 | 'manifestation counter',
236 | #'net counter',
237 | 'velocity counter',
238 | 'vitality counter',
239 | 'treasure counter',
240 | 'pin counter',
241 | 'bounty counter',
242 | 'rust counter',
243 | 'mire counter',
244 | 'tower counter',
245 | #'ore counter',
246 | 'cube counter',
247 | 'strife counter',
248 | 'elixir counter',
249 | 'hunger counter',
250 | 'level counter',
251 | 'winch counter',
252 | 'fungus counter',
253 | 'training counter',
254 | 'theft counter',
255 | 'arrowhead counter',
256 | 'sleep counter',
257 | 'healing counter',
258 | 'mining counter',
259 | 'dream counter',
260 | 'aim counter',
261 | 'arrow counter',
262 | 'javelin counter',
263 | 'gem counter',
264 | 'bribery counter',
265 | 'mine counter',
266 | 'omen counter',
267 | 'phylactery counter',
268 | 'tide counter',
269 | 'polyp counter',
270 | 'petrification counter',
271 | 'shred counter',
272 | 'pupa counter',
273 | 'crystal counter',
274 | ]
275 | usedcounters = []
276 | for countername in allcounters:
277 | if countername in s:
278 | usedcounters += [countername]
279 | s = s.replace(countername, counter_marker + ' counter')
280 |
281 | # oh god some of the counter names are suffixes of others...
282 | shortcounters = [
283 | 'age counter',
284 | 'net counter',
285 | 'ore counter',
286 | ]
287 | for countername in shortcounters:
288 | # SUPER HACKY fix for doubling season
289 | if countername in s and 'more counter' not in s:
290 | usedcounters += [countername]
291 | s = s.replace(countername, counter_marker + ' counter')
292 |
293 | # miraculously this doesn't seem to happen
294 | # if len(usedcounters) > 1:
295 | # print usedcounters
296 |
297 | # we haven't done newline replacement yet, so use actual newlines
298 | if len(usedcounters) == 1:
299 | # and yeah, this line of code can blow up in all kinds of different ways
300 | s = 'countertype ' + counter_marker + ' ' + usedcounters[0].split()[0] + '\n' + s
301 |
302 | return s
303 |
304 |
305 | # The word 'counter' is confusing when used to refer to what we do to spells
306 | # and sometimes abilities to make them not happen. Let's rename that.
307 | # Call this after doing the counter replacement to simplify the regexes.
308 | counter_rename = 'uncast'
309 | def text_pass_6_uncast(s):
310 | # pre-checks to make sure we aren't doing anything dumb
311 | # if '% counter target ' in s or '^ counter target ' in s or '& counter target ' in s:
312 | # print s + '\n'
313 | # if '% counter a ' in s or '^ counter a ' in s or '& counter a ' in s:
314 | # print s + '\n'
315 | # if '% counter all ' in s or '^ counter all ' in s or '& counter all ' in s:
316 | # print s + '\n'
317 | # if '% counter a ' in s or '^ counter a ' in s or '& counter a ' in s:
318 | # print s + '\n'
319 | # if '% counter that ' in s or '^ counter that ' in s or '& counter that ' in s:
320 | # print s + '\n'
321 | # if '% counter @' in s or '^ counter @' in s or '& counter @' in s:
322 | # print s + '\n'
323 | # if '% counter the ' in s or '^ counter the ' in s or '& counter the ' in s:
324 | # print s + '\n'
325 |
326 | # counter target
327 | s = s.replace('counter target ', counter_rename + ' target ')
328 | # counter a
329 | s = s.replace('counter a ', counter_rename + ' a ')
330 | # counter all
331 | s = s.replace('counter all ', counter_rename + ' all ')
332 | # counters a
333 | s = s.replace('counters a ', counter_rename + 's a ')
334 | # countered (this could get weird in terms of englishing the word; lets just go for hilarious)
335 | s = s.replace('countered', counter_rename + 'ed')
336 | # counter that
337 | s = s.replace('counter that ', counter_rename + ' that ')
338 | # counter @
339 | s = s.replace('counter @', counter_rename + ' @')
340 | # counter it (this is tricky
341 | s = s.replace(', counter it', ', ' + counter_rename + ' it')
342 | # counter the (it happens at least once, thanks wizards!)
343 | s = s.replace('counter the ', counter_rename + ' the ')
344 | # counter up to
345 | s = s.replace('counter up to ', counter_rename + ' up to ')
346 |
347 | # check if the word exists in any other context
348 | # if 'counter' in (s.replace('% counter', '').replace('countertype', '')
349 | # .replace('^ counter', '').replace('& counter', ''):
350 | # print s + '\n'
351 |
352 | # whew! by manual inspection of a few dozen texts, it looks like this about covers it.
353 | return s
354 |
355 |
356 | # Run after fixing dashes, it makes the regexes better, but before replacing newlines.
357 | def text_pass_7_choice(s):
358 | # the idea is to take 'choose n ~\n=ability\n=ability\n'
359 | # to '[n = ability = ability]\n'
360 |
361 | def choice_formatting_helper(s_helper, prefix, count, suffix = ''):
362 | single_choices = re.findall(ur'(' + prefix + ur'\n?(\u2022.*(\n|$))+)', s_helper)
363 | for choice in single_choices:
364 | newchoice = choice[0]
365 | newchoice = newchoice.replace(prefix, unary_marker + (unary_counter * count) + suffix)
366 | newchoice = newchoice.replace('\n', ' ')
367 | if newchoice[-1:] == ' ':
368 | newchoice = choice_open_delimiter + newchoice[:-1] + choice_close_delimiter + '\n'
369 | else:
370 | newchoice = choice_open_delimiter + newchoice + choice_close_delimiter
371 | s_helper = s_helper.replace(choice[0], newchoice)
372 | return s_helper
373 |
374 | s = choice_formatting_helper(s, ur'choose one \u2014', 1)
375 | s = choice_formatting_helper(s, ur'choose one \u2014 ', 1) # ty Promise of Power
376 | s = choice_formatting_helper(s, ur'choose two \u2014', 2)
377 | s = choice_formatting_helper(s, ur'choose two \u2014 ', 2) # ty Profane Command
378 | s = choice_formatting_helper(s, ur'choose one or both \u2014', 0)
379 | s = choice_formatting_helper(s, ur'choose one or more \u2014', 0)
380 | s = choice_formatting_helper(s, ur'choose khans or dragons.', 1)
381 | # this is for 'an opponent chooses one', which will be a bit weird but still work out
382 | s = choice_formatting_helper(s, ur'chooses one \u2014', 1)
383 | # Demonic Pact has 'choose one that hasn't been chosen'...
384 | s = choice_formatting_helper(s, ur"choose one that hasn't been chosen \u2014", 1,
385 | suffix=" that hasn't been chosen")
386 | # 'choose n. you may choose the same mode more than once.'
387 | s = choice_formatting_helper(s, ur'choose three. you may choose the same mode more than once.', 3,
388 | suffix='. you may choose the same mode more than once.')
389 |
390 | return s
391 |
392 |
393 | # do before removing newlines
394 | # might as well do this after countertype because we probably care more about
395 | # the location of the equip cost
396 | def text_pass_8_equip(s):
397 | equips = re.findall(r'equip ' + utils.mana_json_regex + r'.?$', s)
398 | # there don't seem to be any cases with more than one
399 | if len(equips) == 1:
400 | equip = equips[0]
401 | s = s.replace('\n' + equip, '')
402 | s = s.replace(equip, '')
403 |
404 | if equip[-1:] == ' ':
405 | equip = equip[0:-1]
406 |
407 | if s == '':
408 | s = equip
409 | else:
410 | s = equip + '\n' + s
411 |
412 | nonmana = re.findall(ur'(equip\u2014.*(\n|$))', s)
413 | if len(nonmana) == 1:
414 | equip = nonmana[0][0]
415 | s = s.replace('\n' + equip, '')
416 | s = s.replace(equip, '')
417 |
418 | if equip[-1:] == ' ':
419 | equip = equip[0:-1]
420 |
421 | if s == '':
422 | s = equip
423 | else:
424 | s = equip + '\n' + s
425 |
426 | return s
427 |
428 |
429 | def text_pass_9_newlines(s):
430 | return s.replace('\n', utils.newline)
431 |
432 |
433 | def text_pass_10_symbols(s):
434 | return utils.to_symbols(s)
435 |
436 |
437 | # reorder the lines of text into a canonical form:
438 | # first enchant and equip
439 | # then other keywords, one per line (things with no period on the end)
440 | # then other abilities
441 | # then kicker and countertype last of all
442 | def text_pass_11_linetrans(s):
443 | # let's just not deal with level up
444 | if 'level up' in s:
445 | return s
446 |
447 | prelines = []
448 | keylines = []
449 | mainlines = []
450 | postlines = []
451 |
452 | lines = s.split(utils.newline)
453 | for line in lines:
454 | line = line.strip()
455 | if line == '':
456 | continue
457 | if not '.' in line:
458 | # because this is inconsistent
459 | line = line.replace(',', ';')
460 | line = line.replace('; where', ', where') # Thromok the Insatiable
461 | line = line.replace('; and', ', and') # wonky protection
462 | line = line.replace('; from', ', from') # wonky protection
463 | line = line.replace('upkeep;', 'upkeep,') # wonky protection
464 | sublines = line.split(';')
465 | for subline in sublines:
466 | subline = subline.strip()
467 | if 'equip' in subline or 'enchant' in subline:
468 | prelines += [subline]
469 | elif 'countertype' in subline or 'kicker' in subline:
470 | postlines += [subline]
471 | else:
472 | keylines += [subline]
473 | elif u'\u2014' in line and not u' \u2014 ' in line:
474 | if 'equip' in line or 'enchant' in line:
475 | prelines += [line]
476 | elif 'countertype' in line or 'kicker' in line:
477 | postlines += [line]
478 | else:
479 | keylines += [line]
480 | else:
481 | mainlines += [line]
482 |
483 | alllines = prelines + keylines + mainlines + postlines
484 | return utils.newline.join(alllines)
485 |
486 |
487 | # randomize the order of the lines
488 | # not a text pass, intended to be invoked dynamically when encoding a card
489 | # call this on fully encoded text, with mana symbols expanded
490 | def separate_lines(text):
491 | # forget about level up, ignore empty text too while we're at it
492 | if text == '' or 'level up' in text:
493 | return [],[],[],[],[]
494 |
495 | preline_search = ['equip', 'fortify', 'enchant ', 'bestow']
496 | # probably could use optimization with a regex
497 | costline_search = [
498 | 'multikicker', 'kicker', 'suspend', 'echo', 'awaken',
499 | 'buyback', 'dash', 'entwine', 'evoke', 'flashback',
500 | 'madness', 'megamorph', 'morph', 'miracle', 'ninjutsu', 'overload',
501 | 'prowl', 'recover', 'reinforce', 'replicate', 'scavenge', 'splice',
502 | 'surge', 'unearth', 'transmute', 'transfigure',
503 | ]
504 | # cycling is a special case to handle the variants
505 | postline_search = ['countertype']
506 | keyline_search = ['cumulative']
507 |
508 | prelines = []
509 | keylines = []
510 | mainlines = []
511 | costlines = []
512 | postlines = []
513 |
514 | lines = text.split(utils.newline)
515 | # we've already done linetrans once, so some of the irregularities have been simplified
516 | for line in lines:
517 | if not '.' in line:
518 | if any(line.startswith(s) for s in preline_search):
519 | prelines.append(line)
520 | elif any(line.startswith(s) for s in postline_search):
521 | postlines.append(line)
522 | elif any(line.startswith(s) for s in costline_search) or 'cycling' in line:
523 | costlines.append(line)
524 | else:
525 | keylines.append(line)
526 | elif (utils.dash_marker in line and not
527 | (' '+utils.dash_marker+' ' in line or 'non'+utils.dash_marker in line)):
528 | if any(line.startswith(s) for s in preline_search):
529 | prelines.append(line)
530 | elif any(line.startswith(s) for s in costline_search) or 'cycling' in line:
531 | costlines.append(line)
532 | elif any(line.startswith(s) for s in keyline_search):
533 | keylines.append(line)
534 | else:
535 | mainlines.append(line)
536 | elif ': monstrosity' in line:
537 | costlines.append(line)
538 | else:
539 | mainlines.append(line)
540 |
541 | return prelines, keylines, mainlines, costlines, postlines
542 |
543 | choice_re = re.compile(re.escape(utils.choice_open_delimiter) + r'.*' +
544 | re.escape(utils.choice_close_delimiter))
545 | choice_divider = ' ' + utils.bullet_marker + ' '
546 | def randomize_choice(line):
547 | choices = re.findall(choice_re, line)
548 | if len(choices) < 1:
549 | return line
550 | new_line = line
551 | for choice in choices:
552 | parts = choice[1:-1].split(choice_divider)
553 | if len(parts) < 3:
554 | continue
555 | choiceparts = parts[1:]
556 | random.shuffle(choiceparts)
557 | new_line = new_line.replace(choice,
558 | utils.choice_open_delimiter +
559 | choice_divider.join(parts[:1] + choiceparts) +
560 | utils.choice_close_delimiter,
561 | 1)
562 | return new_line
563 |
564 | def randomize_lines(text):
565 | if text == '' or 'level up' in text:
566 | return text
567 |
568 | prelines, keylines, mainlines, costlines, postlines = separate_lines(text)
569 |
570 | new_mainlines = []
571 | for line in mainlines:
572 | if line.endswith(utils.choice_close_delimiter):
573 | new_mainlines.append(randomize_choice(line))
574 | # elif utils.choice_open_delimiter in line or utils.choice_close_delimiter in line:
575 | # print(line)
576 | else:
577 | new_mainlines.append(line)
578 |
579 | if False: # TODO: make this an option
580 | lines = prelines + keylines + new_mainlines + costlines + postlines
581 | random.shuffle(lines)
582 | return utils.newline.join(lines)
583 | else:
584 | random.shuffle(prelines)
585 | random.shuffle(keylines)
586 | random.shuffle(new_mainlines)
587 | random.shuffle(costlines)
588 | #random.shuffle(postlines) # only one kind ever (countertype)
589 | return utils.newline.join(prelines+keylines+new_mainlines+costlines+postlines)
590 |
591 |
592 | # Text unpasses, for decoding. All assume the text inside a Manatext, so don't do anything
593 | # weird with the mana cost symbol.
594 |
595 |
596 | def text_unpass_1_choice(s, delimit = False):
597 | choice_regex = (re.escape(choice_open_delimiter) + re.escape(unary_marker)
598 | + r'.*' + re.escape(bullet_marker) + r'.*' + re.escape(choice_close_delimiter))
599 | choices = re.findall(choice_regex, s)
600 | for choice in sorted(choices, lambda x,y: cmp(len(x), len(y)), reverse = True):
601 | fragments = choice[1:-1].split(bullet_marker)
602 | countfrag = fragments[0]
603 | optfrags = fragments[1:]
604 | choicecount = int(utils.from_unary(re.findall(utils.number_unary_regex, countfrag)[0]))
605 | newchoice = ''
606 |
607 | if choicecount == 0:
608 | if len(countfrag) == 2:
609 | newchoice += 'choose one or both '
610 | else:
611 | newchoice += 'choose one or more '
612 | elif choicecount == 1:
613 | newchoice += 'choose one '
614 | elif choicecount == 2:
615 | newchoice += 'choose two '
616 | else:
617 | newchoice += 'choose ' + utils.to_unary(str(choicecount)) + ' '
618 | newchoice += dash_marker
619 |
620 | for option in optfrags:
621 | option = option.strip()
622 | if option:
623 | newchoice += newline + bullet_marker + ' ' + option
624 |
625 | if delimit:
626 | s = s.replace(choice, choice_open_delimiter + newchoice + choice_close_delimiter)
627 | s = s.replace('an opponent ' + choice_open_delimiter + 'choose ',
628 | 'an opponent ' + choice_open_delimiter + 'chooses ')
629 | else:
630 | s = s.replace(choice, newchoice)
631 | s = s.replace('an opponent choose ', 'an opponent chooses ')
632 |
633 | return s
634 |
635 |
636 | def text_unpass_2_counters(s):
637 | countertypes = re.findall(r'countertype ' + re.escape(counter_marker)
638 | + r'[^' + re.escape(newline) + r']*' + re.escape(newline), s)
639 | # lazier than using groups in the regex
640 | countertypes += re.findall(r'countertype ' + re.escape(counter_marker)
641 | + r'[^' + re.escape(newline) + r']*$', s)
642 | if len(countertypes) > 0:
643 | countertype = countertypes[0].replace('countertype ' + counter_marker, '')
644 | countertype = countertype.replace(newline, '\n').strip()
645 | s = s.replace(countertypes[0], '')
646 | s = s.replace(counter_marker, countertype)
647 |
648 | return s
649 |
650 |
651 | def text_unpass_3_uncast(s):
652 | return s.replace(counter_rename, 'counter')
653 |
654 |
655 | def text_unpass_4_unary(s):
656 | return utils.from_unary(s)
657 |
658 |
659 | def text_unpass_5_symbols(s, for_forum, for_html):
660 | return utils.from_symbols(s, for_forum = for_forum, for_html = for_html)
661 |
662 |
663 | def text_unpass_6_cardname(s, name):
664 | return s.replace(this_marker, name)
665 |
666 |
667 | def text_unpass_7_newlines(s):
668 | return s.replace(newline, '\n')
669 |
670 |
671 | def text_unpass_8_unicode(s):
672 | s = s.replace(dash_marker, u'\u2014')
673 | s = s.replace(bullet_marker, u'\u2022')
674 | return s
675 |
--------------------------------------------------------------------------------
/lib/utils.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | # Utilities for handling unicode, unary numbers, mana costs, and special symbols.
4 | # For convenience we redefine everything from config so that it can all be accessed
5 | # from the utils module.
6 |
7 | import config
8 |
9 | # special chunk of text that Magic Set Editor 2 requires at the start of all set files.
10 | mse_prepend = 'mse version: 0.3.8\ngame: magic\nstylesheet: m15\nset info:\n\tsymbol:\nstyling:\n\tmagic-m15:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay:\n\tmagic-m15-clear:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-extra-improved:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\tpt box symbols: magic-pt-symbols-extra.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-planeswalker:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-planeswalker-promo-black:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-promo-dka:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-token-clear:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker-4abil:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker-clear:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker-promo-black:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n'
11 |
12 | # special chunk of text to start an HTML document.
13 | import html_extra_data
14 | segment_ids = html_extra_data.id_lables
15 | html_prepend = html_extra_data.html_prepend
16 | html_append = "\n