├── .gitignore ├── DEPENDENCIES.md ├── LICENSE ├── README.md ├── data ├── cbow.bin ├── cbow.sh ├── cbow.txt ├── mtgvocab.json └── output.txt ├── decode.py ├── encode.py ├── lib ├── cardlib.py ├── cbow.py ├── config.py ├── datalib.py ├── html_extra_data.py ├── jdecode.py ├── manalib.py ├── namediff.py ├── nltk_model.py ├── nltk_model_api.py ├── transforms.py └── utils.py ├── mtg_sweep1.ipynb ├── scripts ├── analysis.py ├── autosample.py ├── collect_checkpoints.py ├── distances.py ├── keydiff.py ├── mtg_validate.py ├── ngrams.py ├── pairing.py ├── sanity.py ├── streamcards.py ├── sum.py └── summarize.py └── sortcards.py /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.pyc 3 | AllSets.json 4 | AllSets-x.json 5 | lib/__init__.py 6 | -------------------------------------------------------------------------------- /DEPENDENCIES.md: -------------------------------------------------------------------------------- 1 | Dependencies 2 | ====== 3 | 4 | ## mtgjson 5 | 6 | First, you'll need the json corpus of Magic the Gathering cards, which can be found at: 7 | 8 | http://mtgjson.com/ 9 | 10 | You probably want the file AllSets.json, which you should also be able to download here: 11 | 12 | http://mtgjson.com/json/AllSets.json 13 | 14 | ## Python packages 15 | 16 | mtgencode uses a few additional Python packages which you should be able to install with Pip, Python's package manager. They aren'y mission critical, but they provide better capitalization of names and text in human-readable output formats. If they aren't installed, mtgencode will silently fall back to less effective workarounds. 17 | 18 | On Ubuntu, you should be able to install the necessary packages with: 19 | 20 | ``` 21 | sudo apt-get install python-pip 22 | sudo pip install titlecase 23 | sudo pip install nltk 24 | ``` 25 | 26 | nltk requires some additional data files to work, so you'll also have to do: 27 | 28 | ``` 29 | mkdir ~/nltk_data 30 | cd ~/nltk_data 31 | python -c "import nltk; nltk.download('punkt')" 32 | cd - 33 | ``` 34 | 35 | You don't have to put the files in ~/nltk_data, that's just one of the places it will look automatically. If you try to run decode.py with nltk but without the additional files, the error message is pretty helpful. 36 | 37 | mtgencode can also use numpy to speed up some of the long calculations required to generate the creativity statistics comparing similarity of generated and existing cards. You can install numpy with: 38 | 39 | ``` 40 | sudo apt-get install python-dev python-pip 41 | sudo pip install numpy 42 | ``` 43 | 44 | This will launch an absolutely massive compilation process for all of the numpy C sources. Go get a cup of coffee, and if it fails consult google. You'll probably need to at least have GCC installed, I'm not sure what else. 45 | 46 | Some additional packages will be needed for multithreading, but that doesn't work yet, so no worries. 47 | 48 | ## word2vec 49 | 50 | The creativity analysis is done using vector models produced by this tool: 51 | 52 | https://code.google.com/p/word2vec/ 53 | 54 | You can install it pretty easily with subversion: 55 | 56 | ``` 57 | sudo apt-get install subversion 58 | mkdir ~/word2vec 59 | cd ~/word2vec 60 | svn checkout http://word2vec.googlecode.com/svn/trunk/ 61 | cd trunk 62 | make 63 | ``` 64 | 65 | That should create some files, among them a binary called word2vec. Add this to your path somehow, and you'll be able to invoke cbow.sh from within the data/ subdirectory to recompile the vector model (cbow.bin) from whatever text representation was last produced (cbow.txt). 66 | 67 | ## Rebuilding the data files 68 | 69 | The standard procedure to produce the derived data files from AllSets.json is the following: 70 | 71 | ``` 72 | ./encode.py -v data/AllSets.json data/output.txt 73 | ./encode.py -v data/output.txt data/cbow.txt -s -e vec 74 | cd data 75 | ./cbow.sh 76 | ``` 77 | 78 | This of course assumes that you have AllSets.json in data/, and that you start from the root of the repo, in the same directory as encode.py. 79 | 80 | ## Magic Set Editor 2 81 | 82 | MSE2 is a tool for creating and viewing custom magic cards: 83 | 84 | http://magicseteditor.sourceforge.net/ 85 | 86 | Set files, with the extension .mse-set, can be produced by decode.py using the -mse option and then viewed in MSE2. 87 | 88 | Unfortunately, getting MSE2 to run on Linux can be tricky. Both Wine 1.6 and 1.7 have been reported to work on Ubuntu; instructions for 1.7 can be found here: 89 | 90 | https://www.winehq.org/download/ubuntu 91 | 92 | To install MSE with Wine, download the standard Windows installer and open it with Wine. Everything should just work. You will need some additional card styles: 93 | 94 | http://sourceforge.net/projects/msetemps/files/Magic%20-%20Recently%20Printed%20Styles.mse-installer/download 95 | 96 | And possibly this: 97 | 98 | http://sourceforge.net/projects/msetemps/files/Magic%20-%20M15%20Extra.mse-installer/download 99 | 100 | Once MSE2 is installed with Wine, you should be able to just click on the template installers and MSE2 will know what to do with them. 101 | 102 | Some additional system fonts are required, specifically Beleren Bold, Beleren Small Caps Bold, and Relay Medium. Those can be found here: 103 | 104 | http://www.slightlymagic.net/forum/viewtopic.php?f=15&t=14730 105 | 106 | http://www.azfonts.net/download/relay-medium/ttf.html 107 | 108 | Open them in Font Viewer and click install; you might then have to clear the caches so MSE2 can see them: 109 | 110 | ``` 111 | sudo fc-cache -fv 112 | ``` 113 | 114 | If you're running a Linux distro other than Ubuntu, then a similar procedure will probably work. If you're on Windows, then it should work fine as is without messing around with Wine. You'll still need the additional styles. 115 | 116 | I tried to build MSE2 from source on 64-bit Ubuntu. After hacking up some of the files, I did get a working binary, but I was unable to set up the data files it needs in such a way that I could actually open a set. If you manage to get this to work, please explain how, and I will be very grateful. 117 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2015 Bill Zorn 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in 11 | all copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 19 | THE SOFTWARE. 20 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # mtgencode 2 | 3 | Utilities to assist in the process of generating Magic the Gathering cards with neural nets. Inspired by this thread on the mtgsalvation forums: 4 | 5 | http://www.mtgsalvation.com/forums/creativity/custom-card-creation/612057-generating-magic-cards-using-deep-recurrent-neural 6 | 7 | The purpose of this code is mostly to wrangle text between various human and machine readable formats. The original input comes from [mtgjson](http://mtgjson.com); this is filtered and reduced to one of several input formats intended for neural network training, such as the standard encoded format used in [data/output.txt](https://github.com/billzorn/mtgencode/blob/master/data/output.txt). Any json or encoded data, including output from appropriately trained neural nets, can then be interpreted as cards and decoded to a human readable format, such as a text spoiler, [Magic Set Editor 2](http://magicseteditor.sourceforge.net) set file, or a pretty, portable html file that can be viewed in any browser. 8 | 9 | ## Requirements 10 | 11 | I'm running this code on Ubuntu 14.04 with Python 2.7. Unfortunately it does not work with Python 3, though apparently it isn't too hard to use 2to3 to automatically convert it. 12 | 13 | For the most part it should work out of the box, though there are a few optional bonus features that will make it much better. See [DEPENDENCIES.md](https://github.com/billzorn/mtgencode/blob/master/DEPENDENCIES.md#dependencies). 14 | 15 | This code does not have anything to do with neural nets; if you want to generate cards with them, see the [tutorial](https://github.com/billzorn/mtgencode#tutorial). 16 | 17 | ## Usage 18 | 19 | Functionality is provided by two main driver scripts: encode.py and decode.py. Logically, encode.py handles encoding to formats intended to feed into a neural network, while decode.py handles decoding to formats intended to be read by a human. 20 | 21 | ### encode.py 22 | 23 | ``` 24 | usage: encode.py [-h] [-e {std,named,noname,rfields,old,norarity,vec,custom}] 25 | [-r] [--nolinetrans] [--nolabel] [-s] [-v] 26 | infile [outfile] 27 | 28 | positional arguments: 29 | infile encoded card file or json corpus to encode 30 | outfile output file, defaults to stdout 31 | 32 | optional arguments: 33 | -h, --help show this help message and exit 34 | -e {std,named,noname,rfields,old,norarity,vec,custom}, --encoding {std,named,noname,rfields,old,norarity,vec,custom} 35 | encoding format to use 36 | -r, --randomize randomize the order of symbols in mana costs 37 | --nolinetrans don't reorder lines of card text 38 | --nolabel don't label fields 39 | -s, --stable don't randomize the order of the cards 40 | -v, --verbose verbose output 41 | ``` 42 | 43 | The supported encodings are: 44 | 45 | Argument | Description 46 | -----------|------------ 47 | std | Standard format: `|type|supertype|subtype|loyalty|pt|text|cost|rarity|name|`. 48 | named | Name first: `|name|type|supertype|subtype|loyalty|pt|text|cost|rarity|`. 49 | noname | No name field at all: `|type|supertype|subtype|loyalty|pt|text|cost|rarity|`. 50 | rfields | Randomize the order of the fields, using only the label to distinguish which field is which. 51 | old | Legacy format: `|name|supertype|type|loyalty|subtype|rarity|pt|cost|text|`. No field labels. 52 | norarity | Older legacy format: `|name|supertype|type|loyalty|subtype|pt|cost|text|`. No field labels. 53 | vec | Produce a content vector for each card; used with [word2vec](https://code.google.com/p/word2vec/). 54 | custom | Blank format slot, inteded to help users add their own formats to the python source. 55 | 56 | ### decode.py 57 | 58 | ``` 59 | usage: decode.py [-h] [-e {std,named,noname,rfields,old,norarity,vec,custom}] 60 | [-g] [-f] [-c] [-d] [-v] [-mse] [-html] 61 | infile [outfile] 62 | 63 | positional arguments: 64 | infile encoded card file or json corpus to encode 65 | outfile output file, defaults to stdout 66 | 67 | optional arguments: 68 | -h, --help show this help message and exit 69 | -e {std,named,noname,rfields,old,norarity,vec,custom}, --encoding {std,named,noname,rfields,old,norarity,vec,custom} 70 | encoding format to use 71 | -g, --gatherer emulate Gatherer visual spoiler 72 | -f, --forum use pretty mana encoding for mtgsalvation forum 73 | -c, --creativity use CBOW fuzzy matching to check creativity of cards 74 | -d, --dump dump out lots of information about invalid cards 75 | -v, --verbose verbose output 76 | -mse, --mse use Magic Set Editor 2 encoding; will output as .mse- 77 | set file 78 | -html, --html create a .html file with pretty forum formatting 79 | ``` 80 | 81 | The default output is a text spoiler which modifies the output of the neural net as little as possible while making it human readable. Specifying the -g option will produce a prettier, Gatherer-inspired text spoiler with heavier-weight transformations applied to the text, such as capitalization. The -f option encodes mana symbols in the format used by the mtgsalvation forum; this is useful if you want to cut and paste your spoiler into a post to share it. 82 | 83 | Passing the -mse option will cause decode.py to produce both the hilarious internal MSE text format as well as an actual mse set file, which is really just a renamed zip archive. The -f and -g flags will be respected in the text that is dumped to each card's notes field. 84 | 85 | Finally, the -c and -d options will print out additional data about the quality of the cards. Running with -c is extremely slow due to the massive amount of computation involved, though at least we can do it in parallel over all of your processor cores; -d is probably a good idea to use in general unless you're trying to produce pretty output to show off. Using html mode is especially useful with -c as we can link to visual spoilers from magiccards.info. 86 | 87 | ### Examples 88 | 89 | To generate the standard encoding in data/output.txt, I run: 90 | 91 | ``` 92 | ./encode.py -v data/AllSets.json data/output.txt 93 | ``` 94 | 95 | Of course, this requires that you've downloaded the mtgjson corpus to data/AllSets.json, and are running from the root of the repo. 96 | 97 | If I wanted to convert that standard output to a Magic Set Editor 2 set, I'd run: 98 | 99 | ``` 100 | ./decode.py -v data/output.txt data/allcards -f -g -d 101 | ``` 102 | 103 | This will produce a useless text file called data/allcards, and a set file called data/allcards.mse-set that you can open with MSE2. The -f and -g options will cause the text spoiler included in the notes field of each card in the set to be a pretty Gatherer-inspired affair that you could cut and paste onto the mtgsalvation forum. The -d option will dump additional information if any of the cards are invalidly formatted, which probably won't do anything because all existing magic cards are encoded correctly. Specifying the -c option here would be a bad idea; it would probably take several days to run. 104 | 105 | ### Scripts 106 | 107 | A bunch of additional data processing functionality is provided by the files in scripts/. Right now there isn't a whole lot, but more tools might be added in the future, to do things such as convert card dumps into .arff files that could be analyzed in [Weka](http://www.cs.waikato.ac.nz/ml/weka/). 108 | 109 | Currently, scripts/summarize.py will build a bunch of big data mining indices and use them to print out interesting statistics about a dump of cards. If you want to use mtgencode to do your own data analysis, taking a look at it would be a good place to start. 110 | 111 | 112 | 113 | ## Tutorial 114 | 115 | This tutorial will cover how to generate cards from scratch using neural nets. 116 | 117 | ### Set up a Linux environment 118 | 119 | If you're already running on Linux, hooray! If not, you have a few options. The easiest is probably to use a virtual machine; the disadvantage of this approach is that it will prevent you from using a graphics card to train the neural net, which speeds things up immensely. For reference, my GTX Titan is about 10x faster than my overclocked 8-core i7-5960X. 120 | 121 | The other option is to dual boot your machine (which is what I do) or otherwise acquire a machine that you can run Linux on natively. How exactly you do this is beyond the scope of this tutorial. 122 | 123 | If you do decide to go the virtual machine route: 124 | 125 | 1. Download some sort of virtual machine software. I recommend [VirtualBox](https://help.ubuntu.com/community/VirtualBox). 126 | 2. Download a Linux operating system. I recommend [Ubuntu](http://www.ubuntu.com/download/desktop). 127 | 3. [Create a virtual machine, and install the operating system on it](https://help.ubuntu.com/community/VirtualBox/FirstVM). 128 | 129 | IMPORTANT NOTE: Training neural nets is extremely CPU intensive, and rather memory intensive as well. If you don't want training to take multiple weeks, it's a very good idea to give your virtual machine as many processor cores and as much memory as you can spare, and to monitor system performance with the 'top' command to make sure you aren't [swapping](https://help.ubuntu.com/community/SwapFaq), as that will degrade performance immensely. 130 | 131 | You should be able to boot up the virtual machine and use whatever operating system you installed. If you're new to Linux, you might want to familiarize yourself with it a little. For my own sanity, I'm going to assume at least basic familiarity. Most of what we'll be doing will be in terminals; if the instructions say to do something and then provide some code in a block quote, it probably means to type that into a terminal, on line at a time. 132 | 133 | ### Set up the neural net code 134 | 135 | We're ultimately going to use the code from the [mtg-rnn repo](https://github.com/billzorn/mtg-rnn); if anything is unclear you can refer to the documentation there as well. 136 | 137 | First, we need to install some dependencies. The primary one is Torch, the scientific computing framework the neural net code is written. Directions are [here](http://torch.ch/docs/getting-started.html). 138 | 139 | Next, open a terminal and install some additional lua packages: 140 | 141 | ``` 142 | luarocks install nngraph 143 | luarocks install optim 144 | ``` 145 | 146 | Now we'll clone the git repo with the neural net code. You'll need git installed, if it isn't: 147 | 148 | ``` 149 | sudo apt-get install git 150 | ``` 151 | 152 | Then go to your home directory (or wherever you want to put the repo, it can be anywhere really) and clone it: 153 | 154 | ``` 155 | cd ~ 156 | git clone https://github.com/billzorn/mtg-rnn.git 157 | ``` 158 | 159 | This should create the folder mtg-rnn, with a bunch of files in it. To check if it works, try: 160 | 161 | ``` 162 | cd ~/mtg-rnn 163 | th train.lua --help 164 | ``` 165 | 166 | A large usage message should be printed. If you get an error, then check to make sure Torch is working. As always, Google is your best friend when anything goes wrong. 167 | 168 | ### Set up mtgencode 169 | 170 | Go back to your home directory (or wherever) and clone mtgencode as well: 171 | 172 | ``` 173 | cd ~ 174 | git clone https://github.com/billzorn/mtgencode.git 175 | ``` 176 | 177 | This should create the folder mtgencode, also with a bunch of files in it. 178 | 179 | You'll need Python to run it; to get full functionality, consult [DEPENDENCIES.md](https://github.com/billzorn/mtgencode/blob/master/DEPENDENCIES.md#dependencies). But, it should work with just Python. To install Python: 180 | 181 | ``` 182 | sudo apt-get install python 183 | ``` 184 | 185 | To check if it works: 186 | 187 | ``` 188 | cd ~/mtgencode 189 | ./encode.py --help 190 | ``` 191 | 192 | Again, you should see a usage message; if you don't, make sure Python is working. mtgencode uses Python 2.7, so if you think your default python is Python 3, you can try: 193 | 194 | ``` 195 | python2 encode.py --help 196 | ``` 197 | 198 | instead of running the script directly. 199 | 200 | ### Generating an encoded corpus for training 201 | 202 | If you just want to train with the default corpus, you can skip this step, as it already exists in mtg-rnn. Just replace all instances of 'custom_encoding' with 'mtgencode-std'. 203 | 204 | To generate an encoded corpus, you'll first need to download AllSets.json from [mtgjson.com](http://mtgjson.com/) to data/AllSets.json. Then to encode it: 205 | 206 | ``` 207 | ./encode.py -v data/AllSets.json data/custom_encoding.txt 208 | ``` 209 | 210 | This will create a the file data/custom_encoding.txt with your encoding in it. You can add some options to create a different encoding; consult the usage of [encode.py](https://github.com/billzorn/mtgencode#encodepy). 211 | 212 | Now copy this encoded corpus over to mtg-rnn: 213 | 214 | ``` 215 | cd ~/mtg-rnn 216 | mkdir data/custom_encoding 217 | cp ~/mtgencode/data/custom_encoding.txt data/custom_encoding/input.txt 218 | ``` 219 | 220 | The input file does have to be named input.txt, though you can name the folder that holds it, under mtg-rnn/data/, whatever you want. 221 | 222 | ### Training a neural net 223 | 224 | There are lots of parameters to control training. With a good GPU, I can train a 3-layer, size 512 network in a few hours; on a CPU this will probably take at least a day. 225 | 226 | Most networks we use are about that size. I'd recommend avoiding anything much larger, as they don't seem to produce appreciably better results and take longer to train. The only other parameter you really have to change from the defaults is seq_length, which we usually set somewhere from 120-200. If this causes memory issues you can reduce batch_size slightly to compensate. 227 | 228 | A sample training command might like this: 229 | 230 | ``` 231 | th train.lua -gpuid -1 -rnn_size 256 -num_layers 3 -seq_length 200 -data_dir data/custom_encoding -checkpoint_dir cv/custom_format-256/ -eval_val_every 1000 -seed 7767 232 | ``` 233 | 234 | This tells the neural network to train using the corpus in data/custom_encoding/, and to output periodic checkpoints to the directory cv/custom_format-256/. The option "-gpuid -1" means to use the CPU, not a GPU (which won't be possible in VirtualBox anyway). The final options, -eval_val_every and -seed, aren't necessary, but I like to specify them. The seed will be set to a fixed 123 if you don't specify one yourself. If you're generating too many checkpoints and filling up your disk, you can increase the number of iterations between saving them by increasing the argument to -eval_val_every. 235 | 236 | If all goes well, you should see the neural net code do some stuff and then start training, reporting training loss and batch times as it goes: 237 | 238 | ``` 239 | 1/112100 (epoch 0.000), train_loss = 4.21492900, grad/param norm = 3.1264e+00, time/batch = 4.73s 240 | 2/112100 (epoch 0.001), train_loss = 4.29372822, grad/param norm = 8.6741e+00, time/batch = 3.62s 241 | 3/112100 (epoch 0.001), train_loss = 4.02817964, grad/param norm = 8.0445e+00, time/batch = 3.57s 242 | ... 243 | ``` 244 | 245 | This process can take a while, so go to sleep or something and come back in the morning. The train_loss should eventually start to decrease and settle around 0.5 or so; if it doesn't, then something is wrong and the neural net will probably produce gibberish. 246 | 247 | Every N iterations, where N is the argument to -eval_val_every, the neural net will generate a checkpoint in cv/custom_format-256/. They look like this: 248 | 249 | ``` 250 | lm_lstm_epoch2.23_0.5367.t7 251 | ``` 252 | 253 | The numbers are important; the first is the epoch, which tells you how many passes the neural network had made over the training data when it saved the checkpoint, and the second is the validation loss of the checkpoint. Validation loss is effectively a measurement of how accurate the checkpoint is at producing text that resembles the encoded format, the lower the better. The two numbers are separated by an underscore, so for the example above, the checkpoint is from epoch 2.23, and it had a validation loss of 0.5367, which isn't great but probably isn't gibberish either. 254 | 255 | ### Sampling checkpoints to generate cards 256 | 257 | Once you're done training, or you've got enough checkpoints and you're just impatient, you can sample to generate actual cards. If the network is still training, you'll probably want to pause it by typing Control-Z in the terminal; you can resume it later with the command 'fg'. Training will use all available CPU resources all by itself, so trying to sample at the same time is a recipe for slow. 258 | 259 | Once you're ready, go the the mtg-rnn repo. A typical sampling command might look like this: 260 | 261 | ``` 262 | th sample.lua cv/custom_format-256/lm_lstm_epochXX.XX_X.XXXX.t7 -gpuid -1 -temperature 0.9 -length 2000 | tee cards.txt 263 | ``` 264 | 265 | Replace the Xs in the checkpoint name with the numbers in the name of an actual checkpoint; tab completion is your friend. This command will sample 2000 characters, which is probably something like 20 cards, and both print them to the terminal and write them to a file called cards.txt. The interesting options here are the temperature and the length. Temperature controls how cautious the network is; lower values produce more probable output, while higher values make it wilder and more creative. Somewhere in the range of 0.7-1.0 usually works best. Length is just how many characters to generate. You can also specify a seed with -seed, exactly as for training, which is a particularly good idea if you just generated a few million characters and would like to see something new. The default seed is fixed at 123, again exactly as for training. 266 | 267 | You can read the output yourself, but it might be painful, especially if you're using randomly ordered fields. 268 | 269 | ### Postprocessing neural net output with mtgencode 270 | 271 | Once you've generated some cards, you can turn them into pretty text spoilers or a set file for MSE2. 272 | 273 | Go back to mtgencode, and run something like: 274 | 275 | ``` 276 | ./decode.py -v ~/mtg-rnn/cards.txt cards.pretty.txt -d 277 | ``` 278 | 279 | This should create a file called cards.pretty.txt with a text spoiler in it that's actually designed for human consumption. Open it in your favorite text editor and enjoy! 280 | 281 | The -d option ensures you'll still be able to see anything that went wrong with the cards. You can change the formatting with -f and -g, and produce a set file for MSE2 with -mse. The -c option produces some intersting comparisons to existing cards, but it's slow, so be prepared to wait a long time if you use it on a large dump. 282 | 283 | ## Gory details of the format 284 | 285 | Individual cards are separated by two newlines. Multifaced cards (split, flip, etc.) are encoded together, with the castable one first if applicable, and separated by only one newline. 286 | 287 | All decimal numbers are in represented in unary, with numbers over 20 special-cased into english. Fun fact: the only numbers over 20 on cards are 25, 30, 40, 50, 100, and 200. The unary represenation uses one character to mark the start of the number, and another to count. So 0 is &, 1 is &^, 2 is &^^, 11 is &^^^^^^^^^^^, and so on. 288 | 289 | Mana costs are specially encoded between braces {}. I use the unary counter to encode the colorless part, and then special two-character symbols for everything else. So, {3}{W}{W} becomes {^^^WWWW}, {U/B}{U/B} becomes {UBUB}, and {X}{X}{X} becomes {XXXXXX}. The details are controlled in lib/utils.py, and handled with the Manacost and Manatext objects in lib/manalib.py. 290 | 291 | The name of the card becomes @ in the text. I try to handle all the stupid special cases correctly. For example, Crovax the Cursed is referred to in his text box as simply 'Crovax'. Yuch. 292 | 293 | The names of counters are similarly replaced with %, and then a speial line of text is added to tell what kind of counter % refers to. Fun fact: there's more than a hundred different kinds used in real cards. 294 | 295 | Several ambiguous words are resolved. Most directly, the word 'counter' as in 'counter target spell' is replaced with 'uncast'. This should prevent confusion with +&^/+&^ counters and % counters. 296 | 297 | I also reformat cards that choose between multiple things by removing the choice clause itself and instead having a delimited list of options prefixed by a number. If you could choose different numbers of things (one or both, one or more - turns out the latter is valid in all existing cases) then the number is 0, otherwise it's however many things you'd get to choose. So, 'choose one -\= effect x\= effect y' (the \ is a newline) becomes [&^ = effect x = effect y]. 298 | 299 | Finally, some postprocessing is done to put the lines of a card's ability text into a standardized, canonical form. Lines with multiple keywords are split, and then we put all of the simple keywords first, followed by things like static or activated abilities. A few things always go first (such as equip and enchant) and a few other things always go last (such as kicker and countertype). There are various reasons for doing this transformation, and some proper science could probably come up with a better specific procedure. One of the primary motivations for putting abilities onto individual lines is that it should simplify the process of adding back in reminder text. It should be noted somewhere that the definition of a simple keyword ability vs. some other line of text is that a simple keyword won't contain a period, and we can split a line with multiple of them by looking for commas and semicolons. 300 | 301 | ====== 302 | 303 | Here's an attempt at a list of all the things I do: 304 | 305 | * Aggregate split / flip / rotating / etc. cards by their card number (22a with 22b) and put them together 306 | 307 | * Make all text lowercase, so the symbols for mana and X are distinct 308 | 309 | * Remove all reminder text 310 | 311 | * Put @ in for the name of the card 312 | 313 | * Encode the mana costs, and the tap and untap symbols 314 | 315 | * Convert decimal numbers to unary 316 | 317 | * Simplify the syntax of dashes, so that - is only used as a minus sign, and ~ is used elsewhere 318 | 319 | * Make sure that where X is the variable X, it's uppercase 320 | 321 | * Change the names of all counters to % and add a line to identify what kind of counter % refers to 322 | 323 | * Move the equip cost of equipment to the beginning of the text so that it's closer to the type 324 | 325 | * Rename 'counter' in the context of 'counter target spell' to 'uncast' 326 | 327 | * Put choices into [&^ = effect x = effect y] format 328 | 329 | * Replace acutal newline characters with \ so that we can use those to separate cards 330 | 331 | * Clean all the unicode junk like accents and unicode minus signs out of the text so there are fewer characters 332 | 333 | * Split composite text lines (i.e. "flying, first strike" -> "flying\first strike") and put the lines into canonical order 334 | -------------------------------------------------------------------------------- /data/cbow.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/billzorn/mtgencode/ee5f26590dc77cd252fa0ceb00d88b4665e2a9bf/data/cbow.bin -------------------------------------------------------------------------------- /data/cbow.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | word2vec -train cbow.txt -output cbow.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 8 -binary 1 -iter 15 4 | -------------------------------------------------------------------------------- /data/mtgvocab.json: -------------------------------------------------------------------------------- 1 | {"idx_to_token": {"1": "\n", "2": " ", "3": "\"", "4": "%", "5": "&", "6": "'", "7": "*", "8": "+", "9": ",", "10": "-", "11": ".", "12": "/", "13": "0", "14": "1", "15": "2", "16": "3", "17": "4", "18": "5", "19": "6", "20": "7", "21": "8", "22": "9", "23": ":", "24": "=", "25": "@", "26": "A", "27": "B", "28": "C", "29": "E", "30": "G", "31": "L", "32": "N", "33": "O", "34": "P", "35": "Q", "36": "R", "37": "S", "38": "T", "39": "U", "40": "W", "41": "X", "42": "Y", "43": "[", "44": "\\", "45": "]", "46": "^", "47": "a", "48": "b", "49": "c", "50": "d", "51": "e", "52": "f", "53": "g", "54": "h", "55": "i", "56": "j", "57": "k", "58": "l", "59": "m", "60": "n", "61": "o", "62": "p", "63": "q", "64": "r", "65": "s", "66": "t", "67": "u", "68": "v", "69": "w", "70": "x", "71": "y", "72": "z", "73": "{", "74": "|", "75": "}", "76": "~"}, "token_to_idx": {"\n": 1, " ": 2, "\"": 3, "%": 4, "'": 6, "&": 5, "+": 8, "*": 7, "-": 10, ",": 9, "/": 12, ".": 11, "1": 14, "0": 13, "3": 16, "2": 15, "5": 18, "4": 17, "7": 20, "6": 19, "9": 22, "8": 21, ":": 23, "=": 24, "A": 26, "@": 25, "C": 28, "B": 27, "E": 29, "G": 30, "L": 31, "O": 33, "N": 32, "Q": 35, "P": 34, "S": 37, "R": 36, "U": 39, "T": 38, "W": 40, "Y": 42, "X": 41, "[": 43, "]": 45, "\\": 44, "^": 46, "a": 47, "c": 49, "b": 48, "e": 51, "d": 50, "g": 53, "f": 52, "i": 55, "h": 54, "k": 57, "j": 56, "m": 59, "l": 58, "o": 61, "n": 60, "q": 63, "p": 62, "s": 65, "r": 64, "u": 67, "t": 66, "w": 69, "v": 68, "y": 71, "x": 70, "{": 73, "z": 72, "}": 75, "|": 74, "~": 76}} -------------------------------------------------------------------------------- /decode.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import zipfile 5 | import shutil 6 | 7 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib') 8 | sys.path.append(libdir) 9 | import utils 10 | import jdecode 11 | import cardlib 12 | from cbow import CBOW 13 | from namediff import Namediff 14 | 15 | def main(fname, oname = None, verbose = True, encoding = 'std', 16 | gatherer = False, for_forum = False, for_mse = False, 17 | creativity = False, vdump = False, for_html = False): 18 | 19 | # there is a sane thing to do here (namely, produce both at the same time) 20 | # but we don't support it yet. 21 | if for_mse and for_html: 22 | print 'ERROR - decode.py - incompatible formats "mse" and "html"' 23 | return 24 | 25 | fmt_ordered = cardlib.fmt_ordered_default 26 | 27 | if encoding in ['std']: 28 | pass 29 | elif encoding in ['named']: 30 | fmt_ordered = cardlib.fmt_ordered_named 31 | elif encoding in ['noname']: 32 | fmt_ordered = cardlib.fmt_ordered_noname 33 | elif encoding in ['rfields']: 34 | pass 35 | elif encoding in ['old']: 36 | fmt_ordered = cardlib.fmt_ordered_old 37 | elif encoding in ['norarity']: 38 | fmt_ordered = cardlib.fmt_ordered_norarity 39 | elif encoding in ['vec']: 40 | pass 41 | elif encoding in ['custom']: 42 | ## put custom format decisions here ########################## 43 | 44 | ## end of custom format ###################################### 45 | pass 46 | else: 47 | raise ValueError('encode.py: unknown encoding: ' + encoding) 48 | 49 | cards = jdecode.mtg_open_file(fname, verbose=verbose, fmt_ordered=fmt_ordered) 50 | 51 | if creativity: 52 | namediff = Namediff() 53 | cbow = CBOW() 54 | if verbose: 55 | print 'Computing nearest names...' 56 | nearest_names = namediff.nearest_par(map(lambda c: c.name, cards), n=3) 57 | if verbose: 58 | print 'Computing nearest cards...' 59 | nearest_cards = cbow.nearest_par(cards) 60 | for i in range(0, len(cards)): 61 | cards[i].nearest_names = nearest_names[i] 62 | cards[i].nearest_cards = nearest_cards[i] 63 | if verbose: 64 | print '...Done.' 65 | 66 | def hoverimg(cardname, dist, nd): 67 | truename = nd.names[cardname] 68 | code = nd.codes[cardname] 69 | namestr = '' 70 | if for_html: 71 | if code: 72 | namestr = ('
' + truename 73 | + '' + ': ' + str(dist) + '\n
\n') 75 | else: 76 | namestr = '
' + truename + ': ' + str(dist) + '
' 77 | elif for_forum: 78 | namestr = '[card]' + truename + '[/card]' + ': ' + str(dist) + '\n' 79 | else: 80 | namestr = truename + ': ' + str(dist) + '\n' 81 | return namestr 82 | 83 | def writecards(writer): 84 | if for_mse: 85 | # have to prepend a massive chunk of formatting info 86 | writer.write(utils.mse_prepend) 87 | 88 | if for_html: 89 | # have to preapend html info 90 | writer.write(utils.html_prepend) 91 | # seperate the write function to allow for writing smaller chunks of cards at a time 92 | segments = sort_colors(cards) 93 | for i in range(len(segments)): 94 | # sort color by CMC 95 | segments[i] = sort_type(segments[i]) 96 | # this allows card boxes to be colored for each color 97 | # for coloring of each box seperately cardlib.Card.format() must change non-minimaly 98 | writer.write('
') 99 | writehtml(writer, segments[i]) 100 | writer.write("

") 101 | # closing the html file 102 | writer.write(utils.html_append) 103 | return #break out of the write cards funcrion to avoid writing cards twice 104 | 105 | 106 | for card in cards: 107 | if for_mse: 108 | writer.write(card.to_mse().encode('utf-8')) 109 | fstring = '' 110 | if card.json: 111 | fstring += 'JSON:\n' + card.json + '\n' 112 | if card.raw: 113 | fstring += 'raw:\n' + card.raw + '\n' 114 | fstring += '\n' 115 | fstring += card.format(gatherer = gatherer, for_forum = for_forum, 116 | vdump = vdump) + '\n' 117 | fstring = fstring.replace('<', '(').replace('>', ')') 118 | writer.write(('\n' + fstring[:-1]).replace('\n', '\n\t\t')) 119 | else: 120 | fstring = card.format(gatherer = gatherer, for_forum = for_forum, 121 | vdump = vdump, for_html = for_html) 122 | writer.write((fstring + '\n').encode('utf-8')) 123 | 124 | if creativity: 125 | cstring = '~~ closest cards ~~\n' 126 | nearest = card.nearest_cards 127 | for dist, cardname in nearest: 128 | cstring += hoverimg(cardname, dist, namediff) 129 | cstring += '~~ closest names ~~\n' 130 | nearest = card.nearest_names 131 | for dist, cardname in nearest: 132 | cstring += hoverimg(cardname, dist, namediff) 133 | if for_mse: 134 | cstring = ('\n\n' + cstring[:-1]).replace('\n', '\n\t\t') 135 | writer.write(cstring.encode('utf-8')) 136 | 137 | writer.write('\n'.encode('utf-8')) 138 | 139 | if for_mse: 140 | # more formatting info 141 | writer.write('version control:\n\ttype: none\napprentice code: ') 142 | 143 | 144 | def writehtml(writer, card_set): 145 | for card in card_set: 146 | fstring = card.format(gatherer = gatherer, for_forum = True, 147 | vdump = vdump, for_html = for_html) 148 | if creativity: 149 | fstring = fstring[:-6] # chop off the closing to stick stuff in 150 | writer.write((fstring + '\n').encode('utf-8')) 151 | 152 | if creativity: 153 | cstring = '~~ closest cards ~~\n
\n' 154 | nearest = card.nearest_cards 155 | for dist, cardname in nearest: 156 | cstring += hoverimg(cardname, dist, namediff) 157 | cstring += "
\n" 158 | cstring += '~~ closest names ~~\n
\n' 159 | nearest = card.nearest_names 160 | for dist, cardname in nearest: 161 | cstring += hoverimg(cardname, dist, namediff) 162 | cstring = '
' + cstring + '
\n' 163 | writer.write(cstring.encode('utf-8')) 164 | 165 | writer.write('\n'.encode('utf-8')) 166 | 167 | # Sorting by colors 168 | def sort_colors(card_set): 169 | # Initialize sections 170 | red_cards = [] 171 | blue_cards = [] 172 | green_cards = [] 173 | black_cards = [] 174 | white_cards = [] 175 | multi_cards = [] 176 | colorless_cards = [] 177 | lands = [] 178 | for card in card_set: 179 | if len(card.get_colors())>1: 180 | multi_cards += [card] 181 | continue 182 | if 'R' in card.get_colors(): 183 | red_cards += [card] 184 | continue 185 | elif 'U' in card.get_colors(): 186 | blue_cards += [card] 187 | continue 188 | elif 'B' in card.get_colors(): 189 | black_cards += [card] 190 | continue 191 | elif 'G' in card.get_colors(): 192 | green_cards += [card] 193 | continue 194 | elif 'W' in card.get_colors(): 195 | white_cards += [card] 196 | continue 197 | else: 198 | if "land" in card.get_types(): 199 | lands += [card] 200 | continue 201 | colorless_cards += [card] 202 | return[white_cards, blue_cards, black_cards, red_cards, green_cards, multi_cards, colorless_cards, lands] 203 | 204 | def sort_type(card_set): 205 | sorting = ["creature", "enchantment", "instant", "sorcery", "artifact", "planeswalker"] 206 | sorted_cards = [[],[],[],[],[],[],[]] 207 | sorted_set = [] 208 | for card in card_set: 209 | types = card.get_types() 210 | for i in range(len(sorting)): 211 | if sorting[i] in types: 212 | sorted_cards[i] += [card] 213 | break 214 | else: 215 | sorted_cards[6] += [card] 216 | for value in sorted_cards: 217 | for card in value: 218 | sorted_set += [card] 219 | return sorted_set 220 | 221 | 222 | 223 | def sort_cmc(card_set): 224 | sorted_cards = [] 225 | sorted_set = [] 226 | for card in card_set: 227 | # make sure there is an empty set for each CMC 228 | while len(sorted_cards)-1 < card.get_cmc(): 229 | sorted_cards += [[]] 230 | # add card to correct set of CMC values 231 | sorted_cards[card.get_cmc()] += [card] 232 | # combine each set of CMC valued cards together 233 | for value in sorted_cards: 234 | for card in value: 235 | sorted_set += [card] 236 | return sorted_set 237 | 238 | 239 | if oname: 240 | if for_html: 241 | print oname 242 | # if ('.html' != oname[-]) 243 | # oname += '.html' 244 | if verbose: 245 | print 'Writing output to: ' + oname 246 | with open(oname, 'w') as ofile: 247 | writecards(ofile) 248 | if for_mse: 249 | # Copy whatever output file is produced, name the copy 'set' (yes, no extension). 250 | if os.path.isfile('set'): 251 | print 'ERROR: tried to overwrite existing file "set" - aborting.' 252 | return 253 | shutil.copyfile(oname, 'set') 254 | # Use the freaky mse extension instead of zip. 255 | with zipfile.ZipFile(oname+'.mse-set', mode='w') as zf: 256 | try: 257 | # Zip up the set file into oname.mse-set. 258 | zf.write('set') 259 | finally: 260 | if verbose: 261 | print 'Made an MSE set file called ' + oname + '.mse-set.' 262 | # The set file is useless outside the .mse-set, delete it. 263 | os.remove('set') 264 | else: 265 | writecards(sys.stdout) 266 | sys.stdout.flush() 267 | 268 | 269 | if __name__ == '__main__': 270 | import argparse 271 | parser = argparse.ArgumentParser() 272 | 273 | parser.add_argument('infile', #nargs='?'. default=None, 274 | help='encoded card file or json corpus to encode') 275 | parser.add_argument('outfile', nargs='?', default=None, 276 | help='output file, defaults to stdout') 277 | parser.add_argument('-e', '--encoding', default='std', choices=utils.formats, 278 | #help='{' + ','.join(formats) + '}', 279 | help='encoding format to use', 280 | ) 281 | parser.add_argument('-g', '--gatherer', action='store_true', 282 | help='emulate Gatherer visual spoiler') 283 | parser.add_argument('-f', '--forum', action='store_true', 284 | help='use pretty mana encoding for mtgsalvation forum') 285 | parser.add_argument('-c', '--creativity', action='store_true', 286 | help='use CBOW fuzzy matching to check creativity of cards') 287 | parser.add_argument('-d', '--dump', action='store_true', 288 | help='dump out lots of information about invalid cards') 289 | parser.add_argument('-v', '--verbose', action='store_true', 290 | help='verbose output') 291 | parser.add_argument('-mse', '--mse', action='store_true', 292 | help='use Magic Set Editor 2 encoding; will output as .mse-set file') 293 | parser.add_argument('-html', '--html', action='store_true', help='create a .html file with pretty forum formatting') 294 | 295 | args = parser.parse_args() 296 | 297 | main(args.infile, args.outfile, verbose = args.verbose, encoding = args.encoding, 298 | gatherer = args.gatherer, for_forum = args.forum, for_mse = args.mse, 299 | creativity = args.creativity, vdump = args.dump, for_html = args.html) 300 | 301 | exit(0) 302 | -------------------------------------------------------------------------------- /encode.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | 5 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'lib') 6 | sys.path.append(libdir) 7 | import re 8 | import random 9 | import utils 10 | import jdecode 11 | import cardlib 12 | 13 | def main(fname, oname = None, verbose = True, encoding = 'std', 14 | nolinetrans = False, randomize = False, nolabel = False, stable = False): 15 | fmt_ordered = cardlib.fmt_ordered_default 16 | fmt_labeled = None if nolabel else cardlib.fmt_labeled_default 17 | fieldsep = utils.fieldsep 18 | line_transformations = not nolinetrans 19 | randomize_fields = False 20 | randomize_mana = randomize 21 | initial_sep = True 22 | final_sep = True 23 | 24 | # set the properties of the encoding 25 | 26 | if encoding in ['std']: 27 | pass 28 | elif encoding in ['named']: 29 | fmt_ordered = cardlib.fmt_ordered_named 30 | elif encoding in ['noname']: 31 | fmt_ordered = cardlib.fmt_ordered_noname 32 | elif encoding in ['rfields']: 33 | randomize_fields = True 34 | final_sep = False 35 | elif encoding in ['old']: 36 | fmt_ordered = cardlib.fmt_ordered_old 37 | elif encoding in ['norarity']: 38 | fmt_ordered = cardlib.fmt_ordered_norarity 39 | elif encoding in ['vec']: 40 | pass 41 | elif encoding in ['custom']: 42 | ## put custom format decisions here ########################## 43 | 44 | ## end of custom format ###################################### 45 | pass 46 | else: 47 | raise ValueError('encode.py: unknown encoding: ' + encoding) 48 | 49 | if verbose: 50 | print 'Preparing to encode:' 51 | print ' Using encoding ' + repr(encoding) 52 | if stable: 53 | print ' NOT randomizing order of cards.' 54 | if randomize_mana: 55 | print ' Randomizing order of symobls in manacosts.' 56 | if not fmt_labeled: 57 | print ' NOT labeling fields for this run (may be harder to decode).' 58 | if not line_transformations: 59 | print ' NOT using line reordering transformations' 60 | 61 | cards = jdecode.mtg_open_file(fname, verbose=verbose, linetrans=line_transformations) 62 | 63 | # This should give a random but consistent ordering, to make comparing changes 64 | # between the output of different versions easier. 65 | if not stable: 66 | random.seed(1371367) 67 | random.shuffle(cards) 68 | 69 | def writecards(writer): 70 | for card in cards: 71 | if encoding in ['vec']: 72 | writer.write(card.vectorize() + '\n\n') 73 | else: 74 | writer.write(card.encode(fmt_ordered = fmt_ordered, 75 | fmt_labeled = fmt_labeled, 76 | fieldsep = fieldsep, 77 | randomize_fields = randomize_fields, 78 | randomize_mana = randomize_mana, 79 | initial_sep = initial_sep, 80 | final_sep = final_sep) 81 | + utils.cardsep) 82 | 83 | if oname: 84 | if verbose: 85 | print 'Writing output to: ' + oname 86 | with open(oname, 'w') as ofile: 87 | writecards(ofile) 88 | else: 89 | writecards(sys.stdout) 90 | sys.stdout.flush() 91 | 92 | 93 | if __name__ == '__main__': 94 | import argparse 95 | parser = argparse.ArgumentParser() 96 | 97 | parser.add_argument('infile', 98 | help='encoded card file or json corpus to encode') 99 | parser.add_argument('outfile', nargs='?', default=None, 100 | help='output file, defaults to stdout') 101 | parser.add_argument('-e', '--encoding', default='std', choices=utils.formats, 102 | #help='{' + ','.join(formats) + '}', 103 | help='encoding format to use', 104 | ) 105 | parser.add_argument('-r', '--randomize', action='store_true', 106 | help='randomize the order of symbols in mana costs') 107 | parser.add_argument('--nolinetrans', action='store_true', 108 | help="don't reorder lines of card text") 109 | parser.add_argument('--nolabel', action='store_true', 110 | help="don't label fields") 111 | parser.add_argument('-s', '--stable', action='store_true', 112 | help="don't randomize the order of the cards") 113 | parser.add_argument('-v', '--verbose', action='store_true', 114 | help='verbose output') 115 | 116 | args = parser.parse_args() 117 | main(args.infile, args.outfile, verbose = args.verbose, encoding = args.encoding, 118 | nolinetrans = args.nolinetrans, randomize = args.randomize, nolabel = args.nolabel, 119 | stable = args.stable) 120 | exit(0) 121 | -------------------------------------------------------------------------------- /lib/cbow.py: -------------------------------------------------------------------------------- 1 | # Infinite thanks to Talcos from the mtgsalvation forums, who among 2 | # many, many other things wrote the original version of this code. 3 | # I have merely ported it to fit my needs. 4 | 5 | import re 6 | import sys 7 | import subprocess 8 | import os 9 | import struct 10 | import math 11 | import multiprocessing 12 | 13 | import utils 14 | import cardlib 15 | import transforms 16 | import namediff 17 | 18 | libdir = os.path.dirname(os.path.realpath(__file__)) 19 | datadir = os.path.realpath(os.path.join(libdir, '../data')) 20 | 21 | # multithreading control parameters 22 | cores = multiprocessing.cpu_count() 23 | 24 | # max length of vocabulary entries 25 | max_w = 50 26 | 27 | 28 | #### snip! #### 29 | 30 | def read_vector_file(fname): 31 | with open(fname, 'rb') as f: 32 | words = int(f.read(4)) 33 | size = int(f.read(4)) 34 | vocab = [' '] * (words * max_w) 35 | M = [] 36 | for b in range(0,words): 37 | a = 0 38 | while True: 39 | c = f.read(1) 40 | vocab[b * max_w + a] = c; 41 | if len(c) == 0 or c == ' ': 42 | break 43 | if (a < max_w) and vocab[b * max_w + a] != '\n': 44 | a += 1 45 | tmp = list(struct.unpack('f'*size,f.read(4 * size))) 46 | length = math.sqrt(sum([tmp[i] * tmp[i] for i in range(0,len(tmp))])) 47 | for i in range(0,len(tmp)): 48 | tmp[i] /= length 49 | M.append(tmp) 50 | return ((''.join(vocab)).split(),M) 51 | 52 | def makevector(vocabulary,vecs,sequence): 53 | words = sequence.split() 54 | indices = [] 55 | for word in words: 56 | if word not in vocabulary: 57 | #print("Missing word in vocabulary: " + word) 58 | continue 59 | #return [0.0]*len(vecs[0]) 60 | indices.append(vocabulary.index(word)) 61 | #res = map(sum,[vecs[i] for i in indices]) 62 | res = None 63 | for v in [vecs[i] for i in indices]: 64 | if res == None: 65 | res = v 66 | else: 67 | res = [x + y for x, y in zip(res,v)] 68 | 69 | # bad things happen if we have a vector of only unknown words 70 | if res is None: 71 | return [0.0]*len(vecs[0]) 72 | 73 | length = math.sqrt(sum([res[i] * res[i] for i in range(0,len(res))])) 74 | for i in range(0,len(res)): 75 | res[i] /= length 76 | return res 77 | 78 | #### !snip #### 79 | 80 | 81 | try: 82 | import numpy 83 | def cosine_similarity(v1,v2): 84 | A = numpy.array([v1,v2]) 85 | 86 | # from http://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat 87 | 88 | # base similarity matrix (all dot products) 89 | # replace this with A.dot(A.T).todense() for sparse representation 90 | similarity = numpy.dot(A, A.T) 91 | 92 | # squared magnitude of preference vectors (number of occurrences) 93 | square_mag = numpy.diag(similarity) 94 | 95 | # inverse squared magnitude 96 | inv_square_mag = 1 / square_mag 97 | 98 | # if it doesn't occur, set it's inverse magnitude to zero (instead of inf) 99 | inv_square_mag[numpy.isinf(inv_square_mag)] = 0 100 | 101 | # inverse of the magnitude 102 | inv_mag = numpy.sqrt(inv_square_mag) 103 | 104 | # cosine similarity (elementwise multiply by inverse magnitudes) 105 | cosine = similarity * inv_mag 106 | cosine = cosine.T * inv_mag 107 | 108 | return cosine[0][1] 109 | 110 | except ImportError: 111 | def cosine_similarity(v1,v2): 112 | #compute cosine similarity of v1 to v2: (v1 dot v1)/{||v1||*||v2||) 113 | sumxx, sumxy, sumyy = 0, 0, 0 114 | for i in range(len(v1)): 115 | x = v1[i]; y = v2[i] 116 | sumxx += x*x 117 | sumyy += y*y 118 | sumxy += x*y 119 | return sumxy/math.sqrt(sumxx*sumyy) 120 | 121 | def cosine_similarity_name(cardvec, v, name): 122 | return (cosine_similarity(cardvec, v), name) 123 | 124 | # we need to put the logic in a regular function (as opposed to a method of an object) 125 | # so that we can pass the function to multiprocessing 126 | def f_nearest(card, vocab, vecs, cardvecs, n): 127 | if isinstance(card, cardlib.Card): 128 | words = card.vectorize().split('\n\n')[0] 129 | else: 130 | # assume it's a string (that's already a vector) 131 | words = card 132 | 133 | if not words: 134 | return [] 135 | 136 | cardvec = makevector(vocab, vecs, words) 137 | 138 | comparisons = [cosine_similarity_name(cardvec, v, name) for (name, v) in cardvecs] 139 | 140 | comparisons.sort(reverse = True) 141 | comp_n = comparisons[:n] 142 | 143 | if isinstance(card, cardlib.Card) and card.bside: 144 | comp_n += f_nearest(card.bside, vocab, vecs, cardvecs, n=n) 145 | 146 | return comp_n 147 | 148 | def f_nearest_per_thread(workitem): 149 | (workcards, vocab, vecs, cardvecs, n) = workitem 150 | return map(lambda card: f_nearest(card, vocab, vecs, cardvecs, n), workcards) 151 | 152 | class CBOW: 153 | def __init__(self, verbose = True, 154 | vector_fname = os.path.join(datadir, 'cbow.bin'), 155 | card_fname = os.path.join(datadir, 'output.txt')): 156 | self.verbose = verbose 157 | self.cardvecs = [] 158 | 159 | if self.verbose: 160 | print 'Building a cbow model...' 161 | 162 | if self.verbose: 163 | print ' Reading binary vector data from: ' + vector_fname 164 | (vocab, vecs) = read_vector_file(vector_fname) 165 | self.vocab = vocab 166 | self.vecs = vecs 167 | 168 | if self.verbose: 169 | print ' Reading encoded cards from: ' + card_fname 170 | print ' They\'d better be in the same order as the file used to build the vector model!' 171 | with open(card_fname, 'rt') as f: 172 | text = f.read() 173 | for card_src in text.split(utils.cardsep): 174 | if card_src: 175 | card = cardlib.Card(card_src) 176 | name = card.name 177 | self.cardvecs += [(name, makevector(self.vocab, 178 | self.vecs, 179 | card.vectorize()))] 180 | 181 | if self.verbose: 182 | print '... Done.' 183 | print ' vocab size: ' + str(len(self.vocab)) 184 | print ' raw vecs: ' + str(len(self.vecs)) 185 | print ' card vecs: ' + str(len(self.cardvecs)) 186 | 187 | def nearest(self, card, n=5): 188 | return f_nearest(card, self.vocab, self.vecs, self.cardvecs, n) 189 | 190 | def nearest_par(self, cards, n=5, threads=cores): 191 | workpool = multiprocessing.Pool(threads) 192 | proto_worklist = namediff.list_split(cards, threads) 193 | worklist = map(lambda x: (x, self.vocab, self.vecs, self.cardvecs, n), proto_worklist) 194 | donelist = workpool.map(f_nearest_per_thread, worklist) 195 | return namediff.list_flatten(donelist) 196 | -------------------------------------------------------------------------------- /lib/config.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | # Utilities for handling unicode, unary numbers, mana costs, and special symbols. 4 | # For convenience we redefine everything from utils so that it can all be accessed 5 | # from the utils module. 6 | 7 | # separators 8 | cardsep = '\n\n' 9 | fieldsep = '|' 10 | bsidesep = '\n' 11 | newline = '\\' 12 | 13 | # special indicators 14 | dash_marker = '~' 15 | bullet_marker = '=' 16 | this_marker = '@' 17 | counter_marker = '%' 18 | reserved_marker = '\v' 19 | reserved_mana_marker = '$' 20 | choice_open_delimiter = '[' 21 | choice_close_delimiter = ']' 22 | x_marker = 'X' 23 | tap_marker = 'T' 24 | untap_marker = 'Q' 25 | # second letter of the word 26 | rarity_common_marker = 'O' 27 | rarity_uncommon_marker = 'N' 28 | rarity_rare_marker = 'A' 29 | rarity_mythic_marker = 'Y' 30 | # with some crazy exceptions 31 | rarity_special_marker = 'E' 32 | rarity_basic_land_marker = 'L' 33 | 34 | # unambiguous synonyms 35 | counter_rename = 'uncast' 36 | 37 | # unary numbers 38 | unary_marker = '&' 39 | unary_counter = '^' 40 | unary_max = 20 41 | unary_exceptions = { 42 | 25 : 'twenty' + dash_marker + 'five', 43 | 30 : 'thirty', 44 | 40 : 'forty', 45 | 50 : 'fifly', 46 | 100: 'one hundred', 47 | 200: 'two hundred', 48 | } 49 | 50 | # field labels, to allow potential reordering of card format 51 | field_label_name = '1' 52 | field_label_rarity = '0' # 2 is part of some mana symbols {2/B} ... 53 | field_label_cost = '3' 54 | field_label_supertypes = '4' 55 | field_label_types = '5' 56 | field_label_subtypes = '6' 57 | field_label_loyalty = '7' 58 | field_label_pt = '8' 59 | field_label_text = '9' 60 | 61 | # additional fields we add to the json cards 62 | json_field_bside = 'bside' 63 | json_field_set_name = 'setName' 64 | json_field_info_code = 'magicCardsInfoCode' 65 | -------------------------------------------------------------------------------- /lib/datalib.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | import utils 4 | from cardlib import Card 5 | 6 | # Format a list of rows of data into nice columns. 7 | # Note that it's the columns that are nice, not this code. 8 | def padrows(l): 9 | # get length for each field 10 | lens = [] 11 | for ll in l: 12 | for i, field in enumerate(ll): 13 | if i < len(lens): 14 | lens[i] = max(len(str(field)), lens[i]) 15 | else: 16 | lens += [len(str(field))] 17 | # now pad out to that length 18 | padded = [] 19 | for ll in l: 20 | padded += [''] 21 | for i, field in enumerate(ll): 22 | s = str(field) 23 | pad = ' ' * (lens[i] - len(s)) 24 | padded[-1] += (s + pad + ' ') 25 | return padded 26 | def printrows(l): 27 | for row in l: 28 | print row 29 | 30 | # index management helpers 31 | def index_size(d): 32 | return sum(map(lambda k: len(d[k]), d)) 33 | 34 | def inc(d, k, obj): 35 | if k or k == 0: 36 | if k in d: 37 | d[k] += obj 38 | else: 39 | d[k] = obj 40 | 41 | # thanks gleemax 42 | def plimit(s, mlen = 1000): 43 | if len(s) > mlen: 44 | return s[:1000] + '[...]' 45 | else: 46 | return s 47 | 48 | class Datamine: 49 | # build the global indices 50 | def __init__(self, card_srcs): 51 | # global card pools 52 | self.unparsed_cards = [] 53 | self.invalid_cards = [] 54 | self.cards = [] 55 | self.allcards = [] 56 | 57 | # global indices 58 | self.by_name = {} 59 | self.by_type = {} 60 | self.by_type_inclusive = {} 61 | self.by_supertype = {} 62 | self.by_supertype_inclusive = {} 63 | self.by_subtype = {} 64 | self.by_subtype_inclusive = {} 65 | self.by_color = {} 66 | self.by_color_inclusive = {} 67 | self.by_color_count = {} 68 | self.by_cmc = {} 69 | self.by_cost = {} 70 | self.by_power = {} 71 | self.by_toughness = {} 72 | self.by_pt = {} 73 | self.by_loyalty = {} 74 | self.by_textlines = {} 75 | self.by_textlen = {} 76 | 77 | self.indices = { 78 | 'by_name' : self.by_name, 79 | 'by_type' : self.by_type, 80 | 'by_type_inclusive' : self.by_type_inclusive, 81 | 'by_supertype' : self.by_supertype, 82 | 'by_supertype_inclusive' : self.by_supertype_inclusive, 83 | 'by_subtype' : self.by_subtype, 84 | 'by_subtype_inclusive' : self.by_subtype_inclusive, 85 | 'by_color' : self.by_color, 86 | 'by_color_inclusive' : self.by_color_inclusive, 87 | 'by_color_count' : self.by_color_count, 88 | 'by_cmc' : self.by_cmc, 89 | 'by_cost' : self.by_cost, 90 | 'by_power' : self.by_power, 91 | 'by_toughness' : self.by_toughness, 92 | 'by_pt' : self.by_pt, 93 | 'by_loyalty' : self.by_loyalty, 94 | 'by_textlines' : self.by_textlines, 95 | 'by_textlen' : self.by_textlen, 96 | } 97 | 98 | for card_src in card_srcs: 99 | # the empty card is not interesting 100 | if not card_src: 101 | continue 102 | card = Card(card_src) 103 | if card.valid: 104 | self.cards += [card] 105 | self.allcards += [card] 106 | elif card.parsed: 107 | self.invalid_cards += [card] 108 | self.allcards += [card] 109 | else: 110 | self.unparsed_cards += [card] 111 | 112 | if card.parsed: 113 | inc(self.by_name, card.name, [card]) 114 | 115 | inc(self.by_type, ' '.join(card.types), [card]) 116 | for t in card.types: 117 | inc(self.by_type_inclusive, t, [card]) 118 | inc(self.by_supertype, ' '.join(card.supertypes), [card]) 119 | for t in card.supertypes: 120 | inc(self.by_supertype_inclusive, t, [card]) 121 | inc(self.by_subtype, ' '.join(card.subtypes), [card]) 122 | for t in card.subtypes: 123 | inc(self.by_subtype_inclusive, t, [card]) 124 | 125 | if card.cost.colors: 126 | inc(self.by_color, card.cost.colors, [card]) 127 | for c in card.cost.colors: 128 | inc(self.by_color_inclusive, c, [card]) 129 | inc(self.by_color_count, len(card.cost.colors), [card]) 130 | else: 131 | # colorless, still want to include in these tables 132 | inc(self.by_color, 'A', [card]) 133 | inc(self.by_color_inclusive, 'A', [card]) 134 | inc(self.by_color_count, 0, [card]) 135 | 136 | inc(self.by_cmc, card.cost.cmc, [card]) 137 | inc(self.by_cost, card.cost.encode() if card.cost.encode() else 'none', [card]) 138 | 139 | inc(self.by_power, card.pt_p, [card]) 140 | inc(self.by_toughness, card.pt_t, [card]) 141 | inc(self.by_pt, card.pt, [card]) 142 | 143 | inc(self.by_loyalty, card.loyalty, [card]) 144 | 145 | inc(self.by_textlines, len(card.text_lines), [card]) 146 | inc(self.by_textlen, len(card.text.encode()), [card]) 147 | 148 | # summarize the indices 149 | # Yes, this printing code is pretty terrible. 150 | def summarize(self, hsize = 10, vsize = 10, cmcsize = 20): 151 | print '====================' 152 | print str(len(self.cards)) + ' valid cards, ' + str(len(self.invalid_cards)) + ' invalid cards.' 153 | print str(len(self.allcards)) + ' cards parsed, ' + str(len(self.unparsed_cards)) + ' failed to parse' 154 | print '--------------------' 155 | print str(len(self.by_name)) + ' unique card names' 156 | print '--------------------' 157 | print (str(len(self.by_color_inclusive)) + ' represented colors (including colorless as \'A\'), ' 158 | + str(len(self.by_color)) + ' combinations') 159 | print 'Breakdown by color:' 160 | rows = [self.by_color_inclusive.keys()] 161 | rows += [[len(self.by_color_inclusive[k]) for k in rows[0]]] 162 | printrows(padrows(rows)) 163 | print 'Breakdown by number of colors:' 164 | rows = [self.by_color_count.keys()] 165 | rows += [[len(self.by_color_count[k]) for k in rows[0]]] 166 | printrows(padrows(rows)) 167 | print '--------------------' 168 | print str(len(self.by_type_inclusive)) + ' unique card types, ' + str(len(self.by_type)) + ' combinations' 169 | print 'Breakdown by type:' 170 | d = sorted(self.by_type_inclusive, 171 | lambda x,y: cmp(len(self.by_type_inclusive[x]), len(self.by_type_inclusive[y])), 172 | reverse = True) 173 | rows = [[k for k in d[:hsize]]] 174 | rows += [[len(self.by_type_inclusive[k]) for k in rows[0]]] 175 | printrows(padrows(rows)) 176 | print '--------------------' 177 | print (str(len(self.by_subtype_inclusive)) + ' unique subtypes, ' 178 | + str(len(self.by_subtype)) + ' combinations') 179 | print '-- Popular subtypes: --' 180 | d = sorted(self.by_subtype_inclusive, 181 | lambda x,y: cmp(len(self.by_subtype_inclusive[x]), len(self.by_subtype_inclusive[y])), 182 | reverse = True) 183 | rows = [] 184 | for k in d[0:vsize]: 185 | rows += [[k, len(self.by_subtype_inclusive[k])]] 186 | printrows(padrows(rows)) 187 | print '-- Top combinations: --' 188 | d = sorted(self.by_subtype, 189 | lambda x,y: cmp(len(self.by_subtype[x]), len(self.by_subtype[y])), 190 | reverse = True) 191 | rows = [] 192 | for k in d[0:vsize]: 193 | rows += [[k, len(self.by_subtype[k])]] 194 | printrows(padrows(rows)) 195 | print '--------------------' 196 | print (str(len(self.by_supertype_inclusive)) + ' unique supertypes, ' 197 | + str(len(self.by_supertype)) + ' combinations') 198 | print 'Breakdown by supertype:' 199 | d = sorted(self.by_supertype_inclusive, 200 | lambda x,y: cmp(len(self.by_supertype_inclusive[x]),len(self.by_supertype_inclusive[y])), 201 | reverse = True) 202 | rows = [[k for k in d[:hsize]]] 203 | rows += [[len(self.by_supertype_inclusive[k]) for k in rows[0]]] 204 | printrows(padrows(rows)) 205 | print '--------------------' 206 | print str(len(self.by_cmc)) + ' different CMCs, ' + str(len(self.by_cost)) + ' unique mana costs' 207 | print 'Breakdown by CMC:' 208 | d = sorted(self.by_cmc, reverse = False) 209 | rows = [[k for k in d[:cmcsize]]] 210 | rows += [[len(self.by_cmc[k]) for k in rows[0]]] 211 | printrows(padrows(rows)) 212 | print '-- Popular mana costs: --' 213 | d = sorted(self.by_cost, 214 | lambda x,y: cmp(len(self.by_cost[x]), len(self.by_cost[y])), 215 | reverse = True) 216 | rows = [] 217 | for k in d[0:vsize]: 218 | rows += [[utils.from_mana(k), len(self.by_cost[k])]] 219 | printrows(padrows(rows)) 220 | print '--------------------' 221 | print str(len(self.by_pt)) + ' unique p/t combinations' 222 | if len(self.by_power) > 0 and len(self.by_toughness) > 0: 223 | print ('Largest power: ' + str(max(map(len, self.by_power)) - 1) + 224 | ', largest toughness: ' + str(max(map(len, self.by_toughness)) - 1)) 225 | print '-- Popular p/t values: --' 226 | d = sorted(self.by_pt, 227 | lambda x,y: cmp(len(self.by_pt[x]), len(self.by_pt[y])), 228 | reverse = True) 229 | rows = [] 230 | for k in d[0:vsize]: 231 | rows += [[utils.from_unary(k), len(self.by_pt[k])]] 232 | printrows(padrows(rows)) 233 | print '--------------------' 234 | print 'Loyalty values:' 235 | d = sorted(self.by_loyalty, 236 | lambda x,y: cmp(len(self.by_loyalty[x]), len(self.by_loyalty[y])), 237 | reverse = True) 238 | rows = [] 239 | for k in d[0:vsize]: 240 | rows += [[utils.from_unary(k), len(self.by_loyalty[k])]] 241 | printrows(padrows(rows)) 242 | print '--------------------' 243 | if len(self.by_textlen) > 0 and len(self.by_textlines) > 0: 244 | print('Card text ranges from ' + str(min(self.by_textlen)) + ' to ' 245 | + str(max(self.by_textlen)) + ' characters in length') 246 | print('Card text ranges from ' + str(min(self.by_textlines)) + ' to ' 247 | + str(max(self.by_textlines)) + ' lines') 248 | print '-- Line counts by frequency: --' 249 | d = sorted(self.by_textlines, 250 | lambda x,y: cmp(len(self.by_textlines[x]), len(self.by_textlines[y])), 251 | reverse = True) 252 | rows = [] 253 | for k in d[0:vsize]: 254 | rows += [[k, len(self.by_textlines[k])]] 255 | printrows(padrows(rows)) 256 | print '====================' 257 | 258 | 259 | # describe outliers in the indices 260 | def outliers(self, hsize = 10, vsize = 10, dump_invalid = False): 261 | print '********************' 262 | print 'Overview of indices:' 263 | rows = [['Index Name', 'Keys', 'Total Members']] 264 | for index in self.indices: 265 | rows += [[index, len(self.indices[index]), index_size(self.indices[index])]] 266 | printrows(padrows(rows)) 267 | print '********************' 268 | if len(self.by_name) > 0: 269 | scardname = sorted(self.by_name, 270 | lambda x,y: cmp(len(x), len(y)), 271 | reverse = False)[0] 272 | print 'Shortest Cardname: (' + str(len(scardname)) + ')' 273 | print ' ' + scardname 274 | lcardname = sorted(self.by_name, 275 | lambda x,y: cmp(len(x), len(y)), 276 | reverse = True)[0] 277 | print 'Longest Cardname: (' + str(len(lcardname)) + ')' 278 | print ' ' + lcardname 279 | d = sorted(self.by_name, 280 | lambda x,y: cmp(len(self.by_name[x]), len(self.by_name[y])), 281 | reverse = True) 282 | rows = [] 283 | for k in d[0:vsize]: 284 | if len(self.by_name[k]) > 1: 285 | rows += [[k, len(self.by_name[k])]] 286 | if rows == []: 287 | print('No duplicated cardnames') 288 | else: 289 | print '-- Most duplicated names: --' 290 | printrows(padrows(rows)) 291 | else: 292 | print 'No cards indexed by name?' 293 | print '--------------------' 294 | if len(self.by_type) > 0: 295 | ltypes = sorted(self.by_type, 296 | lambda x,y: cmp(len(x), len(y)), 297 | reverse = True)[0] 298 | print 'Longest card type: (' + str(len(ltypes)) + ')' 299 | print ' ' + ltypes 300 | else: 301 | print 'No cards indexed by type?' 302 | if len(self.by_subtype) > 0: 303 | lsubtypes = sorted(self.by_subtype, 304 | lambda x,y: cmp(len(x), len(y)), 305 | reverse = True)[0] 306 | print 'Longest subtype: (' + str(len(lsubtypes)) + ')' 307 | print ' ' + lsubtypes 308 | else: 309 | print 'No cards indexed by subtype?' 310 | if len(self.by_supertype) > 0: 311 | lsupertypes = sorted(self.by_supertype, 312 | lambda x,y: cmp(len(x), len(y)), 313 | reverse = True)[0] 314 | print 'Longest supertype: (' + str(len(lsupertypes)) + ')' 315 | print ' ' + lsupertypes 316 | else: 317 | print 'No cards indexed by supertype?' 318 | print '--------------------' 319 | if len(self.by_cost) > 0: 320 | lcost = sorted(self.by_cost, 321 | lambda x,y: cmp(len(x), len(y)), 322 | reverse = True)[0] 323 | print 'Longest mana cost: (' + str(len(lcost)) + ')' 324 | print ' ' + utils.from_mana(lcost) 325 | print '\n' + plimit(self.by_cost[lcost][0].encode()) + '\n' 326 | else: 327 | print 'No cards indexed by cost?' 328 | if len(self.by_cmc) > 0: 329 | lcmc = sorted(self.by_cmc, reverse = True)[0] 330 | print 'Largest cmc: (' + str(lcmc) + ')' 331 | print ' ' + str(self.by_cmc[lcmc][0].cost) 332 | print '\n' + plimit(self.by_cmc[lcmc][0].encode()) 333 | else: 334 | print 'No cards indexed by cmc?' 335 | print '--------------------' 336 | if len(self.by_power) > 0: 337 | lpower = sorted(self.by_power, 338 | lambda x,y: cmp(len(x), len(y)), 339 | reverse = True)[0] 340 | print 'Largest creature power: ' + utils.from_unary(lpower) 341 | print '\n' + plimit(self.by_power[lpower][0].encode()) + '\n' 342 | else: 343 | print 'No cards indexed by power?' 344 | if len(self.by_toughness) > 0: 345 | ltoughness = sorted(self.by_toughness, 346 | lambda x,y: cmp(len(x), len(y)), 347 | reverse = True)[0] 348 | print 'Largest creature toughness: ' + utils.from_unary(ltoughness) 349 | print '\n' + plimit(self.by_toughness[ltoughness][0].encode()) 350 | else: 351 | print 'No cards indexed by toughness?' 352 | print '--------------------' 353 | if len(self.by_textlines) > 0: 354 | llines = sorted(self.by_textlines, reverse = True)[0] 355 | print 'Most lines of text in a card: ' + str(llines) 356 | print '\n' + plimit(self.by_textlines[llines][0].encode()) + '\n' 357 | else: 358 | print 'No cards indexed by line count?' 359 | if len(self.by_textlen) > 0: 360 | ltext = sorted(self.by_textlen, reverse = True)[0] 361 | print 'Most chars in a card text: ' + str(ltext) 362 | print '\n' + plimit(self.by_textlen[ltext][0].encode()) 363 | else: 364 | print 'No cards indexed by char count?' 365 | print '--------------------' 366 | print 'There were ' + str(len(self.invalid_cards)) + ' invalid cards.' 367 | if dump_invalid: 368 | for card in self.invalid_cards: 369 | print '\n' + repr(card.fields) 370 | elif len(self.invalid_cards) > 0: 371 | print 'Not summarizing.' 372 | print '--------------------' 373 | print 'There were ' + str(len(self.unparsed_cards)) + ' unparsed cards.' 374 | if dump_invalid: 375 | for card in self.unparsed_cards: 376 | print '\n' + repr(card.fields) 377 | elif len(self.unparsed_cards) > 0: 378 | print 'Not summarizing.' 379 | print '====================' 380 | -------------------------------------------------------------------------------- /lib/jdecode.py: -------------------------------------------------------------------------------- 1 | import json 2 | 3 | import utils 4 | import cardlib 5 | 6 | def mtg_open_json(fname, verbose = False): 7 | 8 | with open(fname, 'r') as f: 9 | jobj = json.load(f) 10 | 11 | allcards = {} 12 | asides = {} 13 | bsides = {} 14 | 15 | for k_set in jobj: 16 | set = jobj[k_set] 17 | setname = set['name'] 18 | if 'magicCardsInfoCode' in set: 19 | codename = set['magicCardsInfoCode'] 20 | else: 21 | codename = '' 22 | 23 | for card in set['cards']: 24 | card[utils.json_field_set_name] = setname 25 | card[utils.json_field_info_code] = codename 26 | 27 | cardnumber = None 28 | if 'number' in card: 29 | cardnumber = card['number'] 30 | # the lower avoids duplication of at least one card (Will-o/O'-the-Wisp) 31 | cardname = card['name'].lower() 32 | 33 | uid = set['code'] 34 | if cardnumber == None: 35 | uid = uid + '_' + cardname + '_' 36 | else: 37 | uid = uid + '_' + cardnumber 38 | 39 | # aggregate by name to avoid duplicates, not counting bsides 40 | if not uid[-1] == 'b': 41 | if cardname in allcards: 42 | allcards[cardname] += [card] 43 | else: 44 | allcards[cardname] = [card] 45 | 46 | # also aggregate aside cards by uid so we can add bsides later 47 | if uid[-1:] == 'a': 48 | asides[uid] = card 49 | if uid[-1:] == 'b': 50 | bsides[uid] = card 51 | 52 | for uid in bsides: 53 | aside_uid = uid[:-1] + 'a' 54 | if aside_uid in asides: 55 | # the second check handles the brothers yamazaki edge case 56 | if not asides[aside_uid]['name'] == bsides[uid]['name']: 57 | asides[aside_uid][utils.json_field_bside] = bsides[uid] 58 | else: 59 | pass 60 | # this exposes some coldsnap theme deck bsides that aren't 61 | # really bsides; shouldn't matter too much 62 | #print aside_uid 63 | #print bsides[uid] 64 | 65 | if verbose: 66 | print 'Opened ' + str(len(allcards)) + ' uniquely named cards.' 67 | return allcards 68 | 69 | # filters to ignore some undesirable cards, only used when opening json 70 | def default_exclude_sets(cardset): 71 | return cardset == 'Unglued' or cardset == 'Unhinged' or cardset == 'Celebration' 72 | 73 | def default_exclude_types(cardtype): 74 | return cardtype in ['conspiracy'] 75 | 76 | def default_exclude_layouts(layout): 77 | return layout in ['token', 'plane', 'scheme', 'phenomenon', 'vanguard'] 78 | 79 | # centralized logic for opening files of cards, either encoded or json 80 | def mtg_open_file(fname, verbose = False, 81 | linetrans = True, fmt_ordered = cardlib.fmt_ordered_default, 82 | exclude_sets = default_exclude_sets, 83 | exclude_types = default_exclude_types, 84 | exclude_layouts = default_exclude_layouts): 85 | 86 | cards = [] 87 | valid = 0 88 | skipped = 0 89 | invalid = 0 90 | unparsed = 0 91 | 92 | if fname[-5:] == '.json': 93 | if verbose: 94 | print 'This looks like a json file: ' + fname 95 | json_srcs = mtg_open_json(fname, verbose) 96 | # sorted for stability 97 | for json_cardname in sorted(json_srcs): 98 | if len(json_srcs[json_cardname]) > 0: 99 | jcards = json_srcs[json_cardname] 100 | 101 | # look for a normal rarity version, in a set we can use 102 | idx = 0 103 | card = cardlib.Card(jcards[idx], linetrans=linetrans) 104 | while (idx < len(jcards) 105 | and (card.rarity == utils.rarity_special_marker 106 | or exclude_sets(jcards[idx][utils.json_field_set_name]))): 107 | idx += 1 108 | if idx < len(jcards): 109 | card = cardlib.Card(jcards[idx], linetrans=linetrans) 110 | # if there isn't one, settle with index 0 111 | if idx >= len(jcards): 112 | idx = 0 113 | card = cardlib.Card(jcards[idx], linetrans=linetrans) 114 | # we could go back and look for a card satisfying one of the criteria, 115 | # but eh 116 | 117 | skip = False 118 | if (exclude_sets(jcards[idx][utils.json_field_set_name]) 119 | or exclude_layouts(jcards[idx]['layout'])): 120 | skip = True 121 | for cardtype in card.types: 122 | if exclude_types(cardtype): 123 | skip = True 124 | if skip: 125 | skipped += 1 126 | continue 127 | 128 | if card.valid: 129 | valid += 1 130 | cards += [card] 131 | elif card.parsed: 132 | invalid += 1 133 | if verbose: 134 | print 'Invalid card: ' + json_cardname 135 | else: 136 | unparsed += 1 137 | 138 | # fall back to opening a normal encoded file 139 | else: 140 | if verbose: 141 | print 'Opening encoded card file: ' + fname 142 | with open(fname, 'rt') as f: 143 | text = f.read() 144 | for card_src in text.split(utils.cardsep): 145 | if card_src: 146 | card = cardlib.Card(card_src, fmt_ordered=fmt_ordered) 147 | # unlike opening from json, we still want to return invalid cards 148 | cards += [card] 149 | if card.valid: 150 | valid += 1 151 | elif card.parsed: 152 | invalid += 1 153 | if verbose: 154 | print 'Invalid card: ' + json_cardname 155 | else: 156 | unparsed += 1 157 | 158 | if verbose: 159 | print (str(valid) + ' valid, ' + str(skipped) + ' skipped, ' 160 | + str(invalid) + ' invalid, ' + str(unparsed) + ' failed to parse.') 161 | 162 | good_count = 0 163 | bad_count = 0 164 | for card in cards: 165 | if not card.parsed and not card.text.text: 166 | bad_count += 1 167 | elif len(card.name) > 50 or len(card.rarity) > 3: 168 | bad_count += 1 169 | else: 170 | good_count += 1 171 | if good_count + bad_count > 15: 172 | break 173 | # random heuristic 174 | if bad_count > 10: 175 | print 'WARNING: Saw a bunch of unparsed cards:' 176 | print ' Is this a legacy format, you may need to specify the field order.' 177 | 178 | return cards 179 | -------------------------------------------------------------------------------- /lib/manalib.py: -------------------------------------------------------------------------------- 1 | # representation for mana costs and text with embedded mana costs 2 | # data aggregating classes 3 | import re 4 | import random 5 | 6 | import utils 7 | 8 | class Manacost: 9 | '''mana cost representation with data''' 10 | 11 | # hardcoded to be dependent on the symbol structure... ah well 12 | def get_colors(self): 13 | colors = '' 14 | for sym in self.symbols: 15 | if self.symbols[sym] > 0: 16 | symcolors = re.sub(r'2|P|S|X|C', '', sym) 17 | for symcolor in symcolors: 18 | if symcolor not in colors: 19 | colors += symcolor 20 | # sort so the order is always consistent 21 | return ''.join(sorted(colors)) 22 | 23 | def check_colors(self, symbolstring): 24 | for sym in symbolstring: 25 | if not sym in self.colors: 26 | return False 27 | return True 28 | 29 | def __init__(self, src, fmt = ''): 30 | # source fields, exactly one will be set 31 | self.raw = None 32 | self.json = None 33 | # flags 34 | self.parsed = True 35 | self.valid = True 36 | self.none = False 37 | # default values for all fields 38 | self.inner = None 39 | self.cmc = 0 40 | self.colorless = 0 41 | self.sequence = [] 42 | self.symbols = {sym : 0 for sym in utils.mana_syms} 43 | self.allsymbols = {sym : 0 for sym in utils.mana_symall} 44 | self.colors = '' 45 | 46 | if fmt == 'json': 47 | self.json = src 48 | text = utils.mana_translate(self.json.upper()) 49 | else: 50 | self.raw = src 51 | text = self.raw 52 | 53 | if text == '': 54 | self.inner = '' 55 | self.none = True 56 | 57 | elif not (len(text) >= 2 and text[0] == '{' and text[-1] == '}'): 58 | self.parsed = False 59 | self.valid = False 60 | 61 | else: 62 | self.inner = text[1:-1] 63 | 64 | # structure mirrors the decoding in utils, but we pull out different data here 65 | idx = 0 66 | while idx < len(self.inner): 67 | # taking this branch is an infinite loop if unary_marker is empty 68 | if (len(utils.mana_unary_marker) > 0 and 69 | self.inner[idx:idx+len(utils.mana_unary_marker)] == utils.mana_unary_marker): 70 | idx += len(utils.mana_unary_marker) 71 | self.sequence += [utils.mana_unary_marker] 72 | elif self.inner[idx:idx+len(utils.mana_unary_counter)] == utils.mana_unary_counter: 73 | idx += len(utils.mana_unary_counter) 74 | self.sequence += [utils.mana_unary_counter] 75 | self.colorless += 1 76 | self.cmc += 1 77 | else: 78 | old_idx = idx 79 | for symlen in range(utils.mana_symlen_min, utils.mana_symlen_max + 1): 80 | encoded_sym = self.inner[idx:idx+symlen] 81 | if encoded_sym in utils.mana_symall_decode: 82 | idx += symlen 83 | # leave the sequence encoded for convenience 84 | self.sequence += [encoded_sym] 85 | sym = utils.mana_symall_decode[encoded_sym] 86 | self.allsymbols[sym] += 1 87 | if sym in utils.mana_symalt: 88 | self.symbols[utils.mana_alt(sym)] += 1 89 | else: 90 | self.symbols[sym] += 1 91 | if sym == utils.mana_X: 92 | self.cmc += 0 93 | elif utils.mana_2 in sym: 94 | self.cmc += 2 95 | else: 96 | self.cmc += 1 97 | break 98 | # otherwise we'll go into an infinite loop if we see a symbol we don't know 99 | if idx == old_idx: 100 | idx += 1 101 | self.valid = False 102 | 103 | self.colors = self.get_colors() 104 | 105 | def __str__(self): 106 | if self.none: 107 | return '_NOCOST_' 108 | return utils.mana_untranslate(utils.mana_open_delimiter + ''.join(self.sequence) 109 | + utils.mana_close_delimiter) 110 | 111 | def format(self, for_forum = False, for_html = False): 112 | if self.none: 113 | return '_NOCOST_' 114 | 115 | else: 116 | return utils.mana_untranslate(utils.mana_open_delimiter + ''.join(self.sequence) 117 | + utils.mana_close_delimiter, for_forum, for_html) 118 | 119 | def encode(self, randomize = False): 120 | if self.none: 121 | return '' 122 | elif randomize: 123 | # so this won't work very well if mana_unary_marker isn't empty 124 | return (utils.mana_open_delimiter 125 | + ''.join(random.sample(self.sequence, len(self.sequence))) 126 | + utils.mana_close_delimiter) 127 | else: 128 | return utils.mana_open_delimiter + ''.join(self.sequence) + utils.mana_close_delimiter 129 | 130 | def vectorize(self, delimit = False): 131 | if self.none: 132 | return '' 133 | elif delimit: 134 | ld = '(' 135 | rd = ')' 136 | else: 137 | ld = '' 138 | rd = '' 139 | return ' '.join(map(lambda s: ld + s + rd, sorted(self.sequence))) 140 | 141 | 142 | class Manatext: 143 | '''text representation with embedded mana costs''' 144 | 145 | def __init__(self, src, fmt = ''): 146 | # source fields 147 | self.raw = None 148 | self.json = None 149 | # flags 150 | self.valid = True 151 | # default values for all fields 152 | self.text = src 153 | self.costs = [] 154 | 155 | if fmt == 'json': 156 | self.json = src 157 | manastrs = re.findall(utils.mana_json_regex, src) 158 | else: 159 | self.raw = src 160 | manastrs = re.findall(utils.mana_regex, src) 161 | 162 | for manastr in manastrs: 163 | cost = Manacost(manastr, fmt) 164 | if not cost.valid: 165 | self.valid = False 166 | self.costs += [cost] 167 | self.text = self.text.replace(manastr, utils.reserved_mana_marker, 1) 168 | 169 | if (utils.mana_open_delimiter in self.text 170 | or utils.mana_close_delimiter in self.text 171 | or utils.mana_json_open_delimiter in self.text 172 | or utils.mana_json_close_delimiter in self.text): 173 | self.valid = False 174 | 175 | def __str__(self): 176 | text = self.text 177 | for cost in self.costs: 178 | text = text.replace(utils.reserved_mana_marker, str(cost), 1) 179 | return text 180 | 181 | def format(self, for_forum = False, for_html = False): 182 | text = self.text 183 | for cost in self.costs: 184 | text = text.replace(utils.reserved_mana_marker, cost.format(for_forum=for_forum, for_html=for_html), 1) 185 | if for_html: 186 | text = text.replace('\n', '
\n') 187 | return text 188 | 189 | def encode(self, randomize = False): 190 | text = self.text 191 | for cost in self.costs: 192 | text = text.replace(utils.reserved_mana_marker, cost.encode(randomize = randomize), 1) 193 | return text 194 | 195 | def vectorize(self): 196 | text = self.text 197 | special_chars = [utils.reserved_mana_marker, 198 | utils.dash_marker, 199 | utils.bullet_marker, 200 | utils.this_marker, 201 | utils.counter_marker, 202 | utils.choice_open_delimiter, 203 | utils.choice_close_delimiter, 204 | utils.newline, 205 | #utils.x_marker, 206 | utils.tap_marker, 207 | utils.untap_marker, 208 | utils.newline, 209 | ';', ':', '"', ',', '.'] 210 | for char in special_chars: 211 | text = text.replace(char, ' ' + char + ' ') 212 | text = text.replace('/', '/ /') 213 | for cost in self.costs: 214 | text = text.replace(utils.reserved_mana_marker, cost.vectorize(), 1) 215 | return ' '.join(text.split()) 216 | -------------------------------------------------------------------------------- /lib/namediff.py: -------------------------------------------------------------------------------- 1 | # This module is misleadingly named, as it has other utilities as well 2 | # that are generally necessary when trying to postprocess output by 3 | # comparing it against existing cards. 4 | 5 | import difflib 6 | import os 7 | import multiprocessing 8 | 9 | import utils 10 | import jdecode 11 | import cardlib 12 | 13 | libdir = os.path.dirname(os.path.realpath(__file__)) 14 | datadir = os.path.realpath(os.path.join(libdir, '../data')) 15 | 16 | # multithreading control parameters 17 | cores = multiprocessing.cpu_count() 18 | 19 | # split a list into n pieces; return a list of these lists 20 | # has slightly interesting behavior, in that if n is large, it can 21 | # run out of elements early and return less than n lists 22 | def list_split(l, n): 23 | if n <= 0: 24 | return l 25 | split_size = len(l) / n 26 | if len(l) % n > 0: 27 | split_size += 1 28 | return [l[i:i+split_size] for i in range(0, len(l), split_size)] 29 | 30 | # flatten a list of lists into a single list of all their contents, in order 31 | def list_flatten(l): 32 | return [item for sublist in l for item in sublist] 33 | 34 | 35 | # isolated logic for multiprocessing 36 | def f_nearest(name, matchers, n): 37 | for m in matchers: 38 | m.set_seq1(name) 39 | ratios = [(m.ratio(), m.b) for m in matchers] 40 | ratios.sort(reverse = True) 41 | 42 | if ratios[0][0] >= 1: 43 | return ratios[:1] 44 | else: 45 | return ratios[:n] 46 | 47 | def f_nearest_per_thread(workitem): 48 | (worknames, names, n) = workitem 49 | # each thread (well, process) needs to generate its own matchers 50 | matchers = [difflib.SequenceMatcher(b=name, autojunk=False) for name in names] 51 | return map(lambda name: f_nearest(name, matchers, n), worknames) 52 | 53 | class Namediff: 54 | def __init__(self, verbose = True, 55 | json_fname = os.path.join(datadir, 'AllSets.json')): 56 | self.verbose = verbose 57 | self.names = {} 58 | self.codes = {} 59 | self.cardstrings = {} 60 | 61 | if self.verbose: 62 | print 'Setting up namediff...' 63 | 64 | if self.verbose: 65 | print ' Reading names from: ' + json_fname 66 | json_srcs = jdecode.mtg_open_json(json_fname, verbose) 67 | namecount = 0 68 | for json_cardname in sorted(json_srcs): 69 | if len(json_srcs[json_cardname]) > 0: 70 | jcards = json_srcs[json_cardname] 71 | 72 | # just use the first one 73 | idx = 0 74 | card = cardlib.Card(jcards[idx]) 75 | name = card.name 76 | jname = jcards[idx]['name'] 77 | jcode = jcards[idx][utils.json_field_info_code] 78 | if 'number' in jcards[idx]: 79 | jnum = jcards[idx]['number'] 80 | else: 81 | jnum = '' 82 | 83 | if name in self.names: 84 | print ' Duplicate name ' + name + ', ignoring.' 85 | else: 86 | self.names[name] = jname 87 | self.cardstrings[name] = card.encode() 88 | if jcode and jnum: 89 | self.codes[name] = jcode + '/' + jnum + '.jpg' 90 | else: 91 | self.codes[name] = '' 92 | namecount += 1 93 | 94 | print ' Read ' + str(namecount) + ' unique cardnames' 95 | print ' Building SequenceMatcher objects.' 96 | 97 | self.matchers = [difflib.SequenceMatcher(b=n, autojunk=False) for n in self.names] 98 | self.card_matchers = [difflib.SequenceMatcher(b=self.cardstrings[n], autojunk=False) for n in self.cardstrings] 99 | 100 | print '... Done.' 101 | 102 | def nearest(self, name, n=3): 103 | return f_nearest(name, self.matchers, n) 104 | 105 | def nearest_par(self, names, n=3, threads=cores): 106 | workpool = multiprocessing.Pool(threads) 107 | proto_worklist = list_split(names, threads) 108 | worklist = map(lambda x: (x, self.names, n), proto_worklist) 109 | donelist = workpool.map(f_nearest_per_thread, worklist) 110 | return list_flatten(donelist) 111 | 112 | def nearest_card(self, card, n=5): 113 | return f_nearest(card.encode(), self.card_matchers, n) 114 | 115 | def nearest_card_par(self, cards, n=5, threads=cores): 116 | workpool = multiprocessing.Pool(threads) 117 | proto_worklist = list_split(cards, threads) 118 | worklist = map(lambda x: (map(lambda c: c.encode(), x), self.cardstrings.values(), n), proto_worklist) 119 | donelist = workpool.map(f_nearest_per_thread, worklist) 120 | return list_flatten(donelist) 121 | -------------------------------------------------------------------------------- /lib/nltk_model.py: -------------------------------------------------------------------------------- 1 | # Natural Language Toolkit: Language Models 2 | # 3 | # Copyright (C) 2001-2014 NLTK Project 4 | # Authors: Steven Bird 5 | # Daniel Blanchard 6 | # Ilia Kurenkov 7 | # URL: 8 | # For license information, see LICENSE.TXT 9 | # 10 | # adapted for mtgencode Nov. 2015 11 | # an attempt was made to preserve the exact functionality of this code, 12 | # hampered somewhat by its brokenness 13 | 14 | from __future__ import unicode_literals 15 | 16 | from math import log 17 | 18 | from nltk.probability import ConditionalProbDist, ConditionalFreqDist, LidstoneProbDist 19 | from nltk.util import ngrams 20 | from nltk_model_api import ModelI 21 | 22 | from nltk import compat 23 | 24 | 25 | def _estimator(fdist, **estimator_kwargs): 26 | """ 27 | Default estimator function using a LidstoneProbDist. 28 | """ 29 | # can't be an instance method of NgramModel as they 30 | # can't be pickled either. 31 | return LidstoneProbDist(fdist, 0.001, **estimator_kwargs) 32 | 33 | 34 | @compat.python_2_unicode_compatible 35 | class NgramModel(ModelI): 36 | """ 37 | A processing interface for assigning a probability to the next word. 38 | """ 39 | 40 | def __init__(self, n, train, pad_left=True, pad_right=False, 41 | estimator=None, **estimator_kwargs): 42 | """ 43 | Create an ngram language model to capture patterns in n consecutive 44 | words of training text. An estimator smooths the probabilities derived 45 | from the text and may allow generation of ngrams not seen during 46 | training. See model.doctest for more detailed testing 47 | 48 | >>> from nltk.corpus import brown 49 | >>> lm = NgramModel(3, brown.words(categories='news')) 50 | >>> lm 51 | 52 | >>> lm._backoff 53 | 54 | >>> lm.entropy(brown.words(categories='humor')) 55 | ... # doctest: +ELLIPSIS 56 | 12.0399... 57 | 58 | :param n: the order of the language model (ngram size) 59 | :type n: int 60 | :param train: the training text 61 | :type train: list(str) or list(list(str)) 62 | :param pad_left: whether to pad the left of each sentence with an (n-1)-gram of empty strings 63 | :type pad_left: bool 64 | :param pad_right: whether to pad the right of each sentence with an (n-1)-gram of empty strings 65 | :type pad_right: bool 66 | :param estimator: a function for generating a probability distribution 67 | :type estimator: a function that takes a ConditionalFreqDist and 68 | returns a ConditionalProbDist 69 | :param estimator_kwargs: Extra keyword arguments for the estimator 70 | :type estimator_kwargs: (any) 71 | """ 72 | 73 | # protection from cryptic behavior for calling programs 74 | # that use the pre-2.0.2 interface 75 | assert(isinstance(pad_left, bool)) 76 | assert(isinstance(pad_right, bool)) 77 | 78 | self._lpad = ('',) * (n - 1) if pad_left else () 79 | self._rpad = ('',) * (n - 1) if pad_right else () 80 | 81 | # make sure n is greater than zero, otherwise print it 82 | assert (n > 0), n 83 | 84 | # For explicitness save the check whether this is a unigram model 85 | self.is_unigram_model = (n == 1) 86 | # save the ngram order number 87 | self._n = n 88 | # save left and right padding 89 | self._lpad = ('',) * (n - 1) if pad_left else () 90 | self._rpad = ('',) * (n - 1) if pad_right else () 91 | 92 | if estimator is None: 93 | estimator = _estimator 94 | 95 | cfd = ConditionalFreqDist() 96 | 97 | # set read-only ngrams set (see property declaration below to reconfigure) 98 | self._ngrams = set() 99 | 100 | # If given a list of strings instead of a list of lists, create enclosing list 101 | if (train is not None) and isinstance(train[0], compat.string_types): 102 | train = [train] 103 | 104 | # we need to keep track of the number of word types we encounter 105 | vocabulary = set() 106 | for sent in train: 107 | raw_ngrams = ngrams(sent, n, pad_left, pad_right, pad_symbol='') 108 | for ngram in raw_ngrams: 109 | self._ngrams.add(ngram) 110 | context = tuple(ngram[:-1]) 111 | token = ngram[-1] 112 | cfd[context][token] += 1 113 | vocabulary.add(token) 114 | 115 | # Unless number of bins is explicitly passed, we should use the number 116 | # of word types encountered during training as the bins value. 117 | # If right padding is on, this includes the padding symbol. 118 | if 'bins' not in estimator_kwargs: 119 | estimator_kwargs['bins'] = len(vocabulary) 120 | 121 | self._model = ConditionalProbDist(cfd, estimator, **estimator_kwargs) 122 | 123 | # recursively construct the lower-order models 124 | if not self.is_unigram_model: 125 | self._backoff = NgramModel(n-1, train, 126 | pad_left, pad_right, 127 | estimator, 128 | **estimator_kwargs) 129 | 130 | self._backoff_alphas = dict() 131 | # For each condition (or context) 132 | for ctxt in cfd.conditions(): 133 | backoff_ctxt = ctxt[1:] 134 | backoff_total_pr = 0.0 135 | total_observed_pr = 0.0 136 | 137 | # this is the subset of words that we OBSERVED following 138 | # this context. 139 | # i.e. Count(word | context) > 0 140 | for words in self._words_following(ctxt, cfd): 141 | 142 | # so, _words_following as fixed gives back a whole list now... 143 | for word in words: 144 | 145 | total_observed_pr += self.prob(word, ctxt) 146 | # we also need the total (n-1)-gram probability of 147 | # words observed in this n-gram context 148 | backoff_total_pr += self._backoff.prob(word, backoff_ctxt) 149 | 150 | assert (0 <= total_observed_pr <= 1), total_observed_pr 151 | # beta is the remaining probability weight after we factor out 152 | # the probability of observed words. 153 | # As a sanity check, both total_observed_pr and backoff_total_pr 154 | # must be GE 0, since probabilities are never negative 155 | beta = 1.0 - total_observed_pr 156 | 157 | # backoff total has to be less than one, otherwise we get 158 | # an error when we try subtracting it from 1 in the denominator 159 | assert (0 <= backoff_total_pr < 1), backoff_total_pr 160 | alpha_ctxt = beta / (1.0 - backoff_total_pr) 161 | 162 | self._backoff_alphas[ctxt] = alpha_ctxt 163 | 164 | # broken 165 | # def _words_following(self, context, cond_freq_dist): 166 | # for ctxt, word in cond_freq_dist.iterkeys(): 167 | # if ctxt == context: 168 | # yield word 169 | 170 | # fixed 171 | def _words_following(self, context, cond_freq_dist): 172 | for ctxt in cond_freq_dist.iterkeys(): 173 | if ctxt == context: 174 | yield cond_freq_dist[ctxt].keys() 175 | 176 | def prob(self, word, context): 177 | """ 178 | Evaluate the probability of this word in this context using Katz Backoff. 179 | 180 | :param word: the word to get the probability of 181 | :type word: str 182 | :param context: the context the word is in 183 | :type context: list(str) 184 | """ 185 | context = tuple(context) 186 | if (context + (word,) in self._ngrams) or (self.is_unigram_model): 187 | return self._model[context].prob(word) 188 | else: 189 | return self._alpha(context) * self._backoff.prob(word, context[1:]) 190 | 191 | def _alpha(self, context): 192 | """Get the backoff alpha value for the given context 193 | """ 194 | error_message = "Alphas and backoff are not defined for unigram models" 195 | assert not self.is_unigram_model, error_message 196 | 197 | if context in self._backoff_alphas: 198 | return self._backoff_alphas[context] 199 | else: 200 | return 1 201 | 202 | def logprob(self, word, context): 203 | """ 204 | Evaluate the (negative) log probability of this word in this context. 205 | 206 | :param word: the word to get the probability of 207 | :type word: str 208 | :param context: the context the word is in 209 | :type context: list(str) 210 | """ 211 | return -log(self.prob(word, context), 2) 212 | 213 | @property 214 | def ngrams(self): 215 | return self._ngrams 216 | 217 | @property 218 | def backoff(self): 219 | return self._backoff 220 | 221 | @property 222 | def model(self): 223 | return self._model 224 | 225 | def choose_random_word(self, context): 226 | ''' 227 | Randomly select a word that is likely to appear in this context. 228 | 229 | :param context: the context the word is in 230 | :type context: list(str) 231 | ''' 232 | 233 | return self.generate(1, context)[-1] 234 | 235 | # NB, this will always start with same word if the model 236 | # was trained on a single text 237 | def generate(self, num_words, context=()): 238 | ''' 239 | Generate random text based on the language model. 240 | 241 | :param num_words: number of words to generate 242 | :type num_words: int 243 | :param context: initial words in generated string 244 | :type context: list(str) 245 | ''' 246 | 247 | text = list(context) 248 | for i in range(num_words): 249 | text.append(self._generate_one(text)) 250 | return text 251 | 252 | def _generate_one(self, context): 253 | context = (self._lpad + tuple(context))[-self._n + 1:] 254 | if context in self: 255 | return self[context].generate() 256 | elif self._n > 1: 257 | return self._backoff._generate_one(context[1:]) 258 | else: 259 | return '.' 260 | 261 | def entropy(self, text): 262 | """ 263 | Calculate the approximate cross-entropy of the n-gram model for a 264 | given evaluation text. 265 | This is the average log probability of each word in the text. 266 | 267 | :param text: words to use for evaluation 268 | :type text: list(str) 269 | """ 270 | 271 | H = 0.0 # entropy is conventionally denoted by "H" 272 | text = list(self._lpad) + text + list(self._rpad) 273 | for i in range(self._n - 1, len(text)): 274 | context = tuple(text[(i - self._n + 1):i]) 275 | token = text[i] 276 | H += self.logprob(token, context) 277 | return H / float(len(text) - (self._n - 1)) 278 | 279 | def perplexity(self, text): 280 | """ 281 | Calculates the perplexity of the given text. 282 | This is simply 2 ** cross-entropy for the text. 283 | 284 | :param text: words to calculate perplexity of 285 | :type text: list(str) 286 | """ 287 | 288 | return pow(2.0, self.entropy(text)) 289 | 290 | def __contains__(self, item): 291 | if not isinstance(item, tuple): 292 | item = (item,) 293 | return item in self._model 294 | 295 | def __getitem__(self, item): 296 | if not isinstance(item, tuple): 297 | item = (item,) 298 | return self._model[item] 299 | 300 | def __repr__(self): 301 | return '' % (len(self._ngrams), self._n) 302 | 303 | if __name__ == "__main__": 304 | import doctest 305 | doctest.testmod(optionflags=doctest.NORMALIZE_WHITESPACE) 306 | -------------------------------------------------------------------------------- /lib/nltk_model_api.py: -------------------------------------------------------------------------------- 1 | # Natural Language Toolkit: API for Language Models 2 | # 3 | # Copyright (C) 2001-2014 NLTK Project 4 | # Author: Steven Bird 5 | # URL: 6 | # For license information, see LICENSE.TXT 7 | # 8 | # imported for use in mtgcode Nov. 2015 9 | 10 | 11 | # should this be a subclass of ConditionalProbDistI? 12 | 13 | class ModelI(object): 14 | """ 15 | A processing interface for assigning a probability to the next word. 16 | """ 17 | 18 | def __init__(self): 19 | '''Create a new language model.''' 20 | raise NotImplementedError() 21 | 22 | def prob(self, word, context): 23 | '''Evaluate the probability of this word in this context.''' 24 | raise NotImplementedError() 25 | 26 | def logprob(self, word, context): 27 | '''Evaluate the (negative) log probability of this word in this context.''' 28 | raise NotImplementedError() 29 | 30 | def choose_random_word(self, context): 31 | '''Randomly select a word that is likely to appear in this context.''' 32 | raise NotImplementedError() 33 | 34 | def generate(self, n): 35 | '''Generate n words of text from the language model.''' 36 | raise NotImplementedError() 37 | 38 | def entropy(self, text): 39 | '''Evaluate the total entropy of a message with respect to the model. 40 | This is the sum of the log probability of each word in the message.''' 41 | raise NotImplementedError() 42 | 43 | -------------------------------------------------------------------------------- /lib/transforms.py: -------------------------------------------------------------------------------- 1 | # transform passes used to encode / decode cards 2 | import re 3 | import random 4 | 5 | # These could probably use a little love... They tend to hardcode in lots 6 | # of things very specific to the mtgjson format. 7 | 8 | import utils 9 | 10 | cardsep = utils.cardsep 11 | fieldsep = utils.fieldsep 12 | bsidesep = utils.bsidesep 13 | newline = utils.newline 14 | dash_marker = utils.dash_marker 15 | bullet_marker = utils.bullet_marker 16 | this_marker = utils.this_marker 17 | counter_marker = utils.counter_marker 18 | reserved_marker = utils.reserved_marker 19 | choice_open_delimiter = utils.choice_open_delimiter 20 | choice_close_delimiter = utils.choice_close_delimiter 21 | x_marker = utils.x_marker 22 | tap_marker = utils.tap_marker 23 | untap_marker = utils.untap_marker 24 | counter_rename = utils.counter_rename 25 | unary_marker = utils.unary_marker 26 | unary_counter = utils.unary_counter 27 | 28 | 29 | # Name Passes. 30 | 31 | 32 | def name_pass_1_sanitize(s): 33 | s = s.replace('!', '') 34 | s = s.replace('?', '') 35 | s = s.replace('-', dash_marker) 36 | s = s.replace('100,000', 'one hundred thousand') 37 | s = s.replace('1,000', 'one thousand') 38 | s = s.replace('1996', 'nineteen ninety-six') 39 | return s 40 | 41 | 42 | # Name unpasses. 43 | 44 | 45 | # particularly helpful if you want to call text_unpass_8_unicode later 46 | # and NOT have it stick unicode long dashes into names. 47 | def name_unpass_1_dashes(s): 48 | return s.replace(dash_marker, '-') 49 | 50 | 51 | # Text Passes. 52 | 53 | 54 | def text_pass_1_strip_rt(s): 55 | return re.sub(r'\(.*\)', '', s) 56 | 57 | 58 | def text_pass_2_cardname(s, name): 59 | # Here are some fun edge cases, thanks to jml34 on the forum for 60 | # pointing them out. 61 | if name == 'sacrifice': 62 | s = s.replace(name, this_marker, 1) 63 | return s 64 | elif name == 'fear': 65 | return s 66 | 67 | s = s.replace(name, this_marker) 68 | 69 | # So, some legends don't use the full cardname in their text box... 70 | # this check finds about 400 of them. 71 | nameparts = name.split(',') 72 | if len(nameparts) > 1: 73 | mininame = nameparts[0] 74 | new_s = s.replace(mininame, this_marker) 75 | if not new_s == s: 76 | s = new_s 77 | 78 | # A few others don't have a convenient comma to detect their nicknames, 79 | # so we override them here. 80 | overrides = [ 81 | # detectable by splitting on 'the', though that might cause other issues 82 | 'crovax', 83 | 'rashka', 84 | 'phage', 85 | 'shimatsu', 86 | # random and arbitrary: they have a last name, 1996 world champion, etc. 87 | 'world champion', 88 | 'axelrod', 89 | 'hazezon', 90 | 'rubinia', 91 | 'rasputin', 92 | 'hivis', 93 | ] 94 | 95 | for override in overrides: 96 | s = s.replace(override, this_marker) 97 | 98 | # stupid planeswalker abilities 99 | s = s.replace('to him.', 'to ' + this_marker + '.') 100 | s = s.replace('to him this', 'to ' + this_marker + ' this') 101 | s = s.replace('to himself', 'to itself') 102 | s = s.replace("he's", this_marker + ' is') 103 | 104 | # sometimes we actually don't want to do this replacement 105 | s = s.replace('named ' + this_marker, 'named ' + name) 106 | s = s.replace('name is still ' + this_marker, 'name is still ' + name) 107 | s = s.replace('named keeper of ' + this_marker, 'named keeper of ' + name) 108 | s = s.replace('named kobolds of ' + this_marker, 'named kobolds of ' + name) 109 | s = s.replace('named sword of kaldra, ' + this_marker, 'named sword of kaldra, ' + name) 110 | 111 | return s 112 | 113 | 114 | def text_pass_3_unary(s): 115 | return utils.to_unary(s) 116 | 117 | 118 | # Run only after doing unary conversion. 119 | def text_pass_4a_dashes(s): 120 | s = s.replace('-' + unary_marker, reserved_marker) 121 | s = s.replace('-', dash_marker) 122 | s = s.replace(reserved_marker, '-' + unary_marker) 123 | 124 | # level up is annoying 125 | levels = re.findall(r'level &\^*\-&', s) 126 | for level in levels: 127 | newlevel = level.replace('-', dash_marker) 128 | s = s.replace(level, newlevel) 129 | 130 | levels = re.findall(r'level &\^*\+', s) 131 | for level in levels: 132 | newlevel = level.replace('+', dash_marker) 133 | s = s.replace(level, newlevel) 134 | 135 | # and we still have the ~x issue 136 | return s 137 | 138 | 139 | # Run this after fixing dashes, because this unbreaks the ~x issue. 140 | # Also probably don't run this on names, there are a few names with x~ in them. 141 | def text_pass_4b_x(s): 142 | s = s.replace(dash_marker + 'x', '-' + x_marker) 143 | s = s.replace('+x', '+' + x_marker) 144 | s = s.replace(' x ', ' ' + x_marker + ' ') 145 | s = s.replace('x:', x_marker + ':') 146 | s = s.replace('x~', x_marker + '~') 147 | s = s.replace(u'x\u2014', x_marker + u'\u2014') 148 | s = s.replace('x.', x_marker + '.') 149 | s = s.replace('x,', x_marker + ',') 150 | s = s.replace('x is', x_marker + ' is') 151 | s = s.replace('x can\'t', x_marker + ' can\'t') 152 | s = s.replace('x/x', x_marker + '/' + x_marker) 153 | s = s.replace('x target', x_marker + ' target') 154 | s = s.replace('si' + x_marker + ' target', 'six target') 155 | s = s.replace('avara' + x_marker, 'avarax') 156 | # there's also some stupid ice age card that wants -x/-y 157 | s = s.replace('/~', '/-') 158 | return s 159 | 160 | 161 | # Call this before replacing newlines. 162 | # This one ends up being really bad because of the confusion 163 | # with 'counter target spell or ability'. 164 | def text_pass_5_counters(s): 165 | # so, big fat old dictionary time!!!!!!!!! 166 | allcounters = [ 167 | 'time counter', 168 | 'devotion counter', 169 | 'charge counter', 170 | 'ki counter', 171 | 'matrix counter', 172 | 'spore counter', 173 | 'poison counter', 174 | 'quest counter', 175 | 'hatchling counter', 176 | 'storage counter', 177 | 'growth counter', 178 | 'paralyzation counter', 179 | 'energy counter', 180 | 'study counter', 181 | 'glyph counter', 182 | 'depletion counter', 183 | 'sleight counter', 184 | 'loyalty counter', 185 | 'hoofprint counter', 186 | 'wage counter', 187 | 'echo counter', 188 | 'lore counter', 189 | 'page counter', 190 | 'divinity counter', 191 | 'mannequin counter', 192 | 'ice counter', 193 | 'fade counter', 194 | 'pain counter', 195 | #'age counter', 196 | 'gold counter', 197 | 'muster counter', 198 | 'infection counter', 199 | 'plague counter', 200 | 'fate counter', 201 | 'slime counter', 202 | 'shell counter', 203 | 'credit counter', 204 | 'despair counter', 205 | 'globe counter', 206 | 'currency counter', 207 | 'blood counter', 208 | 'soot counter', 209 | 'carrion counter', 210 | 'fuse counter', 211 | 'filibuster counter', 212 | 'wind counter', 213 | 'hourglass counter', 214 | 'trap counter', 215 | 'corpse counter', 216 | 'awakening counter', 217 | 'verse counter', 218 | 'scream counter', 219 | 'doom counter', 220 | 'luck counter', 221 | 'intervention counter', 222 | 'eyeball counter', 223 | 'flood counter', 224 | 'eon counter', 225 | 'death counter', 226 | 'delay counter', 227 | 'blaze counter', 228 | 'magnet counter', 229 | 'feather counter', 230 | 'shield counter', 231 | 'wish counter', 232 | 'petal counter', 233 | 'music counter', 234 | 'pressure counter', 235 | 'manifestation counter', 236 | #'net counter', 237 | 'velocity counter', 238 | 'vitality counter', 239 | 'treasure counter', 240 | 'pin counter', 241 | 'bounty counter', 242 | 'rust counter', 243 | 'mire counter', 244 | 'tower counter', 245 | #'ore counter', 246 | 'cube counter', 247 | 'strife counter', 248 | 'elixir counter', 249 | 'hunger counter', 250 | 'level counter', 251 | 'winch counter', 252 | 'fungus counter', 253 | 'training counter', 254 | 'theft counter', 255 | 'arrowhead counter', 256 | 'sleep counter', 257 | 'healing counter', 258 | 'mining counter', 259 | 'dream counter', 260 | 'aim counter', 261 | 'arrow counter', 262 | 'javelin counter', 263 | 'gem counter', 264 | 'bribery counter', 265 | 'mine counter', 266 | 'omen counter', 267 | 'phylactery counter', 268 | 'tide counter', 269 | 'polyp counter', 270 | 'petrification counter', 271 | 'shred counter', 272 | 'pupa counter', 273 | 'crystal counter', 274 | ] 275 | usedcounters = [] 276 | for countername in allcounters: 277 | if countername in s: 278 | usedcounters += [countername] 279 | s = s.replace(countername, counter_marker + ' counter') 280 | 281 | # oh god some of the counter names are suffixes of others... 282 | shortcounters = [ 283 | 'age counter', 284 | 'net counter', 285 | 'ore counter', 286 | ] 287 | for countername in shortcounters: 288 | # SUPER HACKY fix for doubling season 289 | if countername in s and 'more counter' not in s: 290 | usedcounters += [countername] 291 | s = s.replace(countername, counter_marker + ' counter') 292 | 293 | # miraculously this doesn't seem to happen 294 | # if len(usedcounters) > 1: 295 | # print usedcounters 296 | 297 | # we haven't done newline replacement yet, so use actual newlines 298 | if len(usedcounters) == 1: 299 | # and yeah, this line of code can blow up in all kinds of different ways 300 | s = 'countertype ' + counter_marker + ' ' + usedcounters[0].split()[0] + '\n' + s 301 | 302 | return s 303 | 304 | 305 | # The word 'counter' is confusing when used to refer to what we do to spells 306 | # and sometimes abilities to make them not happen. Let's rename that. 307 | # Call this after doing the counter replacement to simplify the regexes. 308 | counter_rename = 'uncast' 309 | def text_pass_6_uncast(s): 310 | # pre-checks to make sure we aren't doing anything dumb 311 | # if '% counter target ' in s or '^ counter target ' in s or '& counter target ' in s: 312 | # print s + '\n' 313 | # if '% counter a ' in s or '^ counter a ' in s or '& counter a ' in s: 314 | # print s + '\n' 315 | # if '% counter all ' in s or '^ counter all ' in s or '& counter all ' in s: 316 | # print s + '\n' 317 | # if '% counter a ' in s or '^ counter a ' in s or '& counter a ' in s: 318 | # print s + '\n' 319 | # if '% counter that ' in s or '^ counter that ' in s or '& counter that ' in s: 320 | # print s + '\n' 321 | # if '% counter @' in s or '^ counter @' in s or '& counter @' in s: 322 | # print s + '\n' 323 | # if '% counter the ' in s or '^ counter the ' in s or '& counter the ' in s: 324 | # print s + '\n' 325 | 326 | # counter target 327 | s = s.replace('counter target ', counter_rename + ' target ') 328 | # counter a 329 | s = s.replace('counter a ', counter_rename + ' a ') 330 | # counter all 331 | s = s.replace('counter all ', counter_rename + ' all ') 332 | # counters a 333 | s = s.replace('counters a ', counter_rename + 's a ') 334 | # countered (this could get weird in terms of englishing the word; lets just go for hilarious) 335 | s = s.replace('countered', counter_rename + 'ed') 336 | # counter that 337 | s = s.replace('counter that ', counter_rename + ' that ') 338 | # counter @ 339 | s = s.replace('counter @', counter_rename + ' @') 340 | # counter it (this is tricky 341 | s = s.replace(', counter it', ', ' + counter_rename + ' it') 342 | # counter the (it happens at least once, thanks wizards!) 343 | s = s.replace('counter the ', counter_rename + ' the ') 344 | # counter up to 345 | s = s.replace('counter up to ', counter_rename + ' up to ') 346 | 347 | # check if the word exists in any other context 348 | # if 'counter' in (s.replace('% counter', '').replace('countertype', '') 349 | # .replace('^ counter', '').replace('& counter', ''): 350 | # print s + '\n' 351 | 352 | # whew! by manual inspection of a few dozen texts, it looks like this about covers it. 353 | return s 354 | 355 | 356 | # Run after fixing dashes, it makes the regexes better, but before replacing newlines. 357 | def text_pass_7_choice(s): 358 | # the idea is to take 'choose n ~\n=ability\n=ability\n' 359 | # to '[n = ability = ability]\n' 360 | 361 | def choice_formatting_helper(s_helper, prefix, count, suffix = ''): 362 | single_choices = re.findall(ur'(' + prefix + ur'\n?(\u2022.*(\n|$))+)', s_helper) 363 | for choice in single_choices: 364 | newchoice = choice[0] 365 | newchoice = newchoice.replace(prefix, unary_marker + (unary_counter * count) + suffix) 366 | newchoice = newchoice.replace('\n', ' ') 367 | if newchoice[-1:] == ' ': 368 | newchoice = choice_open_delimiter + newchoice[:-1] + choice_close_delimiter + '\n' 369 | else: 370 | newchoice = choice_open_delimiter + newchoice + choice_close_delimiter 371 | s_helper = s_helper.replace(choice[0], newchoice) 372 | return s_helper 373 | 374 | s = choice_formatting_helper(s, ur'choose one \u2014', 1) 375 | s = choice_formatting_helper(s, ur'choose one \u2014 ', 1) # ty Promise of Power 376 | s = choice_formatting_helper(s, ur'choose two \u2014', 2) 377 | s = choice_formatting_helper(s, ur'choose two \u2014 ', 2) # ty Profane Command 378 | s = choice_formatting_helper(s, ur'choose one or both \u2014', 0) 379 | s = choice_formatting_helper(s, ur'choose one or more \u2014', 0) 380 | s = choice_formatting_helper(s, ur'choose khans or dragons.', 1) 381 | # this is for 'an opponent chooses one', which will be a bit weird but still work out 382 | s = choice_formatting_helper(s, ur'chooses one \u2014', 1) 383 | # Demonic Pact has 'choose one that hasn't been chosen'... 384 | s = choice_formatting_helper(s, ur"choose one that hasn't been chosen \u2014", 1, 385 | suffix=" that hasn't been chosen") 386 | # 'choose n. you may choose the same mode more than once.' 387 | s = choice_formatting_helper(s, ur'choose three. you may choose the same mode more than once.', 3, 388 | suffix='. you may choose the same mode more than once.') 389 | 390 | return s 391 | 392 | 393 | # do before removing newlines 394 | # might as well do this after countertype because we probably care more about 395 | # the location of the equip cost 396 | def text_pass_8_equip(s): 397 | equips = re.findall(r'equip ' + utils.mana_json_regex + r'.?$', s) 398 | # there don't seem to be any cases with more than one 399 | if len(equips) == 1: 400 | equip = equips[0] 401 | s = s.replace('\n' + equip, '') 402 | s = s.replace(equip, '') 403 | 404 | if equip[-1:] == ' ': 405 | equip = equip[0:-1] 406 | 407 | if s == '': 408 | s = equip 409 | else: 410 | s = equip + '\n' + s 411 | 412 | nonmana = re.findall(ur'(equip\u2014.*(\n|$))', s) 413 | if len(nonmana) == 1: 414 | equip = nonmana[0][0] 415 | s = s.replace('\n' + equip, '') 416 | s = s.replace(equip, '') 417 | 418 | if equip[-1:] == ' ': 419 | equip = equip[0:-1] 420 | 421 | if s == '': 422 | s = equip 423 | else: 424 | s = equip + '\n' + s 425 | 426 | return s 427 | 428 | 429 | def text_pass_9_newlines(s): 430 | return s.replace('\n', utils.newline) 431 | 432 | 433 | def text_pass_10_symbols(s): 434 | return utils.to_symbols(s) 435 | 436 | 437 | # reorder the lines of text into a canonical form: 438 | # first enchant and equip 439 | # then other keywords, one per line (things with no period on the end) 440 | # then other abilities 441 | # then kicker and countertype last of all 442 | def text_pass_11_linetrans(s): 443 | # let's just not deal with level up 444 | if 'level up' in s: 445 | return s 446 | 447 | prelines = [] 448 | keylines = [] 449 | mainlines = [] 450 | postlines = [] 451 | 452 | lines = s.split(utils.newline) 453 | for line in lines: 454 | line = line.strip() 455 | if line == '': 456 | continue 457 | if not '.' in line: 458 | # because this is inconsistent 459 | line = line.replace(',', ';') 460 | line = line.replace('; where', ', where') # Thromok the Insatiable 461 | line = line.replace('; and', ', and') # wonky protection 462 | line = line.replace('; from', ', from') # wonky protection 463 | line = line.replace('upkeep;', 'upkeep,') # wonky protection 464 | sublines = line.split(';') 465 | for subline in sublines: 466 | subline = subline.strip() 467 | if 'equip' in subline or 'enchant' in subline: 468 | prelines += [subline] 469 | elif 'countertype' in subline or 'kicker' in subline: 470 | postlines += [subline] 471 | else: 472 | keylines += [subline] 473 | elif u'\u2014' in line and not u' \u2014 ' in line: 474 | if 'equip' in line or 'enchant' in line: 475 | prelines += [line] 476 | elif 'countertype' in line or 'kicker' in line: 477 | postlines += [line] 478 | else: 479 | keylines += [line] 480 | else: 481 | mainlines += [line] 482 | 483 | alllines = prelines + keylines + mainlines + postlines 484 | return utils.newline.join(alllines) 485 | 486 | 487 | # randomize the order of the lines 488 | # not a text pass, intended to be invoked dynamically when encoding a card 489 | # call this on fully encoded text, with mana symbols expanded 490 | def separate_lines(text): 491 | # forget about level up, ignore empty text too while we're at it 492 | if text == '' or 'level up' in text: 493 | return [],[],[],[],[] 494 | 495 | preline_search = ['equip', 'fortify', 'enchant ', 'bestow'] 496 | # probably could use optimization with a regex 497 | costline_search = [ 498 | 'multikicker', 'kicker', 'suspend', 'echo', 'awaken', 499 | 'buyback', 'dash', 'entwine', 'evoke', 'flashback', 500 | 'madness', 'megamorph', 'morph', 'miracle', 'ninjutsu', 'overload', 501 | 'prowl', 'recover', 'reinforce', 'replicate', 'scavenge', 'splice', 502 | 'surge', 'unearth', 'transmute', 'transfigure', 503 | ] 504 | # cycling is a special case to handle the variants 505 | postline_search = ['countertype'] 506 | keyline_search = ['cumulative'] 507 | 508 | prelines = [] 509 | keylines = [] 510 | mainlines = [] 511 | costlines = [] 512 | postlines = [] 513 | 514 | lines = text.split(utils.newline) 515 | # we've already done linetrans once, so some of the irregularities have been simplified 516 | for line in lines: 517 | if not '.' in line: 518 | if any(line.startswith(s) for s in preline_search): 519 | prelines.append(line) 520 | elif any(line.startswith(s) for s in postline_search): 521 | postlines.append(line) 522 | elif any(line.startswith(s) for s in costline_search) or 'cycling' in line: 523 | costlines.append(line) 524 | else: 525 | keylines.append(line) 526 | elif (utils.dash_marker in line and not 527 | (' '+utils.dash_marker+' ' in line or 'non'+utils.dash_marker in line)): 528 | if any(line.startswith(s) for s in preline_search): 529 | prelines.append(line) 530 | elif any(line.startswith(s) for s in costline_search) or 'cycling' in line: 531 | costlines.append(line) 532 | elif any(line.startswith(s) for s in keyline_search): 533 | keylines.append(line) 534 | else: 535 | mainlines.append(line) 536 | elif ': monstrosity' in line: 537 | costlines.append(line) 538 | else: 539 | mainlines.append(line) 540 | 541 | return prelines, keylines, mainlines, costlines, postlines 542 | 543 | choice_re = re.compile(re.escape(utils.choice_open_delimiter) + r'.*' + 544 | re.escape(utils.choice_close_delimiter)) 545 | choice_divider = ' ' + utils.bullet_marker + ' ' 546 | def randomize_choice(line): 547 | choices = re.findall(choice_re, line) 548 | if len(choices) < 1: 549 | return line 550 | new_line = line 551 | for choice in choices: 552 | parts = choice[1:-1].split(choice_divider) 553 | if len(parts) < 3: 554 | continue 555 | choiceparts = parts[1:] 556 | random.shuffle(choiceparts) 557 | new_line = new_line.replace(choice, 558 | utils.choice_open_delimiter + 559 | choice_divider.join(parts[:1] + choiceparts) + 560 | utils.choice_close_delimiter, 561 | 1) 562 | return new_line 563 | 564 | def randomize_lines(text): 565 | if text == '' or 'level up' in text: 566 | return text 567 | 568 | prelines, keylines, mainlines, costlines, postlines = separate_lines(text) 569 | 570 | new_mainlines = [] 571 | for line in mainlines: 572 | if line.endswith(utils.choice_close_delimiter): 573 | new_mainlines.append(randomize_choice(line)) 574 | # elif utils.choice_open_delimiter in line or utils.choice_close_delimiter in line: 575 | # print(line) 576 | else: 577 | new_mainlines.append(line) 578 | 579 | if False: # TODO: make this an option 580 | lines = prelines + keylines + new_mainlines + costlines + postlines 581 | random.shuffle(lines) 582 | return utils.newline.join(lines) 583 | else: 584 | random.shuffle(prelines) 585 | random.shuffle(keylines) 586 | random.shuffle(new_mainlines) 587 | random.shuffle(costlines) 588 | #random.shuffle(postlines) # only one kind ever (countertype) 589 | return utils.newline.join(prelines+keylines+new_mainlines+costlines+postlines) 590 | 591 | 592 | # Text unpasses, for decoding. All assume the text inside a Manatext, so don't do anything 593 | # weird with the mana cost symbol. 594 | 595 | 596 | def text_unpass_1_choice(s, delimit = False): 597 | choice_regex = (re.escape(choice_open_delimiter) + re.escape(unary_marker) 598 | + r'.*' + re.escape(bullet_marker) + r'.*' + re.escape(choice_close_delimiter)) 599 | choices = re.findall(choice_regex, s) 600 | for choice in sorted(choices, lambda x,y: cmp(len(x), len(y)), reverse = True): 601 | fragments = choice[1:-1].split(bullet_marker) 602 | countfrag = fragments[0] 603 | optfrags = fragments[1:] 604 | choicecount = int(utils.from_unary(re.findall(utils.number_unary_regex, countfrag)[0])) 605 | newchoice = '' 606 | 607 | if choicecount == 0: 608 | if len(countfrag) == 2: 609 | newchoice += 'choose one or both ' 610 | else: 611 | newchoice += 'choose one or more ' 612 | elif choicecount == 1: 613 | newchoice += 'choose one ' 614 | elif choicecount == 2: 615 | newchoice += 'choose two ' 616 | else: 617 | newchoice += 'choose ' + utils.to_unary(str(choicecount)) + ' ' 618 | newchoice += dash_marker 619 | 620 | for option in optfrags: 621 | option = option.strip() 622 | if option: 623 | newchoice += newline + bullet_marker + ' ' + option 624 | 625 | if delimit: 626 | s = s.replace(choice, choice_open_delimiter + newchoice + choice_close_delimiter) 627 | s = s.replace('an opponent ' + choice_open_delimiter + 'choose ', 628 | 'an opponent ' + choice_open_delimiter + 'chooses ') 629 | else: 630 | s = s.replace(choice, newchoice) 631 | s = s.replace('an opponent choose ', 'an opponent chooses ') 632 | 633 | return s 634 | 635 | 636 | def text_unpass_2_counters(s): 637 | countertypes = re.findall(r'countertype ' + re.escape(counter_marker) 638 | + r'[^' + re.escape(newline) + r']*' + re.escape(newline), s) 639 | # lazier than using groups in the regex 640 | countertypes += re.findall(r'countertype ' + re.escape(counter_marker) 641 | + r'[^' + re.escape(newline) + r']*$', s) 642 | if len(countertypes) > 0: 643 | countertype = countertypes[0].replace('countertype ' + counter_marker, '') 644 | countertype = countertype.replace(newline, '\n').strip() 645 | s = s.replace(countertypes[0], '') 646 | s = s.replace(counter_marker, countertype) 647 | 648 | return s 649 | 650 | 651 | def text_unpass_3_uncast(s): 652 | return s.replace(counter_rename, 'counter') 653 | 654 | 655 | def text_unpass_4_unary(s): 656 | return utils.from_unary(s) 657 | 658 | 659 | def text_unpass_5_symbols(s, for_forum, for_html): 660 | return utils.from_symbols(s, for_forum = for_forum, for_html = for_html) 661 | 662 | 663 | def text_unpass_6_cardname(s, name): 664 | return s.replace(this_marker, name) 665 | 666 | 667 | def text_unpass_7_newlines(s): 668 | return s.replace(newline, '\n') 669 | 670 | 671 | def text_unpass_8_unicode(s): 672 | s = s.replace(dash_marker, u'\u2014') 673 | s = s.replace(bullet_marker, u'\u2022') 674 | return s 675 | -------------------------------------------------------------------------------- /lib/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | # Utilities for handling unicode, unary numbers, mana costs, and special symbols. 4 | # For convenience we redefine everything from config so that it can all be accessed 5 | # from the utils module. 6 | 7 | import config 8 | 9 | # special chunk of text that Magic Set Editor 2 requires at the start of all set files. 10 | mse_prepend = 'mse version: 0.3.8\ngame: magic\nstylesheet: m15\nset info:\n\tsymbol:\nstyling:\n\tmagic-m15:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay:\n\tmagic-m15-clear:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-extra-improved:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\tpt box symbols: magic-pt-symbols-extra.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-planeswalker:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-planeswalker-promo-black:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-promo-dka:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-m15-token-clear:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker-4abil:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker-clear:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n\tmagic-new-planeswalker-promo-black:\n\t\ttext box mana symbols: magic-mana-small.mse-symbol-font\n\t\toverlay: \n' 11 | 12 | # special chunk of text to start an HTML document. 13 | import html_extra_data 14 | segment_ids = html_extra_data.id_lables 15 | html_prepend = html_extra_data.html_prepend 16 | html_append = "\n\n" 17 | 18 | # encoding formats we know about 19 | formats = [ 20 | 'std', 21 | 'named', 22 | 'noname', 23 | 'rfields', 24 | 'old', 25 | 'norarity', 26 | 'vec', 27 | 'custom', 28 | ] 29 | 30 | # separators 31 | cardsep = config.cardsep 32 | fieldsep = config.fieldsep 33 | bsidesep = config.bsidesep 34 | newline = config.newline 35 | 36 | # special indicators 37 | dash_marker = config.dash_marker 38 | bullet_marker = config.bullet_marker 39 | this_marker = config.this_marker 40 | counter_marker = config.counter_marker 41 | reserved_marker = config.reserved_marker 42 | reserved_mana_marker = config.reserved_mana_marker 43 | choice_open_delimiter = config.choice_open_delimiter 44 | choice_close_delimiter = config.choice_close_delimiter 45 | x_marker = config.x_marker 46 | tap_marker = config.tap_marker 47 | untap_marker = config.untap_marker 48 | rarity_common_marker = config.rarity_common_marker 49 | rarity_uncommon_marker = config.rarity_uncommon_marker 50 | rarity_rare_marker = config.rarity_rare_marker 51 | rarity_mythic_marker = config.rarity_mythic_marker 52 | rarity_special_marker = config.rarity_special_marker 53 | rarity_basic_land_marker = config.rarity_basic_land_marker 54 | 55 | json_rarity_map = { 56 | 'Common' : rarity_common_marker, 57 | 'Uncommon' : rarity_uncommon_marker, 58 | 'Rare' : rarity_rare_marker, 59 | 'Mythic Rare' : rarity_mythic_marker, 60 | 'Special' : rarity_special_marker, 61 | 'Basic Land' : rarity_basic_land_marker, 62 | } 63 | json_rarity_unmap = {json_rarity_map[k] : k for k in json_rarity_map} 64 | 65 | # unambiguous synonyms 66 | counter_rename = config.counter_rename 67 | 68 | # field labels 69 | field_label_name = config.field_label_name 70 | field_label_rarity = config.field_label_rarity 71 | field_label_cost = config.field_label_cost 72 | field_label_supertypes = config.field_label_supertypes 73 | field_label_types = config.field_label_types 74 | field_label_subtypes = config.field_label_subtypes 75 | field_label_loyalty = config.field_label_loyalty 76 | field_label_pt = config.field_label_pt 77 | field_label_text = config.field_label_text 78 | 79 | # additional fields we add to the json cards 80 | json_field_bside = config.json_field_bside 81 | json_field_set_name = config.json_field_set_name 82 | json_field_info_code = config.json_field_info_code 83 | 84 | # unicode / ascii conversion 85 | unicode_trans = { 86 | u'\u2014' : dash_marker, # unicode long dash 87 | u'\u2022' : bullet_marker, # unicode bullet 88 | u'\u2019' : '"', # single quote 89 | u'\u2018' : '"', # single quote 90 | u'\u2212' : '-', # minus sign 91 | u'\xe6' : 'ae', # ae symbol 92 | u'\xfb' : 'u', # u with caret 93 | u'\xfa' : 'u', # u with accent 94 | u'\xe9' : 'e', # e with accent 95 | u'\xe1' : 'a', # a with accent 96 | u'\xe0' : 'a', # a with accent going the other way 97 | u'\xe2' : 'a', # a with caret 98 | u'\xf6' : 'o', # o with umlaut 99 | u'\xed' : 'i', # i with accent 100 | } 101 | 102 | # this one is one-way only 103 | def to_ascii(s): 104 | for uchar in unicode_trans: 105 | s = s.replace(uchar, unicode_trans[uchar]) 106 | return s 107 | 108 | # unary numbers 109 | unary_marker = config.unary_marker 110 | unary_counter = config.unary_counter 111 | unary_max = config.unary_max 112 | unary_exceptions = config.unary_exceptions 113 | 114 | def to_unary(s, warn = False): 115 | numbers = re.findall(r'[0123456789]+', s) 116 | # replace largest first to avoid accidentally replacing shared substrings 117 | for n in sorted(numbers, cmp = lambda x,y: cmp(int(x), int(y)), reverse = True): 118 | i = int(n) 119 | if i in unary_exceptions: 120 | s = s.replace(n, unary_exceptions[i]) 121 | elif i > unary_max: 122 | i = unary_max 123 | if warn: 124 | print s 125 | s = s.replace(n, unary_marker + unary_counter * i) 126 | else: 127 | s = s.replace(n, unary_marker + unary_counter * i) 128 | return s 129 | 130 | def from_unary(s): 131 | numbers = re.findall(re.escape(unary_marker + unary_counter) + '*', s) 132 | # again, largest first so we don't replace substrings and break everything 133 | for n in sorted(numbers, cmp = lambda x,y: cmp(len(x), len(y)), reverse = True): 134 | i = (len(n) - len(unary_marker)) / len(unary_counter) 135 | s = s.replace(n, str(i)) 136 | return s 137 | 138 | # mana syntax 139 | mana_open_delimiter = '{' 140 | mana_close_delimiter = '}' 141 | mana_json_open_delimiter = mana_open_delimiter 142 | mana_json_close_delimiter = mana_close_delimiter 143 | mana_json_hybrid_delimiter = '/' 144 | mana_forum_open_delimiter = '[mana]' 145 | mana_forum_close_delimiter = '[/mana]' 146 | mana_html_open_delimiter = "" 148 | mana_html_hybrid_delimiter = '-' 149 | mana_unary_marker = '' # if the same as unary_marker, from_unary WILL replace numbers in mana costs 150 | mana_unary_counter = unary_counter 151 | 152 | # The decoding from mtgjson format is dependent on the specific structure of 153 | # these internally used mana symbol strings, so if you want to change them you'll 154 | # also have to change the json decoding functions. 155 | 156 | # standard mana symbol set 157 | mana_W = 'W' # single color 158 | mana_U = 'U' 159 | mana_B = 'B' 160 | mana_R = 'R' 161 | mana_G = 'G' 162 | mana_P = 'P' # colorless phyrexian 163 | mana_S = 'S' # snow 164 | mana_X = 'X' # colorless X 165 | mana_C = 'C' # colorless only 'eldrazi' 166 | mana_E = 'E' # energy counter 167 | mana_WP = 'WP' # single color phyrexian 168 | mana_UP = 'UP' 169 | mana_BP = 'BP' 170 | mana_RP = 'RP' 171 | mana_GP = 'GP' 172 | mana_2W = '2W' # single color hybrid 173 | mana_2U = '2U' 174 | mana_2B = '2B' 175 | mana_2R = '2R' 176 | mana_2G = '2G' 177 | mana_WU = 'WU' # dual color hybrid 178 | mana_WB = 'WB' 179 | mana_RW = 'RW' 180 | mana_GW = 'GW' 181 | mana_UB = 'UB' 182 | mana_UR = 'UR' 183 | mana_GU = 'GU' 184 | mana_BR = 'BR' 185 | mana_BG = 'BG' 186 | mana_RG = 'RG' 187 | # alternative order symbols 188 | mana_WP_alt = 'PW' # single color phyrexian 189 | mana_UP_alt = 'PU' 190 | mana_BP_alt = 'PB' 191 | mana_RP_alt = 'PR' 192 | mana_GP_alt = 'PG' 193 | mana_2W_alt = 'W2' # single color hybrid 194 | mana_2U_alt = 'U2' 195 | mana_2B_alt = 'B2' 196 | mana_2R_alt = 'R2' 197 | mana_2G_alt = 'G2' 198 | mana_WU_alt = 'UW' # dual color hybrid 199 | mana_WB_alt = 'BW' 200 | mana_RW_alt = 'WR' 201 | mana_GW_alt = 'WG' 202 | mana_UB_alt = 'BU' 203 | mana_UR_alt = 'RU' 204 | mana_GU_alt = 'UG' 205 | mana_BR_alt = 'RB' 206 | mana_BG_alt = 'GB' 207 | mana_RG_alt = 'GR' 208 | # special 209 | mana_2 = '2' # use with 'in' to identify single color hybrid 210 | 211 | # master symbol lists 212 | mana_syms = [ 213 | mana_W, 214 | mana_U, 215 | mana_B, 216 | mana_R, 217 | mana_G, 218 | mana_P, 219 | mana_S, 220 | mana_X, 221 | mana_C, 222 | mana_E, 223 | mana_WP, 224 | mana_UP, 225 | mana_BP, 226 | mana_RP, 227 | mana_GP, 228 | mana_2W, 229 | mana_2U, 230 | mana_2B, 231 | mana_2R, 232 | mana_2G, 233 | mana_WU, 234 | mana_WB, 235 | mana_RW, 236 | mana_GW, 237 | mana_UB, 238 | mana_UR, 239 | mana_GU, 240 | mana_BR, 241 | mana_BG, 242 | mana_RG, 243 | ] 244 | mana_symalt = [ 245 | mana_WP_alt, 246 | mana_UP_alt, 247 | mana_BP_alt, 248 | mana_RP_alt, 249 | mana_GP_alt, 250 | mana_2W_alt, 251 | mana_2U_alt, 252 | mana_2B_alt, 253 | mana_2R_alt, 254 | mana_2G_alt, 255 | mana_WU_alt, 256 | mana_WB_alt, 257 | mana_RW_alt, 258 | mana_GW_alt, 259 | mana_UB_alt, 260 | mana_UR_alt, 261 | mana_GU_alt, 262 | mana_BR_alt, 263 | mana_BG_alt, 264 | mana_RG_alt, 265 | ] 266 | mana_symall = mana_syms + mana_symalt 267 | 268 | # alt symbol conversion 269 | def mana_alt(sym): 270 | if not sym in mana_symall: 271 | raise ValueError('invalid mana symbol for mana_alt(): ' + repr(sym)) 272 | if len(sym) < 2: 273 | return sym 274 | else: 275 | return sym[::-1] 276 | 277 | # produce intended neural net output format 278 | def mana_sym_to_encoding(sym): 279 | if not sym in mana_symall: 280 | raise ValueError('invalid mana symbol for mana_sym_to_encoding(): ' + repr(sym)) 281 | if len(sym) < 2: 282 | return sym * 2 283 | else: 284 | return sym 285 | 286 | # produce json formatting used in mtgjson 287 | def mana_sym_to_json(sym): 288 | if not sym in mana_symall: 289 | raise ValueError('invalid mana symbol for mana_sym_to_json(): ' + repr(sym)) 290 | if len(sym) < 2: 291 | return mana_json_open_delimiter + sym + mana_json_close_delimiter 292 | else: 293 | return (mana_json_open_delimiter + sym[0] + mana_json_hybrid_delimiter 294 | + sym[1] + mana_json_close_delimiter) 295 | 296 | # produce pretty formatting that renders on mtgsalvation forum 297 | # converts individual symbols; surrounding [mana][/mana] tags are added elsewhere 298 | def mana_sym_to_forum(sym): 299 | if not sym in mana_symall: 300 | raise ValueError('invalid mana symbol for mana_sym_to_forum(): ' + repr(sym)) 301 | if sym in mana_symalt: 302 | sym = mana_alt(sym) 303 | if len(sym) < 2: 304 | return sym 305 | else: 306 | return mana_json_open_delimiter + sym + mana_json_close_delimiter 307 | 308 | # forward symbol tables for encoding 309 | mana_syms_encode = {sym : mana_sym_to_encoding(sym) for sym in mana_syms} 310 | mana_symalt_encode = {sym : mana_sym_to_encoding(sym) for sym in mana_symalt} 311 | mana_symall_encode = {sym : mana_sym_to_encoding(sym) for sym in mana_symall} 312 | mana_syms_jencode = {sym : mana_sym_to_json(sym) for sym in mana_syms} 313 | mana_symalt_jencode = {sym : mana_sym_to_json(sym) for sym in mana_symalt} 314 | mana_symall_jencode = {sym : mana_sym_to_json(sym) for sym in mana_symall} 315 | 316 | # reverse symbol tables for decoding 317 | mana_syms_decode = {mana_sym_to_encoding(sym) : sym for sym in mana_syms} 318 | mana_symalt_decode = {mana_sym_to_encoding(sym) : sym for sym in mana_symalt} 319 | mana_symall_decode = {mana_sym_to_encoding(sym) : sym for sym in mana_symall} 320 | mana_syms_jdecode = {mana_sym_to_json(sym) : sym for sym in mana_syms} 321 | mana_symalt_jdecode = {mana_sym_to_json(sym) : sym for sym in mana_symalt} 322 | mana_symall_jdecode = {mana_sym_to_json(sym) : sym for sym in mana_symall} 323 | 324 | # going straight from json to encoding and vice versa 325 | def mana_encode_direct(jsym): 326 | if not jsym in mana_symall_jdecode: 327 | raise ValueError('json string not found in decode table for mana_encode_direct(): ' 328 | + repr(jsym)) 329 | else: 330 | return mana_symall_encode[mana_symall_jdecode[jsym]] 331 | 332 | def mana_decode_direct(sym): 333 | if not sym in mana_symall_decode: 334 | raise ValueError('mana symbol not found in decode table for mana_decode_direct(): ' 335 | + repr(sym)) 336 | else: 337 | return mana_symall_jencode[mana_symall_decode[sym]] 338 | 339 | # hacked in support for mtgsalvation forum 340 | def mana_decode_direct_forum(sym): 341 | if not sym in mana_symall_decode: 342 | raise ValueError('mana symbol not found in decode table for mana_decode_direct_forum(): ' 343 | + repr(sym)) 344 | else: 345 | return mana_sym_to_forum(mana_symall_decode[sym]) 346 | 347 | # processing entire strings 348 | def unique_string(s): 349 | return ''.join(set(s)) 350 | 351 | mana_charset_special = mana_unary_marker + mana_unary_counter 352 | mana_charset_strict = unique_string(''.join(mana_symall) + mana_charset_special) 353 | mana_charset = unique_string(mana_charset_strict + mana_charset_strict.lower()) 354 | 355 | mana_regex_strict = (re.escape(mana_open_delimiter) + '[' 356 | + re.escape(mana_charset_strict) 357 | + ']*' + re.escape(mana_close_delimiter)) 358 | mana_regex = (re.escape(mana_open_delimiter) + '[' 359 | + re.escape(mana_charset) 360 | + ']*' + re.escape(mana_close_delimiter)) 361 | 362 | # as a special case, we let unary or decimal numbers exist in json mana strings 363 | mana_json_charset_special = ('0123456789' + unary_marker + unary_counter) 364 | mana_json_charset_strict = unique_string(''.join(mana_symall_jdecode) + mana_json_charset_special) 365 | mana_json_charset = unique_string(mana_json_charset_strict + mana_json_charset_strict.lower()) 366 | 367 | # note that json mana strings can't be empty between the delimiters 368 | mana_json_regex_strict = (re.escape(mana_json_open_delimiter) + '[' 369 | + re.escape(mana_json_charset_strict) 370 | + ']+' + re.escape(mana_json_close_delimiter)) 371 | mana_json_regex = (re.escape(mana_json_open_delimiter) + '[' 372 | + re.escape(mana_json_charset) 373 | + ']+' + re.escape(mana_json_close_delimiter)) 374 | 375 | number_decimal_regex = r'[0123456789]+' 376 | number_unary_regex = re.escape(unary_marker) + re.escape(unary_counter) + '*' 377 | mana_decimal_regex = (re.escape(mana_json_open_delimiter) + number_decimal_regex 378 | + re.escape(mana_json_close_delimiter)) 379 | mana_unary_regex = (re.escape(mana_json_open_delimiter) + number_unary_regex 380 | + re.escape(mana_json_close_delimiter)) 381 | 382 | # convert a json mana string to the proper encoding 383 | def mana_translate(jmanastr): 384 | manastr = jmanastr 385 | for n in sorted(re.findall(mana_unary_regex, manastr), 386 | lambda x,y: cmp(len(x), len(y)), reverse = True): 387 | ns = re.findall(number_unary_regex, n) 388 | i = (len(ns[0]) - len(unary_marker)) / len(unary_counter) 389 | manastr = manastr.replace(n, mana_unary_marker + mana_unary_counter * i) 390 | for n in sorted(re.findall(mana_decimal_regex, manastr), 391 | lambda x,y: cmp(len(x), len(y)), reverse = True): 392 | ns = re.findall(number_decimal_regex, n) 393 | i = int(ns[0]) 394 | manastr = manastr.replace(n, mana_unary_marker + mana_unary_counter * i) 395 | for jsym in sorted(mana_symall_jdecode, lambda x,y: cmp(len(x), len(y)), reverse = True): 396 | if jsym in manastr: 397 | manastr = manastr.replace(jsym, mana_encode_direct(jsym)) 398 | return mana_open_delimiter + manastr + mana_close_delimiter 399 | 400 | # convert an encoded mana string back to json 401 | mana_symlen_min = min([len(sym) for sym in mana_symall_decode]) 402 | mana_symlen_max = max([len(sym) for sym in mana_symall_decode]) 403 | def mana_untranslate(manastr, for_forum = False, for_html = False): 404 | inner = manastr[1:-1] 405 | jmanastr = '' 406 | colorless_total = 0 407 | idx = 0 408 | while idx < len(inner): 409 | # taking this branch is an infinite loop if unary_marker is empty 410 | if len(mana_unary_marker) > 0 and inner[idx:idx+len(mana_unary_marker)] == mana_unary_marker: 411 | idx += len(mana_unary_marker) 412 | elif inner[idx:idx+len(mana_unary_counter)] == mana_unary_counter: 413 | idx += len(mana_unary_counter) 414 | colorless_total += 1 415 | else: 416 | old_idx = idx 417 | for symlen in range(mana_symlen_min, mana_symlen_max + 1): 418 | sym = inner[idx:idx+symlen] 419 | if sym in mana_symall_decode: 420 | idx += symlen 421 | if for_html: 422 | jmanastr = jmanastr + mana_decode_direct(sym) 423 | jmanastr = jmanastr.replace(mana_open_delimiter, mana_html_open_delimiter) 424 | jmanastr = jmanastr.replace(mana_close_delimiter, mana_html_close_delimiter) 425 | jmanastr = jmanastr.replace(mana_open_delimiter, mana_html_open_delimiter) 426 | jmanastr = jmanastr.replace(mana_json_hybrid_delimiter, mana_html_hybrid_delimiter) 427 | elif for_forum: 428 | jmanastr = jmanastr + mana_decode_direct_forum(sym) 429 | else: 430 | jmanastr = jmanastr + mana_decode_direct(sym) 431 | break 432 | # otherwise we'll go into an infinite loop if we see a symbol we don't know 433 | if idx == old_idx: 434 | idx += 1 435 | 436 | if for_html: 437 | if jmanastr == '': 438 | return mana_html_open_delimiter + str(colorless_total) + mana_html_close_delimiter 439 | else: 440 | return (('' if colorless_total == 0 441 | else mana_html_open_delimiter + str(colorless_total) + mana_html_close_delimiter) 442 | + jmanastr) 443 | 444 | elif for_forum: 445 | if jmanastr == '': 446 | return mana_forum_open_delimiter + str(colorless_total) + mana_forum_close_delimiter 447 | else: 448 | return (mana_forum_open_delimiter + ('' if colorless_total == 0 449 | else str(colorless_total)) 450 | + jmanastr + mana_forum_close_delimiter) 451 | else: 452 | if jmanastr == '': 453 | return mana_json_open_delimiter + str(colorless_total) + mana_json_close_delimiter 454 | else: 455 | return (('' if colorless_total == 0 else 456 | mana_json_open_delimiter + str(colorless_total) + mana_json_close_delimiter) 457 | + jmanastr) 458 | 459 | # finally, replacing all instances in a string 460 | # notice the calls to .upper(), this way we recognize lowercase symbols as well just in case 461 | def to_mana(s): 462 | jmanastrs = re.findall(mana_json_regex, s) 463 | for jmanastr in sorted(jmanastrs, lambda x,y: cmp(len(x), len(y)), reverse = True): 464 | s = s.replace(jmanastr, mana_translate(jmanastr.upper())) 465 | return s 466 | 467 | def from_mana(s, for_forum = False): 468 | manastrs = re.findall(mana_regex, s) 469 | for manastr in sorted(manastrs, lambda x,y: cmp(len(x), len(y)), reverse = True): 470 | s = s.replace(manastr, mana_untranslate(manastr.upper(), for_forum = for_forum)) 471 | return s 472 | 473 | # Translation could also be accomplished using the datamine.Manacost object's 474 | # display methods, but these direct string transformations are retained for 475 | # quick scripting and convenience (and used under the hood by that class to 476 | # do its formatting). 477 | 478 | # more convenience features for formatting tap / untap symbols 479 | json_symbol_tap = tap_marker 480 | json_symbol_untap = untap_marker 481 | 482 | json_symbol_trans = { 483 | mana_json_open_delimiter + json_symbol_tap + mana_json_close_delimiter : tap_marker, 484 | mana_json_open_delimiter + json_symbol_tap.lower() + mana_json_close_delimiter : tap_marker, 485 | mana_json_open_delimiter + json_symbol_untap + mana_json_close_delimiter : untap_marker, 486 | mana_json_open_delimiter + json_symbol_untap.lower() + mana_json_close_delimiter : untap_marker, 487 | } 488 | symbol_trans = { 489 | tap_marker : mana_json_open_delimiter + json_symbol_tap + mana_json_close_delimiter, 490 | untap_marker : mana_json_open_delimiter + json_symbol_untap + mana_json_close_delimiter, 491 | } 492 | symbol_forum_trans = { 493 | tap_marker : mana_forum_open_delimiter + json_symbol_tap + mana_forum_close_delimiter, 494 | untap_marker : mana_forum_open_delimiter + json_symbol_untap + mana_forum_close_delimiter, 495 | } 496 | symbol_html_trans = { 497 | tap_marker : mana_html_open_delimiter + json_symbol_tap + mana_html_close_delimiter, 498 | untap_marker : mana_html_open_delimiter + json_symbol_untap + mana_html_close_delimiter, 499 | } 500 | 501 | json_symbol_regex = (re.escape(mana_json_open_delimiter) + '[' 502 | + json_symbol_tap + json_symbol_tap.lower() 503 | + json_symbol_untap + json_symbol_untap.lower() 504 | + ']' + re.escape(mana_json_close_delimiter)) 505 | symbol_regex = '[' + tap_marker + untap_marker + ']' 506 | 507 | def to_symbols(s): 508 | jsymstrs = re.findall(json_symbol_regex, s) 509 | for jsymstr in sorted(jsymstrs, lambda x,y: cmp(len(x), len(y)), reverse = True): 510 | s = s.replace(jsymstr, json_symbol_trans[jsymstr]) 511 | return s 512 | 513 | def from_symbols(s, for_forum = False, for_html = False): 514 | symstrs = re.findall(symbol_regex, s) 515 | #for symstr in sorted(symstrs, lambda x,y: cmp(len(x), len(y)), reverse = True): 516 | # We have to do the right thing here, because the thing we replace exists in the thing 517 | # we replace it with... 518 | for symstr in set(symstrs): 519 | if for_html: 520 | s = s.replace(symstr, symbol_html_trans[symstr]) 521 | elif for_forum: 522 | s = s.replace(symstr, symbol_forum_trans[symstr]) 523 | else: 524 | s = s.replace(symstr, symbol_trans[symstr]) 525 | return s 526 | 527 | unletters_regex = r"[^abcdefghijklmnopqrstuvwxyz']" 528 | -------------------------------------------------------------------------------- /scripts/analysis.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import re 5 | from collections import OrderedDict 6 | 7 | # scipy is kinda necessary 8 | import scipy 9 | import scipy.stats 10 | import numpy as np 11 | import math 12 | 13 | def mean_nonan(l): 14 | filtered = [x for x in l if not math.isnan(x)] 15 | return np.mean(filtered) 16 | 17 | def gmean_nonzero(l): 18 | filtered = [x for x in l if x != 0 and not math.isnan(x)] 19 | return scipy.stats.gmean(filtered) 20 | 21 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 22 | sys.path.append(libdir) 23 | datadir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data') 24 | import jdecode 25 | 26 | import mtg_validate 27 | import ngrams 28 | 29 | def annotate_values(values): 30 | for k in values: 31 | (total, good, bad) = values[k] 32 | values[k] = OrderedDict([('total', total), ('good', good), ('bad', bad)]) 33 | return values 34 | 35 | def print_statistics(stats, ident = 0): 36 | for k in stats: 37 | if isinstance(stats[k], OrderedDict): 38 | print(' ' * ident + str(k) + ':') 39 | print_statistics(stats[k], ident=ident+2) 40 | elif isinstance(stats[k], dict): 41 | print(' ' * ident + str(k) + ': ') 42 | elif isinstance(stats[k], list): 43 | print(' ' * ident + str(k) + ': ') 44 | else: 45 | print(' ' * ident + str(k) + ': ' + str(stats[k])) 46 | 47 | def get_statistics(fname, lm = None, sep = False, verbose=False): 48 | stats = OrderedDict() 49 | cards = jdecode.mtg_open_file(fname, verbose=verbose) 50 | stats['cards'] = cards 51 | 52 | # unpack the name of the checkpoint - terrible and hacky 53 | try: 54 | final_name = os.path.basename(fname) 55 | halves = final_name.split('_epoch') 56 | cp_name = halves[0] 57 | cp_info = halves[1][:-4] 58 | info_halves = cp_info.split('_') 59 | cp_epoch = float(info_halves[0]) 60 | fragments = info_halves[1].split('.') 61 | cp_vloss = float('.'.join(fragments[:2])) 62 | cp_temp = float('.'.join(fragments[-2:])) 63 | cp_ident = '.'.join(fragments[2:-2]) 64 | stats['cp'] = OrderedDict([('name', cp_name), 65 | ('epoch', cp_epoch), 66 | ('vloss', cp_vloss), 67 | ('temp', cp_temp), 68 | ('ident', cp_ident)]) 69 | except Exception as e: 70 | pass 71 | 72 | # validate 73 | ((total_all, total_good, total_bad, total_uncovered), 74 | values) = mtg_validate.process_props(cards) 75 | 76 | stats['props'] = annotate_values(values) 77 | stats['props']['overall'] = OrderedDict([('total', total_all), 78 | ('good', total_good), 79 | ('bad', total_bad), 80 | ('uncovered', total_uncovered)]) 81 | 82 | # distances 83 | distfname = fname + '.dist' 84 | if os.path.isfile(distfname): 85 | name_dupes = 0 86 | card_dupes = 0 87 | with open(distfname, 'rt') as f: 88 | distlines = f.read().split('\n') 89 | dists = OrderedDict([('name', []), ('cbow', [])]) 90 | for line in distlines: 91 | fields = line.split('|') 92 | if len(fields) < 4: 93 | continue 94 | idx = int(fields[0]) 95 | name = str(fields[1]) 96 | ndist = float(fields[2]) 97 | cdist = float(fields[3]) 98 | dists['name'] += [ndist] 99 | dists['cbow'] += [cdist] 100 | if ndist == 1.0: 101 | name_dupes += 1 102 | if cdist == 1.0: 103 | card_dupes += 1 104 | 105 | dists['name_mean'] = mean_nonan(dists['name']) 106 | dists['cbow_mean'] = mean_nonan(dists['cbow']) 107 | dists['name_geomean'] = gmean_nonzero(dists['name']) 108 | dists['cbow_geomean'] = gmean_nonzero(dists['cbow']) 109 | stats['dists'] = dists 110 | 111 | # n-grams 112 | if not lm is None: 113 | ngram = OrderedDict([('perp', []), ('perp_per', []), 114 | ('perp_max', []), ('perp_per_max', [])]) 115 | for card in cards: 116 | if len(card.text.text) == 0: 117 | perp = 0.0 118 | perp_per = 0.0 119 | elif sep: 120 | vtexts = [line.vectorize().split() for line in card.text_lines 121 | if len(line.vectorize().split()) > 0] 122 | perps = [lm.perplexity(vtext) for vtext in vtexts] 123 | perps_per = [perps[i] / float(len(vtexts[i])) for i in range(0, len(vtexts))] 124 | perp = gmean_nonzero(perps) 125 | perp_per = gmean_nonzero(perps_per) 126 | perp_max = max(perps) 127 | perp_per_max = max(perps_per) 128 | else: 129 | vtext = card.text.vectorize().split() 130 | perp = lm.perplexity(vtext) 131 | perp_per = perp / float(len(vtext)) 132 | perp_max = perp 133 | perp_per_max = perps_per 134 | 135 | ngram['perp'] += [perp] 136 | ngram['perp_per'] += [perp_per] 137 | ngram['perp_max'] += [perp_max] 138 | ngram['perp_per_max'] += [perp_per_max] 139 | 140 | ngram['perp_mean'] = mean_nonan(ngram['perp']) 141 | ngram['perp_per_mean'] = mean_nonan(ngram['perp_per']) 142 | ngram['perp_geomean'] = gmean_nonzero(ngram['perp']) 143 | ngram['perp_per_geomean'] = gmean_nonzero(ngram['perp_per']) 144 | stats['ngram'] = ngram 145 | 146 | return stats 147 | 148 | 149 | def main(infile, verbose = False): 150 | lm = ngrams.build_ngram_model(jdecode.mtg_open_file(str(os.path.join(datadir, 'output.txt'))), 151 | 3, separate_lines=True, verbose=True) 152 | stats = get_statistics(infile, lm=lm, sep=True, verbose=verbose) 153 | print_statistics(stats) 154 | 155 | if __name__ == '__main__': 156 | 157 | import argparse 158 | parser = argparse.ArgumentParser() 159 | 160 | parser.add_argument('infile', #nargs='?'. default=None, 161 | help='encoded card file or json corpus to process') 162 | parser.add_argument('-v', '--verbose', action='store_true', 163 | help='verbose output') 164 | 165 | args = parser.parse_args() 166 | main(args.infile, verbose=args.verbose) 167 | exit(0) 168 | -------------------------------------------------------------------------------- /scripts/autosample.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import subprocess 5 | import random 6 | 7 | def extract_cp_name(name): 8 | # "lm_lstm_epoch50.00_0.1870.t7" 9 | if not (name[:13] == 'lm_lstm_epoch' and name[-3:] == '.t7'): 10 | return None 11 | name = name[13:-3] 12 | (epoch, vloss) = tuple(name.split('_')) 13 | return (float(epoch), float(vloss)) 14 | 15 | def sample(cp, temp, count, seed = None, ident = 'output'): 16 | if seed is None: 17 | seed = random.randint(-1000000000, 1000000000) 18 | outfile = cp + '.' + ident + '.' + str(temp) + '.txt' 19 | cmd = ('th sample.lua ' + cp 20 | + ' -temperature ' + str(temp) 21 | + ' -length ' + str(count) 22 | + ' -seed ' + str(seed) 23 | + ' >> ' + outfile) 24 | if os.path.exists(outfile): 25 | print(outfile + ' already exists, skipping') 26 | return False 27 | else: 28 | # UNSAFE SHELL=TRUE FOR CONVENIENCE 29 | subprocess.call('echo "' + cmd + '" | tee ' + outfile, shell=True) 30 | subprocess.call(cmd, shell=True) 31 | 32 | def find_best_cp(cpdir): 33 | best = None 34 | best_cp = None 35 | for path in os.listdir(cpdir): 36 | fullpath = os.path.join(cpdir, path) 37 | if os.path.isfile(fullpath): 38 | extracted = extract_cp_name(path) 39 | if not extracted is None: 40 | (epoch, vloss) = extracted 41 | if best is None or vloss < best: 42 | best = vloss 43 | best_cp = fullpath 44 | return best_cp 45 | 46 | def process_dir(cpdir, temp, count, seed = None, ident = 'output', verbose = False): 47 | if verbose: 48 | print('processing ' + cpdir) 49 | best_cp = find_best_cp(cpdir) 50 | if not best_cp is None: 51 | sample(best_cp, temp, count, seed=seed, ident=ident) 52 | for path in os.listdir(cpdir): 53 | fullpath = os.path.join(cpdir, path) 54 | if os.path.isdir(fullpath): 55 | process_dir(fullpath, temp, count, seed=seed, ident=ident, verbose=verbose) 56 | 57 | def main(rnndir, cpdir, temp, count, seed = None, ident = 'output', verbose = False): 58 | if not os.path.isdir(rnndir): 59 | raise ValueError('bad rnndir: ' + rnndir) 60 | if not os.path.isdir(cpdir): 61 | raise ValueError('bad cpdir: ' + cpdir) 62 | os.chdir(rnndir) 63 | process_dir(cpdir, temp, count, seed=seed, ident=ident, verbose=verbose) 64 | 65 | if __name__ == '__main__': 66 | import argparse 67 | parser = argparse.ArgumentParser() 68 | 69 | parser.add_argument('rnndir', #nargs='?'. default=None, 70 | help='base rnn directory, must contain sample.lua') 71 | parser.add_argument('cpdir', #nargs='?', default=None, 72 | help='checkpoint directory, all subdirectories will be processed') 73 | parser.add_argument('-t', '--temperature', action='store', default='1.0', 74 | help='sampling temperature') 75 | parser.add_argument('-c', '--count', action='store', default='1000000', 76 | help='number of characters to sample each time') 77 | parser.add_argument('-s', '--seed', action='store', default=None, 78 | help='fixed seed; if not present, a random seed will be used') 79 | parser.add_argument('-i', '--ident', action='store', default='output', 80 | help='identifier to include in the output filenames') 81 | parser.add_argument('-v', '--verbose', action='store_true', 82 | help='verbose output') 83 | 84 | args = parser.parse_args() 85 | if args.seed is None: 86 | seed = None 87 | else: 88 | seed = int(args.seed) 89 | main(args.rnndir, args.cpdir, float(args.temperature), int(args.count), 90 | seed=seed, ident=args.ident, verbose = args.verbose) 91 | exit(0) 92 | -------------------------------------------------------------------------------- /scripts/collect_checkpoints.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import shutil 5 | 6 | def cleanup_dump(dumpstr): 7 | cardfrags = dumpstr.split('\n\n') 8 | if len(cardfrags) < 4: 9 | return '' 10 | else: 11 | return '\n\n'.join(cardfrags[2:-1]) + '\n\n' 12 | 13 | def identify_checkpoints(basedir, ident): 14 | cp_infos = [] 15 | for path in os.listdir(basedir): 16 | fullpath = os.path.join(basedir, path) 17 | if not os.path.isfile(fullpath): 18 | continue 19 | if not (path[:13] == 'lm_lstm_epoch' and path[-4:] == '.txt'): 20 | continue 21 | if not ident in path: 22 | continue 23 | # attempt super hacky parsing 24 | inner = path[13:-4] 25 | halves = inner.split('_') 26 | if not len(halves) == 2: 27 | continue 28 | parts = halves[1].split('.') 29 | if not len(parts) == 6: 30 | continue 31 | # lm_lstm_epoch[25.00_0.3859.t7.output.1.0].txt 32 | if not parts[3] == ident: 33 | continue 34 | epoch = halves[0] 35 | vloss = '.'.join([parts[0], parts[1]]) 36 | temp = '.'.join([parts[4], parts[5]]) 37 | cpname = 'lm_lstm_epoch' + epoch + '_' + vloss + '.t7' 38 | cp_infos += [(fullpath, os.path.join(basedir, cpname), 39 | (epoch, vloss, temp))] 40 | return cp_infos 41 | 42 | def process_dir(basedir, targetdir, ident, copy_cp = False, verbose = False): 43 | (basepath, basedirname) = os.path.split(basedir) 44 | if basedirname == '': 45 | (basepath, basedirname) = os.path.split(basepath) 46 | 47 | cp_infos = identify_checkpoints(basedir, ident) 48 | for (dpath, cpath, (epoch, vloss, temp)) in cp_infos: 49 | if verbose: 50 | print('found dumpfile ' + dpath) 51 | dname = basedirname + '_epoch' + epoch + '_' + vloss + '.' + ident + '.' + temp + '.txt' 52 | cname = basedirname + '_epoch' + epoch + '_' + vloss + '.t7' 53 | tdpath = os.path.join(targetdir, dname) 54 | tcpath = os.path.join(targetdir, cname) 55 | if verbose: 56 | print(' cpx ' + dpath + ' ' + tdpath) 57 | with open(dpath, 'rt') as infile: 58 | with open(tdpath, 'wt') as outfile: 59 | outfile.write(cleanup_dump(infile.read())) 60 | if copy_cp: 61 | if os.path.isfile(cpath): 62 | if verbose: 63 | print(' cp ' + cpath + ' ' + tcpath) 64 | shutil.copy(cpath, tcpath) 65 | 66 | if copy_cp and len(cp_infos) > 0: 67 | cmdpath = os.path.join(basedir, 'command.txt') 68 | tcmdpath = os.path.join(targetdir, basedirname + '.command') 69 | if os.path.isfile(cmdpath): 70 | if verbose: 71 | print(' cp ' + cmdpath + ' ' + tcmdpath) 72 | shutil.copy(cmdpath, tcmdpath) 73 | 74 | for path in os.listdir(basedir): 75 | fullpath = os.path.join(basedir, path) 76 | if os.path.isdir(fullpath): 77 | process_dir(fullpath, targetdir, ident, copy_cp=copy_cp, verbose=verbose) 78 | 79 | def main(basedir, targetdir, ident = 'output', copy_cp = False, verbose = False): 80 | process_dir(basedir, targetdir, ident, copy_cp=copy_cp, verbose=verbose) 81 | 82 | if __name__ == '__main__': 83 | import argparse 84 | parser = argparse.ArgumentParser() 85 | 86 | parser.add_argument('basedir', #nargs='?'. default=None, 87 | help='base rnn directory, must contain sample.lua') 88 | parser.add_argument('targetdir', #nargs='?', default=None, 89 | help='checkpoint directory, all subdirectories will be processed') 90 | parser.add_argument('-c', '--copy_cp', action='store_true', 91 | help='copy checkpoints used to generate the output files') 92 | parser.add_argument('-i', '--ident', action='store', default='output', 93 | help='identifier to look for to determine checkpoints') 94 | parser.add_argument('-v', '--verbose', action='store_true', 95 | help='verbose output') 96 | 97 | args = parser.parse_args() 98 | main(args.basedir, args.targetdir, ident=args.ident, copy_cp=args.copy_cp, verbose=args.verbose) 99 | exit(0) 100 | -------------------------------------------------------------------------------- /scripts/distances.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | 5 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 6 | sys.path.append(libdir) 7 | import utils 8 | import jdecode 9 | from namediff import Namediff 10 | from cbow import CBOW 11 | 12 | def main(fname, oname, verbose = True, parallel = True): 13 | # may need to set special arguments here 14 | cards = jdecode.mtg_open_file(fname, verbose=verbose) 15 | 16 | # this could reasonably be some separate function 17 | # might make sense to merge cbow and namediff and have this be the main interface 18 | namediff = Namediff() 19 | cbow = CBOW() 20 | 21 | if verbose: 22 | print 'Computing nearest names...' 23 | if parallel: 24 | nearest_names = namediff.nearest_par(map(lambda c: c.name, cards), n=1) 25 | else: 26 | nearest_names = [namediff.nearest(c.name, n=1) for c in cards] 27 | 28 | if verbose: 29 | print 'Computing nearest cards...' 30 | if parallel: 31 | nearest_cards = cbow.nearest_par(cards, n=1) 32 | else: 33 | nearest_cards = [cbow.nearest(c, n=1) for c in cards] 34 | 35 | for i in range(0, len(cards)): 36 | cards[i].nearest_names = nearest_names[i] 37 | cards[i].nearest_cards = nearest_cards[i] 38 | 39 | # # unfortunately this takes ~30 hours on 8 cores for a 10MB dump 40 | # if verbose: 41 | # print 'Computing nearest encodings by text edit distance...' 42 | # if parallel: 43 | # nearest_cards_text = namediff.nearest_card_par(cards, n=1) 44 | # else: 45 | # nearest_cards_text = [namediff.nearest_card(c, n=1) for c in cards] 46 | 47 | if verbose: 48 | print '...Done.' 49 | 50 | # write to a file to store the data, this is a terribly long computation 51 | # we could also just store this same info in the cards themselves as more fields... 52 | sep = '|' 53 | with open(oname, 'w') as ofile: 54 | for i in range(0, len(cards)): 55 | card = cards[i] 56 | ostr = str(i) + sep + card.name + sep 57 | ndist, _ = card.nearest_names[0] 58 | ostr += str(ndist) + sep 59 | cdist, _ = card.nearest_cards[0] 60 | ostr += str(cdist) + '\n' 61 | # tdist, _ = nearest_cards_text[i][0] 62 | # ostr += str(tdist) + '\n' 63 | ofile.write(ostr.encode('utf-8')) 64 | 65 | if __name__ == '__main__': 66 | 67 | import argparse 68 | parser = argparse.ArgumentParser() 69 | 70 | parser.add_argument('infile', #nargs='?'. default=None, 71 | help='encoded card file or json corpus to process') 72 | parser.add_argument('outfile', #nargs='?', default=None, 73 | help='name of output file, will be overwritten') 74 | parser.add_argument('-v', '--verbose', action='store_true', 75 | help='verbose output') 76 | parser.add_argument('-p', '--parallel', action='store_true', 77 | help='run in parallel on all cores') 78 | 79 | args = parser.parse_args() 80 | main(args.infile, args.outfile, verbose=args.verbose, parallel=args.parallel) 81 | exit(0) 82 | -------------------------------------------------------------------------------- /scripts/keydiff.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | def parse_keyfile(f, d, constructor = lambda x: x): 4 | for line in f: 5 | kv = map(lambda s: s.strip(), line.split(':')) 6 | if not len(kv) == 2: 7 | continue 8 | d[kv[0]] = constructor(kv[1]) 9 | 10 | def merge_dicts(d1, d2): 11 | d = {} 12 | for k in d1: 13 | d[k] = (d1[k], d2[k] if k in d2 else None) 14 | for k in d2: 15 | if not k in d: 16 | d[k] = (None, d2[k]) 17 | return d 18 | 19 | def main(fname1, fname2, verbose = True): 20 | if verbose: 21 | print 'opening ' + fname1 + ' as base key/value store' 22 | print 'opening ' + fname2 + ' as target key/value store' 23 | 24 | d1 = {} 25 | d2 = {} 26 | with open(fname1, 'rt') as f1: 27 | parse_keyfile(f1, d1, int) 28 | with open(fname2, 'rt') as f2: 29 | parse_keyfile(f2, d2, int) 30 | 31 | tot1 = sum(d1.values()) 32 | tot2 = sum(d2.values()) 33 | 34 | if verbose: 35 | print ' ' + fname1 + ': ' + str(len(d1)) + ', total ' + str(tot1) 36 | print ' ' + fname2 + ': ' + str(len(d2)) + ', total ' + str(tot2) 37 | 38 | d_merged = merge_dicts(d1, d2) 39 | 40 | ratios = {} 41 | only_1 = {} 42 | only_2 = {} 43 | for k in d_merged: 44 | (v1, v2) = d_merged[k] 45 | if v1 is None: 46 | only_2[k] = v2 47 | elif v2 is None: 48 | only_1[k] = v1 49 | else: 50 | ratios[k] = float(v2 * tot1) / float(v1 * tot2) 51 | 52 | print 'shared: ' + str(len(ratios)) 53 | for k in sorted(ratios, lambda x,y: cmp(d2[x], d2[y]), reverse=True): 54 | print ' ' + k + ': ' + str(d2[k]) + '/' + str(d1[k]) + ' (' + str(ratios[k]) + ')' 55 | print '' 56 | 57 | print '1 only: ' + str(len(only_1)) 58 | for k in sorted(only_1, lambda x,y: cmp(d1[x], d1[y]), reverse=True): 59 | print ' ' + k + ': ' + str(d1[k]) 60 | print '' 61 | 62 | print '2 only: ' + str(len(only_2)) 63 | for k in sorted(only_2, lambda x,y: cmp(d2[x], d2[y]), reverse=True): 64 | print ' ' + k + ': ' + str(d2[k]) 65 | print '' 66 | 67 | if __name__ == '__main__': 68 | 69 | import argparse 70 | parser = argparse.ArgumentParser() 71 | 72 | parser.add_argument('file1', #nargs='?'. default=None, 73 | help='base key file to diff against') 74 | parser.add_argument('file2', nargs='?', default=None, 75 | help='other file to compare against the baseline') 76 | parser.add_argument('-v', '--verbose', action='store_true', 77 | help='verbose output') 78 | 79 | args = parser.parse_args() 80 | main(args.file1, args.file2, verbose=args.verbose) 81 | exit(0) 82 | -------------------------------------------------------------------------------- /scripts/mtg_validate.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import re 5 | from collections import OrderedDict 6 | 7 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 8 | sys.path.append(libdir) 9 | import utils 10 | import jdecode 11 | 12 | datadir = os.path.realpath(os.path.join(libdir, '../data')) 13 | gramdir = os.path.join(datadir, 'ngrams') 14 | compute_ngrams = False 15 | gramdicts = {} 16 | if os.path.isdir(gramdir): 17 | import keydiff 18 | compute_ngrams = True 19 | for fname in os.listdir(gramdir): 20 | suffixes = re.findall(r'\.[0-9]*g$', fname) 21 | if suffixes: 22 | grams = int(suffixes[0][1:-1]) 23 | d = {} 24 | with open(os.path.join(gramdir, fname), 'rt') as f: 25 | keydiff.parse_keyfile(f, d, int) 26 | gramdicts[grams] = d 27 | 28 | def rare_grams(card, thresh = 2, grams = 2): 29 | if not grams in gramdicts: 30 | return None 31 | rares = 0 32 | gramdict = gramdicts[grams] 33 | for line in card.text_lines_words: 34 | for i in range(0, len(line) - (grams - 1)): 35 | ngram = ' '.join([line[i + j] for j in range(0, grams)]) 36 | if ngram in gramdict: 37 | if gramdict[ngram] < thresh: 38 | rares += 1 39 | else: 40 | rares += 1 41 | return rares 42 | 43 | def list_only(l, items): 44 | for e in l: 45 | if not e in items: 46 | return False 47 | return True 48 | 49 | def pct(x, total): 50 | pctstr = 100.0 * float(x) / float(total) 51 | return '(' + str(pctstr)[:5] + '%)' 52 | 53 | def check_types(card): 54 | if 'instant' in card.types: 55 | return list_only(card.types, ['tribal', 'instant']) 56 | if 'sorcery' in card.types: 57 | return list_only(card.types, ['tribal', 'sorcery']) 58 | if 'creature' in card.types: 59 | return list_only(card.types, ['tribal', 'creature', 'artifact', 'land', 'enchantment']) 60 | if 'planeswalker' in card.types: 61 | return list_only(card.types, ['tribal', 'planeswalker', 'artifact', 'land', 'enchantment']) 62 | else: 63 | return list_only(card.types, ['tribal', 'artifact', 'land', 'enchantment']) 64 | 65 | def check_pt(card): 66 | if ('creature' in card.types or 'vehicle' in card.subtypes) or card.pt: 67 | return ((('creature' in card.types or 'vehicle' in card.subtypes) and len(re.findall(re.escape('/'), card.pt)) == 1) 68 | and not card.loyalty) 69 | if 'planeswalker' in card.types or card.loyalty: 70 | return (('planeswalker' in card.types and card.loyalty) 71 | and not card.pt) 72 | return None 73 | 74 | def check_lands(card): 75 | if 'land' in card.types: 76 | return card.cost.format() == '_NOCOST_' 77 | else: 78 | return None 79 | 80 | # doesn't handle granted activated abilities in "" 81 | def check_X(card): 82 | correct = None 83 | incost = 'X' in card.cost.encode() 84 | extra_cost_lines = 0 85 | cost_lines = 0 86 | use_lines = 0 87 | for mt in card.text_lines: 88 | sides = mt.text.split(':') 89 | if len(sides) == 2: 90 | actcosts = len(re.findall(re.escape(utils.reserved_mana_marker), sides[0])) 91 | lcosts = mt.costs[:actcosts] 92 | rcosts = mt.costs[actcosts:] 93 | if 'X' in sides[0] or (utils.reserved_mana_marker in sides[0] and 94 | 'X' in ''.join(map(lambda c: c.encode(), lcosts))): 95 | 96 | if incost: 97 | return False # bad, duplicated Xs in costs 98 | 99 | if 'X' in sides[1] or (utils.reserved_mana_marker in sides[1] and 100 | 'X' in ''.join(map(lambda c: c.encode(), rcosts))): 101 | correct = True # good, defined X is either specified or used 102 | if 'monstrosity' in sides[1]: 103 | extra_cost_lines += 1 104 | continue 105 | elif 'remove X % counters' in sides[0] and 'each counter removed' in sides[1]: 106 | correct = True # Blademane Baku 107 | continue 108 | elif 'note' in sides[1]: 109 | correct = True # Ice Cauldron 110 | continue 111 | else: 112 | return False # bad, defined X is unused 113 | 114 | # we've checked all cases where an X ocurrs in an activiation cost 115 | linetext = mt.encode() 116 | intext = len(re.findall(r'X', linetext)) 117 | defs = (len(re.findall(r'X is', linetext)) 118 | + len(re.findall(re.escape('pay {X'), linetext)) 119 | + len(re.findall(re.escape('pay X'), linetext)) 120 | + len(re.findall(re.escape('reveal X'), linetext)) 121 | + len(re.findall(re.escape('may tap X'), linetext))) 122 | 123 | if incost: 124 | if intext: 125 | correct = True # defined and used or specified in some way 126 | elif intext > 0: 127 | if intext > 1 and defs > 0: 128 | correct = True # look for multiples 129 | elif 'suspend' in linetext or 'bloodthirst' in linetext: 130 | correct = True # special case keywords 131 | elif 'reinforce' in linetext and intext > 2: 132 | correct = True # this should work 133 | elif 'contain {X' in linetext or 'with {X' in linetext: 134 | correct = True 135 | 136 | elif ('additional cost' in linetext 137 | or 'morph' in linetext 138 | or 'kicker' in linetext): 139 | cost_lines += 1 140 | else: 141 | use_lines += 1 142 | 143 | if incost and not correct: 144 | if 'sunburst' in card.text.text or 'spent to cast' in card.text.text: 145 | return True # Engineered Explosives, Skyrider Elf 146 | return False # otherwise we should have seen X somewhere if it was in the cost 147 | 148 | elif cost_lines > 0 or use_lines > 0: 149 | if (cost_lines + extra_cost_lines) == 1 and use_lines > 0: 150 | return True # dreams, etc. 151 | else: 152 | return False 153 | 154 | return correct 155 | 156 | def check_kicker(card): 157 | # also lazy and simple 158 | if 'kicker' in card.text.text or 'kicked' in card.text.text: 159 | # could also check for costs, at least make 'it's $ kicker,' not count as a kicker ability 160 | newtext = card.text.text.replace(utils.reserved_mana_marker + ' kicker', '') 161 | return 'kicker' in newtext and 'kicked' in newtext 162 | else: 163 | return None 164 | 165 | def check_counters(card): 166 | uses = len(re.findall(re.escape(utils.counter_marker), card.text.text)) 167 | if uses > 0: 168 | return uses > 1 and 'countertype ' + utils.counter_marker in card.text.text 169 | else: 170 | return None 171 | 172 | def check_choices(card): 173 | bullets = len(re.findall(re.escape(utils.bullet_marker), card.text.text)) 174 | obracks = len(re.findall(re.escape(utils.choice_open_delimiter), card.text.text)) 175 | cbracks = len(re.findall(re.escape(utils.choice_close_delimiter), card.text.text)) 176 | if bullets + obracks + cbracks > 0: 177 | if not (obracks == cbracks and bullets > 0): 178 | return False 179 | # could compile ahead of time 180 | choice_regex = (re.escape(utils.choice_open_delimiter) + re.escape(utils.unary_marker) 181 | + r'.*' + re.escape(utils.bullet_marker) + r'.*' 182 | + re.escape(utils.choice_close_delimiter)) 183 | nochoices = re.sub(choice_regex, '', card.text.text) 184 | nobullets = len(re.findall(re.escape(utils.bullet_marker), nochoices)) 185 | noobracks = len(re.findall(re.escape(utils.choice_open_delimiter), nochoices)) 186 | nocbracks = len(re.findall(re.escape(utils.choice_close_delimiter), nochoices)) 187 | return nobullets + noobracks + nocbracks == 0 188 | else: 189 | return None 190 | 191 | def check_auras(card): 192 | # a bit loose 193 | if 'enchantment' in card.types or 'aura' in card.subtypes or 'enchant' in card.text.text: 194 | return 'enchantment' in card.types or 'aura' in card.subtypes or 'enchant' in card.text.text 195 | else: 196 | return None 197 | 198 | def check_equipment(card): 199 | # probably even looser, chould check for actual equip abilities and noncreatureness 200 | if 'equipment' in card.subtypes: 201 | return 'equip' in card.text.text 202 | else: 203 | return None 204 | 205 | def check_vehicles(card): 206 | if 'vehicle' in card.subtypes: 207 | return 'crew' in card.text.text 208 | else: 209 | return None 210 | 211 | def check_planeswalkers(card): 212 | if 'planeswalker' in card.types: 213 | good_lines = 0 214 | bad_lines = 0 215 | initial_re = r'^[+-]?' + re.escape(utils.unary_marker) + re.escape(utils.unary_counter) + '*:' 216 | initial_re_X = r'^[-+]' + re.escape(utils.x_marker) + '+:' 217 | for line in card.text_lines: 218 | if len(re.findall(initial_re, line.text)) == 1: 219 | good_lines += 1 220 | elif len(re.findall(initial_re_X, line.text)) == 1: 221 | good_lines += 1 222 | elif 'can be your commander' in line.text: 223 | pass 224 | elif 'countertype' in line.text or 'transform' in line.text: 225 | pass 226 | else: 227 | bad_lines += 1 228 | return good_lines > 1 and bad_lines == 0 229 | else: 230 | return None 231 | 232 | def check_levelup(card): 233 | if 'level' in card.text.text: 234 | uplines = 0 235 | llines = 0 236 | for line in card.text_lines: 237 | if 'countertype ' + utils.counter_marker + ' level' in line.text: 238 | uplines += 1 239 | llines += 1 240 | elif 'with level up' in line.text: 241 | llines += 1 242 | elif 'level up' in line.text: 243 | uplines += 1 244 | elif 'level' in line.text: 245 | llines += 1 246 | return uplines == 1 and llines > 0 247 | else: 248 | return None 249 | 250 | def check_activated(card): 251 | activated = 0 252 | for line in card.text_lines: 253 | if '.' in line.text: 254 | subtext = re.sub(r'"[^"]*"', '', line.text) 255 | if 'forecast' in subtext: 256 | pass 257 | elif 'return ' + utils.this_marker + ' from your graveyard' in subtext: 258 | pass 259 | elif 'on the stack' in subtext: 260 | pass 261 | elif ':' in subtext: 262 | activated += 1 263 | if activated > 0: 264 | return list_only(card.types, ['creature', 'land', 'artifact', 'enchantment', 'planeswalker', 'tribal']) 265 | else: 266 | return None 267 | 268 | def check_triggered(card): 269 | triggered = 0 270 | triggered_2 = 0 271 | for line in card.text_lines: 272 | if 'when ' + utils.this_marker + ' enters the battlefield' in line.text: 273 | triggered += 1 274 | if 'when ' + utils.this_marker + ' leaves the battlefield' in line.text: 275 | triggered += 1 276 | if 'when ' + utils.this_marker + ' dies' in line.text: 277 | triggered += 1 278 | elif 'at the beginning' == line.text[:16] or 'when' == line.text[:4]: 279 | if 'from your graveyard' in line.text: 280 | triggered_2 += 1 281 | elif 'in your graveyard' in line.text: 282 | triggered_2 += 1 283 | elif 'if ' + utils.this_marker + ' is suspended' in line.text: 284 | triggered_2 += 1 285 | elif 'if that card is exiled' in line.text or 'if ' + utils.this_marker + ' is exiled' in line.text: 286 | triggered_2 += 1 287 | elif 'when the creature ' + utils.this_marker + ' haunts' in line.text: 288 | triggered_2 += 1 289 | elif 'when you cycle ' + utils.this_marker in line.text or 'when you cast ' + utils.this_marker in line.text: 290 | triggered_2 += 1 291 | elif 'this turn' in line.text or 'this combat' in line.text or 'your next upkeep' in line.text: 292 | triggered_2 += 1 293 | elif 'from your library' in line.text: 294 | triggered_2 += 1 295 | elif 'you discard ' + utils.this_marker in line.text or 'you to discard ' + utils.this_marker in line.text: 296 | triggered_2 += 1 297 | else: 298 | triggered += 1 299 | 300 | if triggered > 0: 301 | return list_only(card.types, ['creature', 'land', 'artifact', 'enchantment', 'planeswalker', 'tribal']) 302 | elif triggered_2: 303 | return True 304 | else: 305 | return None 306 | 307 | def check_chosen(card): 308 | if 'chosen' in card.text.text: 309 | return ('choose' in card.text.text 310 | or 'chosen at random' in card.text.text 311 | or 'name' in card.text.text 312 | or 'is chosen' in card.text.text 313 | or 'search' in card.text.text) 314 | else: 315 | return None 316 | 317 | def check_shuffle(card): 318 | retval = None 319 | # sadly, this does not detect spurious shuffling 320 | for line in card.text_lines: 321 | if 'search' in line.text and 'library' in line.text: 322 | thisval = ('shuffle' in line.text 323 | or 'searches' in line.text 324 | or 'searched' in line.text 325 | or 'searching' in line.text 326 | or 'rest' in line.text 327 | or 'instead' in line.text) 328 | if retval is None: 329 | retval = thisval 330 | else: 331 | retval = retval and thisval 332 | return retval 333 | 334 | def check_quotes(card): 335 | retval = None 336 | for line in card.text_lines: 337 | quotes = len(re.findall(re.escape('"'), line.text)) 338 | # HACK: the '" pattern in the training set is actually incorrect 339 | quotes += len(re.findall(re.escape('\'"'), line.text)) 340 | if quotes > 0: 341 | thisval = quotes % 2 == 0 342 | if retval is None: 343 | retval = thisval 344 | else: 345 | retval = retval and thisval 346 | return retval 347 | 348 | props = OrderedDict([ 349 | ('types', check_types), 350 | ('pt', check_pt), 351 | ('lands', check_lands), 352 | ('X', check_X), 353 | ('kicker', check_kicker), 354 | ('counters', check_counters), 355 | ('choices', check_choices), 356 | ('quotes', check_quotes), 357 | ('auras', check_auras), 358 | ('equipment', check_equipment), 359 | ('vehicles', check_vehicles), 360 | ('planeswalkers', check_planeswalkers), 361 | ('levelup', check_levelup), 362 | ('chosen', check_chosen), 363 | ('shuffle', check_shuffle), 364 | ('activated', check_activated), 365 | ('triggered', check_triggered), 366 | ]) 367 | 368 | def process_props(cards, dump = False, uncovered = False): 369 | total_all = 0 370 | total_good = 0 371 | total_bad = 0 372 | total_uncovered = 0 373 | values = OrderedDict([(k, (0,0,0)) for k in props]) 374 | 375 | for card in cards: 376 | total_all += 1 377 | overall = True 378 | any_prop = False 379 | for prop in props: 380 | (total, good, bad) = values[prop] 381 | this_prop = props[prop](card) 382 | if not this_prop is None: 383 | total += 1 384 | if not prop == 'types': 385 | any_prop = True 386 | if this_prop: 387 | good += 1 388 | else: 389 | bad += 1 390 | overall = False 391 | if card.name not in ['demonic pact', 'lavaclaw reaches', 392 | "ertai's trickery", 'rumbling aftershocks', # i hate these 393 | ] and dump: 394 | print('---- ' + prop + ' ----') 395 | print(card.encode()) 396 | print(card.format()) 397 | values[prop] = (total, good, bad) 398 | if overall: 399 | total_good += 1 400 | else: 401 | total_bad += 1 402 | if not any_prop: 403 | total_uncovered += 1 404 | if uncovered: 405 | print('---- uncovered ----') 406 | print(card.encode()) 407 | print(card.format()) 408 | 409 | return ((total_all, total_good, total_bad, total_uncovered), 410 | values) 411 | 412 | def main(fname, oname = None, verbose = False, dump = False): 413 | # may need to set special arguments here 414 | cards = jdecode.mtg_open_file(fname, verbose=verbose) 415 | 416 | do_grams = False 417 | 418 | if do_grams: 419 | rg = {} 420 | for card in cards: 421 | g = rare_grams(card, thresh=2, grams=2) 422 | if len(card.text_words) > 0: 423 | g = int(1.0 + (float(g) * 100.0 / float(len(card.text_words)))) 424 | if g in rg: 425 | rg[g] += 1 426 | else: 427 | rg[g] = 1 428 | if g >= 60: 429 | print g 430 | print card.format() 431 | 432 | tot = 0 433 | vmax = sum(rg.values()) 434 | pct90 = None 435 | pct95 = None 436 | pct99 = None 437 | for i in sorted(rg): 438 | print str(i) + ' rare ngrams: ' + str(rg[i]) 439 | tot += rg[i] 440 | if pct90 is None and tot >= vmax * 0.90: 441 | pct90 = i 442 | if pct95 is None and tot >= vmax * 0.95: 443 | pct95 = i 444 | if pct99 is None and tot >= vmax * 0.99: 445 | pct99 = i 446 | 447 | print '90% - ' + str(pct90) 448 | print '95% - ' + str(pct95) 449 | print '99% - ' + str(pct99) 450 | 451 | else: 452 | ((total_all, total_good, total_bad, total_uncovered), 453 | values) = process_props(cards, dump=dump) 454 | 455 | # summary 456 | print('-- overall --') 457 | print(' total : ' + str(total_all)) 458 | print(' good : ' + str(total_good) + ' ' + pct(total_good, total_all)) 459 | print(' bad : ' + str(total_bad) + ' ' + pct(total_bad, total_all)) 460 | print(' uncocoverd: ' + str(total_uncovered) + ' ' + pct(total_uncovered, total_all)) 461 | print('----') 462 | 463 | # breakdown 464 | for prop in props: 465 | (total, good, bad) = values[prop] 466 | print(prop + ':') 467 | print(' total: ' + str(total) + ' ' + pct(total, total_all)) 468 | print(' good : ' + str(good) + ' ' + pct(good, total_all)) 469 | print(' bad : ' + str(bad) + ' ' + pct(bad, total_all)) 470 | 471 | 472 | if __name__ == '__main__': 473 | 474 | import argparse 475 | parser = argparse.ArgumentParser() 476 | 477 | parser.add_argument('infile', #nargs='?'. default=None, 478 | help='encoded card file or json corpus to process') 479 | parser.add_argument('outfile', nargs='?', default=None, 480 | help='name of output file, will be overwritten') 481 | parser.add_argument('-v', '--verbose', action='store_true', 482 | help='verbose output') 483 | parser.add_argument('-d', '--dump', action='store_true', 484 | help='print invalid cards') 485 | 486 | args = parser.parse_args() 487 | main(args.infile, args.outfile, verbose=args.verbose, dump=args.dump) 488 | exit(0) 489 | 490 | -------------------------------------------------------------------------------- /scripts/ngrams.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import pickle 5 | 6 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 7 | sys.path.append(libdir) 8 | import jdecode 9 | import nltk_model as model 10 | 11 | def update_ngrams(lines, gramdict, grams): 12 | for line in lines: 13 | for i in range(0, len(line) - (grams - 1)): 14 | ngram = ' '.join([line[i + j] for j in range(0, grams)]) 15 | if ngram in gramdict: 16 | gramdict[ngram] += 1 17 | else: 18 | gramdict[ngram] = 1 19 | 20 | def describe_bins(gramdict, bins): 21 | bins = sorted(bins) 22 | counts = [0 for _ in range(0, len(bins) + 1)] 23 | 24 | for ngram in gramdict: 25 | for i in range(0, len(bins) + 1): 26 | if i < len(bins): 27 | if gramdict[ngram] <= bins[i]: 28 | counts[i] += 1 29 | break 30 | else: 31 | # didn't fit into any of the smaller bins, stick in on the end 32 | counts[-1] += 1 33 | 34 | for i in range(0, len(counts)): 35 | if counts[i] > 0: 36 | print (' ' + (str(bins[i]) if i < len(bins) else str(bins[-1]) + '+') 37 | + ': ' + str(counts[i])) 38 | 39 | def extract_language(cards, separate_lines = True): 40 | if separate_lines: 41 | lang = [line.vectorize() for card in cards for line in card.text_lines] 42 | else: 43 | lang = [card.text.vectorize() for card in cards] 44 | return map(lambda s: s.split(), lang) 45 | 46 | def build_ngram_model(cards, n, separate_lines = True, verbose = False): 47 | if verbose: 48 | print('generating ' + str(n) + '-gram model') 49 | lang = extract_language(cards, separate_lines=separate_lines) 50 | if verbose: 51 | print('found ' + str(len(lang)) + ' sentences') 52 | lm = model.NgramModel(n, lang, pad_left=True, pad_right=True) 53 | if verbose: 54 | print(lm) 55 | return lm 56 | 57 | def main(fname, oname, gmin = 2, gmax = 8, nltk = False, sep = False, verbose = False): 58 | # may need to set special arguments here 59 | cards = jdecode.mtg_open_file(fname, verbose=verbose) 60 | gmin = int(gmin) 61 | gmax = int(gmax) 62 | 63 | if nltk: 64 | n = gmin 65 | lm = build_ngram_model(cards, n, separate_lines=sep, verbose=verbose) 66 | if verbose: 67 | teststr = 'when @ enters the battlefield' 68 | print('litmus test: perplexity of ' + repr(teststr)) 69 | print(' ' + str(lm.perplexity(teststr.split()))) 70 | if verbose: 71 | print('pickling module to ' + oname) 72 | with open(oname, 'wb') as f: 73 | pickle.dump(lm, f) 74 | 75 | else: 76 | bins = [1, 2, 3, 10, 30, 100, 300, 1000] 77 | if gmin < 2 or gmax < gmin: 78 | print 'invalid gram sizes: ' + str(gmin) + '-' + str(gmax) 79 | exit(1) 80 | 81 | for grams in range(gmin, gmax+1): 82 | if verbose: 83 | print 'generating ' + str(grams) + '-grams...' 84 | gramdict = {} 85 | for card in cards: 86 | update_ngrams(card.text_lines_words, gramdict, grams) 87 | 88 | oname_full = oname + '.' + str(grams) + 'g' 89 | if verbose: 90 | print(' writing ' + str(len(gramdict)) + ' unique ' + str(grams) 91 | + '-grams to ' + oname_full) 92 | describe_bins(gramdict, bins) 93 | 94 | with open(oname_full, 'wt') as f: 95 | for ngram in sorted(gramdict, 96 | lambda x,y: cmp(gramdict[x], gramdict[y]), 97 | reverse = True): 98 | f.write((ngram + ': ' + str(gramdict[ngram]) + '\n').encode('utf-8')) 99 | 100 | if __name__ == '__main__': 101 | 102 | import argparse 103 | parser = argparse.ArgumentParser() 104 | 105 | parser.add_argument('infile', #nargs='?'. default=None, 106 | help='encoded card file or json corpus to process') 107 | parser.add_argument('outfile', #nargs='?', default=None, 108 | help='base name of output file, outputs ending in .2g, .3g etc. will be produced') 109 | parser.add_argument('-min', '--min', action='store', default='2', 110 | help='minimum gram size to compute') 111 | parser.add_argument('-max', '--max', action='store', default='8', 112 | help='maximum gram size to compute') 113 | parser.add_argument('-nltk', '--nltk', action='store_true', 114 | help='use nltk model.NgramModel, with n = min') 115 | parser.add_argument('-s', '--separate', action='store_true', 116 | help='separate card text into lines when constructing nltk model') 117 | parser.add_argument('-v', '--verbose', action='store_true', 118 | help='verbose output') 119 | 120 | args = parser.parse_args() 121 | main(args.infile, args.outfile, gmin=args.min, gmax=args.max, nltk=args.nltk, 122 | sep=args.separate, verbose=args.verbose) 123 | exit(0) 124 | -------------------------------------------------------------------------------- /scripts/pairing.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import random 5 | import zipfile 6 | import shutil 7 | 8 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 9 | sys.path.append(libdir) 10 | datadir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../data') 11 | import utils 12 | import jdecode 13 | import ngrams 14 | import analysis 15 | import mtg_validate 16 | 17 | from cbow import CBOW 18 | 19 | separate_lines=True 20 | 21 | def select_card(cards, stats, i): 22 | card = cards[i] 23 | nearest = stats['dists']['cbow'][i] 24 | perp = stats['ngram']['perp'][i] 25 | perp_per = stats['ngram']['perp_per'][i] 26 | perp_max = stats['ngram']['perp_max'][i] 27 | 28 | if nearest > 0.9 or perp_per > 2.0 or perp_max > 10.0: 29 | return None 30 | 31 | ((_, total_good, _, _), _) = mtg_validate.process_props([card]) 32 | if not total_good == 1: 33 | return False 34 | 35 | # print '====' 36 | # print nearest 37 | # print perp 38 | # print perp_per 39 | # print perp_max 40 | # print '----' 41 | # print card.format() 42 | 43 | return True 44 | 45 | def compare_to_real(card, realcard): 46 | ctypes = ' '.join(sorted(card.types)) 47 | rtypes = ' '.join(sorted(realcard.types)) 48 | return ctypes == rtypes and realcard.cost.check_colors(card.cost.get_colors()) 49 | 50 | def writecard(card, name, writer): 51 | gatherer = False 52 | for_forum = True 53 | vdump = True 54 | fmt = card.format(gatherer = gatherer, for_forum = for_forum, vdump = vdump) 55 | oldname = card.name 56 | # alter name used in image 57 | card.name = name 58 | writer.write(card.to_mse().encode('utf-8')) 59 | card.name = oldname 60 | fstring = '' 61 | if card.json: 62 | fstring += 'JSON:\n' + card.json + '\n' 63 | if card.raw: 64 | fstring += 'raw:\n' + card.raw + '\n' 65 | fstring += '\n' 66 | fstring += fmt + '\n' 67 | fstring = fstring.replace('<', '(').replace('>', ')') 68 | writer.write(('\n' + fstring[:-1]).replace('\n', '\n\t\t').encode('utf-8')) 69 | writer.write('\n'.encode('utf-8')) 70 | 71 | def main(fname, oname, n=20, verbose=False): 72 | cbow = CBOW() 73 | realcards = jdecode.mtg_open_file(str(os.path.join(datadir, 'output.txt')), verbose=verbose) 74 | real_by_name = {c.name: c for c in realcards} 75 | lm = ngrams.build_ngram_model(realcards, 3, separate_lines=separate_lines, verbose=verbose) 76 | cards = jdecode.mtg_open_file(fname, verbose=verbose) 77 | stats = analysis.get_statistics(fname, lm=lm, sep=separate_lines, verbose=verbose) 78 | 79 | selected = [] 80 | for i in range(0, len(cards)): 81 | if select_card(cards, stats, i): 82 | selected += [(i, cards[i])] 83 | 84 | limit = 3000 85 | 86 | random.shuffle(selected) 87 | #selected = selected[:limit] 88 | 89 | if verbose: 90 | print('computing nearest cards for ' + str(len(selected)) + ' candindates...') 91 | cbow_nearest = cbow.nearest_par(map(lambda (i, c): c, selected)) 92 | for i in range(0, len(selected)): 93 | (j, card) = selected[i] 94 | selected[i] = (j, card, cbow_nearest[i]) 95 | if verbose: 96 | print('...done') 97 | 98 | final = [] 99 | for (i, card, nearest) in selected: 100 | for dist, rname in nearest: 101 | realcard = real_by_name[rname] 102 | if compare_to_real(card, realcard): 103 | final += [(i, card, realcard, dist)] 104 | break 105 | 106 | for (i, card, realcard, dist) in final: 107 | print '-- real --' 108 | print realcard.format() 109 | print '-- fake --' 110 | print card.format() 111 | print '-- stats --' 112 | perp_per = stats['ngram']['perp_per'][i] 113 | perp_max = stats['ngram']['perp_max'][i] 114 | print dist 115 | print perp_per 116 | print perp_max 117 | print '----' 118 | 119 | if not oname is None: 120 | with open(oname, 'wt') as ofile: 121 | ofile.write(utils.mse_prepend) 122 | for (i, card, realcard, dist) in final: 123 | name = realcard.name 124 | writecard(realcard, name, ofile) 125 | writecard(card, name, ofile) 126 | ofile.write('version control:\n\ttype: none\napprentice code: ') 127 | # Copy whatever output file is produced, name the copy 'set' (yes, no extension). 128 | if os.path.isfile('set'): 129 | print 'ERROR: tried to overwrite existing file "set" - aborting.' 130 | return 131 | shutil.copyfile(oname, 'set') 132 | # Use the freaky mse extension instead of zip. 133 | with zipfile.ZipFile(oname+'.mse-set', mode='w') as zf: 134 | try: 135 | # Zip up the set file into oname.mse-set. 136 | zf.write('set') 137 | finally: 138 | if verbose: 139 | print 'Made an MSE set file called ' + oname + '.mse-set.' 140 | # The set file is useless outside the .mse-set, delete it. 141 | os.remove('set') 142 | 143 | if __name__ == '__main__': 144 | 145 | import argparse 146 | parser = argparse.ArgumentParser() 147 | 148 | parser.add_argument('infile', #nargs='?'. default=None, 149 | help='encoded card file or json corpus to process') 150 | parser.add_argument('outfile', nargs='?', default=None, 151 | help='output file, defaults to none') 152 | parser.add_argument('-n', '--n', action='store', 153 | help='number of cards to consider for each pairing') 154 | parser.add_argument('-v', '--verbose', action='store_true', 155 | help='verbose output') 156 | 157 | args = parser.parse_args() 158 | main(args.infile, args.outfile, n=args.n, verbose=args.verbose) 159 | exit(0) 160 | -------------------------------------------------------------------------------- /scripts/sanity.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | import re 5 | import json 6 | 7 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 8 | sys.path.append(libdir) 9 | import utils 10 | import jdecode 11 | import cardlib 12 | import transforms 13 | 14 | def check_lines(fname): 15 | cards = jdecode.mtg_open_file(fname, verbose=True, linetrans=True) 16 | 17 | prelines = set() 18 | keylines = set() 19 | mainlines = set() 20 | costlines = set() 21 | postlines = set() 22 | 23 | known = ['enchant ', 'equip', 'countertype', 'multikicker', 'kicker', 24 | 'suspend', 'echo', 'awaken', 'bestow', 'buyback', 25 | 'cumulative', 'dash', 'entwine', 'evoke', 'fortify', 26 | 'flashback', 'madness', 'morph', 'megamorph', 'miracle', 'ninjutsu', 27 | 'overload', 'prowl', 'recover', 'reinforce', 'replicate', 'scavenge', 28 | 'splice', 'surge', 'unearth', 'transfigure', 'transmute', 29 | ] 30 | known = [] 31 | 32 | for card in cards: 33 | prel, keyl, mainl, costl, postl = transforms.separate_lines(card.text.encode(randomize=False)) 34 | if card.bside: 35 | prel2, keyl2, mainl2, costl2, postl2 = transforms.separate_lines(card.bside.text.encode(randomize=False)) 36 | prel += prel2 37 | keyl += keyl2 38 | mainl += mainl2 39 | costl += costl2 40 | postl += postl2 41 | 42 | for line in prel: 43 | if line.strip() == '': 44 | print(card.name, card.text.text) 45 | if any(line.startswith(s) for s in known): 46 | line = 'known' 47 | prelines.add(line) 48 | for line in postl: 49 | if line.strip() == '': 50 | print(card.name, card.text.text) 51 | if any(line.startswith(s) for s in known): 52 | line = 'known' 53 | postlines.add(line) 54 | for line in keyl: 55 | if line.strip() == '': 56 | print(card.name, card.text.text) 57 | if any(line.startswith(s) for s in known): 58 | line = 'known' 59 | keylines.add(line) 60 | for line in mainl: 61 | if line.strip() == '': 62 | print(card.name, card.text.text) 63 | # if any(line.startswith(s) for s in known): 64 | # line = 'known' 65 | mainlines.add(line) 66 | for line in costl: 67 | if line.strip() == '': 68 | print(card.name, card.text.text) 69 | # if any(line.startswith(s) for s in known) or 'cycling' in line or 'monstrosity' in line: 70 | # line = 'known' 71 | costlines.add(line) 72 | 73 | print('prel: {:d}, keyl: {:d}, mainl: {:d}, postl {:d}' 74 | .format(len(prelines), len(keylines), len(mainlines), len(postlines))) 75 | 76 | print('\nprelines') 77 | for line in sorted(prelines): 78 | print(line) 79 | 80 | print('\npostlines') 81 | for line in sorted(postlines): 82 | print(line) 83 | 84 | print('\ncostlines') 85 | for line in sorted(costlines): 86 | print(line) 87 | 88 | print('\nkeylines') 89 | for line in sorted(keylines): 90 | print(line) 91 | 92 | print('\nmainlines') 93 | for line in sorted(mainlines): 94 | #if any(s in line for s in ['champion', 'devour', 'tribute']): 95 | print(line) 96 | 97 | def check_vocab(fname): 98 | cards = jdecode.mtg_open_file(fname, verbose=True, linetrans=True) 99 | 100 | vocab = {} 101 | for card in cards: 102 | words = card.text.vectorize().split() 103 | if card.bside: 104 | words += card.bside.text.vectorize().split() 105 | for word in words: 106 | if not word in vocab: 107 | vocab[word] = 1 108 | else: 109 | vocab[word] += 1 110 | 111 | for word in sorted(vocab, lambda x,y: cmp(vocab[x], vocab[y]), reverse = True): 112 | print('{:8d} : {:s}'.format(vocab[word], word)) 113 | 114 | n = 3 115 | 116 | for card in cards: 117 | words = card.text.vectorize().split() 118 | if card.bside: 119 | words += card.bside.text.vectorize().split() 120 | for word in words: 121 | if vocab[word] <= n: 122 | #if 'name' in word: 123 | print('\n{:8d} : {:s}'.format(vocab[word], word)) 124 | print(card.encode()) 125 | break 126 | 127 | def check_characters(fname, vname): 128 | cards = jdecode.mtg_open_file(fname, verbose=True, linetrans=True) 129 | 130 | tokens = {c for c in utils.cardsep} 131 | for card in cards: 132 | for c in card.encode(): 133 | tokens.add(c) 134 | 135 | token_to_idx = {tok:i+1 for i, tok in enumerate(sorted(tokens))} 136 | idx_to_token = {i+1:tok for i, tok in enumerate(sorted(tokens))} 137 | 138 | print('Vocabulary: ({:d} symbols)'.format(len(token_to_idx))) 139 | for token in sorted(token_to_idx): 140 | print('{:8s} : {:4d}'.format(repr(token), token_to_idx[token])) 141 | 142 | # compliant with torch-rnn 143 | if vname: 144 | json_data = {'token_to_idx':token_to_idx, 'idx_to_token':idx_to_token} 145 | print('writing vocabulary to {:s}'.format(vname)) 146 | with open(vname, 'w') as f: 147 | json.dump(json_data, f) 148 | 149 | if __name__ == '__main__': 150 | import argparse 151 | parser = argparse.ArgumentParser() 152 | 153 | parser.add_argument('infile', nargs='?', default=os.path.join(libdir, '../data/output.txt'), 154 | help='encoded card file or json corpus to process') 155 | parser.add_argument('-lines', action='store_true', 156 | help='show behavior of line separation') 157 | parser.add_argument('-vocab', action='store_true', 158 | help='show vocabulary counts from encoded card text') 159 | parser.add_argument('-chars', action='store_true', 160 | help='generate and display vocabulary of characters used in encoding') 161 | parser.add_argument('--vocab_name', default=None, 162 | help='json file to write vocabulary to') 163 | args = parser.parse_args() 164 | 165 | if args.lines: 166 | check_lines(args.infile) 167 | if args.vocab: 168 | check_vocab(args.infile) 169 | if args.chars: 170 | check_characters(args.infile, args.vocab_name) 171 | 172 | exit(0) 173 | -------------------------------------------------------------------------------- /scripts/streamcards.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # -- STOLEN FROM torch-rnn/scripts/streamfile.py -- # 4 | 5 | import os 6 | import threading 7 | import time 8 | import signal 9 | import traceback 10 | import psutil 11 | 12 | # correctly setting up a stream that won't get orphaned and left clutting the operating 13 | # system proceeds in 3 parts: 14 | # 1) invoke install_suicide_handlers() to ensure correct behavior on interrupt 15 | # 2) get threads by invoking spawn_stream_threads 16 | # 3) invoke wait_and_kill_self_noreturn(threads) 17 | # or, use the handy wrapper that does it for you 18 | 19 | def spawn_stream_threads(fds, runthread, mkargs): 20 | threads = [] 21 | for i, fd in enumerate(fds): 22 | stream_thread = threading.Thread(target=runthread, args=mkargs(i, fd)) 23 | stream_thread.daemon = True 24 | stream_thread.start() 25 | threads.append(stream_thread) 26 | return threads 27 | 28 | def force_kill_self_noreturn(): 29 | # We have a strange issue here, which is that our threads will refuse to die 30 | # to a normal exit() or sys.exit() because they're all blocked in write() calls 31 | # on full pipes; the simplest workaround seems to be to ask the OS to terminate us. 32 | # This kinda works, but... 33 | #os.kill(os.getpid(), signal.SIGTERM) 34 | # psutil might have useful features like checking if the pid has been reused before killing it. 35 | # Also we might have child processes like l2e luajits to think about. 36 | me = psutil.Process(os.getpid()) 37 | for child in me.children(recursive=True): 38 | child.terminate() 39 | me.terminate() 40 | 41 | def handler_kill_self(signum, frame): 42 | if signum != signal.SIGQUIT: 43 | traceback.print_stack(frame) 44 | print('caught signal {:d} - streamer sending SIGTERM to self'.format(signum)) 45 | force_kill_self_noreturn() 46 | 47 | def install_suicide_handlers(): 48 | for sig in [signal.SIGHUP, signal.SIGINT, signal.SIGQUIT]: 49 | signal.signal(sig, handler_kill_self) 50 | 51 | def wait_and_kill_self_noreturn(threads): 52 | running = True 53 | while running: 54 | running = False 55 | for thread in threads: 56 | if thread.is_alive(): 57 | running = True 58 | if(os.getppid() <= 1): 59 | # exit if parent process died (and we were reparented to init) 60 | break 61 | time.sleep(1) 62 | force_kill_self_noreturn() 63 | 64 | def streaming_noreturn(fds, write_stream, mkargs): 65 | install_suicide_handlers() 66 | threads = spawn_stream_threads(fds, write_stream, mkargs) 67 | wait_and_kill_self_noreturn(threads) 68 | assert False, 'should not return from streaming' 69 | 70 | # -- END STOLEN FROM torch-rnn/scripts/streamfile.py -- # 71 | 72 | import sys 73 | import random 74 | 75 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 76 | sys.path.append(libdir) 77 | import utils 78 | import jdecode 79 | import transforms 80 | 81 | def main(args): 82 | fds = args.fds 83 | fname = args.fname 84 | block_size = args.block_size 85 | main_seed = args.seed if args.seed != 0 else None 86 | 87 | # simple default encoding for now, will add more options with the curriculum 88 | # learning feature 89 | 90 | cards = jdecode.mtg_open_file(fname, verbose=True, linetrans=True) 91 | 92 | def write_stream(i, fd): 93 | local_random = random.Random(main_seed) 94 | local_random.jumpahead(i) 95 | local_cards = [card for card in cards] 96 | with open('/proc/self/fd/'+str(fd), 'wt') as f: 97 | while True: 98 | local_random.shuffle(local_cards) 99 | for card in local_cards: 100 | f.write(card.encode(randomize_mana=True, randomize_lines=True)) 101 | f.write(utils.cardsep) 102 | 103 | def mkargs(i, fd): 104 | return i, fd 105 | 106 | streaming_noreturn(fds, write_stream, mkargs) 107 | 108 | if __name__ == '__main__': 109 | import argparse 110 | 111 | parser = argparse.ArgumentParser() 112 | parser.add_argument('fds', type=int, nargs='+', 113 | help='file descriptors to write streams to') 114 | parser.add_argument('-f', '--fname', default=os.path.join(libdir, '../data/output.txt'), 115 | help='file to read cards from') 116 | parser.add_argument('-n', '--block_size', type=int, default=10000, 117 | help='number of characters each stream should read/write at a time') 118 | parser.add_argument('-s', '--seed', type=int, default=0, 119 | help='random seed') 120 | args = parser.parse_args() 121 | 122 | main(args) 123 | -------------------------------------------------------------------------------- /scripts/sum.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | 5 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 6 | sys.path.append(libdir) 7 | 8 | def main(fname): 9 | with open(fname, 'rt') as f: 10 | text = f.read() 11 | 12 | cardstats = text.split('\n') 13 | nonempty = 0 14 | name_avg = 0 15 | name_dupes = 0 16 | card_avg = 0 17 | card_dupes = 0 18 | 19 | for c in cardstats: 20 | fields = c.split('|') 21 | if len(fields) < 4: 22 | continue 23 | nonempty += 1 24 | idx = int(fields[0]) 25 | name = str(fields[1]) 26 | ndist = float(fields[2]) 27 | cdist = float(fields[3]) 28 | 29 | name_avg += ndist 30 | if ndist == 1.0: 31 | name_dupes += 1 32 | card_avg += cdist 33 | if cdist == 1.0: 34 | card_dupes += 1 35 | 36 | name_avg = name_avg / float(nonempty) 37 | card_avg = card_avg / float(nonempty) 38 | 39 | print str(nonempty) + ' cards' 40 | print '-- names --' 41 | print 'avg distance: ' + str(name_avg) 42 | print 'num duplicates: ' + str(name_dupes) 43 | print '-- cards --' 44 | print 'avg distance: ' + str(card_avg) 45 | print 'num duplicates: ' + str(card_dupes) 46 | print '----' 47 | 48 | if __name__ == '__main__': 49 | 50 | import argparse 51 | parser = argparse.ArgumentParser() 52 | 53 | parser.add_argument('infile', #nargs='?'. default=None, 54 | help='data file to process') 55 | 56 | args = parser.parse_args() 57 | main(args.infile) 58 | exit(0) 59 | -------------------------------------------------------------------------------- /scripts/summarize.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import sys 3 | import os 4 | 5 | libdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), '../lib') 6 | sys.path.append(libdir) 7 | import utils 8 | import jdecode 9 | from datalib import Datamine 10 | 11 | def main(fname, verbose = True, outliers = False, dump_all = False): 12 | if fname[-5:] == '.json': 13 | if verbose: 14 | print 'This looks like a json file: ' + fname 15 | json_srcs = jdecode.mtg_open_json(fname, verbose) 16 | card_srcs = [] 17 | for json_cardname in sorted(json_srcs): 18 | if len(json_srcs[json_cardname]) > 0: 19 | card_srcs += [json_srcs[json_cardname][0]] 20 | else: 21 | if verbose: 22 | print 'Opening encoded card file: ' + fname 23 | with open(fname, 'rt') as f: 24 | text = f.read() 25 | card_srcs = text.split(utils.cardsep) 26 | 27 | mine = Datamine(card_srcs) 28 | mine.summarize() 29 | if outliers or dump_all: 30 | mine.outliers(dump_invalid = dump_all) 31 | 32 | 33 | if __name__ == '__main__': 34 | import argparse 35 | parser = argparse.ArgumentParser() 36 | 37 | parser.add_argument('infile', 38 | help='encoded card file or json corpus to process') 39 | parser.add_argument('-x', '--outliers', action='store_true', 40 | help='show additional diagnostics and edge cases') 41 | parser.add_argument('-a', '--all', action='store_true', 42 | help='show all information and dump invalid cards') 43 | parser.add_argument('-v', '--verbose', action='store_true', 44 | help='verbose output') 45 | 46 | args = parser.parse_args() 47 | main(args.infile, verbose = args.verbose, outliers = args.outliers, dump_all = args.all) 48 | exit(0) 49 | -------------------------------------------------------------------------------- /sortcards.py: -------------------------------------------------------------------------------- 1 | import re 2 | import codecs 3 | import sys 4 | from collections import OrderedDict 5 | 6 | # returns back a dictionary mapping the names of classes of cards 7 | # to lists of cards in those classes 8 | def sortcards(cards): 9 | classes = OrderedDict([ 10 | ('Special classes:', None), 11 | ('multicards', []), 12 | ('Inclusive classes:', None), 13 | ('X cards', []), 14 | ('kicker cards', []), 15 | ('counter cards', []), 16 | ('uncast cards', []), 17 | ('choice cards', []), 18 | ('equipment', []), 19 | ('levelers', []), 20 | ('legendary', []), 21 | ('Exclusive classes:', None), 22 | ('planeswalkers', []), 23 | ('lands', []), 24 | ('instants', []), 25 | ('sorceries', []), 26 | ('enchantments', []), 27 | ('noncreature artifacts', []), 28 | ('creatures', []), 29 | ('other', []), 30 | ('By color:', None), 31 | ('white', []), 32 | ('blue', []), 33 | ('black', []), 34 | ('red', []), 35 | ('green', []), 36 | ('colorless nonland', []), 37 | ('colorless land', []), 38 | ('unknown color', []), 39 | ('By number of colors:', None), 40 | ('zero colors', []), 41 | ('one color', []), 42 | ('two colors', []), 43 | ('three colors', []), 44 | ('four colors', []), 45 | ('five colors', []), 46 | ('more colors?', []), 47 | ]) 48 | 49 | for card in cards: 50 | # special classes 51 | if '|\n|' in card: 52 | # better formatting pls??? 53 | classes['multicards'] += [card.replace('|\n|', '|\n~~~~~~~~~~~~~~~~\n|')] 54 | continue 55 | 56 | # inclusive classes 57 | if 'X' in card: 58 | classes['X cards'] += [card] 59 | if 'kick' in card: 60 | classes['kicker cards'] += [card] 61 | if '%' in card or '#' in card: 62 | classes['counter cards'] += [card] 63 | if 'uncast' in card: 64 | classes['uncast cards'] += [card] 65 | if '[' in card or ']' in card or '=' in card: 66 | classes['choice cards'] += [card] 67 | if '|equipment|' in card or 'equip {' in card: 68 | classes['equipment'] += [card] 69 | if 'level up' in card or 'level &' in card: 70 | classes['levelers'] += [card] 71 | if '|legendary|' in card: 72 | classes['legendary'] += [card] 73 | 74 | # exclusive classes 75 | if '|planeswalker|' in card: 76 | classes['planeswalkers'] += [card] 77 | elif '|land|' in card: 78 | classes['lands'] += [card] 79 | elif '|instant|' in card: 80 | classes['instants'] += [card] 81 | elif '|sorcery|' in card: 82 | classes['sorceries'] += [card] 83 | elif '|enchantment|' in card: 84 | classes['enchantments'] += [card] 85 | elif '|artifact|' in card: 86 | classes['noncreature artifacts'] += [card] 87 | elif '|creature|' in card or 'artifact creature' in card: 88 | classes['creatures'] += [card] 89 | else: 90 | classes['other'] += [card] 91 | 92 | # color classes need to find the mana cost 93 | fields = card.split('|') 94 | if len(fields) != 11: 95 | classes['unknown color'] += [card] 96 | else: 97 | cost = fields[8] 98 | color_count = 0 99 | if 'W' in cost or 'U' in cost or 'B' in cost or 'R' in cost or 'G' in cost: 100 | if 'W' in cost: 101 | classes['white'] += [card] 102 | color_count += 1 103 | if 'U' in cost: 104 | classes['blue'] += [card] 105 | color_count += 1 106 | if 'B' in cost: 107 | classes['black'] += [card] 108 | color_count += 1 109 | if 'R' in cost: 110 | classes['red'] += [card] 111 | color_count += 1 112 | if 'G' in cost: 113 | classes['green'] += [card] 114 | color_count += 1 115 | # should be unreachable 116 | if color_count == 0: 117 | classes['unknown color'] += [card] 118 | else: 119 | if '|land|' in card: 120 | classes['colorless land'] += [card] 121 | else: 122 | classes['colorless nonland'] += [card] 123 | 124 | if color_count == 0: 125 | classes['zero colors'] += [card] 126 | elif color_count == 1: 127 | classes['one color'] += [card] 128 | elif color_count == 2: 129 | classes['two colors'] += [card] 130 | elif color_count == 3: 131 | classes['three colors'] += [card] 132 | elif color_count == 4: 133 | classes['four colors'] += [card] 134 | elif color_count == 5: 135 | classes['five colors'] += [card] 136 | else: 137 | classes['more colors?'] += [card] 138 | 139 | return classes 140 | 141 | 142 | def main(fname, oname = None, verbose = True): 143 | if verbose: 144 | print 'Opening encoded card file: ' + fname 145 | 146 | f = open(fname, 'r') 147 | text = f.read() 148 | f.close() 149 | 150 | # we get rid of the first and last because they are probably partial 151 | cards = text.split('\n\n')[1:-1] 152 | classes = sortcards(cards) 153 | 154 | if not oname == None: 155 | if verbose: 156 | print 'Writing output to: ' + oname 157 | ofile = codecs.open(oname, 'w', 'utf-8') 158 | 159 | for cardclass in classes: 160 | if classes[cardclass] == None: 161 | print cardclass 162 | else: 163 | print ' ' + cardclass + ': ' + str(len(classes[cardclass])) 164 | 165 | if oname == None: 166 | outputter = sys.stdout 167 | else: 168 | outputter = ofile 169 | 170 | for cardclass in classes: 171 | if classes[cardclass] == None: 172 | outputter.write(cardclass + '\n') 173 | else: 174 | classlen = len(classes[cardclass]) 175 | if classlen > 0: 176 | outputter.write('[spoiler=' + cardclass + ': ' + str(classlen) + ' cards]\n') 177 | for card in classes[cardclass]: 178 | outputter.write(card + '\n\n') 179 | outputter.write('[/spoiler]\n') 180 | 181 | if not oname == None: 182 | ofile.close() 183 | 184 | 185 | if __name__ == '__main__': 186 | import sys 187 | if len(sys.argv) == 2: 188 | main(sys.argv[1]) 189 | elif len(sys.argv) == 3: 190 | main(sys.argv[1], oname = sys.argv[2]) 191 | else: 192 | print 'Usage: ' + sys.argv[0] + ' ' + ' [output filename]' 193 | exit(1) 194 | 195 | --------------------------------------------------------------------------------