├── .gitignore
├── LICENSE
├── MANIFEST
├── README.md
├── __init__.py
├── benchmarks
    ├── README.md
    ├── iris.csv
    └── iris_orange.csv
├── clean.sh
├── doc
    ├── documentation.ipynb
    └── documentation.py
├── pyrulelearn
    ├── __init__.py
    ├── cplex_wrap.py
    ├── imli.py
    ├── maxsat_wrap.py
    └── utils.py
├── requirements.txt
├── setup.cfg
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | *.aux
 2 | *.log
 3 | *.synctex.*
 4 | .DS_Store
 5 | *.out
 6 | *.toc
 7 | *.blg
 8 | *.bbl
 9 | *.pyc
10 | *.iml
11 | *.xml
12 | dist/*
13 | pyrulelearn.egg-info/*
14 | .vscode/*
15 | */__pycache__/*
16 | build/*
17 | temp/*
18 | test.py
19 | data/*
20 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | IMLI --Copyright (c) 2020
 2 |     Bishwamittra Ghosh
 3 |     Kuldeep S. Meel.
 4 | 
 5 | 
 6 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
 7 | 
 8 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
 9 | 
10 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST:
--------------------------------------------------------------------------------
1 | # file GENERATED by distutils, do NOT edit
2 | setup.cfg
3 | setup.py
4 | pyrulelearn/IMLI.py
5 | pyrulelearn/__init__.py
6 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 2 | 
 3 | # IMLI
 4 | 
 5 | IMLI is an interpretable classification rule learning framework based on incremental mini-batch learning.  This tool can be used to learn classification rules expressible in propositional logic, in particular in [CNF, DNF](https://bishwamittra.github.io/publication/imli-ghosh.pdf), and [relaxed CNF](https://bishwamittra.github.io/publication/ecai_2020/paper.pdf).   
 6 | 
 7 | This tool  is based on our [CP-2018](https://arxiv.org/abs/1812.01843), [AIES-2019](https://bishwamittra.github.io/publication/imli-ghosh.pdf), and  [ECAI-2020](https://bishwamittra.github.io/publication/ecai_2020/paper.pdf) papers.
 8 | 
 9 | 
10 | 
11 | 
12 | 
13 | 
14 | # Install
15 | - Install the PIP library.
16 | ```
17 | pip install pyrulelearn
18 | ```
19 | 
20 | - Run `pip install -r requirements.txt` to install all necessary python packages available from pip.
21 | 
22 | This framework requires installing an off-the-shelf MaxSAT solver to learn CNF/DNF rules. Additionally, to learn relaxed-CNF rules, an LP (Linear Programming) solver is required.
23 | 
24 | ### Install MaxSAT solvers
25 | 
26 | To install Open-wbo, follow the instructions from [here](http://sat.inesc-id.pt/open-wbo/).
27 | After the installation is complete, add the path of the binary to the PATH variable. 
28 | ```
29 | export PATH=$PATH:'/path/to/open-wbo/'
30 | ```
31 | Other off-the-shelf MaxSAT solvers can also be used for this framework.
32 | 
33 | ### Install CPLEX
34 | 
35 | To install the linear programming solver, i.e., CPLEX, download and install it from [IBM](https://www.ibm.com/support/pages/downloading-ibm-ilog-cplex-optimization-studio-v1290).  To setup the Python API of CPLEX, follow the instructions from [here](https://www.ibm.com/support/knowledgecenter/SSSA5P_12.7.0/ilog.odms.cplex.help/CPLEX/GettingStarted/topics/set_up/Python_setup.html).
36 | 
37 | # Documentation
38 | 
39 | See the documentation in the [notebook](doc/documentation.ipynb).
40 | 
41 | ## Issues, questions, bugs, etc.
42 | Please click on "issues" at the top and [create a new issue](https://github.com/meelgroup/MLIC/issues). All issues are responded to promptly.
43 | 
44 | ## Contact
45 | [Bishwamittra Ghosh](https://bishwamittra.github.io/) (bghosh@u.nus.edu)
46 | 
47 | ## Citations
48 | 
49 | 
50 | @inproceedings{GMM20,<br />
51 | author={Ghosh, Bishwamittra and Malioutov, Dmitry and  Meel, Kuldeep S.},<br />
52 | title={Classification Rules in Relaxed Logical Form},<br />
53 | booktitle={Proc. of ECAI},<br />
54 | year={2020},}
55 | 
56 | @inproceedings{GM19,<br />
57 | author={Ghosh, Bishwamittra and  Meel, Kuldeep S.},<br />
58 | title={{IMLI}: An Incremental Framework for MaxSAT-Based Learning of Interpretable Classification Rules},<br />
59 | booktitle={Proc. of AIES},<br />
60 | year={2019},}
61 | 
62 | @inproceedings{MM18,<br />
63 | author={Malioutov, Dmitry and  Meel, Kuldeep S.},<br />
64 | title={{MLIC}: A MaxSAT-Based framework for learning interpretable classification rules},<br />
65 | booktitle={Proceedings of International Conference on Constraint Programming (CP)},<br />
66 | month={08},<br />
67 | year={2018},}
68 | 
69 | ## Old Versions
70 | The old version, MLIC (non-incremental framework) is available under the branch "MLIC". Please read the README of the old release to know how to compile the code. 
71 | 


--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/meelgroup/MLIC/30edc0f41fca65eec48e4c1fc16a3c752cb97618/__init__.py


--------------------------------------------------------------------------------
/benchmarks/README.md:
--------------------------------------------------------------------------------
 1 | # Datasets description
 2 | 
 3 | Find the full set of datasets from this [link](https://drive.google.com/drive/folders/1HFAxx1jM9mvnXscXXso5OR9OdoJL0ZWs?usp=sharing).
 4 | This link contains two folders: `converted_for_orange_library` and `quantile_based_discretization`. 
 5 | 
 6 | ## Prepare datasets for  entropy-based discretization
 7 | 
 8 | `converted_for_orange_library` contains datasets that can be passed to the subroutine `imli.discretization_orange()`. This subroutine is based on the [entropy-based feature discretization](https://www.ijcai.org/Proceedings/93-2/Papers/022.pdf) library of [Orange](https://docs.biolab.si//3/data-mining-library/reference/preprocess.html#discretization). To prepare a dataset for `imli.discretization_orange()`,  modify the feature names as follows
 9 | 
10 | 1. For a categorial\discrete feature, add `D#` to the feature name. For example, if `Gender={female, male, others}` is a categorial feature in the dataset, the modified feature name is `D#Gender`.  
11 | 2. For a continuous-valued feature, add `C#`. For example, `income` is modified as `C#income`.
12 | 3. For the target (discrete) column, add `cD#`. For example, `defaulted`, that is the target column, is modified as `cD#defaulted`. 
13 | 4. To ignore any feature, add `i#` to the feature name. 
14 | 
15 | For more details, review the instructions from the Orange [documentation](https://docs.biolab.si//3/data-mining-library/reference/data.io.html).
16 | 


--------------------------------------------------------------------------------
/benchmarks/iris.csv:
--------------------------------------------------------------------------------
  1 | sepal length,sepal width,petal length,petal width,iris species
  2 | 5.1,3.5,1.4,0.2,0
  3 | 4.9,3,1.4,0.2,0
  4 | 4.7,3.2,1.3,0.2,0
  5 | 4.6,3.1,1.5,0.2,0
  6 | 5,3.6,1.4,0.2,0
  7 | 5.4,3.9,1.7,0.4,0
  8 | 4.6,3.4,1.4,0.3,0
  9 | 5,3.4,1.5,0.2,0
 10 | 4.4,2.9,1.4,0.2,0
 11 | 4.9,3.1,1.5,0.1,0
 12 | 5.4,3.7,1.5,0.2,0
 13 | 4.8,3.4,1.6,0.2,0
 14 | 4.8,3,1.4,0.1,0
 15 | 4.3,3,1.1,0.1,0
 16 | 5.8,4,1.2,0.2,0
 17 | 5.7,4.4,1.5,0.4,0
 18 | 5.4,3.9,1.3,0.4,0
 19 | 5.1,3.5,1.4,0.3,0
 20 | 5.7,3.8,1.7,0.3,0
 21 | 5.1,3.8,1.5,0.3,0
 22 | 5.4,3.4,1.7,0.2,0
 23 | 5.1,3.7,1.5,0.4,0
 24 | 4.6,3.6,1,0.2,0
 25 | 5.1,3.3,1.7,0.5,0
 26 | 4.8,3.4,1.9,0.2,0
 27 | 5,3,1.6,0.2,0
 28 | 5,3.4,1.6,0.4,0
 29 | 5.2,3.5,1.5,0.2,0
 30 | 5.2,3.4,1.4,0.2,0
 31 | 4.7,3.2,1.6,0.2,0
 32 | 4.8,3.1,1.6,0.2,0
 33 | 5.4,3.4,1.5,0.4,0
 34 | 5.2,4.1,1.5,0.1,0
 35 | 5.5,4.2,1.4,0.2,0
 36 | 4.9,3.1,1.5,0.1,0
 37 | 5,3.2,1.2,0.2,0
 38 | 5.5,3.5,1.3,0.2,0
 39 | 4.9,3.1,1.5,0.1,0
 40 | 4.4,3,1.3,0.2,0
 41 | 5.1,3.4,1.5,0.2,0
 42 | 5,3.5,1.3,0.3,0
 43 | 4.5,2.3,1.3,0.3,0
 44 | 4.4,3.2,1.3,0.2,0
 45 | 5,3.5,1.6,0.6,0
 46 | 5.1,3.8,1.9,0.4,0
 47 | 4.8,3,1.4,0.3,0
 48 | 5.1,3.8,1.6,0.2,0
 49 | 4.6,3.2,1.4,0.2,0
 50 | 5.3,3.7,1.5,0.2,0
 51 | 5,3.3,1.4,0.2,0
 52 | 7,3.2,4.7,1.4,1
 53 | 6.4,3.2,4.5,1.5,1
 54 | 6.9,3.1,4.9,1.5,1
 55 | 5.5,2.3,4,1.3,1
 56 | 6.5,2.8,4.6,1.5,1
 57 | 5.7,2.8,4.5,1.3,1
 58 | 6.3,3.3,4.7,1.6,1
 59 | 4.9,2.4,3.3,1,1
 60 | 6.6,2.9,4.6,1.3,1
 61 | 5.2,2.7,3.9,1.4,1
 62 | 5,2,3.5,1,1
 63 | 5.9,3,4.2,1.5,1
 64 | 6,2.2,4,1,1
 65 | 6.1,2.9,4.7,1.4,1
 66 | 5.6,2.9,3.6,1.3,1
 67 | 6.7,3.1,4.4,1.4,1
 68 | 5.6,3,4.5,1.5,1
 69 | 5.8,2.7,4.1,1,1
 70 | 6.2,2.2,4.5,1.5,1
 71 | 5.6,2.5,3.9,1.1,1
 72 | 5.9,3.2,4.8,1.8,1
 73 | 6.1,2.8,4,1.3,1
 74 | 6.3,2.5,4.9,1.5,1
 75 | 6.1,2.8,4.7,1.2,1
 76 | 6.4,2.9,4.3,1.3,1
 77 | 6.6,3,4.4,1.4,1
 78 | 6.8,2.8,4.8,1.4,1
 79 | 6.7,3,5,1.7,1
 80 | 6,2.9,4.5,1.5,1
 81 | 5.7,2.6,3.5,1,1
 82 | 5.5,2.4,3.8,1.1,1
 83 | 5.5,2.4,3.7,1,1
 84 | 5.8,2.7,3.9,1.2,1
 85 | 6,2.7,5.1,1.6,1
 86 | 5.4,3,4.5,1.5,1
 87 | 6,3.4,4.5,1.6,1
 88 | 6.7,3.1,4.7,1.5,1
 89 | 6.3,2.3,4.4,1.3,1
 90 | 5.6,3,4.1,1.3,1
 91 | 5.5,2.5,4,1.3,1
 92 | 5.5,2.6,4.4,1.2,1
 93 | 6.1,3,4.6,1.4,1
 94 | 5.8,2.6,4,1.2,1
 95 | 5,2.3,3.3,1,1
 96 | 5.6,2.7,4.2,1.3,1
 97 | 5.7,3,4.2,1.2,1
 98 | 5.7,2.9,4.2,1.3,1
 99 | 6.2,2.9,4.3,1.3,1
100 | 5.1,2.5,3,1.1,1
101 | 5.7,2.8,4.1,1.3,1
102 | 6.3,3.3,6,2.5,0
103 | 5.8,2.7,5.1,1.9,0
104 | 7.1,3,5.9,2.1,0
105 | 6.3,2.9,5.6,1.8,0
106 | 6.5,3,5.8,2.2,0
107 | 7.6,3,6.6,2.1,0
108 | 4.9,2.5,4.5,1.7,0
109 | 7.3,2.9,6.3,1.8,0
110 | 6.7,2.5,5.8,1.8,0
111 | 7.2,3.6,6.1,2.5,0
112 | 6.5,3.2,5.1,2,0
113 | 6.4,2.7,5.3,1.9,0
114 | 6.8,3,5.5,2.1,0
115 | 5.7,2.5,5,2,0
116 | 5.8,2.8,5.1,2.4,0
117 | 6.4,3.2,5.3,2.3,0
118 | 6.5,3,5.5,1.8,0
119 | 7.7,3.8,6.7,2.2,0
120 | 7.7,2.6,6.9,2.3,0
121 | 6,2.2,5,1.5,0
122 | 6.9,3.2,5.7,2.3,0
123 | 5.6,2.8,4.9,2,0
124 | 7.7,2.8,6.7,2,0
125 | 6.3,2.7,4.9,1.8,0
126 | 6.7,3.3,5.7,2.1,0
127 | 7.2,3.2,6,1.8,0
128 | 6.2,2.8,4.8,1.8,0
129 | 6.1,3,4.9,1.8,0
130 | 6.4,2.8,5.6,2.1,0
131 | 7.2,3,5.8,1.6,0
132 | 7.4,2.8,6.1,1.9,0
133 | 7.9,3.8,6.4,2,0
134 | 6.4,2.8,5.6,2.2,0
135 | 6.3,2.8,5.1,1.5,0
136 | 6.1,2.6,5.6,1.4,0
137 | 7.7,3,6.1,2.3,0
138 | 6.3,3.4,5.6,2.4,0
139 | 6.4,3.1,5.5,1.8,0
140 | 6,3,4.8,1.8,0
141 | 6.9,3.1,5.4,2.1,0
142 | 6.7,3.1,5.6,2.4,0
143 | 6.9,3.1,5.1,2.3,0
144 | 5.8,2.7,5.1,1.9,0
145 | 6.8,3.2,5.9,2.3,0
146 | 6.7,3.3,5.7,2.5,0
147 | 6.7,3,5.2,2.3,0
148 | 6.3,2.5,5,1.9,0
149 | 6.5,3,5.2,2,0
150 | 6.2,3.4,5.4,2.3,0
151 | 5.9,3,5.1,1.8,0


--------------------------------------------------------------------------------
/benchmarks/iris_orange.csv:
--------------------------------------------------------------------------------
  1 | C#sepal length,C#sepal width,C#petal length,C#petal width,cD#iris species
  2 | 5.1,3.5,1.4,0.2,0
  3 | 4.9,3.0,1.4,0.2,0
  4 | 4.7,3.2,1.3,0.2,0
  5 | 4.6,3.1,1.5,0.2,0
  6 | 5.0,3.6,1.4,0.2,0
  7 | 5.4,3.9,1.7,0.4,0
  8 | 4.6,3.4,1.4,0.3,0
  9 | 5.0,3.4,1.5,0.2,0
 10 | 4.4,2.9,1.4,0.2,0
 11 | 4.9,3.1,1.5,0.1,0
 12 | 5.4,3.7,1.5,0.2,0
 13 | 4.8,3.4,1.6,0.2,0
 14 | 4.8,3.0,1.4,0.1,0
 15 | 4.3,3.0,1.1,0.1,0
 16 | 5.8,4.0,1.2,0.2,0
 17 | 5.7,4.4,1.5,0.4,0
 18 | 5.4,3.9,1.3,0.4,0
 19 | 5.1,3.5,1.4,0.3,0
 20 | 5.7,3.8,1.7,0.3,0
 21 | 5.1,3.8,1.5,0.3,0
 22 | 5.4,3.4,1.7,0.2,0
 23 | 5.1,3.7,1.5,0.4,0
 24 | 4.6,3.6,1.0,0.2,0
 25 | 5.1,3.3,1.7,0.5,0
 26 | 4.8,3.4,1.9,0.2,0
 27 | 5.0,3.0,1.6,0.2,0
 28 | 5.0,3.4,1.6,0.4,0
 29 | 5.2,3.5,1.5,0.2,0
 30 | 5.2,3.4,1.4,0.2,0
 31 | 4.7,3.2,1.6,0.2,0
 32 | 4.8,3.1,1.6,0.2,0
 33 | 5.4,3.4,1.5,0.4,0
 34 | 5.2,4.1,1.5,0.1,0
 35 | 5.5,4.2,1.4,0.2,0
 36 | 4.9,3.1,1.5,0.1,0
 37 | 5.0,3.2,1.2,0.2,0
 38 | 5.5,3.5,1.3,0.2,0
 39 | 4.9,3.1,1.5,0.1,0
 40 | 4.4,3.0,1.3,0.2,0
 41 | 5.1,3.4,1.5,0.2,0
 42 | 5.0,3.5,1.3,0.3,0
 43 | 4.5,2.3,1.3,0.3,0
 44 | 4.4,3.2,1.3,0.2,0
 45 | 5.0,3.5,1.6,0.6,0
 46 | 5.1,3.8,1.9,0.4,0
 47 | 4.8,3.0,1.4,0.3,0
 48 | 5.1,3.8,1.6,0.2,0
 49 | 4.6,3.2,1.4,0.2,0
 50 | 5.3,3.7,1.5,0.2,0
 51 | 5.0,3.3,1.4,0.2,0
 52 | 7.0,3.2,4.7,1.4,1
 53 | 6.4,3.2,4.5,1.5,1
 54 | 6.9,3.1,4.9,1.5,1
 55 | 5.5,2.3,4.0,1.3,1
 56 | 6.5,2.8,4.6,1.5,1
 57 | 5.7,2.8,4.5,1.3,1
 58 | 6.3,3.3,4.7,1.6,1
 59 | 4.9,2.4,3.3,1.0,1
 60 | 6.6,2.9,4.6,1.3,1
 61 | 5.2,2.7,3.9,1.4,1
 62 | 5.0,2.0,3.5,1.0,1
 63 | 5.9,3.0,4.2,1.5,1
 64 | 6.0,2.2,4.0,1.0,1
 65 | 6.1,2.9,4.7,1.4,1
 66 | 5.6,2.9,3.6,1.3,1
 67 | 6.7,3.1,4.4,1.4,1
 68 | 5.6,3.0,4.5,1.5,1
 69 | 5.8,2.7,4.1,1.0,1
 70 | 6.2,2.2,4.5,1.5,1
 71 | 5.6,2.5,3.9,1.1,1
 72 | 5.9,3.2,4.8,1.8,1
 73 | 6.1,2.8,4.0,1.3,1
 74 | 6.3,2.5,4.9,1.5,1
 75 | 6.1,2.8,4.7,1.2,1
 76 | 6.4,2.9,4.3,1.3,1
 77 | 6.6,3.0,4.4,1.4,1
 78 | 6.8,2.8,4.8,1.4,1
 79 | 6.7,3.0,5.0,1.7,1
 80 | 6.0,2.9,4.5,1.5,1
 81 | 5.7,2.6,3.5,1.0,1
 82 | 5.5,2.4,3.8,1.1,1
 83 | 5.5,2.4,3.7,1.0,1
 84 | 5.8,2.7,3.9,1.2,1
 85 | 6.0,2.7,5.1,1.6,1
 86 | 5.4,3.0,4.5,1.5,1
 87 | 6.0,3.4,4.5,1.6,1
 88 | 6.7,3.1,4.7,1.5,1
 89 | 6.3,2.3,4.4,1.3,1
 90 | 5.6,3.0,4.1,1.3,1
 91 | 5.5,2.5,4.0,1.3,1
 92 | 5.5,2.6,4.4,1.2,1
 93 | 6.1,3.0,4.6,1.4,1
 94 | 5.8,2.6,4.0,1.2,1
 95 | 5.0,2.3,3.3,1.0,1
 96 | 5.6,2.7,4.2,1.3,1
 97 | 5.7,3.0,4.2,1.2,1
 98 | 5.7,2.9,4.2,1.3,1
 99 | 6.2,2.9,4.3,1.3,1
100 | 5.1,2.5,3.0,1.1,1
101 | 5.7,2.8,4.1,1.3,1
102 | 6.3,3.3,6.0,2.5,0
103 | 5.8,2.7,5.1,1.9,0
104 | 7.1,3.0,5.9,2.1,0
105 | 6.3,2.9,5.6,1.8,0
106 | 6.5,3.0,5.8,2.2,0
107 | 7.6,3.0,6.6,2.1,0
108 | 4.9,2.5,4.5,1.7,0
109 | 7.3,2.9,6.3,1.8,0
110 | 6.7,2.5,5.8,1.8,0
111 | 7.2,3.6,6.1,2.5,0
112 | 6.5,3.2,5.1,2.0,0
113 | 6.4,2.7,5.3,1.9,0
114 | 6.8,3.0,5.5,2.1,0
115 | 5.7,2.5,5.0,2.0,0
116 | 5.8,2.8,5.1,2.4,0
117 | 6.4,3.2,5.3,2.3,0
118 | 6.5,3.0,5.5,1.8,0
119 | 7.7,3.8,6.7,2.2,0
120 | 7.7,2.6,6.9,2.3,0
121 | 6.0,2.2,5.0,1.5,0
122 | 6.9,3.2,5.7,2.3,0
123 | 5.6,2.8,4.9,2.0,0
124 | 7.7,2.8,6.7,2.0,0
125 | 6.3,2.7,4.9,1.8,0
126 | 6.7,3.3,5.7,2.1,0
127 | 7.2,3.2,6.0,1.8,0
128 | 6.2,2.8,4.8,1.8,0
129 | 6.1,3.0,4.9,1.8,0
130 | 6.4,2.8,5.6,2.1,0
131 | 7.2,3.0,5.8,1.6,0
132 | 7.4,2.8,6.1,1.9,0
133 | 7.9,3.8,6.4,2.0,0
134 | 6.4,2.8,5.6,2.2,0
135 | 6.3,2.8,5.1,1.5,0
136 | 6.1,2.6,5.6,1.4,0
137 | 7.7,3.0,6.1,2.3,0
138 | 6.3,3.4,5.6,2.4,0
139 | 6.4,3.1,5.5,1.8,0
140 | 6.0,3.0,4.8,1.8,0
141 | 6.9,3.1,5.4,2.1,0
142 | 6.7,3.1,5.6,2.4,0
143 | 6.9,3.1,5.1,2.3,0
144 | 5.8,2.7,5.1,1.9,0
145 | 6.8,3.2,5.9,2.3,0
146 | 6.7,3.3,5.7,2.5,0
147 | 6.7,3.0,5.2,2.3,0
148 | 6.3,2.5,5.0,1.9,0
149 | 6.5,3.0,5.2,2.0,0
150 | 6.2,3.4,5.4,2.3,0
151 | 5.9,3.0,5.1,1.8,0
152 | 


--------------------------------------------------------------------------------
/clean.sh:
--------------------------------------------------------------------------------
1 | rm -r build dist *egg-info
2 | find . -type d -name  "__pycache__" -exec rm -r {} +
3 | find . -type d -name  ".ipynb_checkpoints" -exec rm -r {} +
4 | rm */*model*.*
5 | rm model*.*
6 | 


--------------------------------------------------------------------------------
/doc/documentation.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 1. Learn Binary classification rules\n",
  8 |     "\n",
  9 |     "This tutorial shows how to learn classification rules using MaxSAT-based incremental learning framework, IMLI. We show how to learn five popular classification rules under the same framework. \n",
 10 |     "\n",
 11 |     "- CNF rules (Conjunctive Normal Form)\n",
 12 |     "- DNF rules (Disjunctive Normal Form)\n",
 13 |     "- Decision sets\n",
 14 |     "- Decision lists\n",
 15 |     "- relaxed-CNF rules"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 1,
 21 |    "metadata": {
 22 |     "tags": []
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "import sys\n",
 27 |     "# sys.path.append(\"../\")\n",
 28 |     "\n",
 29 |     "from pyrulelearn.imli import imli\n",
 30 |     "from pyrulelearn import utils\n",
 31 |     "from sklearn.metrics import confusion_matrix\n",
 32 |     "from sklearn.model_selection import train_test_split\n",
 33 |     "from sklearn.metrics import classification_report"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "code",
 38 |    "execution_count": 7,
 39 |    "metadata": {},
 40 |    "outputs": [
 41 |     {
 42 |      "name": "stdout",
 43 |      "output_type": "stream",
 44 |      "text": [
 45 |       "MaxHS is not installed\n"
 46 |      ]
 47 |     }
 48 |    ],
 49 |    "source": [
 50 |     "# Check if MaxSAT solver such as Open-WBO, MaxHS and MILP solver such as cplex are installed\n",
 51 |     "import os\n",
 52 |     "if(os.system(\"which open-wbo\") != 0):\n",
 53 |     "    print(\"Open-WBO is not installed\")\n",
 54 |     "if(os.system(\"which maxhs\") != 0):\n",
 55 |     "    print(\"MaxHS is not installed\")\n",
 56 |     "try:\n",
 57 |     "    import cplex\n",
 58 |     "except Exception as e:\n",
 59 |     "    print(e)"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "markdown",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "### Model Configuration\n",
 67 |     "\n",
 68 |     "Our first objective is to learn a classification rule in <em>CNF</em>, where the decision rule is ANDs of ORs of input features. For that, we specify `rule_type = CNF` inside the classification model `imli`. In this example, we learn a 2-clause rule with following hyper-parameters.\n",
 69 |     "\n",
 70 |     "- `rule_type` sets the type of classification rule. Other possible options are DNF, decision sets, decision lists, relaxed_CNF,\n",
 71 |     "- `num_clause` decides the number of clauses in the classfication rule,\n",
 72 |     "- `data_fidelity` decides the weight on classification error during training,\n",
 73 |     "- `weight_feature` decides the weight of rule-complexity, that is, the cost of introducing a Boolean feature in the classifier rule,\n",
 74 |     "\n",
 75 |     "\n",
 76 |     "We require a MaxSAT solver to learn the Boolean rule. In this example, we use `open-wbo` as the MaxSAT solver. To install a MaxSAT solver, we refer to instructions in [README](../README.md)."
 77 |    ]
 78 |   },
 79 |   {
 80 |    "cell_type": "code",
 81 |    "execution_count": 2,
 82 |    "metadata": {
 83 |     "tags": []
 84 |    },
 85 |    "outputs": [],
 86 |    "source": [
 87 |     "model = imli(rule_type=\"CNF\", num_clause=2,  data_fidelity=10, weight_feature=1, timeout=100, solver=\"open-wbo\", work_dir=\".\", verbose=False)"
 88 |    ]
 89 |   },
 90 |   {
 91 |    "cell_type": "markdown",
 92 |    "metadata": {},
 93 |    "source": [
 94 |     "### Load dataset\n",
 95 |     "In this example, we learn a decision rule on `Iris` dataset. While the original dataset is used for multiclass classification, we modify it for binary classification. Our objective is to learn a decision rule that separates `Iris Versicolour` from other two classes of Iris: `Iris Setosa` and `Iris Virginica`. \n",
 96 |     "\n",
 97 |     "Our framework requires the training set to be discretized. In the following, we apply entropy-based discretization on the dataset. Alternatively, one can use already discretized dataset as a numpy object (or 2D list). To get the classification rule, `features` list has to be provided."
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": 3,
103 |    "metadata": {
104 |     "tags": []
105 |    },
106 |    "outputs": [
107 |     {
108 |      "data": {
109 |       "text/plain": [
110 |        "(['sepal length <  5.45',\n",
111 |        "  'sepal length = (5.45 - 7.05)',\n",
112 |        "  'sepal length >=  7.05',\n",
113 |        "  'sepal width <  2.95',\n",
114 |        "  'sepal width >=  2.95',\n",
115 |        "  'petal length <  2.45',\n",
116 |        "  'petal length = (2.45 - 4.75)',\n",
117 |        "  'petal length >=  4.75',\n",
118 |        "  'petal width <  0.8',\n",
119 |        "  'petal width = (0.8 - 1.75)',\n",
120 |        "  'petal width >=  1.75'],\n",
121 |        " array([[1., 0., 0., ..., 1., 0., 0.],\n",
122 |        "        [1., 0., 0., ..., 1., 0., 0.],\n",
123 |        "        [1., 0., 0., ..., 1., 0., 0.],\n",
124 |        "        ...,\n",
125 |        "        [0., 1., 0., ..., 0., 0., 1.],\n",
126 |        "        [0., 1., 0., ..., 0., 0., 1.],\n",
127 |        "        [0., 1., 0., ..., 0., 0., 1.]]))"
128 |       ]
129 |      },
130 |      "execution_count": 3,
131 |      "metadata": {},
132 |      "output_type": "execute_result"
133 |     }
134 |    ],
135 |    "source": [
136 |     "X, y, features = utils.discretize_orange(\"../benchmarks/iris_orange.csv\")\n",
137 |     "features, X"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "markdown",
142 |    "metadata": {},
143 |    "source": [
144 |     "### Split dataset into train and test set"
145 |    ]
146 |   },
147 |   {
148 |    "cell_type": "code",
149 |    "execution_count": 4,
150 |    "metadata": {},
151 |    "outputs": [],
152 |    "source": [
153 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)"
154 |    ]
155 |   },
156 |   {
157 |    "cell_type": "markdown",
158 |    "metadata": {},
159 |    "source": [
160 |     "### Train the model"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "code",
165 |    "execution_count": 5,
166 |    "metadata": {
167 |     "tags": []
168 |    },
169 |    "outputs": [],
170 |    "source": [
171 |     "model.fit(X_train,y_train)"
172 |    ]
173 |   },
174 |   {
175 |    "cell_type": "markdown",
176 |    "metadata": {},
177 |    "source": [
178 |     "### Report performance of the learned rule"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 6,
184 |    "metadata": {
185 |     "tags": []
186 |    },
187 |    "outputs": [
188 |     {
189 |      "name": "stdout",
190 |      "output_type": "stream",
191 |      "text": [
192 |       "training report: \n",
193 |       "              precision    recall  f1-score   support\n",
194 |       "\n",
195 |       "           0       0.98      0.95      0.97        65\n",
196 |       "           1       0.92      0.97      0.94        35\n",
197 |       "\n",
198 |       "    accuracy                           0.96       100\n",
199 |       "   macro avg       0.95      0.96      0.96       100\n",
200 |       "weighted avg       0.96      0.96      0.96       100\n",
201 |       "\n",
202 |       "\n",
203 |       "test report: \n",
204 |       "              precision    recall  f1-score   support\n",
205 |       "\n",
206 |       "           0       1.00      0.97      0.99        35\n",
207 |       "           1       0.94      1.00      0.97        15\n",
208 |       "\n",
209 |       "    accuracy                           0.98        50\n",
210 |       "   macro avg       0.97      0.99      0.98        50\n",
211 |       "weighted avg       0.98      0.98      0.98        50\n",
212 |       "\n"
213 |      ]
214 |     }
215 |    ],
216 |    "source": [
217 |     "print(\"training report: \")\n",
218 |     "print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))\n",
219 |     "print()\n",
220 |     "print(\"test report: \")\n",
221 |     "print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))\n"
222 |    ]
223 |   },
224 |   {
225 |    "cell_type": "markdown",
226 |    "metadata": {},
227 |    "source": [
228 |     "### Show the learned rule"
229 |    ]
230 |   },
231 |   {
232 |    "cell_type": "code",
233 |    "execution_count": 7,
234 |    "metadata": {
235 |     "tags": []
236 |    },
237 |    "outputs": [
238 |     {
239 |      "name": "stdout",
240 |      "output_type": "stream",
241 |      "text": [
242 |       "Learned rule is: \n",
243 |       "\n",
244 |       "An Iris flower is predicted as Iris Versicolor if\n",
245 |       "petal width = (0.8 - 1.75) AND\n",
246 |       "not sepal length >=  7.05\n"
247 |      ]
248 |     }
249 |    ],
250 |    "source": [
251 |     "rule = model.get_rule(features)\n",
252 |     "print(\"Learned rule is: \\n\")\n",
253 |     "print(\"An Iris flower is predicted as Iris Versicolor if\")\n",
254 |     "print(rule)"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "markdown",
259 |    "metadata": {},
260 |    "source": [
261 |     "# 2. Learn decision rules as DNF\n",
262 |     "\n",
263 |     "To learn a decision rule as a DNF (ORs of ANDs of input features), we specify `rule_type=DNF` in the hyper-parameters of the model. In the following, we learn a 2-clause DNF decision rule. "
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": 8,
269 |    "metadata": {
270 |     "tags": []
271 |    },
272 |    "outputs": [
273 |     {
274 |      "name": "stdout",
275 |      "output_type": "stream",
276 |      "text": [
277 |       "training report: \n",
278 |       "              precision    recall  f1-score   support\n",
279 |       "\n",
280 |       "           0       0.98      0.95      0.97        65\n",
281 |       "           1       0.92      0.97      0.94        35\n",
282 |       "\n",
283 |       "    accuracy                           0.96       100\n",
284 |       "   macro avg       0.95      0.96      0.96       100\n",
285 |       "weighted avg       0.96      0.96      0.96       100\n",
286 |       "\n",
287 |       "\n",
288 |       "test report: \n",
289 |       "              precision    recall  f1-score   support\n",
290 |       "\n",
291 |       "           0       1.00      0.97      0.99        35\n",
292 |       "           1       0.94      1.00      0.97        15\n",
293 |       "\n",
294 |       "    accuracy                           0.98        50\n",
295 |       "   macro avg       0.97      0.99      0.98        50\n",
296 |       "weighted avg       0.98      0.98      0.98        50\n",
297 |       "\n",
298 |       "\n",
299 |       "Rule:->\n",
300 |       "petal width = (0.8 - 1.75) AND not sepal length >=  7.05 OR\n",
301 |       "petal length = (2.45 - 4.75)\n",
302 |       "\n",
303 |       "Original features:\n",
304 |       "['sepal length <  5.45', 'sepal length = (5.45 - 7.05)', 'sepal length >=  7.05', 'sepal width <  2.95', 'sepal width >=  2.95', 'petal length <  2.45', 'petal length = (2.45 - 4.75)', 'petal length >=  4.75', 'petal width <  0.8', 'petal width = (0.8 - 1.75)', 'petal width >=  1.75']\n",
305 |       "\n",
306 |       "In the learned rule, show original index in the feature list with phase (1: original, -1: complemented)\n",
307 |       "[[(2, 1), (9, -1)], [(6, -1)]]\n"
308 |      ]
309 |     }
310 |    ],
311 |    "source": [
312 |     "model = imli(rule_type=\"DNF\", num_clause=2,  data_fidelity=10, solver=\"open-wbo\", work_dir=\".\", verbose=False)\n",
313 |     "model.fit(X_train,y_train)\n",
314 |     "print(\"training report: \")\n",
315 |     "print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))\n",
316 |     "print()\n",
317 |     "print(\"test report: \")\n",
318 |     "print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))\n",
319 |     "\n",
320 |     "print(\"\\nRule:->\")\n",
321 |     "print(model.get_rule(features))\n",
322 |     "print(\"\\nOriginal features:\")\n",
323 |     "print(features)\n",
324 |     "print(\"\\nIn the learned rule, show original index in the feature list with phase (1: original, -1: complemented)\")\n",
325 |     "print(model.get_selected_column_index())"
326 |    ]
327 |   },
328 |   {
329 |    "cell_type": "markdown",
330 |    "metadata": {},
331 |    "source": [
332 |     "# 3. Learn more expressible decision rules: Relaxed-CNF rules\n",
333 |     "\n",
334 |     "Our framework allows one to learn more expressible decision rules, which we call relaxed_CNF rules. This rule allows thresholds on satisfaction of clauses and literals and can learn more complex decision boundaries. See the [ECAI-2020](https://bishwamittra.github.io/publication/ecai_2020/paper.pdf) paper for more details. \n",
335 |     "\n",
336 |     "\n",
337 |     "In our framework, set the `rule_type=relaxed_CNF` to learn relaxed-CNF rules."
338 |    ]
339 |   },
340 |   {
341 |    "cell_type": "code",
342 |    "execution_count": 9,
343 |    "metadata": {
344 |     "tags": []
345 |    },
346 |    "outputs": [
347 |     {
348 |      "name": "stdout",
349 |      "output_type": "stream",
350 |      "text": [
351 |       "training report: \n",
352 |       "              precision    recall  f1-score   support\n",
353 |       "\n",
354 |       "           0       0.98      0.95      0.97        65\n",
355 |       "           1       0.92      0.97      0.94        35\n",
356 |       "\n",
357 |       "    accuracy                           0.96       100\n",
358 |       "   macro avg       0.95      0.96      0.96       100\n",
359 |       "weighted avg       0.96      0.96      0.96       100\n",
360 |       "\n",
361 |       "\n",
362 |       "test report: \n",
363 |       "              precision    recall  f1-score   support\n",
364 |       "\n",
365 |       "           0       1.00      0.97      0.99        35\n",
366 |       "           1       0.94      1.00      0.97        15\n",
367 |       "\n",
368 |       "    accuracy                           0.98        50\n",
369 |       "   macro avg       0.97      0.99      0.98        50\n",
370 |       "weighted avg       0.98      0.98      0.98        50\n",
371 |       "\n"
372 |      ]
373 |     }
374 |    ],
375 |    "source": [
376 |     "model = imli(rule_type=\"relaxed_CNF\", num_clause=2,  data_fidelity=10, solver=\"cplex\", work_dir=\".\", verbose=False)\n",
377 |     "model.fit(X_train,y_train)\n",
378 |     "print(\"training report: \")\n",
379 |     "print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))\n",
380 |     "print()\n",
381 |     "print(\"test report: \")\n",
382 |     "print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "### Understanding the decision rule\n",
390 |     "\n",
391 |     "In this example, we ask the framework to learn a 2 clause rule. During training, we learn the thresholds on clauses and literals while fitting the dataset. The learned rule operates in two levels. In the first level, a clause is satisfied if the literals in the clause satisfy the learned threshold on literals. In the second level, the formula is satisfied when the threshold on clauses is satisfied."
392 |    ]
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "execution_count": 10,
397 |    "metadata": {
398 |     "tags": []
399 |    },
400 |    "outputs": [
401 |     {
402 |      "name": "stdout",
403 |      "output_type": "stream",
404 |      "text": [
405 |       "Learned rule is: \n",
406 |       "\n",
407 |       "An Iris flower is predicted as Iris Versicolor if\n",
408 |       "[ (  petal width = (0.8 - 1.75)   )>= 1  ] +\n",
409 |       "[ (  not sepal length >=  7.05   )>= 1  ]  >= 2\n",
410 |       "\n",
411 |       "Threhosld on clause: 2\n",
412 |       "Threshold on literals: (this is a list where entries denote threholds on literals on all clauses)\n",
413 |       "[1, 1]\n"
414 |      ]
415 |     }
416 |    ],
417 |    "source": [
418 |     "rule = model.get_rule(features)\n",
419 |     "print(\"Learned rule is: \\n\")\n",
420 |     "print(\"An Iris flower is predicted as Iris Versicolor if\")\n",
421 |     "print(rule)\n",
422 |     "print(\"\\nThrehosld on clause:\", model.get_threshold_clause())\n",
423 |     "print(\"Threshold on literals: (this is a list where entries denote threholds on literals on all clauses)\")\n",
424 |     "print(model.get_threshold_literal())"
425 |    ]
426 |   },
427 |   {
428 |    "cell_type": "markdown",
429 |    "metadata": {},
430 |    "source": [
431 |     "# 4. Learn decision rules as decision sets and lists\n",
432 |     "\n",
433 |     "### Decision sets"
434 |    ]
435 |   },
436 |   {
437 |    "cell_type": "code",
438 |    "execution_count": 11,
439 |    "metadata": {},
440 |    "outputs": [
441 |     {
442 |      "name": "stdout",
443 |      "output_type": "stream",
444 |      "text": [
445 |       "\n",
446 |       "Rule:->\n",
447 |       "If not petal width = (0.8 - 1.75): class = 0\n",
448 |       "If petal width = (0.8 - 1.75) AND not sepal length >=  7.05: class = 1\n",
449 |       "If sepal length >=  7.05 AND petal width = (0.8 - 1.75): class = 0\n",
450 |       "If sepal length >=  7.05 AND petal width <  0.8: class = 0\n",
451 |       "Else : class = 0\n",
452 |       "\n",
453 |       "training report: \n",
454 |       "              precision    recall  f1-score   support\n",
455 |       "\n",
456 |       "           0       0.98      0.95      0.97        65\n",
457 |       "           1       0.92      0.97      0.94        35\n",
458 |       "\n",
459 |       "    accuracy                           0.96       100\n",
460 |       "   macro avg       0.95      0.96      0.96       100\n",
461 |       "weighted avg       0.96      0.96      0.96       100\n",
462 |       "\n",
463 |       "\n",
464 |       "test report: \n",
465 |       "              precision    recall  f1-score   support\n",
466 |       "\n",
467 |       "           0       1.00      0.97      0.99        35\n",
468 |       "           1       0.94      1.00      0.97        15\n",
469 |       "\n",
470 |       "    accuracy                           0.98        50\n",
471 |       "   macro avg       0.97      0.99      0.98        50\n",
472 |       "weighted avg       0.98      0.98      0.98        50\n",
473 |       "\n"
474 |      ]
475 |     }
476 |    ],
477 |    "source": [
478 |     "model = imli(rule_type=\"decision sets\", num_clause=5,  data_fidelity=10, solver=\"open-wbo\", work_dir=\".\", verbose=False)\n",
479 |     "model.fit(X_train,y_train)\n",
480 |     "\n",
481 |     "print(\"\\nRule:->\")\n",
482 |     "print(model.get_rule(features))\n",
483 |     "\n",
484 |     "\n",
485 |     "print(\"\\ntraining report: \")\n",
486 |     "print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))\n",
487 |     "print()\n",
488 |     "print(\"test report: \")\n",
489 |     "print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))\n",
490 |     "\n"
491 |    ]
492 |   },
493 |   {
494 |    "cell_type": "markdown",
495 |    "metadata": {},
496 |    "source": [
497 |     "### Decision lists"
498 |    ]
499 |   },
500 |   {
501 |    "cell_type": "code",
502 |    "execution_count": 12,
503 |    "metadata": {},
504 |    "outputs": [
505 |     {
506 |      "name": "stdout",
507 |      "output_type": "stream",
508 |      "text": [
509 |       "\n",
510 |       "Rule:->\n",
511 |       "If not petal width = (0.8 - 1.75): class = 0\n",
512 |       "Else if not sepal length >=  7.05: class = 1\n",
513 |       "Else: class = 0\n",
514 |       "\n",
515 |       "training report: \n",
516 |       "              precision    recall  f1-score   support\n",
517 |       "\n",
518 |       "           0       0.98      0.95      0.97        65\n",
519 |       "           1       0.92      0.97      0.94        35\n",
520 |       "\n",
521 |       "    accuracy                           0.96       100\n",
522 |       "   macro avg       0.95      0.96      0.96       100\n",
523 |       "weighted avg       0.96      0.96      0.96       100\n",
524 |       "\n",
525 |       "\n",
526 |       "test report: \n",
527 |       "              precision    recall  f1-score   support\n",
528 |       "\n",
529 |       "           0       1.00      0.97      0.99        35\n",
530 |       "           1       0.94      1.00      0.97        15\n",
531 |       "\n",
532 |       "    accuracy                           0.98        50\n",
533 |       "   macro avg       0.97      0.99      0.98        50\n",
534 |       "weighted avg       0.98      0.98      0.98        50\n",
535 |       "\n"
536 |      ]
537 |     }
538 |    ],
539 |    "source": [
540 |     "model = imli(rule_type=\"decision lists\", num_clause=5,  data_fidelity=10, solver=\"open-wbo\", work_dir=\".\", verbose=False)\n",
541 |     "model.fit(X_train,y_train)\n",
542 |     "\n",
543 |     "print(\"\\nRule:->\")\n",
544 |     "print(model.get_rule(features))\n",
545 |     "\n",
546 |     "\n",
547 |     "print(\"\\ntraining report: \")\n",
548 |     "print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))\n",
549 |     "print()\n",
550 |     "print(\"test report: \")\n",
551 |     "print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))\n",
552 |     "\n"
553 |    ]
554 |   }
555 |  ],
556 |  "metadata": {
557 |   "kernelspec": {
558 |    "display_name": "Python 3",
559 |    "language": "python",
560 |    "name": "python3"
561 |   },
562 |   "language_info": {
563 |    "codemirror_mode": {
564 |     "name": "ipython",
565 |     "version": 3
566 |    },
567 |    "file_extension": ".py",
568 |    "mimetype": "text/x-python",
569 |    "name": "python",
570 |    "nbconvert_exporter": "python",
571 |    "pygments_lexer": "ipython3",
572 |    "version": "3.7.6"
573 |   },
574 |   "orig_nbformat": 2
575 |  },
576 |  "nbformat": 4,
577 |  "nbformat_minor": 2
578 | }
579 | 


--------------------------------------------------------------------------------
/doc/documentation.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding: utf-8
  3 | 
  4 | # # 1. Learn Binary classification rules
  5 | # 
  6 | # This tutorial shows how to learn classification rules using MaxSAT-based incremental learning framework, IMLI. We show how to learn five popular classification rules under the same framework. 
  7 | # 
  8 | # - CNF rules (Conjunctive Normal Form)
  9 | # - DNF rules (Disjunctive Normal Form)
 10 | # - Decision sets
 11 | # - Decision lists
 12 | # - relaxed-CNF rules
 13 | 
 14 | # In[1]:
 15 | 
 16 | 
 17 | import sys
 18 | # sys.path.append("../")
 19 | 
 20 | from pyrulelearn.imli import imli
 21 | from pyrulelearn import utils
 22 | from sklearn.metrics import confusion_matrix
 23 | from sklearn.model_selection import train_test_split
 24 | from sklearn.metrics import classification_report
 25 | 
 26 | 
 27 | # In[7]:
 28 | 
 29 | 
 30 | # Check if MaxSAT solver such as Open-WBO, MaxHS and MILP solver such as cplex is installed
 31 | import os
 32 | if(os.system("which open-wbo") != 0):
 33 |     print("Open-WBO is not installed")
 34 | if(os.system("which maxhs") != 0):
 35 |     print("MaxHS is not installed")
 36 | try:
 37 |     import cplex
 38 | except Exception as e:
 39 |     print(e)
 40 | 
 41 | 
 42 | # In[ ]:
 43 | 
 44 | 
 45 | 
 46 | 
 47 | 
 48 | # ### Model Configuration
 49 | # 
 50 | # Our first objective is to learn a classification rule in <em>CNF</em>, where the decision rule is ANDs of ORs of input features. For that, we specify `rule_type = CNF` inside the classification model `imli`. In this example, we learn a 2-clause rule with following hyper-parameters.
 51 | # 
 52 | # - `rule_type` sets the type of classification rule. Other possible options are DNF, decision sets, decision lists, relaxed_CNF,
 53 | # - `num_clause` decides the number of clauses in the classfication rule,
 54 | # - `data_fidelity` decides the weight on classification error during training,
 55 | # - `weight_feature` decides the weight of rule-complexity, that is, the cost of introducing a Boolean feature in the classifier rule,
 56 | # 
 57 | # 
 58 | # We require a MaxSAT solver to learn the Boolean rule. In this example, we use `open-wbo` as the MaxSAT solver. To install a MaxSAT solver, we refer to instructions in [README](../README.md).
 59 | 
 60 | # In[2]:
 61 | 
 62 | 
 63 | model = imli(rule_type="CNF", num_clause=2,  data_fidelity=10, weight_feature=1, timeout=100, solver="open-wbo", work_dir=".", verbose=False)
 64 | 
 65 | 
 66 | # ### Load dataset
 67 | # In this example, we learn a decision rule on `Iris` dataset. While the original dataset is used for multiclass classification, we modify it for binary classification. Our objective is to learn a decision rule that separates `Iris Versicolour` from other two classes of Iris: `Iris Setosa` and `Iris Virginica`. 
 68 | # 
 69 | # Our framework requires the training set to be discretized. In the following, we apply entropy-based discretization on the dataset. Alternatively, one can use already discretized dataset as a numpy object (or 2D list). To get the classification rule, `features` list has to be provided.
 70 | 
 71 | # In[3]:
 72 | 
 73 | 
 74 | X, y, features = utils.discretize_orange("../benchmarks/iris_orange.csv")
 75 | features, X
 76 | 
 77 | 
 78 | # ### Split dataset into train and test set
 79 | 
 80 | # In[4]:
 81 | 
 82 | 
 83 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
 84 | 
 85 | 
 86 | # ### Train the model
 87 | 
 88 | # In[5]:
 89 | 
 90 | 
 91 | model.fit(X_train,y_train)
 92 | 
 93 | 
 94 | # ### Report performance of the learned rule
 95 | 
 96 | # In[6]:
 97 | 
 98 | 
 99 | print("training report: ")
100 | print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))
101 | print()
102 | print("test report: ")
103 | print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))
104 | 
105 | 
106 | # ### Show the learned rule
107 | 
108 | # In[7]:
109 | 
110 | 
111 | rule = model.get_rule(features)
112 | print("Learned rule is: \n")
113 | print("An Iris flower is predicted as Iris Versicolor if")
114 | print(rule)
115 | 
116 | 
117 | # # 2. Learn decision rules as DNF
118 | # 
119 | # To learn a decision rule as a DNF (ORs of ANDs of input features), we specify `rule_type=DNF` in the hyper-parameters of the model. In the following, we learn a 2-clause DNF decision rule. 
120 | 
121 | # In[8]:
122 | 
123 | 
124 | model = imli(rule_type="DNF", num_clause=2,  data_fidelity=10, solver="open-wbo", work_dir=".", verbose=False)
125 | model.fit(X_train,y_train)
126 | print("training report: ")
127 | print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))
128 | print()
129 | print("test report: ")
130 | print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))
131 | 
132 | print("\nRule:->")
133 | print(model.get_rule(features))
134 | print("\nOriginal features:")
135 | print(features)
136 | print("\nIn the learned rule, show original index in the feature list with phase (1: original, -1: complemented)")
137 | print(model.get_selected_column_index())
138 | 
139 | 
140 | # # 3. Learn more expressible decision rules: Relaxed-CNF rules
141 | # 
142 | # Our framework allows one to learn more expressible decision rules, which we call relaxed_CNF rules. This rule allows thresholds on satisfaction of clauses and literals and can learn more complex decision boundaries. See the [ECAI-2020](https://bishwamittra.github.io/publication/ecai_2020/paper.pdf) paper for more details. 
143 | # 
144 | # 
145 | # In our framework, set the `rule_type=relaxed_CNF` to learn relaxed-CNF rules.
146 | 
147 | # In[9]:
148 | 
149 | 
150 | model = imli(rule_type="relaxed_CNF", num_clause=2,  data_fidelity=10, solver="cplex", work_dir=".", verbose=False)
151 | model.fit(X_train,y_train)
152 | print("training report: ")
153 | print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))
154 | print()
155 | print("test report: ")
156 | print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))
157 | 
158 | 
159 | # ### Understanding the decision rule
160 | # 
161 | # In this example, we ask the framework to learn a 2 clause rule. During training, we learn the thresholds on clauses and literals while fitting the dataset. The learned rule operates in two levels. In the first level, a clause is satisfied if the literals in the clause satisfy the learned threshold on literals. In the second level, the formula is satisfied when the threshold on clauses is satisfied.
162 | 
163 | # In[10]:
164 | 
165 | 
166 | rule = model.get_rule(features)
167 | print("Learned rule is: \n")
168 | print("An Iris flower is predicted as Iris Versicolor if")
169 | print(rule)
170 | print("\nThrehosld on clause:", model.get_threshold_clause())
171 | print("Threshold on literals: (this is a list where entries denote threholds on literals on all clauses)")
172 | print(model.get_threshold_literal())
173 | 
174 | 
175 | # # 4. Learn decision rules as decision sets and lists
176 | # 
177 | # ### Decision sets
178 | 
179 | # In[11]:
180 | 
181 | 
182 | model = imli(rule_type="decision sets", num_clause=5,  data_fidelity=10, solver="open-wbo", work_dir=".", verbose=False)
183 | model.fit(X_train,y_train)
184 | 
185 | print("\nRule:->")
186 | print(model.get_rule(features))
187 | 
188 | 
189 | print("\ntraining report: ")
190 | print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))
191 | print()
192 | print("test report: ")
193 | print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))
194 | 
195 | 
196 | # ### Decision lists
197 | 
198 | # In[12]:
199 | 
200 | 
201 | model = imli(rule_type="decision lists", num_clause=5,  data_fidelity=10, solver="open-wbo", work_dir=".", verbose=False)
202 | model.fit(X_train,y_train)
203 | 
204 | print("\nRule:->")
205 | print(model.get_rule(features))
206 | 
207 | 
208 | print("\ntraining report: ")
209 | print(classification_report(y_train, model.predict(X_train), target_names=['0','1']))
210 | print()
211 | print("test report: ")
212 | print(classification_report(y_test, model.predict(X_test), target_names=['0','1']))
213 | 
214 | 


--------------------------------------------------------------------------------
/pyrulelearn/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/meelgroup/MLIC/30edc0f41fca65eec48e4c1fc16a3c752cb97618/pyrulelearn/__init__.py


--------------------------------------------------------------------------------
/pyrulelearn/cplex_wrap.py:
--------------------------------------------------------------------------------
  1 | import cplex
  2 | from time import time
  3 | import pyrulelearn.utils
  4 | 
  5 | def _call_cplex(imli, A, y):
  6 |     # A = pyrulelearn.utils._add_dummy_columns(A)
  7 | 
  8 |     no_features = -1
  9 |     no_samples = len(y)
 10 |     if(no_samples > 0):
 11 |         no_features = len(A[0])
 12 |     else:
 13 |         print("- error: the dataset is corrupted, does not have sufficient samples")
 14 | 
 15 |     if (imli.verbose):
 16 |         print("- no of features: ", no_features)
 17 |         print("- no of samples : ", no_samples)
 18 | 
 19 |     # Establish the Linear Programming Model
 20 |     myProblem = cplex.Cplex()
 21 | 
 22 |     feature_variable = []
 23 |     variable_list = []
 24 |     objective_coefficient = []
 25 |     variable_count = 0
 26 | 
 27 |     for eachLevel in range(imli.numClause):
 28 |         for i in range(no_features):
 29 |             feature_variable.append(
 30 |                 "b_" + str(i + 1) + str("_") + str(eachLevel + 1))
 31 | 
 32 |     variable_list = variable_list + feature_variable
 33 | 
 34 |     slack_variable = []
 35 |     for i in range(no_samples):
 36 |         slack_variable.append("s_" + str(i + 1))
 37 | 
 38 |     variable_list = variable_list + slack_variable
 39 | 
 40 |     if (imli.learn_threshold_clause):
 41 |         variable_list.append("eta_clause")
 42 | 
 43 |     if (imli.learn_threshold_literal):
 44 |         # consider different threshold when learning mode is on
 45 |         for eachLevel in range(imli.numClause):
 46 |             variable_list.append("eta_clit_"+str(eachLevel))
 47 | 
 48 |     for i in range(len(y)):
 49 |         for eachLevel in range(imli.numClause):
 50 |             variable_list.append("ax_" + str(i + 1) +
 51 |                                     str("_") + str(eachLevel + 1))
 52 | 
 53 |     myProblem.variables.add(names=variable_list)
 54 | 
 55 |     # encode the objective function:
 56 | 
 57 |     if(imli.verbose):
 58 |         print("- weight feature: ", imli.weightFeature)
 59 |         print("- weight error:   ", imli.dataFidelity)
 60 | 
 61 |     if(imli.iterations == 1 or len(imli._assignList) == 0):  # is called in the first iteration
 62 |         for eachLevel in range(imli.numClause):
 63 |             for i in range(no_features):
 64 |                 objective_coefficient.append(imli.weightFeature)
 65 |                 myProblem.variables.set_lower_bounds(variable_count, 0)
 66 |                 myProblem.variables.set_upper_bounds(variable_count, 1)
 67 |                 myProblem.variables.set_types(
 68 |                     variable_count, myProblem.variables.type.continuous)
 69 |                 myProblem.objective.set_linear(
 70 |                     [(variable_count, objective_coefficient[variable_count])])
 71 |                 variable_count += 1
 72 |     else:
 73 |         for eachLevel in range(imli.numClause):
 74 |             for i in range(no_features):
 75 |                 if (imli._assignList[eachLevel * no_features + i] > 0):
 76 |                     objective_coefficient.append(-imli.weightFeature)
 77 |                 else:
 78 |                     objective_coefficient.append(imli.weightFeature)
 79 | 
 80 |                 myProblem.variables.set_lower_bounds(variable_count, 0)
 81 |                 myProblem.variables.set_upper_bounds(variable_count, 1)
 82 |                 myProblem.variables.set_types(variable_count, myProblem.variables.type.continuous)
 83 |                 myProblem.objective.set_linear([(variable_count, objective_coefficient[variable_count])])
 84 |                 variable_count += 1
 85 | 
 86 |     # slack_variable = []
 87 |     for i in range(no_samples):
 88 |         objective_coefficient.append(imli.dataFidelity)
 89 |         myProblem.variables.set_types(
 90 |             variable_count, myProblem.variables.type.continuous)
 91 |         myProblem.variables.set_lower_bounds(variable_count, 0)
 92 |         myProblem.variables.set_upper_bounds(variable_count, 1)
 93 |         myProblem.objective.set_linear(
 94 |             [(variable_count, objective_coefficient[variable_count])])
 95 |         variable_count += 1
 96 | 
 97 |     myProblem.objective.set_sense(myProblem.objective.sense.minimize)
 98 | 
 99 |     var_eta_clause = -1
100 | 
101 |     if (imli.learn_threshold_clause):
102 |         myProblem.variables.set_types(
103 |             variable_count, myProblem.variables.type.integer)
104 |         myProblem.variables.set_lower_bounds(variable_count, 0)
105 |         myProblem.variables.set_upper_bounds(variable_count, imli.numClause)
106 |         var_eta_clause = variable_count
107 |         variable_count += 1
108 | 
109 |     var_eta_literal = [-1 for eachLevel in range(imli.numClause)]
110 |     constraint_count = 0
111 | 
112 |     if (imli.learn_threshold_literal):
113 | 
114 |         for eachLevel in range(imli.numClause):
115 |             myProblem.variables.set_types(
116 |                 variable_count, myProblem.variables.type.integer)
117 |             myProblem.variables.set_lower_bounds(variable_count, 0)
118 |             myProblem.variables.set_upper_bounds(variable_count, no_features)
119 |             var_eta_literal[eachLevel] = variable_count
120 |             variable_count += 1
121 | 
122 |             constraint = []
123 | 
124 |             for j in range(no_features):
125 |                 constraint.append(1)
126 | 
127 |             constraint.append(-1)
128 | 
129 |             myProblem.linear_constraints.add(
130 |                 lin_expr=[
131 |                     cplex.SparsePair(ind=[eachLevel * no_features + j for j in range(no_features)] + [var_eta_literal[eachLevel]],
132 |                                         val=constraint)],
133 |                 rhs=[0],
134 |                 names=["c" + str(constraint_count)],
135 |                 senses=["G"]
136 |             )
137 |             constraint_count += 1
138 | 
139 |     for i in range(len(y)):
140 |         if (y[i] == 1):
141 | 
142 |             auxiliary_index = []
143 | 
144 |             for eachLevel in range(imli.numClause):
145 |                 constraint = [int(feature) for feature in A[i]]
146 | 
147 |                 myProblem.variables.set_types(
148 |                     variable_count, myProblem.variables.type.integer)
149 |                 myProblem.variables.set_lower_bounds(variable_count, 0)
150 |                 myProblem.variables.set_upper_bounds(variable_count, 1)
151 | 
152 |                 constraint.append(no_features)
153 | 
154 |                 auxiliary_index.append(variable_count)
155 | 
156 |                 if (imli.learn_threshold_literal):
157 | 
158 |                     constraint.append(-1)
159 | 
160 |                     myProblem.linear_constraints.add(
161 |                         lin_expr=[cplex.SparsePair(
162 |                             ind=[eachLevel * no_features + j for j in range(no_features)] + [variable_count,
163 |                                                                                                 var_eta_literal[eachLevel]],
164 |                             val=constraint)],
165 |                         rhs=[0],
166 |                         names=["c" + str(constraint_count)],
167 |                         senses=["G"]
168 |                     )
169 | 
170 |                     constraint_count += 1
171 | 
172 |                 else:
173 | 
174 |                     myProblem.linear_constraints.add(
175 |                         lin_expr=[cplex.SparsePair(
176 |                             ind=[eachLevel * no_features +
177 |                                     j for j in range(no_features)] + [variable_count],
178 |                             val=constraint)],
179 |                         rhs=[imli.threshold_literal],
180 |                         names=["c" + str(constraint_count)],
181 |                         senses=["G"]
182 |                     )
183 | 
184 |                     constraint_count += 1
185 | 
186 |                 variable_count += 1
187 | 
188 |             if (imli.learn_threshold_clause):
189 | 
190 |                 myProblem.linear_constraints.add(
191 |                     lin_expr=[cplex.SparsePair(
192 |                         ind=[i + imli.numClause * no_features,
193 |                                 var_eta_clause] + auxiliary_index,
194 |                         # 1st slack variable = level * no_features
195 |                         val=[imli.numClause, -1] + [-1 for j in range(imli.numClause)])],
196 |                     rhs=[- imli.numClause],
197 |                     names=["c" + str(constraint_count)],
198 |                     senses=["G"]
199 |                 )
200 | 
201 |                 constraint_count += 1
202 | 
203 |             else:
204 | 
205 |                 myProblem.linear_constraints.add(
206 |                     lin_expr=[cplex.SparsePair(
207 |                         # 1st slack variable = level * no_features
208 |                         ind=[i + imli.numClause * no_features] + auxiliary_index,
209 |                         val=[imli.numClause] + [-1 for j in range(imli.numClause)])],
210 |                     rhs=[- imli.numClause + imli.threshold_clause],
211 |                     names=["c" + str(constraint_count)],
212 |                     senses=["G"]
213 |                 )
214 | 
215 |                 constraint_count += 1
216 | 
217 |         else:
218 | 
219 |             auxiliary_index = []
220 | 
221 |             for eachLevel in range(imli.numClause):
222 |                 constraint = [int(feature) for feature in A[i]]
223 |                 myProblem.variables.set_types(
224 |                     variable_count, myProblem.variables.type.integer)
225 |                 myProblem.variables.set_lower_bounds(variable_count, 0)
226 |                 myProblem.variables.set_upper_bounds(variable_count, 1)
227 | 
228 |                 constraint.append(- no_features)
229 | 
230 |                 auxiliary_index.append(variable_count)
231 | 
232 |                 if (imli.learn_threshold_literal):
233 | 
234 |                     constraint.append(-1)
235 | 
236 |                     myProblem.linear_constraints.add(
237 |                         lin_expr=[cplex.SparsePair(
238 |                             ind=[eachLevel * no_features + j for j in range(no_features)] + [variable_count,
239 |                                                                                                 var_eta_literal[eachLevel]],
240 |                             val=constraint)],
241 |                         rhs=[-1],
242 |                         names=["c" + str(constraint_count)],
243 |                         senses=["L"]
244 |                     )
245 | 
246 |                     constraint_count += 1
247 |                 else:
248 | 
249 |                     myProblem.linear_constraints.add(
250 |                         lin_expr=[cplex.SparsePair(
251 |                             ind=[eachLevel * no_features +
252 |                                     j for j in range(no_features)] + [variable_count],
253 |                             val=constraint)],
254 |                         rhs=[imli.threshold_literal - 1],
255 |                         names=["c" + str(constraint_count)],
256 |                         senses=["L"]
257 |                     )
258 | 
259 |                     constraint_count += 1
260 | 
261 |                 variable_count += 1
262 | 
263 |             if (imli.learn_threshold_clause):
264 | 
265 |                 myProblem.linear_constraints.add(
266 |                     lin_expr=[cplex.SparsePair(
267 |                         ind=[i + imli.numClause * no_features,
268 |                                 var_eta_clause] + auxiliary_index,
269 |                         # 1st slack variable = level * no_features
270 |                         val=[imli.numClause, 1] + [-1 for j in range(imli.numClause)])],
271 |                     rhs=[1],
272 |                     names=["c" + str(constraint_count)],
273 |                     senses=["G"]
274 |                 )
275 | 
276 |                 constraint_count += 1
277 | 
278 |             else:
279 | 
280 |                 myProblem.linear_constraints.add(
281 |                     lin_expr=[cplex.SparsePair(
282 |                         # 1st slack variable = level * no_features
283 |                         ind=[i + imli.numClause * no_features] + auxiliary_index,
284 |                         val=[imli.numClause] + [-1 for j in range(imli.numClause)])],
285 |                     rhs=[- imli.threshold_clause + 1],
286 |                     names=["c" + str(constraint_count)],
287 |                     senses=["G"]
288 |                 )
289 | 
290 |                 constraint_count += 1
291 | 
292 |     # set parameters
293 |     if(imli.verbose):
294 |         print("- timelimit for solver: ",  imli.timeOut - time() + imli._fit_start_time)
295 |     myProblem.parameters.clocktype.set(1)  # cpu time (exact time)
296 |     myProblem.parameters.timelimit.set(imli.timeOut - time() + imli._fit_start_time)
297 |     myProblem.parameters.workmem.set(imli.memlimit)
298 |     myProblem.set_log_stream(None)
299 |     myProblem.set_error_stream(None)
300 |     myProblem.set_warning_stream(None)
301 |     myProblem.set_results_stream(None)
302 |     # myProblem.parameters.mip.tolerances.mipgap.set(0.2)
303 |     myProblem.parameters.mip.limits.treememory.set(imli.memlimit)
304 |     myProblem.parameters.workdir.set(imli.workDir)
305 |     myProblem.parameters.mip.strategy.file.set(2)
306 |     myProblem.parameters.threads.set(1)
307 | 
308 |     # Solve the model and print the answer
309 |     start_time = myProblem.get_time()
310 |     start_det_time = myProblem.get_dettime()
311 |     myProblem.solve()
312 |     # solution.get_status() returns an integer code
313 |     status = myProblem.solution.get_status()
314 | 
315 |     end_det_time = myProblem.get_dettime()
316 | 
317 |     end_time = myProblem.get_time()
318 |     if (imli.verbose):
319 |         print("- Total solve time (sec.):", end_time - start_time)
320 |         print("- Total solve dettime (sec.):", end_det_time - start_det_time)
321 | 
322 |         print("- Solution status = ", myProblem.solution.status[status])
323 |         print("- Objective value = ", myProblem.solution.get_objective_value())
324 |         print("- mip relative gap (should be zero):", myProblem.solution.MIP.get_mip_relative_gap())
325 | 
326 |     #  retrieve solution: do rounding
327 | 
328 |     imli._assignList = []
329 |     imli._selectedFeatureIndex = []
330 |     # if(imli.verbose):
331 |     #     print(" - selected feature index")
332 |     for i in range(len(feature_variable)):
333 |         if(myProblem.solution.get_values(feature_variable[i]) > 0):
334 |             imli._assignList.append(1)
335 |             imli._selectedFeatureIndex.append(i+1)
336 |         else:
337 |             imli._assignList.append(0)
338 |             # imli._selectedFeatureIndex.append(i+1)
339 |     # print(imli._selectedFeatureIndex)
340 |     
341 |     # imli._assignList.append(myProblem.solution.get_values(feature_variable[i]))
342 | 
343 |     for i in range(len(slack_variable)):
344 |         imli._assignList.append(myProblem.solution.get_values(slack_variable[i]))
345 | 
346 |     # update parameters
347 |     if (imli.learn_threshold_clause and imli.learn_threshold_literal):
348 | 
349 |         imli.threshold_literal_learned = [int(myProblem.solution.get_values(var_eta_literal[eachLevel])) for eachLevel in range(imli.numClause)]
350 |         imli.threshold_clause_learned = int(myProblem.solution.get_values(var_eta_clause))
351 | 
352 |     elif (imli.learn_threshold_clause):
353 |         imli.threshold_literal_learned = [imli.threshold_literal for eachLevel in range(imli.numClause)]
354 |         imli.threshold_clause_learned = int(myProblem.solution.get_values(var_eta_clause))
355 | 
356 |     elif (imli.learn_threshold_literal):
357 |         imli.threshold_literal_learned = [int(myProblem.solution.get_values(var_eta_literal[eachLevel])) for eachLevel in range(imli.numClause)]
358 |         imli.threshold_clause_learned = imli.threshold_clause
359 | 
360 |     if(imli.verbose):
361 |         print("- cplex returned the solution")
362 | 


--------------------------------------------------------------------------------
/pyrulelearn/imli.py:
--------------------------------------------------------------------------------
   1 | 
   2 | 
   3 | # Contact: Bishwamittra Ghosh [email: bghosh@u.nus.edu]
   4 | 
   5 | import numpy as np
   6 | import pandas as pd
   7 | import warnings
   8 | import math
   9 | import random
  10 | from tqdm import tqdm
  11 | from time import time
  12 | # warnings.simplefilter(action='ignore', category=FutureWarning)
  13 | from sklearn.metrics import classification_report, accuracy_score
  14 | 
  15 | 
  16 | 
  17 | 
  18 | 
  19 | # from pyrulelearn
  20 | import pyrulelearn.utils
  21 | import pyrulelearn.cplex_wrap
  22 | import pyrulelearn.maxsat_wrap
  23 | 
  24 | 
  25 | 
  26 | class imli():
  27 |     def __init__(self, num_clause=5, data_fidelity=1, weight_feature=1, threshold_literal=-1, threshold_clause=-1,
  28 |                  solver="open-wbo", rule_type="CNF", batchsize=400,
  29 |                  work_dir=".", timeout=100, verbose=False):
  30 |         '''
  31 | 
  32 |         :param numBatch: no of Batchs of training dataset
  33 |         :param numClause: no of clause in the formula
  34 |         :param dataFidelity: weight corresponding to accuracy
  35 |         :param weightFeature: weight corresponding to selected features
  36 |         :param solver: specify the (name of the) bin of the solver; bin must be in the path
  37 |         :param ruleType: type of rule {CNF,DNF}
  38 |         :param workDir: working directory
  39 |         :param verbose: True for debug
  40 | 
  41 |         --- more are added later
  42 | 
  43 |         '''
  44 |     
  45 |         # assert 0 <= batchsize and batchsize <= 1
  46 |         assert isinstance(batchsize, int)
  47 |         assert isinstance(data_fidelity, int)
  48 |         assert isinstance(weight_feature, int)
  49 |         assert isinstance(num_clause, int)
  50 |         assert isinstance(threshold_clause, int)
  51 |         assert isinstance(threshold_clause, int)
  52 | 
  53 | 
  54 |         
  55 |         self.numClause = num_clause
  56 |         self.dataFidelity = data_fidelity
  57 |         self.weightFeature = weight_feature
  58 |         self.solver = solver
  59 |         self.ruleType = rule_type
  60 |         self.workDir = work_dir
  61 |         self.verbose = verbose
  62 |         self._selectedFeatureIndex = []
  63 |         self.timeOut = timeout
  64 |         self.memlimit = 1000*16
  65 |         self.learn_threshold_literal = False
  66 |         self.learn_threshold_clause = False
  67 |         self.threshold_literal = threshold_literal
  68 |         self.threshold_clause = threshold_clause
  69 |         self.batchsize = batchsize
  70 |         self._solver_time = 0
  71 |         self._prediction_time = 0
  72 |         self._wcnf_generation_time = 0
  73 |         self._demo_time = 0
  74 | 
  75 |         
  76 |         
  77 | 
  78 |         
  79 | 
  80 |         if(self.ruleType == "relaxed_CNF"):
  81 |             self.solver = "cplex"  # this is the default solver for learning rules in relaxed_CNFs
  82 |         
  83 |     
  84 |     def __repr__(self):
  85 |         print("\n\nIMLI:->")
  86 |         return '\n'.join(" - %s: %s" % (item, value) for (item, value) in vars(self).items() if "_" not in item)
  87 | 
  88 |     def _get_selected_column_index(self):
  89 |         return_list = [[] for i in range(self.numClause)]
  90 |         
  91 |         for elem in self._selectedFeatureIndex:
  92 |             new_index = int(elem)-1
  93 |             return_list[int(new_index/self.numFeatures)].append(new_index % self.numFeatures)
  94 |         return return_list
  95 | 
  96 |     def get_selected_column_index(self):
  97 |         temp = self._get_selected_column_index()
  98 |         result = []
  99 |         for index_list in temp:
 100 |             each_level_index = []
 101 |             for index in index_list:
 102 |                 phase = 1
 103 |                 actual_feature_len = int(self.numFeatures/2)
 104 |                 if(index >= actual_feature_len):
 105 |                     index = index - actual_feature_len
 106 |                     phase = -1
 107 |                 each_level_index.append((index, phase))
 108 |             result.append(each_level_index)
 109 |         
 110 |         return result
 111 |                 
 112 | 
 113 | 
 114 | 
 115 |     def get_num_of_iterations(self):
 116 |         return self.iterations
 117 | 
 118 |     def get_num_of_clause(self):
 119 |         return self.numClause
 120 | 
 121 |     def get_weight_feature(self):
 122 |         return self.weightFeature
 123 | 
 124 |     def get_rule_type(self):
 125 |         return self.ruleType
 126 |     
 127 | 
 128 |     def get_work_dir(self):
 129 |         return self.workDir
 130 | 
 131 |     def get_weight_data_fidelity(self):
 132 |         return self.dataFidelity
 133 | 
 134 |     def get_solver(self):
 135 |         return self.solver
 136 | 
 137 |     def get_threshold_literal(self):
 138 |         return self.threshold_literal_learned
 139 |         
 140 |     def get_threshold_clause(self):
 141 |         return self.threshold_clause_learned
 142 | 
 143 |     def _fit_relaxed_CNF_old(self, XTrain, yTrain):
 144 | 
 145 |         
 146 |         if (self.threshold_clause == -1):
 147 |             self.learn_threshold_clause = True
 148 |         if (self.threshold_literal == -1):
 149 |             self.learn_threshold_literal = True
 150 | 
 151 |         self.iterations = int(math.ceil(XTrain.shape[0]/self.batchsize))    
 152 |         self.trainingSize = len(XTrain)
 153 |         self._assignList = []
 154 |         
 155 |         # define weight (use usual regularization, nothing)
 156 |         # self.weight_feature = (1-self.lamda)/(self.level*len(self.column_names))
 157 |         # self.weight_datafidelity = self.lamda/(self.trainingSize)
 158 | 
 159 |         # reorder X, y based on target class, when sampling is allowed
 160 |         # if(self.sampling):
 161 |         XTrain_pos = []
 162 |         yTrain_pos = []
 163 |         XTrain_neg = []
 164 |         yTrain_neg = []
 165 |         for i in range(self.trainingSize):
 166 |             if(yTrain[i] == 1):
 167 |                 XTrain_pos.append(XTrain[i])
 168 |                 yTrain_pos.append(yTrain[i])
 169 |             else:
 170 |                 XTrain_neg.append(XTrain[i])
 171 |                 yTrain_neg.append(yTrain[i])
 172 | 
 173 |         Xtrain = XTrain_pos + XTrain_neg
 174 |         ytrain = yTrain_pos + yTrain_neg
 175 | 
 176 |         for i in range(self.iterations):
 177 |             if(self.verbose):
 178 |                 print("\n\n")
 179 |                 print("sampling-based minibatch method called")
 180 | 
 181 |                 print("iteration", i+1)
 182 |             XTrain_sampled, yTrain_sampled = pyrulelearn.utils._generateSamples(self, XTrain, yTrain)
 183 | 
 184 |             assert len(XTrain[0]) == len(XTrain_sampled[0])
 185 | 
 186 |             pyrulelearn.cplex_wrap._call_cplex(self, np.array(XTrain_sampled), np.array(yTrain_sampled))
 187 | 
 188 |     
 189 |     def _fit_relaxed_CNF(self, XTrain, yTrain):
 190 | 
 191 |         
 192 |         if (self.threshold_clause == -1):
 193 |             self.learn_threshold_clause = True
 194 |         if (self.threshold_literal == -1):
 195 |             self.learn_threshold_literal = True
 196 | 
 197 |         self.iterations = int(math.ceil(XTrain.shape[0]/self.batchsize))    
 198 |         
 199 |         best_loss = self.dataFidelity * XTrain.shape[0] + self.numFeatures * self.weightFeature * self.numClause
 200 |         self._assignList = []
 201 |         best_loss_attribute = None
 202 |         num_outer_idx = 2
 203 |         for outer_idx in range(num_outer_idx):
 204 | 
 205 |             # time check
 206 |             if(time() - self._fit_start_time > self.timeOut):
 207 |                 continue
 208 |             
 209 | 
 210 |             XTrains, yTrains = pyrulelearn.utils._numpy_partition(self, XTrain, yTrain)
 211 |             batch_order = None
 212 |             random_shuffle_batch = False
 213 |             if(random_shuffle_batch):
 214 |                 batch_order = random.sample(range(self.iterations), self.iterations)
 215 |             else:
 216 |                 batch_order = range(self.iterations)
 217 | 
 218 |             for each_batch in tqdm(batch_order, disable = not self.verbose):
 219 |                 # time check
 220 |                 if(time() - self._fit_start_time > self.timeOut):
 221 |                     continue
 222 |                 
 223 |                 if(self.verbose):
 224 |                     print("\nTraining started for batch: ", each_batch+1)
 225 |                 pyrulelearn.cplex_wrap._call_cplex(self, XTrains[each_batch], yTrains[each_batch])
 226 | 
 227 | 
 228 |                 # performance
 229 |                 yhat = self.predict(XTrain)
 230 |                 acc = accuracy_score(yTrain, yhat)
 231 |                 def _loss(acc, num_sample, rule_size):
 232 |                     return (1-acc) * self.dataFidelity * num_sample + rule_size * self.weightFeature
 233 |                 loss = _loss(acc, XTrain.shape[0], len(self._selectedFeatureIndex))
 234 |                 if(loss <= best_loss):
 235 |                     # print()
 236 |                     # print(acc, len(self._selectedFeatureIndex))
 237 |                     # print(loss)
 238 |                     best_loss = loss
 239 |                     best_loss_attribute = (self._xhat, self._selectedFeatureIndex, self._assignList, self.threshold_literal_learned, self.threshold_clause_learned)
 240 | 
 241 |                 else:
 242 |                     if(best_loss_attribute is not None):
 243 |                         (self._xhat, self._selectedFeatureIndex, self._assignList, self.threshold_literal_learned, self.threshold_clause_learned) = best_loss_attribute
 244 | 
 245 |                 
 246 |             if(self.iterations == 1):
 247 |                 # When iteration = 1, training accuracy is optimized. So there is no point to iterate again
 248 |                 break
 249 | 
 250 |         
 251 |         assert best_loss_attribute is not None
 252 |         self._xhat, self._selectedFeatureIndex, self._assignList, self.threshold_literal_learned, self.threshold_clause_learned = best_loss_attribute 
 253 |         # print("Finally", self.threshold_literal_learned, self.threshold_clause_learned)
 254 |         # self._learn_parameter() # Not required for relaxed_CNF
 255 |         return 
 256 | 
 257 |     
 258 |     def _fit_decision_sets(self, XTrain, yTrain):
 259 | 
 260 |         """
 261 |             The idea is to learn a decision sets using an iterative appraoch. 
 262 |             A decision set is a set of rule, class label pair where each rule is a conjunction of Boolean predicates (i.e., 
 263 |             single clause DNF)
 264 |             In a decision set, each rule is independent (similar to If-then statements).
 265 |             The classification of new input is decided based on following three rules:
 266 |                 1. If the input satisfies only one rule, it is predicted the corresponding class label
 267 |                 2. If the input satisfies more than one rules, we consider a voting function. One simple voting function
 268 |                    is to consider the majority class label out of all satisfied rules.
 269 |                 3. When the input does not satisfy any rule, there is a default class at the end of the decision sets.
 270 | 
 271 | 
 272 |             In the iterative approach, we want to learn a rule for a target class label in each iteration. 
 273 |             Once a rule is learned, we will separate the training set into covered and uncovered. 
 274 | 
 275 |             Covered set: These samples satisfy the rule. A sample can either be correctly covered or incorrectly covered. 
 276 |                 We consider following two cases:
 277 | 
 278 |                     A. For a correctly covered sample, we want to specify constraints such that no other rule covers this sample. 
 279 |                         Because, in the best case, another rule(s) with same class label may cover this sample, which is desired.
 280 |                         In the worst case, a rule(s) with different class label can cover this, which is not desired!!!!
 281 | 
 282 |                         In both cases, the overlap between rules increases, but we want to decrease overlap.
 283 | 
 284 |                     B. For an incorrectly covered sample, we want it to be correctly covered by another rule(s) and ask for
 285 |                         the voting function to finally (!!) output the correct class. 
 286 | 
 287 |                         In this case, we want to increase overlap between rules but carefully.
 288 | 
 289 |             Uncovered set: These samples do not satisy the rule. Hence we initiate another iteration to learn a new rule that 
 290 |                 will hopefully cover more samples. 
 291 | 
 292 | 
 293 |             As we have defined covered and uncovered samples, we next define how we choose the target class. 
 294 | 
 295 |             Target class: this is a drawback(?) of IMLI that can learn a DNF formula for a fixed target class label.
 296 |             For now, we choose the majority class in the uncovered samples as the target class.
 297 | 
 298 |             We next discuss modifying the class labels of already covered samples, which constitutes the most critical contribution
 299 |             of this algorithm. 
 300 | 
 301 |             *************************
 302 |             ** Critical discussion **
 303 |             *************************
 304 | 
 305 |             As clear from point A and B, we have opposing goal of both increasing and decreasing overlap in order to increase
 306 |             the overall training accuracy
 307 | 
 308 |             ## First we tackle A (correctly covered). Let assume the target class for the current iteration is t (other choice is 1-t for binary classification)
 309 |                 For a correctly covered sample with original class t, we modify the class as 1-t because we want to 
 310 |                 decrease overlap
 311 |                 And for a correctly covered sample with original class 1-t, no modification is required
 312 | 
 313 |                 Similar argument applies when target class is 1-t
 314 | 
 315 |             ## To tackle B (incorrectly covered), no modification of class labels is required. Because this samples are incorectly 
 316 |                 covered at least once. So we want to learn new rules that can cover them correctly. 
 317 |                
 318 | 
 319 |             
 320 |             
 321 |         """
 322 |         num_outer_idx = 2
 323 | 
 324 |         # know which class is majority
 325 |         majority = np.argmax(np.bincount(yTrain))
 326 |         all_classes = np.unique(yTrain) # all classes in y
 327 | 
 328 |         # sample size is variable. Therefore we will always maintain this sample size as maximum sample size for all iterations
 329 |         sample_size = self.batchsize
 330 |         
 331 | 
 332 |         # Use MaxSAT-based rule learner
 333 |         ruleType_orig = self.ruleType
 334 |         self.ruleType = "DNF"
 335 |         # self.timeOut = int(self.timeOut/(num_outer_idx * self.numClause))
 336 |         self.timeOut = float(self.timeOut/self.numClause)
 337 |         k = self.numClause
 338 |         self.numClause = 1
 339 |         self.clause_target = []
 340 |         xhat_computed = []
 341 |         selectedFeatureIndex_computed = []
 342 |         verbose = self.verbose
 343 |         self.verbose = False
 344 | 
 345 |         XTrain_covered = np.zeros(shape=(0,XTrain.shape[1]), dtype=bool)
 346 |         yTrain_covered = np.zeros(shape=(0,), dtype=bool)
 347 |         
 348 |         time_statistics = []
 349 |         # iteratively learn a DNF clause for 1, ..., k
 350 |         for idx in range(k):
 351 |             self._fit_start_time = time()
 352 |             
 353 |             # Trivial termination when there is no sample to classify
 354 |             if(len(yTrain) == 0):
 355 |                 if(verbose):
 356 |                     print("\nTerminating because training set is empty\n")
 357 |                 break
 358 | 
 359 |             yTrain_orig = yTrain.copy()
 360 | 
 361 |             if(verbose):
 362 |                 print("\n\n\n")
 363 |                 print(idx)
 364 |                 print("total samples:", len(yTrain))
 365 |                 print("positive samples:", yTrain.sum())
 366 | 
 367 |             
 368 |             # decide target class, at this point, the problem reduces to binary classification
 369 |             target_class = np.argmax(np.bincount(yTrain))
 370 |             self.clause_target.append(target_class)
 371 |             yTrain = (yTrain == target_class).astype(bool)
 372 |             yTrain_working = np.concatenate((yTrain, np.zeros(shape=yTrain_covered.shape, dtype=bool)))
 373 |             XTrain_working = np.concatenate((XTrain, XTrain_covered))
 374 | 
 375 |             if(verbose):
 376 |                 print("\nTarget class:", target_class)
 377 |                 print("Including covered samples")
 378 |                 print("total samples:", len(yTrain_working))
 379 |                 print("target samples:", int(yTrain_working.sum()))
 380 |                 print("Time left:", self.timeOut - time() + self._fit_start_time)
 381 | 
 382 |                 
 383 |             
 384 |             
 385 |             
 386 | 
 387 | 
 388 |             self.iterations = max(2**math.floor(math.log2(len(XTrain_working)/sample_size)),1)
 389 |             if(verbose):
 390 |                 print("Iterations:", self.iterations)
 391 |             
 392 |             
 393 |             
 394 |             best_loss = self.dataFidelity * XTrain.shape[0] + self.numFeatures * self.weightFeature
 395 |             best_loss_attribute = None
 396 |             self._assignList = []
 397 |             for outer_idx in range(num_outer_idx):
 398 | 
 399 |                 # time check
 400 |                 if(time() - self._fit_start_time > self.timeOut):
 401 |                     continue
 402 | 
 403 | 
 404 |                 """
 405 |                     Two heuristics: 
 406 |                     1. random shuffle on batch (typically better performing)
 407 |                     2. without randomness
 408 |                 """
 409 |                 XTrains, yTrains = pyrulelearn.utils._numpy_partition(self, XTrain_working, yTrain_working)
 410 |                 
 411 |                 batch_order = None
 412 |                 random_shuffle_batch = False
 413 |                 if(random_shuffle_batch):
 414 |                     batch_order = random.sample(range(self.iterations), self.iterations)
 415 |                 else:
 416 |                     batch_order = range(self.iterations)
 417 | 
 418 |                 for each_batch in tqdm(batch_order, disable = not verbose):
 419 |                     
 420 |                     # time check
 421 |                     if(time() - self._fit_start_time > self.timeOut):
 422 |                         continue
 423 |                     
 424 | 
 425 |                     if(self.verbose):
 426 |                         print("\nTraining started for batch: ", each_batch+1)
 427 | 
 428 |                     pyrulelearn.maxsat_wrap._learnModel(self, XTrains[each_batch], yTrains[each_batch], isTest=False)
 429 |                     
 430 | 
 431 |                     # performance
 432 |                     self._learn_parameter()
 433 |                     yhat = self.predict(XTrain)
 434 |                     acc = accuracy_score(yTrain, yhat)
 435 |                     def _loss(acc, num_sample, rule_size):
 436 |                         return (1-acc) * self.dataFidelity * num_sample + rule_size * self.weightFeature
 437 |                     loss = _loss(acc, XTrain.shape[0], len(self._selectedFeatureIndex))
 438 |                     if(loss <= best_loss):
 439 |                         # print()
 440 |                         # print(acc, len(self._selectedFeatureIndex), self.dataFidelity, self.weightFeature, XTrain.shape[0])
 441 |                         # print(loss)    
 442 |                         best_loss = loss
 443 |                         best_loss_attribute = (self._xhat, self._selectedFeatureIndex, self._assignList)
 444 |                     else:
 445 |                         if(best_loss_attribute is not None):
 446 |                             self._assignList = best_loss_attribute[2]
 447 |                     
 448 |                     # best_loss_attribute = (self._xhat, self._selectedFeatureIndex, self._assignList)
 449 | 
 450 | 
 451 |                     
 452 |                 if(self.iterations == 1):
 453 |                     # When iteration = 1, training accuracy is optimized. So there is no point to iterate again
 454 |                     break
 455 |                     
 456 |                 # print()
 457 |             if(verbose):
 458 |                 print("Max loss:", best_loss)
 459 |             assert best_loss_attribute is not None
 460 | 
 461 |             
 462 | 
 463 |             self._xhat, self._selectedFeatureIndex, self._assignList = best_loss_attribute 
 464 |             # print("Best:", best_loss)
 465 |                     
 466 | 
 467 |             self._learn_parameter()
 468 |             
 469 |             
 470 | 
 471 |             yhat = self.predict(XTrain)
 472 |             
 473 |             """
 474 |             Decision sets is a list of independent itemsets ( or list of DNF clauses).
 475 |             If yhat matches both the clause_target and ytrain_orig, then the sample is  covered by the DNF clause
 476 |             and is also perfectly classified. So the rest of the samples are considered in the next iteration.
 477 |             """
 478 |             
 479 |             # Find incorrectly covered or uncovered samples
 480 |             mask = (yhat == 0) | (yhat != yTrain)
 481 | 
 482 |             # include covered samples
 483 |             XTrain_covered = np.concatenate((XTrain_covered, XTrain[~mask]))
 484 |             yTrain_covered = np.concatenate((yTrain_covered, yTrain_orig[~mask]))
 485 | 
 486 |             # extract uncovered and incorrectly covered samples
 487 |             XTrain = XTrain[mask]
 488 |             yTrain = yTrain_orig[mask]
 489 | 
 490 |             
 491 |             if(verbose):
 492 |                 print("Coverage:", len(yTrain_orig[~mask]) , "samples")
 493 |                 print("Of which, positive samples in original:", yTrain_orig[~mask].sum())
 494 |             
 495 |             
 496 | 
 497 |             # If learned rule is empty, it can be discarded
 498 |             if(self._xhat[0].sum() == 0 or any(np.array_equal(np.array(x), self._xhat[0]) for x in xhat_computed)):
 499 |                 if(len(self.clause_target) > 0):
 500 |                     self.clause_target = self.clause_target[:-1]
 501 |                 if(verbose):
 502 |                     print("Terminating becuase current rule is empty or repeated")
 503 |                 break
 504 |             # If no sample is removed, next iteration will generate same hypothesis, hence the process is terminated
 505 |             elif(len(yTrain_orig[~mask]) == 0):
 506 |                 if(len(self.clause_target) > 0):
 507 |                     self.clause_target = self.clause_target[:-1]
 508 |                 if(verbose):
 509 |                     print("Terminating becuase no new sample is removed by current rule")
 510 |                 break
 511 |             else:
 512 |                 xhat_computed.append(self._xhat[0])
 513 |                 selectedFeatureIndex_computed += [val + idx * self.numFeatures for val in self._selectedFeatureIndex]
 514 | 
 515 | 
 516 | 
 517 |             
 518 | 
 519 |         
 520 |         
 521 |         """
 522 |         Default rule
 523 |         """
 524 |         xhat_computed.append(np.zeros(self.numFeatures))
 525 |         reach_once = False
 526 |         for each_class in all_classes:
 527 |             if(each_class not in self.clause_target):
 528 |                 self.clause_target.append(each_class)
 529 |                 reach_once = True
 530 |                 break
 531 |         if(not reach_once):
 532 |             self.clause_target.append(majority)
 533 | 
 534 | 
 535 |         # Get back to initial values
 536 |         self.numClause = len(self.clause_target)
 537 |         self._xhat = xhat_computed
 538 |         self._selectedFeatureIndex = selectedFeatureIndex_computed
 539 |         self.ruleType = ruleType_orig
 540 |         self.verbose = verbose
 541 | 
 542 | 
 543 |         # parameters learned for rule
 544 |         self.threshold_literal_learned = [selected_columns.sum() for selected_columns in self._xhat]
 545 |         self.threshold_clause_learned = None
 546 | 
 547 |         # print(self.clause_target)
 548 |         # print(self._xhat)
 549 |         # print(self.threshold_clause_learned, self.threshold_literal_learned)
 550 | 
 551 |     def _fit_CNF_DNF_recursive(self, XTrain, yTrain):
 552 | 
 553 |         num_outer_idx = 2
 554 |         # sample size is variable. Therefore we will always maintain this sample size as maximum sample size for all iterations
 555 |         sample_size = self.batchsize
 556 |         
 557 | 
 558 |         # Use MaxSAT-based rule learner
 559 |         ruleType_orig = self.ruleType
 560 |         # self.ruleType = "DNF"
 561 |         # self.timeOut = int(self.timeOut/(num_outer_idx * self.numClause))
 562 |         self.timeOut = float(self.timeOut/self.numClause)
 563 |         k = self.numClause
 564 |         self.numClause = 1
 565 |         self.clause_target = []
 566 |         xhat_computed = []
 567 |         selectedFeatureIndex_computed = []
 568 |         verbose = self.verbose
 569 |         self.verbose = False
 570 |         
 571 |         
 572 |         # iteratively learn a DNF clause for 1, ..., k iterations
 573 |         for idx in range(k):
 574 |             self._fit_start_time = time()
 575 |                 
 576 | 
 577 |             
 578 |             # yTrain_orig = yTrain.copy()
 579 | 
 580 |             # Trivial termination when there is no sample to classify
 581 |             if(len(yTrain) == 0):
 582 |                 if(verbose):
 583 |                     print("\nTerminating because training set is empty\n")
 584 |                 break
 585 | 
 586 |             
 587 |             if(verbose):
 588 |                 print("\n\n\n")
 589 |                 print(idx)
 590 |                 print("total samples:", len(yTrain))
 591 |                 print("Time left:", self.timeOut - time() + self._fit_start_time)
 592 | 
 593 | 
 594 | 
 595 | 
 596 |             
 597 |             # # decide target class, at this point, the problem reduces to binary classification
 598 |             # target_class = np.argmax(np.bincount(yTrain))
 599 |             # self.clause_target.append(target_class)
 600 |             # yTrain = (yTrain == target_class).astype(int)
 601 |             
 602 | 
 603 |             self.iterations = max(2**math.floor(math.log2(len(XTrain)/sample_size)),1)
 604 |             if(verbose):
 605 |                 print("Iterations:", self.iterations)
 606 | 
 607 |             
 608 |             
 609 |             best_loss = self.dataFidelity * XTrain.shape[0] + self.numFeatures * self.weightFeature
 610 |             best_loss_attribute = None
 611 |             self._assignList = []
 612 |             for outer_idx in range(num_outer_idx):
 613 | 
 614 |                 # time check
 615 |                 if(time() - self._fit_start_time > self.timeOut):
 616 |                     continue
 617 |                 
 618 | 
 619 |                 """
 620 |                     Two heuristics: 
 621 |                     1. random shuffle on batch (typically better performing)
 622 |                     2. without randomness
 623 |                 """
 624 |                 XTrains, yTrains = pyrulelearn.utils._numpy_partition(self, XTrain, yTrain)
 625 |                 batch_order = None
 626 |                 random_shuffle_batch = False
 627 |                 if(random_shuffle_batch):
 628 |                     batch_order = random.sample(range(self.iterations), self.iterations)
 629 |                 else:
 630 |                     batch_order = range(self.iterations)
 631 | 
 632 |                 for each_batch in tqdm(batch_order, disable = not verbose):
 633 |                     # time check
 634 |                     if(time() - self._fit_start_time > self.timeOut):
 635 |                         continue
 636 |                     
 637 |                     if(self.verbose):
 638 |                         print("\nTraining started for batch: ", each_batch+1)
 639 |                     pyrulelearn.maxsat_wrap._learnModel(self, XTrains[each_batch], yTrains[each_batch], isTest=False)
 640 | 
 641 |                     # performance
 642 |                     self._learn_parameter()
 643 |                     yhat = self.predict(XTrain)
 644 |                     acc = accuracy_score(yTrain, yhat)
 645 |                     def _loss(acc, num_sample, rule_size):
 646 |                         return (1-acc) * self.dataFidelity * num_sample + rule_size * self.weightFeature
 647 |                     loss = _loss(acc, XTrain.shape[0], len(self._selectedFeatureIndex))
 648 |                     if(loss <= best_loss):
 649 |                         # print()
 650 |                         # print(acc, len(self._selectedFeatureIndex))
 651 |                         # print(loss)
 652 |                         best_loss = loss
 653 |                         best_loss_attribute = (self._xhat, self._selectedFeatureIndex, self._assignList)
 654 |                     else:
 655 |                         if(best_loss_attribute is not None):
 656 |                             self._assignList = best_loss_attribute[2]
 657 | 
 658 |                 if(self.iterations == 1):
 659 |                     # When iteration = 1, training accuracy is optimized. So there is no point to iterate again
 660 |                     break
 661 |     
 662 |                 # print()
 663 | 
 664 |             assert best_loss_attribute is not None
 665 |             # print("Best accuracy:", best_loss*len(XTrain))
 666 | 
 667 |             
 668 | 
 669 |             self._xhat, self._selectedFeatureIndex, self._assignList = best_loss_attribute 
 670 |             # print("Best:", best_loss)
 671 |                     
 672 | 
 673 |             
 674 |             
 675 |             # print(classification_report(yTrain, yhat, target_names=np.unique(yTrain).astype("str")))
 676 |             # print(accuracy_score(yTrain, yhat))
 677 |             
 678 |             
 679 |             # remove samples that are covered by current clause. 
 680 |             # Depending on CNF/DNF, definition of coverage is different
 681 |             """
 682 |                 When yhat is different than 0 (i.e., does not satisfy the IF condition), it is still considered in the next iteration because 
 683 |                 the final rule is nested if-else.
 684 |             """
 685 |             self._learn_parameter()
 686 |             yhat = self.predict(XTrain)
 687 |             if(self.ruleType == "CNF"):
 688 |                 mask = (yhat == 1)
 689 |             else:
 690 |                 mask = (yhat == 0)
 691 | 
 692 |             if(verbose):    
 693 |                 print("Coverage:", len(yTrain[~mask]) , "samples")
 694 | 
 695 |             
 696 |             
 697 | 
 698 |             # If learned rule is empty, it can be discarded, except this is the first clause
 699 |             if(self._xhat[0].sum() == 0 and len(xhat_computed) != 0):
 700 |                 if(verbose):
 701 |                     print(len(xhat_computed))
 702 |                     print("Terminating becuase current rule is empty")
 703 |                 break
 704 |             else:
 705 |                 # TODO reorder self_xhat
 706 |                 xhat_computed.append(self._xhat[0])
 707 |                 selectedFeatureIndex_computed += [val + idx * self.numFeatures for val in self._selectedFeatureIndex]
 708 | 
 709 | 
 710 |             # If no sample is removed, next iteration will generate the same hypothesis, hence the process is terminated
 711 |             if(len(yTrain[~mask])  == 0):
 712 |                 if(verbose):
 713 |                     print("Terminating becuase no new sample is removed by current rule")
 714 |                 break
 715 | 
 716 |             XTrain = XTrain[mask]
 717 |             yTrain = yTrain[mask]
 718 | 
 719 |         
 720 |         
 721 |         
 722 |         
 723 |         # Get back to initial configuration
 724 |         self.numClause = len(xhat_computed)
 725 |         self._xhat = np.array(xhat_computed)
 726 |         self._selectedFeatureIndex = selectedFeatureIndex_computed
 727 |         self.verbose = verbose
 728 |         
 729 | 
 730 |         # parameters learned for rule
 731 |         self._learn_parameter()
 732 | 
 733 |         
 734 |     
 735 |     def _fit_decision_lists(self, XTrain, yTrain):
 736 | 
 737 |         num_outer_idx = 2
 738 | 
 739 |         # know which class is majority
 740 |         majority = np.argmax(np.bincount(yTrain))
 741 |         all_classes = np.unique(yTrain) # all classes in y
 742 | 
 743 |         # sample size is variable. Therefore we will always maintain this sample size as maximum sample size for all iterations
 744 |         sample_size = self.batchsize
 745 |         
 746 | 
 747 |         # Use MaxSAT-based rule learner
 748 |         ruleType_orig = self.ruleType
 749 |         self.ruleType = "DNF"
 750 |         # self.timeOut = int(self.timeOut/(num_outer_idx * self.numClause))
 751 |         self.timeOut = float(self.timeOut/self.numClause)
 752 |         k = self.numClause
 753 |         self.numClause = 1
 754 |         self.clause_target = []
 755 |         xhat_computed = []
 756 |         selectedFeatureIndex_computed = []
 757 |         verbose = self.verbose
 758 |         self.verbose = False
 759 |         
 760 |         
 761 |         # iteratively learn a DNF clause for 1, ..., k iterations
 762 |         for idx in range(k):
 763 |             self._fit_start_time = time()
 764 |                 
 765 |             # Trivial termination when there is no sample to classify
 766 |             if(len(yTrain) == 0):
 767 |                 if(verbose):
 768 |                     print("\nTerminating because training set is empty\n")
 769 |                 break
 770 | 
 771 |             
 772 |             yTrain_orig = yTrain.copy()
 773 |             
 774 |             if(verbose):
 775 |                 print("\n\n\n")
 776 |                 print(idx)
 777 |                 print("total samples:", len(yTrain))
 778 |                 print("Time left:", self.timeOut - time() + self._fit_start_time)
 779 | 
 780 | 
 781 | 
 782 | 
 783 |             
 784 |             # decide target class, at this point, the problem reduces to binary classification
 785 |             target_class = np.argmax(np.bincount(yTrain))
 786 |             self.clause_target.append(target_class)
 787 |             yTrain = (yTrain == target_class).astype(int)
 788 |             
 789 | 
 790 |             self.iterations = max(2**math.floor(math.log2(len(XTrain)/sample_size)),1)
 791 |             if(verbose):
 792 |                 print("Iterations:", self.iterations)
 793 | 
 794 |             
 795 |             
 796 |             best_loss = self.dataFidelity * XTrain.shape[0] + self.numFeatures * self.weightFeature
 797 |             best_loss_attribute = None
 798 |             self._assignList = []
 799 |             for outer_idx in range(num_outer_idx):
 800 | 
 801 |                 # time check
 802 |                 if(time() - self._fit_start_time > self.timeOut):
 803 |                     continue
 804 |                 
 805 | 
 806 |                 """
 807 |                     Two heuristics: 
 808 |                     1. random shuffle on batch (typically better performing)
 809 |                     2. without randomness
 810 |                 """
 811 |                 XTrains, yTrains = pyrulelearn.utils._numpy_partition(self, XTrain, yTrain)
 812 |                 batch_order = None
 813 |                 random_shuffle_batch = False
 814 |                 if(random_shuffle_batch):
 815 |                     batch_order = random.sample(range(self.iterations), self.iterations)
 816 |                 else:
 817 |                     batch_order = range(self.iterations)
 818 | 
 819 |                 for each_batch in tqdm(batch_order, disable = not verbose):
 820 |                     # time check
 821 |                     if(time() - self._fit_start_time > self.timeOut):
 822 |                         continue
 823 |                     
 824 |                     if(self.verbose):
 825 |                         print("\nTraining started for batch: ", each_batch+1)
 826 |                     pyrulelearn.maxsat_wrap._learnModel(self, XTrains[each_batch], yTrains[each_batch], isTest=False)
 827 | 
 828 |                     
 829 |                     # performance
 830 |                     self._learn_parameter()
 831 |                     yhat = self.predict(XTrain)
 832 |                     acc = accuracy_score(yTrain, yhat)
 833 |                     def _loss(acc, num_sample, rule_size):
 834 |                         return (1-acc) * self.dataFidelity * num_sample + rule_size * self.weightFeature
 835 |                     loss = _loss(acc, XTrain.shape[0], len(self._selectedFeatureIndex))
 836 |                     if(loss <= best_loss):
 837 |                         # print()
 838 |                         # print(acc, len(self._selectedFeatureIndex))
 839 |                         # print(loss)
 840 |                         best_loss = loss
 841 |                         best_loss_attribute = (self._xhat, self._selectedFeatureIndex, self._assignList)
 842 |                     else:
 843 |                         if(best_loss_attribute is not None):
 844 |                             self._assignList = best_loss_attribute[2]
 845 | 
 846 | 
 847 |                 if(self.iterations == 1):
 848 |                     # When iteration = 1, training accuracy is optimized. So there is no point to iterate again
 849 |                     break
 850 |     
 851 |                 # print()
 852 | 
 853 |             assert best_loss_attribute is not None
 854 |             # print("Best accuracy:", best_loss*len(XTrain))
 855 | 
 856 |             
 857 | 
 858 |             self._xhat, self._selectedFeatureIndex, self._assignList = best_loss_attribute 
 859 |             # print("Best:", best_loss)
 860 |                     
 861 | 
 862 |             
 863 |             
 864 |             
 865 | 
 866 |             # print(classification_report(yTrain, yhat, target_names=np.unique(yTrain).astype("str")))
 867 |             # print(accuracy_score(yTrain, yhat))
 868 |             
 869 |             
 870 |             # remove samples that are covered by current DNF clause            
 871 |             """
 872 |                 When yhat is different than 0 (i.e., does not satisfy the IF condition), it is still considered in the next iteration because 
 873 |                 the final rule is nested if-else.
 874 |             """
 875 |             self._learn_parameter()
 876 |             yhat = self.predict(XTrain)
 877 |             mask = (yhat == 0)
 878 |             XTrain = XTrain[mask]
 879 |             yTrain = yTrain_orig[mask]
 880 |             if(verbose):    
 881 |                 print("Coverage:", len(yTrain_orig[~mask]) , "samples")
 882 |             
 883 |             # If learned rule is empty, it can be discarded
 884 |             if(self._xhat[0].sum() == 0 or any(np.array_equal(np.array(x), self._xhat[0]) for x in xhat_computed)):
 885 |                 if(len(self.clause_target) > 0):
 886 |                     self.clause_target = self.clause_target[:-1]
 887 |                 if(verbose):
 888 |                     print("Terminating becuase current rule is empty or repeated")
 889 |                 break
 890 |             # If no sample is removed, next iteration will generate same hypothesis, hence the process is terminated
 891 |             elif(len(yTrain_orig[~mask]) == 0):
 892 |                 if(len(self.clause_target) > 0):
 893 |                     self.clause_target = self.clause_target[:-1]
 894 |                 if(verbose):
 895 |                     print("Terminating becuase no new sample is removed by current rule")
 896 |                 break
 897 |             else:
 898 |                 xhat_computed.append(self._xhat[0])
 899 |                 selectedFeatureIndex_computed += [val + idx * self.numFeatures for val in self._selectedFeatureIndex]
 900 | 
 901 |             
 902 |         
 903 |         """
 904 |         Default rule
 905 |         """
 906 |         xhat_computed.append(np.zeros(self.numFeatures))
 907 |         reach_once = False
 908 |         for each_class in all_classes:
 909 |             if(each_class not in self.clause_target):
 910 |                 self.clause_target.append(each_class)
 911 |                 reach_once = True
 912 |                 break
 913 |         if(not reach_once):
 914 |             self.clause_target.append(majority)
 915 | 
 916 |         
 917 |         # Get back to initial configuration
 918 |         self.numClause = len(self.clause_target)
 919 |         self._xhat = xhat_computed
 920 |         self.ruleType = ruleType_orig
 921 |         self.verbose = verbose
 922 |         self._selectedFeatureIndex = selectedFeatureIndex_computed
 923 | 
 924 | 
 925 |         # parameters learned for rule
 926 |         self.threshold_literal_learned = [selected_columns.sum() for selected_columns in self._xhat]
 927 |         self.threshold_clause_learned = None
 928 | 
 929 |         # print(self.clause_target)
 930 |         # print(self._xhat)
 931 |         # print(self.threshold_clause_learned, self.threshold_literal_learned)
 932 | 
 933 |     
 934 |     
 935 |     def fit(self, XTrain, yTrain, recursive=True):
 936 | 
 937 | 
 938 |         self._fit_mode = True
 939 | 
 940 |         self._fit_start_time = time()    
 941 |         XTrain = pyrulelearn.utils._transform_binary_matrix(XTrain)
 942 |         yTrain = np.array(yTrain, dtype=bool)
 943 |         
 944 |             
 945 | 
 946 |             
 947 | 
 948 |         if(self.ruleType not in ["CNF", "DNF", "relaxed_CNF", "decision lists", "decision sets"]):
 949 |             raise ValueError(self.ruleType)
 950 | 
 951 |         self.trainingSize = XTrain.shape[0]
 952 |         if(self.trainingSize > 0):
 953 |             self.numFeatures = len(XTrain[0])
 954 |         if(self.trainingSize < self.batchsize):
 955 |             self.batchsize = self.trainingSize
 956 | 
 957 | 
 958 | 
 959 |         if(self.ruleType == "relaxed_CNF"):
 960 |             self._fit_relaxed_CNF(XTrain, yTrain)
 961 |             self._fit_mode = False
 962 |             return
 963 | 
 964 |         if(self.ruleType == "decision lists"):
 965 |             self._fit_decision_lists(XTrain, yTrain)
 966 |             self._fit_mode = False
 967 |             return
 968 | 
 969 |         if(self.ruleType == "decision sets"):
 970 |             self._fit_decision_sets(XTrain, yTrain)
 971 |             self._fit_mode = False
 972 |             return
 973 | 
 974 |         
 975 |         if(recursive):
 976 |             self._fit_CNF_DNF_recursive(XTrain, yTrain)
 977 |             self._fit_mode = False
 978 |             return
 979 | 
 980 | 
 981 |         
 982 | 
 983 | 
 984 | 
 985 |         self.iterations = 2**math.floor(math.log2(XTrain.shape[0]/self.batchsize))
 986 |         
 987 |         
 988 |         best_loss = self.dataFidelity * XTrain.shape[0] + self.numFeatures * self.weightFeature * self.numClause
 989 |         best_loss_attribute = None
 990 |         num_outer_idx = 2
 991 |         cnt = 0
 992 |         self._assignList = []
 993 |         for outer_idx in range(num_outer_idx):
 994 | 
 995 |             # time check
 996 |             if(time() - self._fit_start_time > self.timeOut):
 997 |                 continue
 998 |             
 999 | 
1000 |             XTrains, yTrains = pyrulelearn.utils._numpy_partition(self, XTrain, yTrain)
1001 |             batch_order = None
1002 |             random_shuffle_batch = False
1003 |             if(random_shuffle_batch):
1004 |                 batch_order = random.sample(range(self.iterations), self.iterations)
1005 |             else:
1006 |                 batch_order = range(self.iterations)
1007 | 
1008 |             for each_batch in tqdm(batch_order, disable = not self.verbose):
1009 |                 # time check
1010 |                 if(time() - self._fit_start_time > self.timeOut):
1011 |                     continue
1012 |                 
1013 |                 if(self.verbose):
1014 |                     print("\nTraining started for batch: ", each_batch+1)
1015 |                 pyrulelearn.maxsat_wrap._learnModel(self, XTrains[each_batch], yTrains[each_batch], isTest=False)
1016 | 
1017 |                 
1018 | 
1019 |                 # performance
1020 |                 cnt += 1
1021 |                 self._learn_parameter()
1022 |                 yhat = self.predict(XTrain)
1023 |                 acc = accuracy_score(yTrain, yhat)
1024 |                 def _loss(acc, num_sample, rule_size):
1025 |                     assert rule_size <= self.numFeatures
1026 |                     return (1-acc) * self.dataFidelity * num_sample + rule_size * self.weightFeature
1027 |                 loss = _loss(acc, XTrain.shape[0], len(self._selectedFeatureIndex))
1028 |                 if(loss <= best_loss):
1029 |                     # print()
1030 |                     # print(acc, len(self._selectedFeatureIndex))
1031 |                     # print(loss)
1032 |                     best_loss = loss
1033 |                     best_loss_attribute = (self._xhat, self._selectedFeatureIndex, self._assignList)
1034 |                 else:
1035 |                     if(best_loss_attribute is not None):
1036 |                         self._assignList = best_loss_attribute[2]
1037 |                 
1038 |                 
1039 | 
1040 |             if(self.iterations == 1):
1041 |                 # When iteration = 1, training accuracy is optimized. So there is no point to iterate again
1042 |                 break
1043 | 
1044 |        
1045 |         assert best_loss_attribute is not None
1046 |         self._xhat, self._selectedFeatureIndex, self._assignList = best_loss_attribute 
1047 |         self._learn_parameter()
1048 |         self._fit_mode = False
1049 |         return 
1050 | 
1051 |         
1052 | 
1053 |         
1054 |         
1055 | 
1056 |     def predict(self, XTest):
1057 | 
1058 |         if(not self._fit_mode):
1059 |             XTest = pyrulelearn.utils._transform_binary_matrix(XTest)
1060 |             # XTest = pyrulelearn.utils._add_dummy_columns(XTest)
1061 |         assert self.numFeatures == XTest.shape[1], str(self.numFeatures) + " " + str(XTest.shape[1])
1062 |         
1063 | 
1064 |         y_hat = []
1065 |         self.coverage = []
1066 |         if(self.ruleType in ["CNF", "DNF", "relaxed_CNF"]):
1067 |             
1068 |             """
1069 |                 Expensive
1070 |             """
1071 |             if(False):
1072 |                 yTest = [1 for _ in XTest]
1073 |                 for i in range(len(yTest)):
1074 |                     dot_value = [0 for eachLevel in range(self.numClause)]
1075 |                     for eachLevel in range(self.numClause):
1076 |                         if(self.ruleType == "relaxed_CNF"):
1077 |                             dot_value[eachLevel] = np.dot(XTest[i], np.array(self._assignList[eachLevel * self.numFeatures: (eachLevel + 1) * self.numFeatures ]))
1078 |                         elif(self.ruleType in ['CNF', 'DNF']):
1079 |                             dot_value[eachLevel] = np.dot(XTest[i], self._xhat[eachLevel])
1080 |                         else:
1081 |                             raise ValueError
1082 | 
1083 |                     if (yTest[i] == 1):
1084 |                         correctClauseCount = 0
1085 |                         for eachLevel in range(self.numClause):
1086 |                             if (dot_value[eachLevel] >= self.threshold_literal_learned[eachLevel]):
1087 |                                 correctClauseCount += 1
1088 |                         if (correctClauseCount >= self.threshold_clause_learned):
1089 |                             y_hat.append(1)
1090 |                         else:
1091 |                             y_hat.append(0)
1092 | 
1093 |                     else:
1094 |                         correctClauseCount = 0
1095 |                         for eachLevel in range(self.numClause):
1096 |                             if (dot_value[eachLevel] < self.threshold_literal_learned[eachLevel]):
1097 |                                 correctClauseCount += 1
1098 |                         if (correctClauseCount > self.numClause - self.threshold_clause_learned):
1099 |                             y_hat.append(0)
1100 |                         else:
1101 |                             y_hat.append(1)
1102 | 
1103 |                 
1104 |             # Matrix multiplication
1105 |             if(True):
1106 |                 if(self.ruleType == "relaxed_CNF"):
1107 |                     self._xhat = np.array(self._assignList[:self.numClause * self.numFeatures]).reshape(self.numClause, self.numFeatures)
1108 | 
1109 | 
1110 |                 # dot_matrix = XTest.dot(self._xhat.T)
1111 |                 # y_hat = ((dot_matrix >= np.array(self.threshold_literal_learned)).sum(axis=1) >= self.threshold_clause_learned).astype(int)
1112 | 
1113 |                 # considers non zero columns only
1114 |                 start_prediction_time = time()
1115 |                 nonzero_columns = np.nonzero(np.any(self._xhat, axis=0))[0]
1116 |                 dot_matrix = XTest[:, nonzero_columns].dot(self._xhat[:, nonzero_columns].T)
1117 |                 y_hat = ((dot_matrix >= np.array(self.threshold_literal_learned)).sum(axis=1) >= self.threshold_clause_learned).astype(int)
1118 |                 self._prediction_time += time() - start_prediction_time
1119 |         
1120 |                 
1121 |                 # assert np.array_equal(y_hat_, y_hat)
1122 | 
1123 | 
1124 | 
1125 |         elif(self.ruleType in ["decision lists", "decision sets"]):
1126 |             
1127 |             cnt_voting_function = 0
1128 |             cnt_reach_default_rule = 0
1129 | 
1130 |             self.coverage = [0 for _ in range(self.numClause)]
1131 | 
1132 |             assert len(self.clause_target) == self.numClause
1133 |             for example in XTest:
1134 |                 reached_verdict = False
1135 |                 possible_outcome = []
1136 |                 for eachLevel in range(self.numClause):
1137 |                     dot_value = np.dot(example, self._xhat[eachLevel])
1138 |                     assert dot_value <= self.threshold_literal_learned[eachLevel]
1139 |                     if(dot_value == self.threshold_literal_learned[eachLevel]):
1140 |                         reached_verdict = True
1141 |                         self.coverage[eachLevel] += 1
1142 |                         if(self.ruleType == "decision lists"):
1143 |                             y_hat.append(self.clause_target[eachLevel])
1144 |                             if(eachLevel == self.numClause - 1):
1145 |                                 cnt_reach_default_rule += 1
1146 |                             
1147 |                             break
1148 |                         possible_outcome.append(self.clause_target[eachLevel])
1149 | 
1150 |                         
1151 |                 assert reached_verdict
1152 |                 if(self.ruleType == "decision sets"):
1153 |                     # refine possible outcomes
1154 |                     default_outcome = possible_outcome[-1]
1155 |                     possible_outcome = possible_outcome[:-1]
1156 |                     if(len(possible_outcome) > 0):
1157 |                         # most frequent
1158 |                         cnt_voting_function += 1
1159 |                         y_hat.append(max(set(possible_outcome), key = possible_outcome.count))
1160 |                     else:
1161 |                         cnt_reach_default_rule += 1
1162 |                         y_hat.append(default_outcome)
1163 |             
1164 |             if(self.verbose):
1165 |                 print("\n")
1166 |                 print("Voting function:", cnt_voting_function)
1167 |                 print("Default rule:", cnt_reach_default_rule)
1168 |                 print("Coverage:", self.coverage)
1169 | 
1170 | 
1171 | 
1172 |         else:
1173 |             raise ValueError(self.ruleType)
1174 | 
1175 |         # y_hat = np.array(y_hat)
1176 |         return y_hat
1177 | 
1178 |         
1179 |         
1180 |         # if(self.verbose):
1181 |         #     print("\nPrediction through MaxSAT formulation")
1182 |         # predictions = self.__learnModel(XTest, yTest, isTest=True)
1183 |         # yhat = []
1184 |         # for i in range(len(predictions)):
1185 |         #     if (int(predictions[i]) > 0):
1186 |         #         yhat.append(1 - yTest[i])
1187 |         #     else:
1188 |         #         yhat.append(yTest[i])
1189 |         # return yhat
1190 | 
1191 |     
1192 |     def _learn_parameter(self):
1193 |         # parameters learned for rule
1194 |         if(self.ruleType=="CNF"):
1195 |             self.threshold_literal_learned = [1 for i in range(self.numClause)]
1196 |         elif(self.ruleType=="DNF"):
1197 |             self.threshold_literal_learned = [len(selected_columns) for selected_columns in self._get_selected_column_index()]
1198 |         else:
1199 |             raise ValueError
1200 | 
1201 |         if(self.ruleType=="CNF"):
1202 |             self.threshold_clause_learned = self.numClause
1203 |         elif(self.ruleType=="DNF"):
1204 |             self.threshold_clause_learned = 1
1205 |         else:
1206 |             raise ValueError
1207 |     
1208 |     
1209 |     def get_rule(self, features, show_decision_lists=False):
1210 | 
1211 |         if(2 * len(features) == self.numFeatures):
1212 |             features = [str(feature) for feature in features]
1213 |             features += ["not " + str(feature) for feature in features]
1214 |             
1215 |         assert len(features) == self.numFeatures
1216 | 
1217 |         if(self.ruleType == "relaxed_CNF"):  # naive copy paste
1218 |             no_features = len(features)
1219 |             # self.rule_size = 0
1220 |             rule = '[ ( '
1221 |             for eachLevel in range(self.numClause):
1222 | 
1223 |                 for literal_index in range(no_features):
1224 |                     if (self._assignList[eachLevel * no_features + literal_index] >= 1):
1225 |                         rule += " " + features[literal_index] + "  +"
1226 |                 rule = rule[:-1]
1227 |                 rule += ' )>= ' + str(self.threshold_literal_learned[eachLevel]) + "  ]"
1228 | 
1229 |                 if (eachLevel < self.numClause - 1):
1230 |                     rule += ' +\n[ ( '
1231 |             rule += "  >= " + str(self.threshold_clause_learned)
1232 | 
1233 | 
1234 |             return rule
1235 |         else:
1236 | 
1237 |             self.names = []
1238 |             for i in range(self.numClause):
1239 |                 xHatElem = self._xhat[i]
1240 |                 inds_nnz = np.where(abs(xHatElem) > 1e-4)[0]
1241 |                 self.names.append([features[ind] for ind in inds_nnz])
1242 | 
1243 |             
1244 |             if(self.ruleType == "CNF"):
1245 |                 return " AND\n".join([" OR ".join(name) for name in self.names])
1246 |             elif(self.ruleType == "DNF"):
1247 |                 if(not show_decision_lists):
1248 |                     return " OR\n".join([" AND ".join(name) for name in self.names])
1249 |                 else:
1250 |                     return "\n".join([("If " if idx == 0 else ("Else if " if len(name) > 0 else "Else")) +  " AND ".join(name) + ": class = 1" for idx, name in enumerate(self.names)])
1251 |             elif(self.ruleType == "decision lists"):
1252 |                 
1253 |                 #TODO Can intermediate rule be empty?
1254 |                 
1255 |                 return "\n".join([("If " if idx == 0 else ("Else if " if len(name) > 0 else "Else")) +  " AND ".join(name) + ": class = " + str(self.clause_target[idx]) for idx, name in enumerate(self.names)])
1256 |             elif(self.ruleType == "decision sets"):
1257 |                 return "\n".join([("If " if idx < len(self.names) - 1 else "Else ") +  " AND ".join(name) + ": class = " + str(self.clause_target[idx]) for idx, name in enumerate(self.names)])
1258 |             else:
1259 |                 raise ValueError
1260 | 
1261 |             


--------------------------------------------------------------------------------
/pyrulelearn/maxsat_wrap.py:
--------------------------------------------------------------------------------
  1 | import subprocess
  2 | import math
  3 | import os
  4 | import numpy as np
  5 | from time import time
  6 | 
  7 | # from pyrulelearn
  8 | import pyrulelearn.utils
  9 | 
 10 | 
 11 | def _generateWcnfFile(imli, AMatrix, yVector, xSize, WCNFFile,
 12 |                         isTestPhase):
 13 | 
 14 |     # learn soft clauses associated with feature variables and noise variables
 15 |     topWeight, formula_builder = _learnSoftClauses(imli, isTestPhase, xSize,
 16 |                                                                 yVector)
 17 |     
 18 |     # learn hard clauses,
 19 |     additionalVariable = 0
 20 |     y_len = len(yVector)
 21 | 
 22 |     precomputed_vars = [each_level * xSize for each_level in range(imli.numClause)]
 23 |     variable_head =  y_len + imli.numClause * xSize + 1
 24 | 
 25 |     for i in range(y_len):
 26 |         noise = imli.numClause * xSize + i + 1
 27 | 
 28 |         # implementation of tseitin encoding
 29 |         if (yVector[i] == 0):
 30 | 
 31 |             new_clause = str(topWeight) + " " + str(noise)
 32 |             
 33 |             # for each_level in range(imli.numClause):
 34 |                 # new_clause += " " + str(additionalVariable + each_level + len(yVector) + imli.numClause * xSize + 1)
 35 |             formula_builder.append((" ").join(map(str, [topWeight, noise] + [additionalVariable + each_level + variable_head for each_level in range(imli.numClause)] + [0])))
 36 |             # new_clause += " 0\n"
 37 |             # cnfClauses += new_clause
 38 |             # numClauses += 1
 39 | 
 40 |         
 41 |             mask = AMatrix[i] == 1 
 42 |             dummy = np.arange(1, xSize+1)[mask]
 43 |             for j in dummy:
 44 |                 for each_level in range(imli.numClause):
 45 |                     # numClauses += 1
 46 |                     # new_clause = str(topWeight) + " -" + str(additionalVariable + variable_head + each_level) + " -" + str(j + precomputed_vars[each_level])
 47 |                     formula_builder.append((" ").join(map(str, [topWeight, -1 * (additionalVariable + variable_head + each_level), -1 * (j + precomputed_vars[each_level]), 0])))
 48 |                     # cnfClauses += new_clause + " 0\n"
 49 | 
 50 |             additionalVariable += imli.numClause
 51 | 
 52 |         else:
 53 |         
 54 |             mask = AMatrix[i] == 1 
 55 |             dummy = np.arange(1, xSize+1)[mask]
 56 |             for each_level in range(imli.numClause):
 57 |                 # cnfClauses += str(topWeight) + " " + str(noise) + " " + (" ").join(map(str, dummy + each_level * xSize)) + " 0\n"
 58 |                 formula_builder.append((" ").join(map(str, [topWeight, noise] + list(dummy + each_level * xSize) + [0])))
 59 |                 # numClauses += 1
 60 | 
 61 |     # cnfClauses = ("\n").join([(" ").join(map(str, each_clause)) for each_clause in formula_builder])
 62 | 
 63 | 
 64 |     # write in wcnf format
 65 |     start_demo_time = time()
 66 |     num_clauses = len(formula_builder)
 67 |     header = 'p wcnf ' + str(additionalVariable + variable_head - 1) + ' ' + str(num_clauses) + ' ' + str(topWeight) + "\n"
 68 |     
 69 |     with open(WCNFFile, 'w') as file:
 70 |         file.write(header)
 71 |         # write in chunck of 500 clauses
 72 |         # chunck_size = 500
 73 |         # for i in range(0, num_clauses, chunck_size):
 74 |         #     file.writelines(' '.join(str(var) for var in clause) + '\n' for clause in formula_builder[i:i + chunck_size])
 75 |         file.write("\n".join(formula_builder))
 76 | 
 77 |     imli._demo_time += time() - start_demo_time
 78 | 
 79 |     
 80 |     if(imli.verbose):
 81 |         print("- number of Boolean variables:", additionalVariable + xSize * imli.numClause + (len(yVector)))
 82 |         
 83 | 
 84 | 
 85 | 
 86 | def _learnSoftClauses(imli, isTestPhase, xSize, yVector):
 87 |     # cnfClauses = ''
 88 |     # numClauses = 0
 89 | 
 90 |     
 91 |     formula_builder = []
 92 | 
 93 |     if (isTestPhase):
 94 |         topWeight = imli.dataFidelity * len(yVector) + 1 + imli.weightFeature * xSize * imli.numClause
 95 |         # numClauses = 0
 96 |         for i in range(1, imli.numClause * xSize + 1):
 97 |             # numClauses += 1
 98 |             # cnfClauses += str(imli.weightFeature) + ' ' + str(-i) + ' 0\n'
 99 |             formula_builder.append((" ").join(map(str, [imli.weightFeature, -i, 0])))
100 |         for i in range(imli.numClause * xSize + 1, imli.numClause * xSize + len(yVector) + 1):
101 |             # numClauses += 1
102 |             # cnfClauses += str(imli.dataFidelity) + ' ' + str(-i) + ' 0\n'
103 |             formula_builder.append((" ").join(map(str, [imli.dataFidelity, -i, 0])))
104 | 
105 |         # for testing, the positive assigned feature variables are converted to hard clauses
106 |         # so that  their assignment is kept consistent and only noise variables are considered soft,
107 |         for each_assign in imli._assignList:
108 |             # numClauses += 1
109 |             # cnfClauses += str(topWeight) + ' ' + str(each_assign) + ' 0\n'
110 |             formula_builder.append((" ").join(map(str, [topWeight, each_assign, 0])))
111 |     else:
112 |         # applicable for the 1st Batch
113 |         isEmptyAssignList = True
114 | 
115 |         total_additional_weight = 0
116 |         positiveLiteralWeight = imli.weightFeature
117 |         for each_assign in imli._assignList:
118 |             isEmptyAssignList = False
119 |             # numClauses += 1
120 |             if (each_assign > 0):
121 | 
122 |                 # cnfClauses += str(positiveLiteralWeight) + ' ' + str(each_assign) + ' 0\n'
123 |                 formula_builder.append((" ").join(map(str, [positiveLiteralWeight, each_assign, 0])))
124 |                 total_additional_weight += positiveLiteralWeight
125 | 
126 |             else:
127 |                 # cnfClauses += str(imli.weightFeature) + ' ' + str(each_assign) + ' 0\n'
128 |                 formula_builder.append((" ").join(map(str, [imli.weightFeature, each_assign, 0])))
129 |                 total_additional_weight += imli.weightFeature
130 | 
131 |         # noise variables are to be kept consisitent (not necessary though)
132 |         for i in range(imli.numClause * xSize + 1,
133 |                         imli.numClause * xSize + len(yVector) + 1):
134 |             # numClauses += 1
135 |             # cnfClauses += str(imli.dataFidelity) + ' ' + str(-i) + ' 0\n'
136 |             formula_builder.append((" ").join(map(str, [imli.dataFidelity, -i, 0])))
137 | 
138 |         # for the first step
139 |         if (isEmptyAssignList):
140 |             for i in range(1, imli.numClause * xSize + 1):
141 |                 # numClauses += 1
142 |                 # cnfClauses += str(imli.weightFeature) + ' ' + str(-i) + ' 0\n'
143 |                 formula_builder.append((" ").join(map(str, [imli.weightFeature, -i, 0])))
144 |                 total_additional_weight += imli.weightFeature
145 | 
146 |         topWeight = int(imli.dataFidelity * len(yVector) + 1 + total_additional_weight)
147 | 
148 |     # print(formula_builder)
149 |     # cnfClauses = ("\n").join([(" ").join(map(str, each_clause)) for each_clause in formula_builder])
150 |     # assert dummy == cnfClauses[:-1]
151 |     # quit()
152 |     if(imli.verbose):
153 |         print("- number of soft clauses: ", len(formula_builder))
154 | 
155 |     return topWeight, formula_builder
156 | 
157 | 
158 | 
159 | def _pruneRules(imli, fields, xSize):
160 |     # algorithm 1 in paper
161 | 
162 |     new_fileds = fields
163 |     end_of_column_list = [imli.__columnInfo[i][-1] for i in range(len(imli.__columnInfo))]
164 |     freq_end_of_column_list = [[[0, 0] for i in range(len(end_of_column_list))] for j in range(imli.numClause)]
165 |     variable_contained_list = [[[] for i in range(len(end_of_column_list))] for j in range(imli.numClause)]
166 | 
167 |     for i in range(imli.numClause * xSize):
168 |         if ((int(fields[i])) > 0):
169 |             variable = (int(fields[i]) - 1) % xSize + 1
170 |             clause_position = int((int(fields[i]) - 1) / xSize)
171 |             for j in range(len(end_of_column_list)):
172 |                 if (variable <= end_of_column_list[j]):
173 |                     variable_contained_list[clause_position][j].append(clause_position * xSize + variable)
174 |                     freq_end_of_column_list[clause_position][j][0] += 1
175 |                     freq_end_of_column_list[clause_position][j][1] = imli.__columnInfo[j][0]
176 |                     break
177 |     for l in range(imli.numClause):
178 | 
179 |         for i in range(len(freq_end_of_column_list[l])):
180 |             if (freq_end_of_column_list[l][i][0] > 1):
181 |                 if (freq_end_of_column_list[l][i][1] == 3):
182 |                     variable_contained_list[l][i] = variable_contained_list[l][i][:-1]
183 |                     for j in range(len(variable_contained_list[l][i])):
184 |                         new_fileds[variable_contained_list[l][i][j] - 1] = "-" + str(
185 |                             variable_contained_list[l][i][j])
186 |                 elif (freq_end_of_column_list[l][i][1] == 4):
187 |                     variable_contained_list[l][i] = variable_contained_list[l][i][1:]
188 |                     for j in range(len(variable_contained_list[l][i])):
189 |                         new_fileds[variable_contained_list[l][i][j] - 1] = "-" + str(
190 |                             variable_contained_list[l][i][j])
191 |     return new_fileds
192 | 
193 | 
194 | 
195 | def _cmd_exists(imli, cmd):
196 |     return subprocess.call("type " + cmd, shell=True, 
197 |         stdout=subprocess.PIPE, stderr=subprocess.PIPE) == 0
198 | 
199 | def _learnModel(imli, X, y, isTest):
200 |     # X = pyrulelearn.utils._add_dummy_columns(X)
201 | 
202 |     # temp files to save maxsat query in wcnf format
203 |     WCNFFile = imli.workDir + "/" + "model.wcnf"
204 |     outputFileMaxsat = imli.workDir + "/" + "model_out.txt"
205 |     num_features = len(X[0])
206 |     num_samples = len(y)
207 | 
208 |     start_wcnf_generation = time()
209 |     # generate maxsat query for dataset
210 |     if (imli.ruleType == 'DNF'):
211 |         #  negate yVector for DNF rules
212 |         _generateWcnfFile(imli, X, [1 - int(y[each_y]) for each_y in
213 |                                     range(num_samples)],
214 |                                 num_features, WCNFFile,
215 |                                 isTest)
216 | 
217 |     elif(imli.ruleType == "CNF"):
218 |         _generateWcnfFile(imli, X, y, num_features,
219 |                                 WCNFFile,
220 |                                 isTest)
221 |     else:
222 |         print("\n\nError rule type")
223 | 
224 |     imli._wcnf_generation_time += time() - start_wcnf_generation
225 | 
226 |     
227 |     solver_start_time = time()
228 |     # call a maxsat solver
229 |     if(imli.solver in ["open-wbo", "maxhs", 'satlike-cw', 'uwrmaxsat', 'tt-open-wbo-inc', 'open-wbo-inc']):  # solver has timeout and experimented with open-wbo only
230 |         # if(_cmd_exists(imli, imli.solver)):
231 |         if(True):
232 |             # timeout_ = None
233 | 
234 |             # if(imli.iterations == -1):
235 |             #     timeout_ = imli.timeOut
236 |             # else:
237 |             #     if(int(math.ceil(imli.timeOut/imli.iterations)) < 5):  # give at lest 1 second as cpu-lim
238 |             #         timeout_ = 5
239 |             #     else:
240 |             #         timeout_ = int(math.ceil(imli.timeOut/imli.iterations))
241 | 
242 |             # assert timeout_ != None
243 | 
244 |             # left time is allocated for the solver
245 |             timeout_ = max(int(imli.timeOut - time() + imli._fit_start_time), 5)
246 | 
247 |             
248 |             if(imli.solver in ['open-wbo', 'maxhs', 'uwrmaxsat']):
249 |                     cmd = imli.solver + '   ' + WCNFFile + ' -cpu-lim=' + str(timeout_) + ' > ' + outputFileMaxsat
250 |             # incomplete solvers
251 |             elif(imli.solver in ['satlike-cw', 'tt-open-wbo-inc', 'open-wbo-inc']):
252 |                 cmd = "timeout " + str(timeout_) + " " + imli.solver + '   ' + WCNFFile + ' > ' + outputFileMaxsat
253 |             else:
254 |                 raise ValueError
255 |             
256 |         else:
257 |             raise Exception("Solver not found")   
258 |     else:
259 |         raise Warning(imli.solver + " not configured as a MaxSAT solver in this implementation")
260 |         cmd = imli.solver + '   ' + WCNFFile + ' > ' + outputFileMaxsat
261 | 
262 |     # print(cmd)
263 | 
264 |     os.system(cmd)
265 |     imli._solver_time += time() - solver_start_time
266 |     
267 | 
268 |     # delete temp files
269 |     # cmd = "rm " + WCNFFile
270 |     # os.system(cmd)
271 | 
272 |     
273 | 
274 | 
275 |     solution = ''
276 | 
277 |     # # parse result of maxsat solving
278 |     # f = open(outputFileMaxsat, 'r')
279 |     # lines = f.readlines()
280 |     # f.close()
281 |     # # Always consider the last solution
282 |     # for line in lines:
283 |     #     if (line.strip().startswith('v')):
284 |     #         solution = line.strip().strip('v ')
285 | 
286 |     # read line by line
287 |     with open(outputFileMaxsat) as f:
288 |         line = f.readline()
289 |         while line:
290 |             if (line.strip().startswith('v')):
291 |                 solution = line.strip().strip('v ')     
292 |             line = f.readline()
293 | 
294 |             
295 |     if(imli.solver in ['satlike-cw', 'tt-open-wbo-inc']):
296 |         solution = (" ").join([str(idx+1) if(val == "1") else "-" + str(idx+1) for idx,val in enumerate(solution)])
297 | 
298 |     
299 | 
300 |     fields = [int(field) for field in solution.split()]
301 |     TrueRules = []
302 |     TrueErrors = []
303 |     zeroOneSolution = []
304 |         
305 | 
306 |     for field in fields:
307 |         if (field > 0):
308 |             zeroOneSolution.append(1.0)
309 |         else:
310 |             zeroOneSolution.append(0.0)
311 |             
312 |         if (field > 0):
313 | 
314 |             if (abs(field) <= imli.numClause * num_features):
315 |                 TrueRules.append(field)
316 |             elif (imli.numClause * num_features < abs(field) <= imli.numClause * num_features + num_samples):
317 |                 TrueErrors.append(field)
318 | 
319 |     if (imli.verbose and isTest == False):
320 |         print("\n\nBatch training complete")
321 |         print("- number of literals in the rule: " + str(len(TrueRules)))
322 |         print("- number of training errors:    " + str(len(TrueErrors)) + " out of " + str(num_samples))
323 |     imli._xhat = []
324 | 
325 |     for i in range(imli.numClause):
326 |         imli._xhat.append(np.array(
327 |             zeroOneSolution[i * num_features:(i + 1) * num_features]))
328 |     err = np.array(zeroOneSolution[num_features * imli.numClause: len(
329 |         X[0]) * imli.numClause + num_samples])
330 | 
331 | 
332 |     if(imli.ruleType == "DNF"):
333 |         actual_feature_len = int(imli.numFeatures/2)
334 |         imli._xhat = np.array([np.concatenate((each_xhat[actual_feature_len:], each_xhat[:actual_feature_len])) for each_xhat in imli._xhat])
335 |     if(imli.ruleType == "CNF"):
336 |         imli._xhat = np.array(imli._xhat)
337 |     
338 |     
339 | 
340 |     # delete temp files
341 |     # cmd = "rm " + outputFileMaxsat
342 |     # os.system(cmd)
343 | 
344 |     if (not isTest):
345 |         imli._assignList = fields[:imli.numClause * num_features]
346 |         imli._selectedFeatureIndex = TrueRules
347 |         
348 |         # print(imli._selectedFeatureIndex)
349 | 
350 |     
351 | 
352 |     return fields[imli.numClause * num_features:num_samples + imli.numClause * num_features]
353 | 
354 | 
355 | 


--------------------------------------------------------------------------------
/pyrulelearn/utils.py:
--------------------------------------------------------------------------------
  1 | import Orange
  2 | import numpy as np
  3 | import pandas as pd
  4 | import math
  5 | from sklearn.model_selection import train_test_split
  6 | import random
  7 | from feature_engine import discretisers as dsc
  8 | from sklearn.preprocessing import StandardScaler
  9 | from sklearn.preprocessing import MinMaxScaler
 10 | 
 11 | 
 12 | 
 13 | def discretize_orange(csv_file, verbose=False):
 14 |     data = Orange.data.Table(csv_file)
 15 |     # Run impute operation for handling missing values
 16 |     imputer = Orange.preprocess.Impute()
 17 |     data = imputer(data)
 18 |     # Discretize datasets
 19 |     discretizer = Orange.preprocess.Discretize()
 20 |     discretizer.method = Orange.preprocess.discretize.EntropyMDL(
 21 |         force=False)
 22 |     discetized_data = discretizer(data)
 23 |     categorical_columns = [elem.name for elem in discetized_data.domain[:-1]]
 24 |     # Apply one hot encoding on X (Using continuizer of Orange)
 25 |     continuizer = Orange.preprocess.Continuize()
 26 |     binarized_data = continuizer(discetized_data)
 27 |     
 28 |     X=[]
 29 |     # # make another level of binarization
 30 |     # for sample in binarized_data.X:
 31 |     #     X.append([int(feature) for feature in sample]+ [int(1-feature) for feature in sample])
 32 |     X = binarized_data.X
 33 | 
 34 | 
 35 |     columns = []
 36 |     for i in range(len(binarized_data.domain)-1):
 37 |         column = binarized_data.domain[i].name
 38 |         if("<" in column):
 39 |             column = column.replace("=<", ' < ')
 40 |         elif("≥" in column):
 41 |             column = column.replace("=≥", ' >= ')
 42 |         elif("=" in column):
 43 |             if("-" in column):
 44 |                 column = column.replace("=", " = (")
 45 |                 column = column+")"
 46 |             else:
 47 |                 column = column.replace("=", " = ")
 48 |                 column = column
 49 |         columns.append(column)
 50 | 
 51 |     
 52 |     # make negated columns
 53 |     # num_features=len(columns)
 54 |     # for index in range(num_features):
 55 |     #     columns.append("not "+columns[index])
 56 | 
 57 | 
 58 |     if(verbose):
 59 |         print("Applying entropy based discretization using Orange library")
 60 |         print("- file name: ", csv_file)
 61 |         print("- the number of discretized features:", len(columns))
 62 | 
 63 |     return np.array(X), np.array([int(value) for value in binarized_data.Y]),  columns
 64 | 
 65 | 
 66 | 
 67 | def get_scaled_df(X):
 68 |     # scale the feature values 
 69 |     sc = StandardScaler()
 70 |     X = sc.fit_transform(X)
 71 |     return X
 72 | 
 73 | 
 74 | 
 75 | def process(csv_file, verbose=False):
 76 |     df = pd.read_csv(csv_file)
 77 |     prev_columns = list(df.columns)
 78 | 
 79 |     target = None
 80 |     real_valued_columns = []
 81 |     categorical_columns = []
 82 |     
 83 |     
 84 |     for idx, column in enumerate(prev_columns):
 85 | 
 86 |         if(column.startswith("i#")):
 87 |             del df[column]
 88 |             continue
 89 |         
 90 | 
 91 | 
 92 |         if(column.startswith("C#")):
 93 |             column = column[2:]
 94 |             real_valued_columns.append(column)
 95 |         elif(column.startswith("D#")):
 96 |             column = column[2:]
 97 |             categorical_columns.append(column)
 98 |         elif(column.startswith("cD#")):
 99 |             column = column[3:]
100 |             target = column
101 |         else:
102 |             raise ValueError(str(column) + " is not recognized")
103 |         
104 |         df.rename({prev_columns[idx] : column}, axis=1, inplace=True)
105 | 
106 |     assert len([target] + categorical_columns + real_valued_columns) == len(df.columns)
107 | 
108 |     # scale 
109 |     scaler = MinMaxScaler()
110 |     if(len(real_valued_columns) > 0):
111 |         df[real_valued_columns] = scaler.fit_transform(df[real_valued_columns])
112 | 
113 | 
114 |     # drop rows with null values
115 |     df = df.dropna()
116 | 
117 |     X_orig = df.drop([target], axis=1)
118 |     y_orig = df[target]
119 | 
120 |     X = get_one_hot_encoded_df(X_orig, columns_to_one_hot=categorical_columns)
121 | 
122 |     
123 |     X_discretized, binner_dict = get_discretized_df(X_orig, columns_to_discretize=real_valued_columns, verbose=False)
124 |     # print(binner_dict)
125 |     X_discretized = get_one_hot_encoded_df(X_discretized, columns_to_one_hot=list(X_discretized.columns), good_name=binner_dict)
126 | 
127 | 
128 |     return X.values, y_orig.values, list(X.columns), X_discretized.values, y_orig.values, list(X_discretized.columns)
129 | 
130 |     
131 | 
132 | 
133 | 
134 | def get_discretized_df(data, columns_to_discretize = None, verbose=False):
135 |     """ 
136 |     returns train_test_splitted and discretized df
137 |     """
138 | 
139 |     binner_dict_ = {}
140 |     
141 |     if(columns_to_discretize is None):
142 |         columns_to_discretize = list(data.columns)
143 | 
144 |     if(verbose):
145 |         print("Applying discretization\nAttribute bins")
146 |     for variable in columns_to_discretize:
147 |         bins = min(10, len(data[variable].unique()))
148 |         if(verbose):
149 |             print(variable, bins)
150 |             
151 |         # set up the discretisation transformer
152 |         disc  = dsc.EqualWidthDiscretiser(bins=bins, variables = [variable])
153 |         
154 |         # fit the transformer
155 |         disc.fit(data)
156 | 
157 |         if(verbose):
158 |             print(disc.binner_dict_)
159 | 
160 |         for key in disc.binner_dict_:
161 |             assert key not in binner_dict_
162 |             binner_dict_[key] = disc.binner_dict_[key]
163 | 
164 |         # transform the data
165 |         data = disc.transform(data)
166 |         if(verbose):
167 |             print(data[variable].unique())
168 |         
169 |         
170 |     return data, binner_dict_
171 | 
172 | 
173 | def get_one_hot_encoded_df(df, columns_to_one_hot, good_name = {}, verbose = False):
174 |     """  
175 |     Apply one-hot encoding on categircal df and return the df
176 |     """
177 |     if(verbose):
178 |         print("\n\nApply one-hot encoding on categircal attributes")
179 |     for column in columns_to_one_hot:
180 |         if(column not in df.columns):
181 |             if(verbose):
182 |                 print(column, " is not considered in classification")
183 |             continue 
184 | 
185 |         # Apply when there are more than two categories or the binary categories are string objects.
186 |         unique_categories = df[column].unique()
187 |         if(len(unique_categories) > 2):
188 |             one_hot = pd.get_dummies(df[column])
189 |             if(verbose):
190 |                 print(column, " has more than two unique categories", list(one_hot.columns))
191 | 
192 |             if(len(one_hot.columns)>1):
193 |                 if(column not in good_name):
194 |                     one_hot.columns = [column + " = " + str(c) for c in one_hot.columns]
195 |                 else:
196 |                     # print(column, one_hot.columns)
197 |                     one_hot.columns = [str(good_name[column][idx]) + " <= " + column + " < "  +  str(good_name[column][idx + 1]) for idx in one_hot.columns]
198 |             else:
199 |                 one_hot.columns = [column for c in one_hot.columns]
200 |             df = df.drop(column,axis = 1)
201 |             df = df.join(one_hot)
202 |         else:
203 |             # print(column, unique_categories)
204 |             if(0 in unique_categories and 1 in unique_categories):
205 |                 if(verbose):
206 |                     print(column, " has categories 1 and 0")
207 | 
208 |                 continue
209 |             if(len(unique_categories) == 2):
210 |                 df[column] = df[column].map({unique_categories[0]: 0, unique_categories[1]: 1})
211 |             else:
212 |                 assert len(unique_categories) == 1
213 |                 df[column] = df[column].map({unique_categories[0]: 0})
214 |             if(verbose):
215 |                 print("Applying following mapping on attribute", column, "=>", unique_categories[0], ":",  0, "|", unique_categories[1], ":", 1)
216 |     if(verbose):
217 |         print("\n")
218 |     return df
219 | 
220 | 
221 | 
222 | def _discretize(imli, file, categorical_column_index=[], column_seperator=",", frac_present=0.9, num_thresholds=4, verbose=False):
223 | 
224 |     # Quantile probabilities
225 |     quantProb = np.linspace(1. / (num_thresholds + 1.), num_thresholds / (num_thresholds + 1.), num_thresholds)
226 |     # List of categorical columns
227 |     if type(categorical_column_index) is pd.Series:
228 |         categorical_column_index = categorical_column_index.tolist()
229 |     elif type(categorical_column_index) is not list:
230 |         categorical_column_index = [categorical_column_index]
231 |     data = pd.read_csv(file, sep=column_seperator, header=0, error_bad_lines=False)
232 | 
233 |     columns = data.columns
234 |     if (verbose):
235 |         print("\n\nApplying quantile based discretization")
236 |         print("- file name: ", file)
237 |         print("- categorical features index: ", categorical_column_index)
238 |         print("- number of bins: ", num_thresholds)
239 |         # print("- features: ", columns)
240 |         print("- number of features:", len(columns))
241 | 
242 | 
243 |     columnY = columns[-1]
244 | 
245 |     data.dropna(axis=1, thresh=frac_present * len(data), inplace=True)
246 |     data.dropna(axis=0, how='any', inplace=True)
247 | 
248 |     y = data.pop(columnY).copy()
249 | 
250 |     # Initialize dataframe and thresholds
251 |     X = pd.DataFrame(columns=pd.MultiIndex.from_arrays([[], [], []], names=['feature', 'operation', 'value']))
252 |     thresh = {}
253 |     column_counter = 1
254 |     imli.__columnInfo = []
255 |     # Iterate over columns
256 |     count = 0
257 |     for c in data:
258 |         # number of unique values
259 |         valUniq = data[c].nunique()
260 | 
261 |         # Constant column --- discard
262 |         if valUniq < 2:
263 |             continue
264 | 
265 |         # Binary column
266 |         elif valUniq == 2:
267 |             # Rename values to 0, 1
268 |             X[('is', c, '')] = data[c].replace(np.sort(data[c].unique()), [0, 1])
269 |             X[('is not', c, '')] = data[c].replace(np.sort(data[c].unique()), [1, 0])
270 | 
271 |             temp = [1, column_counter, column_counter + 1]
272 |             imli.__columnInfo.append(temp)
273 |             column_counter += 2
274 | 
275 |         # Categorical column
276 |         elif (count in categorical_column_index) or (data[c].dtype == 'object'):
277 |             # if (imli.verbose):
278 |             #     print(c)
279 |             #     print(c in categorical_column_index)
280 |             #     print(data[c].dtype)
281 |             # Dummy-code values
282 |             Anew = pd.get_dummies(data[c]).astype(int)
283 |             Anew.columns = Anew.columns.astype(str)
284 |             # Append negations
285 |             Anew = pd.concat([Anew, 1 - Anew], axis=1, keys=[(c, '=='), (c, '!=')])
286 |             # Concatenate
287 |             X = pd.concat([X, Anew], axis=1)
288 | 
289 |             temp = [2, column_counter, column_counter + 1]
290 |             imli.__columnInfo.append(temp)
291 |             column_counter += 2
292 | 
293 |         # Ordinal column
294 |         elif np.issubdtype(data[c].dtype, int) | np.issubdtype(data[c].dtype, float):
295 |             # Few unique values
296 |             # if (imli.verbose):
297 |             #     print(data[c].dtype)
298 |             if valUniq <= num_thresholds + 1:
299 |                 # Thresholds are sorted unique values excluding maximum
300 |                 thresh[c] = np.sort(data[c].unique())[:-1]
301 |             # Many unique values
302 |             else:
303 |                 # Thresholds are quantiles excluding repetitions
304 |                 thresh[c] = data[c].quantile(q=quantProb).unique()
305 |             # Threshold values to produce binary arrays
306 |             Anew = (data[c].values[:, np.newaxis] <= thresh[c]).astype(int)
307 |             Anew = np.concatenate((Anew, 1 - Anew), axis=1)
308 |             # Convert to dataframe with column labels
309 |             Anew = pd.DataFrame(Anew,
310 |                                 columns=pd.MultiIndex.from_product([[c], ['<=', '>'], thresh[c].astype(str)]))
311 |             # Concatenate
312 |             # print(A.shape)
313 |             # print(Anew.shape)
314 |             X = pd.concat([X, Anew], axis=1)
315 | 
316 |             addedColumn = len(Anew.columns)
317 |             addedColumn = int(addedColumn / 2)
318 |             temp = [3]
319 |             temp = temp + [column_counter + nc for nc in range(addedColumn)]
320 |             column_counter += addedColumn
321 |             imli.__columnInfo.append(temp)
322 |             temp = [4]
323 |             temp = temp + [column_counter + nc for nc in range(addedColumn)]
324 |             column_counter += addedColumn
325 |             imli.__columnInfo.append(temp)
326 |         else:
327 |             # print(("Skipping column '" + c + "': data type cannot be handled"))
328 |             continue
329 |         count += 1
330 | 
331 |     if(verbose):
332 |         print("\n\nAfter applying discretization")
333 |         print("- number of discretized features: ", len(X.columns))
334 |     return X.values, y.values.ravel(), X.columns
335 | 
336 | 
337 | def _transform_binary_matrix(X):
338 |     X = np.array(X)
339 |     assert np.array_equal(X, X.astype(bool)), "Feature array is not binary. Try imli.discretize or imli.discretize_orange"
340 |     X_complement = 1 - X
341 |     return np.hstack((X,X_complement)).astype(bool)
342 | 
343 | 
344 | 
345 | def _generateSamples(imli, XTrain, yTrain):
346 | 
347 |     num_pos_samples = sum(y > 0 for y in yTrain)
348 |     relative_batch_size = float(imli.batchsize/len(yTrain))
349 | 
350 |     list_of_random_index = random.sample(
351 |         [i for i in range(num_pos_samples)], int(num_pos_samples * relative_batch_size)) + random.sample(
352 |         [i for i in range(num_pos_samples, imli.trainingSize)], int((imli.trainingSize - num_pos_samples) * relative_batch_size))
353 |     
354 |     # print(int(imli.trainingSize * imli.batchsize))
355 |     XTrain_sampled = [XTrain[i] for i in list_of_random_index]
356 |     yTrain_sampled = [yTrain[i] for i in list_of_random_index]
357 | 
358 |     assert len(list_of_random_index) == len(set(list_of_random_index)), "sampling is not uniform"
359 | 
360 |     return XTrain_sampled, yTrain_sampled
361 | 
362 | def _numpy_partition(imli, X, y):
363 |     y = y.copy()
364 |     # based on numpy split
365 |     result = np.hstack((X,y.reshape(-1,1)))
366 |     # np.random.seed(22)
367 |     # np.random.shuffle(result)
368 |     result = np.array_split(result, imli.iterations)
369 |     return [np.delete(batch,-1, axis=1) for batch in result], [batch[:,-1] for batch in result]
370 |     
371 | 
372 | def _getBatchWithEqualProbability(imli, X, y):
373 |     '''
374 |         Steps:
375 |             1. seperate data based on class value
376 |             2. Batch each seperate data into Batch_count batches using test_train_split method with 50% part in each
377 |             3. merge one seperate batche from each class and save
378 |         :param X:
379 |         :param y:
380 |         :param Batch_count:
381 |         :param location:
382 |         :param file_name_header:
383 |         :param column_set_list: uses for incremental approach
384 |         :return:
385 |         '''
386 |     Batch_count = imli.iterations
387 |     # y = y.values.ravel()
388 |     max_y = int(y.max())
389 |     min_y = int(y.min())
390 | 
391 |     X_list = [[] for i in range(max_y - min_y + 1)]
392 |     y_list = [[] for i in range(max_y - min_y + 1)]
393 |     level = int(math.log(Batch_count, 2.0))
394 |     for i in range(len(y)):
395 |         inserting_index = int(y[i])
396 |         y_list[inserting_index - min_y].append(y[i])
397 |         X_list[inserting_index - min_y].append(X[i])
398 | 
399 |     final_Batch_X_train = [[] for i in range(Batch_count)]
400 |     final_Batch_y_train = [[] for i in range(Batch_count)]
401 |     for each_class in range(len(X_list)):
402 |         Batch_list_X_train = [X_list[each_class]]
403 |         Batch_list_y_train = [y_list[each_class]]
404 | 
405 |         for i in range(level):
406 |             for j in range(int(math.pow(2, i))):
407 |                 A_train_1, A_train_2, y_train_1, y_train_2 = train_test_split(
408 |                     Batch_list_X_train[int(math.pow(2, i)) + j - 1],
409 |                     Batch_list_y_train[int(math.pow(2, i)) + j - 1],
410 |                     test_size=0.5,
411 |                     random_state = 22)  # random state for keeping consistency between lp and maxsat approach
412 |                 Batch_list_X_train.append(A_train_1)
413 |                 Batch_list_X_train.append(A_train_2)
414 |                 Batch_list_y_train.append(y_train_1)
415 |                 Batch_list_y_train.append(y_train_2)
416 | 
417 |         Batch_list_y_train = Batch_list_y_train[Batch_count - 1:]
418 |         Batch_list_X_train = Batch_list_X_train[Batch_count - 1:]
419 | 
420 |         for i in range(Batch_count):
421 |             final_Batch_y_train[i] = final_Batch_y_train[i] + Batch_list_y_train[i]
422 |             final_Batch_X_train[i] = final_Batch_X_train[i] + Batch_list_X_train[i]
423 | 
424 |             # # to numpy
425 |             # final_Batch_X_train[i] = np.array(final_Batch_X_train[i])
426 |             # final_Batch_y_train[i] = np.array(final_Batch_y_train[i])
427 | 
428 | 
429 |     return final_Batch_X_train[:Batch_count], final_Batch_y_train[:Batch_count]
430 | 
431 | 
432 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | anyio==3.6.1
 2 | AnyQt==0.1.1
 3 | appnope==0.1.3
 4 | asteval==0.9.18
 5 | backcall==0.2.0
 6 | baycomp==1.0.2
 7 | Bottleneck==1.3.4
 8 | CacheControl==0.12.11
 9 | certifi==2022.5.18.1
10 | chardet==4.0.0
11 | charset-normalizer==2.0.12
12 | codegen==1.0
13 | commonmark==0.9.1
14 | cplex==22.1.0.0
15 | cycler==0.11.0
16 | debugpy==1.6.0
17 | decorator==5.1.1
18 | dictdiffer==0.9.0
19 | docutils==0.18.1
20 | entrypoints==0.4
21 | et-xmlfile==1.1.0
22 | feature-engine==0.4.31
23 | fonttools==4.33.3
24 | h11==0.12.0
25 | httpcore==0.15.0
26 | httpx==0.23.0
27 | idna==3.3
28 | importlib-metadata==4.11.4
29 | ipykernel==6.15.0
30 | ipython==7.34.0
31 | ipython-genutils==0.2.0
32 | jedi==0.18.1
33 | joblib==1.1.0
34 | jupyter-client==7.3.4
35 | jupyter-core==4.10.0
36 | keyring==23.6.0
37 | keyrings.alt==4.1.0
38 | kiwisolver==1.4.3
39 | lockfile==0.12.2
40 | matplotlib==3.5.2
41 | matplotlib-inline==0.1.3
42 | msgpack==1.0.4
43 | nest-asyncio==1.5.5
44 | networkx==2.6.3
45 | numpy==1.21.6
46 | openpyxl==3.0.10
47 | openTSNE==0.6.2
48 | orange-canvas-core==0.1.26
49 | orange-widget-base==4.17.0
50 | Orange3==3.32.0
51 | packaging==21.3
52 | pandas==1.3.5
53 | parso==0.8.3
54 | patsy==0.5.2
55 | pexpect==4.8.0
56 | pickleshare==0.7.5
57 | Pillow==9.1.1
58 | prompt-toolkit==3.0.29
59 | psutil==5.9.1
60 | ptyprocess==0.7.0
61 | Pygments==2.12.0
62 | pyparsing==3.0.9
63 | PyQt5==5.15.7
64 | PyQt5-Qt5==5.15.2
65 | PyQt5-sip==12.11.0
66 | pyqtgraph==0.12.3
67 | PyQtWebEngine==5.15.6
68 | PyQtWebEngine-Qt5==5.15.2
69 | python-dateutil==2.8.2
70 | python-louvain==0.16
71 | pytz==2022.1
72 | PyYAML==6.0
73 | pyzmq==23.2.0
74 | qasync==0.23.0
75 | qtconsole==5.3.1
76 | QtPy==2.1.0
77 | requests==2.28.0
78 | rfc3986==1.5.0
79 | scikit-learn==1.0.2
80 | scipy==1.7.3
81 | serverfiles==0.3.1
82 | six==1.16.0
83 | sklearn==0.0
84 | sniffio==1.2.0
85 | statsmodels==0.13.2
86 | threadpoolctl==3.1.0
87 | tornado==6.1
88 | tqdm==4.64.0
89 | traitlets==5.3.0
90 | typing_extensions==4.2.0
91 | urllib3==1.26.9
92 | wcwidth==0.2.5
93 | xlrd==2.0.1
94 | XlsxWriter==3.0.3
95 | zipp==3.8.0
96 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | from setuptools import setup
 2 | 
 3 | # read the contents of your README file
 4 | from os import path
 5 | this_directory = path.abspath(path.dirname(__file__))
 6 | with open(path.join(this_directory, 'README.md'), encoding='utf-8') as f:
 7 |     long_description = f.read()
 8 | 
 9 | 
10 | setup(
11 |   name = 'pyrulelearn',
12 |   packages = ['pyrulelearn'],
13 |   version = 'v1.1.1',
14 |   license='MIT',
15 |   description = 'This library can be used to generate interpretable classification rules expressed as CNF/DNF and relaxed-CNF',
16 |   long_description=long_description,
17 |   long_description_content_type='text/markdown',
18 |   author = 'Bishwamittra Ghosh',
19 |   author_email = 'bishwamittra.ghosh@gmail.com',
20 |   url = 'https://github.com/meelgroup/MLIC',
21 |   download_url = 'https://github.com/meelgroup/MLIC/archive/v1.1.1.tar.gz',
22 |   keywords = ['Classification Rules', 'Interpretable Rules', 'CNF Classification Rules', 'DNF Classification Rules','MaxSAT-based Rule Learning'],   # Keywords that define your package best
23 |   classifiers=[
24 |     'Development Status :: 3 - Alpha',
25 |     'Intended Audience :: Developers',
26 |     'Topic :: Software Development :: Build Tools',
27 |     'License :: OSI Approved :: MIT License',
28 |     'Programming Language :: Python :: 3',
29 |     'Programming Language :: Python :: 3.4',
30 |     'Programming Language :: Python :: 3.5',
31 |     'Programming Language :: Python :: 3.6',
32 |   ],
33 | )


--------------------------------------------------------------------------------