├── LICENSE
├── README.md
├── missingpy
├── __init__.py
├── knnimpute.py
├── missforest.py
├── pairwise_external.py
├── tests
│ ├── __init__.py
│ ├── test_knnimpute.py
│ └── test_missforest.py
└── utils.py
├── requirements.txt
└── setup.py
/LICENSE:
--------------------------------------------------------------------------------
1 | GNU GENERAL PUBLIC LICENSE
2 | Version 3, 29 June 2007
3 |
4 | Copyright (C) 2007 Free Software Foundation, Inc.
5 | Everyone is permitted to copy and distribute verbatim copies
6 | of this license document, but changing it is not allowed.
7 |
8 | Preamble
9 |
10 | The GNU General Public License is a free, copyleft license for
11 | software and other kinds of works.
12 |
13 | The licenses for most software and other practical works are designed
14 | to take away your freedom to share and change the works. By contrast,
15 | the GNU General Public License is intended to guarantee your freedom to
16 | share and change all versions of a program--to make sure it remains free
17 | software for all its users. We, the Free Software Foundation, use the
18 | GNU General Public License for most of our software; it applies also to
19 | any other work released this way by its authors. You can apply it to
20 | your programs, too.
21 |
22 | When we speak of free software, we are referring to freedom, not
23 | price. Our General Public Licenses are designed to make sure that you
24 | have the freedom to distribute copies of free software (and charge for
25 | them if you wish), that you receive source code or can get it if you
26 | want it, that you can change the software or use pieces of it in new
27 | free programs, and that you know you can do these things.
28 |
29 | To protect your rights, we need to prevent others from denying you
30 | these rights or asking you to surrender the rights. Therefore, you have
31 | certain responsibilities if you distribute copies of the software, or if
32 | you modify it: responsibilities to respect the freedom of others.
33 |
34 | For example, if you distribute copies of such a program, whether
35 | gratis or for a fee, you must pass on to the recipients the same
36 | freedoms that you received. You must make sure that they, too, receive
37 | or can get the source code. And you must show them these terms so they
38 | know their rights.
39 |
40 | Developers that use the GNU GPL protect your rights with two steps:
41 | (1) assert copyright on the software, and (2) offer you this License
42 | giving you legal permission to copy, distribute and/or modify it.
43 |
44 | For the developers' and authors' protection, the GPL clearly explains
45 | that there is no warranty for this free software. For both users' and
46 | authors' sake, the GPL requires that modified versions be marked as
47 | changed, so that their problems will not be attributed erroneously to
48 | authors of previous versions.
49 |
50 | Some devices are designed to deny users access to install or run
51 | modified versions of the software inside them, although the manufacturer
52 | can do so. This is fundamentally incompatible with the aim of
53 | protecting users' freedom to change the software. The systematic
54 | pattern of such abuse occurs in the area of products for individuals to
55 | use, which is precisely where it is most unacceptable. Therefore, we
56 | have designed this version of the GPL to prohibit the practice for those
57 | products. If such problems arise substantially in other domains, we
58 | stand ready to extend this provision to those domains in future versions
59 | of the GPL, as needed to protect the freedom of users.
60 |
61 | Finally, every program is threatened constantly by software patents.
62 | States should not allow patents to restrict development and use of
63 | software on general-purpose computers, but in those that do, we wish to
64 | avoid the special danger that patents applied to a free program could
65 | make it effectively proprietary. To prevent this, the GPL assures that
66 | patents cannot be used to render the program non-free.
67 |
68 | The precise terms and conditions for copying, distribution and
69 | modification follow.
70 |
71 | TERMS AND CONDITIONS
72 |
73 | 0. Definitions.
74 |
75 | "This License" refers to version 3 of the GNU General Public License.
76 |
77 | "Copyright" also means copyright-like laws that apply to other kinds of
78 | works, such as semiconductor masks.
79 |
80 | "The Program" refers to any copyrightable work licensed under this
81 | License. Each licensee is addressed as "you". "Licensees" and
82 | "recipients" may be individuals or organizations.
83 |
84 | To "modify" a work means to copy from or adapt all or part of the work
85 | in a fashion requiring copyright permission, other than the making of an
86 | exact copy. The resulting work is called a "modified version" of the
87 | earlier work or a work "based on" the earlier work.
88 |
89 | A "covered work" means either the unmodified Program or a work based
90 | on the Program.
91 |
92 | To "propagate" a work means to do anything with it that, without
93 | permission, would make you directly or secondarily liable for
94 | infringement under applicable copyright law, except executing it on a
95 | computer or modifying a private copy. Propagation includes copying,
96 | distribution (with or without modification), making available to the
97 | public, and in some countries other activities as well.
98 |
99 | To "convey" a work means any kind of propagation that enables other
100 | parties to make or receive copies. Mere interaction with a user through
101 | a computer network, with no transfer of a copy, is not conveying.
102 |
103 | An interactive user interface displays "Appropriate Legal Notices"
104 | to the extent that it includes a convenient and prominently visible
105 | feature that (1) displays an appropriate copyright notice, and (2)
106 | tells the user that there is no warranty for the work (except to the
107 | extent that warranties are provided), that licensees may convey the
108 | work under this License, and how to view a copy of this License. If
109 | the interface presents a list of user commands or options, such as a
110 | menu, a prominent item in the list meets this criterion.
111 |
112 | 1. Source Code.
113 |
114 | The "source code" for a work means the preferred form of the work
115 | for making modifications to it. "Object code" means any non-source
116 | form of a work.
117 |
118 | A "Standard Interface" means an interface that either is an official
119 | standard defined by a recognized standards body, or, in the case of
120 | interfaces specified for a particular programming language, one that
121 | is widely used among developers working in that language.
122 |
123 | The "System Libraries" of an executable work include anything, other
124 | than the work as a whole, that (a) is included in the normal form of
125 | packaging a Major Component, but which is not part of that Major
126 | Component, and (b) serves only to enable use of the work with that
127 | Major Component, or to implement a Standard Interface for which an
128 | implementation is available to the public in source code form. A
129 | "Major Component", in this context, means a major essential component
130 | (kernel, window system, and so on) of the specific operating system
131 | (if any) on which the executable work runs, or a compiler used to
132 | produce the work, or an object code interpreter used to run it.
133 |
134 | The "Corresponding Source" for a work in object code form means all
135 | the source code needed to generate, install, and (for an executable
136 | work) run the object code and to modify the work, including scripts to
137 | control those activities. However, it does not include the work's
138 | System Libraries, or general-purpose tools or generally available free
139 | programs which are used unmodified in performing those activities but
140 | which are not part of the work. For example, Corresponding Source
141 | includes interface definition files associated with source files for
142 | the work, and the source code for shared libraries and dynamically
143 | linked subprograms that the work is specifically designed to require,
144 | such as by intimate data communication or control flow between those
145 | subprograms and other parts of the work.
146 |
147 | The Corresponding Source need not include anything that users
148 | can regenerate automatically from other parts of the Corresponding
149 | Source.
150 |
151 | The Corresponding Source for a work in source code form is that
152 | same work.
153 |
154 | 2. Basic Permissions.
155 |
156 | All rights granted under this License are granted for the term of
157 | copyright on the Program, and are irrevocable provided the stated
158 | conditions are met. This License explicitly affirms your unlimited
159 | permission to run the unmodified Program. The output from running a
160 | covered work is covered by this License only if the output, given its
161 | content, constitutes a covered work. This License acknowledges your
162 | rights of fair use or other equivalent, as provided by copyright law.
163 |
164 | You may make, run and propagate covered works that you do not
165 | convey, without conditions so long as your license otherwise remains
166 | in force. You may convey covered works to others for the sole purpose
167 | of having them make modifications exclusively for you, or provide you
168 | with facilities for running those works, provided that you comply with
169 | the terms of this License in conveying all material for which you do
170 | not control copyright. Those thus making or running the covered works
171 | for you must do so exclusively on your behalf, under your direction
172 | and control, on terms that prohibit them from making any copies of
173 | your copyrighted material outside their relationship with you.
174 |
175 | Conveying under any other circumstances is permitted solely under
176 | the conditions stated below. Sublicensing is not allowed; section 10
177 | makes it unnecessary.
178 |
179 | 3. Protecting Users' Legal Rights From Anti-Circumvention Law.
180 |
181 | No covered work shall be deemed part of an effective technological
182 | measure under any applicable law fulfilling obligations under article
183 | 11 of the WIPO copyright treaty adopted on 20 December 1996, or
184 | similar laws prohibiting or restricting circumvention of such
185 | measures.
186 |
187 | When you convey a covered work, you waive any legal power to forbid
188 | circumvention of technological measures to the extent such circumvention
189 | is effected by exercising rights under this License with respect to
190 | the covered work, and you disclaim any intention to limit operation or
191 | modification of the work as a means of enforcing, against the work's
192 | users, your or third parties' legal rights to forbid circumvention of
193 | technological measures.
194 |
195 | 4. Conveying Verbatim Copies.
196 |
197 | You may convey verbatim copies of the Program's source code as you
198 | receive it, in any medium, provided that you conspicuously and
199 | appropriately publish on each copy an appropriate copyright notice;
200 | keep intact all notices stating that this License and any
201 | non-permissive terms added in accord with section 7 apply to the code;
202 | keep intact all notices of the absence of any warranty; and give all
203 | recipients a copy of this License along with the Program.
204 |
205 | You may charge any price or no price for each copy that you convey,
206 | and you may offer support or warranty protection for a fee.
207 |
208 | 5. Conveying Modified Source Versions.
209 |
210 | You may convey a work based on the Program, or the modifications to
211 | produce it from the Program, in the form of source code under the
212 | terms of section 4, provided that you also meet all of these conditions:
213 |
214 | a) The work must carry prominent notices stating that you modified
215 | it, and giving a relevant date.
216 |
217 | b) The work must carry prominent notices stating that it is
218 | released under this License and any conditions added under section
219 | 7. This requirement modifies the requirement in section 4 to
220 | "keep intact all notices".
221 |
222 | c) You must license the entire work, as a whole, under this
223 | License to anyone who comes into possession of a copy. This
224 | License will therefore apply, along with any applicable section 7
225 | additional terms, to the whole of the work, and all its parts,
226 | regardless of how they are packaged. This License gives no
227 | permission to license the work in any other way, but it does not
228 | invalidate such permission if you have separately received it.
229 |
230 | d) If the work has interactive user interfaces, each must display
231 | Appropriate Legal Notices; however, if the Program has interactive
232 | interfaces that do not display Appropriate Legal Notices, your
233 | work need not make them do so.
234 |
235 | A compilation of a covered work with other separate and independent
236 | works, which are not by their nature extensions of the covered work,
237 | and which are not combined with it such as to form a larger program,
238 | in or on a volume of a storage or distribution medium, is called an
239 | "aggregate" if the compilation and its resulting copyright are not
240 | used to limit the access or legal rights of the compilation's users
241 | beyond what the individual works permit. Inclusion of a covered work
242 | in an aggregate does not cause this License to apply to the other
243 | parts of the aggregate.
244 |
245 | 6. Conveying Non-Source Forms.
246 |
247 | You may convey a covered work in object code form under the terms
248 | of sections 4 and 5, provided that you also convey the
249 | machine-readable Corresponding Source under the terms of this License,
250 | in one of these ways:
251 |
252 | a) Convey the object code in, or embodied in, a physical product
253 | (including a physical distribution medium), accompanied by the
254 | Corresponding Source fixed on a durable physical medium
255 | customarily used for software interchange.
256 |
257 | b) Convey the object code in, or embodied in, a physical product
258 | (including a physical distribution medium), accompanied by a
259 | written offer, valid for at least three years and valid for as
260 | long as you offer spare parts or customer support for that product
261 | model, to give anyone who possesses the object code either (1) a
262 | copy of the Corresponding Source for all the software in the
263 | product that is covered by this License, on a durable physical
264 | medium customarily used for software interchange, for a price no
265 | more than your reasonable cost of physically performing this
266 | conveying of source, or (2) access to copy the
267 | Corresponding Source from a network server at no charge.
268 |
269 | c) Convey individual copies of the object code with a copy of the
270 | written offer to provide the Corresponding Source. This
271 | alternative is allowed only occasionally and noncommercially, and
272 | only if you received the object code with such an offer, in accord
273 | with subsection 6b.
274 |
275 | d) Convey the object code by offering access from a designated
276 | place (gratis or for a charge), and offer equivalent access to the
277 | Corresponding Source in the same way through the same place at no
278 | further charge. You need not require recipients to copy the
279 | Corresponding Source along with the object code. If the place to
280 | copy the object code is a network server, the Corresponding Source
281 | may be on a different server (operated by you or a third party)
282 | that supports equivalent copying facilities, provided you maintain
283 | clear directions next to the object code saying where to find the
284 | Corresponding Source. Regardless of what server hosts the
285 | Corresponding Source, you remain obligated to ensure that it is
286 | available for as long as needed to satisfy these requirements.
287 |
288 | e) Convey the object code using peer-to-peer transmission, provided
289 | you inform other peers where the object code and Corresponding
290 | Source of the work are being offered to the general public at no
291 | charge under subsection 6d.
292 |
293 | A separable portion of the object code, whose source code is excluded
294 | from the Corresponding Source as a System Library, need not be
295 | included in conveying the object code work.
296 |
297 | A "User Product" is either (1) a "consumer product", which means any
298 | tangible personal property which is normally used for personal, family,
299 | or household purposes, or (2) anything designed or sold for incorporation
300 | into a dwelling. In determining whether a product is a consumer product,
301 | doubtful cases shall be resolved in favor of coverage. For a particular
302 | product received by a particular user, "normally used" refers to a
303 | typical or common use of that class of product, regardless of the status
304 | of the particular user or of the way in which the particular user
305 | actually uses, or expects or is expected to use, the product. A product
306 | is a consumer product regardless of whether the product has substantial
307 | commercial, industrial or non-consumer uses, unless such uses represent
308 | the only significant mode of use of the product.
309 |
310 | "Installation Information" for a User Product means any methods,
311 | procedures, authorization keys, or other information required to install
312 | and execute modified versions of a covered work in that User Product from
313 | a modified version of its Corresponding Source. The information must
314 | suffice to ensure that the continued functioning of the modified object
315 | code is in no case prevented or interfered with solely because
316 | modification has been made.
317 |
318 | If you convey an object code work under this section in, or with, or
319 | specifically for use in, a User Product, and the conveying occurs as
320 | part of a transaction in which the right of possession and use of the
321 | User Product is transferred to the recipient in perpetuity or for a
322 | fixed term (regardless of how the transaction is characterized), the
323 | Corresponding Source conveyed under this section must be accompanied
324 | by the Installation Information. But this requirement does not apply
325 | if neither you nor any third party retains the ability to install
326 | modified object code on the User Product (for example, the work has
327 | been installed in ROM).
328 |
329 | The requirement to provide Installation Information does not include a
330 | requirement to continue to provide support service, warranty, or updates
331 | for a work that has been modified or installed by the recipient, or for
332 | the User Product in which it has been modified or installed. Access to a
333 | network may be denied when the modification itself materially and
334 | adversely affects the operation of the network or violates the rules and
335 | protocols for communication across the network.
336 |
337 | Corresponding Source conveyed, and Installation Information provided,
338 | in accord with this section must be in a format that is publicly
339 | documented (and with an implementation available to the public in
340 | source code form), and must require no special password or key for
341 | unpacking, reading or copying.
342 |
343 | 7. Additional Terms.
344 |
345 | "Additional permissions" are terms that supplement the terms of this
346 | License by making exceptions from one or more of its conditions.
347 | Additional permissions that are applicable to the entire Program shall
348 | be treated as though they were included in this License, to the extent
349 | that they are valid under applicable law. If additional permissions
350 | apply only to part of the Program, that part may be used separately
351 | under those permissions, but the entire Program remains governed by
352 | this License without regard to the additional permissions.
353 |
354 | When you convey a copy of a covered work, you may at your option
355 | remove any additional permissions from that copy, or from any part of
356 | it. (Additional permissions may be written to require their own
357 | removal in certain cases when you modify the work.) You may place
358 | additional permissions on material, added by you to a covered work,
359 | for which you have or can give appropriate copyright permission.
360 |
361 | Notwithstanding any other provision of this License, for material you
362 | add to a covered work, you may (if authorized by the copyright holders of
363 | that material) supplement the terms of this License with terms:
364 |
365 | a) Disclaiming warranty or limiting liability differently from the
366 | terms of sections 15 and 16 of this License; or
367 |
368 | b) Requiring preservation of specified reasonable legal notices or
369 | author attributions in that material or in the Appropriate Legal
370 | Notices displayed by works containing it; or
371 |
372 | c) Prohibiting misrepresentation of the origin of that material, or
373 | requiring that modified versions of such material be marked in
374 | reasonable ways as different from the original version; or
375 |
376 | d) Limiting the use for publicity purposes of names of licensors or
377 | authors of the material; or
378 |
379 | e) Declining to grant rights under trademark law for use of some
380 | trade names, trademarks, or service marks; or
381 |
382 | f) Requiring indemnification of licensors and authors of that
383 | material by anyone who conveys the material (or modified versions of
384 | it) with contractual assumptions of liability to the recipient, for
385 | any liability that these contractual assumptions directly impose on
386 | those licensors and authors.
387 |
388 | All other non-permissive additional terms are considered "further
389 | restrictions" within the meaning of section 10. If the Program as you
390 | received it, or any part of it, contains a notice stating that it is
391 | governed by this License along with a term that is a further
392 | restriction, you may remove that term. If a license document contains
393 | a further restriction but permits relicensing or conveying under this
394 | License, you may add to a covered work material governed by the terms
395 | of that license document, provided that the further restriction does
396 | not survive such relicensing or conveying.
397 |
398 | If you add terms to a covered work in accord with this section, you
399 | must place, in the relevant source files, a statement of the
400 | additional terms that apply to those files, or a notice indicating
401 | where to find the applicable terms.
402 |
403 | Additional terms, permissive or non-permissive, may be stated in the
404 | form of a separately written license, or stated as exceptions;
405 | the above requirements apply either way.
406 |
407 | 8. Termination.
408 |
409 | You may not propagate or modify a covered work except as expressly
410 | provided under this License. Any attempt otherwise to propagate or
411 | modify it is void, and will automatically terminate your rights under
412 | this License (including any patent licenses granted under the third
413 | paragraph of section 11).
414 |
415 | However, if you cease all violation of this License, then your
416 | license from a particular copyright holder is reinstated (a)
417 | provisionally, unless and until the copyright holder explicitly and
418 | finally terminates your license, and (b) permanently, if the copyright
419 | holder fails to notify you of the violation by some reasonable means
420 | prior to 60 days after the cessation.
421 |
422 | Moreover, your license from a particular copyright holder is
423 | reinstated permanently if the copyright holder notifies you of the
424 | violation by some reasonable means, this is the first time you have
425 | received notice of violation of this License (for any work) from that
426 | copyright holder, and you cure the violation prior to 30 days after
427 | your receipt of the notice.
428 |
429 | Termination of your rights under this section does not terminate the
430 | licenses of parties who have received copies or rights from you under
431 | this License. If your rights have been terminated and not permanently
432 | reinstated, you do not qualify to receive new licenses for the same
433 | material under section 10.
434 |
435 | 9. Acceptance Not Required for Having Copies.
436 |
437 | You are not required to accept this License in order to receive or
438 | run a copy of the Program. Ancillary propagation of a covered work
439 | occurring solely as a consequence of using peer-to-peer transmission
440 | to receive a copy likewise does not require acceptance. However,
441 | nothing other than this License grants you permission to propagate or
442 | modify any covered work. These actions infringe copyright if you do
443 | not accept this License. Therefore, by modifying or propagating a
444 | covered work, you indicate your acceptance of this License to do so.
445 |
446 | 10. Automatic Licensing of Downstream Recipients.
447 |
448 | Each time you convey a covered work, the recipient automatically
449 | receives a license from the original licensors, to run, modify and
450 | propagate that work, subject to this License. You are not responsible
451 | for enforcing compliance by third parties with this License.
452 |
453 | An "entity transaction" is a transaction transferring control of an
454 | organization, or substantially all assets of one, or subdividing an
455 | organization, or merging organizations. If propagation of a covered
456 | work results from an entity transaction, each party to that
457 | transaction who receives a copy of the work also receives whatever
458 | licenses to the work the party's predecessor in interest had or could
459 | give under the previous paragraph, plus a right to possession of the
460 | Corresponding Source of the work from the predecessor in interest, if
461 | the predecessor has it or can get it with reasonable efforts.
462 |
463 | You may not impose any further restrictions on the exercise of the
464 | rights granted or affirmed under this License. For example, you may
465 | not impose a license fee, royalty, or other charge for exercise of
466 | rights granted under this License, and you may not initiate litigation
467 | (including a cross-claim or counterclaim in a lawsuit) alleging that
468 | any patent claim is infringed by making, using, selling, offering for
469 | sale, or importing the Program or any portion of it.
470 |
471 | 11. Patents.
472 |
473 | A "contributor" is a copyright holder who authorizes use under this
474 | License of the Program or a work on which the Program is based. The
475 | work thus licensed is called the contributor's "contributor version".
476 |
477 | A contributor's "essential patent claims" are all patent claims
478 | owned or controlled by the contributor, whether already acquired or
479 | hereafter acquired, that would be infringed by some manner, permitted
480 | by this License, of making, using, or selling its contributor version,
481 | but do not include claims that would be infringed only as a
482 | consequence of further modification of the contributor version. For
483 | purposes of this definition, "control" includes the right to grant
484 | patent sublicenses in a manner consistent with the requirements of
485 | this License.
486 |
487 | Each contributor grants you a non-exclusive, worldwide, royalty-free
488 | patent license under the contributor's essential patent claims, to
489 | make, use, sell, offer for sale, import and otherwise run, modify and
490 | propagate the contents of its contributor version.
491 |
492 | In the following three paragraphs, a "patent license" is any express
493 | agreement or commitment, however denominated, not to enforce a patent
494 | (such as an express permission to practice a patent or covenant not to
495 | sue for patent infringement). To "grant" such a patent license to a
496 | party means to make such an agreement or commitment not to enforce a
497 | patent against the party.
498 |
499 | If you convey a covered work, knowingly relying on a patent license,
500 | and the Corresponding Source of the work is not available for anyone
501 | to copy, free of charge and under the terms of this License, through a
502 | publicly available network server or other readily accessible means,
503 | then you must either (1) cause the Corresponding Source to be so
504 | available, or (2) arrange to deprive yourself of the benefit of the
505 | patent license for this particular work, or (3) arrange, in a manner
506 | consistent with the requirements of this License, to extend the patent
507 | license to downstream recipients. "Knowingly relying" means you have
508 | actual knowledge that, but for the patent license, your conveying the
509 | covered work in a country, or your recipient's use of the covered work
510 | in a country, would infringe one or more identifiable patents in that
511 | country that you have reason to believe are valid.
512 |
513 | If, pursuant to or in connection with a single transaction or
514 | arrangement, you convey, or propagate by procuring conveyance of, a
515 | covered work, and grant a patent license to some of the parties
516 | receiving the covered work authorizing them to use, propagate, modify
517 | or convey a specific copy of the covered work, then the patent license
518 | you grant is automatically extended to all recipients of the covered
519 | work and works based on it.
520 |
521 | A patent license is "discriminatory" if it does not include within
522 | the scope of its coverage, prohibits the exercise of, or is
523 | conditioned on the non-exercise of one or more of the rights that are
524 | specifically granted under this License. You may not convey a covered
525 | work if you are a party to an arrangement with a third party that is
526 | in the business of distributing software, under which you make payment
527 | to the third party based on the extent of your activity of conveying
528 | the work, and under which the third party grants, to any of the
529 | parties who would receive the covered work from you, a discriminatory
530 | patent license (a) in connection with copies of the covered work
531 | conveyed by you (or copies made from those copies), or (b) primarily
532 | for and in connection with specific products or compilations that
533 | contain the covered work, unless you entered into that arrangement,
534 | or that patent license was granted, prior to 28 March 2007.
535 |
536 | Nothing in this License shall be construed as excluding or limiting
537 | any implied license or other defenses to infringement that may
538 | otherwise be available to you under applicable patent law.
539 |
540 | 12. No Surrender of Others' Freedom.
541 |
542 | If conditions are imposed on you (whether by court order, agreement or
543 | otherwise) that contradict the conditions of this License, they do not
544 | excuse you from the conditions of this License. If you cannot convey a
545 | covered work so as to satisfy simultaneously your obligations under this
546 | License and any other pertinent obligations, then as a consequence you may
547 | not convey it at all. For example, if you agree to terms that obligate you
548 | to collect a royalty for further conveying from those to whom you convey
549 | the Program, the only way you could satisfy both those terms and this
550 | License would be to refrain entirely from conveying the Program.
551 |
552 | 13. Use with the GNU Affero General Public License.
553 |
554 | Notwithstanding any other provision of this License, you have
555 | permission to link or combine any covered work with a work licensed
556 | under version 3 of the GNU Affero General Public License into a single
557 | combined work, and to convey the resulting work. The terms of this
558 | License will continue to apply to the part which is the covered work,
559 | but the special requirements of the GNU Affero General Public License,
560 | section 13, concerning interaction through a network will apply to the
561 | combination as such.
562 |
563 | 14. Revised Versions of this License.
564 |
565 | The Free Software Foundation may publish revised and/or new versions of
566 | the GNU General Public License from time to time. Such new versions will
567 | be similar in spirit to the present version, but may differ in detail to
568 | address new problems or concerns.
569 |
570 | Each version is given a distinguishing version number. If the
571 | Program specifies that a certain numbered version of the GNU General
572 | Public License "or any later version" applies to it, you have the
573 | option of following the terms and conditions either of that numbered
574 | version or of any later version published by the Free Software
575 | Foundation. If the Program does not specify a version number of the
576 | GNU General Public License, you may choose any version ever published
577 | by the Free Software Foundation.
578 |
579 | If the Program specifies that a proxy can decide which future
580 | versions of the GNU General Public License can be used, that proxy's
581 | public statement of acceptance of a version permanently authorizes you
582 | to choose that version for the Program.
583 |
584 | Later license versions may give you additional or different
585 | permissions. However, no additional obligations are imposed on any
586 | author or copyright holder as a result of your choosing to follow a
587 | later version.
588 |
589 | 15. Disclaimer of Warranty.
590 |
591 | THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
592 | APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
593 | HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
594 | OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
595 | THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
596 | PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
597 | IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
598 | ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
599 |
600 | 16. Limitation of Liability.
601 |
602 | IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
603 | WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
604 | THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
605 | GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
606 | USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
607 | DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
608 | PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
609 | EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
610 | SUCH DAMAGES.
611 |
612 | 17. Interpretation of Sections 15 and 16.
613 |
614 | If the disclaimer of warranty and limitation of liability provided
615 | above cannot be given local legal effect according to their terms,
616 | reviewing courts shall apply local law that most closely approximates
617 | an absolute waiver of all civil liability in connection with the
618 | Program, unless a warranty or assumption of liability accompanies a
619 | copy of the Program in return for a fee.
620 |
621 | END OF TERMS AND CONDITIONS
622 |
623 | How to Apply These Terms to Your New Programs
624 |
625 | If you develop a new program, and you want it to be of the greatest
626 | possible use to the public, the best way to achieve this is to make it
627 | free software which everyone can redistribute and change under these terms.
628 |
629 | To do so, attach the following notices to the program. It is safest
630 | to attach them to the start of each source file to most effectively
631 | state the exclusion of warranty; and each file should have at least
632 | the "copyright" line and a pointer to where the full notice is found.
633 |
634 |
635 | Copyright (C)
636 |
637 | This program is free software: you can redistribute it and/or modify
638 | it under the terms of the GNU General Public License as published by
639 | the Free Software Foundation, either version 3 of the License, or
640 | (at your option) any later version.
641 |
642 | This program is distributed in the hope that it will be useful,
643 | but WITHOUT ANY WARRANTY; without even the implied warranty of
644 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
645 | GNU General Public License for more details.
646 |
647 | You should have received a copy of the GNU General Public License
648 | along with this program. If not, see .
649 |
650 | Also add information on how to contact you by electronic and paper mail.
651 |
652 | If the program does terminal interaction, make it output a short
653 | notice like this when it starts in an interactive mode:
654 |
655 | Copyright (C)
656 | This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
657 | This is free software, and you are welcome to redistribute it
658 | under certain conditions; type `show c' for details.
659 |
660 | The hypothetical commands `show w' and `show c' should show the appropriate
661 | parts of the General Public License. Of course, your program's commands
662 | might be different; for a GUI interface, you would use an "about box".
663 |
664 | You should also get your employer (if you work as a programmer) or school,
665 | if any, to sign a "copyright disclaimer" for the program, if necessary.
666 | For more information on this, and how to apply and follow the GNU GPL, see
667 | .
668 |
669 | The GNU General Public License does not permit incorporating your program
670 | into proprietary programs. If your program is a subroutine library, you
671 | may consider it more useful to permit linking proprietary applications with
672 | the library. If this is what you want to do, use the GNU Lesser General
673 | Public License instead of this License. But first, please read
674 | .
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## missingpy
2 |
3 | `missingpy` is a library for missing data imputation in Python. It has an
4 | API consistent with [scikit-learn](http://scikit-learn.org/stable/), so users
5 | already comfortable with that interface will find themselves in familiar
6 | terrain. Currently, the library supports the following algorithms:
7 | 1. k-Nearest Neighbors imputation
8 | 2. Random Forest imputation (MissForest)
9 |
10 | We plan to add other imputation tools in the future so please stay tuned!
11 |
12 | ## Installation
13 |
14 | `pip install missingpy`
15 |
16 | ## 1. k-Nearest Neighbors (kNN) Imputation
17 |
18 | ### Example
19 | ```python
20 | # Let X be an array containing missing values
21 | from missingpy import KNNImputer
22 | imputer = KNNImputer()
23 | X_imputed = imputer.fit_transform(X)
24 | ```
25 |
26 | ### Description
27 | The `KNNImputer` class provides imputation for completing missing
28 | values using the k-Nearest Neighbors approach. Each sample's missing values
29 | are imputed using values from `n_neighbors` nearest neighbors found in the
30 | training set. Note that if a sample has more than one feature missing, then
31 | the sample can potentially have multiple sets of `n_neighbors`
32 | donors depending on the particular feature being imputed.
33 |
34 | Each missing feature is then imputed as the average, either weighted or
35 | unweighted, of these neighbors. Where the number of donor neighbors is less
36 | than `n_neighbors`, the training set average for that feature is used
37 | for imputation. The total number of samples in the training set is, of course,
38 | always greater than or equal to the number of nearest neighbors available for
39 | imputation, depending on both the overall sample size as well as the number of
40 | samples excluded from nearest neighbor calculation because of too many missing
41 | features (as controlled by `row_max_missing`).
42 | For more information on the methodology, see [1].
43 |
44 | The following snippet demonstrates how to replace missing values,
45 | encoded as `np.nan`, using the mean feature value of the two nearest
46 | neighbors of the rows that contain the missing values::
47 |
48 | >>> import numpy as np
49 | >>> from missingpy import KNNImputer
50 | >>> nan = np.nan
51 | >>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
52 | >>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
53 | >>> imputer.fit_transform(X)
54 | array([[1. , 2. , 4. ],
55 | [3. , 4. , 3. ],
56 | [5.5, 6. , 5. ],
57 | [8. , 8. , 7. ]])
58 |
59 | ### API
60 | KNNImputer(missing_values="NaN", n_neighbors=5, weights="uniform",
61 | metric="masked_euclidean", row_max_missing=0.5,
62 | col_max_missing=0.8, copy=True)
63 |
64 | Parameters
65 | ----------
66 | missing_values : integer or "NaN", optional (default = "NaN")
67 | The placeholder for the missing values. All occurrences of
68 | `missing_values` will be imputed. For missing values encoded as
69 | ``np.nan``, use the string value "NaN".
70 |
71 | n_neighbors : int, optional (default = 5)
72 | Number of neighboring samples to use for imputation.
73 |
74 | weights : str or callable, optional (default = "uniform")
75 | Weight function used in prediction. Possible values:
76 |
77 | - 'uniform' : uniform weights. All points in each neighborhood
78 | are weighted equally.
79 | - 'distance' : weight points by the inverse of their distance.
80 | in this case, closer neighbors of a query point will have a
81 | greater influence than neighbors which are further away.
82 | - [callable] : a user-defined function which accepts an
83 | array of distances, and returns an array of the same shape
84 | containing the weights.
85 |
86 | metric : str or callable, optional (default = "masked_euclidean")
87 | Distance metric for searching neighbors. Possible values:
88 | - 'masked_euclidean'
89 | - [callable] : a user-defined function which conforms to the
90 | definition of _pairwise_callable(X, Y, metric, **kwds). In other
91 | words, the function accepts two arrays, X and Y, and a
92 | ``missing_values`` keyword in **kwds and returns a scalar distance
93 | value.
94 |
95 | row_max_missing : float, optional (default = 0.5)
96 | The maximum fraction of columns (i.e. features) that can be missing
97 | before the sample is excluded from nearest neighbor imputation. It
98 | means that such rows will not be considered a potential donor in
99 | ``fit()``, and in ``transform()`` their missing feature values will be
100 | imputed to be the column mean for the entire dataset.
101 |
102 | col_max_missing : float, optional (default = 0.8)
103 | The maximum fraction of rows (or samples) that can be missing
104 | for any feature beyond which an error is raised.
105 |
106 | copy : boolean, optional (default = True)
107 | If True, a copy of X will be created. If False, imputation will
108 | be done in-place whenever possible. Note that, if metric is
109 | "masked_euclidean" and copy=False then missing_values in the
110 | input matrix X will be overwritten with zeros.
111 |
112 | Attributes
113 | ----------
114 | statistics_ : 1-D array of length {n_features}
115 | The 1-D array contains the mean of each feature calculated using
116 | observed (i.e. non-missing) values. This is used for imputing
117 | missing values in samples that are either excluded from nearest
118 | neighbors search because they have too many ( > row_max_missing)
119 | missing features or because all of the sample's k-nearest neighbors
120 | (i.e., the potential donors) also have the relevant feature value
121 | missing.
122 |
123 | Methods
124 | -------
125 | fit(X, y=None):
126 | Fit the imputer on X.
127 |
128 | Parameters
129 | ----------
130 | X : {array-like}, shape (n_samples, n_features)
131 | Input data, where ``n_samples`` is the number of samples and
132 | ``n_features`` is the number of features.
133 |
134 | Returns
135 | -------
136 | self : object
137 | Returns self.
138 |
139 |
140 | transform(X):
141 | Impute all missing values in X.
142 |
143 | Parameters
144 | ----------
145 | X : {array-like}, shape = [n_samples, n_features]
146 | The input data to complete.
147 |
148 | Returns
149 | -------
150 | X : {array-like}, shape = [n_samples, n_features]
151 | The imputed dataset.
152 |
153 |
154 | fit_transform(X, y=None, **fit_params):
155 | Fit KNNImputer and impute all missing values in X.
156 |
157 | Parameters
158 | ----------
159 | X : {array-like}, shape (n_samples, n_features)
160 | Input data, where ``n_samples`` is the number of samples and
161 | ``n_features`` is the number of features.
162 |
163 | Returns
164 | -------
165 | X : {array-like}, shape (n_samples, n_features)
166 | Returns imputed dataset.
167 |
168 | ### References
169 | 1. Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor
170 | Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value
171 | estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001
172 | Pages 520-525.
173 |
174 | ## 2. Random Forest Imputation (MissForest)
175 |
176 | ### Example
177 | ```python
178 | # Let X be an array containing missing values
179 | from missingpy import MissForest
180 | imputer = MissForest()
181 | X_imputed = imputer.fit_transform(X)
182 | ```
183 |
184 | ### Description
185 | MissForest imputes missing values using Random Forests in an iterative
186 | fashion [1]. By default, the imputer begins imputing missing values of the
187 | column (which is expected to be a variable) with the smallest number of
188 | missing values -- let's call this the candidate column.
189 | The first step involves filling any missing values of the remaining,
190 | non-candidate, columns with an initial guess, which is the column mean for
191 | columns representing numerical variables and the column mode for columns
192 | representing categorical variables. Note that the categorical variables
193 | need to be explicitly identified during the imputer's `fit()` method call
194 | (see API for more information). After that, the imputer fits a random
195 | forest model with the candidate column as the outcome variable and the
196 | remaining columns as the predictors over all rows where the candidate
197 | column values are not missing.
198 | After the fit, the missing rows of the candidate column are
199 | imputed using the prediction from the fitted Random Forest. The
200 | rows of the non-candidate columns act as the input data for the fitted
201 | model.
202 | Following this, the imputer moves on to the next candidate column with the
203 | second smallest number of missing values from among the non-candidate
204 | columns in the first round. The process repeats itself for each column
205 | with a missing value, possibly over multiple iterations or epochs for
206 | each column, until the stopping criterion is met.
207 | The stopping criterion is governed by the "difference" between the imputed
208 | arrays over successive iterations. For numerical variables (`num_vars_`),
209 | the difference is defined as follows:
210 |
211 | sum((X_new[:, num_vars_] - X_old[:, num_vars_]) ** 2) /
212 | sum((X_new[:, num_vars_]) ** 2)
213 |
214 | For categorical variables(`cat_vars_`), the difference is defined as follows:
215 |
216 | sum(X_new[:, cat_vars_] != X_old[:, cat_vars_])) / n_cat_missing
217 |
218 | where `X_new` is the newly imputed array, `X_old` is the array imputed in the
219 | previous round, `n_cat_missing` is the total number of categorical
220 | values that are missing, and the `sum()` is performed both across rows
221 | and columns. Following [1], the stopping criterion is considered to have
222 | been met when difference between `X_new` and `X_old` increases for the first
223 | time for both types of variables (if available).
224 |
225 | **Note: The categorical variables need to be one-hot-encoded (also known as
226 | dummy encoded) and they need to be explicitly identified during the
227 | imputer's fit() method call. See the API section for more information.**
228 |
229 | >>> from missingpy import MissForest
230 | >>> nan = float("NaN")
231 | >>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
232 | >>> imputer = MissForest(random_state=1337)
233 | >>> imputer.fit_transform(X)
234 | Iteration: 0
235 | Iteration: 1
236 | Iteration: 2
237 | array([[1. , 2. , 3.92 ],
238 | [3. , 4. , 3. ],
239 | [2.71, 6. , 5. ],
240 | [8. , 8. , 7. ]])
241 |
242 | ### API
243 | MissForest(max_iter=10, decreasing=False, missing_values=np.nan,
244 | copy=True, n_estimators=100, criterion=('mse', 'gini'),
245 | max_depth=None, min_samples_split=2, min_samples_leaf=1,
246 | min_weight_fraction_leaf=0.0, max_features='auto',
247 | max_leaf_nodes=None, min_impurity_decrease=0.0,
248 | bootstrap=True, oob_score=False, n_jobs=-1, random_state=None,
249 | verbose=0, warm_start=False, class_weight=None)
250 |
251 | Parameters
252 | ----------
253 | NOTE: Most parameter definitions below are taken verbatim from the
254 | Scikit-Learn documentation at [2] and [3].
255 |
256 | max_iter : int, optional (default = 10)
257 | The maximum iterations of the imputation process. Each column with a
258 | missing value is imputed exactly once in a given iteration.
259 |
260 | decreasing : boolean, optional (default = False)
261 | If set to True, columns are sorted according to decreasing number of
262 | missing values. In other words, imputation will move from imputing
263 | columns with the largest number of missing values to columns with
264 | fewest number of missing values.
265 |
266 | missing_values : np.nan, integer, optional (default = np.nan)
267 | The placeholder for the missing values. All occurrences of
268 | `missing_values` will be imputed.
269 |
270 | copy : boolean, optional (default = True)
271 | If True, a copy of X will be created. If False, imputation will
272 | be done in-place whenever possible.
273 |
274 | criterion : tuple, optional (default = ('mse', 'gini'))
275 | The function to measure the quality of a split.The first element of
276 | the tuple is for the Random Forest Regressor (for imputing numerical
277 | variables) while the second element is for the Random Forest
278 | Classifier (for imputing categorical variables).
279 |
280 | n_estimators : integer, optional (default=100)
281 | The number of trees in the forest.
282 |
283 | max_depth : integer or None, optional (default=None)
284 | The maximum depth of the tree. If None, then nodes are expanded until
285 | all leaves are pure or until all leaves contain less than
286 | min_samples_split samples.
287 |
288 | min_samples_split : int, float, optional (default=2)
289 | The minimum number of samples required to split an internal node:
290 | - If int, then consider `min_samples_split` as the minimum number.
291 | - If float, then `min_samples_split` is a fraction and
292 | `ceil(min_samples_split * n_samples)` are the minimum
293 | number of samples for each split.
294 |
295 | min_samples_leaf : int, float, optional (default=1)
296 | The minimum number of samples required to be at a leaf node.
297 | A split point at any depth will only be considered if it leaves at
298 | least ``min_samples_leaf`` training samples in each of the left and
299 | right branches. This may have the effect of smoothing the model,
300 | especially in regression.
301 | - If int, then consider `min_samples_leaf` as the minimum number.
302 | - If float, then `min_samples_leaf` is a fraction and
303 | `ceil(min_samples_leaf * n_samples)` are the minimum
304 | number of samples for each node.
305 |
306 | min_weight_fraction_leaf : float, optional (default=0.)
307 | The minimum weighted fraction of the sum total of weights (of all
308 | the input samples) required to be at a leaf node. Samples have
309 | equal weight when sample_weight is not provided.
310 |
311 | max_features : int, float, string or None, optional (default="auto")
312 | The number of features to consider when looking for the best split:
313 | - If int, then consider `max_features` features at each split.
314 | - If float, then `max_features` is a fraction and
315 | `int(max_features * n_features)` features are considered at each
316 | split.
317 | - If "auto", then `max_features=sqrt(n_features)`.
318 | - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
319 | - If "log2", then `max_features=log2(n_features)`.
320 | - If None, then `max_features=n_features`.
321 | Note: the search for a split does not stop until at least one
322 | valid partition of the node samples is found, even if it requires to
323 | effectively inspect more than ``max_features`` features.
324 |
325 | max_leaf_nodes : int or None, optional (default=None)
326 | Grow trees with ``max_leaf_nodes`` in best-first fashion.
327 | Best nodes are defined as relative reduction in impurity.
328 | If None then unlimited number of leaf nodes.
329 |
330 | min_impurity_decrease : float, optional (default=0.)
331 | A node will be split if this split induces a decrease of the impurity
332 | greater than or equal to this value.
333 | The weighted impurity decrease equation is the following::
334 | N_t / N * (impurity - N_t_R / N_t * right_impurity
335 | - N_t_L / N_t * left_impurity)
336 | where ``N`` is the total number of samples, ``N_t`` is the number of
337 | samples at the current node, ``N_t_L`` is the number of samples in the
338 | left child, and ``N_t_R`` is the number of samples in the right child.
339 | ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
340 | if ``sample_weight`` is passed.
341 |
342 | bootstrap : boolean, optional (default=True)
343 | Whether bootstrap samples are used when building trees.
344 |
345 | oob_score : bool (default=False)
346 | Whether to use out-of-bag samples to estimate
347 | the generalization accuracy.
348 |
349 | n_jobs : int or None, optional (default=-1)
350 | The number of jobs to run in parallel for both `fit` and `predict`.
351 | ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
352 | ``-1`` means using all processors.
353 |
354 | random_state : int, RandomState instance or None, optional (default=None)
355 | If int, random_state is the seed used by the random number generator;
356 | If RandomState instance, random_state is the random number generator;
357 | If None, the random number generator is the RandomState instance used
358 | by `np.random`.
359 |
360 | verbose : int, optional (default=0)
361 | Controls the verbosity when fitting and predicting.
362 |
363 | warm_start : bool, optional (default=False)
364 | When set to ``True``, reuse the solution of the previous call to fit
365 | and add more estimators to the ensemble, otherwise, just fit a whole
366 | new forest. See :term:`the Glossary `.
367 |
368 | class_weight : dict, list of dicts, "balanced", "balanced_subsample" or \
369 | None, optional (default=None)
370 | Weights associated with classes in the form ``{class_label: weight}``.
371 | If not given, all classes are supposed to have weight one. For
372 | multi-output problems, a list of dicts can be provided in the same
373 | order as the columns of y.
374 | Note that for multioutput (including multilabel) weights should be
375 | defined for each class of every column in its own dict. For example,
376 | for four-class multilabel classification weights should be
377 | [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
378 | [{1:1}, {2:5}, {3:1}, {4:1}].
379 | The "balanced" mode uses the values of y to automatically adjust
380 | weights inversely proportional to class frequencies in the input data
381 | as ``n_samples / (n_classes * np.bincount(y))``
382 | The "balanced_subsample" mode is the same as "balanced" except that
383 | weights are computed based on the bootstrap sample for every tree
384 | grown.
385 | For multi-output, the weights of each column of y will be multiplied.
386 | Note that these weights will be multiplied with sample_weight (passed
387 | through the fit method) if sample_weight is specified.
388 | NOTE: This parameter is only applicable for Random Forest Classifier
389 | objects (i.e., for categorical variables).
390 |
391 | Attributes
392 | ----------
393 | statistics_ : Dictionary of length two
394 | The first element is an array with the mean of each numerical feature
395 | being imputed while the second element is an array of modes of
396 | categorical features being imputed (if available, otherwise it
397 | will be None).
398 |
399 | Methods
400 | -------
401 | fit(self, X, y=None, cat_vars=None):
402 | Fit the imputer on X.
403 |
404 | Parameters
405 | ----------
406 | X : {array-like}, shape (n_samples, n_features)
407 | Input data, where ``n_samples`` is the number of samples and
408 | ``n_features`` is the number of features.
409 |
410 | cat_vars : int or array of ints, optional (default = None)
411 | An int or an array containing column indices of categorical
412 | variable(s)/feature(s) present in the dataset X.
413 | ``None`` if there are no categorical variables in the dataset.
414 |
415 | Returns
416 | -------
417 | self : object
418 | Returns self.
419 |
420 |
421 | transform(X):
422 | Impute all missing values in X.
423 |
424 | Parameters
425 | ----------
426 | X : {array-like}, shape = [n_samples, n_features]
427 | The input data to complete.
428 |
429 | Returns
430 | -------
431 | X : {array-like}, shape = [n_samples, n_features]
432 | The imputed dataset.
433 |
434 |
435 | fit_transform(X, y=None, **fit_params):
436 | Fit MissForest and impute all missing values in X.
437 |
438 | Parameters
439 | ----------
440 | X : {array-like}, shape (n_samples, n_features)
441 | Input data, where ``n_samples`` is the number of samples and
442 | ``n_features`` is the number of features.
443 |
444 | Returns
445 | -------
446 | X : {array-like}, shape (n_samples, n_features)
447 | Returns imputed dataset.
448 |
449 | ### References
450 |
451 | * [1] Stekhoven, Daniel J., and Peter Bühlmann. "MissForest—non-parametric
452 | missing value imputation for mixed-type data." Bioinformatics 28.1
453 | (2011): 112-118.
454 | * [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
455 | * [3] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
--------------------------------------------------------------------------------
/missingpy/__init__.py:
--------------------------------------------------------------------------------
1 | from .knnimpute import KNNImputer
2 | from .missforest import MissForest
3 |
4 | __all__ = ['KNNImputer', 'MissForest']
5 |
--------------------------------------------------------------------------------
/missingpy/knnimpute.py:
--------------------------------------------------------------------------------
1 | """KNN Imputer for Missing Data"""
2 | # Author: Ashim Bhattarai
3 | # License: GNU General Public License v3 (GPLv3)
4 |
5 | import warnings
6 |
7 | import numpy as np
8 |
9 | from sklearn.base import BaseEstimator, TransformerMixin
10 | from sklearn.utils import check_array
11 | from sklearn.utils.validation import check_is_fitted
12 | from sklearn.utils.validation import FLOAT_DTYPES
13 | from sklearn.neighbors.base import _check_weights
14 | from sklearn.neighbors.base import _get_weights
15 |
16 | from .pairwise_external import pairwise_distances
17 | from .pairwise_external import _get_mask
18 | from .pairwise_external import _MASKED_METRICS
19 |
20 | __all__ = [
21 | 'KNNImputer',
22 | ]
23 |
24 |
25 | class KNNImputer(BaseEstimator, TransformerMixin):
26 | """Imputation for completing missing values using k-Nearest Neighbors.
27 |
28 | Each sample's missing values are imputed using values from ``n_neighbors``
29 | nearest neighbors found in the training set. Each missing feature is then
30 | imputed as the average, either weighted or unweighted, of these neighbors.
31 | Note that if a sample has more than one feature missing, then the
32 | neighbors for that sample can be different depending on the particular
33 | feature being imputed. Finally, where the number of donor neighbors is
34 | less than ``n_neighbors``, the training set average for that feature is
35 | used during imputation.
36 |
37 | Parameters
38 | ----------
39 | missing_values : integer or "NaN", optional (default = "NaN")
40 | The placeholder for the missing values. All occurrences of
41 | `missing_values` will be imputed. For missing values encoded as
42 | ``np.nan``, use the string value "NaN".
43 |
44 | n_neighbors : int, optional (default = 5)
45 | Number of neighboring samples to use for imputation.
46 |
47 | weights : str or callable, optional (default = "uniform")
48 | Weight function used in prediction. Possible values:
49 |
50 | - 'uniform' : uniform weights. All points in each neighborhood
51 | are weighted equally.
52 | - 'distance' : weight points by the inverse of their distance.
53 | in this case, closer neighbors of a query point will have a
54 | greater influence than neighbors which are further away.
55 | - [callable] : a user-defined function which accepts an
56 | array of distances, and returns an array of the same shape
57 | containing the weights.
58 |
59 | metric : str or callable, optional (default = "masked_euclidean")
60 | Distance metric for searching neighbors. Possible values:
61 | - 'masked_euclidean'
62 | - [callable] : a user-defined function which conforms to the
63 | definition of _pairwise_callable(X, Y, metric, **kwds). In other
64 | words, the function accepts two arrays, X and Y, and a
65 | ``missing_values`` keyword in **kwds and returns a scalar distance
66 | value.
67 |
68 | row_max_missing : float, optional (default = 0.5)
69 | The maximum fraction of columns (i.e. features) that can be missing
70 | before the sample is excluded from nearest neighbor imputation. It
71 | means that such rows will not be considered a potential donor in
72 | ``fit()``, and in ``transform()`` their missing feature values will be
73 | imputed to be the column mean for the entire dataset.
74 |
75 | col_max_missing : float, optional (default = 0.8)
76 | The maximum fraction of rows (or samples) that can be missing
77 | for any feature beyond which an error is raised.
78 |
79 | copy : boolean, optional (default = True)
80 | If True, a copy of X will be created. If False, imputation will
81 | be done in-place whenever possible. Note that, if metric is
82 | "masked_euclidean" and copy=False then missing_values in the
83 | input matrix X will be overwritten with zeros.
84 |
85 | Attributes
86 | ----------
87 | statistics_ : 1-D array of length {n_features}
88 | The 1-D array contains the mean of each feature calculated using
89 | observed (i.e. non-missing) values. This is used for imputing
90 | missing values in samples that are either excluded from nearest
91 | neighbors search because they have too many ( > row_max_missing)
92 | missing features or because all of the sample's k-nearest neighbors
93 | (i.e., the potential donors) also have the relevant feature value
94 | missing.
95 |
96 | References
97 | ----------
98 | * Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor
99 | Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing
100 | value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17
101 | no. 6, 2001 Pages 520-525.
102 |
103 | Examples
104 | --------
105 | >>> from missingpy import KNNImputer
106 | >>> nan = float("NaN")
107 | >>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
108 | >>> imputer = KNNImputer(n_neighbors=2, weights="uniform")
109 | >>> imputer.fit_transform(X)
110 | array([[1. , 2. , 4. ],
111 | [3. , 4. , 3. ],
112 | [5.5, 6. , 5. ],
113 | [8. , 8. , 7. ]])
114 | """
115 |
116 | def __init__(self, missing_values="NaN", n_neighbors=5,
117 | weights="uniform", metric="masked_euclidean",
118 | row_max_missing=0.5, col_max_missing=0.8, copy=True):
119 |
120 | self.missing_values = missing_values
121 | self.n_neighbors = n_neighbors
122 | self.weights = weights
123 | self.metric = metric
124 | self.row_max_missing = row_max_missing
125 | self.col_max_missing = col_max_missing
126 | self.copy = copy
127 |
128 | def _impute(self, dist, X, fitted_X, mask, mask_fx):
129 | """Helper function to find and impute missing values"""
130 |
131 | # For each column, find and impute
132 | n_rows_X, n_cols_X = X.shape
133 | for c in range(n_cols_X):
134 | if not np.any(mask[:, c], axis=0):
135 | continue
136 |
137 | # Row index for receivers and potential donors (pdonors)
138 | receivers_row_idx = np.where(mask[:, c])[0]
139 | pdonors_row_idx = np.where(~mask_fx[:, c])[0]
140 |
141 | # Impute using column mean if n_neighbors are not available
142 | if len(pdonors_row_idx) < self.n_neighbors:
143 | warnings.warn("Insufficient number of neighbors! "
144 | "Filling in column mean.")
145 | X[receivers_row_idx, c] = self.statistics_[c]
146 | continue
147 |
148 | # Get distance from potential donors
149 | dist_pdonors = dist[receivers_row_idx][:, pdonors_row_idx]
150 | dist_pdonors = dist_pdonors.reshape(-1,
151 | len(pdonors_row_idx))
152 |
153 | # Argpartition to separate actual donors from the rest
154 | pdonors_idx = np.argpartition(
155 | dist_pdonors, self.n_neighbors - 1, axis=1)
156 |
157 | # Get final donors row index from pdonors
158 | donors_idx = pdonors_idx[:, :self.n_neighbors]
159 |
160 | # Get weights or None
161 | dist_pdonors_rows = np.arange(len(donors_idx))[:, None]
162 | weight_matrix = _get_weights(
163 | dist_pdonors[
164 | dist_pdonors_rows, donors_idx], self.weights)
165 | donor_row_idx_ravel = donors_idx.ravel()
166 |
167 | # Retrieve donor values and calculate kNN score
168 | fitted_X_temp = fitted_X[pdonors_row_idx]
169 | donors = fitted_X_temp[donor_row_idx_ravel, c].reshape(
170 | (-1, self.n_neighbors))
171 | donors_mask = _get_mask(donors, self.missing_values)
172 | donors = np.ma.array(donors, mask=donors_mask)
173 |
174 | # Final imputation
175 | imputed = np.ma.average(donors, axis=1,
176 | weights=weight_matrix)
177 | X[receivers_row_idx, c] = imputed.data
178 | return X
179 |
180 | def fit(self, X, y=None):
181 | """Fit the imputer on X.
182 |
183 | Parameters
184 | ----------
185 | X : {array-like}, shape (n_samples, n_features)
186 | Input data, where ``n_samples`` is the number of samples and
187 | ``n_features`` is the number of features.
188 |
189 | Returns
190 | -------
191 | self : object
192 | Returns self.
193 | """
194 |
195 | # Check data integrity and calling arguments
196 | force_all_finite = False if self.missing_values in ["NaN",
197 | np.nan] else True
198 | if not force_all_finite:
199 | if self.metric not in _MASKED_METRICS and not callable(
200 | self.metric):
201 | raise ValueError(
202 | "The selected metric does not support NaN values.")
203 | X = check_array(X, accept_sparse=False, dtype=np.float64,
204 | force_all_finite=force_all_finite, copy=self.copy)
205 | self.weights = _check_weights(self.weights)
206 |
207 | # Check for +/- inf
208 | if np.any(np.isinf(X)):
209 | raise ValueError("+/- inf values are not allowed.")
210 |
211 | # Check if % missing in any column > col_max_missing
212 | mask = _get_mask(X, self.missing_values)
213 | if np.any(mask.sum(axis=0) > (X.shape[0] * self.col_max_missing)):
214 | raise ValueError("Some column(s) have more than {}% missing values"
215 | .format(self.col_max_missing * 100))
216 | X_col_means = np.ma.array(X, mask=mask).mean(axis=0).data
217 |
218 | # Check if % missing in any row > row_max_missing
219 | bad_rows = mask.sum(axis=1) > (mask.shape[1] * self.row_max_missing)
220 | if np.any(bad_rows):
221 | warnings.warn(
222 | "There are rows with more than {0}% missing values. These "
223 | "rows are not included as donor neighbors."
224 | .format(self.row_max_missing * 100))
225 |
226 | # Remove rows that have more than row_max_missing % missing
227 | X = X[~bad_rows, :]
228 |
229 | # Check if sufficient neighboring samples available
230 | if X.shape[0] < self.n_neighbors:
231 | raise ValueError("There are only %d samples, but n_neighbors=%d."
232 | % (X.shape[0], self.n_neighbors))
233 | self.fitted_X_ = X
234 | self.statistics_ = X_col_means
235 |
236 | return self
237 |
238 | def transform(self, X):
239 | """Impute all missing values in X.
240 |
241 | Parameters
242 | ----------
243 | X : {array-like}, shape = [n_samples, n_features]
244 | The input data to complete.
245 |
246 | Returns
247 | -------
248 | X : {array-like}, shape = [n_samples, n_features]
249 | The imputed dataset.
250 | """
251 |
252 | check_is_fitted(self, ["fitted_X_", "statistics_"])
253 | force_all_finite = False if self.missing_values in ["NaN",
254 | np.nan] else True
255 | X = check_array(X, accept_sparse=False, dtype=FLOAT_DTYPES,
256 | force_all_finite=force_all_finite, copy=self.copy)
257 |
258 | # Check for +/- inf
259 | if np.any(np.isinf(X)):
260 | raise ValueError("+/- inf values are not allowed in data to be "
261 | "transformed.")
262 |
263 | # Get fitted data and ensure correct dimension
264 | n_rows_fit_X, n_cols_fit_X = self.fitted_X_.shape
265 | n_rows_X, n_cols_X = X.shape
266 |
267 | if n_cols_X != n_cols_fit_X:
268 | raise ValueError("Incompatible dimension between the fitted "
269 | "dataset and the one to be transformed.")
270 | mask = _get_mask(X, self.missing_values)
271 |
272 | row_total_missing = mask.sum(axis=1)
273 | if not np.any(row_total_missing):
274 | return X
275 |
276 | # Check for excessive missingness in rows
277 | bad_rows = row_total_missing > (mask.shape[1] * self.row_max_missing)
278 | if np.any(bad_rows):
279 | warnings.warn(
280 | "There are rows with more than {0}% missing values. The "
281 | "missing features in these rows are imputed with column means."
282 | .format(self.row_max_missing * 100))
283 | X_bad = X[bad_rows, :]
284 | X = X[~bad_rows, :]
285 | mask = mask[~bad_rows]
286 | row_total_missing = mask.sum(axis=1)
287 | row_has_missing = row_total_missing.astype(np.bool)
288 |
289 | if np.any(row_has_missing):
290 |
291 | # Mask for fitted_X
292 | mask_fx = _get_mask(self.fitted_X_, self.missing_values)
293 |
294 | # Pairwise distances between receivers and fitted samples
295 | dist = np.empty((len(X), len(self.fitted_X_)))
296 | dist[row_has_missing] = pairwise_distances(
297 | X[row_has_missing], self.fitted_X_, metric=self.metric,
298 | squared=False, missing_values=self.missing_values)
299 |
300 | # Find and impute missing
301 | X = self._impute(dist, X, self.fitted_X_, mask, mask_fx)
302 |
303 | # Merge bad rows to X and mean impute their missing values
304 | if np.any(bad_rows):
305 | bad_missing_index = np.where(_get_mask(X_bad, self.missing_values))
306 | X_bad[bad_missing_index] = np.take(self.statistics_,
307 | bad_missing_index[1])
308 | X_merged = np.empty((n_rows_X, n_cols_X))
309 | X_merged[bad_rows, :] = X_bad
310 | X_merged[~bad_rows, :] = X
311 | X = X_merged
312 | return X
313 |
314 | def fit_transform(self, X, y=None, **fit_params):
315 | """Fit KNNImputer and impute all missing values in X.
316 |
317 | Parameters
318 | ----------
319 | X : {array-like}, shape (n_samples, n_features)
320 | Input data, where ``n_samples`` is the number of samples and
321 | ``n_features`` is the number of features.
322 |
323 | Returns
324 | -------
325 | X : {array-like}, shape (n_samples, n_features)
326 | Returns imputed dataset.
327 | """
328 | return self.fit(X).transform(X)
329 |
--------------------------------------------------------------------------------
/missingpy/missforest.py:
--------------------------------------------------------------------------------
1 | """MissForest Imputer for Missing Data"""
2 | # Author: Ashim Bhattarai
3 | # License: GNU General Public License v3 (GPLv3)
4 |
5 | import warnings
6 |
7 | import numpy as np
8 | from scipy.stats import mode
9 |
10 | from sklearn.base import BaseEstimator, TransformerMixin
11 | from sklearn.utils.validation import check_is_fitted, check_array
12 | from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
13 |
14 | from .pairwise_external import _get_mask
15 |
16 | __all__ = [
17 | 'MissForest',
18 | ]
19 |
20 |
21 | class MissForest(BaseEstimator, TransformerMixin):
22 | """Missing value imputation using Random Forests.
23 |
24 | MissForest imputes missing values using Random Forests in an iterative
25 | fashion. By default, the imputer begins imputing missing values of the
26 | column (which is expected to be a variable) with the smallest number of
27 | missing values -- let's call this the candidate column.
28 | The first step involves filling any missing values of the remaining,
29 | non-candidate, columns with an initial guess, which is the column mean for
30 | columns representing numerical variables and the column mode for columns
31 | representing categorical variables. After that, the imputer fits a random
32 | forest model with the candidate column as the outcome variable and the
33 | remaining columns as the predictors over all rows where the candidate
34 | column values are not missing.
35 | After the fit, the missing rows of the candidate column are
36 | imputed using the prediction from the fitted Random Forest. The
37 | rows of the non-candidate columns act as the input data for the fitted
38 | model.
39 | Following this, the imputer moves on to the next candidate column with the
40 | second smallest number of missing values from among the non-candidate
41 | columns in the first round. The process repeats itself for each column
42 | with a missing value, possibly over multiple iterations or epochs for
43 | each column, until the stopping criterion is met.
44 | The stopping criterion is governed by the "difference" between the imputed
45 | arrays over successive iterations. For numerical variables (num_vars_),
46 | the difference is defined as follows:
47 |
48 | sum((X_new[:, num_vars_] - X_old[:, num_vars_]) ** 2) /
49 | sum((X_new[:, num_vars_]) ** 2)
50 |
51 | For categorical variables(cat_vars_), the difference is defined as follows:
52 |
53 | sum(X_new[:, cat_vars_] != X_old[:, cat_vars_])) / n_cat_missing
54 |
55 | where X_new is the newly imputed array, X_old is the array imputed in the
56 | previous round, n_cat_missing is the total number of categorical
57 | values that are missing, and the sum() is performed both across rows
58 | and columns. Following [1], the stopping criterion is considered to have
59 | been met when difference between X_new and X_old increases for the first
60 | time for both types of variables (if available).
61 |
62 | Parameters
63 | ----------
64 | NOTE: Most parameter definitions below are taken verbatim from the
65 | Scikit-Learn documentation at [2] and [3].
66 |
67 | max_iter : int, optional (default = 10)
68 | The maximum iterations of the imputation process. Each column with a
69 | missing value is imputed exactly once in a given iteration.
70 |
71 | decreasing : boolean, optional (default = False)
72 | If set to True, columns are sorted according to decreasing number of
73 | missing values. In other words, imputation will move from imputing
74 | columns with the largest number of missing values to columns with
75 | fewest number of missing values.
76 |
77 | missing_values : np.nan, integer, optional (default = np.nan)
78 | The placeholder for the missing values. All occurrences of
79 | `missing_values` will be imputed.
80 |
81 | copy : boolean, optional (default = True)
82 | If True, a copy of X will be created. If False, imputation will
83 | be done in-place whenever possible.
84 |
85 | criterion : tuple, optional (default = ('mse', 'gini'))
86 | The function to measure the quality of a split.The first element of
87 | the tuple is for the Random Forest Regressor (for imputing numerical
88 | variables) while the second element is for the Random Forest
89 | Classifier (for imputing categorical variables).
90 |
91 | n_estimators : integer, optional (default=100)
92 | The number of trees in the forest.
93 |
94 | max_depth : integer or None, optional (default=None)
95 | The maximum depth of the tree. If None, then nodes are expanded until
96 | all leaves are pure or until all leaves contain less than
97 | min_samples_split samples.
98 |
99 | min_samples_split : int, float, optional (default=2)
100 | The minimum number of samples required to split an internal node:
101 | - If int, then consider `min_samples_split` as the minimum number.
102 | - If float, then `min_samples_split` is a fraction and
103 | `ceil(min_samples_split * n_samples)` are the minimum
104 | number of samples for each split.
105 |
106 | min_samples_leaf : int, float, optional (default=1)
107 | The minimum number of samples required to be at a leaf node.
108 | A split point at any depth will only be considered if it leaves at
109 | least ``min_samples_leaf`` training samples in each of the left and
110 | right branches. This may have the effect of smoothing the model,
111 | especially in regression.
112 | - If int, then consider `min_samples_leaf` as the minimum number.
113 | - If float, then `min_samples_leaf` is a fraction and
114 | `ceil(min_samples_leaf * n_samples)` are the minimum
115 | number of samples for each node.
116 |
117 | min_weight_fraction_leaf : float, optional (default=0.)
118 | The minimum weighted fraction of the sum total of weights (of all
119 | the input samples) required to be at a leaf node. Samples have
120 | equal weight when sample_weight is not provided.
121 |
122 | max_features : int, float, string or None, optional (default="auto")
123 | The number of features to consider when looking for the best split:
124 | - If int, then consider `max_features` features at each split.
125 | - If float, then `max_features` is a fraction and
126 | `int(max_features * n_features)` features are considered at each
127 | split.
128 | - If "auto", then `max_features=sqrt(n_features)`.
129 | - If "sqrt", then `max_features=sqrt(n_features)` (same as "auto").
130 | - If "log2", then `max_features=log2(n_features)`.
131 | - If None, then `max_features=n_features`.
132 | Note: the search for a split does not stop until at least one
133 | valid partition of the node samples is found, even if it requires to
134 | effectively inspect more than ``max_features`` features.
135 |
136 | max_leaf_nodes : int or None, optional (default=None)
137 | Grow trees with ``max_leaf_nodes`` in best-first fashion.
138 | Best nodes are defined as relative reduction in impurity.
139 | If None then unlimited number of leaf nodes.
140 |
141 | min_impurity_decrease : float, optional (default=0.)
142 | A node will be split if this split induces a decrease of the impurity
143 | greater than or equal to this value.
144 | The weighted impurity decrease equation is the following::
145 | N_t / N * (impurity - N_t_R / N_t * right_impurity
146 | - N_t_L / N_t * left_impurity)
147 | where ``N`` is the total number of samples, ``N_t`` is the number of
148 | samples at the current node, ``N_t_L`` is the number of samples in the
149 | left child, and ``N_t_R`` is the number of samples in the right child.
150 | ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum,
151 | if ``sample_weight`` is passed.
152 |
153 | bootstrap : boolean, optional (default=True)
154 | Whether bootstrap samples are used when building trees.
155 |
156 | oob_score : bool (default=False)
157 | Whether to use out-of-bag samples to estimate
158 | the generalization accuracy.
159 |
160 | n_jobs : int or None, optional (default=None)
161 | The number of jobs to run in parallel for both `fit` and `predict`.
162 | ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
163 | ``-1`` means using all processors. See :term:`Glossary `
164 | for more details.
165 |
166 | random_state : int, RandomState instance or None, optional (default=None)
167 | If int, random_state is the seed used by the random number generator;
168 | If RandomState instance, random_state is the random number generator;
169 | If None, the random number generator is the RandomState instance used
170 | by `np.random`.
171 |
172 | verbose : int, optional (default=0)
173 | Controls the verbosity when fitting and predicting.
174 |
175 | warm_start : bool, optional (default=False)
176 | When set to ``True``, reuse the solution of the previous call to fit
177 | and add more estimators to the ensemble, otherwise, just fit a whole
178 | new forest. See :term:`the Glossary `.
179 |
180 | class_weight : dict, list of dicts, "balanced", "balanced_subsample" or \
181 | None, optional (default=None)
182 | Weights associated with classes in the form ``{class_label: weight}``.
183 | If not given, all classes are supposed to have weight one. For
184 | multi-output problems, a list of dicts can be provided in the same
185 | order as the columns of y.
186 | Note that for multioutput (including multilabel) weights should be
187 | defined for each class of every column in its own dict. For example,
188 | for four-class multilabel classification weights should be
189 | [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of
190 | [{1:1}, {2:5}, {3:1}, {4:1}].
191 | The "balanced" mode uses the values of y to automatically adjust
192 | weights inversely proportional to class frequencies in the input data
193 | as ``n_samples / (n_classes * np.bincount(y))``
194 | The "balanced_subsample" mode is the same as "balanced" except that
195 | weights are computed based on the bootstrap sample for every tree
196 | grown.
197 | For multi-output, the weights of each column of y will be multiplied.
198 | Note that these weights will be multiplied with sample_weight (passed
199 | through the fit method) if sample_weight is specified.
200 | NOTE: This parameter is only applicable for Random Forest Classifier
201 | objects (i.e., for categorical variables).
202 |
203 | Attributes
204 | ----------
205 | statistics_ : Dictionary of length two
206 | The first element is an array with the mean of each numerical feature
207 | being imputed while the second element is an array of modes of
208 | categorical features being imputed (if available, otherwise it
209 | will be None).
210 |
211 | References
212 | ----------
213 | * [1] Stekhoven, Daniel J., and Peter Bühlmann. "MissForest—non-parametric
214 | missing value imputation for mixed-type data." Bioinformatics 28.1
215 | (2011): 112-118.
216 | * [2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
217 | RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
218 | * [3] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
219 | RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
220 |
221 | Examples
222 | --------
223 | >>> from missingpy import MissForest
224 | >>> nan = float("NaN")
225 | >>> X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
226 | >>> imputer = MissForest(random_state=1337)
227 | >>> imputer.fit_transform(X)
228 | Iteration: 0
229 | Iteration: 1
230 | Iteration: 2
231 | array([[1. , 2. , 3.92 ],
232 | [3. , 4. , 3. ],
233 | [2.71, 6. , 5. ],
234 | [8. , 8. , 7. ]])
235 | """
236 |
237 | def __init__(self, max_iter=10, decreasing=False, missing_values=np.nan,
238 | copy=True, n_estimators=100, criterion=('mse', 'gini'),
239 | max_depth=None, min_samples_split=2, min_samples_leaf=1,
240 | min_weight_fraction_leaf=0.0, max_features='auto',
241 | max_leaf_nodes=None, min_impurity_decrease=0.0,
242 | bootstrap=True, oob_score=False, n_jobs=-1, random_state=None,
243 | verbose=0, warm_start=False, class_weight=None):
244 |
245 | self.max_iter = max_iter
246 | self.decreasing = decreasing
247 | self.missing_values = missing_values
248 | self.copy = copy
249 | self.n_estimators = n_estimators
250 | self.criterion = criterion
251 | self.max_depth = max_depth
252 | self.min_samples_split = min_samples_split
253 | self.min_samples_leaf = min_samples_leaf
254 | self.min_weight_fraction_leaf = min_weight_fraction_leaf
255 | self.max_features = max_features
256 | self.max_leaf_nodes = max_leaf_nodes
257 | self.min_impurity_decrease = min_impurity_decrease
258 | self.bootstrap = bootstrap
259 | self.oob_score = oob_score
260 | self.n_jobs = n_jobs
261 | self.random_state = random_state
262 | self.verbose = verbose
263 | self.warm_start = warm_start
264 | self.class_weight = class_weight
265 |
266 | def _miss_forest(self, Ximp, mask):
267 | """The missForest algorithm"""
268 |
269 | # Count missing per column
270 | col_missing_count = mask.sum(axis=0)
271 |
272 | # Get col and row indices for missing
273 | missing_rows, missing_cols = np.where(mask)
274 |
275 | if self.num_vars_ is not None:
276 | # Only keep indices for numerical vars
277 | keep_idx_num = np.in1d(missing_cols, self.num_vars_)
278 | missing_num_rows = missing_rows[keep_idx_num]
279 | missing_num_cols = missing_cols[keep_idx_num]
280 |
281 | # Make initial guess for missing values
282 | col_means = np.full(Ximp.shape[1], fill_value=np.nan)
283 | col_means[self.num_vars_] = self.statistics_.get('col_means')
284 | Ximp[missing_num_rows, missing_num_cols] = np.take(
285 | col_means, missing_num_cols)
286 |
287 | # Reg criterion
288 | reg_criterion = self.criterion if type(self.criterion) == str \
289 | else self.criterion[0]
290 |
291 | # Instantiate regression model
292 | rf_regressor = RandomForestRegressor(
293 | n_estimators=self.n_estimators,
294 | criterion=reg_criterion,
295 | max_depth=self.max_depth,
296 | min_samples_split=self.min_samples_split,
297 | min_samples_leaf=self.min_samples_leaf,
298 | min_weight_fraction_leaf=self.min_weight_fraction_leaf,
299 | max_features=self.max_features,
300 | max_leaf_nodes=self.max_leaf_nodes,
301 | min_impurity_decrease=self.min_impurity_decrease,
302 | bootstrap=self.bootstrap,
303 | oob_score=self.oob_score,
304 | n_jobs=self.n_jobs,
305 | random_state=self.random_state,
306 | verbose=self.verbose,
307 | warm_start=self.warm_start)
308 |
309 | # If needed, repeat for categorical variables
310 | if self.cat_vars_ is not None:
311 | # Calculate total number of missing categorical values (used later)
312 | n_catmissing = np.sum(mask[:, self.cat_vars_])
313 |
314 | # Only keep indices for categorical vars
315 | keep_idx_cat = np.in1d(missing_cols, self.cat_vars_)
316 | missing_cat_rows = missing_rows[keep_idx_cat]
317 | missing_cat_cols = missing_cols[keep_idx_cat]
318 |
319 | # Make initial guess for missing values
320 | col_modes = np.full(Ximp.shape[1], fill_value=np.nan)
321 | col_modes[self.cat_vars_] = self.statistics_.get('col_modes')
322 | Ximp[missing_cat_rows, missing_cat_cols] = np.take(col_modes, missing_cat_cols)
323 |
324 | # Classfication criterion
325 | clf_criterion = self.criterion if type(self.criterion) == str \
326 | else self.criterion[1]
327 |
328 | # Instantiate classification model
329 | rf_classifier = RandomForestClassifier(
330 | n_estimators=self.n_estimators,
331 | criterion=clf_criterion,
332 | max_depth=self.max_depth,
333 | min_samples_split=self.min_samples_split,
334 | min_samples_leaf=self.min_samples_leaf,
335 | min_weight_fraction_leaf=self.min_weight_fraction_leaf,
336 | max_features=self.max_features,
337 | max_leaf_nodes=self.max_leaf_nodes,
338 | min_impurity_decrease=self.min_impurity_decrease,
339 | bootstrap=self.bootstrap,
340 | oob_score=self.oob_score,
341 | n_jobs=self.n_jobs,
342 | random_state=self.random_state,
343 | verbose=self.verbose,
344 | warm_start=self.warm_start,
345 | class_weight=self.class_weight)
346 |
347 | # 2. misscount_idx: sorted indices of cols in X based on missing count
348 | misscount_idx = np.argsort(col_missing_count)
349 | # Reverse order if decreasing is set to True
350 | if self.decreasing is True:
351 | misscount_idx = misscount_idx[::-1]
352 |
353 | # 3. While new_gammas < old_gammas & self.iter_count_ < max_iter loop:
354 | self.iter_count_ = 0
355 | gamma_new = 0
356 | gamma_old = np.inf
357 | gamma_newcat = 0
358 | gamma_oldcat = np.inf
359 | col_index = np.arange(Ximp.shape[1])
360 |
361 | while (
362 | gamma_new < gamma_old or gamma_newcat < gamma_oldcat) and \
363 | self.iter_count_ < self.max_iter:
364 |
365 | # 4. store previously imputed matrix
366 | Ximp_old = np.copy(Ximp)
367 | if self.iter_count_ != 0:
368 | gamma_old = gamma_new
369 | gamma_oldcat = gamma_newcat
370 | # 5. loop
371 | for s in misscount_idx:
372 | # Column indices other than the one being imputed
373 | s_prime = np.delete(col_index, s)
374 |
375 | # Get indices of rows where 's' is observed and missing
376 | obs_rows = np.where(~mask[:, s])[0]
377 | mis_rows = np.where(mask[:, s])[0]
378 |
379 | # If no missing, then skip
380 | if len(mis_rows) == 0:
381 | continue
382 |
383 | # Get observed values of 's'
384 | yobs = Ximp[obs_rows, s]
385 |
386 | # Get 'X' for both observed and missing 's' column
387 | xobs = Ximp[np.ix_(obs_rows, s_prime)]
388 | xmis = Ximp[np.ix_(mis_rows, s_prime)]
389 |
390 | # 6. Fit a random forest over observed and predict the missing
391 | if self.cat_vars_ is not None and s in self.cat_vars_:
392 | rf_classifier.fit(X=xobs, y=yobs)
393 | # 7. predict ymis(s) using xmis(x)
394 | ymis = rf_classifier.predict(xmis)
395 | # 8. update imputed matrix using predicted matrix ymis(s)
396 | Ximp[mis_rows, s] = ymis
397 | else:
398 | rf_regressor.fit(X=xobs, y=yobs)
399 | # 7. predict ymis(s) using xmis(x)
400 | ymis = rf_regressor.predict(xmis)
401 | # 8. update imputed matrix using predicted matrix ymis(s)
402 | Ximp[mis_rows, s] = ymis
403 |
404 | # 9. Update gamma (stopping criterion)
405 | if self.cat_vars_ is not None:
406 | gamma_newcat = np.sum(
407 | (Ximp[:, self.cat_vars_] != Ximp_old[:, self.cat_vars_])) / n_catmissing
408 | if self.num_vars_ is not None:
409 | gamma_new = np.sum((Ximp[:, self.num_vars_] - Ximp_old[:, self.num_vars_]) ** 2) / np.sum((Ximp[:, self.num_vars_]) ** 2)
410 |
411 | print("Iteration:", self.iter_count_)
412 | self.iter_count_ += 1
413 |
414 | return Ximp_old
415 |
416 | def fit(self, X, y=None, cat_vars=None):
417 | """Fit the imputer on X.
418 |
419 | Parameters
420 | ----------
421 | X : {array-like}, shape (n_samples, n_features)
422 | Input data, where ``n_samples`` is the number of samples and
423 | ``n_features`` is the number of features.
424 |
425 | cat_vars : int or array of ints, optional (default = None)
426 | An int or an array containing column indices of categorical
427 | variable(s)/feature(s) present in the dataset X.
428 | ``None`` if there are no categorical variables in the dataset.
429 |
430 | Returns
431 | -------
432 | self : object
433 | Returns self.
434 | """
435 |
436 | # Check data integrity and calling arguments
437 | force_all_finite = False if self.missing_values in ["NaN",
438 | np.nan] else True
439 |
440 | X = check_array(X, accept_sparse=False, dtype=np.float64,
441 | force_all_finite=force_all_finite, copy=self.copy)
442 |
443 | # Check for +/- inf
444 | if np.any(np.isinf(X)):
445 | raise ValueError("+/- inf values are not supported.")
446 |
447 | # Check if any column has all missing
448 | mask = _get_mask(X, self.missing_values)
449 | if np.any(mask.sum(axis=0) >= (X.shape[0])):
450 | raise ValueError("One or more columns have all rows missing.")
451 |
452 | # Check cat_vars type and convert if necessary
453 | if cat_vars is not None:
454 | if type(cat_vars) == int:
455 | cat_vars = [cat_vars]
456 | elif type(cat_vars) == list or type(cat_vars) == np.ndarray:
457 | if np.array(cat_vars).dtype != int:
458 | raise ValueError(
459 | "cat_vars needs to be either an int or an array "
460 | "of ints.")
461 | else:
462 | raise ValueError("cat_vars needs to be either an int or an array "
463 | "of ints.")
464 |
465 | # Identify numerical variables
466 | num_vars = np.setdiff1d(np.arange(X.shape[1]), cat_vars)
467 | num_vars = num_vars if len(num_vars) > 0 else None
468 |
469 | # First replace missing values with NaN if it is something else
470 | if self.missing_values not in ['NaN', np.nan]:
471 | X[np.where(X == self.missing_values)] = np.nan
472 |
473 | # Now, make initial guess for missing values
474 | col_means = np.nanmean(X[:, num_vars], axis=0) if num_vars is not None else None
475 | col_modes = mode(
476 | X[:, cat_vars], axis=0, nan_policy='omit')[0] if cat_vars is not \
477 | None else None
478 |
479 | self.cat_vars_ = cat_vars
480 | self.num_vars_ = num_vars
481 | self.statistics_ = {"col_means": col_means, "col_modes": col_modes}
482 |
483 | return self
484 |
485 | def transform(self, X):
486 | """Impute all missing values in X.
487 |
488 | Parameters
489 | ----------
490 | X : {array-like}, shape = [n_samples, n_features]
491 | The input data to complete.
492 |
493 | Returns
494 | -------
495 | X : {array-like}, shape = [n_samples, n_features]
496 | The imputed dataset.
497 | """
498 | # Confirm whether fit() has been called
499 | check_is_fitted(self, ["cat_vars_", "num_vars_", "statistics_"])
500 |
501 | # Check data integrity
502 | force_all_finite = False if self.missing_values in ["NaN",
503 | np.nan] else True
504 | X = check_array(X, accept_sparse=False, dtype=np.float64,
505 | force_all_finite=force_all_finite, copy=self.copy)
506 |
507 | # Check for +/- inf
508 | if np.any(np.isinf(X)):
509 | raise ValueError("+/- inf values are not supported.")
510 |
511 | # Check if any column has all missing
512 | mask = _get_mask(X, self.missing_values)
513 | if np.any(mask.sum(axis=0) >= (X.shape[0])):
514 | raise ValueError("One or more columns have all rows missing.")
515 |
516 | # Get fitted X col count and ensure correct dimension
517 | n_cols_fit_X = (0 if self.num_vars_ is None else len(self.num_vars_)) \
518 | + (0 if self.cat_vars_ is None else len(self.cat_vars_))
519 | _, n_cols_X = X.shape
520 |
521 | if n_cols_X != n_cols_fit_X:
522 | raise ValueError("Incompatible dimension between the fitted "
523 | "dataset and the one to be transformed.")
524 |
525 | # Check if anything is actually missing and if not return original X
526 | mask = _get_mask(X, self.missing_values)
527 | if not mask.sum() > 0:
528 | warnings.warn("No missing value located; returning original "
529 | "dataset.")
530 | return X
531 |
532 | # row_total_missing = mask.sum(axis=1)
533 | # if not np.any(row_total_missing):
534 | # return X
535 |
536 | # Call missForest function to impute missing
537 | X = self._miss_forest(X, mask)
538 |
539 | # Return imputed dataset
540 | return X
541 |
542 | def fit_transform(self, X, y=None, **fit_params):
543 | """Fit MissForest and impute all missing values in X.
544 |
545 | Parameters
546 | ----------
547 | X : {array-like}, shape (n_samples, n_features)
548 | Input data, where ``n_samples`` is the number of samples and
549 | ``n_features`` is the number of features.
550 |
551 | Returns
552 | -------
553 | X : {array-like}, shape (n_samples, n_features)
554 | Returns imputed dataset.
555 | """
556 | return self.fit(X, **fit_params).transform(X)
557 |
--------------------------------------------------------------------------------
/missingpy/pairwise_external.py:
--------------------------------------------------------------------------------
1 | # This file is a modification of sklearn.metrics.pairwise
2 | # Modifications by Ashim Bhattarai
3 | """
4 | New BSD License
5 |
6 | Copyright (c) 2007–2018 The scikit-learn developers.
7 | All rights reserved.
8 |
9 |
10 | Redistribution and use in source and binary forms, with or without
11 | modification, are permitted provided that the following conditions are met:
12 |
13 | a. Redistributions of source code must retain the above copyright notice,
14 | this list of conditions and the following disclaimer.
15 | b. Redistributions in binary form must reproduce the above copyright
16 | notice, this list of conditions and the following disclaimer in the
17 | documentation and/or other materials provided with the distribution.
18 | c. Neither the name of the Scikit-learn Developers nor the names of
19 | its contributors may be used to endorse or promote products
20 | derived from this software without specific prior written
21 | permission.
22 |
23 |
24 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
25 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27 | ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR
28 | ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
30 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
31 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32 | LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33 | OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
34 | DAMAGE.
35 | """
36 |
37 | from __future__ import division
38 | from functools import partial
39 | import itertools
40 |
41 | import numpy as np
42 | from scipy.spatial import distance
43 | from scipy.sparse import issparse
44 |
45 | from sklearn.metrics.pairwise import _VALID_METRICS, _return_float_dtype
46 | from sklearn.metrics.pairwise import PAIRWISE_BOOLEAN_FUNCTIONS
47 | from sklearn.metrics.pairwise import PAIRWISE_DISTANCE_FUNCTIONS
48 | from sklearn.metrics.pairwise import _parallel_pairwise
49 | from sklearn.utils import check_array
50 |
51 | from .utils import masked_euclidean_distances
52 |
53 | _MASKED_METRICS = ['masked_euclidean']
54 | _VALID_METRICS += ['masked_euclidean']
55 |
56 |
57 | def _get_mask(X, value_to_mask):
58 | """Compute the boolean mask X == missing_values."""
59 | if value_to_mask == "NaN" or np.isnan(value_to_mask):
60 | return np.isnan(X)
61 | else:
62 | return X == value_to_mask
63 |
64 |
65 | def check_pairwise_arrays(X, Y, precomputed=False, dtype=None,
66 | accept_sparse='csr', force_all_finite=True,
67 | copy=False):
68 | """ Set X and Y appropriately and checks inputs
69 |
70 | If Y is None, it is set as a pointer to X (i.e. not a copy).
71 | If Y is given, this does not happen.
72 | All distance metrics should use this function first to assert that the
73 | given parameters are correct and safe to use.
74 |
75 | Specifically, this function first ensures that both X and Y are arrays,
76 | then checks that they are at least two dimensional while ensuring that
77 | their elements are floats (or dtype if provided). Finally, the function
78 | checks that the size of the second dimension of the two arrays is equal, or
79 | the equivalent check for a precomputed distance matrix.
80 |
81 | Parameters
82 | ----------
83 | X : {array-like, sparse matrix}, shape (n_samples_a, n_features)
84 |
85 | Y : {array-like, sparse matrix}, shape (n_samples_b, n_features)
86 |
87 | precomputed : bool
88 | True if X is to be treated as precomputed distances to the samples in
89 | Y.
90 |
91 | dtype : string, type, list of types or None (default=None)
92 | Data type required for X and Y. If None, the dtype will be an
93 | appropriate float type selected by _return_float_dtype.
94 |
95 | .. versionadded:: 0.18
96 |
97 | accept_sparse : string, boolean or list/tuple of strings
98 | String[s] representing allowed sparse matrix formats, such as 'csc',
99 | 'csr', etc. If the input is sparse but not in the allowed format,
100 | it will be converted to the first listed format. True allows the input
101 | to be any format. False means that a sparse matrix input will
102 | raise an error.
103 |
104 | force_all_finite : bool
105 | Whether to raise an error on np.inf and np.nan in X (or Y if it exists)
106 |
107 | copy : bool
108 | Whether a forced copy will be triggered. If copy=False, a copy might
109 | be triggered by a conversion.
110 |
111 | Returns
112 | -------
113 | safe_X : {array-like, sparse matrix}, shape (n_samples_a, n_features)
114 | An array equal to X, guaranteed to be a numpy array.
115 |
116 | safe_Y : {array-like, sparse matrix}, shape (n_samples_b, n_features)
117 | An array equal to Y if Y was not None, guaranteed to be a numpy array.
118 | If Y was None, safe_Y will be a pointer to X.
119 |
120 | """
121 | X, Y, dtype_float = _return_float_dtype(X, Y)
122 |
123 | warn_on_dtype = dtype is not None
124 | estimator = 'check_pairwise_arrays'
125 | if dtype is None:
126 | dtype = dtype_float
127 |
128 | if Y is X or Y is None:
129 | X = Y = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
130 | copy=copy, force_all_finite=force_all_finite,
131 | warn_on_dtype=warn_on_dtype, estimator=estimator)
132 | else:
133 | X = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
134 | copy=copy, force_all_finite=force_all_finite,
135 | warn_on_dtype=warn_on_dtype, estimator=estimator)
136 | Y = check_array(Y, accept_sparse=accept_sparse, dtype=dtype,
137 | copy=copy, force_all_finite=force_all_finite,
138 | warn_on_dtype=warn_on_dtype, estimator=estimator)
139 |
140 | if precomputed:
141 | if X.shape[1] != Y.shape[0]:
142 | raise ValueError("Precomputed metric requires shape "
143 | "(n_queries, n_indexed). Got (%d, %d) "
144 | "for %d indexed." %
145 | (X.shape[0], X.shape[1], Y.shape[0]))
146 | elif X.shape[1] != Y.shape[1]:
147 | raise ValueError("Incompatible dimension for X and Y matrices: "
148 | "X.shape[1] == %d while Y.shape[1] == %d" % (
149 | X.shape[1], Y.shape[1]))
150 |
151 | return X, Y
152 |
153 |
154 | def _pairwise_callable(X, Y, metric, **kwds):
155 | """Handle the callable case for pairwise_{distances,kernels}
156 | """
157 | force_all_finite = False if callable(metric) else True
158 | X, Y = check_pairwise_arrays(X, Y, force_all_finite=force_all_finite)
159 |
160 | if X is Y:
161 | # Only calculate metric for upper triangle
162 | out = np.zeros((X.shape[0], Y.shape[0]), dtype='float')
163 | iterator = itertools.combinations(range(X.shape[0]), 2)
164 | for i, j in iterator:
165 | out[i, j] = metric(X[i], Y[j], **kwds)
166 |
167 | # Make symmetric
168 | # NB: out += out.T will produce incorrect results
169 | out = out + out.T
170 |
171 | # Calculate diagonal
172 | # NB: nonzero diagonals are allowed for both metrics and kernels
173 | for i in range(X.shape[0]):
174 | x = X[i]
175 | out[i, i] = metric(x, x, **kwds)
176 |
177 | else:
178 | # Calculate all cells
179 | out = np.empty((X.shape[0], Y.shape[0]), dtype='float')
180 | iterator = itertools.product(range(X.shape[0]), range(Y.shape[0]))
181 | for i, j in iterator:
182 | out[i, j] = metric(X[i], Y[j], **kwds)
183 |
184 | return out
185 |
186 |
187 | # Helper functions - distance
188 | PAIRWISE_DISTANCE_FUNCTIONS['masked_euclidean'] = masked_euclidean_distances
189 |
190 |
191 | def pairwise_distances(X, Y=None, metric="euclidean", n_jobs=1, **kwds):
192 | """ Compute the distance matrix from a vector array X and optional Y.
193 |
194 | This method takes either a vector array or a distance matrix, and returns
195 | a distance matrix. If the input is a vector array, the distances are
196 | computed. If the input is a distances matrix, it is returned instead.
197 |
198 | This method provides a safe way to take a distance matrix as input, while
199 | preserving compatibility with many other algorithms that take a vector
200 | array.
201 |
202 | If Y is given (default is None), then the returned matrix is the pairwise
203 | distance between the arrays from both X and Y.
204 |
205 | Valid values for metric are:
206 |
207 | - From scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2',
208 | 'manhattan']. These metrics support sparse matrix
209 | inputs.
210 | Also, ['masked_euclidean'] but it does not yet support sparse matrices.
211 |
212 | - From scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev',
213 | 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis',
214 | 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean',
215 | 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule']
216 | See the documentation for scipy.spatial.distance for details on these
217 | metrics. These metrics do not support sparse matrix inputs.
218 |
219 | Note that in the case of 'cityblock', 'cosine' and 'euclidean' (which are
220 | valid scipy.spatial.distance metrics), the scikit-learn implementation
221 | will be used, which is faster and has support for sparse matrices (except
222 | for 'cityblock'). For a verbose description of the metrics from
223 | scikit-learn, see the __doc__ of the sklearn.pairwise.distance_metrics
224 | function.
225 |
226 | Read more in the :ref:`User Guide `.
227 |
228 | Parameters
229 | ----------
230 | X : array [n_samples_a, n_samples_a] if metric == "precomputed", or, \
231 | [n_samples_a, n_features] otherwise
232 | Array of pairwise distances between samples, or a feature array.
233 |
234 | Y : array [n_samples_b, n_features], optional
235 | An optional second feature array. Only allowed if
236 | metric != "precomputed".
237 |
238 | metric : string, or callable
239 | The metric to use when calculating distance between instances in a
240 | feature array. If metric is a string, it must be one of the options
241 | allowed by scipy.spatial.distance.pdist for its metric parameter, or
242 | a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS.
243 | If metric is "precomputed", X is assumed to be a distance matrix.
244 | Alternatively, if metric is a callable function, it is called on each
245 | pair of instances (rows) and the resulting value recorded. The callable
246 | should take two arrays from X as input and return a value indicating
247 | the distance between them.
248 |
249 | n_jobs : int
250 | The number of jobs to use for the computation. This works by breaking
251 | down the pairwise matrix into n_jobs even slices and computing them in
252 | parallel.
253 |
254 | If -1 all CPUs are used. If 1 is given, no parallel computing code is
255 | used at all, which is useful for debugging. For n_jobs below -1,
256 | (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one
257 | are used.
258 |
259 | **kwds : optional keyword parameters
260 | Any further parameters are passed directly to the distance function.
261 | If using a scipy.spatial.distance metric, the parameters are still
262 | metric dependent. See the scipy docs for usage examples.
263 |
264 | Returns
265 | -------
266 | D : array [n_samples_a, n_samples_a] or [n_samples_a, n_samples_b]
267 | A distance matrix D such that D_{i, j} is the distance between the
268 | ith and jth vectors of the given matrix X, if Y is None.
269 | If Y is not None, then D_{i, j} is the distance between the ith array
270 | from X and the jth array from Y.
271 |
272 | See also
273 | --------
274 | pairwise_distances_chunked : performs the same calculation as this funtion,
275 | but returns a generator of chunks of the distance matrix, in order to
276 | limit memory usage.
277 | paired_distances : Computes the distances between corresponding
278 | elements of two arrays
279 | """
280 | if (metric not in _VALID_METRICS and
281 | not callable(metric) and metric != "precomputed"):
282 | raise ValueError("Unknown metric %s. "
283 | "Valid metrics are %s, or 'precomputed', or a "
284 | "callable" % (metric, _VALID_METRICS))
285 |
286 | if metric in _MASKED_METRICS or callable(metric):
287 | missing_values = kwds.get("missing_values") if kwds.get(
288 | "missing_values") is not None else np.nan
289 |
290 | if np.all(_get_mask(X.data if issparse(X) else X, missing_values)):
291 | raise ValueError(
292 | "One or more samples(s) only have missing values.")
293 |
294 | if metric == "precomputed":
295 | X, _ = check_pairwise_arrays(X, Y, precomputed=True)
296 | return X
297 | elif metric in PAIRWISE_DISTANCE_FUNCTIONS:
298 | func = PAIRWISE_DISTANCE_FUNCTIONS[metric]
299 | elif callable(metric):
300 | func = partial(_pairwise_callable, metric=metric, **kwds)
301 | else:
302 | if issparse(X) or issparse(Y):
303 | raise TypeError("scipy distance metrics do not"
304 | " support sparse matrices.")
305 |
306 | dtype = bool if metric in PAIRWISE_BOOLEAN_FUNCTIONS else None
307 |
308 | X, Y = check_pairwise_arrays(X, Y, dtype=dtype)
309 |
310 | if n_jobs == 1 and X is Y:
311 | return distance.squareform(distance.pdist(X, metric=metric,
312 | **kwds))
313 | func = partial(distance.cdist, metric=metric, **kwds)
314 |
315 | return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
316 |
--------------------------------------------------------------------------------
/missingpy/tests/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/epsilon-machine/missingpy/49fb1f61647e5399d1164a63b44e2fbfbc4ed8ad/missingpy/tests/__init__.py
--------------------------------------------------------------------------------
/missingpy/tests/test_knnimpute.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 |
3 | from sklearn.utils.testing import assert_array_equal
4 | from sklearn.utils.testing import assert_array_almost_equal
5 | from sklearn.utils.testing import assert_raise_message
6 | from sklearn.utils.testing import assert_equal
7 |
8 | from missingpy import KNNImputer
9 | from missingpy.pairwise_external import masked_euclidean_distances
10 | from missingpy.pairwise_external import pairwise_distances
11 |
12 |
13 | def test_knn_imputation_shape():
14 | # Verify the shapes of the imputed matrix for different weights and
15 | # number of neighbors.
16 | n_rows = 10
17 | n_cols = 2
18 | X = np.random.rand(n_rows, n_cols)
19 | X[0, 0] = np.nan
20 |
21 | for weights in ['uniform', 'distance']:
22 | for n_neighbors in range(1, 6):
23 | imputer = KNNImputer(n_neighbors=n_neighbors, weights=weights)
24 | X_imputed = imputer.fit_transform(X)
25 | assert_equal(X_imputed.shape, (n_rows, n_cols))
26 |
27 |
28 | def test_knn_imputation_zero():
29 | # Test imputation when missing_values == 0
30 | missing_values = 0
31 | n_neighbors = 2
32 | imputer = KNNImputer(missing_values=missing_values,
33 | n_neighbors=n_neighbors,
34 | weights="uniform")
35 |
36 | # Test with missing_values=0 when NaN present
37 | X = np.array([
38 | [np.nan, 0, 0, 0, 5],
39 | [np.nan, 1, 0, np.nan, 3],
40 | [np.nan, 2, 0, 0, 0],
41 | [np.nan, 6, 0, 5, 13],
42 | ])
43 | msg = "Input contains NaN, infinity or a value too large for %r." % X.dtype
44 | assert_raise_message(ValueError, msg, imputer.fit, X)
45 |
46 | # Test with % zeros in column > col_max_missing
47 | X = np.array([
48 | [1, 0, 0, 0, 5],
49 | [2, 1, 0, 2, 3],
50 | [3, 2, 0, 0, 0],
51 | [4, 6, 0, 5, 13],
52 | ])
53 | msg = "Some column(s) have more than {}% missing values".format(
54 | imputer.col_max_missing * 100)
55 | assert_raise_message(ValueError, msg, imputer.fit, X)
56 |
57 |
58 | def test_knn_imputation_zero_p2():
59 | # Test with an imputable matrix and also compare with missing_values="NaN"
60 | X_zero = np.array([
61 | [1, 0, 1, 1, 1.],
62 | [2, 2, 2, 2, 2],
63 | [3, 3, 3, 3, 0],
64 | [6, 6, 0, 6, 6],
65 | ])
66 |
67 | X_nan = np.array([
68 | [1, np.nan, 1, 1, 1.],
69 | [2, 2, 2, 2, 2],
70 | [3, 3, 3, 3, np.nan],
71 | [6, 6, np.nan, 6, 6],
72 | ])
73 | statistics_mean = np.nanmean(X_nan, axis=0)
74 |
75 | X_imputed = np.array([
76 | [1, 2.5, 1, 1, 1.],
77 | [2, 2, 2, 2, 2],
78 | [3, 3, 3, 3, 1.5],
79 | [6, 6, 2.5, 6, 6],
80 | ])
81 |
82 | imputer_zero = KNNImputer(missing_values=0, n_neighbors=2,
83 | weights="uniform")
84 |
85 | imputer_nan = KNNImputer(missing_values="NaN",
86 | n_neighbors=2,
87 | weights="uniform")
88 |
89 | assert_array_equal(imputer_zero.fit_transform(X_zero), X_imputed)
90 | assert_array_equal(imputer_zero.statistics_, statistics_mean)
91 | assert_array_equal(imputer_zero.fit_transform(X_zero),
92 | imputer_nan.fit_transform(X_nan))
93 |
94 |
95 | def test_knn_imputation_default():
96 | # Test imputation with default parameter values
97 |
98 | # Test with an imputable matrix
99 | X = np.array([
100 | [1, 0, 0, 1],
101 | [2, 1, 2, np.nan],
102 | [3, 2, 3, np.nan],
103 | [np.nan, 4, 5, 5],
104 | [6, np.nan, 6, 7],
105 | [8, 8, 8, 8],
106 | [16, 15, 18, 19],
107 | ])
108 | statistics_mean = np.nanmean(X, axis=0)
109 |
110 | X_imputed = np.array([
111 | [1, 0, 0, 1],
112 | [2, 1, 2, 8],
113 | [3, 2, 3, 8],
114 | [4, 4, 5, 5],
115 | [6, 3, 6, 7],
116 | [8, 8, 8, 8],
117 | [16, 15, 18, 19],
118 | ])
119 |
120 | imputer = KNNImputer()
121 | assert_array_equal(imputer.fit_transform(X), X_imputed)
122 | assert_array_equal(imputer.statistics_, statistics_mean)
123 |
124 | # Test with % missing in row > row_max_missing
125 | X = np.array([
126 | [1, 0, 0, 1],
127 | [2, 1, 2, np.nan],
128 | [3, 2, 3, np.nan],
129 | [np.nan, 4, 5, 5],
130 | [6, np.nan, 6, 7],
131 | [8, 8, 8, 8],
132 | [19, 19, 19, 19],
133 | [np.nan, np.nan, np.nan, 19],
134 | ])
135 | statistics_mean = np.nanmean(X, axis=0)
136 | r7c0, r7c1, r7c2, _ = statistics_mean
137 |
138 | X_imputed = np.array([
139 | [1, 0, 0, 1],
140 | [2, 1, 2, 8],
141 | [3, 2, 3, 8],
142 | [4, 4, 5, 5],
143 | [6, 3, 6, 7],
144 | [8, 8, 8, 8],
145 | [19, 19, 19, 19],
146 | [r7c0, r7c1, r7c2, 19],
147 | ])
148 |
149 | imputer = KNNImputer()
150 | assert_array_almost_equal(imputer.fit_transform(X), X_imputed, decimal=6)
151 | assert_array_almost_equal(imputer.statistics_, statistics_mean, decimal=6)
152 |
153 | # Test with all neighboring donors also having missing feature values
154 | X = np.array([
155 | [1, 0, 0, np.nan],
156 | [2, 1, 2, np.nan],
157 | [3, 2, 3, np.nan],
158 | [4, 4, 5, np.nan],
159 | [6, 7, 6, np.nan],
160 | [8, 8, 8, np.nan],
161 | [20, 20, 20, 20],
162 | [22, 22, 22, 22]
163 | ])
164 | statistics_mean = np.nanmean(X, axis=0)
165 |
166 | X_imputed = np.array([
167 | [1, 0, 0, 21],
168 | [2, 1, 2, 21],
169 | [3, 2, 3, 21],
170 | [4, 4, 5, 21],
171 | [6, 7, 6, 21],
172 | [8, 8, 8, 21],
173 | [20, 20, 20, 20],
174 | [22, 22, 22, 22]
175 | ])
176 |
177 | imputer = KNNImputer()
178 | assert_array_equal(imputer.fit_transform(X), X_imputed)
179 | assert_array_equal(imputer.statistics_, statistics_mean)
180 |
181 | # Test when data in fit() and transform() are different
182 | X = np.array([
183 | [0, 0],
184 | [np.nan, 2],
185 | [4, 3],
186 | [5, 6],
187 | [7, 7],
188 | [9, 8],
189 | [11, 16]
190 | ])
191 | statistics_mean = np.nanmean(X, axis=0)
192 |
193 | Y = np.array([
194 | [1, 0],
195 | [3, 2],
196 | [4, np.nan]
197 | ])
198 |
199 | Y_imputed = np.array([
200 | [1, 0],
201 | [3, 2],
202 | [4, 4.8]
203 | ])
204 |
205 | imputer = KNNImputer()
206 | assert_array_equal(imputer.fit(X).transform(Y), Y_imputed)
207 | assert_array_equal(imputer.statistics_, statistics_mean)
208 |
209 |
210 | def test_default_with_invalid_input():
211 | # Test imputation with default values and invalid input
212 |
213 | # Test with % missing in a column > col_max_missing
214 | X = np.array([
215 | [np.nan, 0, 0, 0, 5],
216 | [np.nan, 1, 0, np.nan, 3],
217 | [np.nan, 2, 0, 0, 0],
218 | [np.nan, 6, 0, 5, 13],
219 | [np.nan, 7, 0, 7, 8],
220 | [np.nan, 8, 0, 8, 9],
221 | ])
222 | imputer = KNNImputer()
223 | msg = "Some column(s) have more than {}% missing values".format(
224 | imputer.col_max_missing * 100)
225 | assert_raise_message(ValueError, msg, imputer.fit, X)
226 |
227 | # Test with insufficient number of neighbors
228 | X = np.array([
229 | [1, 1, 1, 2, np.nan],
230 | [2, 1, 2, 2, 3],
231 | [3, 2, 3, 3, 8],
232 | [6, 6, 2, 5, 13],
233 | ])
234 | msg = "There are only %d samples, but n_neighbors=%d." % \
235 | (X.shape[0], imputer.n_neighbors)
236 | assert_raise_message(ValueError, msg, imputer.fit, X)
237 |
238 | # Test with inf present
239 | X = np.array([
240 | [np.inf, 1, 1, 2, np.nan],
241 | [2, 1, 2, 2, 3],
242 | [3, 2, 3, 3, 8],
243 | [np.nan, 6, 0, 5, 13],
244 | [np.nan, 7, 0, 7, 8],
245 | [6, 6, 2, 5, 7],
246 | ])
247 | msg = "+/- inf values are not allowed."
248 | assert_raise_message(ValueError, msg, KNNImputer().fit, X)
249 |
250 | # Test with inf present in matrix passed in transform()
251 | X = np.array([
252 | [np.inf, 1, 1, 2, np.nan],
253 | [2, 1, 2, 2, 3],
254 | [3, 2, 3, 3, 8],
255 | [np.nan, 6, 0, 5, 13],
256 | [np.nan, 7, 0, 7, 8],
257 | [6, 6, 2, 5, 7],
258 | ])
259 |
260 | X_fit = np.array([
261 | [0, 1, 1, 2, np.nan],
262 | [2, 1, 2, 2, 3],
263 | [3, 2, 3, 3, 8],
264 | [np.nan, 6, 0, 5, 13],
265 | [np.nan, 7, 0, 7, 8],
266 | [6, 6, 2, 5, 7],
267 | ])
268 | msg = "+/- inf values are not allowed in data to be transformed."
269 | assert_raise_message(ValueError, msg, KNNImputer().fit(X_fit).transform, X)
270 |
271 |
272 | def test_knn_n_neighbors():
273 |
274 | X = np.array([
275 | [0, 0],
276 | [np.nan, 2],
277 | [4, 3],
278 | [5, np.nan],
279 | [7, 7],
280 | [np.nan, 8],
281 | [14, 13]
282 | ])
283 | statistics_mean = np.nanmean(X, axis=0)
284 |
285 | # Test with 1 neighbor
286 | X_imputed_1NN = np.array([
287 | [0, 0],
288 | [4, 2],
289 | [4, 3],
290 | [5, 3],
291 | [7, 7],
292 | [7, 8],
293 | [14, 13]
294 | ])
295 |
296 | n_neighbors = 1
297 | imputer = KNNImputer(n_neighbors=n_neighbors)
298 |
299 | assert_array_equal(imputer.fit_transform(X), X_imputed_1NN)
300 | assert_array_equal(imputer.statistics_, statistics_mean)
301 |
302 | # Test with 6 neighbors
303 | X = np.array([
304 | [0, 0],
305 | [np.nan, 2],
306 | [4, 3],
307 | [5, np.nan],
308 | [7, 7],
309 | [np.nan, 8],
310 | [14, 13]
311 | ])
312 |
313 | X_imputed_6NN = np.array([
314 | [0, 0],
315 | [6, 2],
316 | [4, 3],
317 | [5, 5.5],
318 | [7, 7],
319 | [6, 8],
320 | [14, 13]
321 | ])
322 |
323 | n_neighbors = 6
324 | imputer = KNNImputer(n_neighbors=6)
325 | imputer_plus1 = KNNImputer(n_neighbors=n_neighbors + 1)
326 |
327 | assert_array_equal(imputer.fit_transform(X), X_imputed_6NN)
328 | assert_array_equal(imputer.statistics_, statistics_mean)
329 | assert_array_equal(imputer.fit_transform(X), imputer_plus1.fit(
330 | X).transform(X))
331 |
332 |
333 | def test_weight_uniform():
334 | X = np.array([
335 | [0, 0],
336 | [np.nan, 2],
337 | [4, 3],
338 | [5, 6],
339 | [7, 7],
340 | [9, 8],
341 | [11, 10]
342 | ])
343 |
344 | # Test with "uniform" weight (or unweighted)
345 | X_imputed_uniform = np.array([
346 | [0, 0],
347 | [5, 2],
348 | [4, 3],
349 | [5, 6],
350 | [7, 7],
351 | [9, 8],
352 | [11, 10]
353 | ])
354 |
355 | imputer = KNNImputer(weights="uniform")
356 | assert_array_equal(imputer.fit_transform(X), X_imputed_uniform)
357 |
358 | # Test with "callable" weight
359 | def no_weight(dist=None):
360 | return None
361 |
362 | imputer = KNNImputer(weights=no_weight)
363 | assert_array_equal(imputer.fit_transform(X), X_imputed_uniform)
364 |
365 |
366 | def test_weight_distance():
367 | X = np.array([
368 | [0, 0],
369 | [np.nan, 2],
370 | [4, 3],
371 | [5, 6],
372 | [7, 7],
373 | [9, 8],
374 | [11, 10]
375 | ])
376 |
377 | # Test with "distance" weight
378 |
379 | # Get distance of "n_neighbors" neighbors of row 1
380 | dist_matrix = pairwise_distances(X, metric="masked_euclidean")
381 |
382 | index = np.argsort(dist_matrix)[1, 1:6]
383 | dist = dist_matrix[1, index]
384 | weights = 1 / dist
385 | values = X[index, 0]
386 | imputed = np.dot(values, weights) / np.sum(weights)
387 |
388 | # Manual calculation
389 | X_imputed_distance1 = np.array([
390 | [0, 0],
391 | [3.850394, 2],
392 | [4, 3],
393 | [5, 6],
394 | [7, 7],
395 | [9, 8],
396 | [11, 10]
397 | ])
398 |
399 | # NearestNeighbor calculation
400 | X_imputed_distance2 = np.array([
401 | [0, 0],
402 | [imputed, 2],
403 | [4, 3],
404 | [5, 6],
405 | [7, 7],
406 | [9, 8],
407 | [11, 10]
408 | ])
409 |
410 | imputer = KNNImputer(weights="distance")
411 | assert_array_almost_equal(imputer.fit_transform(X), X_imputed_distance1,
412 | decimal=6)
413 | assert_array_almost_equal(imputer.fit_transform(X), X_imputed_distance2,
414 | decimal=6)
415 |
416 | # Test with weights = "distance" and n_neighbors=2
417 | X = np.array([
418 | [np.nan, 0, 0],
419 | [2, 1, 2],
420 | [3, 2, 3],
421 | [4, 5, 5],
422 | ])
423 | statistics_mean = np.nanmean(X, axis=0)
424 |
425 | X_imputed = np.array([
426 | [2.3828, 0, 0],
427 | [2, 1, 2],
428 | [3, 2, 3],
429 | [4, 5, 5],
430 | ])
431 |
432 | imputer = KNNImputer(n_neighbors=2, weights="distance")
433 | assert_array_almost_equal(imputer.fit_transform(X), X_imputed,
434 | decimal=4)
435 | assert_array_equal(imputer.statistics_, statistics_mean)
436 |
437 | # Test with varying missingness patterns
438 | X = np.array([
439 | [1, 0, 0, 1],
440 | [0, np.nan, 1, np.nan],
441 | [1, 1, 1, np.nan],
442 | [0, 1, 0, 0],
443 | [0, 0, 0, 0],
444 | [1, 0, 1, 1],
445 | [10, 10, 10, 10],
446 | ])
447 | statistics_mean = np.nanmean(X, axis=0)
448 |
449 | # Get weights of donor neighbors
450 | dist = masked_euclidean_distances(X)
451 | r1c1_nbor_dists = dist[1, [0, 2, 3, 4, 5]]
452 | r1c3_nbor_dists = dist[1, [0, 3, 4, 5, 6]]
453 | r1c1_nbor_wt = (1/r1c1_nbor_dists)
454 | r1c3_nbor_wt = (1 / r1c3_nbor_dists)
455 |
456 | r2c3_nbor_dists = dist[2, [0, 3, 4, 5, 6]]
457 | r2c3_nbor_wt = 1/r2c3_nbor_dists
458 |
459 | # Collect donor values
460 | col1_donor_values = np.ma.masked_invalid(X[[0, 2, 3, 4, 5], 1]).copy()
461 | col3_donor_values = np.ma.masked_invalid(X[[0, 3, 4, 5, 6], 3]).copy()
462 |
463 | # Final imputed values
464 | r1c1_imp = np.ma.average(col1_donor_values, weights=r1c1_nbor_wt)
465 | r1c3_imp = np.ma.average(col3_donor_values, weights=r1c3_nbor_wt)
466 | r2c3_imp = np.ma.average(col3_donor_values, weights=r2c3_nbor_wt)
467 |
468 | print(r1c1_imp, r1c3_imp, r2c3_imp)
469 | X_imputed = np.array([
470 | [1, 0, 0, 1],
471 | [0, r1c1_imp, 1, r1c3_imp],
472 | [1, 1, 1, r2c3_imp],
473 | [0, 1, 0, 0],
474 | [0, 0, 0, 0],
475 | [1, 0, 1, 1],
476 | [10, 10, 10, 10],
477 | ])
478 |
479 | imputer = KNNImputer(weights="distance")
480 | assert_array_almost_equal(imputer.fit_transform(X), X_imputed, decimal=6)
481 | assert_array_equal(imputer.statistics_, statistics_mean)
482 |
483 |
484 | def test_metric_type():
485 | X = np.array([
486 | [0, 0],
487 | [np.nan, 2],
488 | [4, 3],
489 | [5, 6],
490 | [7, 7],
491 | [9, 8],
492 | [11, 10]
493 | ])
494 |
495 | # Test with a metric type without NaN support
496 | imputer = KNNImputer(metric="euclidean")
497 | bad_metric_msg = "The selected metric does not support NaN values."
498 | assert_raise_message(ValueError, bad_metric_msg, imputer.fit, X)
499 |
500 |
501 | def test_callable_metric():
502 |
503 | # Define callable metric that returns the l1 norm:
504 | def custom_callable(x, y, missing_values="NaN", squared=False):
505 | x = np.ma.array(x, mask=np.isnan(x))
506 | y = np.ma.array(y, mask=np.isnan(y))
507 | dist = np.nansum(np.abs(x-y))
508 | return dist
509 |
510 | X = np.array([
511 | [4, 3, 3, np.nan],
512 | [6, 9, 6, 9],
513 | [4, 8, 6, 9],
514 | [np.nan, 9, 11, 10.]
515 | ])
516 |
517 | X_imputed = np.array([
518 | [4, 3, 3, 9],
519 | [6, 9, 6, 9],
520 | [4, 8, 6, 9],
521 | [5, 9, 11, 10.]
522 | ])
523 |
524 | imputer = KNNImputer(n_neighbors=2, metric=custom_callable)
525 | assert_array_equal(imputer.fit_transform(X), X_imputed)
526 |
527 |
528 | def test_complete_features():
529 |
530 | # Test with use_complete=True
531 | X = np.array([
532 | [0, np.nan, 0, np.nan],
533 | [1, 1, 1, np.nan],
534 | [2, 2, np.nan, 2],
535 | [3, 3, 3, 3],
536 | [4, 4, 4, 4],
537 | [5, 5, 5, 5],
538 | [6, 6, 6, 6],
539 | [np.nan, 7, 7, 7]
540 | ])
541 |
542 | r0c1 = np.mean(X[1:6, 1])
543 | r0c3 = np.mean(X[2:-1, -1])
544 | r1c3 = np.mean(X[2:-1, -1])
545 | r2c2 = np.nanmean(X[:6, 2])
546 | r7c0 = np.mean(X[2:-1, 0])
547 |
548 | X_imputed = np.array([
549 | [0, r0c1, 0, r0c3],
550 | [1, 1, 1, r1c3],
551 | [2, 2, r2c2, 2],
552 | [3, 3, 3, 3],
553 | [4, 4, 4, 4],
554 | [5, 5, 5, 5],
555 | [6, 6, 6, 6],
556 | [r7c0, 7, 7, 7]
557 | ])
558 |
559 | imputer_comp = KNNImputer()
560 | assert_array_almost_equal(imputer_comp.fit_transform(X), X_imputed)
561 |
562 |
563 | def test_complete_features_weighted():
564 |
565 | # Test with use_complete=True
566 | X = np.array([
567 | [0, 0, 0, np.nan],
568 | [1, 1, 1, np.nan],
569 | [2, 2, np.nan, 2],
570 | [3, 3, 3, 3],
571 | [4, 4, 4, 4],
572 | [5, 5, 5, 5],
573 | [6, 6, 6, 6],
574 | [np.nan, 7, 7, 7]
575 | ])
576 |
577 | dist = pairwise_distances(X,
578 | metric="masked_euclidean",
579 | squared=False)
580 |
581 | # Calculate weights
582 | r0c3_w = 1.0 / dist[0, 2:-1]
583 | r1c3_w = 1.0 / dist[1, 2:-1]
584 | r2c2_w = 1.0 / dist[2, (0, 1, 3, 4, 5)]
585 | r7c0_w = 1.0 / dist[7, 2:7]
586 |
587 | # Calculate weighted averages
588 | r0c3 = np.average(X[2:-1, -1], weights=r0c3_w)
589 | r1c3 = np.average(X[2:-1, -1], weights=r1c3_w)
590 | r2c2 = np.average(X[(0, 1, 3, 4, 5), 2], weights=r2c2_w)
591 | r7c0 = np.average(X[2:7, 0], weights=r7c0_w)
592 |
593 | X_imputed = np.array([
594 | [0, 0, 0, r0c3],
595 | [1, 1, 1, r1c3],
596 | [2, 2, r2c2, 2],
597 | [3, 3, 3, 3],
598 | [4, 4, 4, 4],
599 | [5, 5, 5, 5],
600 | [6, 6, 6, 6],
601 | [r7c0, 7, 7, 7]
602 | ])
603 |
604 | imputer_comp_wt = KNNImputer(weights="distance")
605 | assert_array_almost_equal(imputer_comp_wt.fit_transform(X), X_imputed)
606 |
--------------------------------------------------------------------------------
/missingpy/tests/test_missforest.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from scipy.stats import mode
3 |
4 | from sklearn.utils.testing import assert_array_equal
5 | from sklearn.utils.testing import assert_raise_message
6 | from sklearn.utils.testing import assert_equal
7 | from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
8 |
9 | from missingpy import MissForest
10 |
11 | def gen_array(n_rows=20, n_cols=5, missingness=0.2, min_val=0, max_val=10,
12 | missing_values=np.nan, rand_seed=1337):
13 | """Generate an array with NaNs"""
14 |
15 | rand_gen = np.random.RandomState(seed=rand_seed)
16 | X = rand_gen.randint(
17 | min_val, max_val, n_rows * n_cols).reshape(n_rows, n_cols).astype(
18 | np.float)
19 |
20 | # Introduce NaNs if missingness > 0
21 | if missingness > 0:
22 | # If missingness >= 1 then use it as approximate (see below) count
23 | if missingness >= 1:
24 | n_missing = missingness
25 | else:
26 | # If missingness is between (0, 1] then use it as approximate %
27 | # of total cells that are NaNs
28 | n_missing = int(np.ceil(missingness * n_rows * n_cols))
29 |
30 | # Generate row, col index pairs and introduce NaNs
31 | # NOTE: Below does not account for repeated index pairs so NaN
32 | # count/percentage might be less than specified in function call
33 | nan_row_idx = rand_gen.randint(0, n_rows, n_missing)
34 | nan_col_idx = rand_gen.randint(0, n_cols, n_missing)
35 | X[nan_row_idx, nan_col_idx] = missing_values
36 |
37 | return X
38 |
39 |
40 | def test_missforest_imputation_shape():
41 | # Verify the shapes of the imputed matrix
42 | n_rows = 10
43 | n_cols = 2
44 | X = gen_array(n_rows, n_cols)
45 | imputer = MissForest()
46 | X_imputed = imputer.fit_transform(X)
47 | assert_equal(X_imputed.shape, (n_rows, n_cols))
48 |
49 |
50 | def test_missforest_zero():
51 | # Test imputation when missing_values == 0
52 | missing_values = 0
53 | imputer = MissForest(missing_values=missing_values,
54 | random_state=0)
55 |
56 | # Test with missing_values=0 when NaN present
57 | X = gen_array(min_val=0)
58 | msg = "Input contains NaN, infinity or a value too large for %r." % X.dtype
59 | assert_raise_message(ValueError, msg, imputer.fit, X)
60 |
61 | # Test with all zeroes in a column
62 | X = np.array([
63 | [1, 0, 0, 0, 5],
64 | [2, 1, 0, 2, 3],
65 | [3, 2, 0, 0, 0],
66 | [4, 6, 0, 5, 13],
67 | ])
68 | msg = "One or more columns have all rows missing."
69 | assert_raise_message(ValueError, msg, imputer.fit, X)
70 |
71 |
72 | def test_missforest_zero_part2():
73 | # Test with an imputable matrix and compare with missing_values="NaN"
74 | X_zero = gen_array(min_val=1, missing_values=0)
75 | X_nan = gen_array(min_val=1, missing_values=np.nan)
76 | statistics_mean = np.nanmean(X_nan, axis=0)
77 |
78 | imputer_zero = MissForest(missing_values=0, random_state=1337)
79 | imputer_nan = MissForest(missing_values=np.nan, random_state=1337)
80 |
81 | assert_array_equal(imputer_zero.fit_transform(X_zero),
82 | imputer_nan.fit_transform(X_nan))
83 | assert_array_equal(imputer_zero.statistics_.get("col_means"),
84 | statistics_mean)
85 |
86 |
87 | def test_missforest_numerical_single():
88 | # Test imputation with default parameter values
89 |
90 | # Test with a single missing value
91 | df = np.array([
92 | [1, 0, 0, 1],
93 | [2, 1, 2, 2],
94 | [3, 2, 3, 2],
95 | [np.nan, 4, 5, 5],
96 | [6, 7, 6, 7],
97 | [8, 8, 8, 8],
98 | [16, 15, 18, 19],
99 | ])
100 | statistics_mean = np.nanmean(df, axis=0)
101 |
102 | y = df[:, 0]
103 | X = df[:, 1:]
104 | good_rows = np.where(~np.isnan(y))[0]
105 | bad_rows = np.where(np.isnan(y))[0]
106 |
107 | rf = RandomForestRegressor(n_estimators=10, random_state=1337)
108 | rf.fit(X=X[good_rows], y=y[good_rows])
109 | pred_val = rf.predict(X[bad_rows])
110 |
111 | df_imputed = np.array([
112 | [1, 0, 0, 1],
113 | [2, 1, 2, 2],
114 | [3, 2, 3, 2],
115 | [pred_val, 4, 5, 5],
116 | [6, 7, 6, 7],
117 | [8, 8, 8, 8],
118 | [16, 15, 18, 19],
119 | ])
120 |
121 | imputer = MissForest(n_estimators=10, random_state=1337)
122 | assert_array_equal(imputer.fit_transform(df), df_imputed)
123 | assert_array_equal(imputer.statistics_.get('col_means'), statistics_mean)
124 |
125 |
126 | def test_missforest_numerical_multiple():
127 | # Test with two missing values for multiple iterations
128 | df = np.array([
129 | [1, 0, np.nan, 1],
130 | [2, 1, 2, 2],
131 | [3, 2, 3, 2],
132 | [np.nan, 4, 5, 5],
133 | [6, 7, 6, 7],
134 | [8, 8, 8, 8],
135 | [16, 15, 18, 19],
136 | ])
137 | statistics_mean = np.nanmean(df, axis=0)
138 | n_rows, n_cols = df.shape
139 |
140 | # Fit missforest and transform
141 | imputer = MissForest(random_state=1337)
142 | df_imp1 = imputer.fit_transform(df)
143 |
144 | # Get iterations used by missforest above
145 | max_iter = imputer.iter_count_
146 |
147 | # Get NaN mask
148 | nan_mask = np.isnan(df)
149 | nan_rows, nan_cols = np.where(nan_mask)
150 |
151 | # Make initial guess for missing values
152 | df_imp2 = df.copy()
153 | df_imp2[nan_rows, nan_cols] = np.take(statistics_mean, nan_cols)
154 |
155 | # Loop for max_iter count over the columns with NaNs
156 | for _ in range(max_iter):
157 | for c in nan_cols:
158 | # Identify all other columns (i.e. predictors)
159 | not_c = np.setdiff1d(np.arange(n_cols), c)
160 | # Identify rows with NaN and those without in 'c'
161 | y = df_imp2[:, c]
162 | X = df_imp2[:, not_c]
163 | good_rows = np.where(~nan_mask[:, c])[0]
164 | bad_rows = np.where(nan_mask[:, c])[0]
165 |
166 | # Fit model and predict
167 | rf = RandomForestRegressor(n_estimators=100, random_state=1337)
168 | rf.fit(X=X[good_rows], y=y[good_rows])
169 | pred_val = rf.predict(X[bad_rows])
170 |
171 | # Fill in values
172 | df_imp2[bad_rows, c] = pred_val
173 |
174 | assert_array_equal(df_imp1, df_imp2)
175 | assert_array_equal(imputer.statistics_.get('col_means'), statistics_mean)
176 |
177 |
178 | def test_missforest_categorical_single():
179 | # Test imputation with default parameter values
180 |
181 | # Test with a single missing value
182 | df = np.array([
183 | [0, 0, 0, 1],
184 | [0, 1, 2, 2],
185 | [0, 2, 3, 2],
186 | [np.nan, 4, 5, 5],
187 | [1, 7, 6, 7],
188 | [1, 8, 8, 8],
189 | [1, 15, 18, 19],
190 | ])
191 |
192 | y = df[:, 0]
193 | X = df[:, 1:]
194 | good_rows = np.where(~np.isnan(y))[0]
195 | bad_rows = np.where(np.isnan(y))[0]
196 |
197 | rf = RandomForestClassifier(n_estimators=10, random_state=1337)
198 | rf.fit(X=X[good_rows], y=y[good_rows])
199 | pred_val = rf.predict(X[bad_rows])
200 |
201 | df_imputed = np.array([
202 | [0, 0, 0, 1],
203 | [0, 1, 2, 2],
204 | [0, 2, 3, 2],
205 | [pred_val, 4, 5, 5],
206 | [1, 7, 6, 7],
207 | [1, 8, 8, 8],
208 | [1, 15, 18, 19],
209 | ])
210 |
211 | imputer = MissForest(n_estimators=10, random_state=1337)
212 | assert_array_equal(imputer.fit_transform(df, cat_vars=0), df_imputed)
213 | assert_array_equal(imputer.fit_transform(df, cat_vars=[0]), df_imputed)
214 |
215 |
216 | def test_missforest_categorical_multiple():
217 | # Test with two missing values for multiple iterations
218 | df = np.array([
219 | [0, 0, np.nan, 1],
220 | [0, 1, 1, 2],
221 | [0, 2, 1, 2],
222 | [np.nan, 4, 1, 5],
223 | [1, 7, 0, 7],
224 | [1, 8, 0, 8],
225 | [1, 15, 0, 19],
226 | [1, 18, 0, 17],
227 | ])
228 | cat_vars = [0, 2]
229 | statistics_mode = mode(df, axis=0, nan_policy='omit').mode[0]
230 | n_rows, n_cols = df.shape
231 |
232 | # Fit missforest and transform
233 | imputer = MissForest(random_state=1337)
234 | df_imp1 = imputer.fit_transform(df, cat_vars=cat_vars)
235 |
236 | # Get iterations used by missforest above
237 | max_iter = imputer.iter_count_
238 |
239 | # Get NaN mask
240 | nan_mask = np.isnan(df)
241 | nan_rows, nan_cols = np.where(nan_mask)
242 |
243 | # Make initial guess for missing values
244 | df_imp2 = df.copy()
245 | df_imp2[nan_rows, nan_cols] = np.take(statistics_mode, nan_cols)
246 |
247 | # Loop for max_iter count over the columns with NaNs
248 | for _ in range(max_iter):
249 | for c in nan_cols:
250 | # Identify all other columns (i.e. predictors)
251 | not_c = np.setdiff1d(np.arange(n_cols), c)
252 | # Identify rows with NaN and those without in 'c'
253 | y = df_imp2[:, c]
254 | X = df_imp2[:, not_c]
255 | good_rows = np.where(~nan_mask[:, c])[0]
256 | bad_rows = np.where(nan_mask[:, c])[0]
257 |
258 | # Fit model and predict
259 | rf = RandomForestClassifier(n_estimators=100, random_state=1337)
260 | rf.fit(X=X[good_rows], y=y[good_rows])
261 | pred_val = rf.predict(X[bad_rows])
262 |
263 | # Fill in values
264 | df_imp2[bad_rows, c] = pred_val
265 |
266 | assert_array_equal(df_imp1, df_imp2)
267 | assert_array_equal(imputer.statistics_.get('col_modes')[0],
268 | statistics_mode[cat_vars])
269 |
270 |
271 | def test_missforest_mixed_multiple():
272 | # Test with mixed data type
273 | df = np.array([
274 | [np.nan, 0, 0, 1],
275 | [0, 1, 2, 2],
276 | [0, 2, 3, 2],
277 | [1, 4, 5, 5],
278 | [1, 7, 6, 7],
279 | [1, 8, 8, 8],
280 | [1, 15, 18, np.nan],
281 | ])
282 |
283 | n_rows, n_cols = df.shape
284 | cat_vars = [0]
285 | num_vars = np.setdiff1d(range(n_cols), cat_vars)
286 | statistics_mode = mode(df, axis=0, nan_policy='omit').mode[0]
287 | statistics_mean = np.nanmean(df, axis=0)
288 |
289 | # Fit missforest and transform
290 | imputer = MissForest(random_state=1337)
291 | df_imp1 = imputer.fit_transform(df, cat_vars=cat_vars)
292 |
293 | # Get iterations used by missforest above
294 | max_iter = imputer.iter_count_
295 |
296 | # Get NaN mask
297 | nan_mask = np.isnan(df)
298 | nan_rows, nan_cols = np.where(nan_mask)
299 |
300 | # Make initial guess for missing values
301 | df_imp2 = df.copy()
302 | df_imp2[0, 0] = statistics_mode[0]
303 | df_imp2[6, 3] = statistics_mean[3]
304 |
305 | # Loop for max_iter count over the columns with NaNs
306 | for _ in range(max_iter):
307 | for c in nan_cols:
308 | # Identify all other columns (i.e. predictors)
309 | not_c = np.setdiff1d(np.arange(n_cols), c)
310 | # Identify rows with NaN and those without in 'c'
311 | y = df_imp2[:, c]
312 | X = df_imp2[:, not_c]
313 | good_rows = np.where(~nan_mask[:, c])[0]
314 | bad_rows = np.where(nan_mask[:, c])[0]
315 |
316 | # Fit model and predict
317 | if c in cat_vars:
318 | rf = RandomForestClassifier(n_estimators=100,
319 | random_state=1337)
320 | else:
321 | rf = RandomForestRegressor(n_estimators=100,
322 | random_state=1337)
323 | rf.fit(X=X[good_rows], y=y[good_rows])
324 | pred_val = rf.predict(X[bad_rows])
325 |
326 | # Fill in values
327 | df_imp2[bad_rows, c] = pred_val
328 |
329 | assert_array_equal(df_imp1, df_imp2)
330 | assert_array_equal(imputer.statistics_.get('col_means'),
331 | statistics_mean[num_vars])
332 | assert_array_equal(imputer.statistics_.get('col_modes')[0],
333 | statistics_mode[cat_vars])
334 |
335 |
336 | def test_statstics_fit_transform():
337 | # Test statistics_ when data in fit() and transform() are different
338 | X = np.array([
339 | [1, 0, 0, 1],
340 | [2, 1, 2, 2],
341 | [3, 2, 3, 2],
342 | [np.nan, 4, 5, 5],
343 | [6, 7, 6, 7],
344 | [8, 8, 8, 8],
345 | [16, 15, 18, 19],
346 | ])
347 | statistics_mean = np.nanmean(X, axis=0)
348 |
349 | Y = np.array([
350 | [0, 0, 0, 0],
351 | [2, 2, 2, 1],
352 | [3, 2, 3, 2],
353 | [np.nan, 4, 5, 5],
354 | [6, 7, 6, 7],
355 | [9, 9, 8, 8],
356 | [16, 15, 18, 19],
357 | ])
358 |
359 | imputer = MissForest()
360 | imputer.fit(X).transform(Y)
361 | assert_array_equal(imputer.statistics_.get('col_means'), statistics_mean)
362 |
363 |
364 | def test_default_with_invalid_input():
365 | # Test imputation with default values and invalid input
366 |
367 | # Test with all rows missing in a column
368 | X = np.array([
369 | [np.nan, 0, 0, 1],
370 | [np.nan, 1, 2, np.nan],
371 | [np.nan, 2, 3, np.nan],
372 | [np.nan, 4, 5, 5],
373 | ])
374 | imputer = MissForest(random_state=1337)
375 | msg = "One or more columns have all rows missing."
376 | assert_raise_message(ValueError, msg, imputer.fit, X)
377 |
378 | # Test with inf present
379 | X = np.array([
380 | [np.inf, 1, 1, 2, np.nan],
381 | [2, 1, 2, 2, 3],
382 | [3, 2, 3, 3, 8],
383 | [np.nan, 6, 0, 5, 13],
384 | [np.nan, 7, 0, 7, 8],
385 | [6, 6, 2, 5, 7],
386 | ])
387 | msg = "+/- inf values are not supported."
388 | assert_raise_message(ValueError, msg, MissForest().fit, X)
389 |
390 | # Test with inf present in matrix passed in transform()
391 | X = np.array([
392 | [np.inf, 1, 1, 2, np.nan],
393 | [2, 1, 2, 2, 3],
394 | [3, 2, 3, 3, 8],
395 | [np.nan, 6, 0, 5, 13],
396 | [np.nan, 7, 0, 7, 8],
397 | [6, 6, 2, 5, 7],
398 | ])
399 |
400 | X_fit = np.array([
401 | [0, 1, 1, 2, np.nan],
402 | [2, 1, 2, 2, 3],
403 | [3, 2, 3, 3, 8],
404 | [np.nan, 6, 0, 5, 13],
405 | [np.nan, 7, 0, 7, 8],
406 | [6, 6, 2, 5, 7],
407 | ])
408 | msg = "+/- inf values are not supported."
409 | assert_raise_message(ValueError, msg, MissForest().fit(X_fit).transform, X)
410 |
--------------------------------------------------------------------------------
/missingpy/utils.py:
--------------------------------------------------------------------------------
1 | """Utility Functions"""
2 | # Author: Ashim Bhattarai
3 | # License: BSD 3 clause
4 |
5 | import numpy as np
6 |
7 |
8 | def masked_euclidean_distances(X, Y=None, squared=False,
9 | missing_values="NaN", copy=True):
10 | """Calculates euclidean distances in the presence of missing values
11 |
12 | Computes the euclidean distance between each pair of samples (rows) in X
13 | and Y, where Y=X is assumed if Y=None.
14 | When calculating the distance between a pair of samples, this formulation
15 | essentially zero-weights feature coordinates with a missing value in either
16 | sample and scales up the weight of the remaining coordinates:
17 |
18 | dist(x,y) = sqrt(weight * sq. distance from non-missing coordinates)
19 | where,
20 | weight = Total # of coordinates / # of non-missing coordinates
21 |
22 | Note that if all the coordinates are missing or if there are no common
23 | non-missing coordinates then NaN is returned for that pair.
24 |
25 | Read more in the :ref:`User Guide `.
26 |
27 | Parameters
28 | ----------
29 | X : {array-like, sparse matrix}, shape (n_samples_1, n_features)
30 |
31 | Y : {array-like, sparse matrix}, shape (n_samples_2, n_features)
32 |
33 | squared : boolean, optional
34 | Return squared Euclidean distances.
35 |
36 | missing_values : "NaN" or integer, optional
37 | Representation of missing value
38 |
39 | copy : boolean, optional
40 | Make and use a deep copy of X and Y (if Y exists)
41 |
42 | Returns
43 | -------
44 | distances : {array}, shape (n_samples_1, n_samples_2)
45 |
46 | Examples
47 | --------
48 | >>> from missingpy.utils import masked_euclidean_distances
49 | >>> nan = float("NaN")
50 | >>> X = [[0, 1], [1, nan]]
51 | >>> # distance between rows of X
52 | >>> masked_euclidean_distances(X, X)
53 | array([[0. , 1.41421356],
54 | [1.41421356, 0. ]])
55 |
56 | >>> # get distance to origin
57 | >>> masked_euclidean_distances(X, [[0, 0]])
58 | array([[1. ],
59 | [1.41421356]])
60 |
61 | References
62 | ----------
63 | * John K. Dixon, "Pattern Recognition with Partly Missing Data",
64 | IEEE Transactions on Systems, Man, and Cybernetics, Volume: 9, Issue:
65 | 10, pp. 617 - 621, Oct. 1979.
66 | http://ieeexplore.ieee.org/abstract/document/4310090/
67 |
68 | See also
69 | --------
70 | paired_distances : distances betweens pairs of elements of X and Y.
71 | """
72 | # Import here to prevent circular import
73 | from .pairwise_external import _get_mask, check_pairwise_arrays
74 |
75 | # NOTE: force_all_finite=False allows not only NaN but also +/- inf
76 | X, Y = check_pairwise_arrays(X, Y, accept_sparse=False,
77 | force_all_finite=False, copy=copy)
78 | if (np.any(np.isinf(X)) or
79 | (Y is not X and np.any(np.isinf(Y)))):
80 | raise ValueError(
81 | "+/- Infinite values are not allowed.")
82 |
83 | # Get missing mask for X and Y.T
84 | mask_X = _get_mask(X, missing_values)
85 |
86 | YT = Y.T
87 | mask_YT = mask_X.T if Y is X else _get_mask(YT, missing_values)
88 |
89 | # Check if any rows have only missing value
90 | if np.any(mask_X.sum(axis=1) == X.shape[1])\
91 | or (Y is not X and np.any(mask_YT.sum(axis=0) == Y.shape[1])):
92 | raise ValueError("One or more rows only contain missing values.")
93 |
94 | # else:
95 | if missing_values not in ["NaN", np.nan] and (
96 | np.any(np.isnan(X)) or (Y is not X and np.any(np.isnan(Y)))):
97 | raise ValueError(
98 | "NaN values present but missing_value = {0}".format(
99 | missing_values))
100 |
101 | # Get mask of non-missing values set Y.T's missing to zero.
102 | # Further, casting the mask to int to be used in formula later.
103 | not_YT = (~mask_YT).astype(np.int32)
104 | YT[mask_YT] = 0
105 |
106 | # Get X's mask of non-missing values and set X's missing to zero
107 | not_X = (~mask_X).astype(np.int32)
108 | X[mask_X] = 0
109 |
110 | # Calculate distances
111 | # The following formula derived by:
112 | # Shreya Bhattarai
113 |
114 | distances = (
115 | (X.shape[1] / (np.dot(not_X, not_YT))) *
116 | (np.dot(X * X, not_YT) - 2 * (np.dot(X, YT)) +
117 | np.dot(not_X, YT * YT)))
118 |
119 | if X is Y:
120 | # Ensure that distances between vectors and themselves are set to 0.0.
121 | # This may not be the case due to floating point rounding errors.
122 | distances.flat[::distances.shape[0] + 1] = 0.0
123 |
124 | return distances if squared else np.sqrt(distances, out=distances)
125 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.15.4
2 | scipy==1.1.0
3 | scikit-learn==0.20.1
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | import setuptools
2 |
3 | with open("README.md", "r") as fh:
4 | long_description = fh.read()
5 |
6 | setuptools.setup(
7 | name="missingpy",
8 | version="0.2.0",
9 | author="Ashim Bhattarai",
10 | description="Missing Data Imputation for Python",
11 | long_description=long_description,
12 | long_description_content_type="text/markdown",
13 | url="https://github.com/epsilon-machine/missingpy",
14 | packages=setuptools.find_packages(),
15 | classifiers=(
16 | "Programming Language :: Python :: 3",
17 | "License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
18 | "Operating System :: OS Independent",
19 | ),
20 | )
21 |
--------------------------------------------------------------------------------