├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── data-frame.asd ├── description.text ├── docs ├── data-frame.epub ├── data-frame.info ├── data-frame.pdf ├── data-frame.texi └── index.html ├── src ├── conditions.lisp ├── data-frame.lisp ├── defdf.lisp ├── filter.lisp ├── formatted-output.lisp ├── missing.lisp ├── pkgdcl.lisp ├── plist-aops.lisp ├── pprint.lisp ├── properties.lisp ├── summary.lisp └── utils.lisp └── tests └── data-frame-tests.lisp /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a Contributor License 9 | Agreement. You (or your employer) retain the copyright to your contribution; 10 | this simply gives us permission to use and redistribute your contributions as 11 | part of the project. 12 | 13 | You generally only need to submit a CLA once, so if you've already submitted one 14 | (even if it was for a different project), you probably don't need to do it 15 | again. 16 | 17 | ## The Contribution Process 18 | 19 | The basic workflow is: 20 | 21 | 1. Fork the Project 22 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 23 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 24 | 4. Push to the Branch (`git push origin feature/AmazingFeature`) 25 | 5. Open a Pull Request 26 | 27 | With multiple contributors and the desire to maintain high quality 28 | code, we need a small bit of process. For example all submissions, 29 | including submissions by project members, require review. We use 30 | GitHub pull requests for this purpose. Consult [GitHub 31 | Help](https://help.github.com/articles/about-pull-requests/) for more 32 | information on using pull requests, and the [contributing 33 | code](https://lisp-stat.dev/docs/contributing/code/) page for more 34 | details. 35 | 36 | ## Community Guidelines 37 | 38 | This project follows a code of conduct that can be found on the 39 | [contributing](https://lisp-stat.dev/docs/contributing/) page. 40 | 41 | ## How to contribute 42 | 43 | See the [contribution 44 | guidelines](https://lisp-stat.dev/docs/contributing/) 45 | in the Lisp-Stat user guide. 46 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Microsoft Public License (MS-PL) 2 | 3 | This license governs use of the accompanying software. If you use the 4 | software, you accept this license. If you do not accept the license, do not 5 | use the software. 6 | 7 | 1. Definitions 8 | The terms "reproduce," "reproduction," "derivative works," and "distribution" 9 | have the same meaning here as under U.S. copyright law. A "contribution" is 10 | the original software, or any additions or changes to the software. A 11 | "contributor" is any person that distributes its contribution under this 12 | license. "Licensed patents" are a contributor's patent claims that read 13 | directly on its contribution. 14 | 15 | 2. Grant of Rights 16 | (A) Copyright Grant- Subject to the terms of this license, including the 17 | license conditions and limitations in section 3, each contributor grants 18 | you a non-exclusive, worldwide, royalty-free copyright license to 19 | reproduce its contribution, prepare derivative works of its contribution, 20 | and distribute its contribution or any derivative works that you create. 21 | 22 | (B) Patent Grant- Subject to the terms of this license, including the 23 | license conditions and limitations in section 3, each contributor grants 24 | you a non-exclusive, worldwide, royalty-free license under its licensed 25 | patents to make, have made, use, sell, offer for sale, import, and/or 26 | otherwise dispose of its contribution in the software or derivative works 27 | of the contribution in the software. 28 | 29 | 3. Conditions and Limitations 30 | (A) No Trademark License- This license does not grant you rights to use 31 | any contributors' name, logo, or trademarks. 32 | 33 | (B) If you bring a patent claim against any contributor over patents that 34 | you claim are infringed by the software, your patent license from such 35 | contributor to the software ends automatically. 36 | 37 | (C) If you distribute any portion of the software, you must retain all 38 | copyright, patent, trademark, and attribution notices that are present in 39 | the software. 40 | 41 | (D) If you distribute any portion of the software in source code form, 42 | you may do so only under this license by including a complete copy of 43 | this license with your distribution. If you distribute any portion of the 44 | software in compiled or object code form, you may only do so under a 45 | license that complies with this license. 46 | 47 | (E) The software is licensed "as-is." You bear the risk of using it. The 48 | contributors give no express warranties, guarantees, or conditions. You 49 | may have additional consumer rights under your local laws which this 50 | license cannot change. To the extent permitted under your local laws, the 51 | contributors exclude the implied warranties of merchantability, fitness 52 | for a particular purpose and non-infringement. 53 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | [![Contributors][contributors-shield]][contributors-url] 5 | [![Forks][forks-shield]][forks-url] 6 | [![Stargazers][stars-shield]][stars-url] 7 | [![Issues][issues-shield]][issues-url] 8 | [![MS-PL License][license-shield]][license-url] 9 | [![LinkedIn][linkedin-shield]][linkedin-url] 10 | 11 | 12 | 13 | 14 |
15 |

16 | 17 | Logo 18 | 19 | 20 |

Data Frame

21 | 22 |

23 | Data frames for Common Lisp. A two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column 24 |
25 | Explore the docs » 26 |
27 |
28 | Report Bug 29 | · 30 | Request Feature 31 | · 32 | Reference Manual 33 |

34 |

35 | 36 | 37 | 38 | 39 |
40 |

Table of Contents

41 |
    42 |
  1. 43 | About The Project 44 | 47 |
  2. 48 |
  3. 49 | Getting Started 50 | 54 |
  4. 55 |
  5. Usage
  6. 56 |
  7. Roadmap
  8. 57 |
  9. Resources
  10. 58 |
  11. Contributing
  12. 59 |
  13. License
  14. 60 |
  15. Contact
  16. 61 |
62 |
63 | 64 | 65 | 66 | 67 | ## About the Project 68 | 69 | A data frame is a two dimensional data structure structure whose 70 | columns may be of differing types. It is similar to, and may be 71 | manipulated as, a Common Lisp array. Data frames hold tightly 72 | coupled collections of variables that all belong to one experiment. 73 | 74 | ### Built With 75 | 76 | * [anaphora](https://github.com/tokenrove/anaphora) 77 | * [alexandria](https://gitlab.common-lisp.net/alexandria/alexandria) 78 | * [array-operations](https://github.com/lisp-stat/array-operations) 79 | * [select](https://github.com/lisp-stat/select) 80 | * [let-plus](https://github.com/sharplispers/let-plus) 81 | * [numerical-utilities](https://github.com/lisp-stat/numerical-utilities) 82 | 83 | 84 | 85 | ## Getting Started 86 | 87 | To get a local copy up and running follow these steps: 88 | 89 | ### Prerequisites 90 | 91 | An ANSI Common Lisp implementation. Developed and tested with [SBCL](https://www.sbcl.org/). 92 | 93 | ### Installation 94 | 95 | Lisp-Stat is composed of several system that are designed to be 96 | independently useful. So you can, for example, use `data-frame` for 97 | any project needing to manipulate two dimensional data in a machine 98 | learning or statistical setting. 99 | 100 | #### Getting the source 101 | 102 | To make the system accessible to [ASDF](https://common-lisp.net/project/asdf/) (a build facility, similar to `make` in the C world), clone the repository in a directory ASDF knows about. By default the `common-lisp` directory in your home directory is known. Create this if it doesn't already exist and then: 103 | 104 | 1. Clone the repositories 105 | ```sh 106 | cd ~/common-lisp && \ 107 | git clone https://github.com/Lisp-Stat/data-frame.git && \ 108 | git clone https://github.com/Lisp-Stat/dfio.git 109 | ``` 110 | 2. Reset the ASDF source-registry to find the new system (from the REPL) 111 | ```lisp 112 | (asdf:clear-source-registry) 113 | ``` 114 | 3. Load the system 115 | ```lisp 116 | (asdf:load-system :data-frame) 117 | ``` 118 | 119 | If you have installed the slime ASDF extensions, you can invoke this 120 | with a comma (',') from the slime REPL in emacs. 121 | 122 | #### Getting dependencies 123 | 124 | To get the third party systems that Lisp-Stat depends on you can use a dependency manager, such as [Quicklisp](https://www.quicklisp.org/beta/) or [CLPM](https://www.clpm.dev/) Once installed, get the dependencies with either of: 125 | 126 | ```lisp 127 | (clpm-client:sync :sources "clpi") ;sources may vary 128 | ``` 129 | 130 | ```lisp 131 | (ql:quickload :data-frame) 132 | ``` 133 | 134 | You need do this only once. After obtaining the dependencies, you can 135 | load the system with `ASDF` as described above without first syncing 136 | sources. 137 | 138 | 139 | ## Usage 140 | 141 | Create a data frame: 142 | 143 | ```lisp 144 | (make-df '(:a :b) '(#(1 2 3) #(10 20 30))) 145 | 146 | ``` 147 | 148 | For more examples, please refer to the [Documentation](https://lisp-stat.dev/docs/manuals/data-frame). 149 | 150 | 151 | 152 | ## Roadmap 153 | 154 | See the [open issues](https://github.com/lisp-stat/data-frame/issues) for a list of proposed features (and known issues). 155 | 156 | ## Resources 157 | 158 | This system is part of the [Lisp-Stat](https://lisp-stat.dev/) 159 | project; that should be your first stop for information. Also see the 160 | [resources](https://lisp-stat.dev/resources) and 161 | [community](https://lisp-stat.dev/community) page for more 162 | information. 163 | 164 | 165 | ## Contributing 166 | 167 | Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on the code of conduct, and the process for submitting pull requests. 168 | 169 | 170 | ## License 171 | 172 | Distributed under the MS-PL License. See [LICENSE](LICENSE) for more information. 173 | 174 | 175 | 176 | 177 | ## Contact 178 | 179 | Project Link: [https://github.com/lisp-stat/data-frame](https://github.com/lisp-stat/data-frame) 180 | 181 | 182 | 183 | 184 | 185 | [contributors-shield]: https://img.shields.io/github/contributors/lisp-stat/data-frame.svg?style=for-the-badge 186 | [contributors-url]: https://github.com/lisp-stat/data-frame/graphs/contributors 187 | [forks-shield]: https://img.shields.io/github/forks/lisp-stat/data-frame.svg?style=for-the-badge 188 | [forks-url]: https://github.com/lisp-stat/data-frame/network/members 189 | [stars-shield]: https://img.shields.io/github/stars/lisp-stat/data-frame.svg?style=for-the-badge 190 | [stars-url]: https://github.com/lisp-stat/data-frame/stargazers 191 | [issues-shield]: https://img.shields.io/github/issues/lisp-stat/data-frame.svg?style=for-the-badge 192 | [issues-url]: https://github.com/lisp-stat/data-frame/issues 193 | [license-shield]: https://img.shields.io/github/license/lisp-stat/data-frame.svg?style=for-the-badge 194 | [license-url]: https://github.com/lisp-stat/data-frame/blob/master/LICENSE 195 | [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555 196 | [linkedin-url]: https://www.linkedin.com/company/symbolics/ 197 | -------------------------------------------------------------------------------- /data-frame.asd: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: ASDF -*- 2 | ;;; Copyright (c) 2020-2024 by Symbolics Pte. Ltd. All rights reserved. 3 | ;;; SPDX-License-identifier: MS-PL 4 | 5 | (defsystem "data-frame" 6 | :version "1.3.3" 7 | :licence :MS-PL 8 | :author "Steve Nunez " 9 | :long-name "Data frames for Common Lisp" 10 | :description "A data manipulation library for statistical computing" 11 | :long-description #.(uiop:read-file-string 12 | (uiop:subpathname *load-pathname* "description.text")) 13 | :homepage "https://lisp-stat.dev/docs/manuals/data-frame" 14 | :source-control (:git "https://github.com/Lisp-Stat/data-frame.git") 15 | :bug-tracker "https://github.com/Lisp-Stat/data-frame/issues" 16 | 17 | :depends-on ("alexandria" 18 | "alexandria+" 19 | "anaphora" 20 | "array-operations" 21 | "num-utils" 22 | "select" 23 | "statistics" 24 | "let-plus" 25 | "duologue" 26 | #+sbcl "sb-cltl2") 27 | :serial t 28 | :pathname "src/" 29 | :components ((:file "pkgdcl") 30 | (:file "utils") 31 | (:file "data-frame") 32 | (:file "pprint") 33 | (:file "formatted-output") 34 | (:file "summary") 35 | (:file "defdf") 36 | (:file "conditions") 37 | (:file "properties") 38 | (:file "missing") 39 | (:file "filter") 40 | (:file "plist-aops")) 41 | :in-order-to ((test-op (test-op "data-frame/tests")))) 42 | 43 | (defsystem "data-frame/tests" 44 | :version "1.0.0" 45 | :description "Unit tests for DATA-FRAME." 46 | :author "Steve Nunez " 47 | :licence :MS-PL 48 | :depends-on ("data-frame" 49 | "clunit2") 50 | :serial t 51 | :pathname "tests/" 52 | :components ((:file "data-frame-tests")) 53 | :perform (test-op (o s) 54 | (let ((*print-pretty* t)) ;work around clunit issue #9 55 | (symbol-call :clunit :run-suite 56 | (find-symbol* :data-frame 57 | :data-frame-tests) 58 | :use-debugger nil)))) 59 | -------------------------------------------------------------------------------- /description.text: -------------------------------------------------------------------------------- 1 | A data frame is a common way of storing data for statistical analysis. Under the hood, a data frame is a vector of equal-length vectors. Each element of the vector can be thought of as a column and the length of each element of the vector is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows. 2 | 3 | From a design perspective, Lisp-Stat's data frame is conceptually most similar to the 'tibble' from the tidyverse, but using Common Lisp idioms, style and syntax. 4 | -------------------------------------------------------------------------------- /docs/data-frame.epub: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lisp-Stat/data-frame/1fae0f2d55a8228cf9e1018e7bc11686a23ff673/docs/data-frame.epub -------------------------------------------------------------------------------- /docs/data-frame.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lisp-Stat/data-frame/1fae0f2d55a8228cf9e1018e7bc11686a23ff673/docs/data-frame.pdf -------------------------------------------------------------------------------- /src/conditions.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021-2024 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | (define-condition duplicate-key (error) 6 | ((key :initarg :key)) 7 | (:report (lambda (condition stream) 8 | (format stream "Duplicate key ~A." (slot-value condition 'key)))) 9 | (:documentation "An operation attempted to use a key that already exists in ORDERED-KEYS")) 10 | 11 | (define-condition key-not-found (error) 12 | ((key :initarg :key) 13 | (keys :initarg :keys)) 14 | (:report (lambda (condition stream) 15 | (format stream "Key ~A not found, valid keys are ~A." 16 | (slot-value condition 'key) 17 | (slot-value condition 'keys)))) 18 | (:documentation "An operation was attempted on a non-existant key.")) 19 | 20 | (define-condition missing-data (error) 21 | ((name :initarg :name) 22 | (data :initarg :data)) 23 | (:documentation "A variable has missing data, e.g. :na, nil") 24 | (:report (lambda (condition stream) 25 | (format stream "~A contains missing data" 26 | (slot-value condition 'name))))) 27 | 28 | (define-condition large-data (warning) 29 | ((data-size :initarg :data-size 30 | :reader data-size)) 31 | (:report (lambda (condition stream) 32 | (format stream 33 | "You are attempting to embed a large number of data points (~D); the recommended maximum is ~D." 34 | (data-size condition) *large-data*))) 35 | (:documentation "A operation was requested on a data set large enough to potentially cause problems.")) 36 | 37 | (define-condition data-frame-exists (error) 38 | ((data-frame :initarg :data-frame 39 | :reader data-frame)) 40 | (:report (lambda (condition stream) 41 | (format stream 42 | "You are attempting to redefine ~A and *ask-on-redefine is ~A" 43 | (slot-value condition 'name) 44 | *ask-on-redefine*))) 45 | (:documentation "An attempt to redefine an existing data frame. Triggered if either the symbol is bound or the package exists.")) 46 | 47 | ;;; Validation functions for duologue:prompt, which is used to get user input within restarts 48 | (defun df-exists-p (s) 49 | (if (or (find-package s) 50 | (member (find-symbol (string-upcase s)) df::*data-frames*)) 51 | nil 52 | t)) 53 | 54 | (defun invalid-df-name (s) 55 | "A user prompt, using DUOLOGUE, to select a valid data frame name." 56 | (duologue:say "~A names an existing data frame. Please choose another name" s)) 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /src/data-frame.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021-2023 by Symbolics Pte. Ltd. All rights reserved. 3 | ;;; SPDX-License-identifier: MS-PL 4 | (in-package #:data-frame) 5 | 6 | ;;; Note: tpapp never mentions the difference between a data-vector 7 | ;;; and a data-frame. As near as I can tell, a data-frame must have 8 | ;;; the same number of rows for each variable. 9 | 10 | (defparameter *large-data* most-positive-fixnum ;4611686018427387903 11 | "An indication that the data set is large for a particular use case. 12 | This should be bound by a user to the maximum number of data points they consider to be 'normal'. The function can then signal a large-data warning if it is exceeded. 13 | 14 | E.g. (let ((df:*large-data* 50000)) 15 | (handler-bind ((large-data ... 16 | (some-data-operation ; this will signal if the data is too large 17 | (restart-bind ...") 18 | 19 | (deftype data-type () 20 | "A statistical type for a data variable. All data columns must be one of these types if they are to be intepreted properly by Lisp-Stat" 21 | '(member :string :double-float :single-float :categorical :temporal :integer :bit)) 22 | 23 | 24 | ;;; Ordered keys provide a mapping from column keys (symbols) to nonnegative 25 | ;;; integers. They are used internally and the corresponding interface is 26 | ;;; NOT EXPORTED. 27 | 28 | (defstruct (ordered-keys (:copier nil)) 29 | "Representation of ordered keys 30 | Ordered keys provide a mapping from column keys (symbols) to nonnegative 31 | integers. They are used internally and the corresponding interface is 32 | NOT EXPORTED. 33 | 34 | TABLE maps keys to indexes, starting from zero." 35 | (table (make-hash-table :test #'eq) :type hash-table :read-only t)) 36 | 37 | (defun keys-count (ordered-keys) 38 | "Number of keys." 39 | (hash-table-count (ordered-keys-table ordered-keys))) 40 | 41 | (defun keys-vector (ordered-keys) 42 | "Vector of all keys." 43 | (map 'vector #'car 44 | (sort (hash-table-alist (ordered-keys-table ordered-keys)) 45 | #'<= 46 | :key #'cdr))) 47 | 48 | (defun key-index (ordered-keys key) 49 | "Return the index for KEY." 50 | (let+ (((&values index present?) (gethash key (ordered-keys-table ordered-keys)))) 51 | (unless present? 52 | (error 'key-not-found :key key :keys (keys-vector ordered-keys))) 53 | index)) 54 | 55 | (defun add-key! (ordered-keys key) 56 | "Modify ORDERED-KEYS by adding KEY." 57 | (check-type key symbol) 58 | (let+ (((&structure ordered-keys- table) ordered-keys) 59 | ((&values &ign present?) (gethash key table))) 60 | (when present? 61 | (error 'duplicate-key :key key)) 62 | (setf (gethash key table) (hash-table-count table))) 63 | ordered-keys) 64 | 65 | (defun ordered-keys (keys) 66 | "Create an ORDERED-KEYS object from KEYS (a sequence)." 67 | (aprog1 (make-ordered-keys) 68 | (map nil (curry #'add-key! it) keys))) 69 | 70 | (defun copy-ordered-keys (ordered-keys) 71 | "Return a copy of ORDERED-KEYS" 72 | (let+ (((&structure ordered-keys- table) ordered-keys)) 73 | (make-ordered-keys :table (copy-hash-table table)))) 74 | 75 | (defun add-keys (ordered-keys &rest keys) 76 | "Add KEYS to ORDERED-KEYS" 77 | (aprog1 (copy-ordered-keys ordered-keys) 78 | (mapc (curry #'add-key! it) keys))) 79 | 80 | (defun remove-key! (ordered-keys key) 81 | "Modify ORDERED-KEYS by removing KEY." 82 | (check-type key symbol) 83 | (let+ (((&structure ordered-keys- table) ordered-keys) 84 | ((&values &ign present?) (gethash key table)) 85 | (kv)) 86 | (unless present? 87 | (error 'key-not-found :key key)) 88 | 89 | (remhash key table) 90 | (setf kv (remove key (keys-vector ordered-keys))) 91 | (loop for key across kv 92 | for i = 0 then (1+ i) 93 | do (setf (gethash key table) i))) 94 | ordered-keys) 95 | 96 | ;;; 97 | ;;; Implementation of SELECT for ORDERED-KEYS 98 | ;;; 99 | (defmethod axis-dimension ((axis ordered-keys)) 100 | (hash-table-count (ordered-keys-table axis))) 101 | 102 | (defmethod canonical-representation ((axis ordered-keys) (slice symbol)) 103 | (if (select-reserved-symbol? slice) 104 | (call-next-method) 105 | (key-index axis slice))) 106 | 107 | (defmethod select ((ordered-keys ordered-keys) &rest selections) 108 | (let+ (((slice) selections)) 109 | (ordered-keys 110 | (select (keys-vector ordered-keys) 111 | (canonical-representation ordered-keys slice))))) 112 | 113 | 114 | ;;; generic implementation -- the class is not exported, only the functionality 115 | 116 | (defclass data () 117 | ((name ;same as the symbol-name 118 | :initarg nil 119 | :type string 120 | :accessor name 121 | :documentation "The name of the data frame. MUST be the same as the symbol whose value cell points to this data frame. This slot essentially allows us to go 'backwards' and get the symbol that names the data frame.") 122 | (ordered-keys 123 | :initarg :ordered-keys 124 | :type ordered-keys) 125 | (columns 126 | :initarg :columns 127 | :type vector) 128 | #+nil 129 | (doc-string ;I'd like this to be 'documentation', but that conflicts with the CL version 130 | :initarg :doc-string 131 | :initform "" 132 | :type string 133 | :accessor doc-string)) 134 | (:documentation "This class is used for implementing both data-vector and data-frame, and represents an ordered collection of key-column pairs. Columns are not assumed to have any specific attributes. This class is not exported.")) 135 | 136 | (defmethod aops:element-type ((data data)) ;should this be doing something? 137 | t) 138 | 139 | (defun make-data (class keys columns) 140 | "Create a DATA object from KEYS and COLUMNS. FOR INTERNAL USE. Always creates a copy of COLUMNS in order to ensure that it is an adjustable array with a fill pointer. KEYS are converted to ORDERED-KEYS if necessary." 141 | (let ((n-columns (length columns)) 142 | (ordered-keys (atypecase keys 143 | (ordered-keys it) 144 | (t (ordered-keys it))))) 145 | (assert (= n-columns (keys-count ordered-keys))) 146 | (assert (subtypep class 'data)) 147 | (make-instance class 148 | :ordered-keys ordered-keys 149 | :columns (make-array n-columns 150 | :adjustable t 151 | :fill-pointer n-columns 152 | :initial-contents columns)))) 153 | 154 | (defgeneric check-column-compatibility (data column) 155 | (:documentation "Check if COLUMN is compatible with DATA.") 156 | (:method ((data data) column) ;no-op. Was Tamas going to implement it later? 157 | (declare (ignore column)))) 158 | 159 | (defun ensure-arguments-alist (rest) 160 | "Recognizes the following and converts them to an alist: 161 | 162 | plist 163 | alist 164 | (plist) 165 | (alist) 166 | (data-frame)" 167 | (let+ (((&flet error% (&optional (list rest)) 168 | (error "Could not interpret ~A as a plist or alist." list))) 169 | ((&flet ensure-alist (list) 170 | (typecase (car list) 171 | (cons rest) 172 | (symbol (plist-alist rest)) 173 | (t (error% list)))))) 174 | (if (cdr rest) 175 | (ensure-alist rest) 176 | (let ((first (car rest))) 177 | (typecase first 178 | (data (as-alist first)) 179 | (cons (if (consp (cdr first)) 180 | (ensure-alist first) 181 | rest)) ; first element of an alist 182 | (t (error%))))))) 183 | 184 | 185 | ;;; These two functions, alist-data & plist-data can be used to create 186 | ;;; DATA-VECTOR classes, that is a class works just like DATA-FRAME, 187 | ;;; but permits unequal length variables. To date, I haven't found 188 | ;;; much (any) need for a DATA-VECTOR and haven't done any 189 | ;;; improvements to that class. 190 | (defun alist-data (class alist) 191 | "Create an object of CLASS (subclass of DATA) from ALIST which contains key-column pairs." 192 | (assert alist () "Can't create an empty data frame.") 193 | (make-data class (mapcar #'car alist) (mapcar #'cdr alist))) 194 | 195 | (defun plist-data (class plist) 196 | "Create an object of CLASS (subclass of DATA) from PLIST which contains keys and columns, interleaved." 197 | (alist-data class (plist-alist plist))) 198 | 199 | (defun keys (data) 200 | "Return a vector of keys." 201 | (check-type data data) 202 | (copy-seq (keys-vector (slot-value data 'ordered-keys)))) 203 | 204 | (defmethod as-alist ((data data)) 205 | "Key-column pairs as an alist." 206 | (map 'list #'cons (keys data) (columns data))) 207 | 208 | (defun copy (data &key (key #'identity)) 209 | "Copy data frame or vector. Keys are copied (and thus can be modified), columns or elements are copied using KEY, making the default give a shallow copy." 210 | (check-type data data) 211 | (let+ (((&slots-r/o ordered-keys columns) data) 212 | (new-data (make-data (class-of data) 213 | (copy-ordered-keys ordered-keys) 214 | (map 'vector key columns)))) 215 | new-data)) 216 | 217 | (defun column (data key) 218 | "Return column corresponding to key." 219 | (check-type data data) 220 | (let+ (((&slots-r/o ordered-keys columns) data)) 221 | (aref columns (key-index ordered-keys key)))) 222 | 223 | (defun (setf column) (column data key) 224 | "Set column corresponding to key." 225 | (check-column-compatibility data column) 226 | (let+ (((&slots-r/o ordered-keys columns) data)) 227 | (setf (aref columns (key-index ordered-keys key)) column))) 228 | 229 | (defun columns (data &optional (slice t)) 230 | "Return the columns of DATA as a vector, or a selection if given (keys are resolved)." 231 | (check-type data data) 232 | (let+ (((&slots-r/o ordered-keys columns) data)) 233 | (select columns (canonical-representation ordered-keys slice)))) 234 | 235 | (defun column-names (df) 236 | "Return a list of column names in DF, as strings" 237 | (map 'list #'symbol-name (keys df))) 238 | 239 | (defun rows (data) 240 | "Return the rows of DATA as a vector" 241 | (loop for index below (aops:nrow data) 242 | collecting (columns (select:select data index t)) into rows 243 | finally (return (coerce rows 'vector )))) 244 | 245 | (defun map-columns (data function &optional (result-class (class-of data))) 246 | "Map columns of DATA-FRAME or DATA-VECTOR using FUNCTION. The result is a new DATA-FRAME with the same keys." 247 | (make-data result-class (keys data) (map 'vector function (columns data)))) 248 | 249 | 250 | (defun df-env-p (df) 251 | "Returns T if there is environment set-up for the data frame, or NIL if there isn't one." 252 | (if (and (slot-boundp df 'name) 253 | (find-package (string-upcase (slot-value df 'name)))) 254 | t 255 | nil)) 256 | 257 | 258 | (defun add-column! (data key column &optional update-env) 259 | "Modify DATA (a data-frame or data-vector) by adding COLUMN with KEY. Return DATA." 260 | (check-column-compatibility data column) 261 | (let+ (((&slots ordered-keys columns) data)) 262 | (add-key! ordered-keys key) 263 | (vector-push-extend column columns)) 264 | (when (and (df-env-p data) 265 | update-env) 266 | (defdf-env (find-symbol (string-upcase (name data))) nil)) 267 | data) 268 | 269 | (defun add-columns! (data &rest keys-and-columns) 270 | "Modify DATA (a data-frame or data-vector) by adding columns with keys. 271 | If a data-frame environment exists, add columns to it as well." 272 | (mapc (lambda+ ((key . column)) 273 | (add-column! data key column t)) 274 | (ensure-arguments-alist keys-and-columns)) 275 | data) 276 | 277 | (defun add-columns (data &rest keys-and-columns) 278 | "Return a new data-frame or data-vector with keys and columns added. Does not modify DATA." 279 | (aprog1 (copy data) 280 | (apply #'add-columns! it keys-and-columns))) 281 | 282 | 283 | (defun remove-columns (data keys) 284 | "Return a new data-frame or data-vector with keys and columns removed. Does not modify DATA. 285 | ARGS: DATA data frame 286 | KEYS list of keys (variables) to be removed" 287 | (select data t (reverse (set-difference (coerce (keys data) 'list) keys)))) 288 | 289 | (defun remove-column! (data key) 290 | "Modify DATA (a data-frame or data-vector) by removing COLUMN with KEY. Return DATA." 291 | (check-type key symbol) 292 | (let+ (((&slots ordered-keys columns) data) 293 | (index (key-index ordered-keys key))) 294 | (remove-key! ordered-keys key) 295 | (delete-nth* columns index)) 296 | (when (df-env-p data) 297 | (defdf-env (find-symbol (string-upcase (name data))) '(key))) 298 | data) 299 | 300 | ;;; TODO document me! 301 | (defun remove-columns! (data &rest keys) 302 | "Modify DATA (a data-frame or data-vector) by removing columns with keys. 303 | If a data-frame environment exists, add columns to it as well." 304 | (mapc (lambda (key) 305 | (remove-column! data key)) 306 | (car keys)) 307 | data) 308 | 309 | 310 | 311 | ;;; TODO take a plist for new & old and process it so we can rename 312 | ;;; multiple variables in one function call. See ensure-arguments-alist. 313 | (defmethod rename-column! (data new old) ;generic so will work on data-frame subclasses 314 | "Substitute NEW, a SYMBOL, for OLD in DF 315 | 316 | Useful when reading data files that have an empty or generated column name. 317 | 318 | Example: (rename-column! cars 'name :||) will replace an empty symbol with 'name" 319 | (let+ ((old-keys (coerce (keys data) 'list)) 320 | (present? (member old old-keys))) 321 | (unless present? 322 | (error 'key-not-found :key old)) 323 | (setf (slot-value data 'ordered-keys) (ordered-keys (substitute new old (keys data)))) 324 | (when (df-env-p data) 325 | (defdf-env (find-symbol (string-upcase (name data))) old-keys)) 326 | data)) 327 | 328 | 329 | 330 | (defmacro define-data-subclass (class abbreviation) 331 | (check-type class symbol) 332 | (check-type abbreviation symbol) 333 | (let+ (((&flet fname (prefix) 334 | (symbolicate prefix '#:- abbreviation))) 335 | (alist-fn (fname '#:alist)) 336 | (plist-fn (fname '#:plist))) 337 | `(progn 338 | (defclass ,class (data) 339 | ()) 340 | (defun ,(fname '#:make) (keys columns) 341 | (make-data ',class keys columns)) 342 | (defun ,alist-fn (alist) 343 | (alist-data ',class alist)) 344 | (defun ,plist-fn (plist) 345 | (plist-data ',class plist)) 346 | (defun ,abbreviation (&rest plist-or-alist) 347 | (if (alistp plist-or-alist) 348 | (,alist-fn plist-or-alist) 349 | (,plist-fn plist-or-alist)))))) 350 | 351 | (define-data-subclass data-vector dv) 352 | 353 | (defmethod aops:dims ((data-vector data-vector)) 354 | (list (length (columns data-vector)))) 355 | 356 | (defmethod aops:as-array ((data-vector data-vector)) 357 | (columns data-vector)) 358 | 359 | (defmethod select ((data-vector data-vector) &rest slices) 360 | (let+ (((column-slice) slices) 361 | ((&slots-r/o ordered-keys columns) data-vector) 362 | (column-slice (canonical-representation ordered-keys column-slice))) 363 | (if (singleton-representation? column-slice) 364 | (aref columns column-slice) 365 | (make-dv (select ordered-keys column-slice) 366 | (select columns column-slice))))) 367 | 368 | (define-data-subclass data-frame df) 369 | 370 | (defmethod initialize-instance :after ((data-frame data-frame) &rest initargs) 371 | (declare (ignore initargs)) 372 | (let+ (((first . rest) (coerce (columns data-frame) 'list)) 373 | (length (column-length first))) 374 | (assert (every (lambda (column) 375 | (= length (column-length column))) 376 | rest) 377 | () "Columns don't have the same length."))) 378 | 379 | (defmethod aops:nrow ((data-frame data-frame)) 380 | (column-length (aref (columns data-frame) 0))) 381 | 382 | (defmethod aops:ncol ((data-frame data-frame)) 383 | (length (columns data-frame))) 384 | 385 | (defmethod aops:dims ((data-frame data-frame)) 386 | (list (aops:nrow data-frame) (aops:ncol data-frame))) 387 | 388 | (defmethod aops:as-array ((data-frame data-frame)) 389 | ;; Return contents of DATA-FRAME as a matrix. 390 | (nu:transpose (aops:combine (columns data-frame)))) 391 | 392 | (defmethod check-column-compatibility ((data data-frame) column) 393 | (assert (= (column-length column) (aops:nrow data)))) 394 | 395 | (defun matrix-df (keys matrix) 396 | "Convert a matrix to a data-frame with the given keys." 397 | (let+ ((columns (aops:split (nu:transpose matrix) 1))) 398 | (assert (length= columns keys)) 399 | (alist-df (map 'list #'cons keys columns)))) 400 | 401 | 402 | ;;; implementation of SELECT for DATA-FRAME 403 | 404 | (defmethod select ((data-frame data-frame) &rest slices) 405 | (let+ (((row-slice &optional (column-slice t)) slices) 406 | ((&slots-r/o ordered-keys columns) data-frame) 407 | (row-slice (canonical-representation (aops:nrow data-frame) row-slice)) 408 | (column-slice (canonical-representation ordered-keys column-slice)) 409 | (columns (select columns column-slice)) 410 | ((&flet slice-column (column) 411 | (select column row-slice)))) 412 | (if (singleton-representation? column-slice) 413 | (slice-column columns) 414 | (let ((keys (select ordered-keys column-slice)) 415 | (columns (map 'vector #'slice-column columns))) 416 | (if (singleton-representation? row-slice) 417 | (make-dv keys columns) 418 | (make-df keys columns)))))) 419 | 420 | ;;; TODO: (setfs election) 421 | 422 | 423 | ;;; mapping rows and adding columns 424 | 425 | (defun map-rows (data-frame keys function &key (element-type t)) 426 | "Map rows using FUNCTION, on the columns corresponding to KEYS. Return the result with the given ELEMENT-TYPE." 427 | (let ((columns (map 'list (curry #'column data-frame) (ensure-list keys))) 428 | (nrow (aops:nrow data-frame))) 429 | (aprog1 (make-array nrow :element-type element-type) 430 | (dotimes (index nrow) 431 | (setf (aref it index) 432 | (apply function 433 | (mapcar (lambda (column) 434 | (ref column index)) 435 | columns))))))) 436 | 437 | (defun do-rows (data-frame keys function) 438 | "Traverse rows from first to last, calling FUNCTION on the columns corresponding to KEYS. Return no values." 439 | (let ((columns (map 'list (curry #'column data-frame) (ensure-list keys))) 440 | (nrow (aops:nrow data-frame))) 441 | (dotimes (index nrow (values)) 442 | (apply function 443 | (mapcar (lambda (column) 444 | (ref column index)) 445 | columns))))) 446 | 447 | (defun map-df (data-frame keys function result-keys) 448 | "Map DATA-FRAME to another one by rows. Function is called on the columns corresponding to KEYS, and should return a sequence with the same length as RESULT-KEYS, which give the keys of the resulting data frame. RESULT-KETS should be either symbols, or of the format (symbol &optional (element-type t))." 449 | (let* ((columns (map 'list (curry #'column data-frame) keys)) 450 | (nrow (aops:nrow data-frame)) 451 | (result-keys-and-element-types 452 | (mapcar (lambda (key-and-element-type) 453 | (let+ (((key &optional (element-type t)) 454 | (ensure-list key-and-element-type))) 455 | (cons key element-type))) 456 | result-keys)) 457 | (result-columns (map 'vector 458 | (lambda (key-and-element-type) 459 | (make-array nrow 460 | :element-type (cdr key-and-element-type))) 461 | result-keys-and-element-types))) 462 | (dotimes (index nrow) 463 | (let ((result-row (apply function 464 | (mapcar (lambda (column) 465 | (ref column index)) 466 | columns)))) 467 | (assert (length= result-row result-columns)) 468 | (map nil (lambda (result-column result-element) 469 | (setf (aref result-column index) result-element)) 470 | result-columns result-row))) 471 | (make-df (mapcar #'car result-keys-and-element-types) result-columns))) 472 | 473 | (defun mask-rows (data-frame keys predicate) 474 | "Return a bit-vector containing the result of calling PREDICATE on rows of the columns corresponding to KEYS (0 for NIL, 1 otherwise)." 475 | (map-rows data-frame keys (compose (lambda (flag) ;translate nil/non-nil to 0 or 1 476 | (if flag 1 0)) 477 | predicate) 478 | :element-type 'bit)) 479 | 480 | (defun count-rows (data-frame keys predicate) 481 | "Count the number of rows for which PREDICATE called on the columns corresponding to KEYS returns non-NIL." 482 | (let ((columns (map 'list (curry #'column data-frame) (ensure-list keys)))) 483 | (loop for index below (aops:nrow data-frame) 484 | count (apply predicate 485 | (mapcar (lambda (column) 486 | (ref column index)) 487 | columns))))) 488 | 489 | (defun replace-column! (data key function-or-column &key (element-type t)) 490 | "Modify column KEY of data-frame DATA by replacing it either with the given column, or applying the function to the current values (ELEMENT-TYPE is used.)" 491 | (let+ (((&slots ordered-keys columns) data) 492 | (index (key-index ordered-keys key))) 493 | (setf (aref columns index) 494 | (if (functionp function-or-column) 495 | (map-rows data key function-or-column :element-type element-type) 496 | (prog1 function-or-column 497 | (check-column-compatibility data function-or-column))))) 498 | data) 499 | 500 | (defun replace-column (data key function-or-column &key (element-type t)) 501 | "Create a new data frame with new column KEY from data-frame DATA by replacing it either with the given column, or applying the function to the current values (ELEMENT-TYPE is used.)" 502 | (replace-column! (copy data) key function-or-column :element-type element-type)) 503 | 504 | ;; We give this a df- prefix to avoid symbol clash with the CL 505 | ;; function (which, sadly, is not generic). After adding 506 | ;; df:delete-duplicates, shadow both in the package declaration. 507 | (defun df-remove-duplicates (data) 508 | "Return a modified copy of DATA from which any element (row, if a DATA-FRAME) that matches another element has been removed" 509 | (etypecase data 510 | (alexandria:proper-sequence (cl:remove-duplicates data)) ; Eventually shadow the CL version 511 | (df:data-frame (let* ((new-rows (cl:remove-duplicates (rows data) :test #'equalp)) 512 | (new-array (make-array (list (length new-rows) 513 | (length (svref new-rows 0))) 514 | :initial-contents new-rows))) 515 | (matrix-df (keys data) new-array))))) 516 | 517 | 518 | ;; TODO 519 | #+nil 520 | (defun delete-duplicates (data) 521 | "Like REMOVE-DUPLICATES, but may modify DATA" 522 | ...) 523 | 524 | (defmethod print-object ((ordered-keys ordered-keys) stream) 525 | (print-unreadable-object (ordered-keys stream :type t) 526 | (format stream "~{~a~^, ~}" (coerce (keys-vector ordered-keys) 'list)))) 527 | 528 | (defmethod print-object ((data-vector data-vector) stream) 529 | (let ((alist (as-alist data-vector))) 530 | (print-unreadable-object (data-vector stream :type t) 531 | (format stream "of ~d variables" (length alist))))) 532 | 533 | (defmethod print-object ((df data-frame) stream) 534 | "Print DATA-FRAME dimensions and type 535 | After defining this method it is permanently associated with data-frame objects" 536 | (print-unreadable-object (df stream :type t) 537 | (let ((description (and (slot-boundp df 'name) 538 | (documentation (find-symbol (name df)) 'variable)))) 539 | (format stream 540 | "(~d observations of ~d variables)" 541 | (aops:nrow df) 542 | (aops:ncol df)) 543 | (when description 544 | (format stream "~&~A" (short-string description)))))) 545 | 546 | (defmethod describe-object ((df data-frame) stream) 547 | (let ((name (when (slot-boundp df 'name) (name df)))) 548 | (format stream "~A~%" name) 549 | (format stream " A data-frame with ~D observations of ~D variables~2%" (aops:nrow df) (aops:ncol df)) 550 | (when name 551 | (let ((rows (loop for key across (keys df) 552 | for sym = (find-symbol (string-upcase (symbol-name key)) (find-package name)) 553 | collect (list (symbol-name key) 554 | (get sym :type) 555 | (get sym :unit) 556 | (get sym :label))))) 557 | (push '("--------" "----" "----" "-----------") rows) 558 | (push '("Variable" "Type" "Unit" "Label") rows) 559 | (print-table rows))))) 560 | 561 | 562 | ;;; KLUDGE ALERT 563 | ;;; This violates the spec. It's not easy at all to get good 564 | ;;; behaviour from describe. See code and comments in describe.lisp. 565 | #+allegro (setf excl:*enable-package-locked-errors* nil) 566 | (defmethod describe-object :after ((s symbol) stream) 567 | (unless (boundp s) (return-from describe-object)) 568 | (unless (eq #+sbcl (SB-CLTL2:variable-information s) 569 | #+ccl (ccl:variable-information s) 570 | #+allegro (system:variable-information s) 571 | :symbol-macro) 572 | (let ((*print-pretty* t) 573 | (df (symbol-value s)) 574 | (name (symbol-name s))) 575 | 576 | (pprint-logical-block (stream nil) 577 | (pprint-logical-block (stream nil) 578 | (pprint-indent :block 2 stream) 579 | (when-let ((pkg (find-package name))) 580 | (format stream "~@:_Variables: ~@:_") 581 | (pprint-logical-block (stream nil :per-line-prefix " ") 582 | (let ((rows (loop for key across (keys df) 583 | for sym = (find-symbol (string-upcase (symbol-name key)) (find-package name)) 584 | collect (list (symbol-name key) 585 | (get sym :type) 586 | (get sym :unit) 587 | (get sym :label))))) 588 | (push '("--------" "----" "----" "-----------") rows) 589 | (push '("Variable" "Type" "Unit" "Label") rows) 590 | (print-table rows stream))))))))) 591 | #+allegro (setf excl:*enable-package-locked-errors* t) 592 | 593 | (defmethod sample ((df data-frame) n &key 594 | with-replacement 595 | skip-unselected) 596 | "Return N rows of DF taken at random. 597 | 598 | If WITH-REPLACEMENT is true, return a random sample with 599 | replacement (a \"draw\"). 600 | 601 | If WITH-REPLACEMENT is false, return a random sample without 602 | replacement (a \"deal\"). 603 | 604 | If SKIP-UNSELECTED is non-NIL, do not return the elements of DF that we not part of the selection. Non-NIL by default, as the typical use case is to split a data set into training and test data sets." 605 | (declare (data-frame df)) 606 | (let+ (((&dims nrow &ign) df)) 607 | (cond ((> n nrow) (error "Requested number of rows N is greater than rows in the data frame.")) 608 | ((= n 0) nil) 609 | ((= n 1) (select df (random nrow) t)) ;return unselected row? This seems an edge case. 610 | ((= n nrow) df) 611 | (t (let+ ((indices (linspace 0 (1- nrow) nrow)) 612 | (selected (sample indices n :with-replacement with-replacement)) 613 | (not-selected (set-difference* indices selected))) 614 | (if skip-unselected 615 | (select df selected t) 616 | (values (select df selected t) 617 | (select df not-selected t)))))))) 618 | -------------------------------------------------------------------------------- /src/defdf.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; Functions for defining data frames in the statistical environment. 6 | 7 | (defvar *data-frames* nil 8 | "Global list of all data frames") 9 | 10 | (defvar *ask-on-redefine* t 11 | "If non-nil, the system will ask the user for confirmation before redefining a data frame") 12 | 13 | 14 | (defmacro defdf (name data &optional (documentation nil documentation-p)) 15 | "Define a data-frame and package by the same name. 16 | Also defines symbol-macros for variable access, e.g. mtcars:mpg" 17 | `(progn 18 | (when (and ,documentation 19 | (not (stringp ,documentation))) 20 | (error "Data frame documentation is not a STRING")) 21 | (when (and (boundp ',name) 22 | (not (typep (symbol-value ',name) 'data-frame))) 23 | (error "Data is not of type DATA-FRAME")) 24 | 25 | (declaim (special ,name)) 26 | (unless (and *ask-on-redefine* 27 | (boundp ',name) 28 | (not (y-or-n-p "Variable has a value. Redefine?"))) 29 | ,(when documentation-p 30 | `(setf (documentation ',name 'variable) ',documentation)) 31 | 32 | (setf (symbol-value ',name) ,data) 33 | (pushnew ',name *data-frames*) 34 | (defdf-env ',name nil)))) 35 | 36 | 37 | (defun defdf-env (data-frame old-keys) 38 | "Create a package with the same name as DATA-FRAME. Within it, create a symbol-macro for each column that will return the columns value. 39 | Can also be used to remove and update the environment as the DATA-FRAME changes in destructive operations" 40 | (let* ((df (symbol-value data-frame)) 41 | (df-name (symbol-name data-frame)) 42 | (pkg (find-package df-name)) 43 | (rem-keys (set-difference old-keys (coerce (keys df) 'list))) ;the keys that were replaced in an operation 44 | (add-keys (set-difference (coerce (keys df) 'list) old-keys)));the keys that were added in an operation 45 | 46 | (check-type data-frame symbol "a symbol") 47 | (check-type df data-frame "a data frame") 48 | 49 | (alexandria+:unlessf pkg (make-package df-name :use '())) 50 | (unless (slot-boundp df 'name) 51 | (setf (name df) df-name)) 52 | (when add-keys 53 | (maphash #'(lambda (key index) 54 | (declare (ignore index)) 55 | (when (member key add-keys) 56 | (let ((col (intern (symbol-name key) pkg))) 57 | (export col pkg) 58 | ;; (eval `(cl:define-symbol-macro ,col (aref (columns ,df) ,index)))))) ;remove me 59 | (eval `(cl:define-symbol-macro ,col (column ,df ',key)))))) ;there appears no other way than 'eval' 60 | (ordered-keys-table (slot-value df 'ordered-keys)))) 61 | 62 | ;; rename-column! is a special case that requires us to copy over the symbol plist 63 | (when rem-keys 64 | (mapcar #'(lambda (key) 65 | (let ((old-key (find-symbol (string key) pkg))) 66 | (when (= 1 (length rem-keys) (length add-keys)) ;rename! special case 67 | (setf (symbol-plist (find-symbol (string (first add-keys)) pkg)) 68 | (symbol-plist old-key))) 69 | (unintern old-key pkg))) 70 | rem-keys)) 71 | df)) 72 | 73 | (defun undef (&rest params) 74 | "Remove one or more data frames from the environment 75 | PARAMS: a list of DATA-FRAMEs 76 | 77 | Essentially reverses what DEFDF does. Returns the data frames that were removed. Don't use this if you have a data frame bound via DEFPARAMETER. 78 | Examples: 79 | (undef mtcars vlcars)" 80 | (dolist (df params) 81 | (check-type df data-frame "a data-frame") 82 | (assert (slot-boundp df 'name) () "name is not bound in the data-frame") 83 | 84 | (let* ((pkg (find-package (name df))) ;package for symbol-macros 85 | (df-sym (find-symbol (name df)))) 86 | (assert (member df-sym *data-frames*) 87 | () 88 | "~A is not known in the environment. It may have been defined without defdf" (name df)) 89 | 90 | ;; Remove the symbol macros 91 | (loop for var being the symbols in pkg 92 | do (unintern var)) 93 | (delete-package pkg) 94 | 95 | ;; Remove the data frame 96 | (setf *data-frames* (delete df-sym *data-frames*)) 97 | (makunbound df-sym))) 98 | params) 99 | 100 | 101 | ;;; In order to show data frame consistently with different settings 102 | ;;; for print-object, we need to control printing here. 103 | ;;; TODO move to data-frame.lisp 104 | (defun show-data-frames (&key (head nil) (stream *standard-output*)) 105 | "Print all data frames in the current environment in reverse order of creation, i.e. most recently created first. 106 | If HEAD is not NIL, print the first six rows, similar to the (head) function" 107 | (let ((*print-pretty* nil)) 108 | (if head 109 | (loop for df-sym in *data-frames* do 110 | ;; (loop for df in (sort (copy-list *data-frames*) #'string<=) do ;alphabetical order 111 | (let ((df (symbol-value df-sym))) 112 | (let* ((*print-lines* 6) 113 | (*print-pretty* t)) 114 | (format stream "~2&~A" (symbol-name df-sym)) 115 | (df:print-data df stream nil)))) 116 | (pprint-logical-block (stream nil) 117 | (pprint-logical-block (stream nil) 118 | (pprint-indent :block 2 stream) 119 | (loop for df-sym in *data-frames* 120 | do (progn 121 | (format stream "~@:_~A:~@:_" (symbol-name df-sym)) 122 | (fresh-line stream) 123 | (pprint-logical-block (stream nil :per-line-prefix " ") 124 | (format stream "~A" (symbol-value df-sym))) 125 | (fresh-line stream) 126 | (terpri stream)))))))) 127 | 128 | 129 | ;; Unexported. For debugging 130 | (defun show-symbols (pkg) 131 | "Print all symbols in PKG 132 | Example: (show-symbols 'mtcars)" 133 | (do-symbols (s (find-package (symbol-name pkg))) (print s))) 134 | 135 | 136 | 137 | 138 | #| This works, but we're not using at the moment 139 | (defun redefinitionp (df) 140 | ;;; Give the user some restarts in the event the data frame exists 141 | (handler-bind ((data-frame-exists 142 | #'(lambda (c) 143 | (invoke-debugger c)))) 144 | (cond ((and (or pkg 145 | (member df df::*data-frames*)) 146 | *ask-on-redefine*) 147 | (restart-case (signal 'data-frame-exists :data-frame df) 148 | (redefine () 149 | :report (lambda (s) 150 | (format s "Redefine ~A, losing all package symbols" df-name)) 151 | (undef df)) 152 | (new-name (n) 153 | :report "Use a different name" 154 | :interactive (lambda () 155 | (list 156 | (duologue:prompt "Enter a new name for the data frame: " 157 | :type 'string :validator #'df-exists-p :if-invalid #'invalid-df-name))) 158 | ;; (setf df n) 159 | (setf df-name n)) 160 | (dont-ask () 161 | :report (lambda (s) 162 | (format s "Redefine ~A and don't ask about data frame redefinitions again for this session" df-name)) 163 | :interactive (lambda () 164 | (list 165 | (duologue:prompt "Enter a new name for the data frame: " 166 | :type 'string :validator #'df-exists-p :if-invalid #'invalid-df-name))) 167 | (undef df) 168 | (setf df:*ask-on-redefine* nil)))) 169 | ((and (or (find-package df-name) 170 | (boundp df)) 171 | (not *ask-on-redefine*)) 172 | (warn "Redefining ~A" df-name) 173 | (undef df))))) 174 | |# 175 | -------------------------------------------------------------------------------- /src/filter.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2022 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; Apply predicates to the rows of a data frame 6 | 7 | (defun key-list (data form) 8 | "Return a list of keys used in REST, a form" 9 | (loop for key in (coerce (keys data) 'list) 10 | for variables = (flatten form) 11 | when (member key variables) 12 | collect key into columns 13 | finally (return columns))) 14 | 15 | (defun filter-rows (data body) 16 | "Filter DATA by a predicate given in BODY 17 | 18 | Example 19 | (data :mtcars) ; load a data set 20 | (head mtcars) ; view first 6 rows 21 | 22 | ;; MODEL MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB 23 | ;; 0 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 24 | ;; 1 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 25 | ;; 2 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26 | ;; 3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 27 | ;; 4 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 28 | ;; 5 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 29 | 30 | (filter-rows mtcars '(< mpg 17)) 31 | # 32 | 33 | (head *) ; view first 6 rows of filtered data frame 34 | 35 | ;; MODEL MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB 36 | ;; 0 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 37 | ;; 1 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 38 | ;; 2 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 39 | ;; 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 40 | ;; 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 41 | ;; 5 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4" 42 | (let* ((variables (key-list data body)) 43 | (predicate (eval `(lambda ,variables ,body)))) 44 | (select data 45 | (mask-rows data variables predicate) 46 | t))) 47 | -------------------------------------------------------------------------------- /src/formatted-output.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; Formatted output of data frame structures without using the pretty printing subsystem 6 | 7 | ;;; These functions do not use the pretty printer facility, and can be 8 | ;;; used anywhere, i.e. Genera and other CL implementations. They are 9 | ;;; all prefaced by DF- to distinguish them from presentation functions. 10 | 11 | ;;; This example looks like it would map onto data frames easily: 12 | ;;; https://stackoverflow.com/questions/26894147/pretty-print-values-in-fixed-width-fields 13 | 14 | ;;; Table printing 15 | ;;; Taken from: https://github.com/vindarel/print-licenses/blob/master/print-licenses.lisp 16 | (defun aesthetic-string (thing) 17 | "Return the string used to represent `thing` when printing aesthetically." 18 | (format nil "~A" thing)) 19 | 20 | (defun weave (&rest lists) 21 | "Return a list whose elements alternate between each of the lists 22 | `lists`. Weaving stops when any of the lists has been exhausted." 23 | (apply #'mapcan #'list lists)) 24 | 25 | ;;; TODO: Make print-table take a vector-of-vectors instead of a list-of-lists 26 | ;;; This way we could simply pass in (rows df) for processing 27 | (defun print-table (rows &optional (stream *standard-output*)) 28 | "Print ROWS as a nicely-formatted table. 29 | Each row should have the same number of colums. 30 | Columns will be justified properly to fit the longest item in each one. 31 | Example: 32 | (print-table '((1 :red something) 33 | (2 :green more))) 34 | => 35 | 1 | RED | SOMETHING 36 | 2 | GREEN | MORE 37 | " 38 | (when rows 39 | (let ((column-sizes (reduce (curry #'mapcar #'max) 40 | (map 'list 41 | (curry #'mapcar (compose #'length #'aesthetic-string)) 42 | rows)))) 43 | (loop for row in rows do 44 | (format stream "~{~vA~^ | ~}~%" (weave column-sizes row))))) 45 | (values)) 46 | 47 | ;;; Data structure printing 48 | 49 | (defun df-print (df) 50 | "Print DF to *standard-output* in table format" 51 | (when df 52 | (let ((rows (loop for row across (rows df) 53 | collect (coerce row 'list)))) 54 | (print-table (push (column-names df) rows))))) 55 | 56 | 57 | ;;; TODO Summarise a data frame 58 | ;;; See https://github.com/Lisp-Stat/data-frame/issues/4 59 | 60 | 61 | 62 | ;;; 63 | ;;; Markdown 64 | ;;; 65 | 66 | (defun print-markdown (df &key (stream *standard-output*) (row-numbers nil)) 67 | "Print data frame DF, in markdown format, to STREAM 68 | If ROW-NUMBERS is true, also print row numbers as the first column" 69 | (let* ((array (aops:as-array df)) 70 | (col-types (aops:margin #'column-type-format array 0)) 71 | (*print-pprint-dispatch* (copy-pprint-dispatch)) 72 | (*print-pretty* t)) 73 | 74 | ;; For notebook printing, we only need four digits of accuracy 75 | (set-pprint-dispatch 'float (lambda (s f) (format s "~4f" f))) 76 | 77 | ;; Print column names 78 | (if row-numbers (format stream "| ")) 79 | (map nil #'(lambda (x) 80 | (format stream "| ~A " x)) 81 | (keys df)) 82 | (write-char #\| stream) 83 | (write-char #\Newline stream) 84 | 85 | ;; Print alignment 86 | (if row-numbers (format stream "| ---: ")) 87 | (map nil #'(lambda (x) 88 | (alexandria:switch (x :test #'string=) 89 | ("F" (format stream "| ---: ")) 90 | ("D" (format stream "| ---: ")) 91 | ("A" (format stream "| :--- ")))) 92 | col-types) 93 | (write-char #\| stream) 94 | (write-char #\Newline stream) 95 | 96 | ;; Print data 97 | (aops:each-index i 98 | (if row-numbers (format stream "| ~A " i)) 99 | (aops:each-index j 100 | (format stream "| ~A " (aref array i j))) 101 | (format stream " |~%")) 102 | (values))) 103 | 104 | -------------------------------------------------------------------------------- /src/missing.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021, 2023 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; Handle missing data 6 | 7 | ;;; We're using :na as a sentinel value for missing data because we 8 | ;;; want to be able to have boolean data columns. 9 | 10 | ;;; TODO: consider a solution that includes true, false, NaN, etc. JSON 11 | ;;; libraries have varied ways of importing this data. 12 | 13 | (defgeneric missingp (data) 14 | (:method (data) 15 | (declare (ignore data)) 16 | nil) 17 | ;; (:method ((data null)) ;nil sentinel for missing value 18 | ;; t) 19 | (:method ((data (eql :na))) 20 | t) 21 | (:method ((data (eql :missing))) 22 | t) 23 | (:method ((data string)) 24 | nil) 25 | (:method ((data sequence)) 26 | (map 'vector #'missingp data)) 27 | (:method ((data array)) 28 | (aops:map-array data #'missingp)) 29 | (:method ((data data-frame)) 30 | (map-columns data #'missingp)) 31 | (:documentation "Return a vector indicating the position of any missing value indicators. They currently are :na and :missing")) 32 | 33 | (defmethod drop-missing ((df data-frame) &optional (predicate #'missingp)) 34 | "Remove all rows from DF that are missing values according to PREDICATE" 35 | (select df 36 | (bit-not (reduce #'bit-ior (map 'vector 37 | #'(lambda (x) 38 | (mask-rows df x predicate)) 39 | (keys df)))) 40 | t)) 41 | 42 | (defmethod drop-missing ((var vector) &optional (predicate #'missingp)) 43 | "Remove all values from VAR that are missing according to PREDICATE. 44 | Returns values: 45 | 1. the vector with missing values removed 46 | 2. the number of elements removed" 47 | (let ((len (length var)) 48 | (without-missing (remove-if predicate var))) 49 | (values without-missing (- len (length without-missing))))) 50 | 51 | (defun drop-na (df) 52 | "Remove all rows from DF that are missing values. Convenience R-like function." 53 | (drop-missing df)) 54 | 55 | (defmethod replace-missing ((df data-frame) map-alist) 56 | "Replace missing values with the values specified 57 | The alist consists of a column name in the CAR and the replacement value in the CDR 58 | Example: (replace-missing mtcarsm '((mpg . foo)))" 59 | (loop for (column . value) in map-alist 60 | do (setf df (replace-column df column (substitute value :na (column df column)))) 61 | (setf df (replace-column df column (substitute value :missing (column df column)))) 62 | finally (return df))) 63 | 64 | (defun ignore-missing (function &key (warn-user nil) (provide-restart nil)) 65 | "Wrap FUNCTION in a closure that removes missing values and applys FUNCTION in case any of the arguments are :MISSING, :NA or NIL to arguments. Intended for functions accepting vectors." 66 | (lambda (&rest arguments) 67 | ;; (check-type arguments array "a vector") 68 | (let ((missing? (notevery #'null (apply #'missingp arguments)))) 69 | (cond ((not missing?) (apply function arguments)) 70 | ((and missing? 71 | provide-restart) (handler-bind ((missing-data 72 | #'(lambda (c) 73 | (invoke-debugger c)))) 74 | (restart-case (signal 'missing-data :name "Input argument") 75 | (remove-missing () 76 | :report (lambda (s) 77 | (format s "Compute with missing data removed")) 78 | (let+ (((&values clean-vec num) (apply #'drop-missing arguments))) 79 | (when warn-user 80 | (warn (format nil "Removed ~D (~D%) missing values" 81 | num 82 | (round (* 100 (float (/ (length (car arguments))))))))) 83 | (funcall function clean-vec)))))) 84 | (missing? (let+ (((&values clean-vec num) (apply #'drop-missing arguments))) 85 | (when warn-user 86 | (warn (format nil "Removed ~D (~D%) missing values" 87 | num 88 | (round (* 100 (float (/ (length (car arguments))))))))) 89 | (funcall function clean-vec))))))) 90 | 91 | -------------------------------------------------------------------------------- /src/pkgdcl.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: CL-USER -*- 2 | ;;; Copyright (c) 2020-2024 by Symbolics Pte. Ltd. All rights reserved. 3 | ;;; SPDX-License-identifier: MS-PL 4 | 5 | (uiop:define-package #:data-frame 6 | (:nicknames #:df) ;we should remove this in favour of package-local-nicknames 7 | (:use 8 | #:cl 9 | #:alexandria 10 | #:alexandria+ 11 | #:anaphora 12 | #:let-plus 13 | #:select 14 | #:select-dev) 15 | (:import-from #:nu #:as-alist) 16 | (:import-from #:statistics-1 #:add #:make-sparse-counter #:weighted-quantiles) 17 | (:import-from #:aops #:&dims #:linspace) 18 | (:export 19 | 20 | ;; errors & conditions 21 | #:*large-data* ;maximum data size for a particular use case 22 | #:large-data 23 | #:duplicate-key 24 | #:key-not-found 25 | 26 | ;; generic - both data-vector and data-frame 27 | #:data-type 28 | #:columns 29 | #:map-columns 30 | #:column 31 | #:column-type 32 | #:column-names 33 | #:keys 34 | #:copy 35 | #:add-columns 36 | #:add-column! 37 | #:add-columns! 38 | 39 | ;; data-vector 40 | #:data-vector 41 | #:make-dv 42 | #:alist-dv 43 | #:plist-dv 44 | #:dv 45 | 46 | ;; data-frame 47 | #:data-frame 48 | #:doc-string ;same as CL:documentation, but for data-frames 49 | #:source ;return the source of the data 50 | #:make-df 51 | #:alist-df 52 | #:plist-df 53 | #:df 54 | #:matrix-df 55 | #:rows 56 | #:defdf 57 | #:undef 58 | #:defdf-env ;define package/symbol macros for environment 59 | #:show-data-frames 60 | #:random-sample 61 | #:*ask-on-redefine* ;if non-nil, ask user if a data frame will be overwritten 62 | 63 | ;; transformations for data-frames 64 | #:map-rows 65 | #:do-rows 66 | #:mask-rows 67 | #:count-rows 68 | #:filter-rows 69 | #:map-df 70 | ;; #:split ; don't export until we make 'split' generic and merge with aops 'split' 71 | #:replace-column! 72 | #:replace-column 73 | #:remove-column! 74 | #:remove-columns 75 | #:rename-column! 76 | #:replace-key! 77 | #:df-remove-duplicates 78 | 79 | ;; missing values 80 | #:missingp 81 | #:drop-missing 82 | #:replace-missing 83 | #:ignore-missing 84 | 85 | ;; Pretty printing 86 | #:print-data 87 | #:print-markdown 88 | #:print-array 89 | #:head 90 | #:tail 91 | #:short-string ;shorten a long doc-string by returning up to the first newline 92 | 93 | ;; Data properties 94 | #:heuristicate-types 95 | #:set-properties 96 | #:get-property 97 | #:set-property 98 | 99 | ;; Formatted output 100 | #:df-print 101 | #:df-summary 102 | #:name 103 | #:doc-string 104 | 105 | ;; Sequence utilities -- these should be in array-operations 106 | #:delete-nth ;delete the nth item from a sequence 107 | #:delete-nth* ;modify macro for delete-nth 108 | 109 | ;; Subsets of data 110 | #:filter-rows 111 | 112 | ;; Summary methods 113 | #:summary ;summarize a data frame 114 | #:summarize-column ;summarize a variable 115 | #:get-summaries ;return a list of variable summaries 116 | #:bit-variable-summary 117 | #:real-variable-summary 118 | #:factor-variable-summary 119 | #:generic-variable-summary 120 | #:*summary-minimum-length* ;columns are only summarised when longer than this 121 | #:*quantile-threshold* ;if the number of unique reals exceeds this threshold, they will be summarized with quantiles 122 | #:*distinct-threshold* ;if an integer variable has <= discrete values, consider it a categorical variable 123 | #:*distinct-maximum*)) ;if a string/factor variable has > *distinct-maximum* values, exclude it 124 | 125 | 126 | -------------------------------------------------------------------------------- /src/plist-aops.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2020-2021 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; Plists are convenient data structures for ad-hoc data 6 | ;;; analysis. These array operations definitions for plists make them 7 | ;;; easier to work with as arrays. A plist used as a data frame has 8 | ;;; the following format: 9 | 10 | ;;; '(:a #(1 2 3 4) 11 | ;;; :b #(foo bar baz quux)) 12 | 13 | ;;; Like a data frame, all columns must be of equal length, and the 14 | ;;; CAR of VALUE must point to a VECTOR. 15 | 16 | (defmacro ensure-plist (pl) 17 | `(progn 18 | (assert (plistp ,pl :allow-symbol-keys t) () "Argument is not a PLIST") 19 | (assert (notany #'null (map 'list #'vectorp (plist-values ,pl))) () "All values of PLIST must be vectors") 20 | (assert (apply #'= (map 'list #'length (plist-values ,pl))) () "All values of PLIST must be of equal length"))) 21 | 22 | (defmethod aops:as-array ((plist cons)) 23 | "Return the data values of PLIST as an array. The second VALUE is the keys. 24 | This method assumes that the plist is of the form (:col-name #(... ))" 25 | (ensure-plist plist) 26 | (values 27 | (nu:transpose (aops:combine (coerce (plist-values plist) 'vector))) 28 | (plist-keys plist))) 29 | 30 | (defmethod aops:nrow ((plist cons)) 31 | (ensure-plist plist) 32 | (length (cadr plist))) 33 | 34 | (defmethod aops:ncol ((plist cons)) 35 | (ensure-plist plist) 36 | (length (plist-keys plist))) 37 | 38 | (defmethod aops:dims ((plist cons)) 39 | (ensure-plist plist) 40 | (list (aops:nrow plist) (aops:ncol plist))) 41 | -------------------------------------------------------------------------------- /src/pprint.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved. 3 | ;;; SPDX-License-identifier: MS-PL 4 | (in-package #:data-frame) 5 | 6 | #+genera (eval-when (eval load compile) (xp::install)) 7 | 8 | ;;; NOTE: If using emacs/slime, set slime-repl-auto-right-margin to T 9 | ;;; so that the printing system will be able to figure out the right 10 | ;;; margins and format printing properly. This is done via an emacs 11 | ;;; customization. 12 | 13 | 14 | 15 | ;;; Pretty print data-frames and 2D arrays 16 | 17 | ;;; This is an extension of the original work by Tamas that used the 18 | ;;; Common Lisp pretty printing system to format data-frames for 19 | ;;; printing. As an experiment, I have to conclude that using the 20 | ;;; pretty printer this way is a bad idea. It was intended to format 21 | ;;; lisp code, not more complicated output. Future work should be done 22 | ;;; in formatted-output.lisp and these functions are considered 23 | ;;; deprecated. 24 | 25 | ;;; It is not easy to line up columns dynamically when printing them. 26 | ;;; Common lisp does have good control over justification and digits 27 | ;;; but, unlike, say, markdown, it's up to the programmer to 28 | ;;; explicitly specify the paddings. If you knew a priori the widths, 29 | ;;; this would be easy, but if you want to compute them at run-time 30 | ;;; things get a bit ugly. 31 | 32 | ;;; There are two patterns here for this: 33 | ;;; 1. create a new structure that contains the strings in their printed format (pprint-data-frame) 34 | ;;; 2. compute the formatting and apply each format to the original value as you loop through the structure (pprint-array) 35 | 36 | ;;; Neither is particularly efficient. Option 2 is better than option 37 | ;;; 1 in that regard, at the expense of some ugliness in the code and 38 | ;;; keeping track of formatting strings for each column. 39 | 40 | 41 | ;;; The following global variables control aspects of printing: 42 | ;; *print-length* - controls how many elements at a given level are printed (rows) 43 | ;; *print-lines* - controls how many output lines are printed (columns) 44 | (defparameter *max-digits* 4) ;max digits after decimal 45 | (defparameter *row-numbers-p* t) ;print row numbers 46 | 47 | 48 | ;;; 49 | ;;; Utility functions 50 | ;;; 51 | 52 | ;; Consider exporting this if it turns out to be generally useful 53 | (defun reverse-df (df) 54 | "Return DF with columns in reverse order" 55 | (make-df (reverse (keys df)) (reverse (columns df)))) 56 | 57 | (defun printer-status () 58 | "Print values of all the printer variables" 59 | (format t ";; *print-array* = ~a~%" *print-array*) 60 | (format t ";; *print-base* = ~a~%" *print-base*) 61 | (format t ";; *print-case* = ~a~%" *print-case*) 62 | (format t ";; *print-circle* = ~a~%" *print-circle*) 63 | (format t ";; *print-escape* = ~a~%" *print-escape*) 64 | (format t ";; *print-gensym* = ~a~%" *print-gensym*) 65 | (format t ";; *print-length* = ~a~%" *print-length*) 66 | (format t ";; *print-level* = ~a~%" *print-level*) 67 | (format t ";; *print-lines* = ~a~%" *print-lines*) 68 | (format t ";; *print-miser-width* = ~a~%" *print-miser-width*) 69 | (format t ";; *print-pprint-dispatch* = ~a~%" *print-pprint-dispatch*) 70 | (format t ";; *print-pretty* = ~a~%" *print-pretty*) 71 | (format t ";; *print-radix* = ~a~%" *print-radix*) 72 | (format t ";; *print-readably* = ~a~%" *print-readably*) 73 | (format t ";; *print-right-margin* = ~a~%" *print-right-margin*)) 74 | 75 | (defun 2d-array-to-list (array) 76 | "Convert an array to a list of lists" ; make flet? 77 | (loop for i below (array-dimension array 0) 78 | collect (loop for j below (array-dimension array 1) 79 | collect (aref array i j)))) 80 | 81 | ;;; 82 | ;;; Computing column formats 83 | ;;; 84 | 85 | ;;; To properly print float values we need to use the ~F directive to 86 | ;;; control the number of digits so that the decimal points will line 87 | ;;; up. 88 | 89 | ;;; There are three things we need to know to print a data frame: 90 | ;;; - width of the column 91 | ;;; - type of the column 92 | ;;; - maximum number of digits in any value in the column 93 | 94 | ;;; The three utility functions below give us this information. 95 | 96 | (defun max-width (sequence &optional (max-width nil)) 97 | "Return the largest printed string size of the elements of SEQUENCE, equal to or less than MAX-WIDTH" 98 | (let ((actual-width (apply #'max (map 'list #'(lambda (x) 99 | (length 100 | (typecase x 101 | (float (format nil "~F" x)) 102 | (t (format nil "~A" x))))) 103 | sequence)))) 104 | (if max-width 105 | (min actual-width max-width) 106 | actual-width))) 107 | 108 | (defun column-type-format (sequence) 109 | "Return a format string for the most specific type found in sequence 110 | Use this for sequences of type T to determine how to format the column." 111 | (when (bit-vector-p sequence) (return-from column-type-format "D")) 112 | (case (column-type sequence) 113 | (:single-float "F") 114 | (:double-float "F") 115 | (:integer "D") 116 | (:bit "D") 117 | (:symbol "S") 118 | (:catagorical "A") 119 | (t "A"))) 120 | 121 | (defun max-decimal (sequence &optional (max-digits nil)) 122 | "Return the maximum number of digits to the right of the decimal point in the numbers of SEQUENCE, equal to or less than MAX-DIGITS" 123 | (let ((actual-digits (apply #'max 124 | (map 'list 125 | #'(lambda (x) 126 | (typecase x 127 | (float (let* ((str (format nil "~F" x)) 128 | (pos (position #\. str))) 129 | (length (subseq str (1+ pos))))) 130 | (t 0))) 131 | sequence)))) 132 | (if max-digits 133 | (min actual-digits max-digits) 134 | actual-digits))) 135 | 136 | ;;; 137 | ;;; Formatters 138 | ;;; 139 | 140 | #+nil 141 | (defmethod default-column-formats (#-genera (array simple-array) 142 | #+genera (array 'simple-array)) 143 | "Return a list of formatting strings for ARRAY 144 | The method returns a set of default formatting strings using heuristics." 145 | ;; TODO: Delete once ACL is full working. 146 | (let ((col-widths (aops:margin #'max-width array 0)) 147 | (col-types (aops:margin #'column-type-format array 0)) 148 | (col-digits (aops:margin #'max-decimal array 0))) 149 | (map 'list #'(lambda (type width digits) 150 | (alexandria:switch (type :test #'string=) 151 | ("F" (format nil "~~~A,~AF" width digits)) 152 | ("D" (format nil "~~~AD" width)) 153 | ("A" (format nil "~~~AA" width)))) 154 | col-types col-widths col-digits))) 155 | 156 | (defun default-column-formats (array) 157 | "Return a list of formatting strings for ARRAY 158 | The method returns a set of default formatting strings using heuristics." 159 | (let ((col-widths (aops:margin #'max-width array 0)) 160 | (col-types (aops:margin #'column-type-format array 0)) 161 | (col-digits (aops:margin #'max-decimal array 0))) 162 | (map 'list #'(lambda (type width digits) 163 | (alexandria:switch (type :test #'string=) 164 | ("F" (format nil "~~~A,~AF" width digits)) 165 | ("D" (format nil "~~~AD" width)) 166 | ("A" (format nil "~~~AA" width)))) 167 | col-types col-widths col-digits))) 168 | 169 | 170 | 171 | ;;; 172 | ;;; Pretty printers 173 | ;;; 174 | 175 | (defun print-data (data-frame 176 | &optional 177 | (stream *standard-output*) 178 | (row-numbers-p *row-numbers-p*) 179 | (max-digits *max-digits*)) 180 | "Print DATA-FRAME to STREAM using the pretty printer" 181 | (check-type data-frame data) 182 | (let* ((col-names '()) 183 | (df (copy data-frame :key #'copy-array)) 184 | (*print-pretty* t)) 185 | (when row-numbers-p 186 | (setf df (reverse-df 187 | (add-columns (reverse-df df) ;add to end of DF 188 | '|| 189 | (if (> (aops:nrow df) 1) 190 | (aops:linspace 0 (1- (aops:nrow df)) (aops:nrow df)) 191 | #(0)))))) ;special case of 1 row 192 | (flet ((format-column (c) 193 | (let* ((width (max (max-width (column df c)) 194 | (length (symbol-name c)))) 195 | (type (column-type-format (column df c))) 196 | (digits (min (max-decimal (column df c)) 197 | max-digits)) 198 | (data-fmt (alexandria:switch (type :test #'string=) 199 | ("F" (format nil "~~~A,~AF" width digits)) 200 | ("D" (format nil "~~~AD" width)) 201 | ("A" (format nil "~~~AA" width)) 202 | ("S" (format nil "~~~A@A" width)))) 203 | (var-fmt (alexandria:switch (type :test #'string=) ; Why does SBCL complain about this, and only here? 204 | ("F" (format nil "~~~A@A" width)) 205 | ("D" (format nil "~~~A@A" width)) 206 | ("A" (format nil "~~~AA" width)) 207 | ("S" (format nil "~~~A@A" width))))) 208 | 209 | (replace-column! df c #'(lambda (cell) 210 | (if (eq cell :na) ;should take same justification as column 211 | (format nil (format nil "~~~A<~~A~~>" width) cell) 212 | (format nil data-fmt cell)))) 213 | (setf col-names (cons (format nil var-fmt (symbol-name c)) col-names))))) 214 | (map nil #'format-column (keys df))) 215 | 216 | (write-char #\Newline stream) 217 | (pprint-logical-block (stream (reverse col-names) :per-line-prefix ";;") 218 | (loop (pprint-exit-if-list-exhausted) 219 | (write-char #\Space stream) 220 | (write-string (pprint-pop) stream))) 221 | (write-char #\Newline stream) 222 | (pprint-logical-block (stream (2d-array-to-list (aops:as-array df))) 223 | (loop (pprint-exit-if-list-exhausted) 224 | (let ((row (pprint-pop))) 225 | (pprint-logical-block (stream row :per-line-prefix ";; ") 226 | (loop (pprint-exit-if-list-exhausted) 227 | (write-string (pprint-pop) stream) 228 | (write-char #\Space stream)))) 229 | (pprint-newline :mandatory stream))))) 230 | 231 | ;;; TODO: refactor this using the pattern in pprint-data-frame, 232 | ;;; incorporating the code in default-column-formats and adding an 233 | ;;; optional max-digits parameter 234 | (defun print-array (arr &optional (stream *standard-output*) (row-numbers-p *row-numbers-p*)) 235 | "Print an array to STREAM, defaulting to *standard-output*, in a tabular format. If ROW-NUMBERS-P, print row numbers." 236 | (let* ((array (cond (row-numbers-p (aops:stack-cols (aops:linspace 0 (1- (aops:nrow arr)) (aops:nrow arr)) arr)) 237 | (t arr))) 238 | (df-lists (2d-array-to-list array)) 239 | (data-fmt (default-column-formats array)) 240 | (f 0)) 241 | (pprint-logical-block (stream df-lists) 242 | (loop (pprint-exit-if-list-exhausted) 243 | (let ((row (pprint-pop))) 244 | (pprint-logical-block (stream row :per-line-prefix ";;") 245 | (loop (pprint-exit-if-list-exhausted) 246 | (write-char #\Space stream) 247 | (format stream (nth f data-fmt) (pprint-pop)) 248 | (incf f)))) 249 | (setf f 0) 250 | (write-char #\Newline stream))))) 251 | 252 | 253 | ;;; Not pretty printing per-se, but related 254 | 255 | (defmethod head ((df data-frame) &optional (n 6)) 256 | "Return the first N rows of DF; N defaults to 6" 257 | (let ((*print-pretty* t)) 258 | (if (< (aops:nrow df) 6) (setf n (aops:nrow df))) 259 | (print-data (select df (select:range 0 n) t)))) 260 | 261 | (defmethod tail ((df data-frame) &optional (n 6)) 262 | "Return the last N rows of DF; N defaults to 6" 263 | (let ((*print-pretty* t)) 264 | (if (< (aops:nrow df) 6) (setf n (aops:nrow df))) 265 | (print-data (select df (select:range (- n) nil) t)))) 266 | 267 | (defun short-string (str) 268 | "Return up to the first newline 269 | This is useful when docstrings are multi-line. By convention, the first line is the title." 270 | (subseq str 271 | 0 272 | (position #\newline str))) 273 | 274 | 275 | ;;; 276 | ;;; Pretty printer system configuration 277 | ;;; 278 | 279 | ;; After setting the dispatch tables below and *print-pretty* nil, we 280 | ;; get the following behaviour: 281 | ;; (print df) => one line summary of number of rows and columns 282 | ;; (pprint df) => print full table, subject to *print-lines* and *print-length* 283 | ;; print-object is equal to 'print' 284 | ;; Unfortunately, it also means that any CL function that sets 285 | ;; *print-pretty* to T, like 'describe' will print the entire 286 | ;; data-frame. 287 | 288 | ;; we can't use this and still have nice 'describe' for data frames 289 | #+nil 290 | (set-pprint-dispatch 'df:data-frame 291 | #'(lambda (s df) (print-data df s))) 292 | 293 | (set-pprint-dispatch '(array * 2) 294 | #'(lambda (s array) (print-array array s))) 295 | 296 | ;;; These only work when *print-pretty* is T 297 | #+nil 298 | (set-pprint-dispatch 'float 299 | #'(lambda (s obj) 300 | (format s "~,4F" obj))) 301 | #+nil 302 | (set-pprint-dispatch 'double-float 303 | #'(lambda (s obj) 304 | (format s "~,4F" obj))) 305 | 306 | ;;; 307 | ;;; Reference 308 | ;;; 309 | 310 | ;;; Use this to duplicate the behaviour in the documentation 311 | #+nil 312 | (defmethod print-object ((df data-frame) stream) 313 | "Print the first six rows of DATA-FRAME" 314 | (let* ((*print-lines* 6) 315 | (*print-pretty* t)) 316 | (df:pprint-data df stream nil))) 317 | 318 | -------------------------------------------------------------------------------- /src/properties.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | (defun heuristicate-types (df) 6 | "Coerce each element of the column vectors to the most specific type in the column 7 | Often when reading in a data set, the types will be inconsistent in a variable. For example one observation might be 5.1, and another 5. Whilst mathmatically equivalent, we want our variable vectors to have identical types. The COLUMN-TYPE function returns the most specific numeric type in the column, then coerces all the vector elements to this type" 8 | (check-type df data-frame) 9 | (assert (slot-boundp df 'name) () "name is not bound in the data-frame") 10 | (let ((name (name df))) 11 | (map nil #'(lambda (key) 12 | (let* ((data (column df key)) 13 | (col-type (column-type data)) 14 | (sym (find-symbol (symbol-name key) (find-package name)))) 15 | (setf (get sym :type) col-type))) 16 | (keys df)))) 17 | 18 | (defun set-properties (df property prop-values) 19 | "Set the PROPERTY of each variable in DF to a value. The value is specified in the plist PROP-VALUES. 20 | Example: 21 | To give the variables in the mtcars dataset a unit, use: 22 | (set-properties mtcars :unit '(:mpg m/g 23 | :cyl :NA 24 | :disp in³ 25 | :hp hp 26 | :drat :NA 27 | :wt lb 28 | :qsec s 29 | :vs :NA 30 | :am :NA 31 | :gear :NA 32 | :carb :NA))" 33 | (check-type df data-frame) 34 | (assert (slot-boundp df 'name) () "name is not bound in the data-frame") 35 | (let ((name (name df))) 36 | (loop for (key value) on prop-values by #'cddr 37 | for sym = (find-symbol (symbol-name key) (find-package name)) 38 | do (assert (member property '(:type :label :unit)) 39 | (property) 40 | "A property must be one of: :type, :label or :unit") 41 | do (if (eq property :type) 42 | (check-type value df:data-type "a valid data variable type")) 43 | when sym do (setf (get sym property) value)))) 44 | 45 | 46 | ;;; User convenience functions 47 | 48 | (defun get-property (variable property) 49 | "Return the PROPERTY of data VARIABLE" 50 | (assert (member property '(:type :label :unit)) 51 | (property) 52 | "Property ~A is not one of: :type, :label or :unit" 53 | property) 54 | (get variable property)) 55 | 56 | (defun set-property (symbol value property) 57 | "Set the PROPERTY of SYMBOL to VALUE" 58 | (assert (member property '(:type :label :unit)) 59 | (property) 60 | "Property ~A is not one of: :type, :label or :unit" 61 | property) 62 | (if (eq property :type) 63 | (check-type value df:data-type "a valid data variable type")) 64 | (setf (get symbol property) value)) 65 | 66 | 67 | ;; Not exported 68 | (defun show-properties (df) 69 | "Show the standard properties of the variables of the data frame DF 70 | Standard properties are 'label', 'type' and 'unit'" 71 | (let* ((rows (loop for key across (keys df) 72 | collect (list (symbol-name key) 73 | (get key :type) 74 | (get key :unit) 75 | (get key :label))))) 76 | (push '("--------" "----" "----" "-----------") rows) 77 | (push '("Variable" "Type" "Unit" "Label") rows) 78 | (print-table rows))) 79 | 80 | 81 | -------------------------------------------------------------------------------- /src/summary.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2020-2022 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; Summary control variables 6 | (defparameter *summary-minimum-length* 10 7 | "Columns are only summarised when longer than this, otherwise they are returned as is.") 8 | 9 | (defparameter *quantile-threshold* 20 10 | "If the number of unique reals exceeds this threshold, they will be summarized with quantiles, otherwise print frequency table") 11 | 12 | (defparameter *distinct-threshold* 10 13 | "If an integer variable has <= discrete values, consider it a factor") 14 | 15 | (defparameter *distinct-maximum* 20 16 | "If a string/factor variable has > *distinct-maximum* values, exclude it") 17 | 18 | 19 | 20 | (defstruct variable-summary% 21 | "Base class for summarizing variables. Summary functions take SYMBOLs, rather than values, because the symbol property lists naming the variables have meta-data, e.g. type, label, that we want to print. Not exported." 22 | (length 0 :type array-index :read-only t) 23 | (missing 0 :type fixnum :read-only t) 24 | (name "" :type string :read-only t) 25 | (desc "" :type string :read-only t)) 26 | 27 | (defstruct (bit-variable-summary (:include variable-summary%) 28 | (:print-function 29 | (lambda (summary stream depth) 30 | (declare (ignore depth)) 31 | (let+ (((&structure-r/o bit-variable-summary- length count name desc) summary)) 32 | (format stream "~%~%~A (~A)~%" name desc) 33 | (princ "ones: " stream) 34 | (print-count-and-percentage stream count length))))) 35 | "Summary of a bit vector." 36 | (count 0 :type array-index :read-only t)) 37 | 38 | (defstruct (real-variable-summary (:include variable-summary%) 39 | (:print-function 40 | (lambda (summary stream depth) 41 | (declare (ignore depth)) 42 | (let+ (((&structure-r/o real-variable-summary- name desc length missing min q25 q50 mean q75 max) summary) 43 | (*print-pprint-dispatch* (copy-pprint-dispatch))) 44 | (set-pprint-dispatch 'float (lambda (s f) (format s "~,2f" f))) 45 | (format stream 46 | "~%~%~A (~A)~& n: ~W~& ~:_missing: ~W~& ~:_min=~:W~& ~:_q25=~:W~& ~:_q50=~:W~& ~:_mean=~:W~& ~:_q75=~:W~& ~:_max=~:W" 47 | name desc length missing min (ensure-not-ratio q25) (ensure-not-ratio q50) (ensure-not-ratio mean) 48 | (ensure-not-ratio q75) max))))) 49 | "Summary of a real elements (using quantiles)." 50 | ;; (count 0 :type array-index :read-only t) 51 | (min 0 :type real :read-only t) 52 | (q25 0 :type real :read-only t) 53 | (q50 0 :type real :read-only t) 54 | (mean 0 :type real :read-only t) 55 | (q75 0 :type real :read-only t) 56 | (max 0 :type real :read-only t)) 57 | 58 | (defstruct (factor-variable-summary (:include variable-summary%) 59 | (:print-function 60 | (lambda (summary stream depth) 61 | (declare (ignore depth)) 62 | (let+ (((&structure-r/o factor-variable-summary- name desc length element-count-alist) summary)) 63 | (format stream "~%~%~A (~A)~%" name desc) 64 | (loop for (element . count) in element-count-alist do 65 | (print-count-and-percentage stream count length) 66 | (format stream " x ~W, " element)))))) 67 | "Summary for factor variables" 68 | (element-count-alist nil :type list :read-only t)) 69 | 70 | (defstruct (generic-variable-summary (:include variable-summary%) 71 | (:print-function 72 | (lambda (summary stream depth) 73 | (declare (ignore depth)) 74 | (let+ (((&structure-r/o generic-variable-summary- length name quantiles element-count-alist) summary)) 75 | (unless (string= name "") 76 | (format stream "~%~%~:W: " name)) 77 | (when quantiles 78 | (let+ (((&structure-r/o real-variable-summary- length min q25 q50 q75 max) quantiles)) 79 | (format stream 80 | "~:W reals, ~:_min=~:W, ~:_q25=~:W, ~:_q50=~:W, ~:_q75=~:W, ~:_max=~:W" 81 | length min (ensure-not-ratio q25) (ensure-not-ratio q50) 82 | (ensure-not-ratio q75) max))) 83 | (when element-count-alist 84 | (loop for (element . count) in element-count-alist do 85 | (print-count-and-percentage stream count length) 86 | (format stream " x ~:W, " element))))))) 87 | "Summary for generic variables, i.e. those with mixed types." 88 | (quantiles nil :type (or null real-variable-summary) :read-only t) 89 | (element-count-alist nil :type list :read-only t)) 90 | 91 | 92 | 93 | 94 | 95 | (defgeneric column-length (column) 96 | (:documentation "Return the length of column.") 97 | (:method ((column vector)) 98 | (length column))) 99 | 100 | (defun print-count-and-percentage (stream count length) 101 | "Print COUNT as is and also as a rounded percentage" 102 | (format stream "~D (~D%)" count (round (/ count length) 1/100))) 103 | 104 | (defun ensure-not-ratio (real) 105 | "When REAL is a RATIO, convert it to a float, otherwise return as is. Used for printing." 106 | (if (typep real 'ratio) 107 | (float real 1.0) 108 | real)) 109 | 110 | 111 | ;;; TODO figure out where distinct and monotonic should reside. Probably num-utils. 112 | (defun distinct (column) 113 | "Returns the number of distinct elements in COLUMN, a symbol naming a variable. 114 | Useful for formatting columns for human output." 115 | (let+ ((data (eval column)) 116 | (table (aprog1 (make-sparse-counter :test #'equal) 117 | (map nil (curry #'add it) data))) 118 | (alist (as-alist table))) 119 | (length alist))) 120 | 121 | (defun monotonicp (column) 122 | "Returns T if all elements of COLUMN, a SYMBOL, are increasing monotonically 123 | Useful for detecting row numbers in imported data." 124 | (let ((data (eval column))) 125 | (if (not (every #'numberp data)) 126 | nil 127 | (loop for x across data 128 | for i = (1+ x) then (1+ i) 129 | always (= (1+ x) i))))) 130 | 131 | 132 | 133 | (defun summarize-real-variable (column) 134 | "Return a summary for a float variable" 135 | (let+ ((data (eval column)) 136 | (table (aprog1 (make-sparse-counter :test #'equal) 137 | (map nil (curry #'add it) data))) 138 | (alist (as-alist table)) 139 | ((&flet real? (item) (realp (car item)))) 140 | (reals-alist (remove-if (complement #'real?) alist)) 141 | (#(min q25 q50 q75 max) 142 | (weighted-quantiles 143 | (mapcar #'car reals-alist) 144 | (mapcar #'cdr reals-alist) 145 | #(0 1/4 1/2 3/4 1))) 146 | (mean-ignore-missing (ignore-missing #'mean :warn-user t))) 147 | 148 | (make-real-variable-summary 149 | :name (symbol-name column) 150 | :desc (if (get column :label) 151 | (get column :label) 152 | "") 153 | :length (length data) 154 | :missing (length (which data :predicate #'missingp)) 155 | :min min :q25 q25 :q50 q50 :mean (funcall mean-ignore-missing data) :q75 q75 :max max))) 156 | 157 | (defun summarize-factor-variable (column) 158 | "Return an alist of factor/count pairs" 159 | (let+ ((data (eval column)) 160 | (table (aprog1 (make-sparse-counter :test #'equal) 161 | (map nil (curry #'add it) data))) 162 | (alist (as-alist table))) 163 | 164 | (make-factor-variable-summary 165 | ;; :name (df::var-name (symbol-name column)) ; for df$variable style 166 | :name (symbol-name column) 167 | :desc (if (get column :label) (get column :label) "") 168 | :length (length data) 169 | :missing (length (which data :predicate #'missingp)) 170 | :element-count-alist (stable-sort alist #'>= :key #'cdr)))) 171 | 172 | (defun summarize-generic-variable (column &optional name) 173 | "Return an object that summarizes COLUMN of a DATA-FRAME. Primarily intended for printing, not analysis, returned values should print nicely. This function can be used on any type of column, even one with mixed types" 174 | (let+ ((data (eval column)) 175 | ;; (data column) 176 | (length (length data)) 177 | (table (aprog1 (make-sparse-counter :test #'equal) 178 | (map nil (curry #'add it) data))) 179 | (alist (as-alist table)) 180 | ((&flet real? (item) (realp (car item)))) 181 | (reals-alist (remove-if (complement #'real?) alist)) 182 | (quantiles (when (< *quantile-threshold* 183 | (length reals-alist)) 184 | (let+ ((#(min q25 q50 q75 max) 185 | (weighted-quantiles 186 | (mapcar #'car reals-alist) 187 | (mapcar #'cdr reals-alist) 188 | #(0 1/4 1/2 3/4 1)))) 189 | (make-real-variable-summary 190 | :length (reduce #'+ reals-alist :key #'cdr) 191 | :min min :q25 q25 :q50 q50 :q75 q75 :max max)))) 192 | (alist (stable-sort (if quantiles 193 | (remove-if #'real? alist) 194 | (copy-list alist)) 195 | #'>= :key #'cdr))) 196 | (make-generic-variable-summary :length length 197 | :name (if name 198 | name 199 | "") 200 | :quantiles quantiles 201 | :element-count-alist alist))) 202 | 203 | 204 | 205 | (defun summarize-column (column &optional name) 206 | "Return a summary struct for COLUMN" 207 | (let ((data (eval column)) 208 | (label (get column :label))) 209 | (case (get column :type) 210 | 211 | ;; Implementation types 212 | (:double-float (summarize-real-variable column)) 213 | (:single-float (summarize-real-variable column)) 214 | (:integer (summarize-real-variable column)) 215 | (:string (summarize-factor-variable column)) ;we really should remove this at some point. 216 | (:catagorical (summarize-factor-variable column)) 217 | (:bit (make-bit-variable-summary 218 | :name (if name 219 | name 220 | (symbol-name column)) 221 | :desc (if label 222 | label 223 | "") 224 | :length (length data) 225 | :count (count 1 data))) 226 | 227 | ;; Statistical types, note keyword in case key 228 | (:factor (summarize-factor-variable column)) 229 | (t (summarize-generic-variable column))))) 230 | 231 | (defun get-summaries (df) 232 | "Return a list of summaries of the variables in DF" 233 | (loop for key across (keys df) 234 | collect (summarize-column key))) 235 | 236 | #+nil 237 | (defmacro summary (df &optional (stream *standard-output*)) 238 | `(summarize-dataframe ,df ,stream)) 239 | 240 | ;; TODO add :remove-missing parameter so we can summarize in the early stages of data exploration. 241 | ;; Perhaps, if set, have it use summarize-generic-variable 242 | (defun summary (df &optional (stream *standard-output*)) 243 | "Print a summary of DF to STREAM, using heuristics for better formatting" 244 | (let* ((name (when (slot-boundp df 'name) (name df))) 245 | (pkg (find-package name))) 246 | (if pkg 247 | (loop for key across (keys df) 248 | for column = (find-symbol (string-upcase (symbol-name key)) pkg) 249 | for data = (column df key) 250 | for length = (length data) 251 | 252 | unless (or (= (length data) 253 | (distinct data)) ;exclude row names 254 | (monotonicp data)) ;exclude row numbers 255 | collect (case (get column :type) ;special cases 256 | (:double-float (if (< (distinct data) *quantile-threshold*) ;summarise as a factor 257 | (summarize-factor-variable column) 258 | (summarize-real-variable column))) 259 | (:single-float (if (< (distinct data) *quantile-threshold*) ;summarise as a factor 260 | (summarize-factor-variable column) 261 | (summarize-real-variable column))) 262 | (:integer (if (< (distinct data) *distinct-threshold*) ;summarise as a factor 263 | (summarize-factor-variable column) 264 | (summarize-real-variable column))) 265 | (t (summarize-column column)))) 266 | 267 | (loop for key across (keys df) ;no data-frame environment, use generic summary functions 268 | for data = (column df key) 269 | unless (or (= (length data) 270 | (distinct data)) ;exclude row names 271 | (monotonicp data) ;exclude row numbers 272 | (and (< *distinct-maximum* ;exclude row names with a few repeats 273 | (distinct data)) 274 | (equal :string (column-type (column df key))))) 275 | do (format stream "~%~A: ~A" key (summarize-generic-variable data)))))) 276 | 277 | -------------------------------------------------------------------------------- /src/utils.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*- 2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved. 3 | (in-package #:data-frame) 4 | 5 | ;;; 6 | ;;; Insert & remove items from vectors and arrays -- TODO move to alexandria+ 7 | ;;; 8 | 9 | (defun delete-nth (sequence n) 10 | "Return SEQUENCE with the Nth item removed. 11 | Note: DELETE-IF makes no guarantee of being destructive, so you cannot rely on this side-effect. You must SETF the original sequence to the values returned from this function, or use the modify-macro DELETE-NTH*" 12 | (check-type sequence sequence) 13 | (delete-if (constantly t) sequence :start n :count 1)) 14 | 15 | (define-modify-macro delete-nth* (n) 16 | delete-nth 17 | "Destructively modifies N, a SEQUENCE by removing the Nth item. 18 | Example: 19 | LS-USER> (defparameter *v* #(a b c d)) 20 | *V* 21 | LS-USER> (delete-nth* *v* 1) 22 | #(A C D) 23 | LS-USER> *v* 24 | #(A C D)") 25 | 26 | 27 | 28 | ;;; 29 | ;;; Augment the type system 30 | ;;; 31 | 32 | ;;; cl:type-of is under-constrained and returns implementation 33 | ;;; specific results, so we use our own version 34 | ;;; See: https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node53.html 35 | (defun get-type (x) 36 | "Return the most specific type symbol for x" 37 | (typecase x 38 | (bit 'bit) 39 | (single-float 'single-float) 40 | (double-float 'double-float) 41 | ;; (fixnum 'fixnum) 42 | (integer 'integer) 43 | (ratio 'ratio) 44 | (complex 'complex) 45 | ;; (rational 'cl:rational) ;SBCL doesn't recognise this return type 46 | ;; (real 'real) ;SBCL doesn't recognise this return type 47 | 48 | (simple-string 'simple-string) 49 | (string 'string) 50 | 51 | (list 'list) 52 | (symbol 'symbol) 53 | (bit-vector 'bit-vector) 54 | (simple-vector 'simple-vector) 55 | (simple-array 'simple-array) 56 | (vector 'vector) 57 | (array 'array) 58 | (sequence 'sequence) 59 | 60 | (function 'function) 61 | (package 'package))) 62 | 63 | (defun types-in-column (seq) 64 | "Return a list of the types found in SEQ" 65 | (let (types) 66 | (loop for i across seq 67 | for type = (get-type i) 68 | when (not (member type types)) 69 | do (push type types) 70 | finally (return types)))) 71 | 72 | (defun column-type (col) 73 | "Return the most specific type found in COL" 74 | (let ((type-list (types-in-column col))) 75 | (cond ((member 'single-float type-list) :single-float) 76 | ((member 'double-float type-list) :double-float) 77 | ((member 'fixnum type-list) :fixnum) 78 | ((member 'integer type-list) :integer) 79 | ((member 'string type-list) :string) 80 | ((member 'bit type-list) :bit) 81 | ((member 'symbol type-list) :symbol) 82 | (t 'string )))) 83 | 84 | -------------------------------------------------------------------------------- /tests/data-frame-tests.lisp: -------------------------------------------------------------------------------- 1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME-TESTS -*- 2 | ;;; Copyright (c) 2020-2022 by Symbolics Pte. Ltd. All rights reserved. 3 | 4 | (uiop:define-package #:data-frame-tests 5 | (:use 6 | #:cl 7 | #:alexandria 8 | #:anaphora 9 | #:clunit 10 | #:let-plus 11 | #:select 12 | #:data-frame) 13 | (:import-from #:nu #:as-alist #:as-plist) 14 | (:export #:run)) 15 | 16 | (in-package :data-frame-tests) 17 | 18 | (defsuite data-frame ()) 19 | 20 | (defun run (&optional interactive?) 21 | (run-suite 'data-frame :use-debugger interactive?)) 22 | 23 | (defsuite data-vector (data-frame)) 24 | 25 | (deftest data-vector-basics (data-vector) 26 | (let ((dv (dv :a 1 :b 2 :c 3))) 27 | (assert-equalp '(:a 1 :b 2 :c 3) (as-plist dv)) 28 | (assert-equalp #(1 2 3) (columns dv)) 29 | (assert-equalp #(:a :b :c) (keys dv)) 30 | (assert-equalp '((:a . 1) (:b . 2) (:c . 3)) (as-alist dv)) 31 | (assert-equalp '(:a 1 :b 2) (as-plist (select dv #(:a :b)))) 32 | (assert-equalp 3 (select dv :c)) 33 | (let ((dv2 (map-columns dv #'1+))) 34 | (assert-equalp '(:a 2 :b 3 :c 4) (as-plist dv2)) 35 | (assert-true (typep dv2 'data-vector))))) 36 | 37 | 38 | (defsuite data-frame-basics (data-frame)) 39 | 40 | (deffixture data-frame-basics (@body) 41 | (let ((v #(1 2 3 4)) 42 | (b #*0110) 43 | (s #(a b c d))) 44 | @body)) 45 | 46 | (deftest data-frame-creation (data-frame-basics) 47 | (let* ((plist `(:vector ,v :symbols ,s :bits ,b)) 48 | (df (apply #'df plist)) 49 | (df-plist (plist-df plist)) 50 | (df-alist (alist-df (plist-alist plist)))) 51 | (assert-equalp #(:vector :symbols :bits) (keys df)) 52 | (assert-equalp (vector v s b) (columns df)) 53 | (assert-equalp (vector v s b) (columns df t)) 54 | (assert-equalp (vector v) (columns df #(:vector))) 55 | (assert-equalp v (columns df :vector)) 56 | (assert-equalp v (columns df -3)) 57 | (assert-equalp `(:vector ,v :symbols ,s :bits ,b) (as-plist df)) 58 | (assert-equalp `((:vector . ,v) (:symbols . ,s) (:bits . ,b)) (as-alist df)) 59 | (assert-equalp (as-alist df) (as-alist df-plist)) 60 | (assert-equalp (as-alist df) (as-alist df-alist)))) 61 | 62 | (deftest data-frame-select (data-frame-basics) 63 | (let ((df (df :vector v :symbols s))) 64 | (assert-equalp `(:vector ,v) (as-plist (select df t #(:vector)))) 65 | (assert-equalp `(:vector ,(select v b)) (as-plist (select df b #(0)))) 66 | (assert-equalp (select v b) (select df b :vector)) 67 | (assert-equalp '(:vector 3 :symbols c) (as-plist (select df 2 t))) 68 | (assert-equalp `(:vector #(2 4)) (as-plist 69 | (select df 70 | (mask-rows df :vector #'evenp) 71 | #(:vector)))) 72 | (assert-equalp #(2 4) (select df (mask-rows df :vector #'evenp) :vector)))) 73 | 74 | 75 | 76 | (defsuite data-frame-operations (data-frame)) 77 | 78 | (deftest data-frame-map (data-frame-operations) 79 | (let+ ((df (df :a #(2 3 5) 80 | :b #(7 11 13))) 81 | (product #(14 33 65)) 82 | ((&flet predicate (a b) (<= 30 (* a b)))) 83 | ((&flet predicate-bit (a b) (if (predicate a b) 1 0))) 84 | (mask #*011)) 85 | (assert-equalp product 86 | (map-rows df '(:a :b) #'*)) 87 | (assert-equalp `(:p ,product :m ,mask) 88 | (as-plist (map-df df '(:a :b) 89 | (lambda (a b) 90 | (vector (* a b) (predicate-bit a b))) 91 | '((:p fixnum) (:m bit))))) 92 | (let ((mask-rows (mask-rows df '(:a :b) #'predicate))) 93 | (assert-equal mask mask-rows) 94 | (assert-eq 'bit (array-element-type mask-rows))) 95 | (assert-equalp (count 1 mask) 96 | (count-rows df '(:a :b) #'predicate)))) 97 | 98 | #| 99 | ;This test is a bit of a challenge because the operation that determines the symbols for columns compares symbols in the DF package, but here 'a is created in the data-frame-tests package. Frequent using of filter-rows is enough to convince me that it's working properly, but if someone were to make this test work I'd be grateful. 100 | (deftest filter-rows (data-frame-operations) 101 | (let+ ((df (df :a #(2 3 5) 102 | :b #(7 11 13))) 103 | (plst '(:a #(2 3) 104 | :b #(7 11)))) 105 | (assert-equal plst (as-plist (filter-rows df (< a 4)))))) 106 | |# 107 | (deftest rename! (data-frame-operations) 108 | (let+ ((df (df :a #(2 3 5) 109 | :b #(7 11 13)))) 110 | (assert-equalp #(:a :b) (keys df)) 111 | (rename-column! df :c :a) 112 | (assert-equalp #(:c :b) (keys df) "Rename failed"))) 113 | 114 | 115 | 116 | 117 | (defsuite data-frame-add (data-frame)) 118 | 119 | (deffixture data-frame-add (@body) 120 | (let* ((plist1 '(:a #(1 2 3))) 121 | (plist2 '(:b #(4 5 6))) 122 | (plist12 (append plist1 plist2))) 123 | @body)) 124 | 125 | (defmacro test-add (add-function plist1 plist2 append?) 126 | "Macro for generating the following test: 127 | 128 | 1. create a data frame using plist1, 129 | 130 | 2. add plist2 using add-function to get a second data frame, 131 | 132 | 3. test that the first data frame is uncorrupted if append? is nil, or 133 | equivalent the concatenated plist otherwise, 134 | 135 | 4. test that the second data frame is equivalent to the concatenated plist. 136 | 137 | This is a comprehensive test of the add-column family of functions, 138 | destructive or non-destructive." 139 | (with-unique-names (df df2 plist12) 140 | (once-only (plist1 plist2) 141 | `(let* ((,df (plist-df ,plist1)) 142 | (,df2 (apply ,add-function ,df ,plist2)) 143 | (,plist12 (append ,plist1 ,plist2))) 144 | (assert-equal (if ,append? 145 | ,plist12 146 | ,plist1) 147 | (as-plist ,df)) 148 | (assert-equal ,plist12 149 | (as-plist ,df2)))))) 150 | 151 | (deftest add-column (data-frame-add) 152 | (test-add #'add-columns plist1 plist2 nil) 153 | (test-add #'add-column! plist1 plist2 t) 154 | (test-add #'add-columns! plist1 plist2 t) 155 | (assert-equalp '(:a #(1 2 3) :b #(4 5 6)) plist12)) ;this test is only here to quiet the compiler 156 | 157 | (deftest add-map (data-frame-add) 158 | (let* ((plist3 '(:c #(4 10 18))) 159 | (plist123 (append plist12 plist3))) 160 | ;; non-destructive 161 | (let* ((df (plist-df plist12)) 162 | (df2 (add-columns df :c (map-rows df '(:a :b) #'*)))) 163 | (assert-equalp plist12 (as-plist df)) 164 | (assert-equalp plist123 (as-plist df2))) 165 | ;; destructive, function 166 | (let* ((df (plist-df plist12)) 167 | (df2 (add-column! df :c (map-rows df '(:a :b) #'*)))) 168 | (assert-equalp plist123 (as-plist df)) 169 | (assert-equalp plist123 (as-plist df2))))) 170 | 171 | 172 | ;;; replace-column 173 | 174 | (defsuite replace-column (data-frame)) 175 | 176 | (deftest replace-column1 (replace-column) 177 | (let* ((plist '(:a #(1 2 3) :b #(5 7 11))) 178 | (df (plist-df plist)) 179 | ;; (df-copy (copy df)) 180 | (df1 (replace-column df :a #'1+)) 181 | (df2 (replace-column df :a #(2 3 4))) 182 | (expected-plist '(:a #(2 3 4) :b #(5 7 11)))) 183 | (assert-equalp expected-plist (as-plist df1)) 184 | (assert-equalp expected-plist (as-plist df2)) 185 | (assert-equalp plist (as-plist df)) 186 | ;; modify destructively 187 | (replace-column! df :a #'1+) 188 | (assert-false (equalp plist (as-plist df))) 189 | (assert-equalp expected-plist (as-plist df)))) 190 | 191 | 192 | (defsuite remove-columns (data-frame)) 193 | 194 | (deftest remove-columns1 (remove-columns) 195 | (let* ((plist '(:a #(1 2 3) :b #(5 7 11) :c #(100 200 300))) 196 | (df (plist-df plist)) 197 | ;; (df-copy (copy df)) 198 | (df1 (remove-columns df '(:a :b))) 199 | (expected-plist '(:c #(100 200 300)))) 200 | (assert-equalp expected-plist (as-plist df1)) 201 | (assert-equalp plist (as-plist df)) 202 | 203 | (remove-column! df :a) 204 | (assert-false (equalp plist (as-plist df))) 205 | (assert-equalp '(:b #(5 7 11) :c #(100 200 300)) (as-plist df)))) 206 | 207 | 208 | (defsuite replace-columns (data-frame)) 209 | (deftest replace-columns1 (remove-columns) 210 | (let* ((plist '(:a #(1 2 3) :b #(5 7 11) :c #(100 200 300))) 211 | (df (plist-df plist))) 212 | (replace-column! df :a #'1+) 213 | (assert-false (equalp plist (as-plist df))) 214 | (assert-equalp '(:a #(2 3 4) :b #(5 7 11) :c #(100 200 300)) (as-plist df)))) 215 | 216 | 217 | (defsuite remove-duplicates (data-frame)) 218 | 219 | (deftest remove-duplicates1 (remove-duplicates) 220 | (let* ((dup (make-df '(:a :b :c) '(#(a a 3) #(a a 3) #(a a 333)))) 221 | (df1 (df-remove-duplicates dup)) 222 | (expected-plist '(:a #(a 3) :b #(a 3) :c #(a 333)))) 223 | (assert-equalp expected-plist (as-plist df1)))) 224 | 225 | 226 | 227 | (defsuite pretty-print (data-frame)) 228 | 229 | (deftest print-df (pretty-print) 230 | (let* ((df1 (make-df '(:a :b :c) 231 | '(#(a a a) 232 | #(b b b) 233 | #(3 33 333)))) 234 | (*print-pretty* t) 235 | (expected-string " 236 | ;; A B C 237 | ;; 0 A B 3 238 | ;; 1 A B 33 239 | ;; 2 A B 333 240 | ") 241 | (actual-string (make-array '(0) :element-type 'base-char :fill-pointer 0 :adjustable t))) 242 | 243 | (with-output-to-string (s actual-string) 244 | (print-data df1 s)) 245 | (assert-true (string= expected-string actual-string)))) 246 | 247 | (deftest print-array (pretty-print) 248 | (let* ((array1 #2A(#(a a a) 249 | #(b b b) 250 | #(3 33 333))) 251 | (*print-pretty* t) 252 | (expected-string ";; 0 A A A 253 | ;; 1 B B B 254 | ;; 2 3 33 333 255 | ") 256 | (actual-string (make-array '(0) :element-type 'base-char :fill-pointer 0 :adjustable t))) 257 | 258 | (with-output-to-string (s actual-string) 259 | (print-array array1 s)) 260 | (assert-true (string= expected-string actual-string)))) 261 | 262 | 263 | (defsuite missing (data-frame)) 264 | 265 | (deftest d-frame (missing) 266 | (let ((df (matrix-df #(:a :b :c) #2A((1.7 2.1 :na) 267 | (5.4 :na 6.1) 268 | (:na 8.3 9.5))))) 269 | (assert-equalp #2A((nil nil t) 270 | (nil t nil) 271 | (t nil nil)) 272 | (aops:as-array (missingp df))))) 273 | 274 | (deftest array (missing) 275 | (let ((arr #2A((bar 1 2 3 4 :na 6) 276 | (foo 7 8 9 :na 10 11)))) 277 | (assert-equalp #2A((nil nil nil nil nil t nil) 278 | (nil nil nil nil t nil nil)) 279 | (missingp arr)))) 280 | 281 | (deftest vector (missing) 282 | (let ((vec #(0 1 2 3 4 :na 6))) 283 | (assert-equalp #(nil nil nil nil nil t nil) (missingp vec)))) 284 | 285 | (deftest ignore-missing (missing) 286 | (let ((vec #(0 1 2 3 4 :na 6))) 287 | (assert-true (nu:num= (funcall (ignore-missing #'mean) vec) 288 | 2.6666667 289 | nu:*num=-tolerance*)))) 290 | 291 | 292 | 293 | ;;; plist-aops 294 | 295 | (defsuite plist-aops (data-frame)) 296 | 297 | (deftest as-array (plist-aops) 298 | (let ((arr #2A((1 4) (2 5) (3 6))) 299 | (pl '(:a #(1 2 3) :b #(4 5 6)))) 300 | (assert-equalp arr (nth-value 0 (aops:as-array pl))))) 301 | 302 | (deftest dims (plist-aops) 303 | (let ((pl '(:a #(1 2 3) :b #(4 5 6)))) 304 | (assert-equalp '(3 2) (aops:dims pl)) 305 | (assert-equalp 3 (aops:nrow pl)) 306 | (assert-equalp 2 (aops:ncol pl)))) 307 | 308 | 309 | ;;; Data frame environment 310 | (defsuite define-data-frame (data-frame)) 311 | 312 | (deffixture define-data-frame (@body) 313 | (let* ((v #(1 2 3 4)) 314 | (b #*0110) 315 | (s #(a b c d))) 316 | @body)) 317 | 318 | (deftest define-data-frame (define-data-frame) 319 | (let* ((plist `(vector ,v symbols ,s bits ,b)) 320 | (df (apply #'df plist))) 321 | 322 | ;; Define an environment 323 | (df::defdf new-df df) 324 | (assert-true (boundp 'new-df) "The data frame was not bound") 325 | (assert-equalp (type-of (symbol-value 'new-df)) 'data-frame "new-df is not bound to a data-frame") 326 | 327 | ;; Ensure variables, package and macros were created 328 | (assert-true (find-package "NEW-DF") "Data frame package not found") 329 | (assert-equalp #(1 2 3 4) (eval (find-symbol "VECTOR" (find-package "NEW-DF")))) 330 | (assert-equalp #*0110 (eval (find-symbol "BITS" (find-package "NEW-DF")))) 331 | (assert-equalp #(a b c d) (eval (find-symbol "SYMBOLS" (find-package "NEW-DF")))) 332 | 333 | ;; ;; Remove symbol and package 334 | (let ((*package* (find-package "DATA-FRAME-TESTS"))) 335 | (df::undef new-df)) 336 | 337 | (assert-false (boundp (find-symbol "NEW-DF" (find-package "DATA-FRAME-TESTS"))) "The data frame was not removed") 338 | (assert-false (find-package "NEW-DF")))) 339 | 340 | 341 | --------------------------------------------------------------------------------