├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── data-frame.asd
├── description.text
├── docs
├── data-frame.epub
├── data-frame.info
├── data-frame.pdf
├── data-frame.texi
└── index.html
├── src
├── conditions.lisp
├── data-frame.lisp
├── defdf.lisp
├── filter.lisp
├── formatted-output.lisp
├── missing.lisp
├── pkgdcl.lisp
├── plist-aops.lisp
├── pprint.lisp
├── properties.lisp
├── summary.lisp
└── utils.lisp
└── tests
└── data-frame-tests.lisp
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # How to Contribute
2 |
3 | We'd love to accept your patches and contributions to this project. There are
4 | just a few small guidelines you need to follow.
5 |
6 | ## Contributor License Agreement
7 |
8 | Contributions to this project must be accompanied by a Contributor License
9 | Agreement. You (or your employer) retain the copyright to your contribution;
10 | this simply gives us permission to use and redistribute your contributions as
11 | part of the project.
12 |
13 | You generally only need to submit a CLA once, so if you've already submitted one
14 | (even if it was for a different project), you probably don't need to do it
15 | again.
16 |
17 | ## The Contribution Process
18 |
19 | The basic workflow is:
20 |
21 | 1. Fork the Project
22 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
23 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
24 | 4. Push to the Branch (`git push origin feature/AmazingFeature`)
25 | 5. Open a Pull Request
26 |
27 | With multiple contributors and the desire to maintain high quality
28 | code, we need a small bit of process. For example all submissions,
29 | including submissions by project members, require review. We use
30 | GitHub pull requests for this purpose. Consult [GitHub
31 | Help](https://help.github.com/articles/about-pull-requests/) for more
32 | information on using pull requests, and the [contributing
33 | code](https://lisp-stat.dev/docs/contributing/code/) page for more
34 | details.
35 |
36 | ## Community Guidelines
37 |
38 | This project follows a code of conduct that can be found on the
39 | [contributing](https://lisp-stat.dev/docs/contributing/) page.
40 |
41 | ## How to contribute
42 |
43 | See the [contribution
44 | guidelines](https://lisp-stat.dev/docs/contributing/)
45 | in the Lisp-Stat user guide.
46 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Microsoft Public License (MS-PL)
2 |
3 | This license governs use of the accompanying software. If you use the
4 | software, you accept this license. If you do not accept the license, do not
5 | use the software.
6 |
7 | 1. Definitions
8 | The terms "reproduce," "reproduction," "derivative works," and "distribution"
9 | have the same meaning here as under U.S. copyright law. A "contribution" is
10 | the original software, or any additions or changes to the software. A
11 | "contributor" is any person that distributes its contribution under this
12 | license. "Licensed patents" are a contributor's patent claims that read
13 | directly on its contribution.
14 |
15 | 2. Grant of Rights
16 | (A) Copyright Grant- Subject to the terms of this license, including the
17 | license conditions and limitations in section 3, each contributor grants
18 | you a non-exclusive, worldwide, royalty-free copyright license to
19 | reproduce its contribution, prepare derivative works of its contribution,
20 | and distribute its contribution or any derivative works that you create.
21 |
22 | (B) Patent Grant- Subject to the terms of this license, including the
23 | license conditions and limitations in section 3, each contributor grants
24 | you a non-exclusive, worldwide, royalty-free license under its licensed
25 | patents to make, have made, use, sell, offer for sale, import, and/or
26 | otherwise dispose of its contribution in the software or derivative works
27 | of the contribution in the software.
28 |
29 | 3. Conditions and Limitations
30 | (A) No Trademark License- This license does not grant you rights to use
31 | any contributors' name, logo, or trademarks.
32 |
33 | (B) If you bring a patent claim against any contributor over patents that
34 | you claim are infringed by the software, your patent license from such
35 | contributor to the software ends automatically.
36 |
37 | (C) If you distribute any portion of the software, you must retain all
38 | copyright, patent, trademark, and attribution notices that are present in
39 | the software.
40 |
41 | (D) If you distribute any portion of the software in source code form,
42 | you may do so only under this license by including a complete copy of
43 | this license with your distribution. If you distribute any portion of the
44 | software in compiled or object code form, you may only do so under a
45 | license that complies with this license.
46 |
47 | (E) The software is licensed "as-is." You bear the risk of using it. The
48 | contributors give no express warranties, guarantees, or conditions. You
49 | may have additional consumer rights under your local laws which this
50 | license cannot change. To the extent permitted under your local laws, the
51 | contributors exclude the implied warranties of merchantability, fitness
52 | for a particular purpose and non-infringement.
53 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | [![Contributors][contributors-shield]][contributors-url]
5 | [![Forks][forks-shield]][forks-url]
6 | [![Stargazers][stars-shield]][stars-url]
7 | [![Issues][issues-shield]][issues-url]
8 | [![MS-PL License][license-shield]][license-url]
9 | [![LinkedIn][linkedin-shield]][linkedin-url]
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
Data Frame
21 |
22 |
23 | Data frames for Common Lisp. A two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column
24 |
25 | Explore the docs »
26 |
27 |
28 | Report Bug
29 | ·
30 | Request Feature
31 | ·
32 | Reference Manual
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 | Table of Contents
41 |
42 | -
43 | About The Project
44 |
47 |
48 | -
49 | Getting Started
50 |
54 |
55 | - Usage
56 | - Roadmap
57 | - Resources
58 | - Contributing
59 | - License
60 | - Contact
61 |
62 |
63 |
64 |
65 |
66 |
67 | ## About the Project
68 |
69 | A data frame is a two dimensional data structure structure whose
70 | columns may be of differing types. It is similar to, and may be
71 | manipulated as, a Common Lisp array. Data frames hold tightly
72 | coupled collections of variables that all belong to one experiment.
73 |
74 | ### Built With
75 |
76 | * [anaphora](https://github.com/tokenrove/anaphora)
77 | * [alexandria](https://gitlab.common-lisp.net/alexandria/alexandria)
78 | * [array-operations](https://github.com/lisp-stat/array-operations)
79 | * [select](https://github.com/lisp-stat/select)
80 | * [let-plus](https://github.com/sharplispers/let-plus)
81 | * [numerical-utilities](https://github.com/lisp-stat/numerical-utilities)
82 |
83 |
84 |
85 | ## Getting Started
86 |
87 | To get a local copy up and running follow these steps:
88 |
89 | ### Prerequisites
90 |
91 | An ANSI Common Lisp implementation. Developed and tested with [SBCL](https://www.sbcl.org/).
92 |
93 | ### Installation
94 |
95 | Lisp-Stat is composed of several system that are designed to be
96 | independently useful. So you can, for example, use `data-frame` for
97 | any project needing to manipulate two dimensional data in a machine
98 | learning or statistical setting.
99 |
100 | #### Getting the source
101 |
102 | To make the system accessible to [ASDF](https://common-lisp.net/project/asdf/) (a build facility, similar to `make` in the C world), clone the repository in a directory ASDF knows about. By default the `common-lisp` directory in your home directory is known. Create this if it doesn't already exist and then:
103 |
104 | 1. Clone the repositories
105 | ```sh
106 | cd ~/common-lisp && \
107 | git clone https://github.com/Lisp-Stat/data-frame.git && \
108 | git clone https://github.com/Lisp-Stat/dfio.git
109 | ```
110 | 2. Reset the ASDF source-registry to find the new system (from the REPL)
111 | ```lisp
112 | (asdf:clear-source-registry)
113 | ```
114 | 3. Load the system
115 | ```lisp
116 | (asdf:load-system :data-frame)
117 | ```
118 |
119 | If you have installed the slime ASDF extensions, you can invoke this
120 | with a comma (',') from the slime REPL in emacs.
121 |
122 | #### Getting dependencies
123 |
124 | To get the third party systems that Lisp-Stat depends on you can use a dependency manager, such as [Quicklisp](https://www.quicklisp.org/beta/) or [CLPM](https://www.clpm.dev/) Once installed, get the dependencies with either of:
125 |
126 | ```lisp
127 | (clpm-client:sync :sources "clpi") ;sources may vary
128 | ```
129 |
130 | ```lisp
131 | (ql:quickload :data-frame)
132 | ```
133 |
134 | You need do this only once. After obtaining the dependencies, you can
135 | load the system with `ASDF` as described above without first syncing
136 | sources.
137 |
138 |
139 | ## Usage
140 |
141 | Create a data frame:
142 |
143 | ```lisp
144 | (make-df '(:a :b) '(#(1 2 3) #(10 20 30)))
145 |
146 | ```
147 |
148 | For more examples, please refer to the [Documentation](https://lisp-stat.dev/docs/manuals/data-frame).
149 |
150 |
151 |
152 | ## Roadmap
153 |
154 | See the [open issues](https://github.com/lisp-stat/data-frame/issues) for a list of proposed features (and known issues).
155 |
156 | ## Resources
157 |
158 | This system is part of the [Lisp-Stat](https://lisp-stat.dev/)
159 | project; that should be your first stop for information. Also see the
160 | [resources](https://lisp-stat.dev/resources) and
161 | [community](https://lisp-stat.dev/community) page for more
162 | information.
163 |
164 |
165 | ## Contributing
166 |
167 | Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated. Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on the code of conduct, and the process for submitting pull requests.
168 |
169 |
170 | ## License
171 |
172 | Distributed under the MS-PL License. See [LICENSE](LICENSE) for more information.
173 |
174 |
175 |
176 |
177 | ## Contact
178 |
179 | Project Link: [https://github.com/lisp-stat/data-frame](https://github.com/lisp-stat/data-frame)
180 |
181 |
182 |
183 |
184 |
185 | [contributors-shield]: https://img.shields.io/github/contributors/lisp-stat/data-frame.svg?style=for-the-badge
186 | [contributors-url]: https://github.com/lisp-stat/data-frame/graphs/contributors
187 | [forks-shield]: https://img.shields.io/github/forks/lisp-stat/data-frame.svg?style=for-the-badge
188 | [forks-url]: https://github.com/lisp-stat/data-frame/network/members
189 | [stars-shield]: https://img.shields.io/github/stars/lisp-stat/data-frame.svg?style=for-the-badge
190 | [stars-url]: https://github.com/lisp-stat/data-frame/stargazers
191 | [issues-shield]: https://img.shields.io/github/issues/lisp-stat/data-frame.svg?style=for-the-badge
192 | [issues-url]: https://github.com/lisp-stat/data-frame/issues
193 | [license-shield]: https://img.shields.io/github/license/lisp-stat/data-frame.svg?style=for-the-badge
194 | [license-url]: https://github.com/lisp-stat/data-frame/blob/master/LICENSE
195 | [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
196 | [linkedin-url]: https://www.linkedin.com/company/symbolics/
197 |
--------------------------------------------------------------------------------
/data-frame.asd:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: ASDF -*-
2 | ;;; Copyright (c) 2020-2024 by Symbolics Pte. Ltd. All rights reserved.
3 | ;;; SPDX-License-identifier: MS-PL
4 |
5 | (defsystem "data-frame"
6 | :version "1.3.3"
7 | :licence :MS-PL
8 | :author "Steve Nunez "
9 | :long-name "Data frames for Common Lisp"
10 | :description "A data manipulation library for statistical computing"
11 | :long-description #.(uiop:read-file-string
12 | (uiop:subpathname *load-pathname* "description.text"))
13 | :homepage "https://lisp-stat.dev/docs/manuals/data-frame"
14 | :source-control (:git "https://github.com/Lisp-Stat/data-frame.git")
15 | :bug-tracker "https://github.com/Lisp-Stat/data-frame/issues"
16 |
17 | :depends-on ("alexandria"
18 | "alexandria+"
19 | "anaphora"
20 | "array-operations"
21 | "num-utils"
22 | "select"
23 | "statistics"
24 | "let-plus"
25 | "duologue"
26 | #+sbcl "sb-cltl2")
27 | :serial t
28 | :pathname "src/"
29 | :components ((:file "pkgdcl")
30 | (:file "utils")
31 | (:file "data-frame")
32 | (:file "pprint")
33 | (:file "formatted-output")
34 | (:file "summary")
35 | (:file "defdf")
36 | (:file "conditions")
37 | (:file "properties")
38 | (:file "missing")
39 | (:file "filter")
40 | (:file "plist-aops"))
41 | :in-order-to ((test-op (test-op "data-frame/tests"))))
42 |
43 | (defsystem "data-frame/tests"
44 | :version "1.0.0"
45 | :description "Unit tests for DATA-FRAME."
46 | :author "Steve Nunez "
47 | :licence :MS-PL
48 | :depends-on ("data-frame"
49 | "clunit2")
50 | :serial t
51 | :pathname "tests/"
52 | :components ((:file "data-frame-tests"))
53 | :perform (test-op (o s)
54 | (let ((*print-pretty* t)) ;work around clunit issue #9
55 | (symbol-call :clunit :run-suite
56 | (find-symbol* :data-frame
57 | :data-frame-tests)
58 | :use-debugger nil))))
59 |
--------------------------------------------------------------------------------
/description.text:
--------------------------------------------------------------------------------
1 | A data frame is a common way of storing data for statistical analysis. Under the hood, a data frame is a vector of equal-length vectors. Each element of the vector can be thought of as a column and the length of each element of the vector is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows.
2 |
3 | From a design perspective, Lisp-Stat's data frame is conceptually most similar to the 'tibble' from the tidyverse, but using Common Lisp idioms, style and syntax.
4 |
--------------------------------------------------------------------------------
/docs/data-frame.epub:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Lisp-Stat/data-frame/1fae0f2d55a8228cf9e1018e7bc11686a23ff673/docs/data-frame.epub
--------------------------------------------------------------------------------
/docs/data-frame.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Lisp-Stat/data-frame/1fae0f2d55a8228cf9e1018e7bc11686a23ff673/docs/data-frame.pdf
--------------------------------------------------------------------------------
/src/conditions.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021-2024 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | (define-condition duplicate-key (error)
6 | ((key :initarg :key))
7 | (:report (lambda (condition stream)
8 | (format stream "Duplicate key ~A." (slot-value condition 'key))))
9 | (:documentation "An operation attempted to use a key that already exists in ORDERED-KEYS"))
10 |
11 | (define-condition key-not-found (error)
12 | ((key :initarg :key)
13 | (keys :initarg :keys))
14 | (:report (lambda (condition stream)
15 | (format stream "Key ~A not found, valid keys are ~A."
16 | (slot-value condition 'key)
17 | (slot-value condition 'keys))))
18 | (:documentation "An operation was attempted on a non-existant key."))
19 |
20 | (define-condition missing-data (error)
21 | ((name :initarg :name)
22 | (data :initarg :data))
23 | (:documentation "A variable has missing data, e.g. :na, nil")
24 | (:report (lambda (condition stream)
25 | (format stream "~A contains missing data"
26 | (slot-value condition 'name)))))
27 |
28 | (define-condition large-data (warning)
29 | ((data-size :initarg :data-size
30 | :reader data-size))
31 | (:report (lambda (condition stream)
32 | (format stream
33 | "You are attempting to embed a large number of data points (~D); the recommended maximum is ~D."
34 | (data-size condition) *large-data*)))
35 | (:documentation "A operation was requested on a data set large enough to potentially cause problems."))
36 |
37 | (define-condition data-frame-exists (error)
38 | ((data-frame :initarg :data-frame
39 | :reader data-frame))
40 | (:report (lambda (condition stream)
41 | (format stream
42 | "You are attempting to redefine ~A and *ask-on-redefine is ~A"
43 | (slot-value condition 'name)
44 | *ask-on-redefine*)))
45 | (:documentation "An attempt to redefine an existing data frame. Triggered if either the symbol is bound or the package exists."))
46 |
47 | ;;; Validation functions for duologue:prompt, which is used to get user input within restarts
48 | (defun df-exists-p (s)
49 | (if (or (find-package s)
50 | (member (find-symbol (string-upcase s)) df::*data-frames*))
51 | nil
52 | t))
53 |
54 | (defun invalid-df-name (s)
55 | "A user prompt, using DUOLOGUE, to select a valid data frame name."
56 | (duologue:say "~A names an existing data frame. Please choose another name" s))
57 |
58 |
59 |
60 |
--------------------------------------------------------------------------------
/src/data-frame.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021-2023 by Symbolics Pte. Ltd. All rights reserved.
3 | ;;; SPDX-License-identifier: MS-PL
4 | (in-package #:data-frame)
5 |
6 | ;;; Note: tpapp never mentions the difference between a data-vector
7 | ;;; and a data-frame. As near as I can tell, a data-frame must have
8 | ;;; the same number of rows for each variable.
9 |
10 | (defparameter *large-data* most-positive-fixnum ;4611686018427387903
11 | "An indication that the data set is large for a particular use case.
12 | This should be bound by a user to the maximum number of data points they consider to be 'normal'. The function can then signal a large-data warning if it is exceeded.
13 |
14 | E.g. (let ((df:*large-data* 50000))
15 | (handler-bind ((large-data ...
16 | (some-data-operation ; this will signal if the data is too large
17 | (restart-bind ...")
18 |
19 | (deftype data-type ()
20 | "A statistical type for a data variable. All data columns must be one of these types if they are to be intepreted properly by Lisp-Stat"
21 | '(member :string :double-float :single-float :categorical :temporal :integer :bit))
22 |
23 |
24 | ;;; Ordered keys provide a mapping from column keys (symbols) to nonnegative
25 | ;;; integers. They are used internally and the corresponding interface is
26 | ;;; NOT EXPORTED.
27 |
28 | (defstruct (ordered-keys (:copier nil))
29 | "Representation of ordered keys
30 | Ordered keys provide a mapping from column keys (symbols) to nonnegative
31 | integers. They are used internally and the corresponding interface is
32 | NOT EXPORTED.
33 |
34 | TABLE maps keys to indexes, starting from zero."
35 | (table (make-hash-table :test #'eq) :type hash-table :read-only t))
36 |
37 | (defun keys-count (ordered-keys)
38 | "Number of keys."
39 | (hash-table-count (ordered-keys-table ordered-keys)))
40 |
41 | (defun keys-vector (ordered-keys)
42 | "Vector of all keys."
43 | (map 'vector #'car
44 | (sort (hash-table-alist (ordered-keys-table ordered-keys))
45 | #'<=
46 | :key #'cdr)))
47 |
48 | (defun key-index (ordered-keys key)
49 | "Return the index for KEY."
50 | (let+ (((&values index present?) (gethash key (ordered-keys-table ordered-keys))))
51 | (unless present?
52 | (error 'key-not-found :key key :keys (keys-vector ordered-keys)))
53 | index))
54 |
55 | (defun add-key! (ordered-keys key)
56 | "Modify ORDERED-KEYS by adding KEY."
57 | (check-type key symbol)
58 | (let+ (((&structure ordered-keys- table) ordered-keys)
59 | ((&values &ign present?) (gethash key table)))
60 | (when present?
61 | (error 'duplicate-key :key key))
62 | (setf (gethash key table) (hash-table-count table)))
63 | ordered-keys)
64 |
65 | (defun ordered-keys (keys)
66 | "Create an ORDERED-KEYS object from KEYS (a sequence)."
67 | (aprog1 (make-ordered-keys)
68 | (map nil (curry #'add-key! it) keys)))
69 |
70 | (defun copy-ordered-keys (ordered-keys)
71 | "Return a copy of ORDERED-KEYS"
72 | (let+ (((&structure ordered-keys- table) ordered-keys))
73 | (make-ordered-keys :table (copy-hash-table table))))
74 |
75 | (defun add-keys (ordered-keys &rest keys)
76 | "Add KEYS to ORDERED-KEYS"
77 | (aprog1 (copy-ordered-keys ordered-keys)
78 | (mapc (curry #'add-key! it) keys)))
79 |
80 | (defun remove-key! (ordered-keys key)
81 | "Modify ORDERED-KEYS by removing KEY."
82 | (check-type key symbol)
83 | (let+ (((&structure ordered-keys- table) ordered-keys)
84 | ((&values &ign present?) (gethash key table))
85 | (kv))
86 | (unless present?
87 | (error 'key-not-found :key key))
88 |
89 | (remhash key table)
90 | (setf kv (remove key (keys-vector ordered-keys)))
91 | (loop for key across kv
92 | for i = 0 then (1+ i)
93 | do (setf (gethash key table) i)))
94 | ordered-keys)
95 |
96 | ;;;
97 | ;;; Implementation of SELECT for ORDERED-KEYS
98 | ;;;
99 | (defmethod axis-dimension ((axis ordered-keys))
100 | (hash-table-count (ordered-keys-table axis)))
101 |
102 | (defmethod canonical-representation ((axis ordered-keys) (slice symbol))
103 | (if (select-reserved-symbol? slice)
104 | (call-next-method)
105 | (key-index axis slice)))
106 |
107 | (defmethod select ((ordered-keys ordered-keys) &rest selections)
108 | (let+ (((slice) selections))
109 | (ordered-keys
110 | (select (keys-vector ordered-keys)
111 | (canonical-representation ordered-keys slice)))))
112 |
113 |
114 | ;;; generic implementation -- the class is not exported, only the functionality
115 |
116 | (defclass data ()
117 | ((name ;same as the symbol-name
118 | :initarg nil
119 | :type string
120 | :accessor name
121 | :documentation "The name of the data frame. MUST be the same as the symbol whose value cell points to this data frame. This slot essentially allows us to go 'backwards' and get the symbol that names the data frame.")
122 | (ordered-keys
123 | :initarg :ordered-keys
124 | :type ordered-keys)
125 | (columns
126 | :initarg :columns
127 | :type vector)
128 | #+nil
129 | (doc-string ;I'd like this to be 'documentation', but that conflicts with the CL version
130 | :initarg :doc-string
131 | :initform ""
132 | :type string
133 | :accessor doc-string))
134 | (:documentation "This class is used for implementing both data-vector and data-frame, and represents an ordered collection of key-column pairs. Columns are not assumed to have any specific attributes. This class is not exported."))
135 |
136 | (defmethod aops:element-type ((data data)) ;should this be doing something?
137 | t)
138 |
139 | (defun make-data (class keys columns)
140 | "Create a DATA object from KEYS and COLUMNS. FOR INTERNAL USE. Always creates a copy of COLUMNS in order to ensure that it is an adjustable array with a fill pointer. KEYS are converted to ORDERED-KEYS if necessary."
141 | (let ((n-columns (length columns))
142 | (ordered-keys (atypecase keys
143 | (ordered-keys it)
144 | (t (ordered-keys it)))))
145 | (assert (= n-columns (keys-count ordered-keys)))
146 | (assert (subtypep class 'data))
147 | (make-instance class
148 | :ordered-keys ordered-keys
149 | :columns (make-array n-columns
150 | :adjustable t
151 | :fill-pointer n-columns
152 | :initial-contents columns))))
153 |
154 | (defgeneric check-column-compatibility (data column)
155 | (:documentation "Check if COLUMN is compatible with DATA.")
156 | (:method ((data data) column) ;no-op. Was Tamas going to implement it later?
157 | (declare (ignore column))))
158 |
159 | (defun ensure-arguments-alist (rest)
160 | "Recognizes the following and converts them to an alist:
161 |
162 | plist
163 | alist
164 | (plist)
165 | (alist)
166 | (data-frame)"
167 | (let+ (((&flet error% (&optional (list rest))
168 | (error "Could not interpret ~A as a plist or alist." list)))
169 | ((&flet ensure-alist (list)
170 | (typecase (car list)
171 | (cons rest)
172 | (symbol (plist-alist rest))
173 | (t (error% list))))))
174 | (if (cdr rest)
175 | (ensure-alist rest)
176 | (let ((first (car rest)))
177 | (typecase first
178 | (data (as-alist first))
179 | (cons (if (consp (cdr first))
180 | (ensure-alist first)
181 | rest)) ; first element of an alist
182 | (t (error%)))))))
183 |
184 |
185 | ;;; These two functions, alist-data & plist-data can be used to create
186 | ;;; DATA-VECTOR classes, that is a class works just like DATA-FRAME,
187 | ;;; but permits unequal length variables. To date, I haven't found
188 | ;;; much (any) need for a DATA-VECTOR and haven't done any
189 | ;;; improvements to that class.
190 | (defun alist-data (class alist)
191 | "Create an object of CLASS (subclass of DATA) from ALIST which contains key-column pairs."
192 | (assert alist () "Can't create an empty data frame.")
193 | (make-data class (mapcar #'car alist) (mapcar #'cdr alist)))
194 |
195 | (defun plist-data (class plist)
196 | "Create an object of CLASS (subclass of DATA) from PLIST which contains keys and columns, interleaved."
197 | (alist-data class (plist-alist plist)))
198 |
199 | (defun keys (data)
200 | "Return a vector of keys."
201 | (check-type data data)
202 | (copy-seq (keys-vector (slot-value data 'ordered-keys))))
203 |
204 | (defmethod as-alist ((data data))
205 | "Key-column pairs as an alist."
206 | (map 'list #'cons (keys data) (columns data)))
207 |
208 | (defun copy (data &key (key #'identity))
209 | "Copy data frame or vector. Keys are copied (and thus can be modified), columns or elements are copied using KEY, making the default give a shallow copy."
210 | (check-type data data)
211 | (let+ (((&slots-r/o ordered-keys columns) data)
212 | (new-data (make-data (class-of data)
213 | (copy-ordered-keys ordered-keys)
214 | (map 'vector key columns))))
215 | new-data))
216 |
217 | (defun column (data key)
218 | "Return column corresponding to key."
219 | (check-type data data)
220 | (let+ (((&slots-r/o ordered-keys columns) data))
221 | (aref columns (key-index ordered-keys key))))
222 |
223 | (defun (setf column) (column data key)
224 | "Set column corresponding to key."
225 | (check-column-compatibility data column)
226 | (let+ (((&slots-r/o ordered-keys columns) data))
227 | (setf (aref columns (key-index ordered-keys key)) column)))
228 |
229 | (defun columns (data &optional (slice t))
230 | "Return the columns of DATA as a vector, or a selection if given (keys are resolved)."
231 | (check-type data data)
232 | (let+ (((&slots-r/o ordered-keys columns) data))
233 | (select columns (canonical-representation ordered-keys slice))))
234 |
235 | (defun column-names (df)
236 | "Return a list of column names in DF, as strings"
237 | (map 'list #'symbol-name (keys df)))
238 |
239 | (defun rows (data)
240 | "Return the rows of DATA as a vector"
241 | (loop for index below (aops:nrow data)
242 | collecting (columns (select:select data index t)) into rows
243 | finally (return (coerce rows 'vector ))))
244 |
245 | (defun map-columns (data function &optional (result-class (class-of data)))
246 | "Map columns of DATA-FRAME or DATA-VECTOR using FUNCTION. The result is a new DATA-FRAME with the same keys."
247 | (make-data result-class (keys data) (map 'vector function (columns data))))
248 |
249 |
250 | (defun df-env-p (df)
251 | "Returns T if there is environment set-up for the data frame, or NIL if there isn't one."
252 | (if (and (slot-boundp df 'name)
253 | (find-package (string-upcase (slot-value df 'name))))
254 | t
255 | nil))
256 |
257 |
258 | (defun add-column! (data key column &optional update-env)
259 | "Modify DATA (a data-frame or data-vector) by adding COLUMN with KEY. Return DATA."
260 | (check-column-compatibility data column)
261 | (let+ (((&slots ordered-keys columns) data))
262 | (add-key! ordered-keys key)
263 | (vector-push-extend column columns))
264 | (when (and (df-env-p data)
265 | update-env)
266 | (defdf-env (find-symbol (string-upcase (name data))) nil))
267 | data)
268 |
269 | (defun add-columns! (data &rest keys-and-columns)
270 | "Modify DATA (a data-frame or data-vector) by adding columns with keys.
271 | If a data-frame environment exists, add columns to it as well."
272 | (mapc (lambda+ ((key . column))
273 | (add-column! data key column t))
274 | (ensure-arguments-alist keys-and-columns))
275 | data)
276 |
277 | (defun add-columns (data &rest keys-and-columns)
278 | "Return a new data-frame or data-vector with keys and columns added. Does not modify DATA."
279 | (aprog1 (copy data)
280 | (apply #'add-columns! it keys-and-columns)))
281 |
282 |
283 | (defun remove-columns (data keys)
284 | "Return a new data-frame or data-vector with keys and columns removed. Does not modify DATA.
285 | ARGS: DATA data frame
286 | KEYS list of keys (variables) to be removed"
287 | (select data t (reverse (set-difference (coerce (keys data) 'list) keys))))
288 |
289 | (defun remove-column! (data key)
290 | "Modify DATA (a data-frame or data-vector) by removing COLUMN with KEY. Return DATA."
291 | (check-type key symbol)
292 | (let+ (((&slots ordered-keys columns) data)
293 | (index (key-index ordered-keys key)))
294 | (remove-key! ordered-keys key)
295 | (delete-nth* columns index))
296 | (when (df-env-p data)
297 | (defdf-env (find-symbol (string-upcase (name data))) '(key)))
298 | data)
299 |
300 | ;;; TODO document me!
301 | (defun remove-columns! (data &rest keys)
302 | "Modify DATA (a data-frame or data-vector) by removing columns with keys.
303 | If a data-frame environment exists, add columns to it as well."
304 | (mapc (lambda (key)
305 | (remove-column! data key))
306 | (car keys))
307 | data)
308 |
309 |
310 |
311 | ;;; TODO take a plist for new & old and process it so we can rename
312 | ;;; multiple variables in one function call. See ensure-arguments-alist.
313 | (defmethod rename-column! (data new old) ;generic so will work on data-frame subclasses
314 | "Substitute NEW, a SYMBOL, for OLD in DF
315 |
316 | Useful when reading data files that have an empty or generated column name.
317 |
318 | Example: (rename-column! cars 'name :||) will replace an empty symbol with 'name"
319 | (let+ ((old-keys (coerce (keys data) 'list))
320 | (present? (member old old-keys)))
321 | (unless present?
322 | (error 'key-not-found :key old))
323 | (setf (slot-value data 'ordered-keys) (ordered-keys (substitute new old (keys data))))
324 | (when (df-env-p data)
325 | (defdf-env (find-symbol (string-upcase (name data))) old-keys))
326 | data))
327 |
328 |
329 |
330 | (defmacro define-data-subclass (class abbreviation)
331 | (check-type class symbol)
332 | (check-type abbreviation symbol)
333 | (let+ (((&flet fname (prefix)
334 | (symbolicate prefix '#:- abbreviation)))
335 | (alist-fn (fname '#:alist))
336 | (plist-fn (fname '#:plist)))
337 | `(progn
338 | (defclass ,class (data)
339 | ())
340 | (defun ,(fname '#:make) (keys columns)
341 | (make-data ',class keys columns))
342 | (defun ,alist-fn (alist)
343 | (alist-data ',class alist))
344 | (defun ,plist-fn (plist)
345 | (plist-data ',class plist))
346 | (defun ,abbreviation (&rest plist-or-alist)
347 | (if (alistp plist-or-alist)
348 | (,alist-fn plist-or-alist)
349 | (,plist-fn plist-or-alist))))))
350 |
351 | (define-data-subclass data-vector dv)
352 |
353 | (defmethod aops:dims ((data-vector data-vector))
354 | (list (length (columns data-vector))))
355 |
356 | (defmethod aops:as-array ((data-vector data-vector))
357 | (columns data-vector))
358 |
359 | (defmethod select ((data-vector data-vector) &rest slices)
360 | (let+ (((column-slice) slices)
361 | ((&slots-r/o ordered-keys columns) data-vector)
362 | (column-slice (canonical-representation ordered-keys column-slice)))
363 | (if (singleton-representation? column-slice)
364 | (aref columns column-slice)
365 | (make-dv (select ordered-keys column-slice)
366 | (select columns column-slice)))))
367 |
368 | (define-data-subclass data-frame df)
369 |
370 | (defmethod initialize-instance :after ((data-frame data-frame) &rest initargs)
371 | (declare (ignore initargs))
372 | (let+ (((first . rest) (coerce (columns data-frame) 'list))
373 | (length (column-length first)))
374 | (assert (every (lambda (column)
375 | (= length (column-length column)))
376 | rest)
377 | () "Columns don't have the same length.")))
378 |
379 | (defmethod aops:nrow ((data-frame data-frame))
380 | (column-length (aref (columns data-frame) 0)))
381 |
382 | (defmethod aops:ncol ((data-frame data-frame))
383 | (length (columns data-frame)))
384 |
385 | (defmethod aops:dims ((data-frame data-frame))
386 | (list (aops:nrow data-frame) (aops:ncol data-frame)))
387 |
388 | (defmethod aops:as-array ((data-frame data-frame))
389 | ;; Return contents of DATA-FRAME as a matrix.
390 | (nu:transpose (aops:combine (columns data-frame))))
391 |
392 | (defmethod check-column-compatibility ((data data-frame) column)
393 | (assert (= (column-length column) (aops:nrow data))))
394 |
395 | (defun matrix-df (keys matrix)
396 | "Convert a matrix to a data-frame with the given keys."
397 | (let+ ((columns (aops:split (nu:transpose matrix) 1)))
398 | (assert (length= columns keys))
399 | (alist-df (map 'list #'cons keys columns))))
400 |
401 |
402 | ;;; implementation of SELECT for DATA-FRAME
403 |
404 | (defmethod select ((data-frame data-frame) &rest slices)
405 | (let+ (((row-slice &optional (column-slice t)) slices)
406 | ((&slots-r/o ordered-keys columns) data-frame)
407 | (row-slice (canonical-representation (aops:nrow data-frame) row-slice))
408 | (column-slice (canonical-representation ordered-keys column-slice))
409 | (columns (select columns column-slice))
410 | ((&flet slice-column (column)
411 | (select column row-slice))))
412 | (if (singleton-representation? column-slice)
413 | (slice-column columns)
414 | (let ((keys (select ordered-keys column-slice))
415 | (columns (map 'vector #'slice-column columns)))
416 | (if (singleton-representation? row-slice)
417 | (make-dv keys columns)
418 | (make-df keys columns))))))
419 |
420 | ;;; TODO: (setfs election)
421 |
422 |
423 | ;;; mapping rows and adding columns
424 |
425 | (defun map-rows (data-frame keys function &key (element-type t))
426 | "Map rows using FUNCTION, on the columns corresponding to KEYS. Return the result with the given ELEMENT-TYPE."
427 | (let ((columns (map 'list (curry #'column data-frame) (ensure-list keys)))
428 | (nrow (aops:nrow data-frame)))
429 | (aprog1 (make-array nrow :element-type element-type)
430 | (dotimes (index nrow)
431 | (setf (aref it index)
432 | (apply function
433 | (mapcar (lambda (column)
434 | (ref column index))
435 | columns)))))))
436 |
437 | (defun do-rows (data-frame keys function)
438 | "Traverse rows from first to last, calling FUNCTION on the columns corresponding to KEYS. Return no values."
439 | (let ((columns (map 'list (curry #'column data-frame) (ensure-list keys)))
440 | (nrow (aops:nrow data-frame)))
441 | (dotimes (index nrow (values))
442 | (apply function
443 | (mapcar (lambda (column)
444 | (ref column index))
445 | columns)))))
446 |
447 | (defun map-df (data-frame keys function result-keys)
448 | "Map DATA-FRAME to another one by rows. Function is called on the columns corresponding to KEYS, and should return a sequence with the same length as RESULT-KEYS, which give the keys of the resulting data frame. RESULT-KETS should be either symbols, or of the format (symbol &optional (element-type t))."
449 | (let* ((columns (map 'list (curry #'column data-frame) keys))
450 | (nrow (aops:nrow data-frame))
451 | (result-keys-and-element-types
452 | (mapcar (lambda (key-and-element-type)
453 | (let+ (((key &optional (element-type t))
454 | (ensure-list key-and-element-type)))
455 | (cons key element-type)))
456 | result-keys))
457 | (result-columns (map 'vector
458 | (lambda (key-and-element-type)
459 | (make-array nrow
460 | :element-type (cdr key-and-element-type)))
461 | result-keys-and-element-types)))
462 | (dotimes (index nrow)
463 | (let ((result-row (apply function
464 | (mapcar (lambda (column)
465 | (ref column index))
466 | columns))))
467 | (assert (length= result-row result-columns))
468 | (map nil (lambda (result-column result-element)
469 | (setf (aref result-column index) result-element))
470 | result-columns result-row)))
471 | (make-df (mapcar #'car result-keys-and-element-types) result-columns)))
472 |
473 | (defun mask-rows (data-frame keys predicate)
474 | "Return a bit-vector containing the result of calling PREDICATE on rows of the columns corresponding to KEYS (0 for NIL, 1 otherwise)."
475 | (map-rows data-frame keys (compose (lambda (flag) ;translate nil/non-nil to 0 or 1
476 | (if flag 1 0))
477 | predicate)
478 | :element-type 'bit))
479 |
480 | (defun count-rows (data-frame keys predicate)
481 | "Count the number of rows for which PREDICATE called on the columns corresponding to KEYS returns non-NIL."
482 | (let ((columns (map 'list (curry #'column data-frame) (ensure-list keys))))
483 | (loop for index below (aops:nrow data-frame)
484 | count (apply predicate
485 | (mapcar (lambda (column)
486 | (ref column index))
487 | columns)))))
488 |
489 | (defun replace-column! (data key function-or-column &key (element-type t))
490 | "Modify column KEY of data-frame DATA by replacing it either with the given column, or applying the function to the current values (ELEMENT-TYPE is used.)"
491 | (let+ (((&slots ordered-keys columns) data)
492 | (index (key-index ordered-keys key)))
493 | (setf (aref columns index)
494 | (if (functionp function-or-column)
495 | (map-rows data key function-or-column :element-type element-type)
496 | (prog1 function-or-column
497 | (check-column-compatibility data function-or-column)))))
498 | data)
499 |
500 | (defun replace-column (data key function-or-column &key (element-type t))
501 | "Create a new data frame with new column KEY from data-frame DATA by replacing it either with the given column, or applying the function to the current values (ELEMENT-TYPE is used.)"
502 | (replace-column! (copy data) key function-or-column :element-type element-type))
503 |
504 | ;; We give this a df- prefix to avoid symbol clash with the CL
505 | ;; function (which, sadly, is not generic). After adding
506 | ;; df:delete-duplicates, shadow both in the package declaration.
507 | (defun df-remove-duplicates (data)
508 | "Return a modified copy of DATA from which any element (row, if a DATA-FRAME) that matches another element has been removed"
509 | (etypecase data
510 | (alexandria:proper-sequence (cl:remove-duplicates data)) ; Eventually shadow the CL version
511 | (df:data-frame (let* ((new-rows (cl:remove-duplicates (rows data) :test #'equalp))
512 | (new-array (make-array (list (length new-rows)
513 | (length (svref new-rows 0)))
514 | :initial-contents new-rows)))
515 | (matrix-df (keys data) new-array)))))
516 |
517 |
518 | ;; TODO
519 | #+nil
520 | (defun delete-duplicates (data)
521 | "Like REMOVE-DUPLICATES, but may modify DATA"
522 | ...)
523 |
524 | (defmethod print-object ((ordered-keys ordered-keys) stream)
525 | (print-unreadable-object (ordered-keys stream :type t)
526 | (format stream "~{~a~^, ~}" (coerce (keys-vector ordered-keys) 'list))))
527 |
528 | (defmethod print-object ((data-vector data-vector) stream)
529 | (let ((alist (as-alist data-vector)))
530 | (print-unreadable-object (data-vector stream :type t)
531 | (format stream "of ~d variables" (length alist)))))
532 |
533 | (defmethod print-object ((df data-frame) stream)
534 | "Print DATA-FRAME dimensions and type
535 | After defining this method it is permanently associated with data-frame objects"
536 | (print-unreadable-object (df stream :type t)
537 | (let ((description (and (slot-boundp df 'name)
538 | (documentation (find-symbol (name df)) 'variable))))
539 | (format stream
540 | "(~d observations of ~d variables)"
541 | (aops:nrow df)
542 | (aops:ncol df))
543 | (when description
544 | (format stream "~&~A" (short-string description))))))
545 |
546 | (defmethod describe-object ((df data-frame) stream)
547 | (let ((name (when (slot-boundp df 'name) (name df))))
548 | (format stream "~A~%" name)
549 | (format stream " A data-frame with ~D observations of ~D variables~2%" (aops:nrow df) (aops:ncol df))
550 | (when name
551 | (let ((rows (loop for key across (keys df)
552 | for sym = (find-symbol (string-upcase (symbol-name key)) (find-package name))
553 | collect (list (symbol-name key)
554 | (get sym :type)
555 | (get sym :unit)
556 | (get sym :label)))))
557 | (push '("--------" "----" "----" "-----------") rows)
558 | (push '("Variable" "Type" "Unit" "Label") rows)
559 | (print-table rows)))))
560 |
561 |
562 | ;;; KLUDGE ALERT
563 | ;;; This violates the spec. It's not easy at all to get good
564 | ;;; behaviour from describe. See code and comments in describe.lisp.
565 | #+allegro (setf excl:*enable-package-locked-errors* nil)
566 | (defmethod describe-object :after ((s symbol) stream)
567 | (unless (boundp s) (return-from describe-object))
568 | (unless (eq #+sbcl (SB-CLTL2:variable-information s)
569 | #+ccl (ccl:variable-information s)
570 | #+allegro (system:variable-information s)
571 | :symbol-macro)
572 | (let ((*print-pretty* t)
573 | (df (symbol-value s))
574 | (name (symbol-name s)))
575 |
576 | (pprint-logical-block (stream nil)
577 | (pprint-logical-block (stream nil)
578 | (pprint-indent :block 2 stream)
579 | (when-let ((pkg (find-package name)))
580 | (format stream "~@:_Variables: ~@:_")
581 | (pprint-logical-block (stream nil :per-line-prefix " ")
582 | (let ((rows (loop for key across (keys df)
583 | for sym = (find-symbol (string-upcase (symbol-name key)) (find-package name))
584 | collect (list (symbol-name key)
585 | (get sym :type)
586 | (get sym :unit)
587 | (get sym :label)))))
588 | (push '("--------" "----" "----" "-----------") rows)
589 | (push '("Variable" "Type" "Unit" "Label") rows)
590 | (print-table rows stream)))))))))
591 | #+allegro (setf excl:*enable-package-locked-errors* t)
592 |
593 | (defmethod sample ((df data-frame) n &key
594 | with-replacement
595 | skip-unselected)
596 | "Return N rows of DF taken at random.
597 |
598 | If WITH-REPLACEMENT is true, return a random sample with
599 | replacement (a \"draw\").
600 |
601 | If WITH-REPLACEMENT is false, return a random sample without
602 | replacement (a \"deal\").
603 |
604 | If SKIP-UNSELECTED is non-NIL, do not return the elements of DF that we not part of the selection. Non-NIL by default, as the typical use case is to split a data set into training and test data sets."
605 | (declare (data-frame df))
606 | (let+ (((&dims nrow &ign) df))
607 | (cond ((> n nrow) (error "Requested number of rows N is greater than rows in the data frame."))
608 | ((= n 0) nil)
609 | ((= n 1) (select df (random nrow) t)) ;return unselected row? This seems an edge case.
610 | ((= n nrow) df)
611 | (t (let+ ((indices (linspace 0 (1- nrow) nrow))
612 | (selected (sample indices n :with-replacement with-replacement))
613 | (not-selected (set-difference* indices selected)))
614 | (if skip-unselected
615 | (select df selected t)
616 | (values (select df selected t)
617 | (select df not-selected t))))))))
618 |
--------------------------------------------------------------------------------
/src/defdf.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;; Functions for defining data frames in the statistical environment.
6 |
7 | (defvar *data-frames* nil
8 | "Global list of all data frames")
9 |
10 | (defvar *ask-on-redefine* t
11 | "If non-nil, the system will ask the user for confirmation before redefining a data frame")
12 |
13 |
14 | (defmacro defdf (name data &optional (documentation nil documentation-p))
15 | "Define a data-frame and package by the same name.
16 | Also defines symbol-macros for variable access, e.g. mtcars:mpg"
17 | `(progn
18 | (when (and ,documentation
19 | (not (stringp ,documentation)))
20 | (error "Data frame documentation is not a STRING"))
21 | (when (and (boundp ',name)
22 | (not (typep (symbol-value ',name) 'data-frame)))
23 | (error "Data is not of type DATA-FRAME"))
24 |
25 | (declaim (special ,name))
26 | (unless (and *ask-on-redefine*
27 | (boundp ',name)
28 | (not (y-or-n-p "Variable has a value. Redefine?")))
29 | ,(when documentation-p
30 | `(setf (documentation ',name 'variable) ',documentation))
31 |
32 | (setf (symbol-value ',name) ,data)
33 | (pushnew ',name *data-frames*)
34 | (defdf-env ',name nil))))
35 |
36 |
37 | (defun defdf-env (data-frame old-keys)
38 | "Create a package with the same name as DATA-FRAME. Within it, create a symbol-macro for each column that will return the columns value.
39 | Can also be used to remove and update the environment as the DATA-FRAME changes in destructive operations"
40 | (let* ((df (symbol-value data-frame))
41 | (df-name (symbol-name data-frame))
42 | (pkg (find-package df-name))
43 | (rem-keys (set-difference old-keys (coerce (keys df) 'list))) ;the keys that were replaced in an operation
44 | (add-keys (set-difference (coerce (keys df) 'list) old-keys)));the keys that were added in an operation
45 |
46 | (check-type data-frame symbol "a symbol")
47 | (check-type df data-frame "a data frame")
48 |
49 | (alexandria+:unlessf pkg (make-package df-name :use '()))
50 | (unless (slot-boundp df 'name)
51 | (setf (name df) df-name))
52 | (when add-keys
53 | (maphash #'(lambda (key index)
54 | (declare (ignore index))
55 | (when (member key add-keys)
56 | (let ((col (intern (symbol-name key) pkg)))
57 | (export col pkg)
58 | ;; (eval `(cl:define-symbol-macro ,col (aref (columns ,df) ,index)))))) ;remove me
59 | (eval `(cl:define-symbol-macro ,col (column ,df ',key)))))) ;there appears no other way than 'eval'
60 | (ordered-keys-table (slot-value df 'ordered-keys))))
61 |
62 | ;; rename-column! is a special case that requires us to copy over the symbol plist
63 | (when rem-keys
64 | (mapcar #'(lambda (key)
65 | (let ((old-key (find-symbol (string key) pkg)))
66 | (when (= 1 (length rem-keys) (length add-keys)) ;rename! special case
67 | (setf (symbol-plist (find-symbol (string (first add-keys)) pkg))
68 | (symbol-plist old-key)))
69 | (unintern old-key pkg)))
70 | rem-keys))
71 | df))
72 |
73 | (defun undef (&rest params)
74 | "Remove one or more data frames from the environment
75 | PARAMS: a list of DATA-FRAMEs
76 |
77 | Essentially reverses what DEFDF does. Returns the data frames that were removed. Don't use this if you have a data frame bound via DEFPARAMETER.
78 | Examples:
79 | (undef mtcars vlcars)"
80 | (dolist (df params)
81 | (check-type df data-frame "a data-frame")
82 | (assert (slot-boundp df 'name) () "name is not bound in the data-frame")
83 |
84 | (let* ((pkg (find-package (name df))) ;package for symbol-macros
85 | (df-sym (find-symbol (name df))))
86 | (assert (member df-sym *data-frames*)
87 | ()
88 | "~A is not known in the environment. It may have been defined without defdf" (name df))
89 |
90 | ;; Remove the symbol macros
91 | (loop for var being the symbols in pkg
92 | do (unintern var))
93 | (delete-package pkg)
94 |
95 | ;; Remove the data frame
96 | (setf *data-frames* (delete df-sym *data-frames*))
97 | (makunbound df-sym)))
98 | params)
99 |
100 |
101 | ;;; In order to show data frame consistently with different settings
102 | ;;; for print-object, we need to control printing here.
103 | ;;; TODO move to data-frame.lisp
104 | (defun show-data-frames (&key (head nil) (stream *standard-output*))
105 | "Print all data frames in the current environment in reverse order of creation, i.e. most recently created first.
106 | If HEAD is not NIL, print the first six rows, similar to the (head) function"
107 | (let ((*print-pretty* nil))
108 | (if head
109 | (loop for df-sym in *data-frames* do
110 | ;; (loop for df in (sort (copy-list *data-frames*) #'string<=) do ;alphabetical order
111 | (let ((df (symbol-value df-sym)))
112 | (let* ((*print-lines* 6)
113 | (*print-pretty* t))
114 | (format stream "~2&~A" (symbol-name df-sym))
115 | (df:print-data df stream nil))))
116 | (pprint-logical-block (stream nil)
117 | (pprint-logical-block (stream nil)
118 | (pprint-indent :block 2 stream)
119 | (loop for df-sym in *data-frames*
120 | do (progn
121 | (format stream "~@:_~A:~@:_" (symbol-name df-sym))
122 | (fresh-line stream)
123 | (pprint-logical-block (stream nil :per-line-prefix " ")
124 | (format stream "~A" (symbol-value df-sym)))
125 | (fresh-line stream)
126 | (terpri stream))))))))
127 |
128 |
129 | ;; Unexported. For debugging
130 | (defun show-symbols (pkg)
131 | "Print all symbols in PKG
132 | Example: (show-symbols 'mtcars)"
133 | (do-symbols (s (find-package (symbol-name pkg))) (print s)))
134 |
135 |
136 |
137 |
138 | #| This works, but we're not using at the moment
139 | (defun redefinitionp (df)
140 | ;;; Give the user some restarts in the event the data frame exists
141 | (handler-bind ((data-frame-exists
142 | #'(lambda (c)
143 | (invoke-debugger c))))
144 | (cond ((and (or pkg
145 | (member df df::*data-frames*))
146 | *ask-on-redefine*)
147 | (restart-case (signal 'data-frame-exists :data-frame df)
148 | (redefine ()
149 | :report (lambda (s)
150 | (format s "Redefine ~A, losing all package symbols" df-name))
151 | (undef df))
152 | (new-name (n)
153 | :report "Use a different name"
154 | :interactive (lambda ()
155 | (list
156 | (duologue:prompt "Enter a new name for the data frame: "
157 | :type 'string :validator #'df-exists-p :if-invalid #'invalid-df-name)))
158 | ;; (setf df n)
159 | (setf df-name n))
160 | (dont-ask ()
161 | :report (lambda (s)
162 | (format s "Redefine ~A and don't ask about data frame redefinitions again for this session" df-name))
163 | :interactive (lambda ()
164 | (list
165 | (duologue:prompt "Enter a new name for the data frame: "
166 | :type 'string :validator #'df-exists-p :if-invalid #'invalid-df-name)))
167 | (undef df)
168 | (setf df:*ask-on-redefine* nil))))
169 | ((and (or (find-package df-name)
170 | (boundp df))
171 | (not *ask-on-redefine*))
172 | (warn "Redefining ~A" df-name)
173 | (undef df)))))
174 | |#
175 |
--------------------------------------------------------------------------------
/src/filter.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2022 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;; Apply predicates to the rows of a data frame
6 |
7 | (defun key-list (data form)
8 | "Return a list of keys used in REST, a form"
9 | (loop for key in (coerce (keys data) 'list)
10 | for variables = (flatten form)
11 | when (member key variables)
12 | collect key into columns
13 | finally (return columns)))
14 |
15 | (defun filter-rows (data body)
16 | "Filter DATA by a predicate given in BODY
17 |
18 | Example
19 | (data :mtcars) ; load a data set
20 | (head mtcars) ; view first 6 rows
21 |
22 | ;; MODEL MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
23 | ;; 0 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
24 | ;; 1 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
25 | ;; 2 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
26 | ;; 3 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
27 | ;; 4 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
28 | ;; 5 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
29 |
30 | (filter-rows mtcars '(< mpg 17))
31 | #
32 |
33 | (head *) ; view first 6 rows of filtered data frame
34 |
35 | ;; MODEL MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
36 | ;; 0 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
37 | ;; 1 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
38 | ;; 2 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
39 | ;; 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
40 | ;; 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
41 | ;; 5 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4"
42 | (let* ((variables (key-list data body))
43 | (predicate (eval `(lambda ,variables ,body))))
44 | (select data
45 | (mask-rows data variables predicate)
46 | t)))
47 |
--------------------------------------------------------------------------------
/src/formatted-output.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;; Formatted output of data frame structures without using the pretty printing subsystem
6 |
7 | ;;; These functions do not use the pretty printer facility, and can be
8 | ;;; used anywhere, i.e. Genera and other CL implementations. They are
9 | ;;; all prefaced by DF- to distinguish them from presentation functions.
10 |
11 | ;;; This example looks like it would map onto data frames easily:
12 | ;;; https://stackoverflow.com/questions/26894147/pretty-print-values-in-fixed-width-fields
13 |
14 | ;;; Table printing
15 | ;;; Taken from: https://github.com/vindarel/print-licenses/blob/master/print-licenses.lisp
16 | (defun aesthetic-string (thing)
17 | "Return the string used to represent `thing` when printing aesthetically."
18 | (format nil "~A" thing))
19 |
20 | (defun weave (&rest lists)
21 | "Return a list whose elements alternate between each of the lists
22 | `lists`. Weaving stops when any of the lists has been exhausted."
23 | (apply #'mapcan #'list lists))
24 |
25 | ;;; TODO: Make print-table take a vector-of-vectors instead of a list-of-lists
26 | ;;; This way we could simply pass in (rows df) for processing
27 | (defun print-table (rows &optional (stream *standard-output*))
28 | "Print ROWS as a nicely-formatted table.
29 | Each row should have the same number of colums.
30 | Columns will be justified properly to fit the longest item in each one.
31 | Example:
32 | (print-table '((1 :red something)
33 | (2 :green more)))
34 | =>
35 | 1 | RED | SOMETHING
36 | 2 | GREEN | MORE
37 | "
38 | (when rows
39 | (let ((column-sizes (reduce (curry #'mapcar #'max)
40 | (map 'list
41 | (curry #'mapcar (compose #'length #'aesthetic-string))
42 | rows))))
43 | (loop for row in rows do
44 | (format stream "~{~vA~^ | ~}~%" (weave column-sizes row)))))
45 | (values))
46 |
47 | ;;; Data structure printing
48 |
49 | (defun df-print (df)
50 | "Print DF to *standard-output* in table format"
51 | (when df
52 | (let ((rows (loop for row across (rows df)
53 | collect (coerce row 'list))))
54 | (print-table (push (column-names df) rows)))))
55 |
56 |
57 | ;;; TODO Summarise a data frame
58 | ;;; See https://github.com/Lisp-Stat/data-frame/issues/4
59 |
60 |
61 |
62 | ;;;
63 | ;;; Markdown
64 | ;;;
65 |
66 | (defun print-markdown (df &key (stream *standard-output*) (row-numbers nil))
67 | "Print data frame DF, in markdown format, to STREAM
68 | If ROW-NUMBERS is true, also print row numbers as the first column"
69 | (let* ((array (aops:as-array df))
70 | (col-types (aops:margin #'column-type-format array 0))
71 | (*print-pprint-dispatch* (copy-pprint-dispatch))
72 | (*print-pretty* t))
73 |
74 | ;; For notebook printing, we only need four digits of accuracy
75 | (set-pprint-dispatch 'float (lambda (s f) (format s "~4f" f)))
76 |
77 | ;; Print column names
78 | (if row-numbers (format stream "| "))
79 | (map nil #'(lambda (x)
80 | (format stream "| ~A " x))
81 | (keys df))
82 | (write-char #\| stream)
83 | (write-char #\Newline stream)
84 |
85 | ;; Print alignment
86 | (if row-numbers (format stream "| ---: "))
87 | (map nil #'(lambda (x)
88 | (alexandria:switch (x :test #'string=)
89 | ("F" (format stream "| ---: "))
90 | ("D" (format stream "| ---: "))
91 | ("A" (format stream "| :--- "))))
92 | col-types)
93 | (write-char #\| stream)
94 | (write-char #\Newline stream)
95 |
96 | ;; Print data
97 | (aops:each-index i
98 | (if row-numbers (format stream "| ~A " i))
99 | (aops:each-index j
100 | (format stream "| ~A " (aref array i j)))
101 | (format stream " |~%"))
102 | (values)))
103 |
104 |
--------------------------------------------------------------------------------
/src/missing.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021, 2023 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;; Handle missing data
6 |
7 | ;;; We're using :na as a sentinel value for missing data because we
8 | ;;; want to be able to have boolean data columns.
9 |
10 | ;;; TODO: consider a solution that includes true, false, NaN, etc. JSON
11 | ;;; libraries have varied ways of importing this data.
12 |
13 | (defgeneric missingp (data)
14 | (:method (data)
15 | (declare (ignore data))
16 | nil)
17 | ;; (:method ((data null)) ;nil sentinel for missing value
18 | ;; t)
19 | (:method ((data (eql :na)))
20 | t)
21 | (:method ((data (eql :missing)))
22 | t)
23 | (:method ((data string))
24 | nil)
25 | (:method ((data sequence))
26 | (map 'vector #'missingp data))
27 | (:method ((data array))
28 | (aops:map-array data #'missingp))
29 | (:method ((data data-frame))
30 | (map-columns data #'missingp))
31 | (:documentation "Return a vector indicating the position of any missing value indicators. They currently are :na and :missing"))
32 |
33 | (defmethod drop-missing ((df data-frame) &optional (predicate #'missingp))
34 | "Remove all rows from DF that are missing values according to PREDICATE"
35 | (select df
36 | (bit-not (reduce #'bit-ior (map 'vector
37 | #'(lambda (x)
38 | (mask-rows df x predicate))
39 | (keys df))))
40 | t))
41 |
42 | (defmethod drop-missing ((var vector) &optional (predicate #'missingp))
43 | "Remove all values from VAR that are missing according to PREDICATE.
44 | Returns values:
45 | 1. the vector with missing values removed
46 | 2. the number of elements removed"
47 | (let ((len (length var))
48 | (without-missing (remove-if predicate var)))
49 | (values without-missing (- len (length without-missing)))))
50 |
51 | (defun drop-na (df)
52 | "Remove all rows from DF that are missing values. Convenience R-like function."
53 | (drop-missing df))
54 |
55 | (defmethod replace-missing ((df data-frame) map-alist)
56 | "Replace missing values with the values specified
57 | The alist consists of a column name in the CAR and the replacement value in the CDR
58 | Example: (replace-missing mtcarsm '((mpg . foo)))"
59 | (loop for (column . value) in map-alist
60 | do (setf df (replace-column df column (substitute value :na (column df column))))
61 | (setf df (replace-column df column (substitute value :missing (column df column))))
62 | finally (return df)))
63 |
64 | (defun ignore-missing (function &key (warn-user nil) (provide-restart nil))
65 | "Wrap FUNCTION in a closure that removes missing values and applys FUNCTION in case any of the arguments are :MISSING, :NA or NIL to arguments. Intended for functions accepting vectors."
66 | (lambda (&rest arguments)
67 | ;; (check-type arguments array "a vector")
68 | (let ((missing? (notevery #'null (apply #'missingp arguments))))
69 | (cond ((not missing?) (apply function arguments))
70 | ((and missing?
71 | provide-restart) (handler-bind ((missing-data
72 | #'(lambda (c)
73 | (invoke-debugger c))))
74 | (restart-case (signal 'missing-data :name "Input argument")
75 | (remove-missing ()
76 | :report (lambda (s)
77 | (format s "Compute with missing data removed"))
78 | (let+ (((&values clean-vec num) (apply #'drop-missing arguments)))
79 | (when warn-user
80 | (warn (format nil "Removed ~D (~D%) missing values"
81 | num
82 | (round (* 100 (float (/ (length (car arguments)))))))))
83 | (funcall function clean-vec))))))
84 | (missing? (let+ (((&values clean-vec num) (apply #'drop-missing arguments)))
85 | (when warn-user
86 | (warn (format nil "Removed ~D (~D%) missing values"
87 | num
88 | (round (* 100 (float (/ (length (car arguments)))))))))
89 | (funcall function clean-vec)))))))
90 |
91 |
--------------------------------------------------------------------------------
/src/pkgdcl.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: CL-USER -*-
2 | ;;; Copyright (c) 2020-2024 by Symbolics Pte. Ltd. All rights reserved.
3 | ;;; SPDX-License-identifier: MS-PL
4 |
5 | (uiop:define-package #:data-frame
6 | (:nicknames #:df) ;we should remove this in favour of package-local-nicknames
7 | (:use
8 | #:cl
9 | #:alexandria
10 | #:alexandria+
11 | #:anaphora
12 | #:let-plus
13 | #:select
14 | #:select-dev)
15 | (:import-from #:nu #:as-alist)
16 | (:import-from #:statistics-1 #:add #:make-sparse-counter #:weighted-quantiles)
17 | (:import-from #:aops #:&dims #:linspace)
18 | (:export
19 |
20 | ;; errors & conditions
21 | #:*large-data* ;maximum data size for a particular use case
22 | #:large-data
23 | #:duplicate-key
24 | #:key-not-found
25 |
26 | ;; generic - both data-vector and data-frame
27 | #:data-type
28 | #:columns
29 | #:map-columns
30 | #:column
31 | #:column-type
32 | #:column-names
33 | #:keys
34 | #:copy
35 | #:add-columns
36 | #:add-column!
37 | #:add-columns!
38 |
39 | ;; data-vector
40 | #:data-vector
41 | #:make-dv
42 | #:alist-dv
43 | #:plist-dv
44 | #:dv
45 |
46 | ;; data-frame
47 | #:data-frame
48 | #:doc-string ;same as CL:documentation, but for data-frames
49 | #:source ;return the source of the data
50 | #:make-df
51 | #:alist-df
52 | #:plist-df
53 | #:df
54 | #:matrix-df
55 | #:rows
56 | #:defdf
57 | #:undef
58 | #:defdf-env ;define package/symbol macros for environment
59 | #:show-data-frames
60 | #:random-sample
61 | #:*ask-on-redefine* ;if non-nil, ask user if a data frame will be overwritten
62 |
63 | ;; transformations for data-frames
64 | #:map-rows
65 | #:do-rows
66 | #:mask-rows
67 | #:count-rows
68 | #:filter-rows
69 | #:map-df
70 | ;; #:split ; don't export until we make 'split' generic and merge with aops 'split'
71 | #:replace-column!
72 | #:replace-column
73 | #:remove-column!
74 | #:remove-columns
75 | #:rename-column!
76 | #:replace-key!
77 | #:df-remove-duplicates
78 |
79 | ;; missing values
80 | #:missingp
81 | #:drop-missing
82 | #:replace-missing
83 | #:ignore-missing
84 |
85 | ;; Pretty printing
86 | #:print-data
87 | #:print-markdown
88 | #:print-array
89 | #:head
90 | #:tail
91 | #:short-string ;shorten a long doc-string by returning up to the first newline
92 |
93 | ;; Data properties
94 | #:heuristicate-types
95 | #:set-properties
96 | #:get-property
97 | #:set-property
98 |
99 | ;; Formatted output
100 | #:df-print
101 | #:df-summary
102 | #:name
103 | #:doc-string
104 |
105 | ;; Sequence utilities -- these should be in array-operations
106 | #:delete-nth ;delete the nth item from a sequence
107 | #:delete-nth* ;modify macro for delete-nth
108 |
109 | ;; Subsets of data
110 | #:filter-rows
111 |
112 | ;; Summary methods
113 | #:summary ;summarize a data frame
114 | #:summarize-column ;summarize a variable
115 | #:get-summaries ;return a list of variable summaries
116 | #:bit-variable-summary
117 | #:real-variable-summary
118 | #:factor-variable-summary
119 | #:generic-variable-summary
120 | #:*summary-minimum-length* ;columns are only summarised when longer than this
121 | #:*quantile-threshold* ;if the number of unique reals exceeds this threshold, they will be summarized with quantiles
122 | #:*distinct-threshold* ;if an integer variable has <= discrete values, consider it a categorical variable
123 | #:*distinct-maximum*)) ;if a string/factor variable has > *distinct-maximum* values, exclude it
124 |
125 |
126 |
--------------------------------------------------------------------------------
/src/plist-aops.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2020-2021 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;; Plists are convenient data structures for ad-hoc data
6 | ;;; analysis. These array operations definitions for plists make them
7 | ;;; easier to work with as arrays. A plist used as a data frame has
8 | ;;; the following format:
9 |
10 | ;;; '(:a #(1 2 3 4)
11 | ;;; :b #(foo bar baz quux))
12 |
13 | ;;; Like a data frame, all columns must be of equal length, and the
14 | ;;; CAR of VALUE must point to a VECTOR.
15 |
16 | (defmacro ensure-plist (pl)
17 | `(progn
18 | (assert (plistp ,pl :allow-symbol-keys t) () "Argument is not a PLIST")
19 | (assert (notany #'null (map 'list #'vectorp (plist-values ,pl))) () "All values of PLIST must be vectors")
20 | (assert (apply #'= (map 'list #'length (plist-values ,pl))) () "All values of PLIST must be of equal length")))
21 |
22 | (defmethod aops:as-array ((plist cons))
23 | "Return the data values of PLIST as an array. The second VALUE is the keys.
24 | This method assumes that the plist is of the form (:col-name #(... ))"
25 | (ensure-plist plist)
26 | (values
27 | (nu:transpose (aops:combine (coerce (plist-values plist) 'vector)))
28 | (plist-keys plist)))
29 |
30 | (defmethod aops:nrow ((plist cons))
31 | (ensure-plist plist)
32 | (length (cadr plist)))
33 |
34 | (defmethod aops:ncol ((plist cons))
35 | (ensure-plist plist)
36 | (length (plist-keys plist)))
37 |
38 | (defmethod aops:dims ((plist cons))
39 | (ensure-plist plist)
40 | (list (aops:nrow plist) (aops:ncol plist)))
41 |
--------------------------------------------------------------------------------
/src/pprint.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved.
3 | ;;; SPDX-License-identifier: MS-PL
4 | (in-package #:data-frame)
5 |
6 | #+genera (eval-when (eval load compile) (xp::install))
7 |
8 | ;;; NOTE: If using emacs/slime, set slime-repl-auto-right-margin to T
9 | ;;; so that the printing system will be able to figure out the right
10 | ;;; margins and format printing properly. This is done via an emacs
11 | ;;; customization.
12 |
13 |
14 |
15 | ;;; Pretty print data-frames and 2D arrays
16 |
17 | ;;; This is an extension of the original work by Tamas that used the
18 | ;;; Common Lisp pretty printing system to format data-frames for
19 | ;;; printing. As an experiment, I have to conclude that using the
20 | ;;; pretty printer this way is a bad idea. It was intended to format
21 | ;;; lisp code, not more complicated output. Future work should be done
22 | ;;; in formatted-output.lisp and these functions are considered
23 | ;;; deprecated.
24 |
25 | ;;; It is not easy to line up columns dynamically when printing them.
26 | ;;; Common lisp does have good control over justification and digits
27 | ;;; but, unlike, say, markdown, it's up to the programmer to
28 | ;;; explicitly specify the paddings. If you knew a priori the widths,
29 | ;;; this would be easy, but if you want to compute them at run-time
30 | ;;; things get a bit ugly.
31 |
32 | ;;; There are two patterns here for this:
33 | ;;; 1. create a new structure that contains the strings in their printed format (pprint-data-frame)
34 | ;;; 2. compute the formatting and apply each format to the original value as you loop through the structure (pprint-array)
35 |
36 | ;;; Neither is particularly efficient. Option 2 is better than option
37 | ;;; 1 in that regard, at the expense of some ugliness in the code and
38 | ;;; keeping track of formatting strings for each column.
39 |
40 |
41 | ;;; The following global variables control aspects of printing:
42 | ;; *print-length* - controls how many elements at a given level are printed (rows)
43 | ;; *print-lines* - controls how many output lines are printed (columns)
44 | (defparameter *max-digits* 4) ;max digits after decimal
45 | (defparameter *row-numbers-p* t) ;print row numbers
46 |
47 |
48 | ;;;
49 | ;;; Utility functions
50 | ;;;
51 |
52 | ;; Consider exporting this if it turns out to be generally useful
53 | (defun reverse-df (df)
54 | "Return DF with columns in reverse order"
55 | (make-df (reverse (keys df)) (reverse (columns df))))
56 |
57 | (defun printer-status ()
58 | "Print values of all the printer variables"
59 | (format t ";; *print-array* = ~a~%" *print-array*)
60 | (format t ";; *print-base* = ~a~%" *print-base*)
61 | (format t ";; *print-case* = ~a~%" *print-case*)
62 | (format t ";; *print-circle* = ~a~%" *print-circle*)
63 | (format t ";; *print-escape* = ~a~%" *print-escape*)
64 | (format t ";; *print-gensym* = ~a~%" *print-gensym*)
65 | (format t ";; *print-length* = ~a~%" *print-length*)
66 | (format t ";; *print-level* = ~a~%" *print-level*)
67 | (format t ";; *print-lines* = ~a~%" *print-lines*)
68 | (format t ";; *print-miser-width* = ~a~%" *print-miser-width*)
69 | (format t ";; *print-pprint-dispatch* = ~a~%" *print-pprint-dispatch*)
70 | (format t ";; *print-pretty* = ~a~%" *print-pretty*)
71 | (format t ";; *print-radix* = ~a~%" *print-radix*)
72 | (format t ";; *print-readably* = ~a~%" *print-readably*)
73 | (format t ";; *print-right-margin* = ~a~%" *print-right-margin*))
74 |
75 | (defun 2d-array-to-list (array)
76 | "Convert an array to a list of lists" ; make flet?
77 | (loop for i below (array-dimension array 0)
78 | collect (loop for j below (array-dimension array 1)
79 | collect (aref array i j))))
80 |
81 | ;;;
82 | ;;; Computing column formats
83 | ;;;
84 |
85 | ;;; To properly print float values we need to use the ~F directive to
86 | ;;; control the number of digits so that the decimal points will line
87 | ;;; up.
88 |
89 | ;;; There are three things we need to know to print a data frame:
90 | ;;; - width of the column
91 | ;;; - type of the column
92 | ;;; - maximum number of digits in any value in the column
93 |
94 | ;;; The three utility functions below give us this information.
95 |
96 | (defun max-width (sequence &optional (max-width nil))
97 | "Return the largest printed string size of the elements of SEQUENCE, equal to or less than MAX-WIDTH"
98 | (let ((actual-width (apply #'max (map 'list #'(lambda (x)
99 | (length
100 | (typecase x
101 | (float (format nil "~F" x))
102 | (t (format nil "~A" x)))))
103 | sequence))))
104 | (if max-width
105 | (min actual-width max-width)
106 | actual-width)))
107 |
108 | (defun column-type-format (sequence)
109 | "Return a format string for the most specific type found in sequence
110 | Use this for sequences of type T to determine how to format the column."
111 | (when (bit-vector-p sequence) (return-from column-type-format "D"))
112 | (case (column-type sequence)
113 | (:single-float "F")
114 | (:double-float "F")
115 | (:integer "D")
116 | (:bit "D")
117 | (:symbol "S")
118 | (:catagorical "A")
119 | (t "A")))
120 |
121 | (defun max-decimal (sequence &optional (max-digits nil))
122 | "Return the maximum number of digits to the right of the decimal point in the numbers of SEQUENCE, equal to or less than MAX-DIGITS"
123 | (let ((actual-digits (apply #'max
124 | (map 'list
125 | #'(lambda (x)
126 | (typecase x
127 | (float (let* ((str (format nil "~F" x))
128 | (pos (position #\. str)))
129 | (length (subseq str (1+ pos)))))
130 | (t 0)))
131 | sequence))))
132 | (if max-digits
133 | (min actual-digits max-digits)
134 | actual-digits)))
135 |
136 | ;;;
137 | ;;; Formatters
138 | ;;;
139 |
140 | #+nil
141 | (defmethod default-column-formats (#-genera (array simple-array)
142 | #+genera (array 'simple-array))
143 | "Return a list of formatting strings for ARRAY
144 | The method returns a set of default formatting strings using heuristics."
145 | ;; TODO: Delete once ACL is full working.
146 | (let ((col-widths (aops:margin #'max-width array 0))
147 | (col-types (aops:margin #'column-type-format array 0))
148 | (col-digits (aops:margin #'max-decimal array 0)))
149 | (map 'list #'(lambda (type width digits)
150 | (alexandria:switch (type :test #'string=)
151 | ("F" (format nil "~~~A,~AF" width digits))
152 | ("D" (format nil "~~~AD" width))
153 | ("A" (format nil "~~~AA" width))))
154 | col-types col-widths col-digits)))
155 |
156 | (defun default-column-formats (array)
157 | "Return a list of formatting strings for ARRAY
158 | The method returns a set of default formatting strings using heuristics."
159 | (let ((col-widths (aops:margin #'max-width array 0))
160 | (col-types (aops:margin #'column-type-format array 0))
161 | (col-digits (aops:margin #'max-decimal array 0)))
162 | (map 'list #'(lambda (type width digits)
163 | (alexandria:switch (type :test #'string=)
164 | ("F" (format nil "~~~A,~AF" width digits))
165 | ("D" (format nil "~~~AD" width))
166 | ("A" (format nil "~~~AA" width))))
167 | col-types col-widths col-digits)))
168 |
169 |
170 |
171 | ;;;
172 | ;;; Pretty printers
173 | ;;;
174 |
175 | (defun print-data (data-frame
176 | &optional
177 | (stream *standard-output*)
178 | (row-numbers-p *row-numbers-p*)
179 | (max-digits *max-digits*))
180 | "Print DATA-FRAME to STREAM using the pretty printer"
181 | (check-type data-frame data)
182 | (let* ((col-names '())
183 | (df (copy data-frame :key #'copy-array))
184 | (*print-pretty* t))
185 | (when row-numbers-p
186 | (setf df (reverse-df
187 | (add-columns (reverse-df df) ;add to end of DF
188 | '||
189 | (if (> (aops:nrow df) 1)
190 | (aops:linspace 0 (1- (aops:nrow df)) (aops:nrow df))
191 | #(0)))))) ;special case of 1 row
192 | (flet ((format-column (c)
193 | (let* ((width (max (max-width (column df c))
194 | (length (symbol-name c))))
195 | (type (column-type-format (column df c)))
196 | (digits (min (max-decimal (column df c))
197 | max-digits))
198 | (data-fmt (alexandria:switch (type :test #'string=)
199 | ("F" (format nil "~~~A,~AF" width digits))
200 | ("D" (format nil "~~~AD" width))
201 | ("A" (format nil "~~~AA" width))
202 | ("S" (format nil "~~~A@A" width))))
203 | (var-fmt (alexandria:switch (type :test #'string=) ; Why does SBCL complain about this, and only here?
204 | ("F" (format nil "~~~A@A" width))
205 | ("D" (format nil "~~~A@A" width))
206 | ("A" (format nil "~~~AA" width))
207 | ("S" (format nil "~~~A@A" width)))))
208 |
209 | (replace-column! df c #'(lambda (cell)
210 | (if (eq cell :na) ;should take same justification as column
211 | (format nil (format nil "~~~A<~~A~~>" width) cell)
212 | (format nil data-fmt cell))))
213 | (setf col-names (cons (format nil var-fmt (symbol-name c)) col-names)))))
214 | (map nil #'format-column (keys df)))
215 |
216 | (write-char #\Newline stream)
217 | (pprint-logical-block (stream (reverse col-names) :per-line-prefix ";;")
218 | (loop (pprint-exit-if-list-exhausted)
219 | (write-char #\Space stream)
220 | (write-string (pprint-pop) stream)))
221 | (write-char #\Newline stream)
222 | (pprint-logical-block (stream (2d-array-to-list (aops:as-array df)))
223 | (loop (pprint-exit-if-list-exhausted)
224 | (let ((row (pprint-pop)))
225 | (pprint-logical-block (stream row :per-line-prefix ";; ")
226 | (loop (pprint-exit-if-list-exhausted)
227 | (write-string (pprint-pop) stream)
228 | (write-char #\Space stream))))
229 | (pprint-newline :mandatory stream)))))
230 |
231 | ;;; TODO: refactor this using the pattern in pprint-data-frame,
232 | ;;; incorporating the code in default-column-formats and adding an
233 | ;;; optional max-digits parameter
234 | (defun print-array (arr &optional (stream *standard-output*) (row-numbers-p *row-numbers-p*))
235 | "Print an array to STREAM, defaulting to *standard-output*, in a tabular format. If ROW-NUMBERS-P, print row numbers."
236 | (let* ((array (cond (row-numbers-p (aops:stack-cols (aops:linspace 0 (1- (aops:nrow arr)) (aops:nrow arr)) arr))
237 | (t arr)))
238 | (df-lists (2d-array-to-list array))
239 | (data-fmt (default-column-formats array))
240 | (f 0))
241 | (pprint-logical-block (stream df-lists)
242 | (loop (pprint-exit-if-list-exhausted)
243 | (let ((row (pprint-pop)))
244 | (pprint-logical-block (stream row :per-line-prefix ";;")
245 | (loop (pprint-exit-if-list-exhausted)
246 | (write-char #\Space stream)
247 | (format stream (nth f data-fmt) (pprint-pop))
248 | (incf f))))
249 | (setf f 0)
250 | (write-char #\Newline stream)))))
251 |
252 |
253 | ;;; Not pretty printing per-se, but related
254 |
255 | (defmethod head ((df data-frame) &optional (n 6))
256 | "Return the first N rows of DF; N defaults to 6"
257 | (let ((*print-pretty* t))
258 | (if (< (aops:nrow df) 6) (setf n (aops:nrow df)))
259 | (print-data (select df (select:range 0 n) t))))
260 |
261 | (defmethod tail ((df data-frame) &optional (n 6))
262 | "Return the last N rows of DF; N defaults to 6"
263 | (let ((*print-pretty* t))
264 | (if (< (aops:nrow df) 6) (setf n (aops:nrow df)))
265 | (print-data (select df (select:range (- n) nil) t))))
266 |
267 | (defun short-string (str)
268 | "Return up to the first newline
269 | This is useful when docstrings are multi-line. By convention, the first line is the title."
270 | (subseq str
271 | 0
272 | (position #\newline str)))
273 |
274 |
275 | ;;;
276 | ;;; Pretty printer system configuration
277 | ;;;
278 |
279 | ;; After setting the dispatch tables below and *print-pretty* nil, we
280 | ;; get the following behaviour:
281 | ;; (print df) => one line summary of number of rows and columns
282 | ;; (pprint df) => print full table, subject to *print-lines* and *print-length*
283 | ;; print-object is equal to 'print'
284 | ;; Unfortunately, it also means that any CL function that sets
285 | ;; *print-pretty* to T, like 'describe' will print the entire
286 | ;; data-frame.
287 |
288 | ;; we can't use this and still have nice 'describe' for data frames
289 | #+nil
290 | (set-pprint-dispatch 'df:data-frame
291 | #'(lambda (s df) (print-data df s)))
292 |
293 | (set-pprint-dispatch '(array * 2)
294 | #'(lambda (s array) (print-array array s)))
295 |
296 | ;;; These only work when *print-pretty* is T
297 | #+nil
298 | (set-pprint-dispatch 'float
299 | #'(lambda (s obj)
300 | (format s "~,4F" obj)))
301 | #+nil
302 | (set-pprint-dispatch 'double-float
303 | #'(lambda (s obj)
304 | (format s "~,4F" obj)))
305 |
306 | ;;;
307 | ;;; Reference
308 | ;;;
309 |
310 | ;;; Use this to duplicate the behaviour in the documentation
311 | #+nil
312 | (defmethod print-object ((df data-frame) stream)
313 | "Print the first six rows of DATA-FRAME"
314 | (let* ((*print-lines* 6)
315 | (*print-pretty* t))
316 | (df:pprint-data df stream nil)))
317 |
318 |
--------------------------------------------------------------------------------
/src/properties.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | (defun heuristicate-types (df)
6 | "Coerce each element of the column vectors to the most specific type in the column
7 | Often when reading in a data set, the types will be inconsistent in a variable. For example one observation might be 5.1, and another 5. Whilst mathmatically equivalent, we want our variable vectors to have identical types. The COLUMN-TYPE function returns the most specific numeric type in the column, then coerces all the vector elements to this type"
8 | (check-type df data-frame)
9 | (assert (slot-boundp df 'name) () "name is not bound in the data-frame")
10 | (let ((name (name df)))
11 | (map nil #'(lambda (key)
12 | (let* ((data (column df key))
13 | (col-type (column-type data))
14 | (sym (find-symbol (symbol-name key) (find-package name))))
15 | (setf (get sym :type) col-type)))
16 | (keys df))))
17 |
18 | (defun set-properties (df property prop-values)
19 | "Set the PROPERTY of each variable in DF to a value. The value is specified in the plist PROP-VALUES.
20 | Example:
21 | To give the variables in the mtcars dataset a unit, use:
22 | (set-properties mtcars :unit '(:mpg m/g
23 | :cyl :NA
24 | :disp in³
25 | :hp hp
26 | :drat :NA
27 | :wt lb
28 | :qsec s
29 | :vs :NA
30 | :am :NA
31 | :gear :NA
32 | :carb :NA))"
33 | (check-type df data-frame)
34 | (assert (slot-boundp df 'name) () "name is not bound in the data-frame")
35 | (let ((name (name df)))
36 | (loop for (key value) on prop-values by #'cddr
37 | for sym = (find-symbol (symbol-name key) (find-package name))
38 | do (assert (member property '(:type :label :unit))
39 | (property)
40 | "A property must be one of: :type, :label or :unit")
41 | do (if (eq property :type)
42 | (check-type value df:data-type "a valid data variable type"))
43 | when sym do (setf (get sym property) value))))
44 |
45 |
46 | ;;; User convenience functions
47 |
48 | (defun get-property (variable property)
49 | "Return the PROPERTY of data VARIABLE"
50 | (assert (member property '(:type :label :unit))
51 | (property)
52 | "Property ~A is not one of: :type, :label or :unit"
53 | property)
54 | (get variable property))
55 |
56 | (defun set-property (symbol value property)
57 | "Set the PROPERTY of SYMBOL to VALUE"
58 | (assert (member property '(:type :label :unit))
59 | (property)
60 | "Property ~A is not one of: :type, :label or :unit"
61 | property)
62 | (if (eq property :type)
63 | (check-type value df:data-type "a valid data variable type"))
64 | (setf (get symbol property) value))
65 |
66 |
67 | ;; Not exported
68 | (defun show-properties (df)
69 | "Show the standard properties of the variables of the data frame DF
70 | Standard properties are 'label', 'type' and 'unit'"
71 | (let* ((rows (loop for key across (keys df)
72 | collect (list (symbol-name key)
73 | (get key :type)
74 | (get key :unit)
75 | (get key :label)))))
76 | (push '("--------" "----" "----" "-----------") rows)
77 | (push '("Variable" "Type" "Unit" "Label") rows)
78 | (print-table rows)))
79 |
80 |
81 |
--------------------------------------------------------------------------------
/src/summary.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2020-2022 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;; Summary control variables
6 | (defparameter *summary-minimum-length* 10
7 | "Columns are only summarised when longer than this, otherwise they are returned as is.")
8 |
9 | (defparameter *quantile-threshold* 20
10 | "If the number of unique reals exceeds this threshold, they will be summarized with quantiles, otherwise print frequency table")
11 |
12 | (defparameter *distinct-threshold* 10
13 | "If an integer variable has <= discrete values, consider it a factor")
14 |
15 | (defparameter *distinct-maximum* 20
16 | "If a string/factor variable has > *distinct-maximum* values, exclude it")
17 |
18 |
19 |
20 | (defstruct variable-summary%
21 | "Base class for summarizing variables. Summary functions take SYMBOLs, rather than values, because the symbol property lists naming the variables have meta-data, e.g. type, label, that we want to print. Not exported."
22 | (length 0 :type array-index :read-only t)
23 | (missing 0 :type fixnum :read-only t)
24 | (name "" :type string :read-only t)
25 | (desc "" :type string :read-only t))
26 |
27 | (defstruct (bit-variable-summary (:include variable-summary%)
28 | (:print-function
29 | (lambda (summary stream depth)
30 | (declare (ignore depth))
31 | (let+ (((&structure-r/o bit-variable-summary- length count name desc) summary))
32 | (format stream "~%~%~A (~A)~%" name desc)
33 | (princ "ones: " stream)
34 | (print-count-and-percentage stream count length)))))
35 | "Summary of a bit vector."
36 | (count 0 :type array-index :read-only t))
37 |
38 | (defstruct (real-variable-summary (:include variable-summary%)
39 | (:print-function
40 | (lambda (summary stream depth)
41 | (declare (ignore depth))
42 | (let+ (((&structure-r/o real-variable-summary- name desc length missing min q25 q50 mean q75 max) summary)
43 | (*print-pprint-dispatch* (copy-pprint-dispatch)))
44 | (set-pprint-dispatch 'float (lambda (s f) (format s "~,2f" f)))
45 | (format stream
46 | "~%~%~A (~A)~& n: ~W~& ~:_missing: ~W~& ~:_min=~:W~& ~:_q25=~:W~& ~:_q50=~:W~& ~:_mean=~:W~& ~:_q75=~:W~& ~:_max=~:W"
47 | name desc length missing min (ensure-not-ratio q25) (ensure-not-ratio q50) (ensure-not-ratio mean)
48 | (ensure-not-ratio q75) max)))))
49 | "Summary of a real elements (using quantiles)."
50 | ;; (count 0 :type array-index :read-only t)
51 | (min 0 :type real :read-only t)
52 | (q25 0 :type real :read-only t)
53 | (q50 0 :type real :read-only t)
54 | (mean 0 :type real :read-only t)
55 | (q75 0 :type real :read-only t)
56 | (max 0 :type real :read-only t))
57 |
58 | (defstruct (factor-variable-summary (:include variable-summary%)
59 | (:print-function
60 | (lambda (summary stream depth)
61 | (declare (ignore depth))
62 | (let+ (((&structure-r/o factor-variable-summary- name desc length element-count-alist) summary))
63 | (format stream "~%~%~A (~A)~%" name desc)
64 | (loop for (element . count) in element-count-alist do
65 | (print-count-and-percentage stream count length)
66 | (format stream " x ~W, " element))))))
67 | "Summary for factor variables"
68 | (element-count-alist nil :type list :read-only t))
69 |
70 | (defstruct (generic-variable-summary (:include variable-summary%)
71 | (:print-function
72 | (lambda (summary stream depth)
73 | (declare (ignore depth))
74 | (let+ (((&structure-r/o generic-variable-summary- length name quantiles element-count-alist) summary))
75 | (unless (string= name "")
76 | (format stream "~%~%~:W: " name))
77 | (when quantiles
78 | (let+ (((&structure-r/o real-variable-summary- length min q25 q50 q75 max) quantiles))
79 | (format stream
80 | "~:W reals, ~:_min=~:W, ~:_q25=~:W, ~:_q50=~:W, ~:_q75=~:W, ~:_max=~:W"
81 | length min (ensure-not-ratio q25) (ensure-not-ratio q50)
82 | (ensure-not-ratio q75) max)))
83 | (when element-count-alist
84 | (loop for (element . count) in element-count-alist do
85 | (print-count-and-percentage stream count length)
86 | (format stream " x ~:W, " element)))))))
87 | "Summary for generic variables, i.e. those with mixed types."
88 | (quantiles nil :type (or null real-variable-summary) :read-only t)
89 | (element-count-alist nil :type list :read-only t))
90 |
91 |
92 |
93 |
94 |
95 | (defgeneric column-length (column)
96 | (:documentation "Return the length of column.")
97 | (:method ((column vector))
98 | (length column)))
99 |
100 | (defun print-count-and-percentage (stream count length)
101 | "Print COUNT as is and also as a rounded percentage"
102 | (format stream "~D (~D%)" count (round (/ count length) 1/100)))
103 |
104 | (defun ensure-not-ratio (real)
105 | "When REAL is a RATIO, convert it to a float, otherwise return as is. Used for printing."
106 | (if (typep real 'ratio)
107 | (float real 1.0)
108 | real))
109 |
110 |
111 | ;;; TODO figure out where distinct and monotonic should reside. Probably num-utils.
112 | (defun distinct (column)
113 | "Returns the number of distinct elements in COLUMN, a symbol naming a variable.
114 | Useful for formatting columns for human output."
115 | (let+ ((data (eval column))
116 | (table (aprog1 (make-sparse-counter :test #'equal)
117 | (map nil (curry #'add it) data)))
118 | (alist (as-alist table)))
119 | (length alist)))
120 |
121 | (defun monotonicp (column)
122 | "Returns T if all elements of COLUMN, a SYMBOL, are increasing monotonically
123 | Useful for detecting row numbers in imported data."
124 | (let ((data (eval column)))
125 | (if (not (every #'numberp data))
126 | nil
127 | (loop for x across data
128 | for i = (1+ x) then (1+ i)
129 | always (= (1+ x) i)))))
130 |
131 |
132 |
133 | (defun summarize-real-variable (column)
134 | "Return a summary for a float variable"
135 | (let+ ((data (eval column))
136 | (table (aprog1 (make-sparse-counter :test #'equal)
137 | (map nil (curry #'add it) data)))
138 | (alist (as-alist table))
139 | ((&flet real? (item) (realp (car item))))
140 | (reals-alist (remove-if (complement #'real?) alist))
141 | (#(min q25 q50 q75 max)
142 | (weighted-quantiles
143 | (mapcar #'car reals-alist)
144 | (mapcar #'cdr reals-alist)
145 | #(0 1/4 1/2 3/4 1)))
146 | (mean-ignore-missing (ignore-missing #'mean :warn-user t)))
147 |
148 | (make-real-variable-summary
149 | :name (symbol-name column)
150 | :desc (if (get column :label)
151 | (get column :label)
152 | "")
153 | :length (length data)
154 | :missing (length (which data :predicate #'missingp))
155 | :min min :q25 q25 :q50 q50 :mean (funcall mean-ignore-missing data) :q75 q75 :max max)))
156 |
157 | (defun summarize-factor-variable (column)
158 | "Return an alist of factor/count pairs"
159 | (let+ ((data (eval column))
160 | (table (aprog1 (make-sparse-counter :test #'equal)
161 | (map nil (curry #'add it) data)))
162 | (alist (as-alist table)))
163 |
164 | (make-factor-variable-summary
165 | ;; :name (df::var-name (symbol-name column)) ; for df$variable style
166 | :name (symbol-name column)
167 | :desc (if (get column :label) (get column :label) "")
168 | :length (length data)
169 | :missing (length (which data :predicate #'missingp))
170 | :element-count-alist (stable-sort alist #'>= :key #'cdr))))
171 |
172 | (defun summarize-generic-variable (column &optional name)
173 | "Return an object that summarizes COLUMN of a DATA-FRAME. Primarily intended for printing, not analysis, returned values should print nicely. This function can be used on any type of column, even one with mixed types"
174 | (let+ ((data (eval column))
175 | ;; (data column)
176 | (length (length data))
177 | (table (aprog1 (make-sparse-counter :test #'equal)
178 | (map nil (curry #'add it) data)))
179 | (alist (as-alist table))
180 | ((&flet real? (item) (realp (car item))))
181 | (reals-alist (remove-if (complement #'real?) alist))
182 | (quantiles (when (< *quantile-threshold*
183 | (length reals-alist))
184 | (let+ ((#(min q25 q50 q75 max)
185 | (weighted-quantiles
186 | (mapcar #'car reals-alist)
187 | (mapcar #'cdr reals-alist)
188 | #(0 1/4 1/2 3/4 1))))
189 | (make-real-variable-summary
190 | :length (reduce #'+ reals-alist :key #'cdr)
191 | :min min :q25 q25 :q50 q50 :q75 q75 :max max))))
192 | (alist (stable-sort (if quantiles
193 | (remove-if #'real? alist)
194 | (copy-list alist))
195 | #'>= :key #'cdr)))
196 | (make-generic-variable-summary :length length
197 | :name (if name
198 | name
199 | "")
200 | :quantiles quantiles
201 | :element-count-alist alist)))
202 |
203 |
204 |
205 | (defun summarize-column (column &optional name)
206 | "Return a summary struct for COLUMN"
207 | (let ((data (eval column))
208 | (label (get column :label)))
209 | (case (get column :type)
210 |
211 | ;; Implementation types
212 | (:double-float (summarize-real-variable column))
213 | (:single-float (summarize-real-variable column))
214 | (:integer (summarize-real-variable column))
215 | (:string (summarize-factor-variable column)) ;we really should remove this at some point.
216 | (:catagorical (summarize-factor-variable column))
217 | (:bit (make-bit-variable-summary
218 | :name (if name
219 | name
220 | (symbol-name column))
221 | :desc (if label
222 | label
223 | "")
224 | :length (length data)
225 | :count (count 1 data)))
226 |
227 | ;; Statistical types, note keyword in case key
228 | (:factor (summarize-factor-variable column))
229 | (t (summarize-generic-variable column)))))
230 |
231 | (defun get-summaries (df)
232 | "Return a list of summaries of the variables in DF"
233 | (loop for key across (keys df)
234 | collect (summarize-column key)))
235 |
236 | #+nil
237 | (defmacro summary (df &optional (stream *standard-output*))
238 | `(summarize-dataframe ,df ,stream))
239 |
240 | ;; TODO add :remove-missing parameter so we can summarize in the early stages of data exploration.
241 | ;; Perhaps, if set, have it use summarize-generic-variable
242 | (defun summary (df &optional (stream *standard-output*))
243 | "Print a summary of DF to STREAM, using heuristics for better formatting"
244 | (let* ((name (when (slot-boundp df 'name) (name df)))
245 | (pkg (find-package name)))
246 | (if pkg
247 | (loop for key across (keys df)
248 | for column = (find-symbol (string-upcase (symbol-name key)) pkg)
249 | for data = (column df key)
250 | for length = (length data)
251 |
252 | unless (or (= (length data)
253 | (distinct data)) ;exclude row names
254 | (monotonicp data)) ;exclude row numbers
255 | collect (case (get column :type) ;special cases
256 | (:double-float (if (< (distinct data) *quantile-threshold*) ;summarise as a factor
257 | (summarize-factor-variable column)
258 | (summarize-real-variable column)))
259 | (:single-float (if (< (distinct data) *quantile-threshold*) ;summarise as a factor
260 | (summarize-factor-variable column)
261 | (summarize-real-variable column)))
262 | (:integer (if (< (distinct data) *distinct-threshold*) ;summarise as a factor
263 | (summarize-factor-variable column)
264 | (summarize-real-variable column)))
265 | (t (summarize-column column))))
266 |
267 | (loop for key across (keys df) ;no data-frame environment, use generic summary functions
268 | for data = (column df key)
269 | unless (or (= (length data)
270 | (distinct data)) ;exclude row names
271 | (monotonicp data) ;exclude row numbers
272 | (and (< *distinct-maximum* ;exclude row names with a few repeats
273 | (distinct data))
274 | (equal :string (column-type (column df key)))))
275 | do (format stream "~%~A: ~A" key (summarize-generic-variable data))))))
276 |
277 |
--------------------------------------------------------------------------------
/src/utils.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME -*-
2 | ;;; Copyright (c) 2021-2022 by Symbolics Pte. Ltd. All rights reserved.
3 | (in-package #:data-frame)
4 |
5 | ;;;
6 | ;;; Insert & remove items from vectors and arrays -- TODO move to alexandria+
7 | ;;;
8 |
9 | (defun delete-nth (sequence n)
10 | "Return SEQUENCE with the Nth item removed.
11 | Note: DELETE-IF makes no guarantee of being destructive, so you cannot rely on this side-effect. You must SETF the original sequence to the values returned from this function, or use the modify-macro DELETE-NTH*"
12 | (check-type sequence sequence)
13 | (delete-if (constantly t) sequence :start n :count 1))
14 |
15 | (define-modify-macro delete-nth* (n)
16 | delete-nth
17 | "Destructively modifies N, a SEQUENCE by removing the Nth item.
18 | Example:
19 | LS-USER> (defparameter *v* #(a b c d))
20 | *V*
21 | LS-USER> (delete-nth* *v* 1)
22 | #(A C D)
23 | LS-USER> *v*
24 | #(A C D)")
25 |
26 |
27 |
28 | ;;;
29 | ;;; Augment the type system
30 | ;;;
31 |
32 | ;;; cl:type-of is under-constrained and returns implementation
33 | ;;; specific results, so we use our own version
34 | ;;; See: https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node53.html
35 | (defun get-type (x)
36 | "Return the most specific type symbol for x"
37 | (typecase x
38 | (bit 'bit)
39 | (single-float 'single-float)
40 | (double-float 'double-float)
41 | ;; (fixnum 'fixnum)
42 | (integer 'integer)
43 | (ratio 'ratio)
44 | (complex 'complex)
45 | ;; (rational 'cl:rational) ;SBCL doesn't recognise this return type
46 | ;; (real 'real) ;SBCL doesn't recognise this return type
47 |
48 | (simple-string 'simple-string)
49 | (string 'string)
50 |
51 | (list 'list)
52 | (symbol 'symbol)
53 | (bit-vector 'bit-vector)
54 | (simple-vector 'simple-vector)
55 | (simple-array 'simple-array)
56 | (vector 'vector)
57 | (array 'array)
58 | (sequence 'sequence)
59 |
60 | (function 'function)
61 | (package 'package)))
62 |
63 | (defun types-in-column (seq)
64 | "Return a list of the types found in SEQ"
65 | (let (types)
66 | (loop for i across seq
67 | for type = (get-type i)
68 | when (not (member type types))
69 | do (push type types)
70 | finally (return types))))
71 |
72 | (defun column-type (col)
73 | "Return the most specific type found in COL"
74 | (let ((type-list (types-in-column col)))
75 | (cond ((member 'single-float type-list) :single-float)
76 | ((member 'double-float type-list) :double-float)
77 | ((member 'fixnum type-list) :fixnum)
78 | ((member 'integer type-list) :integer)
79 | ((member 'string type-list) :string)
80 | ((member 'bit type-list) :bit)
81 | ((member 'symbol type-list) :symbol)
82 | (t 'string ))))
83 |
84 |
--------------------------------------------------------------------------------
/tests/data-frame-tests.lisp:
--------------------------------------------------------------------------------
1 | ;;; -*- Mode: LISP; Base: 10; Syntax: ANSI-Common-Lisp; Package: DATA-FRAME-TESTS -*-
2 | ;;; Copyright (c) 2020-2022 by Symbolics Pte. Ltd. All rights reserved.
3 |
4 | (uiop:define-package #:data-frame-tests
5 | (:use
6 | #:cl
7 | #:alexandria
8 | #:anaphora
9 | #:clunit
10 | #:let-plus
11 | #:select
12 | #:data-frame)
13 | (:import-from #:nu #:as-alist #:as-plist)
14 | (:export #:run))
15 |
16 | (in-package :data-frame-tests)
17 |
18 | (defsuite data-frame ())
19 |
20 | (defun run (&optional interactive?)
21 | (run-suite 'data-frame :use-debugger interactive?))
22 |
23 | (defsuite data-vector (data-frame))
24 |
25 | (deftest data-vector-basics (data-vector)
26 | (let ((dv (dv :a 1 :b 2 :c 3)))
27 | (assert-equalp '(:a 1 :b 2 :c 3) (as-plist dv))
28 | (assert-equalp #(1 2 3) (columns dv))
29 | (assert-equalp #(:a :b :c) (keys dv))
30 | (assert-equalp '((:a . 1) (:b . 2) (:c . 3)) (as-alist dv))
31 | (assert-equalp '(:a 1 :b 2) (as-plist (select dv #(:a :b))))
32 | (assert-equalp 3 (select dv :c))
33 | (let ((dv2 (map-columns dv #'1+)))
34 | (assert-equalp '(:a 2 :b 3 :c 4) (as-plist dv2))
35 | (assert-true (typep dv2 'data-vector)))))
36 |
37 |
38 | (defsuite data-frame-basics (data-frame))
39 |
40 | (deffixture data-frame-basics (@body)
41 | (let ((v #(1 2 3 4))
42 | (b #*0110)
43 | (s #(a b c d)))
44 | @body))
45 |
46 | (deftest data-frame-creation (data-frame-basics)
47 | (let* ((plist `(:vector ,v :symbols ,s :bits ,b))
48 | (df (apply #'df plist))
49 | (df-plist (plist-df plist))
50 | (df-alist (alist-df (plist-alist plist))))
51 | (assert-equalp #(:vector :symbols :bits) (keys df))
52 | (assert-equalp (vector v s b) (columns df))
53 | (assert-equalp (vector v s b) (columns df t))
54 | (assert-equalp (vector v) (columns df #(:vector)))
55 | (assert-equalp v (columns df :vector))
56 | (assert-equalp v (columns df -3))
57 | (assert-equalp `(:vector ,v :symbols ,s :bits ,b) (as-plist df))
58 | (assert-equalp `((:vector . ,v) (:symbols . ,s) (:bits . ,b)) (as-alist df))
59 | (assert-equalp (as-alist df) (as-alist df-plist))
60 | (assert-equalp (as-alist df) (as-alist df-alist))))
61 |
62 | (deftest data-frame-select (data-frame-basics)
63 | (let ((df (df :vector v :symbols s)))
64 | (assert-equalp `(:vector ,v) (as-plist (select df t #(:vector))))
65 | (assert-equalp `(:vector ,(select v b)) (as-plist (select df b #(0))))
66 | (assert-equalp (select v b) (select df b :vector))
67 | (assert-equalp '(:vector 3 :symbols c) (as-plist (select df 2 t)))
68 | (assert-equalp `(:vector #(2 4)) (as-plist
69 | (select df
70 | (mask-rows df :vector #'evenp)
71 | #(:vector))))
72 | (assert-equalp #(2 4) (select df (mask-rows df :vector #'evenp) :vector))))
73 |
74 |
75 |
76 | (defsuite data-frame-operations (data-frame))
77 |
78 | (deftest data-frame-map (data-frame-operations)
79 | (let+ ((df (df :a #(2 3 5)
80 | :b #(7 11 13)))
81 | (product #(14 33 65))
82 | ((&flet predicate (a b) (<= 30 (* a b))))
83 | ((&flet predicate-bit (a b) (if (predicate a b) 1 0)))
84 | (mask #*011))
85 | (assert-equalp product
86 | (map-rows df '(:a :b) #'*))
87 | (assert-equalp `(:p ,product :m ,mask)
88 | (as-plist (map-df df '(:a :b)
89 | (lambda (a b)
90 | (vector (* a b) (predicate-bit a b)))
91 | '((:p fixnum) (:m bit)))))
92 | (let ((mask-rows (mask-rows df '(:a :b) #'predicate)))
93 | (assert-equal mask mask-rows)
94 | (assert-eq 'bit (array-element-type mask-rows)))
95 | (assert-equalp (count 1 mask)
96 | (count-rows df '(:a :b) #'predicate))))
97 |
98 | #|
99 | ;This test is a bit of a challenge because the operation that determines the symbols for columns compares symbols in the DF package, but here 'a is created in the data-frame-tests package. Frequent using of filter-rows is enough to convince me that it's working properly, but if someone were to make this test work I'd be grateful.
100 | (deftest filter-rows (data-frame-operations)
101 | (let+ ((df (df :a #(2 3 5)
102 | :b #(7 11 13)))
103 | (plst '(:a #(2 3)
104 | :b #(7 11))))
105 | (assert-equal plst (as-plist (filter-rows df (< a 4))))))
106 | |#
107 | (deftest rename! (data-frame-operations)
108 | (let+ ((df (df :a #(2 3 5)
109 | :b #(7 11 13))))
110 | (assert-equalp #(:a :b) (keys df))
111 | (rename-column! df :c :a)
112 | (assert-equalp #(:c :b) (keys df) "Rename failed")))
113 |
114 |
115 |
116 |
117 | (defsuite data-frame-add (data-frame))
118 |
119 | (deffixture data-frame-add (@body)
120 | (let* ((plist1 '(:a #(1 2 3)))
121 | (plist2 '(:b #(4 5 6)))
122 | (plist12 (append plist1 plist2)))
123 | @body))
124 |
125 | (defmacro test-add (add-function plist1 plist2 append?)
126 | "Macro for generating the following test:
127 |
128 | 1. create a data frame using plist1,
129 |
130 | 2. add plist2 using add-function to get a second data frame,
131 |
132 | 3. test that the first data frame is uncorrupted if append? is nil, or
133 | equivalent the concatenated plist otherwise,
134 |
135 | 4. test that the second data frame is equivalent to the concatenated plist.
136 |
137 | This is a comprehensive test of the add-column family of functions,
138 | destructive or non-destructive."
139 | (with-unique-names (df df2 plist12)
140 | (once-only (plist1 plist2)
141 | `(let* ((,df (plist-df ,plist1))
142 | (,df2 (apply ,add-function ,df ,plist2))
143 | (,plist12 (append ,plist1 ,plist2)))
144 | (assert-equal (if ,append?
145 | ,plist12
146 | ,plist1)
147 | (as-plist ,df))
148 | (assert-equal ,plist12
149 | (as-plist ,df2))))))
150 |
151 | (deftest add-column (data-frame-add)
152 | (test-add #'add-columns plist1 plist2 nil)
153 | (test-add #'add-column! plist1 plist2 t)
154 | (test-add #'add-columns! plist1 plist2 t)
155 | (assert-equalp '(:a #(1 2 3) :b #(4 5 6)) plist12)) ;this test is only here to quiet the compiler
156 |
157 | (deftest add-map (data-frame-add)
158 | (let* ((plist3 '(:c #(4 10 18)))
159 | (plist123 (append plist12 plist3)))
160 | ;; non-destructive
161 | (let* ((df (plist-df plist12))
162 | (df2 (add-columns df :c (map-rows df '(:a :b) #'*))))
163 | (assert-equalp plist12 (as-plist df))
164 | (assert-equalp plist123 (as-plist df2)))
165 | ;; destructive, function
166 | (let* ((df (plist-df plist12))
167 | (df2 (add-column! df :c (map-rows df '(:a :b) #'*))))
168 | (assert-equalp plist123 (as-plist df))
169 | (assert-equalp plist123 (as-plist df2)))))
170 |
171 |
172 | ;;; replace-column
173 |
174 | (defsuite replace-column (data-frame))
175 |
176 | (deftest replace-column1 (replace-column)
177 | (let* ((plist '(:a #(1 2 3) :b #(5 7 11)))
178 | (df (plist-df plist))
179 | ;; (df-copy (copy df))
180 | (df1 (replace-column df :a #'1+))
181 | (df2 (replace-column df :a #(2 3 4)))
182 | (expected-plist '(:a #(2 3 4) :b #(5 7 11))))
183 | (assert-equalp expected-plist (as-plist df1))
184 | (assert-equalp expected-plist (as-plist df2))
185 | (assert-equalp plist (as-plist df))
186 | ;; modify destructively
187 | (replace-column! df :a #'1+)
188 | (assert-false (equalp plist (as-plist df)))
189 | (assert-equalp expected-plist (as-plist df))))
190 |
191 |
192 | (defsuite remove-columns (data-frame))
193 |
194 | (deftest remove-columns1 (remove-columns)
195 | (let* ((plist '(:a #(1 2 3) :b #(5 7 11) :c #(100 200 300)))
196 | (df (plist-df plist))
197 | ;; (df-copy (copy df))
198 | (df1 (remove-columns df '(:a :b)))
199 | (expected-plist '(:c #(100 200 300))))
200 | (assert-equalp expected-plist (as-plist df1))
201 | (assert-equalp plist (as-plist df))
202 |
203 | (remove-column! df :a)
204 | (assert-false (equalp plist (as-plist df)))
205 | (assert-equalp '(:b #(5 7 11) :c #(100 200 300)) (as-plist df))))
206 |
207 |
208 | (defsuite replace-columns (data-frame))
209 | (deftest replace-columns1 (remove-columns)
210 | (let* ((plist '(:a #(1 2 3) :b #(5 7 11) :c #(100 200 300)))
211 | (df (plist-df plist)))
212 | (replace-column! df :a #'1+)
213 | (assert-false (equalp plist (as-plist df)))
214 | (assert-equalp '(:a #(2 3 4) :b #(5 7 11) :c #(100 200 300)) (as-plist df))))
215 |
216 |
217 | (defsuite remove-duplicates (data-frame))
218 |
219 | (deftest remove-duplicates1 (remove-duplicates)
220 | (let* ((dup (make-df '(:a :b :c) '(#(a a 3) #(a a 3) #(a a 333))))
221 | (df1 (df-remove-duplicates dup))
222 | (expected-plist '(:a #(a 3) :b #(a 3) :c #(a 333))))
223 | (assert-equalp expected-plist (as-plist df1))))
224 |
225 |
226 |
227 | (defsuite pretty-print (data-frame))
228 |
229 | (deftest print-df (pretty-print)
230 | (let* ((df1 (make-df '(:a :b :c)
231 | '(#(a a a)
232 | #(b b b)
233 | #(3 33 333))))
234 | (*print-pretty* t)
235 | (expected-string "
236 | ;; A B C
237 | ;; 0 A B 3
238 | ;; 1 A B 33
239 | ;; 2 A B 333
240 | ")
241 | (actual-string (make-array '(0) :element-type 'base-char :fill-pointer 0 :adjustable t)))
242 |
243 | (with-output-to-string (s actual-string)
244 | (print-data df1 s))
245 | (assert-true (string= expected-string actual-string))))
246 |
247 | (deftest print-array (pretty-print)
248 | (let* ((array1 #2A(#(a a a)
249 | #(b b b)
250 | #(3 33 333)))
251 | (*print-pretty* t)
252 | (expected-string ";; 0 A A A
253 | ;; 1 B B B
254 | ;; 2 3 33 333
255 | ")
256 | (actual-string (make-array '(0) :element-type 'base-char :fill-pointer 0 :adjustable t)))
257 |
258 | (with-output-to-string (s actual-string)
259 | (print-array array1 s))
260 | (assert-true (string= expected-string actual-string))))
261 |
262 |
263 | (defsuite missing (data-frame))
264 |
265 | (deftest d-frame (missing)
266 | (let ((df (matrix-df #(:a :b :c) #2A((1.7 2.1 :na)
267 | (5.4 :na 6.1)
268 | (:na 8.3 9.5)))))
269 | (assert-equalp #2A((nil nil t)
270 | (nil t nil)
271 | (t nil nil))
272 | (aops:as-array (missingp df)))))
273 |
274 | (deftest array (missing)
275 | (let ((arr #2A((bar 1 2 3 4 :na 6)
276 | (foo 7 8 9 :na 10 11))))
277 | (assert-equalp #2A((nil nil nil nil nil t nil)
278 | (nil nil nil nil t nil nil))
279 | (missingp arr))))
280 |
281 | (deftest vector (missing)
282 | (let ((vec #(0 1 2 3 4 :na 6)))
283 | (assert-equalp #(nil nil nil nil nil t nil) (missingp vec))))
284 |
285 | (deftest ignore-missing (missing)
286 | (let ((vec #(0 1 2 3 4 :na 6)))
287 | (assert-true (nu:num= (funcall (ignore-missing #'mean) vec)
288 | 2.6666667
289 | nu:*num=-tolerance*))))
290 |
291 |
292 |
293 | ;;; plist-aops
294 |
295 | (defsuite plist-aops (data-frame))
296 |
297 | (deftest as-array (plist-aops)
298 | (let ((arr #2A((1 4) (2 5) (3 6)))
299 | (pl '(:a #(1 2 3) :b #(4 5 6))))
300 | (assert-equalp arr (nth-value 0 (aops:as-array pl)))))
301 |
302 | (deftest dims (plist-aops)
303 | (let ((pl '(:a #(1 2 3) :b #(4 5 6))))
304 | (assert-equalp '(3 2) (aops:dims pl))
305 | (assert-equalp 3 (aops:nrow pl))
306 | (assert-equalp 2 (aops:ncol pl))))
307 |
308 |
309 | ;;; Data frame environment
310 | (defsuite define-data-frame (data-frame))
311 |
312 | (deffixture define-data-frame (@body)
313 | (let* ((v #(1 2 3 4))
314 | (b #*0110)
315 | (s #(a b c d)))
316 | @body))
317 |
318 | (deftest define-data-frame (define-data-frame)
319 | (let* ((plist `(vector ,v symbols ,s bits ,b))
320 | (df (apply #'df plist)))
321 |
322 | ;; Define an environment
323 | (df::defdf new-df df)
324 | (assert-true (boundp 'new-df) "The data frame was not bound")
325 | (assert-equalp (type-of (symbol-value 'new-df)) 'data-frame "new-df is not bound to a data-frame")
326 |
327 | ;; Ensure variables, package and macros were created
328 | (assert-true (find-package "NEW-DF") "Data frame package not found")
329 | (assert-equalp #(1 2 3 4) (eval (find-symbol "VECTOR" (find-package "NEW-DF"))))
330 | (assert-equalp #*0110 (eval (find-symbol "BITS" (find-package "NEW-DF"))))
331 | (assert-equalp #(a b c d) (eval (find-symbol "SYMBOLS" (find-package "NEW-DF"))))
332 |
333 | ;; ;; Remove symbol and package
334 | (let ((*package* (find-package "DATA-FRAME-TESTS")))
335 | (df::undef new-df))
336 |
337 | (assert-false (boundp (find-symbol "NEW-DF" (find-package "DATA-FRAME-TESTS"))) "The data frame was not removed")
338 | (assert-false (find-package "NEW-DF"))))
339 |
340 |
341 |
--------------------------------------------------------------------------------