├── .gitignore ├── CHANGES.md ├── LICENSE ├── README.md ├── pandas_sets ├── __init__.py └── sets.py ├── requirements.txt ├── setup.py └── tests └── test_sets.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | *egg-info/ 3 | build/ 4 | dist/ 5 | *.pyc 6 | -------------------------------------------------------------------------------- /CHANGES.md: -------------------------------------------------------------------------------- 1 | ### 0.2.1 (2020-05-27) 2 | - Added support for frozensets [#3](https://github.com/Florents-Tselai/pandas-sets/pull/3) Thanks [knaveofdiamonds!](https://github.com/knaveofdiamonds]) 3 | - Fixed namespace warnings for recent Pandas releases (about `testing` and `Index` / `MultiIndex`) 4 | 5 | ### 0.1.1 (2018-12-27) 6 | - Bug Fix: returned NAs 7 | 8 | ### 0.1.0 (2018-12-26) 9 | - Initially Released Version -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2018, Florents Tselai 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Pandas Sets: Set-oriented Operations in Pandas 2 | 3 | If you store standard Python `set`s or `frozenset`s in your `Series` or `DataFrame` objects, you'll find this useful. 4 | 5 | The `pandas_sets` package adds a `.set` accessor to any pandas `Series` object; 6 | it's like `.dt` for `datetime` or `.str` for `string`, but for [`set`](https://docs.python.org/3.7/library/stdtypes.html#set). 7 | 8 | It exposes all public methods available in the standard [`set`](https://docs.python.org/3.7/library/stdtypes.html#set). 9 | 10 | ## Installation 11 | ```bash 12 | pip install pandas-sets 13 | ``` 14 | Just import the `pandas_sets` package and it will register a `.set` accessor to any `Series` object. 15 | 16 | ```python 17 | import pandas_sets 18 | ``` 19 | 20 | ## Examples 21 | ```python 22 | import pandas_sets 23 | import pandas as pd 24 | df = pd.DataFrame({'post': [1, 2, 3, 4], 25 | 'tags': [{'python', 'pandas'}, {'philosophy', 'strategy'}, {'scikit-learn'}, {'pandas'}] 26 | }) 27 | 28 | pandas_posts = df[df.tags.set.contains('pandas')] 29 | 30 | pandas_posts.tags.set.add('data') 31 | 32 | pandas_posts.tags.set.update({'data', 'analysis'}) 33 | 34 | pandas_posts.tags.set.len() 35 | ``` 36 | 37 | ## Notes 38 | * The implementation is primitive for now. It's based heavily on the pandas' core [`StringMethods`](https://github.com/pandas-dev/pandas/blob/52a2bb490556a86c5f756465320c18977dbe1c36/pandas/core/strings.py#L1783) implementation. 39 | * The public API has been tested for most expected scenarios. 40 | * The API will need to be extended to handle `NA` values appropriately. 41 | -------------------------------------------------------------------------------- /pandas_sets/__init__.py: -------------------------------------------------------------------------------- 1 | from pandas_sets.sets import SetMethods 2 | -------------------------------------------------------------------------------- /pandas_sets/sets.py: -------------------------------------------------------------------------------- 1 | from pandas.api.extensions import register_series_accessor 2 | from pandas.core import strings 3 | from pandas.core.base import NoNewAttributesMixin 4 | from pandas.core.common import ABCSeries 5 | from pandas.core.dtypes.common import is_bool_dtype 6 | from pandas.core.dtypes.common import is_list_like 7 | from pandas.core.dtypes.common import is_object_dtype 8 | import numpy as np 9 | 10 | copy = strings.copy 11 | 12 | 13 | def is_set_type(data): 14 | return isinstance(data, set) or isinstance(data, frozenset) 15 | 16 | 17 | #def _map(*args, **kwargs): 18 | # return strings._map(*args, **kwargs) 19 | 20 | 21 | def _na_map(*args, **kwargs): 22 | return strings._na_map(*args, **kwargs) 23 | 24 | 25 | def set_contains(arr, elem): 26 | pass 27 | 28 | 29 | def set_isdisjoint(arr, other): 30 | pass 31 | 32 | 33 | def set_issubset(arr, other): 34 | pass 35 | 36 | 37 | def set_issuperset(arr, other): 38 | pass 39 | 40 | 41 | def set_union(arr, *others): 42 | pass 43 | 44 | 45 | def set_intersection(arr, *others): 46 | pass 47 | 48 | 49 | def set_difference(arr, *others): 50 | pass 51 | 52 | 53 | def set_symmetic_difference(arr, other): 54 | pass 55 | 56 | 57 | def set_copy(arr): 58 | pass 59 | 60 | 61 | def set_update(arr, *others): 62 | pass 63 | 64 | 65 | def set_intersection_update(arr, *others): 66 | pass 67 | 68 | 69 | def set_difference_update(arr, *others): 70 | pass 71 | 72 | 73 | def set_symmetric_difference_update(arr, other): 74 | pass 75 | 76 | 77 | def set_add(arr, elem): 78 | def f(x): 79 | x.add(elem) 80 | return x 81 | 82 | return _na_map(f, arr) 83 | 84 | 85 | def set_remove(arr, elem): 86 | pass 87 | 88 | 89 | def set_discard(arr, elem): 90 | pass 91 | 92 | 93 | def set_pop(arr): 94 | pass 95 | 96 | 97 | def set_clear(arr): 98 | pass 99 | 100 | 101 | @register_series_accessor("set") 102 | class SetMethods(NoNewAttributesMixin): 103 | """ 104 | Intends to have an implementation similar to `pandas.core.strings.StringMethods` 105 | 106 | Vectorized string functions for Series. NAs stay NA unless 107 | handled otherwise by a particular method. Patterned after Python's set 108 | methods. 109 | 110 | Examples 111 | -------- 112 | >>> s.set.union({ 1, 2 ,3}) 113 | >>> s.set.intersection({}) 114 | """ 115 | 116 | def __init__(self, data): 117 | self._validate(data) 118 | self._data = data 119 | self._orig = data 120 | 121 | def _wrap_result(self, result, use_codes=True, 122 | name=None, expand=None): 123 | 124 | # TODO: this was blindly copied from `strings.StringMethods._wrap_result` for noew 125 | from pandas import Index, MultiIndex 126 | 127 | # for category, we do the stuff on the categories, so blow it up 128 | # to the full series again 129 | # But for some operations, we have to do the stuff on the full values, 130 | # so make it possible to skip this step as the method already did this 131 | # before the transformation... 132 | # if use_codes and self._is_categorical: 133 | # result = take_1d(result, self._orig.cat.codes) 134 | 135 | if not hasattr(result, 'ndim') or not hasattr(result, 'dtype'): 136 | return result 137 | assert result.ndim < 3 138 | 139 | if expand is None: 140 | # infer from ndim if expand is not specified 141 | expand = False if result.ndim == 1 else True 142 | 143 | elif expand is True and not isinstance(self._orig, Index): 144 | # required when expand=True is explicitly specified 145 | # not needed when inferred 146 | 147 | def cons_row(x): 148 | if is_list_like(x): 149 | return x 150 | else: 151 | return [x] 152 | 153 | result = [cons_row(x) for x in result] 154 | if result: 155 | # propagate nan values to match longest sequence (GH 18450) 156 | max_len = max(len(x) for x in result) 157 | result = [x * max_len if len(x) == 0 or x[0] is np.nan 158 | else x for x in result] 159 | 160 | if not isinstance(expand, bool): 161 | raise ValueError("expand must be True or False") 162 | 163 | if expand is False: 164 | # if expand is False, result should have the same name 165 | # as the original otherwise specified 166 | if name is None: 167 | name = getattr(result, 'name', None) 168 | if name is None: 169 | # do not use logical or, _orig may be a DataFrame 170 | # which has "name" column 171 | name = self._orig.name 172 | 173 | # Wait until we are sure result is a Series or Index before 174 | # checking attributes (GH 12180) 175 | if isinstance(self._orig, Index): 176 | # if result is a boolean np.array, return the np.array 177 | # instead of wrapping it into a boolean Index (GH 8875) 178 | if is_bool_dtype(result): 179 | return result 180 | 181 | if expand: 182 | result = list(result) 183 | out = MultiIndex.from_tuples(result, names=name) 184 | if out.nlevels == 1: 185 | # We had all tuples of length-one, which are 186 | # better represented as a regular Index. 187 | out = out.get_level_values(0) 188 | return out 189 | else: 190 | return Index(result, name=name) 191 | else: 192 | index = self._orig.index 193 | if expand: 194 | cons = self._orig._constructor_expanddim 195 | return cons(result, columns=name, index=index) 196 | else: 197 | # Must be a Series 198 | cons = self._orig._constructor 199 | return cons(result, name=name, index=index) 200 | 201 | @staticmethod 202 | def _validate(data): 203 | """ 204 | For now we assume that the dtype is already a `set`. 205 | Remains to be decided if list-like structures should be implicitly converted to sets 206 | """ 207 | if not (isinstance(data, ABCSeries) 208 | and is_object_dtype(data) 209 | and data.map(is_list_like).all() 210 | and data.map(is_set_type).all() 211 | ): 212 | raise AttributeError("Can only use .set accessor with object dtype. " 213 | "All values must be of `set` or `frozenset` type too. " 214 | "Null values` are rejected, " 215 | "so use something like fillna([]) before.") 216 | 217 | def len(self): 218 | # TODO make it use _no_args_wrapper like the StringMethods equivalent does 219 | # return self._data.map(set).map(len) 220 | return self._wrap_result(_na_map(len, self._data)) 221 | 222 | def contains(self, elem): 223 | f = lambda x: elem in x 224 | result = _na_map(f, self._data) 225 | 226 | return self._wrap_result(result) 227 | 228 | def isdisjoint(self, other): 229 | f = lambda x: x.isdisjoint(other) 230 | result = _na_map(f, self._data) 231 | 232 | return self._wrap_result(result) 233 | 234 | def issubset(self, other): 235 | f = lambda x: x.issubset(other) 236 | result = _na_map(f, self._data) 237 | 238 | return self._wrap_result(result) 239 | 240 | def issuperset(self, other): 241 | f = lambda x: x.issuperset(other) 242 | result = _na_map(f, self._data) 243 | 244 | return self._wrap_result(result) 245 | 246 | def union(self, *others): 247 | def f(x): 248 | return x.union(*others) 249 | 250 | result = _na_map(f, self._data) 251 | 252 | return self._wrap_result(result) 253 | 254 | def intersection(self, *others): 255 | def f(x): 256 | return x.intersection(*others) 257 | 258 | result = _na_map(f, self._data) 259 | 260 | return self._wrap_result(result) 261 | 262 | def difference(self, *others): 263 | def f(x): 264 | return x.difference(*others) 265 | 266 | result = _na_map(f, self._data) 267 | 268 | return self._wrap_result(result) 269 | 270 | def symmetric_difference(self, other): 271 | def f(x): 272 | return x.symmetric_difference(other) 273 | 274 | result = _na_map(f, self._data) 275 | 276 | return self._wrap_result(result) 277 | 278 | def copy(self): 279 | # TODO make it use _no_args_wrapper like the StringMethods equivalent does 280 | return self._wrap_result(_na_map(set.copy, self._data)) 281 | 282 | def update(self, *others): 283 | def f(x): 284 | x.update(*others) 285 | return x 286 | 287 | result = _na_map(f, self._data) 288 | 289 | return self._wrap_result(result) 290 | 291 | def intersection_update(self, *others): 292 | def f(x): 293 | x.intersection_update(*others) 294 | return x 295 | 296 | result = _na_map(f, self._data) 297 | 298 | return self._wrap_result(result) 299 | 300 | def difference_update(self, *others): 301 | def f(x): 302 | x.difference_update(others) 303 | return x 304 | 305 | result = _na_map(f, self._data) 306 | 307 | return self._wrap_result(result) 308 | 309 | def symmetric_difference_update(self, other): 310 | def f(x): 311 | x.symmetric_difference_update(other) 312 | return x 313 | 314 | result = _na_map(f, self._data) 315 | 316 | return self._wrap_result(result) 317 | 318 | def add(self, elem): 319 | result = set_add(self._data, elem) 320 | return self._wrap_result(result) 321 | 322 | def remove(self, elem): 323 | def f(x): 324 | x.remove(elem) 325 | return x 326 | 327 | result = _na_map(f, self._data) 328 | 329 | return self._wrap_result(result) 330 | 331 | def discard(self, elem): 332 | def f(x): 333 | x.discard(elem) 334 | return x 335 | 336 | result = _na_map(f, self._data) 337 | 338 | return self._wrap_result(result) 339 | 340 | def pop(self): 341 | # TODO make it use _no_args_wrapper like the StringMethods equivalent does 342 | def f(x): 343 | x.pop() 344 | return x 345 | 346 | result = _na_map(f, self._data) 347 | 348 | return self._wrap_result(result) 349 | 350 | def clear(self): 351 | # TODO make it use _no_args_wrapper like the StringMethods equivalent does 352 | def f(x): 353 | x.clear() 354 | return x 355 | 356 | result = _na_map(f, self._data) 357 | 358 | return self._wrap_result(result) 359 | 360 | @classmethod 361 | def _make_accessor(cls, data): 362 | cls._validate(data) 363 | return cls(data) 364 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas>=0.24.0 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- encoding: utf-8 -*- 3 | 4 | from setuptools import setup, find_packages 5 | 6 | with open("README.md", "r") as fh: 7 | long_description = fh.read() 8 | 9 | setup( 10 | name='pandas-sets', 11 | version='v0.2.1', 12 | packages=find_packages(), 13 | license='BSD', 14 | description='Pandas - Sets: Set-oriented Operations in Pandas', 15 | long_description=long_description, 16 | long_description_content_type="text/markdown", 17 | author='Florents Tselai', 18 | author_email='florents@tselai.com', 19 | url='https://github.com/Florents-Tselai/pandas-sets', 20 | install_requires=open('requirements.txt').read().splitlines() 21 | ) 22 | -------------------------------------------------------------------------------- /tests/test_sets.py: -------------------------------------------------------------------------------- 1 | from unittest import TestCase 2 | import pandas_sets 3 | from pandas.testing import assert_series_equal 4 | from pandas import Series, DataFrame 5 | 6 | """ 7 | Currently testing "ideal-world scenarios" 8 | 9 | TODO 10 | * Test with nulls etc. Default values there? E.g. with set.pop / set.discard 11 | * Decide what to do when series are of iterable types etc. 12 | """ 13 | 14 | 15 | class APITestCase(TestCase): 16 | def setUp(self): 17 | pass 18 | 19 | @property 20 | def simple_case_no_na_with_empty(self): 21 | return Series({ 22 | 'a': set([1]), 23 | 'b': set([3, 4, 5]), 24 | 'c': set([]) 25 | }) 26 | 27 | @property 28 | def frozenset_no_na_with_empty(self): 29 | return Series({ 30 | 'a': frozenset([1]), 31 | 'b': frozenset([3, 4, 5]), 32 | 'c': frozenset([]) 33 | }) 34 | 35 | @property 36 | def simple_case_no_na_without_empty(self): 37 | return Series({ 38 | 'a': set([1]), 39 | 'b': set([3, 4, 5]) 40 | }) 41 | 42 | def test_validate(self): 43 | # TODO 44 | pass 45 | 46 | def test_len(self): 47 | assert_series_equal(self.simple_case_no_na_with_empty.set.len(), Series({ 48 | 'a': 1, 49 | 'b': 3, 50 | 'c': 0 51 | })) 52 | 53 | def test_add(self): 54 | assert_series_equal(self.simple_case_no_na_with_empty.set.add(1), Series({ 55 | 'a': {1}, 56 | 'b': {1, 3, 4, 5}, 57 | 'c': {1} 58 | })) 59 | 60 | def test_contains(self): 61 | assert_series_equal(self.simple_case_no_na_with_empty.set.contains(1), Series({ 62 | 'a': True, 63 | 'b': False, 64 | 'c': False 65 | })) 66 | 67 | def test_isdisjoint(self): 68 | assert_series_equal(self.simple_case_no_na_with_empty.set.isdisjoint({3, 4}), 69 | Series({ 70 | 'a': True, 71 | 'b': False, 72 | 'c': True 73 | })) 74 | 75 | def test_issubset(self): 76 | pass 77 | 78 | def test_issuperset(self): 79 | pass 80 | 81 | def test_union(self): 82 | pass 83 | 84 | def test_pop(self): 85 | 86 | # Assert raises KeyError on empty set 87 | with self.assertRaises(KeyError): 88 | self.simple_case_no_na_with_empty.set.pop() 89 | 90 | s = self.simple_case_no_na_without_empty.set.pop() 91 | assert_series_equal(s.set.len(), 92 | Series({ 93 | 'a': 0, 94 | 'b': 2 95 | })) 96 | 97 | def test_frozensets_are_allowed(self): 98 | assert_series_equal(self.frozenset_no_na_with_empty.set.contains(1), Series({ 99 | 'a': True, 100 | 'b': False, 101 | 'c': False 102 | })) 103 | --------------------------------------------------------------------------------