├── .gitignore
├── CHANGES.md
├── LICENSE
├── README.md
├── pandas_sets
    ├── __init__.py
    └── sets.py
├── requirements.txt
├── setup.py
└── tests
    └── test_sets.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | *egg-info/
3 | build/
4 | dist/
5 | *.pyc
6 | 


--------------------------------------------------------------------------------
/CHANGES.md:
--------------------------------------------------------------------------------
1 | ### 0.2.1 (2020-05-27)
2 | - Added support for frozensets [#3](https://github.com/Florents-Tselai/pandas-sets/pull/3) Thanks [knaveofdiamonds!](https://github.com/knaveofdiamonds])
3 | - Fixed namespace warnings for recent Pandas releases (about `testing` and `Index` / `MultiIndex`)
4 | 
5 | ### 0.1.1 (2018-12-27)
6 | - Bug Fix: returned NAs
7 | 
8 | ### 0.1.0 (2018-12-26)
9 | - Initially Released Version


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2018, Florents Tselai
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Pandas Sets: Set-oriented Operations in Pandas
 2 | 
 3 | If you store standard Python `set`s or `frozenset`s in your `Series` or `DataFrame` objects, you'll find this useful.
 4 | 
 5 | The `pandas_sets` package adds a `.set` accessor to any pandas `Series` object;
 6 | it's like `.dt` for `datetime` or `.str` for `string`, but for [`set`](https://docs.python.org/3.7/library/stdtypes.html#set).
 7 | 
 8 | It exposes all public methods available in the standard [`set`](https://docs.python.org/3.7/library/stdtypes.html#set).
 9 | 
10 | ## Installation
11 | ```bash
12 | pip install pandas-sets
13 | ```
14 | Just import the `pandas_sets` package and it will register a `.set` accessor to any `Series` object.
15 | 
16 | ```python
17 | import pandas_sets
18 | ```
19 | 
20 | ## Examples
21 | ```python
22 | import pandas_sets
23 | import pandas as pd
24 | df = pd.DataFrame({'post': [1, 2, 3, 4],
25 |                     'tags': [{'python', 'pandas'}, {'philosophy', 'strategy'}, {'scikit-learn'}, {'pandas'}]
26 |                    })
27 | 
28 | pandas_posts = df[df.tags.set.contains('pandas')]
29 | 
30 | pandas_posts.tags.set.add('data')
31 | 
32 | pandas_posts.tags.set.update({'data', 'analysis'})
33 | 
34 | pandas_posts.tags.set.len()
35 | ```
36 | 
37 | ## Notes
38 | * The implementation is primitive for now. It's based heavily on the pandas' core [`StringMethods`](https://github.com/pandas-dev/pandas/blob/52a2bb490556a86c5f756465320c18977dbe1c36/pandas/core/strings.py#L1783) implementation.
39 | * The public API has been tested for most expected scenarios.
40 | * The API will need to be extended to handle `NA` values appropriately.
41 | 


--------------------------------------------------------------------------------
/pandas_sets/__init__.py:
--------------------------------------------------------------------------------
1 | from pandas_sets.sets import SetMethods
2 | 


--------------------------------------------------------------------------------
/pandas_sets/sets.py:
--------------------------------------------------------------------------------
  1 | from pandas.api.extensions import register_series_accessor
  2 | from pandas.core import strings
  3 | from pandas.core.base import NoNewAttributesMixin
  4 | from pandas.core.common import ABCSeries
  5 | from pandas.core.dtypes.common import is_bool_dtype
  6 | from pandas.core.dtypes.common import is_list_like
  7 | from pandas.core.dtypes.common import is_object_dtype
  8 | import numpy as np
  9 | 
 10 | copy = strings.copy
 11 | 
 12 | 
 13 | def is_set_type(data):
 14 |     return isinstance(data, set) or isinstance(data, frozenset)
 15 | 
 16 | 
 17 | #def _map(*args, **kwargs):
 18 | #    return strings._map(*args, **kwargs)
 19 | 
 20 | 
 21 | def _na_map(*args, **kwargs):
 22 |     return strings._na_map(*args, **kwargs)
 23 | 
 24 | 
 25 | def set_contains(arr, elem):
 26 |     pass
 27 | 
 28 | 
 29 | def set_isdisjoint(arr, other):
 30 |     pass
 31 | 
 32 | 
 33 | def set_issubset(arr, other):
 34 |     pass
 35 | 
 36 | 
 37 | def set_issuperset(arr, other):
 38 |     pass
 39 | 
 40 | 
 41 | def set_union(arr, *others):
 42 |     pass
 43 | 
 44 | 
 45 | def set_intersection(arr, *others):
 46 |     pass
 47 | 
 48 | 
 49 | def set_difference(arr, *others):
 50 |     pass
 51 | 
 52 | 
 53 | def set_symmetic_difference(arr, other):
 54 |     pass
 55 | 
 56 | 
 57 | def set_copy(arr):
 58 |     pass
 59 | 
 60 | 
 61 | def set_update(arr, *others):
 62 |     pass
 63 | 
 64 | 
 65 | def set_intersection_update(arr, *others):
 66 |     pass
 67 | 
 68 | 
 69 | def set_difference_update(arr, *others):
 70 |     pass
 71 | 
 72 | 
 73 | def set_symmetric_difference_update(arr, other):
 74 |     pass
 75 | 
 76 | 
 77 | def set_add(arr, elem):
 78 |     def f(x):
 79 |         x.add(elem)
 80 |         return x
 81 | 
 82 |     return _na_map(f, arr)
 83 | 
 84 | 
 85 | def set_remove(arr, elem):
 86 |     pass
 87 | 
 88 | 
 89 | def set_discard(arr, elem):
 90 |     pass
 91 | 
 92 | 
 93 | def set_pop(arr):
 94 |     pass
 95 | 
 96 | 
 97 | def set_clear(arr):
 98 |     pass
 99 | 
100 | 
101 | @register_series_accessor("set")
102 | class SetMethods(NoNewAttributesMixin):
103 |     """
104 |     Intends to have an implementation similar to `pandas.core.strings.StringMethods`
105 | 
106 |     Vectorized string functions for Series. NAs stay NA unless
107 |     handled otherwise by a particular method. Patterned after Python's set
108 |     methods.
109 | 
110 |     Examples
111 |     --------
112 |     >>> s.set.union({ 1, 2 ,3})
113 |     >>> s.set.intersection({})
114 |     """
115 | 
116 |     def __init__(self, data):
117 |         self._validate(data)
118 |         self._data = data
119 |         self._orig = data
120 | 
121 |     def _wrap_result(self, result, use_codes=True,
122 |                      name=None, expand=None):
123 | 
124 |         # TODO: this was blindly copied from `strings.StringMethods._wrap_result` for noew
125 |         from pandas import Index, MultiIndex
126 | 
127 |         # for category, we do the stuff on the categories, so blow it up
128 |         # to the full series again
129 |         # But for some operations, we have to do the stuff on the full values,
130 |         # so make it possible to skip this step as the method already did this
131 |         # before the transformation...
132 |         # if use_codes and self._is_categorical:
133 |         #    result = take_1d(result, self._orig.cat.codes)
134 | 
135 |         if not hasattr(result, 'ndim') or not hasattr(result, 'dtype'):
136 |             return result
137 |         assert result.ndim < 3
138 | 
139 |         if expand is None:
140 |             # infer from ndim if expand is not specified
141 |             expand = False if result.ndim == 1 else True
142 | 
143 |         elif expand is True and not isinstance(self._orig, Index):
144 |             # required when expand=True is explicitly specified
145 |             # not needed when inferred
146 | 
147 |             def cons_row(x):
148 |                 if is_list_like(x):
149 |                     return x
150 |                 else:
151 |                     return [x]
152 | 
153 |             result = [cons_row(x) for x in result]
154 |             if result:
155 |                 # propagate nan values to match longest sequence (GH 18450)
156 |                 max_len = max(len(x) for x in result)
157 |                 result = [x * max_len if len(x) == 0 or x[0] is np.nan
158 |                           else x for x in result]
159 | 
160 |         if not isinstance(expand, bool):
161 |             raise ValueError("expand must be True or False")
162 | 
163 |         if expand is False:
164 |             # if expand is False, result should have the same name
165 |             # as the original otherwise specified
166 |             if name is None:
167 |                 name = getattr(result, 'name', None)
168 |             if name is None:
169 |                 # do not use logical or, _orig may be a DataFrame
170 |                 # which has "name" column
171 |                 name = self._orig.name
172 | 
173 |         # Wait until we are sure result is a Series or Index before
174 |         # checking attributes (GH 12180)
175 |         if isinstance(self._orig, Index):
176 |             # if result is a boolean np.array, return the np.array
177 |             # instead of wrapping it into a boolean Index (GH 8875)
178 |             if is_bool_dtype(result):
179 |                 return result
180 | 
181 |             if expand:
182 |                 result = list(result)
183 |                 out = MultiIndex.from_tuples(result, names=name)
184 |                 if out.nlevels == 1:
185 |                     # We had all tuples of length-one, which are
186 |                     # better represented as a regular Index.
187 |                     out = out.get_level_values(0)
188 |                 return out
189 |             else:
190 |                 return Index(result, name=name)
191 |         else:
192 |             index = self._orig.index
193 |             if expand:
194 |                 cons = self._orig._constructor_expanddim
195 |                 return cons(result, columns=name, index=index)
196 |             else:
197 |                 # Must be a Series
198 |                 cons = self._orig._constructor
199 |                 return cons(result, name=name, index=index)
200 | 
201 |     @staticmethod
202 |     def _validate(data):
203 |         """
204 |         For now we assume that the dtype is already a `set`.
205 |         Remains to be decided if list-like structures should be implicitly converted to sets
206 |         """
207 |         if not (isinstance(data, ABCSeries)
208 |                 and is_object_dtype(data)
209 |                 and data.map(is_list_like).all()
210 |                 and data.map(is_set_type).all()
211 |         ):
212 |             raise AttributeError("Can only use .set accessor with object dtype. "
213 |                                  "All values must be of `set` or `frozenset` type too. "
214 |                                  "Null values` are rejected, "
215 |                                  "so use something like fillna([]) before.")
216 | 
217 |     def len(self):
218 |         # TODO make it use _no_args_wrapper like the StringMethods equivalent does
219 |         # return self._data.map(set).map(len)
220 |         return self._wrap_result(_na_map(len, self._data))
221 | 
222 |     def contains(self, elem):
223 |         f = lambda x: elem in x
224 |         result = _na_map(f, self._data)
225 | 
226 |         return self._wrap_result(result)
227 | 
228 |     def isdisjoint(self, other):
229 |         f = lambda x: x.isdisjoint(other)
230 |         result = _na_map(f, self._data)
231 | 
232 |         return self._wrap_result(result)
233 | 
234 |     def issubset(self, other):
235 |         f = lambda x: x.issubset(other)
236 |         result = _na_map(f, self._data)
237 | 
238 |         return self._wrap_result(result)
239 | 
240 |     def issuperset(self, other):
241 |         f = lambda x: x.issuperset(other)
242 |         result = _na_map(f, self._data)
243 | 
244 |         return self._wrap_result(result)
245 | 
246 |     def union(self, *others):
247 |         def f(x):
248 |             return x.union(*others)
249 | 
250 |         result = _na_map(f, self._data)
251 | 
252 |         return self._wrap_result(result)
253 | 
254 |     def intersection(self, *others):
255 |         def f(x):
256 |             return x.intersection(*others)
257 | 
258 |         result = _na_map(f, self._data)
259 | 
260 |         return self._wrap_result(result)
261 | 
262 |     def difference(self, *others):
263 |         def f(x):
264 |             return x.difference(*others)
265 | 
266 |         result = _na_map(f, self._data)
267 | 
268 |         return self._wrap_result(result)
269 | 
270 |     def symmetric_difference(self, other):
271 |         def f(x):
272 |             return x.symmetric_difference(other)
273 | 
274 |         result = _na_map(f, self._data)
275 | 
276 |         return self._wrap_result(result)
277 | 
278 |     def copy(self):
279 |         # TODO make it use _no_args_wrapper like the StringMethods equivalent does
280 |         return self._wrap_result(_na_map(set.copy, self._data))
281 | 
282 |     def update(self, *others):
283 |         def f(x):
284 |             x.update(*others)
285 |             return x
286 | 
287 |         result = _na_map(f, self._data)
288 | 
289 |         return self._wrap_result(result)
290 | 
291 |     def intersection_update(self, *others):
292 |         def f(x):
293 |             x.intersection_update(*others)
294 |             return x
295 | 
296 |         result = _na_map(f, self._data)
297 | 
298 |         return self._wrap_result(result)
299 | 
300 |     def difference_update(self, *others):
301 |         def f(x):
302 |             x.difference_update(others)
303 |             return x
304 | 
305 |         result = _na_map(f, self._data)
306 | 
307 |         return self._wrap_result(result)
308 | 
309 |     def symmetric_difference_update(self, other):
310 |         def f(x):
311 |             x.symmetric_difference_update(other)
312 |             return x
313 | 
314 |         result = _na_map(f, self._data)
315 | 
316 |         return self._wrap_result(result)
317 | 
318 |     def add(self, elem):
319 |         result = set_add(self._data, elem)
320 |         return self._wrap_result(result)
321 | 
322 |     def remove(self, elem):
323 |         def f(x):
324 |             x.remove(elem)
325 |             return x
326 | 
327 |         result = _na_map(f, self._data)
328 | 
329 |         return self._wrap_result(result)
330 | 
331 |     def discard(self, elem):
332 |         def f(x):
333 |             x.discard(elem)
334 |             return x
335 | 
336 |         result = _na_map(f, self._data)
337 | 
338 |         return self._wrap_result(result)
339 | 
340 |     def pop(self):
341 |         # TODO make it use _no_args_wrapper like the StringMethods equivalent does
342 |         def f(x):
343 |             x.pop()
344 |             return x
345 | 
346 |         result = _na_map(f, self._data)
347 | 
348 |         return self._wrap_result(result)
349 | 
350 |     def clear(self):
351 |         # TODO make it use _no_args_wrapper like the StringMethods equivalent does
352 |         def f(x):
353 |             x.clear()
354 |             return x
355 | 
356 |         result = _na_map(f, self._data)
357 | 
358 |         return self._wrap_result(result)
359 | 
360 |     @classmethod
361 |     def _make_accessor(cls, data):
362 |         cls._validate(data)
363 |         return cls(data)
364 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pandas>=0.24.0


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # -*- encoding: utf-8 -*-
 3 | 
 4 | from setuptools import setup, find_packages
 5 | 
 6 | with open("README.md", "r") as fh:
 7 |     long_description = fh.read()
 8 | 
 9 | setup(
10 |     name='pandas-sets',
11 |     version='v0.2.1',
12 |     packages=find_packages(),
13 |     license='BSD',
14 |     description='Pandas - Sets:  Set-oriented Operations in Pandas',
15 |     long_description=long_description,
16 |     long_description_content_type="text/markdown",
17 |     author='Florents Tselai',
18 |     author_email='florents@tselai.com',
19 |     url='https://github.com/Florents-Tselai/pandas-sets',
20 |     install_requires=open('requirements.txt').read().splitlines()
21 | )
22 | 


--------------------------------------------------------------------------------
/tests/test_sets.py:
--------------------------------------------------------------------------------
  1 | from unittest import TestCase
  2 | import pandas_sets
  3 | from pandas.testing import assert_series_equal
  4 | from pandas import Series, DataFrame
  5 | 
  6 | """
  7 | Currently testing "ideal-world scenarios"
  8 | 
  9 | TODO
 10 | * Test with nulls etc. Default values there? E.g. with set.pop / set.discard
 11 | * Decide what to do when series are of iterable types etc.
 12 | """
 13 | 
 14 | 
 15 | class APITestCase(TestCase):
 16 |     def setUp(self):
 17 |         pass
 18 | 
 19 |     @property
 20 |     def simple_case_no_na_with_empty(self):
 21 |         return Series({
 22 |             'a': set([1]),
 23 |             'b': set([3, 4, 5]),
 24 |             'c': set([])
 25 |         })
 26 | 
 27 |     @property
 28 |     def frozenset_no_na_with_empty(self):
 29 |         return Series({
 30 |             'a': frozenset([1]),
 31 |             'b': frozenset([3, 4, 5]),
 32 |             'c': frozenset([])
 33 |         })
 34 | 
 35 |     @property
 36 |     def simple_case_no_na_without_empty(self):
 37 |         return Series({
 38 |             'a': set([1]),
 39 |             'b': set([3, 4, 5])
 40 |         })
 41 | 
 42 |     def test_validate(self):
 43 |         # TODO
 44 |         pass
 45 | 
 46 |     def test_len(self):
 47 |         assert_series_equal(self.simple_case_no_na_with_empty.set.len(), Series({
 48 |             'a': 1,
 49 |             'b': 3,
 50 |             'c': 0
 51 |         }))
 52 | 
 53 |     def test_add(self):
 54 |         assert_series_equal(self.simple_case_no_na_with_empty.set.add(1), Series({
 55 |             'a': {1},
 56 |             'b': {1, 3, 4, 5},
 57 |             'c': {1}
 58 |         }))
 59 | 
 60 |     def test_contains(self):
 61 |         assert_series_equal(self.simple_case_no_na_with_empty.set.contains(1), Series({
 62 |             'a': True,
 63 |             'b': False,
 64 |             'c': False
 65 |         }))
 66 | 
 67 |     def test_isdisjoint(self):
 68 |         assert_series_equal(self.simple_case_no_na_with_empty.set.isdisjoint({3, 4}),
 69 |                                Series({
 70 |                                    'a': True,
 71 |                                    'b': False,
 72 |                                    'c': True
 73 |                                }))
 74 | 
 75 |     def test_issubset(self):
 76 |         pass
 77 | 
 78 |     def test_issuperset(self):
 79 |         pass
 80 | 
 81 |     def test_union(self):
 82 |         pass
 83 | 
 84 |     def test_pop(self):
 85 | 
 86 |         # Assert raises KeyError on empty set
 87 |         with self.assertRaises(KeyError):
 88 |             self.simple_case_no_na_with_empty.set.pop()
 89 | 
 90 |         s = self.simple_case_no_na_without_empty.set.pop()
 91 |         assert_series_equal(s.set.len(),
 92 |                                Series({
 93 |                                    'a': 0,
 94 |                                    'b': 2
 95 |                                }))
 96 | 
 97 |     def test_frozensets_are_allowed(self):
 98 |         assert_series_equal(self.frozenset_no_na_with_empty.set.contains(1), Series({
 99 |             'a': True,
100 |             'b': False,
101 |             'c': False
102 |         }))
103 | 


--------------------------------------------------------------------------------