├── .gitignore ├── .travis.yml ├── CHANGELOG.md ├── LICENSE.txt ├── MANIFEST.in ├── README.rst ├── docs └── source │ ├── conf.py │ ├── index.rst │ └── pandas-validation │ ├── api.rst │ ├── installation.rst │ └── quickstart.rst ├── pandasvalidation.py ├── release-checklist.rst ├── requirements.txt ├── setup.py └── test_pandasvalidation.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | 5 | # C extensions 6 | *.so 7 | 8 | # Distribution / packaging 9 | bin/ 10 | build/ 11 | develop-eggs/ 12 | dist/ 13 | eggs/ 14 | lib/ 15 | lib64/ 16 | parts/ 17 | sdist/ 18 | var/ 19 | *.egg-info/ 20 | .installed.cfg 21 | *.egg 22 | 23 | # Installer logs 24 | pip-log.txt 25 | pip-delete-this-directory.txt 26 | 27 | # Unit test / coverage reports 28 | .tox/ 29 | .coverage 30 | .cache 31 | nosetests.xml 32 | coverage.xml 33 | 34 | # Translations 35 | *.mo 36 | 37 | # Mr Developer 38 | .mr.developer.cfg 39 | .project 40 | .pydevproject 41 | 42 | # Rope 43 | .ropeproject 44 | 45 | # Django stuff: 46 | *.log 47 | *.pot 48 | 49 | # Sphinx documentation 50 | docs/_build/ 51 | 52 | # Apple OSX files 53 | .DS_Store 54 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | 3 | python: 4 | - '3.5' 5 | - '3.6' 6 | 7 | # whitelist 8 | branches: 9 | only: 10 | - master 11 | 12 | install: 13 | - pip install . 14 | - pip install pycodestyle 15 | - pip install pytest 16 | - pip install coverage 17 | - pip install codecov 18 | 19 | script: 20 | - pycodestyle pandasvalidation.py test_pandasvalidation.py setup.py 21 | - coverage run -m pytest test_pandasvalidation.py 22 | - coverage report -m pandasvalidation.py 23 | 24 | after_success: 25 | - codecov 26 | -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | # Changelog # 2 | 3 | Tracking changes in pandas-validation between versions. 4 | See also https://github.com/jmenglund/pandas-validation/releases. 5 | 6 | 7 | ## 0.5.0 ## 8 | 9 | This is a minor release with the following changes: 10 | 11 | * The function `validate_datetime()` is deprecated and replaced by the functions 12 | `validate_date()` and `validate_timestamp()`. The new functions validates 13 | values of types [datetime.date](https://docs.python.org/3/library/datetime.html#datetime.date) and [pandas.Timestamp](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.html), respectively. 14 | * Type conversion is no longer carried out by the functions `validate_numeric()` 15 | and `validate_string()`. It will be up to the user to ensure that the data types 16 | are correct before the validation. 17 | * Documentation has been updated with two new quickstart examples. 18 | 19 | Released: 2019-06-13 20 | 21 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.4.0...v0.5.0) 22 | 23 | 24 | ## 0.4.0 ## 25 | 26 | This is a minor release with the following changes: 27 | 28 | * Non-NumPy numeric dtypes should now be supported. 29 | * The ValidationWarning is now issued att stack level 2. This makes it possible to 30 | trace the line of code that called the function that raised the warning. 31 | 32 | Released: 2019-05-27 33 | 34 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.3.2...v0.4.0) 35 | 36 | 37 | ## 0.3.2 ## 38 | 39 | This is a patch release that fixes an issue with validating numbers with `min_value=0` 40 | or `max_value=0`. 41 | 42 | Released: 2019-02-02 43 | 44 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.3.1...v0.3.2) 45 | 46 | 47 | ## 0.3.1 ## 48 | 49 | This is a patch release with a few fixes to the documentation. 50 | 51 | Released: 2018-10-18 52 | 53 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.3.0...v0.3.1) 54 | 55 | 56 | ## 0.3.0 ## 57 | 58 | This minor release contains the following changes: 59 | 60 | * The validation functions now have a `return_type` argument that gives 61 | the user control over the output. This replaces the `return_values` argument. 62 | * When returning values, the validation functions now filter out all invalid 63 | values. 64 | * A few tests have been added to `test_pandasvalidation.py`. The test coverage 65 | is now complete. 66 | * Documentation is up to date. 67 | * Removed use of the deprecated `pandas.tslib` 68 | 69 | Released: 2018-01-03 70 | 71 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.2.0...v0.3.0) 72 | 73 | 74 | ## 0.2.0 ## 75 | 76 | This minor release contains the following changes: 77 | 78 | * Function `not_convertible()` renamed `mask_nonconvertible()` 79 | * Updated text in `README.rst` 80 | * Updated instructions in `release-checklist.rst` 81 | * Small fixes to the documentation 82 | 83 | Released: 2017-09-17 84 | 85 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.1.1...v0.2.0) 86 | 87 | 88 | ## 0.1.1 ## 89 | 90 | This patch release contains a number of small fixes. 91 | 92 | * Updated text in `README.rst` 93 | * Updated instructions for Travis-CI (`.travis.yml`) 94 | * Added `release-checklist.rst` 95 | * Libraries for testing removed from `requirements.txt` 96 | 97 | Released: 2017-09-15 98 | 99 | [View commits](https://github.com/jmenglund/pandas-validation/compare/v0.1.0...v0.1.1) 100 | 101 | 102 | ## 0.1.0 ## 103 | 104 | Initial release. 105 | 106 | Released: 2016-03-16 107 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Markus Englund 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include *.rst 2 | include CHANGELOG.md 3 | include LICENSE.txt 4 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | pandas-validation 2 | ================= 3 | 4 | |Build-Status| |Coverage-Status| |PyPI-Status| |Doc-Status| |License| 5 | 6 | pandas-validation is a small Python library for validating data 7 | with the Python package `pandas `_. 8 | 9 | Source repository: ``_ 10 | 11 | Documentation at ``_ 12 | 13 | 14 | Installation 15 | ------------ 16 | 17 | For most users, the easiest way is probably to install the latest version 18 | hosted on `PyPI `_: 19 | 20 | .. code-block:: 21 | 22 | $ pip install pandas-validation 23 | 24 | The project is hosted at https://github.com/jmenglund/pandas-validation and 25 | can also be installed using git: 26 | 27 | .. code-block:: 28 | 29 | $ git clone https://github.com/jmenglund/pandas-validation.git 30 | $ cd pandas-validation 31 | $ python setup.py install 32 | 33 | 34 | Running the tests 35 | ----------------- 36 | 37 | Testing is carried out with `pytest `_: 38 | 39 | .. code-block:: 40 | 41 | $ pytest -v test_pandasvalidation.py 42 | 43 | Test coverage can be calculated with `Coverage.py 44 | `_ using the following commands: 45 | 46 | .. code-block:: 47 | 48 | $ coverage run -m pytest 49 | $ coverage report -m pandasvalidation.py 50 | 51 | The code follow style conventions in `PEP8 52 | `_, which can be checked 53 | with `pycodestyle `_: 54 | 55 | .. code-block:: 56 | 57 | $ pycodestyle pandasvalidation.py test_pandasvalidation.py setup.py 58 | 59 | 60 | Building the documentation 61 | -------------------------- 62 | 63 | The documentation can be built with `Sphinx `_ 64 | and the `Read the Docs Sphinx Theme 65 | `_: 66 | 67 | .. code-block:: 68 | 69 | $ cd pandas-validation 70 | $ sphinx-build -b html ./docs/source ./docs/_build/html 71 | 72 | 73 | License 74 | ------- 75 | 76 | pandas-validation is distributed under the `MIT license 77 | `_. 78 | 79 | 80 | Author 81 | ------ 82 | 83 | Markus Englund, `orcid.org/0000-0003-1688-7112 84 | `_ 85 | 86 | 87 | .. |Build-Status| image:: https://api.travis-ci.org/jmenglund/pandas-validation.svg?branch=master 88 | :target: https://travis-ci.org/jmenglund/pandas-validation 89 | :alt: Build status 90 | .. |Coverage-Status| image:: https://codecov.io/gh/jmenglund/pandas-validation/branch/master/graph/badge.svg 91 | :target: https://codecov.io/gh/jmenglund/pandas-validation 92 | :alt: Code coverage 93 | .. |PyPI-Status| image:: https://img.shields.io/pypi/v/pandas-validation.svg 94 | :target: https://pypi.python.org/pypi/pandas-validation 95 | :alt: PyPI status 96 | .. |Doc-Status| image:: https://readthedocs.org/projects/pandas-validation/badge/?version=latest 97 | :target: http://pandas-validation.readthedocs.io/en/latest/?badge=latest 98 | :alt: Documentatio status 99 | .. |License| image:: https://img.shields.io/pypi/l/pandas-validation.svg 100 | :target: https://raw.githubusercontent.com/jmenglund/pandas-validation/master/LICENSE.txt 101 | :alt: License 102 | -------------------------------------------------------------------------------- /docs/source/conf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | # 4 | # pandas-validation documentation build configuration file, created by 5 | # sphinx-quickstart on Tue Jan 19 09:45:10 2016. 6 | # 7 | # This file is execfile()d with the current directory set to its 8 | # containing dir. 9 | # 10 | # Note that not all possible configuration values are present in this 11 | # autogenerated file. 12 | # 13 | # All configuration values have a default; values that are commented out 14 | # serve to show the default. 15 | 16 | import sys 17 | import os 18 | import sphinx_rtd_theme 19 | from re import match 20 | 21 | # If extensions (or modules to document with autodoc) are in another directory, 22 | # add these directories to sys.path here. If the directory is relative to the 23 | # documentation root, use os.path.abspath to make it absolute, like shown here. 24 | #sys.path.insert(0, os.path.abspath('.')) 25 | 26 | # Add project directory to sys.path 27 | sys.path.insert(0, os.path.abspath("../..")) 28 | 29 | # -- General configuration ------------------------------------------------ 30 | 31 | # If your documentation needs a minimal Sphinx version, state it here. 32 | #needs_sphinx = '1.0' 33 | 34 | # Add any Sphinx extension module names here, as strings. They can be 35 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 36 | # ones. 37 | extensions = [ 38 | 'sphinx.ext.autodoc', 39 | 'sphinx.ext.intersphinx', 40 | 'sphinx.ext.ifconfig', 41 | 'sphinx.ext.napoleon', 42 | ] 43 | 44 | # Napoleon settings 45 | napoleon_google_docstring = True 46 | napoleon_numpy_docstring = True 47 | napoleon_include_private_with_doc = False 48 | napoleon_include_special_with_doc = True 49 | napoleon_use_admonition_for_examples = False 50 | napoleon_use_admonition_for_notes = False 51 | napoleon_use_admonition_for_references = False 52 | napoleon_use_ivar = False 53 | napoleon_use_param = True 54 | napoleon_use_rtype = True 55 | 56 | # Add any paths that contain templates here, relative to this directory. 57 | templates_path = ['_templates'] 58 | 59 | # The suffix(es) of source filenames. 60 | # You can specify multiple suffix as a list of string: 61 | # source_suffix = ['.rst', '.md'] 62 | source_suffix = '.rst' 63 | 64 | # The encoding of source files. 65 | #source_encoding = 'utf-8-sig' 66 | 67 | # The master toctree document. 68 | master_doc = 'index' 69 | 70 | # General information about the project. 71 | project = 'pandas-validation' 72 | copyright = '2016–2019, Markus Englund' 73 | author = 'Markus Englund' 74 | 75 | # The version info for the project you're documenting, acts as replacement for 76 | # |version| and |release|, also used in various other places throughout the 77 | # built documents. 78 | # 79 | version = match( 80 | r'(^\d+\.\d+)', __import__('pandasvalidation').__version__).group(0) 81 | # The full version, including alpha/beta/rc tags. 82 | release = __import__('pandasvalidation').__version__ 83 | 84 | # The language for content autogenerated by Sphinx. Refer to documentation 85 | # for a list of supported languages. 86 | # 87 | # This is also used if you do content translation via gettext catalogs. 88 | # Usually you set "language" from the command line for these cases. 89 | language = None 90 | 91 | # There are two options for replacing |today|: either, you set today to some 92 | # non-false value, then it is used: 93 | #today = '' 94 | # Else, today_fmt is used as the format for a strftime call. 95 | #today_fmt = '%B %d, %Y' 96 | 97 | # List of patterns, relative to source directory, that match files and 98 | # directories to ignore when looking for source files. 99 | exclude_patterns = ['_test*.py', 'setup.py' ] 100 | 101 | # The reST default role (used for this markup: `text`) to use for all 102 | # documents. 103 | #default_role = None 104 | 105 | # If true, '()' will be appended to :func: etc. cross-reference text. 106 | #add_function_parentheses = True 107 | 108 | # If true, the current module name will be prepended to all description 109 | # unit titles (such as .. function::). 110 | #add_module_names = True 111 | 112 | # If true, sectionauthor and moduleauthor directives will be shown in the 113 | # output. They are ignored by default. 114 | #show_authors = False 115 | 116 | # The name of the Pygments (syntax highlighting) style to use. 117 | pygments_style = 'sphinx' 118 | 119 | # A list of ignored prefixes for module index sorting. 120 | #modindex_common_prefix = [] 121 | 122 | # If true, keep warnings as "system message" paragraphs in the built documents. 123 | #keep_warnings = False 124 | 125 | # If true, `todo` and `todoList` produce output, else they produce nothing. 126 | todo_include_todos = False 127 | 128 | 129 | # -- Options for HTML output ---------------------------------------------- 130 | 131 | # The theme to use for HTML and HTML Help pages. See the documentation for 132 | # a list of builtin themes. 133 | html_theme = 'sphinx_rtd_theme' 134 | 135 | # Theme options are theme-specific and customize the look and feel of a theme 136 | # further. For a list of options available for each theme, see the 137 | # documentation. 138 | #html_theme_options = {} 139 | 140 | # Add any paths that contain custom themes here, relative to this directory. 141 | html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] 142 | 143 | # The name for this set of Sphinx documents. If None, it defaults to 144 | # " v documentation". 145 | #html_title = None 146 | 147 | # A shorter title for the navigation bar. Default is the same as html_title. 148 | #html_short_title = None 149 | 150 | # The name of an image file (relative to this directory) to place at the top 151 | # of the sidebar. 152 | #html_logo = None 153 | 154 | # The name of an image file (within the static path) to use as favicon of the 155 | # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 156 | # pixels large. 157 | #html_favicon = None 158 | 159 | # Add any paths that contain custom static files (such as style sheets) here, 160 | # relative to this directory. They are copied after the builtin static files, 161 | # so a file named "default.css" will overwrite the builtin "default.css". 162 | html_static_path = ['_static'] 163 | 164 | # Add any extra paths that contain custom files (such as robots.txt or 165 | # .htaccess) here, relative to this directory. These files are copied 166 | # directly to the root of the documentation. 167 | #html_extra_path = [] 168 | 169 | # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, 170 | # using the given strftime format. 171 | html_last_updated_fmt = '%b %d, %Y' 172 | 173 | # If true, SmartyPants will be used to convert quotes and dashes to 174 | # typographically correct entities. 175 | #html_use_smartypants = True 176 | 177 | # Custom sidebar templates, maps document names to template names. 178 | #html_sidebars = {} 179 | 180 | # Additional templates that should be rendered to pages, maps page names to 181 | # template names. 182 | #html_additional_pages = {} 183 | 184 | # If false, no module index is generated. 185 | #html_domain_indices = True 186 | 187 | # If false, no index is generated. 188 | #html_use_index = True 189 | 190 | # If true, the index is split into individual pages for each letter. 191 | #html_split_index = False 192 | 193 | # If true, links to the reST sources are added to the pages. 194 | #html_show_sourcelink = True 195 | 196 | # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. 197 | #html_show_sphinx = True 198 | 199 | # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. 200 | #html_show_copyright = True 201 | 202 | # If true, an OpenSearch description file will be output, and all pages will 203 | # contain a tag referring to it. The value of this option must be the 204 | # base URL from which the finished HTML is served. 205 | #html_use_opensearch = '' 206 | 207 | # This is the file name suffix for HTML files (e.g. ".xhtml"). 208 | #html_file_suffix = None 209 | 210 | # Language to be used for generating the HTML full-text search index. 211 | # Sphinx supports the following languages: 212 | # 'da', 'de', 'en', 'es', 'fi', 'fr', 'h', 'it', 'ja' 213 | # 'nl', 'no', 'pt', 'ro', 'r', 'sv', 'tr' 214 | #html_search_language = 'en' 215 | 216 | # A dictionary with options for the search language support, empty by default. 217 | # Now only 'ja' uses this config value 218 | #html_search_options = {'type': 'default'} 219 | 220 | # The name of a javascript file (relative to the configuration directory) that 221 | # implements a search results scorer. If empty, the default will be used. 222 | #html_search_scorer = 'scorer.js' 223 | 224 | # Output file base name for HTML help builder. 225 | htmlhelp_basename = 'pandas-validationdoc' 226 | 227 | # -- Options for LaTeX output --------------------------------------------- 228 | 229 | latex_elements = { 230 | # The paper size ('letterpaper' or 'a4paper'). 231 | #'papersize': 'letterpaper', 232 | 233 | # The font size ('10pt', '11pt' or '12pt'). 234 | #'pointsize': '10pt', 235 | 236 | # Additional stuff for the LaTeX preamble. 237 | #'preamble': '', 238 | 239 | # Latex figure (float) alignment 240 | #'figure_align': 'htbp', 241 | } 242 | 243 | # Grouping the document tree into LaTeX files. List of tuples 244 | # (source start file, target name, title, 245 | # author, documentclass [howto, manual, or own class]). 246 | latex_documents = [ 247 | (master_doc, 'pandas-validation.tex', 'pandas-validation Documentation', 248 | 'Markus Englund', 'manual'), 249 | ] 250 | 251 | # The name of an image file (relative to this directory) to place at the top of 252 | # the title page. 253 | #latex_logo = None 254 | 255 | # For "manual" documents, if this is true, then toplevel headings are parts, 256 | # not chapters. 257 | #latex_use_parts = False 258 | 259 | # If true, show page references after internal links. 260 | #latex_show_pagerefs = False 261 | 262 | # If true, show URL addresses after external links. 263 | #latex_show_urls = False 264 | 265 | # Documents to append as an appendix to all manuals. 266 | #latex_appendices = [] 267 | 268 | # If false, no module index is generated. 269 | #latex_domain_indices = True 270 | 271 | 272 | # -- Options for manual page output --------------------------------------- 273 | 274 | # One entry per manual page. List of tuples 275 | # (source start file, name, description, authors, manual section). 276 | man_pages = [ 277 | (master_doc, 'pandas-validation', 'pandas-validation Documentation', 278 | [author], 1) 279 | ] 280 | 281 | # If true, show URL addresses after external links. 282 | #man_show_urls = False 283 | 284 | 285 | # -- Options for Texinfo output ------------------------------------------- 286 | 287 | # Grouping the document tree into Texinfo files. List of tuples 288 | # (source start file, target name, title, author, 289 | # dir menu entry, description, category) 290 | texinfo_documents = [ 291 | (master_doc, 'pandas-validation', 'pandas-validation Documentation', 292 | author, 'pandas-validation', 'Validation of data in pandas.', 293 | 'Miscellaneous'), 294 | ] 295 | 296 | # Documents to append as an appendix to all manuals. 297 | #texinfo_appendices = [] 298 | 299 | # If false, no module index is generated. 300 | #texinfo_domain_indices = True 301 | 302 | # How to display URL addresses: 'footnote', 'no', or 'inline'. 303 | #texinfo_show_urls = 'footnote' 304 | 305 | # If true, do not generate a @detailmenu in the "Top" node's menu. 306 | #texinfo_no_detailmenu = False 307 | 308 | 309 | # Example configuration for intersphinx: refer to the Python standard library. 310 | intersphinx_mapping = { 311 | 'python': ('http://docs.python.org/', None), 312 | 'pandas': ('http://pandas.pydata.org/pandas-docs/stable/', None) 313 | } 314 | 315 | nitpick_ignore = [('py:class', 'Warning')] 316 | -------------------------------------------------------------------------------- /docs/source/index.rst: -------------------------------------------------------------------------------- 1 | .. pandas-validation documentation master file, created by 2 | sphinx-quickstart on Tue Jan 19 09:45:10 2016. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | Welcome to pandas-validation's documentation! 7 | ============================================= 8 | 9 | pandas-validation is a small Python package for casting and validating data 10 | handled with `pandas `_. 11 | 12 | .. toctree:: 13 | :caption: Contents 14 | :name: mastertoc 15 | :maxdepth: 2 16 | :glob: 17 | 18 | pandas-validation/installation 19 | pandas-validation/quickstart 20 | pandas-validation/api 21 | 22 | 23 | Indices and tables 24 | ================== 25 | 26 | * :ref:`genindex` 27 | * :ref:`modindex` 28 | * :ref:`search` 29 | -------------------------------------------------------------------------------- /docs/source/pandas-validation/api.rst: -------------------------------------------------------------------------------- 1 | .. py:currentmodule:: pandasvalidation 2 | 3 | .. _api: 4 | 5 | API Reference 6 | ============= 7 | 8 | This document describes the API of the :ref:`pandasvalidation ` module. 9 | 10 | .. automodule:: pandasvalidation 11 | :members: 12 | :undoc-members: 13 | :show-inheritance: 14 | -------------------------------------------------------------------------------- /docs/source/pandas-validation/installation.rst: -------------------------------------------------------------------------------- 1 | .. py:currentmodule:: pandasvalidation 2 | 3 | .. _installation: 4 | 5 | Installing pandas-validation 6 | ============================ 7 | 8 | For most users, the easiest way is probably to install the latest version 9 | hosted on `PyPI `_: 10 | 11 | .. code-block:: none 12 | 13 | $ pip install pandas-validation 14 | 15 | The project is hosted at https://github.com/jmenglund/pandas-validation and 16 | can also be installed using git: 17 | 18 | .. code-block:: none 19 | 20 | $ git clone https://github.com/jmenglund/pandas-validation.git 21 | $ cd pandas-validation 22 | $ python setup.py install 23 | 24 | .. tip:: 25 | You may consider installing pandas-validation and its required Python 26 | packages within a virtual environment in order to avoid cluttering your 27 | system's Python path. See for example the environment management system 28 | `conda `_ or the package 29 | `virtualenv `_. 30 | -------------------------------------------------------------------------------- /docs/source/pandas-validation/quickstart.rst: -------------------------------------------------------------------------------- 1 | .. py:currentmodule:: pandasvalidation 2 | 3 | .. _quickstart: 4 | 5 | Quickstart 6 | ========== 7 | 8 | This guide gives you a brief introduction on how to use 9 | pandas-validation. The library contains four core functions that let 10 | you validate values in a pandas Series (or a DataFrame column). The 11 | examples below will help you get started. If you want to know more, I suggest that you have a look at the :ref:`API reference`. 12 | 13 | * :ref:`validate-dates` 14 | * :ref:`validate-timestamps` 15 | * :ref:`validate-numbers` 16 | * :ref:`validate-strings` 17 | 18 | 19 | The code examples below assume that you first do the following imports: 20 | 21 | .. code-block:: pycon 22 | 23 | >>> import numpy as np 24 | >>> import pandas as pd 25 | >>> import pandasvalidation as pv 26 | 27 | 28 | .. _validate-dates: 29 | 30 | Validate dates 31 | -------------- 32 | 33 | Our first example shows how to validate a pandas Series with a few 34 | dates specified with Python's `datetime.date` data type. Values of 35 | other types are replaced with ``NaT`` ("not a time") prior to the 36 | validation. Warnings then inform the user if any of the values are 37 | invalid. If `return_type` is set to ``values``, a pandas Series will 38 | be returned with only the valid dates. 39 | 40 | 41 | .. code-block:: pycon 42 | 43 | >>> from datetime import date 44 | >>> s1 = pd.Series( 45 | ... [ 46 | date(2010, 10, 7), 47 | date(2018, 3, 15), 48 | date(2018, 3, 15), 49 | np.nan 50 | ], name='My dates') 51 | >>> pv.validate_date( 52 | ... s1, 53 | ... nullable=False, 54 | ... unique=True, 55 | ... min_date=date(2014, 1, 5), 56 | ... max_date=date(2015, 2, 15), 57 | ... return_type=None) 58 | ValidationWarning: 'My dates': NaT value(s); duplicates; date(s) too early; date(s) too late. 59 | 60 | 61 | .. _validate-timestamps: 62 | 63 | Validate timestamps 64 | ------------------- 65 | 66 | Validation of timestamps works in the same way as date validation. 67 | The major difference is that only values of type `pandas.Timestamp` 68 | are taken into account. Values of other types are replaced by ``NaT``. 69 | If `return_type` is set to ``values``, a pandas Series will 70 | be returned with only the valid timestamps. 71 | 72 | 73 | .. code-block:: pycon 74 | 75 | >>> s2 = pd.Series( 76 | ... [ 77 | ... pd.Timestamp(2018, 2, 7, 12, 31, 0), 78 | ... pd.Timestamp(2018, 2, 7, 13, 6, 0), 79 | ... pd.Timestamp(2018, 2, 7, 13, 6, 0), 80 | ... np.nan 81 | ... ], name='My timestamps') 82 | >>> pv.validate_timestamp( 83 | ... s2, 84 | ... nullable=False, 85 | ... unique=True, 86 | ... min_timestamp=pd.Timestamp(2014, 1, 5, 0, 0, 0), 87 | ... max_timestamp=pd.Timestamp(2018, 2, 7, 13, 0, 0), 88 | ... return_type=None) 89 | ValidationWarning: 'My timestamps': NaT value(s); duplicates; timestamp(s) too late. 90 | 91 | 92 | .. _validate-numbers: 93 | 94 | Validate numeric values 95 | ----------------------- 96 | 97 | Validation of numeric values (e.g. floats and integers) follows the 98 | same general principles as the validation of dates and timestamps. 99 | Non-numeric values are treated as ``NaN``, and warnings are issued to 100 | indicate invalid values to the user. If `return_type` is set to 101 | ``values``, a pandas Series will be returned with only the valid 102 | numeric values. 103 | 104 | .. note:: 105 | Prior to version 0.5.0, some non-numeric data types were 106 | automatically converted numeric types before the validation. 107 | This was often convenient but could also lead to unexpected 108 | behaviour. The current implementation is cleaner and gives the 109 | user more control over the data types. 110 | 111 | .. code-block:: pycon 112 | 113 | >>> s3 = pd.Series( 114 | ... [1, 1, 2.3, np.nan], 115 | ... name='My numeric values') 116 | >>> pv.validate_numeric( 117 | ... s3, 118 | ... nullable=False, 119 | ... unique=True, 120 | ... integer=True, 121 | ... min_value=2, 122 | ... max_value=2, 123 | ... return_type=None) 124 | ValidationWarning: 'My numeric values': NaN value(s); duplicates; non-integer(s); value(s) too low; values(s) too high. 125 | 126 | 127 | .. _validate-strings: 128 | 129 | Validate strings 130 | ---------------- 131 | 132 | String validation works in the same way as the other validations, but 133 | concerns only strings. Values of other types, like numbers and 134 | timestamps, are simply replaced with ``NaN`` values before the 135 | validation takes place. If `return_type` is set to ``values``, a 136 | pandas Series will be returned with only the valid strings. 137 | 138 | .. note:: 139 | Prior to version 0.5.0, some non-string data types were 140 | automatically converted to strings before the validation. This 141 | was often convenient but could also lead to unexpected behaviour. 142 | The current implementation is cleaner and gives the user more 143 | control over the data types. 144 | 145 | 146 | .. code-block:: pycon 147 | 148 | >>> s4 = pd.Series( 149 | ... ['1', 'ab\n', 'Ab', 'AB', np.nan], 150 | ... name='My strings') 151 | >>> pv.validate_string( 152 | ... s4, 153 | ... nullable=False, 154 | ... unique=True, 155 | ... min_length=2, 156 | ... max_length=2, 157 | ... case='lower', 158 | ... newlines=False, 159 | ... whitespace=False, 160 | ... return_type=None) 161 | ValidationWarning: 'My strings': NaN value(s); string(s) too short; string(s) too long; wrong case letter(s); newline character(s); whitespace. 162 | -------------------------------------------------------------------------------- /pandasvalidation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | """Module for validating data with the library pandas.""" 5 | 6 | import os 7 | import warnings 8 | import datetime 9 | 10 | import numpy 11 | import pandas 12 | 13 | 14 | __author__ = 'Markus Englund' 15 | __license__ = 'MIT' 16 | __version__ = '0.5.0' 17 | 18 | 19 | warnings.filterwarnings('default', category=DeprecationWarning) 20 | 21 | 22 | class ValidationWarning(Warning): 23 | pass 24 | 25 | 26 | def _datetime_to_string(series, format='%Y-%m-%d'): 27 | """ 28 | Convert datetime values in a pandas Series to strings. 29 | Other values are left as they are. 30 | 31 | Parameters 32 | ---------- 33 | series : pandas.Series 34 | Values to convert. 35 | format : str 36 | Format string for datetime type. Default: '%Y-%m-%d'. 37 | 38 | Returns 39 | ------- 40 | converted : pandas.Series 41 | """ 42 | converted = series.copy() 43 | datetime_mask = series.apply(type).isin( 44 | [datetime.datetime, pandas.Timestamp]) 45 | if datetime_mask.any(): 46 | converted[datetime_mask] = ( 47 | series[datetime_mask].apply(lambda x: x.strftime(format))) 48 | return converted.where(datetime_mask, series) 49 | 50 | 51 | def _numeric_to_string(series, float_format='%g'): 52 | """ 53 | Convert numeric values in a pandas Series to strings. 54 | Other values are left as they are. 55 | 56 | Parameters 57 | ---------- 58 | series : pandas.Series 59 | Values to convert. 60 | float_format : str 61 | Format string for floating point number. Default: '%g'. 62 | 63 | Returns 64 | ------- 65 | converted : pandas.Series 66 | """ 67 | converted = series.copy() 68 | numeric_mask = ( 69 | series.apply(lambda x: numpy.issubdtype(type(x), numpy.number)) & 70 | series.notnull()) 71 | if numeric_mask.any(): 72 | converted[numeric_mask] = ( 73 | series[numeric_mask].apply(lambda x: float_format % x)) 74 | return converted.where(numeric_mask, series) 75 | 76 | 77 | def _get_error_messages(masks, error_info): 78 | """ 79 | Get list of error messages. 80 | 81 | Parameters 82 | ---------- 83 | masks : list 84 | List of pandas.Series with masked errors. 85 | error_info : dict 86 | Dictionary with error messages corresponding to different 87 | validation errors. 88 | """ 89 | msg_list = [] 90 | for key, value in masks.items(): 91 | if value.any(): 92 | msg_list.append(error_info[key]) 93 | return msg_list 94 | 95 | 96 | def _get_return_object(masks, values, return_type): 97 | mask_frame = pandas.concat(masks, axis='columns') 98 | if return_type == 'mask_frame': 99 | return mask_frame 100 | elif return_type == 'mask_series': 101 | return mask_frame.any(axis=1) 102 | elif return_type == 'values': 103 | return values.where(~mask_frame.any(axis=1)) 104 | else: 105 | raise ValueError('Invalid return_type') 106 | 107 | 108 | def mask_nonconvertible( 109 | series, to_datatype, datetime_format=None, exact_date=True): 110 | """ 111 | Return a boolean same-sized object indicating whether values 112 | cannot be converted. 113 | 114 | Parameters 115 | ---------- 116 | series : pandas.Series 117 | Values to check. 118 | to_datatype : str 119 | Datatype to which values should be converted. Available values 120 | are 'numeric' and 'datetime'. 121 | datetime_format : str 122 | strftime to parse time, eg '%d/%m/%Y', note that '%f' will parse 123 | all the way up to nanoseconds. Optional. 124 | exact_date : bool 125 | - If True (default), require an exact format match. 126 | - If False, allow the format to match anywhere in the target string. 127 | """ 128 | if to_datatype == 'numeric': 129 | converted = pandas.to_numeric(series, errors='coerce') 130 | elif to_datatype == 'datetime': 131 | converted = pandas.to_datetime( 132 | series, errors='coerce', format=datetime_format, exact=exact_date) 133 | else: 134 | raise ValueError( 135 | 'Invalid \'to_datatype\': {}' 136 | .format(to_datatype)) # pragma: no cover 137 | notnull = series.copy().notnull() 138 | mask = notnull & converted.isnull() 139 | return mask 140 | 141 | 142 | def to_datetime( 143 | arg, dayfirst=False, yearfirst=False, utc=None, box=True, 144 | format=None, exact=True, coerce=None, unit='ns', 145 | infer_datetime_format=False): 146 | """ 147 | Convert argument to datetime and set nonconvertible values to NaT. 148 | 149 | This function calls :func:`~pandas.to_datetime` with ``errors='coerce'`` 150 | and issues a warning if values cannot be converted. 151 | """ 152 | try: 153 | converted = pandas.to_datetime( 154 | arg, errors='raise', dayfirst=dayfirst, yearfirst=yearfirst, 155 | utc=utc, box=box, format=format, exact=exact) 156 | except ValueError: 157 | converted = pandas.to_datetime( 158 | arg, errors='coerce', dayfirst=dayfirst, yearfirst=yearfirst, 159 | utc=utc, box=box, format=format, exact=exact) 160 | if isinstance(arg, pandas.Series): 161 | warnings.warn( 162 | '{}: value(s) not converted to datetime set as NaT' 163 | .format(repr(arg.name)), ValidationWarning, stacklevel=2) 164 | else: # pragma: no cover 165 | warnings.warn( 166 | 'Value(s) not converted to datetime set as NaT', 167 | ValidationWarning, stacklevel=2) 168 | return converted 169 | 170 | 171 | def to_numeric(arg): 172 | """ 173 | Convert argument to numeric type and set nonconvertible values 174 | to NaN. 175 | 176 | This function calls :func:`~pandas.to_numeric` with ``errors='coerce'`` 177 | and issues a warning if values cannot be converted. 178 | """ 179 | try: 180 | converted = pandas.to_numeric(arg, errors='raise') 181 | except ValueError: 182 | converted = pandas.to_numeric(arg, errors='coerce') 183 | if isinstance(arg, pandas.Series): 184 | warnings.warn( 185 | '{}: value(s) not converted to numeric set as NaN' 186 | .format(repr(arg.name)), ValidationWarning, stacklevel=2) 187 | else: # pragma: no cover 188 | warnings.warn( 189 | 'Value(s) not converted to numeric set as NaN', 190 | ValidationWarning, stacklevel=2) 191 | return converted 192 | 193 | 194 | def to_string(series, float_format='%g', datetime_format='%Y-%m-%d'): 195 | """ 196 | Convert values in a pandas Series to strings. 197 | 198 | Parameters 199 | ---------- 200 | series : pandas.Series 201 | Values to convert. 202 | float_format : str 203 | Format string for floating point number. Default: '%g'. 204 | datetime_format : str 205 | Format string for datetime type. Default: '%Y-%m-%d' 206 | 207 | Returns 208 | ------- 209 | converted : pandas.Series 210 | """ 211 | converted = _numeric_to_string(series, float_format) 212 | converted = _datetime_to_string(converted, format=datetime_format) 213 | converted = converted.astype(str) 214 | converted = converted.where(series.notnull(), numpy.nan) # missing as NaN 215 | return converted 216 | 217 | 218 | def validate_date( 219 | series, nullable=True, unique=False, min_date=None, 220 | max_date=None, return_type=None): 221 | """ 222 | Validate a pandas Series with values of type `datetime.date`. 223 | Values of a different data type will be replaced with NaN prior to 224 | the validataion. 225 | 226 | Parameters 227 | ---------- 228 | series : pandas.Series 229 | Values to validate. 230 | nullable : bool 231 | If False, check for NaN values. Default: True. 232 | unique : bool 233 | If True, check that values are unique. Default: False 234 | min_date : datetime.date 235 | If defined, check for values before min_date. Optional. 236 | max_date : datetime.date 237 | If defined, check for value later than max_date. Optional. 238 | return_type : str 239 | Kind of data object to return; 'mask_series', 'mask_frame' 240 | or 'values'. Default: None. 241 | """ 242 | error_info = { 243 | 'invalid_type': 'Value(s) not of type datetime.date set as NaT', 244 | 'isnull': 'NaT value(s)', 245 | 'nonunique': 'duplicates', 246 | 'too_low': 'date(s) too early', 247 | 'too_high': 'date(s) too late'} 248 | 249 | is_date = series.apply(lambda x: isinstance(x, datetime.date)) 250 | masks = {} 251 | masks['invalid_type'] = ~is_date & series.notnull() 252 | to_validate = series.where(is_date) 253 | if nullable is not True: 254 | masks['isnull'] = to_validate.isnull() 255 | if unique: 256 | masks['nonunique'] = to_validate.duplicated() & to_validate.notnull() 257 | if min_date is not None: 258 | masks['too_low'] = to_validate.dropna() < min_date 259 | if max_date is not None: 260 | masks['too_high'] = to_validate.dropna() > max_date 261 | 262 | msg_list = _get_error_messages(masks, error_info) 263 | 264 | if len(msg_list) > 0: 265 | msg = repr(series.name) + ': ' + '; '.join(msg_list) + '.' 266 | warnings.warn(msg, ValidationWarning, stacklevel=2) 267 | 268 | if return_type is not None: 269 | return _get_return_object(masks, to_validate, return_type) 270 | 271 | 272 | def validate_timestamp( 273 | series, nullable=True, unique=False, min_timestamp=None, 274 | max_timestamp=None, return_type=None): 275 | """ 276 | Validate a pandas Series with values of type `pandas.Timestamp`. 277 | Values of a different data type will be replaced with NaT prior to 278 | the validataion. 279 | 280 | Parameters 281 | ---------- 282 | series : pandas.Series 283 | Values to validate. 284 | nullable : bool 285 | If False, check for NaN values. Default: True. 286 | unique : bool 287 | If True, check that values are unique. Default: False 288 | min_timestamp : pandas.Timestamp 289 | If defined, check for values before min_timestamp. Optional. 290 | max_timestamp : pandas.Timestamp 291 | If defined, check for value later than max_timestamp. Optional. 292 | return_type : str 293 | Kind of data object to return; 'mask_series', 'mask_frame' 294 | or 'values'. Default: None. 295 | """ 296 | error_info = { 297 | 'invalid_type': 'Value(s) not of type pandas.Timestamp set as NaT', 298 | 'isnull': 'NaT value(s)', 299 | 'nonunique': 'duplicates', 300 | 'too_low': 'timestamp(s) too early', 301 | 'too_high': 'timestamp(s) too late'} 302 | 303 | is_timestamp = series.apply(lambda x: isinstance(x, pandas.Timestamp)) 304 | masks = {} 305 | masks['invalid_type'] = ~is_timestamp & series.notnull() 306 | to_validate = pandas.to_datetime(series.where(is_timestamp, pandas.NaT)) 307 | if nullable is not True: 308 | masks['isnull'] = to_validate.isnull() 309 | if unique: 310 | masks['nonunique'] = to_validate.duplicated() & to_validate.notnull() 311 | if min_timestamp is not None: 312 | masks['too_low'] = to_validate.dropna() < min_timestamp 313 | if max_timestamp is not None: 314 | masks['too_high'] = to_validate.dropna() > max_timestamp 315 | 316 | msg_list = _get_error_messages(masks, error_info) 317 | 318 | if len(msg_list) > 0: 319 | msg = repr(series.name) + ': ' + '; '.join(msg_list) + '.' 320 | warnings.warn(msg, ValidationWarning, stacklevel=2) 321 | 322 | if return_type is not None: 323 | return _get_return_object(masks, to_validate, return_type) 324 | 325 | 326 | def validate_datetime( 327 | series, nullable=True, unique=False, min_datetime=None, 328 | max_datetime=None, return_type=None): 329 | """ 330 | Validate a pandas Series containing datetimes. 331 | 332 | .. deprecated:: 0.5.0 333 | `validate_datetime()` will be removed in version 0.7.0. 334 | Use `validate_date()` or `validate_timestamp()` instead. 335 | 336 | Parameters 337 | ---------- 338 | series : pandas.Series 339 | Values to validate. 340 | nullable : bool 341 | If False, check for NaN values. Default: True. 342 | unique : bool 343 | If True, check that values are unique. Default: False 344 | min_datetime : str 345 | If defined, check for values before min_datetime. Optional. 346 | max_datetime : str 347 | If defined, check for value later than max_datetime. Optional. 348 | return_type : str 349 | Kind of data object to return; 'mask_series', 'mask_frame' 350 | or 'values'. Default: None. 351 | """ 352 | 353 | warnings.warn( 354 | 'validate_datetime() is deprecated, use validate_date() or ' 355 | 'validate_timestamp() instead.', DeprecationWarning) 356 | 357 | error_info = { 358 | 'nonconvertible': 'Value(s) not converted to datetime set as NaT', 359 | 'isnull': 'NaT value(s)', 360 | 'nonunique': 'duplicates', 361 | 'too_low': 'date(s) too early', 362 | 'too_high': 'date(s) too late'} 363 | 364 | if not series.dtype.type == numpy.datetime64: 365 | converted = pandas.to_datetime(series, errors='coerce') 366 | else: 367 | converted = series.copy() 368 | masks = {} 369 | masks['nonconvertible'] = series.notnull() & converted.isnull() 370 | if not nullable: 371 | masks['isnull'] = converted.isnull() 372 | if unique: 373 | masks['nonunique'] = converted.duplicated() & converted.notnull() 374 | if min_datetime is not None: 375 | masks['too_low'] = converted.dropna() < min_datetime 376 | if max_datetime is not None: 377 | masks['too_high'] = converted.dropna() > max_datetime 378 | 379 | msg_list = _get_error_messages(masks, error_info) 380 | 381 | if len(msg_list) > 0: 382 | msg = repr(series.name) + ': ' + '; '.join(msg_list) + '.' 383 | warnings.warn(msg, ValidationWarning, stacklevel=2) 384 | 385 | if return_type is not None: 386 | return _get_return_object(masks, converted, return_type) 387 | 388 | 389 | def validate_numeric( 390 | series, nullable=True, unique=False, integer=False, 391 | min_value=None, max_value=None, return_type=None): 392 | """ 393 | Validate a pandas Series containing numeric values. 394 | 395 | Parameters 396 | ---------- 397 | series : pandas.Series 398 | Values to validate. 399 | nullable : bool 400 | If False, check for NaN values. Default: True 401 | unique : bool 402 | If True, check that values are unique. Default: False 403 | integer : bool 404 | If True, check that values are integers. Default: False 405 | min_value : int 406 | If defined, check for values below minimum. Optional. 407 | max_value : int 408 | If defined, check for value above maximum. Optional. 409 | return_type : str 410 | Kind of data object to return; 'mask_series', 'mask_frame' 411 | or 'values'. Default: None. 412 | """ 413 | error_info = { 414 | 'invalid_type': 'Non-numeric value(s) set as NaN', 415 | 'isnull': 'NaN value(s)', 416 | 'nonunique': 'duplicates', 417 | 'noninteger': 'non-integer(s)', 418 | 'too_low': 'value(s) too low', 419 | 'too_high': 'values(s) too high'} 420 | 421 | is_numeric = series.apply(pandas.api.types.is_number) 422 | 423 | masks = {} 424 | masks['invalid_type'] = ~is_numeric & series.notnull() 425 | 426 | to_validate = pandas.to_numeric(series.where(is_numeric)) 427 | if not nullable: 428 | masks['isnull'] = to_validate.isnull() 429 | if unique: 430 | masks['nonunique'] = to_validate.duplicated() & to_validate.notnull() 431 | if integer: 432 | noninteger_dropped = ( 433 | to_validate.dropna() != to_validate.dropna().apply(int)) 434 | masks['noninteger'] = pandas.Series(noninteger_dropped, series.index) 435 | if min_value is not None: 436 | masks['too_low'] = to_validate.dropna() < min_value 437 | if max_value is not None: 438 | masks['too_high'] = to_validate.dropna() > max_value 439 | 440 | msg_list = _get_error_messages(masks, error_info) 441 | 442 | if len(msg_list) > 0: 443 | msg = repr(series.name) + ': ' + '; '.join(msg_list) + '.' 444 | warnings.warn(msg, ValidationWarning, stacklevel=2) 445 | 446 | if return_type is not None: 447 | return _get_return_object(masks, to_validate, return_type) 448 | 449 | 450 | def validate_string( 451 | series, nullable=True, unique=False, 452 | min_length=None, max_length=None, case=None, newlines=True, 453 | trailing_whitespace=True, whitespace=True, matching_regex=None, 454 | non_matching_regex=None, whitelist=None, blacklist=None, 455 | return_type=None): 456 | """ 457 | Validate a pandas Series with strings. Non-string values 458 | will be converted to strings prior to validation. 459 | 460 | Parameters 461 | ---------- 462 | series : pandas.Series 463 | Values to validate. 464 | nullable : bool 465 | If False, check for NaN values. Default: True. 466 | unique : bool 467 | If True, check that values are unique. Default: False. 468 | min_length : int 469 | If defined, check for strings shorter than 470 | minimum length. Optional. 471 | max_length : int 472 | If defined, check for strings longer than 473 | maximum length. Optional. 474 | case : str 475 | Check for a character case constraint. Available values 476 | are 'lower', 'upper' and 'title'. Optional. 477 | newlines : bool 478 | If False, check for newline characters. Default: True. 479 | trailing_whitespace : bool 480 | If False, check for trailing whitespace. Default: True. 481 | whitespace : bool 482 | If False, check for whitespace. Default: True. 483 | matching_regex : str 484 | Check that strings matches some regular expression. Optional. 485 | non_matching_regex : str 486 | Check that strings do not match some regular expression. Optional. 487 | whitelist : list 488 | Check that values are in `whitelist`. Optional. 489 | blacklist : list 490 | Check that values are not in `blacklist`. Optional. 491 | return_type : str 492 | Kind of data object to return; 'mask_series', 'mask_frame' 493 | or 'values'. Default: None. 494 | """ 495 | 496 | error_info = { 497 | 'invalid_type': 'Non-string value(s) set as NaN', 498 | 'isnull': 'NaN value(s)', 499 | 'nonunique': 'duplicates', 500 | 'too_short': 'string(s) too short', 501 | 'too_long': 'string(s) too long', 502 | 'wrong_case': 'wrong case letter(s)', 503 | 'newlines': 'newline character(s)', 504 | 'trailing_space': 'trailing whitespace', 505 | 'whitespace': 'whitespace', 506 | 'regex_mismatch': 'mismatch(es) for "matching regular expression"', 507 | 'regex_match': 'match(es) for "non-matching regular expression"', 508 | 'not_in_whitelist': 'string(s) not in whitelist', 509 | 'in_blacklist': 'string(s) in blacklist'} 510 | 511 | is_string = series.apply(lambda x: isinstance(x, str)) 512 | 513 | masks = {} 514 | masks['invalid_type'] = ~is_string & series.notnull() 515 | 516 | to_validate = series.where(is_string) 517 | 518 | if not nullable: 519 | masks['isnull'] = to_validate.isnull() 520 | if unique: 521 | masks['nonunique'] = to_validate.duplicated() & to_validate.notnull() 522 | if min_length is not None: 523 | too_short_dropped = to_validate.dropna().apply(len) < min_length 524 | masks['too_short'] = pandas.Series(too_short_dropped, series.index) 525 | if max_length is not None: 526 | too_long_dropped = to_validate.dropna().apply(len) > max_length 527 | masks['too_long'] = pandas.Series(too_long_dropped, series.index) 528 | if case: 529 | altered_case = getattr(to_validate.str, case)() 530 | wrong_case_dropped = ( 531 | altered_case.dropna() != to_validate[altered_case.notnull()]) 532 | masks['wrong_case'] = pandas.Series(wrong_case_dropped, series.index) 533 | if newlines is False: 534 | masks['newlines'] = to_validate.str.contains(os.linesep) 535 | if trailing_whitespace is False: 536 | masks['trailing_space'] = to_validate.str.contains( 537 | r'^\s|\s$', regex=True) 538 | if whitespace is False: 539 | masks['whitespace'] = to_validate.str.contains(r'\s', regex=True) 540 | if matching_regex: 541 | masks['regex_mismatch'] = ( 542 | to_validate.str.contains(matching_regex, regex=True) 543 | .apply(lambda x: x is False) & to_validate.notnull()) 544 | if non_matching_regex: 545 | masks['regex_match'] = to_validate.str.contains( 546 | non_matching_regex, regex=True) 547 | if whitelist is not None: 548 | masks['not_in_whitelist'] = ( 549 | to_validate.notnull() & ~to_validate.isin(whitelist)) 550 | if blacklist is not None: 551 | masks['in_blacklist'] = to_validate.isin(blacklist) 552 | 553 | msg_list = _get_error_messages(masks, error_info) 554 | 555 | if len(msg_list) > 0: 556 | msg = repr(series.name) + ': ' + '; '.join(msg_list) + '.' 557 | warnings.warn(msg, ValidationWarning, stacklevel=2) 558 | 559 | if return_type is not None: 560 | return _get_return_object(masks, to_validate, return_type) 561 | -------------------------------------------------------------------------------- /release-checklist.rst: -------------------------------------------------------------------------------- 1 | Release checklist 2 | ================= 3 | 4 | Things to remember when making a new release of pandas-validation. 5 | 6 | #. Changes should be made to some branch other than master (a pull request 7 | should then be created before making the release). 8 | 9 | #. Make desirable changes to the code. 10 | 11 | #. Check coding style against some of the conventions in PEP8: 12 | 13 | .. code-block:: none 14 | 15 | $ pycodestyle *.py 16 | 17 | #. Run tests and report coverage: 18 | 19 | .. code-block:: none 20 | 21 | $ pytest -v test_pandasvalidation.py 22 | $ coverage run -m pytest test_pandasvalidation.py 23 | $ coverage report -m pandasvalidation.py 24 | 25 | #. Update ``README.rst`` and the documentation (in ``docs/``). 26 | 27 | .. code-block:: none 28 | 29 | $ sphinx-build -b html ./docs/source ./docs/_build/html 30 | 31 | #. Update ``CHANGELOG.rst`` and add a release date. 32 | 33 | #. Update the release (version) number in ``setup.py`` and 34 | ``pandasvalidation.py``. Use `Semantic Versioning `_. 35 | 36 | #. Create pull request(s) with changes for the new release. 37 | 38 | #. Create distributions and upload the files to 39 | `PyPI `_ with 40 | `twine `_. 41 | 42 | .. code-block:: none 43 | 44 | $ python setup.py sdist bdist_wheel --universal 45 | $ twine upload dist/* 46 | 47 | #. Create the new release in GitHub. 48 | 49 | #. Trigger a new build (latest version) of the documentation on 50 | ``_. 51 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas>=0.22 2 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | from os.path import join, dirname 3 | 4 | 5 | setup( 6 | name='pandas-validation', 7 | version='0.5.0', 8 | description=( 9 | 'A Python package for validating data with pandas'), 10 | long_description=open( 11 | join(dirname(__file__), 'README.rst'), encoding='utf-8').read(), 12 | packages=find_packages(exclude=['docs', 'tests*']), 13 | py_modules=['pandasvalidation'], 14 | install_requires=['pandas>=0.22'], 15 | author='Markus Englund', 16 | author_email='jan.markus.englund@gmail.com', 17 | url='https://github.com/jmenglund/pandas-validation', 18 | license='MIT', 19 | classifiers=[ 20 | 'Development Status :: 5 - Production/Stable', 21 | 'Intended Audience :: Developers', 22 | 'License :: OSI Approved :: MIT License', 23 | 'Operating System :: OS Independent', 24 | 'Programming Language :: Python', 25 | 'Programming Language :: Python :: 3', 26 | 'Programming Language :: Python :: 3.5', 27 | 'Programming Language :: Python :: 3.6'], 28 | keywords=['pandas', 'validation'], 29 | ) 30 | -------------------------------------------------------------------------------- /test_pandasvalidation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import datetime 5 | import warnings 6 | 7 | import pytest 8 | import numpy 9 | import pandas 10 | 11 | from pandas.util.testing import assert_series_equal, assert_frame_equal 12 | 13 | from pandasvalidation import ( 14 | ValidationWarning, 15 | _datetime_to_string, 16 | _numeric_to_string, 17 | _get_return_object, 18 | mask_nonconvertible, 19 | to_datetime, 20 | to_numeric, 21 | to_string, 22 | validate_datetime, 23 | validate_date, 24 | validate_timestamp, 25 | validate_numeric, 26 | validate_string) 27 | 28 | 29 | class TestReturnTypes(): 30 | 31 | strings = pandas.Series(['1', '1', 'ab\n', 'a b', 'Ab', 'AB', numpy.nan]) 32 | masks = [ 33 | pandas.Series([False, False, False, True, True, False, False]), 34 | pandas.Series([True, True, False, True, True, False, True])] 35 | 36 | def test_return_mask_series(self): 37 | assert_series_equal( 38 | _get_return_object(self.masks, self.strings, 'mask_series'), 39 | pandas.Series([True, True, False, True, True, False, True])) 40 | 41 | def test_return_mask_frame(self): 42 | assert_frame_equal( 43 | _get_return_object(self.masks, self.strings, 'mask_frame'), 44 | pandas.concat(self.masks, axis='columns')) 45 | 46 | def test_return_values(self): 47 | assert_series_equal( 48 | _get_return_object(self.masks, self.strings, 'values'), 49 | pandas.Series([ 50 | numpy.nan, numpy.nan, 'ab\n', numpy.nan, 51 | numpy.nan, 'AB', numpy.nan])) 52 | 53 | def test_wrong_return_type(self): 54 | with pytest.raises(ValueError): 55 | _get_return_object(self.masks, self.strings, 'wrong return type') 56 | 57 | 58 | class TestMaskNonconvertible(): 59 | 60 | mixed = pandas.Series([ 61 | 1, 2.3, numpy.nan, 'abc', pandas.datetime(2014, 1, 7), '2014']) 62 | 63 | inconvertible_numeric = pandas.Series( 64 | [False, False, False, True, True, False]) 65 | 66 | inconvertible_exact_dates = pandas.Series( 67 | [True, True, False, True, True, False]) 68 | 69 | inconvertible_inexact_dates = pandas.Series( 70 | [True, True, False, True, False, False]) 71 | 72 | def test_numeric(self): 73 | assert_series_equal( 74 | mask_nonconvertible(self.mixed, 'numeric'), 75 | self.inconvertible_numeric) 76 | 77 | def test_datetime_exact_date(self): 78 | assert_series_equal( 79 | mask_nonconvertible( 80 | self.mixed, 'datetime', datetime_format='%Y', exact_date=True), 81 | self.inconvertible_exact_dates) 82 | 83 | assert_series_equal( 84 | mask_nonconvertible( 85 | self.mixed, 'datetime', datetime_format='%Y', 86 | exact_date=False), self.inconvertible_inexact_dates) 87 | 88 | 89 | class TestToDatetime(): 90 | 91 | mixed = pandas.Series([ 92 | 1, 2.3, numpy.nan, 93 | 'abc', pandas.datetime(2014, 1, 7), '2014']) 94 | 95 | def test_exact(self): 96 | assert ( 97 | to_datetime(self.mixed, format='%Y', exact=True).tolist() == [ 98 | pandas.NaT, pandas.NaT, pandas.NaT, pandas.NaT, 99 | pandas.NaT, pandas.Timestamp('2014-01-01 00:00:00')]) 100 | assert ( 101 | to_datetime(self.mixed, format='%Y', exact=False).tolist() == [ 102 | pandas.NaT, pandas.NaT, pandas.NaT, pandas.NaT, 103 | pandas.Timestamp('2014-01-01 00:00:00'), 104 | pandas.Timestamp('2014-01-01 00:00:00')]) 105 | 106 | 107 | class TestToNumeric(): 108 | 109 | mixed = pandas.Series([ 110 | 1, 2.3, numpy.nan, 'abc', pandas.datetime(2014, 1, 7), '2014']) 111 | 112 | def test_conversion(self): 113 | assert ( 114 | to_numeric(self.mixed).sum() == 2017.3) 115 | 116 | pytest.warns(ValidationWarning, to_numeric, self.mixed) 117 | 118 | 119 | class TestToString(): 120 | 121 | mixed = pandas.Series( 122 | [1, 2.3, numpy.nan, 'abc', pandas.datetime(2014, 1, 7)]) 123 | 124 | numeric_as_strings = pandas.Series( 125 | ['1', '2.3', numpy.nan, 'abc', pandas.datetime(2014, 1, 7)]) 126 | 127 | datetimes_as_strings = pandas.Series( 128 | [1, 2.3, numpy.nan, 'abc', '2014-01-07']) 129 | 130 | all_values_as_strings = pandas.Series( 131 | ['1', '2.3', numpy.nan, 'abc', '2014-01-07']) 132 | 133 | def test_numeric_to_string(self): 134 | assert_series_equal( 135 | _numeric_to_string(self.mixed), self.numeric_as_strings) 136 | 137 | def test_datetime_to_string(self): 138 | assert_series_equal( 139 | _datetime_to_string(self.mixed, format='%Y-%m-%d'), 140 | self.datetimes_as_strings) 141 | 142 | def test_to_string(self): 143 | assert_series_equal( 144 | to_string( 145 | self.mixed, float_format='%g', datetime_format='%Y-%m-%d'), 146 | self.all_values_as_strings) 147 | 148 | 149 | class TestValidateDatetime(): 150 | 151 | dates_as_strings = pandas.Series([ 152 | '2014-01-07', '2014-01-07', '2014-02-28', numpy.nan]) 153 | 154 | dates = pandas.Series([ 155 | datetime.datetime(2014, 1, 7), datetime.datetime(2014, 1, 7), 156 | datetime.datetime(2014, 2, 28), numpy.nan]) 157 | 158 | def test_validation(self): 159 | 160 | assert_series_equal( 161 | validate_datetime(self.dates_as_strings, return_type='values'), 162 | validate_datetime(self.dates, return_type='values')) 163 | 164 | pytest.warns( 165 | ValidationWarning, validate_datetime, self.dates, nullable=False) 166 | 167 | pytest.warns( 168 | ValidationWarning, validate_datetime, self.dates, unique=True) 169 | 170 | pytest.warns( 171 | ValidationWarning, validate_datetime, self.dates, 172 | min_datetime='2014-01-08') 173 | 174 | pytest.warns( 175 | ValidationWarning, validate_datetime, self.dates, 176 | max_datetime='2014-01-08') 177 | 178 | 179 | class TestValidateDate(): 180 | 181 | dates = pandas.Series([ 182 | datetime.datetime(2014, 1, 7), 183 | datetime.datetime(2014, 1, 7), 184 | datetime.datetime(2014, 2, 28), 185 | pandas.NaT]) 186 | 187 | def test_validation(self): 188 | 189 | assert_series_equal( 190 | validate_date(self.dates, return_type='values'), 191 | self.dates) 192 | 193 | pytest.warns( 194 | ValidationWarning, validate_date, self.dates, nullable=False) 195 | 196 | pytest.warns( 197 | ValidationWarning, validate_date, self.dates, unique=True) 198 | 199 | pytest.warns( 200 | ValidationWarning, validate_date, self.dates, 201 | min_date=datetime.date(2014, 1, 8)) 202 | 203 | pytest.warns( 204 | ValidationWarning, validate_date, self.dates, 205 | max_date=datetime.date(2014, 1, 8)) 206 | 207 | 208 | class TestValidateTimestamp(): 209 | 210 | timestamps = pandas.Series([ 211 | pandas.Timestamp(2014, 1, 7, 12, 0, 5), 212 | pandas.Timestamp(2014, 1, 7, 12, 0, 5), 213 | pandas.Timestamp(2014, 2, 28, 0, 0, 0), 214 | pandas.NaT]) 215 | 216 | def test_validation(self): 217 | 218 | assert_series_equal( 219 | validate_timestamp(self.timestamps, return_type='values'), 220 | self.timestamps) 221 | 222 | pytest.warns( 223 | ValidationWarning, validate_timestamp, self.timestamps, 224 | nullable=False) 225 | 226 | pytest.warns( 227 | ValidationWarning, validate_timestamp, self.timestamps, 228 | unique=True) 229 | 230 | pytest.warns( 231 | ValidationWarning, validate_timestamp, self.timestamps, 232 | min_timestamp=pandas.Timestamp(2014, 1, 8)) 233 | 234 | pytest.warns( 235 | ValidationWarning, validate_timestamp, self.timestamps, 236 | max_timestamp=pandas.Timestamp(2014, 1, 8)) 237 | 238 | 239 | class TestValidateNumber(): 240 | 241 | numeric_with_string = pandas.Series([-1, -1, 2.3, '1']) 242 | numeric = pandas.Series([-1, -1, 2.3, numpy.nan]) 243 | 244 | def test_validation(self): 245 | 246 | assert_series_equal( 247 | validate_numeric(self.numeric_with_string, return_type='values'), 248 | self.numeric) 249 | 250 | pytest.warns( 251 | ValidationWarning, validate_numeric, self.numeric, nullable=False) 252 | 253 | pytest.warns( 254 | ValidationWarning, validate_numeric, self.numeric, unique=True) 255 | 256 | pytest.warns( 257 | ValidationWarning, validate_numeric, self.numeric, integer=True) 258 | 259 | pytest.warns( 260 | ValidationWarning, validate_numeric, self.numeric, min_value=0) 261 | 262 | pytest.warns( 263 | ValidationWarning, validate_numeric, self.numeric, max_value=0) 264 | 265 | 266 | class TestValidateString(): 267 | 268 | mixed = pandas.Series(['ab\n', 'a b', 'Ab', 'Ab', 'AB', 1, numpy.nan]) 269 | strings = pandas.Series( 270 | ['ab\n', 'a b', 'Ab', 'Ab', 'AB', numpy.nan, numpy.nan]) 271 | 272 | def test_validation(self): 273 | 274 | assert_series_equal( 275 | validate_string(self.mixed, return_type='values'), 276 | self.strings) 277 | 278 | pytest.warns( 279 | ValidationWarning, validate_string, self.strings, nullable=False) 280 | 281 | pytest.warns( 282 | ValidationWarning, validate_string, self.strings, unique=True) 283 | 284 | pytest.warns( 285 | ValidationWarning, validate_string, self.strings, min_length=3) 286 | 287 | pytest.warns( 288 | ValidationWarning, validate_string, self.strings, max_length=2) 289 | 290 | pytest.warns( 291 | ValidationWarning, validate_string, self.strings[3:], case='lower') 292 | 293 | pytest.warns( 294 | ValidationWarning, validate_string, self.strings[3:], case='upper') 295 | 296 | pytest.warns( 297 | ValidationWarning, validate_string, self.strings[3:], case='title') 298 | 299 | pytest.warns( 300 | ValidationWarning, validate_string, self.strings, newlines=False) 301 | 302 | pytest.warns( 303 | ValidationWarning, validate_string, self.strings, 304 | trailing_whitespace=False) 305 | 306 | pytest.warns( 307 | ValidationWarning, validate_string, self.strings, whitespace=False) 308 | 309 | pytest.warns( 310 | ValidationWarning, validate_string, self.strings, 311 | matching_regex=r'\d') 312 | 313 | pytest.warns( 314 | ValidationWarning, validate_string, self.strings, 315 | non_matching_regex=r'[\d\s\w]') 316 | 317 | pytest.warns( 318 | ValidationWarning, validate_string, self.strings, 319 | whitelist=self.strings[:4]) 320 | 321 | pytest.warns( 322 | ValidationWarning, validate_string, self.strings, 323 | blacklist=['a', 'Ab']) 324 | --------------------------------------------------------------------------------