├── .gitignore ├── LICENSE ├── README.md ├── index.html ├── requirements.txt ├── sample ├── DeduplicationSampleCodes.md ├── IntegrityConstraintSampleCodes.md └── OutlierSampleCodes.md ├── sparkclean ├── __init__.py ├── df_deduplicator.py ├── df_ic.py ├── df_outliers.py ├── df_transformer.py └── utilities.py └── test ├── __init__.py └── load.py /.gitignore: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/.gitignore -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/LICENSE -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/README.md -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/index.html -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | textdistance 2 | unidecode 3 | tqdm 4 | -------------------------------------------------------------------------------- /sample/DeduplicationSampleCodes.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sample/DeduplicationSampleCodes.md -------------------------------------------------------------------------------- /sample/IntegrityConstraintSampleCodes.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sample/IntegrityConstraintSampleCodes.md -------------------------------------------------------------------------------- /sample/OutlierSampleCodes.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sample/OutlierSampleCodes.md -------------------------------------------------------------------------------- /sparkclean/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sparkclean/__init__.py -------------------------------------------------------------------------------- /sparkclean/df_deduplicator.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sparkclean/df_deduplicator.py -------------------------------------------------------------------------------- /sparkclean/df_ic.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sparkclean/df_ic.py -------------------------------------------------------------------------------- /sparkclean/df_outliers.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sparkclean/df_outliers.py -------------------------------------------------------------------------------- /sparkclean/df_transformer.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sparkclean/df_transformer.py -------------------------------------------------------------------------------- /sparkclean/utilities.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/sparkclean/utilities.py -------------------------------------------------------------------------------- /test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/test/__init__.py -------------------------------------------------------------------------------- /test/load.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NYUBigDataProject/SparkClean/HEAD/test/load.py --------------------------------------------------------------------------------