└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # DataQuality 2 | Tutorial and examples of Data Quality in Big Data System. 3 | 4 | Data Quality metrics: 5 | - completeness 6 | - commission 7 | - omission 8 | - thematic accuracy 9 | - thematic classification correctness 10 | - non-quantitative attribute correctness 11 | - qualintitative attribute accuracy 12 | - logical consistency 13 | - conceptual consistency 14 | - domain consistency 15 | - format consistency 16 | - topological consistency 17 | - temporal quality 18 | - accuracy of a time measurement 19 | - temporal consistency 20 | - temporal validity 21 | - positional accuracy 22 | - absolute external positional accuracy 23 | - relative internal positional accuracy 24 | - gridded data positional accuracy 25 | - usability 26 | 27 | 28 | 29 | Your contributions are always welcome! 30 | 31 | ## [Data Quality](#data-quality) 32 | * [Griffin](https://github.com/eBay/griffin) - Data Quality solution for distributed data systems at any scale in both streaming and batch data context. Detect accuracy, Completeness, Validity, Timeliness, Anomaly detection and Data Profiling. (Recommended) 33 | * [drunken-data-quality](https://github.com/FRosner/drunken-data-quality) - provide data quality report using spark,Elasticsearch, Logstash and Kibana (ELK) and demo (https://github.com/FRosner/ddq-demo-elk) 34 | * [DataQuality for BigData](https://github.com/agile-lab-dev/DataQuality) - a framework to build parallel and distributed quality checks on big data environments. It can be used to calculate metrics and perform checks to assure quality on structured or unstructured data. It relies entirely on Spark. 35 | * [TopNotch](https://github.com/blackrock/TopNotch) - TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:How to define and measure data quality , How to efficiently ensure data quality across many data sets, How to institutionalize existing knowledge of data sets. 36 | * [Phasor Data Quality Tracker](https://github.com/GridProtectionAlliance/pdqtracker) - The PDQ Tracker administered by the Grid Protection Alliance (GPA) is a high-performance, real-time data processing engine designed to raise alarms, track states, store statistics, and generate reports on both the availability and accuracy of streaming synchrophasor data. [doc] (http://www.gridprotectionalliance.org/docs/products/PDQTracker/highlevelrequirements.pdf) 37 | * [DataCleaner](https://github.com/datacleaner/DataCleaner) - The premier open source Data Quality solution [Documentation](https://datacleaner.org/resources/docs/5.1/pdf/datacleaner-reference.pdf) 38 | * [data-quality](https://github.com/Talend/data-quality) - Talend Open Studio for Data Quality can be download from the Talend website. 39 | --------------------------------------------------------------------------------