└── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | # DataQuality
 2 | Tutorial and examples of Data Quality in Big Data System.
 3 | 
 4 | Data Quality metrics:
 5 | - completeness
 6 |  - commission
 7 |  - omission
 8 | - thematic accuracy
 9 |  - thematic classification correctness
10 |  - non-quantitative attribute correctness
11 |  - qualintitative attribute accuracy
12 | - logical consistency
13 |  - conceptual consistency
14 |  - domain consistency
15 |  - format consistency
16 |  - topological consistency
17 | - temporal quality
18 |  - accuracy of a time measurement
19 |  - temporal consistency
20 |  - temporal validity
21 | - positional accuracy
22 |  - absolute external positional accuracy
23 |  - relative internal positional accuracy
24 |  - gridded data positional accuracy
25 | - usability
26 | 
27 | 
28 | 
29 | Your contributions are always welcome!
30 | 
31 | ## [Data Quality](#data-quality)
32 |    * [Griffin](https://github.com/eBay/griffin) - Data Quality solution for distributed data systems at any scale in both streaming and batch data context. Detect accuracy, Completeness, Validity, Timeliness, Anomaly detection and Data Profiling. (Recommended)
33 |    * [drunken-data-quality](https://github.com/FRosner/drunken-data-quality) - provide data quality report using spark,Elasticsearch, Logstash and Kibana (ELK) and demo (https://github.com/FRosner/ddq-demo-elk)
34 |    * [DataQuality for BigData](https://github.com/agile-lab-dev/DataQuality) - a framework to build parallel and distributed quality checks on big data environments. It can be used to calculate metrics and perform checks to assure quality on structured or unstructured data. It relies entirely on Spark.
35 |    * [TopNotch](https://github.com/blackrock/TopNotch) - TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:How to define and measure data quality , How to efficiently ensure data quality across many data sets, How to institutionalize existing knowledge of data sets.
36 |    * [Phasor Data Quality Tracker](https://github.com/GridProtectionAlliance/pdqtracker) - The PDQ Tracker administered by the Grid Protection Alliance (GPA) is a high-performance, real-time data processing engine designed to raise alarms, track states, store statistics, and generate reports on both the availability and accuracy of streaming synchrophasor data. [doc] (http://www.gridprotectionalliance.org/docs/products/PDQTracker/highlevelrequirements.pdf)
37 |    * [DataCleaner](https://github.com/datacleaner/DataCleaner) - The premier open source Data Quality solution [Documentation](https://datacleaner.org/resources/docs/5.1/pdf/datacleaner-reference.pdf)
38 |    * [data-quality](https://github.com/Talend/data-quality) - Talend Open Studio for Data Quality can be download from the Talend website.
39 | 


--------------------------------------------------------------------------------