├── LICENSE ├── README.md └── images └── target_architecture.PNG /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 mhlabs 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Overview 2 | DataHem is a serverless real-time end-2-end ML pipeline built entirely on Google Cloud Platform services - AppEngine, PubSub, Dataflow, BigQuery and Cloud ML. 3 | 4 | # Benefits 5 | When building ML/Data products, your most valuable asset is your data. Hence, the purpose of DataHem is to give you: 6 | 7 | 1. full control and ownership of your data 8 | 2. unsampled data 9 | 3. data in real time 10 | 4. the ability to replay/reprocess your data unlimited times 11 | 5. data synergies -> collect once and use for multiple purposes (reporting, analytics and building data/ML products) 12 | 6. low cost of operations and maintenance 13 | 7. scalability 14 | 8. data as a stream and at rest 15 | 9. activation of data 16 | 10. ability to delete data on a row by row basis 17 | 18 | # Target architecture 19 | 20 | ![Target architecture](https://github.com/mhlabs/datahem/raw/master/images/target_architecture.PNG) 21 | 22 | # Use cases 23 | 24 | ## 1. Digital Analytics 25 | The first use is to leverage your implementation of Google Analytics / Measurement Protocol. Google Analytics is awesome, but has some limitations worth to address in order to take reporting, analytics and machine learning to the next level. By adding a custom task to your Google Analytics tracker, DataHem eliminates many of the limitations of both the free and the premium version of Google Analytics and gives you: 26 | - Unsampled data 27 | - Real-time data 28 | - Unlimited custom dimensions and metrics 29 | - Unlimited data volume 30 | - Enriched data as a stream 31 | - Unlimited reprocessing of data 32 | - No licensing fees (open source) 33 | 34 | # License 35 | DataHem is licensed under [AGPL 3.0 or later](https://opensource.org/licenses/AGPL-3.0) 36 | 37 | # DataHem ecosystem 38 | The architecute of DataHem consists of loosely coupled parts to enable future replacements and extensions of parts. 39 | 40 | * **[tracker](https://github.com/mhlabs/datahem.tracker):** Send data to the collector, currently supporting Google Analytics javascript tracker 41 | * **[collector](https://github.com/mhlabs/datahem.collector):** Collect data sent from trackers and publish the data on pubsub, currently running on Google App Engine Standard (Java) 42 | * **[processor](https://github.com/mhlabs/datahem.processor):** Process bounded and unbounded data and write to PubSub and BigQuery, currently using Google Dataflow (Apache Beam) and supports processing of Google Analytics hits and AWS Kinesis events 43 | * **[serializer](https://github.com/mhlabs/datahem.serializer):** Serialize structured data, currently using protocol buffers 44 | * **[infrastructor](https://github.com/mhlabs/datahem.infrastructor):** Infrastructure as code to easily setup API:s and services required, currently using Google Deployment Manager 45 | * **predictor** (backlog) predictions made on streaming data 46 | * **pseudonymizor** (backlog) pseudonymizing personal and/or sensitive data 47 | * **ruler** (backlog) processing rules for personal data 48 | * **activator** (backlog) serving predictions via REST/gRPC 49 | * **orchestrator** (backlog) workflow management DAGs using Google Cloud Composer 50 | 51 | # Setup 52 | [Follow instructions in wiki how to set up the various parts in DataHem](https://github.com/mhlabs/datahem/wiki/Setup) 53 | 54 | # Background 55 | DataHem was started in June 2017 by [robertsahlin](https://github.com/robertsahlin) / [ML-engineer](https://github.com/ML-engineer). It was open sourced and officially brought under Mathem's mhlabs Github account and announced in May 2018. 56 | 57 | The name DataHem is a play of words to resemble MatHem, the Swedish online grocery store where DataHem is developed. "Data" = "data". "Hem" = the swedish word for "Home". 58 | -------------------------------------------------------------------------------- /images/target_architecture.PNG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mhlabs/datahem/8b4857a5b347f52e899dcf207facec0e8186a04c/images/target_architecture.PNG --------------------------------------------------------------------------------