├── LICENSE
├── README.md
└── images
    └── target_architecture.PNG


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 mhlabs
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Overview
 2 | DataHem is a serverless real-time end-2-end ML pipeline built entirely on Google Cloud Platform services - AppEngine, PubSub, Dataflow, BigQuery and Cloud ML.
 3 | 
 4 | # Benefits
 5 | When building ML/Data products, your most valuable asset is your data. Hence, the purpose of DataHem is to give you:
 6 | 
 7 | 1. full control and ownership of your data
 8 | 2. unsampled data
 9 | 3. data in real time
10 | 4. the ability to replay/reprocess your data unlimited times
11 | 5. data synergies -> collect once and use for multiple purposes (reporting, analytics and building data/ML products)
12 | 6. low cost of operations and maintenance
13 | 7. scalability
14 | 8. data as a stream and at rest
15 | 9. activation of data
16 | 10. ability to delete data on a row by row basis
17 | 
18 | # Target architecture
19 | 
20 | ![Target architecture](https://github.com/mhlabs/datahem/raw/master/images/target_architecture.PNG)
21 | 
22 | # Use cases
23 | 
24 | ## 1. Digital Analytics
25 | The first use is to leverage your implementation of Google Analytics / Measurement Protocol. Google Analytics is awesome, but has some limitations worth to address in order to take reporting, analytics and machine learning to the next level. By adding a custom task to your Google Analytics tracker, DataHem eliminates many of the limitations of both the free and the premium version of Google Analytics and gives you:
26 | - Unsampled data
27 | - Real-time data
28 | - Unlimited custom dimensions and metrics
29 | - Unlimited data volume
30 | - Enriched data as a stream
31 | - Unlimited reprocessing of data
32 | - No licensing fees (open source)
33 | 
34 | # License
35 | DataHem is licensed under [AGPL 3.0 or later](https://opensource.org/licenses/AGPL-3.0)
36 | 
37 | # DataHem ecosystem
38 | The architecute of DataHem consists of loosely coupled parts to enable future replacements and extensions of parts.
39 | 
40 | * **[tracker](https://github.com/mhlabs/datahem.tracker):** Send data to the collector, currently supporting Google Analytics javascript tracker
41 | * **[collector](https://github.com/mhlabs/datahem.collector):** Collect data sent from trackers and publish the data on pubsub, currently running on Google App Engine Standard (Java)
42 | * **[processor](https://github.com/mhlabs/datahem.processor):** Process bounded and unbounded data and write to PubSub and BigQuery, currently using Google Dataflow (Apache Beam) and supports processing of Google Analytics hits and AWS Kinesis events
43 | * **[serializer](https://github.com/mhlabs/datahem.serializer):** Serialize structured data, currently using protocol buffers
44 | * **[infrastructor](https://github.com/mhlabs/datahem.infrastructor):** Infrastructure as code to easily setup API:s and services required, currently using Google Deployment Manager
45 | * **predictor** (backlog) predictions made on streaming data
46 | * **pseudonymizor** (backlog) pseudonymizing personal and/or sensitive data
47 | * **ruler** (backlog) processing rules for personal data
48 | * **activator** (backlog) serving predictions via REST/gRPC
49 | * **orchestrator** (backlog) workflow management DAGs using Google Cloud Composer
50 | 
51 | # Setup
52 | [Follow instructions in wiki how to set up the various parts in DataHem](https://github.com/mhlabs/datahem/wiki/Setup)
53 | 
54 | # Background
55 | DataHem was started in June 2017 by [robertsahlin](https://github.com/robertsahlin) / [ML-engineer](https://github.com/ML-engineer). It was open sourced and officially brought under Mathem's mhlabs Github account and announced in May 2018.
56 | 
57 | The name DataHem is a play of words to resemble MatHem, the Swedish online grocery store where DataHem is developed. "Data" = "data". "Hem" = the swedish word for "Home".
58 | 


--------------------------------------------------------------------------------
/images/target_architecture.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/mhlabs/datahem/8b4857a5b347f52e899dcf207facec0e8186a04c/images/target_architecture.PNG


--------------------------------------------------------------------------------