├── DataOpsSoftware2019.md ├── README.md └── books ├── DataKitchen_dataops_cookbook.pdf ├── Importance of metadata in data warehousing.pdf └── managing-data-in-motion.pdf /DataOpsSoftware2019.md: -------------------------------------------------------------------------------- 1 | ## Data Pipeline Orchestration 2 | 3 | - [Airflow](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8) 4 | an open-source platform to programmatically author, schedule and monitor data pipelines. 5 | - [Apache Oozie](http://oozie.apache.org/) 6 | an open-source workflow scheduler system to manage Apache Hadoop jobs. 7 | - [DBT (Data Build Tool)](https://www.getdbt.com/) 8 | is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively. 9 | - [BMC Control-M](http://www.bmc.com/it-solutions/control-m.html) 10 | a digital business automation solution that simplifies and automates diverse batch application workloads. 11 | - [DataKitchen](https://www.datakitchen.io/) 12 | a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics. 13 | - [Reflow](https://github.com/grailbio/reflow) 14 | Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs. 15 | - [ElementL](https://github.com/elementl) 16 | A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source. 17 | - [Astronomer.io](https://www.astronomer.io/) 18 | Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows. 19 | - [Piperr.io](http://piperr.io/) 20 | Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs. 21 | - [Prefect Technologies](https://www.prefect.io/) 22 | Open-source data engineering platform that builds, tests, and runs data workflows. 23 | - [Genie](https://netflix.github.io/genie/) 24 | Distributed Big Data Orchestration Service by Netflix 25 | 26 | ## Testing and Production Quality 27 | - [ICEDQ](https://icedq.com/) 28 | software used to automate the testing of ETL/Data Warehouse and Data Migration. 29 | - [Naveego](http://www.naveego.com/) 30 | A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management. 31 | - [DataKitchen](https://www.datakitchen.io/) 32 | a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data. 33 | - [FirstEigen](http://firsteigen.com/) 34 | Automatic Data Quality Rule Discovery and Continuous Data Monitoring 35 | - [Great Expectations](https://github.com/great-expectations/great_expectations) 36 | Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time). 37 | - [Enterprise Data Foundation](https://enterprise-data.org/) 38 | Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment. 39 | 40 | ## Deployment Automation and Development Sandbox Creation 41 | - [Jenkins](https://jenkins-ci.org/) 42 | a ‘CI/CD’ tool used by software development teams to deploy code from development into production 43 | - [DataKitchen](https://www.datakitchen.io/) 44 | a DataOps Platform that supports the deployment of all data analytics code and configuration. 45 | - [Amaterasu](http://shinto.io/index.html) 46 | is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies. 47 | - [Meltano](https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/) 48 | aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle. 49 | 50 | ## Data Science Model Deployment 51 | - [Domino](https://www.dominodatalab.com/) 52 | accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility. 53 | - [Hydrosphere.io](https://hydrosphere.io/) 54 | deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines. 55 | - [Open Data Group](https://www.opendatagroup.com/) 56 | a software solution that facilitates the deployment of analytics using models. 57 | - [ParallelM](http://www.parallelm.com/) 58 | moves machine learning into production, automates orchestration, and manages the ML pipeline. 59 | - [Seldon](https://www.seldon.io/) 60 | streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment. 61 | - [Metis Machine](https://metismachine.com/) 62 | Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications. 63 | - [Datatron](http://www.datatron.com/) 64 | Automate deployment and monitoring of AI Models. 65 | - [DSFlow](http://dsflow.io/)Go from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices. 66 | - [DataMo-Datmo](https://datmo.com/) 67 | tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way. 68 | - [MLFlow](https://www.mlflow.org/) 69 | An open source platform for the complete machine learning lifecycle from MapR. 70 | - [Studio.ML](https://www.studio.ml/) 71 | Studio is a model management framework written in Python to help simplify and expedite your model building experience. 72 | - [Comet.ML](https://www.comet.ml/) 73 | Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility. 74 | - [Polyaxon](https://polyaxon.com/) 75 | An open source platform for reproducible machine learning at scale. 76 | - [Missinglink.ai](https://missinglink.ai/) 77 | MissingLink helps data engineers streamline and automate the entire deep learning lifecycle. 78 | - [kubeflow](https://www.kubeflow.org/) 79 | The Machine Learning Toolkit for Kubernetes 80 | - [Vert.ai](https://www.verta.ai/) 81 | Models are the new code! 82 | 83 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # My Awesome Data Ops Resources [![Awesome](https://awesome.re/badge.svg)](https://awesome.re) 2 | 3 | > A curated list of data operations resources, focused for Cultural Heritage Organizations usage. 4 | 5 | 6 | ## Books 7 | 8 | - [The DataOps Cookbook](https://www.datakitchen.io/dataops-cookbook-main.html) A 135-page long book that describes the steip-by-step implmentation of Data Ops. 9 | 10 | - [blogs](#another-section) 11 | 12 | 13 | ## Papers and Blogs 14 | ### ETL 15 | - [Managing Data in Motion](https://www.progress.com/docs/default-source/default-document-library/Progress/Documents/book-club/Managing-Data-in-Motion.p) 16 | 17 | 18 | ### Data Quality 19 | - [A Deep Dive Into Data Quality](https://towardsdatascience.com/a-deep-dive-into-data-quality-c1d1ee576046) 20 | 21 | ### Metadata 22 | - [Importance of Metadata in Data Warehousing](http://sdsu-dspace.calstate.edu/bitstream/handle/10211.10/2354/Dhiman_Abhinav.pdf;sequence=1) 23 | 24 | ### Pipeline Engineering 25 | - [Smart pipelining — reactive approach to computation scheduling](https://medium.com/casumotech/smart-pipelining-reactive-approach-to-computation-scheduling-5a7e39658df5) 26 | 27 | 28 | 29 | ## Data Ops Software 30 | 31 | ### Data Pipeline Orchestration 32 | 33 | - [Airflow](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8) 34 | an open-source platform to programmatically author, schedule and monitor data pipelines. 35 | - [Apache Oozie](http://oozie.apache.org/) 36 | an open-source workflow scheduler system to manage Apache Hadoop jobs. 37 | - [DBT (Data Build Tool)](https://www.getdbt.com/) 38 | is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively. 39 | - [BMC Control-M](http://www.bmc.com/it-solutions/control-m.html) 40 | a digital business automation solution that simplifies and automates diverse batch application workloads. 41 | - [DataKitchen](https://www.datakitchen.io/) 42 | a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics. 43 | - [Reflow](https://github.com/grailbio/reflow) 44 | Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs. 45 | - [ElementL](https://github.com/elementl) 46 | A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source. 47 | - [Astronomer.io](https://www.astronomer.io/) 48 | Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows. 49 | - [Piperr.io](http://piperr.io/) 50 | Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs. 51 | - [Prefect Technologies](https://www.prefect.io/) 52 | Open-source data engineering platform that builds, tests, and runs data workflows. 53 | - [Genie](https://netflix.github.io/genie/) 54 | Distributed Big Data Orchestration Service by Netflix 55 | 56 | ### Testing and Production Quality 57 | - [ICEDQ](https://icedq.com/) 58 | software used to automate the testing of ETL/Data Warehouse and Data Migration. 59 | - [Naveego](http://www.naveego.com/) 60 | A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management. 61 | - [DataKitchen](https://www.datakitchen.io/) 62 | a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data. 63 | - [FirstEigen](http://firsteigen.com/) 64 | Automatic Data Quality Rule Discovery and Continuous Data Monitoring 65 | - [Great Expectations](https://github.com/great-expectations/great_expectations) 66 | Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time). 67 | - [Enterprise Data Foundation](https://enterprise-data.org/) 68 | Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment. 69 | 70 | ### Deployment Automation and Development Sandbox Creation 71 | - [Jenkins](https://jenkins-ci.org/) 72 | a ‘CI/CD’ tool used by software development teams to deploy code from development into production 73 | - [DataKitchen](https://www.datakitchen.io/) 74 | a DataOps Platform that supports the deployment of all data analytics code and configuration. 75 | - [Amaterasu](http://shinto.io/index.html) 76 | is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies. 77 | - [Meltano](https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/) 78 | aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle. 79 | 80 | ### Data Science Model Deployment 81 | - [Domino](https://www.dominodatalab.com/) 82 | accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility. 83 | - [Hydrosphere.io](https://hydrosphere.io/) 84 | deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines. 85 | - [Open Data Group](https://www.opendatagroup.com/) 86 | a software solution that facilitates the deployment of analytics using models. 87 | - [ParallelM](http://www.parallelm.com/) 88 | moves machine learning into production, automates orchestration, and manages the ML pipeline. 89 | - [Seldon](https://www.seldon.io/) 90 | streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment. 91 | - [Metis Machine](https://metismachine.com/) 92 | Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications. 93 | - [Datatron](http://www.datatron.com/) 94 | Automate deployment and monitoring of AI Models. 95 | - [DSFlow](http://dsflow.io/)Go from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices. 96 | - [DataMo-Datmo](https://datmo.com/) 97 | tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way. 98 | - [MLFlow](https://www.mlflow.org/) 99 | An open source platform for the complete machine learning lifecycle from MapR. 100 | - [Studio.ML](https://www.studio.ml/) 101 | Studio is a model management framework written in Python to help simplify and expedite your model building experience. 102 | - [Comet.ML](https://www.comet.ml/) 103 | Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility. 104 | - [Polyaxon](https://polyaxon.com/) 105 | An open source platform for reproducible machine learning at scale. 106 | - [Missinglink.ai](https://missinglink.ai/) 107 | MissingLink helps data engineers streamline and automate the entire deep learning lifecycle. 108 | - [kubeflow](https://www.kubeflow.org/) 109 | The Machine Learning Toolkit for Kubernetes 110 | - [Vert.ai](https://www.verta.ai/) 111 | Models are the new code! 112 | 113 | 114 | 115 | 116 | ## License 117 | 118 | [![CC0](http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](http://creativecommons.org/publicdomain/zero/1.0) 119 | -------------------------------------------------------------------------------- /books/DataKitchen_dataops_cookbook.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen1649chenli/dataOpsResource/7c36f251243475b89cf5546beed64946543daf67/books/DataKitchen_dataops_cookbook.pdf -------------------------------------------------------------------------------- /books/Importance of metadata in data warehousing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen1649chenli/dataOpsResource/7c36f251243475b89cf5546beed64946543daf67/books/Importance of metadata in data warehousing.pdf -------------------------------------------------------------------------------- /books/managing-data-in-motion.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chen1649chenli/dataOpsResource/7c36f251243475b89cf5546beed64946543daf67/books/managing-data-in-motion.pdf --------------------------------------------------------------------------------