├── .gitignore ├── README.md ├── dbt ├── .gitignore ├── requirments.txt └── stackoverflowsurvey │ ├── .gitignore │ ├── .user.yml │ ├── README.md │ ├── analyses │ └── .gitkeep │ ├── dbt_project.yml │ ├── macros │ └── .gitkeep │ ├── models │ ├── schema.yml │ ├── source.yml │ └── survey_results.sql │ ├── profiles.yml │ ├── seeds │ └── .gitkeep │ ├── snapshots │ └── .gitkeep │ └── tests │ └── .gitkeep ├── duckdb └── README.md ├── images ├── architecture.png ├── superset_dashboard.png ├── superset_duckdb_connection.png └── superset_duckdb_connection_advanced_config.png └── superset ├── docker-compose.yml └── docker ├── .env ├── README.md ├── docker-bootstrap.sh ├── docker-ci.sh ├── docker-entrypoint-initdb.d └── examples-init.sh ├── docker-frontend.sh ├── docker-init.sh ├── frontend-mem-nag.sh ├── pythonpath_dev ├── .gitignore ├── superset_config.py └── superset_config_local.example ├── requirements-local.txt └── run-server.sh /.gitignore: -------------------------------------------------------------------------------- 1 | .idea 2 | data -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Building a robust yet simple data analytics platform with DuckDB, dbt, Iceberg, and Superset 2 | Modern analytics platforms require robust data storage, transformation, and management tools. DuckDB provides a simple, high-performance, columnar analytical database. DBT simplifies data transformation and modeling, and Iceberg offers scalable data lake management capabilities. Combining these tools can create a powerful and flexible analytics platform. 3 | 4 | ![architecture.png](images%2Farchitecture.png) 5 | 6 | # Understanding the tools 7 | ## DuckDB 8 | DuckDB is an in-memory, columnar analytical database that stands out for its speed, efficiency, and compatibility with SQL standard. Here is a more in-deepth look at its features: 9 | - **High-performance Analytics**: DuckDB is optimized for analytical queries, making it an ideal choice for data warehousing and analytics workloads. It's in-memory storage and columnar data layout significantly boost query performance. 10 | - **SQL Compatibility**: DuckDB supports SQL, making it accessible to analysts and data professionals who are ready familiar with SQL syntax. This compatibility allows you to leverage your existing SQL knowledge and tools. 11 | - **Integration with BI Tools**: DuckDB integrates seamlessly with popular business intelligence (BI) tools like Tableau, Power BI, and Looker. This compatibility ensures that you can visualize and report on your data effectively. 12 | 13 | ## DBT 14 | dbt, which stands for Data Build Tool, is a command-line tool that revolutionizes the way data transformations and modeling are done. Here's a deeper dive into dbt's capabilities: 15 | - **Modular Data Transformations**: dbt uses SQL and YAML files to define data transformations and models. This modular approach allows you to break down complex transformations into smaller, more manageable pieces, enhancing mantainability and version control. 16 | - **Data Testing**: dbt facilitates data testing by allowing you to define expectations about your data. It helps ensure data quality by automatically running tests against your transformed data. 17 | - **Version Control**: dbt projects can be version controlled with tools like Git, enabling collaboration among data professionals while keeping a history of changes. 18 | - **Incremental Builds**: dbt supports incremental builds, meaning it only processes data that has changed since the last run. This feature saves time and resources when working with large datasets. 19 | - **Orchestration**: While dbt focuses on data transformations and modeling, it can be integrated with orchestration tools like Apache Airflow or dbt Cloud to create automated data pipelines. 20 | 21 | ## Iceberg 22 | Iceberg is a table format designed for managing data lakes, offering several key features to ensure data quality and scalability: 23 | - **Schema Evoluation**: One of Iceberg's standout features is its support for schema evolution. You can add, delete, or modify columns in your datasets without breaking existing queries or data integrity. This makes it suitable for rapidly evolving data lakes. 24 | - **ACID Transformations**: Iceberg provides ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data consistency and reliability in multi-user and multi-write environments. 25 | - **Time-Travel Capabilities**: Iceberg allows you to query historical versions of your data, making it possible to recover from data errors or analyze changes over time. 26 | - **Optimized File Storage**: Iceberg optimizes file storage by using techniques like metadata management, partitioning, and file pruning. This results in efficient data storage and retrieval. 27 | - **Connectivity**: Iceberg supports various storage connectors, including Apache Hadoop HDFS, Amazon S3, and Azure Data Lake Storage, making it versatile and compatible with different data lake platforms. 28 | 29 | > NOTE: *Iceberg is not currently utilized in this showcase, but it will be added soon.* 30 | ## Apache Superset 31 | Apache Superset is a modern, open-source BI tool that enables data exploration, visualization, and interactive dashboards. It connects to various data sources and is designed to empower users to explore data and create dynamic reports. 32 | - **Data Visualization**: Apache Superset allows users to create interactive visualizations, including charts, graphs, and geographic maps, to explore and understand data. 33 | - **Dashboard Creation**: Users can build dynamic dashboards by combining multiple visualizations and applying filters for real-time data exploration. 34 | - **Connectivity**: Apache Superset can connect to various data sources, including SQL databases, data lakes, and cloud storage, making it adaptable to diverse data ecosystems. 35 | - **Security**: It offers robust security features, including role-based access control and integration with authentication providers, ensuring data is accessed securely. 36 | - **Community and Extensibility**: As an open-source project, Apache Superset benefits from a vibrant community that contributes plugins, connectors, and additional features, enhancing its capabilities. 37 | - **SQL Support**: Superset supports SQL queries, allowing users to execute custom queries and create complex calculated fields. 38 | 39 | # Setting up DuckDB, dbt, Superset with Docker Compose 40 | ## Setting up DuckDB 41 | DuckDB will be installed as a library with dbt and Superset in the next session. 42 | 43 | ## Setting up dbt 44 | Firstly, We need to install *dbt-core* and *dbt-duckdb* libraries, then init a dbt project. 45 | ```bash 46 | # create a virtual environment 47 | cd dbt 48 | python -m venv .env 49 | source .env/bin/activate 50 | 51 | # install libraries: dbt-core and dbt-duckdb 52 | pip install -r requirements.txt 53 | 54 | # check version 55 | dbt --version 56 | ``` 57 | 58 | Then we initialize a dbt project with the name *stackoverflowsurvey* and create a *profiles.yml* with the following content: 59 | ```yaml 60 | stackoverflow: 61 | target: dev 62 | outputs: 63 | dev: 64 | type: duckdb 65 | path: '/data/duckdb/stackoverflow.duckdb' # path to local DuckDB database file 66 | ``` 67 | 68 | Run the following commands to properly check configuration: 69 | ```bash 70 | # We must specify the directory of the 'profiles.yml' file since we are not using the default location. 71 | dbt debug --profiles-dir . 72 | ``` 73 | 74 | ### Setting up Superset 75 | Run following commands to set up the Superset service: 76 | ```bash 77 | cd superset 78 | # run docker compose command to start services of the Superset 79 | # the libraries declared in 'requirements-local.txt' file will also be installed too (including duckdb-engine) 80 | docker-compose up --detach 81 | ``` 82 | 83 | Visit *http://localhost:8088* to access the Superset UI. Enter **admin** as username and password. Choose **DuckDB** from the supported databases drop-down. Then set up a connection to DuckDB database. 84 | 85 |
86 | 87 | 88 | 89 | 90 | 91 |
92 |
93 | 94 | > **NOTE**: Provide path to a duckdb database on disk in the url, e.g., *duckdb:////Users/whoever/path/to/duck.db*. 95 | 96 | We combine the DuckDB database path file exposed in *superset/docker/docker-compose.yml* file 97 | ```bash 98 | x-superset-volumes: 99 | &superset-volumes 100 | - /data/duckdb:/app/duckdb 101 | ``` 102 | with the DuckDB database name defined in *dbt/stackoverflowsurvey/profiles.yml*. 103 | ```yaml 104 | path: '/data/duckdb/stackoverflow.duckdb' 105 | ``` 106 | So, below, we have the final URI to establish a connection between Superset and DuckDB: 107 | ```bash 108 | duckdb:///duckdb/stackoverflow.duckdb 109 | ``` 110 | 111 | With Superset, the engine needs to be configured to open DuckDB in “read-only” mode. Otherwise, only one query can run at a time (simultaneous queries will cause locks). This also prevents refreshing the Superset dashboard while the pipeline is running. 112 | 113 | # Loading source 114 | In this showcase, we are using the [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey) data set. To simplify maters, we will focus solely on the [2023](https://cdn.stackoverflow.co/files/jo7n4k8s/production/49915bfd46d0902c3564fd9a06b509d08a20488c.zip/stack-overflow-developer-survey-2023.zip) data set, which needs to be manually downloaded and extracted into the *PROJECT_HOME/data* directory. 115 | 116 | # Building models with dbt 117 | ## Defining data source 118 | We declare the data source in *stackoverflowsurvey/models/source.yml* file with following content: 119 | ```yaml 120 | sources: 121 | - name: stackoverflow_survey_source 122 | tables: 123 | - name: surveys 124 | meta: 125 | external_location: "read_csv('../../data/survey_results_public.csv', AUTO_DETECT=TRUE)" # automatically parser and detect schema 126 | formatter: oldstyle 127 | ``` 128 | ## Building models 129 | For demonstration purposes only, we have created a very simple model with the following content: 130 | ```sql 131 | {{ config(materialized='table') }} 132 | 133 | SELECT * 134 | FROM {{ source('stackoverflow_survey_source', 'surveys')}} 135 | ``` 136 | # Connecting Superset 137 | Once the dbt models are built, the data visualization can begin. An admin user must be created in superset in order to log in. 138 | 139 | ![superset_dashboard.png](images%2Fsuperset_dashboard.png) 140 | 141 | # Conclusion 142 | In this comprehensive guide, we've demonstrated how to construct a sophisticated analytics platform that leverages the combined power of DuckDB, DBT, Iceberg, and Apache Superset. This platform empowers organizations to seamlessly ingest, transform, manage, visualize, and analyze data to extract actionable insights. 143 | Key Components: 144 | - **DuckDB**: Our high-performance, SQL-compatible, in-memory database serves as the foundation for efficient data storage and retrieval, enabling lightning-fast analytical queries. 145 | - **dbt**: DBT simplifies data transformation and modeling, allowing for the creation of modular, version-controlled data pipelines that enhance data quality and maintainability. 146 | - **Iceberg**: Iceberg manages data lakes with ease, offering schema evolution, ACID transactions, and time-travel capabilities, ensuring data integrity and scalability in large-scale analytics environments. 147 | - **Apache Superset**: Apache Superset enhances the platform by providing a modern, open-source BI tool for data exploration, visualization, and interactive dashboard creation. Its connectivity options, security features, and SQL support empower users to gain insights from data with ease. 148 | 149 | Together, these tools create a powerful and flexible analytics platform, enabling organizations to navigate the data landscape with confidence, derive valuable insights, and make informed decisions. Whether you're dealing with structured or unstructured data, this platform equips you with the tools needed to turn raw data into actionable intelligence, driving business success and innovation. 150 | 151 | ## Supporting Links 152 | * Stack Overflow Annual Developer Survey 153 | * Modern Data Stack in a Box with DuckDB 154 | * dbt adapter for DuckDB 155 | 156 | 157 | 158 | 159 | 160 | 161 | -------------------------------------------------------------------------------- /dbt/.gitignore: -------------------------------------------------------------------------------- 1 | logs 2 | .env -------------------------------------------------------------------------------- /dbt/requirments.txt: -------------------------------------------------------------------------------- 1 | dbt-core==1.6.0 2 | dbt-duckdb==1.6.0 -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/.gitignore: -------------------------------------------------------------------------------- 1 | 2 | target/ 3 | dbt_packages/ 4 | logs/ 5 | stackoverflow.* 6 | -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/.user.yml: -------------------------------------------------------------------------------- 1 | id: 1ea6d26b-1f7f-44a4-897d-43d71b80b184 2 | -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/README.md: -------------------------------------------------------------------------------- 1 | Welcome to your new dbt project! 2 | 3 | ### Using the starter project 4 | 5 | Try running the following commands: 6 | - dbt run 7 | - dbt test 8 | 9 | 10 | ### Resources: 11 | - Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction) 12 | - Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers 13 | - Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support 14 | - Find [dbt events](https://events.getdbt.com) near you 15 | - Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices 16 | -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/analyses/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/analyses/.gitkeep -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/dbt_project.yml: -------------------------------------------------------------------------------- 1 | 2 | # Name your project! Project names should contain only lowercase characters 3 | # and underscores. A good package name should reflect your organization's 4 | # name or the intended use of these models 5 | name: 'stackoverflow' 6 | version: '1.0.0' 7 | config-version: 2 8 | 9 | # This setting configures which "profile" dbt uses for this project. 10 | profile: 'stackoverflow' 11 | 12 | # These configurations specify where dbt should look for different types of files. 13 | # The `model-paths` config, for example, states that models in this project can be 14 | # found in the "models/" directory. You probably won't need to change these! 15 | model-paths: ["models"] 16 | analysis-paths: ["analyses"] 17 | test-paths: ["tests"] 18 | seed-paths: ["seeds"] 19 | macro-paths: ["macros"] 20 | snapshot-paths: ["snapshots"] 21 | 22 | clean-targets: # directories to be removed by `dbt clean` 23 | - "target" 24 | - "dbt_packages" 25 | 26 | 27 | # Configuring models 28 | # Full documentation: https://docs.getdbt.com/docs/configuring-models 29 | 30 | # In this example config, we tell dbt to build all models in the example/ 31 | # directory as views. These settings can be overridden in the individual model 32 | # files using the `{{ config(...) }}` macro. 33 | models: 34 | stackoverflow: 35 | # Config indicated by + and applies to all files under models/example/ 36 | example: 37 | +materialized: view 38 | -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/macros/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/macros/.gitkeep -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/models/schema.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | 3 | models: 4 | - name: survey_results 5 | description: "The result of StackOverflow's survey in 2023." 6 | columns: 7 | - name: ResponseId 8 | description: "The response Id" 9 | tests: 10 | - unique 11 | - not_null 12 | - name: Q120 13 | - name: MainBranch 14 | - name: Age 15 | - name: Employment 16 | - name: RemoteWork 17 | - name: CodingActivities 18 | - name: EdLevel 19 | - name: LearnCode 20 | - name: LearnCodeOnline 21 | - name: LearnCodeCoursesCert 22 | - name: YearsCode 23 | - name: YearsCodePro 24 | - name: DevType 25 | - name: OrgSize 26 | - name: PurchaseInfluence 27 | - name: TechList 28 | - name: BuyNewTool 29 | - name: Country 30 | - name: Currency 31 | - name: CompTotal 32 | - name: LanguageHaveWorkedWith 33 | - name: LanguageWantToWorkWith 34 | - name: DatabaseHaveWorkedWith 35 | - name: DatabaseWantToWorkWith 36 | - name: PlatformHaveWorkedWith 37 | - name: PlatformWantToWorkWith 38 | - name: WebframeHaveWorkedWith 39 | - name: WebframeWantToWorkWith 40 | - name: MiscTechHaveWorkedWith 41 | - name: MiscTechWantToWorkWith 42 | - name: ToolsTechHaveWorkedWith 43 | - name: ToolsTechWantToWorkWith 44 | - name: NEWCollabToolsHaveWorkedWith 45 | - name: NEWCollabToolsWantToWorkWith 46 | - name: OpSysPersonal use 47 | - name: OpSysProfessional use 48 | - name: OfficeStackAsyncHaveWorkedWith 49 | - name: OfficeStackAsyncWantToWorkWith 50 | - name: OfficeStackSyncHaveWorkedWith 51 | - name: OfficeStackSyncWantToWorkWith 52 | - name: AISearchHaveWorkedWith 53 | - name: AISearchWantToWorkWith 54 | - name: AIDevHaveWorkedWith 55 | - name: AIDevWantToWorkWith 56 | - name: NEWSOSites 57 | - name: SOVisitFreq 58 | - name: SOAccount 59 | - name: SOPartFreq 60 | - name: SOComm 61 | - name: SOAI 62 | - name: AISelect 63 | - name: AISent 64 | - name: AIAcc 65 | - name: AIBen 66 | - name: AIToolInterested in Using 67 | - name: AIToolCurrently Using 68 | - name: AIToolNot interested in Using 69 | - name: AINextVery different 70 | - name: AINextNeither different nor similar 71 | - name: AINextSomewhat similar 72 | - name: AINextVery similar 73 | - name: AINextSomewhat different 74 | - name: TBranch 75 | - name: ICorPM 76 | - name: WorkExp 77 | - name: Knowledge_1 78 | - name: Knowledge_2 79 | - name: Knowledge_3 80 | - name: Knowledge_4 81 | - name: Knowledge_5 82 | - name: Knowledge_6 83 | - name: Knowledge_7 84 | - name: Knowledge_8 85 | - name: Frequency_1 86 | - name: Frequency_2 87 | - name: Frequency_3 88 | - name: TimeSearching 89 | - name: TimeAnswering 90 | - name: ProfessionalTech 91 | - name: Industry 92 | - name: SurveyLength 93 | - name: SurveyEase 94 | - name: ConvertedCompYearly 95 | -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/models/source.yml: -------------------------------------------------------------------------------- 1 | sources: 2 | - name: stackoverflow_survey_source 3 | tables: 4 | - name: surveys 5 | meta: 6 | external_location: "read_csv('../../data/survey_results_public.csv', AUTO_DETECT=TRUE)" # automatically parser and detect schema 7 | formatter: oldstyle -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/models/survey_results.sql: -------------------------------------------------------------------------------- 1 | {{ config(materialized='table') }} 2 | 3 | SELECT * 4 | FROM {{ source('stackoverflow_survey_source', 'surveys')}} -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/profiles.yml: -------------------------------------------------------------------------------- 1 | stackoverflow: 2 | target: dev 3 | outputs: 4 | dev: 5 | type: duckdb 6 | path: '/data/duckdb/stackoverflow.duckdb' # path to local DuckDB database file -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/seeds/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/seeds/.gitkeep -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/snapshots/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/snapshots/.gitkeep -------------------------------------------------------------------------------- /dbt/stackoverflowsurvey/tests/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/tests/.gitkeep -------------------------------------------------------------------------------- /duckdb/README.md: -------------------------------------------------------------------------------- 1 | > **NOTE**: DuckDB will be used as an embedded library installed with dbt project and Superset. -------------------------------------------------------------------------------- /images/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/architecture.png -------------------------------------------------------------------------------- /images/superset_dashboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/superset_dashboard.png -------------------------------------------------------------------------------- /images/superset_duckdb_connection.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/superset_duckdb_connection.png -------------------------------------------------------------------------------- /images/superset_duckdb_connection_advanced_config.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/superset_duckdb_connection_advanced_config.png -------------------------------------------------------------------------------- /superset/docker-compose.yml: -------------------------------------------------------------------------------- 1 | # 2 | # Licensed to the Apache Software Foundation (ASF) under one or more 3 | # contributor license agreements. See the NOTICE file distributed with 4 | # this work for additional information regarding copyright ownership. 5 | # The ASF licenses this file to You under the Apache License, Version 2.0 6 | # (the "License"); you may not use this file except in compliance with 7 | # the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | x-superset-image: &superset-image apachesuperset.docker.scarf.sh/apache/superset:latest 18 | x-superset-depends-on: &superset-depends-on 19 | - db 20 | - redis 21 | x-superset-volumes: 22 | &superset-volumes # /app/pythonpath_docker will be appended to the PYTHONPATH in the final container 23 | - ./docker:/app/docker 24 | - superset_home:/app/superset_home 25 | - /data/duckdb:/app/duckdb 26 | 27 | version: "3.7" 28 | services: 29 | redis: 30 | image: redis:7 31 | container_name: superset_cache 32 | restart: unless-stopped 33 | volumes: 34 | - redis:/data 35 | networks: 36 | - osmds_internal 37 | 38 | db: 39 | env_file: docker/.env 40 | image: postgres:15 41 | container_name: superset_db 42 | restart: unless-stopped 43 | volumes: 44 | - db_home:/var/lib/postgresql/data 45 | - ./docker/docker-entrypoint-initdb.d:/docker-entrypoint-initdb.d 46 | networks: 47 | - osmds_internal 48 | 49 | superset: 50 | env_file: docker/.env 51 | image: *superset-image 52 | container_name: superset_app 53 | command: ["/app/docker/docker-bootstrap.sh", "app-gunicorn"] 54 | user: "root" 55 | restart: unless-stopped 56 | ports: 57 | - 8088:8088 58 | depends_on: *superset-depends-on 59 | volumes: *superset-volumes 60 | networks: 61 | - osmds_internal 62 | 63 | superset-init: 64 | image: *superset-image 65 | container_name: superset_init 66 | command: ["/app/docker/docker-init.sh"] 67 | env_file: docker/.env 68 | depends_on: *superset-depends-on 69 | user: "root" 70 | volumes: *superset-volumes 71 | healthcheck: 72 | disable: true 73 | networks: 74 | - osmds_internal 75 | 76 | superset-worker: 77 | image: *superset-image 78 | container_name: superset_worker 79 | command: ["/app/docker/docker-bootstrap.sh", "worker"] 80 | env_file: docker/.env 81 | restart: unless-stopped 82 | depends_on: *superset-depends-on 83 | user: "root" 84 | volumes: *superset-volumes 85 | healthcheck: 86 | test: 87 | [ 88 | "CMD-SHELL", 89 | "celery -A superset.tasks.celery_app:app inspect ping -d celery@$$HOSTNAME", 90 | ] 91 | networks: 92 | - osmds_internal 93 | 94 | superset-worker-beat: 95 | image: *superset-image 96 | container_name: superset_worker_beat 97 | command: ["/app/docker/docker-bootstrap.sh", "beat"] 98 | env_file: docker/.env 99 | restart: unless-stopped 100 | depends_on: *superset-depends-on 101 | user: "root" 102 | volumes: *superset-volumes 103 | healthcheck: 104 | disable: true 105 | networks: 106 | - osmds_internal 107 | 108 | volumes: 109 | superset_home: 110 | external: false 111 | db_home: 112 | external: false 113 | redis: 114 | external: false 115 | 116 | networks: 117 | osmds_internal: 118 | external: true -------------------------------------------------------------------------------- /superset/docker/.env: -------------------------------------------------------------------------------- 1 | # 2 | # Licensed to the Apache Software Foundation (ASF) under one or more 3 | # contributor license agreements. See the NOTICE file distributed with 4 | # this work for additional information regarding copyright ownership. 5 | # The ASF licenses this file to You under the Apache License, Version 2.0 6 | # (the "License"); you may not use this file except in compliance with 7 | # the License. You may obtain a copy of the License at 8 | # 9 | # http://www.apache.org/licenses/LICENSE-2.0 10 | # 11 | # Unless required by applicable law or agreed to in writing, software 12 | # distributed under the License is distributed on an "AS IS" BASIS, 13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | # See the License for the specific language governing permissions and 15 | # limitations under the License. 16 | # 17 | COMPOSE_PROJECT_NAME=superset 18 | 19 | # database configurations (do not modify) 20 | DATABASE_DB=superset 21 | DATABASE_HOST=db 22 | DATABASE_PASSWORD=superset 23 | DATABASE_USER=superset 24 | DATABASE_PORT=5432 25 | DATABASE_DIALECT=postgresql 26 | 27 | EXAMPLES_DB=examples 28 | EXAMPLES_HOST=db 29 | EXAMPLES_USER=examples 30 | EXAMPLES_PASSWORD=examples 31 | EXAMPLES_PORT=5432 32 | 33 | # database engine specific environment variables 34 | # change the below if you prefer another database engine 35 | POSTGRES_DB=superset 36 | POSTGRES_USER=superset 37 | POSTGRES_PASSWORD=superset 38 | #MYSQL_DATABASE=superset 39 | #MYSQL_USER=superset 40 | #MYSQL_PASSWORD=superset 41 | #MYSQL_RANDOM_ROOT_PASSWORD=yes 42 | 43 | # Add the mapped in /app/pythonpath_docker which allows devs to override stuff 44 | PYTHONPATH=/app/pythonpath:/app/docker/pythonpath_dev 45 | REDIS_HOST=redis 46 | REDIS_PORT=6379 47 | 48 | SUPERSET_ENV=production 49 | SUPERSET_LOAD_EXAMPLES=yes 50 | SUPERSET_SECRET_KEY=TEST_NON_DEV_SECRET 51 | CYPRESS_CONFIG=false 52 | SUPERSET_PORT=8088 53 | MAPBOX_API_KEY='' 54 | -------------------------------------------------------------------------------- /superset/docker/README.md: -------------------------------------------------------------------------------- 1 | 19 | 20 | # Getting Started with Superset using Docker 21 | 22 | Docker is an easy way to get started with Superset. 23 | 24 | ## Prerequisites 25 | 26 | 1. [Docker](https://www.docker.com/get-started) 27 | 2. [Docker Compose](https://docs.docker.com/compose/install/) 28 | 29 | ## Configuration 30 | 31 | The `/app/pythonpath` folder is mounted from [`./docker/pythonpath_dev`](./pythonpath_dev) 32 | which contains a base configuration [`./docker/pythonpath_dev/superset_config.py`](./pythonpath_dev/superset_config.py) 33 | intended for use with local development. 34 | 35 | ### Local overrides 36 | 37 | In order to override configuration settings locally, simply make a copy of [`./docker/pythonpath_dev/superset_config_local.example`](./pythonpath_dev/superset_config_local.example) 38 | into `./docker/pythonpath_dev/superset_config_docker.py` (git ignored) and fill in your overrides. 39 | 40 | ### Local packages 41 | 42 | If you want to add Python packages in order to test things like databases locally, you can simply add a local requirements.txt (`./docker/requirements-local.txt`) 43 | and rebuild your Docker stack. 44 | 45 | Steps: 46 | 47 | 1. Create `./docker/requirements-local.txt` 48 | 2. Add your new packages 49 | 3. Rebuild docker-compose 50 | 1. `docker-compose down -v` 51 | 2. `docker-compose up` 52 | 53 | ## Initializing Database 54 | 55 | The database will initialize itself upon startup via the init container ([`superset-init`](./docker-init.sh)). This may take a minute. 56 | 57 | ## Normal Operation 58 | 59 | To run the container, simply run: `docker-compose up` 60 | 61 | After waiting several minutes for Superset initialization to finish, you can open a browser and view [`http://localhost:8088`](http://localhost:8088) 62 | to start your journey. 63 | 64 | ## Developing 65 | 66 | While running, the container server will reload on modification of the Superset Python and JavaScript source code. 67 | Don't forget to reload the page to take the new frontend into account though. 68 | 69 | ## Production 70 | 71 | It is possible to run Superset in non-development mode by using [`docker-compose-non-dev.yml`](../docker-compose-non-dev.yml). This file excludes the volumes needed for development and uses [`./docker/.env-non-dev`](./.env-non-dev) which sets the variable `SUPERSET_ENV` to `production`. 72 | 73 | ## Resource Constraints 74 | 75 | If you are attempting to build on macOS and it exits with 137 you need to increase your Docker resources. See instructions [here](https://docs.docker.com/docker-for-mac/#advanced) (search for memory) 76 | -------------------------------------------------------------------------------- /superset/docker/docker-bootstrap.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # 3 | # Licensed to the Apache Software Foundation (ASF) under one or more 4 | # contributor license agreements. See the NOTICE file distributed with 5 | # this work for additional information regarding copyright ownership. 6 | # The ASF licenses this file to You under the Apache License, Version 2.0 7 | # (the "License"); you may not use this file except in compliance with 8 | # the License. You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | # 18 | 19 | set -eo pipefail 20 | 21 | REQUIREMENTS_LOCAL="/app/docker/requirements-local.txt" 22 | # If Cypress run – overwrite the password for admin and export env variables 23 | if [ "$CYPRESS_CONFIG" == "true" ]; then 24 | export SUPERSET_CONFIG=tests.integration_tests.superset_test_config 25 | export SUPERSET_TESTENV=true 26 | export SUPERSET__SQLALCHEMY_DATABASE_URI=postgresql+psycopg2://superset:superset@db:5432/superset 27 | fi 28 | # 29 | # Make sure we have dev requirements installed 30 | # 31 | if [ -f "${REQUIREMENTS_LOCAL}" ]; then 32 | echo "Installing local overrides at ${REQUIREMENTS_LOCAL}" 33 | pip install --no-cache-dir -r "${REQUIREMENTS_LOCAL}" 34 | else 35 | echo "Skipping local overrides" 36 | fi 37 | 38 | case "${1}" in 39 | worker) 40 | echo "Starting Celery worker..." 41 | celery --app=superset.tasks.celery_app:app worker -O fair -l INFO 42 | ;; 43 | beat) 44 | echo "Starting Celery beat..." 45 | rm -f /tmp/celerybeat.pid 46 | celery --app=superset.tasks.celery_app:app beat --pidfile /tmp/celerybeat.pid -l INFO -s "${SUPERSET_HOME}"/celerybeat-schedule 47 | ;; 48 | app) 49 | echo "Starting web app (using development server)..." 50 | flask run -p 8088 --with-threads --reload --debugger --host=0.0.0.0 51 | ;; 52 | app-gunicorn) 53 | echo "Starting web app..." 54 | /usr/bin/run-server.sh 55 | ;; 56 | *) 57 | echo "Unknown Operation!!!" 58 | ;; 59 | esac 60 | -------------------------------------------------------------------------------- /superset/docker/docker-ci.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # 3 | # Licensed to the Apache Software Foundation (ASF) under one or more 4 | # contributor license agreements. See the NOTICE file distributed with 5 | # this work for additional information regarding copyright ownership. 6 | # The ASF licenses this file to You under the Apache License, Version 2.0 7 | # (the "License"); you may not use this file except in compliance with 8 | # the License. You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | # 18 | /app/docker/docker-init.sh 19 | 20 | # TODO: copy config overrides from ENV vars 21 | 22 | # TODO: run celery in detached state 23 | export SERVER_THREADS_AMOUNT=8 24 | # start up the web server 25 | 26 | /usr/bin/run-server.sh 27 | -------------------------------------------------------------------------------- /superset/docker/docker-entrypoint-initdb.d/examples-init.sh: -------------------------------------------------------------------------------- 1 | # ------------------------------------------------------------------------ 2 | # Creates the examples database and repective user. This database location 3 | # and access credentials are defined on the environment variables 4 | # ------------------------------------------------------------------------ 5 | set -e 6 | 7 | psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" <<-EOSQL 8 | CREATE USER ${EXAMPLES_USER} WITH PASSWORD '${EXAMPLES_PASSWORD}'; 9 | CREATE DATABASE ${EXAMPLES_DB}; 10 | GRANT ALL PRIVILEGES ON DATABASE ${EXAMPLES_DB} TO ${EXAMPLES_USER}; 11 | EOSQL 12 | 13 | psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" -d "${EXAMPLES_DB}" <<-EOSQL 14 | GRANT ALL ON SCHEMA public TO ${EXAMPLES_USER}; 15 | EOSQL 16 | -------------------------------------------------------------------------------- /superset/docker/docker-frontend.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # 3 | # Licensed to the Apache Software Foundation (ASF) under one or more 4 | # contributor license agreements. See the NOTICE file distributed with 5 | # this work for additional information regarding copyright ownership. 6 | # The ASF licenses this file to You under the Apache License, Version 2.0 7 | # (the "License"); you may not use this file except in compliance with 8 | # the License. You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | # 18 | set -e 19 | 20 | # Packages needed for puppeteer: 21 | apt update 22 | apt install -y chromium 23 | 24 | cd /app/superset-frontend 25 | npm install -f --no-optional --global webpack webpack-cli 26 | npm install -f --no-optional 27 | 28 | echo "Running frontend" 29 | npm run dev 30 | -------------------------------------------------------------------------------- /superset/docker/docker-init.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | # 3 | # Licensed to the Apache Software Foundation (ASF) under one or more 4 | # contributor license agreements. See the NOTICE file distributed with 5 | # this work for additional information regarding copyright ownership. 6 | # The ASF licenses this file to You under the Apache License, Version 2.0 7 | # (the "License"); you may not use this file except in compliance with 8 | # the License. You may obtain a copy of the License at 9 | # 10 | # http://www.apache.org/licenses/LICENSE-2.0 11 | # 12 | # Unless required by applicable law or agreed to in writing, software 13 | # distributed under the License is distributed on an "AS IS" BASIS, 14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | # See the License for the specific language governing permissions and 16 | # limitations under the License. 17 | # 18 | set -e 19 | 20 | # 21 | # Always install local overrides first 22 | # 23 | /app/docker/docker-bootstrap.sh 24 | 25 | STEP_CNT=4 26 | 27 | echo_step() { 28 | cat <