├── .gitignore
├── README.md
├── dbt
├── .gitignore
├── requirments.txt
└── stackoverflowsurvey
│ ├── .gitignore
│ ├── .user.yml
│ ├── README.md
│ ├── analyses
│ └── .gitkeep
│ ├── dbt_project.yml
│ ├── macros
│ └── .gitkeep
│ ├── models
│ ├── schema.yml
│ ├── source.yml
│ └── survey_results.sql
│ ├── profiles.yml
│ ├── seeds
│ └── .gitkeep
│ ├── snapshots
│ └── .gitkeep
│ └── tests
│ └── .gitkeep
├── duckdb
└── README.md
├── images
├── architecture.png
├── superset_dashboard.png
├── superset_duckdb_connection.png
└── superset_duckdb_connection_advanced_config.png
└── superset
├── docker-compose.yml
└── docker
├── .env
├── README.md
├── docker-bootstrap.sh
├── docker-ci.sh
├── docker-entrypoint-initdb.d
└── examples-init.sh
├── docker-frontend.sh
├── docker-init.sh
├── frontend-mem-nag.sh
├── pythonpath_dev
├── .gitignore
├── superset_config.py
└── superset_config_local.example
├── requirements-local.txt
└── run-server.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | data
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Building a robust yet simple data analytics platform with DuckDB, dbt, Iceberg, and Superset
2 | Modern analytics platforms require robust data storage, transformation, and management tools. DuckDB provides a simple, high-performance, columnar analytical database. DBT simplifies data transformation and modeling, and Iceberg offers scalable data lake management capabilities. Combining these tools can create a powerful and flexible analytics platform.
3 |
4 | 
5 |
6 | # Understanding the tools
7 | ## DuckDB
8 | DuckDB is an in-memory, columnar analytical database that stands out for its speed, efficiency, and compatibility with SQL standard. Here is a more in-deepth look at its features:
9 | - **High-performance Analytics**: DuckDB is optimized for analytical queries, making it an ideal choice for data warehousing and analytics workloads. It's in-memory storage and columnar data layout significantly boost query performance.
10 | - **SQL Compatibility**: DuckDB supports SQL, making it accessible to analysts and data professionals who are ready familiar with SQL syntax. This compatibility allows you to leverage your existing SQL knowledge and tools.
11 | - **Integration with BI Tools**: DuckDB integrates seamlessly with popular business intelligence (BI) tools like Tableau, Power BI, and Looker. This compatibility ensures that you can visualize and report on your data effectively.
12 |
13 | ## DBT
14 | dbt, which stands for Data Build Tool, is a command-line tool that revolutionizes the way data transformations and modeling are done. Here's a deeper dive into dbt's capabilities:
15 | - **Modular Data Transformations**: dbt uses SQL and YAML files to define data transformations and models. This modular approach allows you to break down complex transformations into smaller, more manageable pieces, enhancing mantainability and version control.
16 | - **Data Testing**: dbt facilitates data testing by allowing you to define expectations about your data. It helps ensure data quality by automatically running tests against your transformed data.
17 | - **Version Control**: dbt projects can be version controlled with tools like Git, enabling collaboration among data professionals while keeping a history of changes.
18 | - **Incremental Builds**: dbt supports incremental builds, meaning it only processes data that has changed since the last run. This feature saves time and resources when working with large datasets.
19 | - **Orchestration**: While dbt focuses on data transformations and modeling, it can be integrated with orchestration tools like Apache Airflow or dbt Cloud to create automated data pipelines.
20 |
21 | ## Iceberg
22 | Iceberg is a table format designed for managing data lakes, offering several key features to ensure data quality and scalability:
23 | - **Schema Evoluation**: One of Iceberg's standout features is its support for schema evolution. You can add, delete, or modify columns in your datasets without breaking existing queries or data integrity. This makes it suitable for rapidly evolving data lakes.
24 | - **ACID Transformations**: Iceberg provides ACID (Atomicity, Consistency, Isolation, Durability) transactions, ensuring data consistency and reliability in multi-user and multi-write environments.
25 | - **Time-Travel Capabilities**: Iceberg allows you to query historical versions of your data, making it possible to recover from data errors or analyze changes over time.
26 | - **Optimized File Storage**: Iceberg optimizes file storage by using techniques like metadata management, partitioning, and file pruning. This results in efficient data storage and retrieval.
27 | - **Connectivity**: Iceberg supports various storage connectors, including Apache Hadoop HDFS, Amazon S3, and Azure Data Lake Storage, making it versatile and compatible with different data lake platforms.
28 |
29 | > NOTE: *Iceberg is not currently utilized in this showcase, but it will be added soon.*
30 | ## Apache Superset
31 | Apache Superset is a modern, open-source BI tool that enables data exploration, visualization, and interactive dashboards. It connects to various data sources and is designed to empower users to explore data and create dynamic reports.
32 | - **Data Visualization**: Apache Superset allows users to create interactive visualizations, including charts, graphs, and geographic maps, to explore and understand data.
33 | - **Dashboard Creation**: Users can build dynamic dashboards by combining multiple visualizations and applying filters for real-time data exploration.
34 | - **Connectivity**: Apache Superset can connect to various data sources, including SQL databases, data lakes, and cloud storage, making it adaptable to diverse data ecosystems.
35 | - **Security**: It offers robust security features, including role-based access control and integration with authentication providers, ensuring data is accessed securely.
36 | - **Community and Extensibility**: As an open-source project, Apache Superset benefits from a vibrant community that contributes plugins, connectors, and additional features, enhancing its capabilities.
37 | - **SQL Support**: Superset supports SQL queries, allowing users to execute custom queries and create complex calculated fields.
38 |
39 | # Setting up DuckDB, dbt, Superset with Docker Compose
40 | ## Setting up DuckDB
41 | DuckDB will be installed as a library with dbt and Superset in the next session.
42 |
43 | ## Setting up dbt
44 | Firstly, We need to install *dbt-core* and *dbt-duckdb* libraries, then init a dbt project.
45 | ```bash
46 | # create a virtual environment
47 | cd dbt
48 | python -m venv .env
49 | source .env/bin/activate
50 |
51 | # install libraries: dbt-core and dbt-duckdb
52 | pip install -r requirements.txt
53 |
54 | # check version
55 | dbt --version
56 | ```
57 |
58 | Then we initialize a dbt project with the name *stackoverflowsurvey* and create a *profiles.yml* with the following content:
59 | ```yaml
60 | stackoverflow:
61 | target: dev
62 | outputs:
63 | dev:
64 | type: duckdb
65 | path: '/data/duckdb/stackoverflow.duckdb' # path to local DuckDB database file
66 | ```
67 |
68 | Run the following commands to properly check configuration:
69 | ```bash
70 | # We must specify the directory of the 'profiles.yml' file since we are not using the default location.
71 | dbt debug --profiles-dir .
72 | ```
73 |
74 | ### Setting up Superset
75 | Run following commands to set up the Superset service:
76 | ```bash
77 | cd superset
78 | # run docker compose command to start services of the Superset
79 | # the libraries declared in 'requirements-local.txt' file will also be installed too (including duckdb-engine)
80 | docker-compose up --detach
81 | ```
82 |
83 | Visit *http://localhost:8088* to access the Superset UI. Enter **admin** as username and password. Choose **DuckDB** from the supported databases drop-down. Then set up a connection to DuckDB database.
84 |
85 |
86 |
87 |
88 |  |
89 |  |
90 |
91 |
92 |
93 |
94 | > **NOTE**: Provide path to a duckdb database on disk in the url, e.g., *duckdb:////Users/whoever/path/to/duck.db*.
95 |
96 | We combine the DuckDB database path file exposed in *superset/docker/docker-compose.yml* file
97 | ```bash
98 | x-superset-volumes:
99 | &superset-volumes
100 | - /data/duckdb:/app/duckdb
101 | ```
102 | with the DuckDB database name defined in *dbt/stackoverflowsurvey/profiles.yml*.
103 | ```yaml
104 | path: '/data/duckdb/stackoverflow.duckdb'
105 | ```
106 | So, below, we have the final URI to establish a connection between Superset and DuckDB:
107 | ```bash
108 | duckdb:///duckdb/stackoverflow.duckdb
109 | ```
110 |
111 | With Superset, the engine needs to be configured to open DuckDB in “read-only” mode. Otherwise, only one query can run at a time (simultaneous queries will cause locks). This also prevents refreshing the Superset dashboard while the pipeline is running.
112 |
113 | # Loading source
114 | In this showcase, we are using the [Stack Overflow Annual Developer Survey](https://insights.stackoverflow.com/survey) data set. To simplify maters, we will focus solely on the [2023](https://cdn.stackoverflow.co/files/jo7n4k8s/production/49915bfd46d0902c3564fd9a06b509d08a20488c.zip/stack-overflow-developer-survey-2023.zip) data set, which needs to be manually downloaded and extracted into the *PROJECT_HOME/data* directory.
115 |
116 | # Building models with dbt
117 | ## Defining data source
118 | We declare the data source in *stackoverflowsurvey/models/source.yml* file with following content:
119 | ```yaml
120 | sources:
121 | - name: stackoverflow_survey_source
122 | tables:
123 | - name: surveys
124 | meta:
125 | external_location: "read_csv('../../data/survey_results_public.csv', AUTO_DETECT=TRUE)" # automatically parser and detect schema
126 | formatter: oldstyle
127 | ```
128 | ## Building models
129 | For demonstration purposes only, we have created a very simple model with the following content:
130 | ```sql
131 | {{ config(materialized='table') }}
132 |
133 | SELECT *
134 | FROM {{ source('stackoverflow_survey_source', 'surveys')}}
135 | ```
136 | # Connecting Superset
137 | Once the dbt models are built, the data visualization can begin. An admin user must be created in superset in order to log in.
138 |
139 | 
140 |
141 | # Conclusion
142 | In this comprehensive guide, we've demonstrated how to construct a sophisticated analytics platform that leverages the combined power of DuckDB, DBT, Iceberg, and Apache Superset. This platform empowers organizations to seamlessly ingest, transform, manage, visualize, and analyze data to extract actionable insights.
143 | Key Components:
144 | - **DuckDB**: Our high-performance, SQL-compatible, in-memory database serves as the foundation for efficient data storage and retrieval, enabling lightning-fast analytical queries.
145 | - **dbt**: DBT simplifies data transformation and modeling, allowing for the creation of modular, version-controlled data pipelines that enhance data quality and maintainability.
146 | - **Iceberg**: Iceberg manages data lakes with ease, offering schema evolution, ACID transactions, and time-travel capabilities, ensuring data integrity and scalability in large-scale analytics environments.
147 | - **Apache Superset**: Apache Superset enhances the platform by providing a modern, open-source BI tool for data exploration, visualization, and interactive dashboard creation. Its connectivity options, security features, and SQL support empower users to gain insights from data with ease.
148 |
149 | Together, these tools create a powerful and flexible analytics platform, enabling organizations to navigate the data landscape with confidence, derive valuable insights, and make informed decisions. Whether you're dealing with structured or unstructured data, this platform equips you with the tools needed to turn raw data into actionable intelligence, driving business success and innovation.
150 |
151 | ## Supporting Links
152 | * Stack Overflow Annual Developer Survey
153 | * Modern Data Stack in a Box with DuckDB
154 | * dbt adapter for DuckDB
155 |
156 |
157 |
158 |
159 |
160 |
161 |
--------------------------------------------------------------------------------
/dbt/.gitignore:
--------------------------------------------------------------------------------
1 | logs
2 | .env
--------------------------------------------------------------------------------
/dbt/requirments.txt:
--------------------------------------------------------------------------------
1 | dbt-core==1.6.0
2 | dbt-duckdb==1.6.0
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | target/
3 | dbt_packages/
4 | logs/
5 | stackoverflow.*
6 |
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/.user.yml:
--------------------------------------------------------------------------------
1 | id: 1ea6d26b-1f7f-44a4-897d-43d71b80b184
2 |
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/README.md:
--------------------------------------------------------------------------------
1 | Welcome to your new dbt project!
2 |
3 | ### Using the starter project
4 |
5 | Try running the following commands:
6 | - dbt run
7 | - dbt test
8 |
9 |
10 | ### Resources:
11 | - Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
12 | - Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
13 | - Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
14 | - Find [dbt events](https://events.getdbt.com) near you
15 | - Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
16 |
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/analyses/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/analyses/.gitkeep
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/dbt_project.yml:
--------------------------------------------------------------------------------
1 |
2 | # Name your project! Project names should contain only lowercase characters
3 | # and underscores. A good package name should reflect your organization's
4 | # name or the intended use of these models
5 | name: 'stackoverflow'
6 | version: '1.0.0'
7 | config-version: 2
8 |
9 | # This setting configures which "profile" dbt uses for this project.
10 | profile: 'stackoverflow'
11 |
12 | # These configurations specify where dbt should look for different types of files.
13 | # The `model-paths` config, for example, states that models in this project can be
14 | # found in the "models/" directory. You probably won't need to change these!
15 | model-paths: ["models"]
16 | analysis-paths: ["analyses"]
17 | test-paths: ["tests"]
18 | seed-paths: ["seeds"]
19 | macro-paths: ["macros"]
20 | snapshot-paths: ["snapshots"]
21 |
22 | clean-targets: # directories to be removed by `dbt clean`
23 | - "target"
24 | - "dbt_packages"
25 |
26 |
27 | # Configuring models
28 | # Full documentation: https://docs.getdbt.com/docs/configuring-models
29 |
30 | # In this example config, we tell dbt to build all models in the example/
31 | # directory as views. These settings can be overridden in the individual model
32 | # files using the `{{ config(...) }}` macro.
33 | models:
34 | stackoverflow:
35 | # Config indicated by + and applies to all files under models/example/
36 | example:
37 | +materialized: view
38 |
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/macros/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/macros/.gitkeep
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/models/schema.yml:
--------------------------------------------------------------------------------
1 | version: 2
2 |
3 | models:
4 | - name: survey_results
5 | description: "The result of StackOverflow's survey in 2023."
6 | columns:
7 | - name: ResponseId
8 | description: "The response Id"
9 | tests:
10 | - unique
11 | - not_null
12 | - name: Q120
13 | - name: MainBranch
14 | - name: Age
15 | - name: Employment
16 | - name: RemoteWork
17 | - name: CodingActivities
18 | - name: EdLevel
19 | - name: LearnCode
20 | - name: LearnCodeOnline
21 | - name: LearnCodeCoursesCert
22 | - name: YearsCode
23 | - name: YearsCodePro
24 | - name: DevType
25 | - name: OrgSize
26 | - name: PurchaseInfluence
27 | - name: TechList
28 | - name: BuyNewTool
29 | - name: Country
30 | - name: Currency
31 | - name: CompTotal
32 | - name: LanguageHaveWorkedWith
33 | - name: LanguageWantToWorkWith
34 | - name: DatabaseHaveWorkedWith
35 | - name: DatabaseWantToWorkWith
36 | - name: PlatformHaveWorkedWith
37 | - name: PlatformWantToWorkWith
38 | - name: WebframeHaveWorkedWith
39 | - name: WebframeWantToWorkWith
40 | - name: MiscTechHaveWorkedWith
41 | - name: MiscTechWantToWorkWith
42 | - name: ToolsTechHaveWorkedWith
43 | - name: ToolsTechWantToWorkWith
44 | - name: NEWCollabToolsHaveWorkedWith
45 | - name: NEWCollabToolsWantToWorkWith
46 | - name: OpSysPersonal use
47 | - name: OpSysProfessional use
48 | - name: OfficeStackAsyncHaveWorkedWith
49 | - name: OfficeStackAsyncWantToWorkWith
50 | - name: OfficeStackSyncHaveWorkedWith
51 | - name: OfficeStackSyncWantToWorkWith
52 | - name: AISearchHaveWorkedWith
53 | - name: AISearchWantToWorkWith
54 | - name: AIDevHaveWorkedWith
55 | - name: AIDevWantToWorkWith
56 | - name: NEWSOSites
57 | - name: SOVisitFreq
58 | - name: SOAccount
59 | - name: SOPartFreq
60 | - name: SOComm
61 | - name: SOAI
62 | - name: AISelect
63 | - name: AISent
64 | - name: AIAcc
65 | - name: AIBen
66 | - name: AIToolInterested in Using
67 | - name: AIToolCurrently Using
68 | - name: AIToolNot interested in Using
69 | - name: AINextVery different
70 | - name: AINextNeither different nor similar
71 | - name: AINextSomewhat similar
72 | - name: AINextVery similar
73 | - name: AINextSomewhat different
74 | - name: TBranch
75 | - name: ICorPM
76 | - name: WorkExp
77 | - name: Knowledge_1
78 | - name: Knowledge_2
79 | - name: Knowledge_3
80 | - name: Knowledge_4
81 | - name: Knowledge_5
82 | - name: Knowledge_6
83 | - name: Knowledge_7
84 | - name: Knowledge_8
85 | - name: Frequency_1
86 | - name: Frequency_2
87 | - name: Frequency_3
88 | - name: TimeSearching
89 | - name: TimeAnswering
90 | - name: ProfessionalTech
91 | - name: Industry
92 | - name: SurveyLength
93 | - name: SurveyEase
94 | - name: ConvertedCompYearly
95 |
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/models/source.yml:
--------------------------------------------------------------------------------
1 | sources:
2 | - name: stackoverflow_survey_source
3 | tables:
4 | - name: surveys
5 | meta:
6 | external_location: "read_csv('../../data/survey_results_public.csv', AUTO_DETECT=TRUE)" # automatically parser and detect schema
7 | formatter: oldstyle
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/models/survey_results.sql:
--------------------------------------------------------------------------------
1 | {{ config(materialized='table') }}
2 |
3 | SELECT *
4 | FROM {{ source('stackoverflow_survey_source', 'surveys')}}
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/profiles.yml:
--------------------------------------------------------------------------------
1 | stackoverflow:
2 | target: dev
3 | outputs:
4 | dev:
5 | type: duckdb
6 | path: '/data/duckdb/stackoverflow.duckdb' # path to local DuckDB database file
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/seeds/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/seeds/.gitkeep
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/snapshots/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/snapshots/.gitkeep
--------------------------------------------------------------------------------
/dbt/stackoverflowsurvey/tests/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/dbt/stackoverflowsurvey/tests/.gitkeep
--------------------------------------------------------------------------------
/duckdb/README.md:
--------------------------------------------------------------------------------
1 | > **NOTE**: DuckDB will be used as an embedded library installed with dbt project and Superset.
--------------------------------------------------------------------------------
/images/architecture.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/architecture.png
--------------------------------------------------------------------------------
/images/superset_dashboard.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/superset_dashboard.png
--------------------------------------------------------------------------------
/images/superset_duckdb_connection.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/superset_duckdb_connection.png
--------------------------------------------------------------------------------
/images/superset_duckdb_connection_advanced_config.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/luatnc87/robust-data-analytics-platform-with-duckdb-dbt-iceberg/9e1ec87291ac3b7193bdc5b8193c01491e14bfdf/images/superset_duckdb_connection_advanced_config.png
--------------------------------------------------------------------------------
/superset/docker-compose.yml:
--------------------------------------------------------------------------------
1 | #
2 | # Licensed to the Apache Software Foundation (ASF) under one or more
3 | # contributor license agreements. See the NOTICE file distributed with
4 | # this work for additional information regarding copyright ownership.
5 | # The ASF licenses this file to You under the Apache License, Version 2.0
6 | # (the "License"); you may not use this file except in compliance with
7 | # the License. You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | x-superset-image: &superset-image apachesuperset.docker.scarf.sh/apache/superset:latest
18 | x-superset-depends-on: &superset-depends-on
19 | - db
20 | - redis
21 | x-superset-volumes:
22 | &superset-volumes # /app/pythonpath_docker will be appended to the PYTHONPATH in the final container
23 | - ./docker:/app/docker
24 | - superset_home:/app/superset_home
25 | - /data/duckdb:/app/duckdb
26 |
27 | version: "3.7"
28 | services:
29 | redis:
30 | image: redis:7
31 | container_name: superset_cache
32 | restart: unless-stopped
33 | volumes:
34 | - redis:/data
35 | networks:
36 | - osmds_internal
37 |
38 | db:
39 | env_file: docker/.env
40 | image: postgres:15
41 | container_name: superset_db
42 | restart: unless-stopped
43 | volumes:
44 | - db_home:/var/lib/postgresql/data
45 | - ./docker/docker-entrypoint-initdb.d:/docker-entrypoint-initdb.d
46 | networks:
47 | - osmds_internal
48 |
49 | superset:
50 | env_file: docker/.env
51 | image: *superset-image
52 | container_name: superset_app
53 | command: ["/app/docker/docker-bootstrap.sh", "app-gunicorn"]
54 | user: "root"
55 | restart: unless-stopped
56 | ports:
57 | - 8088:8088
58 | depends_on: *superset-depends-on
59 | volumes: *superset-volumes
60 | networks:
61 | - osmds_internal
62 |
63 | superset-init:
64 | image: *superset-image
65 | container_name: superset_init
66 | command: ["/app/docker/docker-init.sh"]
67 | env_file: docker/.env
68 | depends_on: *superset-depends-on
69 | user: "root"
70 | volumes: *superset-volumes
71 | healthcheck:
72 | disable: true
73 | networks:
74 | - osmds_internal
75 |
76 | superset-worker:
77 | image: *superset-image
78 | container_name: superset_worker
79 | command: ["/app/docker/docker-bootstrap.sh", "worker"]
80 | env_file: docker/.env
81 | restart: unless-stopped
82 | depends_on: *superset-depends-on
83 | user: "root"
84 | volumes: *superset-volumes
85 | healthcheck:
86 | test:
87 | [
88 | "CMD-SHELL",
89 | "celery -A superset.tasks.celery_app:app inspect ping -d celery@$$HOSTNAME",
90 | ]
91 | networks:
92 | - osmds_internal
93 |
94 | superset-worker-beat:
95 | image: *superset-image
96 | container_name: superset_worker_beat
97 | command: ["/app/docker/docker-bootstrap.sh", "beat"]
98 | env_file: docker/.env
99 | restart: unless-stopped
100 | depends_on: *superset-depends-on
101 | user: "root"
102 | volumes: *superset-volumes
103 | healthcheck:
104 | disable: true
105 | networks:
106 | - osmds_internal
107 |
108 | volumes:
109 | superset_home:
110 | external: false
111 | db_home:
112 | external: false
113 | redis:
114 | external: false
115 |
116 | networks:
117 | osmds_internal:
118 | external: true
--------------------------------------------------------------------------------
/superset/docker/.env:
--------------------------------------------------------------------------------
1 | #
2 | # Licensed to the Apache Software Foundation (ASF) under one or more
3 | # contributor license agreements. See the NOTICE file distributed with
4 | # this work for additional information regarding copyright ownership.
5 | # The ASF licenses this file to You under the Apache License, Version 2.0
6 | # (the "License"); you may not use this file except in compliance with
7 | # the License. You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | COMPOSE_PROJECT_NAME=superset
18 |
19 | # database configurations (do not modify)
20 | DATABASE_DB=superset
21 | DATABASE_HOST=db
22 | DATABASE_PASSWORD=superset
23 | DATABASE_USER=superset
24 | DATABASE_PORT=5432
25 | DATABASE_DIALECT=postgresql
26 |
27 | EXAMPLES_DB=examples
28 | EXAMPLES_HOST=db
29 | EXAMPLES_USER=examples
30 | EXAMPLES_PASSWORD=examples
31 | EXAMPLES_PORT=5432
32 |
33 | # database engine specific environment variables
34 | # change the below if you prefer another database engine
35 | POSTGRES_DB=superset
36 | POSTGRES_USER=superset
37 | POSTGRES_PASSWORD=superset
38 | #MYSQL_DATABASE=superset
39 | #MYSQL_USER=superset
40 | #MYSQL_PASSWORD=superset
41 | #MYSQL_RANDOM_ROOT_PASSWORD=yes
42 |
43 | # Add the mapped in /app/pythonpath_docker which allows devs to override stuff
44 | PYTHONPATH=/app/pythonpath:/app/docker/pythonpath_dev
45 | REDIS_HOST=redis
46 | REDIS_PORT=6379
47 |
48 | SUPERSET_ENV=production
49 | SUPERSET_LOAD_EXAMPLES=yes
50 | SUPERSET_SECRET_KEY=TEST_NON_DEV_SECRET
51 | CYPRESS_CONFIG=false
52 | SUPERSET_PORT=8088
53 | MAPBOX_API_KEY=''
54 |
--------------------------------------------------------------------------------
/superset/docker/README.md:
--------------------------------------------------------------------------------
1 |
19 |
20 | # Getting Started with Superset using Docker
21 |
22 | Docker is an easy way to get started with Superset.
23 |
24 | ## Prerequisites
25 |
26 | 1. [Docker](https://www.docker.com/get-started)
27 | 2. [Docker Compose](https://docs.docker.com/compose/install/)
28 |
29 | ## Configuration
30 |
31 | The `/app/pythonpath` folder is mounted from [`./docker/pythonpath_dev`](./pythonpath_dev)
32 | which contains a base configuration [`./docker/pythonpath_dev/superset_config.py`](./pythonpath_dev/superset_config.py)
33 | intended for use with local development.
34 |
35 | ### Local overrides
36 |
37 | In order to override configuration settings locally, simply make a copy of [`./docker/pythonpath_dev/superset_config_local.example`](./pythonpath_dev/superset_config_local.example)
38 | into `./docker/pythonpath_dev/superset_config_docker.py` (git ignored) and fill in your overrides.
39 |
40 | ### Local packages
41 |
42 | If you want to add Python packages in order to test things like databases locally, you can simply add a local requirements.txt (`./docker/requirements-local.txt`)
43 | and rebuild your Docker stack.
44 |
45 | Steps:
46 |
47 | 1. Create `./docker/requirements-local.txt`
48 | 2. Add your new packages
49 | 3. Rebuild docker-compose
50 | 1. `docker-compose down -v`
51 | 2. `docker-compose up`
52 |
53 | ## Initializing Database
54 |
55 | The database will initialize itself upon startup via the init container ([`superset-init`](./docker-init.sh)). This may take a minute.
56 |
57 | ## Normal Operation
58 |
59 | To run the container, simply run: `docker-compose up`
60 |
61 | After waiting several minutes for Superset initialization to finish, you can open a browser and view [`http://localhost:8088`](http://localhost:8088)
62 | to start your journey.
63 |
64 | ## Developing
65 |
66 | While running, the container server will reload on modification of the Superset Python and JavaScript source code.
67 | Don't forget to reload the page to take the new frontend into account though.
68 |
69 | ## Production
70 |
71 | It is possible to run Superset in non-development mode by using [`docker-compose-non-dev.yml`](../docker-compose-non-dev.yml). This file excludes the volumes needed for development and uses [`./docker/.env-non-dev`](./.env-non-dev) which sets the variable `SUPERSET_ENV` to `production`.
72 |
73 | ## Resource Constraints
74 |
75 | If you are attempting to build on macOS and it exits with 137 you need to increase your Docker resources. See instructions [here](https://docs.docker.com/docker-for-mac/#advanced) (search for memory)
76 |
--------------------------------------------------------------------------------
/superset/docker/docker-bootstrap.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | #
3 | # Licensed to the Apache Software Foundation (ASF) under one or more
4 | # contributor license agreements. See the NOTICE file distributed with
5 | # this work for additional information regarding copyright ownership.
6 | # The ASF licenses this file to You under the Apache License, Version 2.0
7 | # (the "License"); you may not use this file except in compliance with
8 | # the License. You may obtain a copy of the License at
9 | #
10 | # http://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # Unless required by applicable law or agreed to in writing, software
13 | # distributed under the License is distributed on an "AS IS" BASIS,
14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 | # See the License for the specific language governing permissions and
16 | # limitations under the License.
17 | #
18 |
19 | set -eo pipefail
20 |
21 | REQUIREMENTS_LOCAL="/app/docker/requirements-local.txt"
22 | # If Cypress run – overwrite the password for admin and export env variables
23 | if [ "$CYPRESS_CONFIG" == "true" ]; then
24 | export SUPERSET_CONFIG=tests.integration_tests.superset_test_config
25 | export SUPERSET_TESTENV=true
26 | export SUPERSET__SQLALCHEMY_DATABASE_URI=postgresql+psycopg2://superset:superset@db:5432/superset
27 | fi
28 | #
29 | # Make sure we have dev requirements installed
30 | #
31 | if [ -f "${REQUIREMENTS_LOCAL}" ]; then
32 | echo "Installing local overrides at ${REQUIREMENTS_LOCAL}"
33 | pip install --no-cache-dir -r "${REQUIREMENTS_LOCAL}"
34 | else
35 | echo "Skipping local overrides"
36 | fi
37 |
38 | case "${1}" in
39 | worker)
40 | echo "Starting Celery worker..."
41 | celery --app=superset.tasks.celery_app:app worker -O fair -l INFO
42 | ;;
43 | beat)
44 | echo "Starting Celery beat..."
45 | rm -f /tmp/celerybeat.pid
46 | celery --app=superset.tasks.celery_app:app beat --pidfile /tmp/celerybeat.pid -l INFO -s "${SUPERSET_HOME}"/celerybeat-schedule
47 | ;;
48 | app)
49 | echo "Starting web app (using development server)..."
50 | flask run -p 8088 --with-threads --reload --debugger --host=0.0.0.0
51 | ;;
52 | app-gunicorn)
53 | echo "Starting web app..."
54 | /usr/bin/run-server.sh
55 | ;;
56 | *)
57 | echo "Unknown Operation!!!"
58 | ;;
59 | esac
60 |
--------------------------------------------------------------------------------
/superset/docker/docker-ci.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | #
3 | # Licensed to the Apache Software Foundation (ASF) under one or more
4 | # contributor license agreements. See the NOTICE file distributed with
5 | # this work for additional information regarding copyright ownership.
6 | # The ASF licenses this file to You under the Apache License, Version 2.0
7 | # (the "License"); you may not use this file except in compliance with
8 | # the License. You may obtain a copy of the License at
9 | #
10 | # http://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # Unless required by applicable law or agreed to in writing, software
13 | # distributed under the License is distributed on an "AS IS" BASIS,
14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 | # See the License for the specific language governing permissions and
16 | # limitations under the License.
17 | #
18 | /app/docker/docker-init.sh
19 |
20 | # TODO: copy config overrides from ENV vars
21 |
22 | # TODO: run celery in detached state
23 | export SERVER_THREADS_AMOUNT=8
24 | # start up the web server
25 |
26 | /usr/bin/run-server.sh
27 |
--------------------------------------------------------------------------------
/superset/docker/docker-entrypoint-initdb.d/examples-init.sh:
--------------------------------------------------------------------------------
1 | # ------------------------------------------------------------------------
2 | # Creates the examples database and repective user. This database location
3 | # and access credentials are defined on the environment variables
4 | # ------------------------------------------------------------------------
5 | set -e
6 |
7 | psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" <<-EOSQL
8 | CREATE USER ${EXAMPLES_USER} WITH PASSWORD '${EXAMPLES_PASSWORD}';
9 | CREATE DATABASE ${EXAMPLES_DB};
10 | GRANT ALL PRIVILEGES ON DATABASE ${EXAMPLES_DB} TO ${EXAMPLES_USER};
11 | EOSQL
12 |
13 | psql -v ON_ERROR_STOP=1 --username "${POSTGRES_USER}" -d "${EXAMPLES_DB}" <<-EOSQL
14 | GRANT ALL ON SCHEMA public TO ${EXAMPLES_USER};
15 | EOSQL
16 |
--------------------------------------------------------------------------------
/superset/docker/docker-frontend.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | #
3 | # Licensed to the Apache Software Foundation (ASF) under one or more
4 | # contributor license agreements. See the NOTICE file distributed with
5 | # this work for additional information regarding copyright ownership.
6 | # The ASF licenses this file to You under the Apache License, Version 2.0
7 | # (the "License"); you may not use this file except in compliance with
8 | # the License. You may obtain a copy of the License at
9 | #
10 | # http://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # Unless required by applicable law or agreed to in writing, software
13 | # distributed under the License is distributed on an "AS IS" BASIS,
14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 | # See the License for the specific language governing permissions and
16 | # limitations under the License.
17 | #
18 | set -e
19 |
20 | # Packages needed for puppeteer:
21 | apt update
22 | apt install -y chromium
23 |
24 | cd /app/superset-frontend
25 | npm install -f --no-optional --global webpack webpack-cli
26 | npm install -f --no-optional
27 |
28 | echo "Running frontend"
29 | npm run dev
30 |
--------------------------------------------------------------------------------
/superset/docker/docker-init.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | #
3 | # Licensed to the Apache Software Foundation (ASF) under one or more
4 | # contributor license agreements. See the NOTICE file distributed with
5 | # this work for additional information regarding copyright ownership.
6 | # The ASF licenses this file to You under the Apache License, Version 2.0
7 | # (the "License"); you may not use this file except in compliance with
8 | # the License. You may obtain a copy of the License at
9 | #
10 | # http://www.apache.org/licenses/LICENSE-2.0
11 | #
12 | # Unless required by applicable law or agreed to in writing, software
13 | # distributed under the License is distributed on an "AS IS" BASIS,
14 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 | # See the License for the specific language governing permissions and
16 | # limitations under the License.
17 | #
18 | set -e
19 |
20 | #
21 | # Always install local overrides first
22 | #
23 | /app/docker/docker-bootstrap.sh
24 |
25 | STEP_CNT=4
26 |
27 | echo_step() {
28 | cat <