├── .gitignore ├── LICENSE ├── README.md ├── app.py ├── config.yaml ├── data └── .gitkeep └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.png 2 | *.jpg 3 | *.csv 4 | *.zip 5 | venv 6 | 7 | # Byte-compiled / optimized / DLL files 8 | __pycache__/ 9 | *.py[cod] 10 | *$py.class 11 | 12 | # C extensions 13 | *.so 14 | 15 | # Distribution / packaging 16 | .Python 17 | build/ 18 | develop-eggs/ 19 | dist/ 20 | downloads/ 21 | eggs/ 22 | .eggs/ 23 | lib/ 24 | lib64/ 25 | parts/ 26 | sdist/ 27 | var/ 28 | wheels/ 29 | pip-wheel-metadata/ 30 | share/python-wheels/ 31 | *.egg-info/ 32 | .installed.cfg 33 | *.egg 34 | MANIFEST 35 | 36 | # PyInstaller 37 | # Usually these files are written by a python script from a template 38 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 39 | *.manifest 40 | *.spec 41 | 42 | # Installer logs 43 | pip-log.txt 44 | pip-delete-this-directory.txt 45 | 46 | # Unit test / coverage reports 47 | htmlcov/ 48 | .tox/ 49 | .nox/ 50 | .coverage 51 | .coverage.* 52 | .cache 53 | nosetests.xml 54 | coverage.xml 55 | *.cover 56 | *.py,cover 57 | .hypothesis/ 58 | .pytest_cache/ 59 | 60 | # Translations 61 | *.mo 62 | *.pot 63 | 64 | # Django stuff: 65 | *.log 66 | local_settings.py 67 | db.sqlite3 68 | db.sqlite3-journal 69 | 70 | # Flask stuff: 71 | instance/ 72 | .webassets-cache 73 | 74 | # Scrapy stuff: 75 | .scrapy 76 | 77 | # Sphinx documentation 78 | docs/_build/ 79 | 80 | # PyBuilder 81 | target/ 82 | 83 | # Jupyter Notebook 84 | .ipynb_checkpoints 85 | 86 | # IPython 87 | profile_default/ 88 | ipython_config.py 89 | 90 | # pyenv 91 | .python-version 92 | 93 | # pipenv 94 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 95 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 96 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 97 | # install all needed dependencies. 98 | #Pipfile.lock 99 | 100 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 101 | __pypackages__/ 102 | 103 | # Celery stuff 104 | celerybeat-schedule 105 | celerybeat.pid 106 | 107 | # SageMath parsed files 108 | *.sage.py 109 | 110 | # Environments 111 | .env 112 | .venv 113 | env/ 114 | venv/ 115 | ENV/ 116 | env.bak/ 117 | venv.bak/ 118 | 119 | # Spyder project settings 120 | .spyderproject 121 | .spyproject 122 | 123 | # Rope project settings 124 | .ropeproject 125 | 126 | # mkdocs documentation 127 | /site 128 | 129 | # mypy 130 | .mypy_cache/ 131 | .dmypy.json 132 | dmypy.json 133 | 134 | # Pyre type checker 135 | .pyre/ 136 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Kyohei Uto 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # handm_data_visualize_app 2 | Data visualization app by [streamlit](https://streamlit.io/) for H&M competition in kaggle. 3 | 4 | competition page: https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/ 5 | 6 | ## Features 7 | ### Customers 8 | You can check the following information by selecting a customer 9 | - Customer information 10 | - Customer transactions 11 | - Frequentry purchased articles images 12 | - Recently purchased Articles images 13 | 14 | ### Articles 15 | You can check the following information by selecting a article 16 | - Article information 17 | - Article image 18 | 19 | https://user-images.githubusercontent.com/43205304/161410926-0dc5929f-cc9c-4e93-ba77-5ca420e73f5c.mov 20 | 21 | 22 | ## Directory 23 | ``` 24 | . 25 | ├── README.md 26 | ├── app.py 27 | ├── config.yaml 28 | ├── data 29 | │ ├── images 30 | │ ├── articles.csv 31 | │ ├── customers.csv 32 | │ └── transactions_train.csv 33 | └── requirements.txt 34 | ``` 35 | 36 | 37 | ## Environment setup 38 | Here I show how to build an environment. 39 | If you already have the competition project, simply prepare `app.py` and `config.yaml` in your project and set the data path in `config.yaml`. 40 | 41 | 1. Clone this repository 42 | ``` 43 | git clone https://github.com/kuto5046/handm_data_visualize_app.git 44 | ``` 45 | 46 | 2. Make environment 47 | 48 | First, create your favorite virtual development environment.(docker, venv, poetry etc...) 49 | 50 | Then, run this command to install necessary libraries in your environment. 51 | ```shell 52 | pip install -r requirements.txt 53 | ``` 54 | 55 | 3. Prepare data 56 | 57 | Prepare competition data in `data` folder. 58 | I show an example of using the `kaggle api`, but you can also download it manually. 59 | ``` 60 | cd data 61 | kaggle competitions download -c h-and-m-personalized-fashion-recommendations 62 | unzip h-and-m-personalized-fashion-recommendations.zip 63 | ``` 64 | 65 | ## Usage 66 | run this command in your terminal 67 | ```shell 68 | streamlit run app.py 69 | ``` 70 | Then connect to the output URL or `localhost:8501` 71 | 72 | To change the settings about app, please edit `config.yaml` 73 | ```yaml 74 | common: 75 | data_dir: ./data/ 76 | image_dir: ./data/images/ 77 | 78 | customers: 79 | min_purchased_count: 20 # Minimum purchases count for random customer selection 80 | num_sample: 10 # Number of image to show 81 | max_display_per_col: 5 # Maximum number of image to display per column 82 | ``` 83 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import cv2 3 | import os 4 | import pandas as pd 5 | import random 6 | import plotly.express as px 7 | import yaml 8 | 9 | 10 | @st.cache 11 | def read_data(input_dir): 12 | df = pd.read_csv(input_dir + "transactions_train.csv", dtype={'article_id': str, 'sales_channel_id': str}) 13 | df['t_dat'] = pd.to_datetime(df['t_dat']) 14 | # 週単位にする。testを0としたいので+1 15 | df['n_weeks_ago'] = ((df['t_dat'].max() - df['t_dat']).dt.days // 7) + 1 16 | articles = pd.read_csv(input_dir + "articles.csv", dtype={'article_id': str}) 17 | customers = pd.read_csv(input_dir + "customers.csv") 18 | return df, articles, customers 19 | 20 | @st.cache 21 | def get_sub_data(df, customers, articles, min_purchased_count): 22 | unique_customers = customers['customer_id'].unique() 23 | active_unique_customers = df.loc[df.groupby(['customer_id'])['article_id'].transform("count")>min_purchased_count, 'customer_id'].unique() 24 | unique_articles = articles['article_id'].unique() 25 | return unique_customers, active_unique_customers, unique_articles 26 | 27 | 28 | def show_article_transactions(df, target_article_id): 29 | st.markdown("### Article Transactions") 30 | _df = df.loc[df['article_id']==target_article_id] 31 | st.dataframe(_df) 32 | fig = px.bar( 33 | _df.groupby(['n_weeks_ago', 'sales_channel_id'])['customer_id'].count().reset_index().rename(columns={'customer_id':'purchase count'}), 34 | x='n_weeks_ago', y='purchase count', color='sales_channel_id', 35 | range_x=[df['n_weeks_ago'].max(), df['n_weeks_ago'].min()], 36 | color_discrete_map={'1':'blue', '2':'red'} 37 | ) 38 | fig.update_traces(width=1) 39 | st.plotly_chart(fig) 40 | 41 | 42 | def visualize_article(df, articles, unique_articles, image_dir): 43 | target_article_id = select_target_article(unique_articles) 44 | st.markdown("### Article Information") 45 | col1, col2 = st.columns([1, 2]) 46 | show_article_image(target_article_id, image_dir, col1) 47 | show_article_info(articles, target_article_id, col2) 48 | show_article_transactions(df, target_article_id) 49 | 50 | 51 | def select_target_article(unique_articles): 52 | target_article_id = st.selectbox("Target Article ID", unique_articles) 53 | if (target_article_id != "") & (not target_article_id in unique_articles): 54 | st.error(f"{target_article_id} is not in the dataset. Please check the id is correct.") 55 | if st.button("Random Choice"): 56 | target_article_id = random.choice(unique_articles) 57 | return target_article_id 58 | 59 | 60 | def show_article_info(articles, target_article_id, col): 61 | target_article_info = articles[articles['article_id']==target_article_id].T.astype(str) 62 | col.dataframe(target_article_info) 63 | 64 | 65 | def show_article_image(target_article_id, image_dir, col): 66 | filename = str(image_dir + f'{target_article_id[:3]}/{target_article_id}.jpg') 67 | img = cv2.imread(filename)[:,:,::-1] 68 | col.image(img, use_column_width=True) 69 | 70 | 71 | def visualize_customer(df, customers, unique_customers, active_unique_customers, num_sample, max_display_per_col, image_dir): 72 | target_customer_id = select_target_customer(unique_customers, active_unique_customers) 73 | show_customer_info(customers, target_customer_id) 74 | show_customer_transactions(df, target_customer_id) 75 | show_frequently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir) 76 | show_recently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir) 77 | 78 | def select_target_customer(unique_customers, active_unique_customers): 79 | target_customer_id = st.text_input( 80 | "Target Customer ID", 81 | # value='e805d4c5a1f5b03312e4b98f29b8a61519ecac5eb01435013ad96413856c02dd', 82 | placeholder='Paste the target customer id' 83 | ) 84 | # if not target_customer_id in unique_customers: 85 | # st.error(f"{target_customer_id} is not in the dataset. Please check the id is correct.") 86 | if st.button("Random Choice"): 87 | target_customer_id = random.choice(active_unique_customers) 88 | return target_customer_id 89 | 90 | 91 | def show_customer_info(customers, target_customer_id): 92 | target_customer_info = customers[customers['customer_id']==target_customer_id].T.astype(str) 93 | st.markdown("### Customer Information") 94 | st.dataframe(target_customer_info) 95 | 96 | 97 | def show_customer_transactions(df, target_customer_id): 98 | st.markdown("### Customer Transactions") 99 | _df = df.loc[df['customer_id']==target_customer_id] 100 | st.dataframe(_df) 101 | fig = px.bar( 102 | _df.groupby(['n_weeks_ago', 'sales_channel_id'])['article_id'].count().reset_index().rename(columns={'article_id':'purchase count'}), 103 | x='n_weeks_ago', y='purchase count', color='sales_channel_id', 104 | range_x=[df['n_weeks_ago'].max(), df['n_weeks_ago'].min()], 105 | color_discrete_map={'1':'blue', '2':'red'} 106 | ) 107 | fig.update_traces(width=1) 108 | st.plotly_chart(fig) 109 | 110 | 111 | def show_frequently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir): 112 | st.markdown("### Frequently Purchased Articles") 113 | purchased_sample = df.loc[df['customer_id']==target_customer_id, 'article_id'].value_counts().head(num_sample) 114 | purchased_articles = purchased_sample.index 115 | purchased_count = purchased_sample.values 116 | 117 | col = st.columns(max_display_per_col) 118 | for i,article_id in enumerate(purchased_articles): 119 | j = i % max_display_per_col 120 | with col[j]: 121 | st.write(f"id: {article_id}") 122 | filename = str(image_dir + f'{article_id[:3]}/{article_id}.jpg') 123 | if os.path.exists(filename): 124 | img = cv2.imread(filename)[:,:,::-1] 125 | st.image(img, use_column_width=True) 126 | st.write(f"count: {purchased_count[i]}") 127 | else: 128 | st.error(f'Skip image because there is no file ({filename})') 129 | 130 | def show_recently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir): 131 | st.markdown("### Recently Purchased Articles") 132 | recently_purchased_sample = df.loc[df['customer_id']==target_customer_id, ['t_dat', 'article_id']].drop_duplicates().sort_values('t_dat',ascending=False).head(num_sample) 133 | recently_purchased_articles = recently_purchased_sample['article_id'].to_numpy() 134 | recently_purchased_date = recently_purchased_sample['t_dat'].dt.strftime("%Y-%m-%d").to_numpy() 135 | col = st.columns(max_display_per_col) 136 | for i,article_id in enumerate(recently_purchased_articles): 137 | j = i % max_display_per_col 138 | with col[j]: 139 | st.write(f"id:{article_id}") 140 | filename = str(image_dir + f'{article_id[:3]}/{article_id}.jpg') 141 | if os.path.exists(filename): 142 | img = cv2.imread(filename)[:,:,::-1] 143 | st.image(img, use_column_width=True) 144 | st.write(f"date: {recently_purchased_date[i]}") 145 | else: 146 | st.error(f'Skip image because there is no file ({filename})') 147 | 148 | 149 | def main(): 150 | # config 151 | with open('./config.yaml', 'r') as yml: 152 | config = yaml.safe_load(yml) 153 | 154 | data_dir = config["common"]['data_dir'] 155 | image_dir = config["common"]['image_dir'] 156 | min_purchased_count = config["customers"]['min_purchased_count'] 157 | num_sample = config["customers"]['num_sample'] 158 | max_display_per_col = config["customers"]['max_display_per_col'] 159 | 160 | # read data(use cache to reduce loading time) 161 | df, articles, customers = read_data(data_dir) 162 | unique_customers, active_unique_customers, unique_articles = get_sub_data(df, customers, articles, min_purchased_count) 163 | 164 | # select type 165 | analysis_type = st.sidebar.radio("Select analysis type", ["Customers", "Articles"]) 166 | 167 | # visualize 168 | if analysis_type == "Customers": 169 | visualize_customer(df, customers, unique_customers, active_unique_customers, num_sample, max_display_per_col, image_dir) 170 | elif analysis_type=="Articles": 171 | visualize_article(df, articles, unique_articles, image_dir) 172 | else: 173 | NotImplementedError 174 | 175 | 176 | if __name__ == "__main__": 177 | main() -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | common: 2 | data_dir: ./data/ 3 | image_dir: ./data/images/ 4 | 5 | customers: 6 | min_purchased_count: 20 # Minimum purchases count for random customer selection 7 | num_sample: 10 # Number of image to show 8 | max_display_per_col: 5 # Maximum number of image to display per column -------------------------------------------------------------------------------- /data/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kuto5046/handm_data_visualize_app/55a1b6b9fde31df8597c0bf9d437a09280ac7646/data/.gitkeep -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | streamlit==1.8.1 2 | opencv-python 3 | pyyaml 4 | plotly 5 | pandas --------------------------------------------------------------------------------