├── .gitignore
├── LICENSE
├── README.md
├── app.py
├── config.yaml
├── data
    └── .gitkeep
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
  1 | *.png
  2 | *.jpg
  3 | *.csv
  4 | *.zip
  5 | venv 
  6 | 
  7 | # Byte-compiled / optimized / DLL files
  8 | __pycache__/
  9 | *.py[cod]
 10 | *$py.class
 11 | 
 12 | # C extensions
 13 | *.so
 14 | 
 15 | # Distribution / packaging
 16 | .Python
 17 | build/
 18 | develop-eggs/
 19 | dist/
 20 | downloads/
 21 | eggs/
 22 | .eggs/
 23 | lib/
 24 | lib64/
 25 | parts/
 26 | sdist/
 27 | var/
 28 | wheels/
 29 | pip-wheel-metadata/
 30 | share/python-wheels/
 31 | *.egg-info/
 32 | .installed.cfg
 33 | *.egg
 34 | MANIFEST
 35 | 
 36 | # PyInstaller
 37 | #  Usually these files are written by a python script from a template
 38 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 39 | *.manifest
 40 | *.spec
 41 | 
 42 | # Installer logs
 43 | pip-log.txt
 44 | pip-delete-this-directory.txt
 45 | 
 46 | # Unit test / coverage reports
 47 | htmlcov/
 48 | .tox/
 49 | .nox/
 50 | .coverage
 51 | .coverage.*
 52 | .cache
 53 | nosetests.xml
 54 | coverage.xml
 55 | *.cover
 56 | *.py,cover
 57 | .hypothesis/
 58 | .pytest_cache/
 59 | 
 60 | # Translations
 61 | *.mo
 62 | *.pot
 63 | 
 64 | # Django stuff:
 65 | *.log
 66 | local_settings.py
 67 | db.sqlite3
 68 | db.sqlite3-journal
 69 | 
 70 | # Flask stuff:
 71 | instance/
 72 | .webassets-cache
 73 | 
 74 | # Scrapy stuff:
 75 | .scrapy
 76 | 
 77 | # Sphinx documentation
 78 | docs/_build/
 79 | 
 80 | # PyBuilder
 81 | target/
 82 | 
 83 | # Jupyter Notebook
 84 | .ipynb_checkpoints
 85 | 
 86 | # IPython
 87 | profile_default/
 88 | ipython_config.py
 89 | 
 90 | # pyenv
 91 | .python-version
 92 | 
 93 | # pipenv
 94 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 95 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 96 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 97 | #   install all needed dependencies.
 98 | #Pipfile.lock
 99 | 
100 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
101 | __pypackages__/
102 | 
103 | # Celery stuff
104 | celerybeat-schedule
105 | celerybeat.pid
106 | 
107 | # SageMath parsed files
108 | *.sage.py
109 | 
110 | # Environments
111 | .env
112 | .venv
113 | env/
114 | venv/
115 | ENV/
116 | env.bak/
117 | venv.bak/
118 | 
119 | # Spyder project settings
120 | .spyderproject
121 | .spyproject
122 | 
123 | # Rope project settings
124 | .ropeproject
125 | 
126 | # mkdocs documentation
127 | /site
128 | 
129 | # mypy
130 | .mypy_cache/
131 | .dmypy.json
132 | dmypy.json
133 | 
134 | # Pyre type checker
135 | .pyre/
136 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 Kyohei Uto
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # handm_data_visualize_app
 2 | Data visualization app by [streamlit](https://streamlit.io/) for H&amp;M competition in kaggle.
 3 | 
 4 | competition page: https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/
 5 | 
 6 | ## Features 
 7 | ### Customers
 8 | You can check the following information by selecting a customer
 9 | - Customer information
10 | - Customer transactions
11 | - Frequentry purchased articles images
12 | - Recently purchased Articles images
13 | 
14 | ### Articles
15 | You can check the following information by selecting a article
16 | - Article information
17 | - Article image
18 | 
19 | https://user-images.githubusercontent.com/43205304/161410926-0dc5929f-cc9c-4e93-ba77-5ca420e73f5c.mov
20 | 
21 | 
22 | ## Directory
23 | ```
24 | .
25 | ├── README.md
26 | ├── app.py
27 | ├── config.yaml
28 | ├── data
29 | │   ├── images
30 | │   ├── articles.csv
31 | │   ├── customers.csv
32 | │   └── transactions_train.csv
33 | └── requirements.txt
34 | ```
35 | 
36 | 
37 | ## Environment setup
38 | Here I show how to build an environment.
39 | If you already have the competition project, simply prepare `app.py` and `config.yaml` in your project and set the data path in `config.yaml`.
40 | 
41 | 1. Clone this repository
42 | ```
43 | git clone https://github.com/kuto5046/handm_data_visualize_app.git
44 | ```
45 | 
46 | 2. Make environment
47 | 
48 | First, create your favorite virtual development environment.(docker, venv, poetry etc...)
49 | 
50 | Then, run this command to install necessary libraries in your environment.
51 | ```shell
52 | pip install -r requirements.txt
53 | ```
54 | 
55 | 3. Prepare data 
56 | 
57 | Prepare competition data in `data` folder.
58 | I show an example of using the `kaggle api`, but you can also download it manually.
59 | ```
60 | cd data
61 | kaggle competitions download -c h-and-m-personalized-fashion-recommendations
62 | unzip h-and-m-personalized-fashion-recommendations.zip
63 | ```
64 | 
65 | ## Usage
66 | run this command in your terminal 
67 | ```shell
68 | streamlit run app.py
69 | ```
70 | Then connect to the output URL or `localhost:8501`
71 | 
72 | To change the settings about app, please edit `config.yaml`
73 | ```yaml
74 | common:
75 |   data_dir: ./data/
76 |   image_dir: ./data/images/
77 | 
78 | customers:
79 |   min_purchased_count: 20  # Minimum purchases count for random customer selection
80 |   num_sample: 10  # Number of image to show
81 |   max_display_per_col: 5  # Maximum number of image to display per column
82 | ```
83 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | import cv2
  3 | import os 
  4 | import pandas as pd 
  5 | import random
  6 | import plotly.express as px
  7 | import yaml
  8 | 
  9 | 
 10 | @st.cache
 11 | def read_data(input_dir):
 12 |     df = pd.read_csv(input_dir + "transactions_train.csv", dtype={'article_id': str, 'sales_channel_id': str})
 13 |     df['t_dat'] = pd.to_datetime(df['t_dat'])
 14 |     # 週単位にする。testを0としたいので+1
 15 |     df['n_weeks_ago'] = ((df['t_dat'].max() - df['t_dat']).dt.days // 7) + 1
 16 |     articles = pd.read_csv(input_dir + "articles.csv", dtype={'article_id': str})
 17 |     customers = pd.read_csv(input_dir + "customers.csv")
 18 |     return df, articles, customers 
 19 | 
 20 | @st.cache
 21 | def get_sub_data(df, customers, articles, min_purchased_count):
 22 |     unique_customers = customers['customer_id'].unique()
 23 |     active_unique_customers = df.loc[df.groupby(['customer_id'])['article_id'].transform("count")>min_purchased_count, 'customer_id'].unique()
 24 |     unique_articles = articles['article_id'].unique()
 25 |     return unique_customers, active_unique_customers, unique_articles
 26 | 
 27 | 
 28 | def show_article_transactions(df, target_article_id):
 29 |     st.markdown("### Article Transactions")
 30 |     _df = df.loc[df['article_id']==target_article_id]
 31 |     st.dataframe(_df)
 32 |     fig = px.bar(
 33 |         _df.groupby(['n_weeks_ago', 'sales_channel_id'])['customer_id'].count().reset_index().rename(columns={'customer_id':'purchase count'}), 
 34 |         x='n_weeks_ago', y='purchase count', color='sales_channel_id', 
 35 |         range_x=[df['n_weeks_ago'].max(), df['n_weeks_ago'].min()],
 36 |         color_discrete_map={'1':'blue', '2':'red'}
 37 |         )
 38 |     fig.update_traces(width=1)
 39 |     st.plotly_chart(fig)
 40 | 
 41 | 
 42 | def visualize_article(df, articles, unique_articles, image_dir):
 43 |     target_article_id = select_target_article(unique_articles)
 44 |     st.markdown("### Article Information")
 45 |     col1, col2 = st.columns([1, 2])
 46 |     show_article_image(target_article_id, image_dir, col1)
 47 |     show_article_info(articles, target_article_id, col2)
 48 |     show_article_transactions(df, target_article_id)
 49 | 
 50 | 
 51 | def select_target_article(unique_articles):
 52 |     target_article_id = st.selectbox("Target Article ID", unique_articles)
 53 |     if (target_article_id != "") & (not target_article_id in unique_articles):
 54 |         st.error(f"{target_article_id} is not in the dataset. Please check the id is correct.")
 55 |     if st.button("Random Choice"):
 56 |         target_article_id = random.choice(unique_articles)
 57 |     return target_article_id
 58 | 
 59 | 
 60 | def show_article_info(articles, target_article_id, col):
 61 |     target_article_info = articles[articles['article_id']==target_article_id].T.astype(str)
 62 |     col.dataframe(target_article_info)
 63 | 
 64 | 
 65 | def show_article_image(target_article_id, image_dir, col):
 66 |     filename = str(image_dir + f'{target_article_id[:3]}/{target_article_id}.jpg')
 67 |     img = cv2.imread(filename)[:,:,::-1]
 68 |     col.image(img, use_column_width=True)
 69 | 
 70 | 
 71 | def visualize_customer(df, customers, unique_customers, active_unique_customers, num_sample, max_display_per_col, image_dir):
 72 |     target_customer_id = select_target_customer(unique_customers, active_unique_customers)
 73 |     show_customer_info(customers, target_customer_id)
 74 |     show_customer_transactions(df, target_customer_id)
 75 |     show_frequently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir)
 76 |     show_recently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir)
 77 | 
 78 | def select_target_customer(unique_customers, active_unique_customers):
 79 |     target_customer_id = st.text_input(
 80 |         "Target Customer ID", 
 81 |         # value='e805d4c5a1f5b03312e4b98f29b8a61519ecac5eb01435013ad96413856c02dd',
 82 |         placeholder='Paste the target customer id'
 83 |         )
 84 |     # if not target_customer_id in unique_customers:
 85 |     #     st.error(f"{target_customer_id} is not in the dataset. Please check the id is correct.")
 86 |     if st.button("Random Choice"):
 87 |         target_customer_id = random.choice(active_unique_customers)
 88 |     return target_customer_id
 89 | 
 90 | 
 91 | def show_customer_info(customers, target_customer_id):
 92 |     target_customer_info = customers[customers['customer_id']==target_customer_id].T.astype(str)
 93 |     st.markdown("### Customer Information")
 94 |     st.dataframe(target_customer_info)
 95 | 
 96 | 
 97 | def show_customer_transactions(df, target_customer_id):
 98 |     st.markdown("### Customer Transactions")
 99 |     _df = df.loc[df['customer_id']==target_customer_id]
100 |     st.dataframe(_df)
101 |     fig = px.bar(
102 |         _df.groupby(['n_weeks_ago', 'sales_channel_id'])['article_id'].count().reset_index().rename(columns={'article_id':'purchase count'}), 
103 |         x='n_weeks_ago', y='purchase count', color='sales_channel_id', 
104 |         range_x=[df['n_weeks_ago'].max(), df['n_weeks_ago'].min()],
105 |         color_discrete_map={'1':'blue', '2':'red'}
106 |         )
107 |     fig.update_traces(width=1)
108 |     st.plotly_chart(fig)
109 | 
110 | 
111 | def show_frequently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir):
112 |     st.markdown("### Frequently Purchased Articles")
113 |     purchased_sample = df.loc[df['customer_id']==target_customer_id, 'article_id'].value_counts().head(num_sample)
114 |     purchased_articles = purchased_sample.index
115 |     purchased_count = purchased_sample.values
116 | 
117 |     col = st.columns(max_display_per_col)
118 |     for i,article_id in enumerate(purchased_articles):
119 |         j = i % max_display_per_col
120 |         with col[j]:
121 |             st.write(f"id: {article_id}")
122 |             filename = str(image_dir + f'{article_id[:3]}/{article_id}.jpg')
123 |             if os.path.exists(filename):
124 |                 img = cv2.imread(filename)[:,:,::-1]
125 |                 st.image(img, use_column_width=True)
126 |                 st.write(f"count: {purchased_count[i]}")
127 |             else:
128 |                 st.error(f'Skip image because there is no file ({filename})')
129 | 
130 | def show_recently_purchased_articles(df, target_customer_id, num_sample, max_display_per_col, image_dir):
131 |     st.markdown("### Recently Purchased Articles")
132 |     recently_purchased_sample = df.loc[df['customer_id']==target_customer_id, ['t_dat', 'article_id']].drop_duplicates().sort_values('t_dat',ascending=False).head(num_sample)
133 |     recently_purchased_articles = recently_purchased_sample['article_id'].to_numpy()
134 |     recently_purchased_date = recently_purchased_sample['t_dat'].dt.strftime("%Y-%m-%d").to_numpy()
135 |     col = st.columns(max_display_per_col)
136 |     for i,article_id in enumerate(recently_purchased_articles):
137 |         j = i % max_display_per_col
138 |         with col[j]:
139 |             st.write(f"id:{article_id}")
140 |             filename = str(image_dir + f'{article_id[:3]}/{article_id}.jpg')
141 |             if os.path.exists(filename):
142 |                 img = cv2.imread(filename)[:,:,::-1]
143 |                 st.image(img, use_column_width=True)
144 |                 st.write(f"date: {recently_purchased_date[i]}")
145 |             else:
146 |                 st.error(f'Skip image because there is no file ({filename})')
147 | 
148 |  
149 | def main():
150 |     # config
151 |     with open('./config.yaml', 'r') as yml:
152 |         config = yaml.safe_load(yml)
153 | 
154 |     data_dir = config["common"]['data_dir'] 
155 |     image_dir = config["common"]['image_dir']
156 |     min_purchased_count = config["customers"]['min_purchased_count']
157 |     num_sample = config["customers"]['num_sample']
158 |     max_display_per_col = config["customers"]['max_display_per_col']
159 | 
160 |     # read data(use cache to reduce loading time)
161 |     df, articles, customers = read_data(data_dir)
162 |     unique_customers, active_unique_customers, unique_articles = get_sub_data(df, customers, articles, min_purchased_count)
163 | 
164 |     # select type
165 |     analysis_type = st.sidebar.radio("Select analysis type", ["Customers", "Articles"])
166 | 
167 |     # visualize
168 |     if analysis_type == "Customers":
169 |         visualize_customer(df, customers, unique_customers, active_unique_customers, num_sample, max_display_per_col, image_dir)
170 |     elif analysis_type=="Articles":
171 |         visualize_article(df, articles, unique_articles, image_dir)
172 |     else:
173 |         NotImplementedError
174 | 
175 | 
176 | if __name__ == "__main__":
177 |     main()


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
1 | common:
2 |   data_dir: ./data/
3 |   image_dir: ./data/images/
4 | 
5 | customers:
6 |   min_purchased_count: 20  # Minimum purchases count for random customer selection
7 |   num_sample: 10  # Number of image to show
8 |   max_display_per_col: 5  # Maximum number of image to display per column


--------------------------------------------------------------------------------
/data/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kuto5046/handm_data_visualize_app/55a1b6b9fde31df8597c0bf9d437a09280ac7646/data/.gitkeep


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | streamlit==1.8.1
2 | opencv-python
3 | pyyaml
4 | plotly
5 | pandas


--------------------------------------------------------------------------------