├── README.md
├── data
    ├── README.md
    ├── processed
    │   └── README.md
    └── raw
    │   └── README.md
├── model
    └── README.md
├── notebook
    ├── README.md
    ├── eda
    │   └── README.md
    ├── evaluation
    │   └── README.md
    ├── modeling
    │   └── README.md
    └── poc
    │   └── README.md
├── src
    ├── README.md
    ├── modeling
    │   └── README.md
    ├── preparation
    │   └── README.md
    └── processing
    │   └── README.md
└── test
    └── README.md


/README.md:
--------------------------------------------------------------------------------
1 | After working data science projects, found that project strucure is important for team communication and project deliver. You may visit [my blog](https://towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600) for more detail.
2 | 
3 | Here is the explantation of folder strucure:
4 | - src: Stores source code (python, R etc) which serves multiple scenarios. During data exploration and model training, we have to transform data for particular purpose. We have to use same code to transfer data during online prediction as well. So it better separates code from notebook such that it serves different purpose.
5 | - test: In R&D, data science focus on building model but not make sure everything work well in unexpected scenario. However, it will be a trouble if deploying model to API. Also, test cases guarantee backward compatible issue but it takes time to implement it.
6 | - model: Folder for storing binary (json or other format) file for local use.
7 | - data: Folder for storing subset data for experiments. It includes both raw data and processed data for temporary use.
8 | - notebook: Storing all notebooks includeing EDA and modeling stage.


--------------------------------------------------------------------------------
/data/README.md:
--------------------------------------------------------------------------------
1 | - raw: Storing the raw result which is generated from "preparation" folder code. My practice is storing a local subset copy rather than retrieving data from remote data store from time to time. It guarantees you have a static dataset for rest of action. Furthermore, we can isolate from data platform unstable issue and network latency issue.
2 | - processed: To shorten model training time, it is a good idea to persist processed data. It should be generated from "processing" folder.


--------------------------------------------------------------------------------
/data/processed/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing processed data


--------------------------------------------------------------------------------
/data/raw/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing raw data


--------------------------------------------------------------------------------
/model/README.md:
--------------------------------------------------------------------------------
1 | Storing intermediate result in here only. For long term, it should be stored in model repository separately. Besides binary model, you should also store model metadata such as date, size of training data.


--------------------------------------------------------------------------------
/notebook/README.md:
--------------------------------------------------------------------------------
1 | - eda: Exploratory Data Analysis (aka Data Exploration) is a step for exploring what you have for later steps. For short term purpose, it should show what you explored. Typical example is showing data distribution. For long term, it should store in centralized place. 
2 | - poc: Due to some reasons, you have to do some PoC (Proof-of-Concept). It can be show in here for temporary purpose.
3 | - modeling: Notebook contains your core part which including model building and training. 
4 | - evaluation: Besides modeling, evaluation is another important step but lots of people do not aware on it. To get trust from product team, we have to demonstrate how good does the model.


--------------------------------------------------------------------------------
/notebook/eda/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing exploratory data analysis


--------------------------------------------------------------------------------
/notebook/evaluation/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing model evaluation


--------------------------------------------------------------------------------
/notebook/modeling/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing how the model is built


--------------------------------------------------------------------------------
/notebook/poc/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing PoC


--------------------------------------------------------------------------------
/src/README.md:
--------------------------------------------------------------------------------
1 | - preparation: Data ingestion such as retrieving data from CSV, relational database, NoSQL, Hadoop etc. We have to retrieve data from multiple sources all the time so we better to have a dedicated function for data retrieval. 
2 | - processing: Data transformation as source data do not fit what model needs all the time. Ideally, we have clean data but I never get it. You may say that we should have data engineering team helps on data transformation. However, we may not know what we need under studying data. One of the important requirement is both off-line training and online prediction should use same pipeline to reduce misalignment.
3 | - modeling: Model building such as tackling classification problem. It should not just include model training part but also evaluation part. On the other hand, we have to think about multiple models scenario. Typical use case is ensemble model such as combing Logistic Regression model and Neural Network model.


--------------------------------------------------------------------------------
/src/modeling/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing the flow of building model


--------------------------------------------------------------------------------
/src/preparation/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing the flow of data preparation


--------------------------------------------------------------------------------
/src/processing/README.md:
--------------------------------------------------------------------------------
1 | Placholder for describing the flow of data processing


--------------------------------------------------------------------------------
/test/README.md:
--------------------------------------------------------------------------------
1 | Test case for asserting python source code. Make sure no bug when changing code. Rather than using manual testing, automatic testing is an essential puzzle of successful project. Teammates will have confidence to modify code assuming that test case help to validate code change do not break previous usage.


--------------------------------------------------------------------------------