├── README.md
├── input
    └── README.md
├── insight_testsuite
    ├── run_tests.sh
    └── tests
    │   └── test_1
    │       ├── README.md
    │       ├── input
    │           ├── order_products.csv
    │           └── products.csv
    │       └── output
    │           └── report.csv
├── output
    └── README.md
├── run.sh
└── src
    └── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Purchase-Analytics
  2 | 
  3 | ## Table of Contents
  4 | 1. [Problem](README.md#problem)
  5 | 1. [Steps to submit your solution](README.md#steps-to-submit-your-solution)
  6 | 1. [Input Dataset](README.md#input-dataset)
  7 | 1. [Instructions](README.md#instructions)
  8 | 1. [Output](README.md#output)
  9 | 1. [Tips on getting an interview](README.md#tips-on-getting-an-interview)
 10 | 1. [Questions?](README.md#questions?)
 11 | 
 12 | ## Problem
 13 | 
 14 | Instacart has published a [dataset](https://www.instacart.com/datasets/grocery-shopping-2017) containing 3 million Instacart orders.
 15 | 
 16 | **For this challenge, we want you to calculate, for each department, the number of times a product was requested, number of times a product was requested for the first time and a ratio of those two numbers.**
 17 | 
 18 | 
 19 | ## Steps to submit your solution
 20 | * To submit your entry please use the link you received in your coding challenge invite email
 21 | * You will only be able to submit through the link one time
 22 | * Do NOT attach a file - we will not admit solutions which are attached files
 23 | * Do NOT send your solution over an email - We are unable to accept coding challenges that way
 24 | 
 25 | ### Creating private repositories
 26 | To avoid plagiarism and any wrongdoing, we request you to submit a private repository of your code. Both GitHub and Bitbucket offer free unlimited private repositories at no extra cost.
 27 | * Create a private repository on GitHub or Bitbucket with the given repository structure. Here is how you will be sharing your private repositories for us to see once you are ready to submit.
 28 | * Add "insight-cc-bot" as a collaborator in your project.
 29 |   * [How to add collaborators on GitHub?](https://help.github.com/articles/inviting-collaborators-to-a-personal-repository/)
 30 |   * [How to add users and groups as collaborators in Bitbucket?](https://confluence.atlassian.com/bitbucket/grant-repository-access-to-users-and-groups-221449716.html)
 31 | * **We will NOT be grading submissions we do not have access to.**
 32 | 
 33 | ### Submitting a link to your repository
 34 | * Use the submission box to enter the link to your GitHub or Bitbucket repo ONLY
 35 | * Link to the specific repo for this project, not your general profile
 36 | * Put any comments in the README inside your project repo, not in the submission box
 37 | 
 38 | 
 39 | ## Input Datasets
 40 | 
 41 | For this challenge, we have two separate input data sources, `order_products.csv` and `products.csv`.
 42 | 
 43 | You can assume each line of the file `order_products.csv` holds data on one request. The file contains data of the form
 44 | 
 45 | ```
 46 | order_id,product_id,add_to_cart_order,reordered
 47 | 2,33120,1,1
 48 | 2,28985,2,1
 49 | 2,9327,3,0
 50 | 2,45918,4,1
 51 | 3,17668,1,1
 52 | 3,46667,2,1
 53 | 3,17461,4,1
 54 | 3,32665,3,1
 55 | 4,46842,1,0
 56 | ```
 57 | 
 58 | where
 59 | 
 60 | * `order_id`: unique identifier of order
 61 | * `product_id`: unique identifier of product
 62 | * `add_to_cart_order`: sequence order in which each product was added to shopping cart
 63 | * `reordered`: flag indicating if the product has been ordered by this user at some point in the past. The field is `1` if the user has ordered it in the past and `0` if the user has not. While data engineers should validate their data, for the purposes of this challenge, you can take the `reordered` flag at face value and assume it accurately reflects whether the product has been ordered by the user before.
 64 | 
 65 | The file `products.csv` holds data on every product, and looks something like this:
 66 | 
 67 | ```
 68 | product_id,product_name,aisle_id,department_id
 69 | 9327,Garlic Powder,104,13
 70 | 17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
 71 | 17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
 72 | 28985,Michigan Organic Kale,83,4
 73 | 32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
 74 | 33120,Organic Egg Whites,86,16
 75 | 45918,Coconut Butter,19,13
 76 | 46667,Organic Ginger Root,83,4
 77 | 46842,Plain Pre-Sliced Bagels,93,3
 78 | ```
 79 | where
 80 | 
 81 | * `product_id`: unique identifier of the product
 82 | * `product_name`: name of the product
 83 | * `aisle_id`: identifier of aisle in which product is located
 84 | * `department_id`: identifier of department
 85 | 
 86 | 
 87 | ## Expected Output
 88 | 
 89 | Given the two input files in the input directory, your program should create an output file, `report.csv`, in the output directory that, for each department, surfaces the following statistics:
 90 | 
 91 | `number_of_orders`. How many times was a product requested from this department? (If the same product was ordered multiple times, we count it as multiple requests)
 92 | 
 93 | `number_of_first_orders`. How many of those requests contain products ordered for the first time?
 94 | 
 95 | `percentage`. What is the percentage of requests containing products ordered for the first time compared with the total number of requests for products from that department? (e.g., `number_of_first_orders` divided by `number_of_orders`)
 96 | 
 97 | For example, with the input files given above, the correct output file is
 98 | 
 99 | ```
100 | department_id,number_of_orders,number_of_first_orders,percentage
101 | 3,2,1,0.50
102 | 4,2,0,0.00
103 | 12,1,0,0.00
104 | 13,2,1,0.50
105 | 16,2,0,0.00
106 | ```
107 | 
108 | *The output file should adhere to the following rules*
109 | 
110 | - It is listed in ascending order by `department_id`
111 | - A `department_id` should be listed only if `number_of_orders` is greater than `0`
112 | - `percentage` should be rounded to the second decimal
113 | 
114 | The examples input and out files are provided in the `insight_testsuite/tests/test_1/input` and `insight_testsuite/tests/test_1/output` folders, respectively.
115 | 
116 | ## Instructions
117 | 
118 | We designed this coding challenge to assess your coding skills and your understanding of computer science fundamentals. They are both prerequisites of becoming a data engineer. To solve this challenge you might pick a programing language of your choice (preferably Python, Scala, Java, or C/C++ because they are commonly used and will help us better assess you), but you are only allowed to use the default data structures that come with that programming language (you might use I/O libraries). For example, you can code in Python, but you should not use Pandas or any other external libraries.
119 | 
120 | ***The objective here is to see if you can implement the solution using basic data structure building blocks and software engineering best practices (by writing clean, modular, and well-tested code).***
121 | 
122 | 
123 | # Tips on getting an interview
124 | 
125 | ## Writing clean, scalable and well-tested code
126 | 
127 | As a data engineer, it’s important that you write clean, well-documented code that scales for a large amount of data. For this reason, it’s important to ensure that your solution works well for a large number of records, rather than just the above example.
128 | 
129 | [Here](https://www.instacart.com/datasets/grocery-shopping-2017) you can find large datasets to test your code (see [here](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b) for its data dictionary).
130 | You can test your code using the files `order_products_train.csv` and `order_products_prior.csv` together with the file `products.csv`.
131 | Note, we will use it to test the full functionality of your code, along with other tests.
132 | 
133 | It's also important to use software engineering best practices like unit tests, especially since data is not always clean and predictable.
134 | 
135 | Before submitting your solution you should summarize your approach and run instructions (if any) in your `README`.
136 | 
137 | You may write your solution in any mainstream programming language, such as C, C++, C#, Go, Java, Python, Ruby, or Scala. Once completed, submit a link of your Github or Bitbucket repo with your source code.
138 | 
139 | In addition to the source code, the top-most directory of your repo must include the `input` and `output` directories, and a shell script named `run.sh` that compiles and runs the program(s) that implement(s) the required features.
140 | 
141 | If your solution requires additional libraries, environments, or dependencies, you must specify these in your `README` documentation. See the figure below for the required structure of the top-most directory in your repo, or simply clone this repo.
142 | 
143 | ## Repo directory structure
144 | 
145 | The directory structure for your repo should look like this:
146 | 
147 |     ├── README.md
148 |     ├── run.sh
149 |     ├── src
150 |     │   └── purchase_analytics.py
151 |     ├── input
152 |     │   └── products.csv
153 |     |   └── order_products.csv
154 |     ├── output
155 |     |   └── report.csv
156 |     ├── insight_testsuite
157 |         └── run_tests.sh
158 |         └── tests
159 |             └── test_1
160 |             |   ├── input
161 |             |   │   └── products.csv
162 |             |   │   └── order_products.csv
163 |             |   |__ output
164 |             |   │   └── report.csv
165 |             ├── your-own-test_1
166 |                 ├── input
167 |                 │   └── your-own-products.csv
168 |                 |   └── your-own-order_products.csv
169 |                 |── output
170 |                     └── report.csv
171 | 
172 | **Don't fork this repo** and don't use this `README` instead of your own. The content of `src` does not need to be a single file called `purchase_analytics.py`, which is only an example. Instead, you should include your own source files and give them expressive names.
173 | 
174 | ## Testing your directory structure and output format
175 | 
176 | To make sure that your code has the correct directory structure and the format of the output files are correct, we have included a test script called `run_tests.sh` in the `insight_testsuite` folder.
177 | 
178 | The tests are stored simply as text files under the `insight_testsuite/tests` folder. Each test should have a separate folder with an `input` folder for `products.csv` and `order_products.csv` and an `output` folder for `report.csv`.
179 | 
180 | You can run the test with the following command from within the `insight_testsuite` folder:
181 | 
182 |     insight_testsuite~$ ./run_tests.sh
183 | 
184 | On a failed test, the output of `run_tests.sh` should look like:
185 | 
186 |     [FAIL]: test_1
187 |     [Thu Mar 30 16:28:01 PDT 2017] 0 of 1 tests passed
188 | 
189 | On success:
190 | 
191 |     [PASS]: test_1
192 |     [Thu Mar 30 16:25:57 PDT 2017] 1 of 1 tests passed
193 | 
194 | 
195 | 
196 | One test has been provided as a way to check your formatting and simulate how we will be running tests when you submit your solution. We urge you to write your own additional tests. `test_1` is only intended to alert you if the directory structure or the output for this test is incorrect.
197 | 
198 | Your submission must pass at least the provided test in order to pass the coding challenge.
199 | 
200 | For a limited time we also are making available a website (no longer available) that will allow you to simulate the environment in which we will test your code. It has been primarily tested on Python code but could be used for Java and C++ repos. Keep in mind that if you need to compile your code (e.g., javac, make), that compilation needs to happen in the `run.sh` file of your code repository. For Python programmers, you are able to use Python2 or Python3 but if you use the later, specify `python3` in your `run.sh` script.
201 | 
202 | # Questions?
203 | Email us at cc@insightdataengineering.com
204 | 


--------------------------------------------------------------------------------
/input/README.md:
--------------------------------------------------------------------------------
1 | This is the directory where your program would find any test input files.
2 | 


--------------------------------------------------------------------------------
/insight_testsuite/run_tests.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | declare -r color_start="\033["
  4 | declare -r color_red="${color_start}0;31m"
  5 | declare -r color_green="${color_start}0;32m"
  6 | declare -r color_blue="${color_start}0;34m"
  7 | declare -r color_norm="${color_start}0m"
  8 | 
  9 | GRADER_ROOT=$(dirname ${BASH_SOURCE})
 10 | 
 11 | PROJECT_PATH=${GRADER_ROOT}/..
 12 | 
 13 | function print_dir_contents {
 14 |   local proj_path=$1
 15 |   echo "Project contents:"
 16 |   echo -e "${color_blue}$(ls ${proj_path})${color_norm}"
 17 | }
 18 | 
 19 | function find_file_or_dir_in_project {
 20 |   local proj_path=$1
 21 |   local file_or_dir_name=$2
 22 |   if [[ ! -e "${proj_path}/${file_or_dir_name}" ]]; then
 23 |     echo -e "[${color_red}FAIL${color_norm}]: no ${file_or_dir_name} found"
 24 |     print_dir_contents ${proj_path}
 25 |     echo -e "${color_red}${file_or_dir_name} [MISSING]${color_norm}"
 26 |     exit 1
 27 |   fi
 28 | }
 29 | 
 30 | # check project directory structure
 31 | function check_project_struct {
 32 |   find_file_or_dir_in_project ${PROJECT_PATH} run.sh
 33 |   find_file_or_dir_in_project ${PROJECT_PATH} src
 34 |   find_file_or_dir_in_project ${PROJECT_PATH} input
 35 |   find_file_or_dir_in_project ${PROJECT_PATH} output
 36 | }
 37 | 
 38 | # setup testing output folder
 39 | function setup_testing_input_output {
 40 |   TEST_OUTPUT_PATH=${GRADER_ROOT}/temp
 41 |   if [ -d ${TEST_OUTPUT_PATH} ]; then
 42 |     rm -rf ${TEST_OUTPUT_PATH}
 43 |   fi
 44 | 
 45 |   mkdir -p ${TEST_OUTPUT_PATH}
 46 | 
 47 |   cp -r ${PROJECT_PATH}/src ${TEST_OUTPUT_PATH}
 48 |   cp -r ${PROJECT_PATH}/run.sh ${TEST_OUTPUT_PATH}
 49 |   cp -r ${PROJECT_PATH}/input ${TEST_OUTPUT_PATH}
 50 |   cp -r ${PROJECT_PATH}/output ${TEST_OUTPUT_PATH}
 51 | 
 52 |   rm -r ${TEST_OUTPUT_PATH}/input/*
 53 |   rm -r ${TEST_OUTPUT_PATH}/output/*
 54 |   cp -r ${GRADER_ROOT}/tests/${test_folder}/input/order_products.csv ${TEST_OUTPUT_PATH}/input/order_products.csv
 55 |   cp -r ${GRADER_ROOT}/tests/${test_folder}/input/products.csv ${TEST_OUTPUT_PATH}/input/products.csv
 56 | }
 57 | 
 58 | function compare_outputs {
 59 |   NUM_OUTPUT_FILES_PASSED=0
 60 |   OUTPUT_FILENAME=report.csv
 61 |   PROJECT_ANSWER_PATH1=${GRADER_ROOT}/temp/output/${OUTPUT_FILENAME}
 62 |   TEST_ANSWER_PATH1=${GRADER_ROOT}/tests/${test_folder}/output/${OUTPUT_FILENAME}
 63 | 
 64 |   DIFF_RESULT1=$(diff -bB ${PROJECT_ANSWER_PATH1} ${TEST_ANSWER_PATH1} | wc -l)
 65 |   if [ "${DIFF_RESULT1}" -eq "0" ] && [ -f ${PROJECT_ANSWER_PATH1} ]; then
 66 |     echo -e "[${color_green}PASS${color_norm}]: ${test_folder} ${OUTPUT_FILENAME}"
 67 |     NUM_OUTPUT_FILES_PASSED=$(($NUM_OUTPUT_FILES_PASSED+1))
 68 |   else
 69 |     echo -e "[${color_red}FAIL${color_norm}]: ${test_folder}"
 70 |     diff ${PROJECT_ANSWER_PATH1} ${TEST_ANSWER_PATH1}
 71 |   fi
 72 | 
 73 |   if [ "${NUM_OUTPUT_FILES_PASSED}" -eq "1" ]; then
 74 |     PASS_CNT=$(($PASS_CNT+1))
 75 |   fi
 76 | 
 77 | }
 78 | 
 79 | function run_all_tests {
 80 |   TEST_FOLDERS=$(ls ${GRADER_ROOT}/tests)
 81 |   NUM_TESTS=$(($(echo $(echo ${TEST_FOLDERS} | wc -w))))
 82 |   PASS_CNT=0
 83 | 
 84 |   # Loop through all tests
 85 |   for test_folder in ${TEST_FOLDERS}; do
 86 | 
 87 |     setup_testing_input_output
 88 | 
 89 |     cd ${GRADER_ROOT}/temp
 90 |     bash run.sh 2>&1
 91 |     cd ../
 92 | 
 93 |     compare_outputs
 94 |   done
 95 | 
 96 |   echo "[$(date)] ${PASS_CNT} of ${NUM_TESTS} tests passed"
 97 |   echo "[$(date)] ${PASS_CNT} of ${NUM_TESTS} tests passed" >> ${GRADER_ROOT}/results.txt
 98 | }
 99 | 
100 | check_project_struct
101 | run_all_tests
102 | 


--------------------------------------------------------------------------------
/insight_testsuite/tests/test_1/README.md:
--------------------------------------------------------------------------------
1 | This test has been provided for you so that you can see one example, however, you should be creating your own tests to check that your code runs as expected.
2 | 


--------------------------------------------------------------------------------
/insight_testsuite/tests/test_1/input/order_products.csv:
--------------------------------------------------------------------------------
 1 | order_id,product_id,add_to_cart_order,reordered
 2 | 2,33120,1,1
 3 | 2,28985,2,1
 4 | 2,9327,3,0
 5 | 2,45918,4,1
 6 | 3,17668,1,1
 7 | 3,46667,2,1
 8 | 3,17461,4,1
 9 | 3,32665,3,1
10 | 4,46842,1,0
11 | 


--------------------------------------------------------------------------------
/insight_testsuite/tests/test_1/input/products.csv:
--------------------------------------------------------------------------------
 1 | product_id,product_name,aisle_id,department_id
 2 | 9327,Garlic Powder,104,13
 3 | 17461,Air Chilled Organic Boneless Skinless Chicken Breasts,35,12
 4 | 17668,Unsweetened Chocolate Almond Breeze Almond Milk,91,16
 5 | 28985,Michigan Organic Kale,83,4
 6 | 32665,Organic Ezekiel 49 Bread Cinnamon Raisin,112,3
 7 | 33120,Organic Egg Whites,86,16
 8 | 45918,Coconut Butter,19,13
 9 | 46667,Organic Ginger Root,83,4
10 | 46842,Plain Pre-Sliced Bagels,93,3
11 | 


--------------------------------------------------------------------------------
/insight_testsuite/tests/test_1/output/report.csv:
--------------------------------------------------------------------------------
1 | department_id,number_of_orders,number_of_first_orders,percentage
2 | 3,2,1,0.50
3 | 4,2,0,0.00
4 | 12,1,0,0.00
5 | 13,2,1,0.50
6 | 16,2,0,0.00
7 | 


--------------------------------------------------------------------------------
/output/README.md:
--------------------------------------------------------------------------------
1 | This directory is where we would expect your program to write the requested output files.
2 | 


--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | #
3 | # Use this shell script to compile (if necessary) your code and then execute it. Below is an example of what might be found in this file if your program was written in Python
4 | #
5 | #python ./src/purchase_analytics.py ./input/order_products.csv ./input/products.csv ./output/report.csv
6 | 


--------------------------------------------------------------------------------
/src/README.md:
--------------------------------------------------------------------------------
1 | This is the directory where your source code would reside.
2 | 


--------------------------------------------------------------------------------