├── src └── README.md ├── input └── README.md ├── output └── README.md ├── insight_testsuite ├── .DS_Store ├── tests │ ├── .DS_Store │ └── test_1 │ │ ├── .DS_Store │ │ ├── input │ │ ├── .DS_Store │ │ └── itcont.txt │ │ ├── output │ │ ├── .DS_Store │ │ └── top_cost_drug.txt │ │ └── README.md └── run_tests.sh ├── run.sh └── README.md /src/README.md: -------------------------------------------------------------------------------- 1 | This is the directory where your source code would reside. 2 | -------------------------------------------------------------------------------- /input/README.md: -------------------------------------------------------------------------------- 1 | This is the directory where your program would find any test input files. 2 | -------------------------------------------------------------------------------- /output/README.md: -------------------------------------------------------------------------------- 1 | This directory is where we would expect your program to write the requested output files. 2 | -------------------------------------------------------------------------------- /insight_testsuite/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsightDataScience/pharmacy_counting/master/insight_testsuite/.DS_Store -------------------------------------------------------------------------------- /insight_testsuite/tests/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsightDataScience/pharmacy_counting/master/insight_testsuite/tests/.DS_Store -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsightDataScience/pharmacy_counting/master/insight_testsuite/tests/test_1/.DS_Store -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/input/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsightDataScience/pharmacy_counting/master/insight_testsuite/tests/test_1/input/.DS_Store -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/output/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsightDataScience/pharmacy_counting/master/insight_testsuite/tests/test_1/output/.DS_Store -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/output/top_cost_drug.txt: -------------------------------------------------------------------------------- 1 | drug_name,num_prescriber,total_cost 2 | CHLORPROMAZINE,2,3000 3 | BENZTROPINE MESYLATE,1,1500 4 | AMBIEN,2,300 -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/README.md: -------------------------------------------------------------------------------- 1 | This test has been provided for you so that you can see one example, however, you should be creating your own tests to check that your code runs as expected. 2 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Use this shell script to compile (if necessary) your code and then execute it. Below is an example of what might be found in this file if your program was written in Python 4 | # 5 | #python ./src/pharmacy_counting.py ./input/itcont.txt ./output/top_cost_drug.txt 6 | -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/input/itcont.txt: -------------------------------------------------------------------------------- 1 | id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost 2 | 1000000001,Smith,James,AMBIEN,100 3 | 1000000002,Garcia,Maria,AMBIEN,200 4 | 1000000003,Johnson,James,CHLORPROMAZINE,1000 5 | 1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000 6 | 1000000005,Smith,David,BENZTROPINE MESYLATE,1500 -------------------------------------------------------------------------------- /insight_testsuite/run_tests.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | declare -r color_start="\033[" 4 | declare -r color_red="${color_start}0;31m" 5 | declare -r color_green="${color_start}0;32m" 6 | declare -r color_blue="${color_start}0;34m" 7 | declare -r color_norm="${color_start}0m" 8 | 9 | GRADER_ROOT=$(dirname ${BASH_SOURCE}) 10 | 11 | PROJECT_PATH=${GRADER_ROOT}/.. 12 | 13 | function print_dir_contents { 14 | local proj_path=$1 15 | echo "Project contents:" 16 | echo -e "${color_blue}$(ls ${proj_path})${color_norm}" 17 | } 18 | 19 | function find_file_or_dir_in_project { 20 | local proj_path=$1 21 | local file_or_dir_name=$2 22 | if [[ ! -e "${proj_path}/${file_or_dir_name}" ]]; then 23 | echo -e "[${color_red}FAIL${color_norm}]: no ${file_or_dir_name} found" 24 | print_dir_contents ${proj_path} 25 | echo -e "${color_red}${file_or_dir_name} [MISSING]${color_norm}" 26 | exit 1 27 | fi 28 | } 29 | 30 | # check project directory structure 31 | function check_project_struct { 32 | find_file_or_dir_in_project ${PROJECT_PATH} run.sh 33 | find_file_or_dir_in_project ${PROJECT_PATH} src 34 | find_file_or_dir_in_project ${PROJECT_PATH} input 35 | find_file_or_dir_in_project ${PROJECT_PATH} output 36 | } 37 | 38 | # setup testing output folder 39 | function setup_testing_input_output { 40 | TEST_OUTPUT_PATH=${GRADER_ROOT}/temp 41 | if [ -d ${TEST_OUTPUT_PATH} ]; then 42 | rm -rf ${TEST_OUTPUT_PATH} 43 | fi 44 | 45 | mkdir -p ${TEST_OUTPUT_PATH} 46 | 47 | cp -r ${PROJECT_PATH}/src ${TEST_OUTPUT_PATH} 48 | cp -r ${PROJECT_PATH}/run.sh ${TEST_OUTPUT_PATH} 49 | cp -r ${PROJECT_PATH}/input ${TEST_OUTPUT_PATH} 50 | cp -r ${PROJECT_PATH}/output ${TEST_OUTPUT_PATH} 51 | 52 | rm -r ${TEST_OUTPUT_PATH}/input/* 53 | rm -r ${TEST_OUTPUT_PATH}/output/* 54 | cp -r ${GRADER_ROOT}/tests/${test_folder}/input/itcont.txt ${TEST_OUTPUT_PATH}/input/itcont.txt 55 | } 56 | 57 | function compare_outputs { 58 | NUM_OUTPUT_FILES_PASSED=0 59 | OUTPUT_FILENAME=top_cost_drug.txt 60 | PROJECT_ANSWER_PATH1=${GRADER_ROOT}/temp/output/${OUTPUT_FILENAME} 61 | TEST_ANSWER_PATH1=${GRADER_ROOT}/tests/${test_folder}/output/${OUTPUT_FILENAME} 62 | 63 | DIFF_RESULT1=$(diff -bB ${PROJECT_ANSWER_PATH1} ${TEST_ANSWER_PATH1} | wc -l) 64 | if [ "${DIFF_RESULT1}" -eq "0" ] && [ -f ${PROJECT_ANSWER_PATH1} ]; then 65 | echo -e "[${color_green}PASS${color_norm}]: ${test_folder} ${OUTPUT_FILENAME}" 66 | NUM_OUTPUT_FILES_PASSED=$(($NUM_OUTPUT_FILES_PASSED+1)) 67 | else 68 | echo -e "[${color_red}FAIL${color_norm}]: ${test_folder}" 69 | diff ${PROJECT_ANSWER_PATH1} ${TEST_ANSWER_PATH1} 70 | fi 71 | 72 | if [ "${NUM_OUTPUT_FILES_PASSED}" -eq "1" ]; then 73 | PASS_CNT=$(($PASS_CNT+1)) 74 | fi 75 | 76 | } 77 | 78 | function run_all_tests { 79 | TEST_FOLDERS=$(ls ${GRADER_ROOT}/tests) 80 | NUM_TESTS=$(($(echo $(echo ${TEST_FOLDERS} | wc -w)))) 81 | PASS_CNT=0 82 | 83 | # Loop through all tests 84 | for test_folder in ${TEST_FOLDERS}; do 85 | 86 | setup_testing_input_output 87 | 88 | cd ${GRADER_ROOT}/temp 89 | bash run.sh 2>&1 90 | cd ../ 91 | 92 | compare_outputs 93 | done 94 | 95 | echo "[$(date)] ${PASS_CNT} of ${NUM_TESTS} tests passed" 96 | echo "[$(date)] ${PASS_CNT} of ${NUM_TESTS} tests passed" >> ${GRADER_ROOT}/results.txt 97 | } 98 | 99 | check_project_struct 100 | run_all_tests 101 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 1. [Problem](README.md#problem) 3 | 1. [Steps to submit your solution](README.md#steps-to-submit-your-solution) 4 | 1. [Input Dataset](README.md#input-dataset) 5 | 1. [Instructions](README.md#instructions) 6 | 1. [Output](README.md#output) 7 | 1. [Tips on getting an interview](README.md#tips-on-getting-an-interview) 8 | 1. [Questions?](README.md#questions?) 9 | 10 | # Problem 11 | 12 | Imagine you are a data engineer working for an online pharmacy. You are asked to generate a list of all drugs, the total number of UNIQUE individuals who prescribed the medication, and the total drug cost, which must be listed in descending order based on the total drug cost and if there is a tie, drug name in ascending order. 13 | 14 | Disclosure: The projects that Insight Data Engineering Fellows work on during the program are much more complicated and interesting than this coding challenge. This challenge only tests you on the basics. 15 | 16 | # Steps to Submit your solution 17 | * To submit your entry please use the link you received in your coding challenge invite email 18 | * You will only be able to submit through the link one time 19 | * Do NOT attach a file - we will not admit solutions which are attached files 20 | * Do NOT send your solution over an email - We are unable to accept coding challenges that way 21 | 22 | ### Creating private repositories 23 | To avoid plagiarism and any wrongdoing, we request you to submit a private repository of your code. Both GitHub and Bitbucket offer free unlimited private repositories at no extra cost. 24 | * Create a private repository on GitHub or Bitbucket with the given repository structure. Here is how you will be sharing your private repositories for us to see once you are ready to submit. 25 | * Add "insight-cc-bot" as a collaborator in your project. 26 | * [How to add collaborators on GitHub?](https://help.github.com/articles/inviting-collaborators-to-a-personal-repository/) 27 | * [How to add users and groups as collaborators in Bitbucket?](https://confluence.atlassian.com/bitbucket/grant-repository-access-to-users-and-groups-221449716.html) 28 | * **We will NOT be grading submissions we do not have access to.** 29 | 30 | ### Submitting a link to your repository 31 | * Use the submission box to enter the link to your GitHub or Bitbucket repo ONLY 32 | * Link to the specific repo for this project, not your general profile 33 | * Put any comments in the README inside your project repo, not in the submission box 34 | 35 | 36 | # Input Dataset 37 | 38 | The original dataset was obtained from the Centers for Medicare & Medicaid Services but has been cleaned and simplified to match the scope of the coding challenge. It provides information on prescription drugs prescribed by individual physicians and other health care providers. The dataset identifies prescribers by their ID, last name, and first name. It also describes the specific prescriptions that were dispensed at their direction, listed by drug name and the cost of the medication. 39 | 40 | # Instructions 41 | 42 | We designed this coding challenge to assess your coding skills and your understanding of computer science fundamentals. They are both prerequisites of becoming a data engineer. To solve this challenge you might pick a programing language of your choice (preferably Python, Scala, Java, or C/C++ because they are commonly used and will help us better assess you), but you are only allowed to use the default data structures that come with that programming language (you might use I/O libraries). For example, you can code in Python, but you should not use Pandas or any other external libraries. 43 | 44 | ***The objective here is to see if you can implement the solution using basic data structure building blocks and software engineering best practices (by writing clean, modular, and well-tested code).*** 45 | 46 | # Output 47 | 48 | Your program needs to create the output file, `top_cost_drug.txt`, that contains comma (`,`) separated fields in each line. 49 | 50 | Each line of this file should contain these fields: 51 | * drug_name: the exact drug name as shown in the input dataset 52 | * num_prescriber: the number of unique prescribers who prescribed the drug. For the purposes of this challenge, a prescriber is considered the same person if two lines share the same prescriber first and last names 53 | * total_cost: total cost of the drug across all prescribers 54 | 55 | For example 56 | 57 | If your input data, **`itcont.txt`**, is 58 | ``` 59 | id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost 60 | 1000000001,Smith,James,AMBIEN,100 61 | 1000000002,Garcia,Maria,AMBIEN,200 62 | 1000000003,Johnson,James,CHLORPROMAZINE,1000 63 | 1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000 64 | 1000000005,Smith,David,BENZTROPINE MESYLATE,1500 65 | ``` 66 | 67 | then your output file, **`top_cost_drug.txt`**, would contain the following lines 68 | ``` 69 | drug_name,num_prescriber,total_cost 70 | CHLORPROMAZINE,2,3000 71 | BENZTROPINE MESYLATE,1,1500 72 | AMBIEN,2,300 73 | ``` 74 | 75 | These files are provided in the `insight_testsuite/tests/test_1/input` and `insight_testsuite/tests/test_1/output` folders, respectively. 76 | 77 | 78 | # Tips on getting an interview 79 | 80 | ## Writing clean, scalable and well-tested code 81 | 82 | As a data engineer, it’s important that you write clean, well-documented code that scales for a large amount of data. For this reason, it’s important to ensure that your solution works well for a large number of records, rather than just the above example. 83 | 84 | Here you can find a large dataset containing over 24 million records. Note, we will use it to test the full functionality of your code, along with other tests. 85 | 86 | It's also important to use software engineering best practices like unit tests, especially since data is not always clean and predictable. 87 | 88 | Before submitting your solution you should summarize your approach and run instructions (if any) in your `README`. 89 | 90 | You may write your solution in any mainstream programming language, such as C, C++, C#, Go, Java, Python, Ruby, or Scala. Once completed, submit a link of your Github or Bitbucket repo with your source code. 91 | 92 | In addition to the source code, the top-most directory of your repo must include the `input` and `output` directories, and a shell script named `run.sh` that compiles and runs the program(s) that implement(s) the required features. 93 | 94 | If your solution requires additional libraries, environments, or dependencies, you must specify these in your `README` documentation. See the figure below for the required structure of the top-most directory in your repo, or simply clone this repo. 95 | 96 | ## Repo directory structure 97 | 98 | The directory structure for your repo should look like this: 99 | 100 | ├── README.md 101 | ├── run.sh 102 | ├── src 103 | │ └── pharmacy-counting.py 104 | ├── input 105 | │ └── itcont.txt 106 | ├── output 107 | | └── top_cost_drug.txt 108 | ├── insight_testsuite 109 | └── run_tests.sh 110 | └── tests 111 | └── test_1 112 | | ├── input 113 | | │ └── itcont.txt 114 | | |__ output 115 | | │ └── top_cost_drug.txt 116 | ├── your-own-test_1 117 | ├── input 118 | │ └── your-own-input-for-itcont.txt 119 | |── output 120 | └── top_cost_drug.txt 121 | 122 | **Don't fork this repo** and don't use this `README` instead of your own. The content of `src` does not need to be a single file called `pharmacy-counting.py`, which is only an example. Instead, you should include your own source files and give them expressive names. 123 | 124 | ## Testing your directory structure and output format 125 | 126 | To make sure that your code has the correct directory structure and the format of the output files are correct, we have included a test script called `run_tests.sh` in the `insight_testsuite` folder. 127 | 128 | The tests are stored simply as text files under the `insight_testsuite/tests` folder. Each test should have a separate folder with an `input` folder for `itcont.txt` and an `output` folder for `top_cost_drug.txt`. 129 | 130 | You can run the test with the following command from within the `insight_testsuite` folder: 131 | 132 | insight_testsuite~$ ./run_tests.sh 133 | 134 | On a failed test, the output of `run_tests.sh` should look like: 135 | 136 | [FAIL]: test_1 137 | [Thu Mar 30 16:28:01 PDT 2017] 0 of 1 tests passed 138 | 139 | On success: 140 | 141 | [PASS]: test_1 142 | [Thu Mar 30 16:25:57 PDT 2017] 1 of 1 tests passed 143 | 144 | 145 | 146 | One test has been provided as a way to check your formatting and simulate how we will be running tests when you submit your solution. We urge you to write your own additional tests. `test_1` is only intended to alert you if the directory structure or the output for this test is incorrect. 147 | 148 | Your submission must pass at least the provided test in order to pass the coding challenge. 149 | 150 | For a limited time we also are making available a website that will allow you to simulate the environment in which we will test your code. It has been primarily tested on Python code but could be used for Java and C++ repos. Keep in mind that if you need to compile your code (e.g., javac, make), that compilation needs to happen in the `run.sh` file of your code repository. For Python programmers, you are able to use Python2 or Python3 but if you use the later, specify `python3` in your `run.sh` script. 151 | 152 | # Questions? 153 | Email us at cc@insightdataengineering.com 154 | --------------------------------------------------------------------------------