├── insight_testsuite ├── tests │ └── test_1 │ │ ├── input │ │ ├── percentile.txt │ │ └── itcont.txt │ │ ├── output │ │ └── repeat_donors.txt │ │ └── README.md └── run_tests.sh ├── src ├── README.md └── .donation-analytics.py.swp ├── input └── README.md ├── output └── README.md ├── run.sh └── README.md /insight_testsuite/tests/test_1/input/percentile.txt: -------------------------------------------------------------------------------- 1 | 30 2 | -------------------------------------------------------------------------------- /src/README.md: -------------------------------------------------------------------------------- 1 | This is the directory where your source code would reside. 2 | -------------------------------------------------------------------------------- /input/README.md: -------------------------------------------------------------------------------- 1 | This is the directory where your program would find any test input files. 2 | -------------------------------------------------------------------------------- /output/README.md: -------------------------------------------------------------------------------- 1 | This directory is where we would expect your program to write the requested output files. 2 | -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/output/repeat_donors.txt: -------------------------------------------------------------------------------- 1 | C00384516|02895|2018|333|333|1 2 | C00384516|02895|2018|333|717|2 3 | -------------------------------------------------------------------------------- /src/.donation-analytics.py.swp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/InsightDataScience/donation-analytics/master/src/.donation-analytics.py.swp -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/README.md: -------------------------------------------------------------------------------- 1 | This test has been provided for you so that you can see one example, however, you should be creating your own tests to check that your code runs as expected. 2 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # 3 | # Use this shell script to compile (if necessary) your code and then execute it. Below is an example of what might be found in this file if your program was written in Python 4 | # 5 | #python ./src/donation-analytics.py ./input/itcont.txt ./input/percentile.txt ./output/repeat_donors.txt 6 | 7 | -------------------------------------------------------------------------------- /insight_testsuite/tests/test_1/input/itcont.txt: -------------------------------------------------------------------------------- 1 | C00629618|N|TER|P|201701230300133512|15C|IND|PEREZ, JOHN A|LOS ANGELES|CA|90017|PRINCIPAL|DOUBLE NICKEL ADVISORS|01032017|40|H6CA34245|SA01251735122|1141239|||2012520171368850783 2 | C00177436|N|M2|P|201702039042410894|15|IND|DEEHAN, WILLIAM N|ALPHARETTA|GA|300047357|UNUM|SVP, SALES, CL|01312017|384||PR2283873845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029337 3 | C00384818|N|M2|P|201702039042412112|15|IND|ABBOTT, JOSEPH|WOONSOCKET|RI|028956146|CVS HEALTH|VP, RETAIL PHARMACY OPS|01122017|250||2017020211435-887|1147467|||4020820171370030285 4 | C00384516|N|M2|P|201702039042410893|15|IND|SABOURIN, JAMES|LOOKOUT MOUNTAIN|GA|028956146|UNUM|SVP, CORPORATE COMMUNICATIONS|01312017|230||PR1890575345050|1147350||P/R DEDUCTION ($115.00 BI-WEEKLY)|4020820171370029335 5 | C00177436|N|M2|P|201702039042410895|15|IND|JEROME, CHRISTOPHER|LOOKOUT MOUNTAIN|GA|307502818|UNUM|EVP, GLOBAL SERVICES|10312017|384||PR2283905245050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029342 6 | C00384516|N|M2|P|201702039042412112|15|IND|ABBOTT, JOSEPH|WOONSOCKET|RI|028956146|CVS HEALTH|EVP, HEAD OF RETAIL OPERATIONS|01122018|333||2017020211435-910|1147467|||4020820171370030287 7 | C00384516|N|M2|P|201702039042410894|15|IND|SABOURIN, JAMES|LOOKOUT MOUNTAIN|GA|028956146|UNUM|SVP, CORPORATE COMMUNICATIONS|01312018|384||PR2283904845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029339 8 | -------------------------------------------------------------------------------- /insight_testsuite/run_tests.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | declare -r color_start="\033[" 4 | declare -r color_red="${color_start}0;31m" 5 | declare -r color_green="${color_start}0;32m" 6 | declare -r color_blue="${color_start}0;34m" 7 | declare -r color_norm="${color_start}0m" 8 | 9 | GRADER_ROOT=$(dirname ${BASH_SOURCE}) 10 | 11 | PROJECT_PATH=${GRADER_ROOT}/.. 12 | 13 | function print_dir_contents { 14 | local proj_path=$1 15 | echo "Project contents:" 16 | echo -e "${color_blue}$(ls ${proj_path})${color_norm}" 17 | } 18 | 19 | function find_file_or_dir_in_project { 20 | local proj_path=$1 21 | local file_or_dir_name=$2 22 | if [[ ! -e "${proj_path}/${file_or_dir_name}" ]]; then 23 | echo -e "[${color_red}FAIL${color_norm}]: no ${file_or_dir_name} found" 24 | print_dir_contents ${proj_path} 25 | echo -e "${color_red}${file_or_dir_name} [MISSING]${color_norm}" 26 | exit 1 27 | fi 28 | } 29 | 30 | # check project directory structure 31 | function check_project_struct { 32 | find_file_or_dir_in_project ${PROJECT_PATH} run.sh 33 | find_file_or_dir_in_project ${PROJECT_PATH} src 34 | find_file_or_dir_in_project ${PROJECT_PATH} input 35 | find_file_or_dir_in_project ${PROJECT_PATH} output 36 | } 37 | 38 | # setup testing output folder 39 | function setup_testing_input_output { 40 | TEST_OUTPUT_PATH=${GRADER_ROOT}/temp 41 | if [ -d ${TEST_OUTPUT_PATH} ]; then 42 | rm -rf ${TEST_OUTPUT_PATH} 43 | fi 44 | 45 | mkdir -p ${TEST_OUTPUT_PATH} 46 | 47 | cp -r ${PROJECT_PATH}/src ${TEST_OUTPUT_PATH} 48 | cp -r ${PROJECT_PATH}/run.sh ${TEST_OUTPUT_PATH} 49 | cp -r ${PROJECT_PATH}/input ${TEST_OUTPUT_PATH} 50 | cp -r ${PROJECT_PATH}/output ${TEST_OUTPUT_PATH} 51 | 52 | rm -r ${TEST_OUTPUT_PATH}/input/* 53 | rm -r ${TEST_OUTPUT_PATH}/output/* 54 | cp -r ${GRADER_ROOT}/tests/${test_folder}/input/itcont.txt ${TEST_OUTPUT_PATH}/input/itcont.txt 55 | cp -r ${GRADER_ROOT}/tests/${test_folder}/input/percentile.txt ${TEST_OUTPUT_PATH}/input/percentile.txt 56 | } 57 | 58 | function compare_outputs { 59 | NUM_OUTPUT_FILES_PASSED=0 60 | OUTPUT_FILENAME=repeat_donors.txt 61 | PROJECT_ANSWER_PATH1=${GRADER_ROOT}/temp/output/${OUTPUT_FILENAME} 62 | TEST_ANSWER_PATH1=${GRADER_ROOT}/tests/${test_folder}/output/${OUTPUT_FILENAME} 63 | 64 | DIFF_RESULT1=$(diff -bB ${PROJECT_ANSWER_PATH1} ${TEST_ANSWER_PATH1} | wc -l) 65 | if [ "${DIFF_RESULT1}" -eq "0" ] && [ -f ${PROJECT_ANSWER_PATH1} ]; then 66 | echo -e "[${color_green}PASS${color_norm}]: ${test_folder} ${OUTPUT_FILENAME}" 67 | NUM_OUTPUT_FILES_PASSED=$(($NUM_OUTPUT_FILES_PASSED+1)) 68 | else 69 | echo -e "[${color_red}FAIL${color_norm}]: ${test_folder}" 70 | diff ${PROJECT_ANSWER_PATH1} ${TEST_ANSWER_PATH1} 71 | fi 72 | 73 | if [ "${NUM_OUTPUT_FILES_PASSED}" -eq "1" ]; then 74 | PASS_CNT=$(($PASS_CNT+1)) 75 | fi 76 | 77 | } 78 | 79 | function run_all_tests { 80 | TEST_FOLDERS=$(ls ${GRADER_ROOT}/tests) 81 | NUM_TESTS=$(($(echo $(echo ${TEST_FOLDERS} | wc -w)))) 82 | PASS_CNT=0 83 | 84 | # Loop through all tests 85 | for test_folder in ${TEST_FOLDERS}; do 86 | 87 | setup_testing_input_output 88 | 89 | cd ${GRADER_ROOT}/temp 90 | bash run.sh 2>&1 91 | cd ../ 92 | 93 | compare_outputs 94 | done 95 | 96 | echo "[$(date)] ${PASS_CNT} of ${NUM_TESTS} tests passed" 97 | echo "[$(date)] ${PASS_CNT} of ${NUM_TESTS} tests passed" >> ${GRADER_ROOT}/results.txt 98 | } 99 | 100 | check_project_struct 101 | run_all_tests 102 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Table of Contents 2 | 1. [Introduction](README.md#introduction) 3 | 2. [Challenge summary](README.md#challenge-summary) 4 | 3. [Details of challenge](README.md#details-of-challenge) 5 | 4. [Input files](README.md#input-files) 6 | 5. [Output file](README.md#output-file) 7 | 6. [Percentile computation](README.md#percentile-computation) 8 | 7. [Example](README.md#example) 9 | 8. [Writing clean, scalable and well-tested code](README.md#writing-clean-scalable-and-well-tested-code) 10 | 9. [Repo directory structure](README.md#repo-directory-structure) 11 | 10. [Testing your directory structure and output format](README.md#testing-your-directory-structure-and-output-format) 12 | 11. [Instructions to submit your solution](README.md#instructions-to-submit-your-solution) 13 | 12. [FAQ](README.md#faq) 14 | 15 | # Introduction 16 | You’re a data engineer working for political consultants whose clients are cash-strapped political candidates. They've asked for help analyzing loyalty trends in campaign contributions, namely identifying areas of repeat donors and calculating how much they're spending. 17 | 18 | The Federal Election Commission regularly publishes campaign contributions, and while you don’t want to pull specific donors from those files — because using that information for fundraising or commercial purposes is illegal — you want to identify areas (zip codes) that could be sources of repeat campaign contributions. 19 | 20 | # Challenge summary 21 | 22 | For this challenge, we're asking you to take a file listing individual campaign contributions for multiple years, determine which ones came from repeat donors, calculate a few values and distill the results into a single output file, `repeat_donors.txt`. 23 | 24 | For each recipient, zip code and calendar year, calculate these three values for contributions coming from repeat donors: 25 | 26 | * total dollars received 27 | * total number of contributions received 28 | * donation amount in a given percentile 29 | 30 | The political consultants, who are primarily interested in donors who have contributed in multiple years, are concerned about possible outliers in the data. So they have asked that your program allow for a variable percentile. That way the program could calculate the median (or the 50th percentile) in one run and the 99th percentile in another. 31 | 32 | Another developer has been placed in charge of building the graphical user 33 | interface with a dashboard showing the latest metrics on repeat donors, among other things. 34 | 35 | Your role on the project is to work on the data pipeline that will hand off the information to the front-end. As the backend data engineer, you do **not** need to display the data or work on the dashboard but you do need to provide the information. 36 | 37 | You can assume there is another process that takes what is written to the output file and sends it to the front-end. If we were building this pipeline in real life, we’d probably have another mechanism to send the output to the GUI rather than writing to a file. However, for the purposes of grading this challenge, we just want you to write the output to files. 38 | 39 | # Details of challenge 40 | 41 | You’re given two input files. 42 | 43 | 1. `percentile.txt`, holds a single value -- the percentile value (1-100) that your program will be asked to calculate. 44 | 45 | 2. `itcont.txt`, has a line for each campaign contribution that was made on a particular date from a donor to a political campaign, committee or other similar entity. 46 | 47 | Out of the many fields listed on the pipe-delimited lines of `itcont.txt` file, you’re primarily interested in the contributor's name, zip code associated with the donor, amount contributed, date of the transaction and ID of the recipient. 48 | 49 | #### Identifying repeat donors 50 | For the purposes of this challenge, if a donor had previously contributed to any recipient listed in the `itcont.txt` file in any prior calendar year, that donor is considered a repeat donor. Also, for the purposes of this challenge, you can assume two contributions are from the same donor if the names and zip codes are identical. 51 | 52 | #### Calculations 53 | Each line of `itcont.txt` should be treated as a record. Your code should process each line as if that record was sequentially streaming into your program. In other words, your program processes every line of `itcont.txt` in the same order as it is listed in the file. 54 | 55 | For each record that you identify as coming from a donor who has contributed to a campaign in a prior calendar year, calculate the running percentile of contributions from repeat donors, total number of transactions from repeat donors and total amount of donations streaming in from repeat donors so far for that calendar year, recipient and zip code. 56 | 57 | Write the calculated fields out onto a pipe-delimited line and then print it to an output file named `repeat_donors.txt` in the same order as the donation appeared in the input file. 58 | 59 | ## Input files 60 | 61 | The Federal Election Commission provides data files stretching back years and is [regularly updated](http://classic.fec.gov/finance/disclosure/ftpdet.shtml). 62 | 63 | For the purposes of this challenge, we’re interested in individual contributions. While you're welcome to run your program using the data files found at the FEC's website, you should not assume that we'll be testing your program on any of those data files or that the lines will be in the same order as what can be found in those files. Our test data files, however, will conform to the data dictionary [as described by the FEC](http://classic.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividuals.shtml). 64 | 65 | Also, while there are many fields in the file that may be interesting, below are the ones that you’ll need to complete this challenge: 66 | 67 | * `CMTE_ID`: identifies the flier, which for our purposes is the recipient of this contribution 68 | * `NAME`: name of the donor 69 | * `ZIP_CODE`: zip code of the contributor (we only want the first five digits/characters) 70 | * `TRANSACTION_DT`: date of the transaction 71 | * `TRANSACTION_AMT`: amount of the transaction 72 | * `OTHER_ID`: a field that denotes whether contribution came from a person or an entity 73 | 74 | ### Input file considerations 75 | 76 | Here are some considerations to keep in mind: 77 | 78 | 1. While the data dictionary has the `ZIP_CODE` occupying nine characters, for the purposes of the challenge, we only consider the first five characters of the field as the zip code 79 | 2. Because the data set doesn't contain a unique donor id, you should use the combination of `NAME` and `ZIP_CODE` (again, first five digits) to identify a unique donor 80 | 3. For the purposes of this challenge, you can assume the input file follows the data dictionary noted by the FEC for the 2015-current election years, although you should not assume the year field holds any particular value 81 | 4. The transactions noted in the input file are not in any particular order, and in fact, can be out of order chronologically 82 | 5. Because we are only interested in individual contributions, we only want records that have the field, `OTHER_ID`, set to empty. If the `OTHER_ID` field contains any other value, you should completely ignore and skip the entire record 83 | 6. Other situations you can completely ignore and skip an entire record: 84 | 85 | * If `TRANSACTION_DT` is an invalid date (e.g., empty, malformed) 86 | * If `ZIP_CODE` is an invalid zip code (i.e., empty, fewer than five digits) 87 | * If the `NAME` is an invalid name (e.g., empty, malformed) 88 | * If any lines in the input file contains empty cells in the `CMTE_ID` or `TRANSACTION_AMT` fields 89 | 90 | Except for the considerations noted above with respect to `CMTE_ID`, `NAME`, `ZIP_CODE`, `TRANSACTION_DT`, `TRANSACTION_AMT`, `OTHER_ID`, data in any of the other fields (whether the data is valid, malformed, or empty) should not affect your processing. That is, as long as the previously noted considerations apply, you should process the record as if it was a valid, newly arriving transaction. (For instance, campaigns sometimes retransmit transactions as amendments, however, for the purposes of this challenge, you can ignore that distinction and treat all of the lines as if they were new) 91 | 92 | 93 | ## Output file 94 | 95 | For the output file that your program will create, `repeat_donors.txt`, the fields on each line should be separated by a `|` 96 | 97 | The output should contain the same number of lines or records as the input data file, `itcont.txt`, minus any records that were ignored as a result of the 'Input file considerations' and any records you determine did not originate from a repeat donor. 98 | 99 | Each line of this file should contain these fields: 100 | 101 | * recipient of the contribution (or `CMTE_ID` from the input file) 102 | * 5-digit zip code of the contributor (or the first five characters of the `ZIP_CODE` field from the input file) 103 | * 4-digit year of the contribution 104 | * running percentile of contributions received from repeat donors to a recipient streamed in so far for this zip code and calendar year. Percentile calculations should be rounded to the whole dollar (drop anything below $.50 and round anything from $.50 and up to the next dollar) 105 | * total amount of contributions received by recipient from the contributor's zip code streamed in so far in this calendar year from repeat donors 106 | * total number of transactions received by recipient from the contributor's zip code streamed in so far this calendar year from repeat donors 107 | 108 | ## Percentile computation 109 | 110 | The first line of `percentile.txt` contains the percentile you should compute for these given input pair. For the percentile computation use the **nearest-rank method** [as described by Wikipedia](https://en.wikipedia.org/wiki/Percentile). 111 | 112 | # Example 113 | 114 | Suppose your input files contained only the following few lines. Note that the fields we are interested in are in **bold** below but will not be like that in the input file. There's also an extra newline between records below, but the input file won't have that. 115 | 116 | **`percentile.txt`** 117 | > **30** 118 | 119 | **`itcont.txt`** 120 | 121 | > **C00629618**|N|TER|P|201701230300133512|15C|IND|**PEREZ, JOHN A**|LOS ANGELES|CA|**90017**|PRINCIPAL|DOUBLE NICKEL ADVISORS|**01032017**|**40**|**H6CA34245**|SA01251735122|1141239|||2012520171368850783 122 | 123 | > **C00177436**|N|M2|P|201702039042410894|15|IND|**DEEHAN, WILLIAM N**|ALPHARETTA|GA|**300047357**|UNUM|SVP, SALES, CL|**01312017**|**384**||PR2283873845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029337 124 | 125 | > **C00384818**|N|M2|P|201702039042412112|15|IND|**ABBOTT, JOSEPH**|WOONSOCKET|RI|**028956146**|CVS HEALTH|VP, RETAIL PHARMACY OPS|**01122017**|**250**||2017020211435-887|1147467|||4020820171370030285 126 | 127 | > **C00384516**|N|M2|P|201702039042410893|15|IND|**SABOURIN, JAMES**|LOOKOUT MOUNTAIN|GA|**028956146**|UNUM|SVP, CORPORATE COMMUNICATIONS|**01312017**|**230**||PR1890575345050|1147350||P/R DEDUCTION ($115.00 BI-WEEKLY)|4020820171370029335 128 | 129 | > **C00177436**|N|M2|P|201702039042410895|15|IND|**JEROME, CHRISTOPHER**|LOOKOUT MOUNTAIN|GA|**307502818**|UNUM|EVP, GLOBAL SERVICES|**10312017**|**384**||PR2283905245050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029342 130 | 131 | > **C00384516**|N|M2|P|201702039042412112|15|IND|**ABBOTT, JOSEPH**|WOONSOCKET|RI|**028956146**|CVS HEALTH|EVP, HEAD OF RETAIL OPERATIONS|**01122018**|**333**||2017020211435-910|1147467|||4020820171370030287 132 | 133 | > **C00384516**|N|M2|P|201702039042410894|15|IND|**SABOURIN, JAMES**|LOOKOUT MOUNTAIN|GA|**028956146**|UNUM|SVP, CORPORATE COMMUNICATIONS|**01312018**|**384**||PR2283904845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029339 134 | 135 | The single line on `percentile.txt` tells us that we need to compute the 30th percentile for the stream in `itcont.txt`. If we were to pick the relevant fields from each line, here is what we would record for each line. 136 | 137 | **`itcont.txt`** 138 | 139 | 1. 140 | CMTE_ID: C00629618 141 | NAME: PEREZ, JOHN A 142 | ZIP_CODE: 90017 143 | TRANSACTION_DT: 01032017 144 | TRANSACTION_AMT: 40 145 | OTHER_ID: H6CA34245 146 | 147 | 2. 148 | CMTE_ID: C00177436 149 | NAME: DEEHAN, WILLIAM N 150 | ZIP_CODE: 30004 151 | TRANSACTION_DT: 01312017 152 | TRANSACTION_AMT: 384 153 | OTHER_ID: empty 154 | 155 | 3. 156 | CMTE_ID: C00384818 157 | NAME: ABBOTT, JOSEPH 158 | ZIP_CODE: 02895 159 | TRANSACTION_DT: 01122017 160 | TRANSACTION_AMT: 250 161 | OTHER_ID: empty 162 | 163 | 4. 164 | CMTE_ID: C00384516 165 | NAME: SABOURIN, JAMES 166 | ZIP_CODE: 02895 167 | TRANSACTION_DT: 01312017 168 | TRANSACTION_AMT: 230 169 | OTHER_ID: empty 170 | 171 | 5. 172 | CMTE_ID: C00177436 173 | NAME: JEROME, CHRISTOPHER 174 | ZIP_CODE: 30750 175 | TRANSACTION_DT: 10312017 176 | TRANSACTION_AMT: 384 177 | OTHER_ID: empty 178 | 179 | 6. 180 | CMTE_ID: C00384516 181 | NAME: ABBOTT, JOSEPH 182 | ZIP_CODE: 02895 183 | TRANSACTION_DT: 01122018 184 | TRANSACTION_AMT: 333 185 | OTHER_ID: empty 186 | 187 | 7. 188 | CMTE_ID: C00384516 189 | NAME: SABOURIN, JAMES 190 | ZIP_CODE: 02895 191 | TRANSACTION_DT: 01312018 192 | TRANSACTION_AMT: 384 193 | OTHER_ID: empty 194 | 195 | 196 | In processing the `itcont.txt` file line by line, we would ignore the first record because the `OTHER_ID` field contains data and is not empty. 197 | 198 | The next four records don't include any contributions from repeat donors so we ignore them. 199 | 200 | But the sixth record includes a donation from `ABBOTT, JOSEPH` with a `ZIP_CODE` of `02895` on Jan. 12, 2018. That same donor contributed in Jan. 12, 2017. That means this contributor is a repeat donor. 201 | 202 | So now, we would look for any contributions from repeat donors for recipient `C00384516` and zip of `02895` for the year `2018`. We would then find that the sixth record would be the only one that would qualify. So we would emit 203 | 204 | * the total number of contributions from repeat donors is `1` 205 | * the total dollar amount of contributions is `333` 206 | * the 30th percentile contribution is `333` 207 | 208 | The seventh record also is for a repeat donor because `SABOURIN, JAMES`, who contributed Jan. 31, 2018, also contributed Jan. 31, 2017. 209 | 210 | When we look for any contributions from repeat donors for recipient, `C00384516`, zip of `02895` for the year `2018`, we would find that the sixth and seventh records qualify. So we would emit 211 | 212 | * the total number of contributions from repeat donors is `2` 213 | * the total dollar amount of contributions is `333` + `384` or `717` 214 | * the 30th percentile contribution is `333` 215 | 216 | Processing all of the input lines in `itcont.txt`, the entire contents of `repeat_donors.txt` would be: 217 | 218 | C00384516|02895|2018|333|333|1 219 | C00384516|02895|2018|333|717|2 220 | 221 | 222 | ## Writing clean, scalable and well-tested code 223 | 224 | As a data engineer, it’s important that you write clean, well-documented code that scales for large amounts of data. For this reason, it’s important to ensure that your solution works well for a large number of records, rather than just the above example. 225 | 226 | It's also important to use software engineering best practices like unit tests, especially since data is not always clean and predictable. For more details about the implementation, please refer to the FAQ below. If further clarification is necessary, email us at but please do so only after you have read through the Readme and FAQ one more time and cannot find the answer to your question. 227 | 228 | Before submitting your solution you should summarize your approach, dependencies and run instructions (if any) in your `README`. 229 | 230 | You may write your solution in any mainstream programming language such as C, C++, C#, Clojure, Erlang, Go, Haskell, Java, Python, Ruby, or Scala. Once completed, submit a link to a Github repo with your source code. 231 | 232 | In addition to the source code, the top-most directory of your repo must include the `input` and `output` directories, and a shell script named `run.sh` that compiles and runs the program(s) that implement the required features. 233 | 234 | If your solution requires additional libraries, environments, or dependencies, you must specify these in your `README` documentation. See the figure below for the required structure of the top-most directory in your repo, or simply clone this repo. 235 | 236 | ## Repo directory structure 237 | 238 | The directory structure for your repo should look like this: 239 | 240 | ├── README.md 241 | ├── run.sh 242 | ├── src 243 | │ └── donation-analytics.py 244 | ├── input 245 | │ └── percentile.txt 246 | │ └── itcont.txt 247 | ├── output 248 | | └── repeat_donors.txt 249 | ├── insight_testsuite 250 | └── run_tests.sh 251 | └── tests 252 | └── test_1 253 | | ├── input 254 | | │ └── percentile.txt 255 | | │ └── itcont.txt 256 | | |__ output 257 | | │ └── repeat_donors.txt 258 | ├── your-own-test_1 259 | ├── input 260 | │ └── your-own-input-for-itcont.txt 261 | |── output 262 | └── repeat_donors.txt 263 | 264 | **Don't fork this repo** and don't use this `README` instead of your own. The content of `src` does not need to be a single file called `donation-analytics.py`, which is only an example. Instead, you should include your own source files and give them expressive names. 265 | 266 | ## Testing your directory structure and output format 267 | 268 | To make sure that your code has the correct directory structure and the format of the output files are correct, we have included a test script called `run_tests.sh` in the `insight_testsuite` folder. 269 | 270 | The tests are stored simply as text files under the `insight_testsuite/tests` folder. Each test should have a separate folder with an `input` folder for `percentile.txt` and `itcont.txt` and an `output` folder for `repeat_donors.txt`. 271 | 272 | You can run the test with the following command from within the `insight_testsuite` folder: 273 | 274 | insight_testsuite~$ ./run_tests.sh 275 | 276 | On a failed test, the output of `run_tests.sh` should look like: 277 | 278 | [FAIL]: test_1 279 | [Thu Mar 30 16:28:01 PDT 2017] 0 of 1 tests passed 280 | 281 | On success: 282 | 283 | [PASS]: test_1 284 | [Thu Mar 30 16:25:57 PDT 2017] 1 of 1 tests passed 285 | 286 | 287 | 288 | One test has been provided as a way to check your formatting and simulate how we will be running tests when you submit your solution. We urge you to write your own additional tests. `test_1` is only intended to alert you if the directory structure or the output for this test is incorrect. 289 | 290 | Your submission must pass at least the provided test in order to pass the coding challenge. 291 | 292 | ## Instructions to submit your solution 293 | * To submit your entry please use the link you received in your coding challenge invite email 294 | * You will only be able to submit through the link one time 295 | * Do NOT attach a file - we will not admit solutions which are attached files 296 | * Use the submission box to enter the link to your GitHub repo or Bitbucket ONLY 297 | * Link to the specific repo for this project, not your general profile 298 | * Put any comments in the README inside your project repo, not in the submission box 299 | * We are unable to accept coding challenges that are emailed to us 300 | 301 | # FAQ 302 | 303 | Here are some common questions we've received. If you have additional questions, please email us at `cc@insightdataengineering.com` and we'll answer your questions as quickly as we can (during PST business hours), and update this FAQ. Again, only contact us after you have read through the Readme and FAQ one more time and cannot find the answer to your question. 304 | 305 | ### Why are you asking us to assume the data is streaming in? 306 | As a data engineer, you may want to take into consideration future needs. For instance, the team working on the dashboard may want to re-use the streaming functionality used to create `repeat_donors.txt` file in the future to show a running percentile value and total dollar amount of contributions as they arrive in real-time. It might prove useful in assessing the success of a candidate's fundraising efforts at any moment in time. 307 | 308 | ### What do I do when the data is listed out of order? 309 | Because donations could appear in any order in the input file, there could be a case where you don't know a contributor is a repeat donor until you encounter the second donation. 310 | 311 | In some cases, the second donation that came later in the file may have a transaction date that is for a previous calendar year. In that case, you should only identify the later donation as coming from a repeat donor and output the requested calculations for that calendar year, zip code and recipient. In this case, there would be no need to revise any lines you may have already outputted earlier. 312 | 313 | ##### Example 314 | 315 | **`percentile.txt`** 316 | > **30** 317 | 318 | **`itcont.txt`** 319 | 320 | > **C00384516**|N|M2|P|201702039042410894|15|IND|**SABOURIN, JOE**|LOOKOUT MOUNTAIN|GA|**028956146**|UNUM|SVP, CORPORATE COMMUNICATIONS|**01312016**|**484**||PR2283904845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029339 321 | 322 | > **C00384516**|N|M2|P|201702039042410894|15|IND|**SABOURIN, JOE**|LOOKOUT MOUNTAIN|GA|**028956146**|UNUM|SVP, CORPORATE COMMUNICATIONS|**01312015**|**384**||PR2283904845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029339 323 | 324 | > **C00384516**|N|M2|P|201702039042410893|15|IND|**SABOURIN, JOE**|LOOKOUT MOUNTAIN|GA|**028956146**|UNUM|SVP, CORPORATE COMMUNICATIONS|**01312017**|**230**||PR1890575345050|1147350||P/R DEDUCTION ($115.00 BI-WEEKLY)|4020820171370029335 325 | 326 | **`repeat_donors.txt`** 327 | 328 | C00384516|02895|2017|230|230|1 329 | 330 | ### The FEC website describes the TRANSCTION_AMT field as NUMBER(14, 2). What does that mean? 331 | 332 | NUMBER(14,2) means the field is capable of holding a number with a maximum precision of 14 and maximum scale of 2. For instance, both 10000.99 and 10000 would be valid transaction amounts. 333 | 334 | ### Which Github link should I submit? 335 | You should submit the URL for the top-level root of your repository. For example, this repo would be submitted by copying the URL `https://github.com/InsightDataScience/donation-analytics` into the appropriate field on the application. **Do NOT try to submit your coding challenge using a pull request**, which would make your source code publicly available. 336 | 337 | ### Do I need a private Github repo? 338 | No, you may use a public repo, there is no need to purchase a private repo. You may also submit a link to a Bitbucket repo if you prefer. 339 | 340 | ### May I use R, Matlab, or other analytics programming languages to solve the challenge? 341 | It's important that your implementation scales to handle large amounts of data. While many of our Fellows have experience with R and Matlab, applicants have found that these languages are unable to process data in a scalable fashion, so you must consider another language. 342 | 343 | ### May I use distributed technologies like Hadoop or Spark? 344 | Your code will be tested on a single machine, so using these technologies will negatively impact your solution. We're not testing your knowledge on distributed computing, but rather on computer science fundamentals and software engineering best practices. 345 | 346 | ### What sort of system should I use to run my program on (Windows, Linux, Mac)? 347 | You may write your solution on any system, but your source code should be portable and work on all systems. Additionally, your `run.sh` must be able to run on either Unix or Linux, as that's the system that will be used for testing. Linux machines are the industry standard for most data engineering teams, so it is helpful to be familiar with this. If you're currently using Windows, we recommend installing a virtual Unix environment, such as VirtualBox or VMWare, and using that to develop your code. Otherwise, you also could use tools, such as Cygwin or Docker, or a free online IDE such as Cloud9. 348 | 349 | ### How fast should my program run? 350 | While there are no strict performance guidelines to this coding challenge, we will consider the amount of time your program takes when grading the challenge. Therefore, you should design and develop your program in the optimal way (i.e. think about time and space complexity instead of trying to hit a specific run time value). 351 | 352 | ### Can I use pre-built packages, modules, or libraries? 353 | This coding challenge can be completed without any "exotic" packages. While you may use publicly available packages, modules, or libraries, you must document any dependencies in your accompanying README file. When we review your submission, we will download these libraries and attempt to run your program. If you do use a package, you should always ensure that the module you're using works efficiently for the specific use-case in the challenge, since many libraries are not designed for large amounts of data. 354 | 355 | ### Will you email me if my code doesn't run? 356 | Unfortunately, we receive hundreds of submissions in a very short time and are unable to email individuals if their code doesn't compile or run. This is why it's so important to document any dependencies you have, as described in the previous question. We will do everything we can to properly test your code, but this requires good documentation. More so, we have provided a test suite so you can confirm that your directory structure and format are correct. 357 | 358 | ### Can I use a database engine? 359 | This coding challenge can be completed without the use of a database. However, if you use one, it must be a publicly available one that can be easily installed with minimal configuration. 360 | 361 | ### Do I need to use multi-threading? 362 | No, your solution doesn't necessarily need to include multi-threading - there are many solutions that don't require multiple threads/cores or any distributed systems, but instead use efficient data structures. 363 | 364 | ### What should the format of the output be? 365 | In order to be tested correctly, you must use the format described above. You can ensure that you have the correct format by using the testing suite we've included. 366 | 367 | ### Should I check if the files in the input directory are text files or non-text files(binary)? 368 | No, for simplicity you may assume that all of the files in the input directory are text files, with the format as described above. 369 | 370 | ### Can I use an IDE like Eclipse or IntelliJ to write my program? 371 | Yes, you can use whatever tools you want - as long as your `run.sh` script correctly runs the relevant target files and creates the `repeat_donors.txt` file in the `output` directory. 372 | 373 | ### What should be in the input directory? 374 | You can put any text file you want in the directory since our testing suite will replace it. Indeed, using your own input files would be quite useful for testing. The file size limit on Github is 100 MB so you won't be able to include the larger sample input files in your `input` directory. 375 | 376 | ### How will the coding challenge be evaluated? 377 | Generally, we will evaluate your coding challenge with a testing suite that provides a variety of inputs and checks the corresponding output. This suite will attempt to use your `run.sh` and is fairly tolerant of different runtime environments. Of course, there are many aspects (e.g. clean code, documentation) that cannot be tested by our suite, so each submission will also be reviewed manually by a data engineer. 378 | 379 | ### How long will it take for me to hear back from you about my submission? 380 | We receive hundreds of submissions and try to evaluate them all in a timely manner. We try to get back to all applicants **within two or three weeks** of submission, but if you have a specific deadline that requires expedited review, please email us at `cc@insightdataengineering.com`. 381 | --------------------------------------------------------------------------------