├── .gitignore
├── CHANGELOG.md
├── CONTRIBUTING.md
├── README.md
├── documentation
    ├── ABOUT.md
    ├── DATA.md
    ├── README.md
    ├── SETUP.md
    └── TERMS.md
└── scripts
    ├── README.md
    ├── api.py
    ├── download.sh
    ├── error.py
    ├── makecsv.py
    ├── master.py
    ├── requirements.txt
    ├── rxnorm.py
    ├── unzip.sh
    └── xpath.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | /tmp


--------------------------------------------------------------------------------
/CHANGELOG.md:
--------------------------------------------------------------------------------
1 | ## 0.1.0
2 | 
3 | - Initial development release
4 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | ## Contributing to Pillbox
 2 | 
 3 | Welcome to the Pillbox Data Process. Below is a guide for getting started in participating in the project. Whether you are a user of the data, interested in how the data in generated, or want to improve the quality of the data, the sections below outline how to get started contributing. 
 4 | 
 5 | ### How to contribute
 6 | 
 7 | There are two main ways to contribute to the project: 
 8 | 
 9 | 1. Report on data quality or errors 
10 | 2. Improve data generation process 
11 | 
12 | ### Report on Data Quality or Errors
13 | 
14 | The first way to get invovled in the Pillbox project is to participate in flagging errors or sending reports on data quality. The project uses the [Issue tracker](https://github.com/HHS/pillbox-data-process/issues) to manage and discuss reports and questions about the project. 
15 | 
16 | Here's a short list of items to consider before submitting an issue: 
17 | 
18 |   - Please search for your issue before filing it; errors or quality issues may have been already reported. 
19 |   - If you encountered an error in the data process, write specifically what the error was and how to replicate the error.
20 |   - Please keep issues professional and straightforward. This is an evolving effort and we look to the community to help improve the quality and the process. 
21 |   - Please use the Labels to mark type of issue. We've pre-categorized labels we think are currently appropriate. These will help flag for context. 
22 | 
23 | ### Improve Data Generation Process 
24 | 
25 | The second way to get involved in the Pillbox project is to contribute to improving the data process. The data generation process is a series of Python scripts that download, unzip, and parse FDA SPL XML data. The goal for this process is to continually improve. We see this improving in two ways: 
26 | 
27 | #### 1. Improving error handling 
28 | 
29 | The FDA SPL data consistantly experiences quality issues. Occasionally these issues can lead to breaking the data process. If you encounter your data process to be broken, please contribute back in the following ways: 
30 | 
31 | 1. Debug the error and submit a Pull Request for the new code that handles the error. 
32 | 2. Submit an issue reporting the new error and any suggestions for how to improve it. 
33 | 
34 | #### 2. Improving secondary data products 
35 | 
36 | The main Pillbox product is the `spl_data.csv` file. This is the master dataset that is made available. In addition, secondary data products are being made available. This includes indidividual json file access to products, and we're growing static API access to the data. 
37 | 
38 | If you want to contribute to creating new secondary data products, you can help us in two ways: 
39 | 
40 | 1. Contribute code to `api.py` or create a new python script and submit a Pull Request. 
41 |   - This is the best way to add to the process by creating new slices of the data, or generating a unique analysis of the data. 
42 | 2. Recommend improvements in the Issue tracker. 
43 |   - Submit a recommendation by creating a new ticket for discussion. Another developer may possibly be able to implement and contribute code based on the recommendation. 
44 | 
45 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Pillbox for Developers
 2 | ### Pillbox Data Process
 3 | 
 4 | Pillbox, an initiative of the National Library of Medicine at the National Institutes of Health, provides data and images for prescription, over-the-counter, homeopathic, and veterinary oral solid dosage medications (pills) marketed in the United States of America. This data set contains information about pills such as how they look, their active and inactive ingredients, and many other criteria.
 5 | 
 6 | Pillbox's primary data source (FDA drug labels) is complex and does not organize information based on individual pills. Additionally, there are very few pill images available in the source data. The Pillbox initative has focused on restructuring the source data, incorporating data from other related data sets, and creating a library of pill images.
 7 | 
 8 | A major function of the initiative was the development of a data process which ingests the source data and produces an easy-to-use, "pill-focused" dataset.
 9 | 
10 | This repository contains the code for that process. It is intended to begin to give developers greater flexibility in using this data as well as expand the scope of and refine the data process. This repository will continue to grow as well as access to the data will continue to improve.
11 | 
12 | ### Get started using or contributing to the code
13 | 
14 |   - [Read more](https://github.com/HHS/pillbox-data-process/blob/master/documentation/SETUP.md) about setting up your local environment
15 |   - Start [contributing](https://github.com/HHS/pillbox-data-process/blob/master/CONTRIBUTING.md) to the development
16 | 
17 | ### Uses of this data
18 | 
19 |   - Identify unknown pills based on their physical appearance
20 |   - Assist in development of electronic health records, medication information systems, and adherence/reminder/tracking tools
21 |   - Support research in areas such as informatics and image processing
22 | 
23 | ### Warning
24 | 
25 | Pillbox's source data is known to have errors and inconsistencies. Read [this document](https://github.com/HHS/pillbox-data-process/blob/master/documentation/DATA.md) before working with Pillbox's API, data, and images.
26 | 
27 | [Read more about Pillbox](https://github.com/HHS/pillbox-data-process/blob/master/documentation/ABOUT.md).
28 | 


--------------------------------------------------------------------------------
/documentation/ABOUT.md:
--------------------------------------------------------------------------------
 1 | ## About Pillbox 
 2 | 
 3 | Pillbox is a resource of the U.S. National Library of Medicine, part of the National Institutes of Health, U.S. Department of Health and Human Services. Pillbox is a United States government resource. 
 4 | 
 5 | ### Overview
 6 | Pillbox is a database of human prescription, over-the-counter, homeopathic, and veterinary oral solid dosage medications (pills) marketed in the United States of America. This data set contains information about pills such as how they look, their ingredients, and other criteria. This data can be used to identify unknown pills based on their physical appearance. It also answers questions like “What pills contain acetaminophen?” or “What pills contain lactose as an inactive ingredient?” The data contain unique identifiers, such as the RXCUI and FDA product code.
 7 | 
 8 | ### Notice: Non-verified Data
 9 | 
10 | Pillbox’s data is created by combining drug information resources from the Food and Drug Administration (FDA) and National Library of Medicine (NLM) at the National Institutes of Health. This information has been reformatted to make it easier to work with but has not been verified by FDA or NLM. The information available for download may not be the labeling on currently distributed products or identical to the labeling that is approved. NLM makes no warranty that the data is error free.
11 | 
12 | #### Disclaimer
13 | 
14 | The pill images and accompanying data available here were obtained from products acquired from a licensed pharmacy or the product manufacturer. Manufacturers may alter the appearance (e.g., shape, color, size, markings) of medications over time.
15 | 
16 | The same medication may have been issued with a different appearance and/or different accompanying data before or after the date NLM acquired it. NLM would like to hear about any changes in medication appearance or possible errors in accompanying information. Please contact Pillbox by posting an issue in the Issue queue or sending an email to pillbox@mail.nih.gov if you notice any discrepancies in the information provided here.
17 | 
18 | Reference in this data to any specific commercial product, process, service, manufacturer, or company does not constitute its endorsement or recommendation by the U.S. government or the U.S. Department of Health and Human Services or any of its agencies.
19 | 
20 | Neither the U.S. government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information disclosed.
21 | 
22 | ### Terms of Use
23 | 
24 | Use of this data is subject to the [Terms of Service document]() made available in this repository.
25 | 


--------------------------------------------------------------------------------
/documentation/DATA.md:
--------------------------------------------------------------------------------
  1 | ## Overview
  2 | 
  3 | Many developers new to working with medication-related data are unaware of potential errors and discrepancies within the data.  This document provides an overview of many issues present in the Pillbox data set.  It is intended as an educational resource to 1) help developers better understand the data, and 2) illustrate the scope and limitations of working with this data, which is critical when creating applications that identify unknown medication or provide information to clinicians, patients, and others.
  4 | 
  5 | It is also intended to be a starting point for discussion and further exploration of this data with the goal of improving downstream utilization by a growing community of innovative individuals and groups seeking to solve challenges related to not only medication identification and reference, but across health care.
  6 | 
  7 | This report is not intended to definitively confirm the presence of an error in Structured Product Labeling data supplied by a submitting firm.  Rather, it highlights potential errors, discrepancies, and data of interest based on analysis of data outlying expected parameters.  It is not intended to be comprehensive.
  8 | 
  9 | Pillbox’s data is derived from the Structured Product Labeling, obtained via NLM’s DailyMed and NLM’s RxNorm, a normalized naming system for generic and branded drugs.
 10 | 
 11 | ### Data changes and transparency
 12 | 
 13 | Because changes are now being made to the data based on comparison of pill images to the physical characteristics data, every effort is being made to be as transparent as possible about this process.  This includes making a table of all changes made to data available for download, web pages which list every change made along with the image used for comparison, and this document which defines the scope and methodology used to make these changes.
 14 | 
 15 | The scope of these changes is not intended to be complete.  In situations where ambiguity exists the original data is not modified.  For records where there is no image available, it is not possible to verify the physical characteristics data.
 16 | 
 17 | Data changes in Pillbox do not affect the master data derived from the drug labels, available via DailyMed.
 18 | 
 19 | ### Identification of data errors and discrepancies
 20 | 
 21 | Errors or discrepancies in Pillbox's data can be identified either visually (comparing pill images to data) or algorithmically (querying data based on logical assumptions about the data).
 22 | 
 23 | To the best of the Pillbox team's knowledge, no federal agency reviews the physical characteristics data for pills.  It is our hope that Pillbox (and the pill images produced through the project) will be a catalyst for the development of manual or automated validation systems for these data and improvement in the overall accuracy of these data.
 24 | 
 25 | ### Three unique problems
 26 | 
 27 | #### Re-labeling and re-distribution
 28 | 
 29 | A unique type of discrepancy can occur when a pill is marketed by more than one company.  Company A may manufacture and market a drug.  They may also allow other companies to distribute and market that same drug.  Each company must submit a separate drug label and that same pill will have a different National Drug Code (NDC) in that label.  Some distribution chains may be quite large.  The small, brown, 200 mg ibuprofen, that has an imprint of "I2" for example, is distributed under almost 80 different NDCs and labels.
 30 | 
 31 | An upcoming data release of Pillbox will group pills by physical characteristics, ingredients and strength, and other criteria so that each unique pill has only one record.  In practice there may not be possible as some label authors have changed one or more of the physical characteristics or other data from the source label.
 32 | 
 33 | The benefits of organizing pills by original manufactured products extend beyond simplifying identification and improving user experience.  If an issue, such as contamination, should ever be occur with a medication, identifying all distribution points for that medication is critical for public safety.  Developers should have easy access to that information.
 34 | 
 35 | #### Manufacturers changing the appearance of a pill
 36 | 
 37 | FDA guidance requires (citation needed) that if the physical characteristics of a pill change, then that product requires a new NDC.  The most common change made to a pill is the imprint.  When a manufacturer changes the physical characteristics of a pill and does not apply for a new NDC, it creates a conflict in presenting the data, especially if there is an image for the pill.
 38 | 
 39 | For example, Company A has a pill with an imprint "123 10".  They then change the imprint to "A 10".  For a certain time, both pills will be available.  If the imprint in the image differs from the data, it may be difficult to determine if the imprint data is 1) incorrect or 2) the imprint has changed but the original NDC was kept.  Also, if both pills are present in the market, there should be two separate records as users could be trying to identify either pill.
 40 | 
 41 | #### Identifying pills that are no longer marketed
 42 | 
 43 | Pillbox was not designed to be an archival resource.  It was intended to reflect the current information available via its sources.  The data process which creates Pillbox takes current data from DailyMed and RxNorm and parses individual products (pills).  Cases exist however where a user is trying to identify an older medication, stored for years in a medicine cabinet.  In disaster response situations, medications which are past the expiration date may be used if certain criteria are met and tests show the medications have retained their potency.
 44 | 
 45 | This issue will also be addressed by the upcoming data release will pills will be grouped by physical characteristics.  It has yet to be determined how far back in time to go, looking for unique pills.  Also, without images it will be difficult to verify the accuracy of the physical characteristics used to group the pills.  This will results is a greater number of groups, with some groups being created based on inaccurate data.  Groupings based on accurate data will be unaffected.
 46 | 
 47 | ### Visually identifying data issues
 48 | 
 49 | FDA publishes guidance for coding the physical characteristics (imprint, color, shape, size, score) of pills.  Based on a review of the 2,159 images available via Pillbox as of July 2013, changes were made to the physical characteristics data of approximately 17% (359) of records for which there was an image.
 50 | 
 51 | As Pillbox increases the number of standardized, high quality images available, those images will be compared to data for each product to ensure physical characteristics (imprint, color, shape, size, score) data match the images.  While many of these errors are more easily identified than others (a round pill listed as square), some criteria are more subjective or nuanced.
 52 | 
 53 | Where data is modified, the goals are to improve search results without introducing ambiguity and to accurately represent the text that appears on a pill.  All changes made to the data are listed in the trade_dress_change_log table.
 54 | 
 55 | #### Imprint
 56 | 
 57 | Before continuing, you should read the FDA form and submission requirements for "imprint":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071810.htm.
 58 | 
 59 | Imprint is perhaps the single best identifier for a pill and presents challenges when developing search logic.  While relatively few errors of commission (typographic errors) have been found in the data based on a comparison to available pill images, a number of other factors are present.
 60 | 
 61 | Issues encountered:
 62 | * Company or drug names are sometimes omitted from the imprint data.
 63 | * Descriptive text is included in the imprint value (ex: "A;10;company logo")
 64 | * Imprint has been changed by the manufacturer
 65 | * Some imprint values are formatted inconsistently in a way that may affect search results
 66 | 
 67 | Additional rules:
 68 | * Trailing semi-colons in the data are removed (ex: "A;10;" changed to "A;10")
 69 | * Dashes, decimal points, slashes and spaces are replaced with semi-colons
 70 | * Stylized single letters (which often appear on pills) are entered as part of the imprint value
 71 | * Text that appears on separate line or is separated by a score line is separated in the imprint value by semi-colons
 72 | * If text is repeated on an pill (ex: a scored pill with the number 10 on both sides or a pill with the letter A appearing multiple times around the edge of a pill) it is entered as separate text.  This will increase the likelihood of an exact match while not interfering with search strings that only include one iteration of repeated text and errs on the side of an accurate representation of the look of the pill.
 73 | * Text the crosses itself (ex: BAYER written vertically and horizontally, crossing at the Y) is entered as separate values, separated by a semi-colon.
 74 | 
 75 | One additional area of concern related to imprint values are characters which look similar.  Text on a pill is often small and the various imprinting processes may render text that is difficult to read.  Users may not be able to accurately identify a character in situations like these.
 76 | 
 77 | * lower-case L vs the number one (1)
 78 | * Upper-case O vs the number zero (0)
 79 | 
 80 | #### Color
 81 | 
 82 | Before continuing, you should read the FDA form and submission requirements for "color":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071794.htm.
 83 | 
 84 | Color is the most subjective of the physical characteristics, however it is one of the most likely to be used by an individual describing a pill.  Existing guidance specifies RGB values for each of the 12 colors.  Color perception varies greatly among individuals and is subject to ambient lighting conditions (indoor fluorescent and incandescent light sources, sunlight, reflected light, etc.).  As such, similarly colored pills may be listed as a variety of similar colors, such as red/orange/brown or blue/turquoise/green.
 85 | 
 86 | Issues encountered:
 87 | * Pills listed as a color that is obviously different from the predominant color present in the image.  In these situations the color value was changed to that of the predominant color.  The new color is subject to the subjectivity described previously.
 88 | 
 89 | Additional rules:
 90 | * Though the guidance specifies that there should be only one value present for color, it is common to see two colors listed.  This practice is upheld in Pillbox's data.
 91 | * For pills that have more than one distinct color (a capsule with a pink cap and white base), the secondary color is added.
 92 | * If the labeler lists a single color and the pill could also be described by a second color, that color may be added.
 93 | * When more than one color is listed, if there is a predominant color (such as a yellow pill with a small white section in the middle) the predominant color is listed first.  This provides the potential to enhance search and more accurately describe the pill without negatively affecting search results.
 94 | * Double color listings (ex: white/white) were change to a single value of that color.
 95 | 
 96 | The NLM Pillbox SPLIMAGE pill image specification creates images under standardized lighting conditions.  It is hoped that these images will lead to development of an automated system to accurately define the predominant color of a pill and create a pallet of colors that is representative of the colors present.
 97 | 
 98 | #### Shape
 99 | 
100 | Before continuing, you should read the FDA form and submission requirements for "shape":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071802.htm.
101 | 
102 | The guidance for shape have two specific nuances that may not be obvious.  First, when dealing with multi-sided shapes, such as pentagons (5-sided) or hexagons (6-sided), the sides do not need to be equal in length.  Also, sides do no need to be straight.  Thus a 
103 | 
104 | Issues encountered:
105 | * Shapes listed as freeform that are actually multi-sided shapes
106 | * Capsules (two-part capsules) listed as oval.
107 | * Oval tablets listed as capsule.  Even if the medication name includes the word "capsule" if it is not a two-part capsules and banded two-part capsule it should be listed as oval.
108 | 
109 | #### Size
110 | 
111 | Before continuing, you should read the FDA form and submission requirements for "size":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071800.htm.
112 | 
113 | Without having a pill to measure it is difficult to identify inaccuracies in the size values.  The NLM Pillbox SPLIMAGE pill image specification includes a ruler in the image.  Some size values however are present.
114 | 
115 | Issues encountered:
116 | * Pills which size value differs from the ruler present in the image.
117 | 
118 | Additional rules:
119 | * If the size as measured in an SPLIMAGE spec image varies by more than 2 mm from the size value provided, it is changed to match the value as measured in the image.
120 | 
121 | #### Score
122 | 
123 | Before continuing, you should read the FDA form and submission requirements for "score":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071805.htm. 
124 | 
125 | Issues encountered:
126 | * Incorrect score values entered
127 | * Pills with different score values on either side
128 | * Though the guidance states score refers to the pill being broken into "equal sized pieces" there exists a pill that is scored to be broken into unequal sized pieces (different dosages)
129 | * Some score lines are faint or only present near the edges of the pill (not a continuous line across the pill)
130 | 
131 | Additional rules:
132 | * When a pill is scored differently on either side (one side can be divided into two pieces, the other side can be divided into three pieces), the higher score value is entered.  This is consistent with the one known example.
133 | 
134 | ### Algorithmically identifying data issues
135 | 
136 | Logical assumptions about the data and relationships between data can identify potential issues.  These issues are not addressed in Pillbox's data unless there is an image present.
137 | 
138 | #### NULL values
139 | 
140 | On occasion, certain values in the data are NULL. 
141 | 
142 | #### Capsules with score = 1
143 | 
144 | #### Size outliers
145 | 
146 | * Pills with extremely large (30+ mm) or small (0 or 1 mm) size values
147 | 
148 | #### Size with a decimal point
149 | 
150 | #### Pills listed a REMAINDER
151 | 
152 | #### DEA schedule is NULL
153 | 
154 | NULL is a valid value for DEA schedule.  It implies that the drug is not scheduled.  However, when you see a NULL value you cannot definitively know if the drug is not scheduled or the labeler forgot to list the value.  There are records for drugs that are on the DEA schedule but some of the records lists the DEA schedule as NULL.
155 | 
156 | #### Duplicate records
157 | 
158 | ### Other techniques for identifying issues in the data
159 | 
160 | #### Data which is not normalized (inactive ingredients, author)
161 | 
162 | #### Comparison of records believed to be redistributed products
163 | 
164 | #### Faceting via OpenRefine


--------------------------------------------------------------------------------
/documentation/README.md:
--------------------------------------------------------------------------------
1 | ## Pillbox Project Documentation 
2 | 
3 | This folder contains information about, contributing, and using the Pillbox project data. 
4 | 
5 | - [About Pillbox](https://github.com/HHS/pillbox-data-process/blob/master/documentation/ABOUT.md) 
6 | - [Pillbox Data Overview](https://github.com/HHS/pillbox-data-process/blob/master/documentation/DATA.md)
7 | - [Data process setup](https://github.com/HHS/pillbox-data-process/blob/master/documentation/SETUP.md)
8 | - [Terms](https://github.com/HHS/pillbox-data-process/blob/master/documentation/TERMS.md)
9 | 


--------------------------------------------------------------------------------
/documentation/SETUP.md:
--------------------------------------------------------------------------------
 1 | ## Setting up your local environment
 2 | 
 3 | #### Setting up on Ubuntu
 4 | 
 5 | 1. To set up on clean Ubuntu installation, run the following commands to intall the necessary requirements:
 6 | 
 7 | 	```
 8 | 	apt-get update
 9 | 	apt-get install git -y
10 | 	apt-get install unzip -y
11 | 	sudo apt-get install python-setuptools python-dev build-essential -y
12 | 	apt-get install libxml2-dev -y
13 | 	apt-get install libxslt1-dev -y
14 | 	easy_install -U setuptools
15 | 	apt-get install python-pip
16 | 	pip install lxml
17 | 	pip install requests
18 | 	```
19 | 
20 | 2. Clone this repo locally using the `git clone` command. This requires a Github account.
21 |    ```
22 |    git clone git@github.com:HHS/pillbox-data-process.git
23 |    ```
24 | 
25 | 3. Follow [steps for data process](https://github.com/HHS/pillbox-data-process/tree/master/scripts#pillbox-data-process).
26 | 
27 | #### Setting up on Mac OSX
28 | 
29 | Latest versions of OSX come with Python 2.7 installed. To run the Pillbox process, additional packages need to be installed. This assumes [Xcode](https://developer.apple.com/xcode/downloads/) & command line tools are installed. If not, install Xcode first.
30 | 
31 | 1. Install pip
32 | 	```
33 | 	sudo easy_install pip
34 | 	```
35 | 
36 | 2. Clone this repo locally using `git clone`. This requires a Github account.
37 | 
38 | 	```
39 | 	git clone git@github.com:HHS/pillbox-data-process.git
40 | 	```
41 | 
42 | 3. Install Python requirements for Pillbox
43 | 	```
44 | 	cd pillbox-data-process
45 | 	cd scripts
46 | 	sudo pip install -r requirements.txt
47 | 	```
48 | 
49 | 4. Follow [steps for data process](https://github.com/HHS/pillbox-data-process/tree/master/scripts#pillbox-data-process).
50 | 


--------------------------------------------------------------------------------
/documentation/TERMS.md:
--------------------------------------------------------------------------------
 1 | ## Terms of Service
 2 | 
 3 | The U.S. National Library of Medicine ("NLM") offers some of its public data in machine readable format via an Application Programming Interface ("API"). This service is offered subject to your acceptance of the terms and conditions contained herein as well as any relevant sections of the www.nlm.nih.gov Website Policies and Privacy Policy (collectively, the "Agreements").
 4 | 
 5 | ### Scope
 6 | 
 7 | All of the content, documentation, code and related materials made available to you through the API is subject to these terms. Access to or use of the API or its content constitutes acceptance to this Agreement.
 8 | 
 9 | ### Use
10 | 
11 | You may use the Pillbox API to develop a service or service to search, display, analyze, retrieve, view and otherwise 'get' information from NLM Pillbox data.
12 | 
13 | ### Attribution
14 | 
15 | All services which utilize or access the API should display the following notice prominently within the application: "This product uses publicly available data from the U.S. National Library of Medicine (NLM), National Institutes of Health, Department of Health and Human Services; NLM is not responsible for the product and does not endorse or recommend this or any other product." You may use the NLM name in order to identify the source of API content subject to these rules. You may not use the NLM name, logo, or the like to imply endorsement of any product, service, or entity, not-for-profit, commercial or otherwise.
16 | 
17 | ### Modification or False Representation of Content
18 | 
19 | You may not modify or falsely represent content accessed through the API and still claim the source is NLM.
20 | 
21 | ### Right to Limit
22 | 
23 | Your use of the API may be subject to certain limitations on access, calls, or use as set forth within this Agreement or otherwise provided by NLM. If NLM reasonably believes that you have attempted to exceed or circumvent these limits, your ability to use the API may be permanently or temporarily blocked. NLM may monitor your use of the API to improve the service or to insure compliance with this Agreement.
24 | 
25 | ### Service Termination
26 | 
27 | If you wish to terminate this Agreement, you may do so by refraining from further use of the API. NLM reserves the right (though not the obligation) to (1) refuse to provide the API to you, if it is NLM's opinion that use violates any NLM policy, or (2) terminate or deny you access to and use of all or part of the API at any time for any other reason in its sole discretion,. All provisions of this Agreement which by their nature should survive termination shall survive termination including, without limitation, warranty disclaimers, indemnity, and limitations of liability.
28 | 
29 | ### Changes
30 | 
31 | NLM reserves the right, at its sole discretion, to modify or replace this Agreement, in whole or in part. Your continued use of or access to the API following posting of any changes to this Agreement constitutes acceptance of those modified terms. NLM may, in the future, offer new services and/or features through the API. Such new features and/or services shall be subject to the terms and conditions of this Agreement.
32 | 
33 | ### Disclaimer of Warranties
34 | 
35 | The API is provided "as is" and on an "as-available" basis. NLM hereby disclaim all warranties of any kind, express or implied, including without limitation the warranties of merchantability, fitness for a particular purpose, and non-infringement. NLM makes no warranty that the API will be error free or that access thereto will be continuous or uninterrupted.
36 | 
37 | ### Limitations on Liability
38 | 
39 | In no event will NLM be liable with to respect to any subject matter of this Agreement under any contract, negligence, strict liability or other legal or equitable theory for:
40 | 
41 | any special, incidental, or consequential damages; the cost of procurement of substitute products or services; for interruption of use or loss or corruption of data.
42 | 
43 | ### General Representations
44 | 
45 | You hereby warrant that (1) your use of the API will be in strict accordance with the NLM privacy policy, this Agreement, and all applicable laws and regulations, and (2) your use of the API will not infringe or misappropriate the intellectual property rights of any third party.
46 | 
47 | ### Indemnification
48 | 
49 | You agree to indemnify and hold harmless NLM, its contractors, employees, agents, and the like from and against any and all claims and expenses including attorney's fees, arising out of your use of the API, including but not limited to violation of this Agreement.
50 | 
51 | ### Miscellaneous
52 | 
53 | This Agreement constitutes the entire Agreement between NLM and you concerning the subject matter hereof, and may only be modified by the posting of a revised version on this page by NLM.
54 | 
55 | ### Disputes
56 | 
57 | Any disputes arising out of this Agreement and access to or use of the API shall be governed by federal law.
58 | 
59 | ### No Waiver of rights
60 | 
61 | NLM's failure to exercise or enforce any right or provision of this Agreement shall not constitute waiver of such right or provision. 


--------------------------------------------------------------------------------
/scripts/README.md:
--------------------------------------------------------------------------------
 1 | ## Pillbox Data Process
 2 | 
 3 | The Pillbox data process uses a series of Python and Shell scripts to download, unzip, and parse the XML data provided by DailyMed. The process is broken into three phases:
 4 | 
 5 | 1. Download (`download.sh`) and unzip (`unzip.sh`)
 6 | 2. Process XML (`master.py`, `xpath.py`, `rxnorm.py`, `error.py`, `makecsv.py`)
 7 | 3. Post-processing for static API and other outputs
 8 | 
 9 | The scripts generate a series of directories under a `/tmp` folder. In addition, the `master.py` data process generates two main intermediate Pillbox outputs: `/processed/csv/spl_data.csv` & `/processed/json/`.
10 | 
11 | #### Requirements
12 | 
13 | - Python 2.6+
14 | - [PIP](http://www.pip-installer.org/en/latest/installing.html#install-or-upgrade-pip)
15 | - `pip install requirements.txt`
16 | - unzip (if not on OSX)
17 | - wget (if not on Ubuntu)
18 | - 30+GB of free space
19 | 
20 | ### Using the scripts
21 | 
22 | #### 1. Download and Unzip
23 | The download and unzip script will download 16GB from DailyMed and unzip into temporary folders.
24 | 
25 | To run:
26 | 
27 | ```
28 | ./download.sh
29 | ./unzip.sh
30 | ```
31 | 
32 | #### 2. Process XML
33 | 
34 | After the downloading is finished, to process the unzipped XML files, run `master.py`. This script will use the `xpath.py`, `rxnorm.py`, `errors.py`, and `makecsv.py` modules.
35 | 
36 | To run:
37 | 
38 | ```
39 | ./master.py
40 | ```
41 | 
42 | #### 3. Post-processing
43 | 
44 | To run any post-processing on the generated CSV or json files, run `api.py` or generate an additional script.
45 | 
46 | To run:
47 | 
48 | ```
49 | ./api.py
50 | ```
51 | 


--------------------------------------------------------------------------------
/scripts/api.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | # Import Python Modules
  4 | import os
  5 | import sys
  6 | import time
  7 | import traceback
  8 | import csv
  9 | import shutil
 10 | import simplejson as json
 11 | from datetime import datetime
 12 | 
 13 | print "Processing data..."
 14 | 
 15 | # Global variables
 16 | authorList = []
 17 | author = {}
 18 | 
 19 | colorList = []
 20 | color = {}
 21 | 
 22 | def createIndex():
 23 | 
 24 | 	def authorIndex(data):
 25 | 		global authorList
 26 | 		global author
 27 | 
 28 | 		if data['data']['author'] not in authorList:
 29 | 			authorList.append(data['data']['author'])
 30 | 			author[data['data']['author']] = []
 31 | 
 32 | 		# build author objects
 33 | 		if data['data']['author'] != "":
 34 | 			author[data['data']['author']].append(data['setid_product'])
 35 | 
 36 | 	def colorIndex(data):
 37 | 		global colorList
 38 | 		global color
 39 | 
 40 | 		if data['data']['SPLCOLOR'] not in colorList:
 41 | 			colorList.append(data['data']['SPLCOLOR'])
 42 | 			color[data['data']['SPLCOLOR']] = []
 43 | 
 44 | 		# build color objects
 45 | 		# {'C48325': ['3CAF3F19-96B4-DAE6-35AA-05643BD531D2-58177-001']}
 46 | 		if data['data']['SPLCOLOR'] != "":
 47 | 			color[data['data']['SPLCOLOR']].append(data['setid_product'])
 48 | 
 49 | 	os.chdir("../tmp/processed/json/")
 50 | 	for fn in os.listdir('.'):
 51 | 		if fn.endswith(".json"):
 52 | 			data_file = open(fn, "rb").read()
 53 | 			data = json.loads(data_file)
 54 | 			authorIndex(data)
 55 | 			colorIndex(data)
 56 | 
 57 | def indexAPI():
 58 | 	global color
 59 | 	global author
 60 | 
 61 | 	authorIndex = []
 62 | 	for a,n in author.items():
 63 | 		authorJSON = {
 64 | 			"author": "",
 65 | 			"spl-id": [],
 66 | 			"count": len(n)
 67 | 			}
 68 | 		authorJSON['author'] = a
 69 | 		authorJSON['spl-id'] = n
 70 | 		authorIndex.append(authorJSON)
 71 | 
 72 | 	writeIndex(authorIndex, 'author')
 73 | 
 74 | 	colorIndex = []
 75 | 	for c,n in color.items():
 76 | 		colorJSON = {
 77 | 			"color": "",
 78 | 			"spl-id": [],
 79 | 			"count": len(n)
 80 | 			}
 81 | 		colorJSON['color'] = c
 82 | 		colorJSON['spl-id'] = n
 83 | 		colorIndex.append(colorJSON)
 84 | 
 85 | 	writeIndex(colorIndex, 'color')
 86 | 
 87 | def writeIndex(output, file_name):
 88 | 	writeout = json.dumps(output, sort_keys=True, separators=(',',':'))
 89 | 	f_out = open('../../../api/index/%s.json' % file_name, 'wb')
 90 | 	f_out.writelines(writeout)
 91 | 	f_out.close()
 92 | 	print "%s index files created..." % file_name
 93 | 
 94 | def copyProcessed():
 95 | 	os.chdir('../csv/')
 96 | 	splSRC = ('spl_data.csv')
 97 | 	ingredientSRC = ('spl_ingredients.csv')
 98 | 	DST = ('../../../api/')
 99 | 	shutil.copy(splSRC,DST)
100 | 	print "spl_data.csv copied."
101 | 	shutil.copy(ingredientSRC,DST)
102 | 	print "spl_ingredients.csv copied."
103 | 
104 | if __name__ == "__main__":
105 | 	createIndex()
106 | 	indexAPI()
107 | 	copyProcessed()
108 | 


--------------------------------------------------------------------------------
/scripts/download.sh:
--------------------------------------------------------------------------------
 1 | #! /bin/bash
 2 | 
 3 | set -e
 4 | 
 5 | STARTTIME=$(date +%s)
 6 | date=$1
 7 | if [ $date ]; then
 8 | 	echo "processing..."
 9 | else
10 | 	echo "Error: no date entered with command (ex: 2014-02-24)"
11 | 	exit 1
12 | fi
13 | 
14 | # make temp directories if do not exist
15 | mkdir -p ../tmp
16 | mkdir -p ../tmp/download
17 | mkdir -p ../tmp/download/$date
18 | 
19 | tmpDIR=../tmp/
20 | cd $tmpDIR
21 | 
22 | # Download all Dailymed files 
23 | wget -O download/$date/dm_spl_release_human_rx_part1.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_rx_part1.zip
24 | wget -O download/$date/dm_spl_release_human_rx_part2.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_rx_part2.zip
25 | wget -O download/$date/dm_spl_release_human_otc_part1.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_otc_part1.zip
26 | wget -O download/$date/dm_spl_release_human_otc_part2.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_otc_part2.zip
27 | wget -O download/$date/dm_spl_release_human_otc_part3.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_otc_part3.zip
28 | wget -O download/$date/dm_spl_release_homeopathic.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_homeopathic.zip
29 | wget -O download/$date/dm_spl_release_animal.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_animal.zip
30 | wget -O download/$date/dm_spl_release_remainder.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_remainder.zip
31 | 
32 | echo "Dailymed files downloaded."
33 | echo "Downloading complete."
34 | ENDTIME=$(date +%s)
35 | TOTALTIME=$((($ENDTIME-$STARTTIME)/60))
36 | echo "Processing took $TOTALTIME minutes to complete."
37 | 


--------------------------------------------------------------------------------
/scripts/error.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | # Import Python Modules 
 4 | import os
 5 | import sys
 6 | import time
 7 | import traceback
 8 | import simplejson as json
 9 | from datetime import datetime
10 | 
11 | now = datetime.now()
12 | 
13 | error_count = 0
14 | 
15 | errorFile = {"updated": now.strftime("%Y-%m-%d %H:%M"), "errors_total": "", "files": []}
16 | 
17 | def xmlError(fn):
18 | 	global error_count
19 | 	exc_type, exc_value, exc_traceback = sys.exc_info()
20 | 	error = traceback.format_exc().splitlines()
21 | 	if error[-1] != "SystemExit: Not OSDF":
22 | 		error_count = error_count + 1
23 | 		errorPrint = {"file": fn, "error": error}
24 | 		errorFile['files'].append(errorPrint)
25 | 
26 | def errorWrite():
27 | 	global error_count
28 | 	errorFile['errors_total'] = error_count
29 | 	print error_count, "errors"
30 | 	writeout = json.dumps(errorFile, sort_keys=True, separators=(',',':'))
31 | 	f_out = open('../errors/errors.json', 'wb')
32 | 	f_out.writelines(writeout)
33 | 	f_out.close()
34 | 


--------------------------------------------------------------------------------
/scripts/makecsv.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | import os
  4 | import sys
  5 | import csv
  6 | import simplejson as json
  7 | from itertools import chain
  8 | import datetime
  9 | 
 10 | now = datetime.datetime.now()
 11 | today = now.strftime("%Y-%m-%d %H:%M")
 12 | 
 13 | dataHeader = [
 14 | 		"setid",
 15 | 		"file_name",
 16 | 		"medicine_name",
 17 | 		"part_medicine_name",
 18 | 		"product_code",
 19 | 		"part_num",
 20 | 		"ndc9",
 21 | 		"author",
 22 | 		"author_type",
 23 | 		"effective_time",
 24 | 		"DEA_SCHEDULE_CODE",
 25 | 		"DEA_SCHEDULE_NAME",
 26 | 		"MARKETING_ACT_CODE",
 27 | 		"NDC",
 28 | 		"SPLCOLOR",
 29 | 		"SPLIMAGE",
 30 | 		"SPLIMPRINT",
 31 | 		"SPLSCORE",
 32 | 		"SPLSHAPE",
 33 | 		"SPLSIZE",
 34 | 		"SPL_INACTIVE_ING",
 35 | 		"SPL_INGREDIENTS",
 36 | 		"SPL_STRENGTH",
 37 | 		"document_type",
 38 | 		"dosage_form",
 39 | 		"rxcui",
 40 | 		"rxstring",
 41 | 		"rxtty",
 42 | 		"source",
 43 | 		"equal_product_code",
 44 | 		"approval_code"
 45 | 		]
 46 | 
 47 | ingredientsHeader = [
 48 | 		"product_code",
 49 | 		"setid",
 50 | 		"part_num",
 51 | 		"numerator_value",
 52 | 		"numerator_unit",
 53 | 		"denominator_value",
 54 | 		"denominator_unit",
 55 | 		"base_of_strength"
 56 | 		]
 57 | 
 58 | dataOutput = open('../tmp/processed/csv/spl_data.csv', 'wb')
 59 | ingredientsOutput = open('../tmp/processed/csv/spl_ingredients.csv', 'wb')
 60 | dataWriter = csv.writer(dataOutput, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC, lineterminator='\n')
 61 | dataWriter.writerow(dataHeader)
 62 | 
 63 | ingredientsWriter = csv.writer(ingredientsOutput, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC, lineterminator='\n')
 64 | ingredientsWriter.writerow(ingredientsHeader)
 65 | 
 66 | def makeCSV(xmlData):
 67 | 
 68 | 	for x in xmlData:
 69 | 		dataRow = []
 70 | 		for h in dataHeader:
 71 | 			if h == 'SPL_INACTIVE_ING':
 72 | 				if x['data'][h] == None:
 73 | 					dataRow.append(x['data'][h])
 74 | 				else:
 75 | 					dataRow.append(";".join(x['data'][h]).encode('ascii','ignore'))
 76 | 			elif h == 'NDC':
 77 | 				dataRow.append(";".join(x['data'][h]).encode('ascii','ignore'))
 78 | 			elif h == 'SPL_INGREDIENTS':
 79 | 				if x['data'][h] == None:
 80 | 					dataRow.append(x['data'][h])
 81 | 				else:
 82 | 					dataRow.append(";".join(x['data'][h]).encode('ascii','ignore'))
 83 | 			elif h == 'SPL_STRENGTH':
 84 | 				if x['data'][h] == None:
 85 | 					dataRow.append(x['data'][h])
 86 | 				else:
 87 | 					dataRow.append(";".join(x['data'][h]).encode('ascii','ignore'))
 88 | 			elif h == 'part_num':
 89 | 				dataRow.append(x['data'][h])
 90 | 			elif h == 'product_name':
 91 | 				dataRow.append(x['data'][h].encode('ascii','ignore'))
 92 | 			else:
 93 | 				if x['data'][h] == None:
 94 | 					dataRow.append(x['data'][h])
 95 | 				else:
 96 | 					dataRow.append(x['data'][h].encode('ascii','ignore'))
 97 | 		dataWriter.writerow(dataRow)
 98 | 		if x['ingredients']:
 99 | 			for a in x['ingredients']:
100 | 				try:
101 | 					if a['active_moiety_names']:
102 | 						ingredientsRow = []
103 | 						for i in ingredientsHeader:
104 | 							idCodes = x['setid_product'].split("-")
105 | 							setid = "-".join(idCodes[:-3])
106 | 							product_code = idCodes[-3] + "-" + idCodes[-2]
107 | 							part_num = idCodes[-1]
108 | 							if i == 'product_code':
109 | 								ingredientsRow.append(product_code)
110 | 							elif i == 'setid':
111 | 								ingredientsRow.append(setid)
112 | 							elif i == 'part_num':
113 | 								ingredientsRow.append(part_num)
114 | 							elif i == 'base_of_strength':
115 | 								try:
116 | 									ingredientsRow.append(";".join(a['active_moiety_names']).encode('ascii','ignore'))
117 | 								except:
118 | 									ingredientsRow.append(None)
119 | 							else:
120 | 								try:
121 | 									ingredientsRow.append(a[i].encode('ascii','ignore'))
122 | 								except:
123 | 									ingredientsRow.append(None)
124 | 						ingredientsWriter.writerow(ingredientsRow)
125 | 				except:
126 | 					pass
127 | 
128 | def closeCSV():
129 | 	dataOutput.close()
130 | 	ingredientsOutput.close()
131 | 
132 | def makeDataPackage():
133 | 	datapackage = {
134 | 				  "name": "pillbox",
135 | 				  "title": "Pillbox SPL Data",
136 | 				  "date_updated": today,
137 | 				  "resources": [
138 | 				    {
139 | 						"path": "spl_data.csv",
140 | 						"schema": {
141 | 						"fields": [
142 | 						  {"name": "setid","type": "string"},
143 | 						  {"name": "file_name","type": "string"},
144 | 						  {"name": "medicine_name","type": "string"},
145 | 						  {"name": "product_code","type": "string"},
146 | 						  {"name": "part_num","type": "integer"},
147 | 						  {"name": "ndc9","type": "string"},
148 | 						  {"name": "author","type": "string"},
149 | 						  {"name": "author_type","type": "string"},
150 | 						  {"name": "date_created","type": "string"},
151 | 						  {"name": "effective_time","type": "integer"},
152 | 						  {"name": "DEA_SCHEDULE_CODE","type": "string"},
153 | 						  {"name": "DEA_SCHEDULE_NAME","type": "string"},
154 | 						  {"name": "MARKETING_ACT_CODE","type": "string"},
155 | 						  {"name": "NDC","type": "string"},
156 | 						  {"name": "SPLCOLOR","type": "string"},
157 | 						  {"name": "SPLIMAGE","type": "string"},
158 | 						  {"name": "SPLIMPRINT","type": "string"},
159 | 						  {"name": "SPLCOLOR","type": "string"},
160 | 						  {"name": "SPLSCORE","type": "integer"},
161 | 						  {"name": "SPLSHAPE","type": "string"},
162 | 						  {"name": "SPLSIZE","type": "integer"},
163 | 						  {"name": "SPL_INACTIVE_ING","type": "string"},
164 | 						  {"name": "SPL_INGREDIENTS","type": "string"},
165 | 						  {"name": "SPL_STRENGTH","type": "string"},
166 | 						  {"name": "SPLSHAPE","type": "string"},
167 | 						  {"name": "document_type","type": "string"},
168 | 						  {"name": "dosage_form","type": "string"},
169 | 						  {"name": "rxcui","type": "string"},
170 | 						  {"name": "rxstring","type": "string"},
171 | 						  {"name": "rxtty","type": "string"},
172 | 						  {"name": "source","type": "string"},
173 | 						  {"name": "equal_product_code","type": "string"},
174 | 						  {"name": "approval_code","type": "string"}
175 | 						]
176 | 						}
177 | 				    },
178 | 					{
179 | 						"path": "spl_ingredients.csv",
180 | 						"schema": {
181 | 							"fields": [
182 | 							{"name": "product_code","type": "string"},
183 | 							{"name": "setid","type": "string"},
184 | 							{"name": "part_num","type": "string"},
185 | 							{"name": "numerator_value","type": "string"},
186 | 							{"name": "numerator_unit","type": "string"},
187 | 							{"name": "denominator_value","type": "string"},
188 | 							{"name": "denominator_unit","type": "string"},
189 | 							{"name": "base_of_strength","type": "string"}
190 | 							]
191 | 						}
192 | 					}
193 | 				  ]
194 | 				}
195 | 
196 | 	writeout = json.dumps(datapackage, sort_keys=True, separators=(',',':'), indent=4 * ' ')
197 | 	f_out = open('../../api/datapackage.json', 'wb')
198 | 	f_out.writelines(writeout)
199 | 	f_out.close()
200 | 	print "Datapackage.json created..."
201 | 
202 | if __name__ == "__main__":
203 | 	makeDataPackage()
204 | 


--------------------------------------------------------------------------------
/scripts/master.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | 
  3 | # Import Python Modules
  4 | import os
  5 | import sys
  6 | import time
  7 | import traceback
  8 | import csv
  9 | import glob
 10 | import simplejson as json
 11 | from datetime import datetime
 12 | # Import other files
 13 | from xpath import parseData
 14 | import rxnorm
 15 | import error
 16 | import makecsv
 17 | import Queue
 18 | import threading
 19 | import requests
 20 | 
 21 | # Variables, directories, initializations
 22 | print "Starting data processing..."
 23 | master_t0 = time.time()
 24 | file_count = 0
 25 | os.chdir("../tmp/tmp-unzipped/")
 26 | queue = Queue.Queue()
 27 | 
 28 | # Check internet connection for RXNorm requests
 29 | print "Checking RXNorm connection..."
 30 | try:
 31 | 	check = rxnorm.connectionCheck()
 32 | 	print check
 33 | except:
 34 | 	sys.exit("RXNorm check failed. Check your internet connection.")
 35 | 
 36 | def xmlProcess(fn):
 37 | 
 38 | 	try:
 39 | 		# run xpath.py on each file
 40 | 		xmlData = parseData(fn)
 41 | 
 42 | 		for x in xmlData:
 43 | 			rxnormData = rxnorm.rxNorm(x['ndc_codes'])
 44 | 			x['data']['rxcui'] = rxnormData['rxcui']
 45 | 			x['data']['rxtty'] = rxnormData['rxtty']
 46 | 			x['data']['rxstring'] = rxnormData['rxstring']
 47 | 			try:
 48 | 				ndc9 = x['data']['product_code'].split("-")
 49 | 				if len(ndc9[0]) < 5:
 50 | 					ndc9[0] = "0%s" % ndc9[0]
 51 | 				if len(ndc9[1]) < 4:
 52 | 					ndc9[1] = "0%s" % ndc9[1]
 53 | 				x['data']['ndc9'] = "".join(ndc9)
 54 | 			except:
 55 | 				x['data']['ndc9'] = ""
 56 | 		# Make indidivdual json files per SETID-NDC code
 57 | 			writeout = json.dumps(x, sort_keys=True, separators=(',',':'))
 58 | 			f_out = open('../processed/json/%s.json' % x['setid_product'], 'wb')
 59 | 			f_out.writelines(writeout)
 60 | 			f_out.close()
 61 | 		# Make CSV file for output, one row per SETID-NDC code
 62 | 		makecsv.makeCSV(xmlData)
 63 | 	except:
 64 | 		error.xmlError(fn)
 65 | 
 66 | class ThreadXML(threading.Thread):
 67 | 	def __init__(self, queue):
 68 | 		threading.Thread.__init__(self)
 69 | 		self.queue = queue
 70 | 
 71 | 	def run(self):
 72 | 		while True:
 73 | 			#grabs file from queue
 74 | 			fn = self.queue.get()
 75 | 			#grabs file and processes
 76 | 			xmlProcess(fn)
 77 | 			#signals to queue job is done
 78 | 			self.queue.task_done()
 79 | 
 80 | def main():
 81 | 	global file_count
 82 | 
 83 | 	#spawn a pool of threads, and pass them queue instance
 84 | 	for i in range(20):
 85 | 		t = ThreadXML(queue)
 86 | 		t.daemon = True
 87 | 		t.start()
 88 | 
 89 | 	#populate queue with data
 90 | 	for d in os.listdir('.'):
 91 | 		files = glob.glob('%s/*.xml' % d)
 92 | 		for fn in files:
 93 | 			file_count = file_count + 1
 94 | 			queue.put(fn)
 95 | 	print "Processing XML with XPATH..."
 96 | 
 97 | 	#wait on the queue until everything has been processed
 98 | 	queue.join()
 99 | 
100 | main()
101 | 
102 | # Calculate the total time and print to console.
103 | master_t1 = time.time()
104 | total_time = (master_t1-master_t0)/60
105 | print file_count, "XML files processed."
106 | error.errorWrite()
107 | makecsv.closeCSV()
108 | makecsv.makeDataPackage()
109 | print "Processing complete. Total Processing time = %d minutes" % total_time
110 | 


--------------------------------------------------------------------------------
/scripts/requirements.txt:
--------------------------------------------------------------------------------
1 | lxml
2 | requests
3 | simplejson
4 | 


--------------------------------------------------------------------------------
/scripts/rxnorm.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import os
 3 | import sys
 4 | import requests
 5 | import simplejson as json
 6 | 
 7 | def connectionCheck():
 8 | 	url = 'http://rxnav.nlm.nih.gov/REST/version'
 9 | 	header = {'Accept': 'application/json'}
10 | 	getCheck = requests.get(url, headers=header)
11 | 	if getCheck.status_code != requests.codes.ok:
12 | 		response = "RXNorm server response error. Response code: %s" % getCheck.status_code
13 | 	else:
14 | 		response = "Connection check complete. RXNorm online. Response code: %s" % getCheck.status_code
15 | 	return response
16 | 
17 | def rxNorm(ndc):
18 | 	# ndc value coming from master.py
19 | 	# ndc = [array of ndc values]
20 | 	if ndc[0] is None:
21 | 		return {"rxcui": "", "rxtty": "", "rxstring": ""}
22 | 	else:
23 | 		# if internet or request throws an error, print out to check connection and exit
24 | 		try:
25 | 			baseurl = 'http://rxnav.nlm.nih.gov/REST/'
26 | 
27 | 			# Searching RXNorm API, Search by identifier to find RxNorm concepts
28 | 			# http://rxnav.nlm.nih.gov/REST/rxcui?idtype=NDC&id=0591-2234-10
29 | 			# Set url parameters for searching RXNorm for SETID
30 | 			ndcSearch = 'rxcui?idtype=NDC&id='
31 | 
32 | 			# Search RXNorm API, Return all properties for a concept
33 | 			rxPropSearch = 'rxcui/'
34 | 			rxttySearch = '/property?propName=TTY'
35 | 			rxstringSearch = '/property?propName=RxNorm%20Name'
36 | 
37 | 			# Request RXNorm API to return json
38 | 			header = {'Accept': 'application/json'}
39 | 			def getTTY(rxCUI):
40 | 				# Search RXNorm again using RXCUI to return RXTTY & RXSTRING
41 | 				getTTY = requests.get(baseurl+rxPropSearch+rxCUI+rxttySearch, headers=header)
42 | 
43 | 				ttyJSON = json.loads(getTTY.text, encoding="utf-8")
44 | 
45 | 				return ttyJSON['propConceptGroup']['propConcept'][0]['propValue']
46 | 
47 | 			def getSTRING(rxCUI):
48 | 				# Search RXNorm again using RXCUI to return RXTTY & RXSTRING
49 | 				getString = requests.get(baseurl+rxPropSearch+rxCUI+rxstringSearch, headers=header)
50 | 				stringJSON = json.loads(getString.text, encoding="utf-8")
51 | 
52 | 				return stringJSON['propConceptGroup']['propConcept'][0]['propValue']
53 | 
54 | 			# Search RXNorm using NDC code, return RXCUI id
55 | 			# ndc = [ndc1, ndc2, ... ]
56 | 			for item in ndc:
57 | 				getRXCUI = requests.get(baseurl+ndcSearch+item, headers=header)
58 | 				if getRXCUI.status_code != requests.codes.ok:
59 | 					print "RXNorm server response error. Response code: %s" % getRXCUI.status_code
60 | 				rxcuiJSON = json.loads(getRXCUI.text, encoding="utf-8")
61 | 				# Check if first value in list returns a RXCUI, if not go to next value
62 | 				try:
63 | 					if rxcuiJSON['idGroup']['rxnormId']:
64 | 						rxCUI = rxcuiJSON['idGroup']['rxnormId'][0]
65 | 						rxTTY = getTTY(rxCUI)
66 | 						rxSTRING = getSTRING(rxCUI)
67 | 						return {"rxcui": rxCUI, "rxtty": rxTTY, "rxstring": rxSTRING}
68 | 				except:
69 | 					# if last item return null values
70 | 					if item == ndc[-1]:
71 | 						return {"rxcui": "", "rxtty": "", "rxstring": ""}
72 | 					pass
73 | 		except:
74 | 			sys.exit("RXNorm connection")
75 | 
76 | if __name__ == "__main__":
77 | 	# Test with sample NDC codes, one works, one doesn't
78 | 	dataTest = rxNorm(['66435-101-42', '66435-101-56', '66435-101-70', '66435-101-84', '66435-101-14', '66435-101-16', '66435-101-18'])
79 | 	print dataTest
80 | 


--------------------------------------------------------------------------------
/scripts/unzip.sh:
--------------------------------------------------------------------------------
  1 | #! /bin/bash
  2 | 
  3 | set -e
  4 | 
  5 | STARTTIME=$(date +%s)
  6 | date=$1
  7 | if [ $date ]; then
  8 | 	echo "processing..."
  9 | else
 10 | 	echo "Error: no date entered with command (ex: 2014-02-24)"
 11 | 	exit 1
 12 | fi
 13 | 
 14 | # make temp directories if do not exist
 15 | mkdir -p ../tmp
 16 | mkdir -p ../tmp/tmp-original
 17 | mkdir -p ../tmp/tmp-unzipped
 18 | mkdir -p ../tmp/tmp-images
 19 | mkdir -p ../tmp/processed/json
 20 | mkdir -p ../tmp/processed/csv
 21 | mkdir -p ../tmp/errors
 22 | 
 23 | tmpDIR=../tmp/
 24 | cd $tmpDIR
 25 | 
 26 | # Removes old files
 27 | tmpOriginal=tmp-original/
 28 | cd $tmpOriginal
 29 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN;
 30 | do
 31 | 	if [ -d $FOLDER ]; then
 32 | 		rm -r $FOLDER
 33 | 	fi
 34 | done
 35 | echo "removed /tmp-original files"
 36 | 
 37 | # Removes old files
 38 | tmpUnzipped=../tmp-unzipped/
 39 | cd $tmpUnzipped
 40 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN;
 41 | do
 42 | 	if [ -d $FOLDER ]; then
 43 | 		rm -r $FOLDER
 44 | 	fi
 45 | done
 46 | echo "removed /tmp-unzipped files"
 47 | 
 48 | # Removes old files
 49 | tmpImages=../tmp-images/
 50 | cd $tmpImages
 51 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN;
 52 | do
 53 | 	if [ -d $FOLDER ]; then
 54 | 		rm -r $FOLDER
 55 | 	fi
 56 | done
 57 | echo "removed /tmp-images files"
 58 | 
 59 | # create tmp subfolders
 60 | mkdir -p ../tmp-original/HRX
 61 | mkdir -p ../tmp-original/HOTC
 62 | mkdir -p ../tmp-original/HOMEO
 63 | mkdir -p ../tmp-original/ANIMAL
 64 | mkdir -p ../tmp-original/REMAIN
 65 | 
 66 | mkdir -p ../tmp-unzipped/HRX
 67 | mkdir -p ../tmp-unzipped/HOTC
 68 | mkdir -p ../tmp-unzipped/HOMEO
 69 | mkdir -p ../tmp-unzipped/ANIMAL
 70 | mkdir -p ../tmp-unzipped/REMAIN
 71 | 
 72 | mkdir -p ../tmp-images/HRX
 73 | mkdir -p ../tmp-images/HOTC
 74 | mkdir -p ../tmp-images/HOMEO
 75 | mkdir -p ../tmp-images/ANIMAL
 76 | mkdir -p ../tmp-images/REMAIN
 77 | 
 78 | # unzips main files to get individual zipped files
 79 | ORIGNDATA=../tmp-original/
 80 | cd $ORIGNDATA
 81 | 
 82 | unzip -qj ../download/$date/dm_spl_release_human_rx_part1.zip -d HRX/
 83 | unzip -qj ../download/$date/dm_spl_release_human_rx_part2.zip -d HRX/
 84 | unzip -qj ../download/$date/dm_spl_release_human_otc_part1.zip -d HOTC/
 85 | unzip -qj ../download/$date/dm_spl_release_human_otc_part2.zip -d HOTC/
 86 | unzip -qj ../download/$date/dm_spl_release_human_otc_part3.zip -d HOTC/
 87 | unzip -qj ../download/$date/dm_spl_release_homeopathic.zip -d HOMEO/
 88 | unzip -qj ../download/$date/dm_spl_release_animal.zip -d ANIMAL/
 89 | unzip -qj ../download/$date/dm_spl_release_remainder.zip -d REMAIN/
 90 | 
 91 | echo "original files unzipped to /tmp-original"
 92 | 
 93 | # loop through all individual zipped files to unzip
 94 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN;
 95 | do
 96 | 	if [ -d $FOLDER ]; then
 97 | 		for f in `ls "$FOLDER/"`;
 98 | 		do
 99 | 			ID=$(zipinfo -1 "$FOLDER/$f" "*.xml" | sed 's/....$//')
100 | 			unzip -Cqo "$FOLDER/$f" "*.xml" -d ../tmp-unzipped/$FOLDER/
101 | 			unzip -Cqo "$FOLDER/$f" -d ../tmp-images/$FOLDER/${ID}
102 | 			rm ../tmp-images/$FOLDER/${ID}/${ID}.xml
103 | 		done
104 | 	fi
105 | done
106 | 
107 | echo "all files unzipped."
108 | echo "processing complete."
109 | ENDTIME=$(date +%s)
110 | TOTALTIME=$((($ENDTIME-$STARTTIME)/60))
111 | echo "Processing took $TOTALTIME minutes to complete."
112 | 


--------------------------------------------------------------------------------
/scripts/xpath.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | # ------------------
  3 | # Pillbox Xpath script that extracts raw data from XML to yield:
  4 | #    1. Rows array, with one row object per product code.
  5 | #    2. Ingredients array, with one object per ingredient
  6 | # ------------------
  7 | # Requirements: Python 2.6 or greater
  8 | 
  9 | import os, sys, time
 10 | import StringIO
 11 | import atexit
 12 | from lxml import etree
 13 | from itertools import groupby
 14 | 
 15 | # Check all XMLs against form codes, discard all XMLs that don't match
 16 | codeChecks = [
 17 | 	"C25158", "C42895", "C42896",
 18 | 	"C42917", "C42902", "C42904",
 19 | 	"C42916", "C42928", "C42936",
 20 | 	"C42954", "C42998", "C42893",
 21 | 	"C42897", "C60997", "C42905",
 22 | 	"C42997", "C42910", "C42927",
 23 | 	"C42931", "C42930", "C61004",
 24 | 	"C61005", "C42964", "C42963",
 25 | 	"C42999", "C61006", "C42985",
 26 | 	"C42992"
 27 | 	]
 28 | 
 29 | 
 30 | def parseData(name):
 31 | 	# Iterparse function that clears the memory each time it finishes running
 32 | 	def getelements(filename, tag, medicineCheck):
 33 | 		context = iter(etree.iterparse(filename))
 34 | 		_, root = next(context) # get root element
 35 | 		for event, elem in context:
 36 | 			# if event == 'start':
 37 | 				# If we pass "yes" via medicineCheck, then we need to return <manufacturedMedicine> instead of <manufacturedProduct>
 38 | 				if medicineCheck == 'yes':
 39 | 					if elem.tag == "{urn:hl7-org:v3}manufacturedMedicine":
 40 | 						yield elem
 41 | 					elif elem.tag == '{urn:hl7-org:v3}manufacturedProduct' or elem.tag =='manufacturedProduct':
 42 | 						yield elem
 43 | 				else:
 44 | 					if tag.find('{') >= 0:
 45 | 						tag = tag[16:]
 46 | 					if elem.tag.find('{') >= 0:
 47 | 						elem.tag = elem.tag[16:]
 48 | 
 49 | 					if elem.tag == tag:
 50 | 						yield elem
 51 | 		root.clear() # preserve memory
 52 | 
 53 | 	# ------------------
 54 | 	# Build SetInfo array
 55 | 	# ------------------
 56 | 	setInfo = {}
 57 | 	filename = name.split('/')
 58 | 	setInfo['file_name'] = filename[1]
 59 | 	setInfo['source'] = filename[0]
 60 | 
 61 | 	def getInfo():
 62 | 		# Get information at parent level
 63 | 		tree = etree.parse(name)
 64 | 		root = tree.getroot()
 65 | 		for child in root.xpath("./*[local-name() = 'id']"):
 66 | 			setInfo['id_root'] = child.get('root')
 67 | 		for child in root.xpath("./*[local-name() = 'setId']"):
 68 | 			setInfo['setid'] = child.get('root')
 69 | 		for child in root.xpath("./*[local-name() = 'effectiveTime']"):
 70 | 			setInfo['effective_time'] = child.get('value')
 71 | 		for child in root.xpath("./*[local-name() = 'code']"):
 72 | 			setInfo['document_type'] = child.get('code')
 73 | 
 74 | 	# --------------------
 75 | 	# Build Sponsors Array
 76 | 	# --------------------
 77 | 	sponsors = {}
 78 | 	for parent in getelements(name, "{urn:hl7-org:v3}author", 'no'):
 79 | 		for child in parent.xpath(".//*[local-name() = 'representedOrganization']"):
 80 | 			for grandChild in child.xpath("./*[local-name() = 'name']"):
 81 | 				sponsors['author'] = grandChild.text.strip()
 82 | 				sponsors['author_type'] = 'labeler'
 83 | 				grandChild.clear()
 84 | 
 85 | 	for parent in getelements(name, "{urn:hl7-org:v3}legalAuthenticator", 'no'):
 86 | 		for child in parent.xpath(".//*[local-name() = 'representedOrganization']"):
 87 | 			for grandChild in child.xpath("./*[local-name() = 'representedOrganization']"):
 88 | 				sponsors['author'] = grandChild.text.strip()
 89 | 				sponsors['author_type'] = 'legal'
 90 | 				grandChild.clear()
 91 | 
 92 | 	# -----------------------------------------
 93 | 	# Build ProdMedicine and Ingredients arrays
 94 | 	# -----------------------------------------
 95 | 	prodMedicines = []
 96 | 	ingredients = {}
 97 | 	formCodes = []
 98 | 	names = []
 99 | 	partnames = []
100 | 	# info object, which will later be appended to prodMedicines array
101 | 	info = {}
102 | 	info['SPLCOLOR'] = []
103 | 	info['SPLIMPRINT'] = []
104 | 	info['SPLSHAPE'] = []
105 | 	info['SPLSIZE'] = []
106 | 	info['SPLSCORE'] = []
107 | 	info['SPLCOATING'] = []
108 | 	info['SPLSYMBOL'] = []
109 | 	info['SPLFLAVOR'] = []
110 | 	info['SPLIMAGE'] = []
111 | 	info['IMAGE_SOURCE'] = []
112 | 	info['SPL_INGREDIENTS'] = []
113 | 	info['SPL_INACTIVE_ING'] = []
114 | 	info['SPL_STRENGTH'] = []
115 | 	info['SPLCONTAINS'] = []
116 | 	info['approval_code'] = []
117 | 	info['MARKETING_ACT_CODE'] = []
118 | 	info['DEA_SCHEDULE_CODE'] = []
119 | 	info['DEA_SCHEDULE_NAME'] = []
120 | 	info['equal_product_code'] = []
121 | 	info['NDC'] = []
122 | 	info['SPLUSE'] = []
123 | 
124 | 	# substanceCodes will be filled with ingredient codes to check for duplicate ingredients
125 | 	substanceCodes = []
126 | 	# doses will be filled with ingredient numerator values to check for duplicate ingredients
127 | 	doses = []
128 | 	# codes stores product_codes, to determine how many unique products \to output with len(codes)
129 | 	codes = []
130 | 	productCodes = []
131 | 	partNumbers = []
132 | 	eqDup = ''
133 | 
134 | 	for parent in getelements(name, "{urn:hl7-org:v3}manufacturedProduct", 'yes'):
135 | 		def proceed(partCode, partChild, index):
136 | 			# There are <manufacturedProduct> elements that have no content and would result
137 | 			# in empty objects being appended to ingredients array. So use ingredientTrue to test.
138 | 			ingredientTrue = 0
139 | 
140 | 			if partCode == 'zero':
141 | 				getInfo()
142 | 				for productCode in parent.xpath("./*[local-name() = 'code']"):
143 | 					uniqueCode = productCode.get('code') + '-0'
144 | 					# set ingredients array for uniquecode
145 | 					ingredients[uniqueCode] = []
146 | 					if uniqueCode not in codes:
147 | 						codes.append(uniqueCode)
148 | 						productCodes.append(productCode.get('code'))
149 | 						partNumbers.append('0')
150 | 			else:
151 | 				# This applies only to <parts>
152 | 				for productCode in parent.xpath("./*[local-name() = 'code']"):
153 | 					uniqueCode = productCode.get('code') + '-'+str(index)
154 | 					formCodes.append(partCode)
155 | 					# set ingredients array for uniquecode
156 | 					ingredients[uniqueCode] = []
157 | 					if uniqueCode not in codes:
158 | 						codes.append(uniqueCode)
159 | 						productCodes.append(productCode.get('code'))
160 | 						partNumbers.append(str(index))
161 | 
162 | 			# Get <containerPackagedProduct> information
163 | 			packageProducts = []
164 | 			for child in parent.xpath("./*[local-name() = 'asContent']"):
165 | 				# Check if we're working with <containerPackagedProduct> or <containerPackagedMedicine>
166 | 				checkMedicine =  child.xpath("./*[local-name() = 'containerPackagedMedicine']")
167 | 				checkProduct =  child.xpath("./*[local-name() = 'containerPackagedProduct']")
168 | 				if checkProduct:
169 | 					productType = 'containerPackagedProduct'
170 | 				else:
171 | 					productType = 'containerPackagedMedicine'
172 | 				# Send product
173 | 				for grandChild in child.xpath("./*[local-name() = '"+productType+"']"):
174 | 					value = grandChild.xpath("./*[local-name() = 'code']")
175 | 					form = grandChild.xpath("./*[local-name() = 'formCode']")
176 | 					# For when there is another <containerPackagedProduct> nested under another <asContent>
177 | 					if value[0].get('code') == None:
178 | 						subElement = grandChild.xpath(".//*[local-name() = 'asContent']")
179 | 						# subValues is an array of all <code> tags under the second instance of <asContent>
180 | 						if subElement:
181 | 							subValues = subElement[0].xpath(".//*[local-name() = 'code']")
182 | 							tempCodes = []
183 | 							# Loop through returned values, which come from multiple levels of <containerPackagedProducts>
184 | 							for v in subValues:
185 | 								if v.get('code') != None:
186 | 									packageProducts.append(v.get('code'))
187 | 					# Else just print the value from the first <containerPackagedProduct> level
188 | 					else:
189 | 						packageProducts.append(value[0].get('code'))
190 | 
191 | 			# The getElements() function captures <manufacturedProduct> and </manufacturedProduct>, which is what
192 | 			# we're seeing when pacakageProducts has length zero
193 | 			if packageProducts:
194 | 				info['NDC'].append(packageProducts)
195 | 
196 | 			# Arrays for ingredients
197 | 			active = []
198 | 			inactive = []
199 | 			splStrength = []
200 | 			# If partCode is zero, we can find the ingredients directly below the <manufacturedProduct> parent
201 | 			# else we need to iterate thorugh the <partProduct> of the <part>, from proceed() function
202 | 
203 | 			if partCode == 'zero':
204 | 				level = parent
205 | 				for child in parent.xpath("./*[local-name() = 'name']"):
206 | 					names.append(child.text.strip())
207 | 					partnames.append('')
208 | 			else:
209 | 				for child in parent.iterchildren('{urn:hl7-org:v3}name'):
210 | 					names.append(child.text.strip())
211 | 
212 | 				partProduct = partChild.xpath("./*[local-name() = 'partProduct']")
213 | 				if not partProduct:
214 | 					partProduct = partChild.xpath("./*[local-name() = 'partMedicine']")
215 | 
216 | 				level = partProduct[0]
217 | 				for child in partProduct[0].xpath("./*[local-name() = 'name']"):
218 | 					partnames.append(child.text.strip())
219 | 			for child in level.xpath("./*[local-name() = 'ingredient']"):
220 | 				# Create temporary object for each ingredient
221 | 				ingredientTemp = {}
222 | 				ingredientTemp['ingredient_type'] = {}
223 | 				ingredientTemp['substance_code'] = {}
224 | 
225 | 				# If statement to find active ingredients
226 | 				if child.get('classCode') == 'ACTIB' or child.get('classCode') == 'ACTIM' or child.get('classCode') == 'ACTIR':
227 | 					ingredientTrue = 1
228 | 					ingredientTemp['active_moiety_names'] = []
229 | 
230 | 					for grandChild in child.xpath("./*[local-name() = 'ingredientSubstance']"):
231 | 						for c in grandChild.iterchildren():
232 | 							ingredientTemp['ingredient_type'] = 'active'
233 | 							if c.tag == '{urn:hl7-org:v3}name' or c.tag == 'name':
234 | 								active.append(c.text.strip())
235 | 								splStrengthItem = c.text.strip()
236 | 								ingredientTemp['substance_name'] = c.text.strip()
237 | 							if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code':
238 | 								ingredientTemp['substance_code'] = c.get('code')
239 | 							if c.tag =='{urn:hl7-org:v3}activeMoiety' or c.tag == 'activeMoiety':
240 | 								name = c.xpath(".//*[local-name() = 'name']")
241 | 
242 | 								# Send active moiety to ingredientTemp
243 | 								try:
244 | 									ingredientTemp['active_moiety_names'].append(name[0].text.strip())
245 | 								except:
246 | 									ingredientTemp['active_moiety_names'].append('')
247 | 
248 | 					for grandChild in child.xpath("./*[local-name() = 'quantity']"):
249 | 						numerator = grandChild.xpath("./*[local-name() = 'numerator']")
250 | 						denominator = grandChild.xpath("./*[local-name() = 'denominator']")
251 | 
252 | 						ingredientTemp['numerator_unit'] = numerator[0].get('unit')
253 | 						ingredientTemp['numerator_value'] = numerator[0].get('value')
254 | 						ingredientTemp['dominator_unit'] = denominator[0].get('unit')
255 | 						ingredientTemp['dominator_value'] = denominator[0].get('value')
256 | 						splStrengthValue = float(ingredientTemp['numerator_value']) / float(ingredientTemp['dominator_value'])
257 | 						if str(splStrengthValue)[-1] == '0':
258 | 							splStrengthValue = int(splStrengthValue)
259 | 						splStrengthItem = "%s %s %s" % (splStrengthItem, splStrengthValue, ingredientTemp['numerator_unit'])
260 | 						splStrength.append(splStrengthItem)
261 | 
262 | 				# If statement to find inactive ingredients
263 | 				if child.get('classCode') == 'IACT':
264 | 					ingredientTrue = 1
265 | 					# Create object for each inactive ingredient
266 | 					for grandChild in child.xpath("./*[local-name() = 'ingredientSubstance']"):
267 | 						for c in grandChild.iterchildren():
268 | 							ingredientTemp['ingredient_type'] = 'inactive'
269 | 							if c.tag == '{urn:hl7-org:v3}name' or c.tag =='name':
270 | 								inactive.append(c.text.strip())
271 | 								ingredientTemp['substance_name'] = c.text.strip()
272 | 							if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code':
273 | 								ingredientTemp['substance_code'] = c.get('code')
274 | 				try:
275 | 					ingredients[uniqueCode].append(ingredientTemp)
276 | 				except:
277 | 					# this is passed because of no uniqeCode assigned when not OSDF
278 | 					continue
279 | 
280 | 			# For XML files that have different ingredient element syntax
281 | 			# check for inactive ingredients
282 | 			for child in level.xpath("./*[local-name() = 'inactiveIngredient']"):
283 | 				# Create temporary object for each ingredient
284 | 				ingredientTemp = {}
285 | 				ingredientTemp['ingredient_type'] = {}
286 | 				ingredientTemp['substance_code'] = {}
287 | 				ingredientTrue = 1
288 | 
289 | 				for grandChild in child.xpath("./*[local-name() = 'inactiveIngredientSubstance']"):
290 | 					for c in grandChild.iterchildren():
291 | 						ingredientTemp['ingredient_type'] = 'inactive'
292 | 						if c.tag == '{urn:hl7-org:v3}name' or c.tag =='name':
293 | 							inactive.append(c.text.strip())
294 | 							ingredientTemp['substance_name'] = c.text.strip()
295 | 						if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code':
296 | 							ingredientTemp['substance_code'] = c.get('code')
297 | 				try:
298 | 					ingredients[uniqueCode].append(ingredientTemp)
299 | 				except:
300 | 					# this is passed because of no uniqeCode assigned when not OSDF
301 | 					continue
302 | 
303 | 			# For XML files that have different ingredient element syntax
304 | 			# check for active ingredients
305 | 			for child in level.xpath("./*[local-name() = 'activeIngredient']"):
306 | 				ingredientTemp = {}
307 | 				ingredientTemp['ingredient_type'] = {}
308 | 				ingredientTemp['substance_code'] = {}
309 | 				ingredientTemp['active_moiety_names'] = []
310 | 				ingredientTrue = 1
311 | 
312 | 				for grandChild in child.xpath("./*[local-name() = 'activeIngredientSubstance']"):
313 | 					for c in grandChild.iterchildren():
314 | 						ingredientTemp['ingredient_type'] = 'active'
315 | 						if c.tag == '{urn:hl7-org:v3}name' or c.tag == 'name':
316 | 							active.append(c.text.strip())
317 | 							splStrengthItem = c.text.strip()
318 | 							ingredientTemp['substance_name'] = c.text.strip()
319 | 						if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code':
320 | 							ingredientTemp['substance_code'] = c.get('code')
321 | 						if c.tag =='{urn:hl7-org:v3}activeMoiety' or c.tag == 'activeMoiety':
322 | 							name = c.xpath(".//*[local-name() = 'name']")
323 | 
324 | 							# Send active moiety to ingredientTemp
325 | 							try:
326 | 								ingredientTemp['active_moiety_names'].append(name[0].text.strip())
327 | 							except:
328 | 								ingredientTemp['active_moiety_names'].append('')
329 | 
330 | 				for grandChild in child.xpath("./*[local-name() = 'quantity']"):
331 | 					numerator = grandChild.xpath("./*[local-name() = 'numerator']")
332 | 					denominator = grandChild.xpath("./*[local-name() = 'denominator']")
333 | 
334 | 					ingredientTemp['numerator_unit'] = numerator[0].get('unit')
335 | 					ingredientTemp['numerator_value'] = numerator[0].get('value')
336 | 					ingredientTemp['dominator_unit'] = denominator[0].get('unit')
337 | 					ingredientTemp['dominator_value'] = denominator[0].get('value')
338 | 					splStrengthValue = float(ingredientTemp['numerator_value']) / float(ingredientTemp['dominator_value'])
339 | 					if str(splStrengthValue)[-1] == '0':
340 | 						splStrengthValue = int(splStrengthValue)
341 | 					splStrengthItem = "%s %s %s" % (splStrengthItem, splStrengthValue, ingredientTemp['numerator_unit'])
342 | 					splStrength.append(splStrengthItem)
343 | 
344 | 			# Send code, name and formCode to info = {}
345 | 			info['product_code'] = productCodes
346 | 			info['part_num'] = partNumbers
347 | 			info['medicine_name'] = names
348 | 			info['part_medicine_name'] = partnames
349 | 			info['dosage_form'] = formCodes
350 | 
351 | 			# If ingredientTrue was set to 1 above, we know we have ingredient information to append
352 | 			if ingredientTrue != 0:
353 | 				info['SPL_INGREDIENTS'].append(active)
354 | 				info['SPL_INACTIVE_ING'].append(inactive)
355 | 				info['SPL_STRENGTH'].append(splStrength)
356 | 
357 | 			# Second set of child elements in <manufacturedProduct> used for ProdMedicines array
358 | 			def checkForValues(ctype, grandChild, dup, idx):
359 | 				value = grandChild.xpath("./*[local-name() = 'value']")
360 | 				reference = grandChild.xpath(".//*[local-name() = 'reference']")
361 | 				if ctype == 'SPLIMPRINT':
362 | 					value = value[0].text.strip()
363 | 				else:
364 | 					value = value[0].attrib
365 | 				kind = grandChild.find("./{urn:hl7-org:v3}code[@code='"+ctype+"']")
366 | 				if kind == None:
367 | 					kind = grandChild.find("./code[@code='"+ctype+"']")
368 | 				if kind !=None:
369 | 					if ctype == 'SPLCOLOR':
370 | 						if dup == '1':
371 | 							color1 = info[ctype][idx]
372 | 							if value.get('code') == None:
373 | 								color2 = ''
374 | 							else:
375 | 								color2 = value.get('code')
376 | 							info[ctype][idx] = "%s;%s" % (color1,color2)
377 | 						else:
378 | 							info[ctype].append(value.get('code'))
379 | 					elif ctype == 'SPLIMPRINT':
380 | 						info[ctype].append(value)
381 | 					elif ctype == 'SPLSCORE':
382 | 						if value.get('value') == None:
383 | 							info[ctype].append('')
384 | 						else:
385 | 							info[ctype].append(value.get('code') or value.get('value'))
386 | 					elif ctype == 'SPLIMAGE':
387 | 						if reference[0].get('value'):
388 | 							splfile = reference[0].get('value').split()
389 | 							info[ctype].append(splfile)
390 | 						else:
391 | 							info[ctype].append('')
392 | 					else:
393 | 						info[ctype].append(value.get('code') or value.get('value'))
394 | 
395 | 			# If partCode is zero, we can find the <asContent> directly below the <manufacturedProduct> parent
396 | 			# else we need to iterate thorugh the <partProduct> of the <part>, from proceed() function
397 | 
398 | 
399 | 
400 | 			if partCode == 'zero':
401 | 				level = parent
402 | 			else:
403 | 				level = partChild
404 | 			if level.xpath("./*[local-name() = 'subjectOf']"):
405 | 				previous = []
406 | 				subjectOfcheck = []
407 | 				for child in level.xpath("./*[local-name() = 'subjectOf']"):
408 | 					for c in child:
409 | 						subjectOfcheck.append(c.tag)
410 | 					# Get Approval code
411 | 					for grandChild in child.xpath(".//*[local-name() = 'approval']"):
412 | 						statusCode = grandChild.xpath("./*[local-name() = 'code']")
413 | 						info['approval_code'].append(statusCode[0].get('code'))
414 | 					#Get marketing act code
415 | 					for grandChild in child.xpath("./*[local-name() = 'marketingAct']"):
416 | 						statusCode = grandChild.xpath("./*[local-name() = 'statusCode']")
417 | 						info['MARKETING_ACT_CODE'].append(statusCode[0].get('code'))
418 | 
419 | 					# Get policy code
420 | 					for grandChild in child.xpath("./*[local-name() = 'policy']"):
421 | 						for each in grandChild.xpath("./*[local-name() = 'code']"):
422 | 							info['DEA_SCHEDULE_CODE'].append(each.get('code'))
423 | 							info['DEA_SCHEDULE_NAME'].append(each.get('displayName'))
424 | 
425 | 					for grandChild in child.xpath("./*[local-name() = 'characteristic']"):
426 | 						for each in grandChild.xpath("./*[local-name() = 'code']"):
427 | 							# Run each type through the CheckForValues() function above
428 | 							ctype = each.get('code')
429 | 							# checks for duplicate spl types, splcolor can happen twice
430 | 							if ctype in previous:
431 | 								idx = len(info[ctype]) - 1
432 | 								checkForValues(ctype, grandChild, '1', idx)
433 | 							else:
434 | 								idx = len(info[ctype])
435 | 								diff = len(codes) - 1
436 | 								if idx < diff:
437 | 									for i in range(diff - idx):
438 | 										info[ctype].append('')
439 | 								checkForValues(ctype, grandChild, '0', 0)
440 | 							previous.append(ctype)
441 | 							each.clear()   #clear memory
442 | 						grandChild.clear() #clear memory
443 | 				# check if dea policy code is in product or not
444 | 				policy = '{urn:hl7-org:v3}policy'
445 | 				if policy not in subjectOfcheck:
446 | 					info['DEA_SCHEDULE_CODE'].append('')
447 | 					info['DEA_SCHEDULE_NAME'].append('')
448 | 				approval = '{urn:hl7-org:v3}approval'
449 | 				if approval not in subjectOfcheck:
450 | 					info['approval_code'].append('')
451 | 
452 | 		# Check if there are <parts> in the manufactured product, if not, partCode = 0
453 | 		parts = parent.xpath("./*[local-name() = 'part']")
454 | 		if not parts:
455 | 			# Check for child <manufacturedProduct>, if exists, check formCode
456 | 			childProduct = parent.xpath(".//*[local-name() = 'manufacturedProduct']")
457 | 			if childProduct:
458 | 				childFormCode = childProduct[0].xpath("./*[local-name() = 'formCode']")
459 | 				if childFormCode:
460 | 					if childFormCode[0].get('code') not in codeChecks:
461 | 						continue  #skip to next Manufactured Product
462 | 			# test current level FormCode against codeChecks
463 | 			formCode = parent.xpath("./*[local-name() = 'formCode']")
464 | 			if formCode:
465 | 				if formCode[0].get('code') not in codeChecks:
466 | 					continue
467 | 				else:
468 | 					formCodes.append(formCode[0].get('code'))
469 | 
470 | 					# Get equal product code from <definingMaterialKind>
471 | 					equiv = parent.xpath(".//*[local-name() = 'asEquivalentEntity']")
472 | 					if equiv:
473 | 						for child in parent.xpath(".//*[local-name() = 'asEquivalentEntity']"):
474 | 							# duplicate check
475 | 							if child != eqDup:
476 | 								equalProdParent = parent.xpath(".//*[local-name() = 'definingMaterialKind']")
477 | 								code = equalProdParent[0].xpath("./*[local-name() = 'code']")
478 | 								equalProdCodes = code[0].get('code')
479 | 								info['equal_product_code'].append(equalProdCodes)
480 | 							eqDup = child
481 | 					else:
482 | 						equalProdCodes = ''
483 | 						info['equal_product_code'].append(equalProdCodes)
484 | 			# No parts found, so part number is zero, send to proceed() function
485 | 			proceed('zero','','')
486 | 		else:
487 | 			# Set up an index to pass to proceed() function to determine part number
488 | 			index = 1
489 | 			for child in parts:
490 | 				# Get equal product code from <definingMaterialKind>
491 | 				equiv = child.xpath(".//*[local-name() = 'asEquivalentEntity']")
492 | 				if equiv:
493 | 					for c in child.xpath(".//*[local-name() = 'asEquivalentEntity']"):
494 | 						equalProdParent = c.xpath(".//*[local-name() = 'definingMaterialKind']")
495 | 						code = equalProdParent[0].xpath("./*[local-name() = 'code']")
496 | 						equalProdCodes = code[0].get('code')
497 | 						info['equal_product_code'].append(equalProdCodes)
498 | 				else:
499 | 					equalProdCodes = ''
500 | 					info['equal_product_code'].append(equalProdCodes)
501 | 
502 | 				formCode =  child.xpath(".//*[local-name() = 'formCode']")
503 | 
504 | 				# Check if formCode is in codeChecks
505 | 				if formCode[0].get('code') not in codeChecks:
506 | 					# If <part> <formCode> is not in codeChecks, move onto next <part>
507 | 					continue
508 | 				# else send to proceed() function with index
509 | 				else:
510 | 					getInfo()
511 | 					proceed(formCode[0].get('code'), child, index)
512 | 					index = index + 1
513 | 	prodMedicines.append(info)
514 | 
515 | 	prodMedNames = [
516 | 					'SPLCOLOR','SPLIMAGE','SPLIMPRINT','medicine_name','SPLSHAPE',
517 | 					'SPL_INGREDIENTS','SPL_INACTIVE_ING','SPLSCORE','SPLSIZE',
518 | 					'product_code','part_num','dosage_form','MARKETING_ACT_CODE',
519 | 					'DEA_SCHEDULE_CODE','DEA_SCHEDULE_NAME','NDC','equal_product_code',
520 | 					'SPL_STRENGTH','part_medicine_name','approval_code'
521 | 					]
522 | 	setInfoNames = ['file_name','effective_time','id_root','setid','document_type','source']
523 | 	sponsorNames = ['author','author_type']
524 | 
525 | 	# Loop through prodMedicines as many times as there are unique product codes + part codes combinations, which is len(codes)
526 | 	products = []
527 | 
528 | 	if prodMedicines[0]['NDC']:
529 | 		for i in range(0, len(codes)):
530 | 			uniqueID = setInfo['setid'] + '-' + codes[i]
531 | 			product = {}
532 | 			product['setid_product'] = uniqueID
533 | 			product['ndc_codes'] = prodMedicines[0]['NDC'][i]
534 | 			tempProduct = {}
535 | 			for name in prodMedNames:
536 | 				# Get information at the correct index
537 | 				try:
538 | 					if name == 'SPLIMAGE':
539 | 						if prodMedicines[0][name][i] != '':
540 | 							image_file = setInfo['setid'] + '_' + prodMedicines[0]['product_code'][i] + '_' + prodMedicines[0]['part_num'][i] + '_' + "_".join(prodMedicines[0][name][i])
541 | 							tempProduct[name] = image_file
542 | 						else:
543 | 							tempProduct[name] = ''
544 | 					else:
545 | 						tempProduct[name] = prodMedicines[0][name][i]
546 | 				except:
547 | 					tempProduct[name] = ''
548 | 			for name in setInfoNames:
549 | 				tempProduct[name] = setInfo[name]
550 | 			for name in sponsorNames:
551 | 				try:
552 | 					tempProduct[name] = sponsors[name]
553 | 				except:
554 | 					tempProduct[name] = ''
555 | 			product['data'] = tempProduct
556 | 			# Ingredients are showing duplicates again leaving out while fixing.
557 | 			product['ingredients'] = ingredients[codes[i]]
558 | 			products.append(product)
559 | 		return products
560 | 	else:
561 | 		sys.exit("Not OSDF")
562 | 
563 | #Use this code to run xpath on the tmp-unzipped files without other scripts
564 | if __name__ == "__main__":
565 | 	test = parseData("../tmp/tmp-unzipped/HRX/6dc74857-7d8d-4102-a56a-a014934b91b2.xml")
566 | 	print test
567 | 


--------------------------------------------------------------------------------