├── .gitignore ├── CHANGELOG.md ├── CONTRIBUTING.md ├── README.md ├── documentation ├── ABOUT.md ├── DATA.md ├── README.md ├── SETUP.md └── TERMS.md └── scripts ├── README.md ├── api.py ├── download.sh ├── error.py ├── makecsv.py ├── master.py ├── requirements.txt ├── rxnorm.py ├── unzip.sh └── xpath.py /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | /tmp -------------------------------------------------------------------------------- /CHANGELOG.md: -------------------------------------------------------------------------------- 1 | ## 0.1.0 2 | 3 | - Initial development release 4 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | ## Contributing to Pillbox 2 | 3 | Welcome to the Pillbox Data Process. Below is a guide for getting started in participating in the project. Whether you are a user of the data, interested in how the data in generated, or want to improve the quality of the data, the sections below outline how to get started contributing. 4 | 5 | ### How to contribute 6 | 7 | There are two main ways to contribute to the project: 8 | 9 | 1. Report on data quality or errors 10 | 2. Improve data generation process 11 | 12 | ### Report on Data Quality or Errors 13 | 14 | The first way to get invovled in the Pillbox project is to participate in flagging errors or sending reports on data quality. The project uses the [Issue tracker](https://github.com/HHS/pillbox-data-process/issues) to manage and discuss reports and questions about the project. 15 | 16 | Here's a short list of items to consider before submitting an issue: 17 | 18 | - Please search for your issue before filing it; errors or quality issues may have been already reported. 19 | - If you encountered an error in the data process, write specifically what the error was and how to replicate the error. 20 | - Please keep issues professional and straightforward. This is an evolving effort and we look to the community to help improve the quality and the process. 21 | - Please use the Labels to mark type of issue. We've pre-categorized labels we think are currently appropriate. These will help flag for context. 22 | 23 | ### Improve Data Generation Process 24 | 25 | The second way to get involved in the Pillbox project is to contribute to improving the data process. The data generation process is a series of Python scripts that download, unzip, and parse FDA SPL XML data. The goal for this process is to continually improve. We see this improving in two ways: 26 | 27 | #### 1. Improving error handling 28 | 29 | The FDA SPL data consistantly experiences quality issues. Occasionally these issues can lead to breaking the data process. If you encounter your data process to be broken, please contribute back in the following ways: 30 | 31 | 1. Debug the error and submit a Pull Request for the new code that handles the error. 32 | 2. Submit an issue reporting the new error and any suggestions for how to improve it. 33 | 34 | #### 2. Improving secondary data products 35 | 36 | The main Pillbox product is the `spl_data.csv` file. This is the master dataset that is made available. In addition, secondary data products are being made available. This includes indidividual json file access to products, and we're growing static API access to the data. 37 | 38 | If you want to contribute to creating new secondary data products, you can help us in two ways: 39 | 40 | 1. Contribute code to `api.py` or create a new python script and submit a Pull Request. 41 | - This is the best way to add to the process by creating new slices of the data, or generating a unique analysis of the data. 42 | 2. Recommend improvements in the Issue tracker. 43 | - Submit a recommendation by creating a new ticket for discussion. Another developer may possibly be able to implement and contribute code based on the recommendation. 44 | 45 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Pillbox for Developers 2 | ### Pillbox Data Process 3 | 4 | Pillbox, an initiative of the National Library of Medicine at the National Institutes of Health, provides data and images for prescription, over-the-counter, homeopathic, and veterinary oral solid dosage medications (pills) marketed in the United States of America. This data set contains information about pills such as how they look, their active and inactive ingredients, and many other criteria. 5 | 6 | Pillbox's primary data source (FDA drug labels) is complex and does not organize information based on individual pills. Additionally, there are very few pill images available in the source data. The Pillbox initative has focused on restructuring the source data, incorporating data from other related data sets, and creating a library of pill images. 7 | 8 | A major function of the initiative was the development of a data process which ingests the source data and produces an easy-to-use, "pill-focused" dataset. 9 | 10 | This repository contains the code for that process. It is intended to begin to give developers greater flexibility in using this data as well as expand the scope of and refine the data process. This repository will continue to grow as well as access to the data will continue to improve. 11 | 12 | ### Get started using or contributing to the code 13 | 14 | - [Read more](https://github.com/HHS/pillbox-data-process/blob/master/documentation/SETUP.md) about setting up your local environment 15 | - Start [contributing](https://github.com/HHS/pillbox-data-process/blob/master/CONTRIBUTING.md) to the development 16 | 17 | ### Uses of this data 18 | 19 | - Identify unknown pills based on their physical appearance 20 | - Assist in development of electronic health records, medication information systems, and adherence/reminder/tracking tools 21 | - Support research in areas such as informatics and image processing 22 | 23 | ### Warning 24 | 25 | Pillbox's source data is known to have errors and inconsistencies. Read [this document](https://github.com/HHS/pillbox-data-process/blob/master/documentation/DATA.md) before working with Pillbox's API, data, and images. 26 | 27 | [Read more about Pillbox](https://github.com/HHS/pillbox-data-process/blob/master/documentation/ABOUT.md). 28 | -------------------------------------------------------------------------------- /documentation/ABOUT.md: -------------------------------------------------------------------------------- 1 | ## About Pillbox 2 | 3 | Pillbox is a resource of the U.S. National Library of Medicine, part of the National Institutes of Health, U.S. Department of Health and Human Services. Pillbox is a United States government resource. 4 | 5 | ### Overview 6 | Pillbox is a database of human prescription, over-the-counter, homeopathic, and veterinary oral solid dosage medications (pills) marketed in the United States of America. This data set contains information about pills such as how they look, their ingredients, and other criteria. This data can be used to identify unknown pills based on their physical appearance. It also answers questions like “What pills contain acetaminophen?” or “What pills contain lactose as an inactive ingredient?” The data contain unique identifiers, such as the RXCUI and FDA product code. 7 | 8 | ### Notice: Non-verified Data 9 | 10 | Pillbox’s data is created by combining drug information resources from the Food and Drug Administration (FDA) and National Library of Medicine (NLM) at the National Institutes of Health. This information has been reformatted to make it easier to work with but has not been verified by FDA or NLM. The information available for download may not be the labeling on currently distributed products or identical to the labeling that is approved. NLM makes no warranty that the data is error free. 11 | 12 | #### Disclaimer 13 | 14 | The pill images and accompanying data available here were obtained from products acquired from a licensed pharmacy or the product manufacturer. Manufacturers may alter the appearance (e.g., shape, color, size, markings) of medications over time. 15 | 16 | The same medication may have been issued with a different appearance and/or different accompanying data before or after the date NLM acquired it. NLM would like to hear about any changes in medication appearance or possible errors in accompanying information. Please contact Pillbox by posting an issue in the Issue queue or sending an email to pillbox@mail.nih.gov if you notice any discrepancies in the information provided here. 17 | 18 | Reference in this data to any specific commercial product, process, service, manufacturer, or company does not constitute its endorsement or recommendation by the U.S. government or the U.S. Department of Health and Human Services or any of its agencies. 19 | 20 | Neither the U.S. government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information disclosed. 21 | 22 | ### Terms of Use 23 | 24 | Use of this data is subject to the [Terms of Service document]() made available in this repository. 25 | -------------------------------------------------------------------------------- /documentation/DATA.md: -------------------------------------------------------------------------------- 1 | ## Overview 2 | 3 | Many developers new to working with medication-related data are unaware of potential errors and discrepancies within the data. This document provides an overview of many issues present in the Pillbox data set. It is intended as an educational resource to 1) help developers better understand the data, and 2) illustrate the scope and limitations of working with this data, which is critical when creating applications that identify unknown medication or provide information to clinicians, patients, and others. 4 | 5 | It is also intended to be a starting point for discussion and further exploration of this data with the goal of improving downstream utilization by a growing community of innovative individuals and groups seeking to solve challenges related to not only medication identification and reference, but across health care. 6 | 7 | This report is not intended to definitively confirm the presence of an error in Structured Product Labeling data supplied by a submitting firm. Rather, it highlights potential errors, discrepancies, and data of interest based on analysis of data outlying expected parameters. It is not intended to be comprehensive. 8 | 9 | Pillbox’s data is derived from the Structured Product Labeling, obtained via NLM’s DailyMed and NLM’s RxNorm, a normalized naming system for generic and branded drugs. 10 | 11 | ### Data changes and transparency 12 | 13 | Because changes are now being made to the data based on comparison of pill images to the physical characteristics data, every effort is being made to be as transparent as possible about this process. This includes making a table of all changes made to data available for download, web pages which list every change made along with the image used for comparison, and this document which defines the scope and methodology used to make these changes. 14 | 15 | The scope of these changes is not intended to be complete. In situations where ambiguity exists the original data is not modified. For records where there is no image available, it is not possible to verify the physical characteristics data. 16 | 17 | Data changes in Pillbox do not affect the master data derived from the drug labels, available via DailyMed. 18 | 19 | ### Identification of data errors and discrepancies 20 | 21 | Errors or discrepancies in Pillbox's data can be identified either visually (comparing pill images to data) or algorithmically (querying data based on logical assumptions about the data). 22 | 23 | To the best of the Pillbox team's knowledge, no federal agency reviews the physical characteristics data for pills. It is our hope that Pillbox (and the pill images produced through the project) will be a catalyst for the development of manual or automated validation systems for these data and improvement in the overall accuracy of these data. 24 | 25 | ### Three unique problems 26 | 27 | #### Re-labeling and re-distribution 28 | 29 | A unique type of discrepancy can occur when a pill is marketed by more than one company. Company A may manufacture and market a drug. They may also allow other companies to distribute and market that same drug. Each company must submit a separate drug label and that same pill will have a different National Drug Code (NDC) in that label. Some distribution chains may be quite large. The small, brown, 200 mg ibuprofen, that has an imprint of "I2" for example, is distributed under almost 80 different NDCs and labels. 30 | 31 | An upcoming data release of Pillbox will group pills by physical characteristics, ingredients and strength, and other criteria so that each unique pill has only one record. In practice there may not be possible as some label authors have changed one or more of the physical characteristics or other data from the source label. 32 | 33 | The benefits of organizing pills by original manufactured products extend beyond simplifying identification and improving user experience. If an issue, such as contamination, should ever be occur with a medication, identifying all distribution points for that medication is critical for public safety. Developers should have easy access to that information. 34 | 35 | #### Manufacturers changing the appearance of a pill 36 | 37 | FDA guidance requires (citation needed) that if the physical characteristics of a pill change, then that product requires a new NDC. The most common change made to a pill is the imprint. When a manufacturer changes the physical characteristics of a pill and does not apply for a new NDC, it creates a conflict in presenting the data, especially if there is an image for the pill. 38 | 39 | For example, Company A has a pill with an imprint "123 10". They then change the imprint to "A 10". For a certain time, both pills will be available. If the imprint in the image differs from the data, it may be difficult to determine if the imprint data is 1) incorrect or 2) the imprint has changed but the original NDC was kept. Also, if both pills are present in the market, there should be two separate records as users could be trying to identify either pill. 40 | 41 | #### Identifying pills that are no longer marketed 42 | 43 | Pillbox was not designed to be an archival resource. It was intended to reflect the current information available via its sources. The data process which creates Pillbox takes current data from DailyMed and RxNorm and parses individual products (pills). Cases exist however where a user is trying to identify an older medication, stored for years in a medicine cabinet. In disaster response situations, medications which are past the expiration date may be used if certain criteria are met and tests show the medications have retained their potency. 44 | 45 | This issue will also be addressed by the upcoming data release will pills will be grouped by physical characteristics. It has yet to be determined how far back in time to go, looking for unique pills. Also, without images it will be difficult to verify the accuracy of the physical characteristics used to group the pills. This will results is a greater number of groups, with some groups being created based on inaccurate data. Groupings based on accurate data will be unaffected. 46 | 47 | ### Visually identifying data issues 48 | 49 | FDA publishes guidance for coding the physical characteristics (imprint, color, shape, size, score) of pills. Based on a review of the 2,159 images available via Pillbox as of July 2013, changes were made to the physical characteristics data of approximately 17% (359) of records for which there was an image. 50 | 51 | As Pillbox increases the number of standardized, high quality images available, those images will be compared to data for each product to ensure physical characteristics (imprint, color, shape, size, score) data match the images. While many of these errors are more easily identified than others (a round pill listed as square), some criteria are more subjective or nuanced. 52 | 53 | Where data is modified, the goals are to improve search results without introducing ambiguity and to accurately represent the text that appears on a pill. All changes made to the data are listed in the trade_dress_change_log table. 54 | 55 | #### Imprint 56 | 57 | Before continuing, you should read the FDA form and submission requirements for "imprint":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071810.htm. 58 | 59 | Imprint is perhaps the single best identifier for a pill and presents challenges when developing search logic. While relatively few errors of commission (typographic errors) have been found in the data based on a comparison to available pill images, a number of other factors are present. 60 | 61 | Issues encountered: 62 | * Company or drug names are sometimes omitted from the imprint data. 63 | * Descriptive text is included in the imprint value (ex: "A;10;company logo") 64 | * Imprint has been changed by the manufacturer 65 | * Some imprint values are formatted inconsistently in a way that may affect search results 66 | 67 | Additional rules: 68 | * Trailing semi-colons in the data are removed (ex: "A;10;" changed to "A;10") 69 | * Dashes, decimal points, slashes and spaces are replaced with semi-colons 70 | * Stylized single letters (which often appear on pills) are entered as part of the imprint value 71 | * Text that appears on separate line or is separated by a score line is separated in the imprint value by semi-colons 72 | * If text is repeated on an pill (ex: a scored pill with the number 10 on both sides or a pill with the letter A appearing multiple times around the edge of a pill) it is entered as separate text. This will increase the likelihood of an exact match while not interfering with search strings that only include one iteration of repeated text and errs on the side of an accurate representation of the look of the pill. 73 | * Text the crosses itself (ex: BAYER written vertically and horizontally, crossing at the Y) is entered as separate values, separated by a semi-colon. 74 | 75 | One additional area of concern related to imprint values are characters which look similar. Text on a pill is often small and the various imprinting processes may render text that is difficult to read. Users may not be able to accurately identify a character in situations like these. 76 | 77 | * lower-case L vs the number one (1) 78 | * Upper-case O vs the number zero (0) 79 | 80 | #### Color 81 | 82 | Before continuing, you should read the FDA form and submission requirements for "color":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071794.htm. 83 | 84 | Color is the most subjective of the physical characteristics, however it is one of the most likely to be used by an individual describing a pill. Existing guidance specifies RGB values for each of the 12 colors. Color perception varies greatly among individuals and is subject to ambient lighting conditions (indoor fluorescent and incandescent light sources, sunlight, reflected light, etc.). As such, similarly colored pills may be listed as a variety of similar colors, such as red/orange/brown or blue/turquoise/green. 85 | 86 | Issues encountered: 87 | * Pills listed as a color that is obviously different from the predominant color present in the image. In these situations the color value was changed to that of the predominant color. The new color is subject to the subjectivity described previously. 88 | 89 | Additional rules: 90 | * Though the guidance specifies that there should be only one value present for color, it is common to see two colors listed. This practice is upheld in Pillbox's data. 91 | * For pills that have more than one distinct color (a capsule with a pink cap and white base), the secondary color is added. 92 | * If the labeler lists a single color and the pill could also be described by a second color, that color may be added. 93 | * When more than one color is listed, if there is a predominant color (such as a yellow pill with a small white section in the middle) the predominant color is listed first. This provides the potential to enhance search and more accurately describe the pill without negatively affecting search results. 94 | * Double color listings (ex: white/white) were change to a single value of that color. 95 | 96 | The NLM Pillbox SPLIMAGE pill image specification creates images under standardized lighting conditions. It is hoped that these images will lead to development of an automated system to accurately define the predominant color of a pill and create a pallet of colors that is representative of the colors present. 97 | 98 | #### Shape 99 | 100 | Before continuing, you should read the FDA form and submission requirements for "shape":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071802.htm. 101 | 102 | The guidance for shape have two specific nuances that may not be obvious. First, when dealing with multi-sided shapes, such as pentagons (5-sided) or hexagons (6-sided), the sides do not need to be equal in length. Also, sides do no need to be straight. Thus a 103 | 104 | Issues encountered: 105 | * Shapes listed as freeform that are actually multi-sided shapes 106 | * Capsules (two-part capsules) listed as oval. 107 | * Oval tablets listed as capsule. Even if the medication name includes the word "capsule" if it is not a two-part capsules and banded two-part capsule it should be listed as oval. 108 | 109 | #### Size 110 | 111 | Before continuing, you should read the FDA form and submission requirements for "size":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071800.htm. 112 | 113 | Without having a pill to measure it is difficult to identify inaccuracies in the size values. The NLM Pillbox SPLIMAGE pill image specification includes a ruler in the image. Some size values however are present. 114 | 115 | Issues encountered: 116 | * Pills which size value differs from the ruler present in the image. 117 | 118 | Additional rules: 119 | * If the size as measured in an SPLIMAGE spec image varies by more than 2 mm from the size value provided, it is changed to match the value as measured in the image. 120 | 121 | #### Score 122 | 123 | Before continuing, you should read the FDA form and submission requirements for "score":http://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071805.htm. 124 | 125 | Issues encountered: 126 | * Incorrect score values entered 127 | * Pills with different score values on either side 128 | * Though the guidance states score refers to the pill being broken into "equal sized pieces" there exists a pill that is scored to be broken into unequal sized pieces (different dosages) 129 | * Some score lines are faint or only present near the edges of the pill (not a continuous line across the pill) 130 | 131 | Additional rules: 132 | * When a pill is scored differently on either side (one side can be divided into two pieces, the other side can be divided into three pieces), the higher score value is entered. This is consistent with the one known example. 133 | 134 | ### Algorithmically identifying data issues 135 | 136 | Logical assumptions about the data and relationships between data can identify potential issues. These issues are not addressed in Pillbox's data unless there is an image present. 137 | 138 | #### NULL values 139 | 140 | On occasion, certain values in the data are NULL. 141 | 142 | #### Capsules with score = 1 143 | 144 | #### Size outliers 145 | 146 | * Pills with extremely large (30+ mm) or small (0 or 1 mm) size values 147 | 148 | #### Size with a decimal point 149 | 150 | #### Pills listed a REMAINDER 151 | 152 | #### DEA schedule is NULL 153 | 154 | NULL is a valid value for DEA schedule. It implies that the drug is not scheduled. However, when you see a NULL value you cannot definitively know if the drug is not scheduled or the labeler forgot to list the value. There are records for drugs that are on the DEA schedule but some of the records lists the DEA schedule as NULL. 155 | 156 | #### Duplicate records 157 | 158 | ### Other techniques for identifying issues in the data 159 | 160 | #### Data which is not normalized (inactive ingredients, author) 161 | 162 | #### Comparison of records believed to be redistributed products 163 | 164 | #### Faceting via OpenRefine -------------------------------------------------------------------------------- /documentation/README.md: -------------------------------------------------------------------------------- 1 | ## Pillbox Project Documentation 2 | 3 | This folder contains information about, contributing, and using the Pillbox project data. 4 | 5 | - [About Pillbox](https://github.com/HHS/pillbox-data-process/blob/master/documentation/ABOUT.md) 6 | - [Pillbox Data Overview](https://github.com/HHS/pillbox-data-process/blob/master/documentation/DATA.md) 7 | - [Data process setup](https://github.com/HHS/pillbox-data-process/blob/master/documentation/SETUP.md) 8 | - [Terms](https://github.com/HHS/pillbox-data-process/blob/master/documentation/TERMS.md) 9 | -------------------------------------------------------------------------------- /documentation/SETUP.md: -------------------------------------------------------------------------------- 1 | ## Setting up your local environment 2 | 3 | #### Setting up on Ubuntu 4 | 5 | 1. To set up on clean Ubuntu installation, run the following commands to intall the necessary requirements: 6 | 7 | ``` 8 | apt-get update 9 | apt-get install git -y 10 | apt-get install unzip -y 11 | sudo apt-get install python-setuptools python-dev build-essential -y 12 | apt-get install libxml2-dev -y 13 | apt-get install libxslt1-dev -y 14 | easy_install -U setuptools 15 | apt-get install python-pip 16 | pip install lxml 17 | pip install requests 18 | ``` 19 | 20 | 2. Clone this repo locally using the `git clone` command. This requires a Github account. 21 | ``` 22 | git clone git@github.com:HHS/pillbox-data-process.git 23 | ``` 24 | 25 | 3. Follow [steps for data process](https://github.com/HHS/pillbox-data-process/tree/master/scripts#pillbox-data-process). 26 | 27 | #### Setting up on Mac OSX 28 | 29 | Latest versions of OSX come with Python 2.7 installed. To run the Pillbox process, additional packages need to be installed. This assumes [Xcode](https://developer.apple.com/xcode/downloads/) & command line tools are installed. If not, install Xcode first. 30 | 31 | 1. Install pip 32 | ``` 33 | sudo easy_install pip 34 | ``` 35 | 36 | 2. Clone this repo locally using `git clone`. This requires a Github account. 37 | 38 | ``` 39 | git clone git@github.com:HHS/pillbox-data-process.git 40 | ``` 41 | 42 | 3. Install Python requirements for Pillbox 43 | ``` 44 | cd pillbox-data-process 45 | cd scripts 46 | sudo pip install -r requirements.txt 47 | ``` 48 | 49 | 4. Follow [steps for data process](https://github.com/HHS/pillbox-data-process/tree/master/scripts#pillbox-data-process). 50 | -------------------------------------------------------------------------------- /documentation/TERMS.md: -------------------------------------------------------------------------------- 1 | ## Terms of Service 2 | 3 | The U.S. National Library of Medicine ("NLM") offers some of its public data in machine readable format via an Application Programming Interface ("API"). This service is offered subject to your acceptance of the terms and conditions contained herein as well as any relevant sections of the www.nlm.nih.gov Website Policies and Privacy Policy (collectively, the "Agreements"). 4 | 5 | ### Scope 6 | 7 | All of the content, documentation, code and related materials made available to you through the API is subject to these terms. Access to or use of the API or its content constitutes acceptance to this Agreement. 8 | 9 | ### Use 10 | 11 | You may use the Pillbox API to develop a service or service to search, display, analyze, retrieve, view and otherwise 'get' information from NLM Pillbox data. 12 | 13 | ### Attribution 14 | 15 | All services which utilize or access the API should display the following notice prominently within the application: "This product uses publicly available data from the U.S. National Library of Medicine (NLM), National Institutes of Health, Department of Health and Human Services; NLM is not responsible for the product and does not endorse or recommend this or any other product." You may use the NLM name in order to identify the source of API content subject to these rules. You may not use the NLM name, logo, or the like to imply endorsement of any product, service, or entity, not-for-profit, commercial or otherwise. 16 | 17 | ### Modification or False Representation of Content 18 | 19 | You may not modify or falsely represent content accessed through the API and still claim the source is NLM. 20 | 21 | ### Right to Limit 22 | 23 | Your use of the API may be subject to certain limitations on access, calls, or use as set forth within this Agreement or otherwise provided by NLM. If NLM reasonably believes that you have attempted to exceed or circumvent these limits, your ability to use the API may be permanently or temporarily blocked. NLM may monitor your use of the API to improve the service or to insure compliance with this Agreement. 24 | 25 | ### Service Termination 26 | 27 | If you wish to terminate this Agreement, you may do so by refraining from further use of the API. NLM reserves the right (though not the obligation) to (1) refuse to provide the API to you, if it is NLM's opinion that use violates any NLM policy, or (2) terminate or deny you access to and use of all or part of the API at any time for any other reason in its sole discretion,. All provisions of this Agreement which by their nature should survive termination shall survive termination including, without limitation, warranty disclaimers, indemnity, and limitations of liability. 28 | 29 | ### Changes 30 | 31 | NLM reserves the right, at its sole discretion, to modify or replace this Agreement, in whole or in part. Your continued use of or access to the API following posting of any changes to this Agreement constitutes acceptance of those modified terms. NLM may, in the future, offer new services and/or features through the API. Such new features and/or services shall be subject to the terms and conditions of this Agreement. 32 | 33 | ### Disclaimer of Warranties 34 | 35 | The API is provided "as is" and on an "as-available" basis. NLM hereby disclaim all warranties of any kind, express or implied, including without limitation the warranties of merchantability, fitness for a particular purpose, and non-infringement. NLM makes no warranty that the API will be error free or that access thereto will be continuous or uninterrupted. 36 | 37 | ### Limitations on Liability 38 | 39 | In no event will NLM be liable with to respect to any subject matter of this Agreement under any contract, negligence, strict liability or other legal or equitable theory for: 40 | 41 | any special, incidental, or consequential damages; the cost of procurement of substitute products or services; for interruption of use or loss or corruption of data. 42 | 43 | ### General Representations 44 | 45 | You hereby warrant that (1) your use of the API will be in strict accordance with the NLM privacy policy, this Agreement, and all applicable laws and regulations, and (2) your use of the API will not infringe or misappropriate the intellectual property rights of any third party. 46 | 47 | ### Indemnification 48 | 49 | You agree to indemnify and hold harmless NLM, its contractors, employees, agents, and the like from and against any and all claims and expenses including attorney's fees, arising out of your use of the API, including but not limited to violation of this Agreement. 50 | 51 | ### Miscellaneous 52 | 53 | This Agreement constitutes the entire Agreement between NLM and you concerning the subject matter hereof, and may only be modified by the posting of a revised version on this page by NLM. 54 | 55 | ### Disputes 56 | 57 | Any disputes arising out of this Agreement and access to or use of the API shall be governed by federal law. 58 | 59 | ### No Waiver of rights 60 | 61 | NLM's failure to exercise or enforce any right or provision of this Agreement shall not constitute waiver of such right or provision. -------------------------------------------------------------------------------- /scripts/README.md: -------------------------------------------------------------------------------- 1 | ## Pillbox Data Process 2 | 3 | The Pillbox data process uses a series of Python and Shell scripts to download, unzip, and parse the XML data provided by DailyMed. The process is broken into three phases: 4 | 5 | 1. Download (`download.sh`) and unzip (`unzip.sh`) 6 | 2. Process XML (`master.py`, `xpath.py`, `rxnorm.py`, `error.py`, `makecsv.py`) 7 | 3. Post-processing for static API and other outputs 8 | 9 | The scripts generate a series of directories under a `/tmp` folder. In addition, the `master.py` data process generates two main intermediate Pillbox outputs: `/processed/csv/spl_data.csv` & `/processed/json/`. 10 | 11 | #### Requirements 12 | 13 | - Python 2.6+ 14 | - [PIP](http://www.pip-installer.org/en/latest/installing.html#install-or-upgrade-pip) 15 | - `pip install requirements.txt` 16 | - unzip (if not on OSX) 17 | - wget (if not on Ubuntu) 18 | - 30+GB of free space 19 | 20 | ### Using the scripts 21 | 22 | #### 1. Download and Unzip 23 | The download and unzip script will download 16GB from DailyMed and unzip into temporary folders. 24 | 25 | To run: 26 | 27 | ``` 28 | ./download.sh 29 | ./unzip.sh 30 | ``` 31 | 32 | #### 2. Process XML 33 | 34 | After the downloading is finished, to process the unzipped XML files, run `master.py`. This script will use the `xpath.py`, `rxnorm.py`, `errors.py`, and `makecsv.py` modules. 35 | 36 | To run: 37 | 38 | ``` 39 | ./master.py 40 | ``` 41 | 42 | #### 3. Post-processing 43 | 44 | To run any post-processing on the generated CSV or json files, run `api.py` or generate an additional script. 45 | 46 | To run: 47 | 48 | ``` 49 | ./api.py 50 | ``` 51 | -------------------------------------------------------------------------------- /scripts/api.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Import Python Modules 4 | import os 5 | import sys 6 | import time 7 | import traceback 8 | import csv 9 | import shutil 10 | import simplejson as json 11 | from datetime import datetime 12 | 13 | print "Processing data..." 14 | 15 | # Global variables 16 | authorList = [] 17 | author = {} 18 | 19 | colorList = [] 20 | color = {} 21 | 22 | def createIndex(): 23 | 24 | def authorIndex(data): 25 | global authorList 26 | global author 27 | 28 | if data['data']['author'] not in authorList: 29 | authorList.append(data['data']['author']) 30 | author[data['data']['author']] = [] 31 | 32 | # build author objects 33 | if data['data']['author'] != "": 34 | author[data['data']['author']].append(data['setid_product']) 35 | 36 | def colorIndex(data): 37 | global colorList 38 | global color 39 | 40 | if data['data']['SPLCOLOR'] not in colorList: 41 | colorList.append(data['data']['SPLCOLOR']) 42 | color[data['data']['SPLCOLOR']] = [] 43 | 44 | # build color objects 45 | # {'C48325': ['3CAF3F19-96B4-DAE6-35AA-05643BD531D2-58177-001']} 46 | if data['data']['SPLCOLOR'] != "": 47 | color[data['data']['SPLCOLOR']].append(data['setid_product']) 48 | 49 | os.chdir("../tmp/processed/json/") 50 | for fn in os.listdir('.'): 51 | if fn.endswith(".json"): 52 | data_file = open(fn, "rb").read() 53 | data = json.loads(data_file) 54 | authorIndex(data) 55 | colorIndex(data) 56 | 57 | def indexAPI(): 58 | global color 59 | global author 60 | 61 | authorIndex = [] 62 | for a,n in author.items(): 63 | authorJSON = { 64 | "author": "", 65 | "spl-id": [], 66 | "count": len(n) 67 | } 68 | authorJSON['author'] = a 69 | authorJSON['spl-id'] = n 70 | authorIndex.append(authorJSON) 71 | 72 | writeIndex(authorIndex, 'author') 73 | 74 | colorIndex = [] 75 | for c,n in color.items(): 76 | colorJSON = { 77 | "color": "", 78 | "spl-id": [], 79 | "count": len(n) 80 | } 81 | colorJSON['color'] = c 82 | colorJSON['spl-id'] = n 83 | colorIndex.append(colorJSON) 84 | 85 | writeIndex(colorIndex, 'color') 86 | 87 | def writeIndex(output, file_name): 88 | writeout = json.dumps(output, sort_keys=True, separators=(',',':')) 89 | f_out = open('../../../api/index/%s.json' % file_name, 'wb') 90 | f_out.writelines(writeout) 91 | f_out.close() 92 | print "%s index files created..." % file_name 93 | 94 | def copyProcessed(): 95 | os.chdir('../csv/') 96 | splSRC = ('spl_data.csv') 97 | ingredientSRC = ('spl_ingredients.csv') 98 | DST = ('../../../api/') 99 | shutil.copy(splSRC,DST) 100 | print "spl_data.csv copied." 101 | shutil.copy(ingredientSRC,DST) 102 | print "spl_ingredients.csv copied." 103 | 104 | if __name__ == "__main__": 105 | createIndex() 106 | indexAPI() 107 | copyProcessed() 108 | -------------------------------------------------------------------------------- /scripts/download.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | set -e 4 | 5 | STARTTIME=$(date +%s) 6 | date=$1 7 | if [ $date ]; then 8 | echo "processing..." 9 | else 10 | echo "Error: no date entered with command (ex: 2014-02-24)" 11 | exit 1 12 | fi 13 | 14 | # make temp directories if do not exist 15 | mkdir -p ../tmp 16 | mkdir -p ../tmp/download 17 | mkdir -p ../tmp/download/$date 18 | 19 | tmpDIR=../tmp/ 20 | cd $tmpDIR 21 | 22 | # Download all Dailymed files 23 | wget -O download/$date/dm_spl_release_human_rx_part1.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_rx_part1.zip 24 | wget -O download/$date/dm_spl_release_human_rx_part2.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_rx_part2.zip 25 | wget -O download/$date/dm_spl_release_human_otc_part1.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_otc_part1.zip 26 | wget -O download/$date/dm_spl_release_human_otc_part2.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_otc_part2.zip 27 | wget -O download/$date/dm_spl_release_human_otc_part3.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_human_otc_part3.zip 28 | wget -O download/$date/dm_spl_release_homeopathic.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_homeopathic.zip 29 | wget -O download/$date/dm_spl_release_animal.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_animal.zip 30 | wget -O download/$date/dm_spl_release_remainder.zip ftp://public.nlm.nih.gov/nlmdata/.dailymed/dm_spl_release_remainder.zip 31 | 32 | echo "Dailymed files downloaded." 33 | echo "Downloading complete." 34 | ENDTIME=$(date +%s) 35 | TOTALTIME=$((($ENDTIME-$STARTTIME)/60)) 36 | echo "Processing took $TOTALTIME minutes to complete." 37 | -------------------------------------------------------------------------------- /scripts/error.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Import Python Modules 4 | import os 5 | import sys 6 | import time 7 | import traceback 8 | import simplejson as json 9 | from datetime import datetime 10 | 11 | now = datetime.now() 12 | 13 | error_count = 0 14 | 15 | errorFile = {"updated": now.strftime("%Y-%m-%d %H:%M"), "errors_total": "", "files": []} 16 | 17 | def xmlError(fn): 18 | global error_count 19 | exc_type, exc_value, exc_traceback = sys.exc_info() 20 | error = traceback.format_exc().splitlines() 21 | if error[-1] != "SystemExit: Not OSDF": 22 | error_count = error_count + 1 23 | errorPrint = {"file": fn, "error": error} 24 | errorFile['files'].append(errorPrint) 25 | 26 | def errorWrite(): 27 | global error_count 28 | errorFile['errors_total'] = error_count 29 | print error_count, "errors" 30 | writeout = json.dumps(errorFile, sort_keys=True, separators=(',',':')) 31 | f_out = open('../errors/errors.json', 'wb') 32 | f_out.writelines(writeout) 33 | f_out.close() 34 | -------------------------------------------------------------------------------- /scripts/makecsv.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import os 4 | import sys 5 | import csv 6 | import simplejson as json 7 | from itertools import chain 8 | import datetime 9 | 10 | now = datetime.datetime.now() 11 | today = now.strftime("%Y-%m-%d %H:%M") 12 | 13 | dataHeader = [ 14 | "setid", 15 | "file_name", 16 | "medicine_name", 17 | "part_medicine_name", 18 | "product_code", 19 | "part_num", 20 | "ndc9", 21 | "author", 22 | "author_type", 23 | "effective_time", 24 | "DEA_SCHEDULE_CODE", 25 | "DEA_SCHEDULE_NAME", 26 | "MARKETING_ACT_CODE", 27 | "NDC", 28 | "SPLCOLOR", 29 | "SPLIMAGE", 30 | "SPLIMPRINT", 31 | "SPLSCORE", 32 | "SPLSHAPE", 33 | "SPLSIZE", 34 | "SPL_INACTIVE_ING", 35 | "SPL_INGREDIENTS", 36 | "SPL_STRENGTH", 37 | "document_type", 38 | "dosage_form", 39 | "rxcui", 40 | "rxstring", 41 | "rxtty", 42 | "source", 43 | "equal_product_code", 44 | "approval_code" 45 | ] 46 | 47 | ingredientsHeader = [ 48 | "product_code", 49 | "setid", 50 | "part_num", 51 | "numerator_value", 52 | "numerator_unit", 53 | "denominator_value", 54 | "denominator_unit", 55 | "base_of_strength" 56 | ] 57 | 58 | dataOutput = open('../tmp/processed/csv/spl_data.csv', 'wb') 59 | ingredientsOutput = open('../tmp/processed/csv/spl_ingredients.csv', 'wb') 60 | dataWriter = csv.writer(dataOutput, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC, lineterminator='\n') 61 | dataWriter.writerow(dataHeader) 62 | 63 | ingredientsWriter = csv.writer(ingredientsOutput, delimiter=",", quotechar='"', quoting=csv.QUOTE_NONNUMERIC, lineterminator='\n') 64 | ingredientsWriter.writerow(ingredientsHeader) 65 | 66 | def makeCSV(xmlData): 67 | 68 | for x in xmlData: 69 | dataRow = [] 70 | for h in dataHeader: 71 | if h == 'SPL_INACTIVE_ING': 72 | if x['data'][h] == None: 73 | dataRow.append(x['data'][h]) 74 | else: 75 | dataRow.append(";".join(x['data'][h]).encode('ascii','ignore')) 76 | elif h == 'NDC': 77 | dataRow.append(";".join(x['data'][h]).encode('ascii','ignore')) 78 | elif h == 'SPL_INGREDIENTS': 79 | if x['data'][h] == None: 80 | dataRow.append(x['data'][h]) 81 | else: 82 | dataRow.append(";".join(x['data'][h]).encode('ascii','ignore')) 83 | elif h == 'SPL_STRENGTH': 84 | if x['data'][h] == None: 85 | dataRow.append(x['data'][h]) 86 | else: 87 | dataRow.append(";".join(x['data'][h]).encode('ascii','ignore')) 88 | elif h == 'part_num': 89 | dataRow.append(x['data'][h]) 90 | elif h == 'product_name': 91 | dataRow.append(x['data'][h].encode('ascii','ignore')) 92 | else: 93 | if x['data'][h] == None: 94 | dataRow.append(x['data'][h]) 95 | else: 96 | dataRow.append(x['data'][h].encode('ascii','ignore')) 97 | dataWriter.writerow(dataRow) 98 | if x['ingredients']: 99 | for a in x['ingredients']: 100 | try: 101 | if a['active_moiety_names']: 102 | ingredientsRow = [] 103 | for i in ingredientsHeader: 104 | idCodes = x['setid_product'].split("-") 105 | setid = "-".join(idCodes[:-3]) 106 | product_code = idCodes[-3] + "-" + idCodes[-2] 107 | part_num = idCodes[-1] 108 | if i == 'product_code': 109 | ingredientsRow.append(product_code) 110 | elif i == 'setid': 111 | ingredientsRow.append(setid) 112 | elif i == 'part_num': 113 | ingredientsRow.append(part_num) 114 | elif i == 'base_of_strength': 115 | try: 116 | ingredientsRow.append(";".join(a['active_moiety_names']).encode('ascii','ignore')) 117 | except: 118 | ingredientsRow.append(None) 119 | else: 120 | try: 121 | ingredientsRow.append(a[i].encode('ascii','ignore')) 122 | except: 123 | ingredientsRow.append(None) 124 | ingredientsWriter.writerow(ingredientsRow) 125 | except: 126 | pass 127 | 128 | def closeCSV(): 129 | dataOutput.close() 130 | ingredientsOutput.close() 131 | 132 | def makeDataPackage(): 133 | datapackage = { 134 | "name": "pillbox", 135 | "title": "Pillbox SPL Data", 136 | "date_updated": today, 137 | "resources": [ 138 | { 139 | "path": "spl_data.csv", 140 | "schema": { 141 | "fields": [ 142 | {"name": "setid","type": "string"}, 143 | {"name": "file_name","type": "string"}, 144 | {"name": "medicine_name","type": "string"}, 145 | {"name": "product_code","type": "string"}, 146 | {"name": "part_num","type": "integer"}, 147 | {"name": "ndc9","type": "string"}, 148 | {"name": "author","type": "string"}, 149 | {"name": "author_type","type": "string"}, 150 | {"name": "date_created","type": "string"}, 151 | {"name": "effective_time","type": "integer"}, 152 | {"name": "DEA_SCHEDULE_CODE","type": "string"}, 153 | {"name": "DEA_SCHEDULE_NAME","type": "string"}, 154 | {"name": "MARKETING_ACT_CODE","type": "string"}, 155 | {"name": "NDC","type": "string"}, 156 | {"name": "SPLCOLOR","type": "string"}, 157 | {"name": "SPLIMAGE","type": "string"}, 158 | {"name": "SPLIMPRINT","type": "string"}, 159 | {"name": "SPLCOLOR","type": "string"}, 160 | {"name": "SPLSCORE","type": "integer"}, 161 | {"name": "SPLSHAPE","type": "string"}, 162 | {"name": "SPLSIZE","type": "integer"}, 163 | {"name": "SPL_INACTIVE_ING","type": "string"}, 164 | {"name": "SPL_INGREDIENTS","type": "string"}, 165 | {"name": "SPL_STRENGTH","type": "string"}, 166 | {"name": "SPLSHAPE","type": "string"}, 167 | {"name": "document_type","type": "string"}, 168 | {"name": "dosage_form","type": "string"}, 169 | {"name": "rxcui","type": "string"}, 170 | {"name": "rxstring","type": "string"}, 171 | {"name": "rxtty","type": "string"}, 172 | {"name": "source","type": "string"}, 173 | {"name": "equal_product_code","type": "string"}, 174 | {"name": "approval_code","type": "string"} 175 | ] 176 | } 177 | }, 178 | { 179 | "path": "spl_ingredients.csv", 180 | "schema": { 181 | "fields": [ 182 | {"name": "product_code","type": "string"}, 183 | {"name": "setid","type": "string"}, 184 | {"name": "part_num","type": "string"}, 185 | {"name": "numerator_value","type": "string"}, 186 | {"name": "numerator_unit","type": "string"}, 187 | {"name": "denominator_value","type": "string"}, 188 | {"name": "denominator_unit","type": "string"}, 189 | {"name": "base_of_strength","type": "string"} 190 | ] 191 | } 192 | } 193 | ] 194 | } 195 | 196 | writeout = json.dumps(datapackage, sort_keys=True, separators=(',',':'), indent=4 * ' ') 197 | f_out = open('../../api/datapackage.json', 'wb') 198 | f_out.writelines(writeout) 199 | f_out.close() 200 | print "Datapackage.json created..." 201 | 202 | if __name__ == "__main__": 203 | makeDataPackage() 204 | -------------------------------------------------------------------------------- /scripts/master.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Import Python Modules 4 | import os 5 | import sys 6 | import time 7 | import traceback 8 | import csv 9 | import glob 10 | import simplejson as json 11 | from datetime import datetime 12 | # Import other files 13 | from xpath import parseData 14 | import rxnorm 15 | import error 16 | import makecsv 17 | import Queue 18 | import threading 19 | import requests 20 | 21 | # Variables, directories, initializations 22 | print "Starting data processing..." 23 | master_t0 = time.time() 24 | file_count = 0 25 | os.chdir("../tmp/tmp-unzipped/") 26 | queue = Queue.Queue() 27 | 28 | # Check internet connection for RXNorm requests 29 | print "Checking RXNorm connection..." 30 | try: 31 | check = rxnorm.connectionCheck() 32 | print check 33 | except: 34 | sys.exit("RXNorm check failed. Check your internet connection.") 35 | 36 | def xmlProcess(fn): 37 | 38 | try: 39 | # run xpath.py on each file 40 | xmlData = parseData(fn) 41 | 42 | for x in xmlData: 43 | rxnormData = rxnorm.rxNorm(x['ndc_codes']) 44 | x['data']['rxcui'] = rxnormData['rxcui'] 45 | x['data']['rxtty'] = rxnormData['rxtty'] 46 | x['data']['rxstring'] = rxnormData['rxstring'] 47 | try: 48 | ndc9 = x['data']['product_code'].split("-") 49 | if len(ndc9[0]) < 5: 50 | ndc9[0] = "0%s" % ndc9[0] 51 | if len(ndc9[1]) < 4: 52 | ndc9[1] = "0%s" % ndc9[1] 53 | x['data']['ndc9'] = "".join(ndc9) 54 | except: 55 | x['data']['ndc9'] = "" 56 | # Make indidivdual json files per SETID-NDC code 57 | writeout = json.dumps(x, sort_keys=True, separators=(',',':')) 58 | f_out = open('../processed/json/%s.json' % x['setid_product'], 'wb') 59 | f_out.writelines(writeout) 60 | f_out.close() 61 | # Make CSV file for output, one row per SETID-NDC code 62 | makecsv.makeCSV(xmlData) 63 | except: 64 | error.xmlError(fn) 65 | 66 | class ThreadXML(threading.Thread): 67 | def __init__(self, queue): 68 | threading.Thread.__init__(self) 69 | self.queue = queue 70 | 71 | def run(self): 72 | while True: 73 | #grabs file from queue 74 | fn = self.queue.get() 75 | #grabs file and processes 76 | xmlProcess(fn) 77 | #signals to queue job is done 78 | self.queue.task_done() 79 | 80 | def main(): 81 | global file_count 82 | 83 | #spawn a pool of threads, and pass them queue instance 84 | for i in range(20): 85 | t = ThreadXML(queue) 86 | t.daemon = True 87 | t.start() 88 | 89 | #populate queue with data 90 | for d in os.listdir('.'): 91 | files = glob.glob('%s/*.xml' % d) 92 | for fn in files: 93 | file_count = file_count + 1 94 | queue.put(fn) 95 | print "Processing XML with XPATH..." 96 | 97 | #wait on the queue until everything has been processed 98 | queue.join() 99 | 100 | main() 101 | 102 | # Calculate the total time and print to console. 103 | master_t1 = time.time() 104 | total_time = (master_t1-master_t0)/60 105 | print file_count, "XML files processed." 106 | error.errorWrite() 107 | makecsv.closeCSV() 108 | makecsv.makeDataPackage() 109 | print "Processing complete. Total Processing time = %d minutes" % total_time 110 | -------------------------------------------------------------------------------- /scripts/requirements.txt: -------------------------------------------------------------------------------- 1 | lxml 2 | requests 3 | simplejson 4 | -------------------------------------------------------------------------------- /scripts/rxnorm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import os 3 | import sys 4 | import requests 5 | import simplejson as json 6 | 7 | def connectionCheck(): 8 | url = 'http://rxnav.nlm.nih.gov/REST/version' 9 | header = {'Accept': 'application/json'} 10 | getCheck = requests.get(url, headers=header) 11 | if getCheck.status_code != requests.codes.ok: 12 | response = "RXNorm server response error. Response code: %s" % getCheck.status_code 13 | else: 14 | response = "Connection check complete. RXNorm online. Response code: %s" % getCheck.status_code 15 | return response 16 | 17 | def rxNorm(ndc): 18 | # ndc value coming from master.py 19 | # ndc = [array of ndc values] 20 | if ndc[0] is None: 21 | return {"rxcui": "", "rxtty": "", "rxstring": ""} 22 | else: 23 | # if internet or request throws an error, print out to check connection and exit 24 | try: 25 | baseurl = 'http://rxnav.nlm.nih.gov/REST/' 26 | 27 | # Searching RXNorm API, Search by identifier to find RxNorm concepts 28 | # http://rxnav.nlm.nih.gov/REST/rxcui?idtype=NDC&id=0591-2234-10 29 | # Set url parameters for searching RXNorm for SETID 30 | ndcSearch = 'rxcui?idtype=NDC&id=' 31 | 32 | # Search RXNorm API, Return all properties for a concept 33 | rxPropSearch = 'rxcui/' 34 | rxttySearch = '/property?propName=TTY' 35 | rxstringSearch = '/property?propName=RxNorm%20Name' 36 | 37 | # Request RXNorm API to return json 38 | header = {'Accept': 'application/json'} 39 | def getTTY(rxCUI): 40 | # Search RXNorm again using RXCUI to return RXTTY & RXSTRING 41 | getTTY = requests.get(baseurl+rxPropSearch+rxCUI+rxttySearch, headers=header) 42 | 43 | ttyJSON = json.loads(getTTY.text, encoding="utf-8") 44 | 45 | return ttyJSON['propConceptGroup']['propConcept'][0]['propValue'] 46 | 47 | def getSTRING(rxCUI): 48 | # Search RXNorm again using RXCUI to return RXTTY & RXSTRING 49 | getString = requests.get(baseurl+rxPropSearch+rxCUI+rxstringSearch, headers=header) 50 | stringJSON = json.loads(getString.text, encoding="utf-8") 51 | 52 | return stringJSON['propConceptGroup']['propConcept'][0]['propValue'] 53 | 54 | # Search RXNorm using NDC code, return RXCUI id 55 | # ndc = [ndc1, ndc2, ... ] 56 | for item in ndc: 57 | getRXCUI = requests.get(baseurl+ndcSearch+item, headers=header) 58 | if getRXCUI.status_code != requests.codes.ok: 59 | print "RXNorm server response error. Response code: %s" % getRXCUI.status_code 60 | rxcuiJSON = json.loads(getRXCUI.text, encoding="utf-8") 61 | # Check if first value in list returns a RXCUI, if not go to next value 62 | try: 63 | if rxcuiJSON['idGroup']['rxnormId']: 64 | rxCUI = rxcuiJSON['idGroup']['rxnormId'][0] 65 | rxTTY = getTTY(rxCUI) 66 | rxSTRING = getSTRING(rxCUI) 67 | return {"rxcui": rxCUI, "rxtty": rxTTY, "rxstring": rxSTRING} 68 | except: 69 | # if last item return null values 70 | if item == ndc[-1]: 71 | return {"rxcui": "", "rxtty": "", "rxstring": ""} 72 | pass 73 | except: 74 | sys.exit("RXNorm connection") 75 | 76 | if __name__ == "__main__": 77 | # Test with sample NDC codes, one works, one doesn't 78 | dataTest = rxNorm(['66435-101-42', '66435-101-56', '66435-101-70', '66435-101-84', '66435-101-14', '66435-101-16', '66435-101-18']) 79 | print dataTest 80 | -------------------------------------------------------------------------------- /scripts/unzip.sh: -------------------------------------------------------------------------------- 1 | #! /bin/bash 2 | 3 | set -e 4 | 5 | STARTTIME=$(date +%s) 6 | date=$1 7 | if [ $date ]; then 8 | echo "processing..." 9 | else 10 | echo "Error: no date entered with command (ex: 2014-02-24)" 11 | exit 1 12 | fi 13 | 14 | # make temp directories if do not exist 15 | mkdir -p ../tmp 16 | mkdir -p ../tmp/tmp-original 17 | mkdir -p ../tmp/tmp-unzipped 18 | mkdir -p ../tmp/tmp-images 19 | mkdir -p ../tmp/processed/json 20 | mkdir -p ../tmp/processed/csv 21 | mkdir -p ../tmp/errors 22 | 23 | tmpDIR=../tmp/ 24 | cd $tmpDIR 25 | 26 | # Removes old files 27 | tmpOriginal=tmp-original/ 28 | cd $tmpOriginal 29 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN; 30 | do 31 | if [ -d $FOLDER ]; then 32 | rm -r $FOLDER 33 | fi 34 | done 35 | echo "removed /tmp-original files" 36 | 37 | # Removes old files 38 | tmpUnzipped=../tmp-unzipped/ 39 | cd $tmpUnzipped 40 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN; 41 | do 42 | if [ -d $FOLDER ]; then 43 | rm -r $FOLDER 44 | fi 45 | done 46 | echo "removed /tmp-unzipped files" 47 | 48 | # Removes old files 49 | tmpImages=../tmp-images/ 50 | cd $tmpImages 51 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN; 52 | do 53 | if [ -d $FOLDER ]; then 54 | rm -r $FOLDER 55 | fi 56 | done 57 | echo "removed /tmp-images files" 58 | 59 | # create tmp subfolders 60 | mkdir -p ../tmp-original/HRX 61 | mkdir -p ../tmp-original/HOTC 62 | mkdir -p ../tmp-original/HOMEO 63 | mkdir -p ../tmp-original/ANIMAL 64 | mkdir -p ../tmp-original/REMAIN 65 | 66 | mkdir -p ../tmp-unzipped/HRX 67 | mkdir -p ../tmp-unzipped/HOTC 68 | mkdir -p ../tmp-unzipped/HOMEO 69 | mkdir -p ../tmp-unzipped/ANIMAL 70 | mkdir -p ../tmp-unzipped/REMAIN 71 | 72 | mkdir -p ../tmp-images/HRX 73 | mkdir -p ../tmp-images/HOTC 74 | mkdir -p ../tmp-images/HOMEO 75 | mkdir -p ../tmp-images/ANIMAL 76 | mkdir -p ../tmp-images/REMAIN 77 | 78 | # unzips main files to get individual zipped files 79 | ORIGNDATA=../tmp-original/ 80 | cd $ORIGNDATA 81 | 82 | unzip -qj ../download/$date/dm_spl_release_human_rx_part1.zip -d HRX/ 83 | unzip -qj ../download/$date/dm_spl_release_human_rx_part2.zip -d HRX/ 84 | unzip -qj ../download/$date/dm_spl_release_human_otc_part1.zip -d HOTC/ 85 | unzip -qj ../download/$date/dm_spl_release_human_otc_part2.zip -d HOTC/ 86 | unzip -qj ../download/$date/dm_spl_release_human_otc_part3.zip -d HOTC/ 87 | unzip -qj ../download/$date/dm_spl_release_homeopathic.zip -d HOMEO/ 88 | unzip -qj ../download/$date/dm_spl_release_animal.zip -d ANIMAL/ 89 | unzip -qj ../download/$date/dm_spl_release_remainder.zip -d REMAIN/ 90 | 91 | echo "original files unzipped to /tmp-original" 92 | 93 | # loop through all individual zipped files to unzip 94 | for FOLDER in HRX HOTC HOMEO ANIMAL REMAIN; 95 | do 96 | if [ -d $FOLDER ]; then 97 | for f in `ls "$FOLDER/"`; 98 | do 99 | ID=$(zipinfo -1 "$FOLDER/$f" "*.xml" | sed 's/....$//') 100 | unzip -Cqo "$FOLDER/$f" "*.xml" -d ../tmp-unzipped/$FOLDER/ 101 | unzip -Cqo "$FOLDER/$f" -d ../tmp-images/$FOLDER/${ID} 102 | rm ../tmp-images/$FOLDER/${ID}/${ID}.xml 103 | done 104 | fi 105 | done 106 | 107 | echo "all files unzipped." 108 | echo "processing complete." 109 | ENDTIME=$(date +%s) 110 | TOTALTIME=$((($ENDTIME-$STARTTIME)/60)) 111 | echo "Processing took $TOTALTIME minutes to complete." 112 | -------------------------------------------------------------------------------- /scripts/xpath.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # ------------------ 3 | # Pillbox Xpath script that extracts raw data from XML to yield: 4 | # 1. Rows array, with one row object per product code. 5 | # 2. Ingredients array, with one object per ingredient 6 | # ------------------ 7 | # Requirements: Python 2.6 or greater 8 | 9 | import os, sys, time 10 | import StringIO 11 | import atexit 12 | from lxml import etree 13 | from itertools import groupby 14 | 15 | # Check all XMLs against form codes, discard all XMLs that don't match 16 | codeChecks = [ 17 | "C25158", "C42895", "C42896", 18 | "C42917", "C42902", "C42904", 19 | "C42916", "C42928", "C42936", 20 | "C42954", "C42998", "C42893", 21 | "C42897", "C60997", "C42905", 22 | "C42997", "C42910", "C42927", 23 | "C42931", "C42930", "C61004", 24 | "C61005", "C42964", "C42963", 25 | "C42999", "C61006", "C42985", 26 | "C42992" 27 | ] 28 | 29 | 30 | def parseData(name): 31 | # Iterparse function that clears the memory each time it finishes running 32 | def getelements(filename, tag, medicineCheck): 33 | context = iter(etree.iterparse(filename)) 34 | _, root = next(context) # get root element 35 | for event, elem in context: 36 | # if event == 'start': 37 | # If we pass "yes" via medicineCheck, then we need to return instead of 38 | if medicineCheck == 'yes': 39 | if elem.tag == "{urn:hl7-org:v3}manufacturedMedicine": 40 | yield elem 41 | elif elem.tag == '{urn:hl7-org:v3}manufacturedProduct' or elem.tag =='manufacturedProduct': 42 | yield elem 43 | else: 44 | if tag.find('{') >= 0: 45 | tag = tag[16:] 46 | if elem.tag.find('{') >= 0: 47 | elem.tag = elem.tag[16:] 48 | 49 | if elem.tag == tag: 50 | yield elem 51 | root.clear() # preserve memory 52 | 53 | # ------------------ 54 | # Build SetInfo array 55 | # ------------------ 56 | setInfo = {} 57 | filename = name.split('/') 58 | setInfo['file_name'] = filename[1] 59 | setInfo['source'] = filename[0] 60 | 61 | def getInfo(): 62 | # Get information at parent level 63 | tree = etree.parse(name) 64 | root = tree.getroot() 65 | for child in root.xpath("./*[local-name() = 'id']"): 66 | setInfo['id_root'] = child.get('root') 67 | for child in root.xpath("./*[local-name() = 'setId']"): 68 | setInfo['setid'] = child.get('root') 69 | for child in root.xpath("./*[local-name() = 'effectiveTime']"): 70 | setInfo['effective_time'] = child.get('value') 71 | for child in root.xpath("./*[local-name() = 'code']"): 72 | setInfo['document_type'] = child.get('code') 73 | 74 | # -------------------- 75 | # Build Sponsors Array 76 | # -------------------- 77 | sponsors = {} 78 | for parent in getelements(name, "{urn:hl7-org:v3}author", 'no'): 79 | for child in parent.xpath(".//*[local-name() = 'representedOrganization']"): 80 | for grandChild in child.xpath("./*[local-name() = 'name']"): 81 | sponsors['author'] = grandChild.text.strip() 82 | sponsors['author_type'] = 'labeler' 83 | grandChild.clear() 84 | 85 | for parent in getelements(name, "{urn:hl7-org:v3}legalAuthenticator", 'no'): 86 | for child in parent.xpath(".//*[local-name() = 'representedOrganization']"): 87 | for grandChild in child.xpath("./*[local-name() = 'representedOrganization']"): 88 | sponsors['author'] = grandChild.text.strip() 89 | sponsors['author_type'] = 'legal' 90 | grandChild.clear() 91 | 92 | # ----------------------------------------- 93 | # Build ProdMedicine and Ingredients arrays 94 | # ----------------------------------------- 95 | prodMedicines = [] 96 | ingredients = {} 97 | formCodes = [] 98 | names = [] 99 | partnames = [] 100 | # info object, which will later be appended to prodMedicines array 101 | info = {} 102 | info['SPLCOLOR'] = [] 103 | info['SPLIMPRINT'] = [] 104 | info['SPLSHAPE'] = [] 105 | info['SPLSIZE'] = [] 106 | info['SPLSCORE'] = [] 107 | info['SPLCOATING'] = [] 108 | info['SPLSYMBOL'] = [] 109 | info['SPLFLAVOR'] = [] 110 | info['SPLIMAGE'] = [] 111 | info['IMAGE_SOURCE'] = [] 112 | info['SPL_INGREDIENTS'] = [] 113 | info['SPL_INACTIVE_ING'] = [] 114 | info['SPL_STRENGTH'] = [] 115 | info['SPLCONTAINS'] = [] 116 | info['approval_code'] = [] 117 | info['MARKETING_ACT_CODE'] = [] 118 | info['DEA_SCHEDULE_CODE'] = [] 119 | info['DEA_SCHEDULE_NAME'] = [] 120 | info['equal_product_code'] = [] 121 | info['NDC'] = [] 122 | info['SPLUSE'] = [] 123 | 124 | # substanceCodes will be filled with ingredient codes to check for duplicate ingredients 125 | substanceCodes = [] 126 | # doses will be filled with ingredient numerator values to check for duplicate ingredients 127 | doses = [] 128 | # codes stores product_codes, to determine how many unique products \to output with len(codes) 129 | codes = [] 130 | productCodes = [] 131 | partNumbers = [] 132 | eqDup = '' 133 | 134 | for parent in getelements(name, "{urn:hl7-org:v3}manufacturedProduct", 'yes'): 135 | def proceed(partCode, partChild, index): 136 | # There are elements that have no content and would result 137 | # in empty objects being appended to ingredients array. So use ingredientTrue to test. 138 | ingredientTrue = 0 139 | 140 | if partCode == 'zero': 141 | getInfo() 142 | for productCode in parent.xpath("./*[local-name() = 'code']"): 143 | uniqueCode = productCode.get('code') + '-0' 144 | # set ingredients array for uniquecode 145 | ingredients[uniqueCode] = [] 146 | if uniqueCode not in codes: 147 | codes.append(uniqueCode) 148 | productCodes.append(productCode.get('code')) 149 | partNumbers.append('0') 150 | else: 151 | # This applies only to 152 | for productCode in parent.xpath("./*[local-name() = 'code']"): 153 | uniqueCode = productCode.get('code') + '-'+str(index) 154 | formCodes.append(partCode) 155 | # set ingredients array for uniquecode 156 | ingredients[uniqueCode] = [] 157 | if uniqueCode not in codes: 158 | codes.append(uniqueCode) 159 | productCodes.append(productCode.get('code')) 160 | partNumbers.append(str(index)) 161 | 162 | # Get information 163 | packageProducts = [] 164 | for child in parent.xpath("./*[local-name() = 'asContent']"): 165 | # Check if we're working with or 166 | checkMedicine = child.xpath("./*[local-name() = 'containerPackagedMedicine']") 167 | checkProduct = child.xpath("./*[local-name() = 'containerPackagedProduct']") 168 | if checkProduct: 169 | productType = 'containerPackagedProduct' 170 | else: 171 | productType = 'containerPackagedMedicine' 172 | # Send product 173 | for grandChild in child.xpath("./*[local-name() = '"+productType+"']"): 174 | value = grandChild.xpath("./*[local-name() = 'code']") 175 | form = grandChild.xpath("./*[local-name() = 'formCode']") 176 | # For when there is another nested under another 177 | if value[0].get('code') == None: 178 | subElement = grandChild.xpath(".//*[local-name() = 'asContent']") 179 | # subValues is an array of all tags under the second instance of 180 | if subElement: 181 | subValues = subElement[0].xpath(".//*[local-name() = 'code']") 182 | tempCodes = [] 183 | # Loop through returned values, which come from multiple levels of 184 | for v in subValues: 185 | if v.get('code') != None: 186 | packageProducts.append(v.get('code')) 187 | # Else just print the value from the first level 188 | else: 189 | packageProducts.append(value[0].get('code')) 190 | 191 | # The getElements() function captures and , which is what 192 | # we're seeing when pacakageProducts has length zero 193 | if packageProducts: 194 | info['NDC'].append(packageProducts) 195 | 196 | # Arrays for ingredients 197 | active = [] 198 | inactive = [] 199 | splStrength = [] 200 | # If partCode is zero, we can find the ingredients directly below the parent 201 | # else we need to iterate thorugh the of the , from proceed() function 202 | 203 | if partCode == 'zero': 204 | level = parent 205 | for child in parent.xpath("./*[local-name() = 'name']"): 206 | names.append(child.text.strip()) 207 | partnames.append('') 208 | else: 209 | for child in parent.iterchildren('{urn:hl7-org:v3}name'): 210 | names.append(child.text.strip()) 211 | 212 | partProduct = partChild.xpath("./*[local-name() = 'partProduct']") 213 | if not partProduct: 214 | partProduct = partChild.xpath("./*[local-name() = 'partMedicine']") 215 | 216 | level = partProduct[0] 217 | for child in partProduct[0].xpath("./*[local-name() = 'name']"): 218 | partnames.append(child.text.strip()) 219 | for child in level.xpath("./*[local-name() = 'ingredient']"): 220 | # Create temporary object for each ingredient 221 | ingredientTemp = {} 222 | ingredientTemp['ingredient_type'] = {} 223 | ingredientTemp['substance_code'] = {} 224 | 225 | # If statement to find active ingredients 226 | if child.get('classCode') == 'ACTIB' or child.get('classCode') == 'ACTIM' or child.get('classCode') == 'ACTIR': 227 | ingredientTrue = 1 228 | ingredientTemp['active_moiety_names'] = [] 229 | 230 | for grandChild in child.xpath("./*[local-name() = 'ingredientSubstance']"): 231 | for c in grandChild.iterchildren(): 232 | ingredientTemp['ingredient_type'] = 'active' 233 | if c.tag == '{urn:hl7-org:v3}name' or c.tag == 'name': 234 | active.append(c.text.strip()) 235 | splStrengthItem = c.text.strip() 236 | ingredientTemp['substance_name'] = c.text.strip() 237 | if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code': 238 | ingredientTemp['substance_code'] = c.get('code') 239 | if c.tag =='{urn:hl7-org:v3}activeMoiety' or c.tag == 'activeMoiety': 240 | name = c.xpath(".//*[local-name() = 'name']") 241 | 242 | # Send active moiety to ingredientTemp 243 | try: 244 | ingredientTemp['active_moiety_names'].append(name[0].text.strip()) 245 | except: 246 | ingredientTemp['active_moiety_names'].append('') 247 | 248 | for grandChild in child.xpath("./*[local-name() = 'quantity']"): 249 | numerator = grandChild.xpath("./*[local-name() = 'numerator']") 250 | denominator = grandChild.xpath("./*[local-name() = 'denominator']") 251 | 252 | ingredientTemp['numerator_unit'] = numerator[0].get('unit') 253 | ingredientTemp['numerator_value'] = numerator[0].get('value') 254 | ingredientTemp['dominator_unit'] = denominator[0].get('unit') 255 | ingredientTemp['dominator_value'] = denominator[0].get('value') 256 | splStrengthValue = float(ingredientTemp['numerator_value']) / float(ingredientTemp['dominator_value']) 257 | if str(splStrengthValue)[-1] == '0': 258 | splStrengthValue = int(splStrengthValue) 259 | splStrengthItem = "%s %s %s" % (splStrengthItem, splStrengthValue, ingredientTemp['numerator_unit']) 260 | splStrength.append(splStrengthItem) 261 | 262 | # If statement to find inactive ingredients 263 | if child.get('classCode') == 'IACT': 264 | ingredientTrue = 1 265 | # Create object for each inactive ingredient 266 | for grandChild in child.xpath("./*[local-name() = 'ingredientSubstance']"): 267 | for c in grandChild.iterchildren(): 268 | ingredientTemp['ingredient_type'] = 'inactive' 269 | if c.tag == '{urn:hl7-org:v3}name' or c.tag =='name': 270 | inactive.append(c.text.strip()) 271 | ingredientTemp['substance_name'] = c.text.strip() 272 | if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code': 273 | ingredientTemp['substance_code'] = c.get('code') 274 | try: 275 | ingredients[uniqueCode].append(ingredientTemp) 276 | except: 277 | # this is passed because of no uniqeCode assigned when not OSDF 278 | continue 279 | 280 | # For XML files that have different ingredient element syntax 281 | # check for inactive ingredients 282 | for child in level.xpath("./*[local-name() = 'inactiveIngredient']"): 283 | # Create temporary object for each ingredient 284 | ingredientTemp = {} 285 | ingredientTemp['ingredient_type'] = {} 286 | ingredientTemp['substance_code'] = {} 287 | ingredientTrue = 1 288 | 289 | for grandChild in child.xpath("./*[local-name() = 'inactiveIngredientSubstance']"): 290 | for c in grandChild.iterchildren(): 291 | ingredientTemp['ingredient_type'] = 'inactive' 292 | if c.tag == '{urn:hl7-org:v3}name' or c.tag =='name': 293 | inactive.append(c.text.strip()) 294 | ingredientTemp['substance_name'] = c.text.strip() 295 | if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code': 296 | ingredientTemp['substance_code'] = c.get('code') 297 | try: 298 | ingredients[uniqueCode].append(ingredientTemp) 299 | except: 300 | # this is passed because of no uniqeCode assigned when not OSDF 301 | continue 302 | 303 | # For XML files that have different ingredient element syntax 304 | # check for active ingredients 305 | for child in level.xpath("./*[local-name() = 'activeIngredient']"): 306 | ingredientTemp = {} 307 | ingredientTemp['ingredient_type'] = {} 308 | ingredientTemp['substance_code'] = {} 309 | ingredientTemp['active_moiety_names'] = [] 310 | ingredientTrue = 1 311 | 312 | for grandChild in child.xpath("./*[local-name() = 'activeIngredientSubstance']"): 313 | for c in grandChild.iterchildren(): 314 | ingredientTemp['ingredient_type'] = 'active' 315 | if c.tag == '{urn:hl7-org:v3}name' or c.tag == 'name': 316 | active.append(c.text.strip()) 317 | splStrengthItem = c.text.strip() 318 | ingredientTemp['substance_name'] = c.text.strip() 319 | if c.tag == '{urn:hl7-org:v3}code' or c.tag == 'code': 320 | ingredientTemp['substance_code'] = c.get('code') 321 | if c.tag =='{urn:hl7-org:v3}activeMoiety' or c.tag == 'activeMoiety': 322 | name = c.xpath(".//*[local-name() = 'name']") 323 | 324 | # Send active moiety to ingredientTemp 325 | try: 326 | ingredientTemp['active_moiety_names'].append(name[0].text.strip()) 327 | except: 328 | ingredientTemp['active_moiety_names'].append('') 329 | 330 | for grandChild in child.xpath("./*[local-name() = 'quantity']"): 331 | numerator = grandChild.xpath("./*[local-name() = 'numerator']") 332 | denominator = grandChild.xpath("./*[local-name() = 'denominator']") 333 | 334 | ingredientTemp['numerator_unit'] = numerator[0].get('unit') 335 | ingredientTemp['numerator_value'] = numerator[0].get('value') 336 | ingredientTemp['dominator_unit'] = denominator[0].get('unit') 337 | ingredientTemp['dominator_value'] = denominator[0].get('value') 338 | splStrengthValue = float(ingredientTemp['numerator_value']) / float(ingredientTemp['dominator_value']) 339 | if str(splStrengthValue)[-1] == '0': 340 | splStrengthValue = int(splStrengthValue) 341 | splStrengthItem = "%s %s %s" % (splStrengthItem, splStrengthValue, ingredientTemp['numerator_unit']) 342 | splStrength.append(splStrengthItem) 343 | 344 | # Send code, name and formCode to info = {} 345 | info['product_code'] = productCodes 346 | info['part_num'] = partNumbers 347 | info['medicine_name'] = names 348 | info['part_medicine_name'] = partnames 349 | info['dosage_form'] = formCodes 350 | 351 | # If ingredientTrue was set to 1 above, we know we have ingredient information to append 352 | if ingredientTrue != 0: 353 | info['SPL_INGREDIENTS'].append(active) 354 | info['SPL_INACTIVE_ING'].append(inactive) 355 | info['SPL_STRENGTH'].append(splStrength) 356 | 357 | # Second set of child elements in used for ProdMedicines array 358 | def checkForValues(ctype, grandChild, dup, idx): 359 | value = grandChild.xpath("./*[local-name() = 'value']") 360 | reference = grandChild.xpath(".//*[local-name() = 'reference']") 361 | if ctype == 'SPLIMPRINT': 362 | value = value[0].text.strip() 363 | else: 364 | value = value[0].attrib 365 | kind = grandChild.find("./{urn:hl7-org:v3}code[@code='"+ctype+"']") 366 | if kind == None: 367 | kind = grandChild.find("./code[@code='"+ctype+"']") 368 | if kind !=None: 369 | if ctype == 'SPLCOLOR': 370 | if dup == '1': 371 | color1 = info[ctype][idx] 372 | if value.get('code') == None: 373 | color2 = '' 374 | else: 375 | color2 = value.get('code') 376 | info[ctype][idx] = "%s;%s" % (color1,color2) 377 | else: 378 | info[ctype].append(value.get('code')) 379 | elif ctype == 'SPLIMPRINT': 380 | info[ctype].append(value) 381 | elif ctype == 'SPLSCORE': 382 | if value.get('value') == None: 383 | info[ctype].append('') 384 | else: 385 | info[ctype].append(value.get('code') or value.get('value')) 386 | elif ctype == 'SPLIMAGE': 387 | if reference[0].get('value'): 388 | splfile = reference[0].get('value').split() 389 | info[ctype].append(splfile) 390 | else: 391 | info[ctype].append('') 392 | else: 393 | info[ctype].append(value.get('code') or value.get('value')) 394 | 395 | # If partCode is zero, we can find the directly below the parent 396 | # else we need to iterate thorugh the of the , from proceed() function 397 | 398 | 399 | 400 | if partCode == 'zero': 401 | level = parent 402 | else: 403 | level = partChild 404 | if level.xpath("./*[local-name() = 'subjectOf']"): 405 | previous = [] 406 | subjectOfcheck = [] 407 | for child in level.xpath("./*[local-name() = 'subjectOf']"): 408 | for c in child: 409 | subjectOfcheck.append(c.tag) 410 | # Get Approval code 411 | for grandChild in child.xpath(".//*[local-name() = 'approval']"): 412 | statusCode = grandChild.xpath("./*[local-name() = 'code']") 413 | info['approval_code'].append(statusCode[0].get('code')) 414 | #Get marketing act code 415 | for grandChild in child.xpath("./*[local-name() = 'marketingAct']"): 416 | statusCode = grandChild.xpath("./*[local-name() = 'statusCode']") 417 | info['MARKETING_ACT_CODE'].append(statusCode[0].get('code')) 418 | 419 | # Get policy code 420 | for grandChild in child.xpath("./*[local-name() = 'policy']"): 421 | for each in grandChild.xpath("./*[local-name() = 'code']"): 422 | info['DEA_SCHEDULE_CODE'].append(each.get('code')) 423 | info['DEA_SCHEDULE_NAME'].append(each.get('displayName')) 424 | 425 | for grandChild in child.xpath("./*[local-name() = 'characteristic']"): 426 | for each in grandChild.xpath("./*[local-name() = 'code']"): 427 | # Run each type through the CheckForValues() function above 428 | ctype = each.get('code') 429 | # checks for duplicate spl types, splcolor can happen twice 430 | if ctype in previous: 431 | idx = len(info[ctype]) - 1 432 | checkForValues(ctype, grandChild, '1', idx) 433 | else: 434 | idx = len(info[ctype]) 435 | diff = len(codes) - 1 436 | if idx < diff: 437 | for i in range(diff - idx): 438 | info[ctype].append('') 439 | checkForValues(ctype, grandChild, '0', 0) 440 | previous.append(ctype) 441 | each.clear() #clear memory 442 | grandChild.clear() #clear memory 443 | # check if dea policy code is in product or not 444 | policy = '{urn:hl7-org:v3}policy' 445 | if policy not in subjectOfcheck: 446 | info['DEA_SCHEDULE_CODE'].append('') 447 | info['DEA_SCHEDULE_NAME'].append('') 448 | approval = '{urn:hl7-org:v3}approval' 449 | if approval not in subjectOfcheck: 450 | info['approval_code'].append('') 451 | 452 | # Check if there are in the manufactured product, if not, partCode = 0 453 | parts = parent.xpath("./*[local-name() = 'part']") 454 | if not parts: 455 | # Check for child , if exists, check formCode 456 | childProduct = parent.xpath(".//*[local-name() = 'manufacturedProduct']") 457 | if childProduct: 458 | childFormCode = childProduct[0].xpath("./*[local-name() = 'formCode']") 459 | if childFormCode: 460 | if childFormCode[0].get('code') not in codeChecks: 461 | continue #skip to next Manufactured Product 462 | # test current level FormCode against codeChecks 463 | formCode = parent.xpath("./*[local-name() = 'formCode']") 464 | if formCode: 465 | if formCode[0].get('code') not in codeChecks: 466 | continue 467 | else: 468 | formCodes.append(formCode[0].get('code')) 469 | 470 | # Get equal product code from 471 | equiv = parent.xpath(".//*[local-name() = 'asEquivalentEntity']") 472 | if equiv: 473 | for child in parent.xpath(".//*[local-name() = 'asEquivalentEntity']"): 474 | # duplicate check 475 | if child != eqDup: 476 | equalProdParent = parent.xpath(".//*[local-name() = 'definingMaterialKind']") 477 | code = equalProdParent[0].xpath("./*[local-name() = 'code']") 478 | equalProdCodes = code[0].get('code') 479 | info['equal_product_code'].append(equalProdCodes) 480 | eqDup = child 481 | else: 482 | equalProdCodes = '' 483 | info['equal_product_code'].append(equalProdCodes) 484 | # No parts found, so part number is zero, send to proceed() function 485 | proceed('zero','','') 486 | else: 487 | # Set up an index to pass to proceed() function to determine part number 488 | index = 1 489 | for child in parts: 490 | # Get equal product code from 491 | equiv = child.xpath(".//*[local-name() = 'asEquivalentEntity']") 492 | if equiv: 493 | for c in child.xpath(".//*[local-name() = 'asEquivalentEntity']"): 494 | equalProdParent = c.xpath(".//*[local-name() = 'definingMaterialKind']") 495 | code = equalProdParent[0].xpath("./*[local-name() = 'code']") 496 | equalProdCodes = code[0].get('code') 497 | info['equal_product_code'].append(equalProdCodes) 498 | else: 499 | equalProdCodes = '' 500 | info['equal_product_code'].append(equalProdCodes) 501 | 502 | formCode = child.xpath(".//*[local-name() = 'formCode']") 503 | 504 | # Check if formCode is in codeChecks 505 | if formCode[0].get('code') not in codeChecks: 506 | # If is not in codeChecks, move onto next 507 | continue 508 | # else send to proceed() function with index 509 | else: 510 | getInfo() 511 | proceed(formCode[0].get('code'), child, index) 512 | index = index + 1 513 | prodMedicines.append(info) 514 | 515 | prodMedNames = [ 516 | 'SPLCOLOR','SPLIMAGE','SPLIMPRINT','medicine_name','SPLSHAPE', 517 | 'SPL_INGREDIENTS','SPL_INACTIVE_ING','SPLSCORE','SPLSIZE', 518 | 'product_code','part_num','dosage_form','MARKETING_ACT_CODE', 519 | 'DEA_SCHEDULE_CODE','DEA_SCHEDULE_NAME','NDC','equal_product_code', 520 | 'SPL_STRENGTH','part_medicine_name','approval_code' 521 | ] 522 | setInfoNames = ['file_name','effective_time','id_root','setid','document_type','source'] 523 | sponsorNames = ['author','author_type'] 524 | 525 | # Loop through prodMedicines as many times as there are unique product codes + part codes combinations, which is len(codes) 526 | products = [] 527 | 528 | if prodMedicines[0]['NDC']: 529 | for i in range(0, len(codes)): 530 | uniqueID = setInfo['setid'] + '-' + codes[i] 531 | product = {} 532 | product['setid_product'] = uniqueID 533 | product['ndc_codes'] = prodMedicines[0]['NDC'][i] 534 | tempProduct = {} 535 | for name in prodMedNames: 536 | # Get information at the correct index 537 | try: 538 | if name == 'SPLIMAGE': 539 | if prodMedicines[0][name][i] != '': 540 | image_file = setInfo['setid'] + '_' + prodMedicines[0]['product_code'][i] + '_' + prodMedicines[0]['part_num'][i] + '_' + "_".join(prodMedicines[0][name][i]) 541 | tempProduct[name] = image_file 542 | else: 543 | tempProduct[name] = '' 544 | else: 545 | tempProduct[name] = prodMedicines[0][name][i] 546 | except: 547 | tempProduct[name] = '' 548 | for name in setInfoNames: 549 | tempProduct[name] = setInfo[name] 550 | for name in sponsorNames: 551 | try: 552 | tempProduct[name] = sponsors[name] 553 | except: 554 | tempProduct[name] = '' 555 | product['data'] = tempProduct 556 | # Ingredients are showing duplicates again leaving out while fixing. 557 | product['ingredients'] = ingredients[codes[i]] 558 | products.append(product) 559 | return products 560 | else: 561 | sys.exit("Not OSDF") 562 | 563 | #Use this code to run xpath on the tmp-unzipped files without other scripts 564 | if __name__ == "__main__": 565 | test = parseData("../tmp/tmp-unzipped/HRX/6dc74857-7d8d-4102-a56a-a014934b91b2.xml") 566 | print test 567 | --------------------------------------------------------------------------------