├── .gitignore ├── LICENSE.md ├── README.md ├── docs ├── phase_1 │ ├── background.md │ └── research.md └── phase_2 │ ├── FISMAticDecember2019.md │ ├── FISMAticMay2020.md │ └── media │ ├── image1.png │ ├── image10.png │ ├── image11.png │ ├── image12.jpeg │ ├── image13.jpeg │ ├── image2.png │ ├── image3.png │ ├── image4.png │ ├── image5.png │ ├── image6.png │ ├── image7.png │ ├── image8.png │ └── image9.png └── src └── phase_1 ├── analysis.ipynb ├── demo.ipynb ├── environment.yml ├── exploration.ipynb ├── fismatic ├── __init__.py ├── control.py ├── control_set.py ├── core.py ├── demo.py ├── docx_parser.py ├── helpers.py ├── parser.py ├── similarity.py └── test │ ├── __init__.py │ ├── common.py │ ├── test_control.py │ ├── test_control_set.py │ ├── test_core.py │ ├── test_demo.py │ ├── test_docx_parser.py │ ├── test_notebook.py │ ├── test_parser.py │ └── test_similarity.py ├── jupyter_notebook_config.py ├── pytest.ini └── usage.md /.gitignore: -------------------------------------------------------------------------------- 1 | *.docx 2 | *.doc 3 | *.xlsx 4 | *.xls 5 | 6 | *.py[cod] 7 | .vscode 8 | .0 9 | output.json 10 | *.index 11 | .DS_Store 12 | matrix.csv 13 | out/ 14 | .ipynb_checkpoints/ 15 | *.sqlite 16 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | As a work of the United States government, this project is in the 2 | public domain within the United States. 3 | 4 | Additionally, we waive copyright and related rights in the work 5 | worldwide through the CC0 1.0 Universal public domain dedication. 6 | 7 | ## CC0 1.0 Universal Summary 8 | 9 | This is a human-readable summary of the 10 | [Legal Code (read the full text)](https://creativecommons.org/publicdomain/zero/1.0/legalcode). 11 | 12 | ### No Copyright 13 | 14 | The person who associated a work with this deed has dedicated the work to 15 | the public domain by waiving all of his or her rights to the work worldwide 16 | under copyright law, including all related and neighboring rights, to the 17 | extent allowed by law. 18 | 19 | You can copy, modify, distribute and perform the work, even for commercial 20 | purposes, all without asking permission. 21 | 22 | ### Other Information 23 | 24 | In no way are the patent or trademark rights of any person affected by CC0, 25 | nor are the rights that other persons may have in the work or in how the 26 | work is used, such as publicity or privacy rights. 27 | 28 | Unless expressly stated otherwise, the person who associated a work with 29 | this deed makes no warranties about the work, and disclaims liability for 30 | all uses of the work, to the fullest extent permitted by applicable law. 31 | When using or citing the work, you should not imply endorsement by the 32 | author or the affirmer. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FISMAtic :zap: 2 | 3 | _Pronounced like "automatic", but starting with "fizz"._ 4 | 5 | The goal of FISMAtic is to reduce the amount of time spent authoring, reviewing, and editing the security compliance documentation leading up to an Authority to Operate (ATO). We plan to build prototype(s) that: 6 | 7 | - Feedback on security compliance documentation 8 | 9 | - Help compliance teams select security controls that are appropriate to a system (tailored baselines) 10 | - This can cut out time spent around irrelevant controls in all other steps of the compliance lifecycle 11 | 12 | ## Table of contents 13 | 14 | ### Phase 1 15 | 16 | - [Video demo - 6/21/19](https://census.webex.com/census/ldr.php?RCID=9486aa77f0aeb069cd681c9ad6a5f1ee) 17 | - [Interview with Nextgov](https://www.nextgov.com/emerging-tech/2019/05/census-thinks-clippy-style-ai-assistant-could-speed-security-authorizations/157339/) 18 | - [Background information](docs/phase_1/background.md) 19 | - [Research summary](docs/phase_1/research.md) 20 | - [Code information](src/phase_1/usage.md) 21 | 22 | ### Phase 2 23 | - [Updates December 2019](docs/phase_2/FISMAticDecember2019.md) 24 | - [Updates May 2020](docs/phase_2/FISMAticMay2020.md) 25 | 26 | ## Call for collaborators 27 | 28 | If you’ve worked in this space or are interested in collaborating, please reach out in an issue or by email at cat@census.gov. 29 | 30 | Thanks! 31 | -------------------------------------------------------------------------------- /docs/phase_1/background.md: -------------------------------------------------------------------------------- 1 | # Background 2 | 3 | "The ATO process", as it's commonly called, is formally defined in the National Institute of Standards & Technology (NIST)'s [Risk Management Framework (RMF)](): 4 | 5 | NIST Risk Management Framework diagram 6 | 7 | This process was developed as a result of the [Federal Information Security Management Act (FISMA)](https://www.nist.gov/programs-projects/federal-information-security-management-act-fisma-implementation-project). See [Introduction to ATOs](https://atos.open-control.org/) for more information. 8 | 9 | Security compliance is time consuming (and therefore expensive) for most organizations in and around the federal government. Two particular pain points were identified: 10 | 11 | - Select[ing] Controls that are appropriate for a given system 12 | - The back-and-forth between delivery teams Implement[ing] and reviewers Assess[ing] Controls 13 | 14 | Delivery teams, who may or may not have experience writing System Security Plans (SSPs), spend a lot of time working on the language for security controls. This is then sent to the assessor, who may be pointing out common mistakes. Each of these back-and-forths can take days or weeks, costing staff hours on both sides and stretching out the time before the project can actually deliver value to users. 15 | 16 | **Our hypothesis is that we can reduce the time spent on the Select, Implement, and Assess Controls steps of the RMF through tooling.** 17 | -------------------------------------------------------------------------------- /docs/phase_1/research.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # FISMAtic Research Round One Summary 4 | 5 | May 13, 2019 6 | 7 | # Background 8 | 9 | The Census Innovation & Operational Efficiency Program (IOE) awarded funding to the FISMAtic project to: 10 | 11 | > ...improve the [Authority to Operate (ATO)] process by (1) reducing costs, (2) speeding time to completion, and (3) improving consistency with machine learning, natural language processing and computer vision. 12 | 13 | _FISMAtic Business Case_ 14 | 15 | See also: [overview of ATOs](https://atos.open-control.org), particularly the [terminology](https://atos.open-control.org/overview/#definitions). (Acronyms used in this doc are defined there.) See the [Appendix](#appendix) for details about the interview process. 16 | 17 | # Major findings 18 | 19 | ## A) Program teams are not well-equipped to complete ATOs. 20 | 21 | > The templates have 'NIST language' and you have no idea what that really means. 22 | 23 | Program teams are often lacking compliance experience. The templates are long and hard to understand, if they are provided at all. 24 | 25 | > There is no other domain in all my work in government where plentiful examples are not available … makes me furious … can't emphasize this enough. 26 | 27 | ATO packages are considered sensitive, so are not easily accessible to program teams going through ATOs. The lack of access to examples came up in the interviews more than any other topic. 28 | 29 | ## B) Prioritization of security/compliance work is critical. 30 | 31 | > Compliance is a byproduct of good security. 32 | 33 | For the program teams, systems need to be designed with security in mind. This means having security experience in the team, if not involving the assessment teams themselves early on (rather than only at the end). Unfortunately, compliance teams are generally overloaded as is, so there is a tension here. 34 | 35 | > Distill [compliance work] down to the points and facts that matter. 36 | > 37 | > Help [program teams] climb ladders, rather than pole vault. 38 | 39 | Instead of treating all security controls as equal, a number of interviewees on both sides got value out of prioritizing controls or parts of the systems that had the largest security implications. This can happen at time of control selection, and/or during assessment. 40 | 41 | ## C) Low (perceived) value of compliance work by program teams. 42 | 43 | > [Compliance] feels like a bureaucratic checkbox...I want to build secure systems. 44 | > 45 | > How much of all of this is putting lipstick on a pig? 46 | 47 | Program teams interviewed citied caring about security and privacy, but there was a perceived lack of value of compliance work, particularly the documentation and the assessors themselves. They saw that compliance work being distinct from the security of the systems they were building. 48 | 49 | > I've heard of good assessors...but my experience [in three agencies] has been really poor. 50 | 51 | Security experience of assessors can vary greatly. Program teams had positive experiences when the assessors had technical background, and they could therefore talk through potential solutions with the program teams. 52 | 53 | > [Assessments are often] veering into art of security rather than the science of it. 54 | > 55 | > What's the interpretation of the control of the system you're building in the organization you're building it? That's the root of all evil with SSPs. 56 | 57 | Despite the thousands of pages of NIST documentation that surround ATO processes, there is a high degree of subjectivity in SSPs and assessments in terms of what is considered "good". 58 | 59 | > I don't think [assessors] look at 95% of what I give them...I find mistakes later that they should have caught. 60 | 61 | Assessment teams often hand program teams the compliance documentation requirements/templates, and only re-engage once the first draft is complete. There can be many months of turnaround time before and after this point, as the effort to complete (especially without experience – see [A](#a-program-teams-are-not-well-equipped-to-complete-atos)) and review the documentation is high. 62 | 63 | ## D) Compliance processes work best when there is close collaboration. 64 | 65 | About half of the interviews mentioned the value of close collaboration and trust between the program teams and the assessors. The following terms were used about assessors in these cases: 66 | 67 | > - Mutual understanding 68 | > - Ground-truthing 69 | > - Your SSP Sherpa 70 | > - Here on a fact-finding mission, not to convict 71 | > - Emotional support 72 | > - Hand-holding 73 | 74 | Unfortunately, a program team getting their assessor's time can be challenging, as noted in [B](#b-prioritization-of-securitycompliance-work-is-critical). Multiple interviews also cited the importance of both sides being up front about their experience and shortcomings, rather than seeing the other side as an opponent. 75 | 76 | ## E) There is power and confusion in inheritance. 77 | 78 | > Using OpenControl took authoring time from nine months for the first system, to four months for the second, then down to two months for the third. I have a goal of two weeks. 79 | 80 | _Paraphrased._ 81 | 82 | SSPs are generally Word documents that can be hundreds of pages in length. Interviewees estimated that only 10-40% of the content in those documents are unique to the system they correspond to; the rest is either boilerplate in the template, or is copied-and-pasted. There were two reasons cited for the latter: 83 | 84 | - Many controls are fulfilled by the agency and underlying platforms/services (enterprise hosting, single sign-on, etc.) 85 | - It's easier to find a similar system that's been approved and copy from their SSP than to write from scratch (see [A](#a-program-teams-are-not-well-equipped-to-complete-atos)) 86 | 87 | As noted in the quote above, leveraging this inheritance in authoring an SSP (through [OpenControl](https://github.com/opencontrol/schemas#why-opencontrol) or another tool) can result in massive reduction in time to ATO. The downsides are: 88 | 89 | - SSP authors can copy-and-paste too much (to save themselves time/headache and try and get approved), misrepresenting their systems, causing confusion and delays. 90 | - It's hard for reviewers to tease out the information specific to that system from the inherited systems that are out of scope. 91 | 92 | ## F) There are compliance tools, but they are under-leveraged. 93 | 94 | To complete ATOs, program team members and assessors in many agencies write in Word documents, then versions of the document ricochet back and forth with comments and edits. Interviewees cited difficulty keeping track of who was editing and merging those edits together. 95 | 96 | > Everybody's [compliance] tool had 20% of reality. 97 | 98 | There are various commercial and open source tools available to help with compliance documentation workflows, including inheritance (see [E](#e-there-is-power-and-confusion-in-inheritance)). Even so, the majority of agencies aren't using them to their full extent, if at all. It's unclear how much of this is due to each of the following: 99 | 100 | - Awareness 101 | - Cost 102 | - Missing features (as the quote above refers to) 103 | - Compliance team manangement not thinking in a user-centered way 104 | 105 | # Potential paths forward 106 | 107 | These are not mutually exclusive. 108 | 109 | ## Automated feedback 110 | 111 | This has been referred to as "Clippy for ATOs", and was the proposal in the original Business Case. This could be anything from a [linter](https://stackoverflow.com/questions/8503559/what-is-linting) with simple business rules up through natural language processing to give feedback on the quality of the content. 112 | 113 | Pros: 114 | 115 | - This was the Solution funded by IOE 116 | - Reduces the feedback cycles 117 | 118 | Cons (specifically asked for feedback on this idea, hence the longest list for this one): 119 | 120 | - Requires access to a large set of SSPs in a machine-readable form 121 | - _"To succeed, you need a corpus of SSPs, [you need to be able to extract the content], and you need to be able to bring in powerful analytical tools."_ 122 | - GSA (at least twice) and DHS worked on text extraction from SSPs, which was laborious. 123 | - Likely challenging to give substantive feedback 124 | - _"Seems like it will be a lot of work for little reward."_ 125 | - See subjectivity in [C](#c-low-perceived-value-of-compliance-work-by-program-teams). 126 | - Even from human reviewers, there is often more noise than signal 127 | - Only possible for a machine learning-driven system to give feedback that's "you're like the other SSPs out there, or not 128 | - High effort for low value, relative to other options 129 | 130 | ## Surface inheritable systems/controls 131 | 132 | Agencies have platforms and other General Support Systems (GSS's) that can be leveraged by program teams to reduce their technical and compliance burden. These providers, along with their compliance information, can be surfaced and more strongly encouraged. 133 | 134 | Pros: 135 | 136 | - [Makes the right thing the easy thing](https://www.youtube.com/watch?v=xqT8e6_yzLg) 137 | - Reduces surface area (and thus security risk, operational burden, etc.) of downstream systems 138 | - MVP possible without writing any code 139 | 140 | Cons: 141 | 142 | - Only useful if those platforms are in place 143 | - Only desirable by program teams if those platforms will make their lives better 144 | 145 | ## Surface SSP/control examples 146 | 147 | The highest-cited need from interviewees was having approved/good examples to draw from for inspiration/clarification, if not copying from directly. These examples can be hand-picked, purpose-written, or drawn dynamically from a larger set (perhaps from similar systems?). 148 | 149 | Pros: 150 | 151 | - The highest-cited request from interviewees 152 | - A step along the way to [automated feedback](#automated-feedback) and [surfacing inheritable systems](#surface-inheritable-systemscontrols) 153 | - MVP possible without writing any code 154 | 155 | Cons: 156 | 157 | - SSPs being considered sensitive likely an obstacle to making their controls broadly accessible 158 | - May require manual control selection/review/scrubbing 159 | 160 | ## Get security and compliance experience into program teams 161 | 162 | This can happen through training existing program team members, adding new staff, or embedding existing assessors (though more may be needed). 163 | 164 | Pros: 165 | 166 | - Directly addresses a top finding from the research 167 | - Results in better security 168 | 169 | Cons: 170 | 171 | - Difficult to do this across all projects comprehensively 172 | - It's expensive 173 | 174 | ## Reduce the documentation burden 175 | 176 | This can happen through tailoring/prioritizing controls, either agency-wide, per-system, or over time (example: [GSA LATO](https://gsablogs.gsa.gov/innovation/2014/12/10/it-security-security-in-an-agile-development-cloud-world-by-kurt-garbars/)). 177 | 178 | Pros: 179 | 180 | - Less work for program teams and assessors 181 | - High-priority controls can still be covered 182 | - Census's Office of Information Security (OIS) requested a tool to help with control selection when FISMAtic was first pitched, so meeting a known internal need 183 | - Goes hand in hand with / can be made possible through [greater inheritance](#surface-inheritable-systemscontrols) 184 | 185 | Cons: 186 | 187 | - Harder if common platforms aren't being utilized 188 | - Reduces the number of compliance checks, which could let issues slip through the cracks 189 | 190 | # Appendix 191 | 192 | Over 3-4 weeks, the xD team conducted thirteen interviews with people from the following organizations: 193 | 194 | - [Centers for Medicare and Medicaid (CMS)](https://www.cms.gov/) 195 | - [CivicActions](https://civicactions.com/) 196 | - [Epigen Technology](http://epigentechnology.com/) 197 | - [General Services Administration (GSA)](https://gsa.gov) 198 | - [GovReady](https://govready.com/) 199 | - [National Geospatial-Intelligence Agency (NGA)](https://www.nga.mil/) 200 | - [Onyx Point](https://www.onyxpoint.com/) 201 | - [Telos](https://www.telos.com/) 202 | - [United States Air Force](https://www.airforce.com/) 203 | 204 | Interviewees represented various stakeholder positions in the ATO process, with current and former roles including: 205 | 206 | - Chief Data Officer (CDO) 207 | - Chief Information Security Officer (CISO) 208 | - Information Systems Security Manager (ISSM) 209 | - Lead Architect 210 | - Startup Founder/CEO 211 | - System Owner 212 | 213 | They had each been involved with a number of ATOs, ranging anywhere from three to 200. We had the following base questions: 214 | 215 | 1. Intro 216 | 1. What’s your role? 217 | 1. How many ATOs have you been involved with? 218 | 1. What role(s) have you played (formally or informally) as part of ATOs? 219 | 1. For system owners: What value have you gotten from people on the assessment side? 220 | 1. What are your challenges around the SSP? 221 | 1. What are the most common problems that cause back-and-forth delays? 222 | 1. What could make these interactions easier? 223 | 1. If you had access to a large number of SSPs in a machine-readable format and the time/skills/tools to analyze them, what would be the top five things you’d want to know from them? 224 | 1. Do you know of tools that give feedback? 225 | 1. Who else should I talk to? 226 | 227 | All quotes in this document are taken directly from interviews. 228 | -------------------------------------------------------------------------------- /docs/phase_2/FISMAticDecember2019.md: -------------------------------------------------------------------------------- 1 | ![](.//media/image1.png) 2 | 3 | # FISMATIC 4 | 5 | Encryption Methods and Implementation 6 | 7 | December 2019 Updates 8 | 9 | ## Understanding ATOs: Glossary 10 | 11 | ![](.//media/image2.png) 12 | 13 | ATO’s were created by the NIST in accordance with the FISMA Act of 2002. 14 | SSP’s are the lengthy documents created in order to satisfy all the 15 | requirements of the ATO, and list out the required security features, 16 | control steps, and standards. 17 | 18 | ![](.//media/image3.png) 19 | 20 | ## Understanding ATOs: Time, Money and Other Issues 21 | 22 | ![](.//media/image4.png) ![](.//media/image5.png) 23 | ![](.//media/image6.png) ![](.//media/image7.png) 24 | ![](.//media/image8.png) 25 | 26 | ## Solution 27 | 28 | ![](.//media/image9.png) 29 | 30 | ## FISMAtic Proof of Concept Objectives 31 | 32 | While the full development of FISMAtic is important to the bureau, based 33 | on an assessment of tools available at the Census Bureau and the 34 | resources provided for the PoC development there are a few key 35 | objectives that will need to be vetted and implemented first to advance 36 | the development of the tool. 37 | 38 | The two main objectives that the FISMAtic tool will address are: 39 | 40 | - Development of User Responses 41 | 42 | - Control Selection 43 | 44 | ## Creation 45 | 46 | ### Understanding the Dataset 47 | 48 | ![](.//media/image10.png) 49 | 50 | - About 2 million lines of spreadsheet data 51 | 52 | - About 347 Systems 53 | 54 | - About 922 Components 55 | 56 | ### Development Process 57 | 58 | ![](.//media/image11.png) 59 | 60 | ### User Interface 61 | 62 | #### Navigation Page 63 | 64 | ![](.//media/image12.jpeg) 65 | 66 | #### Control Input 67 | 68 | ![](.//media/image13.jpeg) 69 | -------------------------------------------------------------------------------- /docs/phase_2/FISMAticMay2020.md: -------------------------------------------------------------------------------- 1 | # ![](.//media/image1.png) 2 | 3 | # FISMATIC 4 | 5 | Encryption Methods and Implementation 6 | 7 | May 2020 Updates 8 | 9 | ## Encryption Background 10 | 11 | ![Symmetric and Asymmetric Encryption - By Rafael 12 | Almeida](.//media/image2.png) 13 | 14 | ### Symmetric Encryption 15 | 16 | - Define: Algorithm's that utilize only one key for both encryption 17 | and decryption. 18 | 19 | - Advanced Encryption Standard (AES) algorithm- NIST approved. 20 | 21 | ### Python Cryptography Library 22 | 23 | Fernet 24 | 25 | - Fernet Keys use 128 bit AES with Cipher Block Chaining (CBC). 26 | 27 | ### Current Set-Up 28 | 29 | Key is sent to users to use FISMAtic. 30 | 31 | - An email, separate from the FISMAtic Files, will be sent to the new 32 | user containing a file with the key as an attachment, using Secret 33 | Agent. 34 | 35 | - User is prompted to upload key file when FISMAtic is started, the 36 | program uses the key to decrypt the data and open FISMAtic. 37 | 38 | - Prevents error of user downloading key and FISMAtic files to 39 | different locations, also allows a user to safely keep the key 40 | separate. 41 | 42 | - Set-up allows for room to further protect key file down the 43 | road. 44 | 45 | ## Limitations & Security Issues 46 | 47 | **There are current ways to bypass this encryption as it is, but it 48 | should be improved in the future.** 49 | 50 | - **A user could very easily write a program that utilizes the key to 51 | decrypt the data files, since they have access to the key and 52 | files.** 53 | 54 | - **This could be fixed by changing the key utilization to the 55 | back-end and packaging FISMAtic into an executable file.** 56 | 57 | - **A user could potentially view the files while FISMAtic is 58 | running.** 59 | 60 | - **This could be fixed by re-encrypting the files immediately 61 | after reading them into memory, instead of upon close of the 62 | program.** 63 | -------------------------------------------------------------------------------- /docs/phase_2/media/image1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image1.png -------------------------------------------------------------------------------- /docs/phase_2/media/image10.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image10.png -------------------------------------------------------------------------------- /docs/phase_2/media/image11.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image11.png -------------------------------------------------------------------------------- /docs/phase_2/media/image12.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image12.jpeg -------------------------------------------------------------------------------- /docs/phase_2/media/image13.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image13.jpeg -------------------------------------------------------------------------------- /docs/phase_2/media/image2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image2.png -------------------------------------------------------------------------------- /docs/phase_2/media/image3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image3.png -------------------------------------------------------------------------------- /docs/phase_2/media/image4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image4.png -------------------------------------------------------------------------------- /docs/phase_2/media/image5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image5.png -------------------------------------------------------------------------------- /docs/phase_2/media/image6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image6.png -------------------------------------------------------------------------------- /docs/phase_2/media/image7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image7.png -------------------------------------------------------------------------------- /docs/phase_2/media/image8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image8.png -------------------------------------------------------------------------------- /docs/phase_2/media/image9.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/docs/phase_2/media/image9.png -------------------------------------------------------------------------------- /src/phase_1/analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Analysis\n", 8 | "\n", 9 | "Do analysis across a number of files." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# ignore whitespace warnings\n", 19 | "%env SPACY_WARNING_IGNORE=W008\n", 20 | "\n", 21 | "import ipywidgets as widgets\n", 22 | "import itertools\n", 23 | "import pandas as pd\n", 24 | "import plotly.offline as py\n", 25 | "import plotly.graph_objs as go\n", 26 | "\n", 27 | "# offline mode\n", 28 | "py.init_notebook_mode(connected=False)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "Re-run this cell when Python code in the repository changes." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "import importlib\n", 45 | "import fismatic.core as fismatic\n", 46 | "import fismatic.helpers as helpers\n", 47 | "importlib.reload(fismatic)\n", 48 | "importlib.reload(helpers);" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## Load files" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": {}, 62 | "outputs": [], 63 | "source": [ 64 | "path_widget = widgets.Text(description=\"Path:\", value=\".\")\n", 65 | "display(path_widget)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "files = fismatic.get_files(path_widget.value)\n", 75 | "control_sets = [fismatic.control_set_for(f) for f in files]" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## Compare files" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [ 91 | "stats = [fismatic.stats_for(cs) for cs in control_sets]\n", 92 | "df = pd.DataFrame(stats)\n", 93 | "df.set_index(\"Filename\", inplace=True)\n", 94 | "df" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": null, 100 | "metadata": {}, 101 | "outputs": [], 102 | "source": [ 103 | "control_token_counts = helpers.flatten([cs.implementation_token_counts() for cs in control_sets])\n", 104 | "\n", 105 | "data = [go.Histogram(x=control_token_counts)]\n", 106 | "layout = go.Layout(\n", 107 | " title=\"Control token counts\",\n", 108 | " xaxis={\n", 109 | " \"title\": \"Number of tokens\"\n", 110 | " },\n", 111 | " yaxis={\n", 112 | " \"title\": \"Number of controls\"\n", 113 | " }\n", 114 | ")\n", 115 | "fig = go.Figure(data=data, layout=layout)\n", 116 | "py.iplot(fig, filename='basic histogram')" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "from collections import Counter\n", 126 | "control_names = helpers.flatten([cs.control_names() for cs in control_sets])\n", 127 | "counter = Counter(control_names)\n", 128 | "top_controls = counter.most_common(20)\n", 129 | "pd.DataFrame(top_controls, columns=[\"Control\", \"# occurrences\"])" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [] 138 | } 139 | ], 140 | "metadata": { 141 | "kernelspec": { 142 | "display_name": "Python 3", 143 | "language": "python", 144 | "name": "python3" 145 | }, 146 | "language_info": { 147 | "codemirror_mode": { 148 | "name": "ipython", 149 | "version": 3 150 | }, 151 | "file_extension": ".py", 152 | "mimetype": "text/x-python", 153 | "name": "python", 154 | "nbconvert_exporter": "python", 155 | "pygments_lexer": "ipython3", 156 | "version": "3.7.3" 157 | } 158 | }, 159 | "nbformat": 4, 160 | "nbformat_minor": 2 161 | } 162 | -------------------------------------------------------------------------------- /src/phase_1/demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Demo\n", 8 | "\n", 9 | "This notebook shows control implementations from other SSPs that are most similar to the provided text.\n", 10 | "\n", 11 | "Re-run this cell when Python code in the repository changes." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": null, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# ignore whitespace warnings\n", 21 | "%env SPACY_WARNING_IGNORE=W008\n", 22 | "\n", 23 | "import importlib\n", 24 | "import fismatic.core as fismatic\n", 25 | "import fismatic.demo as demo\n", 26 | "import ipywidgets as widgets\n", 27 | "\n", 28 | "importlib.reload(demo)\n", 29 | "importlib.reload(fismatic);" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "## Load files" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "path_widget = widgets.Text(description=\"Path:\", value=\".\")\n", 46 | "display(path_widget)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "files = fismatic.get_files(path_widget.value)\n", 56 | "control_sets = [fismatic.control_set_for(f) for f in files]" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "Set up interactive widgets." 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "control_name = widgets.Text(description=\"Control:\", value=\"AC-2\")\n", 73 | "part = widgets.Text(description=\"Part:\", value=\"Part a\")\n", 74 | "implementation = widgets.Textarea(description=\"Implementation:\", value=\"This is a system using AWS EC2.\")\n", 75 | "inputs = [control_name, part, implementation]\n", 76 | "out = widgets.Output()\n", 77 | "\n", 78 | "\n", 79 | "def on_input_change(change):\n", 80 | " similar_implementations = demo.similar_implementations(control_sets, control_name.value, part.value, implementation.value)\n", 81 | " similar_imp_txt = [imp.text for imp in similar_implementations]\n", 82 | " \n", 83 | " out.clear_output()\n", 84 | " out.append_stdout(\"\\n\\n---------------\\n\\n\".join(similar_imp_txt[0:2]))\n", 85 | "\n", 86 | "\n", 87 | "for widget in inputs:\n", 88 | " widget.observe(on_input_change, names='value')" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "## Interactive area\n", 96 | "\n", 97 | "You can modify the text in the fields below. You will then see a couple similar implementations for the same control part." 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "scrolled": false 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "for widget in inputs:\n", 109 | " display(widget)\n", 110 | "\n", 111 | "display(out)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [] 120 | } 121 | ], 122 | "metadata": { 123 | "kernelspec": { 124 | "display_name": "Python 3", 125 | "language": "python", 126 | "name": "python3" 127 | }, 128 | "language_info": { 129 | "codemirror_mode": { 130 | "name": "ipython", 131 | "version": 3 132 | }, 133 | "file_extension": ".py", 134 | "mimetype": "text/x-python", 135 | "name": "python", 136 | "nbconvert_exporter": "python", 137 | "pygments_lexer": "ipython3", 138 | "version": "3.7.3" 139 | } 140 | }, 141 | "nbformat": 4, 142 | "nbformat_minor": 2 143 | } 144 | -------------------------------------------------------------------------------- /src/phase_1/environment.yml: -------------------------------------------------------------------------------- 1 | name: fismatic 2 | dependencies: 3 | - python>=3.5,<4 4 | - ipywidgets 5 | - jupyter 6 | - pandas 7 | - pip 8 | - pip: 9 | - autoflake 10 | - black 11 | - python-docx>0.8,<1 12 | - plotly=3 13 | - pytest 14 | - rope 15 | - spacy=2 16 | -------------------------------------------------------------------------------- /src/phase_1/exploration.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Exploration\n", 8 | "\n", 9 | "Look within a single SSP." 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# ignore whitespace warnings\n", 19 | "%env SPACY_WARNING_IGNORE=W008\n", 20 | "\n", 21 | "import ipywidgets as widgets\n", 22 | "import itertools\n", 23 | "import pandas as pd" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "Re-run this cell when Python code in the repository changes." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "import importlib\n", 40 | "import fismatic.core as fismatic\n", 41 | "importlib.reload(fismatic);" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Load file" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "filename_widget = widgets.Text(description=\"Filename:\", value=\"./Azure Security and Compliance Blueprint - FedRAMP High SSP.docx\")\n", 58 | "display(filename_widget)" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "filename = filename_widget.value\n", 68 | "control_set = fismatic.control_set_for(filename)" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "## Control similarity matrix" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": null, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "control_set.similarity_matrix().head()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "## Top entities" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [ 100 | "top_entities = control_set.top_entities()\n", 101 | "pd.DataFrame(top_entities, columns=[\"Entity\", \"# occurrences\"])" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [ 110 | "top_chunks = control_set.top_proper_noun_chunks()\n", 111 | "pd.DataFrame(top_chunks, columns=[\"Chunk\", \"# occurrences\"])" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": {}, 117 | "source": [ 118 | "## Example implementation" 119 | ] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "execution_count": null, 124 | "metadata": {}, 125 | "outputs": [], 126 | "source": [ 127 | "implementations = control_set.get_implementations()\n", 128 | "implementation = list(implementations)[0]\n", 129 | "implementation" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "### Noun chunks" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "scrolled": true 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "list(implementation.noun_chunks)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "### Entities" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": {}, 161 | "outputs": [], 162 | "source": [ 163 | "implementation.ents" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [] 172 | } 173 | ], 174 | "metadata": { 175 | "kernelspec": { 176 | "display_name": "Python 3", 177 | "language": "python", 178 | "name": "python3" 179 | }, 180 | "language_info": { 181 | "codemirror_mode": { 182 | "name": "ipython", 183 | "version": 3 184 | }, 185 | "file_extension": ".py", 186 | "mimetype": "text/x-python", 187 | "name": "python", 188 | "nbconvert_exporter": "python", 189 | "pygments_lexer": "ipython3", 190 | "version": "3.7.3" 191 | } 192 | }, 193 | "nbformat": 4, 194 | "nbformat_minor": 2 195 | } 196 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/src/phase_1/fismatic/__init__.py -------------------------------------------------------------------------------- /src/phase_1/fismatic/control.py: -------------------------------------------------------------------------------- 1 | import re 2 | from .similarity import nlp 3 | 4 | 5 | class Control: 6 | def __init__(self, name=None): 7 | self.name = name 8 | """Specific to FedRAMP templates""" 9 | self.responsible_role = None 10 | self.imp_status = None 11 | self.origination = None 12 | self.implementation = {} 13 | 14 | @property 15 | def implementation(self): 16 | return self._implementation 17 | 18 | @implementation.setter 19 | def implementation(self, value): 20 | self._implementation = {part: nlp(imp) for part, imp in value.items()} 21 | 22 | def normalized_name(self): 23 | # remove any spaces 24 | name = self.name.replace(" ", "") 25 | parsed = re.match("([A-Z]{2})-(\d+)(\((\d+)\))?", name) 26 | family = parsed[1] 27 | num = parsed[2] 28 | enhancement = parsed[4] 29 | 30 | result = "{}-{}".format(family, num) 31 | if enhancement: 32 | result = "{} ({})".format(result, enhancement) 33 | return result 34 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/control_set.py: -------------------------------------------------------------------------------- 1 | from collections import Counter 2 | from . import helpers 3 | from . import similarity 4 | 5 | 6 | class ControlSet: 7 | def __init__(self, controls, source=""): 8 | self._controls = controls 9 | self._source = source 10 | 11 | @property 12 | def source(self): 13 | """Where the control set came from. This will generally be the file path.""" 14 | return self._source 15 | 16 | def get_implementations_by_id(self): 17 | """The ID (key) is the control name + part.""" 18 | result = {} 19 | 20 | for control in self._controls: 21 | for part, imp in control.implementation.items(): 22 | key = ": ".join([control.name, part]) 23 | result[key] = imp 24 | 25 | return result 26 | 27 | def get_control(self, name): 28 | for control in self._controls: 29 | if control.name == name: 30 | return control 31 | return None 32 | 33 | def get_implementation_for(self, control_name, part): 34 | control = self.get_control(control_name) 35 | if not control: 36 | return None 37 | return control.implementation.get(part) 38 | 39 | def get_implementations(self): 40 | """Returns a list of strings.""" 41 | return self.get_implementations_by_id().values() 42 | 43 | def num_controls(self): 44 | return len(self._controls) 45 | 46 | def num_implementations(self): 47 | return len(self.get_implementations()) 48 | 49 | def num_unique_implementations(self): 50 | implementations = self.get_implementations() 51 | imp_texts = [imp.text for imp in implementations] 52 | return len(set(imp_texts)) 53 | 54 | def num_identical_implementations(self): 55 | return self.num_implementations() - self.num_unique_implementations() 56 | 57 | def implementation_token_counts(self): 58 | implementations = self.get_implementations() 59 | return [ 60 | # based on 61 | # https://stackoverflow.com/a/41425016/358804 62 | sum(1 if not (token.is_stop or token.is_punct) else 0 for token in imp) 63 | for imp in implementations 64 | ] 65 | 66 | def num_tokens(self): 67 | return sum(self.implementation_token_counts()) 68 | 69 | def similarity_matrix(self): 70 | implementations_by_id = self.get_implementations_by_id() 71 | return similarity.generate_diffs_with_labels(implementations_by_id) 72 | 73 | def entities(self): 74 | """Returns a list of spaCy Entities.""" 75 | implementations = self.get_implementations() 76 | return helpers.flatten(imp.ents for imp in implementations) 77 | 78 | def top_entities(self, top=20): 79 | """Returns the most common entities across the controls.""" 80 | entities = [e.text for e in self.entities()] 81 | # https://www.youtube.com/watch?v=YrFOAhT4Azk 82 | counter = Counter(entities) 83 | return counter.most_common(top) 84 | 85 | def _is_proper_noun(self, span): 86 | # https://spacy.io/api/annotation#pos-tagging 87 | return span.root.tag_ in ["NNP", "NNPS"] 88 | 89 | def proper_noun_chunks(self): 90 | implementations = self.get_implementations() 91 | return helpers.flatten( 92 | [ 93 | [chunk.text for chunk in imp.noun_chunks if self._is_proper_noun(chunk)] 94 | for imp in implementations 95 | ] 96 | ) 97 | 98 | def top_proper_noun_chunks(self, top=20): 99 | """Returns the most common proper noun chunks across the controls.""" 100 | chunks = self.proper_noun_chunks() 101 | counter = Counter(chunks) 102 | return counter.most_common(top) 103 | 104 | def control_names(self): 105 | return [control.normalized_name() for control in self._controls] 106 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/core.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import os.path 3 | import sys 4 | from .docx_parser import DocxParser 5 | 6 | 7 | def get_files(input_path): 8 | """`input_path` can be a specific file, or a directory.""" 9 | if os.path.isdir(input_path): 10 | # only process docx 11 | pattern = os.path.join(input_path, "**", "*.docx") 12 | files = glob.glob(pattern, recursive=True) 13 | if not files: 14 | print("No docx files found.", file=sys.stderr) 15 | return files 16 | else: 17 | return [input_path] 18 | 19 | 20 | def control_set_for(input_file): 21 | parser = DocxParser(input_file) 22 | return parser.get_control_set() 23 | 24 | 25 | def stats_for(control_set): 26 | filename = os.path.basename(control_set.source) 27 | return { 28 | "Filename": filename, 29 | "# controls": control_set.num_controls(), 30 | "# identical implementations": control_set.num_identical_implementations(), 31 | "# implementations": control_set.num_implementations(), 32 | "# unique implementations": control_set.num_unique_implementations(), 33 | "# tokens": control_set.num_tokens(), 34 | } 35 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/demo.py: -------------------------------------------------------------------------------- 1 | from .helpers import present 2 | from .similarity import nlp 3 | 4 | 5 | def similar_implementations(control_sets, control_name, part, implementation): 6 | user_implementation = nlp(implementation) 7 | 8 | implementations = [ 9 | cs.get_implementation_for(control_name, part) for cs in control_sets 10 | ] 11 | 12 | # exclude SSPs that don't have that control+part 13 | implementations = filter(present, implementations) 14 | 15 | # get the most similar 16 | return sorted( 17 | implementations, 18 | key=lambda imp: imp.similarity(user_implementation), 19 | reverse=True, 20 | ) 21 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/docx_parser.py: -------------------------------------------------------------------------------- 1 | from docx import Document 2 | from . import parser 3 | from .control_set import ControlSet 4 | 5 | 6 | class DocxParser: 7 | def __init__(self, doc_path): 8 | self.doc = Document(docx=doc_path) 9 | self.doc_path = doc_path 10 | 11 | def get_tables(self): 12 | return parser.get_tables(self.doc) 13 | 14 | def get_controls(self): 15 | # Control details are in tables, skip the rest 16 | tables = self.get_tables() 17 | return parser.get_controls(tables) 18 | 19 | def get_control_set(self): 20 | controls = list(self.get_controls().values()) 21 | return ControlSet(controls, source=self.doc_path) 22 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/helpers.py: -------------------------------------------------------------------------------- 1 | import itertools 2 | 3 | 4 | def flatten(list_of_lists): 5 | # https://stackoverflow.com/a/13498063/358804 6 | return list(itertools.chain(*list_of_lists)) 7 | 8 | 9 | def present(span): 10 | """Accepts a spaCy Span and returns True if the text contains non-whitespace characters.""" 11 | return span and span.text.strip() 12 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/parser.py: -------------------------------------------------------------------------------- 1 | from docx.document import Document as _Document 2 | from docx.oxml.table import CT_Tbl 3 | from docx.oxml.text.paragraph import CT_P 4 | from docx.table import Table, _Cell 5 | from docx.text.paragraph import Paragraph 6 | import re 7 | from .control import Control 8 | from .similarity import nlp 9 | 10 | 11 | def iter_block_items(parent): 12 | """ 13 | Parsing table structure from .docx 14 | 15 | Generate a reference to each paragraph and table child within *parent*, 16 | in document order. Each returned value is an instance of either Table or 17 | Paragraph. *parent* would most commonly be a reference to a main 18 | Document object, but also works for a _Cell object, which itself can 19 | contain paragraphs and tables. 20 | """ 21 | if isinstance(parent, _Document): 22 | parent_elm = parent.element.body 23 | elif isinstance(parent, _Cell): 24 | parent_elm = parent._tc 25 | else: 26 | raise ValueError("something's not right") 27 | 28 | for child in parent_elm.iterchildren(): 29 | if isinstance(child, CT_P): 30 | yield Paragraph(child, parent) 31 | elif isinstance(child, CT_Tbl): 32 | yield Table(child, parent) 33 | 34 | 35 | def get_tables(doc): 36 | return [ 37 | block for block in iter_block_items(doc) if not isinstance(block, Paragraph) 38 | ] 39 | 40 | 41 | def get_control_summary_for(control="AC-1"): 42 | """TODO""" 43 | 44 | 45 | def is_control_heading(text): 46 | actual = nlp(text) 47 | expected = nlp("Control Summary Information") 48 | return actual.similarity(expected) > 0.9 49 | 50 | 51 | def get_control_summary(block): 52 | """True if table contains control summary information.""" 53 | if not isinstance(block, Table): 54 | return False 55 | first_row = block.rows[0] 56 | if len(first_row.cells) < 2: 57 | return False 58 | 59 | second_cell_text = first_row.cells[1].text 60 | if is_control_heading(second_cell_text): 61 | first_cell_text = first_row.cells[0].text 62 | return first_cell_text 63 | 64 | return False 65 | 66 | 67 | def get_responsible_roles(cell): 68 | roles_str = re.sub("responsible role:", "", cell.text, flags=re.IGNORECASE) 69 | responsible_roles = roles_str.split(",") 70 | return [role.strip() for role in responsible_roles] 71 | 72 | 73 | def parse_control_table(table): 74 | """ 75 | TODO 76 | extract data from control summary table 77 | Not sure how to detect checked boxes 78 | """ 79 | result = {"responsible_role": None, "imp_status": None, "origination": None} 80 | for row in table.rows[1:]: 81 | for c in row.cells: 82 | text = c.text.strip().lower() 83 | if text.startswith("responsible role:"): 84 | result["responsible_role"] = get_responsible_roles(c) 85 | elif text.startswith("implementation status"): 86 | # return which box is checked 87 | pass 88 | elif text.startswith("control origination"): 89 | # return which box is checked 90 | pass 91 | else: 92 | pass 93 | raise 94 | 95 | 96 | def parse_implementation_table(table): 97 | """Extract implementation narratives""" 98 | implementations = {} 99 | for row in table.rows[1:]: 100 | try: 101 | implementations.update({row.cells[0].text: row.cells[1].text}) 102 | except IndexError: 103 | pass 104 | return implementations 105 | 106 | 107 | def get_controls(tables): 108 | """Returns a dictionary of controls by name.""" 109 | 110 | # Loop through all tables parsed from docx 111 | # If its a control summary table, add that control to our list 112 | # Then grab the implementation narratives from the following table 113 | check_next = False 114 | controls = {} 115 | check_control = None 116 | 117 | for t in tables: 118 | name = get_control_summary(t) 119 | if name: 120 | control = Control(name=name) 121 | name = control.normalized_name() 122 | # print('Found %s...' % control) 123 | # TODO - parse_control_table(t) 124 | 125 | # Next table in the doc is implementation narratives 126 | check_next = True 127 | # Need to keep control reference until we grab implementation 128 | check_control = name 129 | 130 | controls[name] = control 131 | elif check_next and check_control: 132 | control = controls[check_control] 133 | control.implementation = parse_implementation_table(t) 134 | check_next = False 135 | check_control = None 136 | else: 137 | check_next = False 138 | check_control = None 139 | 140 | return controls 141 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/similarity.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import spacy 4 | 5 | # lazy-load the model 6 | # https://stackoverflow.com/a/7152065/358804 7 | class LazyNLP: 8 | def __init__(self): 9 | self.nlp = None 10 | 11 | def __call__(self, *args, **kwargs): 12 | if not self.nlp: 13 | self.nlp = spacy.load("en_core_web_lg") 14 | return self.nlp(*args, **kwargs) 15 | 16 | 17 | nlp = LazyNLP() 18 | 19 | 20 | def generate_diffs(docs): 21 | """Expects an iterable of spaCy Docs. Returns the similarity scores between controls.""" 22 | results = [[doc1.similarity(doc2) for doc2 in docs] for doc1 in docs] 23 | return np.array(results) 24 | 25 | 26 | def generate_diffs_with_labels(implementations_by_id): 27 | """Returns a Pandas DataFrame of the similarity scores between controls.""" 28 | implementations = implementations_by_id.values() 29 | desc_lkup = implementations_by_id.keys() 30 | matrix = generate_diffs(implementations) 31 | return pd.DataFrame(matrix, index=desc_lkup, columns=desc_lkup) 32 | 33 | 34 | def similar_controls(diffs, threshold=0.9): 35 | """Find all control narratives which are identical or very similar (greater than the provided number).""" 36 | # exclude the controls matching themselves 37 | np.fill_diagonal(diffs.values, np.nan) 38 | 39 | return { 40 | control_name: similarities[similarities > threshold].to_dict() 41 | for control_name, similarities in diffs.iteritems() 42 | } 43 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/uscensusbureau/fismatic/37d6e7d912831e1ef08107ab217ef2c6cd8ae7a9/src/phase_1/fismatic/test/__init__.py -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/common.py: -------------------------------------------------------------------------------- 1 | SOURCE_DOC = "Azure Security and Compliance Blueprint - FedRAMP High SSP.docx" 2 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_control.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | from ..control import Control 3 | 4 | 5 | @pytest.mark.parametrize( 6 | "test_input,expected", 7 | [("AC-2", "AC-2"), ("AC-2 (1)", "AC-2 (1)"), ("AC-2(1)", "AC-2 (1)")], 8 | ) 9 | def test_normalized_name(test_input, expected): 10 | control = Control(test_input) 11 | assert control.normalized_name() == expected 12 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_control_set.py: -------------------------------------------------------------------------------- 1 | from ..control import Control 2 | from ..control_set import ControlSet 3 | 4 | 5 | def test_get_implementation_for(): 6 | control1 = Control("AC-1") 7 | control1.implementation = {"A": "foo"} 8 | 9 | control2 = Control("AC-2") 10 | control2.implementation = {"A": "bar"} 11 | 12 | control_set = ControlSet([control1, control2]) 13 | assert control_set.get_implementation_for("AC-2", "A").text == "bar" 14 | assert control_set.get_implementation_for("baz", "A") == None 15 | assert control_set.get_implementation_for("AC-2", "Z") == None 16 | 17 | 18 | def test_num_unique_implementations(): 19 | control1 = Control("AC-1") 20 | control1.implementation = {"A": "foo"} 21 | 22 | control2 = Control("AC-2") 23 | control2.implementation = {"A": "foo"} 24 | 25 | control3 = Control("AC-3") 26 | control3.implementation = {"A": "bar"} 27 | 28 | control_set = ControlSet([control1, control2, control3]) 29 | assert control_set.num_unique_implementations() == 2 30 | 31 | 32 | def test_num_tokens(): 33 | control1 = Control("foo") 34 | control1.implementation = {"A": "foo - does things"} 35 | 36 | control2 = Control("bar") 37 | control2.implementation = {"A": "bar's a thinger do-er"} 38 | 39 | control_set = ControlSet([control1, control2]) 40 | assert control_set.num_tokens() == 9 41 | 42 | 43 | def test_top_entities(): 44 | control1 = Control("foo") 45 | control1.implementation = { 46 | "A": "FISMATic is the greatest thing to happen to the United States since sliced bread." 47 | } 48 | 49 | control2 = Control("bar") 50 | control2.implementation = {"A": "Have I told you how great FISMAtic is?"} 51 | 52 | control_set = ControlSet([control1, control2]) 53 | 54 | # TODO this should have captured "FISMAtic" 55 | assert control_set.top_entities() == [("the United States", 1)] 56 | 57 | 58 | def test_top_proper_noun_chunks(): 59 | control1 = Control("foo") 60 | control1.implementation = { 61 | "A": "FISMATic is the greatest thing to happen to the United States since sliced bread." 62 | } 63 | 64 | control2 = Control("bar") 65 | control2.implementation = {"A": "Have I told you how great FISMAtic is?"} 66 | 67 | control_set = ControlSet([control1, control2]) 68 | 69 | # TODO this should have captured "FISMAtic" 70 | assert control_set.top_proper_noun_chunks() == [("the United States", 1)] 71 | 72 | 73 | def test_control_names(): 74 | control1 = Control("AC-2") 75 | control1.implementation = {"": ""} 76 | control2 = Control("AU-6(1)") 77 | control2.implementation = {"": ""} 78 | control_set = ControlSet([control1, control2]) 79 | 80 | assert control_set.control_names() == ["AC-2", "AU-6 (1)"] 81 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_core.py: -------------------------------------------------------------------------------- 1 | import os.path 2 | import pytest 3 | from . import common 4 | from .. import core as fismatic 5 | 6 | 7 | @pytest.mark.parametrize( 8 | "test_input,expected", 9 | [ 10 | (common.SOURCE_DOC, [common.SOURCE_DOC]), # single file 11 | (".", [common.SOURCE_DOC]), # directory with docx 12 | ("fismatic", []), # directory without docx 13 | ], 14 | ) 15 | def test_get_files(test_input, expected): 16 | result = fismatic.get_files(test_input) 17 | 18 | # absolute-ize the paths so that it doesn't fail due to leading `./`s 19 | result = [os.path.abspath(p) for p in result] 20 | expected = [os.path.abspath(p) for p in expected] 21 | 22 | assert result == expected 23 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_demo.py: -------------------------------------------------------------------------------- 1 | from ..control import Control 2 | from ..control_set import ControlSet 3 | from ..demo import similar_implementations 4 | 5 | 6 | def test_similar_implementations_empty(): 7 | results = similar_implementations([], "foo", "a", "bar") 8 | assert results == [] 9 | 10 | 11 | def test_similar_implementations_excludes_blank(): 12 | control = Control("AC-2") 13 | control.implementation = {"A": " "} 14 | control_set = ControlSet([control]) 15 | 16 | results = similar_implementations([control_set], "AC-2", "A", "foo") 17 | assert results == [] 18 | 19 | 20 | def test_similar_implementations(): 21 | control1 = Control("AC-1") 22 | control1.implementation = {"A": "Something else."} 23 | control2 = Control("AC-2") 24 | imp1 = "This is about computers." 25 | control2.implementation = {"A": imp1} 26 | control_set1 = ControlSet([control1, control2]) 27 | 28 | control3 = Control("AC-2") 29 | imp2 = "This is also about computers. Computers do a lot." 30 | control3.implementation = {"A": imp2} 31 | control_set2 = ControlSet([control3]) 32 | 33 | control4 = Control("AC-2") 34 | imp3 = "Irrelevant." 35 | control4.implementation = {"A": imp3} 36 | control_set3 = ControlSet([control4]) 37 | 38 | control_sets = [control_set1, control_set2, control_set3] 39 | 40 | results = similar_implementations(control_sets, "AC-2", "A", "computers") 41 | result_txt = [imp.text for imp in results] 42 | assert result_txt == [imp2, imp1, imp3] 43 | 44 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_docx_parser.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | from . import common 3 | from ..docx_parser import DocxParser 4 | 5 | 6 | @pytest.fixture(scope="session") 7 | def controls(): 8 | parser = DocxParser(common.SOURCE_DOC) 9 | return parser.get_controls() 10 | 11 | 12 | def test_get_controls(controls): 13 | control_ids = list(controls.keys()) 14 | assert control_ids[0:4] == ["AC-1", "AC-2", "AC-2 (1)", "AC-2 (2)"] 15 | 16 | 17 | def test_implementation(controls): 18 | control = controls["AC-1"] 19 | implementation = control.implementation["Part a"] 20 | assert "Microsoft Azure" in implementation.text 21 | assert "customer is responsible" in implementation.text 22 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_notebook.py: -------------------------------------------------------------------------------- 1 | import glob 2 | import nbformat 3 | from nbconvert.preprocessors import ExecutePreprocessor 4 | 5 | 6 | def test_notebooks(): 7 | notebooks = glob.glob("./*.ipynb") 8 | # safety check 9 | assert len(notebooks) > 2 10 | 11 | for notebook in notebooks: 12 | # https://nbconvert.readthedocs.io/en/latest/execute_api.html#executing-notebooks-using-the-python-api-interface 13 | with open(notebook) as f: 14 | nb = nbformat.read(f, as_version=4) 15 | ep = ExecutePreprocessor(timeout=600, kernel_name="python3") 16 | ep.preprocess(nb, {}) 17 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_parser.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | from .. import parser 3 | 4 | 5 | @pytest.mark.parametrize( 6 | "test_input,expected", 7 | [ 8 | ("control summary information", True), 9 | ("Control Summary Information", True), 10 | ("Control Enhancement Summary Information", True), 11 | ("something else", False), 12 | ], 13 | ) 14 | def test_is_control_heading(test_input, expected): 15 | assert parser.is_control_heading(test_input) == expected 16 | -------------------------------------------------------------------------------- /src/phase_1/fismatic/test/test_similarity.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from .. import similarity 4 | 5 | 6 | def test_generate_diffs(): 7 | sources = ["foo", "Foo", "bar"] 8 | docs = [similarity.nlp(source) for source in sources] 9 | diffs = similarity.generate_diffs(docs) 10 | expected = [ 11 | # foo, Foo, bar 12 | [1.0, 1.0, 0.2], # foo 13 | [1.0, 1.0, 0.2], # Foo 14 | [0.2, 0.2, 1.0], # bar 15 | ] 16 | np.testing.assert_array_almost_equal(diffs, expected, decimal=2) 17 | 18 | 19 | def test_similar_controls(): 20 | desc_lkup = ["AC-1", "AC-2", "AC-2 (1)"] 21 | diffs = pd.DataFrame( 22 | [ 23 | # AC-1, AC-2, AC-2 (1) 24 | [1.00, 0.95, 0.00], # AC-1 25 | [0.95, 1.00, 0.00], # AC-2 26 | [0.00, 0.00, 1.00], # AC-2 (1) 27 | ], 28 | index=desc_lkup, 29 | columns=desc_lkup, 30 | ) 31 | 32 | very_similar = similarity.similar_controls(diffs) 33 | assert very_similar == { 34 | "AC-1": {"AC-2": 0.95}, 35 | "AC-2": {"AC-1": 0.95}, 36 | "AC-2 (1)": {}, 37 | } 38 | -------------------------------------------------------------------------------- /src/phase_1/jupyter_notebook_config.py: -------------------------------------------------------------------------------- 1 | # https://gist.github.com/binaryfunt/f31a7ecc8d698dd0edbec1489f2ab055 2 | def scrub_output_pre_save(model, **kwargs): 3 | """scrub output before saving notebooks""" 4 | # only run on notebooks 5 | if model["type"] != "notebook": 6 | return 7 | # only run on nbformat v4 8 | if model["content"]["nbformat"] != 4: 9 | return 10 | 11 | for cell in model["content"]["cells"]: 12 | if cell["cell_type"] != "code": 13 | continue 14 | cell["outputs"] = [] 15 | cell["execution_count"] = None 16 | if "collapsed" in cell["metadata"]: 17 | cell["metadata"].pop("collapsed", 0) 18 | 19 | 20 | c.ContentsManager.pre_save_hook = scrub_output_pre_save 21 | -------------------------------------------------------------------------------- /src/phase_1/pytest.ini: -------------------------------------------------------------------------------- 1 | [pytest] 2 | filterwarnings = 3 | ignore::DeprecationWarning 4 | ignore::PendingDeprecationWarning 5 | -------------------------------------------------------------------------------- /src/phase_1/usage.md: -------------------------------------------------------------------------------- 1 | # FISMAtic prototype 2 | 3 | Current state: Jupyter Notebooks that perform linguistic and other analysis of System Security Plans (SSPs) written as Word documents. 4 | 5 | ## Technical overview 6 | 7 | The primary technolgies in use: 8 | 9 | - Python - programming language 10 | - [spaCy](https://spacy.io/) - Natural Language Processing (NLP) 11 | - [Jupyter Notebooks](https://jupyter.org/) - display of results 12 | - [Pandas](https://pandas.pydata.org/) - everything quantitative 13 | 14 | The meat of the project is in Python files under [`fismatic/`](../fismatic), and is `import`ed into Jupyter Notebooks for display. This code could be leveraged in a web application or as an API instead. 15 | 16 | How it works: `.docx` files are read from the filesystem into memory as instances of [`ControlSet`s](../fismatic/control_set.py). These could be populated a different way: from a database connection, or [from OpenControl files on GitHub, as this example does](https://github.com/uscensusbureau/fismatic/pull/42). The processing takes place in spaCy/Pandas, and then the results are formatted and displayed in Jupyter. 17 | 18 | Project tasks are tracked in [GitHub issues](https://github.com/uscensusbureau/fismatic/issues) and status is reflected in [the Kanban board](https://github.com/uscensusbureau/fismatic/projects/1). 19 | 20 | ## Setup 21 | 22 | 1. [Install Conda.](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) 23 | 1. Set up Conda environment. 24 | 25 | ```sh 26 | conda env create -f environment.yml 27 | conda activate fismatic 28 | jupyter nbextension enable --py widgetsnbextension 29 | ``` 30 | 31 | 1. Download the language model. 32 | 33 | ```sh 34 | python -m spacy download en_core_web_lg --user 35 | ``` 36 | 37 | ## Running 38 | 39 | 1. Download an SSP as a `.docx` based on the [FedRAMP template](https://www.fedramp.gov/templates/). 40 | - The [Azure Blueprint FedRAMP High SSP](https://www.microsoft.com/en-us/trustcenter/compliance/fedramp) is a good one to test with. 41 | 1. Start the Jupyter Notebook. 42 | 43 | ```sh 44 | jupyter notebook --config=jupyter_notebook_config.py 45 | ``` 46 | 47 | 1. Play with the following notebooks: 48 | - [`exploration.ipynb`](http://localhost:8888/notebooks/exploration.ipynb) - single SSPs 49 | - [`analysis.ipynb`](http://localhost:8888/notebooks/analysis.ipynb) - across SSPs 50 | - [`demo.ipynb`](http://localhost:8888/notebooks/demo.ipynb) - a proof-of-concept showing relevant control implementations being displayed to the user interactively 51 | 52 | ## Development 53 | 54 | To run tests: 55 | 56 | 1. Download the [Azure Blueprint FedRAMP High SSP](https://www.microsoft.com/en-us/trustcenter/compliance/fedramp), and place the file in this directory. 57 | 1. From this directory, run: 58 | 59 | ```sh 60 | pytest 61 | ``` 62 | --------------------------------------------------------------------------------