├── LICENSE ├── README.md ├── cis_benchmark_converter.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Maxime Beauchamp 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # CIS Benchmark Converter 2 | 3 | **Author:** Octomany 4 | **LinkedIn:** [LinkedIn](https://www.linkedin.com/in/maxbeauchamp/) 5 | 6 | **Date Created:** 2024-11-06 7 | **Last Update:** 2025-14-03 8 | 9 | ## Description 10 | 11 | `CIS Benchmark Converter` is a Python script that extracts recommendations from CIS Benchmark PDF documents and exports them into CSV, Excel, or JSON formats. The script converts unstructured PDF content into a structured table, simplifying compliance reviews and audits. 12 | 13 | ## Features 14 | 15 | - **Configurable Extraction:** 16 | Set the start page to skip tables of contents or disclaimers, and adjust the logging level via command-line options. 17 | 18 | - **Multiple Output Formats:** 19 | Export the extracted data as CSV (using a pipe `|` delimiter), Excel (with styled headers, dropdowns, and conditional formatting), or JSON for easy data integration. 20 | 21 | - **Robust and Maintainable:** 22 | Uses `pathlib` for modern file path management, extended type annotations for static type checking, enhanced exception handling, and a progress bar (via `tqdm`) for user feedback. 23 | 24 | ## Installation 25 | 26 | 1. Clone this repository. 27 | 2. Install dependencies using the provided `requirements.txt`: 28 | 29 | ```bash 30 | pip install -r requirements.txt 31 | ``` 32 | 33 | **requirements.txt:** 34 | ``` 35 | pdfplumber 36 | openpyxl 37 | tqdm 38 | ``` 39 | 40 | ## Usage 41 | 42 | Run the script from the command line as follows: 43 | 44 | ```bash 45 | python cis_benchmark_converter.py \ 46 | -i path/to/input_file.pdf \ 47 | -o path/to/output_file \ 48 | -f [csv|excel|json] \ 49 | --start_page 10 \ 50 | --log_level INFO 51 | ``` 52 | 53 | ### Arguments 54 | 55 | - `-i, --input` : Path to the input CIS Benchmark PDF file (required). 56 | - `-o, --output` : Path to the output file (defaults to the input file name with a `.csv`, `.xlsx`, or `.json` extension). 57 | - `-f, --format` : Output file format: `csv`, `excel`, or `json` (default: `excel`). 58 | - `--start_page` : Page number to start extraction (default: 10). 59 | - `--log_level` : Logging level (`DEBUG`, `INFO`, `WARNING`, or `ERROR`; default: `INFO`). 60 | 61 | ## Example 62 | 63 | ```bash 64 | python cis_benchmark_converter.py -i ./CIS_AWS_Benchmark.pdf -o ./CIS_AWS_Benchmark.json -f json --start_page 10 --log_level INFO 65 | ``` 66 | 67 | For JSON output, the data is structured as a list of dictionaries, with each dictionary representing a recommendation and its associated sections. 68 | 69 | > **Note:** The script automatically excludes any sections labeled "CIS Controls" to focus solely on the core recommendations. 70 | 71 | ## Acknowledgements 72 | 73 | Special thanks to [Flavien Fouqueray (UnBonWhisky)](https://www.linkedin.com/in/ffouqueray/) for his valuable bug fixes and contributions in earlier versions of this script. 74 | 75 | ## License 76 | 77 | This project is licensed under the MIT License. Please respect the copyrights 78 | of the CIS Benchmark documents when using and sharing this tool. -------------------------------------------------------------------------------- /cis_benchmark_converter.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | """ 4 | File: cis_benchmark_converter.py 5 | Author: Maxime Beauchamp 6 | LinkedIn: https://www.linkedin.com/in/maxbeauchamp/ 7 | Created: 2024-11-06 8 | Last Update: 2025-28-04 9 | 10 | Description: 11 | This script extracts recommendations from CIS Benchmark PDF documents and exports 12 | them to CSV, Excel, or JSON format. The extraction starts at a configurable page number 13 | (to skip table of contents or disclaimers). Logging level and other parameters 14 | are configurable via command-line options. 15 | 16 | Usage: 17 | python cis_benchmark_converter.py \ 18 | -i path/to/input_file.pdf \ 19 | -o path/to/output_file \ 20 | -f [csv|excel|json] \ 21 | --start_page 10 \ 22 | --log_level INFO 23 | 24 | Dependencies: 25 | - pdfplumber : for text extraction from PDF files. 26 | - openpyxl : for creating and handling Excel files. 27 | - tqdm : for displaying a progress bar. 28 | - logging : part of the standard Python library. 29 | - pathlib : part of the standard Python library. 30 | - json : part of the standard Python library. 31 | 32 | Installation: 33 | pip install pdfplumber openpyxl tqdm 34 | 35 | License: 36 | This script is provided under the MIT License. 37 | Please respect the copyright of the CIS Benchmarks 38 | documents when using and sharing this script. 39 | """ 40 | 41 | import argparse 42 | import csv 43 | import json 44 | import re 45 | import logging 46 | from pathlib import Path 47 | from typing import Tuple, List, Dict 48 | 49 | import pdfplumber 50 | from tqdm import tqdm 51 | from openpyxl import Workbook 52 | from openpyxl.styles import PatternFill, Font 53 | from openpyxl.worksheet.datavalidation import DataValidation 54 | from openpyxl.worksheet.table import Table, TableStyleInfo 55 | from openpyxl.formatting.rule import FormulaRule 56 | from openpyxl.utils import get_column_letter 57 | 58 | # Disable warnings from pdfminer and pdfplumber 59 | # to avoid cluttering the output with unnecessary messages. 60 | logging.getLogger("pdfminer").setLevel(logging.ERROR) 61 | logging.getLogger("pdfplumber").setLevel(logging.ERROR) 62 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') 63 | 64 | # ----------------------------------------------------------------------------------- 65 | # Global Constants and Regular Expressions 66 | # ----------------------------------------------------------------------------------- 67 | 68 | # Matches recommendation titles (e.g., "1.1.1 (L1) Title of Recommendation") 69 | TITLE_PATTERN: re.Pattern = re.compile(r'^(\d+\.\d+(?:\.\d+)*)\s*(\(L\d+\))?\s*(.*)') 70 | 71 | # Matches page number strings (e.g., "Page 123") 72 | PAGE_NUMBER_PATTERN: re.Pattern = re.compile(r'\bPage\s+\d+\b', re.IGNORECASE) 73 | 74 | # List of section headers to extract 75 | SECTIONS_WITHOUT_CIS: List[str] = [ 76 | 'Profile Applicability:', 77 | 'Description:', 78 | 'Rationale:', 79 | 'Impact:', 80 | 'Audit:', 81 | 'Remediation:', 82 | 'Default Value:', 83 | 'References:', 84 | 'Additional Information:' 85 | ] 86 | 87 | # ----------------------------------------------------------------------------------- 88 | # Utility Functions 89 | # ----------------------------------------------------------------------------------- 90 | 91 | def remove_page_numbers(text: str) -> str: 92 | """ 93 | Remove mentions of page numbers (e.g. "Page 123") from the provided text. 94 | """ 95 | return PAGE_NUMBER_PATTERN.sub('', text) 96 | 97 | def generate_unique_filename(base_name: str, extension: str) -> str: 98 | """ 99 | Generate a unique filename by appending a numeric suffix if the file already exists. 100 | Uses pathlib for robust path handling. 101 | """ 102 | file_path = Path(f"{base_name}.{extension}") 103 | counter = 1 104 | while file_path.exists(): 105 | file_path = Path(f"{base_name}({counter}).{extension}") 106 | counter += 1 107 | return str(file_path) 108 | 109 | # ----------------------------------------------------------------------------------- 110 | # PDF Extraction Functions 111 | # ----------------------------------------------------------------------------------- 112 | 113 | def extract_title_and_version(input_file: Path) -> Tuple[str, str]: 114 | """ 115 | Extract the document title and version from the first page of the PDF. 116 | 117 | Returns: 118 | (title, version) as strings. Version may be empty if no version line is found. 119 | """ 120 | try: 121 | with pdfplumber.open(str(input_file)) as pdf: 122 | first_page_text = pdf.pages[0].extract_text().splitlines() 123 | except Exception as e: 124 | logging.error(f"Error opening PDF for title extraction: {e}") 125 | raise 126 | 127 | title_lines: List[str] = [] 128 | version: str = "" 129 | for line in first_page_text: 130 | # Example: "v1.2 - 2024" 131 | if line.lower().startswith("v") and "-" in line: 132 | version = line.strip() 133 | break 134 | else: 135 | title_lines.append(line.strip()) 136 | 137 | title = " ".join(title_lines) if title_lines else "CIS Benchmark Document" 138 | return title, version 139 | 140 | def read_pdf(input_file: Path, start_page: int = 10) -> str: 141 | """ 142 | Reads text from the PDF file starting at 'start_page'. 143 | Uses tqdm to display a progress bar for the pages processed. 144 | 145 | Raises: 146 | ValueError: if 'start_page' is out of range. 147 | Exception: if reading fails for another reason. 148 | 149 | Returns: 150 | A single string containing the concatenated text of the pages read. 151 | """ 152 | logging.info(f"Reading PDF from page {start_page} onwards...") 153 | 154 | if start_page < 1: 155 | raise ValueError("start_page must be >= 1.") 156 | 157 | try: 158 | with pdfplumber.open(str(input_file)) as pdf: 159 | total_pages = len(pdf.pages) 160 | if start_page > total_pages: 161 | raise ValueError(f"Start page {start_page} exceeds total page count ({total_pages}).") 162 | 163 | # Extract text from each page starting at 'start_page' 164 | text_pages = [] 165 | for page in tqdm(pdf.pages[start_page - 1:], 166 | desc="Extracting pages", 167 | unit="page", 168 | total=(total_pages - start_page + 1)): 169 | page_text = page.extract_text() or "" 170 | text_pages.append(page_text) 171 | 172 | except Exception as e: 173 | logging.error(f"Failed to read PDF: {e}") 174 | raise 175 | 176 | # Filter out any empty strings and join with newlines 177 | return "\n".join(filter(None, text_pages)) 178 | 179 | def find_profile_applicability(lines: List[str], start_index: int, max_depth: int = 10) -> bool: 180 | """ 181 | Checks if 'Profile Applicability:' appears within 'max_depth' lines 182 | after 'start_index', indicating a valid recommendation start. 183 | 184 | Returns: 185 | True if found, otherwise False. 186 | """ 187 | for i in range(start_index + 1, min(start_index + max_depth, len(lines))): 188 | line: str = lines[i].strip() 189 | if line.startswith("Profile Applicability:"): 190 | return True 191 | if TITLE_PATTERN.match(line) or any(line.startswith(sec) for sec in SECTIONS_WITHOUT_CIS): 192 | return False 193 | return False 194 | 195 | def extract_section(lines: List[str], start_index: int, section_name: str) -> Tuple[str, int]: 196 | """ 197 | Extract the content of a section until a new section header, a new recommendation title, 198 | or a mention of "CIS Controls" is encountered. 199 | 200 | Returns: 201 | (content, next_index) 202 | content : the extracted text 203 | next_index : the position in 'lines' after extraction 204 | """ 205 | content: List[str] = [] 206 | current_index: int = start_index + 1 207 | 208 | while current_index < len(lines): 209 | line: str = lines[current_index].strip() 210 | line = remove_page_numbers(line) 211 | 212 | # End of this section if: 213 | # - We reach another known section header 214 | # - We detect a new recommendation title 215 | # - We see "CIS Controls" 216 | if any(line.startswith(sec) for sec in SECTIONS_WITHOUT_CIS) \ 217 | or TITLE_PATTERN.match(line) \ 218 | or line.lower().startswith("cis controls"): 219 | break 220 | 221 | content.append(line) 222 | current_index += 1 223 | 224 | return ' '.join(content).strip(), current_index 225 | 226 | def extract_recommendations(full_text: str) -> List[Dict[str, str]]: 227 | """ 228 | Parse the concatenated PDF text to extract recommendations. 229 | 230 | Returns: 231 | A list of dictionaries, each representing a recommendation with keys like 232 | "Number", "Level", "Title", and the extracted sections (Profile Applicability, etc.). 233 | """ 234 | recommendations: List[Dict[str, str]] = [] 235 | lines: List[str] = full_text.splitlines() 236 | current_recommendation: Dict[str, str] = {} 237 | current_index: int = 0 238 | 239 | while current_index < len(lines): 240 | line: str = lines[current_index].strip() 241 | line = remove_page_numbers(line) 242 | 243 | # Detect a recommendation title line 244 | title_match = TITLE_PATTERN.match(line) 245 | if title_match: 246 | # Check if next lines contain 'Profile Applicability:' => indicates a valid rec 247 | if find_profile_applicability(lines, current_index): 248 | # Save previous recommendation if it exists 249 | if current_recommendation: 250 | recommendations.append(current_recommendation) 251 | current_recommendation = { 252 | 'Number': title_match.group(1), 253 | 'Level': title_match.group(2) or '', 254 | 'Title': title_match.group(3), 255 | } 256 | # Capture multi-line titles 257 | while (current_index + 1 < len(lines) 258 | and not any(lines[current_index + 1].strip().startswith(sec) for sec in SECTIONS_WITHOUT_CIS) 259 | and not TITLE_PATTERN.match(lines[current_index + 1].strip())): 260 | current_index += 1 261 | current_recommendation['Title'] += " " + lines[current_index].strip() 262 | 263 | # Extract standard sections (Profile Applicability, Description, etc.) 264 | for sec in SECTIONS_WITHOUT_CIS: 265 | if line.startswith(sec): 266 | content, next_index = extract_section(lines, current_index, sec) 267 | # e.g. "Additional Information:" -> key = "Additional Information" 268 | current_recommendation[sec[:-1]] = content 269 | current_index = next_index - 1 270 | break 271 | 272 | current_index += 1 273 | 274 | # Add the last recommendation if any 275 | if current_recommendation: 276 | recommendations.append(current_recommendation) 277 | 278 | # Remove duplicates based on (Number, Title) in case of accidental repeats 279 | unique_recommendations = {(rec['Number'], rec['Title']): rec for rec in recommendations} 280 | return list(unique_recommendations.values()) 281 | 282 | # ----------------------------------------------------------------------------------- 283 | # Output Generation (CSV/Excel/JSON) 284 | # ----------------------------------------------------------------------------------- 285 | 286 | def write_output( 287 | recommendations: List[Dict[str, str]], 288 | output_file: Path, 289 | output_format: str, 290 | title: str, 291 | version: str 292 | ) -> None: 293 | """ 294 | Writes the extracted recommendations to CSV, Excel, or JSON format. 295 | 296 | Args: 297 | recommendations : List of recommendation dicts. 298 | output_file : Output file path. 299 | output_format : "csv", "excel", or "json". 300 | title : Document title (extracted from PDF). 301 | version : Document version (extracted from PDF). 302 | """ 303 | logging.info(f"Writing output to {output_file} in {output_format.upper()} format...") 304 | headers: List[str] = ['Compliance Status', 'Number', 'Level', 'Title'] 305 | headers += [sec[:-1] for sec in SECTIONS_WITHOUT_CIS] # remove trailing colon 306 | 307 | if output_format == 'csv': 308 | try: 309 | with output_file.open(mode='w', newline='', encoding='utf-8') as file: 310 | writer = csv.writer(file, delimiter='|') 311 | writer.writerow([title if title else "CIS Benchmark Document"]) 312 | writer.writerow([version if version else ""]) 313 | writer.writerow([]) # Empty line for spacing 314 | writer.writerow(headers) 315 | for recommendation in recommendations: 316 | recommendation['Compliance Status'] = 'To Review' 317 | row = [recommendation.get(header, '') for header in headers] 318 | writer.writerow(row) 319 | except Exception as e: 320 | logging.error(f"Error writing CSV output: {e}") 321 | raise 322 | 323 | elif output_format == 'excel': 324 | workbook = Workbook() 325 | sheet = workbook.active 326 | sheet.title = "Recommendations" 327 | sheet["A1"] = title if title else "CIS Benchmark Document" 328 | sheet["A1"].font = Font(size=14, bold=True) 329 | sheet["A2"] = version if version else "" 330 | sheet["A2"].font = Font(size=12, italic=True) 331 | sheet.append([""] * len(headers)) 332 | sheet.append(headers) 333 | for recommendation in recommendations: 334 | recommendation['Compliance Status'] = 'To Review' 335 | row = [recommendation.get(header, '') for header in headers] 336 | sheet.append(row) 337 | dv = DataValidation(type="list", formula1='"Compliant,Non-Compliant,To Review"', showDropDown=False) 338 | sheet.add_data_validation(dv) 339 | start_row = 5 340 | end_row = len(recommendations) + start_row 341 | for row_idx in range(start_row, end_row): 342 | dv.add(sheet[f"A{row_idx}"]) 343 | compliant_fill = PatternFill(start_color="C6EFCE", end_color="C6EFCE", fill_type="solid") 344 | non_compliant_fill = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid") 345 | to_review_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid") 346 | compliant_rule = FormulaRule(formula=[f'$A{start_row}="Compliant"'], fill=compliant_fill) 347 | non_compliant_rule = FormulaRule(formula=[f'$A{start_row}="Non-Compliant"'], fill=non_compliant_fill) 348 | to_review_rule = FormulaRule(formula=[f'$A{start_row}="To Review"'], fill=to_review_fill) 349 | sheet.conditional_formatting.add(f"A{start_row}:A{end_row}", compliant_rule) 350 | sheet.conditional_formatting.add(f"A{start_row}:A{end_row}", non_compliant_rule) 351 | sheet.conditional_formatting.add(f"A{start_row}:A{end_row}", to_review_rule) 352 | last_column = get_column_letter(len(headers)) 353 | table_range = f"A4:{last_column}{end_row - 1}" 354 | table = Table(displayName="CISRecommendations", ref=table_range) 355 | style = TableStyleInfo(name="TableStyleMedium9", 356 | showFirstColumn=False, 357 | showLastColumn=False, 358 | showRowStripes=True, 359 | showColumnStripes=True) 360 | table.tableStyleInfo = style 361 | sheet.add_table(table) 362 | sheet.column_dimensions['A'].width = 10 363 | sheet.column_dimensions['B'].width = 8 364 | sheet.column_dimensions['C'].width = 8 365 | sheet.column_dimensions['D'].width = 50 366 | for col in range(5, len(headers) + 1): 367 | col_letter = get_column_letter(col) 368 | sheet.column_dimensions[col_letter].width = 10 369 | try: 370 | workbook.save(str(output_file)) 371 | except Exception as e: 372 | logging.error(f"Error saving Excel file: {e}") 373 | raise 374 | 375 | elif output_format == 'json': 376 | try: 377 | # Create a JSON object with document information and recommendations 378 | data = { 379 | "document_title": title if title else "CIS Benchmark Document", 380 | "document_version": version, 381 | "recommendations": recommendations 382 | } 383 | with output_file.open("w", encoding="utf-8") as f: 384 | json.dump(data, f, ensure_ascii=False, indent=4) 385 | except Exception as e: 386 | logging.error(f"Error writing JSON output: {e}") 387 | raise 388 | 389 | logging.info(f"Finished writing {len(recommendations)} recommendations to {output_file}.") 390 | 391 | # ----------------------------------------------------------------------------------- 392 | # Main Function 393 | # ----------------------------------------------------------------------------------- 394 | 395 | def main() -> None: 396 | """ 397 | Main entry point. 398 | Parses command-line arguments, extracts recommendations from the PDF, 399 | and writes the results to CSV, Excel, or JSON. 400 | """ 401 | parser = argparse.ArgumentParser(description="Extract and format recommendations from a CIS Benchmark PDF.") 402 | parser.add_argument("-i", "--input", required=True, type=Path, help="Input PDF file.") 403 | parser.add_argument("-o", "--output", type=Path, 404 | help="Output file (default: same as input file name with .csv, .xlsx, or .json).") 405 | parser.add_argument("-f", "--format", choices=['csv', 'excel', 'json'], default='excel', 406 | help="Output format (csv, excel, or json).") 407 | parser.add_argument("--start_page", type=int, default=10, 408 | help="Page number to start extraction (default: 10).") 409 | parser.add_argument("--log_level", type=str, default="INFO", 410 | choices=["DEBUG", "INFO", "WARNING", "ERROR"], 411 | help="Logging level (default: INFO).") 412 | args = parser.parse_args() 413 | 414 | # Configure logging level 415 | logging.getLogger().setLevel(args.log_level.upper()) 416 | 417 | # Prepare file paths 418 | input_file: Path = args.input 419 | output_format: str = args.format 420 | base_name: str = input_file.stem 421 | extension: str = "csv" if output_format == "csv" else ("xlsx" if output_format == "excel" else "json") 422 | output_file: Path = args.output if args.output else Path(generate_unique_filename(base_name, extension)) 423 | 424 | # Extract title and version from the PDF 425 | title, version = extract_title_and_version(input_file) 426 | 427 | # Read text from PDF, starting at the user-specified page 428 | pdf_text = read_pdf(input_file, start_page=args.start_page) 429 | 430 | # Extract recommendations from the raw PDF text 431 | recommendations = extract_recommendations(pdf_text) 432 | 433 | # Write the extracted data to CSV, Excel, or JSON 434 | write_output(recommendations, output_file, output_format, title, version) 435 | 436 | if __name__ == "__main__": 437 | main() -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pdfplumber 2 | openpyxl 3 | tqdm 4 | --------------------------------------------------------------------------------