├── LICENSE
├── README.md
├── cis_benchmark_converter.py
└── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Maxime Beauchamp
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # CIS Benchmark Converter
 2 | 
 3 | **Author:** Octomany  
 4 | **LinkedIn:** [LinkedIn](https://www.linkedin.com/in/maxbeauchamp/)  
 5 | 
 6 | **Date Created:** 2024-11-06  
 7 | **Last Update:** 2025-14-03
 8 | 
 9 | ## Description
10 | 
11 | `CIS Benchmark Converter` is a Python script that extracts recommendations from CIS Benchmark PDF documents and exports them into CSV, Excel, or JSON formats. The script converts unstructured PDF content into a structured table, simplifying compliance reviews and audits.
12 | 
13 | ## Features
14 | 
15 | - **Configurable Extraction:**  
16 |   Set the start page to skip tables of contents or disclaimers, and adjust the logging level via command-line options.
17 | 
18 | - **Multiple Output Formats:**  
19 |   Export the extracted data as CSV (using a pipe `|` delimiter), Excel (with styled headers, dropdowns, and conditional formatting), or JSON for easy data integration.
20 | 
21 | - **Robust and Maintainable:**  
22 |   Uses `pathlib` for modern file path management, extended type annotations for static type checking, enhanced exception handling, and a progress bar (via `tqdm`) for user feedback.
23 | 
24 | ## Installation
25 | 
26 | 1. Clone this repository.
27 | 2. Install dependencies using the provided `requirements.txt`:
28 | 
29 |    ```bash
30 |    pip install -r requirements.txt
31 |    ```
32 | 
33 |    **requirements.txt:**
34 |    ```
35 |    pdfplumber
36 |    openpyxl
37 |    tqdm
38 |    ```
39 | 
40 | ## Usage
41 | 
42 | Run the script from the command line as follows:
43 | 
44 | ```bash
45 | python cis_benchmark_converter.py \
46 |   -i path/to/input_file.pdf \
47 |   -o path/to/output_file \
48 |   -f [csv|excel|json] \
49 |   --start_page 10 \
50 |   --log_level INFO
51 | ```
52 | 
53 | ### Arguments
54 | 
55 | - `-i, --input` : Path to the input CIS Benchmark PDF file (required).
56 | - `-o, --output` : Path to the output file (defaults to the input file name with a `.csv`, `.xlsx`, or `.json` extension).
57 | - `-f, --format` : Output file format: `csv`, `excel`, or `json` (default: `excel`).
58 | - `--start_page` : Page number to start extraction (default: 10).
59 | - `--log_level` : Logging level (`DEBUG`, `INFO`, `WARNING`, or `ERROR`; default: `INFO`).
60 | 
61 | ## Example
62 | 
63 | ```bash
64 | python cis_benchmark_converter.py -i ./CIS_AWS_Benchmark.pdf -o ./CIS_AWS_Benchmark.json -f json --start_page 10 --log_level INFO
65 | ```
66 | 
67 | For JSON output, the data is structured as a list of dictionaries, with each dictionary representing a recommendation and its associated sections.
68 | 
69 | > **Note:** The script automatically excludes any sections labeled "CIS Controls" to focus solely on the core recommendations.
70 | 
71 | ## Acknowledgements
72 | 
73 | Special thanks to [Flavien Fouqueray (UnBonWhisky)](https://www.linkedin.com/in/ffouqueray/) for his valuable bug fixes and contributions in earlier versions of this script.
74 | 
75 | ## License
76 | 
77 | This project is licensed under the MIT License. Please respect the copyrights
78 | of the CIS Benchmark documents when using and sharing this tool.


--------------------------------------------------------------------------------
/cis_benchmark_converter.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | File: cis_benchmark_converter.py
  5 | Author: Maxime Beauchamp
  6 | LinkedIn: https://www.linkedin.com/in/maxbeauchamp/
  7 | Created: 2024-11-06
  8 | Last Update: 2025-28-04
  9 | 
 10 | Description:
 11 |     This script extracts recommendations from CIS Benchmark PDF documents and exports
 12 |     them to CSV, Excel, or JSON format. The extraction starts at a configurable page number 
 13 |     (to skip table of contents or disclaimers). Logging level and other parameters 
 14 |     are configurable via command-line options.
 15 | 
 16 | Usage:
 17 |     python cis_benchmark_converter.py \
 18 |         -i path/to/input_file.pdf \
 19 |         -o path/to/output_file \
 20 |         -f [csv|excel|json] \
 21 |         --start_page 10 \
 22 |         --log_level INFO
 23 | 
 24 | Dependencies:
 25 |     - pdfplumber : for text extraction from PDF files.
 26 |     - openpyxl   : for creating and handling Excel files.
 27 |     - tqdm       : for displaying a progress bar.
 28 |     - logging    : part of the standard Python library.
 29 |     - pathlib    : part of the standard Python library.
 30 |     - json       : part of the standard Python library.
 31 | 
 32 | Installation:
 33 |     pip install pdfplumber openpyxl tqdm
 34 | 
 35 | License:
 36 |     This script is provided under the MIT License.
 37 |     Please respect the copyright of the CIS Benchmarks
 38 |     documents when using and sharing this script.
 39 | """
 40 | 
 41 | import argparse
 42 | import csv
 43 | import json
 44 | import re
 45 | import logging
 46 | from pathlib import Path
 47 | from typing import Tuple, List, Dict
 48 | 
 49 | import pdfplumber
 50 | from tqdm import tqdm
 51 | from openpyxl import Workbook
 52 | from openpyxl.styles import PatternFill, Font
 53 | from openpyxl.worksheet.datavalidation import DataValidation
 54 | from openpyxl.worksheet.table import Table, TableStyleInfo
 55 | from openpyxl.formatting.rule import FormulaRule
 56 | from openpyxl.utils import get_column_letter
 57 | 
 58 | # Disable warnings from pdfminer and pdfplumber
 59 | # to avoid cluttering the output with unnecessary messages.
 60 | logging.getLogger("pdfminer").setLevel(logging.ERROR)
 61 | logging.getLogger("pdfplumber").setLevel(logging.ERROR)
 62 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 63 | 
 64 | # -----------------------------------------------------------------------------------
 65 | # Global Constants and Regular Expressions
 66 | # -----------------------------------------------------------------------------------
 67 | 
 68 | # Matches recommendation titles (e.g., "1.1.1 (L1) Title of Recommendation")
 69 | TITLE_PATTERN: re.Pattern = re.compile(r'^(\d+\.\d+(?:\.\d+)*)\s*(\(L\d+\))?\s*(.*)')
 70 | 
 71 | # Matches page number strings (e.g., "Page 123")
 72 | PAGE_NUMBER_PATTERN: re.Pattern = re.compile(r'\bPage\s+\d+\b', re.IGNORECASE)
 73 | 
 74 | # List of section headers to extract
 75 | SECTIONS_WITHOUT_CIS: List[str] = [
 76 |     'Profile Applicability:',
 77 |     'Description:',
 78 |     'Rationale:',
 79 |     'Impact:',
 80 |     'Audit:',
 81 |     'Remediation:',
 82 |     'Default Value:',
 83 |     'References:',
 84 |     'Additional Information:'
 85 | ]
 86 | 
 87 | # -----------------------------------------------------------------------------------
 88 | # Utility Functions
 89 | # -----------------------------------------------------------------------------------
 90 | 
 91 | def remove_page_numbers(text: str) -> str:
 92 |     """
 93 |     Remove mentions of page numbers (e.g. "Page 123") from the provided text.
 94 |     """
 95 |     return PAGE_NUMBER_PATTERN.sub('', text)
 96 | 
 97 | def generate_unique_filename(base_name: str, extension: str) -> str:
 98 |     """
 99 |     Generate a unique filename by appending a numeric suffix if the file already exists.
100 |     Uses pathlib for robust path handling.
101 |     """
102 |     file_path = Path(f"{base_name}.{extension}")
103 |     counter = 1
104 |     while file_path.exists():
105 |         file_path = Path(f"{base_name}({counter}).{extension}")
106 |         counter += 1
107 |     return str(file_path)
108 | 
109 | # -----------------------------------------------------------------------------------
110 | # PDF Extraction Functions
111 | # -----------------------------------------------------------------------------------
112 | 
113 | def extract_title_and_version(input_file: Path) -> Tuple[str, str]:
114 |     """
115 |     Extract the document title and version from the first page of the PDF.
116 |     
117 |     Returns:
118 |         (title, version) as strings. Version may be empty if no version line is found.
119 |     """
120 |     try:
121 |         with pdfplumber.open(str(input_file)) as pdf:
122 |             first_page_text = pdf.pages[0].extract_text().splitlines()
123 |     except Exception as e:
124 |         logging.error(f"Error opening PDF for title extraction: {e}")
125 |         raise
126 | 
127 |     title_lines: List[str] = []
128 |     version: str = ""
129 |     for line in first_page_text:
130 |         # Example: "v1.2 - 2024"
131 |         if line.lower().startswith("v") and "-" in line:
132 |             version = line.strip()
133 |             break
134 |         else:
135 |             title_lines.append(line.strip())
136 | 
137 |     title = " ".join(title_lines) if title_lines else "CIS Benchmark Document"
138 |     return title, version
139 | 
140 | def read_pdf(input_file: Path, start_page: int = 10) -> str:
141 |     """
142 |     Reads text from the PDF file starting at 'start_page'. 
143 |     Uses tqdm to display a progress bar for the pages processed.
144 | 
145 |     Raises:
146 |         ValueError: if 'start_page' is out of range.
147 |         Exception:  if reading fails for another reason.
148 | 
149 |     Returns:
150 |         A single string containing the concatenated text of the pages read.
151 |     """
152 |     logging.info(f"Reading PDF from page {start_page} onwards...")
153 | 
154 |     if start_page < 1:
155 |         raise ValueError("start_page must be >= 1.")
156 | 
157 |     try:
158 |         with pdfplumber.open(str(input_file)) as pdf:
159 |             total_pages = len(pdf.pages)
160 |             if start_page > total_pages:
161 |                 raise ValueError(f"Start page {start_page} exceeds total page count ({total_pages}).")
162 | 
163 |             # Extract text from each page starting at 'start_page'
164 |             text_pages = []
165 |             for page in tqdm(pdf.pages[start_page - 1:], 
166 |                              desc="Extracting pages", 
167 |                              unit="page", 
168 |                              total=(total_pages - start_page + 1)):
169 |                 page_text = page.extract_text() or ""
170 |                 text_pages.append(page_text)
171 | 
172 |     except Exception as e:
173 |         logging.error(f"Failed to read PDF: {e}")
174 |         raise
175 | 
176 |     # Filter out any empty strings and join with newlines
177 |     return "\n".join(filter(None, text_pages))
178 | 
179 | def find_profile_applicability(lines: List[str], start_index: int, max_depth: int = 10) -> bool:
180 |     """
181 |     Checks if 'Profile Applicability:' appears within 'max_depth' lines 
182 |     after 'start_index', indicating a valid recommendation start.
183 |     
184 |     Returns:
185 |         True if found, otherwise False.
186 |     """
187 |     for i in range(start_index + 1, min(start_index + max_depth, len(lines))):
188 |         line: str = lines[i].strip()
189 |         if line.startswith("Profile Applicability:"):
190 |             return True
191 |         if TITLE_PATTERN.match(line) or any(line.startswith(sec) for sec in SECTIONS_WITHOUT_CIS):
192 |             return False
193 |     return False
194 | 
195 | def extract_section(lines: List[str], start_index: int, section_name: str) -> Tuple[str, int]:
196 |     """
197 |     Extract the content of a section until a new section header, a new recommendation title,
198 |     or a mention of "CIS Controls" is encountered.
199 |     
200 |     Returns:
201 |         (content, next_index) 
202 |         content     : the extracted text
203 |         next_index  : the position in 'lines' after extraction
204 |     """
205 |     content: List[str] = []
206 |     current_index: int = start_index + 1
207 | 
208 |     while current_index < len(lines):
209 |         line: str = lines[current_index].strip()
210 |         line = remove_page_numbers(line)
211 | 
212 |         # End of this section if:
213 |         #   - We reach another known section header
214 |         #   - We detect a new recommendation title
215 |         #   - We see "CIS Controls"
216 |         if any(line.startswith(sec) for sec in SECTIONS_WITHOUT_CIS) \
217 |            or TITLE_PATTERN.match(line) \
218 |            or line.lower().startswith("cis controls"):
219 |             break
220 | 
221 |         content.append(line)
222 |         current_index += 1
223 | 
224 |     return ' '.join(content).strip(), current_index
225 | 
226 | def extract_recommendations(full_text: str) -> List[Dict[str, str]]:
227 |     """
228 |     Parse the concatenated PDF text to extract recommendations.
229 | 
230 |     Returns:
231 |         A list of dictionaries, each representing a recommendation with keys like
232 |         "Number", "Level", "Title", and the extracted sections (Profile Applicability, etc.).
233 |     """
234 |     recommendations: List[Dict[str, str]] = []
235 |     lines: List[str] = full_text.splitlines()
236 |     current_recommendation: Dict[str, str] = {}
237 |     current_index: int = 0
238 | 
239 |     while current_index < len(lines):
240 |         line: str = lines[current_index].strip()
241 |         line = remove_page_numbers(line)
242 | 
243 |         # Detect a recommendation title line
244 |         title_match = TITLE_PATTERN.match(line)
245 |         if title_match:
246 |             # Check if next lines contain 'Profile Applicability:' => indicates a valid rec
247 |             if find_profile_applicability(lines, current_index):
248 |                 # Save previous recommendation if it exists
249 |                 if current_recommendation:
250 |                     recommendations.append(current_recommendation)
251 |                 current_recommendation = {
252 |                     'Number': title_match.group(1),
253 |                     'Level': title_match.group(2) or '',
254 |                     'Title': title_match.group(3),
255 |                 }
256 |                 # Capture multi-line titles
257 |                 while (current_index + 1 < len(lines)
258 |                        and not any(lines[current_index + 1].strip().startswith(sec) for sec in SECTIONS_WITHOUT_CIS)
259 |                        and not TITLE_PATTERN.match(lines[current_index + 1].strip())):
260 |                     current_index += 1
261 |                     current_recommendation['Title'] += " " + lines[current_index].strip()
262 | 
263 |         # Extract standard sections (Profile Applicability, Description, etc.)
264 |         for sec in SECTIONS_WITHOUT_CIS:
265 |             if line.startswith(sec):
266 |                 content, next_index = extract_section(lines, current_index, sec)
267 |                 # e.g. "Additional Information:" -> key = "Additional Information"
268 |                 current_recommendation[sec[:-1]] = content
269 |                 current_index = next_index - 1
270 |                 break
271 | 
272 |         current_index += 1
273 | 
274 |     # Add the last recommendation if any
275 |     if current_recommendation:
276 |         recommendations.append(current_recommendation)
277 | 
278 |     # Remove duplicates based on (Number, Title) in case of accidental repeats
279 |     unique_recommendations = {(rec['Number'], rec['Title']): rec for rec in recommendations}
280 |     return list(unique_recommendations.values())
281 | 
282 | # -----------------------------------------------------------------------------------
283 | # Output Generation (CSV/Excel/JSON)
284 | # -----------------------------------------------------------------------------------
285 | 
286 | def write_output(
287 |     recommendations: List[Dict[str, str]],
288 |     output_file: Path,
289 |     output_format: str,
290 |     title: str,
291 |     version: str
292 | ) -> None:
293 |     """
294 |     Writes the extracted recommendations to CSV, Excel, or JSON format.
295 |     
296 |     Args:
297 |         recommendations : List of recommendation dicts.
298 |         output_file     : Output file path.
299 |         output_format   : "csv", "excel", or "json".
300 |         title           : Document title (extracted from PDF).
301 |         version         : Document version (extracted from PDF).
302 |     """
303 |     logging.info(f"Writing output to {output_file} in {output_format.upper()} format...")
304 |     headers: List[str] = ['Compliance Status', 'Number', 'Level', 'Title']
305 |     headers += [sec[:-1] for sec in SECTIONS_WITHOUT_CIS]  # remove trailing colon
306 | 
307 |     if output_format == 'csv':
308 |         try:
309 |             with output_file.open(mode='w', newline='', encoding='utf-8') as file:
310 |                 writer = csv.writer(file, delimiter='|')
311 |                 writer.writerow([title if title else "CIS Benchmark Document"])
312 |                 writer.writerow([version if version else ""])
313 |                 writer.writerow([])  # Empty line for spacing
314 |                 writer.writerow(headers)
315 |                 for recommendation in recommendations:
316 |                     recommendation['Compliance Status'] = 'To Review'
317 |                     row = [recommendation.get(header, '') for header in headers]
318 |                     writer.writerow(row)
319 |         except Exception as e:
320 |             logging.error(f"Error writing CSV output: {e}")
321 |             raise
322 | 
323 |     elif output_format == 'excel':
324 |         workbook = Workbook()
325 |         sheet = workbook.active
326 |         sheet.title = "Recommendations"
327 |         sheet["A1"] = title if title else "CIS Benchmark Document"
328 |         sheet["A1"].font = Font(size=14, bold=True)
329 |         sheet["A2"] = version if version else ""
330 |         sheet["A2"].font = Font(size=12, italic=True)
331 |         sheet.append([""] * len(headers))
332 |         sheet.append(headers)
333 |         for recommendation in recommendations:
334 |             recommendation['Compliance Status'] = 'To Review'
335 |             row = [recommendation.get(header, '') for header in headers]
336 |             sheet.append(row)
337 |         dv = DataValidation(type="list", formula1='"Compliant,Non-Compliant,To Review"', showDropDown=False)
338 |         sheet.add_data_validation(dv)
339 |         start_row = 5
340 |         end_row = len(recommendations) + start_row
341 |         for row_idx in range(start_row, end_row):
342 |             dv.add(sheet[f"A{row_idx}"])
343 |         compliant_fill = PatternFill(start_color="C6EFCE", end_color="C6EFCE", fill_type="solid")
344 |         non_compliant_fill = PatternFill(start_color="FFC7CE", end_color="FFC7CE", fill_type="solid")
345 |         to_review_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")
346 |         compliant_rule = FormulaRule(formula=[f'$A{start_row}="Compliant"'], fill=compliant_fill)
347 |         non_compliant_rule = FormulaRule(formula=[f'$A{start_row}="Non-Compliant"'], fill=non_compliant_fill)
348 |         to_review_rule = FormulaRule(formula=[f'$A{start_row}="To Review"'], fill=to_review_fill)
349 |         sheet.conditional_formatting.add(f"A{start_row}:A{end_row}", compliant_rule)
350 |         sheet.conditional_formatting.add(f"A{start_row}:A{end_row}", non_compliant_rule)
351 |         sheet.conditional_formatting.add(f"A{start_row}:A{end_row}", to_review_rule)
352 |         last_column = get_column_letter(len(headers))
353 |         table_range = f"A4:{last_column}{end_row - 1}"
354 |         table = Table(displayName="CISRecommendations", ref=table_range)
355 |         style = TableStyleInfo(name="TableStyleMedium9",
356 |                                showFirstColumn=False,
357 |                                showLastColumn=False,
358 |                                showRowStripes=True,
359 |                                showColumnStripes=True)
360 |         table.tableStyleInfo = style
361 |         sheet.add_table(table)
362 |         sheet.column_dimensions['A'].width = 10
363 |         sheet.column_dimensions['B'].width = 8
364 |         sheet.column_dimensions['C'].width = 8
365 |         sheet.column_dimensions['D'].width = 50
366 |         for col in range(5, len(headers) + 1):
367 |             col_letter = get_column_letter(col)
368 |             sheet.column_dimensions[col_letter].width = 10
369 |         try:
370 |             workbook.save(str(output_file))
371 |         except Exception as e:
372 |             logging.error(f"Error saving Excel file: {e}")
373 |             raise
374 | 
375 |     elif output_format == 'json':
376 |         try:
377 |             # Create a JSON object with document information and recommendations
378 |             data = {
379 |                 "document_title": title if title else "CIS Benchmark Document",
380 |                 "document_version": version,
381 |                 "recommendations": recommendations
382 |             }
383 |             with output_file.open("w", encoding="utf-8") as f:
384 |                 json.dump(data, f, ensure_ascii=False, indent=4)
385 |         except Exception as e:
386 |             logging.error(f"Error writing JSON output: {e}")
387 |             raise
388 | 
389 |     logging.info(f"Finished writing {len(recommendations)} recommendations to {output_file}.")
390 | 
391 | # -----------------------------------------------------------------------------------
392 | # Main Function
393 | # -----------------------------------------------------------------------------------
394 | 
395 | def main() -> None:
396 |     """
397 |     Main entry point.
398 |     Parses command-line arguments, extracts recommendations from the PDF,
399 |     and writes the results to CSV, Excel, or JSON.
400 |     """
401 |     parser = argparse.ArgumentParser(description="Extract and format recommendations from a CIS Benchmark PDF.")
402 |     parser.add_argument("-i", "--input", required=True, type=Path, help="Input PDF file.")
403 |     parser.add_argument("-o", "--output", type=Path,
404 |                         help="Output file (default: same as input file name with .csv, .xlsx, or .json).")
405 |     parser.add_argument("-f", "--format", choices=['csv', 'excel', 'json'], default='excel',
406 |                         help="Output format (csv, excel, or json).")
407 |     parser.add_argument("--start_page", type=int, default=10,
408 |                         help="Page number to start extraction (default: 10).")
409 |     parser.add_argument("--log_level", type=str, default="INFO",
410 |                         choices=["DEBUG", "INFO", "WARNING", "ERROR"],
411 |                         help="Logging level (default: INFO).")
412 |     args = parser.parse_args()
413 | 
414 |     # Configure logging level
415 |     logging.getLogger().setLevel(args.log_level.upper())
416 | 
417 |     # Prepare file paths
418 |     input_file: Path = args.input
419 |     output_format: str = args.format
420 |     base_name: str = input_file.stem
421 |     extension: str = "csv" if output_format == "csv" else ("xlsx" if output_format == "excel" else "json")
422 |     output_file: Path = args.output if args.output else Path(generate_unique_filename(base_name, extension))
423 | 
424 |     # Extract title and version from the PDF
425 |     title, version = extract_title_and_version(input_file)
426 | 
427 |     # Read text from PDF, starting at the user-specified page
428 |     pdf_text = read_pdf(input_file, start_page=args.start_page)
429 | 
430 |     # Extract recommendations from the raw PDF text
431 |     recommendations = extract_recommendations(pdf_text)
432 | 
433 |     # Write the extracted data to CSV, Excel, or JSON
434 |     write_output(recommendations, output_file, output_format, title, version)
435 | 
436 | if __name__ == "__main__":
437 |     main()


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | pdfplumber
2 | openpyxl
3 | tqdm
4 | 


--------------------------------------------------------------------------------