├── .idea └── workspace.xml ├── README.md ├── Test File ├── Copy of each version │ ├── PDF versions test - v1.pdf │ ├── PDF versions test - v2.pdf │ ├── PDF versions test - v3.pdf │ └── PDF versions test - v4.pdf └── PDF versions test.pdf ├── hash_all_images.sh ├── pdf-metadata.sh ├── pdf-processing.sh ├── pdf-triage.sh └── recursive-pdf-processing.sh /.idea/workspace.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 7 | 8 | 9 | 10 | 15 | 16 | 18 | 19 | 21 | { 22 | "associatedIndex": 3 23 | } 24 | 25 | 26 | 29 | 42 | 43 | 44 | 45 | 46 | 1715106058974 47 | 51 | 52 | 53 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # recursive-pdf-processing.sh 2 | Script to process a single PDF file, all PDF files in a folder, or all PDF files in a folder, recursively. 3 | 4 | Script is still being worked on. I need to add an output option. Currently, it will output to a new directory ("report_DD-MM-YYYYTHH:mm:ss") based on where the script is run. This can result in unintended loop if the output folder resides within the recursive folder you have chosen to process. To avoid this, you can navigate to a location where you want an output folder to be created (will be unique as it uses the timestamp as part of the name), and using the -d to select the directory, select a directory located elsewhere. The ultimate purpose of this script would be to be able to mount an image file with the image mounter of your choice, and point the script to somewhere within that hierarchy (e.g., a user folder) to parse all PDFs within it. 5 | 6 | This script was created out of the necessity to extract prior versions out of multiple PDFs. Rather than running the other script multiple times, pointing to a new file each time, this script will allow me to process those multiple PDFs in a single command. 7 | 8 | If you are going to use this script, keep in mind that it is still being tested, and I will be adding an output command line parameter in the coming days hopefully. And I'll update this at that time with examples of syntax you can use. 9 | 10 | For now, an example would be to navivate to your Documents folder on your C drive with Kali WSL for example (cd /mnt/c/Users/{username}/Documents), and run the following (assuming your script is in your user home folder and executable): 11 | 12 | ~/recursive-pdf-processing.sh -d /mnt/n/my_pdfs_are_here -r -p True 13 | 14 | The above uses the following switches: 15 | -d to specify the root directory to start processing. /mnt/n/my_pdfs_are_here 16 | -r means to recursively parse 17 | -p True tells the script to parse prior versions if any found. If you don't specify -p, it will prompt you when it finds a prior version. This wasn't a big deal when processing a single PDF, as you would be at the keyboard while it was processing. But if recursively processing 1000 PDFs for example, you don't want to have to answer whether or not to process prior versions each time it encounters one. So you can specify -p True (automatically process prior versions), or -p False (don't process prior versions). Anything else will result in being prompted for each. 18 | 19 | You can still use the -f option as with the other script to parse a single PDF. Once this script is properly tested, the other will be retired. 20 | 21 | Currently, you don't specify an output folder. It creates one under the folder from which the script is run. But I will be adding -o {output_folder} soon. That way you could, if you prefered, navigate to where you want to start processing PDFs and do the following: 22 | ~/recursive-pdf-processing -d . -r -p True -o /mnt/c/Users/{username}/Desktop 23 | The above will (once I add the -o option) process all pdfs from your current folder (because you navigated to where you want to start), as denoted by the period ".". It will parse recursively, and parse prior versions. 24 | 25 | I will be updating pdf-triage and pdf-metadata to also allow you to specify the folder to process, and where to save the output. I may even incorpoate those into this main script so that you only need to run one script and get everything, or select certain processing options. 26 | 27 | # pdf-processing.sh 28 | Script to process a PDF file 29 | 30 | Command line options: 31 | 32 | -f # this option is required. You must provide the filename of the PDF using the -f switch. 33 | -v # this option prints the version and exits 34 | -p true/false 35 | 36 | The -p switch allows you to tell the script to not attempt to extract prior versions. This is most likely going to be used if you extract prior versions of a PDF, and then want to run the script against those version. You won't need to re-extract the prior versions of those earlier versions. That would be redundant. You can set this to false so that it skips that part. If you do not specify this flag and prior versions exist, the script will alert you of that and ask if you want to recover them. 37 | 38 | Tested on Kali Linux 2023.1 and Kali Linux on WSL. 39 | 40 | If running on Kali Linux WSL, you will need to run the following: 41 | 42 | ``` 43 | sudo apt update 44 | sudo apt upgrade 45 | sudo apt install exiftool 46 | sudo apt install xpdf 47 | sudo apt install pdf-parser 48 | sudo apt install poppler-utils 49 | sudo apt install pdfid 50 | 51 | ``` 52 | The script will execute the following processes against the PDF: 53 | 1. pdfinfo 54 | 2. exiftool 55 | 3. pdfimages 56 | 4. pdfsig 57 | 5. pdfid 58 | 6. pdf-parser 59 | 7. pdffonts 60 | 8. pdfdetach 61 | 62 | The script will also attempt to carve out prior versions of the PDF by looking for %%EOF markers in the PDF. When you edit a PDF, the edits are added after the %%EOF, and a new %%EOF is added at the new file ending. This means there is an opportunity to extract prior versions of a PDF. It's not guaranteed, as there are factors that can cause prior versions to be invalid PDFs. It will depend on whether the tool that was used to edit the PDF is compliant with the PDF standard, whether there was some compressing (cleaning up) done by removing an earlier edit (but the %%EOF remains allowing you to at least know a prior version existed). 63 | 64 | # pdf-triage.sh 65 | Note: 66 | For the pdf-triage.sh script, you only need exiftool and xpdf from the above, plus the "file" command. If you've already installed exiftool and xpdf for the above script, you only need to install the file command here. 67 | ``` 68 | sudo apt install exiftool 69 | sudo apt install xpdf 70 | sudo apt install file 71 | ``` 72 | # pdf-metadata.sh 73 | The pdf-metadata.sh script is really just packaging the exiftool command for the convenience of those who are unfamiliar with exiftool and its switches. You can run the exiftool command alone within the script and yield the same results (of course you need to provide a proper outoupt file in that case rather than the variable name used in the script). 74 | 75 | For the pdf-metadata.sh script, you only need exiftool installed. If you already installed it for either of the above two scripts, you don't need to install it again here. 76 | ``` 77 | sudo apt install exiftool 78 | ``` 79 | # hash_all_images.sh 80 | This script is useful to extract images from PDFs in a folder and save it to a text file. Images are deleted after they are hashed, as the purpose of this script is not to extract all images and preserve them. The other scripts are used for that. This is a way to look for embedded images from various PDFs that share the same hash value. 81 | 82 | Images in different PDFs with the same hash can be normal (e.g., an image of a stamp or signature is used to add to a document). 83 | 84 | But if you have several PDFs where the stamp is a physical stamp that is stamped on a printed document and then scanned to PDF, then they would not have the same hash value. If they do, that contradicts the statement that they are unique scans. 85 | 86 | The script could be modified to use the find command with the recursive option instead of a simple "ls" command. Change the following 87 | $(ls *.pdf) 88 | to 89 | find -iname "*.pdf" 90 | 91 | Doing the above would run a recursive search for PDF files from the location where you run the find command. So you'd need to navigate to the correct path before running it. 92 | 93 | Also, if you want to retain images as well as hash them, you could comment out (or delete) the two "rm -f" commands 94 | 95 | In a future release, I might add command line arguments to allow you to select that without needing to edit the script. But for now, this does what I need (minimum viable product). 96 | -------------------------------------------------------------------------------- /Test File/Copy of each version/PDF versions test - v1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v1.pdf -------------------------------------------------------------------------------- /Test File/Copy of each version/PDF versions test - v2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v2.pdf -------------------------------------------------------------------------------- /Test File/Copy of each version/PDF versions test - v3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v3.pdf -------------------------------------------------------------------------------- /Test File/Copy of each version/PDF versions test - v4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v4.pdf -------------------------------------------------------------------------------- /Test File/PDF versions test.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/PDF versions test.pdf -------------------------------------------------------------------------------- /hash_all_images.sh: -------------------------------------------------------------------------------- 1 | OIFS="$IFS" # Original Field Separator 2 | IFS=$'\n' # New field separator = new line 3 | ORIGINAL_GLOB=$(shopt -p nocaseglob) 4 | shopt -s nocaseglob # to make the ls command case insensitive 5 | 6 | hashfile="hashes($(date -u +%a_%d-%b-%y_%kh%Mm%Ss_UTC)).csv" # create the hash file using the UTC date to make it unique 7 | 8 | for pdf in $(ls *.pdf); do # for all; PDFs in this folder 9 | echo "extracting images from $pdf" 10 | images=$(pdfimages "$pdf" "$pdf-image" -all -print-filenames) 11 | for image in $images; do # for all images within the current PDF in the parent FOR loop 12 | echo "hashing $image" # hash the file 13 | md5sum "$image" >>$hashfile # output the hash file 14 | echo "Deleting $image" # clean up after itself 15 | rm -f $image # deleting the image 16 | param_file="${image%.*}" 17 | param_file="${param_file##*/}" 18 | rm -f "$param_file.params" # deleting the associated .params file. 19 | done 20 | done 21 | 22 | # reset values 23 | IFS=$OIFS 24 | $ORIGINAL_GLOB 25 | -------------------------------------------------------------------------------- /pdf-metadata.sh: -------------------------------------------------------------------------------- 1 | # Written by Jacques Boucher 2 | # jjrboucher@gmail.com 3 | # 4 | # Triage script to review multiple PDFs in folders/subfolders and extract 5 | # the metadata from each and output to a CSV file. 6 | 7 | output_file="pdf-metadata.csv" # name of the file where results are saved. 8 | 9 | if test -f $output_file; then 10 | echo "pdf-triage.csv already exists. Rename or move and re-run the script." 11 | exit 12 | fi 13 | 14 | exiftool -a -G1 -s -ee -csv -r . -ext pdf >>$output_file # append results to csv file in current directory 15 | 16 | echo "Results in $output_file." -------------------------------------------------------------------------------- /pdf-processing.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | ########################### 3 | # Written by Jacques Boucher 4 | # jboucher@unicef.org 5 | 6 | scriptVersion="8 March 2024" 7 | 8 | # Tested on Kali Linux 2023.1 and Kali Linux on WSL. 9 | ############################## 10 | # Installing required binaries 11 | ############################## 12 | # If running on Kali Linux WSL, you will need to run the following: 13 | # sudo apt update 14 | # sudo apt upgrade 15 | # sudo apt install exiftool 16 | # sudo apt install xpdf 17 | # sudo apt install pdf-parser 18 | # sudo apt install poppler-utils 19 | # sudo apt install pdfid 20 | 21 | ################# 22 | # Troubleshooting 23 | ################# 24 | # If the script gives you a warning that one of the above binaries is missing, 25 | # you can search which package you need to install as follows: 26 | # sudo apt search pdfinfo 27 | # The above would search the packages and return that it's part of poppler-utils. You would then install that 28 | # package with the command: sudo apt install poppler-utils. 29 | # 30 | # As best as the author could test, the installation of the required binaries above should be all that's needed. 31 | 32 | #################### 33 | # Processing summary 34 | #################### 35 | # This script will run the following parsing tools against a PDF that you provide as a command line argument. 36 | # The script will check if a command is present. If it is not, it will note same in the log, alert you on screen, and skip that processing. 37 | # 38 | # 1 - pdfinfo 39 | # 2 - exiftool 40 | # 3 - pdfimages 41 | # 4 - pdfsig 42 | # 5 - pdfid 43 | # 6 - pdf-parser 44 | # 7 - pdffonts 45 | # 8 - pdfdetach 46 | # 47 | # The script will also extracts versions of the pdf using grep and dd commands, looking for the %%EOF string in the PDF. 48 | # Each time you edit a PDF within Adobe, it adds 49 | # 50 | ###################### 51 | # Command line options 52 | ###################### 53 | # -f PDF being processed (required option) 54 | # -v Prints the version # of the script and exits. 55 | # -p TRUE/FALSE optional switch if you want to process prior versions of a PDF if present. If you don't include this option on the command line and prior versions are detected, 56 | # the script will prompt you, letting you know it found prior versions and ask if you want to process them. 57 | # This option is especially practical if you extracted prior versions of a PDF, and now want to run the script against one of those PDFs. 58 | # In that scenario, you likely don't need to extract prior versions yet again. So you can use the option -p FALSE. 59 | # 60 | # Exit Codes 61 | commandsExecuted=0 # if no commands are missing, will have an exit code of 0. If any command are missing, it will add 10**(command #) to the exit code. 62 | # example, if pdfinfo is missing, it will add 10**1, or 10 to the exit value. 63 | # if pdfsig is missing, it will add 10**4, or 10000 to the exit value. 64 | # this sort of mimics bit-wise values in that an exit value of 1010 means commands #1 and #3 did not run. 65 | # thus an exit value of 0 means all commands were executed. 66 | missingArg=1 # missing an argument after the switch 67 | tooManyArgs=2 # too many arguments 68 | invalidSwitch=3 # invalid switch 69 | missingSwitch=4 # switch not provided 70 | invalidSyntax=5 # invalid syntax 71 | fileDoesNotExist=6 # file does not exist 72 | emptyFile=7 # 0 byte file provided as argument 73 | badpdf=8 # if a bad PDF is passed to the script and the user chooses to exit without processing it. 74 | 75 | # Other variables 76 | investigator="" 77 | caseNumber="" 78 | filename="" # initialize filename to process to blank 79 | filenamenoext="" # filename without the extension 80 | extension="" # extension for the filename 81 | newfile="" # varible to hold new filename when parsing through PDF for versions. 82 | allimages="" # varible for file name where all image hashes are saved for each comparison. 83 | v=1 # counter for versions of a PDF (applicable when a PDF has been edited with Adobe). 84 | offsets=() # array variable to hold offsets of the %%EOF markers in a pdf denoting the end of each version. 85 | versions=1 # variable to hold # of versions found in a PDF (i.e., number of %%EOF markers). 86 | switch="" # initialize command line switch to blank 87 | RED='\033[0;31m' # red font 88 | YELLOW='\033[0;1;33m' # yellow font 89 | GREEN='\033[32;1;1m' # green font 90 | NOCOLOUR='\033[0;m' # no colour 91 | usage="Usage: $0 {-p true/false} -f \nor: $0 -v" 92 | priorVersion="" #This variable is the flag for deciding if the script should attempt to extract prior versions. 93 | 94 | 95 | hashMark() { # function to write section header to the log file. 96 | echo -e "######################### $1 #############################" >>"$logfile" 97 | } 98 | 99 | blankLine() { # inserts a blank line in the log file and on screen. 100 | echo "" 101 | echo -e "" >>"$logfile" 102 | } 103 | 104 | commandNotFound () { #command not found. Logging same. 105 | echo "" 106 | echo -e "################ ${RED}WARNING!${NOCOLOUR} ################" 107 | echo -e "${RED}$1 ${YELLOW}not found!${NOCOLOUR} Skipping this step." 108 | echo -e "##########################################" 109 | echo "" 110 | blankLine 111 | echo "######## WARNING! ########" >> "$logfile" 112 | echo "$1 not found. Skipping this step." >> "$logfile" 113 | echo "##########################" >> "$logfile" 114 | blankLine 115 | } 116 | 117 | pdfImages() { 118 | #pdfimages 119 | 120 | which pdfimages >/dev/null #checks for command 121 | if [ $? -eq 0 ] # exit status 0 = command found 122 | then 123 | hashMark "$2 - $(pdfimages -v 2>&1 | head -n1)" 124 | echo "Extracting images from $1." 125 | echo "Extracting images from $1." >>"$logfile" 126 | 127 | blankLine 128 | pdfimagesRoot="$1-pdfimages" 129 | echo "executing: pdfimages -all \"$1\" \"$pdfimagesRoot\"" | tee -a "$logfile" 130 | echo "Which will extract the following images:" | tee -a "$logfile" 131 | pdfimages -list "$1" | tee -a "$logfile" 132 | pdfimages -all "$1" "$pdfimagesRoot" 133 | pdfimages -png "$1" "$pdfimagesRoot-png-format" # also export as JPG for ease of viewing, as .ccitt not easily viewable 134 | echo -e "\npdfimages finished execution at $(date).\nExtracted images saved to '$pdfimagesRoot-###.{extension}'.">>"$logfile" 135 | echo "executing: sha256sum \"$pdfimagesRoot\"-*.*" | tee -a "$logfile" 136 | sha256sum "$pdfimagesRoot"-*.* | tee -a "$allimages-unsorted.txt" >> "$logfile" 137 | blankLine 138 | else 139 | commandNotFound "pdfimages" 140 | commandsExecuted=$commandExecuted+1000 141 | fi 142 | } 143 | 144 | checkPDF() { 145 | testPDF="$(pdfinfo "$1" 2>/dev/null)" 146 | if [ "$testPDF" == "" ] 147 | then 148 | pdfValidation="False" 149 | else 150 | pdfValidation="True" 151 | fi 152 | } 153 | 154 | while getopts ":f:p:v" opt; do 155 | 156 | case $opt in 157 | f) 158 | switch=$opt 159 | filename="$OPTARG" 160 | ;; 161 | v) 162 | echo "$0 version: $scriptVersion" 163 | exit 164 | ;; 165 | p) 166 | switch=$opt 167 | priorVersion=$(echo $OPTARG | tr '[:upper:]' '[:lower:]') 168 | ;; 169 | :) 170 | echo "You must supply an argument to -$OPTARG">&2 171 | echo -e $usage 172 | exit $missingArg 173 | ;; 174 | 175 | \?) 176 | if [ $# -gt 2 ]; then 177 | echo "Too many arguments." 178 | echo -e $usage 179 | exit $tooManyArgs 180 | fi 181 | echo "Invalid switch." 182 | echo -e $usage 183 | exit $invalidSwitch 184 | ;; 185 | esac 186 | done 187 | 188 | if [ -z "$switch" ]; then #switch is still blank, thus not provided 189 | echo "Did not provide the required switch." 190 | echo -e $usage 191 | exit $missingSwitch 192 | elif [ -z "$filename" ]; then #filename is still blank, thus invalid syntax 193 | echo "Invalid syntax." 194 | echo -e $usage 195 | exit $invalidSyntax 196 | elif [ ! -f "$filename" ]; then 197 | echo "File does not exist." 198 | echo -e $usage 199 | exit $fileDoesNotExist 200 | elif [ ! -s "$filename" ]; then 201 | echo "The file $filename is a 0 byte file." 202 | echo "Nothing to process." 203 | echo -e $usage 204 | exit $emptyFile 205 | fi 206 | 207 | checkPDF "$filename" 208 | 209 | if [ "$pdfValidation" == "False" ] 210 | then 211 | echo -e "${YELLOW}Warning!${NOCOLOUR}\nThe PDF $filename does not appear to be a valid PDF." 212 | read -p "Do you still wish to proceed (y/n)? " continue 213 | if [ "$continue" == "n" ] 214 | then 215 | exit $badpdf 216 | fi 217 | fi 218 | 219 | logfile="$filename.log" 220 | allimages="$filename-hashes of all images" 221 | 222 | hashMark "Tombstone Info" 223 | read -p "Investigator: " investigator 224 | read -p "Case number: " caseNumber 225 | 226 | echo "Executed by user $(whoami) at $(date)." >> "$logfile" 227 | echo "Investigator: $investigator" >> "$logfile" 228 | echo "Case number: $caseNumber" >> "$logfile" 229 | echo "Processing: $filename" >> "$logfile" 230 | echo "sha256 hash: $(sha256sum "$filename" | cut -d " " -f1)" >> "$logfile" 231 | echo "Current folder: $(pwd)" >> "$logfile" 232 | echo "Script version: "$scriptVersion >> "$logfile" 233 | blankLine 234 | 235 | # pdfinfo 236 | 237 | which pdfinfo >/dev/null #checks for command 238 | if [ $? -eq 0 ] # exit status 0 = command found 239 | then 240 | pdfInfoFile="$filename-pdfinfo.txt" 241 | hashMark "1 - $(pdfinfo -v 2>&1 | head -n 1)" 242 | echo "executing pdfinfo \"$filename\"" | tee -a "$logfile" 243 | pdfinfo "$filename">"$pdfInfoFile" 244 | echo -e "pdfinfo finished execution at $(date).\nResults written to '$pdfInfoFile'." >> "$logfile" 245 | echo "executing: sha256sum \"$pdfInfoFile\"" | tee -a "$logfile" 246 | sha256sum "$pdfInfoFile" >> "$logfile" 247 | 248 | blankLine 249 | else 250 | commandNotFound "pdfinfo" 251 | commandsExecuted=$commandExecuted+10 252 | fi 253 | 254 | # exiftool 255 | 256 | which exiftool >/dev/null #checks for command 257 | if [ $? -eq 0 ] # exit status 0 = command found 258 | then 259 | exifFile="$filename-exif.csv" 260 | hashMark "2 - exiftool version $(exiftool -ver)" 261 | echo "executing: exiftool -a -G1 -s -ee -csv \"$filename\" > \"$exifFile\"" | tee -a "$logfile" 262 | exiftool -a -G1 -s -ee -csv "$filename">"$exifFile" 263 | echo -e "exiftool finished execution at $(date).\nResults written to '$exifFile'.">>"$logfile" 264 | echo "executing: sha256sum \"$exifFile\"" | tee -a "$logfile" 265 | sha256sum "$exifFile" >> "$logfile" 266 | blankLine 267 | else 268 | commandNotFound "exiftool" 269 | commandsExecuted=$commandExecuted+100 270 | fi 271 | 272 | #pdfimages 273 | pdfImages "$filename" "3" 274 | 275 | #pdfsig 276 | 277 | which pdfsig >/dev/null #checks for command 278 | if [ $? = 0 ] # exit status 0 = command found 279 | then 280 | hashMark "4 - $(pdfsig -v 2>&1 | head -n1)" 281 | pdfsigFilename="$filename.pdfsig.txt" 282 | echo "executing: pdfsig -nocert -dump \"$filename\" >>\"$logfile\"" | tee -a "$logfile" 283 | pdfsig -nocert -dump "$filename" >>"$logfile" 284 | echo -e "pdfsig finished execution at $(date).\nResults written to '$pdfsigFilename'.">>"$logfile" 285 | echo -e "Signature(s), if present, is/are dumped to the current folder, $(pwd).">>"$logfile" 286 | blankLine 287 | else 288 | commandNotFound "pdfsig" 289 | commandsExecuted=$commandExecuted+10000 290 | fi 291 | 292 | #pdfid 293 | 294 | which pdfid >/dev/null #checks for command 295 | if [ $? -eq 0 ] # exit status 0 = command found 296 | then 297 | hashMark "5 - pdfid version $(pdfid --version | cut -d " " -f2)" 298 | pdfidFilename="$filename.pdfid.txt" 299 | echo "executing: pdfid -l \"$filename\">\"$pdfidFilename\"" | tee -a "$logfile" 300 | pdfid -l "$filename">"$pdfidFilename" 301 | echo -e "pdfid finished execution at $(date).\nResults written to '$pdfidFilename'.">>"$logfile" 302 | echo "executing: sha256sum \"$pdfidFilename\"" | tee -a "$logfile" 303 | sha256sum "$pdfidFilename" >> "$logfile" 304 | blankLine 305 | else 306 | commandNotFound "pdfid" 307 | commandsExecuted=$commandExecuted+100000 308 | fi 309 | 310 | #pdfparser 311 | 312 | which pdf-parser >/dev/null #checks for command 313 | if [ $? -eq 0 ] # exit status 0 = command found 314 | then 315 | hashMark "6 - pdf-parser version $(pdf-parser --version | grep "pdf-parser" | cut -d " " -f2)" 316 | pdfparserFilename="$filename.pdfparser.txt" 317 | echo "executing: pdf-parser \"$filename\">\"$pdfparserFilename\"" | tee -a "$logfile" 318 | pdf-parser "$filename">"$pdfparserFilename" 319 | echo -e "pdf-parser finished execution at $(date).\nResults written to '$pdfparserFilename'.">>"$logfile" 320 | echo "executing: sha256sum \"$pdfparserFilename\"" | tee -a "$logfile" 321 | sha256sum "$pdfparserFilename" >> "$logfile" 322 | blankLine 323 | else 324 | commandNotFound "pdf-parser" 325 | commandsExecuted=$commandExecuted+1000000 326 | fi 327 | 328 | #pdffonts 329 | 330 | which pdffonts >/dev/null #checks for command 331 | if [ $? -eq 0 ] # exit status 0 = command found 332 | then 333 | hashMark "7 - $(pdffonts -v 2>&1 | head -n1)" 334 | pdffontsFilename="$filename.pdffonts.txt" 335 | echo "executing: pdffonts \"$filename\">\"$pdffontsFilename\" 2>>\"$pdffontsFilename\"" | tee -a "$logfile" 336 | pdffonts "$filename" >"$pdffontsFilename" 2>>"$pdffontsFilename" 337 | echo -e "pdffonts finished execution at $(date).\nResults written to '$pdffontsFilename'.">>"$logfile" 338 | echo "executing: sha256sum \"$pdffontsFilename\"" | tee -a "$logfile" 339 | sha256sum "$pdffontsFilename" >> "$logfile" 340 | blankLine 341 | else 342 | commandNotFound "pdffonts" 343 | commandsExecuted=$commandExecuted+10000000 344 | fi 345 | 346 | #pdfdetach 347 | 348 | which pdfdetach >/dev/null #checks for command 349 | if [ $? -eq 0 ] # exit status 0 = command found 350 | then 351 | hashMark "8 - $(pdfdetach -v 2>&1 | head -n1)" 352 | echo "executing: pdfdetach -saveall \"$filename\"" | tee -a "$logfile" 353 | echo "Which will extract the following files (if applicable):" | tee -a "$logfile" 354 | embeddedItemsCount=$(pdfdetach -list "$filename"|wc -l) 355 | embeddedItemsCount=$((embeddedItemsCount-1)) 356 | pdfdetach -list "$filename" | tee -a "$logfile" 357 | pdfdetach -saveall "$filename" 358 | echo -e "pdfdetach finished execution at $(date).">>"$logfile" 359 | 360 | if [[ "$(pdfdetach -list "$filename")" != "0 embedded files" && "$(pdfdetach -list "$filename")" != "" ]] 361 | # if there are no embedded files and you do the sha256sum command, it waits for input. This avoids hanging the script in such cases. 362 | then 363 | pdfdetach -list "$filename"|tail -n $embeddedItemsCount | cut -d: -f2 364 | echo "executing: sha256sum \"${f#\"${f%%[![:space:]]*}\"}\"" | tee -a "$logfile" 365 | for f in "$((pdfdetach -list "$filename")|tail -n $embeddedItemsCount | cut -d: -f2 2>/dev/null)"; do 366 | sha256sum "${f#"${f%%[![:space:]]*}"}" |tee -a "$logfile" 367 | done 368 | fi 369 | else 370 | commandNotFound "pdfdetach" 371 | commandsExecuted=$commandExecuted+10000000 372 | fi 373 | 374 | blankLine 375 | 376 | #extract versions of the PDF using grep and dd, commands commonly available on any Linux distro 377 | 378 | hashMark "9 - Extracting prior versions of the PDF" 379 | 380 | filenamenoext="${filename%.*}" 381 | extension="${filename##*.}" 382 | v=1 383 | 384 | offsets=($(grep --only-matching --byte-offset --text "%%EOF" "$filename"| cut -d : -f 1)) 385 | 386 | if [ ${offsets[0]} -lt 600 ]; then 387 | unset offsets[0] # removes the first element in the array, as it's a false positive. 388 | fi 389 | 390 | priorVersions=${#offsets[@]} 391 | priorVersions=$((priorVersions-1)) # reduces the count by 1, as the current version does not count as a prior version, but will be in the array. 392 | 393 | if [ $priorVersions -lt 1 ]; then 394 | echo "There are no previous versions of the PDF embedded in this pdf." | tee -a "$logfile" 395 | else 396 | if ! [[ "$priorVersion" = "true" || "$priorVersion" = "false" ]]; then # if the user did not provide a valid option for -p (or did not specify it) 397 | echo -e "There are ${GREEN}$priorVersions prior versions${NOCOLOUR} of this PDF based on the number of %%EOF signatures in it.\n" 398 | echo "The script can attempt to extract them with the caveat that a prior version may or may not be a properly formed PDF." 399 | read -p "Do you want the script to attempt to extract all versions of this PDF (Y/N)? [Y] " priorVersion # default response is Y if user just hits ENTER 400 | 401 | if [[ "$priorVersion" == "Y" || "$priorVersion" == "y" || "$priorVersion" == "" ]]; then 402 | priorVersion="true" 403 | else 404 | priorVersion="false" 405 | fi 406 | fi 407 | if [ "$priorVersion" == "true" ];then # process prior versions 408 | echo "Excluding the current version, there are $((priorVersions-1)) prior versions in this PDF." | tee -a "$logfile" 409 | echo "The script will extract each of them, assiging them a version number. Version 1 being the oldest version, and version $priorVersions being the version prior to the current version." | tee -a "$logfile" 410 | 411 | #unset offsets[0] # removes the first element in the array, as it's a false positive. 412 | 413 | for size in ${offsets[@]}; do 414 | 415 | 416 | if [ $v -le $priorVersions ]; then # if it's not the last version. Last version is redundant, as it's the original PDF passed to the script. 417 | 418 | newfile="$filenamenoext version $v.$extension" 419 | 420 | blocksize=$((size+7)) 421 | 422 | blankLine 423 | echo "executing: dd if=\"$filename\" of=\"$newfile\" bs=$blocksize count=1 status=noxfer 2\> \/dev\/null" | tee -a "$logfile" 424 | echo "This will extract version $v of the pdf $filename, assigning it the new filename $newfile" | tee -a "$logfile" 425 | echo "" 426 | 427 | dd if="$filename" of="$newfile" bs=$blocksize count=1 status=noxfer 2> /dev/null 428 | 429 | echo "executing: sha256sum\"$newfile\" >>\"$logfile\"" | tee -a "$logfile" 430 | sha256sum "$newfile" >>"$logfile" 431 | 432 | blankLine 433 | 434 | checkPDF "$newfile" 435 | 436 | if [ "$pdfValidation" == "True" ] # Valid PDF 437 | then 438 | echo -e "Prior ${GREEN}version $v${NOCOLOUR} of '$filename' appears to be a ${GREEN}valid PDF.${NOCOLOUR}" 439 | echo -e "Prior version $v of '$filename' appears to be a valid PDF." >> "$logfile" 440 | else # Not a valid PDF 441 | echo -e "Prior ${YELLOW}version $v${NOCOLOUR} of '$filename' ${RED}does not appear to be a valid PDF.${NOCOLOUR}" 442 | echo -e "Prior version $v of '$filename' does not appear to be a valid PDF." >> "$logfile" 443 | fi 444 | blankLine 445 | pdfImages "$newfile" "9.$v" 2>/dev/null # Attempt to extract images from the version. Even if not a valid PDF, attempting regardless. 446 | 447 | v=$((v+1)) 448 | fi 449 | done 450 | 451 | echo -e "\nExtracting prior versions of the PDF finished execution at $(date).">>"$logfile" 452 | 453 | echo "executing: exiftool -a -G1 -s -ee -csv \"$filenamenoext version \"*\".$extension\" 2> /dev/null >> \"$filename - all versions - exif.csv" | tee -a "$logfile" 454 | 455 | exiftool -a -G1 -s -ee -csv "$filenamenoext"*".$extension" 2> /dev/null >> "$filename - all versions - exif.csv" 456 | fi 457 | fi 458 | 459 | # sorting all image hashes for ease of identifying matching images 460 | 461 | blankLine 462 | 463 | echo "executing: sort \"$allimages\" > \"$allimages\"" | tee -a "$logfile" 464 | echo -e "All images hashes are also found in: ${GREEN}'$allimages'.${NOCOLOUR}" 465 | echo "All images hashes are also found in: $allimages.">>"$logfile" 466 | 467 | 468 | sort "$allimages-unsorted.txt" > "$allimages.txt" 469 | 470 | 471 | echo -e "\n###############################################################" 472 | echo -e "Log file written to: ${GREEN}'$logfile'.${NOCOLOUR}" 473 | echo -e "Script finshed at $(date)." >> "$logfile" 474 | exit $commandsExecuted 475 | -------------------------------------------------------------------------------- /pdf-triage.sh: -------------------------------------------------------------------------------- 1 | # Written by Jacques Boucher 2 | # jjrboucher@gmail.com 3 | # 4 | # Triage script to review multiple PDFs in folders/subfolders and extract 5 | # a few data points that can help identify possible PDFs warranting further 6 | # review for possible manipulation by a subject. 7 | # 8 | # You can modify the script to extract other fields if they are of interest to you. 9 | # Updated 11 June 2024 10 | 11 | output_file="pdf-triage.tsv" 12 | 13 | if test -f $output_file; then 14 | echo "$output_file already exists. Rename or move and re-run the script." 15 | exit 16 | fi 17 | 18 | OIFS="$IFS" # Original Field Separator 19 | IFS=$'\n' # New field separator = new line 20 | 21 | a=$(find . -iname "*.pdf") # Find all PDFs recursively from current folder 22 | echo "File Create Date Modify Date # of images Author Producer # of fonts # of versions hash" >$output_file # write headers to csv 23 | for i in $a # loop through each item 24 | do 25 | image_count=$(pdfimages -list $i | wc -l) # get only # of lines in output (# of images) 26 | image_count=$((image_count-2)) # substract 2 lines - headers - to get actual # of images 27 | author=$(exiftool -S -s -author $i) # get author of the document without the tag name 28 | create_date=$(exiftool -S -s -CreateDate $i) 29 | modify_date=$(exiftool -S -s -ModifyDate $i) 30 | producer=$(exiftool -S -s -Producer $i) 31 | font_count=$(pdffonts $i | wc -l) # get the # of fonts - but includes 2 additional lines for header. 32 | font_count=$((font_count-2)) # remove the headers 33 | hash=$(md5sum $i | cut -d " " -f1) 34 | 35 | offsets=($(grep --only-matching --byte-offset --text "%%EOF" "$i"| cut -d : -f 1)) # find all %%EOF instances in the PDF 36 | if [ ${offsets[0]} -lt 600 ]; then 37 | unset offsets[0] # removes the first element in the array, as it's a false positive. 38 | fi 39 | version_count=${#offsets[@]} # Number of instances of %%EOF 40 | 41 | echo "$i $create_date $modify_date $image_count $author $producer $font_count $version_count $hash" >>$output_file # append results to csv file in current directory 42 | done 43 | 44 | echo "Results can be found in $output_file." 45 | IFS="$OIFS" # restore IFS to original -------------------------------------------------------------------------------- /recursive-pdf-processing.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | ########################### 3 | # Written by Jacques Boucher 4 | # jboucher@unicef.org 5 | scriptVersion="3 October 2024" 6 | # Tested on Kali Linux 2023.1 and Kali Linux on WSL. 7 | ############################## 8 | # Installing required binaries 9 | ############################## 10 | # If running on Kali Linux WSL, you will need to run the following: 11 | # sudo apt update 12 | # sudo apt upgrade 13 | # sudo apt install exiftool 14 | # sudo apt install xpdf 15 | # sudo apt install pdf-parser 16 | # sudo apt install poppler-utils 17 | # sudo apt install pdfid 18 | 19 | ################# 20 | # Troubleshooting 21 | ################# 22 | # If the script gives you a warning that one of the above binaries is missing, 23 | # you can search which package you need to install as follows: 24 | # sudo apt search pdfinfo 25 | # The above would search the packages and return that it's part of poppler-utils. You would then install that 26 | # package with the command: sudo apt install poppler-utils. 27 | # 28 | # As best as the author could test, the installation of the required binaries above should be all that's needed. 29 | 30 | #################### 31 | # Processing summary 32 | #################### 33 | # This script will run the following parsing tools against a PDF that you provide as a command line argument. 34 | # The script will check if a command is present. If it is not, it will note same in the log, alert you on screen, and skip that processing. 35 | # 36 | # 1 - pdfinfo 37 | # 2 - exiftool 38 | # 3 - pdfimages 39 | # 4 - pdfsig 40 | # 5 - pdfid 41 | # 6 - pdf-parser 42 | # 7 - pdffonts 43 | # 8 - pdfdetach 44 | # 45 | # The script will also extracts versions of the pdf using grep and dd commands, looking for the %%EOF string in the PDF. 46 | # Each time you edit a PDF within Adobe, it adds 47 | # 48 | ###################### 49 | # Command line options 50 | ###################### 51 | # -f PDF being processed (required option) 52 | # -v Prints the version # of the script and exits. 53 | # -p TRUE/FALSE optional switch if you want to process prior versions of a PDF if present. If you don't include this option on the command line and prior versions are detected, 54 | # the script will prompt you, letting you know it found prior versions and ask if you want to process them. 55 | # This option is especially practical if you extracted prior versions of a PDF, and now want to run the script against one of those PDFs. 56 | # In that scenario, you likely don't need to extract prior versions yet again. So you can use the option -p FALSE. 57 | # 58 | # Exit Codes 59 | commandsExecuted=0 # if no commands are missing, will have an exit code of 0. If any command are missing, it will add 10**(command #) to the exit code. 60 | # example, if pdfinfo is missing, it will add 10**1, or 10 to the exit value. 61 | # if pdfsig is missing, it will add 10**4, or 10000 to the exit value. 62 | # this sort of mimics bit-wise values in that an exit value of 1010 means commands #1 and #3 did not run. 63 | # thus an exit value of 0 means all commands were executed. 64 | missingArg=1 # missing an argument after the switch 65 | tooManyArgs=2 # too many arguments 66 | invalidSwitch=3 # invalid switch 67 | missingSwitch=4 # switch not provided 68 | invalidSyntax=5 # invalid syntax 69 | fileDoesNotExist=6 # file does not exist 70 | emptyFile=7 # 0 byte file provided as argument 71 | badpdf=8 # if a bad PDF is passed to the script and the user chooses to exit without processing it. 72 | f_and_d=9 # provided both -f and -d options 73 | noFolder=10 # did not provide a folder parameter with the -d option 74 | 75 | # Other variables 76 | investigator="" 77 | caseNumber="" 78 | currentDateTime=$(date +%d-%m-%YT%H%M%S) 79 | directory=0 # set default to false, not parsing a directory 80 | executionFolder=$(pwd) 81 | filename="" # initialize filename to process to blank 82 | filenamenoext="" # filename without the extension 83 | folder="" # folder to parse 84 | extension="" # extension for the filename 85 | newfile="" # varible to hold new filename when parsing through PDF for versions. 86 | allimages="" # varible for file name where all image hashes are saved for each comparison. 87 | v=1 # counter for versions of a PDF (applicable when a PDF has been edited with Adobe). 88 | offsets=() # array variable to hold offsets of the %%EOF markers in a pdf denoting the end of each version. 89 | versions=0 # variable to hold # of versions found in a PDF (i.e., number of %%EOF markers). 90 | switch="" # initialize command line switch to blank 91 | recursive=0 # set default to false, not recursive - ignored if -d not used. 92 | outputFolder="" 93 | RED='\033[0;31m' # red font 94 | YELLOW='\033[0;1;33m' # yellow font 95 | GREEN='\033[32;1;1m' # green font 96 | NOCOLOUR='\033[0;m' # no colour 97 | usage="Usage: $0 {-f } {-p TRUE/FALSE} {-d } {-r}\nor: $0 -v" 98 | priorVersion="" #This variable is the flag for deciding if the script should attempt to extract prior versions. 99 | OIFS="$IFS" # Original Field Separator 100 | IFS=$'\n' # New field separator = new line 101 | 102 | 103 | hashMark() { # function to write section header to the log file. 104 | echo -e "######################### $1 #############################" >>"$logfile" 105 | } 106 | 107 | fileheader() { # function to write the file header to the log file. 108 | echo -e "*************************************************************************************************" >>"$logfile" 109 | echo -e " Processing $1" >>"$logfile" 110 | echo -e "*************************************************************************************************" >>"$logfile" 111 | blankLine 112 | } 113 | 114 | blankLine() { # inserts a blank line in the log file and on screen. 115 | echo "" 116 | echo -e "" >>"$logfile" 117 | } 118 | 119 | commandNotFound () { #command not found. Logging same. 120 | echo "" 121 | echo -e "################ ${RED}WARNING!${NOCOLOUR} ################" 122 | echo -e "${RED}$1 ${YELLOW}not found!${NOCOLOUR} Skipping this step." 123 | echo -e "##########################################" 124 | echo "" 125 | blankLine 126 | echo "######## WARNING! ########" >> "$logfile" 127 | echo "$1 not found. Skipping this step." >> "$logfile" 128 | echo "##########################" >> "$logfile" 129 | blankLine 130 | } 131 | 132 | pdfImages() { 133 | #pdfimages 134 | 135 | which pdfimages >/dev/null #checks for command 136 | if [ $? -eq 0 ] # exit status 0 = command found 137 | then 138 | hashMark "$2 - $(pdfimages -v 2>&1 | head -n1)" 139 | echo "Extracting images from $1." 140 | echo "Extracting images from $1." >>"$logfile" 141 | 142 | blankLine 143 | pdfimagesRoot="$3" 144 | echo "executing: pdfimages -all \"$1\" \"$pdfimagesRoot\"" | tee -a "$logfile" 145 | echo "Which will extract the following images:" | tee -a "$logfile" 146 | pdfimages -list "$1" | tee -a "$logfile" 147 | pdfimages -all "$1" "$pdfimagesRoot" 148 | pdfimages -png "$1" "$pdfimagesRoot-png-format" # also export as JPG for ease of viewing, as .ccitt not easily viewable 149 | echo -e "\npdfimages finished execution at $(date).\nExtracted images saved to '$pdfimagesRoot-###.{extension}'.">>"$logfile" 150 | echo "executing: sha256sum \"$pdfimagesRoot\-*.*\"" | tee -a "$logfile" 151 | sha256sum "$pdfimagesRoot"-*.* | tee -a "$allimages"-unsorted.txt >> "$logfile" 152 | blankLine 153 | else 154 | commandNotFound "pdfimages" 155 | commandsExecuted=$commandExecuted+1000 156 | fi 157 | } 158 | 159 | checkPDF() { 160 | processThisPDF="y" # defaults to yes 161 | testPDF="$(pdfinfo "$1" 2>/dev/null)" 162 | if [ "$testPDF" == "" ]; then 163 | pdfValidation="False" 164 | else 165 | pdfValidation="True" 166 | fi 167 | 168 | if [ "$pdfValidation" == "False" ]; then 169 | echo -e "${YELLOW}Warning!${NOCOLOUR}\nThe PDF $1 does not appear to be a valid PDF." 170 | echo -e "According to pdfinfo, ${1} does not appear to be a valid PDF." >> "$logfile" 171 | read -p "Do you still wish to proceed (y/n)? " processThisPDF 172 | processThisPDF=$(echo $processThisPDF | tr '[:upper:]' '[:lower:]') 173 | fi 174 | } 175 | 176 | while getopts ":d:f:p:rv" opt; do 177 | case $opt in 178 | d) 179 | switch=$opt 180 | folder="$OPTARG" 181 | if [ -z "$folder" ]; then 182 | echo -e "You ${RED}did not${NOCOLOUR} provide a ${RED}folder${NOCOLOUR} with the -d switch." 183 | elif [ ! -d "$folder" ]; then 184 | echo -e "You ${RED}did not${NOCOLOUR} provide a valid ${RED}folder${NOCOLOUR} with the -d switch." 185 | IFS="$OIFS" # restore IFS to original 186 | exit $noFolder 187 | fi 188 | ;; 189 | f) 190 | switch=$opt 191 | filename="$OPTARG" 192 | if [ ! -z "$filename" ] && [ ! -z "$folder" ]; then 193 | echo -e "You provided ${RED}both${NOCOLOUR} the file (-f) and directory (-d) switches." 194 | echo -e "Please provide ${GREEN}one or the other${NOCOLOUR}, but ${RED}not both.${NOCOLOUR}" 195 | IFS="$OIFS" # restore IFS to original 196 | exit $f_and_d 197 | elif [ -z "$filename" ]; then #filename is still blank, thus invalid syntax 198 | echo "Invalid syntax." 199 | echo -e $usage 200 | IFS="$OIFS" # restore IFS to original 201 | exit $invalidSyntax 202 | elif [ ! -f "$filename" ]; then 203 | echo "File does not exist." 204 | echo -e $usage 205 | IFS="$OIFS" # restore IFS to original 206 | exit $fileDoesNotExist 207 | elif [ ! -s "$filename" ]; then 208 | echo "The file $filename is a 0 byte file." 209 | echo "Nothing to process." 210 | echo -e $usage 211 | IFS="$OIFS" # restore IFS to original 212 | exit $emptyFile 213 | fi 214 | ;; 215 | v) 216 | echo "$0 version: $scriptVersion" 217 | IFS="$OIFS" # restore IFS to original 218 | exit 219 | ;; 220 | p) # attempt to extract prior versions 221 | switch=$opt 222 | priorVersion="$OPTARG" 223 | priorVersion=$(echo $priorVersion | tr '[:upper:]' '[:lower:]') 224 | ;; 225 | r) 226 | recursive=1 # user selected option to recursively search for files. 227 | ;; 228 | :) 229 | echo "You must supply an argument to -$OPTARG">&2 230 | echo -e $usage 231 | IFS="$OIFS" # restore IFS to original 232 | exit $missingArg 233 | ;; 234 | 235 | \?) 236 | echo "Invalid switch." 237 | echo -e $usage 238 | IFS="$OIFS" # restore IFS to original 239 | exit $invalidSwitch 240 | ;; 241 | esac 242 | done 243 | 244 | outputFolder="report_$currentDateTime" 245 | 246 | if [ -z "$switch" ]; then #switch is still blank, thus not provided 247 | echo "Did not provide the required switch." 248 | echo -e $usage 249 | IFS="$OIFS" # restore IFS to original 250 | exit $missingSwitch 251 | fi 252 | 253 | # if a folder is passed, assign output of "find" command to files. 254 | 255 | if [ ! -z $folder ]; then 256 | if [ $recursive -eq 1 ]; then 257 | filename=$(find $folder -iname "*.pdf") 258 | else 259 | filename=$(find $folder -maxdepth 1 -iname "*.pdf") 260 | fi 261 | fi 262 | 263 | mkdir $outputFolder 264 | 265 | logfile="$outputFolder/processing_results.log" 266 | 267 | hashMark "Tombstone Info" 268 | read -p "Investigator: " investigator 269 | read -p "Case number: " caseNumber 270 | 271 | echo "Executed by user $(whoami) at $(date)." >> "$logfile" 272 | echo "Investigator: $investigator" >> "$logfile" 273 | echo "Case number: $caseNumber" >> "$logfile" 274 | echo "Current folder: $executionFolder" >> "$logfile" 275 | echo "Script version: "$scriptVersion >> "$logfile" 276 | echo "Command executed:$0 $@" >> "$logfile" 277 | # echo output folder 278 | blankLine 279 | 280 | fileCount=0 281 | 282 | for fileToProcess in $filename # loop through each file 283 | do 284 | fileCount=$((fileCount+1)) 285 | filenamenoext="${fileToProcess%.*}" 286 | filenamenoext="${filenamenoext##*/}" 287 | extension="${fileToProcess##*.}" 288 | blankLine 289 | fileheader $fileToProcess 290 | blankLine 291 | echo "Creating output folder $outputFolder/$fileCount-$filenamenoext for this file." >> "$logfile" 292 | blankLine 293 | fileFolder="$outputFolder/$fileCount-$filenamenoext" 294 | mkdir $fileFolder 295 | echo "sha256 hash: $(sha256sum "$fileToProcess" | cut -d " " -f1)" >> "$logfile" 296 | blankLine 297 | 298 | checkPDF "$fileToProcess" 299 | 300 | if [ "$pdfValidation" == "False" ]; then 301 | if [ "$processThisPDF" != "y" ]; then 302 | echo "User opted to not process \"$fileToProcess\" as it does not appear to be a valid PDF according to pdfinfo." | tee -a "$logfile" 303 | blankLine 304 | continue # skip out of the loop 305 | else 306 | echo "User opted to process \"$fileToProcess\" despite appearing to not be a vlaid PDF according to pdfinfo." | tee -a "$logfile" 307 | blankLine 308 | fi 309 | fi 310 | 311 | imagesFileName=$(basename ${fileToProcess}) 312 | allimages="$fileFolder/"${imagesFileName%.*}"-hashes of all images" 313 | 314 | # pdfinfo 315 | 316 | which pdfinfo >/dev/null #checks for command 317 | if [ $? -eq 0 ] # exit status 0 = command found 318 | then 319 | pdfInfoFile="$fileFolder/${fileToProcess##*/}-pdfinfo.txt" 320 | hashMark "1 - $(pdfinfo -v 2>&1 | head -n 1)" 321 | echo "executing pdfinfo \"$fileToProcess\"" | tee -a "$logfile" 322 | pdfinfo "$fileToProcess">"$pdfInfoFile" 323 | echo -e "pdfinfo finished execution at $(date).\nResults written to '$pdfInfoFile'." >> "$logfile" 324 | echo "executing: sha256sum \"$pdfInfoFile\"" | tee -a "$logfile" 325 | sha256sum "$pdfInfoFile" >> "$logfile" 326 | 327 | blankLine 328 | else 329 | commandNotFound "pdfinfo" 330 | commandsExecuted=$commandExecuted+10 331 | fi 332 | 333 | # exiftool 334 | 335 | which exiftool >/dev/null #checks for command 336 | if [ $? -eq 0 ] # exit status 0 = command found 337 | then 338 | exifFile="$fileFolder/${fileToProcess##*/}-exif.csv" 339 | hashMark "2 - exiftool version $(exiftool -ver)" 340 | echo "executing: exiftool -a -G1 -s -ee -csv \"$fileToProcess\" > \"$exifFile\"" | tee -a "$logfile" 341 | exiftool -a -G1 -s -ee -csv "$fileToProcess">"$exifFile" 342 | echo -e "exiftool finished execution at $(date).\nResults written to '$fileFolder/$exifFile'.">>"$logfile" 343 | echo "executing: sha256sum \"$exifFile\"" | tee -a "$logfile" 344 | sha256sum "$exifFile" >> "$logfile" 345 | blankLine 346 | else 347 | commandNotFound "exiftool" 348 | commandsExecuted=$commandExecuted+100 349 | fi 350 | 351 | #pdfimages 352 | pdfImages "$fileToProcess" "3" "$fileFolder/${fileToProcess##*/}-pdfimages" 353 | 354 | #pdfsig 355 | 356 | which pdfsig >/dev/null #checks for command 357 | if [ $? = 0 ] # exit status 0 = command found 358 | then 359 | hashMark "4 - $(pdfsig -v 2>&1 | head -n1)" 360 | pdfsigFilename="$fileFolder/$filenamenoext.pdfsig.txt" 361 | echo "executing: pdfsig -nocert -dump \"$fileToProcess\"" | tee -a "$logfile" 362 | pdfsig -nocert -dump "$fileToProcess" >>"$logfile" 363 | 364 | if [ -e "$executionFolder/${fileToProcess##*/}.sig0" ]; then # if it extracted a signature file. 365 | mv $executionFolder/${fileToProcess##*/}.sig* "$fileFolder" # move the file(s) to the correct folder 366 | echo "executing: sha256sum \"$fileFolder/${fileToProcess##*/}.sig*\"" | tee -a "$logfile" 367 | sha256sum $fileFolder/${fileToProcess##*/}.sig* | tee -a "$logfile" 368 | fi 369 | 370 | echo -e "pdfsig finished execution at $(date).\nResults written to '$pdfsigFilename'.">>"$logfile" 371 | echo -e "Signature(s), if present, is/are dumped to $fileFolder.">>"$logfile" 372 | blankLine 373 | else 374 | commandNotFound "pdfsig" 375 | commandsExecuted=$commandExecuted+10000 376 | fi 377 | 378 | #pdfid 379 | 380 | which pdfid >/dev/null #checks for command 381 | if [ $? -eq 0 ] # exit status 0 = command found 382 | then 383 | hashMark "5 - pdfid version $(pdfid --version | cut -d " " -f2)" 384 | pdfidFilename="$fileFolder/$filenamenoext.pdfid.txt" 385 | echo "executing: pdfid -l \"$fileToProcess\">\"$pdfidFilename\"" | tee -a "$logfile" 386 | pdfid -l "$fileToProcess">"$pdfidFilename" 387 | echo -e "pdfid finished execution at $(date).\nResults written to '$pdfidFilename'.">>"$logfile" 388 | echo "executing: sha256sum \"$pdfidFilename\"" | tee -a "$logfile" 389 | sha256sum "$pdfidFilename" >> "$logfile" 390 | blankLine 391 | else 392 | commandNotFound "pdfid" 393 | commandsExecuted=$commandExecuted+100000 394 | fi 395 | 396 | #pdfparser 397 | 398 | which pdf-parser >/dev/null #checks for command 399 | if [ $? -eq 0 ] # exit status 0 = command found 400 | then 401 | hashMark "6 - pdf-parser version $(pdf-parser --version | grep "pdf-parser" | cut -d " " -f2)" 402 | pdfparserFilename="$fileFolder/$filenamenoext.pdfparser.txt" 403 | echo "executing: pdf-parser \"$fileToProcess\">\"$pdfparserFilename\"" | tee -a "$logfile" 404 | pdf-parser "$fileToProcess">"$pdfparserFilename" 405 | echo -e "pdf-parser finished execution at $(date).\nResults written to '$pdfparserFilename'.">>"$logfile" 406 | echo "executing: sha256sum \"$pdfparserFilename\"" | tee -a "$logfile" 407 | sha256sum "$pdfparserFilename" >> "$logfile" 408 | blankLine 409 | else 410 | commandNotFound "pdf-parser" 411 | commandsExecuted=$commandExecuted+1000000 412 | fi 413 | 414 | #pdffonts 415 | 416 | which pdffonts >/dev/null #checks for command 417 | if [ $? -eq 0 ] # exit status 0 = command found 418 | then 419 | hashMark "7 - $(pdffonts -v 2>&1 | head -n1)" 420 | pdffontsFilename="$fileFolder/$filenamenoext.pdffonts.txt" 421 | echo "executing: pdffonts \"$fileToProcess\">\"$pdffontsFilename\" 2>>\"$pdffontsFilename\"" | tee -a "$logfile" 422 | pdffonts "$fileToProcess" >"$pdffontsFilename" 2>>"$pdffontsFilename" 423 | echo -e "pdffonts finished execution at $(date).\nResults written to '$pdffontsFilename'.">>"$logfile" 424 | echo "executing: sha256sum \"$pdffontsFilename\"" | tee -a "$logfile" 425 | sha256sum "$pdffontsFilename" >> "$logfile" 426 | blankLine 427 | else 428 | commandNotFound "pdffonts" 429 | commandsExecuted=$commandExecuted+10000000 430 | fi 431 | 432 | #pdfdetach 433 | 434 | which pdfdetach >/dev/null #checks for command 435 | if [ $? -eq 0 ] # exit status 0 = command found 436 | then 437 | hashMark "8 - $(pdfdetach -v 2>&1 | head -n1)" 438 | echo "executing: pdfdetach -saveall \"$fileToProcess\"" | tee -a "$logfile" 439 | echo "Which will extract the following files (if applicable):" | tee -a "$logfile" 440 | embeddedItemsCount=$(pdfdetach -list "$fileToProcess"|wc -l) 441 | embeddedItemsCount=$((embeddedItemsCount-1)) 442 | pdfdetach -list "$fileToProcess" | tee -a "$logfile" 443 | pdfdetach -saveall "$fileToProcess" 444 | echo -e "pdfdetach finished execution at $(date).">>"$logfile" 445 | 446 | if [[ "$(pdfdetach -list "$fileToProcess")" != "0 embedded files" && "$(pdfdetach -list "$fileToProcess")" != "" ]] 447 | # if there are no embedded files and you do the sha256sum command, it waits for input. This avoids hanging the script in such cases. 448 | then 449 | pdfdetach -list "$fileToProcess"|tail -n $embeddedItemsCount | cut -d: -f2 450 | for f in "$((pdfdetach -list "$fileToProcess")|tail -n $embeddedItemsCount | cut -d: -f2 2>/dev/null)"; do 451 | echo "executing: sha256sum \"${f#\"${f%%[![:space:]]*}\"}\"" | tee -a "$logfile" 452 | sha256sum "${f#"${f%%[![:space:]]*}"}" |tee -a "$logfile" 453 | done 454 | fi 455 | else 456 | commandNotFound "pdfdetach" 457 | commandsExecuted=$commandExecuted+10000000 458 | fi 459 | 460 | blankLine 461 | 462 | #extract versions of the PDF using grep and dd, commands commonly available on any Linux distro 463 | 464 | hashMark "9 - Extracting prior versions of the PDF" 465 | 466 | v=1 467 | 468 | offsets=($(grep --only-matching --byte-offset --text "%%EOF" "$fileToProcess"| cut -d : -f 1)) 469 | 470 | if [ ${offsets[0]} -lt 600 ]; then 471 | unset offsets[0] # removes the first element in the array, as it's a false positive. 472 | fi 473 | 474 | priorVersions=${#offsets[@]} 475 | priorVersions=$((priorVersions-1)) 476 | 477 | if [ $priorVersions -lt 1 ]; then 478 | echo "There are no previous versions of the PDF embedded in this pdf." | tee -a "$logfile" 479 | else 480 | if ! [[ "$priorVersion" = "true" || "$priorVersion" = "false" ]]; then # if the user did not provide a valid option for -p (or did not specify it) 481 | echo -e "There are ${GREEN}$priorVersions prior versions${NOCOLOUR} of this PDF based on the number of %%EOF signatures in it.\n" 482 | echo "The script can attempt to extract them with the caveat that a prior version may or may not be a properly formed PDF." 483 | read -p "Do you want the script to attempt to extract all versions of this PDF (Y/N)? [Y] " priorVersion # default response is Y if user just hits ENTER 484 | 485 | if [[ "$priorVersion" == "Y" || "$priorVersion" == "y" || "$priorVersion" == "" ]]; then 486 | priorVersion="true" 487 | else 488 | priorVersion="false" 489 | fi 490 | fi 491 | if [ "$priorVersion" == "true" ];then # process prior versions 492 | echo "Excluding the current version, there are $((priorVersions-1)) prior versions in this PDF." | tee -a "$logfile" 493 | echo "The script will extract each of them, assiging them a version number. Version 1 being the oldest version, and version $priorVersions being the version prior to the current version." | tee -a "$logfile" 494 | 495 | #unset offsets[0] # removes the first element in the array, as it's a false positive. 496 | 497 | for size in ${offsets[@]}; do 498 | 499 | 500 | if [ $v -le $priorVersions ]; then # if it's not the last version. Last version is redundant, as it's the original PDF passed to the script. 501 | 502 | newfile="$fileFolder/$filenamenoext version $v.$extension" 503 | 504 | blocksize=$((size+7)) 505 | 506 | blankLine 507 | echo "executing: dd if=\"$fileFolder/$fileToProcess\" of=\"$fileFolder/$newfile\" bs=$blocksize count=1 status=noxfer 2\> \/dev\/null" | tee -a "$logfile" 508 | echo "This will extract version $v of the pdf $fileToProcess, assigning it the new filename $fileFolder/$newfile" | tee -a "$logfile" 509 | echo "" 510 | 511 | dd if="$fileToProcess" of="$newfile" bs=$blocksize count=1 status=noxfer 2> /dev/null 512 | 513 | echo "executing: sha256sum \"$newfile\" >>\"$logfile\"" | tee -a "$logfile" 514 | sha256sum "$newfile" >>"$logfile" 515 | 516 | blankLine 517 | 518 | checkPDF "$newfile" 519 | 520 | if [ "pdfValidation" == "False" ] && [ "processThisPDF" != "y" ]; then 521 | echo "User opted to not process \"$newfile\" as it does not appear to be a valid PDF." >> "$logfile" 522 | continue # skip out of the loop 523 | fi 524 | 525 | if [ "$pdfValidation" == "True" ] # Valid PDF 526 | then 527 | echo -e "Prior ${GREEN}version $v${NOCOLOUR} of '$fileFolder/$fileToProcess' appears to be a ${GREEN}valid PDF.${NOCOLOUR}" 528 | echo -e "Prior version $v of '$fileFolder/${fileToProcess##*/}' appears to be a valid PDF." >> "$logfile" 529 | else # Not a valid PDF 530 | echo -e "Prior ${YELLOW}version $v${NOCOLOUR} of '$fileFolder/$fileToProcess' ${RED}does not appear to be a valid PDF.${NOCOLOUR}" 531 | echo -e "Prior version $v of '$fileFolder/$fileToProcess' does not appear to be a valid PDF." >> "$logfile" 532 | fi 533 | blankLine 534 | pdfImages "$newfile" "9.$v" "$fileFolder/${newfile##*/}-pdfimges" 2>/dev/null # Attempt to extract images from the version. Even if not a valid PDF, attempting regardless. 535 | 536 | v=$((v+1)) 537 | fi 538 | done 539 | 540 | echo -e "\nExtracting prior versions of the PDF finished execution at $(date).">>"$logfile" 541 | 542 | echo "executing: exiftool -a -G1 -s -ee -csv \"$fileToProcess\" \"$fileFolder/$filenamenoext version *.$extension\" 2> /dev/null >> \"$fileFolder/${fileToProcess##*/} - all versions - exif.csv" | tee -a "$logfile" 543 | echo "filetoprocess: $fileToProcess" 544 | allPriorVersions="$fileFolder/$filenamenoext version" 545 | echo "allPriorVersions: $allPriorVersions" 546 | exiftool -a -G1 -s -ee -csv "$fileToProcess" "$allPriorVersions"*.$extension 2> /dev/null >> "$fileFolder/${fileToProcess##*/} - all versions - exif.csv" 547 | 548 | fi 549 | fi 550 | 551 | # sorting all image hashes after extracting all prior versions of PDF for ease of identifying matching images 552 | 553 | blankLine 554 | 555 | echo "executing: sort \"$allimages\" > \"$allimages\"-sorted.txt" | tee -a "$logfile" 556 | echo -e "All images hashes are also found in: ${GREEN}'$allimages-sorted.txt'.${NOCOLOUR}" 557 | echo "All images hashes are also found in: $allimages-sorted.txt.">>"$logfile" 558 | 559 | sort "$allimages-unsorted.txt" > "$allimages"-sorted.txt 560 | 561 | blankLine 562 | 563 | done 564 | 565 | echo -e "\n###############################################################" 566 | echo -e "Log file written to: ${GREEN}'$logfile'.${NOCOLOUR}" 567 | echo -e "Script finshed at $(date)." >> "$logfile" 568 | IFS="$OIFS" # restore IFS to original 569 | exit $commandsExecuted --------------------------------------------------------------------------------