├── .idea
└── workspace.xml
├── README.md
├── Test File
├── Copy of each version
│ ├── PDF versions test - v1.pdf
│ ├── PDF versions test - v2.pdf
│ ├── PDF versions test - v3.pdf
│ └── PDF versions test - v4.pdf
└── PDF versions test.pdf
├── hash_all_images.sh
├── pdf-metadata.sh
├── pdf-processing.sh
├── pdf-triage.sh
└── recursive-pdf-processing.sh
/.idea/workspace.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 | {
22 | "associatedIndex": 3
23 | }
24 |
25 |
26 |
27 |
28 |
29 |
42 |
43 |
44 |
45 |
46 | 1715106058974
47 |
48 |
49 | 1715106058974
50 |
51 |
52 |
53 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # recursive-pdf-processing.sh
2 | Script to process a single PDF file, all PDF files in a folder, or all PDF files in a folder, recursively.
3 |
4 | Script is still being worked on. I need to add an output option. Currently, it will output to a new directory ("report_DD-MM-YYYYTHH:mm:ss") based on where the script is run. This can result in unintended loop if the output folder resides within the recursive folder you have chosen to process. To avoid this, you can navigate to a location where you want an output folder to be created (will be unique as it uses the timestamp as part of the name), and using the -d to select the directory, select a directory located elsewhere. The ultimate purpose of this script would be to be able to mount an image file with the image mounter of your choice, and point the script to somewhere within that hierarchy (e.g., a user folder) to parse all PDFs within it.
5 |
6 | This script was created out of the necessity to extract prior versions out of multiple PDFs. Rather than running the other script multiple times, pointing to a new file each time, this script will allow me to process those multiple PDFs in a single command.
7 |
8 | If you are going to use this script, keep in mind that it is still being tested, and I will be adding an output command line parameter in the coming days hopefully. And I'll update this at that time with examples of syntax you can use.
9 |
10 | For now, an example would be to navivate to your Documents folder on your C drive with Kali WSL for example (cd /mnt/c/Users/{username}/Documents), and run the following (assuming your script is in your user home folder and executable):
11 |
12 | ~/recursive-pdf-processing.sh -d /mnt/n/my_pdfs_are_here -r -p True
13 |
14 | The above uses the following switches:
15 | -d to specify the root directory to start processing. /mnt/n/my_pdfs_are_here
16 | -r means to recursively parse
17 | -p True tells the script to parse prior versions if any found. If you don't specify -p, it will prompt you when it finds a prior version. This wasn't a big deal when processing a single PDF, as you would be at the keyboard while it was processing. But if recursively processing 1000 PDFs for example, you don't want to have to answer whether or not to process prior versions each time it encounters one. So you can specify -p True (automatically process prior versions), or -p False (don't process prior versions). Anything else will result in being prompted for each.
18 |
19 | You can still use the -f option as with the other script to parse a single PDF. Once this script is properly tested, the other will be retired.
20 |
21 | Currently, you don't specify an output folder. It creates one under the folder from which the script is run. But I will be adding -o {output_folder} soon. That way you could, if you prefered, navigate to where you want to start processing PDFs and do the following:
22 | ~/recursive-pdf-processing -d . -r -p True -o /mnt/c/Users/{username}/Desktop
23 | The above will (once I add the -o option) process all pdfs from your current folder (because you navigated to where you want to start), as denoted by the period ".". It will parse recursively, and parse prior versions.
24 |
25 | I will be updating pdf-triage and pdf-metadata to also allow you to specify the folder to process, and where to save the output. I may even incorpoate those into this main script so that you only need to run one script and get everything, or select certain processing options.
26 |
27 | # pdf-processing.sh
28 | Script to process a PDF file
29 |
30 | Command line options:
31 |
32 | -f # this option is required. You must provide the filename of the PDF using the -f switch.
33 | -v # this option prints the version and exits
34 | -p true/false
35 |
36 | The -p switch allows you to tell the script to not attempt to extract prior versions. This is most likely going to be used if you extract prior versions of a PDF, and then want to run the script against those version. You won't need to re-extract the prior versions of those earlier versions. That would be redundant. You can set this to false so that it skips that part. If you do not specify this flag and prior versions exist, the script will alert you of that and ask if you want to recover them.
37 |
38 | Tested on Kali Linux 2023.1 and Kali Linux on WSL.
39 |
40 | If running on Kali Linux WSL, you will need to run the following:
41 |
42 | ```
43 | sudo apt update
44 | sudo apt upgrade
45 | sudo apt install exiftool
46 | sudo apt install xpdf
47 | sudo apt install pdf-parser
48 | sudo apt install poppler-utils
49 | sudo apt install pdfid
50 |
51 | ```
52 | The script will execute the following processes against the PDF:
53 | 1. pdfinfo
54 | 2. exiftool
55 | 3. pdfimages
56 | 4. pdfsig
57 | 5. pdfid
58 | 6. pdf-parser
59 | 7. pdffonts
60 | 8. pdfdetach
61 |
62 | The script will also attempt to carve out prior versions of the PDF by looking for %%EOF markers in the PDF. When you edit a PDF, the edits are added after the %%EOF, and a new %%EOF is added at the new file ending. This means there is an opportunity to extract prior versions of a PDF. It's not guaranteed, as there are factors that can cause prior versions to be invalid PDFs. It will depend on whether the tool that was used to edit the PDF is compliant with the PDF standard, whether there was some compressing (cleaning up) done by removing an earlier edit (but the %%EOF remains allowing you to at least know a prior version existed).
63 |
64 | # pdf-triage.sh
65 | Note:
66 | For the pdf-triage.sh script, you only need exiftool and xpdf from the above, plus the "file" command. If you've already installed exiftool and xpdf for the above script, you only need to install the file command here.
67 | ```
68 | sudo apt install exiftool
69 | sudo apt install xpdf
70 | sudo apt install file
71 | ```
72 | # pdf-metadata.sh
73 | The pdf-metadata.sh script is really just packaging the exiftool command for the convenience of those who are unfamiliar with exiftool and its switches. You can run the exiftool command alone within the script and yield the same results (of course you need to provide a proper outoupt file in that case rather than the variable name used in the script).
74 |
75 | For the pdf-metadata.sh script, you only need exiftool installed. If you already installed it for either of the above two scripts, you don't need to install it again here.
76 | ```
77 | sudo apt install exiftool
78 | ```
79 | # hash_all_images.sh
80 | This script is useful to extract images from PDFs in a folder and save it to a text file. Images are deleted after they are hashed, as the purpose of this script is not to extract all images and preserve them. The other scripts are used for that. This is a way to look for embedded images from various PDFs that share the same hash value.
81 |
82 | Images in different PDFs with the same hash can be normal (e.g., an image of a stamp or signature is used to add to a document).
83 |
84 | But if you have several PDFs where the stamp is a physical stamp that is stamped on a printed document and then scanned to PDF, then they would not have the same hash value. If they do, that contradicts the statement that they are unique scans.
85 |
86 | The script could be modified to use the find command with the recursive option instead of a simple "ls" command. Change the following
87 | $(ls *.pdf)
88 | to
89 | find -iname "*.pdf"
90 |
91 | Doing the above would run a recursive search for PDF files from the location where you run the find command. So you'd need to navigate to the correct path before running it.
92 |
93 | Also, if you want to retain images as well as hash them, you could comment out (or delete) the two "rm -f" commands
94 |
95 | In a future release, I might add command line arguments to allow you to select that without needing to edit the script. But for now, this does what I need (minimum viable product).
96 |
--------------------------------------------------------------------------------
/Test File/Copy of each version/PDF versions test - v1.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v1.pdf
--------------------------------------------------------------------------------
/Test File/Copy of each version/PDF versions test - v2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v2.pdf
--------------------------------------------------------------------------------
/Test File/Copy of each version/PDF versions test - v3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v3.pdf
--------------------------------------------------------------------------------
/Test File/Copy of each version/PDF versions test - v4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/Copy of each version/PDF versions test - v4.pdf
--------------------------------------------------------------------------------
/Test File/PDF versions test.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjrboucher/PDF-Processing/7aeabbf56ff3812853728e99c0a2fac5ff97847f/Test File/PDF versions test.pdf
--------------------------------------------------------------------------------
/hash_all_images.sh:
--------------------------------------------------------------------------------
1 | OIFS="$IFS" # Original Field Separator
2 | IFS=$'\n' # New field separator = new line
3 | ORIGINAL_GLOB=$(shopt -p nocaseglob)
4 | shopt -s nocaseglob # to make the ls command case insensitive
5 |
6 | hashfile="hashes($(date -u +%a_%d-%b-%y_%kh%Mm%Ss_UTC)).csv" # create the hash file using the UTC date to make it unique
7 |
8 | for pdf in $(ls *.pdf); do # for all; PDFs in this folder
9 | echo "extracting images from $pdf"
10 | images=$(pdfimages "$pdf" "$pdf-image" -all -print-filenames)
11 | for image in $images; do # for all images within the current PDF in the parent FOR loop
12 | echo "hashing $image" # hash the file
13 | md5sum "$image" >>$hashfile # output the hash file
14 | echo "Deleting $image" # clean up after itself
15 | rm -f $image # deleting the image
16 | param_file="${image%.*}"
17 | param_file="${param_file##*/}"
18 | rm -f "$param_file.params" # deleting the associated .params file.
19 | done
20 | done
21 |
22 | # reset values
23 | IFS=$OIFS
24 | $ORIGINAL_GLOB
25 |
--------------------------------------------------------------------------------
/pdf-metadata.sh:
--------------------------------------------------------------------------------
1 | # Written by Jacques Boucher
2 | # jjrboucher@gmail.com
3 | #
4 | # Triage script to review multiple PDFs in folders/subfolders and extract
5 | # the metadata from each and output to a CSV file.
6 |
7 | output_file="pdf-metadata.csv" # name of the file where results are saved.
8 |
9 | if test -f $output_file; then
10 | echo "pdf-triage.csv already exists. Rename or move and re-run the script."
11 | exit
12 | fi
13 |
14 | exiftool -a -G1 -s -ee -csv -r . -ext pdf >>$output_file # append results to csv file in current directory
15 |
16 | echo "Results in $output_file."
--------------------------------------------------------------------------------
/pdf-processing.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | ###########################
3 | # Written by Jacques Boucher
4 | # jboucher@unicef.org
5 |
6 | scriptVersion="8 March 2024"
7 |
8 | # Tested on Kali Linux 2023.1 and Kali Linux on WSL.
9 | ##############################
10 | # Installing required binaries
11 | ##############################
12 | # If running on Kali Linux WSL, you will need to run the following:
13 | # sudo apt update
14 | # sudo apt upgrade
15 | # sudo apt install exiftool
16 | # sudo apt install xpdf
17 | # sudo apt install pdf-parser
18 | # sudo apt install poppler-utils
19 | # sudo apt install pdfid
20 |
21 | #################
22 | # Troubleshooting
23 | #################
24 | # If the script gives you a warning that one of the above binaries is missing,
25 | # you can search which package you need to install as follows:
26 | # sudo apt search pdfinfo
27 | # The above would search the packages and return that it's part of poppler-utils. You would then install that
28 | # package with the command: sudo apt install poppler-utils.
29 | #
30 | # As best as the author could test, the installation of the required binaries above should be all that's needed.
31 |
32 | ####################
33 | # Processing summary
34 | ####################
35 | # This script will run the following parsing tools against a PDF that you provide as a command line argument.
36 | # The script will check if a command is present. If it is not, it will note same in the log, alert you on screen, and skip that processing.
37 | #
38 | # 1 - pdfinfo
39 | # 2 - exiftool
40 | # 3 - pdfimages
41 | # 4 - pdfsig
42 | # 5 - pdfid
43 | # 6 - pdf-parser
44 | # 7 - pdffonts
45 | # 8 - pdfdetach
46 | #
47 | # The script will also extracts versions of the pdf using grep and dd commands, looking for the %%EOF string in the PDF.
48 | # Each time you edit a PDF within Adobe, it adds
49 | #
50 | ######################
51 | # Command line options
52 | ######################
53 | # -f PDF being processed (required option)
54 | # -v Prints the version # of the script and exits.
55 | # -p TRUE/FALSE optional switch if you want to process prior versions of a PDF if present. If you don't include this option on the command line and prior versions are detected,
56 | # the script will prompt you, letting you know it found prior versions and ask if you want to process them.
57 | # This option is especially practical if you extracted prior versions of a PDF, and now want to run the script against one of those PDFs.
58 | # In that scenario, you likely don't need to extract prior versions yet again. So you can use the option -p FALSE.
59 | #
60 | # Exit Codes
61 | commandsExecuted=0 # if no commands are missing, will have an exit code of 0. If any command are missing, it will add 10**(command #) to the exit code.
62 | # example, if pdfinfo is missing, it will add 10**1, or 10 to the exit value.
63 | # if pdfsig is missing, it will add 10**4, or 10000 to the exit value.
64 | # this sort of mimics bit-wise values in that an exit value of 1010 means commands #1 and #3 did not run.
65 | # thus an exit value of 0 means all commands were executed.
66 | missingArg=1 # missing an argument after the switch
67 | tooManyArgs=2 # too many arguments
68 | invalidSwitch=3 # invalid switch
69 | missingSwitch=4 # switch not provided
70 | invalidSyntax=5 # invalid syntax
71 | fileDoesNotExist=6 # file does not exist
72 | emptyFile=7 # 0 byte file provided as argument
73 | badpdf=8 # if a bad PDF is passed to the script and the user chooses to exit without processing it.
74 |
75 | # Other variables
76 | investigator=""
77 | caseNumber=""
78 | filename="" # initialize filename to process to blank
79 | filenamenoext="" # filename without the extension
80 | extension="" # extension for the filename
81 | newfile="" # varible to hold new filename when parsing through PDF for versions.
82 | allimages="" # varible for file name where all image hashes are saved for each comparison.
83 | v=1 # counter for versions of a PDF (applicable when a PDF has been edited with Adobe).
84 | offsets=() # array variable to hold offsets of the %%EOF markers in a pdf denoting the end of each version.
85 | versions=1 # variable to hold # of versions found in a PDF (i.e., number of %%EOF markers).
86 | switch="" # initialize command line switch to blank
87 | RED='\033[0;31m' # red font
88 | YELLOW='\033[0;1;33m' # yellow font
89 | GREEN='\033[32;1;1m' # green font
90 | NOCOLOUR='\033[0;m' # no colour
91 | usage="Usage: $0 {-p true/false} -f \nor: $0 -v"
92 | priorVersion="" #This variable is the flag for deciding if the script should attempt to extract prior versions.
93 |
94 |
95 | hashMark() { # function to write section header to the log file.
96 | echo -e "######################### $1 #############################" >>"$logfile"
97 | }
98 |
99 | blankLine() { # inserts a blank line in the log file and on screen.
100 | echo ""
101 | echo -e "" >>"$logfile"
102 | }
103 |
104 | commandNotFound () { #command not found. Logging same.
105 | echo ""
106 | echo -e "################ ${RED}WARNING!${NOCOLOUR} ################"
107 | echo -e "${RED}$1 ${YELLOW}not found!${NOCOLOUR} Skipping this step."
108 | echo -e "##########################################"
109 | echo ""
110 | blankLine
111 | echo "######## WARNING! ########" >> "$logfile"
112 | echo "$1 not found. Skipping this step." >> "$logfile"
113 | echo "##########################" >> "$logfile"
114 | blankLine
115 | }
116 |
117 | pdfImages() {
118 | #pdfimages
119 |
120 | which pdfimages >/dev/null #checks for command
121 | if [ $? -eq 0 ] # exit status 0 = command found
122 | then
123 | hashMark "$2 - $(pdfimages -v 2>&1 | head -n1)"
124 | echo "Extracting images from $1."
125 | echo "Extracting images from $1." >>"$logfile"
126 |
127 | blankLine
128 | pdfimagesRoot="$1-pdfimages"
129 | echo "executing: pdfimages -all \"$1\" \"$pdfimagesRoot\"" | tee -a "$logfile"
130 | echo "Which will extract the following images:" | tee -a "$logfile"
131 | pdfimages -list "$1" | tee -a "$logfile"
132 | pdfimages -all "$1" "$pdfimagesRoot"
133 | pdfimages -png "$1" "$pdfimagesRoot-png-format" # also export as JPG for ease of viewing, as .ccitt not easily viewable
134 | echo -e "\npdfimages finished execution at $(date).\nExtracted images saved to '$pdfimagesRoot-###.{extension}'.">>"$logfile"
135 | echo "executing: sha256sum \"$pdfimagesRoot\"-*.*" | tee -a "$logfile"
136 | sha256sum "$pdfimagesRoot"-*.* | tee -a "$allimages-unsorted.txt" >> "$logfile"
137 | blankLine
138 | else
139 | commandNotFound "pdfimages"
140 | commandsExecuted=$commandExecuted+1000
141 | fi
142 | }
143 |
144 | checkPDF() {
145 | testPDF="$(pdfinfo "$1" 2>/dev/null)"
146 | if [ "$testPDF" == "" ]
147 | then
148 | pdfValidation="False"
149 | else
150 | pdfValidation="True"
151 | fi
152 | }
153 |
154 | while getopts ":f:p:v" opt; do
155 |
156 | case $opt in
157 | f)
158 | switch=$opt
159 | filename="$OPTARG"
160 | ;;
161 | v)
162 | echo "$0 version: $scriptVersion"
163 | exit
164 | ;;
165 | p)
166 | switch=$opt
167 | priorVersion=$(echo $OPTARG | tr '[:upper:]' '[:lower:]')
168 | ;;
169 | :)
170 | echo "You must supply an argument to -$OPTARG">&2
171 | echo -e $usage
172 | exit $missingArg
173 | ;;
174 |
175 | \?)
176 | if [ $# -gt 2 ]; then
177 | echo "Too many arguments."
178 | echo -e $usage
179 | exit $tooManyArgs
180 | fi
181 | echo "Invalid switch."
182 | echo -e $usage
183 | exit $invalidSwitch
184 | ;;
185 | esac
186 | done
187 |
188 | if [ -z "$switch" ]; then #switch is still blank, thus not provided
189 | echo "Did not provide the required switch."
190 | echo -e $usage
191 | exit $missingSwitch
192 | elif [ -z "$filename" ]; then #filename is still blank, thus invalid syntax
193 | echo "Invalid syntax."
194 | echo -e $usage
195 | exit $invalidSyntax
196 | elif [ ! -f "$filename" ]; then
197 | echo "File does not exist."
198 | echo -e $usage
199 | exit $fileDoesNotExist
200 | elif [ ! -s "$filename" ]; then
201 | echo "The file $filename is a 0 byte file."
202 | echo "Nothing to process."
203 | echo -e $usage
204 | exit $emptyFile
205 | fi
206 |
207 | checkPDF "$filename"
208 |
209 | if [ "$pdfValidation" == "False" ]
210 | then
211 | echo -e "${YELLOW}Warning!${NOCOLOUR}\nThe PDF $filename does not appear to be a valid PDF."
212 | read -p "Do you still wish to proceed (y/n)? " continue
213 | if [ "$continue" == "n" ]
214 | then
215 | exit $badpdf
216 | fi
217 | fi
218 |
219 | logfile="$filename.log"
220 | allimages="$filename-hashes of all images"
221 |
222 | hashMark "Tombstone Info"
223 | read -p "Investigator: " investigator
224 | read -p "Case number: " caseNumber
225 |
226 | echo "Executed by user $(whoami) at $(date)." >> "$logfile"
227 | echo "Investigator: $investigator" >> "$logfile"
228 | echo "Case number: $caseNumber" >> "$logfile"
229 | echo "Processing: $filename" >> "$logfile"
230 | echo "sha256 hash: $(sha256sum "$filename" | cut -d " " -f1)" >> "$logfile"
231 | echo "Current folder: $(pwd)" >> "$logfile"
232 | echo "Script version: "$scriptVersion >> "$logfile"
233 | blankLine
234 |
235 | # pdfinfo
236 |
237 | which pdfinfo >/dev/null #checks for command
238 | if [ $? -eq 0 ] # exit status 0 = command found
239 | then
240 | pdfInfoFile="$filename-pdfinfo.txt"
241 | hashMark "1 - $(pdfinfo -v 2>&1 | head -n 1)"
242 | echo "executing pdfinfo \"$filename\"" | tee -a "$logfile"
243 | pdfinfo "$filename">"$pdfInfoFile"
244 | echo -e "pdfinfo finished execution at $(date).\nResults written to '$pdfInfoFile'." >> "$logfile"
245 | echo "executing: sha256sum \"$pdfInfoFile\"" | tee -a "$logfile"
246 | sha256sum "$pdfInfoFile" >> "$logfile"
247 |
248 | blankLine
249 | else
250 | commandNotFound "pdfinfo"
251 | commandsExecuted=$commandExecuted+10
252 | fi
253 |
254 | # exiftool
255 |
256 | which exiftool >/dev/null #checks for command
257 | if [ $? -eq 0 ] # exit status 0 = command found
258 | then
259 | exifFile="$filename-exif.csv"
260 | hashMark "2 - exiftool version $(exiftool -ver)"
261 | echo "executing: exiftool -a -G1 -s -ee -csv \"$filename\" > \"$exifFile\"" | tee -a "$logfile"
262 | exiftool -a -G1 -s -ee -csv "$filename">"$exifFile"
263 | echo -e "exiftool finished execution at $(date).\nResults written to '$exifFile'.">>"$logfile"
264 | echo "executing: sha256sum \"$exifFile\"" | tee -a "$logfile"
265 | sha256sum "$exifFile" >> "$logfile"
266 | blankLine
267 | else
268 | commandNotFound "exiftool"
269 | commandsExecuted=$commandExecuted+100
270 | fi
271 |
272 | #pdfimages
273 | pdfImages "$filename" "3"
274 |
275 | #pdfsig
276 |
277 | which pdfsig >/dev/null #checks for command
278 | if [ $? = 0 ] # exit status 0 = command found
279 | then
280 | hashMark "4 - $(pdfsig -v 2>&1 | head -n1)"
281 | pdfsigFilename="$filename.pdfsig.txt"
282 | echo "executing: pdfsig -nocert -dump \"$filename\" >>\"$logfile\"" | tee -a "$logfile"
283 | pdfsig -nocert -dump "$filename" >>"$logfile"
284 | echo -e "pdfsig finished execution at $(date).\nResults written to '$pdfsigFilename'.">>"$logfile"
285 | echo -e "Signature(s), if present, is/are dumped to the current folder, $(pwd).">>"$logfile"
286 | blankLine
287 | else
288 | commandNotFound "pdfsig"
289 | commandsExecuted=$commandExecuted+10000
290 | fi
291 |
292 | #pdfid
293 |
294 | which pdfid >/dev/null #checks for command
295 | if [ $? -eq 0 ] # exit status 0 = command found
296 | then
297 | hashMark "5 - pdfid version $(pdfid --version | cut -d " " -f2)"
298 | pdfidFilename="$filename.pdfid.txt"
299 | echo "executing: pdfid -l \"$filename\">\"$pdfidFilename\"" | tee -a "$logfile"
300 | pdfid -l "$filename">"$pdfidFilename"
301 | echo -e "pdfid finished execution at $(date).\nResults written to '$pdfidFilename'.">>"$logfile"
302 | echo "executing: sha256sum \"$pdfidFilename\"" | tee -a "$logfile"
303 | sha256sum "$pdfidFilename" >> "$logfile"
304 | blankLine
305 | else
306 | commandNotFound "pdfid"
307 | commandsExecuted=$commandExecuted+100000
308 | fi
309 |
310 | #pdfparser
311 |
312 | which pdf-parser >/dev/null #checks for command
313 | if [ $? -eq 0 ] # exit status 0 = command found
314 | then
315 | hashMark "6 - pdf-parser version $(pdf-parser --version | grep "pdf-parser" | cut -d " " -f2)"
316 | pdfparserFilename="$filename.pdfparser.txt"
317 | echo "executing: pdf-parser \"$filename\">\"$pdfparserFilename\"" | tee -a "$logfile"
318 | pdf-parser "$filename">"$pdfparserFilename"
319 | echo -e "pdf-parser finished execution at $(date).\nResults written to '$pdfparserFilename'.">>"$logfile"
320 | echo "executing: sha256sum \"$pdfparserFilename\"" | tee -a "$logfile"
321 | sha256sum "$pdfparserFilename" >> "$logfile"
322 | blankLine
323 | else
324 | commandNotFound "pdf-parser"
325 | commandsExecuted=$commandExecuted+1000000
326 | fi
327 |
328 | #pdffonts
329 |
330 | which pdffonts >/dev/null #checks for command
331 | if [ $? -eq 0 ] # exit status 0 = command found
332 | then
333 | hashMark "7 - $(pdffonts -v 2>&1 | head -n1)"
334 | pdffontsFilename="$filename.pdffonts.txt"
335 | echo "executing: pdffonts \"$filename\">\"$pdffontsFilename\" 2>>\"$pdffontsFilename\"" | tee -a "$logfile"
336 | pdffonts "$filename" >"$pdffontsFilename" 2>>"$pdffontsFilename"
337 | echo -e "pdffonts finished execution at $(date).\nResults written to '$pdffontsFilename'.">>"$logfile"
338 | echo "executing: sha256sum \"$pdffontsFilename\"" | tee -a "$logfile"
339 | sha256sum "$pdffontsFilename" >> "$logfile"
340 | blankLine
341 | else
342 | commandNotFound "pdffonts"
343 | commandsExecuted=$commandExecuted+10000000
344 | fi
345 |
346 | #pdfdetach
347 |
348 | which pdfdetach >/dev/null #checks for command
349 | if [ $? -eq 0 ] # exit status 0 = command found
350 | then
351 | hashMark "8 - $(pdfdetach -v 2>&1 | head -n1)"
352 | echo "executing: pdfdetach -saveall \"$filename\"" | tee -a "$logfile"
353 | echo "Which will extract the following files (if applicable):" | tee -a "$logfile"
354 | embeddedItemsCount=$(pdfdetach -list "$filename"|wc -l)
355 | embeddedItemsCount=$((embeddedItemsCount-1))
356 | pdfdetach -list "$filename" | tee -a "$logfile"
357 | pdfdetach -saveall "$filename"
358 | echo -e "pdfdetach finished execution at $(date).">>"$logfile"
359 |
360 | if [[ "$(pdfdetach -list "$filename")" != "0 embedded files" && "$(pdfdetach -list "$filename")" != "" ]]
361 | # if there are no embedded files and you do the sha256sum command, it waits for input. This avoids hanging the script in such cases.
362 | then
363 | pdfdetach -list "$filename"|tail -n $embeddedItemsCount | cut -d: -f2
364 | echo "executing: sha256sum \"${f#\"${f%%[![:space:]]*}\"}\"" | tee -a "$logfile"
365 | for f in "$((pdfdetach -list "$filename")|tail -n $embeddedItemsCount | cut -d: -f2 2>/dev/null)"; do
366 | sha256sum "${f#"${f%%[![:space:]]*}"}" |tee -a "$logfile"
367 | done
368 | fi
369 | else
370 | commandNotFound "pdfdetach"
371 | commandsExecuted=$commandExecuted+10000000
372 | fi
373 |
374 | blankLine
375 |
376 | #extract versions of the PDF using grep and dd, commands commonly available on any Linux distro
377 |
378 | hashMark "9 - Extracting prior versions of the PDF"
379 |
380 | filenamenoext="${filename%.*}"
381 | extension="${filename##*.}"
382 | v=1
383 |
384 | offsets=($(grep --only-matching --byte-offset --text "%%EOF" "$filename"| cut -d : -f 1))
385 |
386 | if [ ${offsets[0]} -lt 600 ]; then
387 | unset offsets[0] # removes the first element in the array, as it's a false positive.
388 | fi
389 |
390 | priorVersions=${#offsets[@]}
391 | priorVersions=$((priorVersions-1)) # reduces the count by 1, as the current version does not count as a prior version, but will be in the array.
392 |
393 | if [ $priorVersions -lt 1 ]; then
394 | echo "There are no previous versions of the PDF embedded in this pdf." | tee -a "$logfile"
395 | else
396 | if ! [[ "$priorVersion" = "true" || "$priorVersion" = "false" ]]; then # if the user did not provide a valid option for -p (or did not specify it)
397 | echo -e "There are ${GREEN}$priorVersions prior versions${NOCOLOUR} of this PDF based on the number of %%EOF signatures in it.\n"
398 | echo "The script can attempt to extract them with the caveat that a prior version may or may not be a properly formed PDF."
399 | read -p "Do you want the script to attempt to extract all versions of this PDF (Y/N)? [Y] " priorVersion # default response is Y if user just hits ENTER
400 |
401 | if [[ "$priorVersion" == "Y" || "$priorVersion" == "y" || "$priorVersion" == "" ]]; then
402 | priorVersion="true"
403 | else
404 | priorVersion="false"
405 | fi
406 | fi
407 | if [ "$priorVersion" == "true" ];then # process prior versions
408 | echo "Excluding the current version, there are $((priorVersions-1)) prior versions in this PDF." | tee -a "$logfile"
409 | echo "The script will extract each of them, assiging them a version number. Version 1 being the oldest version, and version $priorVersions being the version prior to the current version." | tee -a "$logfile"
410 |
411 | #unset offsets[0] # removes the first element in the array, as it's a false positive.
412 |
413 | for size in ${offsets[@]}; do
414 |
415 |
416 | if [ $v -le $priorVersions ]; then # if it's not the last version. Last version is redundant, as it's the original PDF passed to the script.
417 |
418 | newfile="$filenamenoext version $v.$extension"
419 |
420 | blocksize=$((size+7))
421 |
422 | blankLine
423 | echo "executing: dd if=\"$filename\" of=\"$newfile\" bs=$blocksize count=1 status=noxfer 2\> \/dev\/null" | tee -a "$logfile"
424 | echo "This will extract version $v of the pdf $filename, assigning it the new filename $newfile" | tee -a "$logfile"
425 | echo ""
426 |
427 | dd if="$filename" of="$newfile" bs=$blocksize count=1 status=noxfer 2> /dev/null
428 |
429 | echo "executing: sha256sum\"$newfile\" >>\"$logfile\"" | tee -a "$logfile"
430 | sha256sum "$newfile" >>"$logfile"
431 |
432 | blankLine
433 |
434 | checkPDF "$newfile"
435 |
436 | if [ "$pdfValidation" == "True" ] # Valid PDF
437 | then
438 | echo -e "Prior ${GREEN}version $v${NOCOLOUR} of '$filename' appears to be a ${GREEN}valid PDF.${NOCOLOUR}"
439 | echo -e "Prior version $v of '$filename' appears to be a valid PDF." >> "$logfile"
440 | else # Not a valid PDF
441 | echo -e "Prior ${YELLOW}version $v${NOCOLOUR} of '$filename' ${RED}does not appear to be a valid PDF.${NOCOLOUR}"
442 | echo -e "Prior version $v of '$filename' does not appear to be a valid PDF." >> "$logfile"
443 | fi
444 | blankLine
445 | pdfImages "$newfile" "9.$v" 2>/dev/null # Attempt to extract images from the version. Even if not a valid PDF, attempting regardless.
446 |
447 | v=$((v+1))
448 | fi
449 | done
450 |
451 | echo -e "\nExtracting prior versions of the PDF finished execution at $(date).">>"$logfile"
452 |
453 | echo "executing: exiftool -a -G1 -s -ee -csv \"$filenamenoext version \"*\".$extension\" 2> /dev/null >> \"$filename - all versions - exif.csv" | tee -a "$logfile"
454 |
455 | exiftool -a -G1 -s -ee -csv "$filenamenoext"*".$extension" 2> /dev/null >> "$filename - all versions - exif.csv"
456 | fi
457 | fi
458 |
459 | # sorting all image hashes for ease of identifying matching images
460 |
461 | blankLine
462 |
463 | echo "executing: sort \"$allimages\" > \"$allimages\"" | tee -a "$logfile"
464 | echo -e "All images hashes are also found in: ${GREEN}'$allimages'.${NOCOLOUR}"
465 | echo "All images hashes are also found in: $allimages.">>"$logfile"
466 |
467 |
468 | sort "$allimages-unsorted.txt" > "$allimages.txt"
469 |
470 |
471 | echo -e "\n###############################################################"
472 | echo -e "Log file written to: ${GREEN}'$logfile'.${NOCOLOUR}"
473 | echo -e "Script finshed at $(date)." >> "$logfile"
474 | exit $commandsExecuted
475 |
--------------------------------------------------------------------------------
/pdf-triage.sh:
--------------------------------------------------------------------------------
1 | # Written by Jacques Boucher
2 | # jjrboucher@gmail.com
3 | #
4 | # Triage script to review multiple PDFs in folders/subfolders and extract
5 | # a few data points that can help identify possible PDFs warranting further
6 | # review for possible manipulation by a subject.
7 | #
8 | # You can modify the script to extract other fields if they are of interest to you.
9 | # Updated 11 June 2024
10 |
11 | output_file="pdf-triage.tsv"
12 |
13 | if test -f $output_file; then
14 | echo "$output_file already exists. Rename or move and re-run the script."
15 | exit
16 | fi
17 |
18 | OIFS="$IFS" # Original Field Separator
19 | IFS=$'\n' # New field separator = new line
20 |
21 | a=$(find . -iname "*.pdf") # Find all PDFs recursively from current folder
22 | echo "File Create Date Modify Date # of images Author Producer # of fonts # of versions hash" >$output_file # write headers to csv
23 | for i in $a # loop through each item
24 | do
25 | image_count=$(pdfimages -list $i | wc -l) # get only # of lines in output (# of images)
26 | image_count=$((image_count-2)) # substract 2 lines - headers - to get actual # of images
27 | author=$(exiftool -S -s -author $i) # get author of the document without the tag name
28 | create_date=$(exiftool -S -s -CreateDate $i)
29 | modify_date=$(exiftool -S -s -ModifyDate $i)
30 | producer=$(exiftool -S -s -Producer $i)
31 | font_count=$(pdffonts $i | wc -l) # get the # of fonts - but includes 2 additional lines for header.
32 | font_count=$((font_count-2)) # remove the headers
33 | hash=$(md5sum $i | cut -d " " -f1)
34 |
35 | offsets=($(grep --only-matching --byte-offset --text "%%EOF" "$i"| cut -d : -f 1)) # find all %%EOF instances in the PDF
36 | if [ ${offsets[0]} -lt 600 ]; then
37 | unset offsets[0] # removes the first element in the array, as it's a false positive.
38 | fi
39 | version_count=${#offsets[@]} # Number of instances of %%EOF
40 |
41 | echo "$i $create_date $modify_date $image_count $author $producer $font_count $version_count $hash" >>$output_file # append results to csv file in current directory
42 | done
43 |
44 | echo "Results can be found in $output_file."
45 | IFS="$OIFS" # restore IFS to original
--------------------------------------------------------------------------------
/recursive-pdf-processing.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | ###########################
3 | # Written by Jacques Boucher
4 | # jboucher@unicef.org
5 | scriptVersion="3 October 2024"
6 | # Tested on Kali Linux 2023.1 and Kali Linux on WSL.
7 | ##############################
8 | # Installing required binaries
9 | ##############################
10 | # If running on Kali Linux WSL, you will need to run the following:
11 | # sudo apt update
12 | # sudo apt upgrade
13 | # sudo apt install exiftool
14 | # sudo apt install xpdf
15 | # sudo apt install pdf-parser
16 | # sudo apt install poppler-utils
17 | # sudo apt install pdfid
18 |
19 | #################
20 | # Troubleshooting
21 | #################
22 | # If the script gives you a warning that one of the above binaries is missing,
23 | # you can search which package you need to install as follows:
24 | # sudo apt search pdfinfo
25 | # The above would search the packages and return that it's part of poppler-utils. You would then install that
26 | # package with the command: sudo apt install poppler-utils.
27 | #
28 | # As best as the author could test, the installation of the required binaries above should be all that's needed.
29 |
30 | ####################
31 | # Processing summary
32 | ####################
33 | # This script will run the following parsing tools against a PDF that you provide as a command line argument.
34 | # The script will check if a command is present. If it is not, it will note same in the log, alert you on screen, and skip that processing.
35 | #
36 | # 1 - pdfinfo
37 | # 2 - exiftool
38 | # 3 - pdfimages
39 | # 4 - pdfsig
40 | # 5 - pdfid
41 | # 6 - pdf-parser
42 | # 7 - pdffonts
43 | # 8 - pdfdetach
44 | #
45 | # The script will also extracts versions of the pdf using grep and dd commands, looking for the %%EOF string in the PDF.
46 | # Each time you edit a PDF within Adobe, it adds
47 | #
48 | ######################
49 | # Command line options
50 | ######################
51 | # -f PDF being processed (required option)
52 | # -v Prints the version # of the script and exits.
53 | # -p TRUE/FALSE optional switch if you want to process prior versions of a PDF if present. If you don't include this option on the command line and prior versions are detected,
54 | # the script will prompt you, letting you know it found prior versions and ask if you want to process them.
55 | # This option is especially practical if you extracted prior versions of a PDF, and now want to run the script against one of those PDFs.
56 | # In that scenario, you likely don't need to extract prior versions yet again. So you can use the option -p FALSE.
57 | #
58 | # Exit Codes
59 | commandsExecuted=0 # if no commands are missing, will have an exit code of 0. If any command are missing, it will add 10**(command #) to the exit code.
60 | # example, if pdfinfo is missing, it will add 10**1, or 10 to the exit value.
61 | # if pdfsig is missing, it will add 10**4, or 10000 to the exit value.
62 | # this sort of mimics bit-wise values in that an exit value of 1010 means commands #1 and #3 did not run.
63 | # thus an exit value of 0 means all commands were executed.
64 | missingArg=1 # missing an argument after the switch
65 | tooManyArgs=2 # too many arguments
66 | invalidSwitch=3 # invalid switch
67 | missingSwitch=4 # switch not provided
68 | invalidSyntax=5 # invalid syntax
69 | fileDoesNotExist=6 # file does not exist
70 | emptyFile=7 # 0 byte file provided as argument
71 | badpdf=8 # if a bad PDF is passed to the script and the user chooses to exit without processing it.
72 | f_and_d=9 # provided both -f and -d options
73 | noFolder=10 # did not provide a folder parameter with the -d option
74 |
75 | # Other variables
76 | investigator=""
77 | caseNumber=""
78 | currentDateTime=$(date +%d-%m-%YT%H%M%S)
79 | directory=0 # set default to false, not parsing a directory
80 | executionFolder=$(pwd)
81 | filename="" # initialize filename to process to blank
82 | filenamenoext="" # filename without the extension
83 | folder="" # folder to parse
84 | extension="" # extension for the filename
85 | newfile="" # varible to hold new filename when parsing through PDF for versions.
86 | allimages="" # varible for file name where all image hashes are saved for each comparison.
87 | v=1 # counter for versions of a PDF (applicable when a PDF has been edited with Adobe).
88 | offsets=() # array variable to hold offsets of the %%EOF markers in a pdf denoting the end of each version.
89 | versions=0 # variable to hold # of versions found in a PDF (i.e., number of %%EOF markers).
90 | switch="" # initialize command line switch to blank
91 | recursive=0 # set default to false, not recursive - ignored if -d not used.
92 | outputFolder=""
93 | RED='\033[0;31m' # red font
94 | YELLOW='\033[0;1;33m' # yellow font
95 | GREEN='\033[32;1;1m' # green font
96 | NOCOLOUR='\033[0;m' # no colour
97 | usage="Usage: $0 {-f } {-p TRUE/FALSE} {-d } {-r}\nor: $0 -v"
98 | priorVersion="" #This variable is the flag for deciding if the script should attempt to extract prior versions.
99 | OIFS="$IFS" # Original Field Separator
100 | IFS=$'\n' # New field separator = new line
101 |
102 |
103 | hashMark() { # function to write section header to the log file.
104 | echo -e "######################### $1 #############################" >>"$logfile"
105 | }
106 |
107 | fileheader() { # function to write the file header to the log file.
108 | echo -e "*************************************************************************************************" >>"$logfile"
109 | echo -e " Processing $1" >>"$logfile"
110 | echo -e "*************************************************************************************************" >>"$logfile"
111 | blankLine
112 | }
113 |
114 | blankLine() { # inserts a blank line in the log file and on screen.
115 | echo ""
116 | echo -e "" >>"$logfile"
117 | }
118 |
119 | commandNotFound () { #command not found. Logging same.
120 | echo ""
121 | echo -e "################ ${RED}WARNING!${NOCOLOUR} ################"
122 | echo -e "${RED}$1 ${YELLOW}not found!${NOCOLOUR} Skipping this step."
123 | echo -e "##########################################"
124 | echo ""
125 | blankLine
126 | echo "######## WARNING! ########" >> "$logfile"
127 | echo "$1 not found. Skipping this step." >> "$logfile"
128 | echo "##########################" >> "$logfile"
129 | blankLine
130 | }
131 |
132 | pdfImages() {
133 | #pdfimages
134 |
135 | which pdfimages >/dev/null #checks for command
136 | if [ $? -eq 0 ] # exit status 0 = command found
137 | then
138 | hashMark "$2 - $(pdfimages -v 2>&1 | head -n1)"
139 | echo "Extracting images from $1."
140 | echo "Extracting images from $1." >>"$logfile"
141 |
142 | blankLine
143 | pdfimagesRoot="$3"
144 | echo "executing: pdfimages -all \"$1\" \"$pdfimagesRoot\"" | tee -a "$logfile"
145 | echo "Which will extract the following images:" | tee -a "$logfile"
146 | pdfimages -list "$1" | tee -a "$logfile"
147 | pdfimages -all "$1" "$pdfimagesRoot"
148 | pdfimages -png "$1" "$pdfimagesRoot-png-format" # also export as JPG for ease of viewing, as .ccitt not easily viewable
149 | echo -e "\npdfimages finished execution at $(date).\nExtracted images saved to '$pdfimagesRoot-###.{extension}'.">>"$logfile"
150 | echo "executing: sha256sum \"$pdfimagesRoot\-*.*\"" | tee -a "$logfile"
151 | sha256sum "$pdfimagesRoot"-*.* | tee -a "$allimages"-unsorted.txt >> "$logfile"
152 | blankLine
153 | else
154 | commandNotFound "pdfimages"
155 | commandsExecuted=$commandExecuted+1000
156 | fi
157 | }
158 |
159 | checkPDF() {
160 | processThisPDF="y" # defaults to yes
161 | testPDF="$(pdfinfo "$1" 2>/dev/null)"
162 | if [ "$testPDF" == "" ]; then
163 | pdfValidation="False"
164 | else
165 | pdfValidation="True"
166 | fi
167 |
168 | if [ "$pdfValidation" == "False" ]; then
169 | echo -e "${YELLOW}Warning!${NOCOLOUR}\nThe PDF $1 does not appear to be a valid PDF."
170 | echo -e "According to pdfinfo, ${1} does not appear to be a valid PDF." >> "$logfile"
171 | read -p "Do you still wish to proceed (y/n)? " processThisPDF
172 | processThisPDF=$(echo $processThisPDF | tr '[:upper:]' '[:lower:]')
173 | fi
174 | }
175 |
176 | while getopts ":d:f:p:rv" opt; do
177 | case $opt in
178 | d)
179 | switch=$opt
180 | folder="$OPTARG"
181 | if [ -z "$folder" ]; then
182 | echo -e "You ${RED}did not${NOCOLOUR} provide a ${RED}folder${NOCOLOUR} with the -d switch."
183 | elif [ ! -d "$folder" ]; then
184 | echo -e "You ${RED}did not${NOCOLOUR} provide a valid ${RED}folder${NOCOLOUR} with the -d switch."
185 | IFS="$OIFS" # restore IFS to original
186 | exit $noFolder
187 | fi
188 | ;;
189 | f)
190 | switch=$opt
191 | filename="$OPTARG"
192 | if [ ! -z "$filename" ] && [ ! -z "$folder" ]; then
193 | echo -e "You provided ${RED}both${NOCOLOUR} the file (-f) and directory (-d) switches."
194 | echo -e "Please provide ${GREEN}one or the other${NOCOLOUR}, but ${RED}not both.${NOCOLOUR}"
195 | IFS="$OIFS" # restore IFS to original
196 | exit $f_and_d
197 | elif [ -z "$filename" ]; then #filename is still blank, thus invalid syntax
198 | echo "Invalid syntax."
199 | echo -e $usage
200 | IFS="$OIFS" # restore IFS to original
201 | exit $invalidSyntax
202 | elif [ ! -f "$filename" ]; then
203 | echo "File does not exist."
204 | echo -e $usage
205 | IFS="$OIFS" # restore IFS to original
206 | exit $fileDoesNotExist
207 | elif [ ! -s "$filename" ]; then
208 | echo "The file $filename is a 0 byte file."
209 | echo "Nothing to process."
210 | echo -e $usage
211 | IFS="$OIFS" # restore IFS to original
212 | exit $emptyFile
213 | fi
214 | ;;
215 | v)
216 | echo "$0 version: $scriptVersion"
217 | IFS="$OIFS" # restore IFS to original
218 | exit
219 | ;;
220 | p) # attempt to extract prior versions
221 | switch=$opt
222 | priorVersion="$OPTARG"
223 | priorVersion=$(echo $priorVersion | tr '[:upper:]' '[:lower:]')
224 | ;;
225 | r)
226 | recursive=1 # user selected option to recursively search for files.
227 | ;;
228 | :)
229 | echo "You must supply an argument to -$OPTARG">&2
230 | echo -e $usage
231 | IFS="$OIFS" # restore IFS to original
232 | exit $missingArg
233 | ;;
234 |
235 | \?)
236 | echo "Invalid switch."
237 | echo -e $usage
238 | IFS="$OIFS" # restore IFS to original
239 | exit $invalidSwitch
240 | ;;
241 | esac
242 | done
243 |
244 | outputFolder="report_$currentDateTime"
245 |
246 | if [ -z "$switch" ]; then #switch is still blank, thus not provided
247 | echo "Did not provide the required switch."
248 | echo -e $usage
249 | IFS="$OIFS" # restore IFS to original
250 | exit $missingSwitch
251 | fi
252 |
253 | # if a folder is passed, assign output of "find" command to files.
254 |
255 | if [ ! -z $folder ]; then
256 | if [ $recursive -eq 1 ]; then
257 | filename=$(find $folder -iname "*.pdf")
258 | else
259 | filename=$(find $folder -maxdepth 1 -iname "*.pdf")
260 | fi
261 | fi
262 |
263 | mkdir $outputFolder
264 |
265 | logfile="$outputFolder/processing_results.log"
266 |
267 | hashMark "Tombstone Info"
268 | read -p "Investigator: " investigator
269 | read -p "Case number: " caseNumber
270 |
271 | echo "Executed by user $(whoami) at $(date)." >> "$logfile"
272 | echo "Investigator: $investigator" >> "$logfile"
273 | echo "Case number: $caseNumber" >> "$logfile"
274 | echo "Current folder: $executionFolder" >> "$logfile"
275 | echo "Script version: "$scriptVersion >> "$logfile"
276 | echo "Command executed:$0 $@" >> "$logfile"
277 | # echo output folder
278 | blankLine
279 |
280 | fileCount=0
281 |
282 | for fileToProcess in $filename # loop through each file
283 | do
284 | fileCount=$((fileCount+1))
285 | filenamenoext="${fileToProcess%.*}"
286 | filenamenoext="${filenamenoext##*/}"
287 | extension="${fileToProcess##*.}"
288 | blankLine
289 | fileheader $fileToProcess
290 | blankLine
291 | echo "Creating output folder $outputFolder/$fileCount-$filenamenoext for this file." >> "$logfile"
292 | blankLine
293 | fileFolder="$outputFolder/$fileCount-$filenamenoext"
294 | mkdir $fileFolder
295 | echo "sha256 hash: $(sha256sum "$fileToProcess" | cut -d " " -f1)" >> "$logfile"
296 | blankLine
297 |
298 | checkPDF "$fileToProcess"
299 |
300 | if [ "$pdfValidation" == "False" ]; then
301 | if [ "$processThisPDF" != "y" ]; then
302 | echo "User opted to not process \"$fileToProcess\" as it does not appear to be a valid PDF according to pdfinfo." | tee -a "$logfile"
303 | blankLine
304 | continue # skip out of the loop
305 | else
306 | echo "User opted to process \"$fileToProcess\" despite appearing to not be a vlaid PDF according to pdfinfo." | tee -a "$logfile"
307 | blankLine
308 | fi
309 | fi
310 |
311 | imagesFileName=$(basename ${fileToProcess})
312 | allimages="$fileFolder/"${imagesFileName%.*}"-hashes of all images"
313 |
314 | # pdfinfo
315 |
316 | which pdfinfo >/dev/null #checks for command
317 | if [ $? -eq 0 ] # exit status 0 = command found
318 | then
319 | pdfInfoFile="$fileFolder/${fileToProcess##*/}-pdfinfo.txt"
320 | hashMark "1 - $(pdfinfo -v 2>&1 | head -n 1)"
321 | echo "executing pdfinfo \"$fileToProcess\"" | tee -a "$logfile"
322 | pdfinfo "$fileToProcess">"$pdfInfoFile"
323 | echo -e "pdfinfo finished execution at $(date).\nResults written to '$pdfInfoFile'." >> "$logfile"
324 | echo "executing: sha256sum \"$pdfInfoFile\"" | tee -a "$logfile"
325 | sha256sum "$pdfInfoFile" >> "$logfile"
326 |
327 | blankLine
328 | else
329 | commandNotFound "pdfinfo"
330 | commandsExecuted=$commandExecuted+10
331 | fi
332 |
333 | # exiftool
334 |
335 | which exiftool >/dev/null #checks for command
336 | if [ $? -eq 0 ] # exit status 0 = command found
337 | then
338 | exifFile="$fileFolder/${fileToProcess##*/}-exif.csv"
339 | hashMark "2 - exiftool version $(exiftool -ver)"
340 | echo "executing: exiftool -a -G1 -s -ee -csv \"$fileToProcess\" > \"$exifFile\"" | tee -a "$logfile"
341 | exiftool -a -G1 -s -ee -csv "$fileToProcess">"$exifFile"
342 | echo -e "exiftool finished execution at $(date).\nResults written to '$fileFolder/$exifFile'.">>"$logfile"
343 | echo "executing: sha256sum \"$exifFile\"" | tee -a "$logfile"
344 | sha256sum "$exifFile" >> "$logfile"
345 | blankLine
346 | else
347 | commandNotFound "exiftool"
348 | commandsExecuted=$commandExecuted+100
349 | fi
350 |
351 | #pdfimages
352 | pdfImages "$fileToProcess" "3" "$fileFolder/${fileToProcess##*/}-pdfimages"
353 |
354 | #pdfsig
355 |
356 | which pdfsig >/dev/null #checks for command
357 | if [ $? = 0 ] # exit status 0 = command found
358 | then
359 | hashMark "4 - $(pdfsig -v 2>&1 | head -n1)"
360 | pdfsigFilename="$fileFolder/$filenamenoext.pdfsig.txt"
361 | echo "executing: pdfsig -nocert -dump \"$fileToProcess\"" | tee -a "$logfile"
362 | pdfsig -nocert -dump "$fileToProcess" >>"$logfile"
363 |
364 | if [ -e "$executionFolder/${fileToProcess##*/}.sig0" ]; then # if it extracted a signature file.
365 | mv $executionFolder/${fileToProcess##*/}.sig* "$fileFolder" # move the file(s) to the correct folder
366 | echo "executing: sha256sum \"$fileFolder/${fileToProcess##*/}.sig*\"" | tee -a "$logfile"
367 | sha256sum $fileFolder/${fileToProcess##*/}.sig* | tee -a "$logfile"
368 | fi
369 |
370 | echo -e "pdfsig finished execution at $(date).\nResults written to '$pdfsigFilename'.">>"$logfile"
371 | echo -e "Signature(s), if present, is/are dumped to $fileFolder.">>"$logfile"
372 | blankLine
373 | else
374 | commandNotFound "pdfsig"
375 | commandsExecuted=$commandExecuted+10000
376 | fi
377 |
378 | #pdfid
379 |
380 | which pdfid >/dev/null #checks for command
381 | if [ $? -eq 0 ] # exit status 0 = command found
382 | then
383 | hashMark "5 - pdfid version $(pdfid --version | cut -d " " -f2)"
384 | pdfidFilename="$fileFolder/$filenamenoext.pdfid.txt"
385 | echo "executing: pdfid -l \"$fileToProcess\">\"$pdfidFilename\"" | tee -a "$logfile"
386 | pdfid -l "$fileToProcess">"$pdfidFilename"
387 | echo -e "pdfid finished execution at $(date).\nResults written to '$pdfidFilename'.">>"$logfile"
388 | echo "executing: sha256sum \"$pdfidFilename\"" | tee -a "$logfile"
389 | sha256sum "$pdfidFilename" >> "$logfile"
390 | blankLine
391 | else
392 | commandNotFound "pdfid"
393 | commandsExecuted=$commandExecuted+100000
394 | fi
395 |
396 | #pdfparser
397 |
398 | which pdf-parser >/dev/null #checks for command
399 | if [ $? -eq 0 ] # exit status 0 = command found
400 | then
401 | hashMark "6 - pdf-parser version $(pdf-parser --version | grep "pdf-parser" | cut -d " " -f2)"
402 | pdfparserFilename="$fileFolder/$filenamenoext.pdfparser.txt"
403 | echo "executing: pdf-parser \"$fileToProcess\">\"$pdfparserFilename\"" | tee -a "$logfile"
404 | pdf-parser "$fileToProcess">"$pdfparserFilename"
405 | echo -e "pdf-parser finished execution at $(date).\nResults written to '$pdfparserFilename'.">>"$logfile"
406 | echo "executing: sha256sum \"$pdfparserFilename\"" | tee -a "$logfile"
407 | sha256sum "$pdfparserFilename" >> "$logfile"
408 | blankLine
409 | else
410 | commandNotFound "pdf-parser"
411 | commandsExecuted=$commandExecuted+1000000
412 | fi
413 |
414 | #pdffonts
415 |
416 | which pdffonts >/dev/null #checks for command
417 | if [ $? -eq 0 ] # exit status 0 = command found
418 | then
419 | hashMark "7 - $(pdffonts -v 2>&1 | head -n1)"
420 | pdffontsFilename="$fileFolder/$filenamenoext.pdffonts.txt"
421 | echo "executing: pdffonts \"$fileToProcess\">\"$pdffontsFilename\" 2>>\"$pdffontsFilename\"" | tee -a "$logfile"
422 | pdffonts "$fileToProcess" >"$pdffontsFilename" 2>>"$pdffontsFilename"
423 | echo -e "pdffonts finished execution at $(date).\nResults written to '$pdffontsFilename'.">>"$logfile"
424 | echo "executing: sha256sum \"$pdffontsFilename\"" | tee -a "$logfile"
425 | sha256sum "$pdffontsFilename" >> "$logfile"
426 | blankLine
427 | else
428 | commandNotFound "pdffonts"
429 | commandsExecuted=$commandExecuted+10000000
430 | fi
431 |
432 | #pdfdetach
433 |
434 | which pdfdetach >/dev/null #checks for command
435 | if [ $? -eq 0 ] # exit status 0 = command found
436 | then
437 | hashMark "8 - $(pdfdetach -v 2>&1 | head -n1)"
438 | echo "executing: pdfdetach -saveall \"$fileToProcess\"" | tee -a "$logfile"
439 | echo "Which will extract the following files (if applicable):" | tee -a "$logfile"
440 | embeddedItemsCount=$(pdfdetach -list "$fileToProcess"|wc -l)
441 | embeddedItemsCount=$((embeddedItemsCount-1))
442 | pdfdetach -list "$fileToProcess" | tee -a "$logfile"
443 | pdfdetach -saveall "$fileToProcess"
444 | echo -e "pdfdetach finished execution at $(date).">>"$logfile"
445 |
446 | if [[ "$(pdfdetach -list "$fileToProcess")" != "0 embedded files" && "$(pdfdetach -list "$fileToProcess")" != "" ]]
447 | # if there are no embedded files and you do the sha256sum command, it waits for input. This avoids hanging the script in such cases.
448 | then
449 | pdfdetach -list "$fileToProcess"|tail -n $embeddedItemsCount | cut -d: -f2
450 | for f in "$((pdfdetach -list "$fileToProcess")|tail -n $embeddedItemsCount | cut -d: -f2 2>/dev/null)"; do
451 | echo "executing: sha256sum \"${f#\"${f%%[![:space:]]*}\"}\"" | tee -a "$logfile"
452 | sha256sum "${f#"${f%%[![:space:]]*}"}" |tee -a "$logfile"
453 | done
454 | fi
455 | else
456 | commandNotFound "pdfdetach"
457 | commandsExecuted=$commandExecuted+10000000
458 | fi
459 |
460 | blankLine
461 |
462 | #extract versions of the PDF using grep and dd, commands commonly available on any Linux distro
463 |
464 | hashMark "9 - Extracting prior versions of the PDF"
465 |
466 | v=1
467 |
468 | offsets=($(grep --only-matching --byte-offset --text "%%EOF" "$fileToProcess"| cut -d : -f 1))
469 |
470 | if [ ${offsets[0]} -lt 600 ]; then
471 | unset offsets[0] # removes the first element in the array, as it's a false positive.
472 | fi
473 |
474 | priorVersions=${#offsets[@]}
475 | priorVersions=$((priorVersions-1))
476 |
477 | if [ $priorVersions -lt 1 ]; then
478 | echo "There are no previous versions of the PDF embedded in this pdf." | tee -a "$logfile"
479 | else
480 | if ! [[ "$priorVersion" = "true" || "$priorVersion" = "false" ]]; then # if the user did not provide a valid option for -p (or did not specify it)
481 | echo -e "There are ${GREEN}$priorVersions prior versions${NOCOLOUR} of this PDF based on the number of %%EOF signatures in it.\n"
482 | echo "The script can attempt to extract them with the caveat that a prior version may or may not be a properly formed PDF."
483 | read -p "Do you want the script to attempt to extract all versions of this PDF (Y/N)? [Y] " priorVersion # default response is Y if user just hits ENTER
484 |
485 | if [[ "$priorVersion" == "Y" || "$priorVersion" == "y" || "$priorVersion" == "" ]]; then
486 | priorVersion="true"
487 | else
488 | priorVersion="false"
489 | fi
490 | fi
491 | if [ "$priorVersion" == "true" ];then # process prior versions
492 | echo "Excluding the current version, there are $((priorVersions-1)) prior versions in this PDF." | tee -a "$logfile"
493 | echo "The script will extract each of them, assiging them a version number. Version 1 being the oldest version, and version $priorVersions being the version prior to the current version." | tee -a "$logfile"
494 |
495 | #unset offsets[0] # removes the first element in the array, as it's a false positive.
496 |
497 | for size in ${offsets[@]}; do
498 |
499 |
500 | if [ $v -le $priorVersions ]; then # if it's not the last version. Last version is redundant, as it's the original PDF passed to the script.
501 |
502 | newfile="$fileFolder/$filenamenoext version $v.$extension"
503 |
504 | blocksize=$((size+7))
505 |
506 | blankLine
507 | echo "executing: dd if=\"$fileFolder/$fileToProcess\" of=\"$fileFolder/$newfile\" bs=$blocksize count=1 status=noxfer 2\> \/dev\/null" | tee -a "$logfile"
508 | echo "This will extract version $v of the pdf $fileToProcess, assigning it the new filename $fileFolder/$newfile" | tee -a "$logfile"
509 | echo ""
510 |
511 | dd if="$fileToProcess" of="$newfile" bs=$blocksize count=1 status=noxfer 2> /dev/null
512 |
513 | echo "executing: sha256sum \"$newfile\" >>\"$logfile\"" | tee -a "$logfile"
514 | sha256sum "$newfile" >>"$logfile"
515 |
516 | blankLine
517 |
518 | checkPDF "$newfile"
519 |
520 | if [ "pdfValidation" == "False" ] && [ "processThisPDF" != "y" ]; then
521 | echo "User opted to not process \"$newfile\" as it does not appear to be a valid PDF." >> "$logfile"
522 | continue # skip out of the loop
523 | fi
524 |
525 | if [ "$pdfValidation" == "True" ] # Valid PDF
526 | then
527 | echo -e "Prior ${GREEN}version $v${NOCOLOUR} of '$fileFolder/$fileToProcess' appears to be a ${GREEN}valid PDF.${NOCOLOUR}"
528 | echo -e "Prior version $v of '$fileFolder/${fileToProcess##*/}' appears to be a valid PDF." >> "$logfile"
529 | else # Not a valid PDF
530 | echo -e "Prior ${YELLOW}version $v${NOCOLOUR} of '$fileFolder/$fileToProcess' ${RED}does not appear to be a valid PDF.${NOCOLOUR}"
531 | echo -e "Prior version $v of '$fileFolder/$fileToProcess' does not appear to be a valid PDF." >> "$logfile"
532 | fi
533 | blankLine
534 | pdfImages "$newfile" "9.$v" "$fileFolder/${newfile##*/}-pdfimges" 2>/dev/null # Attempt to extract images from the version. Even if not a valid PDF, attempting regardless.
535 |
536 | v=$((v+1))
537 | fi
538 | done
539 |
540 | echo -e "\nExtracting prior versions of the PDF finished execution at $(date).">>"$logfile"
541 |
542 | echo "executing: exiftool -a -G1 -s -ee -csv \"$fileToProcess\" \"$fileFolder/$filenamenoext version *.$extension\" 2> /dev/null >> \"$fileFolder/${fileToProcess##*/} - all versions - exif.csv" | tee -a "$logfile"
543 | echo "filetoprocess: $fileToProcess"
544 | allPriorVersions="$fileFolder/$filenamenoext version"
545 | echo "allPriorVersions: $allPriorVersions"
546 | exiftool -a -G1 -s -ee -csv "$fileToProcess" "$allPriorVersions"*.$extension 2> /dev/null >> "$fileFolder/${fileToProcess##*/} - all versions - exif.csv"
547 |
548 | fi
549 | fi
550 |
551 | # sorting all image hashes after extracting all prior versions of PDF for ease of identifying matching images
552 |
553 | blankLine
554 |
555 | echo "executing: sort \"$allimages\" > \"$allimages\"-sorted.txt" | tee -a "$logfile"
556 | echo -e "All images hashes are also found in: ${GREEN}'$allimages-sorted.txt'.${NOCOLOUR}"
557 | echo "All images hashes are also found in: $allimages-sorted.txt.">>"$logfile"
558 |
559 | sort "$allimages-unsorted.txt" > "$allimages"-sorted.txt
560 |
561 | blankLine
562 |
563 | done
564 |
565 | echo -e "\n###############################################################"
566 | echo -e "Log file written to: ${GREEN}'$logfile'.${NOCOLOUR}"
567 | echo -e "Script finshed at $(date)." >> "$logfile"
568 | IFS="$OIFS" # restore IFS to original
569 | exit $commandsExecuted
--------------------------------------------------------------------------------