National Cyber Warfare Foundation (NCWF)

Digital Forensics: Extracting PDF Metadata


0 user ratings
2025-08-25 14:08:57
milo
Red Team (CNA)

Welcome back, cyberwarrior novitiates! PDF files often store metadata that can reveal valuable information such as the document author, creation and modification dates, software used, and even embedded scripts or hidden content that may be leveraged during OSINT investigations, legal investigations, cyber operations, or penetration tests. In this article, I’d like to show you how […]


The post Digital Forensics: Extracting PDF Metadata first appeared on Hackers Arise.



Welcome back, cyberwarrior novitiates!






PDF files often store metadata that can reveal valuable information such as the document author, creation and modification dates, software used, and even embedded scripts or hidden content that may be leveraged during OSINT investigations, legal investigations, cyber operations, or penetration tests.





In this article, I’d like to show you how to extract metadata from PDFs using the tools pdf-parser and exiftool.





What is PDF Metadata?





PDF metadata is data about a file’s data — it can include:






  • Document properties (title, author, subject)




  • Creation and modification timestamps




  • Software used to generate or edit the PDF




  • Embedded scripts (e.g., JavaScript exploits)




  • Annotations, hidden objects, or attachments




  • Revision history and sometimes geolocation or device info





While metadata helps with document organization and forensic tracing, it also poses security risks. Hackers can exploit metadata to gather intelligence, uncover potentially sensitive or exploitable information.





$20,000 RCE Vulnerability in ExifTool During Metadata Extraction





To demonstrate that metadata can be not only a source of intelligence but also a potential point of system compromise, I will show you CVE-2021-22204 — a flaw in ExifTool’s handling of DjVu files that allowed crafted metadata to trigger arbitrary code execution. This issue ultimately led to CVE-2021-22205, a critical vulnerability in GitLab’s image processing that inherited the ExifTool bug, with a severity rating of 10.0 (Critical).





Source: HackerOne




Step #1: Setting Up Kali Linux for PDF Metadata Extraction





Kali Linux comes with several tools designed for PDF analysis:






  • pdf-parser (part of the peepdf suite): A command-line utility to parse and extract metadata, scripts, and objects from PDFs (often pre-installed in Kali Linux).




  • exiftool: A command-line tool for extracting, editing, and managing metadata across many file types, including PDFs (also often pre-installed)





Step #2: Metadata Extraction with pdf-parser





To get started, it’s enough to specify the name of the file to pdf-parser:





kali> pdf-parser sample.pdf









It will reveal things like:






  • Objects in the PDF (numbered elements that make up the file).




  • Streams inside objects (can contain images, fonts, JavaScript, etc.).




  • Dictionaries describing properties (like /Type /Page, /Font, /XObject).




  • Embedded JavaScript or suspicious actions (/OpenAction, /AA).




  • Metadata (author, producer, creation date).




  • File structure issues (like malformed objects, which might indicate exploits).





To specifically extract metadata objects, such as the Author, we need to use a keyword search:





kali> pdf-parser –search Author sample.pdf









This command revealed UTF-16 encoded text. By decoding it, we can determine the author and title of the file.
Now, let’s say we found a suspicious metadata dictionary and want to explore it in more detail. For this task, we can display the contents of the dictionary:





kali> pdf-parser –object –raw sample.pdf









In this case, we can see that object 551 is a hidden clickable link inside the PDF. If a user clicks it, their PDF reader will attempt to open the URL, which points to MalwareBazaar—a platform that distributes malware samples for researchers. (To learn more about MalwareBazaar check out this article).





This is exactly how attackers hide “phishing/malware links” in PDFs: by embedding link annotations that look harmless or are invisible but open dangerous URLs.





Step #3: Metadata Extraction with exiftool





Apart from pdf-parser, exiftool is a versatile utility for quickly viewing or stripping metadata from files, including PDFs. For example, to display all metadata in a PDF:





kali> exiftool sample.pdf









In addition to extraction and removal, exiftool supports verbose output with the -v flag for deeper troubleshooting or forensic investigation:





kali> exiftool -v sample.pdf









Summary





In this article, we take a look at the tools pdf-parser and exiftool for metadata extraction from PDFs. Hidden within metadata may be clues about a document’s life cycle, software vulnerabilities, or embedded code designed to exploit a victim’s PDF reader. Carefully examining these fields can reveal potential attack surfaces and avenues for social engineering. The importance of extracting metadata for hacking and OSINT cannot be overstated.





The post Digital Forensics: Extracting PDF Metadata first appeared on Hackers Arise.



Source: HackersArise
Source Link: https://hackers-arise.com/digital-forensics-extracting-pdf-metadata/


Comments
new comment
Nobody has commented yet. Will you be the first?
 
Forum
Red Team (CNA)



Copyright 2012 through 2025 - National Cyber Warfare Foundation - All rights reserved worldwide.