National Cyber Warfare Foundation (NCWF)

Welcome back, cyberwarrior novitiates! PDF files often store metadata that can reveal valuable information such as the document author, creation and modification dates, software used, and even embedded scripts or hidden content that may be leveraged during OSINT investigations, legal investigations, cyber operations, or penetration tests. In this article, I’d like to show you how […]

The post Digital Forensics: Extracting PDF Metadata first appeared on Hackers Arise.

Welcome back, cyberwarrior novitiates!

PDF files often store metadata that can reveal valuable information such as the document author, creation and modification dates, software used, and even embedded scripts or hidden content that may be leveraged during OSINT investigations, legal investigations, cyber operations, or penetration tests.

In this article, I’d like to show you how to extract metadata from PDFs using the tools pdf-parser and exiftool.

What is PDF Metadata?

PDF metadata is data about a file’s data — it can include:

Document properties (title, author, subject)

Creation and modification timestamps

Software used to generate or edit the PDF

Embedded scripts (e.g., JavaScript exploits)

Annotations, hidden objects, or attachments

Revision history and sometimes geolocation or device info

While metadata helps with document organization and forensic tracing, it also poses security risks. Hackers can exploit metadata to gather intelligence, uncover potentially sensitive or exploitable information.

$20,000 RCE Vulnerability in ExifTool During Metadata Extraction

To demonstrate that metadata can be not only a source of intelligence but also a potential point of system compromise, I will show you CVE-2021-22204 — a flaw in ExifTool’s handling of DjVu files that allowed crafted metadata to trigger arbitrary code execution. This issue ultimately led to CVE-2021-22205, a critical vulnerability in GitLab’s image processing that inherited the ExifTool bug, with a severity rating of 10.0 (Critical).

Step #1: Setting Up Kali Linux for PDF Metadata Extraction

Kali Linux comes with several tools designed for PDF analysis:

pdf-parser (part of the peepdf suite): A command-line utility to parse and extract metadata, scripts, and objects from PDFs (often pre-installed in Kali Linux).

exiftool: A command-line tool for extracting, editing, and managing metadata across many file types, including PDFs (also often pre-installed)

Step #2: Metadata Extraction with pdf-parser

To get started, it’s enough to specify the name of the file to pdf-parser:

kali> pdf-parser sample.pdf

It will reveal things like:

Objects in the PDF (numbered elements that make up the file).

Streams inside objects (can contain images, fonts, JavaScript, etc.).

Dictionaries describing properties (like /Type /Page, /Font, /XObject).

Embedded JavaScript or suspicious actions (/OpenAction, /AA).

Metadata (author, producer, creation date).

File structure issues (like malformed objects, which might indicate exploits).

To specifically extract metadata objects, such as the Author, we need to use a keyword search:

kali> pdf-parser –search Author sample.pdf

This command revealed UTF-16 encoded text. By decoding it, we can determine the author and title of the file.
Now, let’s say we found a suspicious metadata dictionary and want to explore it in more detail. For this task, we can display the contents of the dictionary:

kali> pdf-parser –object –raw sample.pdf

In this case, we can see that object 551 is a hidden clickable link inside the PDF. If a user clicks it, their PDF reader will attempt to open the URL, which points to MalwareBazaar—a platform that distributes malware samples for researchers. (To learn more about MalwareBazaar check out this article).

This is exactly how attackers hide “phishing/malware links” in PDFs: by embedding link annotations that look harmless or are invisible but open dangerous URLs.

Step #3: Metadata Extraction with exiftool

Apart from pdf-parser, exiftool is a versatile utility for quickly viewing or stripping metadata from files, including PDFs. For example, to display all metadata in a PDF:

kali> exiftool sample.pdf

In addition to extraction and removal, exiftool supports verbose output with the -v flag for deeper troubleshooting or forensic investigation:

kali> exiftool -v sample.pdf

Summary

In this article, we take a look at the tools pdf-parser and exiftool for metadata extraction from PDFs. Hidden within metadata may be clues about a document’s life cycle, software vulnerabilities, or embedded code designed to exploit a victim’s PDF reader. Carefully examining these fields can reveal potential attack surfaces and avenues for social engineering. The importance of extracting metadata for hacking and OSINT cannot be overstated.