Document Metadata Extractor & Analyzer

PDF and Office metadata analysis, author history, and hidden content detection.

The Document Metadata Extractor analyzes uploaded PDF, DOCX, XLSX, and PPTX files for embedded metadata. It surfaces author names, revision history, creation and modification timestamps, editing software, company fields, and risk indicators such as embedded objects or macros.

Document Metadata Extractor

How to Use

Work through these steps in order. Use this tool for educational and ethical purposes only.

1Select Document Metadata Extractor from the tool navigation.
2Click Upload Document and select a PDF, DOCX, XLSX, or PPTX file.
3Click Analyze Metadata. The tool reads metadata without executing any document content.
4Review the Core Properties panel: Author, Last Modified By, Company, Title, Subject, and Keywords.
5Check the Timestamp panel: Created, Modified, and Last Printed dates with timezone data where available.
6Review the Revision History panel showing total edit count and version number.
7Check the Application Fingerprint for the exact software version used to create and last edit the file.
8

Review the Risk Indicators panel: embedded objects, macro presence, and external link references.

For DOCX files, check the Hidden Content section for tracked changes and hidden text runs.

What Is Document Metadata?

Document metadata is structured data embedded in a file that describes the file itself rather than its visible content. It lives at the binary level inside the file’s internal structure, invisible to anyone reading the document normally.

Office applications and document creation tools write metadata automatically to support version tracking, document management, and workflow automation. When a PDF is generated by a reporting engine like JasperReports, the software stamps its identity into the file header. When you save a Word document, the application records your registered username, organization name, and revision history. These fields exist to help software manage documents, not for every eventual recipient to see.

How it gets embedded: In PDF files, metadata lives in two places: the Document Information Dictionary (a key-value structure in the file’s cross-reference table) and an XMP metadata stream (an XML block in the PDF body). The Creator field records the application that originated the content. In the example result, this is JasperReports (WelcomeLetter), flagged as an Identity Leak because it reveals the internal reporting system that generated the document. The Producer field records the PDF library used for rendering, here iText 2.1.0 (by lowagie.com), which identifies a specific software version for Software ID analysis.

In Word (DOCX) files, metadata is stored in docProps/core.xml and docProps/app.xml inside the ZIP archive that forms the DOCX container. These record the author’s name, company, last modified by, revision count, and total editing time. Excel (XLSX) files use the same structure.

The CreationDate and ModDate fields (both showing D:20250131124553+05’00’ in the example) record when the file was created and last modified, including the UTC timezone offset. This constitutes Timeline Data and can reveal the geographic region of the document’s origin when cross-referenced with other signals.

OSINT and Journalism Cases Where Document Metadata Exposed Sources

The UK Iraq Dossier (2003)

The UK government published a dossier on Iraqi weapons capabilities as a Microsoft Word document. Cambridge academic Richard Smith examined its revision history and tracked-changes metadata, finding the document had been substantially copied from an earlier source. The Last Modified By field pointed to a specific named individual within the British government. The analysis helped establish that the dossier misrepresented its sourcing, a finding with significant political consequences. Publishing Word documents without stripping metadata is equivalent to publishing an internal version history.

Dennis Rader / BTK Killer Identification (2005)

Dennis Rader sent police a floppy disk containing a Word document. Investigators extracted the metadata and found the Author field set to “Dennis” and the Last Saved By field pointing to a church in Wichita, Kansas. That combination linked directly to Rader’s identity and location, providing the lead that broke a case cold for decades.

Leaked Corporate and Legal Documents

When a law firm publishes a PDF brief or a corporation releases a financial report, the Creator field often reveals the internal software stack, such as JasperReports, Crystal Reports, or SAP, identifying the organization’s systems and sometimes specific employees. The Producer field version number (e.g., iText 2.1.0) can expose outdated libraries with known CVEs, giving attackers a targeted angle against that software.

Source Protection in Journalism

A document exported from a government system by a whistleblower carries authorship metadata tied to their account. The Last Modified By field in a DOCX can name the specific employee who last edited it, even after they intended to share it anonymously. Secure journalism platforms like SecureDrop strip metadata server-side during file ingestion. Journalists receiving files via email or direct transfer must sanitize them manually before publication.

How to Sanitize Documents Before Sharing

Method 1: Microsoft Word (DOCX)

Word includes a built-in Document Inspector for one-pass metadata removal.

  1. Go to File → Info → Check for Issues → Inspect Document.
  2. Check all categories: Comments and Revisions, Document Properties and Personal Information, Hidden Text, and Custom XML Data.
  3. Click Inspect, then Remove All next to each category with results.
  4. Save the file.

Note: The Inspector cleans the current file but does not touch backup files or AutoRecover copies stored locally. Delete those separately.

Method 2: PDF Files

Adobe Acrobat Pro: Tools → Redact → Sanitize Document. Full sanitization pass removing metadata, embedded content, scripts, and hidden layers. The most thorough option for high-risk documents.

ExifTool (command line): exiftool -all= -XMP:all= document.pdf strips both the Information Dictionary and XMP metadata streams.

PDF/A conversion: Converts the file to a standards-compliant format that strips non-conformant metadata. Verify the output with this tool before publishing, as PDF/A retains some metadata by specification.

Method 3: LibreOffice

Go to File → Properties and clear the fields in the Description and Custom Properties tabs. For batch processing, use a LibreOffice macro or the headless command-line mode to clear properties on load.

Method 4: Bulk Sanitization with mat2

For organizations handling large document volumes, mat2 (Metadata Anonymisation Toolkit 2) is the standard command-line solution. It handles PDF, DOCX, XLSX, ODP, and other formats in batch mode.

pip install mat2

mat2 –inplace document.pdf

mat2 is used by Tails OS as the default metadata removal tool and is the recommended option for journalists and security researchers.

Safe Publishing Workflow

  1. Complete the document in your preferred application.
  2. Export to PDF rather than sharing the native DOCX/XLSX, which retains more sensitive metadata.
  3. Run mat2 or ExifTool on the PDF output.
  4. Upload the sanitized PDF to this tool and verify no Identity Leak or Attribution fields remain before distributing.
  5. For legal or journalistic contexts, retain the sanitized output as a record of the sanitization step.

Technical Details & Use Cases

Office Open XML formats (DOCX, XLSX, PPTX) are ZIP archives containing XML files. The tool reads core.xml (Dublin Core properties), app.xml (application properties), and custom.xml (organization-defined fields) to build a full metadata profile.

The Author field is set from the operating system user account at the time of creation. It frequently contains a real name even in documents intended for anonymous publication. Last Modified By shows who last saved the file and may differ from the original author.

Application fingerprinting reads the Application and AppVersion fields from app.xml. This returns the exact Office version string, such as Microsoft Office Word 2019 16.0.xxxxx. Security teams use this to identify documents targeting specific software versions in phishing campaigns.

PDF metadata extraction uses pdfinfo from poppler-utils for structured property reading. This handles both standard and linearized PDFs that regex-only parsers fail on. Key fields include Producer (creation software), Creator (authoring application), and encrypted metadata blocks.

Typical use cases: OSINT document research, legal discovery, pre-publication privacy review, and malware document triage.

Pros & Cons

ProsCons
✓ Author and company fields often contain real names that creators forget to sanitize before sharing✗ Encrypted PDFs return minimal metadata since the tool does not decrypt protected documents
✓ Application version fingerprinting identifies the creation environment for targeted analysis✗ Documents sanitized with File \> Inspect Document in Office return little to no useful data
✓ Hidden content detection flags tracked changes and embedded objects in Office files✗ Embedded object detection confirms presence only and does not analyze or execute the content

Related Digital Forensics & Recon Tools

Frequently Asked Questions

Yes, if the device had GPS enabled at capture. The tool extracts those coordinates and plots them on a map. Modern smartphones typically record location within 5–10 meters of the actual position.

Yes. Stripping EXIF removes GPS coordinates from the file permanently and irreversibly. However, if backups of the original exist, such as cloud storage, email copies, or device backups, those retain the original metadata.

Most major platforms including Instagram, Twitter/X, and Facebook strip EXIF when images are uploaded. Platforms that allow direct file downloads like Google Drive, Dropbox, and email preserve the original metadata. Telegram’s “Send as File” option also preserves EXIF while compressed sends strip it.

The combination of Camera Make, Camera Model, Lens Model, Serial Number, and Software fields that identify the specific hardware and software that captured the image. In forensic and OSINT work, matching device fingerprints across multiple images can link separate files to the same physical device regardless of who posted them.

Yes. Tools like ExifTool, Photoshop, or dedicated metadata editors can modify GPS coordinates, timestamps, camera model, and any other EXIF field. For this reason, EXIF serves as one evidence point in an investigation, not conclusive proof, and should be corroborated with other signals such as image content, shadow direction, and linguistic context.

Ready to Collaborate?

For Business Inquiries, Sponsorship's & Partnerships

(Response Within 24 hours)

Scroll to Top