Free Document Metadata Extractor - PDF & Office File Analyzer

PDF and Office metadata analysis, author history, and hidden content detection.

Upload any PDF, DOCX, XLSX, or PPTX file below to extract its hidden metadata - revealing author names, revision history, creation timestamps, software fingerprints, and embedded risk indicators. Free, no login required.

Document Metadata Extractor

How to Use

Work through these steps in order. Use this tool for educational and ethical purposes only.

1Select Document Metadata Extractor from the tool navigation.
2Click Upload Document and select a PDF, DOCX, XLSX, or PPTX file.
3Click Analyze Metadata. The tool reads metadata without executing any document content.
4Review the Core Properties panel: Author, Last Modified By, Company, Title, Subject, and Keywords.
5Check the Timestamp panel: Created, Modified, and Last Printed dates with timezone data where available.
6Review the Revision History panel showing total edit count and version number.
7Check the Application Fingerprint for the exact software version used to create and last edit the file.
8

Review the Risk Indicators panel: embedded objects, macro presence, and external link references.

For DOCX files, check the Hidden Content section for tracked changes and hidden text runs.

What Is Document Metadata?

Document metadata is structured data embedded in a file that describes the file itself rather than its visible content. It lives at the binary level inside the file’s internal structure, invisible to anyone reading the document normally.

Office applications and document creation tools write metadata automatically to support version tracking, document management, and workflow automation. When a PDF is generated by a reporting engine like JasperReports, the software stamps its identity into the file header. When you save a Word document, the application records your registered username, organization name, and revision history. These fields exist to help software manage documents, not for every eventual recipient to see.

How it gets embedded: In PDF files, metadata lives in two places: the Document Information Dictionary (a key-value structure in the file’s cross-reference table) and an XMP metadata stream (an XML block in the PDF body). The Creator field records the application that originated the content. In the example result, this is JasperReports (WelcomeLetter), flagged as an Identity Leak because it reveals the internal reporting system that generated the document. The Producer field records the PDF library used for rendering, here iText 2.1.0 (by lowagie.com), which identifies a specific software version for Software ID analysis.

In Word (DOCX) files, metadata is stored in docProps/core.xml and docProps/app.xml inside the ZIP archive that forms the DOCX container. These record the author’s name, company, last modified by, revision count, and total editing time. Excel (XLSX) files use the same structure.

The CreationDate and ModDate fields (both showing D:20250131124553+05’00’ in the example) record when the file was created and last modified, including the UTC timezone offset. This constitutes Timeline Data and can reveal the geographic region of the document’s origin when cross-referenced with other signals.

OSINT and Journalism Cases Where Document Metadata Exposed Sources

The UK Iraq Dossier (2003)

The UK government published a dossier on Iraqi weapons capabilities as a Microsoft Word document. Cambridge academic Richard Smith examined its revision history and tracked-changes metadata, finding the document had been substantially copied from an earlier source. The Last Modified By field pointed to a specific named individual within the British government. The analysis helped establish that the dossier misrepresented its sourcing, a finding with significant political consequences. Publishing Word documents without stripping metadata is equivalent to publishing an internal version history.

Dennis Rader / BTK Killer Identification (2005)

Dennis Rader sent police a floppy disk containing a Word document. Investigators extracted the metadata and found the Author field set to “Dennis” and the Last Saved By field pointing to a church in Wichita, Kansas. That combination linked directly to Rader’s identity and location, providing the lead that broke a case cold for decades.

Leaked Corporate and Legal Documents

When a law firm publishes a PDF brief or a corporation releases a financial report, the Creator field often reveals the internal software stack, such as JasperReports, Crystal Reports, or SAP, identifying the organization’s systems and sometimes specific employees. The Producer field version number (e.g., iText 2.1.0) can expose outdated libraries with known CVEs, giving attackers a targeted angle against that software.

Source Protection in Journalism

A document exported from a government system by a whistleblower carries authorship metadata tied to their account. The Last Modified By field in a DOCX can name the specific employee who last edited it, even after they intended to share it anonymously. Secure journalism platforms like SecureDrop strip metadata server-side during file ingestion. Journalists receiving files via email or direct transfer must sanitize them manually before publication.

How to Sanitize Documents Before Sharing

Method 1: Microsoft Word (DOCX)

Word includes a built-in Document Inspector for one-pass metadata removal.

  1. Go to File → Info → Check for Issues → Inspect Document.
  2. Check all categories: Comments and Revisions, Document Properties and Personal Information, Hidden Text, and Custom XML Data.
  3. Click Inspect, then Remove All next to each category with results.
  4. Save the file.

Note: The Inspector cleans the current file but does not touch backup files or AutoRecover copies stored locally. Delete those separately.

Method 2: PDF Files

Adobe Acrobat Pro: Tools → Redact → Sanitize Document. Full sanitization pass removing metadata, embedded content, scripts, and hidden layers. The most thorough option for high-risk documents.

ExifTool (command line): exiftool -all= -XMP:all= document.pdf strips both the Information Dictionary and XMP metadata streams.

PDF/A conversion: Converts the file to a standards-compliant format that strips non-conformant metadata. Verify the output with this tool before publishing, as PDF/A retains some metadata by specification.

Method 3: LibreOffice

Go to File → Properties and clear the fields in the Description and Custom Properties tabs. For batch processing, use a LibreOffice macro or the headless command-line mode to clear properties on load.

Method 4: Bulk Sanitization with mat2

For organizations handling large document volumes, mat2 (Metadata Anonymisation Toolkit 2) is the standard command-line solution. It handles PDF, DOCX, XLSX, ODP, and other formats in batch mode.

pip install mat2

mat2 –inplace document.pdf

mat2 is used by Tails OS as the default metadata removal tool and is the recommended option for journalists and security researchers.

Safe Publishing Workflow

  1. Complete the document in your preferred application.
  2. Export to PDF rather than sharing the native DOCX/XLSX, which retains more sensitive metadata.
  3. Run mat2 or ExifTool on the PDF output.
  4. Upload the sanitized PDF to this tool and verify no Identity Leak or Attribution fields remain before distributing.
  5. For legal or journalistic contexts, retain the sanitized output as a record of the sanitization step.

Technical Details & Use Cases

Office Open XML formats (DOCX, XLSX, PPTX) are ZIP archives containing XML files. The tool reads core.xml (Dublin Core properties), app.xml (application properties), and custom.xml (organization-defined fields) to build a full metadata profile.

The Author field is set from the operating system user account at the time of creation. It frequently contains a real name even in documents intended for anonymous publication. Last Modified By shows who last saved the file and may differ from the original author.

Application fingerprinting reads the Application and AppVersion fields from app.xml. This returns the exact Office version string, such as Microsoft Office Word 2019 16.0.xxxxx. Security teams use this to identify documents targeting specific software versions in phishing campaigns.

PDF metadata extraction uses pdfinfo from poppler-utils for structured property reading. This handles both standard and linearized PDFs that regex-only parsers fail on. Key fields include Producer (creation software), Creator (authoring application), and encrypted metadata blocks.

Typical use cases: OSINT document research, legal discovery, pre-publication privacy review, and malware document triage.

Pros & Cons

ProsCons
✓ Author and company fields often contain real names that creators forget to sanitize before sharing✗ Encrypted PDFs return minimal metadata since the tool does not decrypt protected documents
✓ Application version fingerprinting identifies the creation environment for targeted analysis✗ Documents sanitized with File \> Inspect Document in Office return little to no useful data
✓ Hidden content detection flags tracked changes and embedded objects in Office files✗ Embedded object detection confirms presence only and does not analyze or execute the content

Related Digital Forensics & Recon Tools

Frequently Asked Questions

Yes. The Author field in Office documents is set from the operating system user account registered at the time of file creation. It frequently contains a real name even in documents intended for anonymous publication. The Last
Modified By field shows who last saved the file and may differ from the original author — revealing multiple contributors.

Yes. The Creator field records the application that originated the content — such as JasperReports, Microsoft Word, or LibreOffice. The Producer field
records the PDF rendering library and version. These fields can identify internal systems, expose outdated software versions with known CVEs, and
reveal whether a document was generated by automated reporting infrastructure.

The timezone offset in CreationDate and ModDate fields records the UTC offset of the system clock at the time of creation — for example +05’00’ indicates Pakistan Standard Time. This can reveal the geographic region of
the document’s origin when cross-referenced with other contextual signals, even if the document’s claimed origin differs.

For PDFs: use Adobe Acrobat Pro’s Sanitize Document function, or run
ExifTool via command line: exiftool -all= -XMP:all= document.pdf. For DOCX files: use File → Info → Check for Issues → Inspect Document in Microsoft Word and remove all flagged categories. For bulk sanitization, mat2 (Metadata Anonymisation Toolkit) handles multiple formats in batch mode.

When properly collected and documented with chain of custody records, document metadata analysis is generally admissible in legal proceedings. Admissibility depends on establishing that the metadata was extracted without alteration and using validated forensic methods. Courts have accepted metadata evidence in cases including the BTK Killer investigation and corporate litigation. Metadata alone rarely constitutes conclusive proof and should be corroborated with additional evidence.

Free OSINT & Cybersecurity Tools

Social Media OSINT

Search usernames across 50+ platforms and map digital footprints simultaneously.

IP Intelligence

Geolocate any IP, identify network ownership, and detect proxies

DNS Intelligence

DNS record lookup and domain infrastructure analysis.

Email Security Auditor

Verify SPF, DKIM, DMARC records - detect email spoofing vulnerabilities.

EXIF Data Analyzer

Search usernames across 50+ platforms for digital footprint analysis

Browser Fingerprint Analyzer

See exactly what data your browser leaks to every website you visit.

Scroll to Top