File Handling

Process and pseudonymize content in PDFs, Word documents, spreadsheets, and images.

Overview

Many LLM workflows involve uploading documents — contracts, invoices, medical reports, spreadsheets. These files often contain personal data that should not reach the model in plain text.

Noirdoc can analyze and pseudonymize file content before it is forwarded to the LLM provider, applying the same detection and replacement pipeline used for text messages.

File handling must be explicitly enabled via the allow_file_content setting (disabled by default).

Supported formats

Noirdoc processes the following file types:

FormatExtensionsNotes
PDF.pdfText extraction and optional OCR for scanned pages
Microsoft Word.docxFull text and table extraction
Microsoft Excel.xlsxCell-level text extraction across all sheets
Images.png, .jpg, .jpeg, .tiff, .bmpOCR-based text extraction
CSV.csvRow and cell level processing
Markdown.mdParsed as plain text
HTML.html, .htmText extracted from markup
Plain text.txtDirect processing

File content is extracted, scanned for personal data, and then processed according to the configured analysis mode.

Analysis modes

The file_analysis_mode setting controls how Noirdoc handles detected personal data in files. There are four modes:

Passthrough

file_analysis_mode: passthrough

Files are forwarded to the LLM provider without any analysis. Noirdoc does not inspect or modify file content. Use this mode when files are known to contain no personal data or when privacy scanning is not required.

Detect only

file_analysis_mode: detect_only

Noirdoc scans the file content and logs any detected personal data entities, but does not modify the file. The original file is forwarded to the provider unchanged. This mode is useful for auditing — you can review detection results without impacting the LLM workflow.

Block

file_analysis_mode: block

If Noirdoc detects any personal data in the file, the entire request is rejected with an error response. The file is never forwarded to the provider. This mode enforces a strict policy: files containing PII are not allowed through the proxy.

The error response includes details about which entity types were detected, so the caller can take corrective action.

Pseudonymize

file_analysis_mode: pseudonymize

This is the default mode. Noirdoc extracts the text content from the file, applies pseudonymization to all detected entities, and forwards the modified content to the provider. The pseudonymized content uses the same <<TYPE_N>> format as regular text messages.

When the response references pseudonymized values, Noirdoc restores the originals before returning the response to your application.

OCR for images and scanned documents

When file_ocr_enabled is set to true (the default), Noirdoc applies optical character recognition to:

  • Image files (PNG, JPEG, TIFF, BMP)
  • Scanned PDF pages that contain no extractable text layer

OCR converts the visual content to text, which is then processed through the standard detection and pseudonymization pipeline.

This is particularly important for scanned contracts, handwritten notes, or photographed documents that may contain personal data embedded in the image rather than in a text layer.

OCR limitations

  • OCR accuracy depends on image quality, resolution, and language
  • Handwritten text may produce lower accuracy than printed text
  • Very low resolution images (below 150 DPI) may yield poor results
  • OCR adds processing time — expect slightly higher latency for image-heavy requests

You can disable OCR by setting file_ocr_enabled to false if your workflow only involves digital-native documents.

File size limits

The file_max_size_mb setting controls the maximum allowed file size (default: 10 MB). Files exceeding this limit are rejected with an error before any processing occurs.

Adjust this value based on your use case:

  • For chat-based workflows with small attachments, the default 10 MB is typically sufficient
  • For document processing pipelines with large PDFs or spreadsheets, consider increasing the limit
  • Keep in mind that larger files require more processing time and memory

Enabling file processing

File handling requires two settings to be configured:

  1. Enable file content processing:

    Set allow_file_content to true in your tenant settings. This is the master switch — when disabled, file content is never analyzed regardless of other settings.

  2. Choose an analysis mode:

    Set file_analysis_mode to your preferred mode. The default is pseudonymize, which provides full privacy protection.

Example configuration

A typical privacy-focused configuration:

SettingValue
allow_file_contenttrue
file_analysis_modepseudonymize
file_max_size_mb10
file_ocr_enabledtrue

A strict compliance configuration that rejects any file containing PII:

SettingValue
allow_file_contenttrue
file_analysis_modeblock
file_max_size_mb25
file_ocr_enabledtrue