File Handling
Process and pseudonymize content in PDFs, Word documents, spreadsheets, and images.
Overview
Many LLM workflows involve uploading documents — contracts, invoices, medical reports, spreadsheets. These files often contain personal data that should not reach the model in plain text.
Noirdoc can analyze and pseudonymize file content before it is forwarded to the LLM provider, applying the same detection and replacement pipeline used for text messages.
File handling must be explicitly enabled via the allow_file_content setting (disabled by default).
Supported formats
Noirdoc processes the following file types:
| Format | Extensions | Notes |
|---|---|---|
.pdf | Text extraction and optional OCR for scanned pages | |
| Microsoft Word | .docx | Full text and table extraction |
| Microsoft Excel | .xlsx | Cell-level text extraction across all sheets |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | OCR-based text extraction |
| CSV | .csv | Row and cell level processing |
| Markdown | .md | Parsed as plain text |
| HTML | .html, .htm | Text extracted from markup |
| Plain text | .txt | Direct processing |
File content is extracted, scanned for personal data, and then processed according to the configured analysis mode.
Analysis modes
The file_analysis_mode setting controls how Noirdoc handles detected personal data in files. There are four modes:
Passthrough
file_analysis_mode: passthrough
Files are forwarded to the LLM provider without any analysis. Noirdoc does not inspect or modify file content. Use this mode when files are known to contain no personal data or when privacy scanning is not required.
Detect only
file_analysis_mode: detect_only
Noirdoc scans the file content and logs any detected personal data entities, but does not modify the file. The original file is forwarded to the provider unchanged. This mode is useful for auditing — you can review detection results without impacting the LLM workflow.
Block
file_analysis_mode: block
If Noirdoc detects any personal data in the file, the entire request is rejected with an error response. The file is never forwarded to the provider. This mode enforces a strict policy: files containing PII are not allowed through the proxy.
The error response includes details about which entity types were detected, so the caller can take corrective action.
Pseudonymize
file_analysis_mode: pseudonymize
This is the default mode. Noirdoc extracts the text content from the file, applies pseudonymization to all detected entities, and forwards the modified content to the provider. The pseudonymized content uses the same <<TYPE_N>> format as regular text messages.
When the response references pseudonymized values, Noirdoc restores the originals before returning the response to your application.
OCR for images and scanned documents
When file_ocr_enabled is set to true (the default), Noirdoc applies optical character recognition to:
- Image files (PNG, JPEG, TIFF, BMP)
- Scanned PDF pages that contain no extractable text layer
OCR converts the visual content to text, which is then processed through the standard detection and pseudonymization pipeline.
This is particularly important for scanned contracts, handwritten notes, or photographed documents that may contain personal data embedded in the image rather than in a text layer.
OCR limitations
- OCR accuracy depends on image quality, resolution, and language
- Handwritten text may produce lower accuracy than printed text
- Very low resolution images (below 150 DPI) may yield poor results
- OCR adds processing time — expect slightly higher latency for image-heavy requests
You can disable OCR by setting file_ocr_enabled to false if your workflow only involves digital-native documents.
File size limits
The file_max_size_mb setting controls the maximum allowed file size (default: 10 MB). Files exceeding this limit are rejected with an error before any processing occurs.
Adjust this value based on your use case:
- For chat-based workflows with small attachments, the default 10 MB is typically sufficient
- For document processing pipelines with large PDFs or spreadsheets, consider increasing the limit
- Keep in mind that larger files require more processing time and memory
Enabling file processing
File handling requires two settings to be configured:
-
Enable file content processing:
Set
allow_file_contenttotruein your tenant settings. This is the master switch — when disabled, file content is never analyzed regardless of other settings. -
Choose an analysis mode:
Set
file_analysis_modeto your preferred mode. The default ispseudonymize, which provides full privacy protection.
Example configuration
A typical privacy-focused configuration:
| Setting | Value |
|---|---|
allow_file_content | true |
file_analysis_mode | pseudonymize |
file_max_size_mb | 10 |
file_ocr_enabled | true |
A strict compliance configuration that rejects any file containing PII:
| Setting | Value |
|---|---|
allow_file_content | true |
file_analysis_mode | block |
file_max_size_mb | 25 |
file_ocr_enabled | true |