Pseudonymization — Noirdoc Docs

Entity types

Noirdoc detects and pseudonymizes the following entity types. Each type has a dedicated pseudonym prefix and is handled by one or both detection layers.

Entity type	Description	Example	Pseudonym
`PERSON`	Full or partial person names	Max Mustermann	`<<PERSON_1>>`
`EMAIL`	Email addresses	anna@example.com	`<<EMAIL_1>>`
`PHONE`	Phone numbers in various formats	+49 170 1234567	`<<PHONE_1>>`
`IBAN`	International bank account numbers	DE89 3704 0044 0532 0130 00	`<<IBAN_1>>`
`CREDIT_CARD`	Credit card numbers	4111 1111 1111 1111	`<<CREDIT_CARD_1>>`
`LOCATION`	Addresses, cities, regions, countries	Berliner Str. 42, 10115 Berlin	`<<LOCATION_1>>`
`DATE`	Dates of birth, appointment dates, etc.	15.03.1985	`<<DATE_1>>`
`ORGANIZATION`	Company and organization names	Deutsche Bank AG	`<<ORGANIZATION_1>>`
`IP_ADDRESS`	IPv4 and IPv6 addresses	192.168.1.1	`<<IP_ADDRESS_1>>`
`URL`	Web addresses and URIs	https://example.com/profile	`<<URL_1>>`
`MEDICAL_LICENSE`	Medical license numbers	110 2345 678 9	`<<MEDICAL_LICENSE_1>>`
`SVNR`	German social insurance numbers	1234 150385	`<<SVNR_1>>`
`STEUER_ID`	German tax identification numbers	12 345 678 901	`<<STEUER_ID_1>>`

Pseudonym format

Every pseudonym follows the pattern <<TYPE_N>> where:

TYPE is the entity type in uppercase (e.g., PERSON, EMAIL, IBAN)
N is a sequential integer starting at 1, incremented for each new distinct value of that type within a session

For example, if a conversation mentions three different people, they become <<PERSON_1>>, <<PERSON_2>>, and <<PERSON_3>>. The same person always maps to the same pseudonym within a session, regardless of how many times they appear.

The pseudonym format can be customized via the pseudonym_label setting in the Portal.

Mapping persistence

When Noirdoc pseudonymizes an entity, it stores the bidirectional mapping between the real value and its pseudonym. This mapping is scoped to a tenant and session.

How TTL works

The mapping_ttl_days setting controls how long mappings are retained:

Default (30 days): Mappings persist for 30 days after last use. If the same person appears in a new conversation within that window, they receive the same pseudonym they had before.
Custom TTL: Set any positive integer to change the retention period. Longer TTLs provide more consistency across conversations; shorter TTLs reduce the window of stored data.
TTL = 0: Disables persistence entirely. Mappings exist only for the duration of a single API request. The same real value may receive a different pseudonym in a subsequent request. This offers the highest privacy at the cost of cross-conversation consistency.

Session scope

Within a single API request (a single chat completion call), mappings are always consistent — the same value always produces the same pseudonym. The TTL setting only affects whether that mapping is available in future requests.

Detection system

Noirdoc uses multiple detection methods running in parallel to maximize both precision and recall.

Pattern-based detection

The first layer detects structured entities with predictable formats:

Email addresses — pattern matching
Phone numbers — international and local format recognition
IBANs — format and checksum validation
Credit card numbers — format validation
IP addresses — IPv4 and IPv6 patterns
URLs — standard URI pattern matching
Tax IDs and insurance numbers — German-specific format validation

This layer is fast and deterministic, producing very few false positives on structured data.

Context-sensitive detection

The second layer understands the meaning of surrounding text and handles entities that do not follow fixed patterns:

Person names — recognizes names even in unusual formats or multilingual text
Organizations — identifies companies, institutions, and government bodies
Locations — detects cities, addresses, and regions from context
Dates — catches date references written in natural language (e.g., “last Tuesday” or “im Mai letzten Jahres”)

This layer can distinguish between “Schwarz” as a surname and “schwarz” as a color based on sentence structure and capitalization cues.

Confidence scores and thresholds

Each detection carries a confidence score between 0 and 1. Noirdoc applies a configurable threshold to filter out low-confidence detections and reduce false positives. Only entities that meet or exceed the threshold are pseudonymized.

Pseudonymization is a specific technical measure recognized under GDPR (Article 4(5) and Recital 26). By replacing personal data with pseudonyms before it leaves your infrastructure boundary:

The LLM provider processes only pseudonymized data, significantly reducing its role as a data processor
Data minimization requirements (Article 5(1)(c)) are addressed — only the minimum necessary data reaches the model
In the event of a breach at the LLM provider, exposed data contains no directly identifiable personal information
Mappings remain under your control, stored in Noirdoc infrastructure managed in Germany or in your self-hosted deployment

Pseudonymization is not anonymization. The data can be re-identified using the mapping table, which is why Noirdoc treats mappings as sensitive and encrypts them at rest. The distinction matters for your data processing agreements and records of processing activities.

Previous OpenRouter Next File Handling