Pseudonymization
Entity types, mapping persistence, and pseudonym formats.
Entity types
Noirdoc detects and pseudonymizes the following entity types. Each type has a dedicated pseudonym prefix and is handled by one or both detection layers.
| Entity type | Description | Example | Pseudonym |
|---|---|---|---|
PERSON | Full or partial person names | Max Mustermann | <<PERSON_1>> |
EMAIL | Email addresses | anna@example.com | <<EMAIL_1>> |
PHONE | Phone numbers in various formats | +49 170 1234567 | <<PHONE_1>> |
IBAN | International bank account numbers | DE89 3704 0044 0532 0130 00 | <<IBAN_1>> |
CREDIT_CARD | Credit card numbers | 4111 1111 1111 1111 | <<CREDIT_CARD_1>> |
LOCATION | Addresses, cities, regions, countries | Berliner Str. 42, 10115 Berlin | <<LOCATION_1>> |
DATE | Dates of birth, appointment dates, etc. | 15.03.1985 | <<DATE_1>> |
ORGANIZATION | Company and organization names | Deutsche Bank AG | <<ORGANIZATION_1>> |
IP_ADDRESS | IPv4 and IPv6 addresses | 192.168.1.1 | <<IP_ADDRESS_1>> |
URL | Web addresses and URIs | https://example.com/profile | <<URL_1>> |
MEDICAL_LICENSE | Medical license numbers | 110 2345 678 9 | <<MEDICAL_LICENSE_1>> |
SVNR | German social insurance numbers | 1234 150385 | <<SVNR_1>> |
STEUER_ID | German tax identification numbers | 12 345 678 901 | <<STEUER_ID_1>> |
Pseudonym format
Every pseudonym follows the pattern <<TYPE_N>> where:
- TYPE is the entity type in uppercase (e.g.,
PERSON,EMAIL,IBAN) - N is a sequential integer starting at 1, incremented for each new distinct value of that type within a session
For example, if a conversation mentions three different people, they become <<PERSON_1>>, <<PERSON_2>>, and <<PERSON_3>>. The same person always maps to the same pseudonym within a session, regardless of how many times they appear.
The pseudonym format can be customized via the pseudonym_label setting in the Portal.
Mapping persistence
When Noirdoc pseudonymizes an entity, it stores the bidirectional mapping between the real value and its pseudonym. This mapping is scoped to a tenant and session.
How TTL works
The mapping_ttl_days setting controls how long mappings are retained:
- Default (30 days): Mappings persist for 30 days after last use. If the same person appears in a new conversation within that window, they receive the same pseudonym they had before.
- Custom TTL: Set any positive integer to change the retention period. Longer TTLs provide more consistency across conversations; shorter TTLs reduce the window of stored data.
- TTL = 0: Disables persistence entirely. Mappings exist only for the duration of a single API request. The same real value may receive a different pseudonym in a subsequent request. This offers the highest privacy at the cost of cross-conversation consistency.
Session scope
Within a single API request (a single chat completion call), mappings are always consistent — the same value always produces the same pseudonym. The TTL setting only affects whether that mapping is available in future requests.
Detection system
Noirdoc uses multiple detection methods running in parallel to maximize both precision and recall.
Pattern-based detection
The first layer detects structured entities with predictable formats:
- Email addresses — pattern matching
- Phone numbers — international and local format recognition
- IBANs — format and checksum validation
- Credit card numbers — format validation
- IP addresses — IPv4 and IPv6 patterns
- URLs — standard URI pattern matching
- Tax IDs and insurance numbers — German-specific format validation
This layer is fast and deterministic, producing very few false positives on structured data.
Context-sensitive detection
The second layer understands the meaning of surrounding text and handles entities that do not follow fixed patterns:
- Person names — recognizes names even in unusual formats or multilingual text
- Organizations — identifies companies, institutions, and government bodies
- Locations — detects cities, addresses, and regions from context
- Dates — catches date references written in natural language (e.g., “last Tuesday” or “im Mai letzten Jahres”)
This layer can distinguish between “Schwarz” as a surname and “schwarz” as a color based on sentence structure and capitalization cues.
Confidence scores and thresholds
Each detection carries a confidence score between 0 and 1. Noirdoc applies a configurable threshold to filter out low-confidence detections and reduce false positives. Only entities that meet or exceed the threshold are pseudonymized.
GDPR implications
Pseudonymization is a specific technical measure recognized under GDPR (Article 4(5) and Recital 26). By replacing personal data with pseudonyms before it leaves your infrastructure boundary:
- The LLM provider processes only pseudonymized data, significantly reducing its role as a data processor
- Data minimization requirements (Article 5(1)(c)) are addressed — only the minimum necessary data reaches the model
- In the event of a breach at the LLM provider, exposed data contains no directly identifiable personal information
- Mappings remain under your control, stored in Noirdoc infrastructure managed in Germany or in your self-hosted deployment
Pseudonymization is not anonymization. The data can be re-identified using the mapping table, which is why Noirdoc treats mappings as sensitive and encrypts them at rest. The distinction matters for your data processing agreements and records of processing activities.