Pseudonymization

Entity types, mapping persistence, and pseudonym formats.

Entity types

Noirdoc detects and pseudonymizes the following entity types. Each type has a dedicated pseudonym prefix and is handled by one or both detection layers.

Entity typeDescriptionExamplePseudonym
PERSONFull or partial person namesMax Mustermann<<PERSON_1>>
EMAILEmail addressesanna@example.com<<EMAIL_1>>
PHONEPhone numbers in various formats+49 170 1234567<<PHONE_1>>
IBANInternational bank account numbersDE89 3704 0044 0532 0130 00<<IBAN_1>>
CREDIT_CARDCredit card numbers4111 1111 1111 1111<<CREDIT_CARD_1>>
LOCATIONAddresses, cities, regions, countriesBerliner Str. 42, 10115 Berlin<<LOCATION_1>>
DATEDates of birth, appointment dates, etc.15.03.1985<<DATE_1>>
ORGANIZATIONCompany and organization namesDeutsche Bank AG<<ORGANIZATION_1>>
IP_ADDRESSIPv4 and IPv6 addresses192.168.1.1<<IP_ADDRESS_1>>
URLWeb addresses and URIshttps://example.com/profile<<URL_1>>
MEDICAL_LICENSEMedical license numbers110 2345 678 9<<MEDICAL_LICENSE_1>>
SVNRGerman social insurance numbers1234 150385<<SVNR_1>>
STEUER_IDGerman tax identification numbers12 345 678 901<<STEUER_ID_1>>

Pseudonym format

Every pseudonym follows the pattern <<TYPE_N>> where:

  • TYPE is the entity type in uppercase (e.g., PERSON, EMAIL, IBAN)
  • N is a sequential integer starting at 1, incremented for each new distinct value of that type within a session

For example, if a conversation mentions three different people, they become <<PERSON_1>>, <<PERSON_2>>, and <<PERSON_3>>. The same person always maps to the same pseudonym within a session, regardless of how many times they appear.

The pseudonym format can be customized via the pseudonym_label setting in the Portal.

Mapping persistence

When Noirdoc pseudonymizes an entity, it stores the bidirectional mapping between the real value and its pseudonym. This mapping is scoped to a tenant and session.

How TTL works

The mapping_ttl_days setting controls how long mappings are retained:

  • Default (30 days): Mappings persist for 30 days after last use. If the same person appears in a new conversation within that window, they receive the same pseudonym they had before.
  • Custom TTL: Set any positive integer to change the retention period. Longer TTLs provide more consistency across conversations; shorter TTLs reduce the window of stored data.
  • TTL = 0: Disables persistence entirely. Mappings exist only for the duration of a single API request. The same real value may receive a different pseudonym in a subsequent request. This offers the highest privacy at the cost of cross-conversation consistency.

Session scope

Within a single API request (a single chat completion call), mappings are always consistent — the same value always produces the same pseudonym. The TTL setting only affects whether that mapping is available in future requests.

Detection system

Noirdoc uses multiple detection methods running in parallel to maximize both precision and recall.

Pattern-based detection

The first layer detects structured entities with predictable formats:

  • Email addresses — pattern matching
  • Phone numbers — international and local format recognition
  • IBANs — format and checksum validation
  • Credit card numbers — format validation
  • IP addresses — IPv4 and IPv6 patterns
  • URLs — standard URI pattern matching
  • Tax IDs and insurance numbers — German-specific format validation

This layer is fast and deterministic, producing very few false positives on structured data.

Context-sensitive detection

The second layer understands the meaning of surrounding text and handles entities that do not follow fixed patterns:

  • Person names — recognizes names even in unusual formats or multilingual text
  • Organizations — identifies companies, institutions, and government bodies
  • Locations — detects cities, addresses, and regions from context
  • Dates — catches date references written in natural language (e.g., “last Tuesday” or “im Mai letzten Jahres”)

This layer can distinguish between “Schwarz” as a surname and “schwarz” as a color based on sentence structure and capitalization cues.

Confidence scores and thresholds

Each detection carries a confidence score between 0 and 1. Noirdoc applies a configurable threshold to filter out low-confidence detections and reduce false positives. Only entities that meet or exceed the threshold are pseudonymized.

GDPR implications

Pseudonymization is a specific technical measure recognized under GDPR (Article 4(5) and Recital 26). By replacing personal data with pseudonyms before it leaves your infrastructure boundary:

  • The LLM provider processes only pseudonymized data, significantly reducing its role as a data processor
  • Data minimization requirements (Article 5(1)(c)) are addressed — only the minimum necessary data reaches the model
  • In the event of a breach at the LLM provider, exposed data contains no directly identifiable personal information
  • Mappings remain under your control, stored in Noirdoc infrastructure managed in Germany or in your self-hosted deployment

Pseudonymization is not anonymization. The data can be re-identified using the mapping table, which is why Noirdoc treats mappings as sensitive and encrypts them at rest. The distinction matters for your data processing agreements and records of processing activities.