Skip to content
LinkedInX

Data Protection & Privacy

About 5 minutes

Target audience: Those who understand backend development or data engineering fundamentals
Prerequisites: Reading Enterprise Systems Overview

Data protection refers to the set of technical and organizational measures that guard personal and sensitive information against unauthorized access, leakage, and tampering. AI systems carry unique risks not found in traditional systems because they process data at a large scale.

Data protection starts with classifying data by its sensitivity level.

LevelClassificationExamplesProtective measures
1PublicPress releases, public FAQsNo special protection required
2InternalInternal emails, meeting materialsAccess restrictions, authentication
3ConfidentialCustomer records, contracts, financial dataEncryption, access logging
4RestrictedAPI keys, passwords, personal informationMaximum protection, auditing

PII (Personally Identifiable Information) is information that identifies a specific individual, or that can identify an individual when combined with other information.

Common examples of PII:

  • Full name, address, phone number, email address
  • National ID number, passport number, driver’s license number
  • IP address, cookie ID (treated as PII under GDPR)
  • Biometric data (fingerprints, facial images)

PII is regulated data subject to strict management under GDPR, local data protection laws, and similar regulations.

AI systems have unique data protection risks not present in traditional systems.

Large language models can statistically memorize training data and may reproduce PII from training in their responses. Removing PII from training data is mandatory before fine-tuning.

In a RAG (Retrieval-Augmented Generation) system, sensitive documents included in the search index may be returned to general users without adequate access control.

Recording user prompts verbatim creates a store of customer names, contract details, and personal information contained in those prompts.

Sending requests containing sensitive data to an external LLM API means data leaves the organization’s infrastructure. A Data Processing Agreement (DPA) with the provider is required.

Encrypts data stored in databases and file systems.

  • Cloud databases (AWS RDS, Cloud SQL, etc.) can be encrypted by enabling a single setting
  • AES-256 is the standard encryption algorithm
  • Use a KMS (Key Management Service) to manage encryption keys

Encrypts data as it moves across networks.

  • Enforce HTTPS (TLS 1.2 or later) on all communications
  • Apply TLS to internal service-to-service communication as well (zero-trust principle)

Encryption where not even intermediate servers can decrypt the data, as used by messaging apps such as Signal. Consider applying this for particularly sensitive medical or legal data.

Techniques for processing and analyzing personal information without using it directly.

flowchart LR
    A[User data input] --> B[PII detection engine]
    B --> C{PII detected?}
    C -->|Yes| D[Masking / pseudonymization]
    C -->|No| E[Pass through]
    D --> F[RAG / LLM pipeline]
    E --> F
    F --> G[Response generation]
TechniqueDescriptionReversibleUse case
MaskingReplace field with ****NoLog display, on-screen masking
PseudonymizationReplace with a consistent fake ID (mapping kept)YesAnalytics, test data
AnonymizationIrreversibly transform to prevent identificationNoStatistical analysis, model training
TokenizationReplace sensitive value with random token (stored externally)Yes (via lookup)PCI DSS compliance for payment data
import re

def mask_pii(text: str) -> str:
    # Mask email addresses
    text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
                  '[EMAIL]', text)
    # Mask phone numbers (US format)
    text = re.sub(r'\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}', '[PHONE]', text)
    return text

# Example
raw_prompt = "Please contact Alice (alice@example.com) at (555) 123-4567"
safe_prompt = mask_pii(raw_prompt)
# → "Please contact Alice ([EMAIL]) at [PHONE]"

GDPR (General Data Protection Regulation) applies to all organizations that process data belonging to EU citizens. Organizations in Japan with European users are also subject to GDPR.

ObligationDescriptionImpact on AI systems
Purpose LimitationData may not be used beyond its stated purposeUsing customer support data for model training requires explicit consent
Data MinimizationCollect only what is necessaryDesign prompts to not include unnecessary PII
Right to ErasureDelete data upon requestVector database deletion functionality must be implemented
Data Processing Agreement (DPA)Contract with external processorsA DPA must be signed with external LLM API vendors

Privacy by Design is the principle of building privacy protection into a system from the earliest stages of design.

Practical checklist for AI systems:

  • Do not send customer PII to external LLM APIs (mask in advance)
  • Detect and remove PII from context before sending to RAG
  • Store only PII hashes or masked versions in user prompt logs
  • Attach access control metadata to documents in the vector database
  • Design and implement data retention periods and deletion flows
  • Sign a DPA with the external LLM provider

Q: Can I use customer data to fine-tune a model?

Without explicit customer consent and an appropriate DPA, this is not recommended. Under GDPR and data protection laws, using data collected for one purpose (e.g., customer support) for another purpose (model training) requires fresh consent. Anonymized or aggregated data may be usable in some cases, but consulting legal counsel is recommended.

Q: What should I do if the LLM outputs a response containing PII?

Activate the incident response flow. Identify the source of the leak (training data, RAG documents, or PII mixed into prompts) and fix the root cause. GDPR requires reporting to the supervisory authority within 72 hours. Implementing monitoring on prompts and responses with PII pattern detection alerts enables early discovery.

Q: Is data safe if it is encrypted?

Encryption is a necessary condition but not a sufficient one. Even encrypted data is accessible through unauthorized means if legitimate user credentials are compromised. Encryption must be combined with proper access control (RBAC), audit logging, the principle of least privilege, and periodic security reviews to achieve defense in depth.

See the references for the external specifications and background sources used on this page.[1][2][3][4]

  1. GDPR Full Text
  2. NIST Privacy Framework
  3. Anthropic Privacy Policy
  4. OWASP Top 10 Privacy Risks