Data Protection & Privacy

About 5 minutes

Those who understand backend development or data engineering fundamentals

Data protection refers to the set of technical and organizational measures that guard personal and sensitive information against unauthorized access, leakage, and tampering. AI systems carry unique risks not found in traditional systems because they process data at a large scale.

Data Classification

Data protection starts with classifying data by its sensitivity level.

Level	Classification	Examples	Protective measures
1	Public	Press releases, public FAQs	No special protection required
2	Internal	Internal emails, meeting materials	Access restrictions, authentication
3	Confidential	Customer records, contracts, financial data	Encryption, access logging
4	Restricted	API keys, passwords, personal information	Maximum protection, auditing

What Is PII?

PII (Personally Identifiable Information) is information that identifies a specific individual, or that can identify an individual when combined with other information.

Common examples of PII:

Full name, address, phone number, email address
National ID number, passport number, driver’s license number
IP address, cookie ID (treated as PII under GDPR)
Biometric data (fingerprints, facial images)

PII is regulated data subject to strict management under GDPR, local data protection laws, and similar regulations.

AI-Specific Risks

AI systems have unique data protection risks not present in traditional systems.

Training Data Memorization

Large language models can statistically memorize training data and may reproduce PII from training in their responses. Removing PII from training data is mandatory before fine-tuning.

Inappropriate Exposure via RAG

In a RAG (Retrieval-Augmented Generation) system, sensitive documents included in the search index may be returned to general users without adequate access control.

PII Accumulation in Prompt Logs

Recording user prompts verbatim creates a store of customer names, contract details, and personal information contained in those prompts.

Data Transmission to Third-Party LLM APIs

Sending requests containing sensitive data to an external LLM API means data leaves the organization’s infrastructure. A Data Processing Agreement (DPA) with the provider is required.

Encryption

Encryption at Rest

Encrypts data stored in databases and file systems.

Cloud databases (AWS RDS, Cloud SQL, etc.) can be encrypted by enabling a single setting
AES-256 is the standard encryption algorithm
Use a KMS (Key Management Service) to manage encryption keys

Encryption in Transit

Encrypts data as it moves across networks.

Enforce HTTPS (TLS 1.2 or later) on all communications
Apply TLS to internal service-to-service communication as well (zero-trust principle)

End-to-End Encryption

Encryption where not even intermediate servers can decrypt the data, as used by messaging apps such as Signal. Consider applying this for particularly sensitive medical or legal data.

Data Masking and Anonymization

Techniques for processing and analyzing personal information without using it directly.

flowchart LR
    A[User data input] --> B[PII detection engine]
    B --> C{PII detected?}
    C -->|Yes| D[Masking / pseudonymization]
    C -->|No| E[Pass through]
    D --> F[RAG / LLM pipeline]
    E --> F
    F --> G[Response generation]

Technique Comparison

Technique	Description	Reversible	Use case
Masking	Replace field with `****`	No	Log display, on-screen masking
Pseudonymization	Replace with a consistent fake ID (mapping kept)	Yes	Analytics, test data
Anonymization	Irreversibly transform to prevent identification	No	Statistical analysis, model training
Tokenization	Replace sensitive value with random token (stored externally)	Yes (via lookup)	PCI DSS compliance for payment data

Masking Applied to Prompts

import re

def mask_pii(text: str) -> str:
    # Mask email addresses
    text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
                  '[EMAIL]', text)
    # Mask phone numbers (US format)
    text = re.sub(r'\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}', '[PHONE]', text)
    return text

# Example
raw_prompt = "Please contact Alice (alice@example.com) at (555) 123-4567"
safe_prompt = mask_pii(raw_prompt)
# → "Please contact Alice ([EMAIL]) at [PHONE]"

GDPR (General Data Protection Regulation) applies to all organizations that process data belonging to EU citizens. Organizations in Japan with European users are also subject to GDPR.

Obligation	Description	Impact on AI systems
Purpose Limitation	Data may not be used beyond its stated purpose	Using customer support data for model training requires explicit consent
Data Minimization	Collect only what is necessary	Design prompts to not include unnecessary PII
Right to Erasure	Delete data upon request	Vector database deletion functionality must be implemented
Data Processing Agreement (DPA)	Contract with external processors	A DPA must be signed with external LLM API vendors

Privacy by Design

Privacy by Design is the principle of building privacy protection into a system from the earliest stages of design.

Practical checklist for AI systems:

Do not send customer PII to external LLM APIs (mask in advance)
Detect and remove PII from context before sending to RAG
Store only PII hashes or masked versions in user prompt logs
Attach access control metadata to documents in the vector database
Design and implement data retention periods and deletion flows
Sign a DPA with the external LLM provider

Frequently Asked Questions

Q: Can I use customer data to fine-tune a model?

Without explicit customer consent and an appropriate DPA, this is not recommended. Under GDPR and data protection laws, using data collected for one purpose (e.g., customer support) for another purpose (model training) requires fresh consent. Anonymized or aggregated data may be usable in some cases, but consulting legal counsel is recommended.

Q: What should I do if the LLM outputs a response containing PII?

Activate the incident response flow. Identify the source of the leak (training data, RAG documents, or PII mixed into prompts) and fix the root cause. GDPR requires reporting to the supervisory authority within 72 hours. Implementing monitoring on prompts and responses with PII pattern detection alerts enables early discovery.

Q: Is data safe if it is encrypted?

Encryption is a necessary condition but not a sufficient one. Even encrypted data is accessible through unauthorized means if legitimate user credentials are compromised. Encryption must be combined with proper access control (RBAC), audit logging, the principle of least privilege, and periodic security reviews to achieve defense in depth.

See the references for the external specifications and background sources used on this page.[1][2][3][4]

References

Cloud Architecture Overview

Audit Logging