Data Protection & Privacy
About 5 minutes
Data protection refers to the set of technical and organizational measures that guard personal and sensitive information against unauthorized access, leakage, and tampering. AI systems carry unique risks not found in traditional systems because they process data at a large scale.
Data Classification
Section titled “Data Classification”Data protection starts with classifying data by its sensitivity level.
| Level | Classification | Examples | Protective measures |
|---|---|---|---|
| 1 | Public | Press releases, public FAQs | No special protection required |
| 2 | Internal | Internal emails, meeting materials | Access restrictions, authentication |
| 3 | Confidential | Customer records, contracts, financial data | Encryption, access logging |
| 4 | Restricted | API keys, passwords, personal information | Maximum protection, auditing |
What Is PII?
Section titled “What Is PII?”PII (Personally Identifiable Information) is information that identifies a specific individual, or that can identify an individual when combined with other information.
Common examples of PII:
- Full name, address, phone number, email address
- National ID number, passport number, driver’s license number
- IP address, cookie ID (treated as PII under GDPR)
- Biometric data (fingerprints, facial images)
PII is regulated data subject to strict management under GDPR, local data protection laws, and similar regulations.
AI-Specific Risks
Section titled “AI-Specific Risks”AI systems have unique data protection risks not present in traditional systems.
Training Data Memorization
Section titled “Training Data Memorization”Large language models can statistically memorize training data and may reproduce PII from training in their responses. Removing PII from training data is mandatory before fine-tuning.
Inappropriate Exposure via RAG
Section titled “Inappropriate Exposure via RAG”In a RAG (Retrieval-Augmented Generation) system, sensitive documents included in the search index may be returned to general users without adequate access control.
PII Accumulation in Prompt Logs
Section titled “PII Accumulation in Prompt Logs”Recording user prompts verbatim creates a store of customer names, contract details, and personal information contained in those prompts.
Data Transmission to Third-Party LLM APIs
Section titled “Data Transmission to Third-Party LLM APIs”Sending requests containing sensitive data to an external LLM API means data leaves the organization’s infrastructure. A Data Processing Agreement (DPA) with the provider is required.
Encryption
Section titled “Encryption”Encryption at Rest
Section titled “Encryption at Rest”Encrypts data stored in databases and file systems.
- Cloud databases (AWS RDS, Cloud SQL, etc.) can be encrypted by enabling a single setting
- AES-256 is the standard encryption algorithm
- Use a KMS (Key Management Service) to manage encryption keys
Encryption in Transit
Section titled “Encryption in Transit”Encrypts data as it moves across networks.
- Enforce HTTPS (TLS 1.2 or later) on all communications
- Apply TLS to internal service-to-service communication as well (zero-trust principle)
End-to-End Encryption
Section titled “End-to-End Encryption”Encryption where not even intermediate servers can decrypt the data, as used by messaging apps such as Signal. Consider applying this for particularly sensitive medical or legal data.
Data Masking and Anonymization
Section titled “Data Masking and Anonymization”Techniques for processing and analyzing personal information without using it directly.
flowchart LR
A[User data input] --> B[PII detection engine]
B --> C{PII detected?}
C -->|Yes| D[Masking / pseudonymization]
C -->|No| E[Pass through]
D --> F[RAG / LLM pipeline]
E --> F
F --> G[Response generation]Technique Comparison
Section titled “Technique Comparison”| Technique | Description | Reversible | Use case |
|---|---|---|---|
| Masking | Replace field with **** | No | Log display, on-screen masking |
| Pseudonymization | Replace with a consistent fake ID (mapping kept) | Yes | Analytics, test data |
| Anonymization | Irreversibly transform to prevent identification | No | Statistical analysis, model training |
| Tokenization | Replace sensitive value with random token (stored externally) | Yes (via lookup) | PCI DSS compliance for payment data |
Masking Applied to Prompts
Section titled “Masking Applied to Prompts”import re
def mask_pii(text: str) -> str:
# Mask email addresses
text = re.sub(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}',
'[EMAIL]', text)
# Mask phone numbers (US format)
text = re.sub(r'\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}', '[PHONE]', text)
return text
# Example
raw_prompt = "Please contact Alice (alice@example.com) at (555) 123-4567"
safe_prompt = mask_pii(raw_prompt)
# → "Please contact Alice ([EMAIL]) at [PHONE]"Key GDPR Obligations
Section titled “Key GDPR Obligations”GDPR (General Data Protection Regulation) applies to all organizations that process data belonging to EU citizens. Organizations in Japan with European users are also subject to GDPR.
| Obligation | Description | Impact on AI systems |
|---|---|---|
| Purpose Limitation | Data may not be used beyond its stated purpose | Using customer support data for model training requires explicit consent |
| Data Minimization | Collect only what is necessary | Design prompts to not include unnecessary PII |
| Right to Erasure | Delete data upon request | Vector database deletion functionality must be implemented |
| Data Processing Agreement (DPA) | Contract with external processors | A DPA must be signed with external LLM API vendors |
Privacy by Design
Section titled “Privacy by Design”Privacy by Design is the principle of building privacy protection into a system from the earliest stages of design.
Practical checklist for AI systems:
- Do not send customer PII to external LLM APIs (mask in advance)
- Detect and remove PII from context before sending to RAG
- Store only PII hashes or masked versions in user prompt logs
- Attach access control metadata to documents in the vector database
- Design and implement data retention periods and deletion flows
- Sign a DPA with the external LLM provider
Frequently Asked Questions
Section titled “Frequently Asked Questions”Q: Can I use customer data to fine-tune a model?
Without explicit customer consent and an appropriate DPA, this is not recommended. Under GDPR and data protection laws, using data collected for one purpose (e.g., customer support) for another purpose (model training) requires fresh consent. Anonymized or aggregated data may be usable in some cases, but consulting legal counsel is recommended.
Q: What should I do if the LLM outputs a response containing PII?
Activate the incident response flow. Identify the source of the leak (training data, RAG documents, or PII mixed into prompts) and fix the root cause. GDPR requires reporting to the supervisory authority within 72 hours. Implementing monitoring on prompts and responses with PII pattern detection alerts enables early discovery.
Q: Is data safe if it is encrypted?
Encryption is a necessary condition but not a sufficient one. Even encrypted data is accessible through unauthorized means if legitimate user credentials are compromised. Encryption must be combined with proper access control (RBAC), audit logging, the principle of least privilege, and periodic security reviews to achieve defense in depth.
See the references for the external specifications and background sources used on this page.[1][2][3][4]