Generative AI and Privacy: Risks and Design Principles

About 10 minutes

AI system designers and engineers, practitioners implementing Privacy by Design, business professionals seeking a comprehensive view of generative AI privacy risks

AI Governance Overview and Generative AI and Personal Information

Complying with data protection laws like Japan’s APPI and the GDPR is one dimension of the privacy challenge that generative AI presents. Beyond legal compliance, AI systems carry structural privacy risks that statutory frameworks do not fully address.

This article systematically examines the privacy risks specific to generative AI from technical and conceptual perspectives, then covers the privacy-protective approaches that should be built into system design from the outset.

Privacy Risks Inherent to Generative AI

1. Memorization

Large language models (LLMs) have a tendency to “memorize” training data. Data that appears repeatedly in the training set, or data relating to specific individuals, can be reproduced verbatim in model outputs — a phenomenon known as memorization.

Concrete risks include:

Reproduction of specific individuals’ names, addresses, phone numbers, or email addresses
Verbatim reproduction of medical records, legal documents, or other sensitive content
Leakage of private code or API keys

This means that if personal information was present in the training data, that information is potentially “leaked” to every user who interacts with the model.[3]

2. Inference Attacks

Inference attacks are techniques for indirectly extracting information about training data by querying a model. Common types include:

Attack Type	Description
Membership inference	Determine whether a specific data point was included in training data
Model inversion	Attempt to reconstruct training data content from model outputs
Attribute inference	Infer sensitive attributes (race, health status, etc.) from partial information

These attacks are particularly relevant when a model is exposed as an API.

3. Violation of Contextual Integrity

The concept of “Contextual Integrity,” developed by philosopher Helen Nissenbaum, defines privacy not as secrecy but as information flowing in ways appropriate to the norms of the context in which it was originally shared.[2]

Example: Health information shared with a physician flows appropriately when shared with other treating physicians for care purposes. The same information shared with an insurance company or employer violates the original context (medical care), and is experienced as a privacy violation.

Generative AI creates contextual integrity violations in scenarios such as:

Using public social media posts (shared in a personal context, with friends) to train models used in enterprise analysis
Using input data from a specific-purpose service to improve unrelated services
Combining public information to construct profiles that were never intended when information was originally disclosed

4. Discriminatory Inference and Profiling

Generative AI can infer sensitive attributes — gender, race, health status, political views — from input features without explicitly storing personal information. When such inferences are used in consequential decisions like hiring, lending, or insurance pricing, they can enable discriminatory treatment.

Privacy by Design in Practice

Privacy by Design, developed by Ann Cavoukian (former Information and Privacy Commissioner of Ontario), is a framework for embedding privacy protection into system design from the beginning rather than treating it as an afterthought. It consists of seven foundational principles:[1]

Principle	Description
Proactive, not reactive	Prevent privacy risks in design rather than responding after the fact
Privacy as the default	The default configuration provides maximum privacy protection
Privacy embedded into design	Privacy protection is a core element of the system, not an add-on
Full functionality (positive-sum)	Privacy and functionality are not a zero-sum trade-off
End-to-end security	Protection throughout the full data lifecycle
Visibility and transparency	Data processing is transparent and subject to independent verification
Respect for user privacy	The individual’s privacy interests take priority

Applying These Principles to AI Systems

Data Minimization

Collect and process only the minimum data necessary for the stated purpose. In AI systems, there is a natural incentive to accumulate more data for accuracy improvements, but unnecessary data collection increases privacy risk.

Purpose Limitation

Do not use data for purposes other than those for which it was collected. When using user data for model improvement or new feature development, the relationship to the original collection purpose must be made explicit.

Storage Limitation

Delete data promptly once the purpose has been achieved. AI systems often benefit from long-term data retention for ongoing improvement, but storage periods should be defined, and data that is no longer needed should be deleted.

Privacy-Preserving Technical Techniques

Differential Privacy

Differential privacy adds mathematically calibrated noise to statistical query results, providing a formal guarantee that the results cannot be used to infer information about any specific individual.[4] It preserves aggregate patterns in the data while protecting individuals.

Apple uses differential privacy for health data aggregation; Google applies it in Chrome analytics. It is recognized as a technique for achieving both privacy protection and statistical utility.

Applied to AI model training: Techniques such as DP-SGD (Differentially Private Stochastic Gradient Descent) integrate differential privacy into the training process. Model accuracy is reduced to some degree, but memorization risk and vulnerability to inference attacks are substantially reduced.

Federated Learning

Federated learning trains models without centralizing raw data.[5] Training happens locally on each device or server, and only model updates (gradients) are aggregated. The original personal data never leaves the local environment, reducing privacy exposure.

Google uses federated learning in Gboard (the smartphone keyboard) to improve autocorrect predictions without sending individual typing data to a central server.

Anonymization and Pseudonymization

Identifying information is removed or replaced. However, simple anonymization is often vulnerable to re-identification attacks — where multiple datasets are combined to identify individuals. Stronger anonymization techniques such as k-anonymity, l-diversity, and t-closeness are recommended where robust protection is required.

Practical Steps for Organizations

Privacy Evaluation Checklist for AI System Design

When developing or deploying an AI system, conduct a privacy evaluation covering these areas:

Data lifecycle

What data is collected, and for what purpose?
Are retention periods and deletion policies defined?
Are access controls appropriately scoped?

Model risk assessment

Does the training data contain personal information?
Has the risk of memorization been evaluated?
Is there a monitoring process for model outputs?

Transparency to users

Is it clearly disclosed what data the AI system uses?
Do users have a way to understand and control how their data is used?

Internal Use Policy

When employees use external generative AI tools for work, establish a privacy policy that addresses:[6]

Which categories of information must not be included in prompts (customer personal information, special care-required personal information, etc.)
A list of approved AI services and the conditions for their use
A verification step to check that AI-generated content does not contain personal information before it is shared externally

Summary

The privacy risks of generative AI extend beyond legal compliance to include structural challenges: memorization, inference attacks, contextual integrity violations, and discriminatory profiling.

Addressing these requires a Privacy by Design approach — embedding privacy protection into the architecture of systems from the start, not just meeting minimum legal requirements. Combining technical techniques such as differential privacy, federated learning, and appropriate anonymization with the principles of data minimization, purpose limitation, and transparency builds the foundation for AI systems that users and regulators can trust.

For specific implementation choices or legal judgment, consulting a privacy specialist or information security professional with expertise in AI systems is recommended.

References

Cavoukian, Ann, Privacy by Design: The 7 Foundational Principles, Information and Privacy Commissioner of Ontario, 2011
Nissenbaum, Helen, Privacy in Context: Technology, Policy, and the Integrity of Social Life, Stanford University Press, 2010
Carlini, Nicholas et al., Extracting Training Data from Large Language Models, USENIX Security 2021
Dwork, Cynthia, Differential Privacy: A Survey of Results, Theory and Applications of Models of Computation, 2008
McMahan, H. Brendan et al., Communication-Efficient Learning of Deep Networks from Decentralized Data, AISTATS 2017
Personal Information Protection Commission, Notice regarding generative AI services, June 2, 2023

Quiz

Balancing Guardrails and Governance for Enterprise Generative AI

Generative AI and Personal Information: Legal Obligations and Practical Compliance