Skip to content
LinkedInX

Generative AI and Privacy: Risks and Design Principles

About 10 minutes

Target audience: AI system designers and engineers, practitioners implementing Privacy by Design, business professionals seeking a comprehensive view of generative AI privacy risks

Complying with data protection laws like Japan’s APPI and the GDPR is one dimension of the privacy challenge that generative AI presents. Beyond legal compliance, AI systems carry structural privacy risks that statutory frameworks do not fully address.

This article systematically examines the privacy risks specific to generative AI from technical and conceptual perspectives, then covers the privacy-protective approaches that should be built into system design from the outset.


Large language models (LLMs) have a tendency to “memorize” training data. Data that appears repeatedly in the training set, or data relating to specific individuals, can be reproduced verbatim in model outputs — a phenomenon known as memorization.

Concrete risks include:

  • Reproduction of specific individuals’ names, addresses, phone numbers, or email addresses
  • Verbatim reproduction of medical records, legal documents, or other sensitive content
  • Leakage of private code or API keys

This means that if personal information was present in the training data, that information is potentially “leaked” to every user who interacts with the model.[3]

Inference attacks are techniques for indirectly extracting information about training data by querying a model. Common types include:

Attack TypeDescription
Membership inferenceDetermine whether a specific data point was included in training data
Model inversionAttempt to reconstruct training data content from model outputs
Attribute inferenceInfer sensitive attributes (race, health status, etc.) from partial information

These attacks are particularly relevant when a model is exposed as an API.

The concept of “Contextual Integrity,” developed by philosopher Helen Nissenbaum, defines privacy not as secrecy but as information flowing in ways appropriate to the norms of the context in which it was originally shared.[2]

Example: Health information shared with a physician flows appropriately when shared with other treating physicians for care purposes. The same information shared with an insurance company or employer violates the original context (medical care), and is experienced as a privacy violation.

Generative AI creates contextual integrity violations in scenarios such as:

  • Using public social media posts (shared in a personal context, with friends) to train models used in enterprise analysis
  • Using input data from a specific-purpose service to improve unrelated services
  • Combining public information to construct profiles that were never intended when information was originally disclosed

Generative AI can infer sensitive attributes — gender, race, health status, political views — from input features without explicitly storing personal information. When such inferences are used in consequential decisions like hiring, lending, or insurance pricing, they can enable discriminatory treatment.


Privacy by Design, developed by Ann Cavoukian (former Information and Privacy Commissioner of Ontario), is a framework for embedding privacy protection into system design from the beginning rather than treating it as an afterthought. It consists of seven foundational principles:[1]

PrincipleDescription
Proactive, not reactivePrevent privacy risks in design rather than responding after the fact
Privacy as the defaultThe default configuration provides maximum privacy protection
Privacy embedded into designPrivacy protection is a core element of the system, not an add-on
Full functionality (positive-sum)Privacy and functionality are not a zero-sum trade-off
End-to-end securityProtection throughout the full data lifecycle
Visibility and transparencyData processing is transparent and subject to independent verification
Respect for user privacyThe individual’s privacy interests take priority

Data Minimization

Collect and process only the minimum data necessary for the stated purpose. In AI systems, there is a natural incentive to accumulate more data for accuracy improvements, but unnecessary data collection increases privacy risk.

Purpose Limitation

Do not use data for purposes other than those for which it was collected. When using user data for model improvement or new feature development, the relationship to the original collection purpose must be made explicit.

Storage Limitation

Delete data promptly once the purpose has been achieved. AI systems often benefit from long-term data retention for ongoing improvement, but storage periods should be defined, and data that is no longer needed should be deleted.


Differential privacy adds mathematically calibrated noise to statistical query results, providing a formal guarantee that the results cannot be used to infer information about any specific individual.[4] It preserves aggregate patterns in the data while protecting individuals.

Apple uses differential privacy for health data aggregation; Google applies it in Chrome analytics. It is recognized as a technique for achieving both privacy protection and statistical utility.

Applied to AI model training: Techniques such as DP-SGD (Differentially Private Stochastic Gradient Descent) integrate differential privacy into the training process. Model accuracy is reduced to some degree, but memorization risk and vulnerability to inference attacks are substantially reduced.

Federated learning trains models without centralizing raw data.[5] Training happens locally on each device or server, and only model updates (gradients) are aggregated. The original personal data never leaves the local environment, reducing privacy exposure.

Google uses federated learning in Gboard (the smartphone keyboard) to improve autocorrect predictions without sending individual typing data to a central server.

Identifying information is removed or replaced. However, simple anonymization is often vulnerable to re-identification attacks — where multiple datasets are combined to identify individuals. Stronger anonymization techniques such as k-anonymity, l-diversity, and t-closeness are recommended where robust protection is required.


Privacy Evaluation Checklist for AI System Design

Section titled “Privacy Evaluation Checklist for AI System Design”

When developing or deploying an AI system, conduct a privacy evaluation covering these areas:

Data lifecycle

  • What data is collected, and for what purpose?
  • Are retention periods and deletion policies defined?
  • Are access controls appropriately scoped?

Model risk assessment

  • Does the training data contain personal information?
  • Has the risk of memorization been evaluated?
  • Is there a monitoring process for model outputs?

Transparency to users

  • Is it clearly disclosed what data the AI system uses?
  • Do users have a way to understand and control how their data is used?

When employees use external generative AI tools for work, establish a privacy policy that addresses:[6]

  • Which categories of information must not be included in prompts (customer personal information, special care-required personal information, etc.)
  • A list of approved AI services and the conditions for their use
  • A verification step to check that AI-generated content does not contain personal information before it is shared externally

The privacy risks of generative AI extend beyond legal compliance to include structural challenges: memorization, inference attacks, contextual integrity violations, and discriminatory profiling.

Addressing these requires a Privacy by Design approach — embedding privacy protection into the architecture of systems from the start, not just meeting minimum legal requirements. Combining technical techniques such as differential privacy, federated learning, and appropriate anonymization with the principles of data minimization, purpose limitation, and transparency builds the foundation for AI systems that users and regulators can trust.

For specific implementation choices or legal judgment, consulting a privacy specialist or information security professional with expertise in AI systems is recommended.

Related topics: Generative AI and Personal Information | AI and Copyright | AI Governance Overview


  1. Cavoukian, Ann, Privacy by Design: The 7 Foundational Principles, Information and Privacy Commissioner of Ontario, 2011
  2. Nissenbaum, Helen, Privacy in Context: Technology, Policy, and the Integrity of Social Life, Stanford University Press, 2010
  3. Carlini, Nicholas et al., Extracting Training Data from Large Language Models, USENIX Security 2021
  4. Dwork, Cynthia, Differential Privacy: A Survey of Results, Theory and Applications of Models of Computation, 2008
  5. McMahan, H. Brendan et al., Communication-Efficient Learning of Deep Networks from Decentralized Data, AISTATS 2017
  6. Personal Information Protection Commission, Notice regarding generative AI services, June 2, 2023
Quiz