Training Data Preparation

About 5 minutes

Basics covered in Fine-tuning Introduction

The quality of a fine-tuned model is often determined more by data quality than by the choice of model architecture or training method. “Quality over quantity” is a foundational principle in fine-tuning. In practice, 1,000 well-curated examples consistently outperform 10,000 noisy ones.

Why Data Quality Comes First

LLM fine-tuning is essentially imitation learning from correct examples. The model learns the patterns present in the provided data. If the data contains contradictions, noise, or misalignment with the goal, the model learns those flaws too.

With a small amount of high-quality data:

The model learns consistent style and behavior
The risk of overfitting decreases
Iteration cycles for evaluation and improvement become more manageable

Common Data Formats

1. Instruction-following Format

Structured as three fields: instruction, input, and output. This is the format widely used in the Alpaca dataset.

{
  "instruction": "Rewrite the following sentence in formal English.",
  "input": "hey can u come to the meeting tmrw",
  "output": "Could you please attend the meeting tomorrow?"
}

When no input is needed, use an empty string:

{
  "instruction": "What is the capital of Japan?",
  "input": "",
  "output": "The capital of Japan is Tokyo."
}

2. Chat Format (Messages Array)

Multi-turn conversation data following the OpenAI API format. Used when fine-tuning a ChatGPT-style assistant.[5]

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that answers in clear, concise English."
    },
    {
      "role": "user",
      "content": "How do I remove duplicates from a Python list?"
    },
    {
      "role": "assistant",
      "content": "The simplest way is to use `set()`:\n\n```python\noriginal = [1, 2, 2, 3, 3, 4]\nunique = list(set(original))\nprint(unique)  # [1, 2, 3, 4] (order not guaranteed)\n```\n\nTo preserve insertion order, use `dict.fromkeys()`:\n\n```python\nunique_ordered = list(dict.fromkeys(original))\nprint(unique_ordered)  # [1, 2, 3, 4]\n```"
    }
  ]
}

3. Completion Format

A prompt paired with its continuation. Useful for text continuation tasks or filling in fixed templates.

{
  "prompt": "Product: AI Assistant Pro. Three key features include:",
  "completion": "\n1. 24/7 automated responses\n2. Multilingual support (English, Japanese, Chinese)\n3. API integration with existing systems"
}

Rough Guidelines on Data Volume

The amount of data needed varies by task type and complexity.

Task	Approximate examples needed	Notes
Response style / tone adjustment	100–500	Even small amounts can be effective if quality is high
Specific format compliance	200–1,000	Tasks with clear output formats such as JSON
Domain knowledge adaptation	1,000–10,000	Medical, legal, manufacturing, etc.
Acquiring a new capability	10,000+	When adding an ability from scratch

These are rough starting points. An iterative approach — starting small, evaluating, then adding data — is generally more effective than trying to collect the “right” amount upfront.

Data Sources

Manual Curation

Domain experts or people with relevant knowledge create question-answer pairs by hand. The highest-cost approach, but typically yields the highest quality.

Best for: High-expertise domains (medical, legal, internal regulations), production systems where errors are unacceptable

Existing Public Datasets

Hugging Face Hub hosts many public datasets that can be candidates for fine-tuning workflows.[3]

Alpaca: 52,000 instruction-following examples (Stanford)[1]
OpenHermes: 1,001,551 rows of synthetic instruction / chat data[2]

Note: Always verify the license before commercial use.

Synthetic Data Generation (Using GPT-4 etc.)

Automatically generating Q&A pairs from internal documents using a powerful model is another option. Self-Instruct is a research example where a language model generates new instruction data itself.[4]

# Conceptual example of synthetic data generation
system_prompt = """
You are a dataset creation expert.
Given the document below, generate Q&A pairs suitable for fine-tuning.
Output as JSON with "instruction" and "output" fields.
"""

document = "(content from internal manuals or domain documents)"
# Send to GPT-4 API to generate Q&A pairs

Legal note: OpenAI’s Terms of Use prohibit using Output to develop models that compete with OpenAI.[6] Review the latest terms for each provider before using synthetic data for training.

Data Quality Checklist

Items to verify before using collected data for training:

Diversity: No repetitive patterns; varied expressions and structures are present
Accuracy: Outputs are objectively correct and free of contradictions
Consistency: Consistent style and format across examples for the same task type
Appropriate length: Exclude extremely short (insufficient information) or very long (redundant) samples
Format consistency: For JSON output, key names and structure are uniform
Task relevance: Each example directly relates to the target behavior

Data Cleaning Steps

graph TD
    A["Collect raw data"] --> B["Deduplication"]
    B --> C["Length filtering"]
    C --> D["Quality scoring"]
    D --> E["Human review (sampled)"]
    E --> F["Train / Validation split"]
    F --> G["Ready for training"]

Deduplication

Remove not just exact duplicates but also semantically similar samples. Excessive repetition causes the model to over-learn that pattern.

Length Filtering

Very short samples (e.g. output under 10 tokens) may contain too little information
Very long samples may exceed the model’s context length limit

Train / Validation Split

Split 80–90% of data for training and 10–20% for validation. The validation set — data the model has never seen — is used to detect overfitting during training.

Red Flags in Training Data

The following types of data carry high risk and should be excluded or handled carefully:

Issue	Examples	Action
Personally identifiable information (PII)	Names, email addresses, phone numbers	Remove or anonymize with automated tools
Copyrighted content	Full text of books, papers, or paid content	Verify license or obtain permission
Harmful or discriminatory content	Hate speech, violent content	Filter out — mandatory
Contradictory answers	Same question with different “correct” answers	Unify through quality review
Outdated information	Deprecated APIs, changed specifications	Manage content with update dates

Data Source Comparison

Data source	Quality	Cost	Effort	Legal risk
Expert manual creation	Highest	High	High	Low
Existing public datasets	Medium–high	Low	Low	Verify license
Synthetic generation (GPT-4 etc.)	Medium–high	Medium	Medium	Medium (check ToS)
Web scraping	Low–medium	Low	Medium	High (copyright)
Internal logs / history	Medium	Low	Medium	Medium (PII handling needed)

FAQ

Q: Can I use GPT-4 outputs to train another model?

A: OpenAI’s Terms of Use prohibit using Output to develop models that compete with OpenAI.[6] Review the latest terms before proceeding. Anthropic (Claude), Mistral, and other providers have their own terms. For synthetic data generation, it is safest to use models whose terms explicitly permit this use.

Q: How do I know if my data is good enough?

A: Start with a small dataset (100–500 examples), establish a baseline, and measure validation loss and actual task accuracy. Then add more data and observe the improvement. When adding more data stops improving results, shift focus from quantity to quality — review the data for noise, inconsistencies, and coverage gaps.

References

Alpaca: A Strong, Replicable Instruction-Following Model — Stanford’s instruction dataset
OpenHermes Dataset — High-quality synthetic instruction data
Hugging Face Datasets Documentation — Dataset management library
Self-Instruct: Aligning Language Models with Self-Generated Instructions — Foundational research on synthetic data generation
OpenAI Fine-tuning Guide — OpenAI API fine-tuning specifications
OpenAI Terms of Use — OpenAI service terms

Knowledge Distillation

Fine-tuning Methods Compared