Skip to content
LinkedInX

Training Data Preparation

About 5 minutes

Prerequisites: Basics covered in Fine-tuning Introduction

The quality of a fine-tuned model is often determined more by data quality than by the choice of model architecture or training method. “Quality over quantity” is a foundational principle in fine-tuning. In practice, 1,000 well-curated examples consistently outperform 10,000 noisy ones.

LLM fine-tuning is essentially imitation learning from correct examples. The model learns the patterns present in the provided data. If the data contains contradictions, noise, or misalignment with the goal, the model learns those flaws too.

With a small amount of high-quality data:

  • The model learns consistent style and behavior
  • The risk of overfitting decreases
  • Iteration cycles for evaluation and improvement become more manageable

Structured as three fields: instruction, input, and output. This is the format widely used in the Alpaca dataset.

{
  "instruction": "Rewrite the following sentence in formal English.",
  "input": "hey can u come to the meeting tmrw",
  "output": "Could you please attend the meeting tomorrow?"
}

When no input is needed, use an empty string:

{
  "instruction": "What is the capital of Japan?",
  "input": "",
  "output": "The capital of Japan is Tokyo."
}

Multi-turn conversation data following the OpenAI API format. Used when fine-tuning a ChatGPT-style assistant.[5]

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that answers in clear, concise English."
    },
    {
      "role": "user",
      "content": "How do I remove duplicates from a Python list?"
    },
    {
      "role": "assistant",
      "content": "The simplest way is to use `set()`:\n\n```python\noriginal = [1, 2, 2, 3, 3, 4]\nunique = list(set(original))\nprint(unique)  # [1, 2, 3, 4] (order not guaranteed)\n```\n\nTo preserve insertion order, use `dict.fromkeys()`:\n\n```python\nunique_ordered = list(dict.fromkeys(original))\nprint(unique_ordered)  # [1, 2, 3, 4]\n```"
    }
  ]
}

A prompt paired with its continuation. Useful for text continuation tasks or filling in fixed templates.

{
  "prompt": "Product: AI Assistant Pro. Three key features include:",
  "completion": "\n1. 24/7 automated responses\n2. Multilingual support (English, Japanese, Chinese)\n3. API integration with existing systems"
}

The amount of data needed varies by task type and complexity.

TaskApproximate examples neededNotes
Response style / tone adjustment100–500Even small amounts can be effective if quality is high
Specific format compliance200–1,000Tasks with clear output formats such as JSON
Domain knowledge adaptation1,000–10,000Medical, legal, manufacturing, etc.
Acquiring a new capability10,000+When adding an ability from scratch

These are rough starting points. An iterative approach — starting small, evaluating, then adding data — is generally more effective than trying to collect the “right” amount upfront.

Domain experts or people with relevant knowledge create question-answer pairs by hand. The highest-cost approach, but typically yields the highest quality.

Best for: High-expertise domains (medical, legal, internal regulations), production systems where errors are unacceptable

Hugging Face Hub hosts many public datasets that can be candidates for fine-tuning workflows.[3]

  • Alpaca: 52,000 instruction-following examples (Stanford)[1]
  • OpenHermes: 1,001,551 rows of synthetic instruction / chat data[2]

Note: Always verify the license before commercial use.

Synthetic Data Generation (Using GPT-4 etc.)

Section titled “Synthetic Data Generation (Using GPT-4 etc.)”

Automatically generating Q&A pairs from internal documents using a powerful model is another option. Self-Instruct is a research example where a language model generates new instruction data itself.[4]

# Conceptual example of synthetic data generation
system_prompt = """
You are a dataset creation expert.
Given the document below, generate Q&A pairs suitable for fine-tuning.
Output as JSON with "instruction" and "output" fields.
"""

document = "(content from internal manuals or domain documents)"
# Send to GPT-4 API to generate Q&A pairs

Legal note: OpenAI’s Terms of Use prohibit using Output to develop models that compete with OpenAI.[6] Review the latest terms for each provider before using synthetic data for training.

Items to verify before using collected data for training:

  • Diversity: No repetitive patterns; varied expressions and structures are present
  • Accuracy: Outputs are objectively correct and free of contradictions
  • Consistency: Consistent style and format across examples for the same task type
  • Appropriate length: Exclude extremely short (insufficient information) or very long (redundant) samples
  • Format consistency: For JSON output, key names and structure are uniform
  • Task relevance: Each example directly relates to the target behavior
graph TD
    A["Collect raw data"] --> B["Deduplication"]
    B --> C["Length filtering"]
    C --> D["Quality scoring"]
    D --> E["Human review (sampled)"]
    E --> F["Train / Validation split"]
    F --> G["Ready for training"]

Remove not just exact duplicates but also semantically similar samples. Excessive repetition causes the model to over-learn that pattern.

  • Very short samples (e.g. output under 10 tokens) may contain too little information
  • Very long samples may exceed the model’s context length limit

Split 80–90% of data for training and 10–20% for validation. The validation set — data the model has never seen — is used to detect overfitting during training.

The following types of data carry high risk and should be excluded or handled carefully:

IssueExamplesAction
Personally identifiable information (PII)Names, email addresses, phone numbersRemove or anonymize with automated tools
Copyrighted contentFull text of books, papers, or paid contentVerify license or obtain permission
Harmful or discriminatory contentHate speech, violent contentFilter out — mandatory
Contradictory answersSame question with different “correct” answersUnify through quality review
Outdated informationDeprecated APIs, changed specificationsManage content with update dates
Data sourceQualityCostEffortLegal risk
Expert manual creationHighestHighHighLow
Existing public datasetsMedium–highLowLowVerify license
Synthetic generation (GPT-4 etc.)Medium–highMediumMediumMedium (check ToS)
Web scrapingLow–mediumLowMediumHigh (copyright)
Internal logs / historyMediumLowMediumMedium (PII handling needed)

Q: Can I use GPT-4 outputs to train another model?

A: OpenAI’s Terms of Use prohibit using Output to develop models that compete with OpenAI.[6] Review the latest terms before proceeding. Anthropic (Claude), Mistral, and other providers have their own terms. For synthetic data generation, it is safest to use models whose terms explicitly permit this use.

Q: How do I know if my data is good enough?

A: Start with a small dataset (100–500 examples), establish a baseline, and measure validation loss and actual task accuracy. Then add more data and observe the improvement. When adding more data stops improving results, shift focus from quantity to quality — review the data for noise, inconsistencies, and coverage gaps.

  1. Alpaca: A Strong, Replicable Instruction-Following Model — Stanford’s instruction dataset
  2. OpenHermes Dataset — High-quality synthetic instruction data
  3. Hugging Face Datasets Documentation — Dataset management library
  4. Self-Instruct: Aligning Language Models with Self-Generated Instructions — Foundational research on synthetic data generation
  5. OpenAI Fine-tuning Guide — OpenAI API fine-tuning specifications
  6. OpenAI Terms of Use — OpenAI service terms