The quality of a fine-tuned model is often determined more by data quality than by the choice of model architecture or training method. “Quality over quantity” is a foundational principle in fine-tuning. In practice, 1,000 well-curated examples consistently outperform 10,000 noisy ones.
Why Data Quality Comes First
Section titled “Why Data Quality Comes First”LLM fine-tuning is essentially imitation learning from correct examples. The model learns the patterns present in the provided data. If the data contains contradictions, noise, or misalignment with the goal, the model learns those flaws too.
With a small amount of high-quality data:
- The model learns consistent style and behavior
- The risk of overfitting decreases
- Iteration cycles for evaluation and improvement become more manageable
Common Data Formats
Section titled “Common Data Formats”1. Instruction-following Format
Section titled “1. Instruction-following Format”Structured as three fields: instruction, input, and output. This is the format widely used in the Alpaca dataset.
{
"instruction": "Rewrite the following sentence in formal English.",
"input": "hey can u come to the meeting tmrw",
"output": "Could you please attend the meeting tomorrow?"
}When no input is needed, use an empty string:
{
"instruction": "What is the capital of Japan?",
"input": "",
"output": "The capital of Japan is Tokyo."
}2. Chat Format (Messages Array)
Section titled “2. Chat Format (Messages Array)”Multi-turn conversation data following the OpenAI API format. Used when fine-tuning a ChatGPT-style assistant.[5]
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant that answers in clear, concise English."
},
{
"role": "user",
"content": "How do I remove duplicates from a Python list?"
},
{
"role": "assistant",
"content": "The simplest way is to use `set()`:\n\n```python\noriginal = [1, 2, 2, 3, 3, 4]\nunique = list(set(original))\nprint(unique) # [1, 2, 3, 4] (order not guaranteed)\n```\n\nTo preserve insertion order, use `dict.fromkeys()`:\n\n```python\nunique_ordered = list(dict.fromkeys(original))\nprint(unique_ordered) # [1, 2, 3, 4]\n```"
}
]
}3. Completion Format
Section titled “3. Completion Format”A prompt paired with its continuation. Useful for text continuation tasks or filling in fixed templates.
{
"prompt": "Product: AI Assistant Pro. Three key features include:",
"completion": "\n1. 24/7 automated responses\n2. Multilingual support (English, Japanese, Chinese)\n3. API integration with existing systems"
}Rough Guidelines on Data Volume
Section titled “Rough Guidelines on Data Volume”The amount of data needed varies by task type and complexity.
| Task | Approximate examples needed | Notes |
|---|---|---|
| Response style / tone adjustment | 100–500 | Even small amounts can be effective if quality is high |
| Specific format compliance | 200–1,000 | Tasks with clear output formats such as JSON |
| Domain knowledge adaptation | 1,000–10,000 | Medical, legal, manufacturing, etc. |
| Acquiring a new capability | 10,000+ | When adding an ability from scratch |
These are rough starting points. An iterative approach — starting small, evaluating, then adding data — is generally more effective than trying to collect the “right” amount upfront.
Data Sources
Section titled “Data Sources”Manual Curation
Section titled “Manual Curation”Domain experts or people with relevant knowledge create question-answer pairs by hand. The highest-cost approach, but typically yields the highest quality.
Best for: High-expertise domains (medical, legal, internal regulations), production systems where errors are unacceptable
Existing Public Datasets
Section titled “Existing Public Datasets”Hugging Face Hub hosts many public datasets that can be candidates for fine-tuning workflows.[3]
- Alpaca: 52,000 instruction-following examples (Stanford)[1]
- OpenHermes: 1,001,551 rows of synthetic instruction / chat data[2]
Note: Always verify the license before commercial use.
Synthetic Data Generation (Using GPT-4 etc.)
Section titled “Synthetic Data Generation (Using GPT-4 etc.)”Automatically generating Q&A pairs from internal documents using a powerful model is another option. Self-Instruct is a research example where a language model generates new instruction data itself.[4]
# Conceptual example of synthetic data generation
system_prompt = """
You are a dataset creation expert.
Given the document below, generate Q&A pairs suitable for fine-tuning.
Output as JSON with "instruction" and "output" fields.
"""
document = "(content from internal manuals or domain documents)"
# Send to GPT-4 API to generate Q&A pairsLegal note: OpenAI’s Terms of Use prohibit using Output to develop models that compete with OpenAI.[6] Review the latest terms for each provider before using synthetic data for training.
Data Quality Checklist
Section titled “Data Quality Checklist”Items to verify before using collected data for training:
- Diversity: No repetitive patterns; varied expressions and structures are present
- Accuracy: Outputs are objectively correct and free of contradictions
- Consistency: Consistent style and format across examples for the same task type
- Appropriate length: Exclude extremely short (insufficient information) or very long (redundant) samples
- Format consistency: For JSON output, key names and structure are uniform
- Task relevance: Each example directly relates to the target behavior
Data Cleaning Steps
Section titled “Data Cleaning Steps”graph TD
A["Collect raw data"] --> B["Deduplication"]
B --> C["Length filtering"]
C --> D["Quality scoring"]
D --> E["Human review (sampled)"]
E --> F["Train / Validation split"]
F --> G["Ready for training"]Deduplication
Section titled “Deduplication”Remove not just exact duplicates but also semantically similar samples. Excessive repetition causes the model to over-learn that pattern.
Length Filtering
Section titled “Length Filtering”- Very short samples (e.g. output under 10 tokens) may contain too little information
- Very long samples may exceed the model’s context length limit
Train / Validation Split
Section titled “Train / Validation Split”Split 80–90% of data for training and 10–20% for validation. The validation set — data the model has never seen — is used to detect overfitting during training.
Red Flags in Training Data
Section titled “Red Flags in Training Data”The following types of data carry high risk and should be excluded or handled carefully:
| Issue | Examples | Action |
|---|---|---|
| Personally identifiable information (PII) | Names, email addresses, phone numbers | Remove or anonymize with automated tools |
| Copyrighted content | Full text of books, papers, or paid content | Verify license or obtain permission |
| Harmful or discriminatory content | Hate speech, violent content | Filter out — mandatory |
| Contradictory answers | Same question with different “correct” answers | Unify through quality review |
| Outdated information | Deprecated APIs, changed specifications | Manage content with update dates |
Data Source Comparison
Section titled “Data Source Comparison”| Data source | Quality | Cost | Effort | Legal risk |
|---|---|---|---|---|
| Expert manual creation | Highest | High | High | Low |
| Existing public datasets | Medium–high | Low | Low | Verify license |
| Synthetic generation (GPT-4 etc.) | Medium–high | Medium | Medium | Medium (check ToS) |
| Web scraping | Low–medium | Low | Medium | High (copyright) |
| Internal logs / history | Medium | Low | Medium | Medium (PII handling needed) |
Q: Can I use GPT-4 outputs to train another model?
A: OpenAI’s Terms of Use prohibit using Output to develop models that compete with OpenAI.[6] Review the latest terms before proceeding. Anthropic (Claude), Mistral, and other providers have their own terms. For synthetic data generation, it is safest to use models whose terms explicitly permit this use.
Q: How do I know if my data is good enough?
A: Start with a small dataset (100–500 examples), establish a baseline, and measure validation loss and actual task accuracy. Then add more data and observe the improvement. When adding more data stops improving results, shift focus from quantity to quality — review the data for noise, inconsistencies, and coverage gaps.
References
Section titled “References”- Alpaca: A Strong, Replicable Instruction-Following Model — Stanford’s instruction dataset
- OpenHermes Dataset — High-quality synthetic instruction data
- Hugging Face Datasets Documentation — Dataset management library
- Self-Instruct: Aligning Language Models with Self-Generated Instructions — Foundational research on synthetic data generation
- OpenAI Fine-tuning Guide — OpenAI API fine-tuning specifications
- OpenAI Terms of Use — OpenAI service terms