Contents
Why Data Cleaning Is the Highest-Leverage LLM Investment
There is a persistent misconception in enterprise AI projects: that model selection and prompt engineering are the primary levers for output quality. They are not. The dominant factor is almost always the quality of the training data.
The intuition is straightforward. An LLM learns statistical patterns from examples. If those examples are duplicated, the model overfits those patterns. If they contain noise, the model learns the noise. If they contain contradictions, the model learns to be inconsistent. If they contain PII, the model can memorize and regurgitate that sensitive information.
Published research on this is unambiguous. The Dolma dataset paper demonstrated that targeted quality filtering of a 3T token corpus — removing roughly 30% of the data — improved downstream benchmark performance more than scaling the model size by 2×. The FineWeb paper from Hugging Face showed that aggressive deduplication of Common Crawl improved LLM accuracy on MMLU by 4.2 percentage points with no other changes.
Key insight: You will get better results from a 10K high-quality fine-tuning examples than from 100K noisy ones. The goal is not volume — it is signal density.
Step 1 — Ingestion and Parsing
Before any cleaning can happen, you need to extract plain text from whatever format your data lives in. Enterprise data is heterogeneous: PDFs, DOCX files, email archives (PST/MBOX), SQL tables, SharePoint exports, Confluence XML dumps, and raw log files are all common sources.
PDF Extraction
PDF text extraction is deceptively difficult. PDFs have no native concept of reading order — the text layer is a soup of positioned text fragments. Use pdfplumber over PyPDF2 for better layout reconstruction, especially for multi-column documents. For scanned PDFs, you need OCR (Tesseract 5.x or a commercial alternative). Always validate extraction quality on a sample before processing the full corpus.
Encoding Normalization
Always enforce UTF-8 at ingestion time. Use chardet to detect encoding on files that aren't declared. Replace Windows-1252 curly quotes, em-dashes, and other common encoding artifacts with their Unicode equivalents before any downstream processing — these artifacts cause tokenization inconsistencies.
Step 2 — Exact and Near-Duplicate Removal
Duplicate content in training data causes two problems: it wastes token budget (and compute during training) and it causes the model to overweight those examples, degrading generalization. Deduplication should always happen before quality filtering — there's no point scoring a document you're going to remove anyway.
Exact Deduplication
Hash the normalized text content of each document. SHA-256 is standard. Store hashes in a set and drop any document whose hash already exists. This handles verbatim copies and is O(n) in both time and memory.
Near-Duplicate Removal with MinHash LSH
Exact deduplication misses paraphrases, reformatted versions, and documents that share 90% of their content. For these, use MinHash with Locality-Sensitive Hashing (LSH). The datasketch library provides a production-ready implementation.
Configure your Jaccard similarity threshold based on your use case. For instruction-tuning datasets, 0.80 is a reasonable default — it catches near-copies while preserving legitimate topic repetition. For RAG corpora, you may want a tighter threshold of 0.90 to preserve topically similar but distinct documents.
In practice, near-duplicate removal typically eliminates 15–40% of enterprise corpora after exact deduplication. This is expected and healthy — it means your original data had significant redundancy.
Step 3 — Quality Filtering
Quality filtering removes documents that don't contain useful signal for your target task. The exact filters depend on your use case, but the following apply broadly to most enterprise fine-tuning projects.
Length Filtering
Remove documents with fewer than 100 tokens (not enough context to be useful) or more than your model's context window (needs chunking, not filtering). Also remove documents where the ratio of alphabetic characters to total characters is below 0.6 — this catches log files, encoded blobs, and heavily formatted tables that don't contain natural language.
Perplexity-Based Filtering
Train a small n-gram language model (KenLM 5-gram is the standard) on a representative sample of your target domain. Score each document. Documents with very low perplexity are likely boilerplate — legal disclaimers, email signatures, repeated headers. Documents with very high perplexity are likely garbled OCR output or encoding artifacts. Filter both tails.
Language Detection
Use fastText's language identification model to filter to your target language(s). This is especially important for datasets sourced from internal knowledge bases and email archives, which routinely contain fragments of other languages that degrade fine-tuned models intended for a single-language use case.
Important: Log everything you filter and why. You will need this audit trail to explain model behavior and to revisit filtering decisions when performance is unexpectedly poor on certain input types.
Step 4 — PII Detection and Redaction
This step is non-negotiable for enterprise data. Training on data containing unredacted PII creates three distinct risks: regulatory liability (HIPAA, GDPR, CCPA violations), model memorization (the trained model can regurgitate real PII at inference time), and data breach liability (the training dataset itself becomes a compliance risk).
What to Detect
At minimum, your PII detection pipeline must cover: full names, email addresses, phone numbers, physical addresses, Social Security Numbers, passport and driver's license numbers, credit card and bank account numbers, IP addresses, dates of birth, and medical record identifiers. For healthcare data, extend this to ICD codes associated with specific patients and any 18 HIPAA Safe Harbor identifiers.
Implementation with spaCy + Custom Rules
spaCy's NER models provide a strong baseline for person, organization, and location detection. Augment them with regex patterns for structured identifiers (SSNs, phone numbers, credit cards) and a custom entity ruler for domain-specific patterns.
Replace detected entities with semantically consistent synthetic tokens rather than blank redactions. [PERSON] is better than ████ because it preserves grammatical structure and helps the model learn that person names are valid inputs — just not these specific ones.
Step 5 — Normalization
Normalization makes your dataset consistent. Inconsistent formatting teaches the model inconsistent patterns, which surfaces as unpredictable output format at inference time.
Apply the following in order: (1) Unicode normalization to NFC form, (2) whitespace normalization — collapse multiple spaces, standardize line endings to \n, strip leading/trailing whitespace per paragraph, (3) remove zero-width characters and other invisible Unicode, (4) standardize quotation marks to straight quotes unless your task specifically involves typographic quotes, (5) strip HTML/XML tags if your source data contains them.
For instruction-tuning datasets, also normalize the instruction format to your target template at this stage. Decide on your system prompt convention and apply it consistently — mixing Alpaca format with ChatML format in the same dataset causes format confusion during fine-tuning.
Step 6 — Schema Structuring and Validation
The final stage converts cleaned plain text into the structured format your fine-tuning framework requires. This is where a JSONL record is born.
After generating all records, run a validation pass using the actual tokenizer and data loader for your target framework. Load a 1% sample through Axolotl, LLaMA-Factory, or Hugging Face datasets.load_dataset() and confirm zero errors before processing the full corpus. A single malformed record can silently truncate a training run.
Recommended Open-Source Tooling
- text-dedup — production-grade deduplication (MinHash, SimHash, suffix array) for large corpora
- datasketch — MinHash LSH implementation, well-maintained
- spaCy + en_core_web_trf — transformer-based NER for PII detection
- presidio (Microsoft) — purpose-built PII detection and anonymization
- pdfplumber — best open-source PDF text extraction
- KenLM — fast n-gram language model for perplexity scoring
- fastText — language identification (lid.176.bin model)
- datatrove (Hugging Face) — full pipeline toolkit used to build FineWeb
The full pipeline described here is exactly what VaultData runs on-premise inside your infrastructure. If you'd rather not build and maintain it yourself, request a free data audit and we'll assess your data sources and deliver a cleaned sample dataset within 48 hours.
Skip the Pipeline Build — Get Clean Data in 48 Hours
VaultData runs this entire pipeline inside your infrastructure. Zero cloud exposure. Free audit to get started.
Request Free Data Audit →