The Evolution of Data Classification Algorithms
Companies often face regulatory compliance requirements that call for inventorying user-sensitive data and other business data to support audits. They also need to routinely scan for and protect trade secrets. To meet these needs, you have to classify different data types and build a data asset inventory.
The most common type of data that needs protection is PII (Personally Identifiable Information) — things like names, phone numbers, addresses, and national ID numbers. PII is the top priority for regulatory compliance. Then there’s commercially sensitive data, such as product pricing, GMV (Gross Merchandise Value), and sales volumes — key metrics that keep a business running.
Traditional Data Classification Methods
Keyword Matching (Low Recall)
- How it works: Match against a simple, hardcoded list of keywords (e.g., “phone,” “mobile,” “address”).
- Strengths: This is the most basic approach. It’s blazing fast and easy to implement.
- Limitations (low recall): It can only catch literal matches. If PII shows up in a non-standard format — say, “Tel:” instead of “phone,” or a phone number written out as words — it’ll miss it entirely. That’s why its recall is low: it misses a lot of PII that’s actually there.
Regex Validation (High False Positives)
- How it works: Use regular expressions to match specific PII patterns (e.g., an 11-digit number pattern like
^\d{11}$or a national ID format). - Strengths: It can catch more diverse formats than simple keyword matching.
- Limitations (high false positives): It has zero context awareness. For example, the pattern
^\d{11}$might flag an 11-digit order number or product serial number as a phone number. That’s why its precision is low: it incorrectly labels a lot of non-PII as PII.
Machine Learning Classification Methods
BERT & NER (Boosting Precision)
- How it works: Bring in deep learning models based on the Transformer architecture, like BERT, combined with Named Entity Recognition (NER) tasks.
- Strengths: These models can actually “understand” the context of words within a sentence. Instead of relying purely on patterns, they can figure out whether “Li Hua” is a person’s name or just a common phrase, or that “13812345678” following “my phone number is…” is indeed a phone number.
- Impact: Dramatically improves precision, cutting down the false positives that regex produces.
LLM (Expanding Context to Boost Recall)
- How it works: Use Large Language Models (LLMs) to fill in contextual information and improve the model’s classification recall.
- Strengths: LLMs have powerful language generation and comprehension capabilities.
- Impact (boosting recall): They can “expand” or “rewrite” the context around PII-containing sentences, or generate synonyms and variations for specific PII types. By providing more diverse and subtle expressions as training data, they help BERT/NER models learn to recognize non-standard PII expressions that were previously missed. This primarily boosts the system’s recall.
Transfer Learning Classification Methods
LLM Classification Skills
- How it works: Have large-parameter models perform classification reasoning on long-form text, using Skills to codify and maintain classification prompts.
- Strengths: Makes it easy to deploy and maintain classification strategies. Particularly effective for tricky scenarios — like determining whether a sentence contains employee-sensitive information versus user-sensitive information.
- Impact: Improves classification precision, though this fully black-box approach is heavily dependent on the base model’s reasoning capabilities.
Distilled Small Model Classification
- How it works: Running data classification on all your data racks up massive token costs that don’t pencil out over time. Instead, use a powerful large model to label a smaller dataset, then train a lightweight model dedicated solely to classification.
- Strengths: Leverages a large model for data labeling to produce a low-cost, high-efficiency small model for day-to-day classification.
- Impact: Strikes a balance between cost and efficiency, with statistically meaningful interpretability.