Look What You Made AI Do: Data Classification with LLMs
I’m a Swiftie and a Techie, so of course, I used Taylor Swift’s discography to explore how LLMs can be leveraged to classify data. I’ll get into that soon enough, but first, let me explain why I’m excited about data classification, what it is, and some of the research that’s happening.
Data classification isn't just about organizing data—it's about unleashing its untapped potential.
Classifying data creates a richer dataset—one that you can use in more, sometimes extremely impactful, ways. For instance, the International Olympic Committee (IOC) recently announced they will use AI to protect athletes' mental health by identifying abusive social media posts (link). AI will be used to flag posts, which will be reported and removed before athletes are distracted and harassed.
Data classification is one of those tasks that come up regularly. In every company I’ve worked for in 15+ years, there has been a need to categorize data to solve a business problem—removing ads that violate policies, limiting search results for a nearby restaurant to those that are currently open, detecting fraud, and more. In each case, a solution involved a long and arduous process of gathering examples, cleaning data, identifying patterns in the data, and training a model.
The idea that LLMs could be used to classify data without building a custom model, that is, more quickly and cheaply, is intriguing. I get excited about the technology being democratized, and if people who can’t code or train models can classify data, it lowers the cost and time involved and enables more value from data that has been sitting around collecting dust.
I’ve seen examples online of using ChatGPT to determine whether a text's sentiment is positive or negative and even examples of bucketing sentences and paragraphs into groups to extract themes. Beyond these basics, I wanted to know more about the boundaries being explored in research.
Research
Three papers I read covered three areas being explored to improve the quality of LLM-based classification and reduce the time to solve classification problems.
Title and Link | My Takeaway |
---|---|
TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision | Good F1 scores require tuning a supervised model; you can't use pre-built LLMs to classify data into a hierarchical taxonomy out of the box. However, the proposed framework does use LLMs and reduces the level of effort (LOE). |
Enhancing Low-Resource LLMs Classification with PEFT and Synthetic Data | Fine-tuning LLMs using LoRA with synthetic data shows comparable performance to the more expensive and resource-intensive n-shot inference. |
When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes | FastFit significantly improves multi-class classification performance in speed and accuracy across FewMany, a newly curated diverse multi-lingual benchmarking dataset. FastFit demonstrates a 3-20x improvement in training speed, completing training in just a few seconds. |
These advancements demonstrate that while out-of-the-box LLM solutions may not always suffice for complex tasks, strategic fine-tuning, and novel techniques can significantly improve classification performance, speed, and applicability across various domains and devices.
Using LLMs to classify data into a hierarchy may be a good initial solution for use cases where having some directional results is more helpful than having no data at all and where false positives and false negatives are tolerable.
My Experience
I set out to explore the capabilities of text classification using LLMs. In my initial experiment, I tasked ChatGPT with categorizing Taylor Swift's discography by associating each song with the type of pen she used to write it, as well as the definitions for a quill pen, fountain pen, and gel pen. While I found a helpful Reddit thread for the definitions, comprehensive data on this topic was scarce. This made it an interesting project to test the LLM. I used prompting techniques to enhance reasoning and provided three examples of each style, known as a 3-shot method, which has been proven to enhance results. I received a pretty decent list, particularly for a subjective topic. Here’s an example of what it came up with for Midnights.
Anti-Hero | Quill Pen |
---|---|
Sweet Nothing | Quill Pen |
Lavender Haze | Fountain Pen |
Snow on the Beach | Fountain Pen |
Midnight Rain | Fountain Pen |
Bejeweled | Gel Pen |
Karma | Gel Pen |
From there, I decided to see if an LLM, when given examples, could learn a pattern for otherwise meaningless codes like "aKey82v." This idea stemmed from a customer's use case of adding project codes to tasks. I created synthetic data for task types, example tasks, and codes. In my initial testing phase, the classification results were not very accurate, which was expected. I had not explicitly included any keywords, such as a project name or other task patterns, that would make it easy to determine which type of tasks correspond to specific projects. Additionally, I only provided three examples for each project.
While LLMs excel at sentiment analysis and categorizing data based on natural language, their performance with more nuanced patterns isn’t good enough. Training a classification model with supervised learning is still important.
The takeaway: when patterns are intricate, traditional machine learning techniques still hold their ground. It's not about choosing one over the other but knowing when to leverage each tool's strengths.
Need help implementing your AI strategy? Whether you’re looking for additional engineering resources to build a custom solution or someone to fine-tune a set of prompts to enable your team to do more with the AI tools you already have, Mellonhead can help. Email me or Book a consult.