10 Best Practices for Storing Labeled Data
- 10 Best Practices for Storing Labeled Data
- 1. Define the Problem: Is it a good problem for machine learning?
- 2. Gather at least 5,000 datapoints for each outcome.
- 3. Store data at the sentence level.
- 4. Classify and label data in well defined categories.
- 5. Store all representative data.
- 6. Store background data.
- 7. Store the raw text of labeled data (practice redundancy).
- 8. Map your data from start to finish (index values).
- 9. Backup your data.
- 10. Build and think for the future.
10 Best Practices for Storing Labeled Data
You just had your big idea. You read a lot, and you thought it would be interesting to have a classifier that labels a speaker’s tone and determines their political affiliation. How would you begin to break down the problem so that you can use machine learning to do this prediction? We used this Pew Survey that used democratic and republican voters responses to the newspapers they trusted.
Before you can even think about building a production ready machine learning model, you need to think about your data pipeline. This is the foundation on which a ML model runs and without a strong foundation, you can’t expect your model to perform successfully. The experts are Skim AI have put together the 10 best practices for storing labeled data that will set you up for success.1. Define the Problem: Is it a good problem for machine learning?
For a machine learning model to be applicable to solving a problem, there must be definable for a computer:
- Do these set of words fit a pattern that is more like one category of text or another?
- Is there a database with enough representative data for a machine to extract patterns?
In the example we’re referring to, there are two outcomes: democratic leaning speech or republican leaning speech. The problem is clearly more complex than this, as there are many groups that make up democrats and republicans and there are also independents and lots of gradations. But for this example, we are going to simplify to those two variations.
2. Gather at least 5,000 datapoints for each outcome.
Collect at least 5,000 data points in your database for each category of information you want to classify. In our example, we are storing labeled data points from articles, speeches, books or show transcripts. As we want to build a binary classifier, we want 5,000 examples of democratic writing samples and 5,000 examples of republican writing samples for a total of 10,000 samples. While 5,000 points per outcome is the recommended minimum, accuracy will improve with more data, so don’t hold back.
3. Store data at the sentence level.
In our case, the goal is to classify entire articles as either democratic or rebublican but you will want to future-proof your efforts by storing each resource at sentence level instead of entire article level. That way, if you want to classify more specific entities, such as paragraphs or analytics surrounding certain keywords or entities (people, places and organizations), you will be able to use your data with less cleaning effort in the future.
In general, 50-65% of the time spent on any ML project is dedicated to cleaning and transforming data into a format that can be readable by ML algorithms. Most classifiers work at both the sentence and entire document level.
Practical Implementation Tips for sentence and paragraph level classification:
- Keep your classification needs to a single sentence, single paragraph, or single document (article) to start.
- Non-standard needs (a few words, or a few sentences) add a very difficult problem of creating a second ML model to predict which cluster is important.
- Simplify the classification problem as much as possible to start, build complexity in over time as more data becomes available.
4. Classify and label data in well defined categories.
This is a bit about methodology. It’s important to get as many pure signals as possible. That means striping the noise and nuanced resources and information. For example, if you are storing labeled data from centrist sources, it would contain less of a clear signal and if you added data (articles) from a centrist source to either the republican or democratic data set, it would decrease the accuracy and usefulness of the republican/democrat speech classifier.
In our example, this is especially tough as people are much more complicated in their political beliefs than a simple party line. Furthermore, various writers, speakers and newspapers are going to have different opinions than the official party line. In this example, there’s likely to be a lot of noise that needs to be suppressed, for example:
- Papers will vary in the amount to which they lean conservative or liberal on certain issues.
- Specific journalists will have differing views on a particular issue, even among other journalists at the same publication.
- Shareholders or owners may preach a dogma about a particular issue that is important to them, and instruct the editorial team to cover issues a certain way.
One could spend hours defining a methodology to account for all possible variables. We recommend gathering and storing as much data as possible. Look for clean data at sentence level and create fields to track author, publication, and any other fields that can be captured.
5. Store all representative data.
Can you get access to enough data? In our case it is relatively easy to get access to old articles from these publications to gather a dataset of articles and enough datapoints for each classification category.
If not, you can consider using Amazon Mechanical Turk to label data or if your methodology requires training, you can train and pay people in India or Macedonia $1,000 a month to build a dataset.
6. Store background data.
Storing labeled data that is tangentially related to what you want to classify will allow you to build more robust models that likely include more vocabulary, people, places and topics that will help any model you build. It can help by exposing the classification model to new vocabulary, topics, and entities and understand the inherent relationships between the words. This will make the model better able to handle data outside the initial data that you started with.
Maybe you want to get books written by congressmen and congresswomen, tweets, interview transcripts, transcripts of cable news shows, transcripts from the dialogue in the house of congress, bills and laws written or sponsored by particular members of congress.
The point of machine learning is that you don’t have to test all the variables yourself, just get enough data for ML to work, and define your problem well.
7. Store the raw text of labeled data (practice redundancy).
To be safe, always store the raw text of your labeled data. For example, if you have a sentence within an article that is representative of data you want to label, make sure you store the raw text of that sentence and the label. Even if you just store this data as a redundancy, please take this action. Your machine learning engineer or data scientist will thank you.
8. Map your data from start to finish (index values).
If you do use index values to reference labeled data, map that data and understand the mapping well. For example, if you are storing a sentence or paragraph from an article, make sure the database values for where that sentence or paragraph starts matches the value from the source from which you are storing the data. To be safe, test it from the first sentence, beginning and end values, and the last sentence.
9. Backup your data.
This should be self-explanatory. Backup your data regularly.
10. Build and think for the future.
It takes years to gather enough labeled data in some circumstances. If you know that you want to solve a problem in a specific area, start collecting as much unlabeled and labeled data related problem you want to solve for and domain specific data.
Ready to get started? Check out our other pieces on machine learning.