You just had your big idea. You read a lot, and you thought it would be interesting to have a classifier that labels a speaker’s tone and determines their political affiliation. How would you begin to break down the problem so that you can use machine learning to do this prediction? We used this Pew Survey that used democratic and republican voters responses to the newspapers they trusted.
Before you can even think about building a production ready machine learning model, you need to think about your data pipeline. This is the foundation on which a ML model runs and without a strong foundation, you can’t expect your model to perform successfully. The experts are Skim AI have put together the 10 best practices for storing labeled data that will set you up for success.
For a machine learning model to be applicable to solving a problem, there must be definable for a computer:
In the example we’re referring to, there are two outcomes: democratic leaning speech or republican leaning speech. The problem is clearly more complex than this, as there are many groups that make up democrats and republicans and there are also independents and lots of gradations. But for this example, we are going to simplify to those two variations.
Collect at least 5,000 data points in your database for each category of information you want to classify. In our example, we are storing labeled data points from articles, speeches, books or show transcripts. As we want to build a binary classifier, we want 5,000 examples of democratic writing samples and 5,000 examples of republican writing samples for a total of 10,000 samples. While 5,000 points per outcome is the recommended minimum, accuracy will improve with more data, so don’t hold back.
In our case, the goal is to classify entire articles as either democratic or rebublican but you will want to future-proof your efforts by storing each resource at sentence level instead of entire article level. That way, if you want to classify more specific entities, such as paragraphs or analytics surrounding certain keywords or entities (people, places and organizations), you will be able to use your data with less cleaning effort in the future.
In general, 50-65% of the time spent on any ML project is dedicated to cleaning and transforming data into a format that can be readable by ML algorithms. Most classifiers work at both the sentence and entire document level.
Practical Implementation Tips for sentence and paragraph level classification:
This is a bit about methodology. It’s important to get as many pure signals as possible. That means striping the noise and nuanced resources and information. For example, if you are storing labeled data from centrist sources, it would contain less of a clear signal and if you added data (articles) from a centrist source to either the republican or democratic data set, it would decrease the accuracy and usefulness of the republican/democrat speech classifier.
In our example, this is especially tough as people are much more complicated in their political beliefs than a simple party line. Furthermore, various writers, speakers and newspapers are going to have different opinions than the official party line. In this example, there’s likely to be a lot of noise that needs to be suppressed, for example:
One could spend hours defining a methodology to account for all possible variables. We recommend gathering and storing as much data as possible. Look for clean data at sentence level and create fields to track author, publication, and any other fields that can be captured.
Can you get access to enough data? In our case it is relatively easy to get access to old articles from these publications to gather a dataset of articles and enough datapoints for each classification category.
If not, you can consider using Amazon Mechanical Turk to label data or if your methodology requires training, you can train and pay people in India or Macedonia $1,000 a month to build a dataset.
Storing labeled data that is tangentially related to what you want to classify will allow you to build more robust models that likely include more vocabulary, people, places and topics that will help any model you build. It can help by exposing the classification model to new vocabulary, topics, and entities and understand the inherent relationships between the words. This will make the model better able to handle data outside the initial data that you started with.
Maybe you want to get books written by congressmen and congresswomen, tweets, interview transcripts, transcripts of cable news shows, transcripts from the dialogue in the house of congress, bills and laws written or sponsored by particular members of congress.
The point of machine learning is that you don’t have to test all the variables yourself, just get enough data for ML to work, and define your problem well.
To be safe, always store the raw text of your labeled data. For example, if you have a sentence within an article that is representative of data you want to label, make sure you store the raw text of that sentence and the label. Even if you just store this data as a redundancy, please take this action. Your machine learning engineer or data scientist will thank you.
If you do use index values to reference labeled data, map that data and understand the mapping well. For example, if you are storing a sentence or paragraph from an article, make sure the database values for where that sentence or paragraph starts matches the value from the source from which you are storing the data. To be safe, test it from the first sentence, beginning and end values, and the last sentence.
This should be self-explanatory. Backup your data regularly.
It takes years to gather enough labeled data in some circumstances. If you know that you want to solve a problem in a specific area, start collecting as much unlabeled and labeled data related problem you want to solve for and domain specific data.
Ready to get started? Check out our other pieces on machine learning.
Automatically generate summaries on news articles, research papers, PDFs and more with Skim AI’s summarization tool.