Over 80% of data science projects fail to go beyond testing and into production. If everyone is starting a machine learning project, where is it going wrong? Undoubtedly, ML solutions increase efficiencies for those who are in the business of gathering or analyzing large swaths of data. But often, the menacing thought of how to implement such a project often keeps us from doing so.
So how do you even begin to approach such a task? The same way you’d eat an elephant – bite by bite. Through my experience leading my team in building a standard AI platform (Skim AI Chrome Toolbar) and custom solutions, I’ve identified the 10 questions to ask before starting a machine learning project. With these 10 questions answered, you’ll grasp a clear foundation of how to approach the project.
There are several valid responses for this question so let’s break it down. First, identify the overall goal: do you need to extract information or classify information?
Next, identify at what level of detail this should be executed. For example, should the model analyze by sentence level or at the entire document level. Or do you need something custom such as a subset of sentences in a paragraph that may not be ideal to implement with high accuracy?
Determine the quantitative desired results. Perhaps you want to increase the amount of data that’s classified with automated data extraction. In this case, you must indicate by how much. Or maybe you want to increase the amount of data that you label collectively as a firm or to be able to make a prediction with a certain level of accuracy. Whatever the goal is, make it clear and establish measurable metrics.
Ideally, you want to have two to five-thousand data points to start for each classification category. It’s beneficial to have another fifty to one hundred-thousand pieces of unlabeled raw text, articles or equivalent to use as a layer in your model. If you were building a sentiment or other classifier for mentions of a product in news data, it would still be good to have a few hundred thousand pieces of news that mention products and the industry you are building the model for, even if those articles aren’t labelled.
As mentioned in question 3, the very minimum number of data points required is 5,000 per category to develop a model that provides results that are close to human accuracy. In order to create a realistic timeline, you should consider how long it would take to accomplish labeling that first set manually.
Sometimes you have to simplify your system to deploy a useful machine learning model. Often times the reason for this is that you simply won’t have enough data to build a model with human level accuracy to distinguish between categories.
Is your methodology easy for a human to understand? Are your categories of classification distinct and have some vocabulary differences between categories, or is it hard to distinguish due to subjectivity and inability to define rules with certainty.
If a human can’t make a decision within a second or two, a machine is going to have a problem. Some teams alleviate this by creating a “mixed” classification category, and flag the category for review by an analyst the same way that when a Tesla isn’t sure what to do in a confusing situation it asks the driver to take control of the steering wheel.
It’s helpful to determine the pieces of information you’ll be able to collect from each document in your database. These can include things such as author, date, time, newspaper section, location, source, category, or entities involved (among many other things).
Most data extraction projects want to easily extract the entities (people, places, and things) in a piece of text. Many companies want to map that data to a particular client or to display entity level analytics to an end user (likely a client or customer). If you need to match an entity in the text to one in your database, it is helpful to write out those desired matches.
Additionally, if you already have entities labelled in text, building a model to extract entities from new text will perform at a high level of accuracy. This is especially the case if the goal is to label all of the variations of a single, unified entity (e.g. matching “Facebook”, “WhatsApp”, and “Instagram” to their shared stock symbol “FB”). Creating a master list of entities is also helpful if there are various ways that you find an entity mentioned in text that you then need display in a client facing frontend interface like a BI dashboard.
Data is a competitive advantage that enables you to build models. You should consider building out this capability in-house or in an outsourced capacity to enable your future projects.
If you don’t have an in-house team, consider outsourcing your data gathering needs to companies in India or Eastern Europe. They have very competitive rates that should range from 500-1000 per month for a data gatherer depending on how sophisticated your labeling system is.
E.g. Is there a lot of data still to be labelled that is stored in the database, for this particular problem or other domain-specific documents we will create models for in future?
If so, you can build or tailor various language models to increase the performance of most solutions. This is because even unlabelled data is helpful for machine learning models to extract meaning from the relationships that already exist in the unlabelled text.
For the same reason stated above, domain specific data is very useful for ML/NLP. Oftentimes, someone or some data provider will have what you need to get started, sometimes for free. Many research projects will consider sharing their datasets, often for non-commercial use. Just email them. See what it would take to get access in terms of cost and if there are APIs.
Google, Facebook, Governments, Market Data Providers, Research Projects and others can help you seed your initial data set with data they make available to the community. Often, having such a large dataset increases your ability to get more out of your models even if you have less labelled data of your own.