What You Should Know Before You Select a
Sentiment Analysis Dataset

Why do you need a sentiment analysis dataset for training?

Sentiment models are a type of natural language processing (NLP) algorithm that determines the polarity of a piece of text. That is, a sentiment model predicts whether the opinion given in a piece of text is positive, negative, or neutral. These models provide a powerful tool for gaining insights into large sets of opinion-based data, such as social media posts and product reviews. For example, a seller on the Amazon marketplace could use a sentiment model to quickly assess thousands of reviews and gauge customer satisfaction of their goods. Sentiment models can also be used to predict the reviews for a new product by comparing product metadata to similar products and analyzing those products’ reviews.


Like all machine learning algorithms, sentiment models require large sets of labeled training data to develop and tune, also called a training sentiment analysis dataset. The first step in model development requires a sentiment analysis dataset of tens of thousands of statements that are already labeled as positive, negative, or neutral. Finding training data is difficult, because a human expert must determine and label the polarity of each statement in the training data. Having a ready-made training dataset that is already labeled greatly reduces the time and effort needed to develop a sentiment model. Two such sentiment datasets frequently used for training are the Internet Movie Database (IMDB) and Amazon review databases.

Primary Training Datasets: IMDB and Amazon Review Databases

The IMDB and Amazon review databases are almost ideal for training sentiment models (more on their limitations to follow), as they are ready-made datasets of easily labeled sentiments. The polarity of these reviews can be determined by segmenting reviews by score. For the IMBD database, reviews of 0-3 stars are typically considered negative, 4-6 stars neutral, and 7-10 stars positive. Similarly, for Amazon reviews, 1-2 stars is negative, 3 stars is neutral, and 4-5 stars is positive. However, the Amazon review database is not as popular, as a 1-to-5 rating does not have the fidelity of a 1-to-10 system and the Amazon dataset is more complex and therefore more challenging to use.


The IMDB database has been used in a wealth of academic studies, tutorials, and open-source codes. The standard IMDB dataset contains 50,000 reviews, with an even number of positive and negative reviews. In general, the IMDB database is more popular than the Amazon database as it provides a smaller and easier to manipulate dataset. The IMDB dataset is a powerful tool for developing the skills necessary to go to develop more advanced sentiment models.


The Amazon review dataset has the advantages of size and complexity. Amazon has compiled reviews for over 20 years and offers a dataset of over 130 million labeled sentiments. The Amazon dataset also offers the additional benefit of containing reviews in multiple languages. The Amazon dataset further provides labeled “fake” or biased reviews. Due to its size and complexity, the Amazon dataset provides for the development of more sophisticated sentiment models. The Amazon dataset additionally offers more utility, given that predicting product performance via sentiment modeling is a critical component to modern product release.

Limitations in Applicability of the IMDB and Amazon Sentiment Analysis Datasets

As much time and effort as these databases save for training sentiment models, they are not without limitations. Given the quantitative nature of reviews, applying the models trained using these databases to qualitative opinions, such as tweets, leads to a loss in accuracy. Also, for the IMBD database, reviews are highly subjective to the viewers’ preferences, which can skew results. Similarly, for the Amazon database, biased, or “fake” reviews, are common. A further complication of any sentiment database is the innate inability of the model to recognize sarcasm, which can be common among reviews.

Furthermore, the key words (features) found during the training process are limited when working with reviews. Reviews often tend to be repetitive, containing a limited subset of key terms. Moreover, reviews contain some terms that are uncommon in regular opinion statements, such as “weak soundtrack.” Because of the uniqueness of some of the key terms and the lack of key term diversity, applying sentiment models trained on these databases can lead to suboptimal results. For example, if a company wants to use a sentiment model to predict the reaction to a change in policy, a model trained on a review database would struggle with this prediction, given that the reaction will not be a quantitative assessment of a product.

In summary, sentiment models are a powerful tool for modern businesses, and these models require a large sentiment analysis dataset for training. The IMDB and Amazon review databases are two common, readily accessible sentiment databases that are popular for training sentiment models. While providing a useful tool for sentiment model training, these datasets come with caveats that must be taken into account.




Interested in learning more about Skim AI’s ML use case? Read about it here.