Are We Really Making Progress on Neural Recommendation Approaches?

A summary of Maurizio Ferrari Dacrema, et al.’s Recent Article at RecSys 2019​

Neural Recommendation Algorithms

Recommendation algorithms have become ubiquitous across commercial fields, from the Amazon “yourstore” splash page to Netflix’s matching % scores. Recommendation algorithms in essence filter large sets of data, i.e. song or movie databases, using a variety of methods to suss out the most relevant items to a user. The algorithm does so by looking at past behavior of a user and using knowledge gained from these observations to recommends products and media which the user is most likely to buy, watch, or listen to. Many attempts have been made to leverage machine learning, especially neural networks, for recommendation systems. While there is a wealth of research claiming improvements in recommendations for various algorithms, Dacrema et al. have written an enlightening article asking, are we really improving over traditional techniques? According to their article “..indications exist … .that the achieved progress—measured in terms of accuracy improvements over existing models—is not always as strong as expected.” So, if progress isn’t being accurately captured, how are researchers currently measuring progress, what are the faults in these methods, and have we actually improved recommendation algorithms by adding machine learning techniques?

How Progress is Measured

Progress in algorithm performance is measured by comparing new algorithm performance to baseline performance of other extent algorithms. In particular, the metrics most commonly used are:

  • Precision: The ability of a classification model to identify only the relevant data points.
  • Recall: The ability of a model to find all the relevant data points within a dataset.
  • Normalized Discounted Cumulative Gain (NDCG): the comparison between the baseline ranked list (typically human judged) against the algorithm’s ranked list.

Why Are These Methods Failing?

While several factors contribute to the failure of current progress assessment methods, Decrema et al. point to three key factors:

  1. Weak baseline datasets for training and evaluation
  2. Weak methods used for new baselines (using previously published but unverified algorithms for performance comparison)
  3. Inability to compare and reproduce results across papers

In particular, the authors point out the extreme lack of repeatability for published algorithms. The authors are quick to point out that in the modern research environment, wherein source code and data sets are made readily available, published results should be trivial to recreate. However, “In reality, there are … tiny details regarding the implementation of the algorithms and the evaluation procedure … that can have an impact on the experiment outcomes.” In fact, the authors only found a total of seven papers with source code and data sets capable of reproduction out of dozens examined.

Neural Recommendation: Have We Improved?

Dacrema et al. tested seven published algorithms in their paper. They compared the results of these algorithms, using the data used in the respective studies, to the results of traditional, much simpler algorithms. In their study, they found only one algorithm that outperformed tradition methods: Variational Autoencoders for Collaborative Filtering (Mult-VAE), presented by Liang et al. in 2018. Decrema et al. argue that Mult-VAE provides the following performance improvements:

  • The obtained accuracy results were between 10% and 20% better than the simple linear method (SLIM) presented by Xia Ning and George Karypis in 2011 at IDCM 11, which was the best baseline algorithm’s performance.
  • Results could be reproduced with improvements over SLIM of up to 5% on all performance measures.
  • Recall improvements for Mult-VAE over SLIM “seem solid.”

Decrema et al. conclude by stating “Thus, with Mult-VAE, we found one example in the examined literature where a more complex method was better … than any of our baseline techniques in all configurations.”


As tempting as it is to declare success and publish novel algorithms and results, Dacrema’s team has shown that we really aren’t improving, or at least not by much. Their article concludes by stating “Our analysis indicates that … most of the reviewed works can be outperformed at least on some datasets by conceptually and computationally simpler algorithms.” Therefore, as tempting as it is to apply machine learning to all data analysis applications, recommendation systems has so far proven to be an application for which machine learning has not improved algorithm performance; at least, not yet.