Recommendation algorithms have become ubiquitous across commercial fields, from the Amazon “yourstore” splash page to Netflix’s matching % scores. Recommendation algorithms in essence filter large sets of data, i.e. song or movie databases, using a variety of methods to suss out the most relevant items to a user. The algorithm does so by looking at past behavior of a user and using knowledge gained from these observations to recommends products and media which the user is most likely to buy, watch, or listen to. Many attempts have been made to leverage machine learning, especially neural networks, for recommendation systems. While there is a wealth of research claiming improvements in recommendations for various algorithms, Dacrema et al. have written an enlightening article asking, are we really improving over traditional techniques? According to their article “..indications exist … .that the achieved progress—measured in terms of accuracy improvements over existing models—is not always as strong as expected.” So, if progress isn’t being accurately captured, how are researchers currently measuring progress, what are the faults in these methods, and have we actually improved recommendation algorithms by adding machine learning techniques?
Progress in algorithm performance is measured by comparing new algorithm performance to baseline performance of other extent algorithms. In particular, the metrics most commonly used are:
While several factors contribute to the failure of current progress assessment methods, Decrema et al. point to three key factors:
In particular, the authors point out the extreme lack of repeatability for published algorithms. The authors are quick to point out that in the modern research environment, wherein source code and data sets are made readily available, published results should be trivial to recreate. However, “In reality, there are … tiny details regarding the implementation of the algorithms and the evaluation procedure … that can have an impact on the experiment outcomes.” In fact, the authors only found a total of seven papers with source code and data sets capable of reproduction out of dozens examined.
Dacrema et al. tested seven published algorithms in their paper. They compared the results of these algorithms, using the data used in the respective studies, to the results of traditional, much simpler algorithms. In their study, they found only one algorithm that outperformed tradition methods: Variational Autoencoders for Collaborative Filtering (Mult-VAE), presented by Liang et al. in 2018. Decrema et al. argue that Mult-VAE provides the following performance improvements:
Decrema et al. conclude by stating “Thus, with Mult-VAE, we found one example in the examined literature where a more complex method was better … than any of our baseline techniques in all configurations.”
As tempting as it is to declare success and publish novel algorithms and results, Dacrema’s team has shown that we really aren’t improving, or at least not by much. Their article concludes by stating “Our analysis indicates that … most of the reviewed works can be outperformed at least on some datasets by conceptually and computationally simpler algorithms.” Therefore, as tempting as it is to apply machine learning to all data analysis applications, recommendation systems has so far proven to be an application for which machine learning has not improved algorithm performance; at least, not yet.