Music recommendation systems at work

Return to site

Music recommendation systems at work

In this article we are going to look at how algorithms can help personalize recommendations to various types of listeners. Although a recommendation system is a tryptic device/UI/algo, we will focus here on the issues raised by algorithms and consider the following features: user-to-items (personalized recommendation of songs, artists, albums, genres, playlists, contexts), item-to-items (similar artists, similar songs), and context-to-items (context playlists).
A first level of recommendation consists in using collaborative filtering (user preferences): listeners who likes this song tend to like also these songs. It is powerful because it finds in a social group which artists/songs people tend to like and recommends them to the other listeners belonging to this social group. It is widely used as a item-to-items recommendation system by default in all types of goods, not only in the creative industry.
But collaborating filtering has drawbacks, which are well documented problems: popularity biais (there are a high proportion of highly popular songs) and the cold start issue (for new content, there is no user preferences to draw from to infer recommendations).
To address the cold start problem, content based approach (which consists in describing the content and measuring similarity between content items) is a remedy : for an isolated song attributing a genre/style helps including this song in certain playlists.
These two approaches (collaborative filtering and content based) address what it requires to select a set of songs/artists for the listener in a static mode. More complex recommendation models take listeners behaviour into account (his goal, his context, his interactions with the UI).

Image source: Music recommendation - ISMIR 2018 -  Markus Schedl, Peter Knees, Fabien Gouyon 

The objective is to provide recommendation systems that fit listeners profile in terms of music universe, content popularity, familiarity, new releases, appropriation cycle (discovery, repetition, pleasure, saturation), diversity of genres, surprise, continuation of past exploration (including outside the music platform),...
For an in-depth review of issues relating to music recommendation systems, we invit you to read this presentation ("Overview and new challenges of music recommendation research in 2018", by Markus Schedl, Peter Knees, Fabien Gouyon). This other article (by Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, Mehdi Elah) emphasises especially on remedies on current challenges.
One key issue is the listener need of perceived consistency, which encourages recommenders to focus closely around the object of the recommendation (for instance very similar artists around one of listener favorite artist). High similarity does not provide necesseraly good recommendations: a clone of the Beatles is not a good recommendation, and a playlist that keeps in the same mood for a long time triggers boredom. There are listeners who likes diversity (often found in the "enthusiasts" group) and would welcome more open recommendations. More exploration rather than exploitation, more novelty rather than familiarity, more diversity rather than accuracy, more serendipity rather than focus. Personalization does not mean systematically high focus on listener preferences.

"Personalization does not mean systematically high focus on listener preferences. "

The more the recommender makes the listener wander significantly outside his song/artist preferences, the more it requires interpretability and explicability, otherwise the listener may not understand what the recommender is doing.
There are plenty of angles to personalize further listeners navigation and the filtering process. We could not resist mentioning this piece of research ("Evidence for piano specific rubato style in Chopin nocturnes", by Miguel Molina-Solana, Maarten Grachten, Gerhard Widmer) on how to categorize rubatos for pianists performing Chopin nocturnes. That would for instance help streaming platforms dedicated to the classical genre to empower their listeners with more explicit cues on how to navigate between the various interpretations and performers.

"There are plenty of angles to personalize further listeners navigation and the filtering process. "

A playlist is a specific type of recommendation: it is a selection of songs presented in a specific order, and that makes the recommendation process more complex. This article helps digging a bit further in challenges raised by playlist generation.

The question of whether the recommendation process is done by humans or by computer helps understanding what is at stake with algorithmic recommendations. We have assumed that recommendations are generated automatically because human created playlists are not personalized, which is the object of this post.
And the question applies to both the selection of songs (playlist) and the description of the content (genre and mood of songs, genre of artists). Human curation seems to bring more engaging playlists (there is an art in naming playlist), more trust (though this paper would argue the opposite), more consistency, and avoid irrelevant song in a given context. Machine brings more empathy to listeners (as it can generate all together personalized music universe, context, and adapt to each listener behaviour), and attributes recommendation to the long tail.
The difference between an evaluation by a human being (listening with his ears and interacting with his brain) and by machine (retrieving descriptors from the audio signal and generating recommendation with algorithms), is referred as the "semantic gap". If it is far from being closed for content description, machine generated recommendations are effective, and bring a key decisive advantage: simplifying the UI.
Pandora resorts to machine for playlist generation, but to human for describing the content, and by offering a very simple UI targets the lean-back listeners. Other music platforms use more or less hybrid methods.

To further optimise and personalize a recommendation system, it requires to measure listener satisfaction and decide what are the appropriate metrics: session duration, additions to library/playlists, likes/dislikes, skips, volume changes,...Another approach, automatic personalized playlist continuation, aims at reproducing the patterns implied in the way a listener creates his playlists. But this assumes that he does it correctly in the first place, which is debatable.
Most of listener feedbacks are not explicit but implicit, which makes them streneous to exploit. A listener can skip a song not because he doesn’t like the songs, but because he has listened to it too many times, or it is not the right context to listen to it.
For a “savant” listener, skipping songs is part of his exploration experience. Skip rate can be high and is not strictly a metric of dis-satisfaction.
Where a recommender requires to get an input for a critical variable on a user profile, what works best is to ask the user to disambigate his preference. Entering into a dialogue with the listener is a very efficient way to increase satisfaction.
Lacking personalization can bring collateral damages. For instance interpreting the skip rate at an aggregate level can actually raise A&R issues. Here is an article illustrating the problem of not personalizing exposure for new releases, which can lead to an undiscerned dictatorship of the skip rate.

"Entering into a dialogue with the listener is a very efficient way to increase satisfaction."

Now, how are personalization algorithms implemented in major streaming services ? We have started to run a benchmark to analyse it. We included in the benchmark : Spotify, Apple, Pandora, Google Play Music, Deezer, Youtube, Amazon, Napster, and Qobuz. From an analysis of listener classification we created 3 personas, representative of listener level of engagement (casual, enthusiast, savant) and various aspects of their behaviour (types and frequency of interactions), and tested each service with the same sequence of actions. Here is some of the findings:
Spotify, Pandora, Google Music and Youtube provide substantial personalization.
Some services have no user-to-items recommendation system.
There is little difference between recommendations provided to the 3 personas we created.
The popularity biais of collaborative filtering is often not mitigated.
Song skips are hardly taken into account: the same songs, often very popular, if skipped, even after a short time, keeps being recommended.
Favorite songs keeps being included in recommendations, irrespective of listener profile.
Context playlist are not personalized.
There is little difference of recommendations depending on the time of the day and day of the week.
recommenders are too focused on user preferences; even Google "I'm feeling lucky" is rather conservative. 
The speed of personalization is rather slow, not capitalizing on behavourial data collected.
There is almost no conversation with the listener, to get confirmation of implicit choices or key computed dimensions of listeners profile.
There is a lack of consistency between their apparent market positioning and recommendation strategy (for instance making mainstream recommendation to all listeners for a premium brand).
There are bad recommendations because of poor quality metadata.
Depending on listeners and contents, there is no adaptation of the type of similarity (user based, content based, context based).

The goal of this benchmark is to identify best practises, blind spots, and help align the recommendating system of a music service provider with its market positioning.
In the next post we will look at how recommendation fits within the overall strategy of music service providers.

"The goal of this benchmark is to identify best practises, blind spots, and help align a recommendating system with its market positioning."

While we speak about algorithms, it is an opportunity to investigate about the impact of algorithm on filter bubbles, diversity and concentration of content consumption.
Defining diversity is no easy task. The best metric designed so far is the Stirling model, which includes 3 dimensions: variety, balance and disparity. This interesting study shows how the model is applied to the video/cinema sector.

This study ("Measuring cultural diversity: a review of existing definitions", by Heritiana Ranaisonson) provides a good understanding of the theoretical framework of the Stirling model.
But in practise, it is hard to implement. There is no study providing a convincing measurement of disparity. What kind of taxonomy is to be used for the various categories? with what granularity (15 mains genres or hundreds of sub-genres) ? How to cope with changes in time and space, and increasing hybridation of genres ? the share of new releases, recent, versus older creations ? Though, there would be a solution : using the same number of categories (X largest genres, say 15, and a couple of descriptors for new/recent/old) and measuring disparity by producing distances between genres by using a factorial analysis applied to a large user preferences base. No doubt researchers will produce more sophisticated approaches in the future, at least by applying fully the Stirling model.   
This study ("Evolution de la diversité consommée sur le marché de la musique enregistrée, 2007-2016", by Olivier Donnat, in French) provides an analysis of diversity in the music industry in France over the period 2007 - 2016. In 2016 in France, it shows that the concentration of consumption for streaming services is much lower than for other formats: the Top 1000 amounts to 23% of total consumption for streaming, 36% for single download, 44% for album download, and 61% for CD purchases. With regard to genre diversity (measured in terms of balance), streaming does not seem to prompt a better balance than other formats: there is merely a redistribution of genres (from pop to urban), if we downplay a casualty (the classical genre has been wiped out). 

This trend of a lower concentration is also observed in the US. Measured in terms of top 50 tracks, top 5/top 15/ top 25 artists, the number of streams relative to total has decreased from 2016 to 2018, as shows this article. 

In this article ("Niche is the new mainstream"), Midia Research argues that moving from the old model (mainstream linear broadcast) to the new one (hypersegmentation), "a few big hits for everyone are being replaced by many, smaller hits for individuals". This is indeed confirmed by the decrease in concentration as showed by the figures above.

This study ("The effects of music recommendation engines on the filter bubble phenomenon", by David Allen, Jeremy Campo, Evin Ugur, Henry Wheeler-Mackta), brings about an insight on diversity in the US. It compares how people use a smart radio (Pandora) with an online FM/AM radio aggregator (TuneIn). It shows that music discoveries were more frequent on Pandora, that music discoveries were more diverse on TuneIn, and that discoveries on Pandora were much closer to participants’ initial taste. It concludes that smart radios creates more filter bubbles, though we would be cautious with the metric used to measure diversity (a mood variable with 5 classes only).

The streaming format is caracterized by:
a high number of entry points to start a music experience
higher frequency of playlists renewal
context based playlists ushering more diverse genres
people not listening to the same playlists
personalization generating singular navigations in music catalogues
As a result we believe that streaming has a positive impact on diversity: less concentration on the tops, higher variety and disparity, and probably with more daring recommendation systems, higher balance. This ought of course to be confirmed by proper research.