Explainability in Music Recommender Systems

The most common way to listen to recorded music nowadays is via streaming platforms which provide access to tens of millions of tracks. To assist users in effectively browsing these large catalogs, the integration of Music Recommender Systems (MRSs) has become essential. Current real-world MRSs are often quite complex and optimized for recommendation accuracy. They combine several building blocks based on collaborative filtering and content-based recommendation. This complexity can hinder the ability to explain recommendations to end users, which is particularly important for recommendations perceived as unexpected or inappropriate. While pure recommendation performance often correlates with user satisfaction, explainability has a positive impact on other factors such as trust and forgiveness, which are ultimately essential to maintain user loyalty. In this article, we discuss how explainability can be addressed in the context of MRSs. We provide perspectives on how explainability could improve music recommendation algorithms and enhance user experience. First, we review common dimensions and goals of recommenders' explainability and in general of eXplainable Artificial Intelligence (XAI), and elaborate on the extent to which these apply -- or need to be adapted -- to the specific characteristics of music consumption and recommendation. Then, we show how explainability components can be integrated within a MRS and in what form explanations can be provided. Since the evaluation of explanation quality is decoupled from pure accuracy-based evaluation criteria, we also discuss requirements and strategies for evaluating explanations of music recommendations. Finally, we describe the current challenges for introducing explainability within a large-scale industrial music recommender system and provide research perspectives.

maintaining trust of the users in the system. Therefore, equipping MRSs with capabilities to provide explanations to their users is of mutual interest.

Characteristics of music consumption and music recommender systems
While music recommendation shares some properties with other media recommendation tasks, such as videos or movies, there exist also pronounced differences. Among the ones identified in literature (e. g., [99]), the following characteristics are relevant for explainability in MRSs, as we will elaborate in the subsequent sections: • The duration of item consumption is commonly much shorter than in other domains, i. e., songs have typical lengths of several minutes, whereas watching a movie, reading a book, or spending a holiday take much longer.
• Music data comes in manifold representations, including audio, MIDI, and textual metadata (e. g., editorial metadata but also user-generated tags). Furthermore, music-related data that can be leveraged in MRSs is highly multimodal and includes images (e. g., album covers) and videos (e. g., music video clips), next to audio and textual metadata.
Finally, user feedback is collected from various activities (e. g., likes, favorites, song skips).
• The listening context strongly affects music preferences [48]. For instance, the listener's mood, location, (e. g., consumption at home vs. while commuting), social situation (e. g., alone vs. with friends) and other aspects have been shown to influence musical needs and demands [35,91].
• Music is often consumed sequentially, i. e., tracks in a listening session or playlist. Therefore, for music, we often focus on sequential recommendation tasks, such as automatic playlist creation or continuation [13,117], that leverage both long-term and short-term user preferences.

Common music recommendation tasks and methods
Various use cases of MRSs exist, centered around different tasks. Among these, the most important ones are front page recommendation (recommending content for thematic collections of music -also known as shelves or channelspresented to the user on the front page of the platform's user interface) [11], music exploration/discovery (e. g., based on item similarity in terms of melody, rhythm, or lyrics) [41,60], automatic playlist generation (commonly based on the user profile, but possibly only based on a seed description such as "music to relax"), and automatic playlist continuation (based on a sequence of seed tracks) [50,117].
To create a music recommendation engine, a variety of methods are adopted, depending on the use case. These include latent factor models (e. g., singular value decomposition [56] or factorization machines [68]), graph mining techniques (e. g., random walks [40] or graph embeddings [84]), and deep learning-based techniques (e. g., convolutional neural networks [109], recurrent neural networks [47], or autoencoders [66]). Furthermore, techniques from audio signal processing and natural language processing are often used to create vector representations of music items or to annotate music items with relevant tags [33,49,82].
In this article, we discuss how explainability can be approached in MRSs, and we provide perspectives and outline challenges in this context. More precisely, we first review definitions and goals of explainability commonly adopted in RS research, and investigate to which extent they are applicable or need adaptation in the music domain (Section 2).
Subsequently, Section 3 reviews existing explanation types and describes the means through which explanations can be provided to the users, and the methods to integrate explainability capabilities into MRSs. How to evaluate the offered explanations in a music recommendation context is discussed in Section 4. Finally, taking an industry perspective, Section 5 describes the challenges MRS providers face when integrating explainability functionality into their real-world systems.

GOALS AND DIMENSIONS OF EXPLAINABILITY FOR MUSIC RECOMMENDER SYSTEMS
Recent years have seen an upsurge of interest in explainable recommendations, even though the concept already emerged in the 2000s [97]. This evolution of explainable RSs has been accompanied by an increasing popularity of eXplainable Artificial Intelligence (XAI), with which it shares roots, approaches, and terminology. XAI represents the convergence of many research disciplines, including computer science, human-computer-interaction, philosophy, and psychology. Coherent and stable XAI definitions and terms have started to appear only recently [8,43,67]. Meanwhile, RSs research has developed explanation-related concepts that are unknown to general XAI, some of which, however, rest upon elusive descriptions. This specificity is probably due to the nature of RSs themselves, which differ w. r. t. their tasks, inputs, and results from general trends in XAI. Linking these two explainability realms would not only result in a more standardized approach to explanations in RSs but also in a direct application of methods from XAI to MRSs.
In this section, we review definitions and concepts of explainability in RSs. Subsequently, we compare and connect them with the ones of XAI. Note that this is not a survey of XAI or explainable RSs as other valuable resources exist on this matter [8,38,43,81,105,118].

Definitions and goals of explainability for MRS
What does it mean to explain a recommendation? Within the RS field, Tintarev et al. [106] addresses this question with "to make clear by giving a detailed description", and Zhang et al. [118] with "an explainable recommendation aims to answer the question of why". We can thus discern a role of explanations as complementary information to the recommendation. But these definitions are limited; for instance, ensuring fair recommendations involves tracing the "why" of a recommendation, but only regarding certain critical aspects (e. g., potential gender biases) and it does not tell how to act upon them. As we develop next, "complementary information" and "fair recommendations" shape two of the many facets of explainability.
Borrowing general ideas from recent harmonization efforts of XAI terms, it is more convenient to distinguish between explanation objects and goals. In particular, explanations designate the result of an explanation system, they form an "interface between the system to explain and a target audience" [43]. Quite interchangeably with explainability, we will use the term interpretability, with a more passive characteristic: a system can be interpretable -e. g., decision trees are often interpretable, neural networks are not. The opposite notion is often referred to as blackboxness. We stress that automatically concluding that trees and linear regressions are interpretable and that neural networks are not is questionable. 2 As we will see next, this depends on a precise formulation of explanation tasks that do not admit one-size-fits-all rules.
The previous mention of "audience" is essential, since a given explanation type may only convey meaningful information to specific people. In RS research, the target audience of explanations is usually end-users as they are the targets of the recommendation decision and could be skeptical about it. Nevertheless, other stakeholders may be interested in receiving explanations, e. g., system designers and data scientists may inquire wherever their system bases its decisions on discriminatory biases from the data.
We shall continue with a cautionary tale: the disparate notions of explainability have led to many misuses of XAI [67]. Because we usually do not have access to ground-truth explanations in the wild, and realistically will not in industrial contexts, many XAI works have relied on intuitive notions of what their target explanations should be. This first makes evaluation difficult. As F. Doshi-Velez [31] highlights, the relevance of explanations is often suggested in a "you'll know it when you see it" fashion, which paves the way to many confirmation biases. Second, several counterintuitive results have been unveiled. For instance, the widely agreed-upon idea that an interpretable model is more desirable than a blackbox one has been challenged: produced explanations -similarly to model predictions -may be misleading or biased [3,28,29,54,94] In addressing these questions, the concept of incompleteness was proposed. Its purpose is to characterize the "missing piece" justifying the use of an explanation system [31]. Here, the literature of explainable RSs and general XAI diverges.
A distinction of the goals of RS explanations is proposed in [106], which delineates seven of them. We can enrich this discussion with goals identified in general XAI by Arietta et al [8]. Both sets of goals are displayed in Figure 1 with short definitions. We find that neither of the two may solely account for all MRS purposes: Explainable RS goals mostly fall into the informativeness category. This has a broader scope than RS transparency that feels too focused on the decomposition of models' inner mechanisms. Furthermore, RS goals have been found to be intercorrelated [9], in particular, satisfaction being arguably a desired byproduct of any explanation method. That said, persuasiveness is a strong dimension of RSs that is absent from general XAI [32]; when aiming at transparency, creating a persuasive system may appear contradictory.
Identifying those goals is crucial because explanations may simply not be needed if incompleteness is not an issue.
Evaluation should then be conducted with regard to each targeted incompleteness, to avoid mismatched objectives.
Lastly, the concept of understandability simply bridges the gap between a chosen XAI system and this new notion of goal/incompleteness being addressed for a target audience. All these notions are illustrated and placed accordingly in Figure 2. We discuss additional taxonomic axes for XAI in subsequent paragraphs.
As a final note, explainability can be framed through an interesting take from Michael Jordan on the future of machine learning. 3 The goal of XAI is not only for decision-makers to understand model predictions, but to allow a back and forth interaction between the two.  Fig. 2. Overview of XAI notions. In the upper part, a ML model is trained on data and used to make predictions . Beyond prediction, if the model alone is insufficient w. r. t. an underlying human-grounded application, the use of an XAI method will be justified. The specification of the target audience delineates incompleteness to be addressed through explanations, along different explanation axis (lower part).

Local/Global scope
As MRSs commonly provide numerous recommendations, there is a focal distinction to make in the explanations scope: local vs. global [31]. Local or instance-wise explanations target the decision of the model for a specific inputrecommendation pair, e. g., explaining that a track was recommended to an end-user because some of its features matched. Local explanations must be tailored to each individual prediction. This type of explanation is aligned with the European General Data Protection Regulation (GDPR) "Right to explanation" [85], which entitles users to inquire about the reasoning behind the outcome of an algorithm, hence supporting informativeness as an explainability goal.
In contrast, global explanations provide a big picture of the model logic, covering multiple model decisions. For instance, estimating clusters of machine-learned user embeddings may help rationalize the behavior of the MRS within several general communities. This broad view of the model is necessary to detect systematic biases of the model (addressing fairness goals) and to examine wherever a model is suitable for deployment (addressing informativeness, trustworthiness, and confidence). Lastly, note that the two types may be linked: it is sometimes relevant to craft a global explanation by providing multiple local explanations [92].

Intrinsic/Post-hoc interpretability
We can also distinguish explanation systems w. r. t. whether interpretability should be an inherent part of the RS-intrinsic interpretability, or should be provided as an addition to an already working RS-post-hoc interpretability.
Intrinsic interpretability refers to the ability of the RS to provide sufficient information to make its inner functioning clear to a specific audience [8]. In this case, the explanations coincide with the model. Being inherent in the model, intrinsic interpretability has to be planned in advance, making it a component of the model design. For instance, an Item-k-Nearest Neighbours model recommends artists because they are similar to the ones the user listened to, thus allowing explanations such as "We recommend you <artist> because it is similar to <artist(s)>".
Post-hoc or extrinsic interpretability refers to the use of external XAI to yield knowledge from a blackbox model. It can be considered as reverse engineering the model [43]. For example, the recommendations of a blackbox model can be explained by making a post-hoc selection of the relevant features that lead to the recommendation; they offer explanations such as "We recommend you this because it has <feature(s)> you may like". Both intrinsic and post-hoc views are affiliated with the concept of transparency, thus supporting informativeness, causality, and confidence.
However, post-hoc explanations are hampered by their externalness and require an additional check of their faithfulness to the studied model. Yet, compared to intrinsic, they disentangle model design from explanation design, allowing to consider XAI systems in a later stage, or to apply them to already working models.

Un/supervised explanations
We often think of XAI methods as being unsupervised. Particularly on the end-user side, it is arduous to guess which could be the ground-truth explanations for the user since their judgment of what a good explanation is may be biased [75]. Nevertheless, target explanations are sometimes available [9]. But far from making it a supervised task, our goal is not only to make explanation predictions but to address an incompleteness; the relevance of the target predictions thus has to be questioned. Do these really address our needs w. r. t. to incompleteness? Or are they a proxy for it? In the latter case, how do we assert/evaluate their understandability w. r. t. our goal? We present two ideas from XAI for supervised explanations in the image domain that could be applied to MRS.
In the field of image classification, some datasets gather images with textual descriptions. Each set of words can be matched against corresponding visual aspects in the images, enabling to generate visual explanations for class prediction of unseen instances through RNN-generated texts [45]. The explanations are evaluated against held-out test descriptions. Here, the concept of explanation is driven by two desiderata, first as a way to link different modalities of a same object -image and text, and second as a rationale that conveys useful information by yielding class-specific information that differentiate it from other classes. Obtaining this informative discriminative quality is tricky in an unsupervised setting. The multimodality of music data (e. g., audio, lyrics, users, playlists) makes it a good candidate for this paradigm.
We can identify another line of supervised explanations as linking different conceptual levels. The TCAV method [55], for instance, allows to check predictions against human-understandable concepts, e. g., how much the model prediction for an image of a zebra is sensitive to "stripeness". Again, there is an interesting link to music: there is a known and unresolved semantic gap between low-level data (i. e., audio signal) and its correspondence to high-level descriptions (e. g., genre, mood) [17].

Model/Data
We conclude this section with a paramount yet subtle distinction that is prone to be overlooked: are the explanations related to the RS model processing or to the data it represents?
Model explanations, on one side, focus on a learned representation and parameters and aim at making sense out of it. With a mild exaggeration, to the question "why is this track recommended by the MRS given my history?" a model-focused answer of a RS might be "it maximizes the probability of being co-listened with your history, considering all other users listening history". Data explanations, on the other side, would rather focus on "why are those items co-listened in the first place?". The trained model by itself is less interesting than the goal of uncovering "natural mechanism[s] in the world" [19]. In practice, in the first case, the model inspection may expose irregularities and lead to adjust its architecture and regularization (e. g., balancing fairness trade-off parameters); in the second case, the model plays the role of a proxy representation of data, detected errors would more suitably be attributed to a misrepresentation of input data (e. g., feature engineering for a better matrix factorization), and the ultimate goal is to find a structure that is credible given prior knowledge of the problem.
These aspects are often entangled. Explaining the model provides little information with noisy data, and explaining the data may be misleading if the model assumptions do not capture salient aspects (e. g., correlation instead of causation).
It is a widespread fallacy to explain a model (which is often easier, particularly when using transparent models), when the true underlying objective is to explain data. As a corollary, critics of XAI often oscillate between "the method is unreasonable for explaining the model" (e. g., randomizing the model's weights does not change the explanation [3]) and "the produced explanations, though relevant for the model, do not make sense for humans", without explicitly mentioning this duality [90].

MAKING MUSIC RECOMMENDER SYSTEMS EXPLAINABLE
In the previous section, we have drawn links between explainability in RSs and XAI, and presented different definitions.
Bearing these definitions in mind, we now study different ways MRSs can be made more explainable. We start with a general overview of possible explanation methods for MRSs, then discuss the adaptability of three relevant explanation paradigms to MRSs.

Overview of explanation methods for MRSs
We want to provide the reader with a short background on existing explanation methods for RSs, and then discuss how the latter are particularized for MRS.
Explanations of RSs. Zhang and Chen [118] characterizes six RSs explanation types. First, relevant item or user explanations, also called example-based explanations, are bonded with item-based or user-based collaborative filtering [2].
Thus, a recommendation is motivated either by the similarity of the item to other items previously liked by the user, or by the affinity that similar users have towards the recommended item.
Second, there are feature-based explanations, which are associated with content-based recommendation algorithms.
Explanations are commonly shown as tags relevant to a user or an item [118]. Opinion-based explanations focus on relevant aspects of the recommended item [112,119], which can be enriched with a sentiment [119]. In contrast to feature-based explanations leveraging item metadata or user profiles, opinion-based aspects are mined from reviews or social media posts.
Further, we also distinguish sentence, visual, and social explanations. Sentence explanations can be predefined templates with placeholders regarding features or aspects/opinions filled on-the-fly depending on the recommendation or specific user (e. g., "We recommend this item because its [good/excellent] [feature] matches with your [112]. Alternatively, sentence explanations can be generated from scratch using language models trained on reviews [25]. Visual explanations appear as images or visual elements often accompanied by text [7,21]. Image regions or caption words that explain the recommendation could be highlighted [21]. Social explanations mention either the user's friends who liked the recommended item [102] or their overall number. Extension to MRSs. Explanations in MRSs have multiple specificities. First, they can be based on audio. As voice assistants are becoming increasingly popular in music consumption [100], researchers have been looking into how to augment recommendations with audio music explanations. One line of work proposes listenable explanations [10], inspired from radio shows in which hosts provide information about played tracks for creating transitions. Alternatively, item parts such as track snippets focusing on a particular audio source (e. g., instrument or voice [72]) can be emphasized as reasons for recommendation.
Second, whenever recommendations are provided as collections of items (e. g., playlists), explanation generation can be modeled as playlist captioning (i. e., the automatic generation of a title and/or a description of the playlist) [23] or playlist stories generation [10]. Existing work usually relies on predefined textual templates [10].
Third, music explanations are rarely informed by a unique data source. Knowledge Graphs (KGs) are constructed from external sources and used for explanations [83]. Information sources leveraged in existing work are: user-generated text such as music descriptions [120], existing knowledge bases like MusicBrainz or Wikipedia, [77], tags describing items or users [62,120], social information such as users' friends [62,102], audio features [7], or pre-trained tag embeddings [7]. We next discuss in detail feature-based explanations (Section 3.2), example-based explanations (Section 3.3), and graph-based explanations (Section 3.4). We refer to Figure 3 for examples of each explanation type.

Feature-based explanations
MRSs rely on multi-modal information (or features) in order to provide personalized recommendations to users. It is therefore legitimate to ask which features are most responsible for the generated recommendation. recommendation. For instance, such explanations may be "We recommend you this song because it is '90s rock, a combo of era and genre you enjoy listening to. " where the genre and era represent the relevant features.
Relevance. Feature-based explanations are only relevant if the features are themselves interpretable. Furthermore, feature selection is an NP-hard problem [80] and real-word applications necessarily rely on feature assumptions: e. g., a limited number of interacting features [20,69], group or structure coherence [4,121], feature independence [92], or first order approximations [103].
Applications. Frequently, considered features are selected and ranked through a relevance score. More than just the top-contributing features, displaying or visualizing all the scores is a common practice among data scientists, acting as an encompassing explanation [54], though the information overload may be misleading [89]. Note that "relevance" for a feature is a polysemous term that inherently depends on the used selection method. As illustration, both SHAP [70] and L2X [20] assign relevance scores to single features, however, while SHAP expresses relevance in terms of marginalized contributions of features across all possible subsets, L2X encodes relevance as a notion of informativeness on the response variable through maximizing mutual information. We refer to [12,26,104] for surveys and further details on selection methods.
Applied to MRSs, feature explanations may be related to users, items, to the context, or a combination of the previous.
User features stretch from reasonably static characteristics (e. g., country of origin, age group, personality) to constantlychanging traits (e. g., tastes, recent interests, mood). These features offer a fertile ground for tailored recommendations, and thus tailored explanations such as "We recommend you this track because it suits your current emotional state" or "... because of your country of origin". However, the effectiveness of these explanations may be hindered by unreliable estimates of some user's variables, notably dynamic ones. Fairness-wise, societal biases in RSs often stem from the usage of sensitive user features, an analysis of their impact on recommendations being crucial to be able to temper with them.
Explanations involving these features are strongly tied to content-based recommendation [118] as they directly match the user's preference profile (e. g., "We recommend you this because it has the tempo/genre you like").
Lastly, miscellaneous context features may be suited for generating personalized explanations. Time and location, being the most popular ones, provide a sound contextualization for the recommendation such as "This techno masterpiece is perfect for tonight's Friday's party!" or "Since you are doing home-office these days, we recommend you this 'Work from Home' playlist".
Evaluation. Feature-based explanations are tied to a chosen definition of relevance. Their approximations can be compared to full computation results, when affordable. However, it may be tricky to evaluate whether the relevance scores themselves translate to true relevance. What is relevant for a trained model indeed reveals correlated events in the data, with the risk of returning spurious relations instead of causal truth about the data. As another pitfall, many feature selection methods do not handle intercorrelated features well [116], which are however common with MRSs.

Example-based explanations
In MRSs, example-based explanations are a very common type of explanation, that can be reduced to the use of the sentence template "We recommend you <this new item> because of <its similarity> to <meaningful item(s)>".
They are conceptually tied to Case-based RS [15].
Relevance. Similarly to feature-based explanations, they are only relevant if the given examples are themselves interpretable for the target audience. This includes returning items that are known to the user: e. g., from a set of liked or previously interacted items, or from broadly-known items.
Applications. Regarding example types, it is common to see artist examples as they convey a general sense of genre, temporal period, or style. Relevant-user examples were popular in the past decades as they have the interesting social twist of fostering users' curiosity to find pairs with similar tastes. They have gradually vanished from most music and video streaming platforms since they were found less convincing and accurate than item-based explanations, and in turn may have a negative impact on trustworthiness [46]. Nevertheless, limiting social explanations to close circles was found more relevant (e. g., "recommended tracks recently discovered by your friends"). Other than textual modalities, explanations in MRSs include displaying album covers, which may convey information about the style or even allow to recognize record labels (e. g., Deutsche Grammophon, Blue Note). Short audio thumbnails are also a promising way to provide explanations that cannot be otherwise expressed with words [72].
As for similarity relations, we note they may not be explicitly stated in the explanation, or in some cases cannot even be stated. This is particularly true for RSs basing their recommendation on co-listening data. With the same causality counterpoints as before; a co-listening may be coincidental, confounded by external factors, or, more pragmatically, may result from noisy metadata and inattentive users. With deep learning models that compute non-linear similarity metrics (e. g., the NeuCF method [44]), it gets trickier as we are faced with an added blackboxness issue.
This can lead to explanation examples that feel cryptic to the user. Recent works in KG-based recommendations are a way to alleviate this issue, we will discuss them in Section 3.4. Another lead lies in the disentanglement of the embeddings' latent dimensions that help rationalize proximity according to explicit concepts (e. g., audio features, genre, instrumentation) [65]. Attention-based mechanisms are also a promising way of providing recommendations based on the selection of a reasonably small and contextual subset of neighbors [53], though claims on the interpretability of attention are disputed [101].
Evaluation. It is useful to evaluate the discriminativeness of examples. Indeed, example-based explanations are affected by popularity biases, which hampers informativeness. As illustration, "The Beatles" are streamed by many users with diverse profiles, thus appearing in many co-listening relations and are likely to emerge as similar neighbors. But using Then, one should distinguish items coming from an explicit elicitation (e. g., liked artists) and implicit preferences.
The former are often meaningful to users but may make them feel trapped in a recommendation bubble, while the latter are more diverse but potentially lack direct connection to users, affecting trust in explanations.
Examples can also be useful for persuasiveness goals. It may be interesting, for instance, to provide a set of examples that include an item that is well-known to the user (acting as a hook) and unknown or weakly interacted ones (acting as discoveries). This principle is quite common in radio "clock" programming, where alternating power songs and discoveries has been shown to be a powerful tool to keep users engaged.

Graph-based explanations
Canonically, RSs match users and items. It is therefore not surprising that graph-based approaches on bipartite graphs can be used, with users on one side and items on the other. The recommendation task may indeed be framed as link prediction: given links of interactions between users and items, which unseen links are then probable? Likewise, similar item recommendation can be formulated as the task of finding probable nearest neighbors in a graph of users.
Relevance. This framing can seem cumbersome, with a first strong hindrance that graph methods quickly get computationally expensive -though some works have demonstrated industrial-scale applicability [115]. Second, notions of repeated music consumption, preference decay and accounting for a temporal dimension for sequential recommendations are tricky to incorporate into graphs. Nevertheless, graph-methods possess an outstanding expressive power, especially for multi-relational data, enabling abundant new RS applications.
Further justification lies in the graph structure naturally found in MRS data. Vertically, there is a natural hierarchy for musical items: tracks are organized into albums, that are themselves children of artists, that can be regrouped into genre, style, and time period, or any complex multi-leveled music ontology. Users exhibit a similar hierarchy, we can often assign them to several clusters of interest, that are themselves linked to a given culture, country, or age category.
Horizontally, music item clusters act as islands of connected components, with central nodes being representative of a given style and having influence on surrounding artists. Weakly connected nodes denote niche artists, and nodes in-between clusters fuse several influences. The same reasoning may apply to users' communities and hierarchies.
Applications. Graph analysis tools can be used to analyze node and edge structure. Detecting cliques and usingdegeneracy can help represent communities. Tripartite and generally -partite formulations allow to generalize canonical recommendation by handling more actors than items and users, for example considering artists and context. Directional edges can be leveraged to create graphs with asymmetrical relations and avoid recommending niche artists as being similar to very popular ones [95]. Graph-specific embedding techniques may be applied, e. g., using random walks to train embeddings on more diverse sequences than observed data [42]. Other approaches are promising, such as analysis of graph structure for domain-transfer; application of the traveling salesman problem to find fluid playlist tracks orderings; or continual learning by framing the addition of new users and items as new nodes that should not perturb far away regions of the graph. All these tools address the transparency issue, leading to more interpretable models.
Interpreting structure is mostly useful for an audience of researchers, but recent advances in the field of knowledgegraph-based recommendations show additional promising applications for end-users. The term KG refers to the use of external expert-knowledge to better understand the entities at hand within a RS task and how they relate to one another [30]. In the context of MRSs, available knowledge may include intra-music relations (e. g., "is sung by", "has music label", "belongs to genre"), and collaborative information (e. g., "often streamed with", "user taste belongs to cluster").
The KG can be applied to enhance the representation of items before recommendation. For instance, the latent space that is usually learned to compactly represent items can be structured to align with each item relations of the graph [14,114]. However, RSs may still fail to leverage the full power of KGs, solely relying on enhanced representations.
Instead, another approach is to directly incorporate KGs into the recommendation computation, which allows multi-hop reasoning. For that, all paths (with a fixed maximum length) between a pair of user and item can be extracted, and their relevance estimated [113]. This enables to produce explanations corresponding to paths of high probability (e. g., a path is recommended to you because it's similar to A you listened to before, which is sung by an artist belonging to the same indie music label as B. "). For a complete survey of KG methods, we refer to [51].
Evaluation. The efficiency of those techniques is conditioned on a good modeling of the involved entities (i. e., nodes and links), deep knowledge engineering, and an accurate estimation of the paths' relevance while ensuring their interpretability. As a counter-example, a generic "similar to" relation in a KG does nothing for informativeness as it is still a blackbox information, no matter the transparent relations before and after in the path.
Multi-hop reasoning that is permitted by graphs is a great opportunity to enhance discovery, which is known to impact effectiveness and satisfaction of RSs [16]. But this requires crafting new metrics for relevance evaluation, which is still an open research topic [37].
KGs are also a promising lead for causality, as they can allow to model and estimate causal structures for data.

Perspectives
Drawing inspiration from the recent success of GANs [39], we could consider generative explanations in MRSs. In particular, assuming the audio content is available, a GAN-generated explanation may provide a listenable explanation of what the user tastes are like according to the model. Indeed, the explanation may be conditioned on some priors [76], e. g., what the user likes about metal or jazz, to provide reasonable explanations. However, these types of explanations are hampered by the demanding resources required to generate audios [27].
Another interesting direction is exploiting human concepts of musical understanding [22,55]. For example, to understand how much the concept of 'rock' or 'happy' matters for the recommendation to a specific user. Beyond informativeness, this may also lead to uncovering bias in the datasets (e. g., how much the concept of male artist matters for the recommendation).
Lastly, counterfactual or contrastive explanations not only pinpoint the causes of a model decision but also provide users with actionable levers to change the recommendation [75,108,111]. Among the explanation types, counterfactual explanations may be considered best compliant to the GDPR [85] as they can provide a refined framework for fairness [64].

EVALUATING EXPLANATIONS
Evaluating MRS explanations is paramount to assess whether the explanation goals (Section 2) are met by the explanation methods (Section 3). This is an inherently hard task since it involves a multitude of factors, including the targeted goals of the explanation (i. e., asserting understandability), the type of explanation (e. g., whether the XAI method works as intended), and the underlying RS model (e. g., checking whether we are trying to explain meaningful recommendations in the first place). We have discussed some evaluation aspects in previous sections, specific to particular explanation dimensions and categories of methods. While there exists no one-size-fits-all evaluation strategy, in the following, we provide some general guidelines, tailored to the target audience of the explanation.

Evaluating explanation from the end-user's perspective
Since RSs explanations mostly target end-consumers, it is legitimate to involve them in the evaluation procedure. One straightforward way to evaluate such explanations is to conduct user studies [61,71] and assess if the explanations allow to address the targeted goals. We have argued in Section 2.4 that an explanation ground-truth is an evasive concept. Nevertheless, user studies can provide cues for what explanation types are best suited in specific domains, investigate research questions (e. g., should we use explanations in visual or text form?), and can also detect practical misuses [54].
In the context of MRSs, user studies showed that visual explanations increase understandability [7] while social or sentence explanations are more persuasive [102]. However, providing too many details results in cognitive overload and is negatively perceived [62]. Also, persuasiveness does not necessarily correlate with the value recommendations have for the user. For instance, a user following an artist recommendation because a friend likes it does not necessarily result in the user liking the artist. One suggestion to overcome this is by corroborating different types of explanations (e. g., social with feature-based explanations) [102]. Another solution is to enable conversations between user and system, so recommendations could be gradually improved with system's explanations and user's feedback [120].
User studies in MRSs are typically either between-subject or within-subject. Studies of the first type split users in two groups: one does receive the explanation, the other does not [74]. Hence, we can naturally quantify the effect of the explanation by comparing the results between groups. The prominent A/B testing frequently used in industry belongs to this study type, where a large basin of users is available and different interfaces can be tested simultaneously.
In contrast, within-subject experiments are used when only few users are available, especially outside the industry context. In these studies, each user is presented with all explanation interfaces [18,46,62,74,83,107,110], and one containing no explanation. Such within-subject studies need to take care of possible confounding factors emerging from the subsequent interaction with different interfaces (e. g., a user may feel lost interacting with a complex interface after seeing a very simple one).
Another fundamental aspect of user studies are the type of measurements they employ [61], usually either behavioral, such as click-through-rates and time-spent-interacting [7,120], or attitudinal, for instance, surveys and semi-structured interviews [10,62]. Generally, the measurement should be carefully tailored to the explanation goal(s). For example, if persuasiveness and trustworthiness are the most relevant explanation goals, we can assess the first via click-through-rate and the second through specific questions e. g., "Do you trust the recommendation?". In an industrial context, these measurements may be used as key performance indicators of the explanations, though little research has been carried out here beyond general users' satisfaction (e. g., streaming time and weekly active users count).
Lastly, music consumption is influenced by the user's personal characteristics and context (see Section 1), which also affect the reception of the explanations. It is, therefore, necessary to take them into account by ensuring a representative population sample. Research has considered different demographics (e. g., gender, age group, and country) [10,102], musical sophistication [74], listening habits [7,83], and psychological traits such as personality [62] and need for cognition [74].

Evaluating explanation from the technical stakeholders' perspective
Methods to evaluate explanations can also serve the technical stakeholder's side of MRSs, e. g., engineers and data analysts. Technical -offline -evaluations, though more convenient to conduct than user studies, are prone to the adoption of sketchy intuitive metrics, which can result in confirmation biases [31,67]. Fortunately, some metrics for explainability are widely agreed upon and seldom lead to misinterpretations. For instance, the stability of an explanation between re-estimations [79], its robustness to small data changes [6,58], and its consistency across several similar models [34] appear to be reasonable minimal requirements for XAI. Similarly, sparsity is often desirable for explanations since fewer parameters in the explanation translate to better cognitive handling [94]. Discriminativeness is already a not-so-trivial requirement as some popular feature-based explanation methods were shown to result in the same explanations across several class predictions [3]. Other subtle sanity checks are necessary: e. g., some ML models tend to leverage out-of-distribution artifacts and thus provide nonsensical explanations [63], which must be avoided.
In a semi-encouraging manner, some XAI goals seem harder to achieve than to check. For instance, fairness objectives often stem from measured biases (e. g., disparity), the impact of which a fairness-inducing system can thus be quantified [36]. Note that this gets trickier for less tractable objectives (e. g., minimizing environmental impact) or if a complete measurement is unavailable, costly, or requires time to witness a significant change. The same could be said for interactivity, for instance by tracking the variety of tracks a user listens to after adopting the system.
Not every method can generate explanations for all items or users of a MRSs. Thus, it is useful to measure the coverage of a method, e. g., how many explainable items are recommended in the top-k list for each user [1,87]. Likewise, computational efficiency of explanation generation should be taken into account [20], particularly for time-sensitive use cases.

EXPLAINABILITY CHALLENGES IN AN INDUSTRIAL CONTEXT
In previous sections, we discussed different ways to make MRSs more explainable and to evaluate explanations. We now focus on the inherent challenges that arise in a real industrial context when trying to implement these methods to explain recommendations to end-users.

Explanations in real MRS
Many providers of commercial music streaming services design their recommendation interface as swipeable carousels [11], namely sequences of sections that users can scroll. These carousels have titles that convey information to end-users such as: Fig. 4. Real-world recommendations with explanations.
• Self-explanatory titles: e. g., "Top 10", "Popular in your area", "Trending content" or "Recommended for you" that merely indicate the content selection process (Figure 4 top).
• Feature-based explanations: e. g., "70's soul" or "Rock music" (Figure 4 middle) • Example-based explanations: e. g., "Because you like artist X", "Because you listened to album Y" (Figure 4 bottom) Certainly, these simple and crude explanations are in contrast with the advanced explanation capabilities we have presented earlier. Graph-based explanations, for instance, do not easily fit the headline formatting constraint, due to their length and complexity. Therefore, they are quite uncommon in industrial systems though they represent a  promising aspect of conversational MRSs. In the following, we further analyze this discrepancy between the scientific state of the art and the industrial realm.

Overview of an industrial MRS
A simplistic view of an industrial MRS is given in Figure 5. Central to it is the Core Recommendation Engine that models users and items affinities. Usually trained offline on a vast amount of user-item interactions, the system is then used online to generate item recommendations for each user accessing the service. This core MRS is complemented by heuristic filters and pre/post processing.
To train and query the Core module, only a fraction of all available information about items and users will be eventually used. For instance, users' metadata such as location, context or declared age can be used as-is, transformed (e. g., quantized into broad areas or age buckets) or discarded. Items' data can be even more heavily processed. The audio signal can be subsampled, compressed, bounded, or normalized. Contextual information about the device, time, and location may be collected or inferred. Additionally, some systems leverage continuous user feedback in a session for online adaptation.
Symmetrically, the direct output of the core RS is not what the final user will be confronted with. Heuristics may be added, for instance to remove items that were already presented recently. In some contexts, enforcing contractual or legal obligations (such as the Digital Millennium Copyright Act rules for internet broadcasters [24]) can also be necessary. Finally, product constraints in terms of display space on the device, connectivity status, or content availability issues can impact recommendations.
"We recommended the song A by artist B to you because: (1) We considered your recent history (e. g., 3 months) and that older interaction may no longer be relevant. Also, considering a longer time period would have been too computationally costly. (2) We also considered the recent history of many more users (not all of them, some were excluded because, for instance they had too few interactions, or peculiar activity patterns) to learn a representation space encoding similarity between artists with a machine learning system. (3) The machine learning system learned to give close representations to artists that are co-listened by the same set of users and distant representations to artists that are not. (4) We saw that you listened to songs by artists similar to artist B and you did not skip them which we interpreted as positive feedback. (5) Eventually, you also explicitly liked artists similar to B or songs similar to A. (6) Some other songs that could have been very relevant in this context were discarded because you skipped them in a previous session. (7) We sampled items in our representation space that are close to items to which you gave positive feedback and far from those with negative feedback. (8) Song A and artist B also passed other heuristic filters (e. g., regarding redundancy of recommended content, or a user personal blacklist). " Table 1. Honest recommendations explanation.

Issues with explainability in industrial MRS
If we were to provide a detailed description of the internals of an MRS, destined to end-users and using natural language, it would probably look like the explanations provided in Table 1. While these may seem too detailed and almost provocative, they highlight a set of issues that we may face when trying to include explanations in an industrial MRS.
Issues with engineering assumptions and design choices. MRSs largely rely on implicit feedback and the engineering assumptions that come with their processing. For instance, most music services collect user feedbacks through basic interactions, namely skips, likes, dislikes, listening history and navigation outside what was provided by the MRS (such as music that was retrieved through the search engine). While dislikes are rather self-explanatory, the intention of the user liking a recommended item may not be that clear as users may use it for bookmarking songs. The intention behind a skip is even more difficult to understand [4], while skips remain the most basic and common interactions. Thus, MRS designers usually want to take advantage of them and enforce heuristic rules, e. g., negatively weighting skips and positively considering full-songs listenings (even though the music may have been played without someone actually listening to it).
Some design choices can also be made in order to make the system computationally efficient, notably limiting the amount of data: for instance, in item (1) of Table 1, the system needs to explain that old interactions were not taken into account for providing the recommendation, otherwise a user may not understand why some recurrent skip of an artist they dislike was not taken into account. It is worth noting that such design choices are usually optimized in the industrial context (e. g., through A/B testing), but are rarely considered in academic research.
We could also mention that it is common, in large catalogs, to encounter metadata ambiguities such as homonym artist profiles or polysemous musical genres. The impacts of such ambiguities on the system can be large and put explanation at risk of being deceptive, for instance, if an example-based explanation "Because you listened to artist X" is displayed to a user that listened to a different artist named X.
In the artificial explanation presented in Table 1, items (4), (6), and (1) rely on pragmatic assumptions. This makes the explanation quite complex and may decrease user satisfaction, especially if some assumption is invalid: e. g., when a song was skipped because played in an inappropriate context and not because it was disliked. Furthermore, providing such a detailed explanation may change the user behavior w. r. t. these interactions: e. g., they may avoid skipping songs they like in order to prevent them from being discarded in future recommendations, which may cause dissatisfaction.
Trade off between simplicity of the explanation and complexity of the RS. Among industrial actors, the less is more design pattern is widely adopted as a general good practice supported by the theory of cognitive load applied to user interface design [122]. The latter suggests that unnecessary display of information goes against ergonomics principles [96], and can thus be detrimental to users' satisfaction. Following these guidelines, end-user explanations should be carefully crafted to remain simple and concise, hence making cognitive overload less likely.
Additionally, industrial incentives are primarily driven toward highly accurate systems. This often requires complex MRS components, making explanations not simple enough to be provided to the user. For instance, some constituting blocks can be based on black-box methods, such as latent factor-based models or deep embeddings that are widely used in MRSs. These models embed users and items as multidimensional vectors in a latent space, and represent affinity as their relative distance in this space. While they usually provide good results in a large variety of recommendation tasks, the latent factors are very difficult to understand: for instance, in item (2) of Table 1, artist similarity is computed by a black-box system which is barely explainable to the user.
Besides, several MRS processing blocks rely on parameter choices. For instance, one may not consider all past user-song interactions to train the user-item affinity model, but only those that are significant (e. g., only consider interactions when the user listened to at least half of a song). But this threshold is arbitrary and may exclude interactions that are important to a user: e. g., users only listening to the intro of a song many times because they like it a lot.
Arguably, these parameters should be optimized, but in practice they are so numerous that optimization becomes intractable.
Finally, industrial MRSs are built upon several sub-blocks that are glued together and that rely on various sources of data: user modeling, content modeling, user-item affinity modeling, etc. The recommendation is made on top of all those blocks that may each influence the final recommendation. The impact of each block on this final recommendation is quite difficult to assess and, consequently, it is hard to generate a simple explanation on top of these unclear impacts.
Feature selection may appear as a solution, but as long as several features are significantly impacting the prediction, the explanation would need to be either complex or incomplete. The overall complexity of explanations in Table 1 illustrates this issue.
Issues of transparency with respect to company competition. One of the main goals of explanations for RSs is to increase transparency. While transparency can boost user satisfaction, it can possibly disclose some critical aspects of the system. Therefore, making sure that explanations do not reveal insights about the system internals can be necessary. For instance, releasing the information that the MRS uses artist embeddings (item (3) of Table 1) or a specific hyperparameter of the system such as the considered time-frame in the history (item (1)) can be sensitive information that a private company may be reluctant to make public to competitors.

Perspectives for explainable MRSs
Improving the level of explanation of MRSs while keeping strong simplicity constraints for the user remains a challenge.
Though, the end-user is not the only stakeholder to be impacted by MRSs. For instance, the revenue of music producers is impacted, too. Global explanations may thus be relevant for such an audience, in terms of fairness and transparency (explanations would not be about single recommendations but probably about explaining why an artist was recommended to a particular group of people). As there are no simplicity constraints for this kind of stakeholder, explanations could possibly be much more elaborate.
Another aspect is that keeping a system explainable is important for constantly improving its performance. For instance, receiving user complaints or feedback about bad recommendations can only be leveraged for improving the system if the RS engineers can understand the reason for these mis-recommendations. A RS that relies on black-box blocks prevents understanding bad recommendations and, therefore, hinders improving the system.
Finally, advanced users may want more control, and simplicity constraints may be less important to them: For instance, [52] argues that, as opposed to the less is more design pattern, giving users additional control over the RS does increase cognitive load, but also increases user satisfaction for users who have a deep understanding of how the RS works. Controls that would enable interacting with the MRS, make possible a positive feedback loop: explanations can be explicitly leveraged by the user to act on the RS and mitigate future spurious recommendations.
Interestingly, the increasing usage of voice-controlled devices to pilot music streaming services creates a promising new playground for deploying explainable MRSs and beyond, to create fully interactive experiences where recommendations can be challenged, and eventually improved.

ACKNOWLEDGMENTS
This work received support from the Austrian Science Fund (FWF): P33526 and DFH-23.