发布于 2026-01-06 0 阅读
0

如何使用 Create ML MLRecommender 构建歌曲推荐器

如何使用 Create ML MLRecommender 构建歌曲推荐器

您可以在我的网站上找到这篇文章以及更多内容

客观的

本文结束时,我们将学习如何使用Create ML MLRecommender根据用户的收听历史向其推荐歌曲。我们还将学习如何MLDataTable使用 Python 和来自第三方的数据来解析和准备数据。

ML推荐器简介

个性化推荐系统可应用于多种场景,例如音乐播放器、视频播放器或社交媒体网站。机器学习推荐系统会将用户的历史活动与众多其他用户的庞大活动库进行比较。例如,如果 Spotify 想向您推荐新的每日精选,其机器学习推荐系统可能会查看您过去几周的收听记录,并将其与您朋友的收听记录进行比较。我们今天的目标是创建一个系统,MLRecommender根据用户的收听记录向其推荐歌曲。

构造函数为MLRecommender

init(trainingData: MLDataTable, userColumn: String, itemColumn: String, ratingColumn: String? = nil, parameters: MLRecommender.ModelParameters = ModelParameters()) throws
Enter fullscreen mode Exit fullscreen mode

创建数据表

trainingData第一步是创建。在这种情况下,我们的训练数据是来自百万歌曲数据集MLDataTable的许多不同用户的收听历史记录,该数据集包含超过一百万首歌曲的元数据以及用户提供的评分。

我们将使用数据集中的两个文件。第一个文件是 `user_id` 1000.txt,其中包含 10000 条记录的用户 ID、歌曲 ID 和收听时间。我们history.txt之后会将其称为 `user_id`。第二个文件是 ` song_id` song_data.csv,其中包含歌曲 ID、标题、发行日期和艺术家姓名。我们之后会将其称为 `song_artist_name` songs.csv。本教程所需的所有完整文件都可以在文章末尾找到。

以下是我们的输入文件内容。请注意,其中songs.csv一个文件有标题行,而另一个文件则history.txt没有:

# history.txt

b80344d063b5ccb3212f76538f3d9e43d87dca9e    SOAKIMP12A8C130995  1
b80344d063b5ccb3212f76538f3d9e43d87dca9e    SOBBMDR12A8C13253B  2
b80344d063b5ccb3212f76538f3d9e43d87dca9e    SOBXHDL12A81C204C0  1
...
Enter fullscreen mode Exit fullscreen mode
# songs.csv

song_id,title,release,artist_name,year
SOQMMHC12AB0180CB8,"Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003
SOVFVAK12A8C1350D9,"Tanssi vaan","Karkuteillä",Karkkiautomaatti,1995
SOGTUKN12AB017F4F1,"No One Could Ever",Butter,"Hudson Mohawke",2006
...
Enter fullscreen mode Exit fullscreen mode

我们将使用pandas Python 库来处理 CSV 数据。首先,下载上面的文件并分别命名为 `.cv`history.txtsongs.csv`.dv`,然后我们将加载它们:

import csv
import pandas as pd

history_file = 'history.txt' # 'https://static.turi.com/datasets/millionsong/10000.txt'
songs_metadata_file = 'songs.csv' # 'https://static.turi.com/datasets/millionsong/song_data.csv'

# Import the files
history_df = pd.read_table(history_file, header=None)
history_df.columns = ['user_id', 'song_id', 'listen_count']
metadata_df =  pd.read_csv(songs_metadata_file)
Enter fullscreen mode Exit fullscreen mode

songs.csv文件中已经包含了列标题,所以我们不需要像之前那样手动添加history_df。以下是数据框现在的样子:

# history_df

                                    user_id             song_id  listen_count
0  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOAKIMP12A8C130995             1
1  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBBMDR12A8C13253B             2
2  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBXHDL12A81C204C0             1
...
Enter fullscreen mode Exit fullscreen mode
# metadata_df
# (The '\' means that the row continues onto the next lines)

              song_id              title                release  \
0  SOQMMHC12AB0180CB8       Silent Night  Monster Ballads X-Mas
1  SOVFVAK12A8C1350D9        Tanssi vaan            Karkuteillä
2  SOGTUKN12AB017F4F1  No One Could Ever                 Butter

        artist_name  year
0  Faster Pussy cat  2003
1  Karkkiautomaatti  1995
2    Hudson Mohawke  2006
...
Enter fullscreen mode Exit fullscreen mode

接下来,为了创建所有用户的统一收听历史记录,我们需要将歌曲数据合并到metadata_df收听历史记录中history_df,并生成一个可在 Swift 中使用的 CSV 文件。我们还需要添加一列,将歌曲标题和艺术家姓名合并在一起,以便我们可以在列表中同时看到这两者MLRecommender

# Merge the files into a single csv
song_df = pd.merge(history_df, metadata_df.drop_duplicates(['song_id']), on="song_id", how="left")
song_df.to_csv('merged_listen_data.csv', quoting=csv.QUOTE_NONNUMERIC)

# Add a "Title - Name" column for easier printing later
song_df['song'] = song_df['title'] + ' - ' + song_df['artist_name']
Enter fullscreen mode Exit fullscreen mode

以下是我们合并后的歌曲数据框:

# song_df

                                    user_id             song_id  listen_count  \
0  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOAKIMP12A8C130995             1
1  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBBMDR12A8C13253B             2
2  b80344d063b5ccb3212f76538f3d9e43d87dca9e  SOBXHDL12A81C204C0             1

             title              release    artist_name  year  \
0         The Cove   Thicker Than Water   Jack Johnson     0
1  Entre Dos Aguas  Flamenco Para Niños  Paco De Lucia  1976
2         Stronger           Graduation     Kanye West  2007

                              song
0          The Cove - Jack Johnson
1  Entre Dos Aguas - Paco De Lucia
2            Stronger - Kanye West
...
Enter fullscreen mode Exit fullscreen mode

截至撰写本文时,MLRecommender要求项目 ID 列的值trainingData从 1 到项目数量。换句话说,如果我们的项目trainingData列表中只有三首歌,那么merged_listen_data.csv歌曲 ID 将会是 1、2SOQMMHC12AB0180CB8SOVFVAK12A8C1350D93 SOGTUKN12AB017F4F1,但我们需要的是 4、5016。2让我们在 CSV 文件中添加一个新列,使用从 0 到 N 的递增歌曲 ID:

# Find the unique song ids
song_ids = metadata_df.song_id.unique()

# Create a new dataframe of the unique song ids and a new incremental
# id for each one
incremental_id_df = pd.DataFrame({'song_id': song_ids})
incremental_id_df['incremental_song_id'] = incremental_id_df.index

# Merge the original song metadata with the incremental ids
new_song_id_df = pd.merge(song_id_df, incremental_id_df, on='song_id', how='left')
new_song_id_df.to_csv('songs_incremental_id.csv', quoting=csv.QUOTE_NONNUMERIC)

# Create a new merged history and song metadata CSV with incremental ids
new_history_df = pd.merge(history_df, incremental_id_df, on='song_id', how='inner')
new_history_df.to_csv('merged_listen_data_incremental_song_id.csv', quoting=csv.QUOTE_NONNUMERIC)
Enter fullscreen mode Exit fullscreen mode

以下是我们新的歌曲 CSV 文件的内容。请注意,文件开头新增了一列,其中包含从 0 到 999999 的歌曲 ID:

# songs_incremental_id.csv

"","song_id","title","release","artist_name","year","incremental_song_id"
0,"SOQMMHC12AB0180CB8","Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003,0
1,"SOVFVAK12A8C1350D9","Tanssi vaan","Karkuteillä","Karkkiautomaatti",1995,1
2,"SOGTUKN12AB017F4F1","No One Could Ever","Butter","Hudson Mohawke",2006,2
...
Enter fullscreen mode Exit fullscreen mode

以下是合并后的最终监听数据,其中包含了递增的 ID,可供读取MLRecommender

# merged_listen_data_incremental_song_id.csv

"","Unnamed: 0","user_id","song_id","listen_count","title","release","artist_name","year","incremental_song_id"
0,0,"b80344d063b5ccb3212f76538f3d9e43d87dca9e","SOAKIMP12A8C130995",1,"The Cove","Thicker Than Water","Jack Johnson",0,397069
1,18887,"7c86176941718984fed11b7c0674ff04c029b480","SOAKIMP12A8C130995",1,"The Cove","Thicker Than Water","Jack Johnson",0,397069
2,21627,"76235885b32c4e8c82760c340dc54f9b608d7d7e","SOAKIMP12A8C130995",3,"The Cove","Thicker Than Water","Jack Johnson",0,397069
...
Enter fullscreen mode Exit fullscreen mode

现在我们可以把它加载到推荐系统中了!

使用MLRecommender

创建一个新的 Swift Playground,并将这两个 CSV 文件作为资源添加merged_listen_data_incremental_song_id.csvsongs_incremental_id.csvPlayground 中。有关如何向 Swift Playground 添​​加资源的帮助,请参阅此帖子请确保您的 Swift Playground 是一个空白的 macOS Playground,而不是 iOS Playground。因为我们的MLRecommender插件在生成推荐时只会提供用户 ID 和歌曲 ID,所以我们将使用第二个 CSV 文件来查看歌曲标题。

首先,让我们加载合并后的监听历史记录,并递增 ID:

import Foundation
import CreateML

// Create an MLDataTable from the merged CSV data
let history_csv = Bundle.main.url(forResource: "merged_listen_data_incremental_song_id", withExtension: "csv")!
let history_table = try MLDataTable(contentsOf: history_csv)
print(history_table)
Enter fullscreen mode Exit fullscreen mode
Columns:
    X1  string
    Unnamed: 0  integer
    user_id string
    song_id string
    listen_count    integer
    title   string
    release string
    artist_name string
    year    integer
    incremental_song_id integer
Rows: 2000000
Data:
+----------------+----------------+----------------+----------------+----------------+
| X1             | Unnamed: 0     | user_id        | song_id        | listen_count   |
+----------------+----------------+----------------+----------------+----------------+
| 0              | 0              | b80344d063b5...| SOAKIMP12A8C...| 1              |
| 1              | 18887          | 7c8617694171...| SOAKIMP12A8C...| 1              |
| 2              | 21627          | 76235885b32c...| SOAKIMP12A8C...| 3              |
| 3              | 27714          | 250c0fa2a77b...| SOAKIMP12A8C...| 1              |
| 4              | 34428          | 3f73f44560e8...| SOAKIMP12A8C...| 6              |
| 5              | 34715          | 7a4b8e7d2905...| SOAKIMP12A8C...| 6              |
| 6              | 55885          | b4a678fb729b...| SOAKIMP12A8C...| 2              |
| 7              | 65683          | 33280fc74b16...| SOAKIMP12A8C...| 1              |
| 8              | 75029          | be21ec120193...| SOAKIMP12A8C...| 1              |
| 9              | 105313         | 6fbb9ff93663...| SOAKIMP12A8C...| 2              |
+----------------+----------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+---------------------+
| title          | release        | artist_name    | year           | incremental_song_id |
+----------------+----------------+----------------+----------------+---------------------+
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
| The Cove       | Thicker Than...| Jack Johnson   | 0              | 397069              |
+----------------+----------------+----------------+----------------+---------------------+
[2000000 rows x 10 columns]
Enter fullscreen mode Exit fullscreen mode

由此,我们可以创建一个MLRecommender。我们的trainingData数据表格式是合并后的收听历史 CSV 文件,userColumnuser_id列名,是itemColumnincremental_song_id名。是user_idb80344d063b5ccb3212f76538f3d9e43d87dca9e合并后的 CSV 数据中随机选择的。

// Generate recommendations
let recommender = try MLRecommender(trainingData: history_table, userColumn: "user_id", itemColumn: "incremental_song_id")
let recs = try recommender.recommendations(fromUsers: ["b80344d063b5ccb3212f76538f3d9e43d87dca9e"])
print(recs)
Enter fullscreen mode Exit fullscreen mode
Columns:
    user_id string
    incremental_song_id integer
    score   float
    rank    integer
Rows: 10
Data:
+----------------+---------------------+----------------+----------------+
| user_id        | incremental_song_id | score          | rank           |
+----------------+---------------------+----------------+----------------+
| b80344d063b5...| 114557              | 0.0461493      | 1              |
| b80344d063b5...| 834311              | 0.0436045      | 2              |
| b80344d063b5...| 939015              | 0.043068       | 3              |
| b80344d063b5...| 955047              | 0.0427589      | 4              |
| b80344d063b5...| 563380              | 0.0426116      | 5              |
| b80344d063b5...| 677759              | 0.0423951      | 6              |
| b80344d063b5...| 689170              | 0.0418951      | 7              |
| b80344d063b5...| 333053              | 0.041788       | 8              |
| b80344d063b5...| 381319              | 0.0403042      | 9              |
| b80344d063b5...| 117491              | 0.0400819      | 10             |
+----------------+---------------------+----------------+----------------+
[10 rows x 4 columns]
Enter fullscreen mode Exit fullscreen mode

但我们希望了解与每首推荐歌曲关联的歌曲元数据incremental_song_id。让我们加载歌曲元数据表,并使用递增的 ID 将推荐歌曲与歌曲元数据连接起来:

// Use the songs data CSV to print the recommended song titles
let songs_csv = Bundle.main.url(forResource: "songs_incremental_id", withExtension: "csv")!
let songs_table = try MLDataTable(contentsOf: songs_csv)
print(songs_table)

let song_title_recs = recs.join(with: songs_table, on: "incremental_song_id")
print(song_title_recs)
Enter fullscreen mode Exit fullscreen mode
Columns:
    X1  string
    song_id string
    title   undefined
    release string
    artist_name string
    year    integer
    incremental_song_id integer
Rows: 1000000
Data:
+----------------+----------------+----------------+----------------+----------------+
| X1             | song_id        | title          | release        | artist_name    |
+----------------+----------------+----------------+----------------+----------------+
| 0              | SOQMMHC12AB0...| Silent Night   | Monster Ball...| Faster Pussy...|
| 1              | SOVFVAK12A8C...| Tanssi vaan    | Karkuteillä   | Karkkiautoma...|
| 2              | SOGTUKN12AB0...| No One Could...| Butter         | Hudson Mohawke |
| 3              | SOBNYVR12A8C...| Si Vos Querés | De Culo        | Yerba Brava    |
| 4              | SOHSBXH12A8C...| Tangle Of As...| Rene Ablaze ...| Der Mystic     |
| 5              | SOZVAPQ12A8C...| Symphony No....| Berwald: Sym...| David Montgo...|
| 6              | SOQVRHI12A6D...| We Have Got ...| Strictly The...| Sasha / Turb...|
| 7              | SOEYRFT12AB0...| 2 Da Beat Ch...| Da Bomb        | Kris Kross     |
| 8              | SOPMIYT12A6D...| Goodbye        | Danny Boy      | Joseph Locke   |
| 9              | SOJCFMH12A8C...| Mama_ mama c...| March to cad...| The Sun Harb...|
+----------------+----------------+----------------+----------------+----------------+
+----------------+---------------------+
| year           | incremental_song_id |
+----------------+---------------------+
| 2003           | 0                   |
| 1995           | 1                   |
| 2006           | 2                   |
| 2003           | 3                   |
| 0              | 4                   |
| 0              | 5                   |
| 0              | 6                   |
| 1993           | 7                   |
| 0              | 8                   |
| 0              | 9                   |
+----------------+---------------------+
[1000000 rows x 7 columns]


Columns:
    user_id string
    incremental_song_id integer
    score   float
    rank    integer
    X1  string
    song_id string
    title   undefined
    release string
    artist_name string
    year    integer
Rows: 11
Data:
+----------------+---------------------+----------------+----------------+----------------+
| user_id        | incremental_song_id | score          | rank           | X1             |
+----------------+---------------------+----------------+----------------+----------------+
| b80344d063b5...| 114557              | 0.0461493      | 1              | 114578         |
| b80344d063b5...| 117491              | 0.0400819      | 10             | 117512         |
| b80344d063b5...| 333053              | 0.041788       | 8              | 333174         |
| b80344d063b5...| 381319              | 0.0403042      | 9              | 381465         |
| b80344d063b5...| 381319              | 0.0403042      | 9              | 444615         |
| b80344d063b5...| 563380              | 0.0426116      | 5              | 563705         |
| b80344d063b5...| 677759              | 0.0423951      | 6              | 678222         |
| b80344d063b5...| 689170              | 0.0418951      | 7              | 689654         |
| b80344d063b5...| 834311              | 0.0436045      | 2              | 834983         |
| b80344d063b5...| 939015              | 0.043068       | 3              | 939863         |
+----------------+---------------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+----------------+
| song_id        | title          | release        | artist_name    | year           |
+----------------+----------------+----------------+----------------+----------------+
| SOHENSJ12AAF...| Great Indoors  | Room For Squ...| John Mayer     | 0              |
| SOOGZYY12A67...| Crying Shame   | In Between D...| Jack Johnson   | 2005           |
| SOGFKJE12A8C...| Sun It Rises   | Fleet Foxes    | Fleet Foxes    | 2008           |
| SOECLAD12AAF...| St. Patrick'...| Room For Squ...| John Mayer     | 0              |
| SOECLAD12AAF...| St. Patrick'...| Room For Squ...| John Mayer     | 0              |
| SOAYTRA12A8C...| All At Once    | Sleep Throug...| Jack Johnson   | 2008           |
| SOKLVUI12A67...| If I Could     | In Between D...| Jack Johnson   | 2005           |
| SOYIJIL12A67...| Posters        | Brushfire Fa...| Jack Johnson   | 2000           |
| SORKFWO12A8C...| Quiet Houses   | Fleet Foxes    | Fleet Foxes    | 2008           |
| SOJAMXH12A8C...| Meadowlarks    | Fleet Foxes    | Fleet Foxes    | 2008           |
+----------------+----------------+----------------+----------------+----------------+
[11 rows x 10 columns]
Enter fullscreen mode Exit fullscreen mode

最后打印出来的表格是我们推荐的歌曲,第一首是《Great Indoors》!现在我们可以将它MLRecommender用于其他用户ID了。

包起来

首先,我们查看了MLRecommender构造函数。然后,我们从百万歌曲数据集中收集了歌曲数据。我们修改了数据集以提高可读性,并为歌曲元数据添加了递增的 ID。我们将歌曲元数据和收听历史记录加载到 Swift Playground 中,MLRecommender根据收听历史记录创建了一个推荐列表,并生成了推荐歌曲。最后,我们使用歌曲元数据将推荐歌曲与其标题和艺术家关联起来。

源文件

本教程中提到的所有文件都可以在这里找到,包括:

  • songs.csv一百万首歌曲的元数据
  • history.txt多用户的歌曲收听历史记录
  • data-parser.py用于操作百万歌曲数据集的 Python 代码
  • merged_listed_data.csv歌曲元数据和收听历史的合并数据集
  • merged_listed_data_incremental_song_id.csvmerged_listed_data.csv添加了递增的 ID
  • songs_incremental_id.csvsongs.csv添加了递增的 ID
  • MusicRecommender.playground用于创建 MLRecommender 的 Swift Playground

这篇博文的灵感来源于 Eric Le 的文章《 如何构建一个简单的歌曲推荐系统》

文章来源:https://dev.to/nickymarino/how-to-build-a-song-recommender-using-create-ml-mlrecommender-45h1