如何使用 Create ML MLRecommender 构建歌曲推荐器
您可以在我的网站上找到这篇文章以及更多内容!
客观的
本文结束时,我们将学习如何使用Create ML MLRecommender根据用户的收听历史向其推荐歌曲。我们还将学习如何MLDataTable使用 Python 和来自第三方的数据来解析和准备数据。
ML推荐器简介
个性化推荐系统可应用于多种场景,例如音乐播放器、视频播放器或社交媒体网站。机器学习推荐系统会将用户的历史活动与众多其他用户的庞大活动库进行比较。例如,如果 Spotify 想向您推荐新的每日精选,其机器学习推荐系统可能会查看您过去几周的收听记录,并将其与您朋友的收听记录进行比较。我们今天的目标是创建一个系统,MLRecommender根据用户的收听记录向其推荐歌曲。
构造函数为MLRecommender:
init(trainingData: MLDataTable, userColumn: String, itemColumn: String, ratingColumn: String? = nil, parameters: MLRecommender.ModelParameters = ModelParameters()) throws
创建数据表
trainingData第一步是创建。在这种情况下,我们的训练数据是来自百万歌曲数据集MLDataTable的许多不同用户的收听历史记录,该数据集包含超过一百万首歌曲的元数据以及用户提供的评分。
我们将使用数据集中的两个文件。第一个文件是 `user_id` 1000.txt,其中包含 10000 条记录的用户 ID、歌曲 ID 和收听时间。我们history.txt之后会将其称为 `user_id`。第二个文件是 ` song_id` song_data.csv,其中包含歌曲 ID、标题、发行日期和艺术家姓名。我们之后会将其称为 `song_artist_name` songs.csv。本教程所需的所有完整文件都可以在文章末尾找到。
以下是我们的输入文件内容。请注意,其中songs.csv一个文件有标题行,而另一个文件则history.txt没有:
# history.txt
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
...
# songs.csv
song_id,title,release,artist_name,year
SOQMMHC12AB0180CB8,"Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003
SOVFVAK12A8C1350D9,"Tanssi vaan","Karkuteillä",Karkkiautomaatti,1995
SOGTUKN12AB017F4F1,"No One Could Ever",Butter,"Hudson Mohawke",2006
...
我们将使用pandas Python 库来处理 CSV 数据。首先,下载上面的文件并分别命名为 `.cv`history.txt和songs.csv`.dv`,然后我们将加载它们:
import csv
import pandas as pd
history_file = 'history.txt' # 'https://static.turi.com/datasets/millionsong/10000.txt'
songs_metadata_file = 'songs.csv' # 'https://static.turi.com/datasets/millionsong/song_data.csv'
# Import the files
history_df = pd.read_table(history_file, header=None)
history_df.columns = ['user_id', 'song_id', 'listen_count']
metadata_df = pd.read_csv(songs_metadata_file)
songs.csv文件中已经包含了列标题,所以我们不需要像之前那样手动添加history_df。以下是数据框现在的样子:
# history_df
user_id song_id listen_count
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
...
# metadata_df
# (The '\' means that the row continues onto the next lines)
song_id title release \
0 SOQMMHC12AB0180CB8 Silent Night Monster Ballads X-Mas
1 SOVFVAK12A8C1350D9 Tanssi vaan Karkuteillä
2 SOGTUKN12AB017F4F1 No One Could Ever Butter
artist_name year
0 Faster Pussy cat 2003
1 Karkkiautomaatti 1995
2 Hudson Mohawke 2006
...
接下来,为了创建所有用户的统一收听历史记录,我们需要将歌曲数据合并到metadata_df收听历史记录中history_df,并生成一个可在 Swift 中使用的 CSV 文件。我们还需要添加一列,将歌曲标题和艺术家姓名合并在一起,以便我们可以在列表中同时看到这两者MLRecommender:
# Merge the files into a single csv
song_df = pd.merge(history_df, metadata_df.drop_duplicates(['song_id']), on="song_id", how="left")
song_df.to_csv('merged_listen_data.csv', quoting=csv.QUOTE_NONNUMERIC)
# Add a "Title - Name" column for easier printing later
song_df['song'] = song_df['title'] + ' - ' + song_df['artist_name']
以下是我们合并后的歌曲数据框:
# song_df
user_id song_id listen_count \
0 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOAKIMP12A8C130995 1
1 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBBMDR12A8C13253B 2
2 b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
title release artist_name year \
0 The Cove Thicker Than Water Jack Johnson 0
1 Entre Dos Aguas Flamenco Para Niños Paco De Lucia 1976
2 Stronger Graduation Kanye West 2007
song
0 The Cove - Jack Johnson
1 Entre Dos Aguas - Paco De Lucia
2 Stronger - Kanye West
...
截至撰写本文时,MLRecommender要求项目 ID 列的值trainingData从 1 到项目数量。换句话说,如果我们的项目trainingData列表中只有三首歌,那么merged_listen_data.csv歌曲 ID 将会是 1、2SOQMMHC12AB0180CB8和SOVFVAK12A8C1350D93 SOGTUKN12AB017F4F1,但我们需要的是 4、50和16。2让我们在 CSV 文件中添加一个新列,使用从 0 到 N 的递增歌曲 ID:
# Find the unique song ids
song_ids = metadata_df.song_id.unique()
# Create a new dataframe of the unique song ids and a new incremental
# id for each one
incremental_id_df = pd.DataFrame({'song_id': song_ids})
incremental_id_df['incremental_song_id'] = incremental_id_df.index
# Merge the original song metadata with the incremental ids
new_song_id_df = pd.merge(song_id_df, incremental_id_df, on='song_id', how='left')
new_song_id_df.to_csv('songs_incremental_id.csv', quoting=csv.QUOTE_NONNUMERIC)
# Create a new merged history and song metadata CSV with incremental ids
new_history_df = pd.merge(history_df, incremental_id_df, on='song_id', how='inner')
new_history_df.to_csv('merged_listen_data_incremental_song_id.csv', quoting=csv.QUOTE_NONNUMERIC)
以下是我们新的歌曲 CSV 文件的内容。请注意,文件开头新增了一列,其中包含从 0 到 999999 的歌曲 ID:
# songs_incremental_id.csv
"","song_id","title","release","artist_name","year","incremental_song_id"
0,"SOQMMHC12AB0180CB8","Silent Night","Monster Ballads X-Mas","Faster Pussy cat",2003,0
1,"SOVFVAK12A8C1350D9","Tanssi vaan","Karkuteillä","Karkkiautomaatti",1995,1
2,"SOGTUKN12AB017F4F1","No One Could Ever","Butter","Hudson Mohawke",2006,2
...
以下是合并后的最终监听数据,其中包含了递增的 ID,可供读取MLRecommender:
# merged_listen_data_incremental_song_id.csv
"","Unnamed: 0","user_id","song_id","listen_count","title","release","artist_name","year","incremental_song_id"
0,0,"b80344d063b5ccb3212f76538f3d9e43d87dca9e","SOAKIMP12A8C130995",1,"The Cove","Thicker Than Water","Jack Johnson",0,397069
1,18887,"7c86176941718984fed11b7c0674ff04c029b480","SOAKIMP12A8C130995",1,"The Cove","Thicker Than Water","Jack Johnson",0,397069
2,21627,"76235885b32c4e8c82760c340dc54f9b608d7d7e","SOAKIMP12A8C130995",3,"The Cove","Thicker Than Water","Jack Johnson",0,397069
...
现在我们可以把它加载到推荐系统中了!
使用MLRecommender
创建一个新的 Swift Playground,并将这两个 CSV 文件作为资源添加merged_listen_data_incremental_song_id.csv到songs_incremental_id.csvPlayground 中。有关如何向 Swift Playground 添加资源的帮助,请参阅此帖子。请确保您的 Swift Playground 是一个空白的 macOS Playground,而不是 iOS Playground。因为我们的MLRecommender插件在生成推荐时只会提供用户 ID 和歌曲 ID,所以我们将使用第二个 CSV 文件来查看歌曲标题。
首先,让我们加载合并后的监听历史记录,并递增 ID:
import Foundation
import CreateML
// Create an MLDataTable from the merged CSV data
let history_csv = Bundle.main.url(forResource: "merged_listen_data_incremental_song_id", withExtension: "csv")!
let history_table = try MLDataTable(contentsOf: history_csv)
print(history_table)
Columns:
X1 string
Unnamed: 0 integer
user_id string
song_id string
listen_count integer
title string
release string
artist_name string
year integer
incremental_song_id integer
Rows: 2000000
Data:
+----------------+----------------+----------------+----------------+----------------+
| X1 | Unnamed: 0 | user_id | song_id | listen_count |
+----------------+----------------+----------------+----------------+----------------+
| 0 | 0 | b80344d063b5...| SOAKIMP12A8C...| 1 |
| 1 | 18887 | 7c8617694171...| SOAKIMP12A8C...| 1 |
| 2 | 21627 | 76235885b32c...| SOAKIMP12A8C...| 3 |
| 3 | 27714 | 250c0fa2a77b...| SOAKIMP12A8C...| 1 |
| 4 | 34428 | 3f73f44560e8...| SOAKIMP12A8C...| 6 |
| 5 | 34715 | 7a4b8e7d2905...| SOAKIMP12A8C...| 6 |
| 6 | 55885 | b4a678fb729b...| SOAKIMP12A8C...| 2 |
| 7 | 65683 | 33280fc74b16...| SOAKIMP12A8C...| 1 |
| 8 | 75029 | be21ec120193...| SOAKIMP12A8C...| 1 |
| 9 | 105313 | 6fbb9ff93663...| SOAKIMP12A8C...| 2 |
+----------------+----------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+---------------------+
| title | release | artist_name | year | incremental_song_id |
+----------------+----------------+----------------+----------------+---------------------+
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
| The Cove | Thicker Than...| Jack Johnson | 0 | 397069 |
+----------------+----------------+----------------+----------------+---------------------+
[2000000 rows x 10 columns]
由此,我们可以创建一个MLRecommender。我们的trainingData数据表格式是合并后的收听历史 CSV 文件,userColumn是user_id列名,是itemColumn列incremental_song_id名。是user_id从b80344d063b5ccb3212f76538f3d9e43d87dca9e合并后的 CSV 数据中随机选择的。
// Generate recommendations
let recommender = try MLRecommender(trainingData: history_table, userColumn: "user_id", itemColumn: "incremental_song_id")
let recs = try recommender.recommendations(fromUsers: ["b80344d063b5ccb3212f76538f3d9e43d87dca9e"])
print(recs)
Columns:
user_id string
incremental_song_id integer
score float
rank integer
Rows: 10
Data:
+----------------+---------------------+----------------+----------------+
| user_id | incremental_song_id | score | rank |
+----------------+---------------------+----------------+----------------+
| b80344d063b5...| 114557 | 0.0461493 | 1 |
| b80344d063b5...| 834311 | 0.0436045 | 2 |
| b80344d063b5...| 939015 | 0.043068 | 3 |
| b80344d063b5...| 955047 | 0.0427589 | 4 |
| b80344d063b5...| 563380 | 0.0426116 | 5 |
| b80344d063b5...| 677759 | 0.0423951 | 6 |
| b80344d063b5...| 689170 | 0.0418951 | 7 |
| b80344d063b5...| 333053 | 0.041788 | 8 |
| b80344d063b5...| 381319 | 0.0403042 | 9 |
| b80344d063b5...| 117491 | 0.0400819 | 10 |
+----------------+---------------------+----------------+----------------+
[10 rows x 4 columns]
但我们希望了解与每首推荐歌曲关联的歌曲元数据incremental_song_id。让我们加载歌曲元数据表,并使用递增的 ID 将推荐歌曲与歌曲元数据连接起来:
// Use the songs data CSV to print the recommended song titles
let songs_csv = Bundle.main.url(forResource: "songs_incremental_id", withExtension: "csv")!
let songs_table = try MLDataTable(contentsOf: songs_csv)
print(songs_table)
let song_title_recs = recs.join(with: songs_table, on: "incremental_song_id")
print(song_title_recs)
Columns:
X1 string
song_id string
title undefined
release string
artist_name string
year integer
incremental_song_id integer
Rows: 1000000
Data:
+----------------+----------------+----------------+----------------+----------------+
| X1 | song_id | title | release | artist_name |
+----------------+----------------+----------------+----------------+----------------+
| 0 | SOQMMHC12AB0...| Silent Night | Monster Ball...| Faster Pussy...|
| 1 | SOVFVAK12A8C...| Tanssi vaan | Karkuteillä | Karkkiautoma...|
| 2 | SOGTUKN12AB0...| No One Could...| Butter | Hudson Mohawke |
| 3 | SOBNYVR12A8C...| Si Vos Querés | De Culo | Yerba Brava |
| 4 | SOHSBXH12A8C...| Tangle Of As...| Rene Ablaze ...| Der Mystic |
| 5 | SOZVAPQ12A8C...| Symphony No....| Berwald: Sym...| David Montgo...|
| 6 | SOQVRHI12A6D...| We Have Got ...| Strictly The...| Sasha / Turb...|
| 7 | SOEYRFT12AB0...| 2 Da Beat Ch...| Da Bomb | Kris Kross |
| 8 | SOPMIYT12A6D...| Goodbye | Danny Boy | Joseph Locke |
| 9 | SOJCFMH12A8C...| Mama_ mama c...| March to cad...| The Sun Harb...|
+----------------+----------------+----------------+----------------+----------------+
+----------------+---------------------+
| year | incremental_song_id |
+----------------+---------------------+
| 2003 | 0 |
| 1995 | 1 |
| 2006 | 2 |
| 2003 | 3 |
| 0 | 4 |
| 0 | 5 |
| 0 | 6 |
| 1993 | 7 |
| 0 | 8 |
| 0 | 9 |
+----------------+---------------------+
[1000000 rows x 7 columns]
Columns:
user_id string
incremental_song_id integer
score float
rank integer
X1 string
song_id string
title undefined
release string
artist_name string
year integer
Rows: 11
Data:
+----------------+---------------------+----------------+----------------+----------------+
| user_id | incremental_song_id | score | rank | X1 |
+----------------+---------------------+----------------+----------------+----------------+
| b80344d063b5...| 114557 | 0.0461493 | 1 | 114578 |
| b80344d063b5...| 117491 | 0.0400819 | 10 | 117512 |
| b80344d063b5...| 333053 | 0.041788 | 8 | 333174 |
| b80344d063b5...| 381319 | 0.0403042 | 9 | 381465 |
| b80344d063b5...| 381319 | 0.0403042 | 9 | 444615 |
| b80344d063b5...| 563380 | 0.0426116 | 5 | 563705 |
| b80344d063b5...| 677759 | 0.0423951 | 6 | 678222 |
| b80344d063b5...| 689170 | 0.0418951 | 7 | 689654 |
| b80344d063b5...| 834311 | 0.0436045 | 2 | 834983 |
| b80344d063b5...| 939015 | 0.043068 | 3 | 939863 |
+----------------+---------------------+----------------+----------------+----------------+
+----------------+----------------+----------------+----------------+----------------+
| song_id | title | release | artist_name | year |
+----------------+----------------+----------------+----------------+----------------+
| SOHENSJ12AAF...| Great Indoors | Room For Squ...| John Mayer | 0 |
| SOOGZYY12A67...| Crying Shame | In Between D...| Jack Johnson | 2005 |
| SOGFKJE12A8C...| Sun It Rises | Fleet Foxes | Fleet Foxes | 2008 |
| SOECLAD12AAF...| St. Patrick'...| Room For Squ...| John Mayer | 0 |
| SOECLAD12AAF...| St. Patrick'...| Room For Squ...| John Mayer | 0 |
| SOAYTRA12A8C...| All At Once | Sleep Throug...| Jack Johnson | 2008 |
| SOKLVUI12A67...| If I Could | In Between D...| Jack Johnson | 2005 |
| SOYIJIL12A67...| Posters | Brushfire Fa...| Jack Johnson | 2000 |
| SORKFWO12A8C...| Quiet Houses | Fleet Foxes | Fleet Foxes | 2008 |
| SOJAMXH12A8C...| Meadowlarks | Fleet Foxes | Fleet Foxes | 2008 |
+----------------+----------------+----------------+----------------+----------------+
[11 rows x 10 columns]
最后打印出来的表格是我们推荐的歌曲,第一首是《Great Indoors》!现在我们可以将它MLRecommender用于其他用户ID了。
包起来
首先,我们查看了MLRecommender构造函数。然后,我们从百万歌曲数据集中收集了歌曲数据。我们修改了数据集以提高可读性,并为歌曲元数据添加了递增的 ID。我们将歌曲元数据和收听历史记录加载到 Swift Playground 中,MLRecommender根据收听历史记录创建了一个推荐列表,并生成了推荐歌曲。最后,我们使用歌曲元数据将推荐歌曲与其标题和艺术家关联起来。
源文件
本教程中提到的所有文件都可以在这里找到,包括:
songs.csv一百万首歌曲的元数据history.txt多用户的歌曲收听历史记录data-parser.py用于操作百万歌曲数据集的 Python 代码merged_listed_data.csv歌曲元数据和收听历史的合并数据集merged_listed_data_incremental_song_id.csv:merged_listed_data.csv添加了递增的 IDsongs_incremental_id.csv:songs.csv添加了递增的 IDMusicRecommender.playground用于创建 MLRecommender 的 Swift Playground
这篇博文的灵感来源于 Eric Le 的文章《 如何构建一个简单的歌曲推荐系统》。
文章来源:https://dev.to/nickymarino/how-to-build-a-song-recommender-using-create-ml-mlrecommender-45h1