CTR预估中的深度模型
Table of Contents

数据集

Electronics Dataset, 亚马逊1996-2014年间的评论数据, Electronics 只是其中电子产品这个类别的数据,它包括两个文件,一个是评论数据,另一个是电子产品的元数据。

{"reviewerID": "A3BY5KCNQZXV5U", "asin": "0594451647", "reviewerName": "Matenai", "helpful": [3, 3], "reviewText": "This product really works great but I found the following items you need to keep in mind:- You must have your power adapter connected for it to work...it plugs in the the bottom. It appears it needs power from the nook power adapter to operate.- The plug fits in loosely and you cannot move the Nook around much without holding the adapter in place.- On initial plugin it seems you need to rock it around to get the connection but then it seems solid.- It works with a 25ft high quality HDMI cable so you can put the NOOK across the room with you. Not tested with cheap cables.Warning...I found that my LG SmartTV 3D from a few years back does not work with this adapter but it does not seem to work with many things...bad software. This adapter works fine with other HDMI devices I have used like monitors and I am sure other TVs.Gave it five stars because it really is nice to extend the screen and use your Nook as a streaming server to your TV. Nice they made such a device.", "overall": 5.0, "summary": "This works great but read the details...", "unixReviewTime": 1390176000, "reviewTime": "01 20, 2014"}
{'asin': '0594287995', 'imUrl': 'http://g-ecx.images-amazon.com/images/G/01/x-site/icons/no-img-sm._CB192198896_.gif', 'categories': [['Electronics', 'eBook Readers & Accessories', 'Covers']], 'title': 'Kate Spade Rain or Shine Magazine Cover for Nook Simple Touch'}

数据预处理

在这次实验中, 只用到 reviewerID, asin, unixReviewTime, categories 这4个字段, 并且简单起见, categories只用到最后一个类别。对每一个用户,照时间对他评论的产品排序, 评论的产品作为正样本, 之前评论的产品列表作为历史, 再随机采样相等数量的其他产品(不在历史评论中也不在未来的评论列表中)作为负样本, 最后按最后一个评论的样本作为测试集(正样本和负样本一致, 实际评论的作为正样本, 采样的其他产品作为负样本)。这样,保证正样本和负样本都覆盖了所有用户。

预处理流程参考代码,我复用了来自DIN论文的代码 https://github.com/zhougr1993/DeepInterestNetwork

基本信息: user_count: 192403 item_count: 63001 cate_count: 801 example_count: 1689188

# 训练数据集合
# reviewerID, hist,    asin,  label
104760, [3737, 19450], 18486, 1

# 测试数据集合
# reviewerID,  hist,       (review asin,  no review asin)
91788, [16942, 42346, 38112, 36550, 45547, 31289, 48828], (57905, 11716)

# asin的类别映射表 cate_list, key 是asin ID, value 是 cate_id
[738 157 571 ...  63 674 351]