├── README.md ├── faiss_model ├── config.yml ├── data │ └── embedding.txt ├── embedding_recsys.py ├── faiss_api.py ├── faiss_fastapi.py ├── model │ └── faiss.model ├── process.png ├── shoppingweb.png ├── utils.py ├── uwsgi.ini └── uwsgi.pid └── lsh_model ├── config.Properities ├── content_similar.scala ├── library ├── ambiguity.dic ├── default.dic └── stop.dic ├── pom.xml └── submit.bash /README.md: -------------------------------------------------------------------------------- 1 | # recsys_faiss 2 | 一个基于 fasttext + faiss 的商品内容相关推荐接口实现,restful接口采用nginx+uwsgi+flask,gunicorn+uvicorn+fastapi 3 | 4 | 增加Spark实现内容相关推荐,Ansj+Word2vec+LSH+Phoenix 5 | 6 | #### 商品详情页效果图 7 | 将模型部署应用 8 | ![](shoppingweb.png) 9 | 10 | #### 模型接口流程图 11 | ![](process.png) 12 | 13 | #### 训练商品属性的特征向量,商品向量add到faiss 14 | ``` 15 | python embedding_recsys.py 16 | ``` 17 | 18 | #### flask封装faiss接口,输入商品id重建向量,进行余弦相似度检索 19 | #### 启动uwsgi 20 | ``` 21 | uwsgi uwsgi.ini 22 | ``` 23 | 24 | #### nginx配置 25 | ``` 26 | server { 27 | listen 8089; # 指定监听的端口 28 | charset utf-8; 29 | 30 | server_name localhost; # ip地址 31 | location / { 32 | include uwsgi_params; 33 | uwsgi_pass 127.0.0.1:8088; 34 | uwsgi_param UWSGI_CHDIR /Users/PycharmProjects/recsys_faiss; 35 | uwsgi_param UWSGI_SCRIPT recsys_faiss.faiss_api.py; 36 | } 37 | } 38 | ``` 39 | 40 | #### 接口测试 41 | get请求,请求参数spu商品ID,n_items召回相似商品数量 42 | ``` 43 | python 44 | >>> import requests 45 | >>> res = requests.get("http://127.0.0.1:8089/faiss/similar_items/?spu=3&n_items=10") 46 | >>> res.json() 47 | {'code': '200', 'msg': '处理成功', 'result': {'56482': 1.0, '92237': 1.0, '56483': 1.0, '56481': 1.0, '56484': 1.0, '56485': 1.0, '56486': 1.0, '4': 1.0, '18': 0.9981815814971924, '19': 0.9981815814971924}} 48 | ``` 49 | 50 | #### 推荐结果验证 51 | spu = 3 52 | ``` 53 | +-------------+--------------------------------------+ 54 | | ITEM_NUM_ID | ITEM_NAME | 55 | +-------------+--------------------------------------+ 56 | | 3 | 卓德优格乳杏口味含乳饮品 | 57 | +-------------+--------------------------------------+ 58 | ``` 59 | 60 | #### 推荐结果 61 | ``` 62 | +-------------+---------------------------------------------------------------------------+ 63 | | ITEM_NUM_ID | ITEM_NAME | 64 | +-------------+---------------------------------------------------------------------------+ 65 | | 19 | 卓德低脂热处理风味发酵乳(森林水果口味)120g | 66 | | 8221 | 爱乐薇蓝莓味含乳饮品125克 | 67 | | 56481 | 卓德风味发酵乳(草莓鲜酪口味)120g | 68 | | 8 | 卓德脱脂含乳饮品(覆盆子口味) | 69 | | 56483 | 卓德风味发酵乳(香草口味)120g | 70 | | 20 | 卓德低脂热处理风味发酵乳(草莓口味)120g | 71 | | 56484 | 卓德脱脂含乳饮品水蜜桃口味+覆盆子口味4*115g | 72 | | 56486 | 卓德热处理风味发酵乳(原味)4*115g | 73 | | 56482 | 卓德风味发酵乳(焗苹果口味)120g | 74 | | 8229 | 爱乐薇菠萝味含乳饮品125克 | 75 | | 18 | 卓德低脂热处理风味发酵乳(水蜜桃、西番莲口味)120g | 76 | | 4 | 卓德优格乳草莓口味含乳饮品 | 77 | | 92237 | 卓德含乳饮品(草莓口味)460克(4*115克) | 78 | | 6 | 卓德脱脂含乳饮品(水蜜桃口味) | 79 | +-------------+---------------------------------------------------------------------------+ 80 | ``` 81 | 82 | #### 接口压力测试 83 | ``` 84 | siege -c 100 -t 10s -b "http://127.0.0.1:8089/faiss/similar_items/?spu=3&n_items=50" 85 | 86 | Transactions: 41011 hits 87 | Availability: 100.00 % 88 | Elapsed time: 9.17 secs 89 | Data transferred: 12.24 MB 90 | Response time: 0.02 secs 91 | Transaction rate: 4472.30 trans/sec 92 | Throughput: 1.33 MB/sec 93 | Concurrency: 99.57 94 | Successful transactions: 41011 95 | Failed transactions: 0 96 | Longest transaction: 0.07 97 | Shortest transaction: 0.00 98 | ``` 99 | 100 | #### fastapi 101 | ``` 102 | gunicorn faiss_fastapi:app -w 4 -k uvicorn.workers.UvicornWorker -D 103 | ``` 104 | ``` 105 | python 106 | >>> import requests 107 | >>> res = requests.get("http://127.0.0.1:8000/faiss/similar_items/?spu_id=3&n_items=10") 108 | >>> res.json() 109 | {'code': 200, 'msg': 'success', 'res': [4, 56486, 92237, 56484, 56485, 56481, 56482, 56483, 18, 20]} 110 | ``` 111 | 112 | #### Spark实现 113 | 在phoenix创建表RECSYS_SIMILAR_LSH 114 | ``` 115 | 0: jdbc:phoenix:> create table RECSYS_SIMILAR_LSH (id varchar not null primary key, recommend varchar) salt_buckets=8; 116 | ``` 117 | 提交spark任务 118 | ``` 119 | bash submit.bash 120 | ``` 121 | 查看结果 122 | ``` 123 | 0: jdbc:phoenix:> select * from RECSYS_SIMILAR_LSH limit 5; 124 | +---------+-----------------------------------------------------------------------------------------------------------------------------------------------+ 125 | | ID | | 126 | +---------+-----------------------------------------------------------------------------------------------------------------------------------------------+ 127 | | 100880 | 4407,78608,753,88585,99289,17360,42159,43082,8636,43403,109828,2409,214619,202489,43125,14123,97192,9408,73847,48269,20587,209262,76913,78394 | 128 | | 102431 | 100034,100280,98687,118912,114140,29619,106257,118940,100065,30217,49843,49891,41759,28874,109745,29915,20059,29191,238333,90415,51839,48266, | 129 | | 104213 | 237497,21255,12543,98798,90771,117289,21262,20042,75753,212108,29915,50095,50537,39070,20059,101172,53475,18816,29859,109745,41840,29619,1886 | 130 | | 105577 | 9681,91428,62392,41219,117776,13191,120160,97337,112055,78196,202915,202899,227439,39411,94532,102624,102618,235521,105425,120167,58650,85126 | 131 | | 106655 | 233605,42025,120616,59829,203421,209948,99844,94505,752,39665,93387,80632,232698,57406,102814,43438,42975,8926,91368,73961,210979,92327,94477 | 132 | +---------+-----------------------------------------------------------------------------------------------------------------------------------------------+ 133 | ``` 134 | 135 | 136 | 137 | 138 | -------------------------------------------------------------------------------- /faiss_model/config.yml: -------------------------------------------------------------------------------- 1 | model: 2 | fasttext_model_path: ./model/fasttext.model 3 | embedding_size: 100 4 | fasttext_epochs: 5 5 | fasttext_min_count: 1 6 | fasttext_sg: 1 7 | fasttext_mode: train 8 | faiss_mode: train 9 | faiss_model_path: ./model/faiss.model 10 | faiss_topn: 50 11 | faiss_add_index: True 12 | add_random: False 13 | 14 | data: 15 | embedding_path: ./data/embedding.txt 16 | top_spu_path: ./data/top_spu.pkl 17 | mysql_host: 127.0.0.1 18 | mysql_port: 3306 19 | mysql_db: recsys 20 | mysql_user: root 21 | mysql_password: password 22 | 23 | weight: 24 | PTY_NUM_1_weight: 0.1 25 | PTY_NUM_2_weight: 0.2 26 | PTY_NUM_3_weight: 0.3 27 | PRODUCT_ORIGIN_weight: 0.1 28 | BRAND_weight: 0.2 29 | -------------------------------------------------------------------------------- /faiss_model/embedding_recsys.py: -------------------------------------------------------------------------------- 1 | import faiss 2 | import numpy as np 3 | from gensim.models import FastText 4 | from sklearn.preprocessing import normalize 5 | 6 | from utils import load_yaml_config, connect_mysql 7 | 8 | 9 | def create_fasttext_model(sentence_list, model_path, embedding_path, mode="train", min_count=1, 10 | size=128, sg=1, epochs=5): 11 | if mode == "train": 12 | model = FastText(min_count=min_count, size=size, sg=sg) 13 | model.build_vocab(sentence_list) 14 | model.train(sentence_list, total_examples=model.corpus_count, epochs=epochs) 15 | elif mode == "update": 16 | model = FastText.load(model_path) 17 | model.build_vocab(sentence_list, update=True) 18 | model.train(sentence_list, total_examples=model.corpus_count, epochs=5) 19 | model.save(model_path) 20 | model.wv.save_word2vec_format(embedding_path, binary=False) 21 | 22 | return model 23 | 24 | 25 | def create_item_embedding(model, sentence, size, weight_dict, add_random): 26 | raw = np.zeros(size) 27 | for word in sentence: 28 | if word in list(model.wv.vocab.keys()): 29 | raw += model.wv[word] * weight_dict[word.split("|")[0]] 30 | if add_random: 31 | raw = raw * 100 + np.random.random(size) / 100 32 | 33 | return normalize(raw.reshape(1, -1))[0] 34 | 35 | 36 | def create_faiss_model(item_embedding, item_list, faiss_path, size=128, mode="train"): 37 | item_embedding = np.array(item_embedding, dtype=np.float32) 38 | ids = np.array(item_list).astype("int") 39 | if mode == "train": 40 | index = faiss.index_factory(size, "IVF100,Flat", faiss.METRIC_INNER_PRODUCT) 41 | index.nprobe = 20 42 | index.train(item_embedding) 43 | # 初始化make_direct_map,reconstruct 重建向量 44 | index.make_direct_map() 45 | index_id = faiss.IndexIDMap2(index) 46 | elif mode == "update": 47 | index_id = faiss.read_index(faiss_path) 48 | index_id.add_with_ids(item_embedding, ids) 49 | # index保存 50 | faiss.write_index(index_id, faiss_path) 51 | 52 | return index 53 | 54 | 55 | if __name__ == "__main__": 56 | config = load_yaml_config("./config.yml") 57 | # model 58 | fasttext_model_path = config["model"]["fasttext_model_path"] 59 | faiss_model_path = config["model"]["faiss_model_path"] 60 | fasttext_mode = config["model"]["fasttext_mode"] 61 | faiss_mode = config["model"]["faiss_mode"] 62 | embedding_size = config["model"]["embedding_size"] 63 | fasttext_epochs = config["model"]["fasttext_epochs"] 64 | fasttext_min_count = config["model"]["fasttext_min_count"] 65 | fasttext_sg = config["model"]["fasttext_sg"] 66 | faiss_topn = config["model"]["faiss_topn"] 67 | add_random = config["model"]["add_random"] 68 | # data 69 | embedding_path = config["data"]["embedding_path"] 70 | hbase_host = config["data"]["hbase_host"] 71 | hbase_port = config["data"]["hbase_port"] 72 | hbase_table = config["data"]["hbase_table"] 73 | mysql_host = config["data"]["mysql_host"] 74 | mysql_port = config["data"]["mysql_port"] 75 | mysql_db = config["data"]["mysql_db"] 76 | mysql_user = config["data"]["mysql_user"] 77 | mysql_password = config["data"]["mysql_password"] 78 | # weight 79 | PTY_NUM_1_weight = config["weight"]["PTY_NUM_1_weight"] 80 | PTY_NUM_2_weight = config["weight"]["PTY_NUM_2_weight"] 81 | PTY_NUM_3_weight = config["weight"]["PTY_NUM_3_weight"] 82 | PRODUCT_ORIGIN_weight = config["weight"]["PRODUCT_ORIGIN_weight"] 83 | BRAND_weight = config["weight"]["BRAND_weight"] 84 | # 创建权重字典 85 | weight_dict = { 86 | "PTY_NUM_1": PTY_NUM_1_weight, 87 | "PTY_NUM_2": PTY_NUM_2_weight, 88 | "PTY_NUM_3": PTY_NUM_3_weight, 89 | "PRODUCT_ORIGIN_NUM_ID": PRODUCT_ORIGIN_weight, 90 | "BRAND_ID": BRAND_weight 91 | } 92 | 93 | query = """select distinct ITEM_NUM_ID, PTY_NUM_1, PTY_NUM_2, PTY_NUM_3, PRODUCT_ORIGIN_NUM_ID, BRAND_ID 94 | from goods_data where ITEM_NUM_ID is not null""" 95 | 96 | # 读取商品数据 97 | conn = connect_mysql(mysql_host, mysql_port, mysql_db, mysql_user, mysql_password) 98 | cursor = conn.cursor() 99 | cursor.execute(query) 100 | columns = ['ITEM_NUM_ID', 'PTY_NUM_1', 'PTY_NUM_2', 'PTY_NUM_3', 'PRODUCT_ORIGIN_NUM_ID', 'BRAND_ID'] 101 | 102 | # 获得spu和spu的特征 103 | item_sentence_dict = {} 104 | for line in cursor.fetchall(): 105 | item = line[0] 106 | sentence = [column + "|" + str(value) for column, value in zip(columns[1:], line[1:]) if value != None] 107 | item_sentence_dict.setdefault(item, []) 108 | item_sentence_dict[item].append(sentence) 109 | # 每一个item取文本最长的,同一个spu存在多条记录,特征无法去重复 110 | item_sentence_dict = {item: sorted(sentences, key=lambda x: len(x), reverse=True)[0] for item, sentences in 111 | item_sentence_dict.items()} 112 | 113 | cursor.close() 114 | conn.close() 115 | 116 | # 得到item_List和sentence_list 117 | item_list, sentence_list = list(item_sentence_dict.keys()), list(item_sentence_dict.values()) 118 | 119 | # 训练词向量 120 | fasttext_model = create_fasttext_model(sentence_list, fasttext_model_path, embedding_path, 121 | mode=fasttext_mode, min_count=fasttext_min_count, 122 | size=embedding_size, sg=fasttext_sg, epochs=fasttext_epochs) 123 | 124 | # 获得句子向量 125 | item_embedding = [create_item_embedding(fasttext_model, sentence, embedding_size, weight_dict, add_random) for 126 | sentence in sentence_list] 127 | item_embedding = np.array(item_embedding, dtype=np.float32) 128 | # 训练faiss 129 | index = create_faiss_model(item_embedding, item_list, faiss_path=faiss_model_path, size=embedding_size, 130 | mode=faiss_mode) 131 | 132 | 133 | 134 | 135 | -------------------------------------------------------------------------------- /faiss_model/faiss_api.py: -------------------------------------------------------------------------------- 1 | import traceback 2 | import datetime 3 | 4 | from flask import Flask, request, jsonify 5 | import faiss 6 | 7 | from utils import load_yaml_config 8 | 9 | config = load_yaml_config("./config.yml") 10 | faiss_model_path = config["model"]["faiss_model_path"] 11 | index = faiss.read_index(faiss_model_path) 12 | 13 | model_update_time = "" 14 | 15 | app = Flask(__name__) 16 | 17 | 18 | @app.route("/faiss/similar_items/", methods=["GET"]) 19 | def check(): 20 | try: 21 | get_data = request.args.to_dict() 22 | spu = int(get_data.get('spu')) 23 | n_items = int(get_data.get('n_items')) 24 | 25 | except Exception as e: 26 | return jsonify({'code': 400, 'msg': '无效请求', 'trace': traceback.format_exc()}), 400 27 | 28 | result = faiss_search(spu, n_items) 29 | return jsonify({'code': 200, 'msg': '处理成功', 'result': result}), 200 30 | 31 | 32 | # 功能函数 33 | def faiss_search(spu, n_items): 34 | tim_str = datetime.datetime.now().strftime("%Y%m%d") 35 | global model_update_time, index 36 | 37 | try: 38 | if tim_str != model_update_time: 39 | print("重新读取磁盘模型文件") 40 | model_update_time = tim_str 41 | 42 | index = faiss.read_index(faiss_model_path) 43 | D, I = index.search(index.reconstruct(spu).reshape(1, -1), n_items + 1) 44 | result = {spu_index: round(score, 4) for spu_index, score in zip(I.tolist()[0], D.tolist()[0]) 45 | if spu_index != spu} 46 | else: 47 | print("读取缓存模型文件") 48 | D, I = index.search(index.reconstruct(spu).reshape(1, -1), n_items + 1) 49 | result = {spu_index: round(score, 4) for spu_index, score in zip(I.tolist()[0], D.tolist()[0]) 50 | if spu_index != spu} 51 | except Exception as e: 52 | result = [] 53 | 54 | return result 55 | 56 | 57 | if __name__ == "__main__": 58 | app.run(host='0.0.0.0', port=5000) 59 | -------------------------------------------------------------------------------- /faiss_model/faiss_fastapi.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | import faiss 4 | from fastapi import FastAPI, Query 5 | 6 | from utils import load_yaml_config 7 | 8 | config = load_yaml_config("./config.yml") 9 | faiss_model_path = config["model"]["faiss_model_path"] 10 | index = faiss.read_index(faiss_model_path) 11 | 12 | model_update_time = None 13 | 14 | app = FastAPI() 15 | 16 | 17 | @app.get("/faiss/similar_items/") 18 | async def get_single_item_similar( 19 | spu_id: int = Query( 20 | ..., 21 | title="spu_id", 22 | description="item spu id", 23 | gt=0), 24 | n_items: int = Query( 25 | 9, 26 | titel="n_items", 27 | description="topN of similar items", 28 | ge=1) 29 | ): 30 | tim_str = datetime.datetime.now().strftime("%Y%m%d") 31 | global model_update_time, index 32 | item_list = [] 33 | 34 | try: 35 | if tim_str != model_update_time: 36 | model_update_time = tim_str 37 | # 重新读取模型文件 38 | index = faiss.read_index(faiss_model_path) 39 | distance, item_index = index.search(index.reconstruct(spu_id).reshape(1, -1), n_items + 1) 40 | item_list = [spu for spu in item_index.tolist()[0] if spu != spu_id] 41 | result = {"code": 200, "msg": "success", "res": item_list} 42 | except Exception as e: 43 | result = {"code": 400, "msg": e.args, "res": item_list} 44 | 45 | return result 46 | -------------------------------------------------------------------------------- /faiss_model/model/faiss.model: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaogp/recsys_faiss/8295b3c556ea3388af73a430f1a3086fa336e2a0/faiss_model/model/faiss.model -------------------------------------------------------------------------------- /faiss_model/process.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaogp/recsys_faiss/8295b3c556ea3388af73a430f1a3086fa336e2a0/faiss_model/process.png -------------------------------------------------------------------------------- /faiss_model/shoppingweb.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaogp/recsys_faiss/8295b3c556ea3388af73a430f1a3086fa336e2a0/faiss_model/shoppingweb.png -------------------------------------------------------------------------------- /faiss_model/utils.py: -------------------------------------------------------------------------------- 1 | import yaml 2 | 3 | import happybase 4 | import pymysql 5 | 6 | 7 | def load_yaml_config(yaml_file): 8 | with open(yaml_file) as f: 9 | config = yaml.load(f, Loader=yaml.FullLoader) 10 | return config 11 | 12 | 13 | def connect_hbase(tablename, host, port): 14 | connection = happybase.Connection(host, port) 15 | table = connection.table(tablename) 16 | return table 17 | 18 | 19 | def connect_mysql(host, port, db_name, user, password): 20 | db_config = {"host": host, "port": port, "database": db_name, 21 | "user": user, "password": password} 22 | conn = pymysql.connect(**db_config) 23 | return conn 24 | 25 | -------------------------------------------------------------------------------- /faiss_model/uwsgi.ini: -------------------------------------------------------------------------------- 1 | [uwsgi] 2 | socket = :8088 3 | callable = app 4 | chdir = /Users/PycharmProjects/recsys_faiss 5 | wsgi-file = faiss_api.py 6 | pythonpath = /Users/opt/anaconda3/bin 7 | processes = 4 8 | threads = 3 9 | master = True 10 | buffer-size = 65536 11 | daemonize = ./faiss_api_uwsgi.log 12 | log-maxsize = 5000000 13 | pidfile = ./uwsgi.pid 14 | reload-mercy = 1 15 | worker-reload-mercy = 1 16 | -------------------------------------------------------------------------------- /faiss_model/uwsgi.pid: -------------------------------------------------------------------------------- 1 | 68721 2 | -------------------------------------------------------------------------------- /lsh_model/config.Properities: -------------------------------------------------------------------------------- 1 | stopPath: /path/library/stop.dic 2 | w2v_window_size: 5 3 | w2v_iter: 3 4 | w2v_vector_size: 100 5 | w2v_min_count: 0 6 | lsh_bucket_len: 4 7 | lsh_num_hash: 10 8 | lsh_threshold: 1 9 | topn: 100 10 | -------------------------------------------------------------------------------- /lsh_model/content_similar.scala: -------------------------------------------------------------------------------- 1 | package com.mycom.recsys 2 | 3 | import scala.io.Source 4 | import java.util.Properties 5 | import java.io.FileInputStream 6 | 7 | import org.apache.log4j.{Level, Logger} 8 | import org.apache.spark.sql.functions._ 9 | import org.apache.spark.sql.SparkSession 10 | import org.apache.spark.ml.feature.{Word2Vec, BucketedRandomProjectionLSH, Normalizer} 11 | import org.apache.spark.sql.Row 12 | import org.apache.spark.sql.expressions.Window 13 | import breeze.linalg.{DenseVector => BDV} 14 | import org.apache.spark.ml.linalg.{Vector, Vectors} 15 | 16 | import org.ansj.recognition.impl.StopRecognition 17 | import org.ansj.splitWord.analysis.ToAnalysis 18 | 19 | 20 | object content_similar { 21 | 22 | def getStopWord(stopPath: String): StopRecognition = { 23 | val stop = new StopRecognition() 24 | stop.insertStopNatures("w") // 过滤掉标点 25 | stop.insertStopNatures("null") // 过滤null词性 26 | stop.insertStopNatures("m") // 过滤量词 27 | stop.insertStopRegexes("^[a-zA-Z]{1,}") // 过滤英文字母 28 | val source = Source.fromFile(stopPath) 29 | for (line <- source.getLines()) { 30 | stop.insertStopWords(line) 31 | } 32 | stop 33 | } 34 | 35 | def getProPerties(proPath: String) = { 36 | val properties: Properties = new Properties() 37 | properties.load(new FileInputStream(proPath)) 38 | properties 39 | } 40 | 41 | def main(args: Array[String]): Unit = { 42 | Logger.getRootLogger.setLevel(Level.WARN) 43 | 44 | val properties = getProPerties(args(0)) 45 | val stopPath = properties.getProperty("stopPath") // 停止用词路径 46 | val w2v_window_size = properties.getProperty("w2v_window_size").toInt 47 | val w2v_iter = properties.getProperty("w2v_iter").toInt 48 | val w2v_vector_size = properties.getProperty("w2v_vector_size").toInt 49 | val w2v_min_count = properties.getProperty("w2v_min_count").toInt 50 | val lsh_bucket_len = properties.getProperty("lsh_bucket_len").toInt 51 | val lsh_num_hash = properties.getProperty("lsh_num_hash").toInt 52 | val lsh_threshold = properties.getProperty("lsh_threshold").toDouble 53 | val topn = properties.getProperty("topn").toInt 54 | 55 | val spark = SparkSession.builder() 56 | .appName("content_similar") 57 | .master("local[*]") 58 | .getOrCreate() 59 | import spark.implicits._ 60 | 61 | val df = spark.read.format("jdbc") 62 | .option("url", "jdbc:mysql://localhost:3306/recsys?useUnicode=true&characterEncoding=utf8&autoReconnect=true&useSSL=false") 63 | .option("user", "root") 64 | .option("password", "password") 65 | .option("dbtable", s"(select ITEM_NUM_ID, ITEM_NAME, PTY1_NAME, PTY2_NAME, PTY3_NAME, BRAND_NAME from goodsdata) as t1") 66 | .load() 67 | 68 | val filter = getStopWord(stopPath) 69 | val seg_udf = udf((text: String) => (ToAnalysis.parse(text).recognition(filter).toStringWithOutNature(" "))) 70 | 71 | // 分词 72 | val df2 = df.withColumn("seg", concat_ws(" ", seg_udf($"ITEM_NAME"), $"PTY1_NAME", $"PTY2_NAME", $"PTY3_NAME")) 73 | .select($"ITEM_NUM_ID", $"seg") 74 | .withColumn("seg", split($"seg", " ")) 75 | .cache() 76 | 77 | // 训练word2vec 78 | val word2Vec = new Word2Vec() 79 | .setInputCol("seg") 80 | .setOutputCol("vec_avg") 81 | .setWindowSize(w2v_window_size) 82 | .setMaxIter(w2v_iter) 83 | .setVectorSize(w2v_vector_size) 84 | .setMinCount(w2v_min_count) 85 | 86 | val model = word2Vec.fit(df2) 87 | 88 | // transform转化为向量的平均值 89 | val vec_avg_df = model.transform(df2) 90 | 91 | // 向量归一化 92 | val normalizer = new Normalizer() 93 | .setInputCol("vec_avg") 94 | .setOutputCol("vec") 95 | .setP(1.0) 96 | 97 | val vector_df = normalizer.transform(vec_avg_df).drop($"vec_avg") 98 | 99 | // explode所有商品的词 100 | // val df3 = df2.withColumn("word", explode($"seg")) 101 | // 102 | // // join 103 | // val df4 = df3.join(vector_df, Seq("word"), "inner").select($"ITEM_NUM_ID", $"normal_vector") 104 | // 105 | // // 词向量相加 106 | // val vec_sum_df = df4.rdd 107 | // .map { case Row(k: Long, v: Vector) => (k, BDV(v.toDense.values)) } 108 | // .foldByKey(BDV.zeros[Double](w2v_vector_size))(_ += _) 109 | // .mapValues(v => Vectors.dense(v.toArray)) 110 | // .toDF("id", "vec") 111 | 112 | // LSH 113 | val brp = new BucketedRandomProjectionLSH() 114 | .setBucketLength(lsh_bucket_len) 115 | .setNumHashTables(lsh_num_hash) 116 | .setInputCol("vec") 117 | .setOutputCol("hashes") 118 | val brpModel = brp.fit(vector_df) 119 | 120 | val brpDf = brpModel.approxSimilarityJoin(vector_df, vector_df, lsh_threshold, "EuclideanDistance") 121 | 122 | // 结果整理 123 | val getid_df = brpDf 124 | .withColumn("datasetA", udf((input: Row) => {input(0).toString}).apply($"datasetA")) 125 | .withColumn("datasetB", udf((input: Row) => {input(0).toString}).apply($"datasetB")) 126 | .withColumnRenamed("datasetA", "id_i") 127 | .withColumnRenamed("datasetB", "id_j") 128 | .filter("id_i != id_j") 129 | .withColumn("rank", row_number().over(Window.partitionBy("id_i").orderBy($"EuclideanDistance"))) 130 | .filter($"rank" <= topn) 131 | .groupBy($"id_i") 132 | .agg(concat_ws(",", collect_list($"id_j"))) 133 | .toDF("id", "recommend") 134 | 135 | getid_df.write.format("org.apache.phoenix.spark") 136 | .option("zkurl", "localhost:2181") 137 | .option("table", "RECSYS_SIMILAR_LSH") 138 | .mode("overwrite") 139 | .save() 140 | 141 | df2.unpersist() 142 | spark.close() 143 | 144 | } 145 | } 146 | 147 | -------------------------------------------------------------------------------- /lsh_model/library/ambiguity.dic: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaogp/recsys_faiss/8295b3c556ea3388af73a430f1a3086fa336e2a0/lsh_model/library/ambiguity.dic -------------------------------------------------------------------------------- /lsh_model/library/default.dic: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xiaogp/recsys_faiss/8295b3c556ea3388af73a430f1a3086fa336e2a0/lsh_model/library/default.dic -------------------------------------------------------------------------------- /lsh_model/library/stop.dic: -------------------------------------------------------------------------------- 1 | ——— 2 | 》), 3 | )÷(1- 4 | ”, 5 | )、 6 | =( 7 | : 8 | → 9 | ℃ 10 | & 11 | * 12 | 一一 13 | ~~~~ 14 | ’ 15 | . 16 | 『 17 | .一 18 | ./ 19 | -- 20 | 』 21 | =″ 22 | 【 23 | [*] 24 | }> 25 | [⑤]] 26 | [①D] 27 | c] 28 | ng昉 29 | * 30 | // 31 | [ 32 | ] 33 | [②e] 34 | [②g] 35 | ={ 36 | } 37 | ,也 38 | ‘ 39 | A 40 | [①⑥] 41 | [②B] 42 | [①a] 43 | [④a] 44 | [①③] 45 | [③h] 46 | ③] 47 | 1. 48 | -- 49 | [②b] 50 | ’‘ 51 | ××× 52 | [①⑧] 53 | 0:2 54 | =[ 55 | [⑤b] 56 | [②c] 57 | [④b] 58 | [②③] 59 | [③a] 60 | [④c] 61 | [①⑤] 62 | [①⑦] 63 | [①g] 64 | ∈[ 65 | [①⑨] 66 | [①④] 67 | [①c] 68 | [②f] 69 | [②⑧] 70 | [②①] 71 | [①C] 72 | [③c] 73 | [③g] 74 | [②⑤] 75 | [②②] 76 | 一. 77 | [①h] 78 | .数 79 | [] 80 | [①B] 81 | 数/ 82 | [①i] 83 | [③e] 84 | [①①] 85 | [④d] 86 | [④e] 87 | [③b] 88 | [⑤a] 89 | [①A] 90 | [②⑧] 91 | [②⑦] 92 | [①d] 93 | [②j] 94 | 〕〔 95 | ][ 96 | :// 97 | ′∈ 98 | [②④ 99 | [⑤e] 100 | 12% 101 | b] 102 | ... 103 | ................... 104 | …………………………………………………③ 105 | ZXFITL 106 | [③F] 107 | 」 108 | [①o] 109 | ]∧′=[ 110 | ∪φ∈ 111 | ′| 112 | {- 113 | ②c 114 | } 115 | [③①] 116 | R.L. 117 | [①E] 118 | Ψ 119 | -[*]- 120 | ↑ 121 | .日 122 | [②d] 123 | [② 124 | [②⑦] 125 | [②②] 126 | [③e] 127 | [①i] 128 | [①B] 129 | [①h] 130 | [①d] 131 | [①g] 132 | [①②] 133 | [②a] 134 | f] 135 | [⑩] 136 | a] 137 | [①e] 138 | [②h] 139 | [②⑥] 140 | [③d] 141 | [②⑩] 142 | e] 143 | 〉 144 | 】 145 | 元/吨 146 | [②⑩] 147 | 2.3% 148 | 5:0 149 | [①] 150 | :: 151 | [②] 152 | [③] 153 | [④] 154 | [⑤] 155 | [⑥] 156 | [⑦] 157 | [⑧] 158 | [⑨] 159 | …… 160 | —— 161 | ? 162 | 、 163 | 。 164 | “ 165 | ” 166 | 《 167 | 》 168 | ! 169 | , 170 | : 171 | ; 172 | ? 173 | . 174 | , 175 | . 176 | ' 177 | ? 178 | · 179 | ——— 180 | ── 181 | ? 182 | — 183 | < 184 | > 185 | ( 186 | ) 187 | 〔 188 | 〕 189 | [ 190 | ] 191 | ( 192 | ) 193 | - 194 | + 195 | ~ 196 | × 197 | / 198 | / 199 | ① 200 | ② 201 | ③ 202 | ④ 203 | ⑤ 204 | ⑥ 205 | ⑦ 206 | ⑧ 207 | ⑨ 208 | ⑩ 209 | Ⅲ 210 | В 211 | " 212 | ; 213 | # 214 | @ 215 | γ 216 | μ 217 | φ 218 | φ. 219 | × 220 | Δ 221 | ■ 222 | ▲ 223 | sub 224 | exp 225 | sup 226 | sub 227 | Lex 228 | # 229 | % 230 | & 231 | ' 232 | + 233 | +ξ 234 | ++ 235 | - 236 | -β 237 | < 238 | <± 239 | <Δ 240 | <λ 241 | <φ 242 | << 243 | = 244 | = 245 | =☆ 246 | =- 247 | > 248 | >λ 249 | _ 250 | ~± 251 | ~+ 252 | [⑤f] 253 | [⑤d] 254 | [②i] 255 | ≈ 256 | [②G] 257 | [①f] 258 | LI 259 | ㈧ 260 | [- 261 | ...... 262 | 〉 263 | [③⑩] 264 | 第二 265 | 一番 266 | 一直 267 | 一个 268 | 一些 269 | 许多 270 | 种 271 | 有的是 272 | 也就是说 273 | 末##末 274 | 啊 275 | 阿 276 | 哎 277 | 哎呀 278 | 哎哟 279 | 唉 280 | 俺 281 | 俺们 282 | 按 283 | 按照 284 | 吧 285 | 吧哒 286 | 把 287 | 罢了 288 | 被 289 | 本 290 | 本着 291 | 比 292 | 比方 293 | 比如 294 | 鄙人 295 | 彼 296 | 彼此 297 | 边 298 | 别 299 | 别的 300 | 别说 301 | 并 302 | 并且 303 | 不比 304 | 不成 305 | 不单 306 | 不但 307 | 不独 308 | 不管 309 | 不光 310 | 不过 311 | 不仅 312 | 不拘 313 | 不论 314 | 不怕 315 | 不然 316 | 不如 317 | 不特 318 | 不惟 319 | 不问 320 | 不只 321 | 朝 322 | 朝着 323 | 趁 324 | 趁着 325 | 乘 326 | 冲 327 | 除 328 | 除此之外 329 | 除非 330 | 除了 331 | 此 332 | 此间 333 | 此外 334 | 从 335 | 从而 336 | 打 337 | 待 338 | 但 339 | 但是 340 | 当 341 | 当着 342 | 到 343 | 得 344 | 的 345 | 的话 346 | 等 347 | 等等 348 | 地 349 | 第 350 | 叮咚 351 | 对 352 | 对于 353 | 多 354 | 多少 355 | 而 356 | 而况 357 | 而且 358 | 而是 359 | 而外 360 | 而言 361 | 而已 362 | 尔后 363 | 反过来 364 | 反过来说 365 | 反之 366 | 非但 367 | 非徒 368 | 否则 369 | 嘎 370 | 嘎登 371 | 该 372 | 赶 373 | 个 374 | 各 375 | 各个 376 | 各位 377 | 各种 378 | 各自 379 | 给 380 | 根据 381 | 跟 382 | 故 383 | 故此 384 | 固然 385 | 关于 386 | 管 387 | 归 388 | 果然 389 | 果真 390 | 过 391 | 哈 392 | 哈哈 393 | 呵 394 | 和 395 | 何 396 | 何处 397 | 何况 398 | 何时 399 | 嘿 400 | 哼 401 | 哼唷 402 | 呼哧 403 | 乎 404 | 哗 405 | 还是 406 | 还有 407 | 换句话说 408 | 换言之 409 | 或 410 | 或是 411 | 或者 412 | 极了 413 | 及 414 | 及其 415 | 及至 416 | 即 417 | 即便 418 | 即或 419 | 即令 420 | 即若 421 | 即使 422 | 几 423 | 几时 424 | 己 425 | 既 426 | 既然 427 | 既是 428 | 继而 429 | 加之 430 | 假如 431 | 假若 432 | 假使 433 | 鉴于 434 | 将 435 | 较 436 | 较之 437 | 叫 438 | 接着 439 | 结果 440 | 借 441 | 紧接着 442 | 进而 443 | 尽 444 | 尽管 445 | 经 446 | 经过 447 | 就 448 | 就是 449 | 就是说 450 | 据 451 | 具体地说 452 | 具体说来 453 | 开始 454 | 开外 455 | 靠 456 | 咳 457 | 可 458 | 可见 459 | 可是 460 | 可以 461 | 况且 462 | 啦 463 | 来 464 | 来着 465 | 离 466 | 例如 467 | 哩 468 | 连 469 | 连同 470 | 两者 471 | 了 472 | 临 473 | 另 474 | 另外 475 | 另一方面 476 | 论 477 | 嘛 478 | 吗 479 | 慢说 480 | 漫说 481 | 冒 482 | 么 483 | 每 484 | 每当 485 | 们 486 | 莫若 487 | 某 488 | 某个 489 | 某些 490 | 拿 491 | 哪 492 | 哪边 493 | 哪儿 494 | 哪个 495 | 哪里 496 | 哪年 497 | 哪怕 498 | 哪天 499 | 哪些 500 | 哪样 501 | 那 502 | 那边 503 | 那儿 504 | 那个 505 | 那会儿 506 | 那里 507 | 那么 508 | 那么些 509 | 那么样 510 | 那时 511 | 那些 512 | 那样 513 | 乃 514 | 乃至 515 | 呢 516 | 能 517 | 你 518 | 你们 519 | 您 520 | 宁 521 | 宁可 522 | 宁肯 523 | 宁愿 524 | 哦 525 | 呕 526 | 啪达 527 | 旁人 528 | 呸 529 | 凭 530 | 凭借 531 | 其 532 | 其次 533 | 其二 534 | 其他 535 | 其它 536 | 其一 537 | 其余 538 | 其中 539 | 起 540 | 起见 541 | 起见 542 | 岂但 543 | 恰恰相反 544 | 前后 545 | 前者 546 | 且 547 | 然而 548 | 然后 549 | 然则 550 | 让 551 | 人家 552 | 任 553 | 任何 554 | 任凭 555 | 如 556 | 如此 557 | 如果 558 | 如何 559 | 如其 560 | 如若 561 | 如上所述 562 | 若 563 | 若非 564 | 若是 565 | 啥 566 | 上下 567 | 尚且 568 | 设若 569 | 设使 570 | 甚而 571 | 甚么 572 | 甚至 573 | 省得 574 | 时候 575 | 什么 576 | 什么样 577 | 使得 578 | 是 579 | 是的 580 | 首先 581 | 谁 582 | 谁知 583 | 顺 584 | 顺着 585 | 似的 586 | 虽 587 | 虽然 588 | 虽说 589 | 虽则 590 | 随 591 | 随着 592 | 所 593 | 所以 594 | 他 595 | 他们 596 | 他人 597 | 它 598 | 它们 599 | 她 600 | 她们 601 | 倘 602 | 倘或 603 | 倘然 604 | 倘若 605 | 倘使 606 | 腾 607 | 替 608 | 通过 609 | 同 610 | 同时 611 | 哇 612 | 万一 613 | 往 614 | 望 615 | 为 616 | 为何 617 | 为了 618 | 为什么 619 | 为着 620 | 喂 621 | 嗡嗡 622 | 我 623 | 我们 624 | 呜 625 | 呜呼 626 | 乌乎 627 | 无论 628 | 无宁 629 | 毋宁 630 | 嘻 631 | 吓 632 | 相对而言 633 | 像 634 | 向 635 | 向着 636 | 嘘 637 | 呀 638 | 焉 639 | 沿 640 | 沿着 641 | 要 642 | 要不 643 | 要不然 644 | 要不是 645 | 要么 646 | 要是 647 | 也 648 | 也罢 649 | 也好 650 | 一 651 | 一般 652 | 一旦 653 | 一方面 654 | 一来 655 | 一切 656 | 一样 657 | 一则 658 | 依 659 | 依照 660 | 矣 661 | 以 662 | 以便 663 | 以及 664 | 以免 665 | 以至 666 | 以至于 667 | 以致 668 | 抑或 669 | 因 670 | 因此 671 | 因而 672 | 因为 673 | 哟 674 | 用 675 | 由 676 | 由此可见 677 | 由于 678 | 有 679 | 有的 680 | 有关 681 | 有些 682 | 又 683 | 于 684 | 于是 685 | 于是乎 686 | 与 687 | 与此同时 688 | 与否 689 | 与其 690 | 越是 691 | 云云 692 | 哉 693 | 再说 694 | 再者 695 | 在 696 | 在下 697 | 咱 698 | 咱们 699 | 则 700 | 怎 701 | 怎么 702 | 怎么办 703 | 怎么样 704 | 怎样 705 | 咋 706 | 照 707 | 照着 708 | 者 709 | 这 710 | 这边 711 | 这儿 712 | 这个 713 | 这会儿 714 | 这就是说 715 | 这里 716 | 这么 717 | 这么点儿 718 | 这么些 719 | 这么样 720 | 这时 721 | 这些 722 | 这样 723 | 正如 724 | 吱 725 | 之 726 | 之类 727 | 之所以 728 | 之一 729 | 只是 730 | 只限 731 | 只要 732 | 只有 733 | 至 734 | 至于 735 | 诸位 736 | 着 737 | 着呢 738 | 自 739 | 自从 740 | 自个儿 741 | 自各儿 742 | 自己 743 | 自家 744 | 自身 745 | 综上所述 746 | 总的来看 747 | 总的来说 748 | 总的说来 749 | 总而言之 750 | 总之 751 | 纵 752 | 纵令 753 | 纵然 754 | 纵使 755 | 遵照 756 | 作为 757 | 兮 758 | 呃 759 | 呗 760 | 咚 761 | 咦 762 | 喏 763 | 啐 764 | 喔唷 765 | 嗬 766 | 嗯 767 | 嗳 768 | -------------------------------------------------------------------------------- /lsh_model/pom.xml: -------------------------------------------------------------------------------- 1 | 3 | 4.0.0 4 | com.mycom.recsys 5 | recsys 6 | 1.0-SNAPSHOT 7 | 2019 8 | 9 | UTF-8 10 | 2.4.4 11 | 2.11.12 12 | 2.6.0 13 | 2.0.2 14 | 15 | 16 | 17 | 18 | scala-tools.org 19 | Scala-Tools Maven2 Repository 20 | http://scala-tools.org/repo-releases 21 | 22 | 23 | 24 | 25 | 26 | scala-tools.org 27 | Scala-Tools Maven2 Repository 28 | http://scala-tools.org/repo-releases 29 | 30 | 31 | 32 | 33 | 34 | 35 | org.apache.spark 36 | spark-core_2.11 37 | 2.3.0 38 | provided 39 | 40 | 41 | 42 | org.apache.hbase 43 | hbase-client 44 | 2.0.0 45 | provided 46 | 47 | 48 | org.slf4j 49 | slf4j-log4j12 50 | 51 | 52 | 53 | 54 | 55 | org.apache.hbase 56 | hbase-server 57 | 1.2.3 58 | provided 59 | 60 | 61 | 62 | org.apache.hadoop 63 | hadoop-client 64 | 2.6.0 65 | provided 66 | 67 | 68 | 69 | org.codehaus.jackson 70 | jackson-mapper-asl 71 | 1.9.13 72 | 73 | 74 | 75 | org.apache.spark 76 | spark-sql_2.11 77 | 2.4.4 78 | provided 79 | 80 | 81 | 82 | org.apache.spark 83 | spark-hive_2.11 84 | 2.3.0 85 | provided 86 | 87 | 88 | 89 | 90 | org.apache.spark 91 | spark-mllib_2.11 92 | 2.3.0 93 | provided 94 | 95 | 96 | 97 | org.apache.spark 98 | spark-streaming-kafka-0-8_2.11 99 | ${spark.version} 100 | provided 101 | 102 | 103 | 104 | org.postgresql 105 | postgresql 106 | 42.2.8 107 | 108 | 109 | 110 | org.apache.spark 111 | spark-sql-kafka-0-10_2.11 112 | 2.4.4 113 | provided 114 | 115 | 116 | 117 | org.apache.kafka 118 | kafka_2.13 119 | 2.4.0 120 | 121 | 122 | 123 | org.apache.kafka 124 | kafka-clients 125 | 2.4.0 126 | 127 | 128 | 129 | com.redislabs 130 | spark-redis 131 | 2.4.0 132 | 133 | 134 | 135 | org.apache.commons 136 | commons-pool2 137 | 2.0 138 | 139 | 140 | 141 | redis.clients 142 | jedis 143 | 3.2.0 144 | 145 | 146 | 147 | ml.dmlc 148 | xgboost4j-spark 149 | 0.90 150 | 151 | 152 | 153 | org.apache.phoenix 154 | phoenix-spark 155 | 4.14.0-HBase-1.3 156 | provided 157 | 158 | 159 | 160 | org.ansj 161 | ansj_seg 162 | 5.1.3 163 | 164 | 165 | 166 | 167 | 168 | 169 | src/main/scala 170 | src/test/scala 171 | 172 | 173 | 174 | net.alchim31.maven 175 | scala-maven-plugin 176 | 3.3.1 177 | 178 | 179 | 180 | compile 181 | testCompile 182 | 183 | 184 | 185 | -dependencyfile 186 | ${project.build.directory}/.scala_dependencies 187 | 188 | 189 | 190 | 191 | 192 | 193 | org.apache.maven.plugins 194 | maven-surefire-plugin 195 | 2.18.1 196 | 197 | false 198 | true 199 | 200 | 201 | 202 | **/*Test.* 203 | **/*Suite.* 204 | 205 | 206 | 207 | 208 | org.apache.maven.plugins 209 | maven-compiler-plugin 210 | 3.1 211 | 212 | 8 213 | 8 214 | 215 | 216 | 217 | 218 | 219 | 220 | -------------------------------------------------------------------------------- /lsh_model/submit.bash: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | cd /path 3 | spark-submit --class com.mycom.recsys.content_similar \ 4 | --master local[*] \ 5 | --conf spark.sql.shuffle.partitions=150 \ 6 | --conf spark.default.parallelism=150 \ 7 | --executor-cores 3 \ 8 | --num-executors 3 \ 9 | --executor-memory 1g \ 10 | recsys-1.0-SNAPSHOT.jar "/path/config.Properities" 11 | --------------------------------------------------------------------------------