├── README.md ├── folder-hierarchy.png ├── lei ├── 0.format.py ├── 1.pre ├── 2.pos ├── 3.validate ├── 4.lexicon.linux ├── 5.profile ├── 6.transform.py ├── 7.match.py ├── input │ └── reviews_Musical_Instruments_5.json.gz ├── intermediate │ └── .gitkeep └── output │ └── .gitkeep └── run_lei.sh /README.md: -------------------------------------------------------------------------------- 1 | # Sentires-Guide 2 | A quick guide to Sentires: Phrase-level Sentiment Analysis toolkit 3 | - Zhang, Yongfeng, et al. [Do users rate or review? boost phrase-level sentiment labeling with review-level sentiment classification](http://yongfeng.me/attach/bps-zhang.pdf). SIGIR'14. 4 | 5 | To obtain this tool, please check [Sentires](https://github.com/evison/Sentires). 6 | 7 | ## Motivation 8 | The tool is very meaningful to the IR community, as many research works are built on top of its results. However, it may be difficult to obtain (feature, opinion, sentence, sentiment) quadruples from user reviews, since it did not have such a function in the first place. Moreover, nowadays people are usually more familiar with Python rather than Java on which this tool was developed. Therefore, here we present our data processing steps (mostly in Python) in the following paper to help researchers quickly obtain the aforementioned quadruples from user reviews. 9 | - Lei Li, Yongfeng Zhang, Li Chen. Generate Neural Template Explanations for Recommendation. CIKM'20. \[[Paper](https://lileipisces.github.io/files/CIKM20-NETE-paper.pdf)\] \[[Code](https://github.com/lileipisces/NETE)\] 10 | 11 | **A small ecosystem for Recommender Systems-based Natural Language Generation is available at [NLG4RS](https://github.com/lileipisces/NLG4RS)!** 12 | 13 | ## Creation Steps 14 | ![](folder-hierarchy.png) 15 | - Place the folder [lei](lei/) in the tool's folder "English-Jar" as shown above. 16 | - Modify [0.format.py](lei/0.format.py), including data path (line 5), the keys (line 12, 13, 16, 20, 21, 23) and how you load the data and iterate over each review (line 9), so that your datasets can be processed in the right format of the tool's input. Remove line 12-15, if your dataset has no summary or tip, which is meant to include as much textual data as possible. 17 | - Update the absolute paths (line 65, 78, 94, 95) in [4.lexicon.linux](lei/4.lexicon.linux) accordingly. 18 | - Run the commands in [run_lei.sh](run_lei.sh) one by one (do not run this script, otherwise it may throw errors) 19 | 20 | ## Friendly reminder 21 | - Memory error might be fixed by specifying the memory allocation of JVM, e.g., ```java -jar -Xmx200G thuir-sentires.jar -t lexicon -c lei/4.lexicon.linux >> log.txt``` 22 | - The POS step could really take a lot of time, e.g., days. 23 | 24 | ## Results 25 | You will find a file "reviews.pickle" in [lei/output/](lei/output/), which is a python list, where each element is a python dict with the following keys: 26 | ``` 27 | 'user', 28 | 'item', 29 | 'rating', 30 | 'text', 31 | 'sentence' # a list of tuples and each tuple looks like (feature, adjective, sentence, score). It could be empty, because the tool may fail to identify feature-opinion pairs from the review. 32 | ``` 33 | 34 | ## Datasets to [download](https://drive.google.com/drive/folders/1yB-EFuApAOJ0RzTI0VfZ0pignytguU0_?usp=sharing) 35 | - TripAdvisor Hong Kong 36 | - Amazon Movies & TV 37 | - Yelp 2019 38 | 39 | ## Citations 40 | ``` 41 | @inproceedings{CIKM20-NETE, 42 | title={Generate Neural Template Explanations for Recommendation}, 43 | author={Li, Lei and Zhang, Yongfeng and Chen, Li}, 44 | booktitle={CIKM}, 45 | year={2020} 46 | } 47 | @inproceedings{WWW20-NETE, 48 | title={Towards Controllable Explanation Generation for Recommender Systems via Neural Template}, 49 | author={Li, Lei and Chen, Li and Zhang, Yongfeng}, 50 | booktitle={WWW Demo}, 51 | year={2020} 52 | } 53 | @inproceedings{SIGIR14-Sentires, 54 | title={Do users rate or review? Boost phrase-level sentiment labeling with review-level sentiment classification}, 55 | author={Zhang, Yongfeng and Zhang, Haochen and Zhang, Min and Liu, Yiqun and Ma, Shaoping}, 56 | booktitle={SIGIR}, 57 | year={2014} 58 | } 59 | ``` 60 | -------------------------------------------------------------------------------- /folder-hierarchy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lileipisces/Sentires-Guide/e04cabd18bb09d8d2e5bd6eef05f08738cfbb198/folder-hierarchy.png -------------------------------------------------------------------------------- /lei/0.format.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import gzip 3 | import re 4 | 5 | raw_path = 'input/reviews_Musical_Instruments_5.json.gz' # path to load raw reviews 6 | writer_1 = open('input/record.per.row.txt', 'w', encoding='utf-8') 7 | product2text_list = {} 8 | product2json = {} 9 | for line in gzip.open(raw_path, 'r'): 10 | review = eval(line) 11 | text = '' 12 | if 'summary' in review: 13 | summary = review['summary'] 14 | if summary != '': 15 | text += summary + '\n' 16 | text += review['reviewText'] 17 | 18 | writer_1.write('\n{}\n\n'.format(text)) 19 | 20 | item_id = review['asin'] 21 | json_doc = {'user': review['reviewerID'], 22 | 'item': item_id, 23 | 'rating': int(review['overall']), 24 | 'text': text} 25 | 26 | if item_id in product2json: 27 | product2json[item_id].append(json_doc) 28 | else: 29 | product2json[item_id] = [json_doc] 30 | 31 | if item_id in product2text_list: 32 | product2text_list[item_id].append('\n{}\n\n'.format(text)) 33 | else: 34 | product2text_list[item_id] = ['\n{}\n\n'.format(text)] 35 | 36 | with open('input/records.per.product.txt', 'w', encoding='utf-8') as f: 37 | for (product, text_list) in product2text_list.items(): 38 | f.write(product + '\t' + str(len(text_list)) + '\tfake_URL') 39 | text = '\n\t' + re.sub('\n', '\n\t', ''.join(text_list).strip()) + '\n' 40 | f.write(text) 41 | 42 | pickle.dump(product2json, open('input/product2json.pickle', 'wb')) 43 | -------------------------------------------------------------------------------- /lei/1.pre: -------------------------------------------------------------------------------- 1 | preprocessing.language=english 2 | preprocessing.raw.path=lei/input/record.per.row.txt 3 | preprocessing.raw.charset=UTF8 4 | preprocessing.selected.path=lei/intermediate/pre.txt 5 | preprocessing.selected.charset=UTF8 6 | -------------------------------------------------------------------------------- /lei/2.pos: -------------------------------------------------------------------------------- 1 | postagging.language=english 2 | postagging.select.path=lei/intermediate/pre.txt 3 | postagging.select.charset=UTF8 4 | postagging.pos.path=lei/intermediate/pos.1.txt 5 | postagging.pos.charset=UTF8 6 | postagging.model.file=res/english-bidirectional-distsim.tagger -------------------------------------------------------------------------------- /lei/3.validate: -------------------------------------------------------------------------------- 1 | validation.A.mapping.class=edu.thuir.sentires.buildin.STANPOSMapping 2 | validation.B.mapping.class=edu.thuir.sentires.buildin.STANPOSMapping 3 | validation.source.charset=UTF8 4 | validation.output.charset=UTF8 5 | 6 | validation.A.result=lei/intermediate/pos.1.txt 7 | validation.B.result=lei/intermediate/pos.2.txt 8 | validation.source=lei/intermediate/pre.txt 9 | validation.output=lei/intermediate/validate.txt -------------------------------------------------------------------------------- /lei/4.lexicon.linux: -------------------------------------------------------------------------------- 1 | ############################################################################## 2 | ## 3 | ## author : zhanghc@thuir 4 | ## 5 | ## The '.task' file contains all necessary context of the specific sentiment 6 | ## resources extraction task. 7 | ## 8 | ## To drive a build-sentiment-resource task, the '.task' file should always 9 | ## contain following parts: 10 | ## I/O : 11 | ## * POS-tagged corpus as input and its encoding charset. 12 | ## * tuple formatted with "[L]\t${Feature}|${Opinion}" as output 13 | ## and its encoding charset. 14 | ## Resources : 15 | ## * background corpus used to filter background noises, which should be 16 | ## generated by the given tool of the project. 17 | ## * public sementic resources, which provide initial polarity of a given 18 | ## opinion word. 19 | ## * public linguistic resources, which provide some naive linguistic 20 | ## resources generated manually. 21 | ## * public stop word lists, which help to filter some already-known but 22 | ## ambigous noises. 23 | ## Morphology & POS information: 24 | ## * custom POS parser should be specified in this part, and 25 | ## * specific patterns that are used to extract features. 26 | ## * manual of the POS parser and morphology component can be found in 27 | ## conf/morphology.properties. 28 | ## Metrics & Thresholds: 29 | ## * all metrics and thresholds that are used in the approach. 30 | ## * please refer to the Approach Document and learn about these metrics. 31 | ## 32 | ## It is strongly recommended that all related configurations of 'Resources' 33 | ## and 'Morphology & POS information' parts should be recorded in an individual 34 | ## property files to share with all other tasks. 35 | ## 36 | ## 37 | ############################################################################## 38 | ## 39 | ## I/O: 40 | ## In this part, all information of the input and output should be specified, 41 | ## 42 | ## Input: 43 | task.io.corpus.file=lei/intermediate/validate.txt 44 | task.io.corpus.charset=UTF8 45 | task.io.feature.file=data/feature/manual.feature 46 | task.io.feature.charset=UTF8 47 | task.io.stop.feature.file=data/feature/stop.feature 48 | task.io.stop.feature.charset=UTF8 49 | task.io.stop.opinion.file=data/opinion/stop.opinion 50 | task.io.stop.opinion.charset=UTF8 51 | 52 | ## Output: 53 | task.io.lexicon.file=lei/intermediate/lexicon.txt 54 | task.io.lexicon.charset=UTF8 55 | lowest.frequency=4 56 | #lowest.frequency=1 57 | 58 | ## 59 | ############################################################################## 60 | ## 61 | ## Metrics & Thresholds: 62 | ## In this part, all metrics and thresholds that are used in the approach 63 | ## should be specified. 64 | ## 65 | include=/home/comp/csleili/0/English-Jar/preset/relax.threshold 66 | 67 | ## 68 | ############################################################################## 69 | ## 70 | ## Resources: 71 | ## Specify all necessary resources location and other properties. 72 | ## 73 | ## It is strongly recommended to gather all configurations and organize 74 | ## them in one particular property file as it can be shared with other tasks. 75 | ## 76 | ## integrated property file: 77 | #include=${path.preset}/default.resource 78 | include=/home/comp/csleili/0/English-Jar/preset/english.resource 79 | 80 | ## 81 | # resource.stopword.path=stopword.res 82 | 83 | ## 84 | ############################################################################## 85 | ## 86 | ## Morphology & POS: 87 | ## Specify custom POS parser and extracting patterns. 88 | ## 89 | ## It is strongly recommended to gather all configurations and organize 90 | ## them in one particular property file as it can be shared with other tasks. 91 | ## 92 | ## integrated property file: 93 | #include=${path.preset}/default.pattern 94 | include=/home/comp/csleili/0/English-Jar/preset/english.pattern 95 | include=/home/comp/csleili/0/English-Jar/preset/default.mapping 96 | 97 | ## 98 | ############################################################################## 99 | -------------------------------------------------------------------------------- /lei/5.profile: -------------------------------------------------------------------------------- 1 | # input 2 | profile.product = lei/input/records.per.product.txt 3 | profile.lexicon = lei/intermediate/lexicon.txt 4 | profile.sbc2dbc = data/sbc2dbc/sbc2dbc.dict 5 | 6 | # output 7 | profile.posprofile = lei/output/pos.sentence.txt 8 | profile.negprofile = lei/output/neg.sentence.txt 9 | profile.indicatorfile = lei/intermediate/indicator.txt 10 | profile.index = lei/intermediate/index.txt 11 | profile.rerank.index = lei/intermediate/rerank.index.txt 12 | 13 | # not used 14 | profile.feature = lei/2014.nus.utf.label.feature 15 | profile.opinion = lei/2014.nus.utf.label.opinion 16 | 17 | profile.product.charset = UTF8 18 | profile.lexicon.charset = UTF8 19 | profile.profile.charset = UTF8 20 | profile.indicator.charset = UTF8 21 | profile.feature.charset = UTF8 22 | profile.opinion.charset = UTF8 23 | profile.productname.toshiba.charset = UTF8 24 | profile.productname.other.charset = UTF8 25 | profile.index.charset = UTF8 26 | profile.sbc2dbc.charset = UTF8 27 | -------------------------------------------------------------------------------- /lei/6.transform.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import re 3 | 4 | 5 | def data_format(path): 6 | product2feature = {} 7 | with open(path, 'r', errors='ignore') as f: 8 | line = f.readline() 9 | while line != '': 10 | product = line.split('\t')[2] 11 | line = f.readline() 12 | if line == '' or line.startswith('e'): 13 | continue 14 | 15 | feature2adj = {} 16 | while line != '' and not line.startswith('e'): 17 | feature = line[1:].strip() 18 | 19 | adj2sen = {} 20 | line = f.readline() 21 | while line.startswith('o'): 22 | content = line[1:].strip().split('\t') 23 | adj_phrase = content[0] 24 | phrase_count = int(content[1]) 25 | sentence_c = [] 26 | while phrase_count > 0: 27 | line = f.readline()[1:].strip() 28 | sentence_count = re.findall('\([0-9]+\)', line)[-1] 29 | sentence = line[:-(len(sentence_count) + 1)] 30 | count = int(sentence_count[1:-1]) 31 | sentence_c.append((sentence, count)) 32 | 33 | phrase_count -= count 34 | 35 | adj2sen[adj_phrase] = sentence_c 36 | 37 | line = f.readline() 38 | 39 | feature2adj[feature] = adj2sen 40 | 41 | product2feature[product] = feature2adj 42 | 43 | return product2feature 44 | 45 | 46 | product2feature_pos = data_format('lei/output/pos.sentence.txt') 47 | product2feature_neg = data_format('lei/output/neg.sentence.txt') 48 | pickle.dump(product2feature_pos, open('lei/intermediate/product2feature_pos.pickle', 'wb')) 49 | pickle.dump(product2feature_neg, open('lei/intermediate/product2feature_neg.pickle', 'wb')) 50 | 51 | 52 | ''' 53 | an example: 54 | 55 | {'The Ritz-Carlton, Hong Kong': 56 | {'gluten': 57 | {'and the': [('My Wife is gluten free and they were able to provide gluten free bread at short notice', 1)], 58 | 'free': [('I have special dietary requirements - gluten free - which the Concierge Club Staff went out of the way to accommodate', 1), 59 | ('My Wife is gluten free and they were able to provide gluten free bread at short notice', 1), 60 | ('They were even able to accommodate my Wife with gluten free pasta which she thoroughly enjoyed', 1)]}, 61 | 'smiles': {'warm': [('The Ritz-Carlton welcomed my family and me with the warmest of smiles in the fifth tallest building in the world and most luxurious from Hong Kong', 1)], 62 | 'superb': [('We were greeted by different staff with smiles and superb services', 1)]}}, 63 | 'Korean Hostel': {'stay': {'on': [("If you've stayed on a backpackers before you'll be fine", 1)]}}, 64 | 'Gloria Guesthouse': {'staff': {'helpful': [('staff is helpful', 1)]}, 'room': {'basic': [('room is spacious enough for 2 people with good bathroom and all basic amenities', 1)]}}} 65 | 66 | ''' 67 | -------------------------------------------------------------------------------- /lei/7.match.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | 4 | product2json = pickle.load(open('lei/input/product2json.pickle', 'rb')) 5 | product2feature_pos = pickle.load(open('lei/intermediate/product2feature_pos.pickle', 'rb')) 6 | product2feature_neg = pickle.load(open('lei/intermediate/product2feature_neg.pickle', 'rb')) 7 | 8 | 9 | reviews = [] 10 | for (product_id, review_list) in product2json.items(): 11 | if product_id in product2feature_pos: 12 | feature2adj = product2feature_pos[product_id] 13 | for (feature, adj2sent_list) in feature2adj.items(): 14 | for (adj, sent_list) in adj2sent_list.items(): 15 | for (sent, count) in sent_list: 16 | for idx, review in enumerate(review_list): 17 | text = review['text'] 18 | if text.find(sent) != -1: 19 | if 'sentence' in review: 20 | review['sentence'].append((feature, adj, sent, 1)) # (feature, adj, sent) 21 | else: 22 | review['sentence'] = [(feature, adj, sent, 1)] 23 | review_list[idx] = review 24 | count -= 1 25 | if count == 0: 26 | break 27 | 28 | if product_id in product2feature_neg: 29 | feature2adj = product2feature_neg[product_id] 30 | for (feature, adj2sent_list) in feature2adj.items(): 31 | for (adj, sent_list) in adj2sent_list.items(): 32 | for (sent, count) in sent_list: 33 | for idx, review in enumerate(review_list): 34 | text = review['text'] 35 | if text.find(sent) != -1: 36 | if 'sentence' in review: 37 | review['sentence'].append((feature, adj, sent, -1)) # (feature, adj, sent) 38 | else: 39 | review['sentence'] = [(feature, adj, sent, -1)] 40 | review_list[idx] = review 41 | count -= 1 42 | if count == 0: 43 | break 44 | 45 | reviews.extend(review_list) 46 | 47 | pickle.dump(reviews, open('lei/output/reviews.pickle', 'wb')) 48 | 49 | ''' 50 | keys: 51 | 'user', 52 | 'item', 53 | 'rating', 54 | 'text', 55 | 'sentence' a list of (feature, adj, sent, score) 56 | ''' 57 | -------------------------------------------------------------------------------- /lei/input/reviews_Musical_Instruments_5.json.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lileipisces/Sentires-Guide/e04cabd18bb09d8d2e5bd6eef05f08738cfbb198/lei/input/reviews_Musical_Instruments_5.json.gz -------------------------------------------------------------------------------- /lei/intermediate/.gitkeep: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /lei/output/.gitkeep: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /run_lei.sh: -------------------------------------------------------------------------------- 1 | python lei/0.format.py 2 | 3 | java -jar thuir-sentires.jar -t pre -c lei/1.pre 4 | 5 | java -jar thuir-sentires.jar -t pos -c lei/2.pos 6 | 7 | cp lei/intermediate/pos.1.txt lei/intermediate/pos.2.txt 8 | 9 | java -jar thuir-sentires.jar -t validate -c lei/3.validate 10 | 11 | java -jar thuir-sentires.jar -t lexicon -c lei/4.lexicon.linux 12 | 13 | java -jar thuir-sentires.jar -t profile -c lei/5.profile 14 | 15 | python lei/6.transform.py 16 | 17 | python lei/7.match.py --------------------------------------------------------------------------------