├── README.md
├── folder-hierarchy.png
├── lei
├── 0.format.py
├── 1.pre
├── 2.pos
├── 3.validate
├── 4.lexicon.linux
├── 5.profile
├── 6.transform.py
├── 7.match.py
├── input
│ └── reviews_Musical_Instruments_5.json.gz
├── intermediate
│ └── .gitkeep
└── output
│ └── .gitkeep
└── run_lei.sh
/README.md:
--------------------------------------------------------------------------------
1 | # Sentires-Guide
2 | A quick guide to Sentires: Phrase-level Sentiment Analysis toolkit
3 | - Zhang, Yongfeng, et al. [Do users rate or review? boost phrase-level sentiment labeling with review-level sentiment classification](http://yongfeng.me/attach/bps-zhang.pdf). SIGIR'14.
4 |
5 | To obtain this tool, please check [Sentires](https://github.com/evison/Sentires).
6 |
7 | ## Motivation
8 | The tool is very meaningful to the IR community, as many research works are built on top of its results. However, it may be difficult to obtain (feature, opinion, sentence, sentiment) quadruples from user reviews, since it did not have such a function in the first place. Moreover, nowadays people are usually more familiar with Python rather than Java on which this tool was developed. Therefore, here we present our data processing steps (mostly in Python) in the following paper to help researchers quickly obtain the aforementioned quadruples from user reviews.
9 | - Lei Li, Yongfeng Zhang, Li Chen. Generate Neural Template Explanations for Recommendation. CIKM'20. \[[Paper](https://lileipisces.github.io/files/CIKM20-NETE-paper.pdf)\] \[[Code](https://github.com/lileipisces/NETE)\]
10 |
11 | **A small ecosystem for Recommender Systems-based Natural Language Generation is available at [NLG4RS](https://github.com/lileipisces/NLG4RS)!**
12 |
13 | ## Creation Steps
14 | 
15 | - Place the folder [lei](lei/) in the tool's folder "English-Jar" as shown above.
16 | - Modify [0.format.py](lei/0.format.py), including data path (line 5), the keys (line 12, 13, 16, 20, 21, 23) and how you load the data and iterate over each review (line 9), so that your datasets can be processed in the right format of the tool's input. Remove line 12-15, if your dataset has no summary or tip, which is meant to include as much textual data as possible.
17 | - Update the absolute paths (line 65, 78, 94, 95) in [4.lexicon.linux](lei/4.lexicon.linux) accordingly.
18 | - Run the commands in [run_lei.sh](run_lei.sh) one by one (do not run this script, otherwise it may throw errors)
19 |
20 | ## Friendly reminder
21 | - Memory error might be fixed by specifying the memory allocation of JVM, e.g., ```java -jar -Xmx200G thuir-sentires.jar -t lexicon -c lei/4.lexicon.linux >> log.txt```
22 | - The POS step could really take a lot of time, e.g., days.
23 |
24 | ## Results
25 | You will find a file "reviews.pickle" in [lei/output/](lei/output/), which is a python list, where each element is a python dict with the following keys:
26 | ```
27 | 'user',
28 | 'item',
29 | 'rating',
30 | 'text',
31 | 'sentence' # a list of tuples and each tuple looks like (feature, adjective, sentence, score). It could be empty, because the tool may fail to identify feature-opinion pairs from the review.
32 | ```
33 |
34 | ## Datasets to [download](https://drive.google.com/drive/folders/1yB-EFuApAOJ0RzTI0VfZ0pignytguU0_?usp=sharing)
35 | - TripAdvisor Hong Kong
36 | - Amazon Movies & TV
37 | - Yelp 2019
38 |
39 | ## Citations
40 | ```
41 | @inproceedings{CIKM20-NETE,
42 | title={Generate Neural Template Explanations for Recommendation},
43 | author={Li, Lei and Zhang, Yongfeng and Chen, Li},
44 | booktitle={CIKM},
45 | year={2020}
46 | }
47 | @inproceedings{WWW20-NETE,
48 | title={Towards Controllable Explanation Generation for Recommender Systems via Neural Template},
49 | author={Li, Lei and Chen, Li and Zhang, Yongfeng},
50 | booktitle={WWW Demo},
51 | year={2020}
52 | }
53 | @inproceedings{SIGIR14-Sentires,
54 | title={Do users rate or review? Boost phrase-level sentiment labeling with review-level sentiment classification},
55 | author={Zhang, Yongfeng and Zhang, Haochen and Zhang, Min and Liu, Yiqun and Ma, Shaoping},
56 | booktitle={SIGIR},
57 | year={2014}
58 | }
59 | ```
60 |
--------------------------------------------------------------------------------
/folder-hierarchy.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lileipisces/Sentires-Guide/e04cabd18bb09d8d2e5bd6eef05f08738cfbb198/folder-hierarchy.png
--------------------------------------------------------------------------------
/lei/0.format.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | import gzip
3 | import re
4 |
5 | raw_path = 'input/reviews_Musical_Instruments_5.json.gz' # path to load raw reviews
6 | writer_1 = open('input/record.per.row.txt', 'w', encoding='utf-8')
7 | product2text_list = {}
8 | product2json = {}
9 | for line in gzip.open(raw_path, 'r'):
10 | review = eval(line)
11 | text = ''
12 | if 'summary' in review:
13 | summary = review['summary']
14 | if summary != '':
15 | text += summary + '\n'
16 | text += review['reviewText']
17 |
18 | writer_1.write('\n{}\n\n'.format(text))
19 |
20 | item_id = review['asin']
21 | json_doc = {'user': review['reviewerID'],
22 | 'item': item_id,
23 | 'rating': int(review['overall']),
24 | 'text': text}
25 |
26 | if item_id in product2json:
27 | product2json[item_id].append(json_doc)
28 | else:
29 | product2json[item_id] = [json_doc]
30 |
31 | if item_id in product2text_list:
32 | product2text_list[item_id].append('\n{}\n\n'.format(text))
33 | else:
34 | product2text_list[item_id] = ['\n{}\n\n'.format(text)]
35 |
36 | with open('input/records.per.product.txt', 'w', encoding='utf-8') as f:
37 | for (product, text_list) in product2text_list.items():
38 | f.write(product + '\t' + str(len(text_list)) + '\tfake_URL')
39 | text = '\n\t' + re.sub('\n', '\n\t', ''.join(text_list).strip()) + '\n'
40 | f.write(text)
41 |
42 | pickle.dump(product2json, open('input/product2json.pickle', 'wb'))
43 |
--------------------------------------------------------------------------------
/lei/1.pre:
--------------------------------------------------------------------------------
1 | preprocessing.language=english
2 | preprocessing.raw.path=lei/input/record.per.row.txt
3 | preprocessing.raw.charset=UTF8
4 | preprocessing.selected.path=lei/intermediate/pre.txt
5 | preprocessing.selected.charset=UTF8
6 |
--------------------------------------------------------------------------------
/lei/2.pos:
--------------------------------------------------------------------------------
1 | postagging.language=english
2 | postagging.select.path=lei/intermediate/pre.txt
3 | postagging.select.charset=UTF8
4 | postagging.pos.path=lei/intermediate/pos.1.txt
5 | postagging.pos.charset=UTF8
6 | postagging.model.file=res/english-bidirectional-distsim.tagger
--------------------------------------------------------------------------------
/lei/3.validate:
--------------------------------------------------------------------------------
1 | validation.A.mapping.class=edu.thuir.sentires.buildin.STANPOSMapping
2 | validation.B.mapping.class=edu.thuir.sentires.buildin.STANPOSMapping
3 | validation.source.charset=UTF8
4 | validation.output.charset=UTF8
5 |
6 | validation.A.result=lei/intermediate/pos.1.txt
7 | validation.B.result=lei/intermediate/pos.2.txt
8 | validation.source=lei/intermediate/pre.txt
9 | validation.output=lei/intermediate/validate.txt
--------------------------------------------------------------------------------
/lei/4.lexicon.linux:
--------------------------------------------------------------------------------
1 | ##############################################################################
2 | ##
3 | ## author : zhanghc@thuir
4 | ##
5 | ## The '.task' file contains all necessary context of the specific sentiment
6 | ## resources extraction task.
7 | ##
8 | ## To drive a build-sentiment-resource task, the '.task' file should always
9 | ## contain following parts:
10 | ## I/O :
11 | ## * POS-tagged corpus as input and its encoding charset.
12 | ## * tuple formatted with "[L]\t${Feature}|${Opinion}" as output
13 | ## and its encoding charset.
14 | ## Resources :
15 | ## * background corpus used to filter background noises, which should be
16 | ## generated by the given tool of the project.
17 | ## * public sementic resources, which provide initial polarity of a given
18 | ## opinion word.
19 | ## * public linguistic resources, which provide some naive linguistic
20 | ## resources generated manually.
21 | ## * public stop word lists, which help to filter some already-known but
22 | ## ambigous noises.
23 | ## Morphology & POS information:
24 | ## * custom POS parser should be specified in this part, and
25 | ## * specific patterns that are used to extract features.
26 | ## * manual of the POS parser and morphology component can be found in
27 | ## conf/morphology.properties.
28 | ## Metrics & Thresholds:
29 | ## * all metrics and thresholds that are used in the approach.
30 | ## * please refer to the Approach Document and learn about these metrics.
31 | ##
32 | ## It is strongly recommended that all related configurations of 'Resources'
33 | ## and 'Morphology & POS information' parts should be recorded in an individual
34 | ## property files to share with all other tasks.
35 | ##
36 | ##
37 | ##############################################################################
38 | ##
39 | ## I/O:
40 | ## In this part, all information of the input and output should be specified,
41 | ##
42 | ## Input:
43 | task.io.corpus.file=lei/intermediate/validate.txt
44 | task.io.corpus.charset=UTF8
45 | task.io.feature.file=data/feature/manual.feature
46 | task.io.feature.charset=UTF8
47 | task.io.stop.feature.file=data/feature/stop.feature
48 | task.io.stop.feature.charset=UTF8
49 | task.io.stop.opinion.file=data/opinion/stop.opinion
50 | task.io.stop.opinion.charset=UTF8
51 |
52 | ## Output:
53 | task.io.lexicon.file=lei/intermediate/lexicon.txt
54 | task.io.lexicon.charset=UTF8
55 | lowest.frequency=4
56 | #lowest.frequency=1
57 |
58 | ##
59 | ##############################################################################
60 | ##
61 | ## Metrics & Thresholds:
62 | ## In this part, all metrics and thresholds that are used in the approach
63 | ## should be specified.
64 | ##
65 | include=/home/comp/csleili/0/English-Jar/preset/relax.threshold
66 |
67 | ##
68 | ##############################################################################
69 | ##
70 | ## Resources:
71 | ## Specify all necessary resources location and other properties.
72 | ##
73 | ## It is strongly recommended to gather all configurations and organize
74 | ## them in one particular property file as it can be shared with other tasks.
75 | ##
76 | ## integrated property file:
77 | #include=${path.preset}/default.resource
78 | include=/home/comp/csleili/0/English-Jar/preset/english.resource
79 |
80 | ##
81 | # resource.stopword.path=stopword.res
82 |
83 | ##
84 | ##############################################################################
85 | ##
86 | ## Morphology & POS:
87 | ## Specify custom POS parser and extracting patterns.
88 | ##
89 | ## It is strongly recommended to gather all configurations and organize
90 | ## them in one particular property file as it can be shared with other tasks.
91 | ##
92 | ## integrated property file:
93 | #include=${path.preset}/default.pattern
94 | include=/home/comp/csleili/0/English-Jar/preset/english.pattern
95 | include=/home/comp/csleili/0/English-Jar/preset/default.mapping
96 |
97 | ##
98 | ##############################################################################
99 |
--------------------------------------------------------------------------------
/lei/5.profile:
--------------------------------------------------------------------------------
1 | # input
2 | profile.product = lei/input/records.per.product.txt
3 | profile.lexicon = lei/intermediate/lexicon.txt
4 | profile.sbc2dbc = data/sbc2dbc/sbc2dbc.dict
5 |
6 | # output
7 | profile.posprofile = lei/output/pos.sentence.txt
8 | profile.negprofile = lei/output/neg.sentence.txt
9 | profile.indicatorfile = lei/intermediate/indicator.txt
10 | profile.index = lei/intermediate/index.txt
11 | profile.rerank.index = lei/intermediate/rerank.index.txt
12 |
13 | # not used
14 | profile.feature = lei/2014.nus.utf.label.feature
15 | profile.opinion = lei/2014.nus.utf.label.opinion
16 |
17 | profile.product.charset = UTF8
18 | profile.lexicon.charset = UTF8
19 | profile.profile.charset = UTF8
20 | profile.indicator.charset = UTF8
21 | profile.feature.charset = UTF8
22 | profile.opinion.charset = UTF8
23 | profile.productname.toshiba.charset = UTF8
24 | profile.productname.other.charset = UTF8
25 | profile.index.charset = UTF8
26 | profile.sbc2dbc.charset = UTF8
27 |
--------------------------------------------------------------------------------
/lei/6.transform.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | import re
3 |
4 |
5 | def data_format(path):
6 | product2feature = {}
7 | with open(path, 'r', errors='ignore') as f:
8 | line = f.readline()
9 | while line != '':
10 | product = line.split('\t')[2]
11 | line = f.readline()
12 | if line == '' or line.startswith('e'):
13 | continue
14 |
15 | feature2adj = {}
16 | while line != '' and not line.startswith('e'):
17 | feature = line[1:].strip()
18 |
19 | adj2sen = {}
20 | line = f.readline()
21 | while line.startswith('o'):
22 | content = line[1:].strip().split('\t')
23 | adj_phrase = content[0]
24 | phrase_count = int(content[1])
25 | sentence_c = []
26 | while phrase_count > 0:
27 | line = f.readline()[1:].strip()
28 | sentence_count = re.findall('\([0-9]+\)', line)[-1]
29 | sentence = line[:-(len(sentence_count) + 1)]
30 | count = int(sentence_count[1:-1])
31 | sentence_c.append((sentence, count))
32 |
33 | phrase_count -= count
34 |
35 | adj2sen[adj_phrase] = sentence_c
36 |
37 | line = f.readline()
38 |
39 | feature2adj[feature] = adj2sen
40 |
41 | product2feature[product] = feature2adj
42 |
43 | return product2feature
44 |
45 |
46 | product2feature_pos = data_format('lei/output/pos.sentence.txt')
47 | product2feature_neg = data_format('lei/output/neg.sentence.txt')
48 | pickle.dump(product2feature_pos, open('lei/intermediate/product2feature_pos.pickle', 'wb'))
49 | pickle.dump(product2feature_neg, open('lei/intermediate/product2feature_neg.pickle', 'wb'))
50 |
51 |
52 | '''
53 | an example:
54 |
55 | {'The Ritz-Carlton, Hong Kong':
56 | {'gluten':
57 | {'and the': [('My Wife is gluten free and they were able to provide gluten free bread at short notice', 1)],
58 | 'free': [('I have special dietary requirements - gluten free - which the Concierge Club Staff went out of the way to accommodate', 1),
59 | ('My Wife is gluten free and they were able to provide gluten free bread at short notice', 1),
60 | ('They were even able to accommodate my Wife with gluten free pasta which she thoroughly enjoyed', 1)]},
61 | 'smiles': {'warm': [('The Ritz-Carlton welcomed my family and me with the warmest of smiles in the fifth tallest building in the world and most luxurious from Hong Kong', 1)],
62 | 'superb': [('We were greeted by different staff with smiles and superb services', 1)]}},
63 | 'Korean Hostel': {'stay': {'on': [("If you've stayed on a backpackers before you'll be fine", 1)]}},
64 | 'Gloria Guesthouse': {'staff': {'helpful': [('staff is helpful', 1)]}, 'room': {'basic': [('room is spacious enough for 2 people with good bathroom and all basic amenities', 1)]}}}
65 |
66 | '''
67 |
--------------------------------------------------------------------------------
/lei/7.match.py:
--------------------------------------------------------------------------------
1 | import pickle
2 |
3 |
4 | product2json = pickle.load(open('lei/input/product2json.pickle', 'rb'))
5 | product2feature_pos = pickle.load(open('lei/intermediate/product2feature_pos.pickle', 'rb'))
6 | product2feature_neg = pickle.load(open('lei/intermediate/product2feature_neg.pickle', 'rb'))
7 |
8 |
9 | reviews = []
10 | for (product_id, review_list) in product2json.items():
11 | if product_id in product2feature_pos:
12 | feature2adj = product2feature_pos[product_id]
13 | for (feature, adj2sent_list) in feature2adj.items():
14 | for (adj, sent_list) in adj2sent_list.items():
15 | for (sent, count) in sent_list:
16 | for idx, review in enumerate(review_list):
17 | text = review['text']
18 | if text.find(sent) != -1:
19 | if 'sentence' in review:
20 | review['sentence'].append((feature, adj, sent, 1)) # (feature, adj, sent)
21 | else:
22 | review['sentence'] = [(feature, adj, sent, 1)]
23 | review_list[idx] = review
24 | count -= 1
25 | if count == 0:
26 | break
27 |
28 | if product_id in product2feature_neg:
29 | feature2adj = product2feature_neg[product_id]
30 | for (feature, adj2sent_list) in feature2adj.items():
31 | for (adj, sent_list) in adj2sent_list.items():
32 | for (sent, count) in sent_list:
33 | for idx, review in enumerate(review_list):
34 | text = review['text']
35 | if text.find(sent) != -1:
36 | if 'sentence' in review:
37 | review['sentence'].append((feature, adj, sent, -1)) # (feature, adj, sent)
38 | else:
39 | review['sentence'] = [(feature, adj, sent, -1)]
40 | review_list[idx] = review
41 | count -= 1
42 | if count == 0:
43 | break
44 |
45 | reviews.extend(review_list)
46 |
47 | pickle.dump(reviews, open('lei/output/reviews.pickle', 'wb'))
48 |
49 | '''
50 | keys:
51 | 'user',
52 | 'item',
53 | 'rating',
54 | 'text',
55 | 'sentence' a list of (feature, adj, sent, score)
56 | '''
57 |
--------------------------------------------------------------------------------
/lei/input/reviews_Musical_Instruments_5.json.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lileipisces/Sentires-Guide/e04cabd18bb09d8d2e5bd6eef05f08738cfbb198/lei/input/reviews_Musical_Instruments_5.json.gz
--------------------------------------------------------------------------------
/lei/intermediate/.gitkeep:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/lei/output/.gitkeep:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/run_lei.sh:
--------------------------------------------------------------------------------
1 | python lei/0.format.py
2 |
3 | java -jar thuir-sentires.jar -t pre -c lei/1.pre
4 |
5 | java -jar thuir-sentires.jar -t pos -c lei/2.pos
6 |
7 | cp lei/intermediate/pos.1.txt lei/intermediate/pos.2.txt
8 |
9 | java -jar thuir-sentires.jar -t validate -c lei/3.validate
10 |
11 | java -jar thuir-sentires.jar -t lexicon -c lei/4.lexicon.linux
12 |
13 | java -jar thuir-sentires.jar -t profile -c lei/5.profile
14 |
15 | python lei/6.transform.py
16 |
17 | python lei/7.match.py
--------------------------------------------------------------------------------