.
675 |
676 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Amazon reviews downloader and parser
2 |
3 | Amazon offers no more a simple public API to access its reviews. If ones wants to quickly build a dataset of Amazon reviews he/she has to download them using a web crawler on the HTML pages.
4 |
5 | I wrote a Perl script that, given a first level domain (e.g., "com", "it") and list of IDs of Amazon products, automatically downloads, from the Amazon server that is dedicated to that domain, all and only the HTML pages that contain the reviews about that products.
6 |
7 | Then I wrote another Perl script that, given a list of the downloaded HTML files, extract all the reviews contained in them, outputting for each review a record with the following information:
8 |
9 | * A counter of the extracted reviews so far (can be used as a unique ID for the dataset).
10 | * Date of the review in YYYYMMDD format (note: on non-English speaking domains this feature won’t work, edit the script to set the name of months in the desired language).
11 | * ID of the reviewed product.
12 | * Star rating assigned by the reviewer.
13 | * Count of "yes" helpfulness votes.
14 | * Count of total helpfulness votes ("yes"+"no").
15 | * Date of the review in human readable format (will be in the language used by the specified domain).
16 | * ID of the author of the review.
17 | * Title of the review
18 | * Content of the review
19 |
20 | ## Example
21 |
22 | Given the products with IDs, e.g., _B0040JHVCC_ and _B00004ZDB1_, the reviews from the ".com" domain are downloaded with the command:
23 |
24 | ./downloadAmazonReviews.pl com B0040JHVCC B00004ZDB1
25 |
26 | Reviews are automatically downloaded in the _./amazonreviews/com/B0040JHVCC_ and _./amazonreviews/com/B00004ZDB1_ directory. The script automatically adapts a timeout between download requests in order to be polite with Amazon, and also retries failing downloads (503 errors) until every page is downloaded.
27 |
28 | Then the reviews are extracted from the HTML file by issuing the command:
29 |
30 | ./extractAmazonReviews-DivLayout.pl ./amazonreviews/com/B0040JHVCC/ ./amazonreviews/com/B00004ZDB1/
31 |
32 | The scripts outputs one review per line on the standard output, in a CSV format:
33 |
34 | "0","20120118","B0040JHVCC","1.0","1","2","January 18, 2012","A1E5SQ7VA3I8OI","Not worth the price","I purchased the... (removed for brevity) ...and definitely no."
35 | "1","20120116","B0040JHVCC","5.0","0","0","January 16, 2012","A34FBZLFAU88UI","Compact version of the 7D, and neck and neck with D7000","This camera is... (removed for brevity) ...focus hunting issues."
36 | "2",...
37 |
38 | Redirection can be used to save the output to a file:
39 |
40 | ./extractAmazonReviews-DivLayout.pl ./amazonreviews/com/B0040JHVCC/ ./amazonreviews/com/B00004ZDB1/ > reviews.csv
41 |
42 | will save the CSV output into the "reviews.csv" file.
43 |
44 | Note that there are two versions of the extract script: "extractAmazonReviews-DivLayout.pl" and "extractAmazonReviews-TableLayout.pl".
45 | The DivLayout version works on the last (at April 2015) Amazon layout which ditched tables for divs. This layout is currently (again, at April 2015) used on the ".com" domain. The other domains still use the table-based layout, for which the TableLayout version is designed.
46 |
47 | ## Disclaimer
48 |
49 | I provide you the tool to download the reviews, not the right to download them. You have to respect Amazon's rights on its own data. Do not release the data you download without Amazon's consent.
50 |
51 | ## History
52 |
53 | April, 2015: added different versions of extract script to work with both table- and div-based layouts.
54 |
55 | January, 2015: moved to GitHub
56 |
57 | November, 2013: (version 2.0) added the extraction of helpfulness scores. Major version update due to the change in output format.
58 | November, 2013: (version 1.1) the extractAmazonReviews.pl script is now able to process all the files in a directory just passing the directory name. This is motivated by the fact that the * char in the command line of Windows is not expanded in the same way of Unix shells. Previous syntax, with *, continues to work.
59 |
60 | June, 2013: (version 1.0) moved to fossil scm.
61 |
62 | November, 2012: added the possibility to download reviews from many Amazon domains, not just the _com_ one (last time tested it was working on European languages and Chinese, not on Japanese).
63 |
--------------------------------------------------------------------------------
/downloadAmazonReviews.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | # Amazon reviews downloader
4 | # Copyright (C) 2015 Andrea Esuli
5 | #
6 | # This program is free software: you can redistribute it and/or modify
7 | # it under the terms of the GNU General Public License as published by
8 | # the Free Software Foundation, either version 3 of the License, or
9 | # (at your option) any later version.
10 | #
11 | # This program is distributed in the hope that it will be useful,
12 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | # GNU General Public License for more details.
15 | #
16 | # You should have received a copy of the GNU General Public License
17 | # along with this program. If not, see .
18 |
19 | # usage: ./downloadAmazonReviews.pl
20 | # example: ./downloadAmazonReviews.pl com B0040JHVC2 B004CG4CN4
21 | # output: a directory ./amazonreviews// is created for each product ID; HTML files containing reviews are downloaded and saved in each directory.
22 |
23 | use strict;
24 | use LWP::UserAgent;
25 | use HTTP::Request;
26 | use WWW::Mechanize::PhantomJS;
27 | binmode(STDIN, ':encoding(utf8)');
28 | binmode(STDOUT, ':encoding(utf8)');
29 | binmode(STDERR, ':encoding(utf8)');
30 |
31 | $| = 1; #autoflush
32 |
33 | my $ua = LWP::UserAgent->new;
34 | $ua->timeout(10);
35 | $ua->env_proxy;
36 | $ua->agent('Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.30 (KHTML, like Gecko) Ubuntu/11.04 Chromium/12.0.742.91 Chrome/12.0.742.91 Safari/534.30');
37 |
38 | mkdir "amazonreviews";
39 |
40 | my $sleepTime = 1;
41 |
42 | my $domain = shift;
43 | mkdir "amazonreviews/$domain";
44 |
45 | my $id = "";
46 | while($id = shift) {
47 |
48 | my $dir = "amazonreviews/$domain/$id";
49 | mkdir $dir;
50 |
51 | my $urlPart1 = "http://www.amazon.".$domain."/product-reviews/";
52 | my $urlPart2 = "/ref=cm_cr_getr_d_paging_btm_";
53 | my $urlPart3 = "?ie=UTF8&showViewpoints=1&sortBy=recent&pageNumber=";
54 |
55 | my $referer = $urlPart1.$id.$urlPart2."1".$urlPart3."1";
56 |
57 | my $page = 1;
58 | my $lastPage = 1;
59 | while($page<=$lastPage) {
60 |
61 | my $url = $urlPart1.$id.$urlPart2.$page.$urlPart3.$page;
62 |
63 | print $url;
64 |
65 |
66 | my $mech = WWW::Mechanize::PhantomJS->new(
67 | launch_arg => ['ghostdriver/src/main.js' ],
68 | );
69 |
70 | $mech->get($url);
71 |
72 | my $response = $mech->response(headers => 0);
73 | if($response->is_success) {
74 | print " GOTIT\n";
75 | my $content = $mech->content( format => 'html' );
76 |
77 | while($content =~ m#(([1-9]*[0-9],)*[1-9]?[1-9]?[0-9])#gs ) {
78 | my $temp = $1;
79 | $temp =~ s/,//;
80 | my $val = int ($temp/10) + 1;
81 | if($val>$lastPage) {
82 | $lastPage = $val;
83 | }
84 |
85 | $matched = 1
86 | }
87 |
88 | # Try again if no usable content was returned
89 | if(!$matched) {
90 | print "Unusable results, trying again\n";
91 | next;
92 | }
93 |
94 | if(open(CONTENTFILE, ">./$dir/$page")) {
95 | binmode(CONTENTFILE, ":utf8");
96 | print CONTENTFILE $content;
97 | close(CONTENTFILE);
98 | print "ok\t$domain\t$id\t$page\t$lastPage\n";
99 | }
100 | else {
101 | print "failed\t$domain\t$id\t$page\t$lastPage\n";
102 | }
103 |
104 | if($sleepTime>0) {
105 | --$sleepTime;
106 | }
107 | }
108 | else {
109 | if($mech->status()==503) {
110 | --$page;
111 | ++$sleepTime;
112 | print " TIMEOUT ".$response->code." retrying (new timeout $sleepTime)\n";
113 | }
114 | else {
115 | print " Downloaded ". ($page-1). " pages for product id $id (end code:".$response->code.")\n";
116 | last;
117 | }
118 | }
119 | ++$page;
120 | sleep($sleepTime);
121 | }
122 | }
123 |
124 |
--------------------------------------------------------------------------------
/extractAmazonReviews-DivLayout.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | # Amazon reviews extractor
4 | # Copyright (C) 2015 Andrea Esuli
5 | #
6 | # This program is free software: you can redistribute it and/or modify
7 | # it under the terms of the GNU General Public License as published by
8 | # the Free Software Foundation, either version 3 of the License, or
9 | # (at your option) any later version.
10 | #
11 | # This program is distributed in the hope that it will be useful,
12 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | # GNU General Public License for more details.
15 | #
16 | # You should have received a copy of the GNU General Public License
17 | # along with this program. If not, see .
18 |
19 | # usage: ./extractAmazonReviews.pl |
20 | # note: when a directory name is specified, only the file residing in that directory are processed, not files eventually contained in subdirectories, i.e., there is no recursive visit of subdirectories.
21 | # example: ./extractAmazonReviews.pl amazonreviews/com/B0040JHVC2
22 | # example: ./extractAmazonReviews.pl amazonreviews/com/B0040JHVC2/1 amazonreviews/com/B0040JHVC2/2 amazonreviews/com/B0040JHVC2/3
23 | # output: a simple CSV to standard output
24 | # note: use redirect to save output to a file
25 | # example: ./extractAmazonReviews.pl amazonreviews/com/B0040JHVC2 > B0040JHVC2.com.csv
26 |
27 | use strict;
28 | use File::Spec;
29 |
30 | my $filename ="";
31 | my $count = 0;
32 |
33 | while($filename= shift) {
34 | if(-f $filename) {
35 | extract($filename);
36 | }
37 | elsif(-d $filename) {
38 | opendir(DIR, $filename) or next;
39 | while(my $subfilename = readdir(DIR)) {
40 | extract(File::Spec->catfile($filename,$subfilename));
41 | }
42 | closedir(DIR);
43 | }
44 | }
45 |
46 | sub extract {
47 | my($filename) = $_[0];
48 | open (FILE, "<", $filename) or return;
49 | my $whole_file;
50 | {
51 | local $/;
52 | $whole_file = ;
53 | }
54 | close(FILE);
55 |
56 | $whole_file =~ m#product\-reviews/([A-Z0-9]+)/ref\=cm_cr_pr_hist#gs;
57 | my $model = $1;
58 |
59 | $whole_file =~ m#cm_cr-review_list.*?>(.*?)(.*?)report-abuse-link#gs) {
63 | my $block = $1;
64 |
65 | $block =~ m#star-(.) review-rating#gs;
66 | my $rating = $1;
67 |
68 | $block =~ m#review-title.*?>(.*?)#gs;
69 | my $title = $1;
70 |
71 | $block =~ m#review-date">(.*?)#gs;
72 | my $date = $1;
73 |
74 | $date =~ m/on ([A-Za-z]+) ([0-9]+), ([0-9]+)/;
75 | my $month = $1;
76 |
77 | if($month eq "January") {
78 | $month = "01";
79 | }
80 | elsif($month eq "February") {
81 | $month = "02";
82 | }
83 | elsif($month eq "March") {
84 | $month = "03";
85 | }
86 | elsif($month eq "April") {
87 | $month = "04";
88 | }
89 | elsif($month eq "May") {
90 | $month = "05";
91 | }
92 | elsif($month eq "June") {
93 | $month = "06";
94 | }
95 | elsif($month eq "July") {
96 | $month = "07";
97 | }
98 | elsif($month eq "August") {
99 | $month = "08";
100 | }
101 | elsif($month eq "September") {
102 | $month = "09";
103 | }
104 | elsif($month eq "October") {
105 | $month = "10";
106 | }
107 | elsif($month eq "November") {
108 | $month = "11";
109 | }
110 | elsif($month eq "December") {
111 | $month = "12";
112 | }
113 | else {
114 | $month = "XX";
115 | }
116 |
117 | my $newDate = "XX";
118 | if($month ne "XX") {
119 | $newDate = sprintf ( "$3$month%02d",$2);
120 | }
121 |
122 | my $helpfulTotal = 0;
123 | my $helpfulYes = 0;
124 | if($block =~ m#review-votes.*?([0-9]+).*?([0-9]+)#) {
125 | $helpfulTotal = ($1, $2)[$1 < $2];
126 | $helpfulYes = ($1, $2)[$1 > $2];
127 | }
128 | my $userId = "ANONYMOUS";
129 | if($block =~ /profile\/(.*?)\//) {
130 | $userId = $1;
131 | }
132 |
133 |
134 | $block =~ m#base review-text">(.*?) 0) {
142 | print "\"$count\",\"$newDate\",\"$model\",\"$rating\",\"$helpfulYes\",\"$helpfulTotal\",\"$date\",\"$userId\",\"$title\",\"$review\"\n";
143 | }
144 | ++$count;
145 | }
146 | }
147 |
--------------------------------------------------------------------------------
/extractAmazonReviews-TableLayout.pl:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl
2 |
3 | # Amazon reviews extractor
4 | # Copyright (C) 2015 Andrea Esuli
5 | #
6 | # This program is free software: you can redistribute it and/or modify
7 | # it under the terms of the GNU General Public License as published by
8 | # the Free Software Foundation, either version 3 of the License, or
9 | # (at your option) any later version.
10 | #
11 | # This program is distributed in the hope that it will be useful,
12 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | # GNU General Public License for more details.
15 | #
16 | # You should have received a copy of the GNU General Public License
17 | # along with this program. If not, see .
18 |
19 | # usage: ./extractAmazonReviews.pl |
20 | # note: when a directory name is specified, only the file residing in that directory are processed, not files eventually contained in subdirectories, i.e., there is no recursive visit of subdirectories.
21 | # example: ./extractAmazonReviews.pl amazonreviews/com/B0040JHVC2
22 | # example: ./extractAmazonReviews.pl amazonreviews/com/B0040JHVC2/1 amazonreviews/com/B0040JHVC2/2 amazonreviews/com/B0040JHVC2/3
23 | # output: a simple CSV to standard output
24 | # note: use redirect to save output to a file
25 | # example: ./extractAmazonReviews.pl amazonreviews/com/B0040JHVC2 > B0040JHVC2.com.csv
26 |
27 | use strict;
28 | use File::Spec;
29 |
30 | my $filename ="";
31 | my $count = 0;
32 |
33 | while($filename= shift) {
34 | if(-f $filename) {
35 | extract($filename);
36 | }
37 | elsif(-d $filename) {
38 | opendir(DIR, $filename) or next;
39 | while(my $subfilename = readdir(DIR)) {
40 | extract(File::Spec->catfile($filename,$subfilename));
41 | }
42 | closedir(DIR);
43 | }
44 | }
45 |
46 | sub extract {
47 | my($filename) = $_[0];
48 | open (FILE, "<", $filename) or return;
49 | my $whole_file;
50 | {
51 | local $/;
52 | $whole_file = ;
53 | }
54 | close(FILE);
55 |
56 | $whole_file =~ m#product\-reviews/([A-Z0-9]+)/ref\=cm_cr_pr_hist#gs;
57 | my $model = $1;
58 |
59 | $whole_file =~ m#table id="productReviews.*?>(.*?)#gs;
60 | $whole_file = $1;
61 |
62 | while ($whole_file =~ m#-->(.*?)(