├── README.md
├── classify.py
└── python_scikit_airbnb.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # spark-sklearn-airbnb-predict
 2 | Code example to predict prices of Airbnb vacation rentals, using scikit-learn on Spark.
 3 | 
 4 | The [Jupyter notebook in this repo](https://github.com/mapr-demos/spark-sklearn-airbnb-predict/blob/master/python_scikit_airbnb.ipynb) contains examples to run regression estimators on the [Inside Airbnb](http://insideairbnb.com/get-the-data.html) listings dataset from San Francisco.  The target variable is the price of the listing.  To speed up the hyperparameter search, the notebook shows examples that use the spark-sklearn package to distribute GridSearchCV across nodes in a Spark cluster.  This provides a much faster way to search and can lead to better results.
 5 | 
 6 | To run the scikit-learn examples (without Spark) the following packages are required:
 7 | * Python 2
 8 | * Pandas
 9 | * NumPy
10 | * scikit-learn (0.17 or later)
11 | 
12 | These can be installed on the [MapR Sandbox](https://www.mapr.com/products/mapr-sandbox-hadoop).
13 | 
14 | To run the scikit-learn examples with Spark, the following packages are required on each machine:
15 | * All of the above packages
16 | * Spark (1.5 or later)
17 | * [spark-sklearn](https://github.com/databricks/spark-sklearn) -- follow the installation instructions there
18 | 
19 | You can run this on a MapR cluster by following one of these methods:
20 | * Use the [MapR Sandbox](https://www.mapr.com/products/mapr-sandbox-hadoop), which comes with Spark pre-installed.  You must install Pandas, NumPy and scikit-learn.
21 | * If you have multiple machines available, use the [MapR Community Edition](https://www.mapr.com/products/hadoop-download) and install the mapr-spark package on each machine, following the [Spark on YARN documentation](http://maprdocs.mapr.com/51/#Spark/SparkonYARN.html)
22 | 
23 | Run the script with:
24 | 
25 | ```MASTER=yarn-client /opt/mapr/spark/spark-1.5.2/bin/spark-submit --num-executors=4 --executor-cores=8 python_scikit_airbnb.py```
26 | 
27 | (setting num-executors and executor-cores to suit your environment)
28 | 
29 | The file ```classify.py``` in this repo contains an example of classification on the same dataset, using ```reviews.csv``` and text analysis.
30 | 
31 | and of course... have fun!
32 | 
33 | 


--------------------------------------------------------------------------------
/classify.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from sklearn import ensemble
  4 | from sklearn import linear_model
  5 | from sklearn import tree
  6 | from sklearn import svm
  7 | from sklearn import neighbors
  8 | from sklearn import preprocessing
  9 | from sklearn.feature_extraction.text import TfidfVectorizer
 10 | from sklearn.linear_model import LogisticRegression
 11 | from sklearn.ensemble import RandomForestClassifier
 12 | 
 13 | # scikit-learn normal
 14 | # from sklearn.grid_search import GridSearchCV
 15 | 
 16 | # spark
 17 | from pyspark import SparkContext, SparkConf
 18 | from spark_sklearn import GridSearchCV
 19 | 
 20 | def top_tfidf_feats(row, features, top_n=25):
 21 |     topn_ids = np.argsort(row)[::-1][:top_n]
 22 |     top_feats = [(features[i], row[i]) for i in topn_ids]
 23 |     df = pd.DataFrame(top_feats)
 24 |     df.columns = ['feature', 'tfidf']
 25 |     return df
 26 | 
 27 | def top_feats_in_doc(Xtr, features, row_id, top_n=25):
 28 |     ''' Top tfidf features in specific document (matrix row) '''
 29 |     row = np.squeeze(Xtr[row_id].toarray())
 30 |     return top_tfidf_feats(row, features, top_n)
 31 | 
 32 | # scikit-learn normal
 33 | # from sklearn.grid_search import GridSearchCV
 34 | 
 35 | # spark
 36 | #from pyspark import SparkContext, SparkConf
 37 | #from spark_sklearn import GridSearchCV
 38 | 
 39 | from sklearn import preprocessing
 40 | from sklearn.cross_validation import train_test_split
 41 | import sklearn.metrics as metrics
 42 | import matplotlib.pyplot as plt
 43 | from collections import Counter
 44 | 
 45 | LISTINGSFILE = '/mapr/tmclust1/user/mapr/pyspark-learn/airbnb/listings.csv'
 46 | REVIEWSFILE = '/mapr/tmclust1/user/mapr/pyspark-learn/airbnb/reviews.csv'
 47 | 
 48 | cols = ['id',
 49 |         'neighbourhood_cleansed',
 50 |         ]
 51 | 
 52 | rcols = [ 'listing_id', 'comments' ]
 53 | 
 54 | nbhs = [ 'Mission', 'South of Market', 'Western Addition' ]
 55 | 
 56 | # read the file into a dataframe
 57 | df = pd.read_csv(LISTINGSFILE, usecols=cols, index_col='id')
 58 | rdf = pd.read_csv(REVIEWSFILE, usecols=rcols)
 59 | 
 60 | # combine the reviews with the listings into a single dataframe
 61 | # indexed by listing ID
 62 | rdf = rdf.groupby(['listing_id'])['comments']. \
 63 |     apply(lambda x: ' '.join(x.astype(str))).reset_index()
 64 | rdf = rdf.set_index(rdf['listing_id'].astype(float))
 65 | df = pd.concat([df, rdf], axis = 1)
 66 | 
 67 | print "before filtering: %d" % len(df.index)
 68 | df = df.dropna(axis=0)
 69 | df = df[df.neighbourhood_cleansed.isin(nbhs)]
 70 | print "after filtering: %d" % len(df.index)
 71 | 
 72 | le = preprocessing.LabelEncoder().fit(df.neighbourhood_cleansed)
 73 | df['nbh'] = le.transform(df.neighbourhood_cleansed)
 74 | 
 75 | tfid = TfidfVectorizer()
 76 | ttext = tfid.fit_transform(df['comments'])
 77 | 
 78 | #print top_feats_in_doc(ttext, tfid.get_feature_names(), 1, 10)
 79 | #sys.exit(0)
 80 | 
 81 | print "%d %d" % (ttext.shape[0], len(df['nbh']))
 82 | X_train, X_test, y_train, y_test = \
 83 |     train_test_split(ttext, df['nbh'],
 84 |     test_size=0.2, random_state=1)
 85 | 
 86 | rs = 1
 87 | ests = [ neighbors.KNeighborsClassifier(3),
 88 |          RandomForestClassifier(random_state=rs) ]
 89 | 
 90 | ests_labels = np.array(['KNeighbors', 'RandomForest' ])
 91 | 
 92 | for i, e in enumerate(ests):
 93 |         e.fit(X_train, y_train)
 94 |         this_score = metrics.accuracy_score(y_test, e.predict(X_test))
 95 |         scorestr = "%s: Accuracy Score %0.2f" % (ests_labels[i],
 96 |                 this_score)
 97 |         print
 98 |         print scorestr
 99 |         print "-" * len(scorestr)
100 |         print metrics.classification_report(y_test,
101 |                 e.predict(X_test), target_names=le.classes_)
102 | 
103 | tuned_parameters = { "max_depth": [3, None],
104 |                "max_features": [1, 'auto'],
105 |                "min_samples_split": [1, 20],
106 |                "n_estimators": [10, 300, 500] }
107 | rf = RandomForestClassifier(random_state=rs)
108 | 
109 | # spark-sklearn
110 | conf = SparkConf()
111 | sc = SparkContext(conf=conf)
112 | clf = GridSearchCV(sc, rf, cv=3,
113 |        param_grid=tuned_parameters,
114 |        scoring='accuracy')
115 | 
116 | # scikit-learn
117 | # clf = GridSearchCV(rf, cv=2, scoring='accuracy',
118 | #         param_grid=tuned_parameters,
119 | #         verbose=True)
120 | 
121 | preds = clf.fit(X_train, y_train)
122 | best = clf.best_estimator_
123 | this_score = metrics.accuracy_score(y_test, best.predict(X_test))
124 | scorestr = "RF / GridSearchCV: Accuracy Score %0.2f" % this_score
125 | print
126 | print scorestr
127 | print "-" * len(scorestr)
128 | print metrics.classification_report(y_test,
129 |         best.predict(X_test), target_names=le.classes_)
130 | 


--------------------------------------------------------------------------------