├── README.md ├── classify.py └── python_scikit_airbnb.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # spark-sklearn-airbnb-predict 2 | Code example to predict prices of Airbnb vacation rentals, using scikit-learn on Spark. 3 | 4 | The [Jupyter notebook in this repo](https://github.com/mapr-demos/spark-sklearn-airbnb-predict/blob/master/python_scikit_airbnb.ipynb) contains examples to run regression estimators on the [Inside Airbnb](http://insideairbnb.com/get-the-data.html) listings dataset from San Francisco. The target variable is the price of the listing. To speed up the hyperparameter search, the notebook shows examples that use the spark-sklearn package to distribute GridSearchCV across nodes in a Spark cluster. This provides a much faster way to search and can lead to better results. 5 | 6 | To run the scikit-learn examples (without Spark) the following packages are required: 7 | * Python 2 8 | * Pandas 9 | * NumPy 10 | * scikit-learn (0.17 or later) 11 | 12 | These can be installed on the [MapR Sandbox](https://www.mapr.com/products/mapr-sandbox-hadoop). 13 | 14 | To run the scikit-learn examples with Spark, the following packages are required on each machine: 15 | * All of the above packages 16 | * Spark (1.5 or later) 17 | * [spark-sklearn](https://github.com/databricks/spark-sklearn) -- follow the installation instructions there 18 | 19 | You can run this on a MapR cluster by following one of these methods: 20 | * Use the [MapR Sandbox](https://www.mapr.com/products/mapr-sandbox-hadoop), which comes with Spark pre-installed. You must install Pandas, NumPy and scikit-learn. 21 | * If you have multiple machines available, use the [MapR Community Edition](https://www.mapr.com/products/hadoop-download) and install the mapr-spark package on each machine, following the [Spark on YARN documentation](http://maprdocs.mapr.com/51/#Spark/SparkonYARN.html) 22 | 23 | Run the script with: 24 | 25 | ```MASTER=yarn-client /opt/mapr/spark/spark-1.5.2/bin/spark-submit --num-executors=4 --executor-cores=8 python_scikit_airbnb.py``` 26 | 27 | (setting num-executors and executor-cores to suit your environment) 28 | 29 | The file ```classify.py``` in this repo contains an example of classification on the same dataset, using ```reviews.csv``` and text analysis. 30 | 31 | and of course... have fun! 32 | 33 | -------------------------------------------------------------------------------- /classify.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from sklearn import ensemble 4 | from sklearn import linear_model 5 | from sklearn import tree 6 | from sklearn import svm 7 | from sklearn import neighbors 8 | from sklearn import preprocessing 9 | from sklearn.feature_extraction.text import TfidfVectorizer 10 | from sklearn.linear_model import LogisticRegression 11 | from sklearn.ensemble import RandomForestClassifier 12 | 13 | # scikit-learn normal 14 | # from sklearn.grid_search import GridSearchCV 15 | 16 | # spark 17 | from pyspark import SparkContext, SparkConf 18 | from spark_sklearn import GridSearchCV 19 | 20 | def top_tfidf_feats(row, features, top_n=25): 21 | topn_ids = np.argsort(row)[::-1][:top_n] 22 | top_feats = [(features[i], row[i]) for i in topn_ids] 23 | df = pd.DataFrame(top_feats) 24 | df.columns = ['feature', 'tfidf'] 25 | return df 26 | 27 | def top_feats_in_doc(Xtr, features, row_id, top_n=25): 28 | ''' Top tfidf features in specific document (matrix row) ''' 29 | row = np.squeeze(Xtr[row_id].toarray()) 30 | return top_tfidf_feats(row, features, top_n) 31 | 32 | # scikit-learn normal 33 | # from sklearn.grid_search import GridSearchCV 34 | 35 | # spark 36 | #from pyspark import SparkContext, SparkConf 37 | #from spark_sklearn import GridSearchCV 38 | 39 | from sklearn import preprocessing 40 | from sklearn.cross_validation import train_test_split 41 | import sklearn.metrics as metrics 42 | import matplotlib.pyplot as plt 43 | from collections import Counter 44 | 45 | LISTINGSFILE = '/mapr/tmclust1/user/mapr/pyspark-learn/airbnb/listings.csv' 46 | REVIEWSFILE = '/mapr/tmclust1/user/mapr/pyspark-learn/airbnb/reviews.csv' 47 | 48 | cols = ['id', 49 | 'neighbourhood_cleansed', 50 | ] 51 | 52 | rcols = [ 'listing_id', 'comments' ] 53 | 54 | nbhs = [ 'Mission', 'South of Market', 'Western Addition' ] 55 | 56 | # read the file into a dataframe 57 | df = pd.read_csv(LISTINGSFILE, usecols=cols, index_col='id') 58 | rdf = pd.read_csv(REVIEWSFILE, usecols=rcols) 59 | 60 | # combine the reviews with the listings into a single dataframe 61 | # indexed by listing ID 62 | rdf = rdf.groupby(['listing_id'])['comments']. \ 63 | apply(lambda x: ' '.join(x.astype(str))).reset_index() 64 | rdf = rdf.set_index(rdf['listing_id'].astype(float)) 65 | df = pd.concat([df, rdf], axis = 1) 66 | 67 | print "before filtering: %d" % len(df.index) 68 | df = df.dropna(axis=0) 69 | df = df[df.neighbourhood_cleansed.isin(nbhs)] 70 | print "after filtering: %d" % len(df.index) 71 | 72 | le = preprocessing.LabelEncoder().fit(df.neighbourhood_cleansed) 73 | df['nbh'] = le.transform(df.neighbourhood_cleansed) 74 | 75 | tfid = TfidfVectorizer() 76 | ttext = tfid.fit_transform(df['comments']) 77 | 78 | #print top_feats_in_doc(ttext, tfid.get_feature_names(), 1, 10) 79 | #sys.exit(0) 80 | 81 | print "%d %d" % (ttext.shape[0], len(df['nbh'])) 82 | X_train, X_test, y_train, y_test = \ 83 | train_test_split(ttext, df['nbh'], 84 | test_size=0.2, random_state=1) 85 | 86 | rs = 1 87 | ests = [ neighbors.KNeighborsClassifier(3), 88 | RandomForestClassifier(random_state=rs) ] 89 | 90 | ests_labels = np.array(['KNeighbors', 'RandomForest' ]) 91 | 92 | for i, e in enumerate(ests): 93 | e.fit(X_train, y_train) 94 | this_score = metrics.accuracy_score(y_test, e.predict(X_test)) 95 | scorestr = "%s: Accuracy Score %0.2f" % (ests_labels[i], 96 | this_score) 97 | print 98 | print scorestr 99 | print "-" * len(scorestr) 100 | print metrics.classification_report(y_test, 101 | e.predict(X_test), target_names=le.classes_) 102 | 103 | tuned_parameters = { "max_depth": [3, None], 104 | "max_features": [1, 'auto'], 105 | "min_samples_split": [1, 20], 106 | "n_estimators": [10, 300, 500] } 107 | rf = RandomForestClassifier(random_state=rs) 108 | 109 | # spark-sklearn 110 | conf = SparkConf() 111 | sc = SparkContext(conf=conf) 112 | clf = GridSearchCV(sc, rf, cv=3, 113 | param_grid=tuned_parameters, 114 | scoring='accuracy') 115 | 116 | # scikit-learn 117 | # clf = GridSearchCV(rf, cv=2, scoring='accuracy', 118 | # param_grid=tuned_parameters, 119 | # verbose=True) 120 | 121 | preds = clf.fit(X_train, y_train) 122 | best = clf.best_estimator_ 123 | this_score = metrics.accuracy_score(y_test, best.predict(X_test)) 124 | scorestr = "RF / GridSearchCV: Accuracy Score %0.2f" % this_score 125 | print 126 | print scorestr 127 | print "-" * len(scorestr) 128 | print metrics.classification_report(y_test, 129 | best.predict(X_test), target_names=le.classes_) 130 | --------------------------------------------------------------------------------