├── .gitignore ├── README.md ├── data └── ml-100k │ ├── README │ ├── allbut.pl │ ├── mku.sh │ ├── u.data │ ├── u.genre │ ├── u.info │ ├── u.item │ ├── u.occupation │ ├── u.user │ ├── u1.base │ ├── u1.test │ ├── u2.base │ ├── u2.test │ ├── u3.base │ ├── u3.test │ ├── u4.base │ ├── u4.test │ ├── u5.base │ ├── u5.test │ ├── ua.base │ ├── ua.test │ ├── ub.base │ └── ub.test ├── julia-df.jl ├── python-df.ipynb └── r-df.R /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints/ 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | I have been playing a lot lately with [Julia](http://julialang.org/). 2 | In the past I have used python and R a lot for my data analysis tasks. I always tend to have some cheatsheet in front of me for the basic dataframe operations in pandas and R. Its hard to make that context switch in the syntax from one language to another. 3 | With Julia, my brain was just not ready to accept another set of syntax. 4 | So I thought a quick reference comparing the basic dataframe manipulation syntax for all 3 languages would be nice. 5 | Most of my data tasks start with some ETL tasks on dataframes. So this is more of a reference for my use in future but can be useful to others too who face the same situation. 6 | 7 | I took this awesome tutorial by [Greg Reda](http://www.gregreda.com/) on ["Working with DataFrames"](http://www.gregreda.com/2013/10/26/working-with-pandas-dataframes/) and tried to port the example to all R and Julia. 8 | 9 | All the code from this post can be found at this [github repo dataframes-compare](https://github.com/ajkl/dataframes-cheatsheet) 10 | 11 | We start with the [MovieLens 100k dataset](http://grouplens.org/datasets/movielens/). 12 | From the README you can see : 13 | 14 | ``` 15 | This data set consists of: 16 | * 100,000 ratings (1-5) from 943 users on 1682 movies. 17 | * Each user has rated at least 20 movies. 18 | * Simple demographic info for the users (age, gender, occupation, zip) 19 | ``` 20 | 21 | ## Data import 22 | Lets start with loading the data into our dataframe. The methods to read the data from a flat file are pretty simple and straight forward in all 3, the only tricky part is when adding column names in Julia. But there are ways to get around. 23 | 24 | ######Julia 25 | 26 | ``` 27 | using DataFrames 28 | u_col_names=[symbol("user_id"), symbol("age"), symbol("sex"), symbol("occupation"), symbol("zip_code")] 29 | ### 30 | #another way to do the same without adding each entry as a symbol. Thanks to [Jubobs](http://stackoverflow.com/users/2541573/jubobs) from this [stackoverflow post](http://stackoverflow.com/questions/27629206/why-does-this-list-comprehension-return-an-arrayany-1-instead-of-an-arraysymb/27629609#27629609) for the suggestion. 31 | col_names=["user_id", "age", "sex", "occupation", "zip_code"] 32 | u_col_names=map(symbol, col_names) 33 | ### 34 | users = DataFrames.readtable("data/ml-100k/u.user", separator='|', header=false, names=u_col_names) 35 | ``` 36 | 37 | ######Python 38 | 39 | ``` 40 | import pandas as pd 41 | 42 | u_col_names = ['user_id', 'age', 'sex', 'occuptation', 'zip_code'] 43 | users = pd.read_csv('data/ml-100k/u.user', sep = '|', names=u_col_names) 44 | 45 | r_col_names = ['user_id', 'movie_id', 'rating', 'unix_timestamp'] 46 | ratings = pd.read_csv('data/ml-100k/u.data', sep='\t', names=r_col_names) 47 | 48 | # let's only load the first five columns of the file with usecols 49 | m_col_names = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url'] 50 | movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_col_names, usecols=range(5)) 51 | ``` 52 | 53 | ######R 54 | 55 | ``` 56 | u_col_names <- c('user_id', 'age', 'sex', 'occupation', 'zip_code') 57 | users <- read.csv('data/ml-100k/u.user', sep='|', col.names=u_col_names) 58 | 59 | r_col_names = c('user_id', 'movie_id', 'rating', 'unix_timestamp') 60 | ratings = read.csv('data/ml-100k/u.data', sep='\t', col.names=r_col_names) 61 | 62 | # let's only load the first five columns for movies using "usecols" param 63 | m_col_names = c('movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url') 64 | movies = read.table('data/ml-100k/u.item', sep='|', colClasses=c("integer", "character", "factor", "factor", "character", rep("NULL", 19)), quote="") 65 | #http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r also quotes in strings cause importing errors so you need quote="" 66 | #cannot specify col.names in read.table as we are skipping 19 columns. R will complain with "more columns than column names" 67 | colnames(movies) <- m_col_names 68 | ``` 69 | 70 | If you notice closely R and python accept string literals in single quotes ' but Julia treats is as a character literal. I personally feel thats how it should be if I think about how C/C++ treats string literals. 71 | 72 | ## Sanity check 73 | Now that we have the data loaded in our dataframes, its usually good practice to see the classes/types of each column. We will also try to get the summary statistics for the columns of our dataframe. 74 | 75 | ######Julia 76 | ``` 77 | eltypes(users) 78 | 79 | 5-element Array{Type{T<:Top},1}: 80 | Int64 81 | Int64 82 | UTF8String 83 | UTF8String 84 | UTF8String 85 | ``` 86 | ``` 87 | describe(users) 88 | 89 | user_id 90 | Min 1.0 91 | 1st Qu. 236.5 92 | Median 472.0 93 | Mean 472.0 94 | 3rd Qu. 707.5 95 | Max 943.0 96 | NAs 0 97 | NA% 0.0% 98 | 99 | age 100 | Min 7.0 101 | 1st Qu. 25.0 102 | Median 31.0 103 | Mean 34.05196182396607 104 | 3rd Qu. 43.0 105 | Max 73.0 106 | NAs 0 107 | NA% 0.0% 108 | 109 | sex 110 | Length 943 111 | Type UTF8String 112 | NAs 0 113 | NA% 0.0% 114 | Unique 2 115 | 116 | occupation 117 | Length 943 118 | Type UTF8String 119 | NAs 0 120 | NA% 0.0% 121 | Unique 21 122 | 123 | zip_code 124 | Length 943 125 | Type UTF8String 126 | NAs 0 127 | NA% 0.0% 128 | Unique 795 129 | ``` 130 | 131 | ######Python 132 | ``` 133 | type(users) 134 | 135 | pandas.core.frame.DataFrame 136 | ``` 137 | ``` 138 | users.dtypes 139 | 140 | user_id int64 141 | age int64 142 | sex object 143 | occuptation object 144 | zip_code object 145 | dtype: object 146 | ``` 147 | ``` 148 | users.describe 149 | 150 | user_id age 151 | count 943.000000 943.000000 152 | mean 472.000000 34.051962 153 | std 272.364951 12.192740 154 | min 1.000000 7.000000 155 | 25% 236.500000 25.000000 156 | 50% 472.000000 31.000000 157 | 75% 707.500000 43.000000 158 | max 943.000000 73.000000 159 | ``` 160 | 161 | ######R 162 | ``` 163 | class(users) 164 | 165 | [1] "data.frame" 166 | ``` 167 | ``` 168 | lapply(users, class) 169 | 170 | $user_id 171 | [1] "integer" 172 | 173 | $age 174 | [1] "integer" 175 | 176 | $sex 177 | [1] "factor" 178 | 179 | $occupation 180 | [1] "factor" 181 | 182 | $zip_code 183 | [1] "factor" 184 | ``` 185 | ``` 186 | summary(users) 187 | 188 | user_id age sex occupation zip_code 189 | Min. : 2.0 Min. : 7.00 F:273 student :196 55414 : 9 190 | 1st Qu.:237.2 1st Qu.:25.00 M:669 other :105 55105 : 6 191 | Median :472.5 Median :31.00 educator : 95 10003 : 5 192 | Mean :472.5 Mean :34.06 administrator: 79 20009 : 5 193 | 3rd Qu.:707.8 3rd Qu.:43.00 engineer : 67 55337 : 5 194 | Max. :943.0 Max. :73.00 programmer : 66 27514 : 4 195 | (Other) :334 (Other):908 196 | user_id age sex occupation zip_code 197 | Min. : 2.0 Min. : 7.00 F:273 student :196 55414 : 9 198 | 1st Qu.:237.2 1st Qu.:25.00 M:669 other :105 55105 : 6 199 | Median :472.5 Median :31.00 educator : 95 10003 : 5 200 | Mean :472.5 Mean :34.06 administrator: 79 20009 : 5 201 | 3rd Qu.:707.8 3rd Qu.:43.00 engineer : 67 55337 : 5 202 | Max. :943.0 Max. :73.00 programmer : 66 27514 : 4 203 | (Other) :334 (Other):908 204 | ``` 205 | 206 | We can see that the ratio is around 1:2.5 between female and male users and most of them (if you exclude "other") are students. Minimum age is 7 and it goes till 73 as the max. 207 | 208 | ## Subsetting 209 | 210 | Lets try to query our data now and filter it with some conditions. 211 | 212 | #### Head/Tail 213 | We will warmup with some top/bottom values to get a feel of the data we have at hand. 214 | 215 | 216 | ######Julia 217 | ``` 218 | head(users) 219 | 6x5 DataFrame 220 | | Row | user_id | age | sex | occupation | zip_code | 221 | |-----|---------|-----|-----|--------------|----------| 222 | | 1 | 1 | 24 | "M" | "technician" | "85711" | 223 | | 2 | 2 | 53 | "F" | "other" | "94043" | 224 | | 3 | 3 | 23 | "M" | "writer" | "32067" | 225 | | 4 | 4 | 24 | "M" | "technician" | "43537" | 226 | | 5 | 5 | 33 | "F" | "other" | "15213" | 227 | | 6 | 6 | 42 | "M" | "executive" | "98101" | 228 | ``` 229 | ``` 230 | tail(users) 231 | 6x5 DataFrame 232 | | Row | user_id | age | sex | occupation | zip_code | 233 | |-----|---------|-----|-----|-----------------|----------| 234 | | 1 | 938 | 38 | "F" | "technician" | "55038" | 235 | | 2 | 939 | 26 | "F" | "student" | "33319" | 236 | | 3 | 940 | 32 | "M" | "administrator" | "02215" | 237 | | 4 | 941 | 20 | "M" | "student" | "97229" | 238 | | 5 | 942 | 48 | "F" | "librarian" | "78209" | 239 | | 6 | 943 | 22 | "M" | "student" | "77841" | 240 | ``` 241 | If you want to select a custom number of rows in head or tail you can pass it as a param 242 | ``` 243 | head(users, 3) 244 | 3x5 DataFrame 245 | | Row | user_id | age | sex | occupation | zip_code | 246 | |-----|---------|-----|-----|--------------|----------| 247 | | 1 | 1 | 24 | "M" | "technician" | "85711" | 248 | | 2 | 2 | 53 | "F" | "other" | "94043" | 249 | | 3 | 3 | 23 | "M" | "writer" | "32067" | 250 | ``` 251 | 252 | I wont paste the output of these for python and R as they are similar to what Julia shows.. 253 | 254 | ######Python 255 | ``` 256 | users.head() 257 | users.tail() 258 | users.head(3) 259 | ``` 260 | 261 | ######R 262 | ``` 263 | head(movies) 264 | tail(movies) 265 | head(movies, n=3) 266 | ``` 267 | 268 | #### Row subset 269 | 270 | Lets try to get all the rows from 50th row to the 55th row 271 | 272 | ######Julia 273 | ``` 274 | users[50:55,:] 275 | 6x5 DataFrame 276 | | Row | user_id | age | sex | occupation | zip_code | 277 | |-----|---------|-----|-----|--------------|----------| 278 | | 1 | 50 | 21 | "M" | "writer" | "52245" | 279 | | 2 | 51 | 28 | "M" | "educator" | "16509" | 280 | | 3 | 52 | 18 | "F" | "student" | "55105" | 281 | | 4 | 53 | 26 | "M" | "programmer" | "55414" | 282 | | 5 | 54 | 22 | "M" | "executive" | "66315" | 283 | | 6 | 55 | 37 | "M" | "programmer" | "01331" | 284 | ``` 285 | ######Python 286 | ``` 287 | users[50:55] 288 | ``` 289 | ######R 290 | ``` 291 | users[50:55,] 292 | ``` 293 | 294 | #### Column subset 295 | 296 | You can select a single column by column name. 297 | 298 | ######Julia 299 | ``` 300 | head(users[:occupation]) 301 | 6-element DataArray{UTF8String,1}: 302 | "technician" 303 | "other" 304 | "writer" 305 | "technician" 306 | "other" 307 | "executive" 308 | ``` 309 | ######Python 310 | ``` 311 | users['occupation'].head() 312 | ``` 313 | ######R 314 | ``` 315 | head(users$occupation) 316 | ``` 317 | 318 | Multiple column selection works by passing a vector of column names 319 | 320 | ######Julia 321 | ``` 322 | head(users[:,[:occupation, :sex, :age]]) 323 | 6x3 DataFrame 324 | | Row | occupation | sex | age | 325 | |-----|--------------|-----|-----| 326 | | 1 | "technician" | "M" | 24 | 327 | | 2 | "other" | "F" | 53 | 328 | | 3 | "writer" | "M" | 23 | 329 | | 4 | "technician" | "M" | 24 | 330 | | 5 | "other" | "F" | 33 | 331 | | 6 | "executive" | "M" | 42 | 332 | ``` 333 | 334 | ######Python 335 | ``` 336 | users[['occupation', 'sex', 'age']].head() 337 | ``` 338 | 339 | ######R 340 | ``` 341 | head(users[,c('occupation', 'sex', 'age')]) 342 | ``` 343 | 344 | #### Query / Conditional subset 345 | 346 | Subsetting a dataframe based on querying a column for a condition can be achieved like this - 347 | 348 | ######Julia 349 | ``` 350 | users[users[:occupation] .== "writer", :] 351 | 45x5 DataFrame 352 | | Row | user_id | age | sex | occupation | zip_code | 353 | |-----|---------|-----|-----|------------|----------| 354 | | 1 | 3 | 23 | "M" | "writer" | "32067" | 355 | | 2 | 21 | 26 | "M" | "writer" | "30068" | 356 | | 3 | 22 | 25 | "M" | "writer" | "40206" | 357 | | 4 | 28 | 32 | "M" | "writer" | "55369" | 358 | | 5 | 50 | 21 | "M" | "writer" | "52245" | 359 | ⋮ 360 | | 43 | 853 | 49 | "M" | "writer" | "40515" | 361 | | 44 | 896 | 28 | "M" | "writer" | "91505" | 362 | | 45 | 911 | 37 | "F" | "writer" | "53210" | 363 | ``` 364 | ######Python 365 | ``` 366 | users[users.occupation == 'writer'] 367 | ``` 368 | There is another way in python, by using the query construct of the dataframe. 369 | ``` 370 | users.query('occupation=="writer"') 371 | ``` 372 | ######R 373 | ``` 374 | users[users$occupation == 'writer',] 375 | ``` 376 | Notice the subtle changes in all these examples. For example in the last query subsetting, for python you dont need to specify "select all columns" by adding a ' ," ' as you do in Julia or a ', ' as you do in R. 377 | 378 | I will try to write a followup post on Join on Dataframes in these 3 languages. 379 | -------------------------------------------------------------------------------- /data/ml-100k/README: -------------------------------------------------------------------------------- 1 | SUMMARY & USAGE LICENSE 2 | ============================================= 3 | 4 | MovieLens data sets were collected by the GroupLens Research Project 5 | at the University of Minnesota. 6 | 7 | This data set consists of: 8 | * 100,000 ratings (1-5) from 943 users on 1682 movies. 9 | * Each user has rated at least 20 movies. 10 | * Simple demographic info for the users (age, gender, occupation, zip) 11 | 12 | The data was collected through the MovieLens web site 13 | (movielens.umn.edu) during the seven-month period from September 19th, 14 | 1997 through April 22nd, 1998. This data has been cleaned up - users 15 | who had less than 20 ratings or did not have complete demographic 16 | information were removed from this data set. Detailed descriptions of 17 | the data file can be found at the end of this file. 18 | 19 | Neither the University of Minnesota nor any of the researchers 20 | involved can guarantee the correctness of the data, its suitability 21 | for any particular purpose, or the validity of results based on the 22 | use of the data set. The data set may be used for any research 23 | purposes under the following conditions: 24 | 25 | * The user may not state or imply any endorsement from the 26 | University of Minnesota or the GroupLens Research Group. 27 | 28 | * The user must acknowledge the use of the data set in 29 | publications resulting from the use of the data set, and must 30 | send us an electronic or paper copy of those publications. 31 | 32 | * The user may not redistribute the data without separate 33 | permission. 34 | 35 | * The user may not use this information for any commercial or 36 | revenue-bearing purposes without first obtaining permission 37 | from a faculty member of the GroupLens Research Project at the 38 | University of Minnesota. 39 | 40 | If you have any further questions or comments, please contact Jon Herlocker 41 | . 42 | 43 | ACKNOWLEDGEMENTS 44 | ============================================== 45 | 46 | Thanks to Al Borchers for cleaning up this data and writing the 47 | accompanying scripts. 48 | 49 | PUBLISHED WORK THAT HAS USED THIS DATASET 50 | ============================================== 51 | 52 | Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic 53 | Framework for Performing Collaborative Filtering. Proceedings of the 54 | 1999 Conference on Research and Development in Information 55 | Retrieval. Aug. 1999. 56 | 57 | FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT 58 | ============================================== 59 | 60 | The GroupLens Research Project is a research group in the Department 61 | of Computer Science and Engineering at the University of Minnesota. 62 | Members of the GroupLens Research Project are involved in many 63 | research projects related to the fields of information filtering, 64 | collaborative filtering, and recommender systems. The project is lead 65 | by professors John Riedl and Joseph Konstan. The project began to 66 | explore automated collaborative filtering in 1992, but is most well 67 | known for its world wide trial of an automated collaborative filtering 68 | system for Usenet news in 1996. The technology developed in the 69 | Usenet trial formed the base for the formation of Net Perceptions, 70 | Inc., which was founded by members of GroupLens Research. Since then 71 | the project has expanded its scope to research overall information 72 | filtering solutions, integrating in content-based methods as well as 73 | improving current collaborative filtering technology. 74 | 75 | Further information on the GroupLens Research project, including 76 | research publications, can be found at the following web site: 77 | 78 | http://www.grouplens.org/ 79 | 80 | GroupLens Research currently operates a movie recommender based on 81 | collaborative filtering: 82 | 83 | http://www.movielens.org/ 84 | 85 | DETAILED DESCRIPTIONS OF DATA FILES 86 | ============================================== 87 | 88 | Here are brief descriptions of the data. 89 | 90 | ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this: 91 | gunzip ml-data.tar.gz 92 | tar xvf ml-data.tar 93 | mku.sh 94 | 95 | u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. 96 | Each user has rated at least 20 movies. Users and items are 97 | numbered consecutively from 1. The data is randomly 98 | ordered. This is a tab separated list of 99 | user id | item id | rating | timestamp. 100 | The time stamps are unix seconds since 1/1/1970 UTC 101 | 102 | u.info -- The number of users, items, and ratings in the u data set. 103 | 104 | u.item -- Information about the items (movies); this is a tab separated 105 | list of 106 | movie id | movie title | release date | video release date | 107 | IMDb URL | unknown | Action | Adventure | Animation | 108 | Children's | Comedy | Crime | Documentary | Drama | Fantasy | 109 | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | 110 | Thriller | War | Western | 111 | The last 19 fields are the genres, a 1 indicates the movie 112 | is of that genre, a 0 indicates it is not; movies can be in 113 | several genres at once. 114 | The movie ids are the ones used in the u.data data set. 115 | 116 | u.genre -- A list of the genres. 117 | 118 | u.user -- Demographic information about the users; this is a tab 119 | separated list of 120 | user id | age | gender | occupation | zip code 121 | The user ids are the ones used in the u.data data set. 122 | 123 | u.occupation -- A list of the occupations. 124 | 125 | u1.base -- The data sets u1.base and u1.test through u5.base and u5.test 126 | u1.test are 80%/20% splits of the u data into training and test data. 127 | u2.base Each of u1, ..., u5 have disjoint test sets; this if for 128 | u2.test 5 fold cross validation (where you repeat your experiment 129 | u3.base with each training and test set and average the results). 130 | u3.test These data sets can be generated from u.data by mku.sh. 131 | u4.base 132 | u4.test 133 | u5.base 134 | u5.test 135 | 136 | ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test 137 | ua.test split the u data into a training set and a test set with 138 | ub.base exactly 10 ratings per user in the test set. The sets 139 | ub.test ua.test and ub.test are disjoint. These data sets can 140 | be generated from u.data by mku.sh. 141 | 142 | allbut.pl -- The script that generates training and test sets where 143 | all but n of a users ratings are in the training data. 144 | 145 | mku.sh -- A shell script to generate all the u data sets from u.data. 146 | -------------------------------------------------------------------------------- /data/ml-100k/allbut.pl: -------------------------------------------------------------------------------- 1 | #!/usr/local/bin/perl 2 | 3 | # get args 4 | if (@ARGV < 3) { 5 | print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n"; 6 | exit 1; 7 | } 8 | $basename = shift; 9 | $start = shift; 10 | $stop = shift; 11 | $maxtest = shift; 12 | 13 | # open files 14 | open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n"; 15 | open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n"; 16 | 17 | # init variables 18 | $testcnt = 0; 19 | 20 | while (<>) { 21 | ($user) = split; 22 | if (! defined $ratingcnt{$user}) { 23 | $ratingcnt{$user} = 0; 24 | } 25 | ++$ratingcnt{$user}; 26 | if (($testcnt < $maxtest || $maxtest <= 0) 27 | && $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) { 28 | ++$testcnt; 29 | print TESTFILE; 30 | } 31 | else { 32 | print BASEFILE; 33 | } 34 | } 35 | -------------------------------------------------------------------------------- /data/ml-100k/mku.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | trap `rm -f tmp.$$; exit 1` 1 2 15 4 | 5 | for i in 1 2 3 4 5 6 | do 7 | head -`expr $i \* 20000` u.data | tail -20000 > tmp.$$ 8 | sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.test 9 | head -`expr \( $i - 1 \) \* 20000` u.data > tmp.$$ 10 | tail -`expr \( 5 - $i \) \* 20000` u.data >> tmp.$$ 11 | sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.base 12 | done 13 | 14 | allbut.pl ua 1 10 100000 u.data 15 | sort -t" " -k 1,1n -k 2,2n ua.base > tmp.$$ 16 | mv tmp.$$ ua.base 17 | sort -t" " -k 1,1n -k 2,2n ua.test > tmp.$$ 18 | mv tmp.$$ ua.test 19 | 20 | allbut.pl ub 11 20 100000 u.data 21 | sort -t" " -k 1,1n -k 2,2n ub.base > tmp.$$ 22 | mv tmp.$$ ub.base 23 | sort -t" " -k 1,1n -k 2,2n ub.test > tmp.$$ 24 | mv tmp.$$ ub.test 25 | 26 | -------------------------------------------------------------------------------- /data/ml-100k/u.genre: -------------------------------------------------------------------------------- 1 | unknown|0 2 | Action|1 3 | Adventure|2 4 | Animation|3 5 | Children's|4 6 | Comedy|5 7 | Crime|6 8 | Documentary|7 9 | Drama|8 10 | Fantasy|9 11 | Film-Noir|10 12 | Horror|11 13 | Musical|12 14 | Mystery|13 15 | Romance|14 16 | Sci-Fi|15 17 | Thriller|16 18 | War|17 19 | Western|18 20 | 21 | -------------------------------------------------------------------------------- /data/ml-100k/u.info: -------------------------------------------------------------------------------- 1 | 943 users 2 | 1682 items 3 | 100000 ratings 4 | -------------------------------------------------------------------------------- /data/ml-100k/u.item: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ajkl/dataframes-cheatsheet/d2e8566abfde327fa8145f65050957b10a82b5b8/data/ml-100k/u.item -------------------------------------------------------------------------------- /data/ml-100k/u.occupation: -------------------------------------------------------------------------------- 1 | administrator 2 | artist 3 | doctor 4 | educator 5 | engineer 6 | entertainment 7 | executive 8 | healthcare 9 | homemaker 10 | lawyer 11 | librarian 12 | marketing 13 | none 14 | other 15 | programmer 16 | retired 17 | salesman 18 | scientist 19 | student 20 | technician 21 | writer 22 | -------------------------------------------------------------------------------- /data/ml-100k/u.user: -------------------------------------------------------------------------------- 1 | 1|24|M|technician|85711 2 | 2|53|F|other|94043 3 | 3|23|M|writer|32067 4 | 4|24|M|technician|43537 5 | 5|33|F|other|15213 6 | 6|42|M|executive|98101 7 | 7|57|M|administrator|91344 8 | 8|36|M|administrator|05201 9 | 9|29|M|student|01002 10 | 10|53|M|lawyer|90703 11 | 11|39|F|other|30329 12 | 12|28|F|other|06405 13 | 13|47|M|educator|29206 14 | 14|45|M|scientist|55106 15 | 15|49|F|educator|97301 16 | 16|21|M|entertainment|10309 17 | 17|30|M|programmer|06355 18 | 18|35|F|other|37212 19 | 19|40|M|librarian|02138 20 | 20|42|F|homemaker|95660 21 | 21|26|M|writer|30068 22 | 22|25|M|writer|40206 23 | 23|30|F|artist|48197 24 | 24|21|F|artist|94533 25 | 25|39|M|engineer|55107 26 | 26|49|M|engineer|21044 27 | 27|40|F|librarian|30030 28 | 28|32|M|writer|55369 29 | 29|41|M|programmer|94043 30 | 30|7|M|student|55436 31 | 31|24|M|artist|10003 32 | 32|28|F|student|78741 33 | 33|23|M|student|27510 34 | 34|38|F|administrator|42141 35 | 35|20|F|homemaker|42459 36 | 36|19|F|student|93117 37 | 37|23|M|student|55105 38 | 38|28|F|other|54467 39 | 39|41|M|entertainment|01040 40 | 40|38|M|scientist|27514 41 | 41|33|M|engineer|80525 42 | 42|30|M|administrator|17870 43 | 43|29|F|librarian|20854 44 | 44|26|M|technician|46260 45 | 45|29|M|programmer|50233 46 | 46|27|F|marketing|46538 47 | 47|53|M|marketing|07102 48 | 48|45|M|administrator|12550 49 | 49|23|F|student|76111 50 | 50|21|M|writer|52245 51 | 51|28|M|educator|16509 52 | 52|18|F|student|55105 53 | 53|26|M|programmer|55414 54 | 54|22|M|executive|66315 55 | 55|37|M|programmer|01331 56 | 56|25|M|librarian|46260 57 | 57|16|M|none|84010 58 | 58|27|M|programmer|52246 59 | 59|49|M|educator|08403 60 | 60|50|M|healthcare|06472 61 | 61|36|M|engineer|30040 62 | 62|27|F|administrator|97214 63 | 63|31|M|marketing|75240 64 | 64|32|M|educator|43202 65 | 65|51|F|educator|48118 66 | 66|23|M|student|80521 67 | 67|17|M|student|60402 68 | 68|19|M|student|22904 69 | 69|24|M|engineer|55337 70 | 70|27|M|engineer|60067 71 | 71|39|M|scientist|98034 72 | 72|48|F|administrator|73034 73 | 73|24|M|student|41850 74 | 74|39|M|scientist|T8H1N 75 | 75|24|M|entertainment|08816 76 | 76|20|M|student|02215 77 | 77|30|M|technician|29379 78 | 78|26|M|administrator|61801 79 | 79|39|F|administrator|03755 80 | 80|34|F|administrator|52241 81 | 81|21|M|student|21218 82 | 82|50|M|programmer|22902 83 | 83|40|M|other|44133 84 | 84|32|M|executive|55369 85 | 85|51|M|educator|20003 86 | 86|26|M|administrator|46005 87 | 87|47|M|administrator|89503 88 | 88|49|F|librarian|11701 89 | 89|43|F|administrator|68106 90 | 90|60|M|educator|78155 91 | 91|55|M|marketing|01913 92 | 92|32|M|entertainment|80525 93 | 93|48|M|executive|23112 94 | 94|26|M|student|71457 95 | 95|31|M|administrator|10707 96 | 96|25|F|artist|75206 97 | 97|43|M|artist|98006 98 | 98|49|F|executive|90291 99 | 99|20|M|student|63129 100 | 100|36|M|executive|90254 101 | 101|15|M|student|05146 102 | 102|38|M|programmer|30220 103 | 103|26|M|student|55108 104 | 104|27|M|student|55108 105 | 105|24|M|engineer|94043 106 | 106|61|M|retired|55125 107 | 107|39|M|scientist|60466 108 | 108|44|M|educator|63130 109 | 109|29|M|other|55423 110 | 110|19|M|student|77840 111 | 111|57|M|engineer|90630 112 | 112|30|M|salesman|60613 113 | 113|47|M|executive|95032 114 | 114|27|M|programmer|75013 115 | 115|31|M|engineer|17110 116 | 116|40|M|healthcare|97232 117 | 117|20|M|student|16125 118 | 118|21|M|administrator|90210 119 | 119|32|M|programmer|67401 120 | 120|47|F|other|06260 121 | 121|54|M|librarian|99603 122 | 122|32|F|writer|22206 123 | 123|48|F|artist|20008 124 | 124|34|M|student|60615 125 | 125|30|M|lawyer|22202 126 | 126|28|F|lawyer|20015 127 | 127|33|M|none|73439 128 | 128|24|F|marketing|20009 129 | 129|36|F|marketing|07039 130 | 130|20|M|none|60115 131 | 131|59|F|administrator|15237 132 | 132|24|M|other|94612 133 | 133|53|M|engineer|78602 134 | 134|31|M|programmer|80236 135 | 135|23|M|student|38401 136 | 136|51|M|other|97365 137 | 137|50|M|educator|84408 138 | 138|46|M|doctor|53211 139 | 139|20|M|student|08904 140 | 140|30|F|student|32250 141 | 141|49|M|programmer|36117 142 | 142|13|M|other|48118 143 | 143|42|M|technician|08832 144 | 144|53|M|programmer|20910 145 | 145|31|M|entertainment|V3N4P 146 | 146|45|M|artist|83814 147 | 147|40|F|librarian|02143 148 | 148|33|M|engineer|97006 149 | 149|35|F|marketing|17325 150 | 150|20|F|artist|02139 151 | 151|38|F|administrator|48103 152 | 152|33|F|educator|68767 153 | 153|25|M|student|60641 154 | 154|25|M|student|53703 155 | 155|32|F|other|11217 156 | 156|25|M|educator|08360 157 | 157|57|M|engineer|70808 158 | 158|50|M|educator|27606 159 | 159|23|F|student|55346 160 | 160|27|M|programmer|66215 161 | 161|50|M|lawyer|55104 162 | 162|25|M|artist|15610 163 | 163|49|M|administrator|97212 164 | 164|47|M|healthcare|80123 165 | 165|20|F|other|53715 166 | 166|47|M|educator|55113 167 | 167|37|M|other|L9G2B 168 | 168|48|M|other|80127 169 | 169|52|F|other|53705 170 | 170|53|F|healthcare|30067 171 | 171|48|F|educator|78750 172 | 172|55|M|marketing|22207 173 | 173|56|M|other|22306 174 | 174|30|F|administrator|52302 175 | 175|26|F|scientist|21911 176 | 176|28|M|scientist|07030 177 | 177|20|M|programmer|19104 178 | 178|26|M|other|49512 179 | 179|15|M|entertainment|20755 180 | 180|22|F|administrator|60202 181 | 181|26|M|executive|21218 182 | 182|36|M|programmer|33884 183 | 183|33|M|scientist|27708 184 | 184|37|M|librarian|76013 185 | 185|53|F|librarian|97403 186 | 186|39|F|executive|00000 187 | 187|26|M|educator|16801 188 | 188|42|M|student|29440 189 | 189|32|M|artist|95014 190 | 190|30|M|administrator|95938 191 | 191|33|M|administrator|95161 192 | 192|42|M|educator|90840 193 | 193|29|M|student|49931 194 | 194|38|M|administrator|02154 195 | 195|42|M|scientist|93555 196 | 196|49|M|writer|55105 197 | 197|55|M|technician|75094 198 | 198|21|F|student|55414 199 | 199|30|M|writer|17604 200 | 200|40|M|programmer|93402 201 | 201|27|M|writer|E2A4H 202 | 202|41|F|educator|60201 203 | 203|25|F|student|32301 204 | 204|52|F|librarian|10960 205 | 205|47|M|lawyer|06371 206 | 206|14|F|student|53115 207 | 207|39|M|marketing|92037 208 | 208|43|M|engineer|01720 209 | 209|33|F|educator|85710 210 | 210|39|M|engineer|03060 211 | 211|66|M|salesman|32605 212 | 212|49|F|educator|61401 213 | 213|33|M|executive|55345 214 | 214|26|F|librarian|11231 215 | 215|35|M|programmer|63033 216 | 216|22|M|engineer|02215 217 | 217|22|M|other|11727 218 | 218|37|M|administrator|06513 219 | 219|32|M|programmer|43212 220 | 220|30|M|librarian|78205 221 | 221|19|M|student|20685 222 | 222|29|M|programmer|27502 223 | 223|19|F|student|47906 224 | 224|31|F|educator|43512 225 | 225|51|F|administrator|58202 226 | 226|28|M|student|92103 227 | 227|46|M|executive|60659 228 | 228|21|F|student|22003 229 | 229|29|F|librarian|22903 230 | 230|28|F|student|14476 231 | 231|48|M|librarian|01080 232 | 232|45|M|scientist|99709 233 | 233|38|M|engineer|98682 234 | 234|60|M|retired|94702 235 | 235|37|M|educator|22973 236 | 236|44|F|writer|53214 237 | 237|49|M|administrator|63146 238 | 238|42|F|administrator|44124 239 | 239|39|M|artist|95628 240 | 240|23|F|educator|20784 241 | 241|26|F|student|20001 242 | 242|33|M|educator|31404 243 | 243|33|M|educator|60201 244 | 244|28|M|technician|80525 245 | 245|22|M|student|55109 246 | 246|19|M|student|28734 247 | 247|28|M|engineer|20770 248 | 248|25|M|student|37235 249 | 249|25|M|student|84103 250 | 250|29|M|executive|95110 251 | 251|28|M|doctor|85032 252 | 252|42|M|engineer|07733 253 | 253|26|F|librarian|22903 254 | 254|44|M|educator|42647 255 | 255|23|M|entertainment|07029 256 | 256|35|F|none|39042 257 | 257|17|M|student|77005 258 | 258|19|F|student|77801 259 | 259|21|M|student|48823 260 | 260|40|F|artist|89801 261 | 261|28|M|administrator|85202 262 | 262|19|F|student|78264 263 | 263|41|M|programmer|55346 264 | 264|36|F|writer|90064 265 | 265|26|M|executive|84601 266 | 266|62|F|administrator|78756 267 | 267|23|M|engineer|83716 268 | 268|24|M|engineer|19422 269 | 269|31|F|librarian|43201 270 | 270|18|F|student|63119 271 | 271|51|M|engineer|22932 272 | 272|33|M|scientist|53706 273 | 273|50|F|other|10016 274 | 274|20|F|student|55414 275 | 275|38|M|engineer|92064 276 | 276|21|M|student|95064 277 | 277|35|F|administrator|55406 278 | 278|37|F|librarian|30033 279 | 279|33|M|programmer|85251 280 | 280|30|F|librarian|22903 281 | 281|15|F|student|06059 282 | 282|22|M|administrator|20057 283 | 283|28|M|programmer|55305 284 | 284|40|M|executive|92629 285 | 285|25|M|programmer|53713 286 | 286|27|M|student|15217 287 | 287|21|M|salesman|31211 288 | 288|34|M|marketing|23226 289 | 289|11|M|none|94619 290 | 290|40|M|engineer|93550 291 | 291|19|M|student|44106 292 | 292|35|F|programmer|94703 293 | 293|24|M|writer|60804 294 | 294|34|M|technician|92110 295 | 295|31|M|educator|50325 296 | 296|43|F|administrator|16803 297 | 297|29|F|educator|98103 298 | 298|44|M|executive|01581 299 | 299|29|M|doctor|63108 300 | 300|26|F|programmer|55106 301 | 301|24|M|student|55439 302 | 302|42|M|educator|77904 303 | 303|19|M|student|14853 304 | 304|22|F|student|71701 305 | 305|23|M|programmer|94086 306 | 306|45|M|other|73132 307 | 307|25|M|student|55454 308 | 308|60|M|retired|95076 309 | 309|40|M|scientist|70802 310 | 310|37|M|educator|91711 311 | 311|32|M|technician|73071 312 | 312|48|M|other|02110 313 | 313|41|M|marketing|60035 314 | 314|20|F|student|08043 315 | 315|31|M|educator|18301 316 | 316|43|F|other|77009 317 | 317|22|M|administrator|13210 318 | 318|65|M|retired|06518 319 | 319|38|M|programmer|22030 320 | 320|19|M|student|24060 321 | 321|49|F|educator|55413 322 | 322|20|M|student|50613 323 | 323|21|M|student|19149 324 | 324|21|F|student|02176 325 | 325|48|M|technician|02139 326 | 326|41|M|administrator|15235 327 | 327|22|M|student|11101 328 | 328|51|M|administrator|06779 329 | 329|48|M|educator|01720 330 | 330|35|F|educator|33884 331 | 331|33|M|entertainment|91344 332 | 332|20|M|student|40504 333 | 333|47|M|other|V0R2M 334 | 334|32|M|librarian|30002 335 | 335|45|M|executive|33775 336 | 336|23|M|salesman|42101 337 | 337|37|M|scientist|10522 338 | 338|39|F|librarian|59717 339 | 339|35|M|lawyer|37901 340 | 340|46|M|engineer|80123 341 | 341|17|F|student|44405 342 | 342|25|F|other|98006 343 | 343|43|M|engineer|30093 344 | 344|30|F|librarian|94117 345 | 345|28|F|librarian|94143 346 | 346|34|M|other|76059 347 | 347|18|M|student|90210 348 | 348|24|F|student|45660 349 | 349|68|M|retired|61455 350 | 350|32|M|student|97301 351 | 351|61|M|educator|49938 352 | 352|37|F|programmer|55105 353 | 353|25|M|scientist|28480 354 | 354|29|F|librarian|48197 355 | 355|25|M|student|60135 356 | 356|32|F|homemaker|92688 357 | 357|26|M|executive|98133 358 | 358|40|M|educator|10022 359 | 359|22|M|student|61801 360 | 360|51|M|other|98027 361 | 361|22|M|student|44074 362 | 362|35|F|homemaker|85233 363 | 363|20|M|student|87501 364 | 364|63|M|engineer|01810 365 | 365|29|M|lawyer|20009 366 | 366|20|F|student|50670 367 | 367|17|M|student|37411 368 | 368|18|M|student|92113 369 | 369|24|M|student|91335 370 | 370|52|M|writer|08534 371 | 371|36|M|engineer|99206 372 | 372|25|F|student|66046 373 | 373|24|F|other|55116 374 | 374|36|M|executive|78746 375 | 375|17|M|entertainment|37777 376 | 376|28|F|other|10010 377 | 377|22|M|student|18015 378 | 378|35|M|student|02859 379 | 379|44|M|programmer|98117 380 | 380|32|M|engineer|55117 381 | 381|33|M|artist|94608 382 | 382|45|M|engineer|01824 383 | 383|42|M|administrator|75204 384 | 384|52|M|programmer|45218 385 | 385|36|M|writer|10003 386 | 386|36|M|salesman|43221 387 | 387|33|M|entertainment|37412 388 | 388|31|M|other|36106 389 | 389|44|F|writer|83702 390 | 390|42|F|writer|85016 391 | 391|23|M|student|84604 392 | 392|52|M|writer|59801 393 | 393|19|M|student|83686 394 | 394|25|M|administrator|96819 395 | 395|43|M|other|44092 396 | 396|57|M|engineer|94551 397 | 397|17|M|student|27514 398 | 398|40|M|other|60008 399 | 399|25|M|other|92374 400 | 400|33|F|administrator|78213 401 | 401|46|F|healthcare|84107 402 | 402|30|M|engineer|95129 403 | 403|37|M|other|06811 404 | 404|29|F|programmer|55108 405 | 405|22|F|healthcare|10019 406 | 406|52|M|educator|93109 407 | 407|29|M|engineer|03261 408 | 408|23|M|student|61755 409 | 409|48|M|administrator|98225 410 | 410|30|F|artist|94025 411 | 411|34|M|educator|44691 412 | 412|25|M|educator|15222 413 | 413|55|M|educator|78212 414 | 414|24|M|programmer|38115 415 | 415|39|M|educator|85711 416 | 416|20|F|student|92626 417 | 417|27|F|other|48103 418 | 418|55|F|none|21206 419 | 419|37|M|lawyer|43215 420 | 420|53|M|educator|02140 421 | 421|38|F|programmer|55105 422 | 422|26|M|entertainment|94533 423 | 423|64|M|other|91606 424 | 424|36|F|marketing|55422 425 | 425|19|M|student|58644 426 | 426|55|M|educator|01602 427 | 427|51|M|doctor|85258 428 | 428|28|M|student|55414 429 | 429|27|M|student|29205 430 | 430|38|M|scientist|98199 431 | 431|24|M|marketing|92629 432 | 432|22|M|entertainment|50311 433 | 433|27|M|artist|11211 434 | 434|16|F|student|49705 435 | 435|24|M|engineer|60007 436 | 436|30|F|administrator|17345 437 | 437|27|F|other|20009 438 | 438|51|F|administrator|43204 439 | 439|23|F|administrator|20817 440 | 440|30|M|other|48076 441 | 441|50|M|technician|55013 442 | 442|22|M|student|85282 443 | 443|35|M|salesman|33308 444 | 444|51|F|lawyer|53202 445 | 445|21|M|writer|92653 446 | 446|57|M|educator|60201 447 | 447|30|M|administrator|55113 448 | 448|23|M|entertainment|10021 449 | 449|23|M|librarian|55021 450 | 450|35|F|educator|11758 451 | 451|16|M|student|48446 452 | 452|35|M|administrator|28018 453 | 453|18|M|student|06333 454 | 454|57|M|other|97330 455 | 455|48|M|administrator|83709 456 | 456|24|M|technician|31820 457 | 457|33|F|salesman|30011 458 | 458|47|M|technician|Y1A6B 459 | 459|22|M|student|29201 460 | 460|44|F|other|60630 461 | 461|15|M|student|98102 462 | 462|19|F|student|02918 463 | 463|48|F|healthcare|75218 464 | 464|60|M|writer|94583 465 | 465|32|M|other|05001 466 | 466|22|M|student|90804 467 | 467|29|M|engineer|91201 468 | 468|28|M|engineer|02341 469 | 469|60|M|educator|78628 470 | 470|24|M|programmer|10021 471 | 471|10|M|student|77459 472 | 472|24|M|student|87544 473 | 473|29|M|student|94708 474 | 474|51|M|executive|93711 475 | 475|30|M|programmer|75230 476 | 476|28|M|student|60440 477 | 477|23|F|student|02125 478 | 478|29|M|other|10019 479 | 479|30|M|educator|55409 480 | 480|57|M|retired|98257 481 | 481|73|M|retired|37771 482 | 482|18|F|student|40256 483 | 483|29|M|scientist|43212 484 | 484|27|M|student|21208 485 | 485|44|F|educator|95821 486 | 486|39|M|educator|93101 487 | 487|22|M|engineer|92121 488 | 488|48|M|technician|21012 489 | 489|55|M|other|45218 490 | 490|29|F|artist|V5A2B 491 | 491|43|F|writer|53711 492 | 492|57|M|educator|94618 493 | 493|22|M|engineer|60090 494 | 494|38|F|administrator|49428 495 | 495|29|M|engineer|03052 496 | 496|21|F|student|55414 497 | 497|20|M|student|50112 498 | 498|26|M|writer|55408 499 | 499|42|M|programmer|75006 500 | 500|28|M|administrator|94305 501 | 501|22|M|student|10025 502 | 502|22|M|student|23092 503 | 503|50|F|writer|27514 504 | 504|40|F|writer|92115 505 | 505|27|F|other|20657 506 | 506|46|M|programmer|03869 507 | 507|18|F|writer|28450 508 | 508|27|M|marketing|19382 509 | 509|23|M|administrator|10011 510 | 510|34|M|other|98038 511 | 511|22|M|student|21250 512 | 512|29|M|other|20090 513 | 513|43|M|administrator|26241 514 | 514|27|M|programmer|20707 515 | 515|53|M|marketing|49508 516 | 516|53|F|librarian|10021 517 | 517|24|M|student|55454 518 | 518|49|F|writer|99709 519 | 519|22|M|other|55320 520 | 520|62|M|healthcare|12603 521 | 521|19|M|student|02146 522 | 522|36|M|engineer|55443 523 | 523|50|F|administrator|04102 524 | 524|56|M|educator|02159 525 | 525|27|F|administrator|19711 526 | 526|30|M|marketing|97124 527 | 527|33|M|librarian|12180 528 | 528|18|M|student|55104 529 | 529|47|F|administrator|44224 530 | 530|29|M|engineer|94040 531 | 531|30|F|salesman|97408 532 | 532|20|M|student|92705 533 | 533|43|M|librarian|02324 534 | 534|20|M|student|05464 535 | 535|45|F|educator|80302 536 | 536|38|M|engineer|30078 537 | 537|36|M|engineer|22902 538 | 538|31|M|scientist|21010 539 | 539|53|F|administrator|80303 540 | 540|28|M|engineer|91201 541 | 541|19|F|student|84302 542 | 542|21|M|student|60515 543 | 543|33|M|scientist|95123 544 | 544|44|F|other|29464 545 | 545|27|M|technician|08052 546 | 546|36|M|executive|22911 547 | 547|50|M|educator|14534 548 | 548|51|M|writer|95468 549 | 549|42|M|scientist|45680 550 | 550|16|F|student|95453 551 | 551|25|M|programmer|55414 552 | 552|45|M|other|68147 553 | 553|58|M|educator|62901 554 | 554|32|M|scientist|62901 555 | 555|29|F|educator|23227 556 | 556|35|F|educator|30606 557 | 557|30|F|writer|11217 558 | 558|56|F|writer|63132 559 | 559|69|M|executive|10022 560 | 560|32|M|student|10003 561 | 561|23|M|engineer|60005 562 | 562|54|F|administrator|20879 563 | 563|39|F|librarian|32707 564 | 564|65|M|retired|94591 565 | 565|40|M|student|55422 566 | 566|20|M|student|14627 567 | 567|24|M|entertainment|10003 568 | 568|39|M|educator|01915 569 | 569|34|M|educator|91903 570 | 570|26|M|educator|14627 571 | 571|34|M|artist|01945 572 | 572|51|M|educator|20003 573 | 573|68|M|retired|48911 574 | 574|56|M|educator|53188 575 | 575|33|M|marketing|46032 576 | 576|48|M|executive|98281 577 | 577|36|F|student|77845 578 | 578|31|M|administrator|M7A1A 579 | 579|32|M|educator|48103 580 | 580|16|M|student|17961 581 | 581|37|M|other|94131 582 | 582|17|M|student|93003 583 | 583|44|M|engineer|29631 584 | 584|25|M|student|27511 585 | 585|69|M|librarian|98501 586 | 586|20|M|student|79508 587 | 587|26|M|other|14216 588 | 588|18|F|student|93063 589 | 589|21|M|lawyer|90034 590 | 590|50|M|educator|82435 591 | 591|57|F|librarian|92093 592 | 592|18|M|student|97520 593 | 593|31|F|educator|68767 594 | 594|46|M|educator|M4J2K 595 | 595|25|M|programmer|31909 596 | 596|20|M|artist|77073 597 | 597|23|M|other|84116 598 | 598|40|F|marketing|43085 599 | 599|22|F|student|R3T5K 600 | 600|34|M|programmer|02320 601 | 601|19|F|artist|99687 602 | 602|47|F|other|34656 603 | 603|21|M|programmer|47905 604 | 604|39|M|educator|11787 605 | 605|33|M|engineer|33716 606 | 606|28|M|programmer|63044 607 | 607|49|F|healthcare|02154 608 | 608|22|M|other|10003 609 | 609|13|F|student|55106 610 | 610|22|M|student|21227 611 | 611|46|M|librarian|77008 612 | 612|36|M|educator|79070 613 | 613|37|F|marketing|29678 614 | 614|54|M|educator|80227 615 | 615|38|M|educator|27705 616 | 616|55|M|scientist|50613 617 | 617|27|F|writer|11201 618 | 618|15|F|student|44212 619 | 619|17|M|student|44134 620 | 620|18|F|writer|81648 621 | 621|17|M|student|60402 622 | 622|25|M|programmer|14850 623 | 623|50|F|educator|60187 624 | 624|19|M|student|30067 625 | 625|27|M|programmer|20723 626 | 626|23|M|scientist|19807 627 | 627|24|M|engineer|08034 628 | 628|13|M|none|94306 629 | 629|46|F|other|44224 630 | 630|26|F|healthcare|55408 631 | 631|18|F|student|38866 632 | 632|18|M|student|55454 633 | 633|35|M|programmer|55414 634 | 634|39|M|engineer|T8H1N 635 | 635|22|M|other|23237 636 | 636|47|M|educator|48043 637 | 637|30|M|other|74101 638 | 638|45|M|engineer|01940 639 | 639|42|F|librarian|12065 640 | 640|20|M|student|61801 641 | 641|24|M|student|60626 642 | 642|18|F|student|95521 643 | 643|39|M|scientist|55122 644 | 644|51|M|retired|63645 645 | 645|27|M|programmer|53211 646 | 646|17|F|student|51250 647 | 647|40|M|educator|45810 648 | 648|43|M|engineer|91351 649 | 649|20|M|student|39762 650 | 650|42|M|engineer|83814 651 | 651|65|M|retired|02903 652 | 652|35|M|other|22911 653 | 653|31|M|executive|55105 654 | 654|27|F|student|78739 655 | 655|50|F|healthcare|60657 656 | 656|48|M|educator|10314 657 | 657|26|F|none|78704 658 | 658|33|M|programmer|92626 659 | 659|31|M|educator|54248 660 | 660|26|M|student|77380 661 | 661|28|M|programmer|98121 662 | 662|55|M|librarian|19102 663 | 663|26|M|other|19341 664 | 664|30|M|engineer|94115 665 | 665|25|M|administrator|55412 666 | 666|44|M|administrator|61820 667 | 667|35|M|librarian|01970 668 | 668|29|F|writer|10016 669 | 669|37|M|other|20009 670 | 670|30|M|technician|21114 671 | 671|21|M|programmer|91919 672 | 672|54|F|administrator|90095 673 | 673|51|M|educator|22906 674 | 674|13|F|student|55337 675 | 675|34|M|other|28814 676 | 676|30|M|programmer|32712 677 | 677|20|M|other|99835 678 | 678|50|M|educator|61462 679 | 679|20|F|student|54302 680 | 680|33|M|lawyer|90405 681 | 681|44|F|marketing|97208 682 | 682|23|M|programmer|55128 683 | 683|42|M|librarian|23509 684 | 684|28|M|student|55414 685 | 685|32|F|librarian|55409 686 | 686|32|M|educator|26506 687 | 687|31|F|healthcare|27713 688 | 688|37|F|administrator|60476 689 | 689|25|M|other|45439 690 | 690|35|M|salesman|63304 691 | 691|34|M|educator|60089 692 | 692|34|M|engineer|18053 693 | 693|43|F|healthcare|85210 694 | 694|60|M|programmer|06365 695 | 695|26|M|writer|38115 696 | 696|55|M|other|94920 697 | 697|25|M|other|77042 698 | 698|28|F|programmer|06906 699 | 699|44|M|other|96754 700 | 700|17|M|student|76309 701 | 701|51|F|librarian|56321 702 | 702|37|M|other|89104 703 | 703|26|M|educator|49512 704 | 704|51|F|librarian|91105 705 | 705|21|F|student|54494 706 | 706|23|M|student|55454 707 | 707|56|F|librarian|19146 708 | 708|26|F|homemaker|96349 709 | 709|21|M|other|N4T1A 710 | 710|19|M|student|92020 711 | 711|22|F|student|15203 712 | 712|22|F|student|54901 713 | 713|42|F|other|07204 714 | 714|26|M|engineer|55343 715 | 715|21|M|technician|91206 716 | 716|36|F|administrator|44265 717 | 717|24|M|technician|84105 718 | 718|42|M|technician|64118 719 | 719|37|F|other|V0R2H 720 | 720|49|F|administrator|16506 721 | 721|24|F|entertainment|11238 722 | 722|50|F|homemaker|17331 723 | 723|26|M|executive|94403 724 | 724|31|M|executive|40243 725 | 725|21|M|student|91711 726 | 726|25|F|administrator|80538 727 | 727|25|M|student|78741 728 | 728|58|M|executive|94306 729 | 729|19|M|student|56567 730 | 730|31|F|scientist|32114 731 | 731|41|F|educator|70403 732 | 732|28|F|other|98405 733 | 733|44|F|other|60630 734 | 734|25|F|other|63108 735 | 735|29|F|healthcare|85719 736 | 736|48|F|writer|94618 737 | 737|30|M|programmer|98072 738 | 738|35|M|technician|95403 739 | 739|35|M|technician|73162 740 | 740|25|F|educator|22206 741 | 741|25|M|writer|63108 742 | 742|35|M|student|29210 743 | 743|31|M|programmer|92660 744 | 744|35|M|marketing|47024 745 | 745|42|M|writer|55113 746 | 746|25|M|engineer|19047 747 | 747|19|M|other|93612 748 | 748|28|M|administrator|94720 749 | 749|33|M|other|80919 750 | 750|28|M|administrator|32303 751 | 751|24|F|other|90034 752 | 752|60|M|retired|21201 753 | 753|56|M|salesman|91206 754 | 754|59|F|librarian|62901 755 | 755|44|F|educator|97007 756 | 756|30|F|none|90247 757 | 757|26|M|student|55104 758 | 758|27|M|student|53706 759 | 759|20|F|student|68503 760 | 760|35|F|other|14211 761 | 761|17|M|student|97302 762 | 762|32|M|administrator|95050 763 | 763|27|M|scientist|02113 764 | 764|27|F|educator|62903 765 | 765|31|M|student|33066 766 | 766|42|M|other|10960 767 | 767|70|M|engineer|00000 768 | 768|29|M|administrator|12866 769 | 769|39|M|executive|06927 770 | 770|28|M|student|14216 771 | 771|26|M|student|15232 772 | 772|50|M|writer|27105 773 | 773|20|M|student|55414 774 | 774|30|M|student|80027 775 | 775|46|M|executive|90036 776 | 776|30|M|librarian|51157 777 | 777|63|M|programmer|01810 778 | 778|34|M|student|01960 779 | 779|31|M|student|K7L5J 780 | 780|49|M|programmer|94560 781 | 781|20|M|student|48825 782 | 782|21|F|artist|33205 783 | 783|30|M|marketing|77081 784 | 784|47|M|administrator|91040 785 | 785|32|M|engineer|23322 786 | 786|36|F|engineer|01754 787 | 787|18|F|student|98620 788 | 788|51|M|administrator|05779 789 | 789|29|M|other|55420 790 | 790|27|M|technician|80913 791 | 791|31|M|educator|20064 792 | 792|40|M|programmer|12205 793 | 793|22|M|student|85281 794 | 794|32|M|educator|57197 795 | 795|30|M|programmer|08610 796 | 796|32|F|writer|33755 797 | 797|44|F|other|62522 798 | 798|40|F|writer|64131 799 | 799|49|F|administrator|19716 800 | 800|25|M|programmer|55337 801 | 801|22|M|writer|92154 802 | 802|35|M|administrator|34105 803 | 803|70|M|administrator|78212 804 | 804|39|M|educator|61820 805 | 805|27|F|other|20009 806 | 806|27|M|marketing|11217 807 | 807|41|F|healthcare|93555 808 | 808|45|M|salesman|90016 809 | 809|50|F|marketing|30803 810 | 810|55|F|other|80526 811 | 811|40|F|educator|73013 812 | 812|22|M|technician|76234 813 | 813|14|F|student|02136 814 | 814|30|M|other|12345 815 | 815|32|M|other|28806 816 | 816|34|M|other|20755 817 | 817|19|M|student|60152 818 | 818|28|M|librarian|27514 819 | 819|59|M|administrator|40205 820 | 820|22|M|student|37725 821 | 821|37|M|engineer|77845 822 | 822|29|F|librarian|53144 823 | 823|27|M|artist|50322 824 | 824|31|M|other|15017 825 | 825|44|M|engineer|05452 826 | 826|28|M|artist|77048 827 | 827|23|F|engineer|80228 828 | 828|28|M|librarian|85282 829 | 829|48|M|writer|80209 830 | 830|46|M|programmer|53066 831 | 831|21|M|other|33765 832 | 832|24|M|technician|77042 833 | 833|34|M|writer|90019 834 | 834|26|M|other|64153 835 | 835|44|F|executive|11577 836 | 836|44|M|artist|10018 837 | 837|36|F|artist|55409 838 | 838|23|M|student|01375 839 | 839|38|F|entertainment|90814 840 | 840|39|M|artist|55406 841 | 841|45|M|doctor|47401 842 | 842|40|M|writer|93055 843 | 843|35|M|librarian|44212 844 | 844|22|M|engineer|95662 845 | 845|64|M|doctor|97405 846 | 846|27|M|lawyer|47130 847 | 847|29|M|student|55417 848 | 848|46|M|engineer|02146 849 | 849|15|F|student|25652 850 | 850|34|M|technician|78390 851 | 851|18|M|other|29646 852 | 852|46|M|administrator|94086 853 | 853|49|M|writer|40515 854 | 854|29|F|student|55408 855 | 855|53|M|librarian|04988 856 | 856|43|F|marketing|97215 857 | 857|35|F|administrator|V1G4L 858 | 858|63|M|educator|09645 859 | 859|18|F|other|06492 860 | 860|70|F|retired|48322 861 | 861|38|F|student|14085 862 | 862|25|M|executive|13820 863 | 863|17|M|student|60089 864 | 864|27|M|programmer|63021 865 | 865|25|M|artist|11231 866 | 866|45|M|other|60302 867 | 867|24|M|scientist|92507 868 | 868|21|M|programmer|55303 869 | 869|30|M|student|10025 870 | 870|22|M|student|65203 871 | 871|31|M|executive|44648 872 | 872|19|F|student|74078 873 | 873|48|F|administrator|33763 874 | 874|36|M|scientist|37076 875 | 875|24|F|student|35802 876 | 876|41|M|other|20902 877 | 877|30|M|other|77504 878 | 878|50|F|educator|98027 879 | 879|33|F|administrator|55337 880 | 880|13|M|student|83702 881 | 881|39|M|marketing|43017 882 | 882|35|M|engineer|40503 883 | 883|49|M|librarian|50266 884 | 884|44|M|engineer|55337 885 | 885|30|F|other|95316 886 | 886|20|M|student|61820 887 | 887|14|F|student|27249 888 | 888|41|M|scientist|17036 889 | 889|24|M|technician|78704 890 | 890|32|M|student|97301 891 | 891|51|F|administrator|03062 892 | 892|36|M|other|45243 893 | 893|25|M|student|95823 894 | 894|47|M|educator|74075 895 | 895|31|F|librarian|32301 896 | 896|28|M|writer|91505 897 | 897|30|M|other|33484 898 | 898|23|M|homemaker|61755 899 | 899|32|M|other|55116 900 | 900|60|M|retired|18505 901 | 901|38|M|executive|L1V3W 902 | 902|45|F|artist|97203 903 | 903|28|M|educator|20850 904 | 904|17|F|student|61073 905 | 905|27|M|other|30350 906 | 906|45|M|librarian|70124 907 | 907|25|F|other|80526 908 | 908|44|F|librarian|68504 909 | 909|50|F|educator|53171 910 | 910|28|M|healthcare|29301 911 | 911|37|F|writer|53210 912 | 912|51|M|other|06512 913 | 913|27|M|student|76201 914 | 914|44|F|other|08105 915 | 915|50|M|entertainment|60614 916 | 916|27|M|engineer|N2L5N 917 | 917|22|F|student|20006 918 | 918|40|M|scientist|70116 919 | 919|25|M|other|14216 920 | 920|30|F|artist|90008 921 | 921|20|F|student|98801 922 | 922|29|F|administrator|21114 923 | 923|21|M|student|E2E3R 924 | 924|29|M|other|11753 925 | 925|18|F|salesman|49036 926 | 926|49|M|entertainment|01701 927 | 927|23|M|programmer|55428 928 | 928|21|M|student|55408 929 | 929|44|M|scientist|53711 930 | 930|28|F|scientist|07310 931 | 931|60|M|educator|33556 932 | 932|58|M|educator|06437 933 | 933|28|M|student|48105 934 | 934|61|M|engineer|22902 935 | 935|42|M|doctor|66221 936 | 936|24|M|other|32789 937 | 937|48|M|educator|98072 938 | 938|38|F|technician|55038 939 | 939|26|F|student|33319 940 | 940|32|M|administrator|02215 941 | 941|20|M|student|97229 942 | 942|48|F|librarian|78209 943 | 943|22|M|student|77841 944 | -------------------------------------------------------------------------------- /julia-df.jl: -------------------------------------------------------------------------------- 1 | using DataFrames 2 | u_col_names=[symbol("user_id"), symbol("age"), symbol("sex"), symbol("occupation"), symbol("zip_code")] 3 | ### 4 | #a twisted way to do the same without adding each entry as a symbol 5 | col_names=["user_id", "age", "sex", "occupation", "zip_code"] 6 | u_col_names[symbol(m_col_names[i]) for i in 1:size(col_names)[1]] 7 | ### 8 | #As pointed out by Valentin and in the HN comments section the easiest way to do this is 9 | u_col_names=[:user_id, :age, :sex, :occupation, :zip_code] 10 | users = DataFrames.readtable("data/ml-100k/u.user", separator='|', header=false, names=u_col_names) 11 | 12 | eltypes(users) 13 | describe(users) 14 | 15 | head(users, 3) 16 | 17 | users[50:55,:] 18 | 19 | head(users[:occupation]) 20 | 21 | head(users[:,[:occupation, :sex, :age]]) 22 | 23 | users[users[:occupation] .== "writer", :] 24 | 25 | 26 | r_col_names = ['user_id', 'movie_id', 'rating', 'unix_timestamp'] 27 | ratings = DataFrames.readtable('data/ml-100k/u.data', sep='\t', names=r_col_names) 28 | # let's only load the first five columns for movies using "usecols" param 29 | m_col_names = ["movie_id", "title", "release_date", "video_release_date", "imdb_url"] 30 | movies = read.table('data/ml-100k/u.item', sep='|', colClasses=c("integer", "character", "factor", "factor", "character", rep("NULL", 19)))#http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r 31 | #cannot specify col.names in read.table as we are skipping 19 columns. R will complain with "more columns than column names" 32 | colnames(movies) <- m_col_names 33 | -------------------------------------------------------------------------------- /python-df.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "name": "", 4 | "signature": "sha256:d578f21d531bb0315f767d6ebcaf0123b318b39b3928ea74f3fdc00388fc84a3" 5 | }, 6 | "nbformat": 3, 7 | "nbformat_minor": 0, 8 | "worksheets": [ 9 | { 10 | "cells": [ 11 | { 12 | "cell_type": "code", 13 | "collapsed": false, 14 | "input": [ 15 | "import pandas as pd\n", 16 | "u_col_names = ['user_id', 'age', 'sex', 'occuptation', 'zip_code']\n", 17 | "users = pd.read_csv('data/ml-100k/u.user', sep = '|', names=u_col_names)\n", 18 | "\n", 19 | "r_col_names = ['user_id', 'movie_id', 'rating', 'unix_timestamp']\n", 20 | "ratings = pd.read_csv('data/ml-100k/u.data', sep='\\t', names=r_col_names)\n", 21 | "\n", 22 | "# let's only load the first five columns of the file with usecols\n", 23 | "m_col_names = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']\n", 24 | "movies = pd.read_csv('data/ml-100k/u.item', sep='|', names=m_col_names, usecols=range(5))" 25 | ], 26 | "language": "python", 27 | "metadata": {}, 28 | "outputs": [], 29 | "prompt_number": 2 30 | }, 31 | { 32 | "cell_type": "code", 33 | "collapsed": false, 34 | "input": [ 35 | "type(users)" 36 | ], 37 | "language": "python", 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "metadata": {}, 42 | "output_type": "pyout", 43 | "prompt_number": 20, 44 | "text": [ 45 | "pandas.core.frame.DataFrame" 46 | ] 47 | } 48 | ], 49 | "prompt_number": 20 50 | }, 51 | { 52 | "cell_type": "code", 53 | "collapsed": false, 54 | "input": [ 55 | "users.dtypes" 56 | ], 57 | "language": "python", 58 | "metadata": {}, 59 | "outputs": [ 60 | { 61 | "metadata": {}, 62 | "output_type": "pyout", 63 | "prompt_number": 19, 64 | "text": [ 65 | "user_id int64\n", 66 | "age int64\n", 67 | "sex object\n", 68 | "occuptation object\n", 69 | "zip_code object\n", 70 | "dtype: object" 71 | ] 72 | } 73 | ], 74 | "prompt_number": 19 75 | }, 76 | { 77 | "cell_type": "code", 78 | "collapsed": false, 79 | "input": [ 80 | "users.describe()" 81 | ], 82 | "language": "python", 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "html": [ 87 | "
\n", 88 | "\n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | "
user_idage
count 943.000000 943.000000
mean 472.000000 34.051962
std 272.364951 12.192740
min 1.000000 7.000000
25% 236.500000 25.000000
50% 472.000000 31.000000
75% 707.500000 43.000000
max 943.000000 73.000000
\n", 139 | "
" 140 | ], 141 | "metadata": {}, 142 | "output_type": "pyout", 143 | "prompt_number": 21, 144 | "text": [ 145 | " user_id age\n", 146 | "count 943.000000 943.000000\n", 147 | "mean 472.000000 34.051962\n", 148 | "std 272.364951 12.192740\n", 149 | "min 1.000000 7.000000\n", 150 | "25% 236.500000 25.000000\n", 151 | "50% 472.000000 31.000000\n", 152 | "75% 707.500000 43.000000\n", 153 | "max 943.000000 73.000000" 154 | ] 155 | } 156 | ], 157 | "prompt_number": 21 158 | }, 159 | { 160 | "cell_type": "code", 161 | "collapsed": false, 162 | "input": [ 163 | "movies.head()" 164 | ], 165 | "language": "python", 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "html": [ 170 | "
\n", 171 | "\n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | "
movie_idtitlerelease_datevideo_release_dateimdb_url
0 1 Toy Story (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 2 GoldenEye (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 3 Four Rooms (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Four%20Rooms%...
3 4 Get Shorty (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Get%20Shorty%...
4 5 Copycat (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Copycat%20(1995)
\n", 225 | "
" 226 | ], 227 | "metadata": {}, 228 | "output_type": "pyout", 229 | "prompt_number": 6, 230 | "text": [ 231 | " movie_id title release_date video_release_date \\\n", 232 | "0 1 Toy Story (1995) 01-Jan-1995 NaN \n", 233 | "1 2 GoldenEye (1995) 01-Jan-1995 NaN \n", 234 | "2 3 Four Rooms (1995) 01-Jan-1995 NaN \n", 235 | "3 4 Get Shorty (1995) 01-Jan-1995 NaN \n", 236 | "4 5 Copycat (1995) 01-Jan-1995 NaN \n", 237 | "\n", 238 | " imdb_url \n", 239 | "0 http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 240 | "1 http://us.imdb.com/M/title-exact?GoldenEye%20(... \n", 241 | "2 http://us.imdb.com/M/title-exact?Four%20Rooms%... \n", 242 | "3 http://us.imdb.com/M/title-exact?Get%20Shorty%... \n", 243 | "4 http://us.imdb.com/M/title-exact?Copycat%20(1995) " 244 | ] 245 | } 246 | ], 247 | "prompt_number": 6 248 | }, 249 | { 250 | "cell_type": "code", 251 | "collapsed": false, 252 | "input": [ 253 | "movies.tail()" 254 | ], 255 | "language": "python", 256 | "metadata": {}, 257 | "outputs": [ 258 | { 259 | "html": [ 260 | "
\n", 261 | "\n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | "
movie_idtitlerelease_datevideo_release_dateimdb_url
1677 1678 Mat' i syn (1997) 06-Feb-1998NaN http://us.imdb.com/M/title-exact?Mat%27+i+syn+...
1678 1679 B. Monkey (1998) 06-Feb-1998NaN http://us.imdb.com/M/title-exact?B%2E+Monkey+(...
1679 1680 Sliding Doors (1998) 01-Jan-1998NaN http://us.imdb.com/Title?Sliding+Doors+(1998)
1680 1681 You So Crazy (1994) 01-Jan-1994NaN http://us.imdb.com/M/title-exact?You%20So%20Cr...
1681 1682 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996NaN http://us.imdb.com/M/title-exact?Schrei%20aus%...
\n", 315 | "
" 316 | ], 317 | "metadata": {}, 318 | "output_type": "pyout", 319 | "prompt_number": 7, 320 | "text": [ 321 | " movie_id title release_date \\\n", 322 | "1677 1678 Mat' i syn (1997) 06-Feb-1998 \n", 323 | "1678 1679 B. Monkey (1998) 06-Feb-1998 \n", 324 | "1679 1680 Sliding Doors (1998) 01-Jan-1998 \n", 325 | "1680 1681 You So Crazy (1994) 01-Jan-1994 \n", 326 | "1681 1682 Scream of Stone (Schrei aus Stein) (1991) 08-Mar-1996 \n", 327 | "\n", 328 | " video_release_date imdb_url \n", 329 | "1677 NaN http://us.imdb.com/M/title-exact?Mat%27+i+syn+... \n", 330 | "1678 NaN http://us.imdb.com/M/title-exact?B%2E+Monkey+(... \n", 331 | "1679 NaN http://us.imdb.com/Title?Sliding+Doors+(1998) \n", 332 | "1680 NaN http://us.imdb.com/M/title-exact?You%20So%20Cr... \n", 333 | "1681 NaN http://us.imdb.com/M/title-exact?Schrei%20aus%... " 334 | ] 335 | } 336 | ], 337 | "prompt_number": 7 338 | }, 339 | { 340 | "cell_type": "code", 341 | "collapsed": false, 342 | "input": [ 343 | "movies.head(3)" 344 | ], 345 | "language": "python", 346 | "metadata": {}, 347 | "outputs": [ 348 | { 349 | "html": [ 350 | "
\n", 351 | "\n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | "
movie_idtitlerelease_datevideo_release_dateimdb_url
0 1 Toy Story (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 2 GoldenEye (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 3 Four Rooms (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Four%20Rooms%...
\n", 389 | "
" 390 | ], 391 | "metadata": {}, 392 | "output_type": "pyout", 393 | "prompt_number": 17, 394 | "text": [ 395 | " movie_id title release_date video_release_date \\\n", 396 | "0 1 Toy Story (1995) 01-Jan-1995 NaN \n", 397 | "1 2 GoldenEye (1995) 01-Jan-1995 NaN \n", 398 | "2 3 Four Rooms (1995) 01-Jan-1995 NaN \n", 399 | "\n", 400 | " imdb_url \n", 401 | "0 http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 402 | "1 http://us.imdb.com/M/title-exact?GoldenEye%20(... \n", 403 | "2 http://us.imdb.com/M/title-exact?Four%20Rooms%... " 404 | ] 405 | } 406 | ], 407 | "prompt_number": 17 408 | }, 409 | { 410 | "cell_type": "code", 411 | "collapsed": false, 412 | "input": [ 413 | "movies[50:55]" 414 | ], 415 | "language": "python", 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "html": [ 420 | "
\n", 421 | "\n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | "
movie_idtitlerelease_datevideo_release_dateimdb_url
50 51 Legends of the Fall (1994) 01-Jan-1994NaN http://us.imdb.com/M/title-exact?Legends%20of%...
51 52 Madness of King George, The (1994) 01-Jan-1994NaN http://us.imdb.com/M/title-exact?Madness%20of%...
52 53 Natural Born Killers (1994) 01-Jan-1994NaN http://us.imdb.com/M/title-exact?Natural%20Bor...
53 54 Outbreak (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Outbreak%20(1...
54 55 Professional, The (1994) 01-Jan-1994NaN http://us.imdb.com/Title?L%E9on+(1994)
\n", 475 | "
" 476 | ], 477 | "metadata": {}, 478 | "output_type": "pyout", 479 | "prompt_number": 8, 480 | "text": [ 481 | " movie_id title release_date \\\n", 482 | "50 51 Legends of the Fall (1994) 01-Jan-1994 \n", 483 | "51 52 Madness of King George, The (1994) 01-Jan-1994 \n", 484 | "52 53 Natural Born Killers (1994) 01-Jan-1994 \n", 485 | "53 54 Outbreak (1995) 01-Jan-1995 \n", 486 | "54 55 Professional, The (1994) 01-Jan-1994 \n", 487 | "\n", 488 | " video_release_date imdb_url \n", 489 | "50 NaN http://us.imdb.com/M/title-exact?Legends%20of%... \n", 490 | "51 NaN http://us.imdb.com/M/title-exact?Madness%20of%... \n", 491 | "52 NaN http://us.imdb.com/M/title-exact?Natural%20Bor... \n", 492 | "53 NaN http://us.imdb.com/M/title-exact?Outbreak%20(1... \n", 493 | "54 NaN http://us.imdb.com/Title?L%E9on+(1994) " 494 | ] 495 | } 496 | ], 497 | "prompt_number": 8 498 | }, 499 | { 500 | "cell_type": "code", 501 | "collapsed": false, 502 | "input": [ 503 | "movies['title'].head()" 504 | ], 505 | "language": "python", 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "metadata": {}, 510 | "output_type": "pyout", 511 | "prompt_number": 18, 512 | "text": [ 513 | "0 Toy Story (1995)\n", 514 | "1 GoldenEye (1995)\n", 515 | "2 Four Rooms (1995)\n", 516 | "3 Get Shorty (1995)\n", 517 | "4 Copycat (1995)\n", 518 | "Name: title, dtype: object" 519 | ] 520 | } 521 | ], 522 | "prompt_number": 18 523 | }, 524 | { 525 | "cell_type": "code", 526 | "collapsed": false, 527 | "input": [ 528 | "movies[['title', 'release_date', 'imdb_url']].head()" 529 | ], 530 | "language": "python", 531 | "metadata": {}, 532 | "outputs": [ 533 | { 534 | "html": [ 535 | "
\n", 536 | "\n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | "
titlerelease_dateimdb_url
0 Toy Story (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Toy%20Story%2...
1 GoldenEye (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?GoldenEye%20(...
2 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%...
3 Get Shorty (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Get%20Shorty%...
4 Copycat (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Copycat%20(1995)
\n", 578 | "
" 579 | ], 580 | "metadata": {}, 581 | "output_type": "pyout", 582 | "prompt_number": 9, 583 | "text": [ 584 | " title release_date \\\n", 585 | "0 Toy Story (1995) 01-Jan-1995 \n", 586 | "1 GoldenEye (1995) 01-Jan-1995 \n", 587 | "2 Four Rooms (1995) 01-Jan-1995 \n", 588 | "3 Get Shorty (1995) 01-Jan-1995 \n", 589 | "4 Copycat (1995) 01-Jan-1995 \n", 590 | "\n", 591 | " imdb_url \n", 592 | "0 http://us.imdb.com/M/title-exact?Toy%20Story%2... \n", 593 | "1 http://us.imdb.com/M/title-exact?GoldenEye%20(... \n", 594 | "2 http://us.imdb.com/M/title-exact?Four%20Rooms%... \n", 595 | "3 http://us.imdb.com/M/title-exact?Get%20Shorty%... \n", 596 | "4 http://us.imdb.com/M/title-exact?Copycat%20(1995) " 597 | ] 598 | } 599 | ], 600 | "prompt_number": 9 601 | }, 602 | { 603 | "cell_type": "code", 604 | "collapsed": false, 605 | "input": [ 606 | "movies[movies.title == 'Toy Story (1995)']" 607 | ], 608 | "language": "python", 609 | "metadata": {}, 610 | "outputs": [ 611 | { 612 | "html": [ 613 | "
\n", 614 | "\n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | "
movie_idtitlerelease_datevideo_release_dateimdb_url
0 1 Toy Story (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
\n", 636 | "
" 637 | ], 638 | "metadata": {}, 639 | "output_type": "pyout", 640 | "prompt_number": 10, 641 | "text": [ 642 | " movie_id title release_date video_release_date \\\n", 643 | "0 1 Toy Story (1995) 01-Jan-1995 NaN \n", 644 | "\n", 645 | " imdb_url \n", 646 | "0 http://us.imdb.com/M/title-exact?Toy%20Story%2... " 647 | ] 648 | } 649 | ], 650 | "prompt_number": 10 651 | }, 652 | { 653 | "cell_type": "code", 654 | "collapsed": false, 655 | "input": [ 656 | "movies.query('title==\"Toy Story (1995)\"')" 657 | ], 658 | "language": "python", 659 | "metadata": {}, 660 | "outputs": [ 661 | { 662 | "html": [ 663 | "
\n", 664 | "\n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | "
movie_idtitlerelease_datevideo_release_dateimdb_url
0 1 Toy Story (1995) 01-Jan-1995NaN http://us.imdb.com/M/title-exact?Toy%20Story%2...
\n", 686 | "
" 687 | ], 688 | "metadata": {}, 689 | "output_type": "pyout", 690 | "prompt_number": 11, 691 | "text": [ 692 | " movie_id title release_date video_release_date \\\n", 693 | "0 1 Toy Story (1995) 01-Jan-1995 NaN \n", 694 | "\n", 695 | " imdb_url \n", 696 | "0 http://us.imdb.com/M/title-exact?Toy%20Story%2... " 697 | ] 698 | } 699 | ], 700 | "prompt_number": 11 701 | } 702 | ], 703 | "metadata": {} 704 | } 705 | ] 706 | } -------------------------------------------------------------------------------- /r-df.R: -------------------------------------------------------------------------------- 1 | u_col_names <- c('user_id', 'age', 'sex', 'occupation', 'zip_code') 2 | users <- read.csv('data/ml-100k/u.user', sep='|', col.names=u_col_names) 3 | r_col_names = c('user_id', 'movie_id', 'rating', 'unix_timestamp') 4 | ratings = read.csv('data/ml-100k/u.data', sep='\t', col.names=r_col_names) 5 | # let's only load the first five columns for movies using "usecols" param 6 | m_col_names = c('movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url') 7 | movies = read.table('data/ml-100k/u.item', sep='|', colClasses=c("integer", "character", "factor", "factor", "character", rep("NULL", 19)), quote="")#http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r also quotes in strings cause importing errors so you need quote="" 8 | #cannot specify col.names in read.table as we are skipping 19 columns. R will complain with "more columns than column names" 9 | colnames(movies) <- m_col_names 10 | 11 | class(users) 12 | lapply(users, class) 13 | #better way to get class of all columns is using str 14 | str(users) 15 | summary(users) 16 | 17 | head(movies) 18 | tail(movies) 19 | 20 | head(movies, n=3) 21 | 22 | movies[50:55,] 23 | head(movies$title) 24 | head(movies[,c('title', 'release_date', 'imdb_url')]) 25 | movies[movies$title == 'Toy Story (1995)',] 26 | 27 | --------------------------------------------------------------------------------