├── .gitignore ├── LICENSE ├── README.md ├── conf ├── lang │ ├── contractions_ca.txt │ ├── contractions_fr.txt │ ├── contractions_ga.txt │ ├── contractions_it.txt │ ├── hyphenations_ga.txt │ ├── stemdict_nl.txt │ ├── stoptags_ja.txt │ ├── stopwords_ar.txt │ ├── stopwords_bg.txt │ ├── stopwords_ca.txt │ ├── stopwords_cz.txt │ ├── stopwords_da.txt │ ├── stopwords_de.txt │ ├── stopwords_el.txt │ ├── stopwords_en.txt │ ├── stopwords_es.txt │ ├── stopwords_eu.txt │ ├── stopwords_fa.txt │ ├── stopwords_fi.txt │ ├── stopwords_fr.txt │ ├── stopwords_ga.txt │ ├── stopwords_gl.txt │ ├── stopwords_hi.txt │ ├── stopwords_hu.txt │ ├── stopwords_hy.txt │ ├── stopwords_id.txt │ ├── stopwords_it.txt │ ├── stopwords_ja.txt │ ├── stopwords_lv.txt │ ├── stopwords_nl.txt │ ├── stopwords_no.txt │ ├── stopwords_pt.txt │ ├── stopwords_ro.txt │ ├── stopwords_ru.txt │ ├── stopwords_sv.txt │ ├── stopwords_th.txt │ ├── stopwords_tr.txt │ └── userdict_ja.txt ├── managed-schema ├── params.json ├── protwords.txt ├── solrconfig.xml ├── stopwords.txt └── synonyms.txt ├── data ├── movieDetails.json └── movieDetails_mongo.json ├── mongo_to_solr.py ├── query_mongo.py ├── query_solr.py └── solr_index_movies.py /.gitignore: -------------------------------------------------------------------------------- 1 | venv/ 2 | .idea/ 3 | **/.DS_Store 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Gary A. Stafford 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Searching with Apache Solr 2 | 3 | Materials for the post, [Apache Solr: Because your Database is not a Search Engine](https://wp.me/p1RD28-6e9). In this post, we will examine what sets Apache Solr aside from databases, like MongoDB, as a search engine. We will explore the similarities and differences between Solr and MongoDB by analyzing a series of comparative queries. We then delve into some of Solr’s more advanced search capabilities. 4 | 5 | Movie data used in demo publicly available from MongoDB: [Setup and Import the Data](https://docs.mongodb.com/charts/master/tutorial/movie-details/prereqs-and-import-data/#download-the-data) 6 | 7 | ## Set-up Instructions 8 | 9 | More detailed set-up instructions are in the post, [Apache Solr: Because your Database is not a Search Engine](https://wp.me/p1RD28-6e9). 10 | 11 | - Create MongoDB and Solr Docker containers (commands below) 12 | - Set (2) environment variables (commands below) 13 | - Import JSON data to MongoDB (command below) 14 | - Index JSON data to Solr (command below) 15 | - Run `query_mongo.py` and `query_solr.py` query scripts 16 | 17 | ## Useful Commands 18 | 19 | Create MongoDB and Solr Docker containers. Solr container bind-mounts config directory from this project. 20 | 21 | ```bash 22 | docker run --name mongo -p 27017:27017 -d mongo:latest 23 | docker run --name solr -d -p 8983:8983 -v $PWD/conf:/conf solr:latest solr-create -c movies -d /conf 24 | 25 | docker logs solr --follow 26 | ``` 27 | 28 | _Optional: Copy config from Solr container to local path_ 29 | 30 | ```bash 31 | docker run --name solr -p 8983:8983 -d solr:latest solr-create -c movies 32 | docker cp solr:/opt/solr/server/solr/movies/conf/ . 33 | ``` 34 | 35 | _Optional: Create your own core in Solr container_ 36 | 37 | ```bash 38 | docker exec -it --user=solr solr bin/solr create_core -c movies 39 | ``` 40 | 41 | Update environment variables with your own values and set 42 | 43 | ```bash 44 | # local docker example 45 | export SOLR_URL="http://localhost:8983/solr" 46 | export MONOGDB_CONN="mongodb://localhost:27017/movies" 47 | 48 | env | grep 'SOLR_URL\|MONOGDB_CONN' 49 | ``` 50 | 51 | Import `movieDetails_mongo.json` JSON data to MongoDB 52 | 53 | ```bash 54 | mongoimport \ 55 | --uri $MONOGDB_CONN \ 56 | --collection "movieDetails" \ 57 | --drop --file "data/movieDetails_mongo.json" 58 | ``` 59 | 60 | Index JSON data to Solr 61 | 62 | ```bash 63 | python3 ./solr_index_movies.py 64 | ``` 65 | 66 | FYI Only: Modify Solr movies schema 67 | 68 | ```bash 69 | curl -X POST \ 70 | "${SOLR_URL}/movies/schema" \ 71 | -H 'Content-Type: application/json' \ 72 | -d '{ 73 | "replace-field":{ 74 | "name":"title", 75 | "type":"text_en", 76 | "multiValued":false 77 | }, 78 | "replace-field":{ 79 | "name":"plot", 80 | "type":"text_en", 81 | "multiValued":false 82 | }, 83 | "replace-field":{ 84 | "name":"genres", 85 | "type":"text_en", 86 | "multiValued":true 87 | } 88 | }' 89 | ``` 90 | 91 | Run query scripts 92 | 93 | ```bash 94 | time python3 ./query_mongo.py 95 | time python3 ./query_solr.py 96 | ``` 97 | 98 | ## Output from Solr Searches 99 | 100 | ```text 101 | > time python3 ./query_solr.py 102 | 103 | Target Solr instance: http://localhost:8983/solr 104 | ---------- 105 | 106 | Parameters 107 | ---------- 108 | q: *:* 109 | kwargs: {'defType': 'lucene', 'fl': 'title score', 'sort': 'title asc', 'rows': '5'} 110 | 111 | Results 112 | ---------- 113 | document count: 2250 114 | qtime (ms): 2 115 | {'title': 'If....', 'score': 1.0} 116 | {'title': 'To Be or Not to Be', 'score': 1.0} 117 | {'title': 'MW: Dai 0 shô akuma no gêmu', 'score': 1.0} 118 | {'title': 'Km. 0 - Kilometer Zero', 'score': 1.0} 119 | {'title': '0,60 mg', 'score': 1.0} 120 | ---------- 121 | 122 | Parameters 123 | ---------- 124 | q: *:* 125 | kwargs: {'defType': 'lucene', 'omitHeader': 'true', 'rows': '0'} 126 | 127 | Results 128 | ---------- 129 | document count: 2250 130 | qtime (ms): None 131 | ---------- 132 | 133 | Parameters 134 | ---------- 135 | q: title: "Star Wars: Episode V - The Empire Strikes Back" 136 | kwargs: {'defType': 'lucene', 'fl': 'title score'} 137 | 138 | Results 139 | ---------- 140 | document count: 1 141 | qtime (ms): 0 142 | {'title': 'Star Wars: Episode V - The Empire Strikes Back', 'score': 30.66} 143 | ---------- 144 | 145 | Parameters 146 | ---------- 147 | q: "star wars" 148 | kwargs: {'defType': 'lucene', 'df': 'title', 'fl': 'title score'} 149 | 150 | Results 151 | ---------- 152 | document count: 6 153 | qtime (ms): 0 154 | {'title': 'Star Wars: Episode IV - A New Hope', 'score': 8.27} 155 | {'title': 'Star Wars: Episode VI - Return of the Jedi', 'score': 8.27} 156 | {'title': 'Star Wars: Episode I - The Phantom Menace', 'score': 8.27} 157 | {'title': 'Star Wars: Episode III - Revenge of the Sith', 'score': 8.27} 158 | {'title': 'Star Wars: Episode II - Attack of the Clones', 'score': 8.27} 159 | {'title': 'Star Wars: Episode V - The Empire Strikes Back', 'score': 7.6} 160 | ---------- 161 | 162 | Parameters 163 | ---------- 164 | q: star wars 165 | kwargs: {'defType': 'lucene', 'fq': 'countries: USA', 'df': 'title', 'fl': 'title score', 'rows': '5'} 166 | 167 | Results 168 | ---------- 169 | document count: 18 170 | qtime (ms): 0 171 | {'title': 'Star Wars: Episode IV - A New Hope', 'score': 8.27} 172 | {'title': 'Star Wars: Episode VI - Return of the Jedi', 'score': 8.27} 173 | {'title': 'Star Wars: Episode I - The Phantom Menace', 'score': 8.27} 174 | {'title': 'Star Wars: Episode III - Revenge of the Sith', 'score': 8.27} 175 | {'title': 'Star Wars: Episode II - Attack of the Clones', 'score': 8.27} 176 | ---------- 177 | 178 | Parameters 179 | ---------- 180 | q: adventure action western 181 | kwargs: {'defType': 'lucene', 'fq': 'countries: USA', 'df': 'genres', 'fl': 'title genres score', 'rows': '5'} 182 | 183 | Results 184 | ---------- 185 | document count: 244 186 | qtime (ms): 1 187 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 7.18} 188 | {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 6.26} 189 | {'title': 'The Big Trail', 'genres': ['Adventure', 'Western', 'Romance'], 'score': 5.46} 190 | {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 5.26} 191 | {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 5.26} 192 | ---------- 193 | 194 | Parameters 195 | ---------- 196 | q: adventure action +western 197 | kwargs: {'defType': 'lucene', 'fq': 'countries: USA', 'df': 'genres', 'fl': 'title genres score', 'rows': '5'} 198 | 199 | Results 200 | ---------- 201 | document count: 24 202 | qtime (ms): 1 203 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 7.18} 204 | {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 6.26} 205 | {'title': 'The Big Trail', 'genres': ['Adventure', 'Western', 'Romance'], 'score': 5.46} 206 | {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 5.26} 207 | {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 5.26} 208 | ---------- 209 | 210 | Parameters 211 | ---------- 212 | q: adventure action western 213 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title genres score', 'rows': '5'} 214 | 215 | Results 216 | ---------- 217 | document count: 259 218 | qtime (ms): 1 219 | {'title': 'The Secret Life of Walter Mitty', 'genres': ['Adventure', 'Comedy', 'Drama'], 'score': 7.67} 220 | {'title': 'Western Union', 'genres': ['History', 'Western'], 'score': 7.39} 221 | {'title': 'The Adventures of Tintin', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 7.36} 222 | {'title': 'Adventures in Babysitting', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 7.36} 223 | {'title': 'The Poseidon Adventure', 'genres': ['Action', 'Adventure', 'Drama'], 'score': 7.36} 224 | ---------- 225 | 226 | Parameters 227 | ---------- 228 | q: adventure action western 229 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title^2.0 genres^4.0', 'fl': 'title genres score', 'rows': '5'} 230 | 231 | Results 232 | ---------- 233 | document count: 259 234 | qtime (ms): 3 235 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 28.71} 236 | {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 25.05} 237 | {'title': 'The Big Trail', 'genres': ['Adventure', 'Western', 'Romance'], 'score': 21.84} 238 | {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 21.05} 239 | {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 21.05} 240 | ---------- 241 | 242 | Parameters 243 | ---------- 244 | q: adventure action +western -romance 245 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title^2.0 genres^4.0', 'fl': 'title genres score', 'rows': '5'} 246 | 247 | Results 248 | ---------- 249 | document count: 25 250 | qtime (ms): 4 251 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 28.71} 252 | {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 25.05} 253 | {'title': 'Once Upon a Time in the West', 'genres': ['Western'], 'score': 21.05} 254 | {'title': 'How the West Was Won', 'genres': ['Western'], 'score': 21.05} 255 | {'title': 'Cowboy', 'genres': ['Western'], 'score': 21.05} 256 | ---------- 257 | 258 | Parameters 259 | ---------- 260 | q: adventure action +western -romance 261 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title genres score', 'rows': '5'} 262 | 263 | Results 264 | ---------- 265 | document count: 25 266 | qtime (ms): 1 267 | {'title': 'Western Union', 'genres': ['History', 'Western'], 'score': 7.39} 268 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 7.18} 269 | {'title': 'Western Spaghetti', 'genres': ['Short'], 'score': 6.64} 270 | {'title': 'Crossfire Trail', 'genres': ['Action', 'Western'], 'score': 6.26} 271 | {'title': 'Butch Cassidy and the Sundance Kid', 'genres': ['Biography', 'Crime', 'Drama'], 'score': 6.23} 272 | ---------- 273 | 274 | Parameters 275 | ---------- 276 | q: A cowboys movie 277 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title genres score', 'rows': '10'} 278 | 279 | Results 280 | ---------- 281 | document count: 23 282 | qtime (ms): 7 283 | {'title': 'Cowboy Bebop: The Movie', 'genres': ['Animation', 'Action', 'Crime'], 'score': 11.24} 284 | {'title': 'Cowboy', 'genres': ['Western'], 'score': 7.31} 285 | {'title': 'TV: The Movie', 'genres': ['Comedy'], 'score': 6.42} 286 | {'title': 'Space Cowboys', 'genres': ['Action', 'Adventure', 'Thriller'], 'score': 6.33} 287 | {'title': 'Midnight Cowboy', 'genres': ['Drama'], 'score': 6.33} 288 | {'title': 'Drugstore Cowboy', 'genres': ['Crime', 'Drama'], 'score': 6.33} 289 | {'title': 'Urban Cowboy', 'genres': ['Drama', 'Romance', 'Western'], 'score': 6.33} 290 | {'title': 'The Cowboy Way', 'genres': ['Action', 'Comedy', 'Drama'], 'score': 6.33} 291 | {'title': 'The Cowboy and the Lady', 'genres': ['Comedy', 'Drama', 'Romance'], 'score': 6.33} 292 | {'title': 'Toy Story', 'genres': ['Animation', 'Adventure', 'Comedy'], 'score': 5.65} 293 | ---------- 294 | 295 | Parameters 296 | ---------- 297 | q: The Lego Movie -movie 298 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title genres score', 'rows': '10'} 299 | 300 | Results 301 | ---------- 302 | document count: 1 303 | qtime (ms): 1 304 | {'title': 'Lego DC Comics Super Heroes: Justice League vs. Bizarro League', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 4.05} 305 | ---------- 306 | 307 | Parameters 308 | ---------- 309 | q: A cowboys movie 310 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'bq': 'title:movie^-2.0', 'fl': 'title genres score', 'rows': '10'} 311 | 312 | Results 313 | ---------- 314 | document count: 23 315 | qtime (ms): 1 316 | {'title': 'Cowboy', 'genres': ['Western'], 'score': 7.31} 317 | {'title': 'Space Cowboys', 'genres': ['Action', 'Adventure', 'Thriller'], 'score': 6.33} 318 | {'title': 'Midnight Cowboy', 'genres': ['Drama'], 'score': 6.33} 319 | {'title': 'Drugstore Cowboy', 'genres': ['Crime', 'Drama'], 'score': 6.33} 320 | {'title': 'Urban Cowboy', 'genres': ['Drama', 'Romance', 'Western'], 'score': 6.33} 321 | {'title': 'The Cowboy Way', 'genres': ['Action', 'Comedy', 'Drama'], 'score': 6.33} 322 | {'title': 'The Cowboy and the Lady', 'genres': ['Comedy', 'Drama', 'Romance'], 'score': 6.33} 323 | {'title': 'Toy Story', 'genres': ['Animation', 'Adventure', 'Comedy'], 'score': 5.65} 324 | {'title': "Ride 'Em Cowboy", 'genres': ['Comedy', 'Western', 'Musical'], 'score': 5.58} 325 | {'title': "G.M. Whiting's Enemy", 'genres': ['Mystery'], 'score': 5.32} 326 | ---------- 327 | 328 | Parameters 329 | ---------- 330 | q: adventure action +western -romance 331 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title awards.wins score', 'rows': '5'} 332 | 333 | Results 334 | ---------- 335 | document count: 25 336 | qtime (ms): 7 337 | {'title': 'Western Union', 'awards.wins': [0.0], 'score': 7.39} 338 | {'title': 'The Wild Bunch', 'awards.wins': [5.0], 'score': 7.18} 339 | {'title': 'Western Spaghetti', 'awards.wins': [2.0], 'score': 6.64} 340 | {'title': 'Crossfire Trail', 'awards.wins': [1.0], 'score': 6.26} 341 | {'title': 'Butch Cassidy and the Sundance Kid', 'awards.wins': [16.0], 'score': 6.23} 342 | ---------- 343 | 344 | Parameters 345 | ---------- 346 | q: adventure action +western -romance 347 | kwargs: {'defType': 'edismax', 'fq': 'countries: USA', 'qf': 'plot title genres', 'fl': 'title awards.wins score', 'boost': 'div(field(awards.wins,min),2)', 'rows': '5'} 348 | 349 | Results 350 | ---------- 351 | document count: 25 352 | qtime (ms): 2 353 | {'title': 'Butch Cassidy and the Sundance Kid', 'awards.wins': [16.0], 'score': 49.86} 354 | {'title': 'Wild Wild West', 'awards.wins': [10.0], 'score': 26.03} 355 | {'title': 'How the West Was Won', 'awards.wins': [7.0], 'score': 18.42} 356 | {'title': 'The Wild Bunch', 'awards.wins': [5.0], 'score': 17.95} 357 | {'title': 'All Quiet on the Western Front', 'awards.wins': [5.0], 'score': 13.08} 358 | ---------- 359 | 360 | Parameters 361 | ---------- 362 | title: Star Wars: Episode I - The Phantom Menace 363 | 364 | Results 365 | ---------- 366 | id: 5b107bec1d2952d0da9046ed 367 | ---------- 368 | 369 | Parameters 370 | ---------- 371 | q: {!mlt qf="genres" mintf=1 mindf=1}5b107bec1d2952d0da9046ed 372 | kwargs: {'defType': 'lucene', 'fq': 'countries: USA', 'fl': 'title genres score', 'rows': '5'} 373 | 374 | Results 375 | ---------- 376 | document count: 252 377 | qtime (ms): 1 378 | {'title': 'Star Wars: Episode IV - A New Hope', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} 379 | {'title': 'Star Wars: Episode V - The Empire Strikes Back', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} 380 | {'title': 'Star Wars: Episode VI - Return of the Jedi', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} 381 | {'title': 'Star Wars: Episode III - Revenge of the Sith', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} 382 | {'title': 'Star Wars: Episode II - Attack of the Clones', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 6.33} 383 | ---------- 384 | 385 | Parameters 386 | ---------- 387 | q: id:"5b107bec1d2952d0da9046ed" 388 | kwargs: {'defType': 'lucene', 'fl': 'actors director writers'} 389 | 390 | Results 391 | ---------- 392 | document count: 1 393 | qtime (ms): 0 394 | {'director': ['George Lucas'], 'writers': ['George Lucas'], 'actors': ['Liam Neeson', 'Ewan McGregor', 'Natalie Portman', 'Jake Lloyd']} 395 | ---------- 396 | 397 | Parameters 398 | ---------- 399 | q: {!mlt qf="actors director writers" mintf=1 mindf=1}5b107bec1d2952d0da9046ed 400 | kwargs: {'defType': 'lucene', 'fq': 'countries: USA', 'fl': 'title actors director writers score', 'rows': '10'} 401 | 402 | Results 403 | ---------- 404 | document count: 55 405 | qtime (ms): 1 406 | {'title': 'Star Wars: Episode III - Revenge of the Sith', 'director': ['George Lucas'], 'writers': ['George Lucas'], 'actors': ['Ewan McGregor', 'Natalie Portman', 'Hayden Christensen', 'Ian McDiarmid'], 'score': 44.84} 407 | {'title': 'Star Wars: Episode II - Attack of the Clones', 'director': ['George Lucas'], 'writers': ['George Lucas', 'Jonathan Hales', 'George Lucas'], 'actors': ['Ewan McGregor', 'Natalie Portman', 'Hayden Christensen', 'Christopher Lee'], 'score': 44.58} 408 | {'title': 'Star Wars: Episode IV - A New Hope', 'director': ['George Lucas'], 'writers': ['George Lucas'], 'actors': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 'Peter Cushing'], 'score': 23.51} 409 | {'title': 'Star Wars: Episode VI - Return of the Jedi', 'director': ['Richard Marquand'], 'writers': ['Lawrence Kasdan', 'George Lucas', 'George Lucas'], 'actors': ['Mark Hamill', 'Harrison Ford', 'Carrie Fisher', 'Billy Dee Williams'], 'score': 11.96} 410 | {'title': 'A Million Ways to Die in the West', 'director': ['Seth MacFarlane'], 'writers': ['Seth MacFarlane', 'Alec Sulkin', 'Wellesley Wild'], 'actors': ['Seth MacFarlane', 'Charlize Theron', 'Amanda Seyfried', 'Liam Neeson'], 'score': 11.72} 411 | {'title': 'Run All Night', 'director': ['Jaume Collet-Serra'], 'writers': ['Brad Ingelsby'], 'actors': ['Liam Neeson', 'Ed Harris', 'Joel Kinnaman', 'Boyd Holbrook'], 'score': 11.72} 412 | {'title': 'I Love You Phillip Morris', 'director': ['Glenn Ficarra, John Requa'], 'writers': ['John Requa', 'Glenn Ficarra', 'Steve McVicker'], 'actors': ['Jim Carrey', 'Ewan McGregor', 'Leslie Mann', 'Rodrigo Santoro'], 'score': 10.97} 413 | {'title': 'The Island', 'director': ['Michael Bay'], 'writers': ['Caspian Tredwell-Owen', 'Alex Kurtzman', 'Roberto Orci', 'Caspian Tredwell-Owen'], 'actors': ['Ewan McGregor', 'Scarlett Johansson', 'Djimon Hounsou', 'Sean Bean'], 'score': 10.97} 414 | {'title': 'Big Fish', 'director': ['Tim Burton'], 'writers': ['Daniel Wallace', 'John August'], 'actors': ['Ewan McGregor', 'Albert Finney', 'Billy Crudup', 'Jessica Lange'], 'score': 10.97} 415 | {'title': 'New Meet Me on South Street: The Story of JC Dobbs', 'director': ['George Manney'], 'writers': ['George Manney'], 'actors': ['Tony Bidgood', 'Peter Stone Brown', 'Stephen Caldwell', 'Tommy Conwell'], 'score': 10.5} 416 | ---------- 417 | 418 | Parameters 419 | ---------- 420 | q: ciborg 421 | kwargs: {'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title score', 'stopwords': 'true', 'rows': '5'} 422 | 423 | Results 424 | ---------- 425 | document count: 2 426 | qtime (ms): 0 427 | {'title': 'Terminator 2: Judgment Day', 'score': 8.17} 428 | {'title': "I'm a Cyborg, But That's OK", 'score': 7.13} 429 | ---------- 430 | 431 | Parameters 432 | ---------- 433 | q: droid 434 | kwargs: {'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title score', 'stopwords': 'true', 'rows': '5'} 435 | 436 | Results 437 | ---------- 438 | document count: 15 439 | qtime (ms): 2 440 | {'title': 'Robo Jî', 'score': 7.67} 441 | {'title': "I'm a Cyborg, But That's OK", 'score': 7.13} 442 | {'title': 'BV-01', 'score': 6.6} 443 | {'title': 'Robot Chicken: DC Comics Special', 'score': 6.44} 444 | {'title': 'Terminator 2: Judgment Day', 'score': 6.23} 445 | ---------- 446 | 447 | Parameters 448 | ---------- 449 | q: scary 450 | kwargs: {'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title score', 'stopwords': 'true', 'rows': '5'} 451 | 452 | Results 453 | ---------- 454 | document count: 141 455 | qtime (ms): 0 456 | {'title': 'See No Evil, Hear No Evil', 'score': 7.9} 457 | {'title': 'The Evil Dead', 'score': 7.23} 458 | {'title': 'Evil Dead', 'score': 7.23} 459 | {'title': 'Evil Ed', 'score': 7.23} 460 | {'title': 'Evil Dead II', 'score': 6.37} 461 | ---------- 462 | 463 | Parameters 464 | ---------- 465 | q: lol 466 | kwargs: {'defType': 'edismax', 'qf': 'title plot genres', 'fl': 'title score', 'stopwords': 'true', 'rows': '5'} 467 | 468 | Results 469 | ---------- 470 | document count: 1 471 | qtime (ms): 2 472 | {'title': 'JK LOL', 'score': 9.05} 473 | 474 | python3 ./query_solr.py 0.47s user 0.20s system 42% cpu 1.559 total 475 | ``` 476 | 477 | ## Output from MongoDB Queries 478 | 479 | ```text 480 | > time python3 ./query_mongo.py 481 | 482 | Target MongoDB instance: mongodb://localhost:27017/movies 483 | No index to remove 484 | ---------- 485 | 486 | Parameters 487 | ---------- 488 | query: {} 489 | projection: {'_id': 0, 'title': 1} 490 | sort: none 491 | 492 | Results 493 | ---------- 494 | document count: 2250 495 | {'title': 'West Side Story'} 496 | {'title': 'A Million Ways to Die in the West'} 497 | {'title': 'Once Upon a Time in the West'} 498 | {'title': 'Wild Wild West'} 499 | {'title': 'An American Tail: Fievel Goes West'} 500 | ---------- 501 | 502 | Parameters 503 | ---------- 504 | query: {} 505 | 506 | Results 507 | ---------- 508 | document count: 2250 509 | ---------- 510 | 511 | Parameters 512 | ---------- 513 | query: {'title': 'Star Wars: Episode V - The Empire Strikes Back'} 514 | projection: {'_id': 0, 'title': 1} 515 | sort: none 516 | 517 | Results 518 | ---------- 519 | document count: 1 520 | {'title': 'Star Wars: Episode V - The Empire Strikes Back'} 521 | ---------- 522 | 523 | Parameters 524 | ---------- 525 | query: {'title': {'$regex': '\\bstar wars\\b', '$options': 'i'}} 526 | projection: {'_id': 0, 'title': 1} 527 | sort: none 528 | 529 | Results 530 | ---------- 531 | document count: 6 532 | {'title': 'Star Wars: Episode I - The Phantom Menace'} 533 | {'title': 'Star Wars: Episode II - Attack of the Clones'} 534 | {'title': 'Star Wars: Episode III - Revenge of the Sith'} 535 | {'title': 'Star Wars: Episode IV - A New Hope'} 536 | {'title': 'Star Wars: Episode V - The Empire Strikes Back'} 537 | ---------- 538 | 539 | Parameters 540 | ---------- 541 | query: {'$text': {'$search': 'star wars', '$language': 'en', '$caseSensitive': False}, 'countries': 'USA'} 542 | projection: {'score': {'$meta': 'textScore'}, '_id': 0, 'title': 1} 543 | sort: [('score', {'$meta': 'textScore'})] 544 | 545 | Results 546 | ---------- 547 | document count: 18 548 | {'title': 'Star Wars: Episode I - The Phantom Menace', 'score': 1.2} 549 | {'title': 'Star Wars: Episode IV - A New Hope', 'score': 1.17} 550 | {'title': 'Star Wars: Episode III - Revenge of the Sith', 'score': 1.17} 551 | {'title': 'Star Wars: Episode VI - Return of the Jedi', 'score': 1.17} 552 | {'title': 'Star Wars: Episode II - Attack of the Clones', 'score': 1.17} 553 | ---------- 554 | 555 | Parameters 556 | ---------- 557 | query: {'genres': {'$in': ['Adventure', 'Action', 'Western']}, 'countries': 'USA'} 558 | projection: {'_id': 0, 'genres': 1, 'title': 1} 559 | sort: none 560 | 561 | Results 562 | ---------- 563 | document count: 244 564 | {'title': 'A Million Ways to Die in the West', 'genres': ['Comedy', 'Western']} 565 | {'title': 'Once Upon a Time in the West', 'genres': ['Western']} 566 | {'title': 'Wild Wild West', 'genres': ['Action', 'Western', 'Comedy']} 567 | {'title': 'An American Tail: Fievel Goes West', 'genres': ['Animation', 'Adventure', 'Family']} 568 | {'title': 'How the West Was Won', 'genres': ['Western']} 569 | ---------- 570 | 571 | Parameters 572 | ---------- 573 | query: {'$text': {'$search': 'western action adventure', '$language': 'en', '$caseSensitive': False}, 'countries': 'USA'} 574 | projection: {'score': {'$meta': 'textScore'}, '_id': 0, 'genres': 1, 'title': 1} 575 | sort: [('score', {'$meta': 'textScore'})] 576 | 577 | Results 578 | ---------- 579 | document count: 259 580 | {'title': 'Zathura: A Space Adventure', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 3.3} 581 | {'title': 'The Extraordinary Adventures of Adèle Blanc-Sec', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 3.24} 582 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 3.2} 583 | {'title': 'The Adventures of Tintin', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 2.85} 584 | {'title': 'Adventures in Babysitting', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 2.85} 585 | ---------- 586 | 587 | Parameters 588 | ---------- 589 | query: {'$text': {'$search': 'western action adventure', '$language': 'en', '$caseSensitive': False}, 'countries': 'USA'} 590 | projection: {'score': {'$meta': 'textScore'}, '_id': 0, 'genres': 1, 'title': 1} 591 | sort: [('score', {'$meta': 'textScore'})] 592 | 593 | Results 594 | ---------- 595 | document count: 259 596 | {'title': 'The Wild Bunch', 'genres': ['Action', 'Adventure', 'Western'], 'score': 12.8} 597 | {'title': 'Zathura: A Space Adventure', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 10.27} 598 | {'title': 'The Extraordinary Adventures of Adèle Blanc-Sec', 'genres': ['Action', 'Adventure', 'Fantasy'], 'score': 10.14} 599 | {'title': 'The Adventures of Tintin', 'genres': ['Animation', 'Action', 'Adventure'], 'score': 9.9} 600 | {'title': 'Adventures in Babysitting', 'genres': ['Action', 'Adventure', 'Comedy'], 'score': 9.9} 601 | 602 | python3 ./query_mongo.py 0.21s user 0.11s system 18% cpu 1.716 total 603 | ``` 604 | 605 | ## References 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | -------------------------------------------------------------------------------- /conf/lang/contractions_ca.txt: -------------------------------------------------------------------------------- 1 | # Set of Catalan contractions for ElisionFilter 2 | # TODO: load this as a resource from the analyzer and sync it in build.xml 3 | d 4 | l 5 | m 6 | n 7 | s 8 | t 9 | -------------------------------------------------------------------------------- /conf/lang/contractions_fr.txt: -------------------------------------------------------------------------------- 1 | # Set of French contractions for ElisionFilter 2 | # TODO: load this as a resource from the analyzer and sync it in build.xml 3 | l 4 | m 5 | t 6 | qu 7 | n 8 | s 9 | j 10 | d 11 | c 12 | jusqu 13 | quoiqu 14 | lorsqu 15 | puisqu 16 | -------------------------------------------------------------------------------- /conf/lang/contractions_ga.txt: -------------------------------------------------------------------------------- 1 | # Set of Irish contractions for ElisionFilter 2 | # TODO: load this as a resource from the analyzer and sync it in build.xml 3 | d 4 | m 5 | b 6 | -------------------------------------------------------------------------------- /conf/lang/contractions_it.txt: -------------------------------------------------------------------------------- 1 | # Set of Italian contractions for ElisionFilter 2 | # TODO: load this as a resource from the analyzer and sync it in build.xml 3 | c 4 | l 5 | all 6 | dall 7 | dell 8 | nell 9 | sull 10 | coll 11 | pell 12 | gl 13 | agl 14 | dagl 15 | degl 16 | negl 17 | sugl 18 | un 19 | m 20 | t 21 | s 22 | v 23 | d 24 | -------------------------------------------------------------------------------- /conf/lang/hyphenations_ga.txt: -------------------------------------------------------------------------------- 1 | # Set of Irish hyphenations for StopFilter 2 | # TODO: load this as a resource from the analyzer and sync it in build.xml 3 | h 4 | n 5 | t 6 | -------------------------------------------------------------------------------- /conf/lang/stemdict_nl.txt: -------------------------------------------------------------------------------- 1 | # Set of overrides for the dutch stemmer 2 | # TODO: load this as a resource from the analyzer and sync it in build.xml 3 | fiets fiets 4 | bromfiets bromfiets 5 | ei eier 6 | kind kinder 7 | -------------------------------------------------------------------------------- /conf/lang/stoptags_ja.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file defines a Japanese stoptag set for JapanesePartOfSpeechStopFilter. 3 | # 4 | # Any token with a part-of-speech tag that exactly matches those defined in this 5 | # file are removed from the token stream. 6 | # 7 | # Set your own stoptags by uncommenting the lines below. Note that comments are 8 | # not allowed on the same line as a stoptag. See LUCENE-3745 for frequency lists, 9 | # etc. that can be useful for building you own stoptag set. 10 | # 11 | # The entire possible tagset is provided below for convenience. 12 | # 13 | ##### 14 | # noun: unclassified nouns 15 | #名詞 16 | # 17 | # noun-common: Common nouns or nouns where the sub-classification is undefined 18 | #名詞-一般 19 | # 20 | # noun-proper: Proper nouns where the sub-classification is undefined 21 | #名詞-固有名詞 22 | # 23 | # noun-proper-misc: miscellaneous proper nouns 24 | #名詞-固有名詞-一般 25 | # 26 | # noun-proper-person: Personal names where the sub-classification is undefined 27 | #名詞-固有名詞-人名 28 | # 29 | # noun-proper-person-misc: names that cannot be divided into surname and 30 | # given name; foreign names; names where the surname or given name is unknown. 31 | # e.g. お市の方 32 | #名詞-固有名詞-人名-一般 33 | # 34 | # noun-proper-person-surname: Mainly Japanese surnames. 35 | # e.g. 山田 36 | #名詞-固有名詞-人名-姓 37 | # 38 | # noun-proper-person-given_name: Mainly Japanese given names. 39 | # e.g. 太郎 40 | #名詞-固有名詞-人名-名 41 | # 42 | # noun-proper-organization: Names representing organizations. 43 | # e.g. 通産省, NHK 44 | #名詞-固有名詞-組織 45 | # 46 | # noun-proper-place: Place names where the sub-classification is undefined 47 | #名詞-固有名詞-地域 48 | # 49 | # noun-proper-place-misc: Place names excluding countries. 50 | # e.g. アジア, バルセロナ, 京都 51 | #名詞-固有名詞-地域-一般 52 | # 53 | # noun-proper-place-country: Country names. 54 | # e.g. 日本, オーストラリア 55 | #名詞-固有名詞-地域-国 56 | # 57 | # noun-pronoun: Pronouns where the sub-classification is undefined 58 | #名詞-代名詞 59 | # 60 | # noun-pronoun-misc: miscellaneous pronouns: 61 | # e.g. それ, ここ, あいつ, あなた, あちこち, いくつ, どこか, なに, みなさん, みんな, わたくし, われわれ 62 | #名詞-代名詞-一般 63 | # 64 | # noun-pronoun-contraction: Spoken language contraction made by combining a 65 | # pronoun and the particle 'wa'. 66 | # e.g. ありゃ, こりゃ, こりゃあ, そりゃ, そりゃあ 67 | #名詞-代名詞-縮約 68 | # 69 | # noun-adverbial: Temporal nouns such as names of days or months that behave 70 | # like adverbs. Nouns that represent amount or ratios and can be used adverbially, 71 | # e.g. 金曜, 一月, 午後, 少量 72 | #名詞-副詞可能 73 | # 74 | # noun-verbal: Nouns that take arguments with case and can appear followed by 75 | # 'suru' and related verbs (する, できる, なさる, くださる) 76 | # e.g. インプット, 愛着, 悪化, 悪戦苦闘, 一安心, 下取り 77 | #名詞-サ変接続 78 | # 79 | # noun-adjective-base: The base form of adjectives, words that appear before な ("na") 80 | # e.g. 健康, 安易, 駄目, だめ 81 | #名詞-形容動詞語幹 82 | # 83 | # noun-numeric: Arabic numbers, Chinese numerals, and counters like 何 (回), 数. 84 | # e.g. 0, 1, 2, 何, 数, 幾 85 | #名詞-数 86 | # 87 | # noun-affix: noun affixes where the sub-classification is undefined 88 | #名詞-非自立 89 | # 90 | # noun-affix-misc: Of adnominalizers, the case-marker の ("no"), and words that 91 | # attach to the base form of inflectional words, words that cannot be classified 92 | # into any of the other categories below. This category includes indefinite nouns. 93 | # e.g. あかつき, 暁, かい, 甲斐, 気, きらい, 嫌い, くせ, 癖, こと, 事, ごと, 毎, しだい, 次第, 94 | # 順, せい, 所為, ついで, 序で, つもり, 積もり, 点, どころ, の, はず, 筈, はずみ, 弾み, 95 | # 拍子, ふう, ふり, 振り, ほう, 方, 旨, もの, 物, 者, ゆえ, 故, ゆえん, 所以, わけ, 訳, 96 | # わり, 割り, 割, ん-口語/, もん-口語/ 97 | #名詞-非自立-一般 98 | # 99 | # noun-affix-adverbial: noun affixes that that can behave as adverbs. 100 | # e.g. あいだ, 間, あげく, 挙げ句, あと, 後, 余り, 以外, 以降, 以後, 以上, 以前, 一方, うえ, 101 | # 上, うち, 内, おり, 折り, かぎり, 限り, きり, っきり, 結果, ころ, 頃, さい, 際, 最中, さなか, 102 | # 最中, じたい, 自体, たび, 度, ため, 為, つど, 都度, とおり, 通り, とき, 時, ところ, 所, 103 | # とたん, 途端, なか, 中, のち, 後, ばあい, 場合, 日, ぶん, 分, ほか, 他, まえ, 前, まま, 104 | # 儘, 侭, みぎり, 矢先 105 | #名詞-非自立-副詞可能 106 | # 107 | # noun-affix-aux: noun affixes treated as 助動詞 ("auxiliary verb") in school grammars 108 | # with the stem よう(だ) ("you(da)"). 109 | # e.g. よう, やう, 様 (よう) 110 | #名詞-非自立-助動詞語幹 111 | # 112 | # noun-affix-adjective-base: noun affixes that can connect to the indeclinable 113 | # connection form な (aux "da"). 114 | # e.g. みたい, ふう 115 | #名詞-非自立-形容動詞語幹 116 | # 117 | # noun-special: special nouns where the sub-classification is undefined. 118 | #名詞-特殊 119 | # 120 | # noun-special-aux: The そうだ ("souda") stem form that is used for reporting news, is 121 | # treated as 助動詞 ("auxiliary verb") in school grammars, and attach to the base 122 | # form of inflectional words. 123 | # e.g. そう 124 | #名詞-特殊-助動詞語幹 125 | # 126 | # noun-suffix: noun suffixes where the sub-classification is undefined. 127 | #名詞-接尾 128 | # 129 | # noun-suffix-misc: Of the nouns or stem forms of other parts of speech that connect 130 | # to ガル or タイ and can combine into compound nouns, words that cannot be classified into 131 | # any of the other categories below. In general, this category is more inclusive than 132 | # 接尾語 ("suffix") and is usually the last element in a compound noun. 133 | # e.g. おき, かた, 方, 甲斐 (がい), がかり, ぎみ, 気味, ぐるみ, (~した) さ, 次第, 済 (ず) み, 134 | # よう, (でき)っこ, 感, 観, 性, 学, 類, 面, 用 135 | #名詞-接尾-一般 136 | # 137 | # noun-suffix-person: Suffixes that form nouns and attach to person names more often 138 | # than other nouns. 139 | # e.g. 君, 様, 著 140 | #名詞-接尾-人名 141 | # 142 | # noun-suffix-place: Suffixes that form nouns and attach to place names more often 143 | # than other nouns. 144 | # e.g. 町, 市, 県 145 | #名詞-接尾-地域 146 | # 147 | # noun-suffix-verbal: Of the suffixes that attach to nouns and form nouns, those that 148 | # can appear before スル ("suru"). 149 | # e.g. 化, 視, 分け, 入り, 落ち, 買い 150 | #名詞-接尾-サ変接続 151 | # 152 | # noun-suffix-aux: The stem form of そうだ (様態) that is used to indicate conditions, 153 | # is treated as 助動詞 ("auxiliary verb") in school grammars, and attach to the 154 | # conjunctive form of inflectional words. 155 | # e.g. そう 156 | #名詞-接尾-助動詞語幹 157 | # 158 | # noun-suffix-adjective-base: Suffixes that attach to other nouns or the conjunctive 159 | # form of inflectional words and appear before the copula だ ("da"). 160 | # e.g. 的, げ, がち 161 | #名詞-接尾-形容動詞語幹 162 | # 163 | # noun-suffix-adverbial: Suffixes that attach to other nouns and can behave as adverbs. 164 | # e.g. 後 (ご), 以後, 以降, 以前, 前後, 中, 末, 上, 時 (じ) 165 | #名詞-接尾-副詞可能 166 | # 167 | # noun-suffix-classifier: Suffixes that attach to numbers and form nouns. This category 168 | # is more inclusive than 助数詞 ("classifier") and includes common nouns that attach 169 | # to numbers. 170 | # e.g. 個, つ, 本, 冊, パーセント, cm, kg, カ月, か国, 区画, 時間, 時半 171 | #名詞-接尾-助数詞 172 | # 173 | # noun-suffix-special: Special suffixes that mainly attach to inflecting words. 174 | # e.g. (楽し) さ, (考え) 方 175 | #名詞-接尾-特殊 176 | # 177 | # noun-suffix-conjunctive: Nouns that behave like conjunctions and join two words 178 | # together. 179 | # e.g. (日本) 対 (アメリカ), 対 (アメリカ), (3) 対 (5), (女優) 兼 (主婦) 180 | #名詞-接続詞的 181 | # 182 | # noun-verbal_aux: Nouns that attach to the conjunctive particle て ("te") and are 183 | # semantically verb-like. 184 | # e.g. ごらん, ご覧, 御覧, 頂戴 185 | #名詞-動詞非自立的 186 | # 187 | # noun-quotation: text that cannot be segmented into words, proverbs, Chinese poetry, 188 | # dialects, English, etc. Currently, the only entry for 名詞 引用文字列 ("noun quotation") 189 | # is いわく ("iwaku"). 190 | #名詞-引用文字列 191 | # 192 | # noun-nai_adjective: Words that appear before the auxiliary verb ない ("nai") and 193 | # behave like an adjective. 194 | # e.g. 申し訳, 仕方, とんでも, 違い 195 | #名詞-ナイ形容詞語幹 196 | # 197 | ##### 198 | # prefix: unclassified prefixes 199 | #接頭詞 200 | # 201 | # prefix-nominal: Prefixes that attach to nouns (including adjective stem forms) 202 | # excluding numerical expressions. 203 | # e.g. お (水), 某 (氏), 同 (社), 故 (~氏), 高 (品質), お (見事), ご (立派) 204 | #接頭詞-名詞接続 205 | # 206 | # prefix-verbal: Prefixes that attach to the imperative form of a verb or a verb 207 | # in conjunctive form followed by なる/なさる/くださる. 208 | # e.g. お (読みなさい), お (座り) 209 | #接頭詞-動詞接続 210 | # 211 | # prefix-adjectival: Prefixes that attach to adjectives. 212 | # e.g. お (寒いですねえ), バカ (でかい) 213 | #接頭詞-形容詞接続 214 | # 215 | # prefix-numerical: Prefixes that attach to numerical expressions. 216 | # e.g. 約, およそ, 毎時 217 | #接頭詞-数接続 218 | # 219 | ##### 220 | # verb: unclassified verbs 221 | #動詞 222 | # 223 | # verb-main: 224 | #動詞-自立 225 | # 226 | # verb-auxiliary: 227 | #動詞-非自立 228 | # 229 | # verb-suffix: 230 | #動詞-接尾 231 | # 232 | ##### 233 | # adjective: unclassified adjectives 234 | #形容詞 235 | # 236 | # adjective-main: 237 | #形容詞-自立 238 | # 239 | # adjective-auxiliary: 240 | #形容詞-非自立 241 | # 242 | # adjective-suffix: 243 | #形容詞-接尾 244 | # 245 | ##### 246 | # adverb: unclassified adverbs 247 | #副詞 248 | # 249 | # adverb-misc: Words that can be segmented into one unit and where adnominal 250 | # modification is not possible. 251 | # e.g. あいかわらず, 多分 252 | #副詞-一般 253 | # 254 | # adverb-particle_conjunction: Adverbs that can be followed by の, は, に, 255 | # な, する, だ, etc. 256 | # e.g. こんなに, そんなに, あんなに, なにか, なんでも 257 | #副詞-助詞類接続 258 | # 259 | ##### 260 | # adnominal: Words that only have noun-modifying forms. 261 | # e.g. この, その, あの, どの, いわゆる, なんらかの, 何らかの, いろんな, こういう, そういう, ああいう, 262 | # どういう, こんな, そんな, あんな, どんな, 大きな, 小さな, おかしな, ほんの, たいした, 263 | # 「(, も) さる (ことながら)」, 微々たる, 堂々たる, 単なる, いかなる, 我が」「同じ, 亡き 264 | #連体詞 265 | # 266 | ##### 267 | # conjunction: Conjunctions that can occur independently. 268 | # e.g. が, けれども, そして, じゃあ, それどころか 269 | 接続詞 270 | # 271 | ##### 272 | # particle: unclassified particles. 273 | 助詞 274 | # 275 | # particle-case: case particles where the subclassification is undefined. 276 | 助詞-格助詞 277 | # 278 | # particle-case-misc: Case particles. 279 | # e.g. から, が, で, と, に, へ, より, を, の, にて 280 | 助詞-格助詞-一般 281 | # 282 | # particle-case-quote: the "to" that appears after nouns, a person’s speech, 283 | # quotation marks, expressions of decisions from a meeting, reasons, judgements, 284 | # conjectures, etc. 285 | # e.g. ( だ) と (述べた.), ( である) と (して執行猶予...) 286 | 助詞-格助詞-引用 287 | # 288 | # particle-case-compound: Compounds of particles and verbs that mainly behave 289 | # like case particles. 290 | # e.g. という, といった, とかいう, として, とともに, と共に, でもって, にあたって, に当たって, に当って, 291 | # にあたり, に当たり, に当り, に当たる, にあたる, において, に於いて,に於て, における, に於ける, 292 | # にかけ, にかけて, にかんし, に関し, にかんして, に関して, にかんする, に関する, に際し, 293 | # に際して, にしたがい, に従い, に従う, にしたがって, に従って, にたいし, に対し, にたいして, 294 | # に対して, にたいする, に対する, について, につき, につけ, につけて, につれ, につれて, にとって, 295 | # にとり, にまつわる, によって, に依って, に因って, により, に依り, に因り, による, に依る, に因る, 296 | # にわたって, にわたる, をもって, を以って, を通じ, を通じて, を通して, をめぐって, をめぐり, をめぐる, 297 | # って-口語/, ちゅう-関西弁「という」/, (何) ていう (人)-口語/, っていう-口語/, といふ, とかいふ 298 | 助詞-格助詞-連語 299 | # 300 | # particle-conjunctive: 301 | # e.g. から, からには, が, けれど, けれども, けど, し, つつ, て, で, と, ところが, どころか, とも, ども, 302 | # ながら, なり, ので, のに, ば, ものの, や ( した), やいなや, (ころん) じゃ(いけない)-口語/, 303 | # (行っ) ちゃ(いけない)-口語/, (言っ) たって (しかたがない)-口語/, (それがなく)ったって (平気)-口語/ 304 | 助詞-接続助詞 305 | # 306 | # particle-dependency: 307 | # e.g. こそ, さえ, しか, すら, は, も, ぞ 308 | 助詞-係助詞 309 | # 310 | # particle-adverbial: 311 | # e.g. がてら, かも, くらい, 位, ぐらい, しも, (学校) じゃ(これが流行っている)-口語/, 312 | # (それ)じゃあ (よくない)-口語/, ずつ, (私) なぞ, など, (私) なり (に), (先生) なんか (大嫌い)-口語/, 313 | # (私) なんぞ, (先生) なんて (大嫌い)-口語/, のみ, だけ, (私) だって-口語/, だに, 314 | # (彼)ったら-口語/, (お茶) でも (いかが), 等 (とう), (今後) とも, ばかり, ばっか-口語/, ばっかり-口語/, 315 | # ほど, 程, まで, 迄, (誰) も (が)([助詞-格助詞] および [助詞-係助詞] の前に位置する「も」) 316 | 助詞-副助詞 317 | # 318 | # particle-interjective: particles with interjective grammatical roles. 319 | # e.g. (松島) や 320 | 助詞-間投助詞 321 | # 322 | # particle-coordinate: 323 | # e.g. と, たり, だの, だり, とか, なり, や, やら 324 | 助詞-並立助詞 325 | # 326 | # particle-final: 327 | # e.g. かい, かしら, さ, ぜ, (だ)っけ-口語/, (とまってる) で-方言/, な, ナ, なあ-口語/, ぞ, ね, ネ, 328 | # ねぇ-口語/, ねえ-口語/, ねん-方言/, の, のう-口語/, や, よ, ヨ, よぉ-口語/, わ, わい-口語/ 329 | 助詞-終助詞 330 | # 331 | # particle-adverbial/conjunctive/final: The particle "ka" when unknown whether it is 332 | # adverbial, conjunctive, or sentence final. For example: 333 | # (a) 「A か B か」. Ex:「(国内で運用する) か,(海外で運用する) か (.)」 334 | # (b) Inside an adverb phrase. Ex:「(幸いという) か (, 死者はいなかった.)」 335 | # 「(祈りが届いたせい) か (, 試験に合格した.)」 336 | # (c) 「かのように」. Ex:「(何もなかった) か (のように振る舞った.)」 337 | # e.g. か 338 | 助詞-副助詞/並立助詞/終助詞 339 | # 340 | # particle-adnominalizer: The "no" that attaches to nouns and modifies 341 | # non-inflectional words. 342 | 助詞-連体化 343 | # 344 | # particle-adnominalizer: The "ni" and "to" that appear following nouns and adverbs 345 | # that are giongo, giseigo, or gitaigo. 346 | # e.g. に, と 347 | 助詞-副詞化 348 | # 349 | # particle-special: A particle that does not fit into one of the above classifications. 350 | # This includes particles that are used in Tanka, Haiku, and other poetry. 351 | # e.g. かな, けむ, ( しただろう) に, (あんた) にゃ(わからん), (俺) ん (家) 352 | 助詞-特殊 353 | # 354 | ##### 355 | # auxiliary-verb: 356 | 助動詞 357 | # 358 | ##### 359 | # interjection: Greetings and other exclamations. 360 | # e.g. おはよう, おはようございます, こんにちは, こんばんは, ありがとう, どうもありがとう, ありがとうございます, 361 | # いただきます, ごちそうさま, さよなら, さようなら, はい, いいえ, ごめん, ごめんなさい 362 | #感動詞 363 | # 364 | ##### 365 | # symbol: unclassified Symbols. 366 | 記号 367 | # 368 | # symbol-misc: A general symbol not in one of the categories below. 369 | # e.g. [○◎@$〒→+] 370 | 記号-一般 371 | # 372 | # symbol-comma: Commas 373 | # e.g. [,、] 374 | 記号-読点 375 | # 376 | # symbol-period: Periods and full stops. 377 | # e.g. [..。] 378 | 記号-句点 379 | # 380 | # symbol-space: Full-width whitespace. 381 | 記号-空白 382 | # 383 | # symbol-open_bracket: 384 | # e.g. [({‘“『【] 385 | 記号-括弧開 386 | # 387 | # symbol-close_bracket: 388 | # e.g. [)}’”』」】] 389 | 記号-括弧閉 390 | # 391 | # symbol-alphabetic: 392 | #記号-アルファベット 393 | # 394 | ##### 395 | # other: unclassified other 396 | #その他 397 | # 398 | # other-interjection: Words that are hard to classify as noun-suffixes or 399 | # sentence-final particles. 400 | # e.g. (だ)ァ 401 | その他-間投 402 | # 403 | ##### 404 | # filler: Aizuchi that occurs during a conversation or sounds inserted as filler. 405 | # e.g. あの, うんと, えと 406 | フィラー 407 | # 408 | ##### 409 | # non-verbal: non-verbal sound. 410 | 非言語音 411 | # 412 | ##### 413 | # fragment: 414 | #語断片 415 | # 416 | ##### 417 | # unknown: unknown part of speech. 418 | #未知語 419 | # 420 | ##### End of file 421 | -------------------------------------------------------------------------------- /conf/lang/stopwords_ar.txt: -------------------------------------------------------------------------------- 1 | # This file was created by Jacques Savoy and is distributed under the BSD license. 2 | # See http://members.unine.ch/jacques.savoy/clef/index.html. 3 | # Also see http://www.opensource.org/licenses/bsd-license.html 4 | # Cleaned on October 11, 2009 (not normalized, so use before normalization) 5 | # This means that when modifying this list, you might need to add some 6 | # redundant entries, for example containing forms with both أ and ا 7 | من 8 | ومن 9 | منها 10 | منه 11 | في 12 | وفي 13 | فيها 14 | فيه 15 | و 16 | ف 17 | ثم 18 | او 19 | أو 20 | ب 21 | بها 22 | به 23 | ا 24 | أ 25 | اى 26 | اي 27 | أي 28 | أى 29 | لا 30 | ولا 31 | الا 32 | ألا 33 | إلا 34 | لكن 35 | ما 36 | وما 37 | كما 38 | فما 39 | عن 40 | مع 41 | اذا 42 | إذا 43 | ان 44 | أن 45 | إن 46 | انها 47 | أنها 48 | إنها 49 | انه 50 | أنه 51 | إنه 52 | بان 53 | بأن 54 | فان 55 | فأن 56 | وان 57 | وأن 58 | وإن 59 | التى 60 | التي 61 | الذى 62 | الذي 63 | الذين 64 | الى 65 | الي 66 | إلى 67 | إلي 68 | على 69 | عليها 70 | عليه 71 | اما 72 | أما 73 | إما 74 | ايضا 75 | أيضا 76 | كل 77 | وكل 78 | لم 79 | ولم 80 | لن 81 | ولن 82 | هى 83 | هي 84 | هو 85 | وهى 86 | وهي 87 | وهو 88 | فهى 89 | فهي 90 | فهو 91 | انت 92 | أنت 93 | لك 94 | لها 95 | له 96 | هذه 97 | هذا 98 | تلك 99 | ذلك 100 | هناك 101 | كانت 102 | كان 103 | يكون 104 | تكون 105 | وكانت 106 | وكان 107 | غير 108 | بعض 109 | قد 110 | نحو 111 | بين 112 | بينما 113 | منذ 114 | ضمن 115 | حيث 116 | الان 117 | الآن 118 | خلال 119 | بعد 120 | قبل 121 | حتى 122 | عند 123 | عندما 124 | لدى 125 | جميع 126 | -------------------------------------------------------------------------------- /conf/lang/stopwords_bg.txt: -------------------------------------------------------------------------------- 1 | # This file was created by Jacques Savoy and is distributed under the BSD license. 2 | # See http://members.unine.ch/jacques.savoy/clef/index.html. 3 | # Also see http://www.opensource.org/licenses/bsd-license.html 4 | а 5 | аз 6 | ако 7 | ала 8 | бе 9 | без 10 | беше 11 | би 12 | бил 13 | била 14 | били 15 | било 16 | близо 17 | бъдат 18 | бъде 19 | бяха 20 | в 21 | вас 22 | ваш 23 | ваша 24 | вероятно 25 | вече 26 | взема 27 | ви 28 | вие 29 | винаги 30 | все 31 | всеки 32 | всички 33 | всичко 34 | всяка 35 | във 36 | въпреки 37 | върху 38 | г 39 | ги 40 | главно 41 | го 42 | д 43 | да 44 | дали 45 | до 46 | докато 47 | докога 48 | дори 49 | досега 50 | доста 51 | е 52 | едва 53 | един 54 | ето 55 | за 56 | зад 57 | заедно 58 | заради 59 | засега 60 | затова 61 | защо 62 | защото 63 | и 64 | из 65 | или 66 | им 67 | има 68 | имат 69 | иска 70 | й 71 | каза 72 | как 73 | каква 74 | какво 75 | както 76 | какъв 77 | като 78 | кога 79 | когато 80 | което 81 | които 82 | кой 83 | който 84 | колко 85 | която 86 | къде 87 | където 88 | към 89 | ли 90 | м 91 | ме 92 | между 93 | мен 94 | ми 95 | мнозина 96 | мога 97 | могат 98 | може 99 | моля 100 | момента 101 | му 102 | н 103 | на 104 | над 105 | назад 106 | най 107 | направи 108 | напред 109 | например 110 | нас 111 | не 112 | него 113 | нея 114 | ни 115 | ние 116 | никой 117 | нито 118 | но 119 | някои 120 | някой 121 | няма 122 | обаче 123 | около 124 | освен 125 | особено 126 | от 127 | отгоре 128 | отново 129 | още 130 | пак 131 | по 132 | повече 133 | повечето 134 | под 135 | поне 136 | поради 137 | после 138 | почти 139 | прави 140 | пред 141 | преди 142 | през 143 | при 144 | пък 145 | първо 146 | с 147 | са 148 | само 149 | се 150 | сега 151 | си 152 | скоро 153 | след 154 | сме 155 | според 156 | сред 157 | срещу 158 | сте 159 | съм 160 | със 161 | също 162 | т 163 | тази 164 | така 165 | такива 166 | такъв 167 | там 168 | твой 169 | те 170 | тези 171 | ти 172 | тн 173 | то 174 | това 175 | тогава 176 | този 177 | той 178 | толкова 179 | точно 180 | трябва 181 | тук 182 | тъй 183 | тя 184 | тях 185 | у 186 | харесва 187 | ч 188 | че 189 | често 190 | чрез 191 | ще 192 | щом 193 | я 194 | -------------------------------------------------------------------------------- /conf/lang/stopwords_ca.txt: -------------------------------------------------------------------------------- 1 | # Catalan stopwords from http://github.com/vcl/cue.language (Apache 2 Licensed) 2 | a 3 | abans 4 | ací 5 | ah 6 | així 7 | això 8 | al 9 | als 10 | aleshores 11 | algun 12 | alguna 13 | algunes 14 | alguns 15 | alhora 16 | allà 17 | allí 18 | allò 19 | altra 20 | altre 21 | altres 22 | amb 23 | ambdós 24 | ambdues 25 | apa 26 | aquell 27 | aquella 28 | aquelles 29 | aquells 30 | aquest 31 | aquesta 32 | aquestes 33 | aquests 34 | aquí 35 | baix 36 | cada 37 | cadascú 38 | cadascuna 39 | cadascunes 40 | cadascuns 41 | com 42 | contra 43 | d'un 44 | d'una 45 | d'unes 46 | d'uns 47 | dalt 48 | de 49 | del 50 | dels 51 | des 52 | després 53 | dins 54 | dintre 55 | donat 56 | doncs 57 | durant 58 | e 59 | eh 60 | el 61 | els 62 | em 63 | en 64 | encara 65 | ens 66 | entre 67 | érem 68 | eren 69 | éreu 70 | es 71 | és 72 | esta 73 | està 74 | estàvem 75 | estaven 76 | estàveu 77 | esteu 78 | et 79 | etc 80 | ets 81 | fins 82 | fora 83 | gairebé 84 | ha 85 | han 86 | has 87 | havia 88 | he 89 | hem 90 | heu 91 | hi 92 | ho 93 | i 94 | igual 95 | iguals 96 | ja 97 | l'hi 98 | la 99 | les 100 | li 101 | li'n 102 | llavors 103 | m'he 104 | ma 105 | mal 106 | malgrat 107 | mateix 108 | mateixa 109 | mateixes 110 | mateixos 111 | me 112 | mentre 113 | més 114 | meu 115 | meus 116 | meva 117 | meves 118 | molt 119 | molta 120 | moltes 121 | molts 122 | mon 123 | mons 124 | n'he 125 | n'hi 126 | ne 127 | ni 128 | no 129 | nogensmenys 130 | només 131 | nosaltres 132 | nostra 133 | nostre 134 | nostres 135 | o 136 | oh 137 | oi 138 | on 139 | pas 140 | pel 141 | pels 142 | per 143 | però 144 | perquè 145 | poc 146 | poca 147 | pocs 148 | poques 149 | potser 150 | propi 151 | qual 152 | quals 153 | quan 154 | quant 155 | que 156 | què 157 | quelcom 158 | qui 159 | quin 160 | quina 161 | quines 162 | quins 163 | s'ha 164 | s'han 165 | sa 166 | semblant 167 | semblants 168 | ses 169 | seu 170 | seus 171 | seva 172 | seva 173 | seves 174 | si 175 | sobre 176 | sobretot 177 | sóc 178 | solament 179 | sols 180 | son 181 | són 182 | sons 183 | sota 184 | sou 185 | t'ha 186 | t'han 187 | t'he 188 | ta 189 | tal 190 | també 191 | tampoc 192 | tan 193 | tant 194 | tanta 195 | tantes 196 | teu 197 | teus 198 | teva 199 | teves 200 | ton 201 | tons 202 | tot 203 | tota 204 | totes 205 | tots 206 | un 207 | una 208 | unes 209 | uns 210 | us 211 | va 212 | vaig 213 | vam 214 | van 215 | vas 216 | veu 217 | vosaltres 218 | vostra 219 | vostre 220 | vostres 221 | -------------------------------------------------------------------------------- /conf/lang/stopwords_cz.txt: -------------------------------------------------------------------------------- 1 | a 2 | s 3 | k 4 | o 5 | i 6 | u 7 | v 8 | z 9 | dnes 10 | cz 11 | tímto 12 | budeš 13 | budem 14 | byli 15 | jseš 16 | můj 17 | svým 18 | ta 19 | tomto 20 | tohle 21 | tuto 22 | tyto 23 | jej 24 | zda 25 | proč 26 | máte 27 | tato 28 | kam 29 | tohoto 30 | kdo 31 | kteří 32 | mi 33 | nám 34 | tom 35 | tomuto 36 | mít 37 | nic 38 | proto 39 | kterou 40 | byla 41 | toho 42 | protože 43 | asi 44 | ho 45 | naši 46 | napište 47 | re 48 | což 49 | tím 50 | takže 51 | svých 52 | její 53 | svými 54 | jste 55 | aj 56 | tu 57 | tedy 58 | teto 59 | bylo 60 | kde 61 | ke 62 | pravé 63 | ji 64 | nad 65 | nejsou 66 | či 67 | pod 68 | téma 69 | mezi 70 | přes 71 | ty 72 | pak 73 | vám 74 | ani 75 | když 76 | však 77 | neg 78 | jsem 79 | tento 80 | článku 81 | články 82 | aby 83 | jsme 84 | před 85 | pta 86 | jejich 87 | byl 88 | ještě 89 | až 90 | bez 91 | také 92 | pouze 93 | první 94 | vaše 95 | která 96 | nás 97 | nový 98 | tipy 99 | pokud 100 | může 101 | strana 102 | jeho 103 | své 104 | jiné 105 | zprávy 106 | nové 107 | není 108 | vás 109 | jen 110 | podle 111 | zde 112 | už 113 | být 114 | více 115 | bude 116 | již 117 | než 118 | který 119 | by 120 | které 121 | co 122 | nebo 123 | ten 124 | tak 125 | má 126 | při 127 | od 128 | po 129 | jsou 130 | jak 131 | další 132 | ale 133 | si 134 | se 135 | ve 136 | to 137 | jako 138 | za 139 | zpět 140 | ze 141 | do 142 | pro 143 | je 144 | na 145 | atd 146 | atp 147 | jakmile 148 | přičemž 149 | já 150 | on 151 | ona 152 | ono 153 | oni 154 | ony 155 | my 156 | vy 157 | jí 158 | ji 159 | mě 160 | mne 161 | jemu 162 | tomu 163 | těm 164 | těmu 165 | němu 166 | němuž 167 | jehož 168 | jíž 169 | jelikož 170 | jež 171 | jakož 172 | načež 173 | -------------------------------------------------------------------------------- /conf/lang/stopwords_da.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A Danish stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | | This is a ranked list (commonest to rarest) of stopwords derived from 14 | | a large text sample. 15 | 16 | 17 | og | and 18 | i | in 19 | jeg | I 20 | det | that (dem. pronoun)/it (pers. pronoun) 21 | at | that (in front of a sentence)/to (with infinitive) 22 | en | a/an 23 | den | it (pers. pronoun)/that (dem. pronoun) 24 | til | to/at/for/until/against/by/of/into, more 25 | er | present tense of "to be" 26 | som | who, as 27 | på | on/upon/in/on/at/to/after/of/with/for, on 28 | de | they 29 | med | with/by/in, along 30 | han | he 31 | af | of/by/from/off/for/in/with/on, off 32 | for | at/for/to/from/by/of/ago, in front/before, because 33 | ikke | not 34 | der | who/which, there/those 35 | var | past tense of "to be" 36 | mig | me/myself 37 | sig | oneself/himself/herself/itself/themselves 38 | men | but 39 | et | a/an/one, one (number), someone/somebody/one 40 | har | present tense of "to have" 41 | om | round/about/for/in/a, about/around/down, if 42 | vi | we 43 | min | my 44 | havde | past tense of "to have" 45 | ham | him 46 | hun | she 47 | nu | now 48 | over | over/above/across/by/beyond/past/on/about, over/past 49 | da | then, when/as/since 50 | fra | from/off/since, off, since 51 | du | you 52 | ud | out 53 | sin | his/her/its/one's 54 | dem | them 55 | os | us/ourselves 56 | op | up 57 | man | you/one 58 | hans | his 59 | hvor | where 60 | eller | or 61 | hvad | what 62 | skal | must/shall etc. 63 | selv | myself/youself/herself/ourselves etc., even 64 | her | here 65 | alle | all/everyone/everybody etc. 66 | vil | will (verb) 67 | blev | past tense of "to stay/to remain/to get/to become" 68 | kunne | could 69 | ind | in 70 | når | when 71 | være | present tense of "to be" 72 | dog | however/yet/after all 73 | noget | something 74 | ville | would 75 | jo | you know/you see (adv), yes 76 | deres | their/theirs 77 | efter | after/behind/according to/for/by/from, later/afterwards 78 | ned | down 79 | skulle | should 80 | denne | this 81 | end | than 82 | dette | this 83 | mit | my/mine 84 | også | also 85 | under | under/beneath/below/during, below/underneath 86 | have | have 87 | dig | you 88 | anden | other 89 | hende | her 90 | mine | my 91 | alt | everything 92 | meget | much/very, plenty of 93 | sit | his, her, its, one's 94 | sine | his, her, its, one's 95 | vor | our 96 | mod | against 97 | disse | these 98 | hvis | if 99 | din | your/yours 100 | nogle | some 101 | hos | by/at 102 | blive | be/become 103 | mange | many 104 | ad | by/through 105 | bliver | present tense of "to be/to become" 106 | hendes | her/hers 107 | været | be 108 | thi | for (conj) 109 | jer | you 110 | sådan | such, like this/like that 111 | -------------------------------------------------------------------------------- /conf/lang/stopwords_de.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A German stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | | The number of forms in this list is reduced significantly by passing it 14 | | through the German stemmer. 15 | 16 | 17 | aber | but 18 | 19 | alle | all 20 | allem 21 | allen 22 | aller 23 | alles 24 | 25 | als | than, as 26 | also | so 27 | am | an + dem 28 | an | at 29 | 30 | ander | other 31 | andere 32 | anderem 33 | anderen 34 | anderer 35 | anderes 36 | anderm 37 | andern 38 | anderr 39 | anders 40 | 41 | auch | also 42 | auf | on 43 | aus | out of 44 | bei | by 45 | bin | am 46 | bis | until 47 | bist | art 48 | da | there 49 | damit | with it 50 | dann | then 51 | 52 | der | the 53 | den 54 | des 55 | dem 56 | die 57 | das 58 | 59 | daß | that 60 | 61 | derselbe | the same 62 | derselben 63 | denselben 64 | desselben 65 | demselben 66 | dieselbe 67 | dieselben 68 | dasselbe 69 | 70 | dazu | to that 71 | 72 | dein | thy 73 | deine 74 | deinem 75 | deinen 76 | deiner 77 | deines 78 | 79 | denn | because 80 | 81 | derer | of those 82 | dessen | of him 83 | 84 | dich | thee 85 | dir | to thee 86 | du | thou 87 | 88 | dies | this 89 | diese 90 | diesem 91 | diesen 92 | dieser 93 | dieses 94 | 95 | 96 | doch | (several meanings) 97 | dort | (over) there 98 | 99 | 100 | durch | through 101 | 102 | ein | a 103 | eine 104 | einem 105 | einen 106 | einer 107 | eines 108 | 109 | einig | some 110 | einige 111 | einigem 112 | einigen 113 | einiger 114 | einiges 115 | 116 | einmal | once 117 | 118 | er | he 119 | ihn | him 120 | ihm | to him 121 | 122 | es | it 123 | etwas | something 124 | 125 | euer | your 126 | eure 127 | eurem 128 | euren 129 | eurer 130 | eures 131 | 132 | für | for 133 | gegen | towards 134 | gewesen | p.p. of sein 135 | hab | have 136 | habe | have 137 | haben | have 138 | hat | has 139 | hatte | had 140 | hatten | had 141 | hier | here 142 | hin | there 143 | hinter | behind 144 | 145 | ich | I 146 | mich | me 147 | mir | to me 148 | 149 | 150 | ihr | you, to her 151 | ihre 152 | ihrem 153 | ihren 154 | ihrer 155 | ihres 156 | euch | to you 157 | 158 | im | in + dem 159 | in | in 160 | indem | while 161 | ins | in + das 162 | ist | is 163 | 164 | jede | each, every 165 | jedem 166 | jeden 167 | jeder 168 | jedes 169 | 170 | jene | that 171 | jenem 172 | jenen 173 | jener 174 | jenes 175 | 176 | jetzt | now 177 | kann | can 178 | 179 | kein | no 180 | keine 181 | keinem 182 | keinen 183 | keiner 184 | keines 185 | 186 | können | can 187 | könnte | could 188 | machen | do 189 | man | one 190 | 191 | manche | some, many a 192 | manchem 193 | manchen 194 | mancher 195 | manches 196 | 197 | mein | my 198 | meine 199 | meinem 200 | meinen 201 | meiner 202 | meines 203 | 204 | mit | with 205 | muss | must 206 | musste | had to 207 | nach | to(wards) 208 | nicht | not 209 | nichts | nothing 210 | noch | still, yet 211 | nun | now 212 | nur | only 213 | ob | whether 214 | oder | or 215 | ohne | without 216 | sehr | very 217 | 218 | sein | his 219 | seine 220 | seinem 221 | seinen 222 | seiner 223 | seines 224 | 225 | selbst | self 226 | sich | herself 227 | 228 | sie | they, she 229 | ihnen | to them 230 | 231 | sind | are 232 | so | so 233 | 234 | solche | such 235 | solchem 236 | solchen 237 | solcher 238 | solches 239 | 240 | soll | shall 241 | sollte | should 242 | sondern | but 243 | sonst | else 244 | über | over 245 | um | about, around 246 | und | and 247 | 248 | uns | us 249 | unse 250 | unsem 251 | unsen 252 | unser 253 | unses 254 | 255 | unter | under 256 | viel | much 257 | vom | von + dem 258 | von | from 259 | vor | before 260 | während | while 261 | war | was 262 | waren | were 263 | warst | wast 264 | was | what 265 | weg | away, off 266 | weil | because 267 | weiter | further 268 | 269 | welche | which 270 | welchem 271 | welchen 272 | welcher 273 | welches 274 | 275 | wenn | when 276 | werde | will 277 | werden | will 278 | wie | how 279 | wieder | again 280 | will | want 281 | wir | we 282 | wird | will 283 | wirst | willst 284 | wo | where 285 | wollen | want 286 | wollte | wanted 287 | würde | would 288 | würden | would 289 | zu | to 290 | zum | zu + dem 291 | zur | zu + der 292 | zwar | indeed 293 | zwischen | between 294 | 295 | -------------------------------------------------------------------------------- /conf/lang/stopwords_el.txt: -------------------------------------------------------------------------------- 1 | # Lucene Greek Stopwords list 2 | # Note: by default this file is used after GreekLowerCaseFilter, 3 | # so when modifying this file use 'σ' instead of 'ς' 4 | ο 5 | η 6 | το 7 | οι 8 | τα 9 | του 10 | τησ 11 | των 12 | τον 13 | την 14 | και 15 | κι 16 | κ 17 | ειμαι 18 | εισαι 19 | ειναι 20 | ειμαστε 21 | ειστε 22 | στο 23 | στον 24 | στη 25 | στην 26 | μα 27 | αλλα 28 | απο 29 | για 30 | προσ 31 | με 32 | σε 33 | ωσ 34 | παρα 35 | αντι 36 | κατα 37 | μετα 38 | θα 39 | να 40 | δε 41 | δεν 42 | μη 43 | μην 44 | επι 45 | ενω 46 | εαν 47 | αν 48 | τοτε 49 | που 50 | πωσ 51 | ποιοσ 52 | ποια 53 | ποιο 54 | ποιοι 55 | ποιεσ 56 | ποιων 57 | ποιουσ 58 | αυτοσ 59 | αυτη 60 | αυτο 61 | αυτοι 62 | αυτων 63 | αυτουσ 64 | αυτεσ 65 | αυτα 66 | εκεινοσ 67 | εκεινη 68 | εκεινο 69 | εκεινοι 70 | εκεινεσ 71 | εκεινα 72 | εκεινων 73 | εκεινουσ 74 | οπωσ 75 | ομωσ 76 | ισωσ 77 | οσο 78 | οτι 79 | -------------------------------------------------------------------------------- /conf/lang/stopwords_en.txt: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one or more 2 | # contributor license agreements. See the NOTICE file distributed with 3 | # this work for additional information regarding copyright ownership. 4 | # The ASF licenses this file to You under the Apache License, Version 2.0 5 | # (the "License"); you may not use this file except in compliance with 6 | # the License. You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | # a couple of test stopwords to test that the words are really being 17 | # configured from this file: 18 | stopworda 19 | stopwordb 20 | 21 | # Standard english stop words taken from Lucene's StopAnalyzer 22 | a 23 | an 24 | and 25 | are 26 | as 27 | at 28 | be 29 | but 30 | by 31 | for 32 | if 33 | in 34 | into 35 | is 36 | it 37 | no 38 | not 39 | of 40 | on 41 | or 42 | such 43 | that 44 | the 45 | their 46 | then 47 | there 48 | these 49 | they 50 | this 51 | to 52 | was 53 | will 54 | with 55 | -------------------------------------------------------------------------------- /conf/lang/stopwords_es.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A Spanish stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | 14 | | The following is a ranked list (commonest to rarest) of stopwords 15 | | deriving from a large sample of text. 16 | 17 | | Extra words have been added at the end. 18 | 19 | de | from, of 20 | la | the, her 21 | que | who, that 22 | el | the 23 | en | in 24 | y | and 25 | a | to 26 | los | the, them 27 | del | de + el 28 | se | himself, from him etc 29 | las | the, them 30 | por | for, by, etc 31 | un | a 32 | para | for 33 | con | with 34 | no | no 35 | una | a 36 | su | his, her 37 | al | a + el 38 | | es from SER 39 | lo | him 40 | como | how 41 | más | more 42 | pero | pero 43 | sus | su plural 44 | le | to him, her 45 | ya | already 46 | o | or 47 | | fue from SER 48 | este | this 49 | | ha from HABER 50 | sí | himself etc 51 | porque | because 52 | esta | this 53 | | son from SER 54 | entre | between 55 | | está from ESTAR 56 | cuando | when 57 | muy | very 58 | sin | without 59 | sobre | on 60 | | ser from SER 61 | | tiene from TENER 62 | también | also 63 | me | me 64 | hasta | until 65 | hay | there is/are 66 | donde | where 67 | | han from HABER 68 | quien | whom, that 69 | | están from ESTAR 70 | | estado from ESTAR 71 | desde | from 72 | todo | all 73 | nos | us 74 | durante | during 75 | | estados from ESTAR 76 | todos | all 77 | uno | a 78 | les | to them 79 | ni | nor 80 | contra | against 81 | otros | other 82 | | fueron from SER 83 | ese | that 84 | eso | that 85 | | había from HABER 86 | ante | before 87 | ellos | they 88 | e | and (variant of y) 89 | esto | this 90 | mí | me 91 | antes | before 92 | algunos | some 93 | qué | what? 94 | unos | a 95 | yo | I 96 | otro | other 97 | otras | other 98 | otra | other 99 | él | he 100 | tanto | so much, many 101 | esa | that 102 | estos | these 103 | mucho | much, many 104 | quienes | who 105 | nada | nothing 106 | muchos | many 107 | cual | who 108 | | sea from SER 109 | poco | few 110 | ella | she 111 | estar | to be 112 | | haber from HABER 113 | estas | these 114 | | estaba from ESTAR 115 | | estamos from ESTAR 116 | algunas | some 117 | algo | something 118 | nosotros | we 119 | 120 | | other forms 121 | 122 | mi | me 123 | mis | mi plural 124 | tú | thou 125 | te | thee 126 | ti | thee 127 | tu | thy 128 | tus | tu plural 129 | ellas | they 130 | nosotras | we 131 | vosotros | you 132 | vosotras | you 133 | os | you 134 | mío | mine 135 | mía | 136 | míos | 137 | mías | 138 | tuyo | thine 139 | tuya | 140 | tuyos | 141 | tuyas | 142 | suyo | his, hers, theirs 143 | suya | 144 | suyos | 145 | suyas | 146 | nuestro | ours 147 | nuestra | 148 | nuestros | 149 | nuestras | 150 | vuestro | yours 151 | vuestra | 152 | vuestros | 153 | vuestras | 154 | esos | those 155 | esas | those 156 | 157 | | forms of estar, to be (not including the infinitive): 158 | estoy 159 | estás 160 | está 161 | estamos 162 | estáis 163 | están 164 | esté 165 | estés 166 | estemos 167 | estéis 168 | estén 169 | estaré 170 | estarás 171 | estará 172 | estaremos 173 | estaréis 174 | estarán 175 | estaría 176 | estarías 177 | estaríamos 178 | estaríais 179 | estarían 180 | estaba 181 | estabas 182 | estábamos 183 | estabais 184 | estaban 185 | estuve 186 | estuviste 187 | estuvo 188 | estuvimos 189 | estuvisteis 190 | estuvieron 191 | estuviera 192 | estuvieras 193 | estuviéramos 194 | estuvierais 195 | estuvieran 196 | estuviese 197 | estuvieses 198 | estuviésemos 199 | estuvieseis 200 | estuviesen 201 | estando 202 | estado 203 | estada 204 | estados 205 | estadas 206 | estad 207 | 208 | | forms of haber, to have (not including the infinitive): 209 | he 210 | has 211 | ha 212 | hemos 213 | habéis 214 | han 215 | haya 216 | hayas 217 | hayamos 218 | hayáis 219 | hayan 220 | habré 221 | habrás 222 | habrá 223 | habremos 224 | habréis 225 | habrán 226 | habría 227 | habrías 228 | habríamos 229 | habríais 230 | habrían 231 | había 232 | habías 233 | habíamos 234 | habíais 235 | habían 236 | hube 237 | hubiste 238 | hubo 239 | hubimos 240 | hubisteis 241 | hubieron 242 | hubiera 243 | hubieras 244 | hubiéramos 245 | hubierais 246 | hubieran 247 | hubiese 248 | hubieses 249 | hubiésemos 250 | hubieseis 251 | hubiesen 252 | habiendo 253 | habido 254 | habida 255 | habidos 256 | habidas 257 | 258 | | forms of ser, to be (not including the infinitive): 259 | soy 260 | eres 261 | es 262 | somos 263 | sois 264 | son 265 | sea 266 | seas 267 | seamos 268 | seáis 269 | sean 270 | seré 271 | serás 272 | será 273 | seremos 274 | seréis 275 | serán 276 | sería 277 | serías 278 | seríamos 279 | seríais 280 | serían 281 | era 282 | eras 283 | éramos 284 | erais 285 | eran 286 | fui 287 | fuiste 288 | fue 289 | fuimos 290 | fuisteis 291 | fueron 292 | fuera 293 | fueras 294 | fuéramos 295 | fuerais 296 | fueran 297 | fuese 298 | fueses 299 | fuésemos 300 | fueseis 301 | fuesen 302 | siendo 303 | sido 304 | | sed also means 'thirst' 305 | 306 | | forms of tener, to have (not including the infinitive): 307 | tengo 308 | tienes 309 | tiene 310 | tenemos 311 | tenéis 312 | tienen 313 | tenga 314 | tengas 315 | tengamos 316 | tengáis 317 | tengan 318 | tendré 319 | tendrás 320 | tendrá 321 | tendremos 322 | tendréis 323 | tendrán 324 | tendría 325 | tendrías 326 | tendríamos 327 | tendríais 328 | tendrían 329 | tenía 330 | tenías 331 | teníamos 332 | teníais 333 | tenían 334 | tuve 335 | tuviste 336 | tuvo 337 | tuvimos 338 | tuvisteis 339 | tuvieron 340 | tuviera 341 | tuvieras 342 | tuviéramos 343 | tuvierais 344 | tuvieran 345 | tuviese 346 | tuvieses 347 | tuviésemos 348 | tuvieseis 349 | tuviesen 350 | teniendo 351 | tenido 352 | tenida 353 | tenidos 354 | tenidas 355 | tened 356 | 357 | -------------------------------------------------------------------------------- /conf/lang/stopwords_eu.txt: -------------------------------------------------------------------------------- 1 | # example set of basque stopwords 2 | al 3 | anitz 4 | arabera 5 | asko 6 | baina 7 | bat 8 | batean 9 | batek 10 | bati 11 | batzuei 12 | batzuek 13 | batzuetan 14 | batzuk 15 | bera 16 | beraiek 17 | berau 18 | berauek 19 | bere 20 | berori 21 | beroriek 22 | beste 23 | bezala 24 | da 25 | dago 26 | dira 27 | ditu 28 | du 29 | dute 30 | edo 31 | egin 32 | ere 33 | eta 34 | eurak 35 | ez 36 | gainera 37 | gu 38 | gutxi 39 | guzti 40 | haiei 41 | haiek 42 | haietan 43 | hainbeste 44 | hala 45 | han 46 | handik 47 | hango 48 | hara 49 | hari 50 | hark 51 | hartan 52 | hau 53 | hauei 54 | hauek 55 | hauetan 56 | hemen 57 | hemendik 58 | hemengo 59 | hi 60 | hona 61 | honek 62 | honela 63 | honetan 64 | honi 65 | hor 66 | hori 67 | horiei 68 | horiek 69 | horietan 70 | horko 71 | horra 72 | horrek 73 | horrela 74 | horretan 75 | horri 76 | hortik 77 | hura 78 | izan 79 | ni 80 | noiz 81 | nola 82 | non 83 | nondik 84 | nongo 85 | nor 86 | nora 87 | ze 88 | zein 89 | zen 90 | zenbait 91 | zenbat 92 | zer 93 | zergatik 94 | ziren 95 | zituen 96 | zu 97 | zuek 98 | zuen 99 | zuten 100 | -------------------------------------------------------------------------------- /conf/lang/stopwords_fa.txt: -------------------------------------------------------------------------------- 1 | # This file was created by Jacques Savoy and is distributed under the BSD license. 2 | # See http://members.unine.ch/jacques.savoy/clef/index.html. 3 | # Also see http://www.opensource.org/licenses/bsd-license.html 4 | # Note: by default this file is used after normalization, so when adding entries 5 | # to this file, use the arabic 'ي' instead of 'ی' 6 | انان 7 | نداشته 8 | سراسر 9 | خياه 10 | ايشان 11 | وي 12 | تاكنون 13 | بيشتري 14 | دوم 15 | پس 16 | ناشي 17 | وگو 18 | يا 19 | داشتند 20 | سپس 21 | هنگام 22 | هرگز 23 | پنج 24 | نشان 25 | امسال 26 | ديگر 27 | گروهي 28 | شدند 29 | چطور 30 | ده 31 | و 32 | دو 33 | نخستين 34 | ولي 35 | چرا 36 | چه 37 | وسط 38 | ه 39 | كدام 40 | قابل 41 | يك 42 | رفت 43 | هفت 44 | همچنين 45 | در 46 | هزار 47 | بله 48 | بلي 49 | شايد 50 | اما 51 | شناسي 52 | گرفته 53 | دهد 54 | داشته 55 | دانست 56 | داشتن 57 | خواهيم 58 | ميليارد 59 | وقتيكه 60 | امد 61 | خواهد 62 | جز 63 | اورده 64 | شده 65 | بلكه 66 | خدمات 67 | شدن 68 | برخي 69 | نبود 70 | بسياري 71 | جلوگيري 72 | حق 73 | كردند 74 | نوعي 75 | بعري 76 | نكرده 77 | نظير 78 | نبايد 79 | بوده 80 | بودن 81 | داد 82 | اورد 83 | هست 84 | جايي 85 | شود 86 | دنبال 87 | داده 88 | بايد 89 | سابق 90 | هيچ 91 | همان 92 | انجا 93 | كمتر 94 | كجاست 95 | گردد 96 | كسي 97 | تر 98 | مردم 99 | تان 100 | دادن 101 | بودند 102 | سري 103 | جدا 104 | ندارند 105 | مگر 106 | يكديگر 107 | دارد 108 | دهند 109 | بنابراين 110 | هنگامي 111 | سمت 112 | جا 113 | انچه 114 | خود 115 | دادند 116 | زياد 117 | دارند 118 | اثر 119 | بدون 120 | بهترين 121 | بيشتر 122 | البته 123 | به 124 | براساس 125 | بيرون 126 | كرد 127 | بعضي 128 | گرفت 129 | توي 130 | اي 131 | ميليون 132 | او 133 | جريان 134 | تول 135 | بر 136 | مانند 137 | برابر 138 | باشيم 139 | مدتي 140 | گويند 141 | اكنون 142 | تا 143 | تنها 144 | جديد 145 | چند 146 | بي 147 | نشده 148 | كردن 149 | كردم 150 | گويد 151 | كرده 152 | كنيم 153 | نمي 154 | نزد 155 | روي 156 | قصد 157 | فقط 158 | بالاي 159 | ديگران 160 | اين 161 | ديروز 162 | توسط 163 | سوم 164 | ايم 165 | دانند 166 | سوي 167 | استفاده 168 | شما 169 | كنار 170 | داريم 171 | ساخته 172 | طور 173 | امده 174 | رفته 175 | نخست 176 | بيست 177 | نزديك 178 | طي 179 | كنيد 180 | از 181 | انها 182 | تمامي 183 | داشت 184 | يكي 185 | طريق 186 | اش 187 | چيست 188 | روب 189 | نمايد 190 | گفت 191 | چندين 192 | چيزي 193 | تواند 194 | ام 195 | ايا 196 | با 197 | ان 198 | ايد 199 | ترين 200 | اينكه 201 | ديگري 202 | راه 203 | هايي 204 | بروز 205 | همچنان 206 | پاعين 207 | كس 208 | حدود 209 | مختلف 210 | مقابل 211 | چيز 212 | گيرد 213 | ندارد 214 | ضد 215 | همچون 216 | سازي 217 | شان 218 | مورد 219 | باره 220 | مرسي 221 | خويش 222 | برخوردار 223 | چون 224 | خارج 225 | شش 226 | هنوز 227 | تحت 228 | ضمن 229 | هستيم 230 | گفته 231 | فكر 232 | بسيار 233 | پيش 234 | براي 235 | روزهاي 236 | انكه 237 | نخواهد 238 | بالا 239 | كل 240 | وقتي 241 | كي 242 | چنين 243 | كه 244 | گيري 245 | نيست 246 | است 247 | كجا 248 | كند 249 | نيز 250 | يابد 251 | بندي 252 | حتي 253 | توانند 254 | عقب 255 | خواست 256 | كنند 257 | بين 258 | تمام 259 | همه 260 | ما 261 | باشند 262 | مثل 263 | شد 264 | اري 265 | باشد 266 | اره 267 | طبق 268 | بعد 269 | اگر 270 | صورت 271 | غير 272 | جاي 273 | بيش 274 | ريزي 275 | اند 276 | زيرا 277 | چگونه 278 | بار 279 | لطفا 280 | مي 281 | درباره 282 | من 283 | ديده 284 | همين 285 | گذاري 286 | برداري 287 | علت 288 | گذاشته 289 | هم 290 | فوق 291 | نه 292 | ها 293 | شوند 294 | اباد 295 | همواره 296 | هر 297 | اول 298 | خواهند 299 | چهار 300 | نام 301 | امروز 302 | مان 303 | هاي 304 | قبل 305 | كنم 306 | سعي 307 | تازه 308 | را 309 | هستند 310 | زير 311 | جلوي 312 | عنوان 313 | بود 314 | -------------------------------------------------------------------------------- /conf/lang/stopwords_fi.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/finnish/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | forms of BE 11 | 12 | olla 13 | olen 14 | olet 15 | on 16 | olemme 17 | olette 18 | ovat 19 | ole | negative form 20 | 21 | oli 22 | olisi 23 | olisit 24 | olisin 25 | olisimme 26 | olisitte 27 | olisivat 28 | olit 29 | olin 30 | olimme 31 | olitte 32 | olivat 33 | ollut 34 | olleet 35 | 36 | en | negation 37 | et 38 | ei 39 | emme 40 | ette 41 | eivät 42 | 43 | |Nom Gen Acc Part Iness Elat Illat Adess Ablat Allat Ess Trans 44 | minä minun minut minua minussa minusta minuun minulla minulta minulle | I 45 | sinä sinun sinut sinua sinussa sinusta sinuun sinulla sinulta sinulle | you 46 | hän hänen hänet häntä hänessä hänestä häneen hänellä häneltä hänelle | he she 47 | me meidän meidät meitä meissä meistä meihin meillä meiltä meille | we 48 | te teidän teidät teitä teissä teistä teihin teillä teiltä teille | you 49 | he heidän heidät heitä heissä heistä heihin heillä heiltä heille | they 50 | 51 | tämä tämän tätä tässä tästä tähän tallä tältä tälle tänä täksi | this 52 | tuo tuon tuotä tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that 53 | se sen sitä siinä siitä siihen sillä siltä sille sinä siksi | it 54 | nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi | these 55 | nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi | those 56 | ne niiden niitä niissä niistä niihin niillä niiltä niille niinä niiksi | they 57 | 58 | kuka kenen kenet ketä kenessä kenestä keneen kenellä keneltä kenelle kenenä keneksi| who 59 | ketkä keiden ketkä keitä keissä keistä keihin keillä keiltä keille keinä keiksi | (pl) 60 | mikä minkä minkä mitä missä mistä mihin millä miltä mille minä miksi | which what 61 | mitkä | (pl) 62 | 63 | joka jonka jota jossa josta johon jolla jolta jolle jona joksi | who which 64 | jotka joiden joita joissa joista joihin joilla joilta joille joina joiksi | (pl) 65 | 66 | | conjunctions 67 | 68 | että | that 69 | ja | and 70 | jos | if 71 | koska | because 72 | kuin | than 73 | mutta | but 74 | niin | so 75 | sekä | and 76 | sillä | for 77 | tai | or 78 | vaan | but 79 | vai | or 80 | vaikka | although 81 | 82 | 83 | | prepositions 84 | 85 | kanssa | with 86 | mukaan | according to 87 | noin | about 88 | poikki | across 89 | yli | over, across 90 | 91 | | other 92 | 93 | kun | when 94 | niin | so 95 | nyt | now 96 | itse | self 97 | 98 | -------------------------------------------------------------------------------- /conf/lang/stopwords_fr.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A French stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | au | a + le 14 | aux | a + les 15 | avec | with 16 | ce | this 17 | ces | these 18 | dans | with 19 | de | of 20 | des | de + les 21 | du | de + le 22 | elle | she 23 | en | `of them' etc 24 | et | and 25 | eux | them 26 | il | he 27 | je | I 28 | la | the 29 | le | the 30 | leur | their 31 | lui | him 32 | ma | my (fem) 33 | mais | but 34 | me | me 35 | même | same; as in moi-même (myself) etc 36 | mes | me (pl) 37 | moi | me 38 | mon | my (masc) 39 | ne | not 40 | nos | our (pl) 41 | notre | our 42 | nous | we 43 | on | one 44 | ou | where 45 | par | by 46 | pas | not 47 | pour | for 48 | qu | que before vowel 49 | que | that 50 | qui | who 51 | sa | his, her (fem) 52 | se | oneself 53 | ses | his (pl) 54 | son | his, her (masc) 55 | sur | on 56 | ta | thy (fem) 57 | te | thee 58 | tes | thy (pl) 59 | toi | thee 60 | ton | thy (masc) 61 | tu | thou 62 | un | a 63 | une | a 64 | vos | your (pl) 65 | votre | your 66 | vous | you 67 | 68 | | single letter forms 69 | 70 | c | c' 71 | d | d' 72 | j | j' 73 | l | l' 74 | à | to, at 75 | m | m' 76 | n | n' 77 | s | s' 78 | t | t' 79 | y | there 80 | 81 | | forms of être (not including the infinitive): 82 | été 83 | étée 84 | étées 85 | étés 86 | étant 87 | suis 88 | es 89 | est 90 | sommes 91 | êtes 92 | sont 93 | serai 94 | seras 95 | sera 96 | serons 97 | serez 98 | seront 99 | serais 100 | serait 101 | serions 102 | seriez 103 | seraient 104 | étais 105 | était 106 | étions 107 | étiez 108 | étaient 109 | fus 110 | fut 111 | fûmes 112 | fûtes 113 | furent 114 | sois 115 | soit 116 | soyons 117 | soyez 118 | soient 119 | fusse 120 | fusses 121 | fût 122 | fussions 123 | fussiez 124 | fussent 125 | 126 | | forms of avoir (not including the infinitive): 127 | ayant 128 | eu 129 | eue 130 | eues 131 | eus 132 | ai 133 | as 134 | avons 135 | avez 136 | ont 137 | aurai 138 | auras 139 | aura 140 | aurons 141 | aurez 142 | auront 143 | aurais 144 | aurait 145 | aurions 146 | auriez 147 | auraient 148 | avais 149 | avait 150 | avions 151 | aviez 152 | avaient 153 | eut 154 | eûmes 155 | eûtes 156 | eurent 157 | aie 158 | aies 159 | ait 160 | ayons 161 | ayez 162 | aient 163 | eusse 164 | eusses 165 | eût 166 | eussions 167 | eussiez 168 | eussent 169 | 170 | | Later additions (from Jean-Christophe Deschamps) 171 | ceci | this 172 | cela | that 173 | celà | that 174 | cet | this 175 | cette | this 176 | ici | here 177 | ils | they 178 | les | the (pl) 179 | leurs | their (pl) 180 | quel | which 181 | quels | which 182 | quelle | which 183 | quelles | which 184 | sans | without 185 | soi | oneself 186 | 187 | -------------------------------------------------------------------------------- /conf/lang/stopwords_ga.txt: -------------------------------------------------------------------------------- 1 | 2 | a 3 | ach 4 | ag 5 | agus 6 | an 7 | aon 8 | ar 9 | arna 10 | as 11 | b' 12 | ba 13 | beirt 14 | bhúr 15 | caoga 16 | ceathair 17 | ceathrar 18 | chomh 19 | chtó 20 | chuig 21 | chun 22 | cois 23 | céad 24 | cúig 25 | cúigear 26 | d' 27 | daichead 28 | dar 29 | de 30 | deich 31 | deichniúr 32 | den 33 | dhá 34 | do 35 | don 36 | dtí 37 | dá 38 | dár 39 | dó 40 | faoi 41 | faoin 42 | faoina 43 | faoinár 44 | fara 45 | fiche 46 | gach 47 | gan 48 | go 49 | gur 50 | haon 51 | hocht 52 | i 53 | iad 54 | idir 55 | in 56 | ina 57 | ins 58 | inár 59 | is 60 | le 61 | leis 62 | lena 63 | lenár 64 | m' 65 | mar 66 | mo 67 | mé 68 | na 69 | nach 70 | naoi 71 | naonúr 72 | ná 73 | ní 74 | níor 75 | nó 76 | nócha 77 | ocht 78 | ochtar 79 | os 80 | roimh 81 | sa 82 | seacht 83 | seachtar 84 | seachtó 85 | seasca 86 | seisear 87 | siad 88 | sibh 89 | sinn 90 | sna 91 | sé 92 | sí 93 | tar 94 | thar 95 | thú 96 | triúr 97 | trí 98 | trína 99 | trínár 100 | tríocha 101 | tú 102 | um 103 | ár 104 | é 105 | éis 106 | í 107 | ó 108 | ón 109 | óna 110 | ónár 111 | -------------------------------------------------------------------------------- /conf/lang/stopwords_gl.txt: -------------------------------------------------------------------------------- 1 | # galican stopwords 2 | a 3 | aínda 4 | alí 5 | aquel 6 | aquela 7 | aquelas 8 | aqueles 9 | aquilo 10 | aquí 11 | ao 12 | aos 13 | as 14 | así 15 | á 16 | ben 17 | cando 18 | che 19 | co 20 | coa 21 | comigo 22 | con 23 | connosco 24 | contigo 25 | convosco 26 | coas 27 | cos 28 | cun 29 | cuns 30 | cunha 31 | cunhas 32 | da 33 | dalgunha 34 | dalgunhas 35 | dalgún 36 | dalgúns 37 | das 38 | de 39 | del 40 | dela 41 | delas 42 | deles 43 | desde 44 | deste 45 | do 46 | dos 47 | dun 48 | duns 49 | dunha 50 | dunhas 51 | e 52 | el 53 | ela 54 | elas 55 | eles 56 | en 57 | era 58 | eran 59 | esa 60 | esas 61 | ese 62 | eses 63 | esta 64 | estar 65 | estaba 66 | está 67 | están 68 | este 69 | estes 70 | estiven 71 | estou 72 | eu 73 | é 74 | facer 75 | foi 76 | foron 77 | fun 78 | había 79 | hai 80 | iso 81 | isto 82 | la 83 | las 84 | lle 85 | lles 86 | lo 87 | los 88 | mais 89 | me 90 | meu 91 | meus 92 | min 93 | miña 94 | miñas 95 | moi 96 | na 97 | nas 98 | neste 99 | nin 100 | no 101 | non 102 | nos 103 | nosa 104 | nosas 105 | noso 106 | nosos 107 | nós 108 | nun 109 | nunha 110 | nuns 111 | nunhas 112 | o 113 | os 114 | ou 115 | ó 116 | ós 117 | para 118 | pero 119 | pode 120 | pois 121 | pola 122 | polas 123 | polo 124 | polos 125 | por 126 | que 127 | se 128 | senón 129 | ser 130 | seu 131 | seus 132 | sexa 133 | sido 134 | sobre 135 | súa 136 | súas 137 | tamén 138 | tan 139 | te 140 | ten 141 | teñen 142 | teño 143 | ter 144 | teu 145 | teus 146 | ti 147 | tido 148 | tiña 149 | tiven 150 | túa 151 | túas 152 | un 153 | unha 154 | unhas 155 | uns 156 | vos 157 | vosa 158 | vosas 159 | voso 160 | vosos 161 | vós 162 | -------------------------------------------------------------------------------- /conf/lang/stopwords_hi.txt: -------------------------------------------------------------------------------- 1 | # Also see http://www.opensource.org/licenses/bsd-license.html 2 | # See http://members.unine.ch/jacques.savoy/clef/index.html. 3 | # This file was created by Jacques Savoy and is distributed under the BSD license. 4 | # Note: by default this file also contains forms normalized by HindiNormalizer 5 | # for spelling variation (see section below), such that it can be used whether or 6 | # not you enable that feature. When adding additional entries to this list, 7 | # please add the normalized form as well. 8 | अंदर 9 | अत 10 | अपना 11 | अपनी 12 | अपने 13 | अभी 14 | आदि 15 | आप 16 | इत्यादि 17 | इन 18 | इनका 19 | इन्हीं 20 | इन्हें 21 | इन्हों 22 | इस 23 | इसका 24 | इसकी 25 | इसके 26 | इसमें 27 | इसी 28 | इसे 29 | उन 30 | उनका 31 | उनकी 32 | उनके 33 | उनको 34 | उन्हीं 35 | उन्हें 36 | उन्हों 37 | उस 38 | उसके 39 | उसी 40 | उसे 41 | एक 42 | एवं 43 | एस 44 | ऐसे 45 | और 46 | कई 47 | कर 48 | करता 49 | करते 50 | करना 51 | करने 52 | करें 53 | कहते 54 | कहा 55 | का 56 | काफ़ी 57 | कि 58 | कितना 59 | किन्हें 60 | किन्हों 61 | किया 62 | किर 63 | किस 64 | किसी 65 | किसे 66 | की 67 | कुछ 68 | कुल 69 | के 70 | को 71 | कोई 72 | कौन 73 | कौनसा 74 | गया 75 | घर 76 | जब 77 | जहाँ 78 | जा 79 | जितना 80 | जिन 81 | जिन्हें 82 | जिन्हों 83 | जिस 84 | जिसे 85 | जीधर 86 | जैसा 87 | जैसे 88 | जो 89 | तक 90 | तब 91 | तरह 92 | तिन 93 | तिन्हें 94 | तिन्हों 95 | तिस 96 | तिसे 97 | तो 98 | था 99 | थी 100 | थे 101 | दबारा 102 | दिया 103 | दुसरा 104 | दूसरे 105 | दो 106 | द्वारा 107 | न 108 | नहीं 109 | ना 110 | निहायत 111 | नीचे 112 | ने 113 | पर 114 | पर 115 | पहले 116 | पूरा 117 | पे 118 | फिर 119 | बनी 120 | बही 121 | बहुत 122 | बाद 123 | बाला 124 | बिलकुल 125 | भी 126 | भीतर 127 | मगर 128 | मानो 129 | मे 130 | में 131 | यदि 132 | यह 133 | यहाँ 134 | यही 135 | या 136 | यिह 137 | ये 138 | रखें 139 | रहा 140 | रहे 141 | ऱ्वासा 142 | लिए 143 | लिये 144 | लेकिन 145 | व 146 | वर्ग 147 | वह 148 | वह 149 | वहाँ 150 | वहीं 151 | वाले 152 | वुह 153 | वे 154 | वग़ैरह 155 | संग 156 | सकता 157 | सकते 158 | सबसे 159 | सभी 160 | साथ 161 | साबुत 162 | साभ 163 | सारा 164 | से 165 | सो 166 | ही 167 | हुआ 168 | हुई 169 | हुए 170 | है 171 | हैं 172 | हो 173 | होता 174 | होती 175 | होते 176 | होना 177 | होने 178 | # additional normalized forms of the above 179 | अपनि 180 | जेसे 181 | होति 182 | सभि 183 | तिंहों 184 | इंहों 185 | दवारा 186 | इसि 187 | किंहें 188 | थि 189 | उंहों 190 | ओर 191 | जिंहें 192 | वहिं 193 | अभि 194 | बनि 195 | हि 196 | उंहिं 197 | उंहें 198 | हें 199 | वगेरह 200 | एसे 201 | रवासा 202 | कोन 203 | निचे 204 | काफि 205 | उसि 206 | पुरा 207 | भितर 208 | हे 209 | बहि 210 | वहां 211 | कोइ 212 | यहां 213 | जिंहों 214 | तिंहें 215 | किसि 216 | कइ 217 | यहि 218 | इंहिं 219 | जिधर 220 | इंहें 221 | अदि 222 | इतयादि 223 | हुइ 224 | कोनसा 225 | इसकि 226 | दुसरे 227 | जहां 228 | अप 229 | किंहों 230 | उनकि 231 | भि 232 | वरग 233 | हुअ 234 | जेसा 235 | नहिं 236 | -------------------------------------------------------------------------------- /conf/lang/stopwords_hu.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/hungarian/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | Hungarian stop word list 11 | | prepared by Anna Tordai 12 | 13 | a 14 | ahogy 15 | ahol 16 | aki 17 | akik 18 | akkor 19 | alatt 20 | által 21 | általában 22 | amely 23 | amelyek 24 | amelyekben 25 | amelyeket 26 | amelyet 27 | amelynek 28 | ami 29 | amit 30 | amolyan 31 | amíg 32 | amikor 33 | át 34 | abban 35 | ahhoz 36 | annak 37 | arra 38 | arról 39 | az 40 | azok 41 | azon 42 | azt 43 | azzal 44 | azért 45 | aztán 46 | azután 47 | azonban 48 | bár 49 | be 50 | belül 51 | benne 52 | cikk 53 | cikkek 54 | cikkeket 55 | csak 56 | de 57 | e 58 | eddig 59 | egész 60 | egy 61 | egyes 62 | egyetlen 63 | egyéb 64 | egyik 65 | egyre 66 | ekkor 67 | el 68 | elég 69 | ellen 70 | elő 71 | először 72 | előtt 73 | első 74 | én 75 | éppen 76 | ebben 77 | ehhez 78 | emilyen 79 | ennek 80 | erre 81 | ez 82 | ezt 83 | ezek 84 | ezen 85 | ezzel 86 | ezért 87 | és 88 | fel 89 | felé 90 | hanem 91 | hiszen 92 | hogy 93 | hogyan 94 | igen 95 | így 96 | illetve 97 | ill. 98 | ill 99 | ilyen 100 | ilyenkor 101 | ison 102 | ismét 103 | itt 104 | jó 105 | jól 106 | jobban 107 | kell 108 | kellett 109 | keresztül 110 | keressünk 111 | ki 112 | kívül 113 | között 114 | közül 115 | legalább 116 | lehet 117 | lehetett 118 | legyen 119 | lenne 120 | lenni 121 | lesz 122 | lett 123 | maga 124 | magát 125 | majd 126 | majd 127 | már 128 | más 129 | másik 130 | meg 131 | még 132 | mellett 133 | mert 134 | mely 135 | melyek 136 | mi 137 | mit 138 | míg 139 | miért 140 | milyen 141 | mikor 142 | minden 143 | mindent 144 | mindenki 145 | mindig 146 | mint 147 | mintha 148 | mivel 149 | most 150 | nagy 151 | nagyobb 152 | nagyon 153 | ne 154 | néha 155 | nekem 156 | neki 157 | nem 158 | néhány 159 | nélkül 160 | nincs 161 | olyan 162 | ott 163 | össze 164 | ő 165 | ők 166 | őket 167 | pedig 168 | persze 169 | rá 170 | s 171 | saját 172 | sem 173 | semmi 174 | sok 175 | sokat 176 | sokkal 177 | számára 178 | szemben 179 | szerint 180 | szinte 181 | talán 182 | tehát 183 | teljes 184 | tovább 185 | továbbá 186 | több 187 | úgy 188 | ugyanis 189 | új 190 | újabb 191 | újra 192 | után 193 | utána 194 | utolsó 195 | vagy 196 | vagyis 197 | valaki 198 | valami 199 | valamint 200 | való 201 | vagyok 202 | van 203 | vannak 204 | volt 205 | voltam 206 | voltak 207 | voltunk 208 | vissza 209 | vele 210 | viszont 211 | volna 212 | -------------------------------------------------------------------------------- /conf/lang/stopwords_hy.txt: -------------------------------------------------------------------------------- 1 | # example set of Armenian stopwords. 2 | այդ 3 | այլ 4 | այն 5 | այս 6 | դու 7 | դուք 8 | եմ 9 | են 10 | ենք 11 | ես 12 | եք 13 | է 14 | էի 15 | էին 16 | էինք 17 | էիր 18 | էիք 19 | էր 20 | ըստ 21 | թ 22 | ի 23 | ին 24 | իսկ 25 | իր 26 | կամ 27 | համար 28 | հետ 29 | հետո 30 | մենք 31 | մեջ 32 | մի 33 | ն 34 | նա 35 | նաև 36 | նրա 37 | նրանք 38 | որ 39 | որը 40 | որոնք 41 | որպես 42 | ու 43 | ում 44 | պիտի 45 | վրա 46 | և 47 | -------------------------------------------------------------------------------- /conf/lang/stopwords_id.txt: -------------------------------------------------------------------------------- 1 | # from appendix D of: A Study of Stemming Effects on Information 2 | # Retrieval in Bahasa Indonesia 3 | ada 4 | adanya 5 | adalah 6 | adapun 7 | agak 8 | agaknya 9 | agar 10 | akan 11 | akankah 12 | akhirnya 13 | aku 14 | akulah 15 | amat 16 | amatlah 17 | anda 18 | andalah 19 | antar 20 | diantaranya 21 | antara 22 | antaranya 23 | diantara 24 | apa 25 | apaan 26 | mengapa 27 | apabila 28 | apakah 29 | apalagi 30 | apatah 31 | atau 32 | ataukah 33 | ataupun 34 | bagai 35 | bagaikan 36 | sebagai 37 | sebagainya 38 | bagaimana 39 | bagaimanapun 40 | sebagaimana 41 | bagaimanakah 42 | bagi 43 | bahkan 44 | bahwa 45 | bahwasanya 46 | sebaliknya 47 | banyak 48 | sebanyak 49 | beberapa 50 | seberapa 51 | begini 52 | beginian 53 | beginikah 54 | beginilah 55 | sebegini 56 | begitu 57 | begitukah 58 | begitulah 59 | begitupun 60 | sebegitu 61 | belum 62 | belumlah 63 | sebelum 64 | sebelumnya 65 | sebenarnya 66 | berapa 67 | berapakah 68 | berapalah 69 | berapapun 70 | betulkah 71 | sebetulnya 72 | biasa 73 | biasanya 74 | bila 75 | bilakah 76 | bisa 77 | bisakah 78 | sebisanya 79 | boleh 80 | bolehkah 81 | bolehlah 82 | buat 83 | bukan 84 | bukankah 85 | bukanlah 86 | bukannya 87 | cuma 88 | percuma 89 | dahulu 90 | dalam 91 | dan 92 | dapat 93 | dari 94 | daripada 95 | dekat 96 | demi 97 | demikian 98 | demikianlah 99 | sedemikian 100 | dengan 101 | depan 102 | di 103 | dia 104 | dialah 105 | dini 106 | diri 107 | dirinya 108 | terdiri 109 | dong 110 | dulu 111 | enggak 112 | enggaknya 113 | entah 114 | entahlah 115 | terhadap 116 | terhadapnya 117 | hal 118 | hampir 119 | hanya 120 | hanyalah 121 | harus 122 | haruslah 123 | harusnya 124 | seharusnya 125 | hendak 126 | hendaklah 127 | hendaknya 128 | hingga 129 | sehingga 130 | ia 131 | ialah 132 | ibarat 133 | ingin 134 | inginkah 135 | inginkan 136 | ini 137 | inikah 138 | inilah 139 | itu 140 | itukah 141 | itulah 142 | jangan 143 | jangankan 144 | janganlah 145 | jika 146 | jikalau 147 | juga 148 | justru 149 | kala 150 | kalau 151 | kalaulah 152 | kalaupun 153 | kalian 154 | kami 155 | kamilah 156 | kamu 157 | kamulah 158 | kan 159 | kapan 160 | kapankah 161 | kapanpun 162 | dikarenakan 163 | karena 164 | karenanya 165 | ke 166 | kecil 167 | kemudian 168 | kenapa 169 | kepada 170 | kepadanya 171 | ketika 172 | seketika 173 | khususnya 174 | kini 175 | kinilah 176 | kiranya 177 | sekiranya 178 | kita 179 | kitalah 180 | kok 181 | lagi 182 | lagian 183 | selagi 184 | lah 185 | lain 186 | lainnya 187 | melainkan 188 | selaku 189 | lalu 190 | melalui 191 | terlalu 192 | lama 193 | lamanya 194 | selama 195 | selama 196 | selamanya 197 | lebih 198 | terlebih 199 | bermacam 200 | macam 201 | semacam 202 | maka 203 | makanya 204 | makin 205 | malah 206 | malahan 207 | mampu 208 | mampukah 209 | mana 210 | manakala 211 | manalagi 212 | masih 213 | masihkah 214 | semasih 215 | masing 216 | mau 217 | maupun 218 | semaunya 219 | memang 220 | mereka 221 | merekalah 222 | meski 223 | meskipun 224 | semula 225 | mungkin 226 | mungkinkah 227 | nah 228 | namun 229 | nanti 230 | nantinya 231 | nyaris 232 | oleh 233 | olehnya 234 | seorang 235 | seseorang 236 | pada 237 | padanya 238 | padahal 239 | paling 240 | sepanjang 241 | pantas 242 | sepantasnya 243 | sepantasnyalah 244 | para 245 | pasti 246 | pastilah 247 | per 248 | pernah 249 | pula 250 | pun 251 | merupakan 252 | rupanya 253 | serupa 254 | saat 255 | saatnya 256 | sesaat 257 | saja 258 | sajalah 259 | saling 260 | bersama 261 | sama 262 | sesama 263 | sambil 264 | sampai 265 | sana 266 | sangat 267 | sangatlah 268 | saya 269 | sayalah 270 | se 271 | sebab 272 | sebabnya 273 | sebuah 274 | tersebut 275 | tersebutlah 276 | sedang 277 | sedangkan 278 | sedikit 279 | sedikitnya 280 | segala 281 | segalanya 282 | segera 283 | sesegera 284 | sejak 285 | sejenak 286 | sekali 287 | sekalian 288 | sekalipun 289 | sesekali 290 | sekaligus 291 | sekarang 292 | sekarang 293 | sekitar 294 | sekitarnya 295 | sela 296 | selain 297 | selalu 298 | seluruh 299 | seluruhnya 300 | semakin 301 | sementara 302 | sempat 303 | semua 304 | semuanya 305 | sendiri 306 | sendirinya 307 | seolah 308 | seperti 309 | sepertinya 310 | sering 311 | seringnya 312 | serta 313 | siapa 314 | siapakah 315 | siapapun 316 | disini 317 | disinilah 318 | sini 319 | sinilah 320 | sesuatu 321 | sesuatunya 322 | suatu 323 | sesudah 324 | sesudahnya 325 | sudah 326 | sudahkah 327 | sudahlah 328 | supaya 329 | tadi 330 | tadinya 331 | tak 332 | tanpa 333 | setelah 334 | telah 335 | tentang 336 | tentu 337 | tentulah 338 | tentunya 339 | tertentu 340 | seterusnya 341 | tapi 342 | tetapi 343 | setiap 344 | tiap 345 | setidaknya 346 | tidak 347 | tidakkah 348 | tidaklah 349 | toh 350 | waduh 351 | wah 352 | wahai 353 | sewaktu 354 | walau 355 | walaupun 356 | wong 357 | yaitu 358 | yakni 359 | yang 360 | -------------------------------------------------------------------------------- /conf/lang/stopwords_it.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/italian/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | An Italian stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | ad | a (to) before vowel 14 | al | a + il 15 | allo | a + lo 16 | ai | a + i 17 | agli | a + gli 18 | all | a + l' 19 | agl | a + gl' 20 | alla | a + la 21 | alle | a + le 22 | con | with 23 | col | con + il 24 | coi | con + i (forms collo, cogli etc are now very rare) 25 | da | from 26 | dal | da + il 27 | dallo | da + lo 28 | dai | da + i 29 | dagli | da + gli 30 | dall | da + l' 31 | dagl | da + gll' 32 | dalla | da + la 33 | dalle | da + le 34 | di | of 35 | del | di + il 36 | dello | di + lo 37 | dei | di + i 38 | degli | di + gli 39 | dell | di + l' 40 | degl | di + gl' 41 | della | di + la 42 | delle | di + le 43 | in | in 44 | nel | in + el 45 | nello | in + lo 46 | nei | in + i 47 | negli | in + gli 48 | nell | in + l' 49 | negl | in + gl' 50 | nella | in + la 51 | nelle | in + le 52 | su | on 53 | sul | su + il 54 | sullo | su + lo 55 | sui | su + i 56 | sugli | su + gli 57 | sull | su + l' 58 | sugl | su + gl' 59 | sulla | su + la 60 | sulle | su + le 61 | per | through, by 62 | tra | among 63 | contro | against 64 | io | I 65 | tu | thou 66 | lui | he 67 | lei | she 68 | noi | we 69 | voi | you 70 | loro | they 71 | mio | my 72 | mia | 73 | miei | 74 | mie | 75 | tuo | 76 | tua | 77 | tuoi | thy 78 | tue | 79 | suo | 80 | sua | 81 | suoi | his, her 82 | sue | 83 | nostro | our 84 | nostra | 85 | nostri | 86 | nostre | 87 | vostro | your 88 | vostra | 89 | vostri | 90 | vostre | 91 | mi | me 92 | ti | thee 93 | ci | us, there 94 | vi | you, there 95 | lo | him, the 96 | la | her, the 97 | li | them 98 | le | them, the 99 | gli | to him, the 100 | ne | from there etc 101 | il | the 102 | un | a 103 | uno | a 104 | una | a 105 | ma | but 106 | ed | and 107 | se | if 108 | perché | why, because 109 | anche | also 110 | come | how 111 | dov | where (as dov') 112 | dove | where 113 | che | who, that 114 | chi | who 115 | cui | whom 116 | non | not 117 | più | more 118 | quale | who, that 119 | quanto | how much 120 | quanti | 121 | quanta | 122 | quante | 123 | quello | that 124 | quelli | 125 | quella | 126 | quelle | 127 | questo | this 128 | questi | 129 | questa | 130 | queste | 131 | si | yes 132 | tutto | all 133 | tutti | all 134 | 135 | | single letter forms: 136 | 137 | a | at 138 | c | as c' for ce or ci 139 | e | and 140 | i | the 141 | l | as l' 142 | o | or 143 | 144 | | forms of avere, to have (not including the infinitive): 145 | 146 | ho 147 | hai 148 | ha 149 | abbiamo 150 | avete 151 | hanno 152 | abbia 153 | abbiate 154 | abbiano 155 | avrò 156 | avrai 157 | avrà 158 | avremo 159 | avrete 160 | avranno 161 | avrei 162 | avresti 163 | avrebbe 164 | avremmo 165 | avreste 166 | avrebbero 167 | avevo 168 | avevi 169 | aveva 170 | avevamo 171 | avevate 172 | avevano 173 | ebbi 174 | avesti 175 | ebbe 176 | avemmo 177 | aveste 178 | ebbero 179 | avessi 180 | avesse 181 | avessimo 182 | avessero 183 | avendo 184 | avuto 185 | avuta 186 | avuti 187 | avute 188 | 189 | | forms of essere, to be (not including the infinitive): 190 | sono 191 | sei 192 | è 193 | siamo 194 | siete 195 | sia 196 | siate 197 | siano 198 | sarò 199 | sarai 200 | sarà 201 | saremo 202 | sarete 203 | saranno 204 | sarei 205 | saresti 206 | sarebbe 207 | saremmo 208 | sareste 209 | sarebbero 210 | ero 211 | eri 212 | era 213 | eravamo 214 | eravate 215 | erano 216 | fui 217 | fosti 218 | fu 219 | fummo 220 | foste 221 | furono 222 | fossi 223 | fosse 224 | fossimo 225 | fossero 226 | essendo 227 | 228 | | forms of fare, to do (not including the infinitive, fa, fat-): 229 | faccio 230 | fai 231 | facciamo 232 | fanno 233 | faccia 234 | facciate 235 | facciano 236 | farò 237 | farai 238 | farà 239 | faremo 240 | farete 241 | faranno 242 | farei 243 | faresti 244 | farebbe 245 | faremmo 246 | fareste 247 | farebbero 248 | facevo 249 | facevi 250 | faceva 251 | facevamo 252 | facevate 253 | facevano 254 | feci 255 | facesti 256 | fece 257 | facemmo 258 | faceste 259 | fecero 260 | facessi 261 | facesse 262 | facessimo 263 | facessero 264 | facendo 265 | 266 | | forms of stare, to be (not including the infinitive): 267 | sto 268 | stai 269 | sta 270 | stiamo 271 | stanno 272 | stia 273 | stiate 274 | stiano 275 | starò 276 | starai 277 | starà 278 | staremo 279 | starete 280 | staranno 281 | starei 282 | staresti 283 | starebbe 284 | staremmo 285 | stareste 286 | starebbero 287 | stavo 288 | stavi 289 | stava 290 | stavamo 291 | stavate 292 | stavano 293 | stetti 294 | stesti 295 | stette 296 | stemmo 297 | steste 298 | stettero 299 | stessi 300 | stesse 301 | stessimo 302 | stessero 303 | stando 304 | -------------------------------------------------------------------------------- /conf/lang/stopwords_ja.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file defines a stopword set for Japanese. 3 | # 4 | # This set is made up of hand-picked frequent terms from segmented Japanese Wikipedia. 5 | # Punctuation characters and frequent kanji have mostly been left out. See LUCENE-3745 6 | # for frequency lists, etc. that can be useful for making your own set (if desired) 7 | # 8 | # Note that there is an overlap between these stopwords and the terms stopped when used 9 | # in combination with the JapanesePartOfSpeechStopFilter. When editing this file, note 10 | # that comments are not allowed on the same line as stopwords. 11 | # 12 | # Also note that stopping is done in a case-insensitive manner. Change your StopFilter 13 | # configuration if you need case-sensitive stopping. Lastly, note that stopping is done 14 | # using the same character width as the entries in this file. Since this StopFilter is 15 | # normally done after a CJKWidthFilter in your chain, you would usually want your romaji 16 | # entries to be in half-width and your kana entries to be in full-width. 17 | # 18 | の 19 | に 20 | は 21 | を 22 | た 23 | が 24 | で 25 | て 26 | と 27 | し 28 | れ 29 | さ 30 | ある 31 | いる 32 | も 33 | する 34 | から 35 | な 36 | こと 37 | として 38 | い 39 | や 40 | れる 41 | など 42 | なっ 43 | ない 44 | この 45 | ため 46 | その 47 | あっ 48 | よう 49 | また 50 | もの 51 | という 52 | あり 53 | まで 54 | られ 55 | なる 56 | へ 57 | か 58 | だ 59 | これ 60 | によって 61 | により 62 | おり 63 | より 64 | による 65 | ず 66 | なり 67 | られる 68 | において 69 | ば 70 | なかっ 71 | なく 72 | しかし 73 | について 74 | せ 75 | だっ 76 | その後 77 | できる 78 | それ 79 | う 80 | ので 81 | なお 82 | のみ 83 | でき 84 | き 85 | つ 86 | における 87 | および 88 | いう 89 | さらに 90 | でも 91 | ら 92 | たり 93 | その他 94 | に関する 95 | たち 96 | ます 97 | ん 98 | なら 99 | に対して 100 | 特に 101 | せる 102 | 及び 103 | これら 104 | とき 105 | では 106 | にて 107 | ほか 108 | ながら 109 | うち 110 | そして 111 | とともに 112 | ただし 113 | かつて 114 | それぞれ 115 | または 116 | お 117 | ほど 118 | ものの 119 | に対する 120 | ほとんど 121 | と共に 122 | といった 123 | です 124 | とも 125 | ところ 126 | ここ 127 | ##### End of file 128 | -------------------------------------------------------------------------------- /conf/lang/stopwords_lv.txt: -------------------------------------------------------------------------------- 1 | # Set of Latvian stopwords from A Stemming Algorithm for Latvian, Karlis Kreslins 2 | # the original list of over 800 forms was refined: 3 | # pronouns, adverbs, interjections were removed 4 | # 5 | # prepositions 6 | aiz 7 | ap 8 | ar 9 | apakš 10 | ārpus 11 | augšpus 12 | bez 13 | caur 14 | dēļ 15 | gar 16 | iekš 17 | iz 18 | kopš 19 | labad 20 | lejpus 21 | līdz 22 | no 23 | otrpus 24 | pa 25 | par 26 | pār 27 | pēc 28 | pie 29 | pirms 30 | pret 31 | priekš 32 | starp 33 | šaipus 34 | uz 35 | viņpus 36 | virs 37 | virspus 38 | zem 39 | apakšpus 40 | # Conjunctions 41 | un 42 | bet 43 | jo 44 | ja 45 | ka 46 | lai 47 | tomēr 48 | tikko 49 | turpretī 50 | arī 51 | kaut 52 | gan 53 | tādēļ 54 | tā 55 | ne 56 | tikvien 57 | vien 58 | kā 59 | ir 60 | te 61 | vai 62 | kamēr 63 | # Particles 64 | ar 65 | diezin 66 | droši 67 | diemžēl 68 | nebūt 69 | ik 70 | it 71 | taču 72 | nu 73 | pat 74 | tiklab 75 | iekšpus 76 | nedz 77 | tik 78 | nevis 79 | turpretim 80 | jeb 81 | iekam 82 | iekām 83 | iekāms 84 | kolīdz 85 | līdzko 86 | tiklīdz 87 | jebšu 88 | tālab 89 | tāpēc 90 | nekā 91 | itin 92 | jā 93 | jau 94 | jel 95 | nē 96 | nezin 97 | tad 98 | tikai 99 | vis 100 | tak 101 | iekams 102 | vien 103 | # modal verbs 104 | būt 105 | biju 106 | biji 107 | bija 108 | bijām 109 | bijāt 110 | esmu 111 | esi 112 | esam 113 | esat 114 | būšu 115 | būsi 116 | būs 117 | būsim 118 | būsiet 119 | tikt 120 | tiku 121 | tiki 122 | tika 123 | tikām 124 | tikāt 125 | tieku 126 | tiec 127 | tiek 128 | tiekam 129 | tiekat 130 | tikšu 131 | tiks 132 | tiksim 133 | tiksiet 134 | tapt 135 | tapi 136 | tapāt 137 | topat 138 | tapšu 139 | tapsi 140 | taps 141 | tapsim 142 | tapsiet 143 | kļūt 144 | kļuvu 145 | kļuvi 146 | kļuva 147 | kļuvām 148 | kļuvāt 149 | kļūstu 150 | kļūsti 151 | kļūst 152 | kļūstam 153 | kļūstat 154 | kļūšu 155 | kļūsi 156 | kļūs 157 | kļūsim 158 | kļūsiet 159 | # verbs 160 | varēt 161 | varēju 162 | varējām 163 | varēšu 164 | varēsim 165 | var 166 | varēji 167 | varējāt 168 | varēsi 169 | varēsiet 170 | varat 171 | varēja 172 | varēs 173 | -------------------------------------------------------------------------------- /conf/lang/stopwords_nl.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/dutch/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A Dutch stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | | This is a ranked list (commonest to rarest) of stopwords derived from 14 | | a large sample of Dutch text. 15 | 16 | | Dutch stop words frequently exhibit homonym clashes. These are indicated 17 | | clearly below. 18 | 19 | de | the 20 | en | and 21 | van | of, from 22 | ik | I, the ego 23 | te | (1) chez, at etc, (2) to, (3) too 24 | dat | that, which 25 | die | that, those, who, which 26 | in | in, inside 27 | een | a, an, one 28 | hij | he 29 | het | the, it 30 | niet | not, nothing, naught 31 | zijn | (1) to be, being, (2) his, one's, its 32 | is | is 33 | was | (1) was, past tense of all persons sing. of 'zijn' (to be) (2) wax, (3) the washing, (4) rise of river 34 | op | on, upon, at, in, up, used up 35 | aan | on, upon, to (as dative) 36 | met | with, by 37 | als | like, such as, when 38 | voor | (1) before, in front of, (2) furrow 39 | had | had, past tense all persons sing. of 'hebben' (have) 40 | er | there 41 | maar | but, only 42 | om | round, about, for etc 43 | hem | him 44 | dan | then 45 | zou | should/would, past tense all persons sing. of 'zullen' 46 | of | or, whether, if 47 | wat | what, something, anything 48 | mijn | possessive and noun 'mine' 49 | men | people, 'one' 50 | dit | this 51 | zo | so, thus, in this way 52 | door | through by 53 | over | over, across 54 | ze | she, her, they, them 55 | zich | oneself 56 | bij | (1) a bee, (2) by, near, at 57 | ook | also, too 58 | tot | till, until 59 | je | you 60 | mij | me 61 | uit | out of, from 62 | der | Old Dutch form of 'van der' still found in surnames 63 | daar | (1) there, (2) because 64 | haar | (1) her, their, them, (2) hair 65 | naar | (1) unpleasant, unwell etc, (2) towards, (3) as 66 | heb | present first person sing. of 'to have' 67 | hoe | how, why 68 | heeft | present third person sing. of 'to have' 69 | hebben | 'to have' and various parts thereof 70 | deze | this 71 | u | you 72 | want | (1) for, (2) mitten, (3) rigging 73 | nog | yet, still 74 | zal | 'shall', first and third person sing. of verb 'zullen' (will) 75 | me | me 76 | zij | she, they 77 | nu | now 78 | ge | 'thou', still used in Belgium and south Netherlands 79 | geen | none 80 | omdat | because 81 | iets | something, somewhat 82 | worden | to become, grow, get 83 | toch | yet, still 84 | al | all, every, each 85 | waren | (1) 'were' (2) to wander, (3) wares, (3) 86 | veel | much, many 87 | meer | (1) more, (2) lake 88 | doen | to do, to make 89 | toen | then, when 90 | moet | noun 'spot/mote' and present form of 'to must' 91 | ben | (1) am, (2) 'are' in interrogative second person singular of 'to be' 92 | zonder | without 93 | kan | noun 'can' and present form of 'to be able' 94 | hun | their, them 95 | dus | so, consequently 96 | alles | all, everything, anything 97 | onder | under, beneath 98 | ja | yes, of course 99 | eens | once, one day 100 | hier | here 101 | wie | who 102 | werd | imperfect third person sing. of 'become' 103 | altijd | always 104 | doch | yet, but etc 105 | wordt | present third person sing. of 'become' 106 | wezen | (1) to be, (2) 'been' as in 'been fishing', (3) orphans 107 | kunnen | to be able 108 | ons | us/our 109 | zelf | self 110 | tegen | against, towards, at 111 | na | after, near 112 | reeds | already 113 | wil | (1) present tense of 'want', (2) 'will', noun, (3) fender 114 | kon | could; past tense of 'to be able' 115 | niets | nothing 116 | uw | your 117 | iemand | somebody 118 | geweest | been; past participle of 'be' 119 | andere | other 120 | -------------------------------------------------------------------------------- /conf/lang/stopwords_no.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/norwegian/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A Norwegian stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | | This stop word list is for the dominant bokmål dialect. Words unique 14 | | to nynorsk are marked *. 15 | 16 | | Revised by Jan Bruusgaard , Jan 2005 17 | 18 | og | and 19 | i | in 20 | jeg | I 21 | det | it/this/that 22 | at | to (w. inf.) 23 | en | a/an 24 | et | a/an 25 | den | it/this/that 26 | til | to 27 | er | is/am/are 28 | som | who/that 29 | på | on 30 | de | they / you(formal) 31 | med | with 32 | han | he 33 | av | of 34 | ikke | not 35 | ikkje | not * 36 | der | there 37 | så | so 38 | var | was/were 39 | meg | me 40 | seg | you 41 | men | but 42 | ett | one 43 | har | have 44 | om | about 45 | vi | we 46 | min | my 47 | mitt | my 48 | ha | have 49 | hadde | had 50 | hun | she 51 | nå | now 52 | over | over 53 | da | when/as 54 | ved | by/know 55 | fra | from 56 | du | you 57 | ut | out 58 | sin | your 59 | dem | them 60 | oss | us 61 | opp | up 62 | man | you/one 63 | kan | can 64 | hans | his 65 | hvor | where 66 | eller | or 67 | hva | what 68 | skal | shall/must 69 | selv | self (reflective) 70 | sjøl | self (reflective) 71 | her | here 72 | alle | all 73 | vil | will 74 | bli | become 75 | ble | became 76 | blei | became * 77 | blitt | have become 78 | kunne | could 79 | inn | in 80 | når | when 81 | være | be 82 | kom | come 83 | noen | some 84 | noe | some 85 | ville | would 86 | dere | you 87 | som | who/which/that 88 | deres | their/theirs 89 | kun | only/just 90 | ja | yes 91 | etter | after 92 | ned | down 93 | skulle | should 94 | denne | this 95 | for | for/because 96 | deg | you 97 | si | hers/his 98 | sine | hers/his 99 | sitt | hers/his 100 | mot | against 101 | å | to 102 | meget | much 103 | hvorfor | why 104 | dette | this 105 | disse | these/those 106 | uten | without 107 | hvordan | how 108 | ingen | none 109 | din | your 110 | ditt | your 111 | blir | become 112 | samme | same 113 | hvilken | which 114 | hvilke | which (plural) 115 | sånn | such a 116 | inni | inside/within 117 | mellom | between 118 | vår | our 119 | hver | each 120 | hvem | who 121 | vors | us/ours 122 | hvis | whose 123 | både | both 124 | bare | only/just 125 | enn | than 126 | fordi | as/because 127 | før | before 128 | mange | many 129 | også | also 130 | slik | just 131 | vært | been 132 | være | to be 133 | båe | both * 134 | begge | both 135 | siden | since 136 | dykk | your * 137 | dykkar | yours * 138 | dei | they * 139 | deira | them * 140 | deires | theirs * 141 | deim | them * 142 | di | your (fem.) * 143 | då | as/when * 144 | eg | I * 145 | ein | a/an * 146 | eit | a/an * 147 | eitt | a/an * 148 | elles | or * 149 | honom | he * 150 | hjå | at * 151 | ho | she * 152 | hoe | she * 153 | henne | her 154 | hennar | her/hers 155 | hennes | hers 156 | hoss | how * 157 | hossen | how * 158 | ikkje | not * 159 | ingi | noone * 160 | inkje | noone * 161 | korleis | how * 162 | korso | how * 163 | kva | what/which * 164 | kvar | where * 165 | kvarhelst | where * 166 | kven | who/whom * 167 | kvi | why * 168 | kvifor | why * 169 | me | we * 170 | medan | while * 171 | mi | my * 172 | mine | my * 173 | mykje | much * 174 | no | now * 175 | nokon | some (masc./neut.) * 176 | noka | some (fem.) * 177 | nokor | some * 178 | noko | some * 179 | nokre | some * 180 | si | his/hers * 181 | sia | since * 182 | sidan | since * 183 | so | so * 184 | somt | some * 185 | somme | some * 186 | um | about* 187 | upp | up * 188 | vere | be * 189 | vore | was * 190 | verte | become * 191 | vort | become * 192 | varte | became * 193 | vart | became * 194 | 195 | -------------------------------------------------------------------------------- /conf/lang/stopwords_pt.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/portuguese/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A Portuguese stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | 14 | | The following is a ranked list (commonest to rarest) of stopwords 15 | | deriving from a large sample of text. 16 | 17 | | Extra words have been added at the end. 18 | 19 | de | of, from 20 | a | the; to, at; her 21 | o | the; him 22 | que | who, that 23 | e | and 24 | do | de + o 25 | da | de + a 26 | em | in 27 | um | a 28 | para | for 29 | | é from SER 30 | com | with 31 | não | not, no 32 | uma | a 33 | os | the; them 34 | no | em + o 35 | se | himself etc 36 | na | em + a 37 | por | for 38 | mais | more 39 | as | the; them 40 | dos | de + os 41 | como | as, like 42 | mas | but 43 | | foi from SER 44 | ao | a + o 45 | ele | he 46 | das | de + as 47 | | tem from TER 48 | à | a + a 49 | seu | his 50 | sua | her 51 | ou | or 52 | | ser from SER 53 | quando | when 54 | muito | much 55 | | há from HAV 56 | nos | em + os; us 57 | já | already, now 58 | | está from EST 59 | eu | I 60 | também | also 61 | só | only, just 62 | pelo | per + o 63 | pela | per + a 64 | até | up to 65 | isso | that 66 | ela | he 67 | entre | between 68 | | era from SER 69 | depois | after 70 | sem | without 71 | mesmo | same 72 | aos | a + os 73 | | ter from TER 74 | seus | his 75 | quem | whom 76 | nas | em + as 77 | me | me 78 | esse | that 79 | eles | they 80 | | estão from EST 81 | você | you 82 | | tinha from TER 83 | | foram from SER 84 | essa | that 85 | num | em + um 86 | nem | nor 87 | suas | her 88 | meu | my 89 | às | a + as 90 | minha | my 91 | | têm from TER 92 | numa | em + uma 93 | pelos | per + os 94 | elas | they 95 | | havia from HAV 96 | | seja from SER 97 | qual | which 98 | | será from SER 99 | nós | we 100 | | tenho from TER 101 | lhe | to him, her 102 | deles | of them 103 | essas | those 104 | esses | those 105 | pelas | per + as 106 | este | this 107 | | fosse from SER 108 | dele | of him 109 | 110 | | other words. There are many contractions such as naquele = em+aquele, 111 | | mo = me+o, but they are rare. 112 | | Indefinite article plural forms are also rare. 113 | 114 | tu | thou 115 | te | thee 116 | vocês | you (plural) 117 | vos | you 118 | lhes | to them 119 | meus | my 120 | minhas 121 | teu | thy 122 | tua 123 | teus 124 | tuas 125 | nosso | our 126 | nossa 127 | nossos 128 | nossas 129 | 130 | dela | of her 131 | delas | of them 132 | 133 | esta | this 134 | estes | these 135 | estas | these 136 | aquele | that 137 | aquela | that 138 | aqueles | those 139 | aquelas | those 140 | isto | this 141 | aquilo | that 142 | 143 | | forms of estar, to be (not including the infinitive): 144 | estou 145 | está 146 | estamos 147 | estão 148 | estive 149 | esteve 150 | estivemos 151 | estiveram 152 | estava 153 | estávamos 154 | estavam 155 | estivera 156 | estivéramos 157 | esteja 158 | estejamos 159 | estejam 160 | estivesse 161 | estivéssemos 162 | estivessem 163 | estiver 164 | estivermos 165 | estiverem 166 | 167 | | forms of haver, to have (not including the infinitive): 168 | hei 169 | há 170 | havemos 171 | hão 172 | houve 173 | houvemos 174 | houveram 175 | houvera 176 | houvéramos 177 | haja 178 | hajamos 179 | hajam 180 | houvesse 181 | houvéssemos 182 | houvessem 183 | houver 184 | houvermos 185 | houverem 186 | houverei 187 | houverá 188 | houveremos 189 | houverão 190 | houveria 191 | houveríamos 192 | houveriam 193 | 194 | | forms of ser, to be (not including the infinitive): 195 | sou 196 | somos 197 | são 198 | era 199 | éramos 200 | eram 201 | fui 202 | foi 203 | fomos 204 | foram 205 | fora 206 | fôramos 207 | seja 208 | sejamos 209 | sejam 210 | fosse 211 | fôssemos 212 | fossem 213 | for 214 | formos 215 | forem 216 | serei 217 | será 218 | seremos 219 | serão 220 | seria 221 | seríamos 222 | seriam 223 | 224 | | forms of ter, to have (not including the infinitive): 225 | tenho 226 | tem 227 | temos 228 | tém 229 | tinha 230 | tínhamos 231 | tinham 232 | tive 233 | teve 234 | tivemos 235 | tiveram 236 | tivera 237 | tivéramos 238 | tenha 239 | tenhamos 240 | tenham 241 | tivesse 242 | tivéssemos 243 | tivessem 244 | tiver 245 | tivermos 246 | tiverem 247 | terei 248 | terá 249 | teremos 250 | terão 251 | teria 252 | teríamos 253 | teriam 254 | -------------------------------------------------------------------------------- /conf/lang/stopwords_ro.txt: -------------------------------------------------------------------------------- 1 | # This file was created by Jacques Savoy and is distributed under the BSD license. 2 | # See http://members.unine.ch/jacques.savoy/clef/index.html. 3 | # Also see http://www.opensource.org/licenses/bsd-license.html 4 | acea 5 | aceasta 6 | această 7 | aceea 8 | acei 9 | aceia 10 | acel 11 | acela 12 | acele 13 | acelea 14 | acest 15 | acesta 16 | aceste 17 | acestea 18 | aceşti 19 | aceştia 20 | acolo 21 | acum 22 | ai 23 | aia 24 | aibă 25 | aici 26 | al 27 | ăla 28 | ale 29 | alea 30 | ălea 31 | altceva 32 | altcineva 33 | am 34 | ar 35 | are 36 | aş 37 | aşadar 38 | asemenea 39 | asta 40 | ăsta 41 | astăzi 42 | astea 43 | ăstea 44 | ăştia 45 | asupra 46 | aţi 47 | au 48 | avea 49 | avem 50 | aveţi 51 | azi 52 | bine 53 | bucur 54 | bună 55 | ca 56 | că 57 | căci 58 | când 59 | care 60 | cărei 61 | căror 62 | cărui 63 | cât 64 | câte 65 | câţi 66 | către 67 | câtva 68 | ce 69 | cel 70 | ceva 71 | chiar 72 | cînd 73 | cine 74 | cineva 75 | cît 76 | cîte 77 | cîţi 78 | cîtva 79 | contra 80 | cu 81 | cum 82 | cumva 83 | curând 84 | curînd 85 | da 86 | dă 87 | dacă 88 | dar 89 | datorită 90 | de 91 | deci 92 | deja 93 | deoarece 94 | departe 95 | deşi 96 | din 97 | dinaintea 98 | dintr 99 | dintre 100 | drept 101 | după 102 | ea 103 | ei 104 | el 105 | ele 106 | eram 107 | este 108 | eşti 109 | eu 110 | face 111 | fără 112 | fi 113 | fie 114 | fiecare 115 | fii 116 | fim 117 | fiţi 118 | iar 119 | ieri 120 | îi 121 | îl 122 | îmi 123 | împotriva 124 | în 125 | înainte 126 | înaintea 127 | încât 128 | încît 129 | încotro 130 | între 131 | întrucât 132 | întrucît 133 | îţi 134 | la 135 | lângă 136 | le 137 | li 138 | lîngă 139 | lor 140 | lui 141 | mă 142 | mâine 143 | mea 144 | mei 145 | mele 146 | mereu 147 | meu 148 | mi 149 | mine 150 | mult 151 | multă 152 | mulţi 153 | ne 154 | nicăieri 155 | nici 156 | nimeni 157 | nişte 158 | noastră 159 | noastre 160 | noi 161 | noştri 162 | nostru 163 | nu 164 | ori 165 | oricând 166 | oricare 167 | oricât 168 | orice 169 | oricînd 170 | oricine 171 | oricît 172 | oricum 173 | oriunde 174 | până 175 | pe 176 | pentru 177 | peste 178 | pînă 179 | poate 180 | pot 181 | prea 182 | prima 183 | primul 184 | prin 185 | printr 186 | sa 187 | să 188 | săi 189 | sale 190 | sau 191 | său 192 | se 193 | şi 194 | sînt 195 | sîntem 196 | sînteţi 197 | spre 198 | sub 199 | sunt 200 | suntem 201 | sunteţi 202 | ta 203 | tăi 204 | tale 205 | tău 206 | te 207 | ţi 208 | ţie 209 | tine 210 | toată 211 | toate 212 | tot 213 | toţi 214 | totuşi 215 | tu 216 | un 217 | una 218 | unde 219 | undeva 220 | unei 221 | unele 222 | uneori 223 | unor 224 | vă 225 | vi 226 | voastră 227 | voastre 228 | voi 229 | voştri 230 | vostru 231 | vouă 232 | vreo 233 | vreun 234 | -------------------------------------------------------------------------------- /conf/lang/stopwords_ru.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/russian/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | a russian stop word list. comments begin with vertical bar. each stop 11 | | word is at the start of a line. 12 | 13 | | this is a ranked list (commonest to rarest) of stopwords derived from 14 | | a large text sample. 15 | 16 | | letter `ё' is translated to `е'. 17 | 18 | и | and 19 | в | in/into 20 | во | alternative form 21 | не | not 22 | что | what/that 23 | он | he 24 | на | on/onto 25 | я | i 26 | с | from 27 | со | alternative form 28 | как | how 29 | а | milder form of `no' (but) 30 | то | conjunction and form of `that' 31 | все | all 32 | она | she 33 | так | so, thus 34 | его | him 35 | но | but 36 | да | yes/and 37 | ты | thou 38 | к | towards, by 39 | у | around, chez 40 | же | intensifier particle 41 | вы | you 42 | за | beyond, behind 43 | бы | conditional/subj. particle 44 | по | up to, along 45 | только | only 46 | ее | her 47 | мне | to me 48 | было | it was 49 | вот | here is/are, particle 50 | от | away from 51 | меня | me 52 | еще | still, yet, more 53 | нет | no, there isnt/arent 54 | о | about 55 | из | out of 56 | ему | to him 57 | теперь | now 58 | когда | when 59 | даже | even 60 | ну | so, well 61 | вдруг | suddenly 62 | ли | interrogative particle 63 | если | if 64 | уже | already, but homonym of `narrower' 65 | или | or 66 | ни | neither 67 | быть | to be 68 | был | he was 69 | него | prepositional form of его 70 | до | up to 71 | вас | you accusative 72 | нибудь | indef. suffix preceded by hyphen 73 | опять | again 74 | уж | already, but homonym of `adder' 75 | вам | to you 76 | сказал | he said 77 | ведь | particle `after all' 78 | там | there 79 | потом | then 80 | себя | oneself 81 | ничего | nothing 82 | ей | to her 83 | может | usually with `быть' as `maybe' 84 | они | they 85 | тут | here 86 | где | where 87 | есть | there is/are 88 | надо | got to, must 89 | ней | prepositional form of ей 90 | для | for 91 | мы | we 92 | тебя | thee 93 | их | them, their 94 | чем | than 95 | была | she was 96 | сам | self 97 | чтоб | in order to 98 | без | without 99 | будто | as if 100 | человек | man, person, one 101 | чего | genitive form of `what' 102 | раз | once 103 | тоже | also 104 | себе | to oneself 105 | под | beneath 106 | жизнь | life 107 | будет | will be 108 | ж | short form of intensifer particle `же' 109 | тогда | then 110 | кто | who 111 | этот | this 112 | говорил | was saying 113 | того | genitive form of `that' 114 | потому | for that reason 115 | этого | genitive form of `this' 116 | какой | which 117 | совсем | altogether 118 | ним | prepositional form of `его', `они' 119 | здесь | here 120 | этом | prepositional form of `этот' 121 | один | one 122 | почти | almost 123 | мой | my 124 | тем | instrumental/dative plural of `тот', `то' 125 | чтобы | full form of `in order that' 126 | нее | her (acc.) 127 | кажется | it seems 128 | сейчас | now 129 | были | they were 130 | куда | where to 131 | зачем | why 132 | сказать | to say 133 | всех | all (acc., gen. preposn. plural) 134 | никогда | never 135 | сегодня | today 136 | можно | possible, one can 137 | при | by 138 | наконец | finally 139 | два | two 140 | об | alternative form of `о', about 141 | другой | another 142 | хоть | even 143 | после | after 144 | над | above 145 | больше | more 146 | тот | that one (masc.) 147 | через | across, in 148 | эти | these 149 | нас | us 150 | про | about 151 | всего | in all, only, of all 152 | них | prepositional form of `они' (they) 153 | какая | which, feminine 154 | много | lots 155 | разве | interrogative particle 156 | сказала | she said 157 | три | three 158 | эту | this, acc. fem. sing. 159 | моя | my, feminine 160 | впрочем | moreover, besides 161 | хорошо | good 162 | свою | ones own, acc. fem. sing. 163 | этой | oblique form of `эта', fem. `this' 164 | перед | in front of 165 | иногда | sometimes 166 | лучше | better 167 | чуть | a little 168 | том | preposn. form of `that one' 169 | нельзя | one must not 170 | такой | such a one 171 | им | to them 172 | более | more 173 | всегда | always 174 | конечно | of course 175 | всю | acc. fem. sing of `all' 176 | между | between 177 | 178 | 179 | | b: some paradigms 180 | | 181 | | personal pronouns 182 | | 183 | | я меня мне мной [мною] 184 | | ты тебя тебе тобой [тобою] 185 | | он его ему им [него, нему, ним] 186 | | она ее эи ею [нее, нэи, нею] 187 | | оно его ему им [него, нему, ним] 188 | | 189 | | мы нас нам нами 190 | | вы вас вам вами 191 | | они их им ими [них, ним, ними] 192 | | 193 | | себя себе собой [собою] 194 | | 195 | | demonstrative pronouns: этот (this), тот (that) 196 | | 197 | | этот эта это эти 198 | | этого эты это эти 199 | | этого этой этого этих 200 | | этому этой этому этим 201 | | этим этой этим [этою] этими 202 | | этом этой этом этих 203 | | 204 | | тот та то те 205 | | того ту то те 206 | | того той того тех 207 | | тому той тому тем 208 | | тем той тем [тою] теми 209 | | том той том тех 210 | | 211 | | determinative pronouns 212 | | 213 | | (a) весь (all) 214 | | 215 | | весь вся все все 216 | | всего всю все все 217 | | всего всей всего всех 218 | | всему всей всему всем 219 | | всем всей всем [всею] всеми 220 | | всем всей всем всех 221 | | 222 | | (b) сам (himself etc) 223 | | 224 | | сам сама само сами 225 | | самого саму само самих 226 | | самого самой самого самих 227 | | самому самой самому самим 228 | | самим самой самим [самою] самими 229 | | самом самой самом самих 230 | | 231 | | stems of verbs `to be', `to have', `to do' and modal 232 | | 233 | | быть бы буд быв есть суть 234 | | име 235 | | дел 236 | | мог мож мочь 237 | | уме 238 | | хоч хот 239 | | долж 240 | | можн 241 | | нужн 242 | | нельзя 243 | 244 | -------------------------------------------------------------------------------- /conf/lang/stopwords_sv.txt: -------------------------------------------------------------------------------- 1 | | From svn.tartarus.org/snowball/trunk/website/algorithms/swedish/stop.txt 2 | | This file is distributed under the BSD License. 3 | | See http://snowball.tartarus.org/license.php 4 | | Also see http://www.opensource.org/licenses/bsd-license.html 5 | | - Encoding was converted to UTF-8. 6 | | - This notice was added. 7 | | 8 | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" 9 | 10 | | A Swedish stop word list. Comments begin with vertical bar. Each stop 11 | | word is at the start of a line. 12 | 13 | | This is a ranked list (commonest to rarest) of stopwords derived from 14 | | a large text sample. 15 | 16 | | Swedish stop words occasionally exhibit homonym clashes. For example 17 | | så = so, but also seed. These are indicated clearly below. 18 | 19 | och | and 20 | det | it, this/that 21 | att | to (with infinitive) 22 | i | in, at 23 | en | a 24 | jag | I 25 | hon | she 26 | som | who, that 27 | han | he 28 | på | on 29 | den | it, this/that 30 | med | with 31 | var | where, each 32 | sig | him(self) etc 33 | för | for 34 | så | so (also: seed) 35 | till | to 36 | är | is 37 | men | but 38 | ett | a 39 | om | if; around, about 40 | hade | had 41 | de | they, these/those 42 | av | of 43 | icke | not, no 44 | mig | me 45 | du | you 46 | henne | her 47 | då | then, when 48 | sin | his 49 | nu | now 50 | har | have 51 | inte | inte någon = no one 52 | hans | his 53 | honom | him 54 | skulle | 'sake' 55 | hennes | her 56 | där | there 57 | min | my 58 | man | one (pronoun) 59 | ej | nor 60 | vid | at, by, on (also: vast) 61 | kunde | could 62 | något | some etc 63 | från | from, off 64 | ut | out 65 | när | when 66 | efter | after, behind 67 | upp | up 68 | vi | we 69 | dem | them 70 | vara | be 71 | vad | what 72 | över | over 73 | än | than 74 | dig | you 75 | kan | can 76 | sina | his 77 | här | here 78 | ha | have 79 | mot | towards 80 | alla | all 81 | under | under (also: wonder) 82 | någon | some etc 83 | eller | or (else) 84 | allt | all 85 | mycket | much 86 | sedan | since 87 | ju | why 88 | denna | this/that 89 | själv | myself, yourself etc 90 | detta | this/that 91 | åt | to 92 | utan | without 93 | varit | was 94 | hur | how 95 | ingen | no 96 | mitt | my 97 | ni | you 98 | bli | to be, become 99 | blev | from bli 100 | oss | us 101 | din | thy 102 | dessa | these/those 103 | några | some etc 104 | deras | their 105 | blir | from bli 106 | mina | my 107 | samma | (the) same 108 | vilken | who, that 109 | er | you, your 110 | sådan | such a 111 | vår | our 112 | blivit | from bli 113 | dess | its 114 | inom | within 115 | mellan | between 116 | sådant | such a 117 | varför | why 118 | varje | each 119 | vilka | who, that 120 | ditt | thy 121 | vem | who 122 | vilket | who, that 123 | sitta | his 124 | sådana | such a 125 | vart | each 126 | dina | thy 127 | vars | whose 128 | vårt | our 129 | våra | our 130 | ert | your 131 | era | your 132 | vilkas | whose 133 | 134 | -------------------------------------------------------------------------------- /conf/lang/stopwords_th.txt: -------------------------------------------------------------------------------- 1 | # Thai stopwords from: 2 | # "Opinion Detection in Thai Political News Columns 3 | # Based on Subjectivity Analysis" 4 | # Khampol Sukhum, Supot Nitsuwat, and Choochart Haruechaiyasak 5 | ไว้ 6 | ไม่ 7 | ไป 8 | ได้ 9 | ให้ 10 | ใน 11 | โดย 12 | แห่ง 13 | แล้ว 14 | และ 15 | แรก 16 | แบบ 17 | แต่ 18 | เอง 19 | เห็น 20 | เลย 21 | เริ่ม 22 | เรา 23 | เมื่อ 24 | เพื่อ 25 | เพราะ 26 | เป็นการ 27 | เป็น 28 | เปิดเผย 29 | เปิด 30 | เนื่องจาก 31 | เดียวกัน 32 | เดียว 33 | เช่น 34 | เฉพาะ 35 | เคย 36 | เข้า 37 | เขา 38 | อีก 39 | อาจ 40 | อะไร 41 | ออก 42 | อย่าง 43 | อยู่ 44 | อยาก 45 | หาก 46 | หลาย 47 | หลังจาก 48 | หลัง 49 | หรือ 50 | หนึ่ง 51 | ส่วน 52 | ส่ง 53 | สุด 54 | สําหรับ 55 | ว่า 56 | วัน 57 | ลง 58 | ร่วม 59 | ราย 60 | รับ 61 | ระหว่าง 62 | รวม 63 | ยัง 64 | มี 65 | มาก 66 | มา 67 | พร้อม 68 | พบ 69 | ผ่าน 70 | ผล 71 | บาง 72 | น่า 73 | นี้ 74 | นํา 75 | นั้น 76 | นัก 77 | นอกจาก 78 | ทุก 79 | ที่สุด 80 | ที่ 81 | ทําให้ 82 | ทํา 83 | ทาง 84 | ทั้งนี้ 85 | ทั้ง 86 | ถ้า 87 | ถูก 88 | ถึง 89 | ต้อง 90 | ต่างๆ 91 | ต่าง 92 | ต่อ 93 | ตาม 94 | ตั้งแต่ 95 | ตั้ง 96 | ด้าน 97 | ด้วย 98 | ดัง 99 | ซึ่ง 100 | ช่วง 101 | จึง 102 | จาก 103 | จัด 104 | จะ 105 | คือ 106 | ความ 107 | ครั้ง 108 | คง 109 | ขึ้น 110 | ของ 111 | ขอ 112 | ขณะ 113 | ก่อน 114 | ก็ 115 | การ 116 | กับ 117 | กัน 118 | กว่า 119 | กล่าว 120 | -------------------------------------------------------------------------------- /conf/lang/stopwords_tr.txt: -------------------------------------------------------------------------------- 1 | # Turkish stopwords from LUCENE-559 2 | # merged with the list from "Information Retrieval on Turkish Texts" 3 | # (http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf) 4 | acaba 5 | altmış 6 | altı 7 | ama 8 | ancak 9 | arada 10 | aslında 11 | ayrıca 12 | bana 13 | bazı 14 | belki 15 | ben 16 | benden 17 | beni 18 | benim 19 | beri 20 | beş 21 | bile 22 | bin 23 | bir 24 | birçok 25 | biri 26 | birkaç 27 | birkez 28 | birşey 29 | birşeyi 30 | biz 31 | bize 32 | bizden 33 | bizi 34 | bizim 35 | böyle 36 | böylece 37 | bu 38 | buna 39 | bunda 40 | bundan 41 | bunlar 42 | bunları 43 | bunların 44 | bunu 45 | bunun 46 | burada 47 | çok 48 | çünkü 49 | da 50 | daha 51 | dahi 52 | de 53 | defa 54 | değil 55 | diğer 56 | diye 57 | doksan 58 | dokuz 59 | dolayı 60 | dolayısıyla 61 | dört 62 | edecek 63 | eden 64 | ederek 65 | edilecek 66 | ediliyor 67 | edilmesi 68 | ediyor 69 | eğer 70 | elli 71 | en 72 | etmesi 73 | etti 74 | ettiği 75 | ettiğini 76 | gibi 77 | göre 78 | halen 79 | hangi 80 | hatta 81 | hem 82 | henüz 83 | hep 84 | hepsi 85 | her 86 | herhangi 87 | herkesin 88 | hiç 89 | hiçbir 90 | için 91 | iki 92 | ile 93 | ilgili 94 | ise 95 | işte 96 | itibaren 97 | itibariyle 98 | kadar 99 | karşın 100 | katrilyon 101 | kendi 102 | kendilerine 103 | kendini 104 | kendisi 105 | kendisine 106 | kendisini 107 | kez 108 | ki 109 | kim 110 | kimden 111 | kime 112 | kimi 113 | kimse 114 | kırk 115 | milyar 116 | milyon 117 | mu 118 | mü 119 | mı 120 | nasıl 121 | ne 122 | neden 123 | nedenle 124 | nerde 125 | nerede 126 | nereye 127 | niye 128 | niçin 129 | o 130 | olan 131 | olarak 132 | oldu 133 | olduğu 134 | olduğunu 135 | olduklarını 136 | olmadı 137 | olmadığı 138 | olmak 139 | olması 140 | olmayan 141 | olmaz 142 | olsa 143 | olsun 144 | olup 145 | olur 146 | olursa 147 | oluyor 148 | on 149 | ona 150 | ondan 151 | onlar 152 | onlardan 153 | onları 154 | onların 155 | onu 156 | onun 157 | otuz 158 | oysa 159 | öyle 160 | pek 161 | rağmen 162 | sadece 163 | sanki 164 | sekiz 165 | seksen 166 | sen 167 | senden 168 | seni 169 | senin 170 | siz 171 | sizden 172 | sizi 173 | sizin 174 | şey 175 | şeyden 176 | şeyi 177 | şeyler 178 | şöyle 179 | şu 180 | şuna 181 | şunda 182 | şundan 183 | şunları 184 | şunu 185 | tarafından 186 | trilyon 187 | tüm 188 | üç 189 | üzere 190 | var 191 | vardı 192 | ve 193 | veya 194 | ya 195 | yani 196 | yapacak 197 | yapılan 198 | yapılması 199 | yapıyor 200 | yapmak 201 | yaptı 202 | yaptığı 203 | yaptığını 204 | yaptıkları 205 | yedi 206 | yerine 207 | yetmiş 208 | yine 209 | yirmi 210 | yoksa 211 | yüz 212 | zaten 213 | -------------------------------------------------------------------------------- /conf/lang/userdict_ja.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This is a sample user dictionary for Kuromoji (JapaneseTokenizer) 3 | # 4 | # Add entries to this file in order to override the statistical model in terms 5 | # of segmentation, readings and part-of-speech tags. Notice that entries do 6 | # not have weights since they are always used when found. This is by-design 7 | # in order to maximize ease-of-use. 8 | # 9 | # Entries are defined using the following CSV format: 10 | # , ... , ... , 11 | # 12 | # Notice that a single half-width space separates tokens and readings, and 13 | # that the number tokens and readings must match exactly. 14 | # 15 | # Also notice that multiple entries with the same is undefined. 16 | # 17 | # Whitespace only lines are ignored. Comments are not allowed on entry lines. 18 | # 19 | 20 | # Custom segmentation for kanji compounds 21 | 日本経済新聞,日本 経済 新聞,ニホン ケイザイ シンブン,カスタム名詞 22 | 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 23 | 24 | # Custom segmentation for compound katakana 25 | トートバッグ,トート バッグ,トート バッグ,かずカナ名詞 26 | ショルダーバッグ,ショルダー バッグ,ショルダー バッグ,かずカナ名詞 27 | 28 | # Custom reading for former sumo wrestler 29 | 朝青龍,朝青龍,アサショウリュウ,カスタム人名 30 | -------------------------------------------------------------------------------- /conf/managed-schema: -------------------------------------------------------------------------------- 1 | 2 | 18 | 19 | 40 | 41 | 42 | 59 | 60 | 94 | 95 | 101 | 102 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 121 | 122 | 123 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 162 | id 163 | 164 | 170 | 171 | 178 | 179 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 206 | 207 | 208 | 209 | 210 | 211 | 212 | 213 | 214 | 215 | 216 | 217 | 218 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | 255 | 256 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 278 | 279 | 280 | 281 | 282 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 324 | 325 | 326 | 327 | 328 | 332 | 334 | 338 | 339 | 340 | 341 | 344 | 345 | 346 | 347 | 348 | 349 | 353 | 354 | 355 | 356 | 359 | 360 | 361 | 362 | 363 | 372 | 373 | 374 | 375 | 376 | 379 | 381 | 385 | 386 | 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 432 | 433 | 434 | 435 | 436 | 439 | 440 | 441 | 442 | 443 | 444 | 445 | 447 | 448 | 449 | 450 | 451 | 452 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | 487 | 491 | 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 512 | 513 | 514 | 515 | 516 | 517 | 518 | 522 | 524 | 525 | 526 | 527 | 528 | 529 | 530 | 531 | 532 | 533 | 534 | 535 | 536 | 537 | 538 | 539 | 540 | 541 | 542 | 543 | 544 | 545 | 546 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 554 | 555 | 556 | 557 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | 566 | 567 | 568 | 569 | 570 | 571 | 572 | 573 | 574 | 575 | 576 | 577 | 578 | 579 | 580 | 581 | 582 | 583 | 584 | 585 | 586 | 587 | 588 | 589 | 590 | 591 | 592 | 593 | 594 | 595 | 596 | 597 | 598 | 599 | 600 | 601 | 602 | 603 | 604 | 605 | 606 | 607 | 608 | 609 | 610 | 611 | 612 | 613 | 614 | 615 | 616 | 617 | 618 | 619 | 620 | 621 | 622 | 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | 631 | 632 | 633 | 634 | 635 | 636 | 637 | 638 | 639 | 640 | 641 | 642 | 643 | 644 | 645 | 646 | 647 | 648 | 649 | 650 | 651 | 652 | 653 | 654 | 655 | 656 | 657 | 658 | 659 | 660 | 661 | 662 | 663 | 664 | 665 | 666 | 667 | 668 | 669 | 670 | 671 | 672 | 673 | 674 | 675 | 676 | 677 | 678 | 679 | 680 | 681 | 682 | 683 | 684 | 685 | 686 | 687 | 688 | 689 | 690 | 691 | 692 | 693 | 694 | 695 | 696 | 697 | 698 | 699 | 700 | 701 | 702 | 703 | 704 | 705 | 706 | 707 | 708 | 709 | 710 | 711 | 712 | 713 | 714 | 715 | 716 | 717 | 718 | 719 | 720 | 721 | 722 | 723 | 724 | 725 | 726 | 727 | 728 | 729 | 730 | 731 | 732 | 733 | 734 | 735 | 736 | 737 | 738 | 739 | 740 | 741 | 742 | 743 | 744 | 745 | 746 | 747 | 748 | 749 | 750 | 751 | 752 | 753 | 754 | 755 | 756 | 757 | 758 | 759 | 760 | 761 | 762 | 763 | 764 | 765 | 766 | 767 | 768 | 769 | 770 | 771 | 772 | 773 | 774 | 775 | 776 | 777 | 778 | 779 | 780 | 781 | 782 | 783 | 784 | 785 | 786 | 787 | 788 | 789 | 790 | 791 | 792 | 793 | 794 | 795 | 796 | 797 | 798 | 799 | 800 | 801 | 806 | 807 | 808 | 809 | 835 | 836 | 837 | 838 | 839 | 840 | 841 | 842 | 843 | 844 | 845 | 846 | 847 | 848 | 849 | 850 | 851 | 852 | 853 | 854 | 855 | 856 | 873 | 874 | 879 | 880 | 881 | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 | 892 | 893 | 894 | 895 | 896 | 897 | 898 | 899 | 900 | 901 | 902 | 903 | 904 | 905 | 906 | 907 | 908 | 909 | 910 | 911 | 912 | 913 | 914 | 915 | 916 | 917 | 918 | 919 | 920 | 921 | 922 | 923 | 924 | 925 | 926 | 927 | 928 | 929 | 930 | 931 | 932 | 933 | 934 | 935 | 936 | 937 | 938 | 939 | 940 | 941 | 942 | 943 | 944 | 945 | 946 | 947 | 948 | 949 | 950 | 951 | 952 | 953 | 954 | 955 | 956 | 957 | 958 | 959 | 960 | 961 | 962 | 963 | 964 | 965 | 966 | 967 | 968 | 969 | 970 | 971 | 972 | 973 | 974 | 975 | 976 | 977 | 978 | 979 | 980 | 981 | 982 | 983 | 984 | 985 | 986 | 987 | 988 | 989 | 990 | 991 | 992 | 997 | 1002 | 1003 | 1004 | -------------------------------------------------------------------------------- /conf/params.json: -------------------------------------------------------------------------------- 1 | {"params":{ 2 | "query":{ 3 | "defType":"edismax", 4 | "q.alt":"*:*", 5 | "rows":"10", 6 | "fl":"*,score", 7 | "":{"v":0} 8 | }, 9 | "facets":{ 10 | "facet":"on", 11 | "facet.mincount": "1", 12 | "":{"v":0} 13 | }, 14 | "velocity":{ 15 | "wt": "velocity", 16 | "v.template":"browse", 17 | "v.layout": "layout", 18 | "":{"v":0} 19 | } 20 | }} -------------------------------------------------------------------------------- /conf/protwords.txt: -------------------------------------------------------------------------------- 1 | # The ASF licenses this file to You under the Apache License, Version 2.0 2 | # (the "License"); you may not use this file except in compliance with 3 | # the License. You may obtain a copy of the License at 4 | # 5 | # http://www.apache.org/licenses/LICENSE-2.0 6 | # 7 | # Unless required by applicable law or agreed to in writing, software 8 | # distributed under the License is distributed on an "AS IS" BASIS, 9 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 10 | # See the License for the specific language governing permissions and 11 | # limitations under the License. 12 | 13 | #----------------------------------------------------------------------- 14 | # Use a protected word file to protect against the stemmer reducing two 15 | # unrelated words to the same base word. 16 | 17 | # Some non-words that normally won't be encountered, 18 | # just to test that they won't be stemmed. 19 | dontstems 20 | zwhacky 21 | 22 | -------------------------------------------------------------------------------- /conf/stopwords.txt: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one or more 2 | # contributor license agreements. See the NOTICE file distributed with 3 | # this work for additional information regarding copyright ownership. 4 | # The ASF licenses this file to You under the Apache License, Version 2.0 5 | # (the "License"); you may not use this file except in compliance with 6 | # the License. You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | -------------------------------------------------------------------------------- /conf/synonyms.txt: -------------------------------------------------------------------------------- 1 | # The ASF licenses this file to You under the Apache License, Version 2.0 2 | # (the "License"); you may not use this file except in compliance with 3 | # the License. You may obtain a copy of the License at 4 | # 5 | # http://www.apache.org/licenses/LICENSE-2.0 6 | # 7 | # Unless required by applicable law or agreed to in writing, software 8 | # distributed under the License is distributed on an "AS IS" BASIS, 9 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 10 | # See the License for the specific language governing permissions and 11 | # limitations under the License. 12 | 13 | #----------------------------------------------------------------------- 14 | #some test synonym mappings unlikely to appear in real input text 15 | aaafoo => aaabar 16 | bbbfoo => bbbfoo bbbbar 17 | cccfoo => cccbar cccbaz 18 | fooaaa,baraaa,bazaaa 19 | 20 | # Some synonym groups specific to this example 21 | GB,gib,gigabyte,gigabytes 22 | MB,mib,megabyte,megabytes 23 | Television, Televisions, TV, TVs 24 | #notice we use "gib" instead of "GiB" so any WordDelimiterGraphFilter coming 25 | #after us won't split it into two words. 26 | 27 | # Synonym mappings can be used for spelling correction too 28 | pixima => pixma 29 | 30 | ## Custom synonym groups for movies index ## 31 | 32 | # Replacement Synonyms example 33 | scarey => scary 34 | spookey => spooky 35 | ciborg => cyborg 36 | 37 | # Multiway Expansion Synonyms examples 38 | scary,slasher,spooky,evil,horror 39 | # cowboy,buckaroo,buckeroo,cowhand,cowman,cowpoke,cowpuncher,wrangler 40 | 41 | # Oneway Expansion Synonyms examples 42 | droid => droid,android,robot,cyborg 43 | 44 | # Jargon 45 | ai,artificial intelligence 46 | ml,machine learning 47 | cia,central intelligence agency 48 | fbi,federal bureau investigation 49 | lol,laughing out loud,league legends 50 | -------------------------------------------------------------------------------- /mongo_to_solr.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # 3 | # author: Gary A. Stafford 4 | # site: https://programmaticponderings.com 5 | # license: MIT License 6 | # purpose: ETL - Import MongoDB collection of documents to Solr index 7 | # usage: python3 ./mongo_to_solr.py 8 | 9 | import pymongo 10 | import os 11 | import requests 12 | from bson.json_util import dumps 13 | 14 | mongodb_conn = os.environ.get('MONGODB_CONN') 15 | mongodb_client = pymongo.MongoClient(mongodb_conn) 16 | mongo_db = mongodb_client['video'] 17 | mongo_collection = mongo_db['movieDetails'] 18 | solr_url = os.environ.get('SOLR_URL') 19 | solr_collection = 'movies' 20 | 21 | 22 | def main(): 23 | print('Target MongoDB instance: %s' % mongodb_conn) 24 | print('Target Solr instance: %s' % solr_url) 25 | 26 | get_documents 27 | add_all() 28 | 29 | 30 | # Read documents from JSON file 31 | def get_documents(): 32 | # https://www.w3schools.com/python/python_mongodb_query.asp 33 | mongo_query = {} 34 | documents = mongo_collection.find(mongo_query) 35 | return documents 36 | 37 | 38 | # Add documents to Solr in bulk 39 | def add_all(): 40 | # https://lucene.apache.org/solr/guide/7_6/uploading-data-with-index-handlers.html#adding-multiple-json-documents 41 | documents = get_documents() 42 | 43 | path = '/update/json/docs?commit=true' 44 | 45 | print('documents to add: ', dumps(documents.count())) 46 | 47 | r = requests.post(solr_url + '/' + solr_collection + path, data=dumps(documents)) 48 | print('add all status: ', r.status_code, r.reason, r.url, r.content) 49 | 50 | 51 | # Add documents to Solr one at a time 52 | def add_each(): 53 | # https://lucene.apache.org/solr/guide/7_6/uploading-data-with-index-handlers.html#adding-multiple-json-documents 54 | 55 | documents = get_documents() 56 | path = '/update/json/docs?commit=true' 57 | print('documents to add: ', dumps(documents.count())) 58 | 59 | for document in documents: 60 | print(dumps(document)) 61 | r = requests.post(solr_url + '/' + solr_collection + path, data=dumps(document)) 62 | print('add all status: ', r.status_code, r.reason, r.url, r.content) 63 | 64 | 65 | # TODO: Fix - Create collection function not working correctly 66 | def create_collection(): 67 | # https://lucene.apache.org/solr/guide/7_6/collections-api.html 68 | path = '/admin/collections?action=CREATE&name=' + solr_collection + \ 69 | '&collection.configName=_default&&numShards=2&replicationFactor=1&wt=xml' 70 | r = requests.get(solr_url + path) 71 | print('commit status: ', r.status_code, r.reason, r.content) 72 | 73 | path = '/config' 74 | data = {'set-user-property': {'update.autoCreateFields': 'false'}} 75 | r = requests.post(solr_url + '/' + solr_collection + path, json=data) 76 | print('commit status: ', r.status_code, r.reason, r.url, r.content) 77 | 78 | 79 | # Change schema items to multiValued = false 80 | def multi_value_false(): 81 | path = '/schema' 82 | json_data = '{"replace-field":{"name":"title","type":"text_en","multiValued":false},' \ 83 | '"replace-field":{"name":"plot","type":"text_en","multiValued":false},' \ 84 | '"replace-field":{"name":"genres","type":"text_en","multiValued":true}}' 85 | 86 | r = requests.post(solr_url + '/' + solr_collection + path, data=json_data) 87 | print('add all status: ', r.status_code, r.reason) 88 | 89 | 90 | if __name__ == "__main__": 91 | main() 92 | -------------------------------------------------------------------------------- /query_mongo.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # 3 | # author: Gary A. Stafford 4 | # site: https://programmaticponderings.com 5 | # license: MIT License 6 | # purpose: Perform queries against MongoDB 7 | # USAge: python3 ./query_mongo.py 8 | 9 | import pymongo 10 | import os 11 | 12 | mongodb_conn = os.environ.get('MONOGDB_CONN') 13 | mongodb_client = pymongo.MongoClient(mongodb_conn) 14 | mongo_db = mongodb_client['movies'] 15 | mongo_collection = mongo_db['movieDetails'] 16 | 17 | 18 | def main(): 19 | print('Target MongoDB instance: %s' % mongodb_conn) 20 | 21 | create_indexes() 22 | 23 | # Query 1a: All Documents 24 | find_documents({}) 25 | 26 | # Query 1b: Count Only 27 | count_documents() 28 | 29 | # Query 2: Exact Search 30 | find_documents({'title': 'Star Wars: Episode V - The Empire Strikes Back'}) 31 | 32 | # Query 3: Search Phrase 33 | find_documents({'title': {'$regex': r'\bstar wars\b', '$options': 'i'}}) 34 | 35 | # Query 4: Search Terms 36 | # find_documents({'title': {'$regex': 'star|wars', '$options': 'i'}}) 37 | find_documents({'$text': {'$search': 'star wars', 38 | '$language': 'en', 39 | '$caseSensitive': False}, 40 | 'countries': 'USA'}, 41 | [('score', {'$meta': 'textScore'})], 42 | projection={'score': {'$meta': 'textScore'}, '_id': 0, 'title': 1}) 43 | 44 | # Query 5a: Multiple Search Terms 45 | find_documents({'genres': {'$in': ['Adventure', 'Action', 'Western']}, 'countries': 'USA'}, 46 | projection={'_id': 0, 'genres': 1, 'title': 1}) 47 | 48 | # modify index 49 | modify_index() 50 | 51 | # Query 6a 52 | find_documents({'$text': {'$search': 'western action adventure', 53 | '$language': 'en', 54 | '$caseSensitive': False}, 55 | 'countries': 'USA'}, 56 | [('score', {'$meta': 'textScore'})], 57 | projection={'score': {'$meta': 'textScore'}, '_id': 0, 'genres': 1, 'title': 1}) 58 | 59 | weight_index() 60 | 61 | # Query 6b: Boosting Fields 62 | find_documents({'$text': {'$search': 'western action adventure', 63 | '$language': 'en', 64 | '$caseSensitive': False}, 65 | 'countries': 'USA'}, 66 | [('score', {'$meta': 'textScore'})], 67 | projection={'score': {'$meta': 'textScore'}, '_id': 0, 'genres': 1, 'title': 1}) 68 | 69 | # # Additional Unused Query Variations 70 | # find_documents({'$text': {'$search': 'Star Wars: Episode V - The Empire Strikes Back', 71 | # '$language': 'en', 72 | # '$caseSensitive': False}, 73 | # 'countries': 'USA'}, 74 | # [('score', {'$meta': 'textScore'})], 75 | # projection={'score': {'$meta': 'textScore'}, '_id': 0, 'title': 1}) 76 | # find_documents({'title': {'$regex': r'\bstar wars\b|\bstar trek\b', '$options': 'i'}}) 77 | # 78 | # find_documents({'plot': {'$regex': r'\bwestern\b|\baction\b|\badventure\b', '$options': 'i'}, 79 | # 'countries': 'USA'}) 80 | # 81 | # find_documents({'plot': {'$regex': 'western|action|adventure', '$options': 'i'}, 82 | # 'countries': 'USA'}) 83 | # 84 | # find_documents({'title': {'$regex': r'\bstar\b|\bwars\b', '$options': 'i'}}) 85 | # 86 | # find_documents({'$or': [{'title': {'$regex': r'\bwestern\b|\baction\b|\adventure\b', '$options': 'i'}}, 87 | # {'plot': {'$regex': r'\bwestern\b|\baction\b|\adventure\b', '$options': 'i'}}, 88 | # {'genres': {'$regex': r'\bwestern\b|\baction\b|\adventure\b', '$options': 'i'}}], 89 | # 'countries': 'USA'}) 90 | # 91 | # find_documents({'$or': [{'title': {'$regex': 'western|action|adventure', '$options': 'i'}}, 92 | # {'plot': {'$regex': 'western|action|adventure', '$options': 'i'}}, 93 | # {'genres': {'$regex': 'western|action|adventure', '$options': 'i'}}], 94 | # 'countries': 'USA'}) 95 | 96 | 97 | def create_indexes(): 98 | try: 99 | mongo_collection.drop_index('genres_text_title_text_plot_text') 100 | except pymongo.errors.OperationFailure: 101 | print('No index to remove') 102 | 103 | try: 104 | mongo_collection.drop_index('title_text') 105 | except pymongo.errors.OperationFailure: 106 | print('No index to remove') 107 | 108 | try: 109 | mongo_collection.drop_index('countries_1') 110 | except pymongo.errors.OperationFailure: 111 | print('No index to remove') 112 | 113 | try: 114 | mongo_collection.drop_index('title_1') 115 | except pymongo.errors.OperationFailure: 116 | print('No index to remove') 117 | 118 | mongo_collection.create_index([('title', pymongo.ASCENDING)]) 119 | mongo_collection.create_index([('countries', pymongo.ASCENDING)]) 120 | mongo_collection.create_index([('title', pymongo.TEXT)]) 121 | 122 | 123 | def modify_index(): 124 | try: 125 | mongo_collection.drop_index('title_text') 126 | except pymongo.errors.OperationFailure: 127 | print('No index to remove') 128 | 129 | mongo_collection.create_index([('genres', pymongo.TEXT), 130 | ('title', pymongo.TEXT), 131 | ('plot', pymongo.TEXT)]) 132 | 133 | 134 | def weight_index(): 135 | try: 136 | mongo_collection.drop_index('genres_text_title_text_plot_text') 137 | except pymongo.errors.OperationFailure: 138 | print('No index to remove') 139 | 140 | mongo_collection.create_index([('genres', pymongo.TEXT), 141 | ('title', pymongo.TEXT), 142 | ('plot', pymongo.TEXT)], 143 | weights={'genres': 4, 'title': 2}) 144 | 145 | 146 | def find_documents(query, *sort, projection={'_id': 0, 'title': 1}): 147 | if sort: 148 | documents = mongo_collection \ 149 | .find(query, projection) \ 150 | .sort(*sort) \ 151 | .limit(5) 152 | else: 153 | documents = mongo_collection \ 154 | .find(query, projection) \ 155 | .limit(5) 156 | 157 | print("----------\n") 158 | print("Parameters\n----------") 159 | # print(documents.explain()) 160 | print("query: %s" % query) 161 | print("projection: %s" % projection) 162 | try: 163 | print("sort: %s" % sort) 164 | except TypeError: 165 | print("sort: %s" % "none") 166 | print("\nResults\n----------") 167 | print("document count: %s" % documents.count()) 168 | for document in documents: 169 | if 'score' in document: 170 | document['score'] = round(document['score'], 2) 171 | print(document) 172 | 173 | 174 | def count_documents(): 175 | count = mongo_collection.count() 176 | 177 | print("----------\n") 178 | print("Parameters\n----------") 179 | # print(documents.explain()) 180 | print("query: {}") 181 | print("\nResults\n----------") 182 | print("document count: %s" % count) 183 | 184 | 185 | if __name__ == "__main__": 186 | main() 187 | -------------------------------------------------------------------------------- /query_solr.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # 3 | # author: Gary A. Stafford 4 | # site: https://programmaticponderings.com 5 | # license: MIT License 6 | # purpose: Perform searches against a Solr index 7 | # usage: python3 ./query_solr.py 8 | 9 | import json 10 | import pysolr 11 | import os 12 | from bson.json_util import dumps 13 | 14 | solr_url = os.environ.get('SOLR_URL') 15 | solr_collection = "movies" 16 | solr = pysolr.Solr(solr_url + "/" + solr_collection) 17 | 18 | 19 | def main(): 20 | print('Target Solr instance: %s' % solr_url) 21 | 22 | # Query 1a: All Documents 23 | solr_search("*:*", **{ 24 | "defType": "lucene", 25 | "fl": "*, score", 26 | "sort": "title asc", 27 | "rows": "5"}) 28 | 29 | # Query 1b: Count Only 30 | solr_search("*:*", **{ 31 | "defType": "lucene", 32 | "omitHeader": "true", 33 | "rows": "0"}) 34 | 35 | # Query 2: Exact Search 36 | solr_search("title: \"Star Wars: Episode V - The Empire Strikes Back\"", **{ 37 | "defType": "lucene", 38 | "fl": "title score"}) 39 | 40 | # Query 3: Search Phrase 41 | solr_search("\"star wars\"", **{ 42 | "defType": "lucene", 43 | "df": "title", 44 | "fl": "title, score"}) 45 | 46 | # Query 4: Search Terms 47 | solr_search("star wars", **{ 48 | "defType": "lucene", 49 | "fq": "countries: USA", 50 | "df": "title", 51 | "fl": "title, score", 52 | "rows": "5"}) 53 | 54 | # Query 5a: Multiple Search Terms 55 | solr_search("adventure action western", **{ 56 | "defType": "lucene", 57 | "fq": "countries: USA", 58 | "df": "genres", 59 | "fl": "title, genres, score", 60 | "rows": "5"}) 61 | 62 | # Query 5b: Required Search Term 63 | solr_search("adventure action +western", **{ 64 | "defType": "lucene", 65 | "fq": "countries: USA", 66 | "df": "genres", 67 | "fl": "title, genres, score", 68 | "rows": "5"}) 69 | 70 | # Query 6a: eDisMax Query 71 | solr_search("adventure action western", **{ 72 | "defType": "edismax", 73 | "fq": "countries: USA", 74 | "qf": "plot title genres", 75 | "fl": "title, genres, score", 76 | "rows": "5"}) 77 | 78 | # Query 6b: eDisMax Boosted Fields 79 | solr_search("adventure action western", **{ 80 | "defType": "edismax", 81 | "fq": "countries: USA", 82 | "qf": "plot title^2.0 genres^4.0", 83 | "fl": "title, genres, score", 84 | "rows": "5"}) 85 | 86 | # Query 6c: eDisMax Boosted with Required/Prohibited Terms 87 | solr_search("adventure action +western -romance", **{ 88 | "defType": "edismax", 89 | "fq": "countries: USA", 90 | "qf": "plot title^2.0 genres^4.0", 91 | "fl": "title, genres, score", 92 | "rows": "5"}) 93 | 94 | # Query 6d: eDisMax w/ Required/Prohibited Terms, w/o Boost 95 | solr_search("adventure action +western -romance", **{ 96 | "defType": "edismax", 97 | "fq": "countries: USA", 98 | "qf": "plot title genres", 99 | "fl": "title, genres, score", 100 | "rows": "5"}) 101 | 102 | # Query 7a: The Movie Dilemma 103 | solr_search("A cowboys movie", **{ 104 | "defType": "edismax", 105 | "fq": "countries: USA", 106 | "qf": "plot title genres", 107 | "fl": "title, genres, score", 108 | "rows": "10"}) 109 | 110 | # Query 7b: Stop Words (simulation) 111 | solr_search("The Lego Movie -movie", **{ 112 | "defType": "edismax", 113 | "fq": "countries: USA", 114 | "qf": "plot title genres", 115 | "fl": "title, genres, score", 116 | "rows": "10"}) 117 | 118 | # Query 7c: Negative Boost 119 | solr_search("A cowboys movie", **{ 120 | "defType": "edismax", 121 | "fq": "countries: USA", 122 | "qf": "plot title genres", 123 | "bq": "title:movie^-2.0", 124 | "fl": "title, genres, score", 125 | "rows": "10"}) 126 | 127 | # Query 8: Boost Function 128 | solr_search("adventure action +western -romance", **{ 129 | "defType": "edismax", 130 | "fq": "countries: USA", 131 | "qf": "plot title genres", 132 | "fl": "title, awards.wins, score", 133 | "rows": "5"}) 134 | 135 | solr_search("adventure action +western -romance", **{ 136 | "defType": "edismax", 137 | "fq": "countries: USA", 138 | "qf": "plot title genres", 139 | "fl": "title, awards.wins, score", 140 | "boost": "div(field(awards.wins,min),2)", 141 | "rows": "5"}) 142 | 143 | # Query 9a: MLT Genres 144 | mlt_id = get_movie_id("Star Wars: Episode I - The Phantom Menace") 145 | 146 | mlt_qf = "genres" 147 | 148 | solr_search("{!mlt qf=\"%s\" mintf=1 mindf=1}%s" % (mlt_qf, mlt_id), **{ 149 | "defType": "lucene", 150 | "fq": "countries: USA", 151 | "fl": "title, genres, score", 152 | "rows": "5"}) 153 | 154 | # Query 9b: The Problem with George 155 | mlt_qf = "actors director writers" 156 | 157 | solr_search("id:\"%s\"" % mlt_id, **{ 158 | "defType": "lucene", 159 | "fl": "actors director writers"}) 160 | 161 | solr_search("{!mlt qf=\"%s\" mintf=1 mindf=1}%s" % (mlt_qf, mlt_id), **{ 162 | "defType": "lucene", 163 | "fq": "countries: USA", 164 | "fl": "title, actors, director, writers, score", 165 | "rows": "10"}) 166 | 167 | # Query 10a: Replacement Synonyms 168 | solr_search("ciborg", **{ 169 | "defType": "edismax", 170 | "qf": "title plot genres", 171 | "fl": "title, score", 172 | "stopwords": "true", 173 | "rows": "5"}) 174 | 175 | # Query 10b: Oneway Expansion Synonyms 176 | solr_search("droid", **{ 177 | "defType": "edismax", 178 | "qf": "title plot genres", 179 | "fl": "title, score", 180 | "stopwords": "true", 181 | "rows": "5"}) 182 | 183 | # Query 10c: Multiway Expansion Synonyms 184 | solr_search("scary", **{ 185 | "defType": "edismax", 186 | "qf": "title plot genres", 187 | "fl": "title, score", 188 | "stopwords": "true", 189 | "rows": "5"}) 190 | 191 | # 10d: Synonymous Phrases 192 | solr_search("lol", **{ 193 | "defType": "edismax", 194 | "qf": "title plot genres", 195 | "fl": "title, score", 196 | "stopwords": "true", 197 | "rows": "5"}) 198 | 199 | # Query 11: Faceting 200 | get_facets("adventure action +western -romance", **{ 201 | "defType": "edismax", 202 | "omitHeader": "true", 203 | "qf": "title plot genres", 204 | "fl": "title, genres, score", 205 | "facet": "on", 206 | "facet.field": "genres", 207 | "facet.mincount": "1", 208 | "facet.sort": "genres", 209 | "rows": "0"}) 210 | 211 | # # Additional Unused Query Variations 212 | # # eDisMax - Basic example, multiple search terms 213 | # solr_search("actors:\"John Wayne\" AND western action adventure", **{ 214 | # "defType": "edismax", 215 | # "qf": "plot title genres actors director", 216 | # "fl": "id, plot, title, genres, actors, director, score", 217 | # "rows": "5"}) 218 | # 219 | # solr_search("western action adventure with John Wayne", **{ 220 | # "defType": "edismax", 221 | # "qf": "plot title genres actors director", 222 | # "fl": "id plot title genres actors director score", 223 | # "rows": "5"}) 224 | # 225 | # solr_search("western action adventure +\"John Wayne\"", **{ 226 | # "defType": "edismax", 227 | # "qf": "plot title genres actors director", 228 | # "fl": "id plot title genres actors director score", 229 | # "rows": "5"}) 230 | # 231 | # # eDisMax - Boosted fields 232 | # solr_search("western action adventure", **{ 233 | # "defType": "edismax", 234 | # "qf": "plot title^2.0 genres^3.0", 235 | # "fl": "title genres score", 236 | # "rows": "5"}) 237 | # 238 | # solr_search("classic western action adventure adventure", **{ 239 | # "defType": "edismax", 240 | # "qf": "plot title^2.0 genres^3.0", 241 | # "fl": "title genres score", 242 | # "rows": "5"}) 243 | # 244 | # # eDisMax - Boost results that have a field that matches a specific value 245 | # solr_search("classic western action adventure adventure", **{ 246 | # "defType": "edismax", 247 | # "qf": "plot title^2.0 genres^3.0", 248 | # "bq": "genres:western^5.0", 249 | # "fl": "title genres score", 250 | # "rows": "5"}) 251 | # 252 | # solr_search("\"star wars\" OR \"star trek\"", **{ 253 | # "defType": "lucene", 254 | # "df": "title", 255 | # "fl": "title score", 256 | # "rows": "5"}) 257 | # 258 | # # why we can't add 'movie' as a stop word 259 | # solr_search("\"movie\"", **{ 260 | # "defType": "lucene", 261 | # "df": "title", 262 | # "fl": "title score", 263 | # "rows": "5"}) 264 | # 265 | # solr_search("*western* *action* *adventure*", **{ 266 | # "defType": "edismax", 267 | # "fq": "countries: USA", 268 | # "qf": "plot title genres", 269 | # "fl": "title genres score", 270 | # "rows": "5"}) 271 | 272 | 273 | # Solr's default Query Parser (aka lucene parser) 274 | def solr_search(q, **kwargs): 275 | results = solr.search(q, **kwargs) 276 | 277 | print("----------\n") 278 | print("Parameters\n----------") 279 | print("q: %s" % q) 280 | print("kwargs: %s" % kwargs) 281 | print("\nResults\n----------") 282 | print("document count: %s" % results.hits) 283 | print("qtime (ms): %s" % results.qtime) 284 | # print("docs: %s" % dumps(results.docs)) 285 | for document in results.docs: 286 | if 'score' in document: 287 | document['score'] = round(document['score'], 2) 288 | # print(json.dumps(document, indent=2, sort_keys=True)) # json pretty print 289 | print(document) 290 | 291 | 292 | def get_movie_id(title): 293 | # More Like This Query Parser (MLTQParser) example 294 | movie_id = solr.search("title:\"%s\"" % title, **{ 295 | "defType": "lucene", 296 | "omitHeader": "true", 297 | "fl": "id", 298 | "indent": "false", 299 | "rows": 1 300 | }).docs[0]['id'] 301 | 302 | print("----------\n") 303 | print("Parameters\n----------") 304 | print("title: %s " % title) 305 | print("\nResults\n----------") 306 | print("id: %s " % movie_id) 307 | 308 | return movie_id 309 | 310 | 311 | def get_facets(q, **kwargs): 312 | results = solr.search(q, **kwargs) 313 | 314 | print("----------\n") 315 | print("Parameters\n----------") 316 | print("q: %s" % q) 317 | print("kwargs: %s" % kwargs) 318 | print("\nResults\n----------") 319 | print("document count: %s" % results.hits) 320 | print("qtime (ms): %s" % results.qtime) 321 | # print("Facets: %s" % json.dumps(results.facets, indent=2)) 322 | print("Facets: %s" % results.facets) 323 | 324 | 325 | # TODO: Fix - MLT Query Parser function not working 326 | def more_like_this_query_parser(q, mltfl): 327 | results = solr.more_like_this(q, mltfl) 328 | 329 | print("----------\n") 330 | print("Parameters\n----------") 331 | print("q : %s" % q) 332 | print("mlt fl: %s" % mltfl) 333 | print("\nResults\n----------") 334 | print("document count: %s" % results.hits) 335 | print("qtime (ms): %s" % results.qtime) 336 | print("docs: %s" % dumps(results.docs)) 337 | 338 | 339 | if __name__ == "__main__": 340 | main() 341 | -------------------------------------------------------------------------------- /solr_index_movies.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # 3 | # author: Gary A. Stafford 4 | # site: https://programmaticponderings.com 5 | # license: MIT License 6 | # purpose: Index the collection of movie documents to Solr 7 | # usage: python3 ./solr_index_movies.py 8 | 9 | import json 10 | import os 11 | import requests 12 | 13 | solr_url = os.environ.get('SOLR_URL') 14 | solr_collection = 'movies' 15 | data_file = 'data/movieDetails.json' 16 | 17 | 18 | def main(): 19 | print('Target Solr instance: %s' % solr_url) 20 | delete_all_documents() 21 | # load_json_file_to_solr() 22 | multi_value_false() 23 | delete_all_documents() 24 | load_json_file_to_solr() 25 | get_document_count() 26 | 27 | 28 | def delete_all_documents(): 29 | # https://wiki.apache.org/solr/FAQ 30 | path = '/update' 31 | headers = {'Content-type': 'text/xml', 'charset': 'utf-8'} 32 | 33 | raw_data = '*:*' 34 | r = requests.post(solr_url + '/' + solr_collection + path, data=raw_data, headers=headers) 35 | print('Delete all documents: ', r.status_code, r.reason) 36 | 37 | raw_data = '' 38 | r = requests.post(solr_url + '/' + solr_collection + path, data=raw_data, headers=headers) 39 | print('Commit delete: ', r.status_code, r.reason) 40 | 41 | 42 | def load_json_file_to_solr(): 43 | with open(data_file) as data: 44 | json_data = json.load(data) 45 | json_data = add_id(json_data) 46 | 47 | path = '/update/json/docs?commit=true' 48 | r = requests.post(solr_url + '/' + solr_collection + path, json=json_data) 49 | print('Bulk add all documents: ', r.status_code, r.reason) 50 | 51 | 52 | # Add Solr ID field by copying MongoDB ID field 53 | def add_id(json_data): 54 | for document in json_data: 55 | document['id'] = document['_id']['$oid'] 56 | return json_data 57 | 58 | 59 | def multi_value_false(): 60 | path = '/schema' 61 | json_data = '{"replace-field":{"name":"title","type":"text_en","multiValued":false},' \ 62 | '"replace-field":{"name":"plot","type":"text_en","multiValued":false},' \ 63 | '"replace-field":{"name":"genres","type":"text_en","multiValued":true}}' 64 | 65 | r = requests.post(solr_url + '/' + solr_collection + path, data=json_data) 66 | print('Modify schema: ', r.status_code, r.reason) 67 | 68 | 69 | def get_document_count(): 70 | path = '/select?q=*:*&rows=0' 71 | r = requests.get(solr_url + '/' + solr_collection + path) 72 | print('Document count: ', r.status_code, r.reason, r.text) 73 | 74 | 75 | if __name__ == "__main__": 76 | main() 77 | --------------------------------------------------------------------------------