├── LICENSE ├── romanian.stop └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2016, Ionut-Cristian Florescu 2 | 3 | Permission to use, copy, modify, and/or distribute this software for any 4 | purpose with or without fee is hereby granted, provided that the above 5 | copyright notice and this permission notice appear in all copies. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 8 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 9 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 10 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 11 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 12 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 13 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 14 | -------------------------------------------------------------------------------- /romanian.stop: -------------------------------------------------------------------------------- 1 | a 2 | abia 3 | acea 4 | aceasta 5 | aceea 6 | aceia 7 | acel 8 | acela 9 | acelasi 10 | acele 11 | acelea 12 | aceluiasi 13 | acest 14 | acesta 15 | aceste 16 | acestea 17 | acestei 18 | acesti 19 | acestia 20 | acestor 21 | acestora 22 | acestui 23 | acolo 24 | acum 25 | adica 26 | ai 27 | aia 28 | aici 29 | al 30 | ala 31 | alaturi 32 | ale 33 | alt 34 | alta 35 | altceva 36 | alte 37 | altele 38 | altfel 39 | alti 40 | altii 41 | altul 42 | am 43 | anume 44 | apoi 45 | ar 46 | are 47 | as 48 | asa 49 | asemenea 50 | asta 51 | astazi 52 | astfel 53 | asupra 54 | atare 55 | atat 56 | atata 57 | atatea 58 | atati 59 | atatia 60 | ati 61 | atit 62 | atiti 63 | atitia 64 | atunci 65 | au 66 | avea 67 | avem 68 | avut 69 | azi 70 | ba 71 | bine 72 | ca 73 | cam 74 | cand 75 | care 76 | careia 77 | carora 78 | caruia 79 | cat 80 | cata 81 | cate 82 | cati 83 | catre 84 | ce 85 | cea 86 | ceea 87 | cei 88 | ceilalti 89 | cel 90 | cele 91 | celelalte 92 | celor 93 | ceva 94 | chiar 95 | ci 96 | cind 97 | cine 98 | cineva 99 | cit 100 | cite 101 | citeva 102 | citi 103 | citiva 104 | cu 105 | cui 106 | cum 107 | cumva 108 | da 109 | daca 110 | dar 111 | de 112 | deasupra 113 | decat 114 | deci 115 | decit 116 | deja 117 | desi 118 | despre 119 | din 120 | dintr 121 | dintre 122 | doar 123 | dupa 124 | ea 125 | ei 126 | el 127 | ele 128 | era 129 | este 130 | eu 131 | fara 132 | fecareia 133 | fel 134 | fi 135 | fie 136 | fiecare 137 | fiecarui 138 | fiecaruia 139 | fiind 140 | foarte 141 | fost 142 | i-au 143 | iar 144 | ieri 145 | ii 146 | il 147 | imi 148 | impotriva 149 | in 150 | inainte 151 | inapoi 152 | inca 153 | incat 154 | incit 155 | insa 156 | insusi 157 | intr 158 | intre 159 | isi 160 | iti 161 | l-am 162 | la 163 | le 164 | li 165 | lor 166 | lui 167 | ma 168 | mai 169 | mare 170 | mereu 171 | mod 172 | mult 173 | multa 174 | multe 175 | multi 176 | ne 177 | nici 178 | niciodata 179 | nimeni 180 | nimic 181 | niste 182 | noi 183 | nostri 184 | nostru 185 | noua 186 | nu 187 | numai 188 | o 189 | oarecare 190 | oarece 191 | oarecine 192 | oarecui 193 | or 194 | orice 195 | oricum 196 | pana 197 | pe 198 | pentru 199 | peste 200 | pina 201 | plus 202 | poata 203 | prea 204 | prin 205 | printr-o 206 | putini 207 | s-ar 208 | sa 209 | sa-i 210 | sa-mi 211 | sa-si 212 | sa-ti 213 | sai 214 | sale 215 | sau 216 | se 217 | si 218 | sint 219 | sintem 220 | sinteti 221 | spre 222 | sub 223 | sunt 224 | suntem 225 | sunteti 226 | te 227 | ti 228 | toata 229 | toate 230 | tocmai 231 | tot 232 | toti 233 | totul 234 | totusi 235 | tu 236 | tuturor 237 | un 238 | una 239 | unde 240 | unei 241 | unele 242 | uneori 243 | unii 244 | unor 245 | unui 246 | unul 247 | va 248 | voi 249 | vom 250 | vor 251 | vreo 252 | vreun 253 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # PostgreSQL text-search utils 2 | 3 | A collection of files and patterns to improve the default PostgreSQL text search. 4 | 5 | ## What's this? 6 | 7 | The PostgreSQL text search engine is one of the best available across both commercial and open-source RDBMs, but it does miss a number features when it comes to supporting languages other than English. 8 | 9 | This repository is aiming to provide useful patterns and files that are potentially missing in a default PostgreSQL distribution. 10 | 11 | ## Note 12 | 13 | The location of `$SHAREDIR` is OS/PostgreSQL distribution specific. On a Debian/Ubuntu-based box with PostgreSQL 9.5, you'll find it at `/usr/share/postgresql/9.5/`. Ask around on stackoverflow if you can't find yours. Please refrain from raising issues in this repo to ask about it. 14 | 15 | ## Updated unaccent rules 16 | 17 | The `unaccent.rules` file on your system may lack a number of UTF8 characters, depending on how old is your PostgreSQL version. If necessary, you can replace the one in your `$SHAREDIR/tsearch_data` folder with the latest one from the PostgreSQL repository: 18 | 19 | cd `pg_config --sharedir`/tsearch_data 20 | curl -O https://raw.githubusercontent.com/postgres/postgres/master/contrib/unaccent/unaccent.rules 21 | 22 | ## Language-specific stop words 23 | 24 | In a standard PostgreSQL 9.5 package there are 14 `*.stop` files covering Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and Turkish. 25 | 26 | If your language is not among those, there's a chance you'll find the missing file in this repository. 27 | 28 | cd `pg_config --sharedir`/tsearch_data 29 | curl -O https://raw.githubusercontent.com/icflorescu/postgresql-tsearch-utils/master/romanian.stop 30 | 31 | If the file you're looking for is not here, then by all means feel free to contribute with a useful pull-request. Simply raising an issue to ask for it will probably not help, but I'll gladly accept pull-requests with `greek.stop`, `bulgarian.stop`, `czech.stop`, etc. 32 | 33 | ## Useful information around the web 34 | 35 | - [Controlling Text Search](https://www.postgresql.org/docs/current/static/textsearch-controls.html) in the official PostgreSQL manual 36 | - [Postgres full-text search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/) 37 | - [Full text search in milliseconds with PostgreSQL](https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/) 38 | - [PostgreSQL: A full text search engine](http://shisaa.jp/postset/postgresql-full-text-search-part-1.html) 39 | - [Indexing for full text search in PostgreSQL](https://www.compose.io/articles/indexing-for-full-text-search-in-postgresql/) 40 | - [From LIKE to Full-Text Search](http://www.nomadblue.com/blog/django/from-like-to-full-text-search-part-I/) 41 | 42 | ## Example 43 | 44 | Here's how to create an improved search configuration for Romanian language: 45 | 46 | /* Make sure you've updated the unaccent.rules file 47 | * before creating extension 48 | */ 49 | CREATE EXTENSION unaccent; 50 | 51 | /* Create a Romanian "snowball" dictionary that 52 | * takes into account stopwords defined in romanian.stop 53 | */ 54 | CREATE TEXT SEARCH DICTIONARY ro ( 55 | TEMPLATE = snowball 56 | , LANGUAGE = romanian 57 | , STOPWORDS = romanian 58 | ); 59 | 60 | /* Create a new text search configuration 61 | * based on the newly created dictionary and unaccent.rules 62 | */ 63 | CREATE TEXT SEARCH CONFIGURATION RO (COPY = romanian); 64 | ALTER TEXT SEARCH CONFIGURATION RO 65 | ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, hword, hword_part, word 66 | WITH unaccent, ro; 67 | 68 | /* This will give you wrong results: 69 | * 'autom':10 'cu':3 'cut':7 'de':8 'integral':5 'limuzin':1 'tracțiun':4 'verd':2 'vitez':9 'și':6 70 | */ 71 | SELECT to_tsvector('romanian', 'limuzină verde cu tracțiune integrală și cutie de viteze automată'); 72 | 73 | /* But using the new search configuration will give you better results: 74 | * 'autom':10 'cut':7 'integral':5 'limuzin':1 'tractiun':4 'verd':2 'vitez':9 75 | */ 76 | SELECT to_tsvector('ro', 'limuzină verde cu tracțiune integrală și cutie de viteze automată'); 77 | 78 | ## Read carefully before raising issues 79 | 80 | I'm getting lots of questions from people just learning to do web development or simply looking to solve a very specific problem they're dealing with. While I will answer some of them for the benefit of the community, please understand that open-source is a shared effort and it's definitely not about piggybacking on other people's work. On places like GitHub, that means raising issues is encouraged, but **coming up with useful pull-requests is a lot better**. If I'm willing to share some of my code for free, I'm doing it for a number of reasons: my own intellectual challenges, pride, arrogance, stubbornness to believe I'm bringing a contribution to common progress and freedom, etc. Your particular well-being is probably not one of those reasons. I'm not in the business of providing free consultancy, so if you need my help to solve your specific problem, there's a fee for that. 81 | 82 | ## Asking for help or a new feature 83 | 84 | See the note above. If you need help and are willing to pay for it, drop me a message. If you have an idea about a new feature that doesn't break existing ones and you're willing to invest effort to make it happen, have a look at the code and feel free to make a pull-request. 85 | 86 | ## Credits 87 | 88 | See contributors [here](https://github.com/icflorescu/postgresql-tsearch-utils/graphs/contributors). 89 | 90 | If you find this repo useful, don't hesitate to give it a star and [spread the word](http://twitter.com/share?text=Checkout%20this%%20PostgreSQL%20text%20search%20utils%20repo!&url=http%3A%2F%2Fgithub.com/icflorescu/postgresql-tsearch-utils&hashtags=PostgreSQL,database,textsearch&via=icflorescu). 91 | 92 | # License 93 | 94 | The [ISC License](https://github.com/icflorescu/postgresql-tsearch-utils/blob/master/LICENSE). 95 | --------------------------------------------------------------------------------