├── LICENSE ├── README.md └── klapa.sql /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Yoni Goldblat 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Klapa 2 | Text patterns clustering in PostgreSQL 3 | 4 | ## What is Klapa? 5 | **Klapa** (pronounced kla·pa /ˈkläpä/) is a relatively fast and efficient genuine algorithm to find patterns in a dataset of strings. 6 | 7 | **Klapa** allows clustering patterns from DB tables or the file-system and relies on PostgreSQL(>=11) to tackle performance issues that are common in clustering combinatorics. 8 | 9 | ## Motivation 10 | The ability the detect reoccurring textual patterns is useful in fields that require data investigation such as: 11 | - Log files analysis (anomalies detection, health monitoring) 12 | - Debugging trace files 13 | - Natural language processing (NLP) 14 | 15 | ## Demo 16 | Assume you have the following strings and want to find the most common patterns: 17 | 18 | - the black fox jumped over the bench 19 | - the red fox walked over the river 20 | - the yellow fox skipped the line 21 | - one yellow fox saw a bird 22 | - the yellow dog flipped over the pool 23 | - the black cow flipped over the pool 24 | - the black cat flipped over the pool 25 | - the yellow dog flipped over the pool 26 | 27 | **Klapa** clusters these strings into the following output (top 5 results): 28 | 29 | | p | pattern | occurrences | # words | sample | 30 | |-------|-----------------------------------------|-------------|--------|--------------------------------------| 31 | | 0.875 | the *? | 7 | 1 | the black cat flipped over the pool | 32 | | 0.75 | the * * * over the *? | 6 | 3 | the black cat flipped over the pool | 33 | | 0.5 | the * * flipped over the pool *? | 4 | 5 | the black cat flipped over the pool | 34 | | 0.5 | * * fox *? | 4 | 1 | one yellow fox saw a bird | 35 | | 0.5 | * yellow *? | 4 | 1 | one yellow fox saw a bird | 36 | 37 | So it can be easily seen that for instance, the pattern `the * * * over the *?` appears in 75% of the strings (each asterisk represents a single word in its place, similar to \S in regex). 38 | 39 | 40 | ## Installation 41 | 1. The best way to install Klapa is to import the `klapa.sql` file into the chosen database. 42 | 2. Enable required extensions: 43 | `postgres=# CREATE EXTENSION intarray;` 44 | `postgres=# CREATE EXTENSION btree_gist;` 45 | 3. If you want to analyze text files, enable the foreign data wrapper extension and create a "server": 46 | `postgres=# CREATE EXTENSION file_fdw;` 47 | `postgres=# CREATE SERVER file_fdw_server FOREIGN DATA WRAPPER file_fdw;` 48 | 49 | From there, you can start interacting with the `cluster_patterns` function. 50 | 51 | ## Usage 52 | 53 | ### Quick Start 54 | 55 | Cluster a text column in the database using default parameters: 56 | ```sql 57 | postgres=# SELECT * FROM cluster_patterns(source_table_name => 'email.messages', source_table_column_name => 'subject') 58 | ``` 59 | 60 | Cluster a text file: 61 | ```sql 62 | postgres=# SELECT * FROM cluster_patterns(source_file_name => '/var/log/postgresql/postgresql.log') 63 | ``` 64 | 65 | ### Parameters 66 | 67 | - **source_file_name** - Full path to text file. Postgres user on OS should have read access to it. Either this field or `source_table_name` are mandatory per your data source type. 68 | - **source_table_name** - Table name that hold the column to cluster. Note that if this table is not in the same schema where the `cluster_patterns` function is installed you should provide the full path to schema and table in form of `'schema.tablename'`. 69 | - **source_table_column_name** - Column name that holds the strings to cluster. Mandatory when providing `source_table_name` parameter. 70 | - **tail_source_p** - Process only the last `x%` rows in the data source. Value of 1 will process the whole data source, while value of 0.1 will process only the last 10% of the rows. This is mostly useful when there's a need to inspect more recent data such as in with log files. If the data source is file, this parameter trims the last lines of the file, while if the data source is a database table, the function tails the last rows returned by the database in `select * from tablename` and might not reflect insertion, sequential or timestamp order. (default 0.1) 71 | - **words_range** - Define the range of words in the sentence the algorithm will take into account - start and last word positions, separated by a comma. Useful in log files when the strings are starting with a unique pattern (such as timestamp, server name or IP address) or when the lines don't hold relevant text after a certain position. Narrower range will reduce the possible words combination and will result in faster performance. (default '6,20') 72 | - **words_delimiter** - Regular expression that when matched is used to split sentences into words. Defaults characters are space and open square-brackets. (default '[\s\[]+') 73 | - **sample_p** - Percent of rows to analyze from the data source after performing the tail step. The rows are randomly chosen. (default 0.5). 74 | - **word_occurrences_threshold** - Minimum number of occurrences that a word should appear in the data set to be considered into one of the patterns. Lowering this number increases the possible combinations in the recursive patterns loop (preparation of words_tree table) and negatively impacts performance. (default 50) 75 | - **pattern_occurrences_threshold** - Minimum number of occurrences that a pattern should appear in the data source to be included in the query result. Changing this number has a low impact on performance. (default 50) 76 | 77 | #### Example 78 | 79 | Find patterns in the first 5 words in the subjects column in the messages table. Limit calculations to the last 25% messages, and sample randomly only 50% of them. 80 | ```sql 81 | postgres=# SELECT * FROM cluster_patterns(source_file_name => 'email.messages', 82 | source_table_column_name =>'subject', 83 | words_range => '1,5', 84 | tail_source_p=>0.25, 85 | sample_p=>0.5) 86 | ``` 87 | 88 | ### Caveats 89 | 1. Due to Postgres `Copy` function encoding limitations, some files might return `SQL Error [22P04]: ERROR: literal carriage return found in data` error. These files need further preparation before processing to remove special Unicode characters and carriage returns. Using the terminal: 90 | `sed 's:\\:\\\\:g; s:\r:\\r:g' /var/log/postgresql/postgresql.log /var/log/postgresql/altered.log` 91 | 92 | ## Algorithm Features 93 | 94 | ### Scaling 95 | Klapa was benchmarked against multiple scenarios to optimize the execution plans and to improve performance and memory footprints. This was done by extending some PostgreSQL features to the limit and requirement to enable some native extensions. 96 | Yet, for very large datasets the process of exploding the words dictionary and aggregating the patterns tend to stretch resources and grow big. 97 | The following tips can help to reduce process times, CPU and memory consumption: 98 | - If the number of strings in the data source is high, lower `sample_p` *(default=0.5)* parameter to reduce the random sample size. 99 | - For sources where recent data is more relevant (such as in log files) use the `tail_source_p` *(default=0.1)* parameter to analyze only the last x% of the data. 100 | - For strings with repeating irrelevant parts (such as log files where the first words represent timestamp, machine name etc), change the `words_range` *(default=6,20)*. 101 | - Based on your dataset size try altering the `word_occurrences_threshold` *(default=50)* and `pattern_occurrences_threshold` *(default=50)* to pick only the most repetitive words. 102 | - 103 | ### Principles 104 | 105 | - Code Readability. SQL formatting based on style guide from [sqlstyle.guide](https://www.sqlstyle.guide/) 106 | - Ease of Use. This implementation is distributed as a single PL/pgSQL function, which means it can be installed and used easily. 107 | - Using DB Goodies. Postgres comes with lots of features that can stretch performance to max. Some of them are tricky and needs deep knowledge of the DB engine. 108 | - Built for Production. Not compromising for the sake of POC. 109 | - Leave No Trace. All runtime objects are temporary and lives only during through the query session. 110 | 111 | ## Future Work 112 | 113 | ### Todo 114 | 115 | 1. Add ignore_characters[array], stop_words[array] and underscore_words[bool] to slice the words dictionary. 116 | 2. Consider **hyperloglog** to extend calculation from sample to whole data source. 117 | 3. Utilize Postgres' multi-core parallelism to speed up the recursive query that builds the `words_tree` combinations table. This is the heaviest query in the algorithm. PG12 doesn't support parallel gathering of recursive CTE (I could not achieve parallel execution even by dividing the parent node of the recursive loop into multiple UNION ALL queries). 118 | 4. ~~Change aggregated columns words_tree fields (tree, pos_tree) to be arrays of words IDs and word positions instead of concatenated strings.~~ (proved to increase memory consumption and the perf improvement is neglected even for huge datasets) 119 | 5. ~~Separate `explode_words_pivot_sliced_word_position_gist` to 2 indexes~~ (Postgres plan optimizer prefers such indexes over the current multi-column index yet they provide slower execution time) 120 | 121 | ### Bugs & Issues 122 | I'm open to any kind of contributions and happy to discuss, review and merge pull requests. 123 | I also invite any questions, comments, bug reports, patches directly to my mail address. 124 | 125 | 126 | ## Inspiration 127 | The **Klapa** algorithm is genuine, yet the idea of text clustering into patterns was inspired from the following publications: 128 | - [LogMine](https://dl.acm.org/citation.cfm?id=2983323.2983358): Fast Pattern Recognition for Log Analytics for the idea of coarse grained clustering and the idea of type deduction to simplify logs. 129 | - [nestordemeure/textPatterns]([https://github.com/nestordemeure/textPatterns/](https://github.com/nestordemeure/textPatterns/)) - Hierarchical clustering of lines of text. 130 | - Kusto reduce operator [(link)](https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/reduceoperator) 131 | - A Data Clustering Algorithm for Mining Patterns From Event Logs by Risto Vaarandi [(link)](https://ristov.github.io/publications/slct-ipom03-web.pdf) 132 | 133 | ## About 134 | 135 | ### Author 136 | The project was founded in 2020 by Yoni G ([email](goldblat-remove.this-@gmail.com) / [LinkedIn](https://www.linkedin.com/in/goldblat/)) 137 | 138 | **💸 Do you like my work?** I am a Senior Backend & Data Engineer open to offers and opportunities 139 | 140 | ### Why the name? 141 | The word klapa translates from Croatian as ["a group of friends"](https://en.wikipedia.org/wiki/Klapa) which resembles the idea of patterns clustering. 142 | In addition it consists of 2 syllables, each pronounced as the stems of "**clu**stering **pa**tterns". 143 | 144 | 145 | ## License 146 | **Klapa** is distributed using MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. 147 | 148 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /klapa.sql: -------------------------------------------------------------------------------- 1 | CREATE OR REPLACE FUNCTION cluster_patterns(source_file_name text DEFAULT ''::text, 2 | tail_source_p real DEFAULT 0.1, 3 | source_table_name text DEFAULT ''::text, 4 | source_table_column_name text DEFAULT ''::text, 5 | words_range text DEFAULT '6,20'::text, 6 | words_delimiter text DEFAULT '[\s\[]+'::text, 7 | sample_p real DEFAULT 0.5, 8 | word_occurrences_threshold integer DEFAULT 50, 9 | pattern_occurrences_threshold integer DEFAULT 50) 10 | RETURNS TABLE(p double precision, pattern text, occurrences integer, num_of_words integer, sample text) 11 | LANGUAGE plpgsql 12 | AS $function$ 13 | 14 | DECLARE 15 | total_rows_in_source INT; 16 | 17 | 18 | BEGIN 19 | 20 | RAISE NOTICE 'START SCRIPT: %' , timeofday()::timestamp; 21 | 22 | IF (source_file_name != '' AND source_table_name != '') OR (source_file_name = '' AND source_table_name = '') THEN 23 | RAISE EXCEPTION 'Please provide only one source for clustering: source_file_name OR source_table_name'; 24 | END IF; 25 | 26 | IF (source_table_name != '' AND source_table_column_name = '') THEN 27 | RAISE EXCEPTION 'Please provide parameter column name (source_table_column_name) to cluster when specifying source_table_name'; 28 | END IF; 29 | 30 | IF (SPLIT_PART(words_range,',',1)::INT > SPLIT_PART(words_range,',',2)::INT) THEN 31 | RAISE EXCEPTION 'words_range parameter should be in form x,y where x<=y'; 32 | END IF; 33 | 34 | IF (tail_source_p > 1 OR tail_source_p < 0 OR sample_p > 1 OR sample_p < 0) THEN 35 | RAISE EXCEPTION 'tail_source_p OR tail_source_p values are out of range (0..1)'; 36 | END IF; 37 | 38 | IF (source_table_name != '') THEN 39 | EXECUTE 'CREATE TEMPORARY VIEW source AS SELECT ' || source_table_column_name || ' AS string FROM ' || source_table_name; 40 | ELSE 41 | EXECUTE 'CREATE FOREIGN TABLE source(string TEXT) SERVER FILE_FDW_SERVER OPTIONS (FORMAT ''text'',FILENAME ''' || source_file_name || ''',ENCODING ''utf8'', DELIMITER E''\x01'');'; 42 | END IF; 43 | 44 | SELECT COUNT(string) 45 | FROM source 46 | INTO total_rows_in_source; 47 | 48 | RAISE NOTICE 'FINISHED count(*): %' , timeofday()::timestamp; 49 | 50 | CREATE TEMPORARY TABLE source_sample ON COMMIT DROP AS 51 | SELECT string, 52 | ROW_NUMBER() OVER () AS id 53 | FROM (SELECT string, 54 | ROW_NUMBER() OVER (ORDER BY string) AS original_id 55 | FROM source 56 | ) AS ordered_source 57 | WHERE (original_id > total_rows_in_source * (1 - tail_source_p)) 58 | ORDER BY RANDOM() 59 | LIMIT (total_rows_in_source * tail_source_p * sample_p); 60 | 61 | RAISE NOTICE 'FINISHED BUILDING source_sample: %' , timeofday()::timestamp; 62 | 63 | IF (source_table_name != '') THEN 64 | DROP VIEW source; 65 | ELSE 66 | DROP FOREIGN TABLE source; 67 | END IF; 68 | 69 | CREATE TEMPORARY TABLE explode_words_pivot ON COMMIT DROP AS 70 | WITH explode_words AS (SELECT id::INT, 71 | word, 72 | word_position::BIGINT 73 | FROM source_sample t, 74 | regexp_split_to_table(t.string, words_delimiter) 75 | WITH ORDINALITY x(word, word_position)) 76 | SELECT word_position, 77 | word, 78 | ARRAY_AGG(DISTINCT id) word_pos_in_strings 79 | FROM explode_words 80 | WHERE TRIM(word) != '' 81 | AND word_position BETWEEN SPLIT_PART(words_range,',',1)::INT 82 | AND SPLIT_PART(words_range,',',2)::INT 83 | GROUP BY word, 84 | word_position 85 | ORDER BY word_position, 86 | ARRAY_LENGTH(ARRAY_AGG(DISTINCT id), 1) DESC; 87 | 88 | RAISE NOTICE 'FINISHED BUILDING explode_words_pivot: %' , timeofday()::timestamp; 89 | 90 | CREATE TEMPORARY TABLE explode_words_pivot_sliced ON COMMIT DROP AS 91 | SELECT * 92 | FROM explode_words_pivot 93 | WHERE ARRAY_LENGTH(word_pos_in_strings, 1) >= word_occurrences_threshold; 94 | 95 | RAISE NOTICE 'FINISHED BUILDING explode_words_pivot_sliced: %' , timeofday()::timestamp; 96 | 97 | DROP TABLE explode_words_pivot; 98 | /* Histogram of words occuring in strings: 99 | SELECT COUNT(*) number_of_words, 100 | ARRAY_LENGTH(word_pos_in_strings, 1) occuring_in_strings 101 | FROM explode_words_pivot_sliced 102 | GROUP BY ARRAY_LENGTH(word_pos_in_strings, 1) 103 | ORDER BY ARRAY_LENGTH(word_pos_in_strings, 1) DESC 104 | */ 105 | 106 | CREATE INDEX explode_words_pivot_sliced_word_position_gist 107 | ON explode_words_pivot_sliced 108 | USING GIST(word_pos_in_strings GIST__INTBIG_OPS, word_position); 109 | 110 | RAISE NOTICE 'FINISHED BUILDING index explode_words_pivot_sliced_word_position_gist: %' , timeofday()::timestamp; 111 | 112 | 113 | CREATE TEMPORARY TABLE words_tree ON COMMIT DROP AS 114 | WITH RECURSIVE words_tree AS 115 | (SELECT word_position, 116 | word, 117 | REPEAT('* ', (word_position - 1)::INT) || word AS tree, 118 | word_position::TEXT AS pos_tree, 119 | word_pos_in_strings, 120 | 1 AS depth, 121 | ARRAY_LENGTH(word_pos_in_strings, 1) AS number_of_occurrences 122 | FROM explode_words_pivot_sliced 123 | WHERE array_length(word_pos_in_strings, 1) >= pattern_occurrences_threshold 124 | UNION 125 | SELECT child.word_position, 126 | child.word, 127 | parent.tree || ' ' || repeat('* ', (child.word_position - parent.word_position - 1)::int) || child.word, 128 | parent.pos_tree || '.' || child.word_position::text, 129 | parent.word_pos_in_strings & child.word_pos_in_strings, 130 | parent.depth + 1, 131 | array_length(parent.word_pos_in_strings & child.word_pos_in_strings, 1) 132 | FROM explode_words_pivot_sliced child 133 | JOIN words_tree parent 134 | ON parent.word_position < child.word_position 135 | AND child.word_pos_in_strings && parent.word_pos_in_strings 136 | WHERE ARRAY_LENGTH(parent.word_pos_in_strings & child.word_pos_in_strings, 1) >= pattern_occurrences_threshold) 137 | SELECT words_tree.tree, 138 | words_tree.pos_tree, 139 | word_pos_in_strings, 140 | words_tree.number_of_occurrences, 141 | depth 142 | FROM words_tree; 143 | 144 | RAISE NOTICE 'FINISHED BUILDING words_tree: %' , timeofday()::timestamp; 145 | 146 | RETURN QUERY 147 | SELECT words_tree_matches.number_of_occurrences / (total_rows_in_source * tail_source_p * sample_p) AS p, 148 | words_tree_matches.tree || COALESCE(CASE 149 | WHEN ARRAY_LENGTH(STRING_TO_ARRAY(words_tree_matches.tree, ' '), 1) < SPLIT_PART(words_range,',',2)::INT 150 | THEN ' *?' 151 | END, '') AS pattern, 152 | words_tree_matches.number_of_occurrences occurrences, 153 | words_tree_matches.depth AS num_of_words, 154 | source_sample.string::text AS sample 155 | FROM (SELECT *, 156 | ROW_NUMBER() OVER(PARTITION BY word_pos_in_strings ORDER BY depth DESC) AS best_match 157 | FROM words_tree) AS words_tree_matches 158 | LEFT JOIN source_sample 159 | ON words_tree_matches.word_pos_in_strings[1] = source_sample.id 160 | WHERE words_tree_matches.best_match = 1 161 | ORDER BY words_tree_matches.number_of_occurrences DESC, 162 | depth DESC; 163 | 164 | RAISE NOTICE 'FINISHED SCRIPT: %' , timeofday()::timestamp; 165 | 166 | 167 | END 168 | 169 | $function$ 170 | ; 171 | --------------------------------------------------------------------------------