├── LICENSE
├── README.md
└── klapa.sql


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Yoni Goldblat
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Klapa
  2 | Text patterns clustering in PostgreSQL
  3 | 
  4 | ## What is Klapa?
  5 | **Klapa** (pronounced kla·pa /ˈkläpä/) is a relatively fast and efficient genuine algorithm to find patterns in a dataset of strings. 
  6 | 
  7 | **Klapa**  allows clustering patterns from DB tables or the file-system and relies on PostgreSQL(>=11) to tackle performance issues that are common in  clustering combinatorics.
  8 | 
  9 | ## Motivation
 10 | The ability the detect reoccurring textual patterns is useful in fields that require data investigation such as:
 11 | - Log files analysis (anomalies detection, health monitoring)
 12 | - Debugging trace files
 13 | - Natural language processing (NLP)
 14 | 
 15 | ## Demo
 16 | Assume you have the following strings and want to find the most common patterns:
 17 | 
 18 |     - the black fox jumped over the bench
 19 |     - the red fox walked over the river
 20 |     - the yellow fox skipped the line
 21 |     - one yellow fox saw a bird
 22 |     - the yellow dog flipped over the pool
 23 |     - the black cow flipped over the pool
 24 |     - the black cat flipped over the pool
 25 |     - the yellow dog flipped over the pool  
 26 | 
 27 | **Klapa** clusters these strings into the following output (top 5 results):
 28 | 
 29 | | p     | pattern                                 | occurrences | # words | sample                               |
 30 | |-------|-----------------------------------------|-------------|--------|--------------------------------------|
 31 | | 0.875 | the *?                                  | 7           | 1      | the black cat flipped over the pool  |
 32 | | 0.75  | the * * * over the *?                   | 6           | 3      | the black cat flipped over the pool  |
 33 | | 0.5   | the * * flipped over the pool *?        | 4           | 5      | the black cat flipped over the pool  |
 34 | | 0.5   | * * fox *?                              | 4           | 1      | one yellow fox saw a bird            |
 35 | | 0.5   | * yellow *?                             | 4           | 1      | one yellow fox saw a bird            |
 36 | 
 37 | So it can be easily seen that for instance, the pattern `the * * * over the *?` appears in 75% of the strings (each asterisk represents a single word in its place, similar to \S in regex).
 38 | 
 39 | 
 40 | ## Installation
 41 | 1. The best way to install Klapa is to import the `klapa.sql` file into the chosen database.
 42 | 2.  Enable required extensions:
 43 | `postgres=# CREATE EXTENSION intarray;`
 44 | `postgres=# CREATE EXTENSION btree_gist;`
 45 | 3. If you want to analyze text files, enable the foreign data wrapper extension and create a "server":
 46 | `postgres=# CREATE EXTENSION file_fdw;`
 47 | `postgres=# CREATE SERVER file_fdw_server FOREIGN DATA WRAPPER file_fdw;`
 48 | 
 49 | From there, you can start interacting with the `cluster_patterns` function.
 50 | 
 51 | ## Usage
 52 | 
 53 | ### Quick Start
 54 | 
 55 | Cluster a text column in the database using default parameters:
 56 | ```sql
 57 | postgres=# SELECT * FROM cluster_patterns(source_table_name => 'email.messages', source_table_column_name => 'subject')
 58 |  ```
 59 | 
 60 | Cluster a text file:
 61 | ```sql
 62 | postgres=# SELECT * FROM cluster_patterns(source_file_name => '/var/log/postgresql/postgresql.log')
 63 | ```
 64 | 
 65 | ### Parameters
 66 | 
 67 | - **source_file_name** - Full path to text file. Postgres user on OS should have read access to it. Either this field or `source_table_name` are mandatory per your data source type.
 68 | - **source_table_name** - Table name that hold the column to cluster. Note that if this table is not in the same schema where the `cluster_patterns` function is installed you should provide the full path to schema and table in form of `'schema.tablename'`.
 69 | - **source_table_column_name** - Column name that holds the strings to cluster. Mandatory when providing `source_table_name` parameter.
 70 | - **tail_source_p** - Process only the last `x%` rows in the data source. Value of 1 will process the whole data source, while value of 0.1 will process only the last 10% of the rows. This is mostly useful when there's a need to inspect more recent data such as in with log files. If the data source is file, this parameter trims the last lines of the file, while if the data source is a database table, the function tails the last rows returned by the database in `select * from tablename` and might not reflect insertion, sequential or timestamp order. (default 0.1)
 71 | - **words_range** - Define the range of words in the sentence the algorithm will take into account - start and last word positions, separated by a comma. Useful in log files when the strings are starting with a unique pattern (such as timestamp, server name or IP address) or when the lines don't hold relevant text after a certain position. Narrower range will reduce the possible words combination and will result in faster performance. (default '6,20')
 72 | - **words_delimiter** - Regular expression that when matched is used to split sentences into words. Defaults characters are space and open square-brackets. (default '[\s\[]+')
 73 | - **sample_p** - Percent of rows to analyze from the data source after performing the tail step. The rows are randomly chosen. (default 0.5).
 74 | - **word_occurrences_threshold** - Minimum number of occurrences that a word should appear in the data set to be considered into one of the patterns. Lowering this number increases the possible combinations in the recursive patterns loop (preparation of words_tree table) and negatively impacts performance.  (default 50)
 75 | - **pattern_occurrences_threshold** - Minimum number of occurrences that a pattern should appear in the data source to be included in the query result. Changing this number has a low impact on performance. (default 50)
 76 | 
 77 | #### Example
 78 | 
 79 | Find patterns in the first 5 words in the subjects column in the messages table. Limit calculations to the last 25% messages, and sample randomly only 50% of them.
 80 | ```sql
 81 | postgres=# SELECT * FROM cluster_patterns(source_file_name => 'email.messages',
 82 |                                source_table_column_name =>'subject',
 83 |                                words_range => '1,5',
 84 |                                tail_source_p=>0.25,
 85 |                                sample_p=>0.5)
 86 | ```
 87 | 
 88 | ### Caveats
 89 |  1. Due to Postgres `Copy` function encoding limitations, some files might return `SQL Error [22P04]: ERROR: literal carriage return found in data` error. These files need further preparation before processing to remove special Unicode characters and carriage returns. Using the terminal:
 90 |  `sed 's:\\:\\\\:g; s:\r:\\r:g' /var/log/postgresql/postgresql.log /var/log/postgresql/altered.log`
 91 | 
 92 | ## Algorithm Features
 93 | 
 94 | ### Scaling
 95 | Klapa was benchmarked against multiple scenarios to optimize the execution plans and to improve performance and memory footprints. This was done by extending some PostgreSQL features to the limit and requirement to enable some native extensions.
 96 | Yet, for very large datasets the process of exploding the words dictionary and aggregating the patterns tend to stretch resources and grow big.
 97 | The following tips can help to reduce process times, CPU and memory consumption:
 98 | - If the number of strings in the data source is high, lower `sample_p` *(default=0.5)* parameter to reduce the random sample size.
 99 | - For sources where recent data is more relevant (such as in log files) use the `tail_source_p` *(default=0.1)* parameter to analyze only the last x% of the data. 
100 | - For strings with repeating irrelevant parts (such as log files where the first words represent timestamp, machine name etc), change the `words_range` *(default=6,20)*.
101 | - Based on your dataset size try altering the `word_occurrences_threshold` *(default=50)* and `pattern_occurrences_threshold` *(default=50)* to pick only the most repetitive words.
102 | - 
103 | ### Principles
104 | 
105 | - Code Readability.  SQL formatting based on style guide from [sqlstyle.guide](https://www.sqlstyle.guide/)
106 | - Ease of Use. This implementation  is distributed as a single PL/pgSQL function, which means it can be installed and used easily.
107 | - Using DB Goodies. Postgres comes with lots of features that can stretch performance to max. Some of them are tricky and needs deep knowledge of the DB engine.
108 | - Built for Production. Not compromising for the sake of POC.
109 | - Leave No Trace. All runtime objects are temporary and lives only during through the query session.
110 |  
111 | ## Future Work
112 | 
113 | ### Todo
114 | 
115 | 1. Add ignore_characters[array], stop_words[array] and underscore_words[bool] to slice the words dictionary.
116 | 2. Consider **hyperloglog** to extend calculation from sample to whole data source.
117 | 3. Utilize Postgres' multi-core parallelism to speed up the recursive query that builds the `words_tree` combinations table. This is the heaviest query in the algorithm. PG12 doesn't support parallel gathering of recursive CTE (I could not achieve parallel execution even by dividing the parent node of the recursive loop into multiple UNION ALL queries).
118 | 4. ~~Change aggregated columns words_tree fields (tree, pos_tree) to be arrays of words IDs and word positions instead of concatenated strings.~~ (proved to increase memory consumption and the perf improvement is neglected even for huge datasets)
119 | 5. ~~Separate `explode_words_pivot_sliced_word_position_gist` to 2 indexes~~ (Postgres plan optimizer prefers such indexes over the current multi-column index yet they provide slower execution time)
120 | 
121 | ### Bugs & Issues
122 | I'm open to any kind of contributions and happy to discuss, review and merge pull requests.
123 | I also invite any questions, comments, bug reports, patches directly to my mail address.
124 | 
125 | 
126 | ## Inspiration
127 | The **Klapa** algorithm is genuine, yet the idea of text clustering into patterns was inspired from the following publications:
128 | - [LogMine](https://dl.acm.org/citation.cfm?id=2983323.2983358): Fast Pattern Recognition for Log Analytics  for the idea of coarse grained clustering and the idea of type deduction to simplify logs.
129 | -   [nestordemeure/textPatterns]([https://github.com/nestordemeure/textPatterns/](https://github.com/nestordemeure/textPatterns/)) - Hierarchical clustering of lines of text.
130 | - Kusto reduce operator [(link)](https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/reduceoperator)
131 | - A Data Clustering Algorithm for Mining Patterns From Event Logs by Risto Vaarandi [(link)](https://ristov.github.io/publications/slct-ipom03-web.pdf)
132 | 
133 | ## About
134 | 
135 | ### Author
136 | The project was founded in 2020 by Yoni G ([email](goldblat-remove.this-@gmail.com) / [LinkedIn](https://www.linkedin.com/in/goldblat/))
137 | 
138 | **💸 Do you like my work?** I am a Senior Backend & Data Engineer open to offers and opportunities
139 | 
140 | ### Why the name?
141 | The word klapa translates from Croatian as ["a group of friends"](https://en.wikipedia.org/wiki/Klapa) which resembles the idea of patterns clustering.
142 | In addition it consists of 2 syllables, each pronounced as the stems of "**clu**stering **pa**tterns".
143 | 
144 | 
145 | ## License
146 | **Klapa** is distributed using MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request.
147 | 
148 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/klapa.sql:
--------------------------------------------------------------------------------
  1 | CREATE OR REPLACE FUNCTION cluster_patterns(source_file_name text DEFAULT ''::text,
  2 |                                                 tail_source_p real DEFAULT 0.1,
  3 |                                                 source_table_name text DEFAULT ''::text,
  4 |                                                 source_table_column_name text DEFAULT ''::text,
  5 |                                                 words_range text DEFAULT '6,20'::text,
  6 |                                                 words_delimiter text DEFAULT '[\s\[]+'::text,
  7 |                                                 sample_p real DEFAULT 0.5,
  8 |                                                 word_occurrences_threshold integer DEFAULT 50,
  9 |                                                 pattern_occurrences_threshold integer DEFAULT 50)
 10 |  RETURNS TABLE(p double precision, pattern text, occurrences integer, num_of_words integer, sample text)
 11 |  LANGUAGE plpgsql
 12 | AS $function$
 13 | 
 14 |   DECLARE 
 15 |     total_rows_in_source INT;
 16 |   
 17 |   
 18 | BEGIN
 19 | 
 20 |   RAISE NOTICE 'START SCRIPT: %'  , timeofday()::timestamp;
 21 | 
 22 |   IF (source_file_name != '' AND source_table_name != '') OR (source_file_name = '' AND source_table_name = '') THEN 
 23 |      RAISE EXCEPTION 'Please provide only one source for clustering: source_file_name OR source_table_name';
 24 |   END IF;
 25 | 
 26 |   IF (source_table_name != '' AND source_table_column_name = '') THEN 
 27 |      RAISE EXCEPTION 'Please provide parameter column name (source_table_column_name) to cluster when specifying source_table_name';
 28 |   END IF;
 29 | 
 30 |   IF (SPLIT_PART(words_range,',',1)::INT > SPLIT_PART(words_range,',',2)::INT) THEN 
 31 |      RAISE EXCEPTION 'words_range parameter should be in form x,y where x<=y';
 32 |   END IF;
 33 | 
 34 |   IF (tail_source_p > 1 OR tail_source_p < 0 OR sample_p > 1 OR sample_p < 0) THEN 
 35 |      RAISE EXCEPTION 'tail_source_p OR tail_source_p values are out of range (0..1)';
 36 |   END IF;
 37 | 
 38 |   IF (source_table_name != '') THEN
 39 |      EXECUTE 'CREATE TEMPORARY VIEW source AS SELECT ' || source_table_column_name || ' AS string FROM ' || source_table_name;
 40 |   ELSE
 41 |      EXECUTE 'CREATE FOREIGN TABLE source(string TEXT) SERVER FILE_FDW_SERVER OPTIONS (FORMAT ''text'',FILENAME ''' || source_file_name || ''',ENCODING ''utf8'', DELIMITER E''\x01'');';
 42 |   END IF;
 43 | 
 44 |   SELECT COUNT(string) 
 45 |     FROM source
 46 |     INTO total_rows_in_source;
 47 | 
 48 |   RAISE NOTICE 'FINISHED count(*): %'  , timeofday()::timestamp;
 49 |   
 50 |   CREATE TEMPORARY TABLE source_sample ON COMMIT DROP AS
 51 |       SELECT string, 
 52 |              ROW_NUMBER() OVER () AS id
 53 |         FROM (SELECT  string,
 54 |                       ROW_NUMBER() OVER (ORDER BY string) AS original_id
 55 |               FROM  source
 56 |               ) AS ordered_source
 57 |        WHERE (original_id > total_rows_in_source * (1 - tail_source_p))
 58 |        ORDER BY RANDOM()
 59 |        LIMIT (total_rows_in_source * tail_source_p * sample_p);
 60 | 
 61 |  RAISE NOTICE 'FINISHED BUILDING source_sample: %'  , timeofday()::timestamp;
 62 |      
 63 |  IF (source_table_name != '') THEN
 64 |      DROP VIEW source;
 65 |   ELSE
 66 |      DROP FOREIGN TABLE source;
 67 |   END IF;  
 68 |   
 69 |   CREATE TEMPORARY TABLE explode_words_pivot ON COMMIT DROP AS
 70 |       WITH explode_words AS (SELECT id::INT,
 71 |                                     word,
 72 |                                     word_position::BIGINT
 73 |                                FROM source_sample t,
 74 |                                     regexp_split_to_table(t.string, words_delimiter)
 75 |                                     WITH ORDINALITY x(word, word_position))
 76 |       SELECT word_position,
 77 |              word,
 78 |              ARRAY_AGG(DISTINCT id) word_pos_in_strings
 79 |         FROM explode_words
 80 |        WHERE TRIM(word) != ''
 81 |              AND word_position BETWEEN SPLIT_PART(words_range,',',1)::INT 
 82 |              AND SPLIT_PART(words_range,',',2)::INT 
 83 |        GROUP BY word,
 84 |                 word_position
 85 |        ORDER BY word_position,
 86 |                 ARRAY_LENGTH(ARRAY_AGG(DISTINCT id), 1) DESC;    
 87 | 
 88 |   RAISE NOTICE 'FINISHED BUILDING explode_words_pivot: %'  , timeofday()::timestamp;
 89 |               
 90 |   CREATE TEMPORARY TABLE explode_words_pivot_sliced ON COMMIT DROP AS
 91 |       SELECT *
 92 |         FROM explode_words_pivot
 93 |        WHERE ARRAY_LENGTH(word_pos_in_strings, 1) >= word_occurrences_threshold;                                
 94 | 
 95 |   RAISE NOTICE 'FINISHED BUILDING explode_words_pivot_sliced: %'  , timeofday()::timestamp;
 96 |      
 97 |   DROP TABLE explode_words_pivot;
 98 |        /* Histogram of words occuring in strings:
 99 |           SELECT COUNT(*) number_of_words,                            
100 |                  ARRAY_LENGTH(word_pos_in_strings, 1) occuring_in_strings                            
101 |             FROM explode_words_pivot_sliced
102 |            GROUP BY ARRAY_LENGTH(word_pos_in_strings, 1)
103 |            ORDER BY ARRAY_LENGTH(word_pos_in_strings, 1) DESC
104 |         */                                
105 |   
106 |   CREATE INDEX explode_words_pivot_sliced_word_position_gist
107 |             ON explode_words_pivot_sliced
108 |          USING GIST(word_pos_in_strings GIST__INTBIG_OPS, word_position);                                
109 | 
110 |   RAISE NOTICE 'FINISHED BUILDING index explode_words_pivot_sliced_word_position_gist: %'  , timeofday()::timestamp;
111 | 
112 | 
113 |   CREATE TEMPORARY TABLE words_tree ON COMMIT DROP AS
114 |       WITH RECURSIVE words_tree AS 
115 |           (SELECT word_position,
116 |                   word,
117 |                   REPEAT('* ', (word_position - 1)::INT) || word AS tree,
118 |                   word_position::TEXT AS pos_tree,
119 |                   word_pos_in_strings,
120 |                   1 AS depth,
121 |                   ARRAY_LENGTH(word_pos_in_strings, 1) AS number_of_occurrences
122 |              FROM explode_words_pivot_sliced
123 |             WHERE array_length(word_pos_in_strings, 1) >= pattern_occurrences_threshold
124 |             UNION
125 |            SELECT child.word_position,
126 |                   child.word,
127 |                   parent.tree || ' ' || repeat('* ', (child.word_position - parent.word_position - 1)::int) || child.word,
128 |                   parent.pos_tree || '.' || child.word_position::text,
129 |                   parent.word_pos_in_strings & child.word_pos_in_strings,
130 |                   parent.depth + 1,
131 |                   array_length(parent.word_pos_in_strings & child.word_pos_in_strings, 1)
132 |              FROM explode_words_pivot_sliced child
133 |              JOIN words_tree parent
134 |                ON parent.word_position < child.word_position
135 |                   AND child.word_pos_in_strings && parent.word_pos_in_strings
136 |             WHERE ARRAY_LENGTH(parent.word_pos_in_strings & child.word_pos_in_strings, 1) >= pattern_occurrences_threshold)
137 |       SELECT words_tree.tree,
138 |              words_tree.pos_tree,
139 |              word_pos_in_strings,
140 |              words_tree.number_of_occurrences,
141 |              depth
142 |         FROM words_tree;                                
143 | 
144 |   RAISE NOTICE 'FINISHED BUILDING words_tree: %'  , timeofday()::timestamp;
145 | 
146 |   RETURN QUERY
147 |       SELECT words_tree_matches.number_of_occurrences / (total_rows_in_source * tail_source_p * sample_p) AS p,
148 |              words_tree_matches.tree || COALESCE(CASE 
149 |                                                    WHEN ARRAY_LENGTH(STRING_TO_ARRAY(words_tree_matches.tree, ' '), 1) < SPLIT_PART(words_range,',',2)::INT
150 |                                                      THEN ' *?'
151 |                                                  END, '') AS pattern,
152 |              words_tree_matches.number_of_occurrences occurrences,
153 |              words_tree_matches.depth AS num_of_words,
154 |              source_sample.string::text AS sample    
155 |         FROM (SELECT *,
156 |                      ROW_NUMBER() OVER(PARTITION BY word_pos_in_strings ORDER BY depth DESC) AS best_match
157 |               FROM words_tree) AS words_tree_matches
158 |         LEFT JOIN source_sample
159 |           ON words_tree_matches.word_pos_in_strings[1] = source_sample.id
160 |        WHERE words_tree_matches.best_match = 1
161 |        ORDER BY words_tree_matches.number_of_occurrences DESC,
162 |              depth DESC;
163 | 
164 |   RAISE NOTICE 'FINISHED SCRIPT: %'  , timeofday()::timestamp;
165 |            
166 |            
167 | END 
168 | 
169 | $function$
170 | ;
171 | 


--------------------------------------------------------------------------------