├── .github └── ISSUE_TEMPLATE.md ├── .gitignore ├── .travis.yml ├── LICENSE.txt ├── README.md ├── liwc ├── __init__.py ├── dic.py └── trie.py ├── setup.cfg ├── setup.py └── test ├── alpha.dic └── test_alpha_dic.py /.github/ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | Please do not open an issue with the intent of subverting encryption implemented by the LIWC developers. 2 | 3 | If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly. 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | build/ 2 | dist/ 3 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - "2.7" 4 | - "3.6" 5 | script: 6 | - python setup.py test 7 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright © 2012-2020 Christopher Brown 2 | 3 | MIT License 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 6 | 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 8 | 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # `liwc` 2 | 3 | [![PyPI version](https://badge.fury.io/py/liwc.svg)](https://pypi.org/project/liwc/) 4 | [![Travis CI Build Status](https://travis-ci.org/chbrown/liwc-python.svg?branch=master)](https://travis-ci.org/chbrown/liwc-python) 5 | 6 | This repository is a Python package implementing two basic functions: 7 | 1. Loading (parsing) a Linguistic Inquiry and Word Count (LIWC) dictionary from the `.dic` file format. 8 | 2. Using that dictionary to count category matches on provided texts. 9 | 10 | This is not an official LIWC product nor is it in any way affiliated with the LIWC development team or Receptiviti. 11 | 12 | 13 | ## Obtaining LIWC 14 | 15 | The LIWC lexicon is proprietary, so it is _not_ included in this repository. 16 | 17 | The lexicon data can be acquired (purchased) from [liwc.net](http://liwc.net/). 18 | 19 | * If you are a researcher at an academic institution, please contact [Dr. James W. Pennebaker](https://liberalarts.utexas.edu/psychology/faculty/pennebak) directly. 20 | * For commercial use, contact [Receptiviti](https://www.receptiviti.com/), which is the company that holds exclusive commercial license. 21 | 22 | Finally, please do not open an issue in this repository with the intent of subverting encryption implemented by the LIWC developers. 23 | If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly. 24 | 25 | 26 | ## Setup 27 | 28 | Install from [PyPI](https://pypi.python.org/pypi/liwc): 29 | 30 | pip install liwc 31 | 32 | 33 | ## Example 34 | 35 | This example reads the LIWC dictionary from a file named `LIWC2007_English100131.dic`, which looks like this: 36 | 37 | % 38 | 1 funct 39 | 2 pronoun 40 | [...] 41 | % 42 | a 1 10 43 | abdomen* 146 147 44 | about 1 16 17 45 | [...] 46 | 47 | 48 | #### Loading the lexicon 49 | 50 | ```python 51 | import liwc 52 | parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic') 53 | ``` 54 | 55 | * `parse` is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings) 56 | * `category_names` is all LIWC categories in the lexicon (a list of strings) 57 | 58 | 59 | #### Analyzing text 60 | 61 | ```python 62 | import re 63 | 64 | def tokenize(text): 65 | # you may want to use a smarter tokenizer 66 | for match in re.finditer(r'\w+', text, re.UNICODE): 67 | yield match.group(0) 68 | 69 | gettysburg = '''Four score and seven years ago our fathers brought forth on 70 | this continent a new nation, conceived in liberty, and dedicated to the 71 | proposition that all men are created equal. Now we are engaged in a great 72 | civil war, testing whether that nation, or any nation so conceived and so 73 | dedicated, can long endure. We are met on a great battlefield of that war. 74 | We have come to dedicate a portion of that field, as a final resting place 75 | for those who here gave their lives that that nation might live. It is 76 | altogether fitting and proper that we should do this.'''.lower() 77 | gettysburg_tokens = tokenize(gettysburg) 78 | ``` 79 | 80 | Now, count all the categories in all of the tokens, and print the results: 81 | 82 | ```python 83 | from collections import Counter 84 | gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token)) 85 | print(gettysburg_counts) 86 | #=> Counter({'funct': 58, 'pronoun': 18, 'cogmech': 17, ...}) 87 | ``` 88 | 89 | ### _N.B._: 90 | 91 | * The LIWC lexicon only matches lowercase strings, so you will most likely want to lowercase your input text before passing it to `parse(...)`. 92 | In the example above, I call `.lower()` on the entire string, but you could alternatively incorporate that into your tokenization process (e.g., by using [spaCy](https://spacy.io/api/token)'s `token.lower_`). 93 | 94 | 95 | ## License 96 | 97 | Copyright (c) 2012-2020 Christopher Brown. 98 | [MIT Licensed](LICENSE.txt). 99 | -------------------------------------------------------------------------------- /liwc/__init__.py: -------------------------------------------------------------------------------- 1 | from .dic import read_dic 2 | from .trie import build_trie, search_trie 3 | 4 | try: 5 | import pkg_resources 6 | 7 | __version__ = pkg_resources.get_distribution("liwc").version 8 | except Exception: 9 | __version__ = None 10 | 11 | 12 | def load_token_parser(filepath): 13 | """ 14 | Reads a LIWC lexicon from a file in the .dic format, returning a tuple of 15 | (parse, category_names), where: 16 | * `parse` is a function from a token to a list of strings (potentially 17 | empty) of matching categories 18 | * `category_names` is a list of strings representing all LIWC categories in 19 | the lexicon 20 | """ 21 | lexicon, category_names = read_dic(filepath) 22 | trie = build_trie(lexicon) 23 | 24 | def parse_token(token): 25 | for category_name in search_trie(trie, token): 26 | yield category_name 27 | 28 | return parse_token, category_names 29 | -------------------------------------------------------------------------------- /liwc/dic.py: -------------------------------------------------------------------------------- 1 | def _parse_categories(lines): 2 | """ 3 | Read (category_id, category_name) pairs from the categories section. 4 | Each line consists of an integer followed a tab and then the category name. 5 | This section is separated from the lexicon by a line consisting of a single "%". 6 | """ 7 | for line in lines: 8 | line = line.strip() 9 | if line == "%": 10 | return 11 | # ignore non-matching groups of categories 12 | if "\t" in line: 13 | category_id, category_name = line.split("\t", 1) 14 | yield category_id, category_name 15 | 16 | 17 | def _parse_lexicon(lines, category_mapping): 18 | """ 19 | Read (match_expression, category_names) pairs from the lexicon section. 20 | Each line consists of a match expression followed by a tab and then one or more 21 | tab-separated integers, which are mapped to category names using `category_mapping`. 22 | """ 23 | for line in lines: 24 | line = line.strip() 25 | parts = line.split("\t") 26 | yield parts[0], [category_mapping[category_id] for category_id in parts[1:]] 27 | 28 | 29 | def read_dic(filepath): 30 | """ 31 | Reads a LIWC lexicon from a file in the .dic format, returning a tuple of 32 | (lexicon, category_names), where: 33 | * `lexicon` is a dict mapping string patterns to lists of category names 34 | * `category_names` is a list of category names (as strings) 35 | """ 36 | with open(filepath) as lines: 37 | # read up to first "%" (should be very first line of file) 38 | for line in lines: 39 | if line.strip() == "%": 40 | break 41 | # read categories (a mapping from integer string to category name) 42 | category_mapping = dict(_parse_categories(lines)) 43 | # read lexicon (a mapping from matching string to a list of category names) 44 | lexicon = dict(_parse_lexicon(lines, category_mapping)) 45 | return lexicon, list(category_mapping.values()) 46 | -------------------------------------------------------------------------------- /liwc/trie.py: -------------------------------------------------------------------------------- 1 | def build_trie(lexicon): 2 | """ 3 | Build a character-trie from the plain pattern_string -> categories_list 4 | mapping provided by `lexicon`. 5 | 6 | Some LIWC patterns end with a `*` to indicate a wildcard match. 7 | """ 8 | trie = {} 9 | for pattern, category_names in lexicon.items(): 10 | cursor = trie 11 | for char in pattern: 12 | if char == "*": 13 | cursor["*"] = category_names 14 | break 15 | if char not in cursor: 16 | cursor[char] = {} 17 | cursor = cursor[char] 18 | cursor["$"] = category_names 19 | return trie 20 | 21 | 22 | def search_trie(trie, token, token_i=0): 23 | """ 24 | Search the given character-trie for paths that match the `token` string. 25 | """ 26 | if "*" in trie: 27 | return trie["*"] 28 | if "$" in trie and token_i == len(token): 29 | return trie["$"] 30 | if token_i < len(token): 31 | char = token[token_i] 32 | if char in trie: 33 | return search_trie(trie[char], token, token_i + 1) 34 | return [] 35 | -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | name = liwc 3 | author = Christopher Brown 4 | author_email = chrisbrown@utexas.edu 5 | url = https://github.com/chbrown/liwc-python 6 | description = Linguistic Inquiry and Word Count (LIWC) analyzer (proprietary data not included) 7 | long_description = file: README.md 8 | long_description_content_type = text/markdown 9 | license = MIT 10 | 11 | [options] 12 | packages = find: 13 | zip_safe = True 14 | setup_requires = 15 | pytest-runner 16 | setuptools-scm 17 | tests_require = 18 | pytest 19 | pytest-black 20 | 21 | [aliases] 22 | test = pytest 23 | 24 | [tool:pytest] 25 | addopts = 26 | --black 27 | 28 | [bdist_wheel] 29 | universal = 1 30 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup 2 | 3 | setup(use_scm_version=True) 4 | -------------------------------------------------------------------------------- /test/alpha.dic: -------------------------------------------------------------------------------- 1 | % 2 | 1 A 3 | 2 Bravo 4 | % 5 | a* 1 6 | bravo 2 7 | -------------------------------------------------------------------------------- /test/test_alpha_dic.py: -------------------------------------------------------------------------------- 1 | import os.path 2 | 3 | import liwc 4 | 5 | test_dir = os.path.dirname(__file__) 6 | 7 | 8 | def test_category_names(): 9 | _, category_names = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic")) 10 | assert category_names == ["A", "Bravo"] 11 | 12 | 13 | def test_parse(): 14 | parse, _ = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic")) 15 | sentence = "Any alpha a bravo charlie Bravo boy" 16 | tokens = sentence.split() 17 | matches = [category for token in tokens for category in parse(token)] 18 | # matching is case-sensitive, so the only matches are "alpha" (A), "a" (A) and "bravo" (Bravo) 19 | assert matches == ["A", "A", "Bravo"] 20 | --------------------------------------------------------------------------------