├── .github
    └── ISSUE_TEMPLATE.md
├── .gitignore
├── .travis.yml
├── LICENSE.txt
├── README.md
├── liwc
    ├── __init__.py
    ├── dic.py
    └── trie.py
├── setup.cfg
├── setup.py
└── test
    ├── alpha.dic
    └── test_alpha_dic.py


/.github/ISSUE_TEMPLATE.md:
--------------------------------------------------------------------------------
1 | Please do not open an issue with the intent of subverting encryption implemented by the LIWC developers.
2 | 
3 | If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly.
4 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | build/
2 | dist/
3 | 


--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 | python:
3 |   - "2.7"
4 |   - "3.6"
5 | script:
6 |   - python setup.py test
7 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | Copyright © 2012-2020 Christopher Brown <io@henrian.com>
 2 | 
 3 | MIT License
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
 6 | 
 7 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # `liwc`
 2 | 
 3 | [![PyPI version](https://badge.fury.io/py/liwc.svg)](https://pypi.org/project/liwc/)
 4 | [![Travis CI Build Status](https://travis-ci.org/chbrown/liwc-python.svg?branch=master)](https://travis-ci.org/chbrown/liwc-python)
 5 | 
 6 | This repository is a Python package implementing two basic functions:
 7 | 1. Loading (parsing) a Linguistic Inquiry and Word Count (LIWC) dictionary from the `.dic` file format.
 8 | 2. Using that dictionary to count category matches on provided texts.
 9 | 
10 | This is not an official LIWC product nor is it in any way affiliated with the LIWC development team or Receptiviti.
11 | 
12 | 
13 | ## Obtaining LIWC
14 | 
15 | The LIWC lexicon is proprietary, so it is _not_ included in this repository.
16 | 
17 | The lexicon data can be acquired (purchased) from [liwc.net](http://liwc.net/).
18 | 
19 | * If you are a researcher at an academic institution, please contact [Dr. James W. Pennebaker](https://liberalarts.utexas.edu/psychology/faculty/pennebak) directly.
20 | * For commercial use, contact [Receptiviti](https://www.receptiviti.com/), which is the company that holds exclusive commercial license.
21 | 
22 | Finally, please do not open an issue in this repository with the intent of subverting encryption implemented by the LIWC developers.
23 | If the version of LIWC that you purchased (or otherwise legitimately obtained as a researcher at an academic institution) does not provide a machine-readable `*.dic` file, please contact the distributor directly.
24 | 
25 | 
26 | ## Setup
27 | 
28 | Install from [PyPI](https://pypi.python.org/pypi/liwc):
29 | 
30 |     pip install liwc
31 | 
32 | 
33 | ## Example
34 | 
35 | This example reads the LIWC dictionary from a file named `LIWC2007_English100131.dic`, which looks like this:
36 | 
37 |     %
38 |     1   funct
39 |     2   pronoun
40 |     [...]
41 |     %
42 |     a   1   10
43 |     abdomen*    146 147
44 |     about   1   16  17
45 |     [...]
46 | 
47 | 
48 | #### Loading the lexicon
49 | 
50 | ```python
51 | import liwc
52 | parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
53 | ```
54 | 
55 | * `parse` is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings)
56 | * `category_names` is all LIWC categories in the lexicon (a list of strings)
57 | 
58 | 
59 | #### Analyzing text
60 | 
61 | ```python
62 | import re
63 | 
64 | def tokenize(text):
65 |     # you may want to use a smarter tokenizer
66 |     for match in re.finditer(r'\w+', text, re.UNICODE):
67 |         yield match.group(0)
68 | 
69 | gettysburg = '''Four score and seven years ago our fathers brought forth on
70 |   this continent a new nation, conceived in liberty, and dedicated to the
71 |   proposition that all men are created equal. Now we are engaged in a great
72 |   civil war, testing whether that nation, or any nation so conceived and so
73 |   dedicated, can long endure. We are met on a great battlefield of that war.
74 |   We have come to dedicate a portion of that field, as a final resting place
75 |   for those who here gave their lives that that nation might live. It is
76 |   altogether fitting and proper that we should do this.'''.lower()
77 | gettysburg_tokens = tokenize(gettysburg)
78 | ```
79 | 
80 | Now, count all the categories in all of the tokens, and print the results:
81 | 
82 | ```python
83 | from collections import Counter
84 | gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token))
85 | print(gettysburg_counts)
86 | #=> Counter({'funct': 58, 'pronoun': 18, 'cogmech': 17, ...})
87 | ```
88 | 
89 | ### _N.B._:
90 | 
91 | * The LIWC lexicon only matches lowercase strings, so you will most likely want to lowercase your input text before passing it to `parse(...)`.
92 |   In the example above, I call `.lower()` on the entire string, but you could alternatively incorporate that into your tokenization process (e.g., by using [spaCy](https://spacy.io/api/token)'s `token.lower_`).
93 | 
94 | 
95 | ## License
96 | 
97 | Copyright (c) 2012-2020 Christopher Brown.
98 | [MIT Licensed](LICENSE.txt).
99 | 


--------------------------------------------------------------------------------
/liwc/__init__.py:
--------------------------------------------------------------------------------
 1 | from .dic import read_dic
 2 | from .trie import build_trie, search_trie
 3 | 
 4 | try:
 5 |     import pkg_resources
 6 | 
 7 |     __version__ = pkg_resources.get_distribution("liwc").version
 8 | except Exception:
 9 |     __version__ = None
10 | 
11 | 
12 | def load_token_parser(filepath):
13 |     """
14 |     Reads a LIWC lexicon from a file in the .dic format, returning a tuple of
15 |     (parse, category_names), where:
16 |     * `parse` is a function from a token to a list of strings (potentially
17 |       empty) of matching categories
18 |     * `category_names` is a list of strings representing all LIWC categories in
19 |       the lexicon
20 |     """
21 |     lexicon, category_names = read_dic(filepath)
22 |     trie = build_trie(lexicon)
23 | 
24 |     def parse_token(token):
25 |         for category_name in search_trie(trie, token):
26 |             yield category_name
27 | 
28 |     return parse_token, category_names
29 | 


--------------------------------------------------------------------------------
/liwc/dic.py:
--------------------------------------------------------------------------------
 1 | def _parse_categories(lines):
 2 |     """
 3 |     Read (category_id, category_name) pairs from the categories section.
 4 |     Each line consists of an integer followed a tab and then the category name.
 5 |     This section is separated from the lexicon by a line consisting of a single "%".
 6 |     """
 7 |     for line in lines:
 8 |         line = line.strip()
 9 |         if line == "%":
10 |             return
11 |         # ignore non-matching groups of categories
12 |         if "\t" in line:
13 |             category_id, category_name = line.split("\t", 1)
14 |             yield category_id, category_name
15 | 
16 | 
17 | def _parse_lexicon(lines, category_mapping):
18 |     """
19 |     Read (match_expression, category_names) pairs from the lexicon section.
20 |     Each line consists of a match expression followed by a tab and then one or more
21 |     tab-separated integers, which are mapped to category names using `category_mapping`.
22 |     """
23 |     for line in lines:
24 |         line = line.strip()
25 |         parts = line.split("\t")
26 |         yield parts[0], [category_mapping[category_id] for category_id in parts[1:]]
27 | 
28 | 
29 | def read_dic(filepath):
30 |     """
31 |     Reads a LIWC lexicon from a file in the .dic format, returning a tuple of
32 |     (lexicon, category_names), where:
33 |     * `lexicon` is a dict mapping string patterns to lists of category names
34 |     * `category_names` is a list of category names (as strings)
35 |     """
36 |     with open(filepath) as lines:
37 |         # read up to first "%" (should be very first line of file)
38 |         for line in lines:
39 |             if line.strip() == "%":
40 |                 break
41 |         # read categories (a mapping from integer string to category name)
42 |         category_mapping = dict(_parse_categories(lines))
43 |         # read lexicon (a mapping from matching string to a list of category names)
44 |         lexicon = dict(_parse_lexicon(lines, category_mapping))
45 |     return lexicon, list(category_mapping.values())
46 | 


--------------------------------------------------------------------------------
/liwc/trie.py:
--------------------------------------------------------------------------------
 1 | def build_trie(lexicon):
 2 |     """
 3 |     Build a character-trie from the plain pattern_string -> categories_list
 4 |     mapping provided by `lexicon`.
 5 | 
 6 |     Some LIWC patterns end with a `*` to indicate a wildcard match.
 7 |     """
 8 |     trie = {}
 9 |     for pattern, category_names in lexicon.items():
10 |         cursor = trie
11 |         for char in pattern:
12 |             if char == "*":
13 |                 cursor["*"] = category_names
14 |                 break
15 |             if char not in cursor:
16 |                 cursor[char] = {}
17 |             cursor = cursor[char]
18 |         cursor["$"] = category_names
19 |     return trie
20 | 
21 | 
22 | def search_trie(trie, token, token_i=0):
23 |     """
24 |     Search the given character-trie for paths that match the `token` string.
25 |     """
26 |     if "*" in trie:
27 |         return trie["*"]
28 |     if "$" in trie and token_i == len(token):
29 |         return trie["$"]
30 |     if token_i < len(token):
31 |         char = token[token_i]
32 |         if char in trie:
33 |             return search_trie(trie[char], token, token_i + 1)
34 |     return []
35 | 


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
 1 | [metadata]
 2 | name = liwc
 3 | author = Christopher Brown
 4 | author_email = chrisbrown@utexas.edu
 5 | url = https://github.com/chbrown/liwc-python
 6 | description = Linguistic Inquiry and Word Count (LIWC) analyzer (proprietary data not included)
 7 | long_description = file: README.md
 8 | long_description_content_type = text/markdown
 9 | license = MIT
10 | 
11 | [options]
12 | packages = find:
13 | zip_safe = True
14 | setup_requires =
15 |   pytest-runner
16 |   setuptools-scm
17 | tests_require =
18 |   pytest
19 |   pytest-black
20 | 
21 | [aliases]
22 | test = pytest
23 | 
24 | [tool:pytest]
25 | addopts =
26 |   --black
27 | 
28 | [bdist_wheel]
29 | universal = 1
30 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | 
3 | setup(use_scm_version=True)
4 | 


--------------------------------------------------------------------------------
/test/alpha.dic:
--------------------------------------------------------------------------------
1 | %
2 | 1	A
3 | 2	Bravo
4 | %
5 | a*	1
6 | bravo	2
7 | 


--------------------------------------------------------------------------------
/test/test_alpha_dic.py:
--------------------------------------------------------------------------------
 1 | import os.path
 2 | 
 3 | import liwc
 4 | 
 5 | test_dir = os.path.dirname(__file__)
 6 | 
 7 | 
 8 | def test_category_names():
 9 |     _, category_names = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic"))
10 |     assert category_names == ["A", "Bravo"]
11 | 
12 | 
13 | def test_parse():
14 |     parse, _ = liwc.load_token_parser(os.path.join(test_dir, "alpha.dic"))
15 |     sentence = "Any alpha a bravo charlie Bravo boy"
16 |     tokens = sentence.split()
17 |     matches = [category for token in tokens for category in parse(token)]
18 |     # matching is case-sensitive, so the only matches are "alpha" (A), "a" (A) and "bravo" (Bravo)
19 |     assert matches == ["A", "A", "Bravo"]
20 | 


--------------------------------------------------------------------------------