├── README.md ├── TUMCC-clean.txt └── TUMCC-raw.7z /README.md: -------------------------------------------------------------------------------- 1 | # TUMCC (Telegram Underground Market Chinese Corpus) 2 | 3 | TUMCC is the first Chinese corpus in the jargon identification field. 4 | 5 | **28,749** sentences, including **804,971** characters, from **19,821** Telegram users of **12** Telegram groups were collected when we built TUMCC. 6 | 7 | We had finished data screening and word segmentation before we released this corpus. So it might be easier for you to use. 8 | 9 | After cleaning, TUMCC contains 3,863 sentences (100,000 characters) from 3,139 Telegram users. 10 | 11 | ## Files 12 | 13 | ``TUMCC-clean.txt`` contains the corpus after our cleaning. You can use it directly in your research. 14 | 15 | ``TUMCC-raw.7z`` contains raw information we collected from Telegram. You can do text cleaning to get more valid data and valuable information. 16 | 17 | For more details about the target Telegram group sources for data extraction, please refer to the paper [`Identification of Chinese Dark Jargons in Telegram Underground Markets Using Context-Oriented and Linguistic Features`](https://doi.org/10.1016/j.ipm.2022.103033) ([Information Processing and Management](https://www.sciencedirect.com/journal/information-processing-and-management), 2022). 18 | 19 | ## Citation 20 | Thanks for your interest in our dataset, please feel free to leave a ⭐️ or cite us through: 21 | 22 | ``` 23 | @article{hou2022identification, 24 | title={Identification of Chinese dark jargons in Telegram underground markets using context-oriented and linguistic features}, 25 | author={Hou, Yiwei and Wang, Hailin and Wang, Haizhou}, 26 | journal={Information Processing \& Management}, 27 | volume={59}, 28 | number={5}, 29 | pages={103033,1--20}, 30 | year={2022}, 31 | publisher={Elsevier} 32 | } 33 | ``` 34 | -------------------------------------------------------------------------------- /TUMCC-raw.7z: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/m1-llie/TUMCC/637d7deaa308426061d0220efd0d638cc411f8e3/TUMCC-raw.7z --------------------------------------------------------------------------------