├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | CC0 1.0 Universal 2 | 3 | Statement of Purpose 4 | 5 | The laws of most jurisdictions throughout the world automatically confer 6 | exclusive Copyright and Related Rights (defined below) upon the creator and 7 | subsequent owner(s) (each and all, an "owner") of an original work of 8 | authorship and/or a database (each, a "Work"). 9 | 10 | Certain owners wish to permanently relinquish those rights to a Work for the 11 | purpose of contributing to a commons of creative, cultural and scientific 12 | works ("Commons") that the public can reliably and without fear of later 13 | claims of infringement build upon, modify, incorporate in other works, reuse 14 | and redistribute as freely as possible in any form whatsoever and for any 15 | purposes, including without limitation commercial purposes. These owners may 16 | contribute to the Commons to promote the ideal of a free culture and the 17 | further production of creative, cultural and scientific works, or to gain 18 | reputation or greater distribution for their Work in part through the use and 19 | efforts of others. 20 | 21 | For these and/or other purposes and motivations, and without any expectation 22 | of additional consideration or compensation, the person associating CC0 with a 23 | Work (the "Affirmer"), to the extent that he or she is an owner of Copyright 24 | and Related Rights in the Work, voluntarily elects to apply CC0 to the Work 25 | and publicly distribute the Work under its terms, with knowledge of his or her 26 | Copyright and Related Rights in the Work and the meaning and intended legal 27 | effect of CC0 on those rights. 28 | 29 | 1. Copyright and Related Rights. A Work made available under CC0 may be 30 | protected by copyright and related or neighboring rights ("Copyright and 31 | Related Rights"). Copyright and Related Rights include, but are not limited 32 | to, the following: 33 | 34 | i. the right to reproduce, adapt, distribute, perform, display, communicate, 35 | and translate a Work; 36 | 37 | ii. moral rights retained by the original author(s) and/or performer(s); 38 | 39 | iii. publicity and privacy rights pertaining to a person's image or likeness 40 | depicted in a Work; 41 | 42 | iv. rights protecting against unfair competition in regards to a Work, 43 | subject to the limitations in paragraph 4(a), below; 44 | 45 | v. rights protecting the extraction, dissemination, use and reuse of data in 46 | a Work; 47 | 48 | vi. database rights (such as those arising under Directive 96/9/EC of the 49 | European Parliament and of the Council of 11 March 1996 on the legal 50 | protection of databases, and under any national implementation thereof, 51 | including any amended or successor version of such directive); and 52 | 53 | vii. other similar, equivalent or corresponding rights throughout the world 54 | based on applicable law or treaty, and any national implementations thereof. 55 | 56 | 2. Waiver. To the greatest extent permitted by, but not in contravention of, 57 | applicable law, Affirmer hereby overtly, fully, permanently, irrevocably and 58 | unconditionally waives, abandons, and surrenders all of Affirmer's Copyright 59 | and Related Rights and associated claims and causes of action, whether now 60 | known or unknown (including existing as well as future claims and causes of 61 | action), in the Work (i) in all territories worldwide, (ii) for the maximum 62 | duration provided by applicable law or treaty (including future time 63 | extensions), (iii) in any current or future medium and for any number of 64 | copies, and (iv) for any purpose whatsoever, including without limitation 65 | commercial, advertising or promotional purposes (the "Waiver"). Affirmer makes 66 | the Waiver for the benefit of each member of the public at large and to the 67 | detriment of Affirmer's heirs and successors, fully intending that such Waiver 68 | shall not be subject to revocation, rescission, cancellation, termination, or 69 | any other legal or equitable action to disrupt the quiet enjoyment of the Work 70 | by the public as contemplated by Affirmer's express Statement of Purpose. 71 | 72 | 3. Public License Fallback. Should any part of the Waiver for any reason be 73 | judged legally invalid or ineffective under applicable law, then the Waiver 74 | shall be preserved to the maximum extent permitted taking into account 75 | Affirmer's express Statement of Purpose. In addition, to the extent the Waiver 76 | is so judged Affirmer hereby grants to each affected person a royalty-free, 77 | non transferable, non sublicensable, non exclusive, irrevocable and 78 | unconditional license to exercise Affirmer's Copyright and Related Rights in 79 | the Work (i) in all territories worldwide, (ii) for the maximum duration 80 | provided by applicable law or treaty (including future time extensions), (iii) 81 | in any current or future medium and for any number of copies, and (iv) for any 82 | purpose whatsoever, including without limitation commercial, advertising or 83 | promotional purposes (the "License"). The License shall be deemed effective as 84 | of the date CC0 was applied by Affirmer to the Work. Should any part of the 85 | License for any reason be judged legally invalid or ineffective under 86 | applicable law, such partial invalidity or ineffectiveness shall not 87 | invalidate the remainder of the License, and in such case Affirmer hereby 88 | affirms that he or she will not (i) exercise any of his or her remaining 89 | Copyright and Related Rights in the Work or (ii) assert any associated claims 90 | and causes of action with respect to the Work, in either case contrary to 91 | Affirmer's express Statement of Purpose. 92 | 93 | 4. Limitations and Disclaimers. 94 | 95 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 96 | surrendered, licensed or otherwise affected by this document. 97 | 98 | b. Affirmer offers the Work as-is and makes no representations or warranties 99 | of any kind concerning the Work, express, implied, statutory or otherwise, 100 | including without limitation warranties of title, merchantability, fitness 101 | for a particular purpose, non infringement, or the absence of latent or 102 | other defects, accuracy, or the present or absence of errors, whether or not 103 | discoverable, all to the greatest extent permissible under applicable law. 104 | 105 | c. Affirmer disclaims responsibility for clearing rights of other persons 106 | that may apply to the Work or any use thereof, including without limitation 107 | any person's Copyright and Related Rights in the Work. Further, Affirmer 108 | disclaims responsibility for obtaining any necessary consents, permissions 109 | or other rights required for any use of the Work. 110 | 111 | d. Affirmer understands and acknowledges that Creative Commons is not a 112 | party to this document and has no duty or obligation with respect to this 113 | CC0 or use of the Work. 114 | 115 | For more information, please see 116 | 117 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Awesome software for Text ML [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome) 2 | 3 | A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub. 4 | 5 | ## Frameworks and libraries 6 | 7 | ### :snake: Python 8 | 9 | #### Text processing 10 | 11 | * [HanLP](https://github.com/hankcs/HanLP) - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/ 12 | 13 | * [flair](https://github.com/flairNLP/flair) - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification. 14 | 15 | * [sentencepiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation. 16 | 17 | * [stanza](https://github.com/stanfordnlp/stanza) - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/ 18 | 19 | #### Pipelines / block-programming 20 | 21 | * [texthero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization from zero to hero. https://texthero.org/ 22 | 23 | #### Distributed computing 24 | 25 | * [spark-nlp](https://github.com/JohnSnowLabs/spark-nlp) - Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. https://nlp.johnsnowlabs.com/ 26 | 27 | #### Machine Learning 28 | 29 | * [sklearn](https://github.com/scikit-learn/scikit-learn) - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/ 30 | 31 | * [gensim](https://github.com/RaRe-Technologies/gensim) - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/ 32 | 33 | * [nlpaug](https://github.com/makcedward/nlpaug) - Augmenting nlp for your machine learning projects. 34 | 35 | * [AugLy](https://github.com/facebookresearch/AugLy) - A data augmentations library from Facebook research for audio, image, text, and video. 36 | 37 | #### Deep Learning 38 | 39 | * [Transformers](https://github.com/huggingface/transformers) - Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://huggingface.co/transformers 40 | 41 | * [fairseq](https://github.com/facebookresearch/fairseq) - Facebook AI Research Sequence-to-Sequence Toolkit written in Python. https://fairseq.readthedocs.io/en/latest/ 42 | 43 | * [bert-as-service](https://github.com/hanxiao/bert-as-service) - Mapping a variable-length sentence to a fixed-length vector using BERT model. https://bert-as-service.readthedocs.io 44 | 45 | * [Kashgari](https://github.com/BrikerMan/Kashgari) - Kashgari is a Production-ready NLP Transfer learning framework for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding. 46 | 47 | #### Natural Language Understanding 48 | 49 | * [Snips NLU](https://github.com/snipsco/snips-nlu) - Snips Python library to extract meaning from text. https://snips-nlu.readthedocs.io 50 | 51 | * [IKY](https://github.com/alfredfrancis/ai-chatbot-framework) - A python chatbot framework with Natural Language Understanding and Artificial Intelligence. 52 | 53 | * [rasa](https://github.com/RasaHQ/rasa) - Framework to automate text- and voice-based conversations: NLU, dialogue management, chatbots. https://rasa.com/docs/rasa/ 54 | 55 | * [ParlAI](https://github.com/facebookresearch/ParlAI) - A framework for training and evaluating AI models on a variety of openly available dialogue datasets. https://parl.ai/ 56 | 57 | * [DeepPavlov](https://github.com/deeppavlov/DeepPavlov) - An open source library for deep learning end-to-end dialog systems and chatbots. https://deeppavlov.ai/ 58 | 59 | * [Rhino](https://github.com/Picovoice/rhino) - On-device speech-to-intent engine powered by deep learning. https://picovoice.ai/ 60 | 61 | * [langchain](https://github.com/hwchase17/langchain) - Building applications with LLMs (large language models) through composability. https://langchain.readthedocs.io 62 | 63 | * [NeMo](https://github.com/NVIDIA/NeMo) - NeMo: a toolkit for conversational AI. https://nvidia.github.io/NeMo/ 64 | 65 | #### Text mining 66 | 67 | * [dedupe](https://github.com/dedupeio/dedupe) - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution. 68 | 69 | #### Visualizations 70 | 71 | * [Scattertext](https://github.com/JasonKessler/scattertext) - Beautiful visualizations of how language differs among document types. 72 | 73 | #### Big language models 74 | 75 | * [BIG-bench](https://github.com/google/BIG-bench) - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models. 76 | 77 | ### C++ 78 | 79 | #### Text processing 80 | 81 | Currently empty 🪹 82 | 83 | ## Knowledge 📚 84 | 85 | ### Learning 101 86 | 87 | * [Virgilio](https://github.com/virgili0/Virgilio) - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science. 88 | 89 | ### Multiple languages 90 | 91 | * [Awesome Sentiment Analysis](https://github.com/laugustyniak/awesome-sentiment-analysis) - Repository with all what is necessary for sentiment analysis and related areas 92 | 93 | ### Python (and Python Notebooks) 94 | 95 | * [practicalAI](https://github.com/practicalAI/practicalAI) - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me 96 | 97 | * [nlp-recipes](https://github.com/microsoft/nlp-recipes) - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. 98 | 99 | ## No longer maintained 100 | 101 | * [NeuronBlocks](https://github.com/microsoft/NeuronBlocks) - NLP DNN Toolkit - Building Your NLP DNN Models Like Playing Lego. 102 | 103 | * [artificial-adversary](https://github.com/airbnb/artificial-adversary) - Tool to generate adversarial text examples and test machine learning models against them. 104 | 105 | * [DELTA](https://github.com/didi/delta) - DELTA is a deep learning based natural language and speech processing platform. https://delta-didi.readthedocs.io/ 106 | 107 | * [EventForecast](https://github.com/moment-of-peace/EventForecast) - Time series prediction and text analysis using Keras LSTM, plus clustering, association rules mining. 108 | 109 | * [lazynlp](https://github.com/chiphuyen/lazynlp) - Library to scrape and clean web pages to create massive datasets. 110 | 111 | * [MeTA: ModErn Text Analysis](https://github.com/meta-toolkit/meta) - A Modern C++ Data Sciences Toolkit. https://meta-toolkit.org 112 | --------------------------------------------------------------------------------