├── NLP - Text processing pipeline.ipynb ├── README.md └── text-processing-pipeline-classification-models.ipynb /NLP - Text processing pipeline.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"metadata":{},"cell_type":"markdown","source":"This notebook contains functions required for text cleaning and processing pipeline in NLP problems.\nThese are ready-to-use functions and use NLTK and SKlearn packages."},{"metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","trusted":true},"cell_type":"code","source":"# This Python 3 environment comes with many helpful analytics libraries installed\n# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n# For example, here's several helpful packages to load\n\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\nimport string as st\nimport re\nimport nltk\nfrom nltk import PorterStemmer, WordNetLemmatizer\n\n# Input data files are available in the read-only \"../input/\" directory\n# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n\nimport os\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n for filename in filenames:\n print(os.path.join(dirname, filename))\n\n# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session","execution_count":2,"outputs":[{"output_type":"stream","text":"/kaggle/input/spam-text-message-classification/SPAM text message 20170820 - Data.csv\n","name":"stdout"}]},{"metadata":{"_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","trusted":true},"cell_type":"code","source":"# Read the data. Here it is already in .csv format.\ndata = pd.read_csv('../input/spam-text-message-classification/SPAM text message 20170820 - Data.csv')\ndata.head()","execution_count":3,"outputs":[{"output_type":"execute_result","execution_count":3,"data":{"text/plain":" Category Message\n0 ham Go until jurong point, crazy.. Available only ...\n1 ham Ok lar... Joking wif u oni...\n2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n3 ham U dun say so early hor... U c already then say...\n4 ham Nah I don't think he goes to usf, he lives aro...","text/html":"
\n | Category | \nMessage | \n
---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \n
\n | Category | \nMessage | \nremoved_punc | \n
---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n
\n | Category | \nMessage | \nremoved_punc | \ntokens | \n
---|---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n[go, until, jurong, point, crazy, available, o... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n[ok, lar, joking, wif, u, oni] | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n[free, entry, in, 2, a, wkly, comp, to, win, f... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n[u, dun, say, so, early, hor, u, c, already, t... | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n[nah, i, dont, think, he, goes, to, usf, he, l... | \n
\n | Category | \nMessage | \nremoved_punc | \ntokens | \nlarger_tokens | \n
---|---|---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n[go, until, jurong, point, crazy, available, o... | \n[until, jurong, point, crazy, available, only,... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n[ok, lar, joking, wif, u, oni] | \n[joking] | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n[free, entry, in, 2, a, wkly, comp, to, win, f... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n[u, dun, say, so, early, hor, u, c, already, t... | \n[early, already, then] | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n[nah, i, dont, think, he, goes, to, usf, he, l... | \n[dont, think, goes, lives, around, here, though] | \n
\n | Category | \nMessage | \nremoved_punc | \ntokens | \nlarger_tokens | \nclean_tokens | \n
---|---|---|---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n[go, until, jurong, point, crazy, available, o... | \n[until, jurong, point, crazy, available, only,... | \n[jurong, point, crazy, available, bugis, great... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n[ok, lar, joking, wif, u, oni] | \n[joking] | \n[joking] | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n[free, entry, in, 2, a, wkly, comp, to, win, f... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n[u, dun, say, so, early, hor, u, c, already, t... | \n[early, already, then] | \n[early, already] | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n[nah, i, dont, think, he, goes, to, usf, he, l... | \n[dont, think, goes, lives, around, here, though] | \n[dont, think, goes, lives, around, though] | \n
\n | Category | \nMessage | \nremoved_punc | \ntokens | \nlarger_tokens | \nclean_tokens | \nstem_words | \n
---|---|---|---|---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n[go, until, jurong, point, crazy, available, o... | \n[until, jurong, point, crazy, available, only,... | \n[jurong, point, crazy, available, bugis, great... | \n[jurong, point, crazi, avail, bugi, great, wor... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n[ok, lar, joking, wif, u, oni] | \n[joking] | \n[joking] | \n[joke] | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n[free, entry, in, 2, a, wkly, comp, to, win, f... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entri, wkli, comp, final, tkt, 21st, 20... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n[u, dun, say, so, early, hor, u, c, already, t... | \n[early, already, then] | \n[early, already] | \n[earli, alreadi] | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n[nah, i, dont, think, he, goes, to, usf, he, l... | \n[dont, think, goes, lives, around, here, though] | \n[dont, think, goes, lives, around, though] | \n[dont, think, goe, live, around, though] | \n
\n | Category | \nMessage | \nremoved_punc | \ntokens | \nlarger_tokens | \nclean_tokens | \nstem_words | \nlemma_words | \n
---|---|---|---|---|---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n[go, until, jurong, point, crazy, available, o... | \n[until, jurong, point, crazy, available, only,... | \n[jurong, point, crazy, available, bugis, great... | \n[jurong, point, crazi, avail, bugi, great, wor... | \n[jurong, point, crazy, available, bugis, great... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n[ok, lar, joking, wif, u, oni] | \n[joking] | \n[joking] | \n[joke] | \n[joking] | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n[free, entry, in, 2, a, wkly, comp, to, win, f... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entri, wkli, comp, final, tkt, 21st, 20... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n[u, dun, say, so, early, hor, u, c, already, t... | \n[early, already, then] | \n[early, already] | \n[earli, alreadi] | \n[early, already] | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n[nah, i, dont, think, he, goes, to, usf, he, l... | \n[dont, think, goes, lives, around, here, though] | \n[dont, think, goes, lives, around, though] | \n[dont, think, goe, live, around, though] | \n[dont, think, go, life, around, though] | \n
\n | Category | \nMessage | \nremoved_punc | \ntokens | \nlarger_tokens | \nclean_tokens | \nstem_words | \nlemma_words | \nclean_text | \n
---|---|---|---|---|---|---|---|---|---|
0 | \nham | \nGo until jurong point, crazy.. Available only ... | \nGo until jurong point crazy Available only in ... | \n[go, until, jurong, point, crazy, available, o... | \n[until, jurong, point, crazy, available, only,... | \n[jurong, point, crazy, available, bugis, great... | \n[jurong, point, crazi, avail, bugi, great, wor... | \n[jurong, point, crazy, available, bugis, great... | \njurong point crazy available bugis great world... | \n
1 | \nham | \nOk lar... Joking wif u oni... | \nOk lar Joking wif u oni | \n[ok, lar, joking, wif, u, oni] | \n[joking] | \n[joking] | \n[joke] | \n[joking] | \njoking | \n
2 | \nspam | \nFree entry in 2 a wkly comp to win FA Cup fina... | \nFree entry in 2 a wkly comp to win FA Cup fina... | \n[free, entry, in, 2, a, wkly, comp, to, win, f... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \n[free, entri, wkli, comp, final, tkt, 21st, 20... | \n[free, entry, wkly, comp, final, tkts, 21st, 2... | \nfree entry wkly comp final tkts 21st 2005 text... | \n
3 | \nham | \nU dun say so early hor... U c already then say... | \nU dun say so early hor U c already then say | \n[u, dun, say, so, early, hor, u, c, already, t... | \n[early, already, then] | \n[early, already] | \n[earli, alreadi] | \n[early, already] | \nearly already | \n
4 | \nham | \nNah I don't think he goes to usf, he lives aro... | \nNah I dont think he goes to usf he lives aroun... | \n[nah, i, dont, think, he, goes, to, usf, he, l... | \n[dont, think, goes, lives, around, here, though] | \n[dont, think, goes, lives, around, though] | \n[dont, think, goe, live, around, though] | \n[dont, think, go, life, around, though] | \ndont think go life around though | \n