├── README.md ├── 文本分析及語料庫統計 ├── .gitkeep ├── README.md ├── data │ ├── news.csv │ ├── news_annotation.p │ └── polls.json ├── ipynb │ ├── 1. Fundamentals of Data Analysis.ipynb │ ├── 2. Descriptive Statistics.ipynb │ ├── 3. Models.ipynb │ └── 4. Practice on news annotation.ipynb └── slides │ ├── 1. Fundamentals of Data Analysis.slides.html │ ├── 2. Descriptive Statistics.slides.html │ └── 3. Models.slides.html ├── 語料庫介面設計 ├── .gitkeep └── README.md ├── 語料庫標記及語言學分析 ├── .gitkeep ├── README.md ├── practice │ ├── .ipynb_checkpoints │ │ └── sentiment_annotation-checkpoint.ipynb │ ├── 2018 語料庫程式實務工作坊_語言學分析&標記實作_學習單.pdf │ ├── negative_words.txt │ ├── new_sample.txt │ ├── positive_words.txt │ ├── readme.txt │ ├── sentiment_annotation.ipynb │ ├── sentiment_annotation.py │ └── sentiment_annotation_sample.csv └── slide │ └── 20181104_語料庫語言學工作坊.pdf ├── 語料庫爬蟲 ├── README.md ├── data │ ├── appledaily.json │ └── han.json └── src │ ├── applecrawler.py │ └── 實戰.ipynb ├── 語料庫資料前處理 ├── .gitkeep ├── README.md ├── data │ ├── 106_student.csv │ ├── clean_han.json │ └── han.json ├── dict.txt.big ├── pre_process.ipynb ├── stopwords.txt └── userdict.txt └── 語料庫開放框架及上線部署 ├── .gitkeep ├── Lyrics_analytics.ipynb └── README.md /README.md: -------------------------------------------------------------------------------- 1 | ## COPENS Workshop 2018 語料庫程式實務工作坊 2 | 3 | [工作坊連結](http://lope.linguistics.ntu.edu.tw/hocor2018/) 4 | 5 | ### Hands-on Corpus Programming workshop 6 | 語料庫是語言學研究與語言科技應用的基礎建設。隨著語料的擴增與問題的複雜化,現有工具已經不敷使用。寫程式變成研究者基本的技能。 7 | 本工作坊的目的,是希望對有 Python 入門知識的人提供一組語料庫程式處理的專門程式技。 8 | -------------------------------------------------------------------------------- /文本分析及語料庫統計/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/文本分析及語料庫統計/.gitkeep -------------------------------------------------------------------------------- /文本分析及語料庫統計/README.md: -------------------------------------------------------------------------------- 1 | 文本分析及語料庫統計 2 | 3 | --- 4 | 5 | 這裡分為三個部份: 6 | 7 | 1. Fundamentals of Data Analysis 8 | 內容為如何使用 Pandas 做基本的資料處理,包括存取檔案、檢查數據表格、選取特定資料、資料排序、資料轉換、以及繪製圖表。 9 | 10 | 2. Descriptive Statistics 11 | 內容為基本的描述統計,包含頻率、平均數、標準差、標準化、相關係數。 12 | 13 | 3. Models 14 | 內容為建立模型,第一個是 Linear Model 來做數值的預測,第二個是用 Logistic Regression 的方法來做分類器,也會帶大家使用 scikit-learn 這個套件。 15 | -------------------------------------------------------------------------------- /文本分析及語料庫統計/data/news_annotation.p: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/文本分析及語料庫統計/data/news_annotation.p -------------------------------------------------------------------------------- /文本分析及語料庫統計/data/polls.json: -------------------------------------------------------------------------------- 1 | [{"有效樣本": "1077", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/10/22", "支持率": {"其他": "19.9%", "蘇貞昌": "31.0%", "侯友宜": "49.1%"}}, {"有效樣本": "1026", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/10/19", "支持率": {"其他": "20.0%", "蘇貞昌": "30.0%", "侯友宜": "50.0%"}}, {"有效樣本": "1075", "機構": "信傳媒", "訪問主題": "彰化縣長當選人", "時間": "2018/10/18", "支持率": {"魏明谷": "34.5%", "黃文玲": "4.8%", "其他": "27.4%", "王惠美": "33.3%"}}, {"有效樣本": "1009", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2018/10/17", "支持率": {"韓國瑜": "42.0%", "陳其邁": "35.0%", "其他": "23.0%"}}, {"有效樣本": "1070", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/10/15", "支持率": {"柯文哲": "37.5%", "姚文智": "11.3%", "丁守中": "25.0%", "其他": "26.2%"}}, {"有效樣本": "1271", "機構": "ETtoday 東森新聞雲", "訪問主題": "屏東縣長當選人", "時間": "2018/10/15", "支持率": {"潘孟安": "36.4%", "蘇清泉": "24.8%", "其他": "38.8%"}}, {"有效樣本": "803", "機構": "TVBS", "訪問主題": "嘉義市長當選人", "時間": "2018/10/15", "支持率": {"蕭淑麗": "15.0%", "黃敏惠": "35.0%", "其他": "21.0%", "涂醒者": "29.0%"}}, {"有效樣本": "1077", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/10/11", "支持率": {"其他": "17.8%", "蘇貞昌": "32.8%", "侯友宜": "49.4%"}}, {"有效樣本": "985", "機構": "TVBS", "訪問主題": "台東縣長當選人", "時間": "2018/10/11", "支持率": {"饒慶鈴": "40.0%", "鄺麗貞": "5.0%", "其他": "24.0%", "劉櫂豪": "31.0%"}}, {"有效樣本": "830", "機構": "TVBS", "訪問主題": "宜蘭縣長當選人", "時間": "2018/10/09", "支持率": {"林姿妙": "47.0%", "陳歐珀": "19.0%", "其他": "34.0%"}}, {"有效樣本": "1084", "機構": "信傳媒", "訪問主題": "台中市長當選人", "時間": "2018/10/08", "支持率": {"林佳龍": "40.0%", "盧秀燕": "34.6%", "其他": "25.4%"}}, {"有效樣本": "1074", "機構": "聯合報", "訪問主題": "宜蘭縣長當選人", "時間": "2018/10/08", "支持率": {"林姿妙": "42.0%", "陳歐珀": "13.0%", "其他": "45.0%"}}, {"有效樣本": "1078", "機構": "旺旺中時", "訪問主題": "台北市長當選人", "時間": "2018/10/08", "支持率": {"柯文哲": "41.6%", "姚文智": "15.2%", "丁守中": "27.2%", "其他": "16.0%"}}, {"有效樣本": "1533", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義縣長當選人", "時間": "2018/10/08", "支持率": {"翁章梁": "29.3%", "其他": "42.9%", "吳育仁": "22.8%", "吳芳銘": "5.0%"}}, {"有效樣本": "1075", "機構": "聯合報", "訪問主題": "嘉義市長當選人", "時間": "2018/10/05", "支持率": {"蕭淑麗": "17.0%", "涂醒哲": "25.0%", "黃敏惠": "28.0%", "其他": "30.0%"}}, {"有效樣本": "1069", "機構": "聯合報", "訪問主題": "新竹縣長當選人", "時間": "2018/10/04", "支持率": {"徐欣瑩": "31.0%", "楊文科": "25.0%", "其他": "33.0%", "鄭朝方": "11.0%"}}, {"有效樣本": "1014", "機構": "旺旺中時", "訪問主題": "台東縣長當選人", "時間": "2018/10/04", "支持率": {"饒慶鈴": "38.6%", "鄺麗貞": "3.3%", "其他": "27.8%", "劉櫂豪": "30.3%"}}, {"有效樣本": "1080", "機構": "聯合報", "訪問主題": "彰化縣長當選人", "時間": "2018/10/01", "支持率": {"魏明谷": "24.0%", "其他": "42.0%", "王惠美": "34.0%"}}, {"有效樣本": "1087", "機構": "旺旺中時", "訪問主題": "台中市長當選人", "時間": "2018/10/01", "支持率": {"林佳龍": "38.1%", "盧秀燕": "40.2%", "其他": "21.7%"}}, {"有效樣本": "1068", "機構": "三立電視", "訪問主題": "台北市長當選人", "時間": "2018/10/01", "支持率": {"柯文哲": "36.1%", "姚文智": "11.1%", "丁守中": "25.8%", "其他": "27.0%"}}, {"有效樣本": "1086", "機構": "信傳媒", "訪問主題": "宜蘭縣長當選人", "時間": "2018/10/01", "支持率": {"林姿妙": "40.1%", "陳歐珀": "26.5%", "其他": "33.4%"}}, {"有效樣本": "1501", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義市長當選人", "時間": "2018/09/30", "支持率": {"蕭淑麗": "10.7%", "涂醒哲": "26.0%", "黃敏惠": "35.4%", "其他": "27.9%"}}, {"有效樣本": "1072", "機構": "自由時報", "訪問主題": "嘉義市長當選人", "時間": "2018/09/29", "支持率": {"蕭淑麗": "17.9%", "涂醒哲": "30.1%", "黃敏惠": "24.6%", "其他": "27.1%"}}, {"有效樣本": "1070", "機構": "聯合報", "訪問主題": "桃園市長當選人", "時間": "2018/09/27", "支持率": {"楊麗環": "6.0%", "陳學聖": "16.0%", "其他": "27.0%", "鄭文燦": "51.0%"}}, {"有效樣本": "1072", "機構": "聯合報", "訪問主題": "台南市長當選人", "時間": "2018/09/27", "支持率": {"林義豐": "14.0%", "黃偉哲": "31.0%", "其他": "43.0%", "高思博": "12.0%"}}, {"有效樣本": "1086", "機構": "聯合報", "訪問主題": "高雄市長當選人", "時間": "2018/09/25", "支持率": {"韓國瑜": "32.0%", "陳其邁": "34.0%", "其他": "34.0%"}}, {"有效樣本": "1723", "機構": "ETtoday 東森新聞雲", "訪問主題": "宜蘭縣長當選人", "時間": "2018/09/25", "支持率": {"林姿妙": "35.6%", "陳歐珀": "23.1%", "其他": "41.3%"}}, {"有效樣本": "1064", "機構": "自由時報", "訪問主題": "宜蘭縣長當選人", "時間": "2018/09/22", "支持率": {"林姿妙": "37.6%", "陳歐珀": "23.4%", "其他": "39.0%"}}, {"有效樣本": "1071", "機構": "聯合報", "訪問主題": "台北市長當選人", "時間": "2018/09/21", "支持率": {"柯文哲": "37.0%", "姚文智": "8.0%", "丁守中": "29.0%", "其他": "26.0%"}}, {"有效樣本": "1070", "機構": "蘋果日報", "訪問主題": "桃園市長當選人", "時間": "2018/09/21", "支持率": {"楊麗環": "5.4%", "陳學聖": "18.5%", "其他": "33.8%", "鄭文燦": "42.3%"}}, {"有效樣本": "1501", "機構": "美麗島電子報", "訪問主題": "新北市長當選人", "時間": "2018/09/21", "支持率": {"其他": "30.1%", "蘇貞昌": "28.2%", "侯友宜": "41.7%"}}, {"有效樣本": "1089", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/09/21", "支持率": {"柯文哲": "34.6%", "姚文智": "11.1%", "丁守中": "23.5%", "其他": "30.8%"}}, {"有效樣本": "1076", "機構": "蘋果日報", "訪問主題": "台南市長當選人", "時間": "2018/09/20", "支持率": {"林義豐": "5.9%", "黃偉哲": "23.9%", "其他": "56.4%", "高思博": "13.8%"}}, {"有效樣本": "1074", "機構": "信傳媒", "訪問主題": "高雄市長當選人", "時間": "2018/09/20", "支持率": {"韓國瑜": "35.5%", "陳其邁": "41.2%", "其他": "23.3%"}}, {"有效樣本": "1068", "機構": "聯合報", "訪問主題": "新北市長當選人", "時間": "2018/09/19", "支持率": {"其他": "28.0%", "蘇貞昌": "24.0%", "侯友宜": "48.0%"}}, {"有效樣本": "1073", "機構": "蘋果日報", "訪問主題": "高雄市長當選人", "時間": "2018/09/19", "支持率": {"韓國瑜": "31.2%", "陳其邁": "33.8%", "其他": "35.0%"}}, {"有效樣本": "1068", "機構": "蘋果日報", "訪問主題": "台中市長當選人", "時間": "2018/09/18", "支持率": {"林佳龍": "30.3%", "盧秀燕": "30.9%", "其他": "38.8%"}}, {"有效樣本": "1068", "機構": "蘋果日報", "訪問主題": "台北市長當選人", "時間": "2018/09/17", "支持率": {"柯文哲": "34.9%", "姚文智": "10.4%", "丁守中": "30.8%", "其他": "23.9%"}}, {"有效樣本": "2408", "機構": "ETtoday 東森新聞雲", "訪問主題": "台北市長當選人", "時間": "2018/09/17", "支持率": {"柯文哲": "41.7%", "姚文智": "8.4%", "丁守中": "29.9%", "其他": "20.0%"}}, {"有效樣本": "1001", "機構": "TVBS", "訪問主題": "台北市長當選人", "時間": "2018/09/17", "支持率": {"柯文哲": "37.0%", "姚文智": "11.0%", "丁守中": "32.0%", "其他": "20.0%"}}, {"有效樣本": "1068", "機構": "蘋果日報", "訪問主題": "新北市長當選人", "時間": "2018/09/17", "支持率": {"其他": "30.4%", "蘇貞昌": "29.4%", "侯友宜": "40.2%"}}, {"有效樣本": "1011", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/09/17", "支持率": {"其他": "20.0%", "蘇貞昌": "32.0%", "侯友宜": "48.0%"}}, {"有效樣本": "801", "機構": "TVBS", "訪問主題": "桃園市長當選人", "時間": "2018/09/17", "支持率": {"楊麗環": "6.0%", "陳學聖": "17.0%", "其他": "24.0%", "鄭文燦": "54.0%"}}, {"有效樣本": "1015", "機構": "TVBS", "訪問主題": "台中市長當選人", "時間": "2018/09/17", "支持率": {"林佳龍": "35.0%", "盧秀燕": "38.0%", "其他": "27.0%"}}, {"有效樣本": "859", "機構": "TVBS", "訪問主題": "台南市長當選人", "時間": "2018/09/17", "支持率": {"林義豐": "13.0%", "黃偉哲": "33.0%", "其他": "40.0%", "高思博": "14.0%"}}, {"有效樣本": "1019", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2018/09/17", "支持率": {"韓國瑜": "35.0%", "陳其邁": "39.0%", "其他": "26.0%"}}, {"有效樣本": "1063", "機構": "自由時報", "訪問主題": "彰化縣長當選人", "時間": "2018/09/15", "支持率": {"魏明谷": "31.3%", "其他": "38.2%", "王惠美": "30.5%"}}, {"有效樣本": "1097", "機構": "聯合報", "訪問主題": "台中市長當選人", "時間": "2018/09/13", "支持率": {"林佳龍": "33.0%", "盧秀燕": "34.0%", "其他": "33.0%"}}, {"有效樣本": "1071", "機構": "年代電視", "訪問主題": "新竹縣長當選人", "時間": "2018/09/12", "支持率": {"徐欣瑩": "32.6%", "楊文科": "23.3%", "其他": "35.1%", "鄭朝方": "9.1%"}}, {"有效樣本": "1087", "機構": "ETtoday 東森新聞雲", "訪問主題": "彰化縣長當選人", "時間": "2018/09/10", "支持率": {"魏明谷": "29.6%", "黃文玲": "4.7%", "其他": "36.7%", "王惠美": "29.0%"}}, {"有效樣本": "1082", "機構": "自由時報", "訪問主題": "新北市長當選人", "時間": "2018/09/09", "支持率": {"其他": "29.3%", "蘇貞昌": "30.9%", "侯友宜": "39.9%"}}, {"有效樣本": "1107", "機構": "風傳媒", "訪問主題": "新竹市長當選人", "時間": "2018/09/05", "支持率": {"其他": "30.0%", "林智堅": "48.5%", "謝文進": "7.9%", "許明財": "13.6%"}}, {"有效樣本": "1070", "機構": "信傳媒", "訪問主題": "嘉義市長當選人", "時間": "2018/09/03", "支持率": {"蕭淑麗": "18.9%", "涂醒哲": "39.3%", "黃敏惠": "24.8%", "其他": "17.0%"}}, {"有效樣本": "1685", "機構": "ETtoday 東森新聞雲", "訪問主題": "高雄市長當選人", "時間": "2018/09/03", "支持率": {"韓國瑜": "31.0%", "陳其邁": "36.7%", "其他": "32.3%"}}, {"有效樣本": "1070", "機構": "自由時報", "訪問主題": "台北市長當選人", "時間": "2018/09/01", "支持率": {"柯文哲": "33.4%", "姚文智": "14.9%", "丁守中": "24.7%", "其他": "27.0%"}}, {"有效樣本": "1068", "機構": "TVBS", "訪問主題": "新竹縣長當選人", "時間": "2018/08/31", "支持率": {"徐欣瑩": "28.0%", "楊文科": "25.0%", "其他": "34.0%", "鄭朝方": "13.0%"}}, {"有效樣本": "1748", "機構": "ETtoday 東森新聞雲", "訪問主題": "台南市長當選人", "時間": "2018/08/27", "支持率": {"林義豐": "14.1%", "黃偉哲": "31.3%", "其他": "38.7%", "高思博": "15.9%"}}, {"有效樣本": "1069", "機構": "自由時報", "訪問主題": "台中市長當選人", "時間": "2018/08/25", "支持率": {"林佳龍": "38.5%", "盧秀燕": "32.4%", "其他": "39.1%"}}, {"有效樣本": "1580", "機構": "ETtoday 東森新聞雲", "訪問主題": "桃園市長當選人", "時間": "2018/08/20", "支持率": {"陳學聖": "23.8%", "其他": "25.5%", "鄭文燦": "50.7%"}}, {"有效樣本": "1069", "機構": "自由時報", "訪問主題": "台南市長當選人", "時間": "2018/08/18", "支持率": {"林義豐": "9.7%", "黃偉哲": "40.6%", "其他": "38.1%", "高思博": "11.6%"}}, {"有效樣本": "1082", "機構": "美麗島電子報", "訪問主題": "高雄市長當選人", "時間": "2018/08/17", "支持率": {"韓國瑜": "27.8%", "陳其邁": "38.9%", "其他": "33.3%"}}, {"有效樣本": "1607", "機構": "ETtoday 東森新聞雲", "訪問主題": "台中市長當選人", "時間": "2018/08/14", "支持率": {"林佳龍": "31.0%", "盧秀燕": "35.6%", "其他": "33.4%"}}, {"有效樣本": "1753", "機構": "ETtoday 東森新聞雲", "訪問主題": "新北市長當選人", "時間": "2018/08/10", "支持率": {"其他": "30.6%", "蘇貞昌": "24.2%", "侯友宜": "45.2%"}}, {"有效樣本": "1088", "機構": "NOWnews今日新聞", "訪問主題": "台中市長當選人", "時間": "2018/08/10", "支持率": {"林佳龍": "35.8%", "盧秀燕": "38.0%", "其他": "26.2%"}}, {"有效樣本": "1022", "機構": "自由時報", "訪問主題": "高雄市長當選人", "時間": "2018/08/04", "支持率": {"韓國瑜": "26.3%", "陳其邁": "38.5%", "其他": "35.2%"}}, {"有效樣本": "1100", "機構": "信傳媒", "訪問主題": "台中市長當選人", "時間": "2018/08/01", "支持率": {"林佳龍": "43.9%", "盧秀燕": "31.1%", "其他": "25.0%"}}, {"有效樣本": "1857", "機構": "ETtoday 東森新聞雲", "訪問主題": "台北市長當選人", "時間": "2018/07/30", "支持率": {"柯文哲": "42.0%", "姚文智": "5.4%", "丁守中": "31.0%", "其他": "21.6%"}}, {"有效樣本": "1376", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹縣長當選人", "時間": "2018/07/26", "支持率": {"徐欣瑩": "27.5%", "楊文科": "23.6%", "其他": "24.3%", "林為洲": "18.1%"}}, {"有效樣本": "870", "機構": "TVBS", "訪問主題": "台北市長當選人", "時間": "2018/07/24", "支持率": {"柯文哲": "40.0%", "姚文智": "11.0%", "丁守中": "30.0%", "其他": "19.0%"}}, {"有效樣本": "1030", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/07/24", "支持率": {"其他": "23.0%", "蘇貞昌": "29.0%", "侯友宜": "48.0%"}}, {"有效樣本": "803", "機構": "TVBS", "訪問主題": "桃園市長當選人", "時間": "2018/07/24", "支持率": {"陳學聖": "20.0%", "其他": "24.0%", "鄭文燦": "56.0%"}}, {"有效樣本": "1077", "機構": "TVBS", "訪問主題": "台中市長當選人", "時間": "2018/07/24", "支持率": {"林佳龍": "33.0%", "盧秀燕": "39.0%", "其他": "28.0%"}}, {"有效樣本": "833", "機構": "TVBS", "訪問主題": "台南市長當選人", "時間": "2018/07/24", "支持率": {"林義豐": "7.0%", "黃偉哲": "41.0%", "其他": "37.0%", "高思博": "15.0%"}}, {"有效樣本": "1012", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2018/07/24", "支持率": {"韓國瑜": "32.0%", "陳其邁": "40.0%", "其他": "28.0%"}}, {"有效樣本": "1070", "機構": "信傳媒", "訪問主題": "新北市長當選人", "時間": "2018/07/23", "支持率": {"其他": "22.6%", "蘇貞昌": "31.1%", "侯友宜": "46.3%"}}, {"有效樣本": "2607", "機構": "ETtoday 東森新聞雲", "訪問主題": "連江縣長當選人", "時間": "2018/07/23", "支持率": {"劉增應": "34.1%", "蘇柏豪": "27.8%", "其他": "38.1%"}}, {"有效樣本": "2607", "機構": "ETtoday 東森新聞雲", "訪問主題": "澎湖縣長當選人", "時間": "2018/07/23", "支持率": {"陳光復": "20.1%", "賴峰偉": "29.1%", "其他": "50.8%"}}, {"有效樣本": "2607", "機構": "ETtoday 東森新聞雲", "訪問主題": "金門縣長當選人", "時間": "2018/07/23", "支持率": {"其他": "47.4%", "陳福海": "23.0%", "楊鎮浯": "28.0%"}}, {"有效樣本": "2093", "機構": "ETtoday 東森新聞雲", "訪問主題": "苗栗縣長當選人", "時間": "2018/07/20", "支持率": {"徐定禎": "16.5%", "徐耀昌": "35.9%", "其他": "47.6%"}}, {"有效樣本": "1074", "機構": "三立電視", "訪問主題": "金門縣長當選人", "時間": "2018/07/17", "支持率": {"其他": "34.7%", "陳福海": "30.6%", "楊鎮浯": "32.7%"}}, {"有效樣本": "0", "機構": "年代電視", "訪問主題": "台南市長當選人", "時間": "2018/07/15", "支持率": {"黃偉哲": "36.6%", "其他": "50.4%", "高思博": "13.0%"}}, {"有效樣本": "1082", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/07/12", "支持率": {"柯文哲": "38.7%", "姚文智": "16.4%", "丁守中": "30.5%", "其他": "14.4%"}}, {"有效樣本": "1070", "機構": "信傳媒", "訪問主題": "台北市長當選人", "時間": "2018/07/11", "支持率": {"柯文哲": "38.9%", "姚文智": "11.0%", "丁守中": "27.5%", "其他": "22.6%"}}, {"有效樣本": "2315", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹縣長當選人", "時間": "2018/07/11", "支持率": {"徐欣瑩": "19.7%", "楊文科": "26.3%", "其他": "44.3%", "鄭朝方": "9.7%"}}, {"有效樣本": "1070", "機構": "旺旺中時", "訪問主題": "新竹市長當選人", "時間": "2018/07/04", "支持率": {"其他": "22.8%", "林智堅": "51.4%", "謝文進": "9.5%", "許明財": "16.3%"}}, {"有效樣本": "2520", "機構": "TVBS", "訪問主題": "台中市長當選人", "時間": "2018/06/29", "支持率": {"林佳龍": "33.0%", "盧秀燕": "39.7%", "其他": "27.3%"}}, {"有效樣本": "1024", "機構": "TVBS", "訪問主題": "新竹縣長當選人", "時間": "2018/06/29", "支持率": {"徐欣瑩": "27.0%", "楊文科": "20.0%", "其他": "21.0%", "林為洲": "24.0%"}}, {"有效樣本": "3273", "機構": "ETtoday 東森新聞雲", "訪問主題": "屏東縣長當選人", "時間": "2018/06/27", "支持率": {"潘孟安": "37.9%", "蘇清泉": "20.9%", "其他": "41.2%"}}, {"有效樣本": "907", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/06/26", "支持率": {"其他": "21.0%", "蘇貞昌": "31.0%", "侯友宜": "48.0%"}}, {"有效樣本": "1072", "機構": "旺旺中時", "訪問主題": "基隆市長當選人", "時間": "2018/06/26", "支持率": {"謝立功": "15.9%", "其他": "26.8%", "林右昌": "50.0%"}}, {"有效樣本": "853", "機構": "TVBS", "訪問主題": "桃園市長當選人", "時間": "2018/06/23", "支持率": {"陳學聖": "21.0%", "其他": "18.0%", "鄭文燦": "61.0%"}}, {"有效樣本": "1069", "機構": "旺旺中時", "訪問主題": "新竹縣長當選人", "時間": "2018/06/22", "支持率": {"徐欣瑩": "18.3%", "楊文科": "21.0%", "其他": "56.2%", "鄭朝方": "4.5%"}}, {"有效樣本": "1070", "機構": "年代電視", "訪問主題": "新竹縣長當選人", "時間": "2018/06/22", "支持率": {"徐欣瑩": "40.2%", "楊文科": "23.3%", "其他": "27.7%", "鄭朝方": "8.8%"}}, {"有效樣本": "3435", "機構": "ETtoday 東森新聞雲", "訪問主題": "台東縣長當選人", "時間": "2018/06/22", "支持率": {"饒慶鈴": "34.1%", "其他": "38.1%", "劉櫂豪": "27.8%"}}, {"有效樣本": "N/A", "機構": "三立電視", "訪問主題": "新竹市長當選人", "時間": "2018/06/21", "支持率": {"其他": "16.0%", "林智堅": "55.0%", "謝文進": "14.0%", "許明財": "15.0%"}}, {"有效樣本": "1075", "機構": "旺旺中時", "訪問主題": "新竹縣長當選人", "時間": "2018/06/19", "支持率": {"徐欣瑩": "20.1%", "楊文科": "27.0%", "其他": "48.2%", "鄭朝方": "4.7%"}}, {"有效樣本": "1217", "機構": "自由時報", "訪問主題": "新竹市長當選人", "時間": "2018/06/19", "支持率": {"其他": "22.0%", "林智堅": "55.8%", "謝文進": "7.7%", "許明財": "14.5%"}}, {"有效樣本": "1075", "機構": "美麗島電子報", "訪問主題": "新北市長當選人", "時間": "2018/06/15", "支持率": {"其他": "34.7%", "蘇貞昌": "23.6%", "侯友宜": "41.7%"}}, {"有效樣本": "1869", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹市長當選人", "時間": "2018/06/14", "支持率": {"其他": "39.0%", "林智堅": "38.0%", "謝文進": "6.0%", "許明財": "17.0%"}}, {"有效樣本": "1071", "機構": "旺旺中時", "訪問主題": "苗栗縣長當選人", "時間": "2018/06/13", "支持率": {"徐定禎": "17.6%", "徐耀昌": "56.9%", "其他": "25.5%"}}, {"有效樣本": "3435", "機構": "ETtoday 東森新聞雲", "訪問主題": "花蓮縣長當選人", "時間": "2018/06/13", "支持率": {"劉曉玫": "19.0%", "徐榛蔚": "44.4%", "其他": "36.6%"}}, {"有效樣本": "3453", "機構": "ETtoday 東森新聞雲", "訪問主題": "宜蘭縣長當選人", "時間": "2018/06/12", "支持率": {"林姿妙": "41.2%", "陳歐珀": "15.8%", "其他": "43.0%"}}, {"有效樣本": "1007", "機構": "旺旺中時", "訪問主題": "桃園市長當選人", "時間": "2018/06/12", "支持率": {"陳學聖": "24.9%", "其他": "23.6%", "鄭文燦": "51.5%"}}, {"有效樣本": "1009", "機構": "旺旺中時", "訪問主題": "台南市長當選人", "時間": "2018/06/08", "支持率": {"黃偉哲": "40.3%", "其他": "42.4%", "高思博": "17.3%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義縣長當選人", "時間": "2018/06/04", "支持率": {"翁章梁": "34.3%", "其他": "50.4%", "吳育仁": "15.3%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義市長當選人", "時間": "2018/06/04", "支持率": {"蕭淑麗": "18.0%", "涂醒哲": "18.2%", "黃敏惠": "31.2%", "其他": "32.6%"}}, {"有效樣本": "1195", "機構": "TVBS", "訪問主題": "台北市長當選人", "時間": "2018/05/31", "支持率": {"柯文哲": "31.0%", "姚文智": "13.0%", "丁守中": "33.0%", "其他": "23.0%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "雲林縣長當選人", "時間": "2018/05/31", "支持率": {"張麗善": "19.6%", "其他": "47.6%", "李進勇": "32.8%"}}, {"有效樣本": "1079", "機構": "蘋果日報", "訪問主題": "台北市長當選人", "時間": "2018/05/29", "支持率": {"柯文哲": "29.0%", "姚文智": "13.5%", "丁守中": "29.1%", "其他": "28.4%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "台南市長當選人", "時間": "2018/05/28", "支持率": {"林義豐": "2.9%", "黃偉哲": "31.2%", "其他": "50.3%", "高思博": "18.5%"}}, {"有效樣本": "1001", "機構": "聯合報", "訪問主題": "台中市長當選人", "時間": "2018/05/28", "支持率": {"林佳龍": "32.0%", "盧秀燕": "39.0%", "其他": "29.0%"}}, {"有效樣本": "1010", "機構": "旺旺中時", "訪問主題": "高雄市長當選人", "時間": "2018/05/27", "支持率": {"韓國瑜": "33.0%", "陳其邁": "39.5%", "其他": "27.5%"}}, {"有效樣本": "819", "機構": "聯合報", "訪問主題": "台北市長當選人", "時間": "2018/05/19", "支持率": {"柯文哲": "38.0%", "姚文智": "8.0%", "丁守中": "39.0%", "其他": "15.0%"}}, {"有效樣本": "1848", "機構": "ETtoday 東森新聞雲", "訪問主題": "台北市長當選人", "時間": "2018/05/17", "支持率": {"柯文哲": "36.4%", "姚文智": "13.4%", "丁守中": "37.5%", "其他": "12.7%"}}, {"有效樣本": "1073", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/05/14", "支持率": {"柯文哲": "35.2%", "姚文智": "15.0%", "丁守中": "33.1%", "其他": "16.7%"}}, {"有效樣本": "3273", "機構": "ETtoday 東森新聞雲", "訪問主題": "高雄市長當選人", "時間": "2018/05/04", "支持率": {"韓國瑜": "30.4%", "陳其邁": "32.0%", "其他": "37.6%"}}, {"有效樣本": "1003", "機構": "美麗島電子報", "訪問主題": "台中市長當選人", "時間": "2018/05/03", "支持率": {"林佳龍": "33.0%", "盧秀燕": "35.7%", "其他": "31.3%"}}, {"有效樣本": "980", "機構": "聯合報", "訪問主題": "新北市長當選人", "時間": "2018/05/01", "支持率": {"其他": "29.0%", "蘇貞昌": "26.0%", "侯友宜": "45.0%"}}, {"有效樣本": "1083", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/05/01", "支持率": {"其他": "22.3%", "蘇貞昌": "27.1%", "侯友宜": "50.6%"}}, {"有效樣本": "1081", "機構": "大社會民調", "訪問主題": "嘉義市長當選人", "時間": "2018/04/18", "支持率": {"蕭淑麗": "15.1%", "涂醒哲": "21.9%", "黃敏惠": "34.4%", "其他": "0.0%"}}, {"有效樣本": "1909", "機構": "ETtoday 東森新聞雲", "訪問主題": "新北市長當選人", "時間": "2018/04/13", "支持率": {"其他": "30.9%", "蘇貞昌": "22.7%", "侯友宜": "46.4%"}}, {"有效樣本": "957", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/04/12", "支持率": {"其他": "28.0%", "蘇貞昌": "32.0%", "侯友宜": "40.0%"}}, {"有效樣本": "1896", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹市長當選人", "時間": "2018/04/12", "支持率": {"其他": "36.6%", "林智堅": "51.4%", "許明財": "12.0%"}}, {"有效樣本": "1841", "機構": "ETtoday 東森新聞雲", "訪問主題": "桃園市長當選人", "時間": "2018/04/11", "支持率": {"陳學聖": "31.7%", "其他": "21.6%", "鄭文燦": "46.7%"}}, {"有效樣本": "N/A", "機構": "旺旺中時", "訪問主題": "彰化縣長當選人", "時間": "2018/04/10", "支持率": {"魏明谷": "39.9%", "其他": "34.4%", "王惠美": "25.7%"}}, {"有效樣本": "0", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/04/07", "支持率": {"其他": "19.3%", "蘇貞昌": "26.4%", "侯友宜": "54.3%"}}, {"有效樣本": "541", "機構": "蘋果日報", "訪問主題": "新北市長當選人", "時間": "2018/04/07", "支持率": {"其他": "11.1%", "蘇貞昌": "27.5%", "侯友宜": "61.4%"}}, {"有效樣本": "2016", "機構": "ETtoday 東森新聞雲", "訪問主題": "基隆市長當選人", "時間": "2018/03/30", "支持率": {"謝立功": "24.0%", "其他": "31.9%", "林右昌": "44.1%"}}, {"有效樣本": "1084", "機構": "旺旺中時", "訪問主題": "南投縣長當選人", "時間": "2018/03/28", "支持率": {"其他": "53.2%", "洪國浩": "8.9%", "林明溱": "37.9%"}}, {"有效樣本": "1071", "機構": "旺旺中時", "訪問主題": "花蓮縣長當選人", "時間": "2018/03/25", "支持率": {"劉曉玫": "17.0%", "徐榛蔚": "40.1%", "其他": "42.9%"}}, {"有效樣本": "3240", "機構": "ETtoday 東森新聞雲", "訪問主題": "彰化縣長當選人", "時間": "2018/03/19", "支持率": {"魏明谷": "31.9%", "其他": "39.8%", "王惠美": "28.3%"}}, {"有效樣本": "3240", "機構": "ETtoday 東森新聞雲", "訪問主題": "南投縣長當選人", "時間": "2018/03/19", "支持率": {"其他": "28.9%", "洪國浩": "11.7%", "林明溱": "59.4%"}}, {"有效樣本": "2259", "機構": "ETtoday 東森新聞雲", "訪問主題": "台中市長當選人", "時間": "2018/03/05", "支持率": {"林佳龍": "32.6%", "盧秀燕": "32.0%", "其他": "35.4%"}}, {"有效樣本": "1430", "機構": "ETtoday 東森新聞雲", "訪問主題": "新北市長當選人", "時間": "2018/03/02", "支持率": {"其他": "21.2%", "蘇貞昌": "27.4%", "侯友宜": "51.4%"}}, {"有效樣本": "1007", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/02/26", "支持率": {"其他": "12.6%", "蘇貞昌": "33.6%", "侯友宜": "53.8%"}}, {"有效樣本": "1010", "機構": "旺旺中時", "訪問主題": "台中市長當選人", "時間": "2018/02/13", "支持率": {"林佳龍": "36.4%", "盧秀燕": "30.6%", "其他": "33.0%"}}, {"有效樣本": "1073", "機構": "美麗島電子報", "訪問主題": "台中市長當選人", "時間": "2018/02/09", "支持率": {"林佳龍": "35.8%", "盧秀燕": "37.6%", "其他": "26.6%"}}, {"有效樣本": "1073", "機構": "旺旺中時", "訪問主題": "新竹市長當選人", "時間": "2018/01/26", "支持率": {"其他": "10.6%", "林智堅": "56.4%", "許明財": "33.0%"}}, {"有效樣本": "1068", "機構": "年代電視", "訪問主題": "澎湖縣長當選人", "時間": "2018/01/18", "支持率": {"陳光復": "26.4%", "賴峰偉": "39.4%", "其他": "34.2%"}}, {"有效樣本": "1073", "機構": "年代電視", "訪問主題": "嘉義市長當選人", "時間": "2018/01/16", "支持率": {"其他": "24.7%", "涂醒哲": "22.1%", "黃敏惠": "53.2%"}}, {"有效樣本": "1070", "機構": "美麗島電子報", "訪問主題": "新北市長當選人", "時間": "2018/01/15", "支持率": {"其他": "16.8%", "蘇貞昌": "41.7%", "侯友宜": "41.5%"}}, {"有效樣本": "N/A", "機構": "三立電視", "訪問主題": "基隆市長當選人", "時間": "2018/01/15", "支持率": {"謝立功": "15.2%", "其他": "26.8%", "林右昌": "58.0%"}}, {"有效樣本": "1069", "機構": "年代電視", "訪問主題": "新竹縣長當選人", "時間": "2018/01/04", "支持率": {"徐欣瑩": "32.2%", "楊文科": "23.6%", "其他": "31.6%", "鄭朝方": "12.6%"}}, {"有效樣本": "1069", "機構": "年代電視", "訪問主題": "彰化縣長當選人", "時間": "2018/01/04", "支持率": {"魏明谷": "26.3%", "其他": "37.8%", "王惠美": "35.9%"}}, {"有效樣本": "1070", "機構": "年代電視", "訪問主題": "台中市長當選人", "時間": "2017/12/28", "支持率": {"林佳龍": "36.1%", "盧秀燕": "35.1%", "其他": "28.8%"}}, {"有效樣本": "1070", "機構": "年代電視", "訪問主題": "宜蘭縣長當選人", "時間": "2017/12/27", "支持率": {"林姿妙": "45.9%", "陳歐珀": "23.4%", "其他": "30.7%"}}, {"有效樣本": "1068", "機構": "風傳媒", "訪問主題": "桃園市長當選人", "時間": "2017/12/13", "支持率": {"楊麗環": "4.8%", "陳學聖": "4.7%", "其他": "36.9%", "鄭文燦": "53.6%"}}, {"有效樣本": "897", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2017/12/06", "支持率": {"其他": "13.0%", "蘇貞昌": "39.0%", "侯友宜": "48.0%"}}, {"有效樣本": "1081", "機構": "旺旺中時", "訪問主題": "新竹市長當選人", "時間": "2017/11/14", "支持率": {"其他": "20.3%", "林智堅": "59.9%", "許明財": "19.8%"}}, {"有效樣本": "1071", "機構": "美麗島電子報", "訪問主題": "台中市長當選人", "時間": "2017/11/10", "支持率": {"林佳龍": "37.4%", "盧秀燕": "33.1%", "其他": "29.5%"}}, {"有效樣本": "890", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2017/08/15", "支持率": {"韓國瑜": "31.0%", "陳其邁": "59.0%", "其他": "10.0%"}}, {"有效樣本": "811", "機構": "TVBS", "訪問主題": "宜蘭縣長當選人", "時間": "2017/07/19", "支持率": {"林姿妙": "57.0%", "陳歐珀": "23.0%", "其他": "20.0%"}}] -------------------------------------------------------------------------------- /文本分析及語料庫統計/ipynb/2. Descriptive Statistics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# 2. Descriptive Statistics\n", 12 | "## Outline\n", 13 | "* [Frequency](#frequency)\n", 14 | "* [Measures of central tendency](#measuresOfCentralTendency)\n", 15 | "* [Measures of dispersion](#measuresOfDispersion)\n", 16 | "* [Normalization and Standardization](#normalizationAndStandardization)\n", 17 | "* [Coefficients of correlation](#coefficientsOfCorrelation)" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": { 23 | "slideshow": { 24 | "slide_type": "slide" 25 | } 26 | }, 27 | "source": [ 28 | "## Frequency\n", 29 | "\n", 30 | "資料常常需要計算出現的頻率,`.value_counts()` 可以統計某個欄位中每個值出現的次數。" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 1, 36 | "metadata": { 37 | "slideshow": { 38 | "slide_type": "fragment" 39 | } 40 | }, 41 | "outputs": [ 42 | { 43 | "data": { 44 | "text/html": [ 45 | "
\n", 46 | "\n", 59 | "\n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | "
titlecontenttimeproviderurl
0「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允...2018-10-22 12:16:02+08:00風傳媒https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁...
1【Yahoo論壇/林青弘】柯文哲是否一再說謊?柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張...2018-10-22 14:00:26+08:00林青弘https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘...
2【Yahoo論壇】民進黨誰最怕陳其邁落選?讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲...2018-10-22 13:57:44+08:00讀者投書https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選...
3抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方...2018-10-22 13:32:00+08:00EBC東森新聞https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚...
4百年土地公上香祈福 陳學聖提五不原則【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉...2018-10-22 13:17:44+08:00民眾日報https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0...
\n", 113 | "
" 114 | ], 115 | "text/plain": [ 116 | " title \\\n", 117 | "0 「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現 \n", 118 | "1 【Yahoo論壇/林青弘】柯文哲是否一再說謊? \n", 119 | "2 【Yahoo論壇】民進黨誰最怕陳其邁落選? \n", 120 | "3 抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚 \n", 121 | "4 百年土地公上香祈福 陳學聖提五不原則 \n", 122 | "\n", 123 | " content \\\n", 124 | "0 國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允... \n", 125 | "1 柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張... \n", 126 | "2 讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲... \n", 127 | "3 國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方... \n", 128 | "4 【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉... \n", 129 | "\n", 130 | " time provider \\\n", 131 | "0 2018-10-22 12:16:02+08:00 風傳媒 \n", 132 | "1 2018-10-22 14:00:26+08:00 林青弘 \n", 133 | "2 2018-10-22 13:57:44+08:00 讀者投書 \n", 134 | "3 2018-10-22 13:32:00+08:00 EBC東森新聞 \n", 135 | "4 2018-10-22 13:17:44+08:00 民眾日報 \n", 136 | "\n", 137 | " url \n", 138 | "0 https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁... \n", 139 | "1 https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘... \n", 140 | "2 https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選... \n", 141 | "3 https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚... \n", 142 | "4 https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0... " 143 | ] 144 | }, 145 | "execution_count": 1, 146 | "metadata": {}, 147 | "output_type": "execute_result" 148 | } 149 | ], 150 | "source": [ 151 | "import pandas as pd\n", 152 | "from pathlib import Path\n", 153 | "data_folder = Path(\"../data/\")\n", 154 | "\n", 155 | "news = pd.read_csv(data_folder / \"news.csv\")\n", 156 | "news.head()" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 2, 162 | "metadata": { 163 | "slideshow": { 164 | "slide_type": "subslide" 165 | } 166 | }, 167 | "outputs": [ 168 | { 169 | "data": { 170 | "text/plain": [ 171 | "中央社 14\n", 172 | "聯合新聞網 13\n", 173 | "今日新聞NOWnews 11\n", 174 | "新頭殼 11\n", 175 | "風傳媒 11\n", 176 | "三立新聞網 setn.com 10\n", 177 | "民眾日報 10\n", 178 | "TVBS新聞網 9\n", 179 | "民視 7\n", 180 | "台灣好新聞報 6\n", 181 | "EBC東森新聞 6\n", 182 | "華視 3\n", 183 | "中華日報 2\n", 184 | "信傳媒 2\n", 185 | "讀者投書 1\n", 186 | "壹電視影音 1\n", 187 | "林青弘 1\n", 188 | "上報 1\n", 189 | "詹為元 1\n", 190 | "Name: provider, dtype: int64" 191 | ] 192 | }, 193 | "execution_count": 2, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "news['provider'].value_counts()" 200 | ] 201 | }, 202 | { 203 | "cell_type": "code", 204 | "execution_count": 3, 205 | "metadata": { 206 | "slideshow": { 207 | "slide_type": "subslide" 208 | } 209 | }, 210 | "outputs": [ 211 | { 212 | "data": { 213 | "text/plain": [ 214 | "False 84\n", 215 | "True 36\n", 216 | "Name: 柯文哲, dtype: int64" 217 | ] 218 | }, 219 | "execution_count": 3, 220 | "metadata": {}, 221 | "output_type": "execute_result" 222 | } 223 | ], 224 | "source": [ 225 | "word = '柯文哲' \n", 226 | "news[word] = [word in text for text in news.content]\n", 227 | "news[word].value_counts()" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 4, 233 | "metadata": { 234 | "slideshow": { 235 | "slide_type": "fragment" 236 | } 237 | }, 238 | "outputs": [ 239 | { 240 | "data": { 241 | "text/html": [ 242 | "
\n", 243 | "\n", 256 | "\n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | "
姚文智FalseTrue
柯文哲
False768
True1224
\n", 282 | "
" 283 | ], 284 | "text/plain": [ 285 | "姚文智 False True \n", 286 | "柯文哲 \n", 287 | "False 76 8\n", 288 | "True 12 24" 289 | ] 290 | }, 291 | "execution_count": 4, 292 | "metadata": {}, 293 | "output_type": "execute_result" 294 | } 295 | ], 296 | "source": [ 297 | "word = '姚文智' \n", 298 | "news[word] = [word in text for text in news.content]\n", 299 | "pd.crosstab(news[\"柯文哲\"], news[\"姚文智\"])" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 5, 305 | "metadata": { 306 | "slideshow": { 307 | "slide_type": "subslide" 308 | } 309 | }, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "text/plain": [ 314 | "0 42\n", 315 | "1 29\n", 316 | "2 17\n", 317 | "3 12\n", 318 | "5 7\n", 319 | "4 5\n", 320 | "6 3\n", 321 | "9 2\n", 322 | "11 1\n", 323 | "8 1\n", 324 | "7 1\n", 325 | "Name: 民進黨, dtype: int64" 326 | ] 327 | }, 328 | "execution_count": 5, 329 | "metadata": {}, 330 | "output_type": "execute_result" 331 | } 332 | ], 333 | "source": [ 334 | "word = '民進黨'\n", 335 | "news[word] = [text.count(word) for text in news.content]\n", 336 | "news[word].value_counts()" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": { 342 | "slideshow": { 343 | "slide_type": "slide" 344 | } 345 | }, 346 | "source": [ 347 | "## Measures of central tendency
\n", 348 | "可以使用 `.mode()` 得到眾數、`.median()` 得到中位數、`.mean()` 得到平均數。" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 6, 354 | "metadata": { 355 | "slideshow": { 356 | "slide_type": "fragment" 357 | } 358 | }, 359 | "outputs": [ 360 | { 361 | "data": { 362 | "text/plain": [ 363 | "0 中央社\n", 364 | "dtype: object" 365 | ] 366 | }, 367 | "execution_count": 6, 368 | "metadata": {}, 369 | "output_type": "execute_result" 370 | } 371 | ], 372 | "source": [ 373 | "# mode\n", 374 | "news['provider'].mode()" 375 | ] 376 | }, 377 | { 378 | "cell_type": "code", 379 | "execution_count": 7, 380 | "metadata": { 381 | "slideshow": { 382 | "slide_type": "subslide" 383 | } 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "# count the news length\n", 388 | "news['length'] = news['content'].apply(len)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": 8, 394 | "metadata": { 395 | "slideshow": { 396 | "slide_type": "fragment" 397 | } 398 | }, 399 | "outputs": [ 400 | { 401 | "data": { 402 | "text/plain": [ 403 | "680.5" 404 | ] 405 | }, 406 | "execution_count": 8, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "# median\n", 413 | "news['length'].median()" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 9, 419 | "metadata": { 420 | "scrolled": true, 421 | "slideshow": { 422 | "slide_type": "fragment" 423 | } 424 | }, 425 | "outputs": [ 426 | { 427 | "data": { 428 | "text/plain": [ 429 | "699.4833333333333" 430 | ] 431 | }, 432 | "execution_count": 9, 433 | "metadata": {}, 434 | "output_type": "execute_result" 435 | } 436 | ], 437 | "source": [ 438 | "# mean\n", 439 | "news['length'].mean()" 440 | ] 441 | }, 442 | { 443 | "cell_type": "markdown", 444 | "metadata": { 445 | "slideshow": { 446 | "slide_type": "slide" 447 | } 448 | }, 449 | "source": [ 450 | "### Measures of dispersion\n", 451 | "可以用 `.max()` 得到最大值、`.min()` 得到最小值、相減即為全距。 \n", 452 | "可以用 `.quantile()` 得到百分位數、`.std()` 得到標準差、`.var()` 得到變異數。 \n", 453 | "`.describe()` 則是數據表格的統計,包含平均數、標準差、最大最小值、中位數和四分位數。" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": 10, 459 | "metadata": { 460 | "slideshow": { 461 | "slide_type": "fragment" 462 | } 463 | }, 464 | "outputs": [ 465 | { 466 | "data": { 467 | "text/plain": [ 468 | "1885" 469 | ] 470 | }, 471 | "execution_count": 10, 472 | "metadata": {}, 473 | "output_type": "execute_result" 474 | } 475 | ], 476 | "source": [ 477 | "# range\n", 478 | "news.length.max() - news.length.min()" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 11, 484 | "metadata": { 485 | "slideshow": { 486 | "slide_type": "fragment" 487 | } 488 | }, 489 | "outputs": [ 490 | { 491 | "data": { 492 | "text/plain": [ 493 | "526.5" 494 | ] 495 | }, 496 | "execution_count": 11, 497 | "metadata": {}, 498 | "output_type": "execute_result" 499 | } 500 | ], 501 | "source": [ 502 | "# Quantiles and quartiles \n", 503 | "news.length.quantile(0.25)" 504 | ] 505 | }, 506 | { 507 | "cell_type": "code", 508 | "execution_count": 12, 509 | "metadata": { 510 | "slideshow": { 511 | "slide_type": "subslide" 512 | } 513 | }, 514 | "outputs": [ 515 | { 516 | "data": { 517 | "text/plain": [ 518 | "319.96102660076804" 519 | ] 520 | }, 521 | "execution_count": 12, 522 | "metadata": {}, 523 | "output_type": "execute_result" 524 | } 525 | ], 526 | "source": [ 527 | "# Standard deviation\n", 528 | "news.length.std()" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": 13, 534 | "metadata": { 535 | "slideshow": { 536 | "slide_type": "fragment" 537 | } 538 | }, 539 | "outputs": [ 540 | { 541 | "data": { 542 | "text/plain": [ 543 | "102375.0585434174" 544 | ] 545 | }, 546 | "execution_count": 13, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "# Variance\n", 553 | "news.length.var()" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 14, 559 | "metadata": { 560 | "slideshow": { 561 | "slide_type": "fragment" 562 | } 563 | }, 564 | "outputs": [ 565 | { 566 | "data": { 567 | "text/plain": [ 568 | "102375.05854341738" 569 | ] 570 | }, 571 | "execution_count": 14, 572 | "metadata": {}, 573 | "output_type": "execute_result" 574 | } 575 | ], 576 | "source": [ 577 | "news.length.std() ** 2" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 15, 583 | "metadata": { 584 | "slideshow": { 585 | "slide_type": "subslide" 586 | } 587 | }, 588 | "outputs": [ 589 | { 590 | "data": { 591 | "text/html": [ 592 | "
\n", 593 | "\n", 606 | "\n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | "
民進黨length
count120.000000120.000000
mean1.800000699.483333
std2.198548319.961027
min0.00000063.000000
25%0.000000526.500000
50%1.000000680.500000
75%3.000000832.000000
max11.0000001948.000000
\n", 657 | "
" 658 | ], 659 | "text/plain": [ 660 | " 民進黨 length\n", 661 | "count 120.000000 120.000000\n", 662 | "mean 1.800000 699.483333\n", 663 | "std 2.198548 319.961027\n", 664 | "min 0.000000 63.000000\n", 665 | "25% 0.000000 526.500000\n", 666 | "50% 1.000000 680.500000\n", 667 | "75% 3.000000 832.000000\n", 668 | "max 11.000000 1948.000000" 669 | ] 670 | }, 671 | "execution_count": 15, 672 | "metadata": {}, 673 | "output_type": "execute_result" 674 | } 675 | ], 676 | "source": [ 677 | "news.describe()" 678 | ] 679 | }, 680 | { 681 | "cell_type": "markdown", 682 | "metadata": { 683 | "slideshow": { 684 | "slide_type": "slide" 685 | } 686 | }, 687 | "source": [ 688 | "### Normalization and Standardization
\n", 689 | "在建立模型前,通常會成資料標準化,常見的方法有下面兩種。 \n", 690 | "Normalization: \n", 691 | "$ x_{\\text{norm}} = (x-x_{\\text{min}}) / (x_{\\text{max}} - x_{\\text{min}}) $ \n", 692 | "$x_{\\text{norm}}$'s are between 0 and 1.\n", 693 | "\n", 694 | "Standardization: \n", 695 | "$ x_{\\text{std}} = (x-\\mu) / \\sigma $ \n", 696 | "$x_{\\text{std}}$'s have mean 0 and standard deviation 1.\n" 697 | ] 698 | }, 699 | { 700 | "cell_type": "code", 701 | "execution_count": 16, 702 | "metadata": { 703 | "slideshow": { 704 | "slide_type": "fragment" 705 | } 706 | }, 707 | "outputs": [], 708 | "source": [ 709 | "news['length_norm'] = (news.length - news.length.min())/(news.length.max() - news.length.min())\n", 710 | "news['length_std'] = (news.length - news.length.mean())/news.length.std()" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": 17, 716 | "metadata": { 717 | "scrolled": true, 718 | "slideshow": { 719 | "slide_type": "subslide" 720 | } 721 | }, 722 | "outputs": [ 723 | { 724 | "data": { 725 | "text/plain": [ 726 | "" 727 | ] 728 | }, 729 | "execution_count": 17, 730 | "metadata": {}, 731 | "output_type": "execute_result" 732 | }, 733 | { 734 | "data": { 735 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAEStJREFUeJzt3X+s3XV9x/Hn2wLKelnBFU86ZF6MaCQQwZ4YFpPtXFDX4SKY6CJRB5HtqpvEZGxJo3+IOhO2WUlGSLYukNYFuTKna1NwjiFXphHcrdbeYudQ7By1a8cKjZcxZ917f9xvTVdvOd/zu+dzno/kpuf7PZ/v+b7f/X776vd+v99zTmQmkqTx97xRFyBJ6g8DXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklSI04a5srVr1+b09HRXyz7zzDOsXr26vwWd4ux5MtjzZOil5507dz6Zmee2GzfUQJ+enmZhYaGrZefn52m1Wv0t6BRnz5PBnidDLz1HxL/WGecpF0kqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKsRQ3ymq8TC98d6RrXvLhsl6O7jUTx6hS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhWgb6BHxgoj4WkR8MyIejYgPV/O3RMT3ImJX9XPp4MuVJJ1MnQ/n+hFwRWYuRcTpwJcj4vPVc3+YmZ8ZXHmSpLraBnpmJrBUTZ5e/eQgi5Ikda7WOfSIWBURu4BDwP2Z+Uj11MciYndE3BoRzx9YlZKktmL5ALzm4Iizgc8BNwL/Cfw7cAawGfhuZn5khWVmgVmARqOxfm5urqtCl5aWmJqa6mrZcTWqnhf3Hxn6Oo+5YM0qt/MEsOfOzMzM7MzMZrtxHQU6QER8CHgmMz9+3LwW8AeZ+RvPtWyz2cyFhYWO1nfM/Pw8rVarq2XH1ah6HvUXXLidy2fPnYmIWoFe5y6Xc6sjcyLiTOB1wD9HxLpqXgDXAHu6qlSS1Bd17nJZB2yNiFUs/wdwT2buiIgvRsS5QAC7gPcMsE5JUht17nLZDVy2wvwrBlKRJKkrvlNUkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKkTbQI+IF0TE1yLimxHxaER8uJp/QUQ8EhGPRcSnI+KMwZcrSTqZOkfoPwKuyMxXAZcCGyLicuCPgVsz80LgKeCGwZUpSWqnbaDnsqVq8vTqJ4ErgM9U87cC1wykQklSLZGZ7QdFrAJ2Ai8Dbgf+FHg4M19WPX8+8PnMvHiFZWeBWYBGo7F+bm6uq0KXlpaYmprqatlxdejwEQ4+O+oqhuuCNasmbjtP4r5tz52ZmZnZmZnNduNOq/NimfkT4NKIOBv4HPDKlYadZNnNwGaAZrOZrVarzip/xvz8PN0uO65uu2sbmxZrbaJibNmweuK28yTu2/Y8GB3d5ZKZTwPzwOXA2RFxLG1eDPygv6VJkjpR5y6Xc6sjcyLiTOB1wF7gQeAt1bDrgG2DKlKS1F6d3+fXAVur8+jPA+7JzB0R8S1gLiL+CPgGcMcA65QktdE20DNzN3DZCvMfB14ziKIkSZ3znaKSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgpR50uiz4+IByNib0Q8GhHvr+bfHBH7I2JX9XPV4MuVJJ1MnS+JPgrclJlfj4izgJ0RcX/13K2Z+fHBlSdJqqvOl0QfAA5Uj38YEXuB8wZdmCSpMx2dQ4+IaeAy4JFq1vsiYndE3BkR5/S5NklSByIz6w2MmAK+BHwsMz8bEQ3gSSCBjwLrMvNdKyw3C8wCNBqN9XNzc10VurS0xNTUVFfLjqtDh49w8NlRVzFcF6xZNXHbeRL3bXvuzMzMzM7MbLYbVyvQI+J0YAfwhcz8xArPTwM7MvPi53qdZrOZCwsLbde3kvn5eVqtVlfLjqvb7trGpsU6lznKsWXD6onbzpO4b9tzZyKiVqDXucslgDuAvceHeUSsO27Ym4E93RQqSeqPOod/rwXeCSxGxK5q3geAayPiUpZPuewD3j2QCiVJtdS5y+XLQKzw1H39L0eS1C3fKSpJhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRB1viT6/Ih4MCL2RsSjEfH+av4LI+L+iHis+vOcwZcrSTqZOkfoR4GbMvOVwOXA70XERcBG4IHMvBB4oJqWJI1I20DPzAOZ+fXq8Q+BvcB5wNXA1mrYVuCaQRUpSWqvo3PoETENXAY8AjQy8wAshz7won4XJ0mqLzKz3sCIKeBLwMcy87MR8XRmnn3c809l5s+cR4+IWWAWoNForJ+bm+uq0KWlJaamprpadlwdOnyEg8+OuorhumDNqonbzpO4b9tzZ2ZmZnZmZrPduNPqvFhEnA78DXBXZn62mn0wItZl5oGIWAccWmnZzNwMbAZoNpvZarXqrPJnzM/P0+2y4+q2u7axabHWJirGlg2rJ247T+K+bc+DUeculwDuAPZm5ieOe2o7cF31+DpgW//LkyTVVefw77XAO4HFiNhVzfsAcAtwT0TcAHwfeOtgSpQk1dE20DPzy0Cc5Okr+1uOJKlbvlNUkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFWKy3oaoU97i/iNcv/Heoa933y1vHPo6pX7zCF2SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSpEnS+JvjMiDkXEnuPm3RwR+yNiV/Vz1WDLlCS1U+cIfQuwYYX5t2bmpdXPff0tS5LUqbaBnpkPAYeHUIskqQe9nEN/X0Tsrk7JnNO3iiRJXYnMbD8oYhrYkZkXV9MN4EkggY8C6zLzXSdZdhaYBWg0Guvn5ua6KnRpaYmpqamulh1Xhw4f4eCzo65iuBpnMpKeLzlvzfBXWpnEfdueOzMzM7MzM5vtxnX1BReZefDY44j4S2DHc4zdDGwGaDab2Wq1ulkl8/PzdLvsuLrtrm1sWpys7yC56ZKjI+l539tbQ1/nMZO4b9vzYHR1yiUi1h03+WZgz8nGSpKGo+2hUETcDbSAtRHxBPAhoBURl7J8ymUf8O4B1ihJqqFtoGfmtSvMvmMAtUiSeuA7RSWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQk/UNxF2a3njvSNZ70yUjWa2kMeURuiQVom2gR8SdEXEoIvYcN++FEXF/RDxW/XnOYMuUJLVT5wh9C7DhhHkbgQcy80LggWpakjRCbQM9Mx8CDp8w+2pga/V4K3BNn+uSJHUoMrP9oIhpYEdmXlxNP52ZZx/3/FOZueJpl4iYBWYBGo3G+rm5ua4KXVpaYmpqqqtle7W4/8hI1ts4Ew4+O5JVj8yoer7kvDXDX2lllPv2qNhzZ2ZmZnZmZrPduIHf5ZKZm4HNAM1mM1utVlevMz8/T7fL9ur6kd3lcpRNi5N1I9Koet739tbQ13nMKPftUbHnwej2LpeDEbEOoPrzUP9KkiR1o9tA3w5cVz2+DtjWn3IkSd2qc9vi3cBXgVdExBMRcQNwC/D6iHgMeH01LUkaobYnKzPz2pM8dWWfa5Ek9WBsrrgt7j8ysouTkjQOfOu/JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFaKnbyyKiH3AD4GfAEczs9mPoiRJnevHV9DNZOaTfXgdSVIPPOUiSYXoNdAT+PuI2BkRs/0oSJLUncjM7heO+MXM/EFEvAi4H7gxMx86YcwsMAvQaDTWz83NdbWuQ4ePcPDZrksdS40zsecJMKqeLzlvzfBXWllaWmJqampk6x+FXnqemZnZWecaZU+B/v9eKOJmYCkzP36yMc1mMxcWFrp6/dvu2samxX6c8h8fN11y1J4nwKh63nfLG4e+zmPm5+dptVojW/8o9NJzRNQK9K5PuUTE6og469hj4A3Anm5fT5LUm14OCxrA5yLi2Ot8KjP/ri9VSZI61nWgZ+bjwKv6WIskqQfetihJhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBVisj6nVNJPTW+8d2Tr3rJh9cjWXTKP0CWpEAa6JBXCQJekQhjoklQIL4pKGrrF/Ue4foQXZUdhGBeCPUKXpEL0FOgRsSEivh0R34mIjf0qSpLUua4DPSJWAbcDvw5cBFwbERf1qzBJUmd6OUJ/DfCdzHw8M/8HmAOu7k9ZkqRO9RLo5wH/dtz0E9U8SdIIRGZ2t2DEW4Ffy8zfrqbfCbwmM288YdwsMFtNvgL4dpe1rgWe7HLZcWXPk8GeJ0MvPb8kM89tN6iX2xafAM4/bvrFwA9OHJSZm4HNPawHgIhYyMxmr68zTux5MtjzZBhGz72ccvkn4MKIuCAizgDeBmzvT1mSpE51fYSemUcj4n3AF4BVwJ2Z+WjfKpMkdaSnd4pm5n3AfX2qpZ2eT9uMIXueDPY8GQbec9cXRSVJpxbf+i9JhTjlAr3dxwlExPMj4tPV849ExPTwq+yvGj3/fkR8KyJ2R8QDEfGSUdTZT3U/NiIi3hIRGRFjfUdEnX4j4jer7fxoRHxq2DX2W439+pci4sGI+Ea1b181ijr7KSLujIhDEbHnJM9HRPxZ9XeyOyJe3dcCMvOU+WH54up3gZcCZwDfBC46YczvAn9ePX4b8OlR1z2EnmeAn6sev3cSeq7GnQU8BDwMNEdd94C38YXAN4BzqukXjbruIfS8GXhv9fgiYN+o6+5D378CvBrYc5LnrwI+DwRwOfBIP9d/qh2h1/k4gauBrdXjzwBXRkQMscZ+a9tzZj6Ymf9VTT7M8j3/46zux0Z8FPgT4L+HWdwA1On3d4DbM/MpgMw8NOQa+61Ozwn8fPV4DSu8j2XcZOZDwOHnGHI18Mlc9jBwdkSs69f6T7VAr/NxAj8dk5lHgSPALwylusHo9CMUbmD5f/hx1rbniLgMOD8zdwyzsAGps41fDrw8Ir4SEQ9HxIahVTcYdXq+GXhHRDzB8t1yN1K+gX5kyqn2BRcrHWmfeBtOnTHjpHY/EfEOoAn86kArGrzn7DkingfcClw/rIIGrM42Po3l0y4tln8D+8eIuDgznx5wbYNSp+drgS2ZuSkifhn4q6rn/x18eSMz0Pw61Y7Q63ycwE/HRMRpLP+q9ly/4pzqan2EQkS8Dvgg8KbM/NGQahuUdj2fBVwMzEfEPpbPNW4f4wujdffrbZn548z8HsufeXThkOobhDo93wDcA5CZXwVewPLnnZSs1r/3bp1qgV7n4wS2A9dVj98CfDGrqw1jqm3P1emHv2A5zMf93Cq06Tkzj2Tm2syczsxplq8bvCkzF0ZTbs/q7Nd/y/LFbyJiLcunYB4fapX9Vafn7wNXAkTEK1kO9P8YapXDtx34repul8uBI5l5oG+vPuqrwie5CvwvLF8h/2A17yMs/4OG5Y3+18B3gK8BLx11zUPo+R+Ag8Cu6mf7qGsedM8njJ1njO9yqbmNA/gE8C1gEXjbqGseQs8XAV9h+Q6YXcAbRl1zH3q+GzgA/Jjlo/EbgPcA7zluO99e/Z0s9nu/9p2iklSIU+2UiySpSwa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmF+D+rSkg219juTAAAAABJRU5ErkJggg==\n", 736 | "text/plain": [ 737 | "
" 738 | ] 739 | }, 740 | "metadata": { 741 | "needs_background": "light" 742 | }, 743 | "output_type": "display_data" 744 | } 745 | ], 746 | "source": [ 747 | "%matplotlib inline\n", 748 | "\n", 749 | "news['length_norm'].hist()" 750 | ] 751 | }, 752 | { 753 | "cell_type": "code", 754 | "execution_count": 18, 755 | "metadata": { 756 | "slideshow": { 757 | "slide_type": "subslide" 758 | } 759 | }, 760 | "outputs": [ 761 | { 762 | "data": { 763 | "text/plain": [ 764 | "" 765 | ] 766 | }, 767 | "execution_count": 18, 768 | "metadata": {}, 769 | "output_type": "execute_result" 770 | }, 771 | { 772 | "data": { 773 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAEMVJREFUeJzt3W2MXOV5xvHrijHFYqhNhDN1DepGCkIgbzHyCFHxZdaQ1gUUoCpSUYqMoNpUCohK2xcn+RBoiuQqGCq1kSqrICyVskUpCGRDiUuYIKQEukuM1+5CSanb2ri2KOAy1KJacvfDHiPXWXvOnHNmZ+bZ/08aec7bM/ctjy8O52WOI0IAgOH3mX4XAACoBoEOAIkg0AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASMRZi/lhF1xwQYyMjBTa9qOPPtK5555bbUF9klIvEv0MspR6kZZuP9PT0+9GxOpO6y1qoI+MjGhqaqrQtq1WS81ms9qC+iSlXiT6GWQp9SIt3X5s/1ue8TjkAgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiVjUO0UxHEa27Opq/YnROd3e5Tanc2Dr9ZWMAyxF7KEDQCIIdABIBIEOAIkg0AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJ6Bjots+x/art123vt31fNv9R2/9qe0/2Wt/7cgEAp5Pnx7k+lrQxItq2l0t62fZz2bI/iIjv9q48AEBeHQM9IkJSO5tcnr2il0UBALqX6xi67WW290g6Kml3RLySLbrf9l7bD9n+uZ5VCQDoyPM74DlXtldJekrS3ZL+S9J/Sjpb0nZJ/xIRf7zANuOSxiWpXq9vmJycLFRou91WrVYrtO2gGfReZg4d62r9+grpyPFqPnt07cpqBiph0P9+upFSL9LS7WdsbGw6Ihqd1usq0CXJ9jclfRQRD5w0rynp9yPihjNt22g0YmpqqqvPO6HVaqnZbBbadtAMei9FHnCxbaaaZ6UMwgMuBv3vpxsp9SIt3X5s5wr0PFe5rM72zGV7haRrJb1he002z5JukrSvY1UAgJ7Js1u1RtIO28s0/x+AJyJip+3v214tyZL2SPrdHtYJAOggz1UueyVdscD8jT2pCABQCHeKAkAiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkAgCHQASQaADQCIIdABIBIEOAInoGOi2z7H9qu3Xbe+3fV82//O2X7H9lu2/tX1278sFAJxOnj30jyVtjIjLJa2XtMn2VZL+VNJDEXGxpPcl3dm7MgEAnXQM9JjXziaXZ6+QtFHSd7P5OyTd1JMKAQC5OCI6r2QvkzQt6QuSviPp25J+FBFfyJZfJOm5iFi3wLbjksYlqV6vb5icnCxUaLvdVq1WK7TtoMnby8yhY4tQTXn1FdKR49WMNbp2ZTUDlbAUv2vDYqn2MzY2Nh0RjU7rnZXnQyPiE0nrba+S9JSkSxda7TTbbpe0XZIajUY0m808H/kzWq2Wim47aPL2cvuWXb0vpgITo3PaNpPrq9TRgS83KxmnjKX4XRsW9HNmXV3lEhEfSGpJukrSKtsn/hVfKOmdyqoCAHQtz1Uuq7M9c9leIelaSbOSXpT0m9lqmyU93asiAQCd5fn/5DWSdmTH0T8j6YmI2Gn7nyRN2v4TST+W9HAP6wQAdNAx0CNir6QrFpj/tqQre1EUAKB73CkKAIkg0AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASAReR4SfZHtF23P2t5v+55s/r22D9nek72u6325AIDTyfOQ6DlJExHxmu3zJE3b3p0teygiHuhdeQCAvPI8JPqwpMPZ+w9tz0pa2+vCAADd6eoYuu0RSVdIeiWbdZftvbYfsX1+xbUBALrgiMi3ol2T9ANJ90fEk7brkt6VFJK+JWlNRNyxwHbjksYlqV6vb5icnCxUaLvdVq1WK7TtoMnby8yhY4tQTXn1FdKR49WMNbp2ZTUDlbAUv2vDYqn2MzY2Nh0RjU7r5Qp028sl7ZT0fEQ8uMDyEUk7I2LdmcZpNBoxNTXV8fMW0mq11Gw2C207aPL2MrJlV++LqcDE6Jy2zeQ5HdPZga3XVzJOGUvxuzYslmo/tnMFep6rXCzpYUmzJ4e57TUnrXazpH0dqwIA9Eye3aqrJd0macb2nmze1yXdanu95g+5HJD0lZ5UCADIJc9VLi9L8gKLnq2+HABAUdwpCgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkAgCHQASQaADQCIIdABIBIEOAIkg0AEgEXkeEn2R7Rdtz9reb/uebP5nbe+2/Vb25/m9LxcAcDp59tDnJE1ExKWSrpL0VduXSdoi6YWIuFjSC9k0AKBPOgZ6RByOiNey9x9KmpW0VtKNknZkq+2QdFOvigQAdNbVMXTbI5KukPSKpHpEHJbmQ1/S56ouDgCQnyMi34p2TdIPJN0fEU/a/iAiVp20/P2I+Jnj6LbHJY1LUr1e3zA5OVmo0Ha7rVqtVmjbQZO3l5lDxxahmvLqK6Qjx6sZa3TtymoGKmEpfteGxVLtZ2xsbDoiGp3WyxXotpdL2inp+Yh4MJv3pqRmRBy2vUZSKyIuOdM4jUYjpqamOn7eQlqtlprNZqFtB03eXka27Op9MRWYGJ3TtpmzKhnrwNbrKxmnjKX4XRsWS7Uf27kCPc9VLpb0sKTZE2GeeUbS5uz9ZklPd6wKANAzeXarrpZ0m6QZ23uyeV+XtFXSE7bvlPTvkm7pTYkAgDw6BnpEvCzJp1l8TbXlAACK4k5RAEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkIhqbu8DKtKvu2MH4Q5VoCz20AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgETkeUj0I7aP2t530rx7bR+yvSd7XdfbMgEAneTZQ39U0qYF5j8UEeuz17PVlgUA6FbHQI+IlyS9twi1AABKKHMM/S7be7NDMudXVhEAoBBHROeV7BFJOyNiXTZdl/SupJD0LUlrIuKO02w7Lmlckur1+obJyclChbbbbdVqtULbDpq8vcwcOrYI1ZRXXyEdOd7vKsoZXbvy0/dL8bs2LJZqP2NjY9MR0ei0XqFAz7vsVI1GI6ampjp+3kJarZaazWahbQdN3l769bCHbk2MzmnbzHA/K+XkB1wsxe/asFiq/djOFeiFDrnYXnPS5M2S9p1uXQDA4ui4W2X7cUlNSRfYPijpm5Kattdr/pDLAUlf6WGNAIAcOgZ6RNy6wOyHe1ALAKAE7hQFgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkAgCHQASQaADQCIIdABIBIEOAIkg0AEgEQQ6ACSCQAeARAz3k30XSdUPa54YndPtQ/IAaADDgz10AEhEx0C3/Yjto7b3nTTvs7Z3234r+/P83pYJAOgkzx76o5I2nTJvi6QXIuJiSS9k0wCAPuoY6BHxkqT3Tpl9o6Qd2fsdkm6quC4AQJccEZ1Xskck7YyIddn0BxGx6qTl70fEgoddbI9LGpeker2+YXJyslCh7XZbtVqt0LZlzRw6Vul49RXSkeOVDtlXKfQzunblp+/7+V2rWkq9SEu3n7GxsemIaHRar+dXuUTEdknbJanRaESz2Sw0TqvVUtFty6r6ipSJ0Tltm0nnAqMU+jnw5ean7/v5XataSr1I9NNJ0atcjtheI0nZn0crqwgAUEjRQH9G0ubs/WZJT1dTDgCgqDyXLT4u6YeSLrF90PadkrZK+qLttyR9MZsGAPRRxwOfEXHraRZdU3EtAIAShuZM1syhY9wuDwBnwK3/AJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkIhSTyyyfUDSh5I+kTQXEY0qigIAdK+KR9CNRcS7FYwDACiBQy4AkIiygR6Svmd72vZ4FQUBAIpxRBTf2P7FiHjH9uck7ZZ0d0S8dMo645LGJaler2+YnJws9FlH3zumI8cLlzpQ6iuUTC8S/QyyM/Uyunbl4hZTgXa7rVqt1u8yKpO3n7Gxsek85yhLBfr/G8i+V1I7Ih443TqNRiOmpqYKjf/njz2tbTNVHPLvv4nRuWR6kehnkJ2plwNbr1/kasprtVpqNpv9LqMyefuxnSvQCx9ysX2u7fNOvJf0q5L2FR0PAFBOmd2QuqSnbJ8Y528i4u8rqQoA0LXCgR4Rb0u6vMJaAAAlcNkiACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQiDR+IxRA10a27OrbZw/jT/cOA/bQASARBDoAJIJAB4BEEOgAkAhOigJYdEVPyE6Mzun2Pp7MLWMxTgSzhw4AiSgV6LY32X7T9k9sb6mqKABA9woHuu1lkr4j6dclXSbpVtuXVVUYAKA7ZfbQr5T0k4h4OyL+V9KkpBurKQsA0K0ygb5W0n+cNH0wmwcA6ANHRLEN7Vsk/VpE/E42fZukKyPi7lPWG5c0nk1eIunNgrVeIOndgtsOmpR6kehnkKXUi7R0+/mliFjdaaUyly0elHTRSdMXSnrn1JUiYruk7SU+R5JkeyoiGmXHGQQp9SLRzyBLqReJfjopc8jlHyVdbPvzts+W9FuSnqmmLABAtwrvoUfEnO27JD0vaZmkRyJif2WVAQC6UupO0Yh4VtKzFdXSSenDNgMkpV4k+hlkKfUi0c8ZFT4pCgAYLNz6DwCJGJpAt/1t22/Y3mv7Kdur+l1TGbZvsb3f9k9tD+VZ+9R++sH2I7aP2t7X71rKsn2R7Rdtz2bfs3v6XVMZts+x/art17N+7ut3TWXZXmb7x7Z3VjXm0AS6pN2S1kXEL0v6Z0lf63M9Ze2T9BuSXup3IUUk+tMPj0ra1O8iKjInaSIiLpV0laSvDvnfz8eSNkbE5ZLWS9pk+6o+11TWPZJmqxxwaAI9Ir4XEXPZ5I80f9370IqI2YgoepPVIEjupx8i4iVJ7/W7jipExOGIeC17/6Hmg2No7+SOee1scnn2GtoTgLYvlHS9pL+qctyhCfRT3CHpuX4XscTx0w9DwvaIpCskvdLfSsrJDlHskXRU0u6IGOZ+/kzSH0r6aZWDDtQDLmz/g6RfWGDRNyLi6Wydb2j+fycfW8zaisjTzxDzAvOGdo8pVbZrkv5O0u9FxH/3u54yIuITSeuz82dP2V4XEUN3vsP2DZKORsS07WaVYw9UoEfEtWdabnuzpBskXRNDcL1lp36GXK6ffkD/2F6u+TB/LCKe7Hc9VYmID2y3NH++Y+gCXdLVkr5k+zpJ50j6edt/HRG/XXbgoTnkYnuTpD+S9KWI+J9+1wN++mGQ2bakhyXNRsSD/a6nLNurT1zZZnuFpGslvdHfqoqJiK9FxIURMaL5fzffryLMpSEKdEl/Iek8Sbtt77H9l/0uqAzbN9s+KOlXJO2y/Xy/a+pGdoL6xE8/zEp6Yth/+sH245J+KOkS2wdt39nvmkq4WtJtkjZm/172ZHuEw2qNpBdt79X8zsTuiKjscr9UcKcoACRimPbQAQBnQKADQCIIdABIBIEOAIkg0AEgEQQ6ACSCQAeARBDoAJCI/wO1mA9HPIldfAAAAABJRU5ErkJggg==\n", 774 | "text/plain": [ 775 | "
" 776 | ] 777 | }, 778 | "metadata": { 779 | "needs_background": "light" 780 | }, 781 | "output_type": "display_data" 782 | } 783 | ], 784 | "source": [ 785 | "news['length_std'].hist()" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": 19, 791 | "metadata": { 792 | "scrolled": true, 793 | "slideshow": { 794 | "slide_type": "subslide" 795 | } 796 | }, 797 | "outputs": [ 798 | { 799 | "data": { 800 | "text/plain": [ 801 | "" 802 | ] 803 | }, 804 | "execution_count": 19, 805 | "metadata": {}, 806 | "output_type": "execute_result" 807 | }, 808 | { 809 | "data": { 810 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAE4FJREFUeJzt3X2MZXV9x/H3tzzolkEeCtxsV+pipUbiRnAnhsZqZvAJwQq21UiMYqUZTcRoujbdalKx1gSrK0lTU0MDYW3U0VqJBKRIKKMxUewsLszSFXlw27KsSxBEhhLbpd/+cc+2M8M8nPtw5h5+vF/Jzdz7O+ee+5lzz/3MmXOfIjORJJXjV0YdQJI0XBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTBHrueNnXTSSbl58+ZFY0888QTHHHPMesboSdvzQfszmm9wbc/Y9nzQ/oyr5du1a9fDmXly7YVl5rqdtm7dmkvdeuutTxtrk7bny2x/RvMNru0Z254vs/0ZV8sHzGYPXeuhGEkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKsy6fqSAnhk2b79hqMvbtuUQ7665zH2Xnz/U25aejdxjl6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCrNmsUfEcyPiBxFxR0TcFREfr8aviYifRMTu6nRm83ElSWup8yFgvwTOycz5iDgK+G5E3FhN+5PM/Fpz8SRJvVqz2DMzgfnq4lHVKZsMJUnqX61j7BFxRETsBh4Cbs7M26pJn4yIOyPiioh4TmMpJUm1RXeHvObMEccD1wIfAH4G/BQ4GrgSuC8z/2KZ60wBUwCdTmfr9PT0ounz8/OMjY31m79xbc8Hw884t/+xoS0LoLMBDj5Zb94tm44b6m3X8Wy8j4et7fmg/RlXyzc5ObkrM8frLqunYgeIiI8BT2TmZxaMTQAfzsw3rXbd8fHxnJ2dXTQ2MzPDxMRETxnWU9vzwfAzNvFFGzvm6n2nyyi+aOPZeB8PW9vzQfszrpYvInoq9jqvijm52lMnIjYArwV+FBEbq7EALgT21L1RSVJz6uxGbQR2RsQRdP8QfDUzr4+If46Ik4EAdgPvazCnJKmmOq+KuRM4a5nxcxpJJEkaiO88laTCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgqzZrFHxHMj4gcRcUdE3BURH6/GT4uI2yLinoj4SkQc3XxcSdJa6uyx/xI4JzNfBpwJnBsRZwOfAq7IzNOBR4FLmospSaprzWLPrvnq4lHVKYFzgK9V4zuBCxtJKEnqSWTm2jNFHAHsAl4EfA74NPD9zHxRNf1U4MbMfOky150CpgA6nc7W6enpRdPn5+cZGxsb8Ndozijzze1/rNZ8nQ1w8MmGwwygl3xbNh3XbJhltH0bhPZnbHs+aH/G1fJNTk7uyszxuss6ss5MmfkUcGZEHA9cC7xkudlWuO6VwJUA4+PjOTExsWj6zMwMS8faZJT53r39hlrzbdtyiB1zte7Kkegl3753TDQbZhlt3wah/Rnbng/an3GY+Xp6VUxm/hyYAc4Gjo+Iw4/W5wMPDiWRJGkgdV4Vc3K1p05EbABeC+wFbgX+oJrtYuAbTYWUJNVX5//jjcDO6jj7rwBfzczrI+JfgemI+Evgh8BVDeaUJNW0ZrFn5p3AWcuM3w+8oolQkqT++c5TSSqMxS5JhbHYJakwFrskFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFqfNl1qdGxK0RsTci7oqID1bjl0XE/ojYXZ3Oaz6uJGktdb7M+hCwLTNvj4hjgV0RcXM17YrM/Exz8SRJvarzZdYHgAPV+ccjYi+wqelgkqT+9HSMPSI2A2cBt1VDl0bEnRFxdUScMORskqQ+RGbWmzFiDPg28MnM/HpEdICHgQQ+AWzMzPcsc70pYAqg0+lsnZ6eXjR9fn6esbGxgX6JJo0y39z+x2rN19kAB59sOMwAesm3ZdNxzYZZRtu3QWh/xrbng/ZnXC3f5OTkrswcr7usWsUeEUcB1wM3ZeZnl5m+Gbg+M1+62nLGx8dzdnZ20djMzAwTExN18667UebbvP2GWvNt23KIHXN1ni4ZjV7y7bv8/IbTPF3bt0Fof8a254P2Z1wtX0T0VOx1XhUTwFXA3oWlHhEbF8z2FmBP3RuVJDWnzm7UK4F3AnMRsbsa+whwUUScSfdQzD7gvY0klCT1pM6rYr4LxDKTvjn8OJKkQfnOU0kqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhanzZdanRsStEbE3Iu6KiA9W4ydGxM0RcU/184Tm40qS1lJnj/0QsC0zXwKcDbw/Is4AtgO3ZObpwC3VZUnSiK1Z7Jl5IDNvr84/DuwFNgEXADur2XYCFzYVUpJUX0/H2CNiM3AWcBvQycwD0C1/4JRhh5Mk9S4ys96MEWPAt4FPZubXI+LnmXn8gumPZubTjrNHxBQwBdDpdLZOT08vmj4/P8/Y2NgAv0KzRplvbv9jtebrbICDTzYcZgC95Nuy6bhmwyyj7dsgtD9j2/NB+zOulm9ycnJXZo7XXVatYo+Io4DrgZsy87PV2N3ARGYeiIiNwExmvni15YyPj+fs7OyisZmZGSYmJurmXXejzLd5+w215tu25RA75o5sOE3/esm37/LzG07zdG3fBqH9GdueD9qfcbV8EdFTsdd5VUwAVwF7D5d65Trg4ur8xcA36t6oJKk5dXajXgm8E5iLiN3V2EeAy4GvRsQlwL8Db20moiSpF2sWe2Z+F4gVJr9muHEkSYPynaeSVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMO19u6Keleq+23aYtm05xMS636rUHPfYJakwFrskFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwtT5MuurI+KhiNizYOyyiNgfEbur03nNxpQk1VVnj/0a4Nxlxq/IzDOr0zeHG0uS1K81iz0zvwM8sg5ZJElDMMgx9ksj4s7qUM0JQ0skSRpIZObaM0VsBq7PzJdWlzvAw0ACnwA2ZuZ7VrjuFDAF0Ol0tk5PTy+aPj8/z9jYWP+/QcNGmW9u/2O15utsgINPNhxmAM+EfKeceNyoY6zKx8ng2p5xtXyTk5O7MnO87rL6Kva605YaHx/P2dnZRWMzMzNMTEzUCjsKo8xX90sntm05xI659n5nyjMh3wfeccGoY6zKx8ng2p5xtXwR0VOx93UoJiI2Lrj4FmDPSvNKktbXmrtREfFlYAI4KSIeAD4GTETEmXQPxewD3ttgRklSD9Ys9sy8aJnhqxrIIkkaAt95KkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhbHYJakwFrskFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYVp7zcMt0jdL5WWpDZwj12SCrNmsUfE1RHxUETsWTB2YkTcHBH3VD9PaDamJKmuOnvs1wDnLhnbDtySmacDt1SXJUktsGaxZ+Z3gEeWDF8A7KzO7wQuHHIuSVKfIjPXniliM3B9Zr60uvzzzDx+wfRHM3PZwzERMQVMAXQ6na3T09OLps/PzzM2NtZv/sbNz8/zk8eeGnWMVXU2wMEnR51iZc+EfKeceNyoY6zqmfA4aXM+aH/G1fJNTk7uyszxustq/FUxmXklcCXA+Ph4TkxMLJo+MzPD0rE2mZmZYcd3nxh1jFVt23KIHXPtfYHTMyHf21q8DcIz43HS5nzQ/ozDzNfvq2IORsRGgOrnQ0NJI0kaWL/Ffh1wcXX+YuAbw4kjSRpUnZc7fhn4HvDiiHggIi4BLgdeFxH3AK+rLkuSWmDNA5+ZedEKk14z5CySpCFo7zNaS4zqbf3bthziGbSaJMmPFJCk0ljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhbHYJakwFrskFcZil6TCDPTVQBGxD3gceAo4lJnjwwglSerfML7zbTIzHx7CciRJQ+ChGEkqzKDFnsC3ImJXREwNI5AkaTCRmf1fOeLXM/PBiDgFuBn4QGZ+Z8k8U8AUQKfT2To9Pb1oGfPz84yNja15W3P7H+s75yA6G+DgkyO56drantF8g2si45ZNxw1tWXUfx6PU9oyr5ZucnNzVy3OYAxX7ogVFXAbMZ+ZnVppnfHw8Z2dnF43NzMwwMTGx5vI3b79hwIT92bblEDvmhvFURHPantF8g2si477Lzx/asuo+jkep7RlXyxcRPRV734diIuKYiDj28Hng9cCefpcnSRqOQXYBOsC1EXF4OV/KzH8aSipJUt/6LvbMvB942RCzSJKGwJc7SlJhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhWn3Z5VKaswwPwp725ZDvLuH5Q3zI4P1dO6xS1JhLHZJKozFLkmFsdglqTA+eSpp3Y3iO4x7fYJ32NbzCWP32CWpMAMVe0ScGxF3R8S9EbF9WKEkSf3ru9gj4gjgc8AbgTOAiyLijGEFkyT1Z5A99lcA92bm/Zn5X8A0cMFwYkmS+jVIsW8C/mPB5QeqMUnSCEVm9nfFiLcCb8jMP6ouvxN4RWZ+YMl8U8BUdfHFwN1LFnUS8HBfIdZH2/NB+zOab3Btz9j2fND+jKvle0Fmnlx3QYO83PEB4NQFl58PPLh0psy8ErhypYVExGxmjg+Qo1Ftzwftz2i+wbU9Y9vzQfszDjPfIIdi/gU4PSJOi4ijgbcD1w0jlCSpf33vsWfmoYi4FLgJOAK4OjPvGloySVJfBnrnaWZ+E/jmgBlWPEzTEm3PB+3PaL7BtT1j2/NB+zMOLV/fT55KktrJjxSQpMKMrNjb8nEEEXFqRNwaEXsj4q6I+GA1fllE7I+I3dXpvAXX+bMq990R8YZ1yLgvIuaqHLPV2IkRcXNE3FP9PKEaj4j46yrfnRHx8oazvXjBOtodEb+IiA+Nev1FxNUR8VBE7Fkw1vM6i4iLq/nviYiLG8736Yj4UZXh2og4vhrfHBFPLliXn19wna3VtnFv9TtEwxl7vl+beqyvkO8rC7Lti4jd1fi6r8NVuqX57TAz1/1E98nW+4AXAkcDdwBnjCjLRuDl1fljgR/T/YiEy4APLzP/GVXe5wCnVb/HEQ1n3AectGTsr4Dt1fntwKeq8+cBNwIBnA3cts7360+BF4x6/QGvBl4O7Ol3nQEnAvdXP0+ozp/QYL7XA0dW5z+1IN/mhfMtWc4PgN+ust8IvLHhddjT/drkY325fEum7wD+fFTrcJVuaXw7HNUee2s+jiAzD2Tm7dX5x4G9rP4O2guA6cz8ZWb+BLiX7u+z3i4AdlbndwIXLhj/QnZ9Hzg+IjauU6bXAPdl5r+tMs+6rL/M/A7wyDK33cs6ewNwc2Y+kpmPAjcD5zaVLzO/lZmHqovfp/vekBVVGZ+Xmd/LbgN8YcHv1EjGVax0vzb2WF8tX7XX/Tbgy6sto8l1uEq3NL4djqrYW/lxBBGxGTgLuK0aurT6l+jqw/8uMZrsCXwrInZF9528AJ3MPADdDQg4ZYT5Dns7ix9IbVl/h/W6zkaZ9T10994OOy0ifhgR346IV1Vjm6pM652vl/t1VOvwVcDBzLxnwdjI1uGSbml8OxxVsS93DGukL8+JiDHgH4EPZeYvgL8FfhM4EzhA9986GE32V2bmy+l+kub7I+LVq8w7knUb3TepvRn4h2qoTetvLStlGtW6/ChwCPhiNXQA+I3MPAv4Y+BLEfG8EeXr9X4d1f19EYt3Mka2DpfplhVnXSFLzxlHVey1Po5gvUTEUXRX/Bcz8+sAmXkwM5/KzP8B/o7/P1yw7tkz88Hq50PAtVWWg4cPsVQ/HxpVvsobgdsz82CVtTXrb4Fe19m6Z62eGHsT8I7q0ADV4Y2fVed30T1m/VtVvoWHa9ZjW+z1fh3FOjwS+D3gKwtyj2QdLtctrMN2OKpib83HEVTH4q4C9mbmZxeMLzwu/Rbg8DPv1wFvj4jnRMRpwOl0n3xpKt8xEXHs4fN0n2DbU+U4/Oz4xcA3FuR7V/UM+9nAY4f/7WvYoj2ktqy/JXpdZzcBr4+IE6pDDq+vxhoREecCfwq8OTP/c8H4ydH9/gMi4oV019n9VcbHI+Lsajt+14LfqamMvd6vo3isvxb4UWb+3yGWUazDlbqF9dgOh/Hsbz8nus8A/5juX86PjjDH79D9t+ZOYHd1Og/4e2CuGr8O2LjgOh+tct/NEF+FsEK+F9J9JcEdwF2H1xXwa8AtwD3VzxOr8aD7BSj3VfnH12Ed/irwM+C4BWMjXX90/8gcAP6b7h7PJf2sM7rHuu+tTn/YcL576R5LPbwdfr6a9/er+/4O4HbgdxcsZ5xuud4H/A3Vmw4bzNjz/drUY325fNX4NcD7lsy77uuQlbul8e3Qd55KUmF856kkFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMP8LvRKHthesrdcAAAAASUVORK5CYII=\n", 811 | "text/plain": [ 812 | "
" 813 | ] 814 | }, 815 | "metadata": { 816 | "needs_background": "light" 817 | }, 818 | "output_type": "display_data" 819 | } 820 | ], 821 | "source": [ 822 | "news['length'].hist()" 823 | ] 824 | }, 825 | { 826 | "cell_type": "markdown", 827 | "metadata": { 828 | "slideshow": { 829 | "slide_type": "slide" 830 | } 831 | }, 832 | "source": [ 833 | "### Coefficients of correlation\n", 834 | "\n", 835 | "可以使用 `.corr()` 來看兩個欄位之間的相關係數(預設是 Pearson , 也可以用 Kendall 或Spearman 的方法)。" 836 | ] 837 | }, 838 | { 839 | "cell_type": "code", 840 | "execution_count": 20, 841 | "metadata": { 842 | "slideshow": { 843 | "slide_type": "fragment" 844 | } 845 | }, 846 | "outputs": [ 847 | { 848 | "data": { 849 | "text/html": [ 850 | "
\n", 851 | "\n", 864 | "\n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | "
柯文哲姚文智民進黨
柯文哲1.0000000.592157-0.031563
姚文智0.5921571.0000000.227232
民進黨-0.0315630.2272321.000000
\n", 894 | "
" 895 | ], 896 | "text/plain": [ 897 | " 柯文哲 姚文智 民進黨\n", 898 | "柯文哲 1.000000 0.592157 -0.031563\n", 899 | "姚文智 0.592157 1.000000 0.227232\n", 900 | "民進黨 -0.031563 0.227232 1.000000" 901 | ] 902 | }, 903 | "execution_count": 20, 904 | "metadata": {}, 905 | "output_type": "execute_result" 906 | } 907 | ], 908 | "source": [ 909 | "news.loc[:,['柯文哲','姚文智','民進黨']].corr()" 910 | ] 911 | } 912 | ], 913 | "metadata": { 914 | "anaconda-cloud": {}, 915 | "celltoolbar": "Slideshow", 916 | "kernelspec": { 917 | "display_name": "Python 3", 918 | "language": "python", 919 | "name": "python3" 920 | }, 921 | "language_info": { 922 | "codemirror_mode": { 923 | "name": "ipython", 924 | "version": 3 925 | }, 926 | "file_extension": ".py", 927 | "mimetype": "text/x-python", 928 | "name": "python", 929 | "nbconvert_exporter": "python", 930 | "pygments_lexer": "ipython3", 931 | "version": "3.7.0" 932 | } 933 | }, 934 | "nbformat": 4, 935 | "nbformat_minor": 1 936 | } 937 | -------------------------------------------------------------------------------- /文本分析及語料庫統計/ipynb/3. Models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "slideshow": { 7 | "slide_type": "slide" 8 | } 9 | }, 10 | "source": [ 11 | "# 3. Models\n", 12 | "## Outline\n", 13 | "\n", 14 | "* [Preprocessing](#Preprocessing)\n", 15 | " * [Segmentation](#Segmentation)\n", 16 | " * [tf-idf](#tf-idf)\n", 17 | "* [Linear Models](#LinearModels)\n", 18 | "* [Binary Logistic Regression Models](#BinaryLogisticRegressionModels)\n", 19 | " * [Cross Validation](#CrossValidation)\n", 20 | "* [Exercises and Solutions](#ExercisesAndSolutions)" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "slideshow": { 27 | "slide_type": "notes" 28 | } 29 | }, 30 | "source": [ 31 | "這一節會教大家建立模型,第一個是 Linear Model 來做數值的預測,第二個是用 Logistic Regression 的方法來做分類器,也會帶大家使用 scikit-learn 這個套件。" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": { 37 | "slideshow": { 38 | "slide_type": "slide" 39 | } 40 | }, 41 | "source": [ 42 | "## Preprocessing\n", 43 | "先將資料做前處理,將新聞的內容斷詞計算詞頻。" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 1, 49 | "metadata": { 50 | "scrolled": false, 51 | "slideshow": { 52 | "slide_type": "fragment" 53 | } 54 | }, 55 | "outputs": [ 56 | { 57 | "data": { 58 | "text/html": [ 59 | "
\n", 60 | "\n", 73 | "\n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | "
titlecontenttimeproviderurl
0「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允...2018-10-22 12:16:02+08:00風傳媒https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁...
1【Yahoo論壇/林青弘】柯文哲是否一再說謊?柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張...2018-10-22 14:00:26+08:00林青弘https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘...
2【Yahoo論壇】民進黨誰最怕陳其邁落選?讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲...2018-10-22 13:57:44+08:00讀者投書https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選...
3抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方...2018-10-22 13:32:00+08:00EBC東森新聞https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚...
4百年土地公上香祈福 陳學聖提五不原則【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉...2018-10-22 13:17:44+08:00民眾日報https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0...
\n", 127 | "
" 128 | ], 129 | "text/plain": [ 130 | " title \\\n", 131 | "0 「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現 \n", 132 | "1 【Yahoo論壇/林青弘】柯文哲是否一再說謊? \n", 133 | "2 【Yahoo論壇】民進黨誰最怕陳其邁落選? \n", 134 | "3 抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚 \n", 135 | "4 百年土地公上香祈福 陳學聖提五不原則 \n", 136 | "\n", 137 | " content \\\n", 138 | "0 國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允... \n", 139 | "1 柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張... \n", 140 | "2 讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲... \n", 141 | "3 國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方... \n", 142 | "4 【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉... \n", 143 | "\n", 144 | " time provider \\\n", 145 | "0 2018-10-22 12:16:02+08:00 風傳媒 \n", 146 | "1 2018-10-22 14:00:26+08:00 林青弘 \n", 147 | "2 2018-10-22 13:57:44+08:00 讀者投書 \n", 148 | "3 2018-10-22 13:32:00+08:00 EBC東森新聞 \n", 149 | "4 2018-10-22 13:17:44+08:00 民眾日報 \n", 150 | "\n", 151 | " url \n", 152 | "0 https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁... \n", 153 | "1 https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘... \n", 154 | "2 https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選... \n", 155 | "3 https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚... \n", 156 | "4 https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0... " 157 | ] 158 | }, 159 | "execution_count": 1, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "import pandas as pd\n", 166 | "from pathlib import Path\n", 167 | "data_folder = Path(\"../data/\")\n", 168 | "\n", 169 | "news = pd.read_csv(data_folder / \"news.csv\")\n", 170 | "news.head()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 2, 176 | "metadata": { 177 | "slideshow": { 178 | "slide_type": "subslide" 179 | } 180 | }, 181 | "outputs": [ 182 | { 183 | "data": { 184 | "text/plain": [ 185 | "0 627\n", 186 | "1 1304\n", 187 | "2 1673\n", 188 | "3 503\n", 189 | "4 585\n", 190 | "Name: length, dtype: int64" 191 | ] 192 | }, 193 | "execution_count": 2, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "news['length'] = news['content'].apply(len)\n", 200 | "news['length'].head()" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": { 206 | "slideshow": { 207 | "slide_type": "subslide" 208 | } 209 | }, 210 | "source": [ 211 | "### Segmentation\n", 212 | "使用 jieba 來斷詞" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 3, 218 | "metadata": { 219 | "slideshow": { 220 | "slide_type": "fragment" 221 | } 222 | }, 223 | "outputs": [], 224 | "source": [ 225 | "import jieba" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 4, 231 | "metadata": { 232 | "scrolled": false, 233 | "slideshow": { 234 | "slide_type": "fragment" 235 | } 236 | }, 237 | "outputs": [ 238 | { 239 | "name": "stdout", 240 | "output_type": "stream", 241 | "text": [ 242 | "國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允武」。前總統陳水扁今(22)日在《新勇哥物語》質疑,這是「災難政治學」的表現,反批吳敦義以為高雄市長贏定了,得意忘形的囂張之情溢於言表。\n", 243 | "《新勇哥物語》今天刊出,陳水扁借勇哥表示,不相信吳敦義會講鼓勵暴力的話,也不相信吳敦義主席會認為,當年韓國瑜公然在國會殿堂打阿扁打到住院是對的。\n", 244 | "陳水扁指出,吳敦義的發言,是「災難政治學」的表現,他批評吳敦義還真的以為他提名韓國瑜參選高雄市長贏定了,得意忘形的囂張之情溢於言表,竟把25年前的打人事件拿出來捧為文武雙全的英雄看待。「是非不明,黑白不分,不是很可怕嗎?!」\n", 245 | "陳水扁也在文中還原,1993年5月韓國瑜推倒他導致受傷住院,隔天有幫派份子聚集到立法院,衝突場面導致10多人受傷掛彩,韓國瑜遭到質疑找黑道兄弟助陣,風波一度越演越烈。韓國瑜後來也道歉,「我願意為我肢體衝突,向陳水扁委員致歉。」\n", 246 | "陳水扁表示,當時他是在幫榮民講話,因為他對政府的榮民就養照護問題向退輔會提出質詢,認為「不能把榮民當豬養,不是說榮民是豬」。韓國瑜聽到「豬」就抓狂,這跟扁小時候家裡養豬賣錢供給讀書,生活經驗完全不同。(推薦閱讀:普悠瑪翻車慘劇》吳敦義籲所有九合一選舉候選人暫停選舉活動)\n", 247 | "相關報導● 吳敦義因韓國瑜「打扁」才提名選高雄?段宜康酸:乾脆提名殺過人的!● 強力反擊韓國瑜 洪耀福:愛河水臭、自來水不能喝,這是吳敦義當市長時的高雄\n" 248 | ] 249 | } 250 | ], 251 | "source": [ 252 | "text = news.content[0]\n", 253 | "print(text)" 254 | ] 255 | }, 256 | { 257 | "cell_type": "code", 258 | "execution_count": 5, 259 | "metadata": { 260 | "scrolled": false, 261 | "slideshow": { 262 | "slide_type": "subslide" 263 | } 264 | }, 265 | "outputs": [ 266 | { 267 | "name": "stderr", 268 | "output_type": "stream", 269 | "text": [ 270 | "Building prefix dict from the default dictionary ...\n", 271 | "Loading model from cache /var/folders/80/gbrxkvp9687cyvtdv78_5z3h0000gn/T/jieba.cache\n", 272 | "Loading model cost 1.275 seconds.\n", 273 | "Prefix dict has been built succesfully.\n" 274 | ] 275 | }, 276 | { 277 | "name": "stdout", 278 | "output_type": "stream", 279 | "text": [ 280 | "國民黨 主席 吳敦義 日前 提到 高雄市 長 候選人 韓國瑜 過去 打陳水 扁 , 表示 「 很 認同 跟 敬佩 」 並 形容 「 允文允武 」 。 前 總統 陳 水扁 今 ( 22 ) 日 在 《 新勇哥 物語 》 質疑 , 這是 「 災難 政治 學 」 的 表現 , 反批 吳敦義以 為 高雄市 長 贏定 了 , 得意忘形 的 囂張 之 情溢 於 言表 。 \n", 281 | " 《 新勇哥 物語 》 今天 刊出 , 陳 水扁 借勇哥 表示 , 不 相信 吳敦義會 講鼓勵 暴力 的 話 , 也 不 相信 吳敦義 主席 會 認為 , 當年 韓國瑜 公然 在 國會 殿堂 打阿扁 打到 住院 是 對 的 。 \n", 282 | " 陳 水扁 指出 , 吳敦義的 發言 , 是 「 災難 政治 學 」 的 表現 , 他 批評 吳敦義還 真的 以為 他 提名 韓國瑜 參選 高雄市 長 贏定 了 , 得意忘形 的 囂張 之 情溢 於 言表 , 竟 把 25 年前 的 打人 事件 拿出 來 捧 為 文武 雙全 的 英雄 看待 。 「 是非 不明 , 黑白不分 , 不是 很 可怕 嗎 ? ! 」 \n", 283 | " 陳 水扁 也 在 文中 還原 , 1993 年 5 月 韓國瑜 推倒 他導致 受傷 住院 , 隔天 有 幫派 份子 聚集 到 立法院 , 衝突場 面導致 10 多人 受傷 掛彩 , 韓國瑜 遭到 質疑 找 黑道 兄弟 助陣 , 風波 一度 越演 越烈 。 韓國瑜後來 也 道歉 , 「 我願意 為 我 肢體 衝突 , 向 陳 水 扁委員 致歉 。 」 \n", 284 | " 陳 水扁 表示 , 當時 他 是 在 幫榮民 講話 , 因為 他 對 政府 的 榮民 就 養照護 問題 向 退 輔會 提出 質詢 , 認為 「 不能 把 榮民當 豬養 , 不是 說榮民 是豬 」 。 韓國瑜 聽 到 「 豬 」 就 抓狂 , 這跟 扁小時 候家裡 養豬 賣 錢 供給 讀書 , 生活 經驗 完全 不同 。 ( 推薦 閱讀 : 普悠瑪 翻車 慘劇 》 吳敦義籲 所有 九 合一 選舉候 選人 暫停 選舉 活動 ) \n", 285 | " 相關 報導 ● 吳敦義因 韓國瑜 「 打扁 」 才 提名 選高雄 ? 段 宜康酸 : 乾脆 提名 殺過 人 的 ! ● 強力 反擊 韓國瑜 洪耀福 : 愛 河水 臭 、 自來 水 不能 喝 , 這是 吳敦義當 市長 時 的 高雄\n" 286 | ] 287 | } 288 | ], 289 | "source": [ 290 | "print(\" \".join(jieba.cut(text)))" 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 6, 296 | "metadata": { 297 | "slideshow": { 298 | "slide_type": "subslide" 299 | } 300 | }, 301 | "outputs": [ 302 | { 303 | "data": { 304 | "text/plain": [ 305 | "0 國民黨 主席 吳敦義 日前 提到 高雄市 長 候選人 韓國瑜 過去 打陳水 扁 , 表示 「...\n", 306 | "1 柯文 哲市 長 在 台北市 北投 區 七星 公園 造勢 , 行動 競選 總部 的 大卡 車開...\n", 307 | "2 讀者 投書 : 廖念漢 ( 現任 奇策 盟文 宣部 主任 、 曾任 海巡 署 專聘 講師 )...\n", 308 | "3 國民黨 高雄市 長 候選人 韓國瑜 聲勢 上 漲 , 又 抽中 一號 籤 王 , 心情 相當...\n", 309 | "4 【 綜合 報導 】 普悠瑪 列車 出軌 意外 舉國震 驚 如同 國難 , 令 社會 大眾 、...\n", 310 | "Name: segmentation, dtype: object" 311 | ] 312 | }, 313 | "execution_count": 6, 314 | "metadata": {}, 315 | "output_type": "execute_result" 316 | } 317 | ], 318 | "source": [ 319 | "news['segmentation'] = news.content.apply(lambda text: \" \".join(jieba.cut(text)))\n", 320 | "news['segmentation'].head()" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": { 326 | "slideshow": { 327 | "slide_type": "subslide" 328 | } 329 | }, 330 | "source": [ 331 | "### tf-idf\n", 332 | "tf: term frequency 詞頻,詞語在單一文本中出現的頻率,\n", 333 | "idf: inverse document frequency 逆向檔案頻率,全部文本的數量除以包含詞語的文本的數量 \n", 334 | "\n", 335 | "$\\text{tf-idf} = tf * idf$ \n", 336 | "\n", 337 | "例如「的」可能在文本中詞頻高,但是每個文本都有「的」,因此 idf 很小,tf-idf 相乘起來就很小,代表不是重要的訊息" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 7, 343 | "metadata": { 344 | "slideshow": { 345 | "slide_type": "fragment" 346 | } 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 351 | "v = TfidfVectorizer()\n", 352 | "news_tfidf = v.fit_transform(news.segmentation)" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 8, 358 | "metadata": { 359 | "scrolled": false, 360 | "slideshow": { 361 | "slide_type": "fragment" 362 | } 363 | }, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/plain": [ 368 | "(120, 7806)" 369 | ] 370 | }, 371 | "execution_count": 8, 372 | "metadata": {}, 373 | "output_type": "execute_result" 374 | } 375 | ], 376 | "source": [ 377 | "news_tfidf.shape" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": { 383 | "slideshow": { 384 | "slide_type": "slide" 385 | } 386 | }, 387 | "source": [ 388 | "## Linear Models
\n", 389 | "使用線性的模型來模擬預測未知數值" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 9, 395 | "metadata": { 396 | "slideshow": { 397 | "slide_type": "fragment" 398 | } 399 | }, 400 | "outputs": [], 401 | "source": [ 402 | "import sklearn\n", 403 | "from sklearn.model_selection import train_test_split\n", 404 | "X_train, X_test, y_train, y_test = train_test_split(\n", 405 | " news_tfidf, \n", 406 | " news[['length']],\n", 407 | " test_size=0.3, \n", 408 | " random_state=7)" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 10, 414 | "metadata": { 415 | "slideshow": { 416 | "slide_type": "subslide" 417 | } 418 | }, 419 | "outputs": [], 420 | "source": [ 421 | "from sklearn.linear_model import LinearRegression\n", 422 | "from sklearn.metrics import mean_squared_error, r2_score\n", 423 | "\n", 424 | "regr = LinearRegression()\n", 425 | "regr.fit(X_train, y_train)\n", 426 | "y_pred = regr.predict(X_test)" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 11, 432 | "metadata": { 433 | "scrolled": true, 434 | "slideshow": { 435 | "slide_type": "subslide" 436 | } 437 | }, 438 | "outputs": [ 439 | { 440 | "name": "stdout", 441 | "output_type": "stream", 442 | "text": [ 443 | "Coefficients: \n", 444 | " [[96.85880969 -2.22508701 14.08190329 ... 0. 36.56082547\n", 445 | " 0. ]]\n", 446 | "Mean squared error: 61652.85\n", 447 | "Variance score: 0.15\n" 448 | ] 449 | } 450 | ], 451 | "source": [ 452 | "print('Coefficients: \\n', regr.coef_)\n", 453 | "\n", 454 | "print(\"Mean squared error: %.2f\"\n", 455 | " % mean_squared_error(y_test, y_pred))\n", 456 | "\n", 457 | "# Explained variance score: 1 is perfect prediction\n", 458 | "print('Variance score: %.2f' % r2_score(y_test, y_pred))" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": { 464 | "slideshow": { 465 | "slide_type": "slide" 466 | } 467 | }, 468 | "source": [ 469 | "## Binary Logistic Regression Models \n", 470 | "使用二元分類的模型來預測資料的類別" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 12, 476 | "metadata": { 477 | "slideshow": { 478 | "slide_type": "fragment" 479 | } 480 | }, 481 | "outputs": [ 482 | { 483 | "data": { 484 | "text/html": [ 485 | "
\n", 486 | "\n", 499 | "\n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | "
contentprovider
14在競選活動方面,韓流在全台發威,韓國瑜從台灣頭跑到台灣尾進行輔選,他昨天一整天馬不停蹄,為高...聯合新聞網
15「台北客家義民嘉年華」重頭戲挑擔踩街活動昨天登場,由5000多人組成的踩街隊伍,一早浩浩蕩蕩...聯合新聞網
16市長選戰攻防激烈,台北市長柯文哲卻連連失言,前天酸「台女不化妝上街嚇人」,行動競總「開進」公...聯合新聞網
32(中央社記者王揚宇台北21日電)民進黨台北市長參選人姚文智今天在一場論壇說,學校作為有機體,...中央社
33(中央社記者黃麗芸台北21日電)「雙北市長青年論壇」今天登場,中國國民黨台北市長參選人丁守中...中央社
\n", 535 | "
" 536 | ], 537 | "text/plain": [ 538 | " content provider\n", 539 | "14 在競選活動方面,韓流在全台發威,韓國瑜從台灣頭跑到台灣尾進行輔選,他昨天一整天馬不停蹄,為高... 聯合新聞網\n", 540 | "15 「台北客家義民嘉年華」重頭戲挑擔踩街活動昨天登場,由5000多人組成的踩街隊伍,一早浩浩蕩蕩... 聯合新聞網\n", 541 | "16 市長選戰攻防激烈,台北市長柯文哲卻連連失言,前天酸「台女不化妝上街嚇人」,行動競總「開進」公... 聯合新聞網\n", 542 | "32 (中央社記者王揚宇台北21日電)民進黨台北市長參選人姚文智今天在一場論壇說,學校作為有機體,... 中央社\n", 543 | "33 (中央社記者黃麗芸台北21日電)「雙北市長青年論壇」今天登場,中國國民黨台北市長參選人丁守中... 中央社" 544 | ] 545 | }, 546 | "execution_count": 12, 547 | "metadata": {}, 548 | "output_type": "execute_result" 549 | } 550 | ], 551 | "source": [ 552 | "selected_news = news.loc[news.provider.isin(['中央社','聯合新聞網']), ['content','provider']]\n", 553 | "selected_news.head()" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 13, 559 | "metadata": { 560 | "slideshow": { 561 | "slide_type": "subslide" 562 | } 563 | }, 564 | "outputs": [], 565 | "source": [ 566 | "selected_news_tfidf = news_tfidf[selected_news.index]" 567 | ] 568 | }, 569 | { 570 | "cell_type": "code", 571 | "execution_count": 14, 572 | "metadata": { 573 | "scrolled": true, 574 | "slideshow": { 575 | "slide_type": "fragment" 576 | } 577 | }, 578 | "outputs": [], 579 | "source": [ 580 | "X_train, X_test, y_train, y_test = train_test_split(\n", 581 | " selected_news_tfidf, \n", 582 | " selected_news[['provider']],\n", 583 | " test_size=0.3, \n", 584 | " random_state=0)" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 15, 590 | "metadata": { 591 | "slideshow": { 592 | "slide_type": "subslide" 593 | } 594 | }, 595 | "outputs": [ 596 | { 597 | "data": { 598 | "text/plain": [ 599 | "<18x7806 sparse matrix of type ''\n", 600 | "\twith 2601 stored elements in Compressed Sparse Row format>" 601 | ] 602 | }, 603 | "execution_count": 15, 604 | "metadata": {}, 605 | "output_type": "execute_result" 606 | } 607 | ], 608 | "source": [ 609 | "X_train" 610 | ] 611 | }, 612 | { 613 | "cell_type": "code", 614 | "execution_count": 16, 615 | "metadata": { 616 | "slideshow": { 617 | "slide_type": "fragment" 618 | } 619 | }, 620 | "outputs": [ 621 | { 622 | "data": { 623 | "text/plain": [ 624 | "<9x7806 sparse matrix of type ''\n", 625 | "\twith 1380 stored elements in Compressed Sparse Row format>" 626 | ] 627 | }, 628 | "execution_count": 16, 629 | "metadata": {}, 630 | "output_type": "execute_result" 631 | } 632 | ], 633 | "source": [ 634 | "X_test" 635 | ] 636 | }, 637 | { 638 | "cell_type": "code", 639 | "execution_count": 17, 640 | "metadata": { 641 | "scrolled": true, 642 | "slideshow": { 643 | "slide_type": "subslide" 644 | } 645 | }, 646 | "outputs": [ 647 | { 648 | "data": { 649 | "text/html": [ 650 | "
\n", 651 | "\n", 664 | "\n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | "
provider
96聯合新聞網
92聯合新聞網
15聯合新聞網
72中央社
119中央社
109中央社
58中央社
49中央社
33中央社
94聯合新聞網
71中央社
50中央社
98聯合新聞網
32中央社
14聯合新聞網
97聯合新聞網
91聯合新聞網
74中央社
\n", 746 | "
" 747 | ], 748 | "text/plain": [ 749 | " provider\n", 750 | "96 聯合新聞網\n", 751 | "92 聯合新聞網\n", 752 | "15 聯合新聞網\n", 753 | "72 中央社\n", 754 | "119 中央社\n", 755 | "109 中央社\n", 756 | "58 中央社\n", 757 | "49 中央社\n", 758 | "33 中央社\n", 759 | "94 聯合新聞網\n", 760 | "71 中央社\n", 761 | "50 中央社\n", 762 | "98 聯合新聞網\n", 763 | "32 中央社\n", 764 | "14 聯合新聞網\n", 765 | "97 聯合新聞網\n", 766 | "91 聯合新聞網\n", 767 | "74 中央社" 768 | ] 769 | }, 770 | "execution_count": 17, 771 | "metadata": {}, 772 | "output_type": "execute_result" 773 | } 774 | ], 775 | "source": [ 776 | "y_train" 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": 18, 782 | "metadata": { 783 | "scrolled": true, 784 | "slideshow": { 785 | "slide_type": "subslide" 786 | } 787 | }, 788 | "outputs": [ 789 | { 790 | "data": { 791 | "text/html": [ 792 | "
\n", 793 | "\n", 806 | "\n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | "
provider
16聯合新聞網
107中央社
90聯合新聞網
93聯合新聞網
34中央社
73中央社
99聯合新聞網
77中央社
95聯合新聞網
\n", 852 | "
" 853 | ], 854 | "text/plain": [ 855 | " provider\n", 856 | "16 聯合新聞網\n", 857 | "107 中央社\n", 858 | "90 聯合新聞網\n", 859 | "93 聯合新聞網\n", 860 | "34 中央社\n", 861 | "73 中央社\n", 862 | "99 聯合新聞網\n", 863 | "77 中央社\n", 864 | "95 聯合新聞網" 865 | ] 866 | }, 867 | "execution_count": 18, 868 | "metadata": {}, 869 | "output_type": "execute_result" 870 | } 871 | ], 872 | "source": [ 873 | "y_test" 874 | ] 875 | }, 876 | { 877 | "cell_type": "code", 878 | "execution_count": 19, 879 | "metadata": { 880 | "scrolled": true, 881 | "slideshow": { 882 | "slide_type": "subslide" 883 | } 884 | }, 885 | "outputs": [ 886 | { 887 | "data": { 888 | "text/plain": [ 889 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 890 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 891 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 892 | " verbose=0, warm_start=False)" 893 | ] 894 | }, 895 | "execution_count": 19, 896 | "metadata": {}, 897 | "output_type": "execute_result" 898 | } 899 | ], 900 | "source": [ 901 | "from sklearn.linear_model import LogisticRegression\n", 902 | "lr = LogisticRegression()\n", 903 | "lr.fit(X_train,y_train.provider.values)" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": 20, 909 | "metadata": { 910 | "slideshow": { 911 | "slide_type": "subslide" 912 | } 913 | }, 914 | "outputs": [ 915 | { 916 | "data": { 917 | "text/plain": [ 918 | "0.7777777777777778" 919 | ] 920 | }, 921 | "execution_count": 20, 922 | "metadata": {}, 923 | "output_type": "execute_result" 924 | } 925 | ], 926 | "source": [ 927 | "from sklearn.metrics import accuracy_score\n", 928 | "accuracy_score(y_test.provider.values, lr.predict(X_test))" 929 | ] 930 | }, 931 | { 932 | "cell_type": "code", 933 | "execution_count": 21, 934 | "metadata": { 935 | "scrolled": true, 936 | "slideshow": { 937 | "slide_type": "fragment" 938 | } 939 | }, 940 | "outputs": [ 941 | { 942 | "data": { 943 | "text/plain": [ 944 | "array(['聯合新聞網', '中央社', '聯合新聞網', '聯合新聞網', '中央社', '中央社', '聯合新聞網', '中央社',\n", 945 | " '聯合新聞網'], dtype=object)" 946 | ] 947 | }, 948 | "execution_count": 21, 949 | "metadata": {}, 950 | "output_type": "execute_result" 951 | } 952 | ], 953 | "source": [ 954 | "y_test.provider.values" 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": 22, 960 | "metadata": { 961 | "scrolled": true, 962 | "slideshow": { 963 | "slide_type": "fragment" 964 | } 965 | }, 966 | "outputs": [ 967 | { 968 | "data": { 969 | "text/plain": [ 970 | "array(['中央社', '中央社', '聯合新聞網', '中央社', '中央社', '中央社', '聯合新聞網', '中央社',\n", 971 | " '聯合新聞網'], dtype=object)" 972 | ] 973 | }, 974 | "execution_count": 22, 975 | "metadata": {}, 976 | "output_type": "execute_result" 977 | } 978 | ], 979 | "source": [ 980 | "lr.predict(X_test)" 981 | ] 982 | }, 983 | { 984 | "cell_type": "markdown", 985 | "metadata": { 986 | "slideshow": { 987 | "slide_type": "subslide" 988 | } 989 | }, 990 | "source": [ 991 | "### Cross Validation
\n", 992 | "我們可以使用 Cross Validation 來評估 Classifier 的效果,常用的方法是 k-fold ,也就是將資料分成 k 等份,每次使用其 k-1 份來 training,剩下一份來 testing,總共執行 k 次,這樣做可以充分利用手上已經有的資料來學習。" 993 | ] 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": 23, 998 | "metadata": { 999 | "scrolled": true, 1000 | "slideshow": { 1001 | "slide_type": "fragment" 1002 | } 1003 | }, 1004 | "outputs": [ 1005 | { 1006 | "name": "stdout", 1007 | "output_type": "stream", 1008 | "text": [ 1009 | "[0.5 1. 1. 0.8 1. ]\n" 1010 | ] 1011 | } 1012 | ], 1013 | "source": [ 1014 | "from sklearn.model_selection import cross_val_score\n", 1015 | "scores = cross_val_score(lr, selected_news_tfidf, selected_news.provider.values, cv=5)\n", 1016 | "print(scores)" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": 24, 1022 | "metadata": { 1023 | "slideshow": { 1024 | "slide_type": "fragment" 1025 | } 1026 | }, 1027 | "outputs": [ 1028 | { 1029 | "name": "stdout", 1030 | "output_type": "stream", 1031 | "text": [ 1032 | "Accuracy: 0.86 (+/- 0.39)\n" 1033 | ] 1034 | } 1035 | ], 1036 | "source": [ 1037 | "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" 1038 | ] 1039 | }, 1040 | { 1041 | "cell_type": "markdown", 1042 | "metadata": { 1043 | "slideshow": { 1044 | "slide_type": "slide" 1045 | } 1046 | }, 1047 | "source": [ 1048 | "## Exercises and Solutions \n", 1049 | "\n", 1050 | "
\n", 1051 | "
1. 改用 F1 score 來評定 Classifer 的成效 \n", 1052 | "

\n", 1053 | " \n", 1054 | "```python\n", 1055 | "from sklearn.metrics import f1_score\n", 1056 | "f1_score(y_test.provider.values, lr.predict(X_test), average='macro')\n", 1057 | "```\n", 1058 | "\n", 1059 | "

\n", 1060 | "
\n", 1061 | "\n", 1062 | "
2. 使用 Multinomial Naive Bayes 來做一個新的 Classifier \n", 1063 | "

\n", 1064 | " \n", 1065 | "```python\n", 1066 | "from sklearn.naive_bayes import MultinomialNB\n", 1067 | "nb = MultinomialNB()\n", 1068 | "nb.fit(X_train,y_train.provider.values)\n", 1069 | "accuracy_score(y_test.provider.values, nb.predict(X_test))\n", 1070 | "```\n", 1071 | "

\n", 1072 | "
\n" 1073 | ] 1074 | }, 1075 | { 1076 | "cell_type": "markdown", 1077 | "metadata": { 1078 | "slideshow": { 1079 | "slide_type": "notes" 1080 | } 1081 | }, 1082 | "source": [ 1083 | "## More about: \n", 1084 | "1. [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n", 1085 | "2. [Working With Text Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)\n", 1086 | "1. [Scikit Learn User Guide](http://scikit-learn.org/stable/user_guide.html)\n" 1087 | ] 1088 | } 1089 | ], 1090 | "metadata": { 1091 | "anaconda-cloud": {}, 1092 | "celltoolbar": "Slideshow", 1093 | "kernelspec": { 1094 | "display_name": "Python 3", 1095 | "language": "python", 1096 | "name": "python3" 1097 | }, 1098 | "language_info": { 1099 | "codemirror_mode": { 1100 | "name": "ipython", 1101 | "version": 3 1102 | }, 1103 | "file_extension": ".py", 1104 | "mimetype": "text/x-python", 1105 | "name": "python", 1106 | "nbconvert_exporter": "python", 1107 | "pygments_lexer": "ipython3", 1108 | "version": "3.7.0" 1109 | } 1110 | }, 1111 | "nbformat": 4, 1112 | "nbformat_minor": 1 1113 | } 1114 | -------------------------------------------------------------------------------- /文本分析及語料庫統計/ipynb/4. Practice on news annotation.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 4. Practice on news annotation\n", 8 | "使用上一節課得到的 news_annotation 建立模型 \n", 9 | "1. 將斷詞完的 cln_content 的欄位轉成 tf-idf\n", 10 | "2. 用 polarity_score 來做 Linear Model\n", 11 | "3. 用 sentiment 來做 Logistic Regression Model" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Preprocessing" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 1, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "import pandas as pd\n", 28 | "from pathlib import Path\n", 29 | "data_folder = Path(\"../data/\")" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "news = pd.read_pickle(data_folder / 'news_annotation.p')" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "name": "stdout", 48 | "output_type": "stream", 49 | "text": [ 50 | "\n", 51 | "RangeIndex: 381 entries, 0 to 380\n", 52 | "Data columns (total 3 columns):\n", 53 | "cln_content 381 non-null object\n", 54 | "polarity_score 381 non-null int64\n", 55 | "sentiment 381 non-null object\n", 56 | "dtypes: int64(1), object(2)\n", 57 | "memory usage: 9.0+ KB\n" 58 | ] 59 | } 60 | ], 61 | "source": [ 62 | "news.info()" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": 4, 68 | "metadata": {}, 69 | "outputs": [ 70 | { 71 | "data": { 72 | "text/html": [ 73 | "
\n", 74 | "\n", 87 | "\n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | "
cln_contentpolarity_scoresentiment
0國民黨 高雄 市長 候選人 韓國瑜 昨 接受 媒體 專訪 時稱 當選 市長 後 禁止 政治 ...151
1國民黨 高雄 市長 候選人 韓國瑜 今天 台北 大安區 舉辦 第二場 北漂 青年 座談 表示...281
2韓流北 漂 第二場 國民黨 高雄 市長 韓國瑜 主攻 北漂 族群 昨天 舉行 北 漂族 見面...171
3國民黨 高雄 市長 候選人 韓國瑜 日前 接受 專訪 時 指出 高雄 市長 所有 高雄 街頭...121
4國民黨 高雄 市長 候選人 韓國瑜 接受 中時 電子報 專訪 稱 當選 市長 後 未來 政治...51
\n", 129 | "
" 130 | ], 131 | "text/plain": [ 132 | " cln_content polarity_score sentiment\n", 133 | "0 國民黨 高雄 市長 候選人 韓國瑜 昨 接受 媒體 專訪 時稱 當選 市長 後 禁止 政治 ... 15 1\n", 134 | "1 國民黨 高雄 市長 候選人 韓國瑜 今天 台北 大安區 舉辦 第二場 北漂 青年 座談 表示... 28 1\n", 135 | "2 韓流北 漂 第二場 國民黨 高雄 市長 韓國瑜 主攻 北漂 族群 昨天 舉行 北 漂族 見面... 17 1\n", 136 | "3 國民黨 高雄 市長 候選人 韓國瑜 日前 接受 專訪 時 指出 高雄 市長 所有 高雄 街頭... 12 1\n", 137 | "4 國民黨 高雄 市長 候選人 韓國瑜 接受 中時 電子報 專訪 稱 當選 市長 後 未來 政治... 5 1" 138 | ] 139 | }, 140 | "execution_count": 4, 141 | "metadata": {}, 142 | "output_type": "execute_result" 143 | } 144 | ], 145 | "source": [ 146 | "news.head()" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 5, 152 | "metadata": { 153 | "scrolled": true 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "news.sentiment = news.sentiment.astype('category')" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 6, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 167 | "v = TfidfVectorizer()\n", 168 | "news_tfidf = v.fit_transform(news.cln_content)" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 7, 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "data": { 178 | "text/plain": [ 179 | "(381, 16479)" 180 | ] 181 | }, 182 | "execution_count": 7, 183 | "metadata": {}, 184 | "output_type": "execute_result" 185 | } 186 | ], 187 | "source": [ 188 | "news_tfidf.shape" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "## Linear Model" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 8, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "import sklearn\n", 205 | "from sklearn.model_selection import train_test_split\n", 206 | "X_train, X_test, y_train, y_test = train_test_split(\n", 207 | " news_tfidf, \n", 208 | " news.polarity_score,\n", 209 | " test_size=0.3, \n", 210 | " random_state=7)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 9, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "from sklearn.linear_model import LinearRegression\n", 220 | "from sklearn.metrics import mean_squared_error, r2_score\n", 221 | "\n", 222 | "regr = LinearRegression()\n", 223 | "regr.fit(X_train, y_train)\n", 224 | "y_pred = regr.predict(X_test)" 225 | ] 226 | }, 227 | { 228 | "cell_type": "code", 229 | "execution_count": 10, 230 | "metadata": {}, 231 | "outputs": [ 232 | { 233 | "name": "stdout", 234 | "output_type": "stream", 235 | "text": [ 236 | "Coefficients: \n", 237 | " [3.63835662 0.45605763 1.81917831 ... 0. 0.34342381 0.04584367]\n", 238 | "Mean squared error: 170.07\n", 239 | "Variance score: 0.24\n" 240 | ] 241 | } 242 | ], 243 | "source": [ 244 | "print('Coefficients: \\n', regr.coef_)\n", 245 | "\n", 246 | "print(\"Mean squared error: %.2f\"\n", 247 | " % mean_squared_error(y_test, y_pred))\n", 248 | "\n", 249 | "# Explained variance score: 1 is perfect prediction\n", 250 | "print('Variance score: %.2f' % r2_score(y_test, y_pred))" 251 | ] 252 | }, 253 | { 254 | "cell_type": "markdown", 255 | "metadata": {}, 256 | "source": [ 257 | "## Logistic Regression Model" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": 11, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "import sklearn\n", 267 | "from sklearn.model_selection import train_test_split\n", 268 | "\n", 269 | "X_train, X_test, y_train, y_test = train_test_split(\n", 270 | " news_tfidf, \n", 271 | " news.sentiment,\n", 272 | " test_size=0.3, \n", 273 | " random_state=0)" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 12, 279 | "metadata": { 280 | "scrolled": false 281 | }, 282 | "outputs": [ 283 | { 284 | "data": { 285 | "text/plain": [ 286 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", 287 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", 288 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", 289 | " verbose=0, warm_start=False)" 290 | ] 291 | }, 292 | "execution_count": 12, 293 | "metadata": {}, 294 | "output_type": "execute_result" 295 | } 296 | ], 297 | "source": [ 298 | "from sklearn.linear_model import LogisticRegression\n", 299 | "lr = LogisticRegression()\n", 300 | "lr.fit(X_train,y_train)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 13, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "data": { 310 | "text/plain": [ 311 | "0.9391304347826087" 312 | ] 313 | }, 314 | "execution_count": 13, 315 | "metadata": {}, 316 | "output_type": "execute_result" 317 | } 318 | ], 319 | "source": [ 320 | "from sklearn.metrics import accuracy_score\n", 321 | "accuracy_score(y_test, lr.predict(X_test))" 322 | ] 323 | }, 324 | { 325 | "cell_type": "code", 326 | "execution_count": 14, 327 | "metadata": {}, 328 | "outputs": [ 329 | { 330 | "name": "stdout", 331 | "output_type": "stream", 332 | "text": [ 333 | "[0.93506494 0.93506494 0.93421053 0.93421053 0.94666667]\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "from sklearn.model_selection import cross_val_score\n", 339 | "scores = cross_val_score(lr, news_tfidf, news.sentiment, cv=5)\n", 340 | "print(scores)" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 15, 346 | "metadata": {}, 347 | "outputs": [ 348 | { 349 | "name": "stdout", 350 | "output_type": "stream", 351 | "text": [ 352 | "Accuracy: 0.94 (+/- 0.01)\n" 353 | ] 354 | } 355 | ], 356 | "source": [ 357 | "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" 358 | ] 359 | } 360 | ], 361 | "metadata": { 362 | "kernelspec": { 363 | "display_name": "Python 3", 364 | "language": "python", 365 | "name": "python3" 366 | }, 367 | "language_info": { 368 | "codemirror_mode": { 369 | "name": "ipython", 370 | "version": 3 371 | }, 372 | "file_extension": ".py", 373 | "mimetype": "text/x-python", 374 | "name": "python", 375 | "nbconvert_exporter": "python", 376 | "pygments_lexer": "ipython3", 377 | "version": "3.7.0" 378 | } 379 | }, 380 | "nbformat": 4, 381 | "nbformat_minor": 2 382 | } 383 | -------------------------------------------------------------------------------- /語料庫介面設計/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫介面設計/.gitkeep -------------------------------------------------------------------------------- /語料庫介面設計/README.md: -------------------------------------------------------------------------------- 1 | 語料庫介面設計 2 | 3 | --- 4 | -------------------------------------------------------------------------------- /語料庫標記及語言學分析/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫標記及語言學分析/.gitkeep -------------------------------------------------------------------------------- /語料庫標記及語言學分析/README.md: -------------------------------------------------------------------------------- 1 | 語料庫標記及語言學分析 2 | 3 | --- 4 | -------------------------------------------------------------------------------- /語料庫標記及語言學分析/practice/.ipynb_checkpoints/sentiment_annotation-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 58, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import csv\n", 12 | "import pandas as pd\n", 13 | "import re" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 59, 19 | "metadata": { 20 | "collapsed": false 21 | }, 22 | "outputs": [], 23 | "source": [ 24 | "sample_file = list(csv.reader(open('new_sample.txt', \"r\"), delimiter = '\\t'))" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 60, 30 | "metadata": { 31 | "collapsed": false, 32 | "scrolled": true 33 | }, 34 | "outputs": [], 35 | "source": [ 36 | "df = pd.DataFrame(sample_file, columns=[\"sample\", \"sample_segmented\", \"polarity\"])\n", 37 | "df = df.drop(['sample', 'polarity'], axis=1)\n", 38 | "# df" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 61, 44 | "metadata": { 45 | "collapsed": true 46 | }, 47 | "outputs": [], 48 | "source": [ 49 | "# NTUSD wordlist\n", 50 | "positive_words = open(\"positive_words.txt\",\"r\").read().split(\"\\n\")\n", 51 | "negative_words = open(\"negative_words.txt\",\"r\").read().split(\"\\n\")" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 62, 57 | "metadata": { 58 | "collapsed": false 59 | }, 60 | "outputs": [], 61 | "source": [ 62 | "# === 計算positive word 在每一個sample出現的count === #\n", 63 | "positive_word_score = []\n", 64 | "for text in list(df.sample_segmented):\n", 65 | " result = 0\n", 66 | " for words in positive_words:\n", 67 | " if words in text:\n", 68 | " result += 1 \n", 69 | " positive_word_score.append(result)\n", 70 | "# positive_word_score\n", 71 | "\n", 72 | "# === 計算positive pattern 在每一個sample出現的count === #\n", 73 | "positive_pattern = '還好.+(會|不會)?'\n", 74 | "positive_pattern_score = []\n", 75 | "for text in list(df.sample_segmented):\n", 76 | " positive_pattern_score.append(len(re.findall(positive_pattern,text)))\n", 77 | "# positive_pattern_score\n", 78 | "\n", 79 | "# === 將 positive word和positive pattern計算後的結果合併===#\n", 80 | "positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))]" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 63, 86 | "metadata": { 87 | "collapsed": false, 88 | "scrolled": true 89 | }, 90 | "outputs": [ 91 | { 92 | "data": { 93 | "text/plain": [ 94 | "[0,\n", 95 | " 0,\n", 96 | " 0,\n", 97 | " 0,\n", 98 | " 0,\n", 99 | " 0,\n", 100 | " 0,\n", 101 | " 0,\n", 102 | " -1,\n", 103 | " 0,\n", 104 | " 0,\n", 105 | " 0,\n", 106 | " 0,\n", 107 | " -2,\n", 108 | " 0,\n", 109 | " -2,\n", 110 | " 0,\n", 111 | " 0,\n", 112 | " 0,\n", 113 | " -1,\n", 114 | " 0,\n", 115 | " 0,\n", 116 | " -1,\n", 117 | " -1,\n", 118 | " 0,\n", 119 | " 0,\n", 120 | " 0,\n", 121 | " 0,\n", 122 | " -1,\n", 123 | " -1,\n", 124 | " 0,\n", 125 | " 0,\n", 126 | " 0,\n", 127 | " 0,\n", 128 | " 0,\n", 129 | " 0,\n", 130 | " 0,\n", 131 | " 0,\n", 132 | " 0,\n", 133 | " -1,\n", 134 | " 0,\n", 135 | " 0,\n", 136 | " 0,\n", 137 | " 0,\n", 138 | " 0,\n", 139 | " 0,\n", 140 | " 0,\n", 141 | " 0,\n", 142 | " 0,\n", 143 | " 0,\n", 144 | " 0,\n", 145 | " 0,\n", 146 | " -2,\n", 147 | " -2,\n", 148 | " 0,\n", 149 | " -4,\n", 150 | " 0,\n", 151 | " 0,\n", 152 | " 0,\n", 153 | " 0,\n", 154 | " -1,\n", 155 | " -1,\n", 156 | " 0,\n", 157 | " -2,\n", 158 | " 0,\n", 159 | " -1,\n", 160 | " 0,\n", 161 | " 0,\n", 162 | " -2,\n", 163 | " 0,\n", 164 | " 0,\n", 165 | " 0,\n", 166 | " -1,\n", 167 | " -2,\n", 168 | " 0,\n", 169 | " -1,\n", 170 | " 0,\n", 171 | " 0,\n", 172 | " -1,\n", 173 | " 0,\n", 174 | " 0,\n", 175 | " 0,\n", 176 | " -3,\n", 177 | " 0,\n", 178 | " 0,\n", 179 | " 0,\n", 180 | " 0,\n", 181 | " 0,\n", 182 | " 0,\n", 183 | " 0,\n", 184 | " 0,\n", 185 | " 0,\n", 186 | " 0,\n", 187 | " 0,\n", 188 | " 0,\n", 189 | " 0,\n", 190 | " 0]" 191 | ] 192 | }, 193 | "execution_count": 63, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "# === 計算negative word 在每一個sample出現的count === #\n", 200 | "negative_word_score = []\n", 201 | "for text in list(df.sample_segmented):\n", 202 | " result = 0\n", 203 | " for words in negative_words:\n", 204 | " if words in text:\n", 205 | " result -= 1 \n", 206 | " negative_word_score.append(result)\n", 207 | "# negative_word_score\n", 208 | "\n", 209 | "# === 計算negative pattern 在每一個sample出現的count === #\n", 210 | "negative_pattern = r'都.*了.*還.*|連.+都.+|結果.+都'\n", 211 | "negative_pattern_score = []\n", 212 | "for text in list(df.sample_segmented):\n", 213 | " negative_pattern_score.append(len(re.findall(negative_pattern,text))*-1)\n", 214 | "# negative_pattern_score\n", 215 | "\n", 216 | "# === 將 negative word和 negative pattern計算後的結果合併===#\n", 217 | "negative_score = [negative_word_score[i] + negative_pattern_score[i] for i in range(len(negative_word_score))]\n", 218 | "negative_score\n" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 64, 224 | "metadata": { 225 | "collapsed": false 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "# df['positive_score'] = positive_score\n", 230 | "# df['negative_score'] = negative_score\n", 231 | "df['polarity_score'] = [positive_score[i] + negative_score[i] for i in range(len(positive_score))]\n" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 65, 237 | "metadata": { 238 | "collapsed": false 239 | }, 240 | "outputs": [], 241 | "source": [ 242 | "df.loc[df.polarity_score > 0, 'sentiment'] = '1' \n", 243 | "df.loc[df.polarity_score < 0, 'sentiment'] = '-1' \n", 244 | "df.loc[df.polarity_score == 0, 'sentiment'] = '0' " 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 66, 250 | "metadata": { 251 | "collapsed": false 252 | }, 253 | "outputs": [], 254 | "source": [ 255 | "# df = df.drop(['polarity_score'], axis=1)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": 67, 261 | "metadata": { 262 | "collapsed": false 263 | }, 264 | "outputs": [ 265 | { 266 | "data": { 267 | "text/html": [ 268 | "
\n", 269 | "\n", 282 | "\n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | "
sample_segmentedpolarity_scoresentiment
0有趣 中肯 溫馨31
1五樓 耶 , 好巧 喔00
2結束 了 才 要 去 你 是 想到 ㄛ 妳00
3年度 最 中肯 ! !11
4好 好玩 阿11
5有趣 中肯 溫馨31
6年度 最 中肯 ! !11
7生日快樂31
8我 還沒 去 ....-1-1
9有趣 中肯 溫馨31
10早睡早起 精神 好00
11太棒 了 ! !11
12有趣 中肯 溫馨31
13恐怖 金紙 鋪-2-1
14年度 最 中肯 ! !11
15睿 真的 有夠 生氣-2-1
16嘖 有 看到 花博 了 還嫌 哩00
17爭 豔 館 那 紫色 讚 ! !11
18有趣 中肯 溫馨31
19我 可 是 去 走 了 數小時 一個館 都沒 逛 呢-1-1
20美女 晚安 安 ~ ~00
21美女 晚安 安 ~ ~ ~00
22我 還沒 去過-1-1
23可憐 你 了-1-1
24我 不 怎麼 喜歡 報告 的 說31
25真是太 厲害 囉 ~11
26有趣 中肯 溫馨31
27假日 平安 ! 順心 !11
28好險 我 都 逛 完了 ~ 乎-1-1
29買 一送 一 沒了-1-1
............
67有趣 中肯 溫馨31
68結果 說 要 去 都 沒去 成-2-1
69早上好 ~11
70去 了 兩次 .. 也 只 看 了 一區 半00
71看來 你 順利 回到 家 了11
72上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能...00
73可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了-2-1
74她 真的 很乖00
75那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好-1-1
76我 沒 有 排耶 XDD 真 幸運11
77有趣 中肯 溫馨31
78你 都 去 幾次 了 。 我 半次 都 沒去 過-1-1
79好聽 耶11
80五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃41
81真的 ! ! ! 今天 好熱 ! ! !00
82對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都...-2-1
83有趣 中肯 溫馨31
84我 還有 好 幾 ㄍ 館 沒 看 ...11
85Have a Nice Day ~00
86Good morning ~00
87早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天...11
88有趣 中肯 溫馨31
89早上好11
90午安00
91也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~...00
92哀 噁00
93有趣 中肯 溫馨31
94me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛00
95不錯 玩11
96別去 都 是 人 ..00
\n", 660 | "

97 rows × 3 columns

\n", 661 | "
" 662 | ], 663 | "text/plain": [ 664 | " sample_segmented polarity_score \\\n", 665 | "0 有趣 中肯 溫馨 3 \n", 666 | "1 五樓 耶 , 好巧 喔 0 \n", 667 | "2 結束 了 才 要 去 你 是 想到 ㄛ 妳 0 \n", 668 | "3 年度 最 中肯 ! ! 1 \n", 669 | "4 好 好玩 阿 1 \n", 670 | "5 有趣 中肯 溫馨 3 \n", 671 | "6 年度 最 中肯 ! ! 1 \n", 672 | "7 生日快樂 3 \n", 673 | "8 我 還沒 去 .... -1 \n", 674 | "9 有趣 中肯 溫馨 3 \n", 675 | "10 早睡早起 精神 好 0 \n", 676 | "11 太棒 了 ! ! 1 \n", 677 | "12 有趣 中肯 溫馨 3 \n", 678 | "13 恐怖 金紙 鋪 -2 \n", 679 | "14 年度 最 中肯 ! ! 1 \n", 680 | "15 睿 真的 有夠 生氣 -2 \n", 681 | "16 嘖 有 看到 花博 了 還嫌 哩 0 \n", 682 | "17 爭 豔 館 那 紫色 讚 ! ! 1 \n", 683 | "18 有趣 中肯 溫馨 3 \n", 684 | "19 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 -1 \n", 685 | "20 美女 晚安 安 ~ ~ 0 \n", 686 | "21 美女 晚安 安 ~ ~ ~ 0 \n", 687 | "22 我 還沒 去過 -1 \n", 688 | "23 可憐 你 了 -1 \n", 689 | "24 我 不 怎麼 喜歡 報告 的 說 3 \n", 690 | "25 真是太 厲害 囉 ~ 1 \n", 691 | "26 有趣 中肯 溫馨 3 \n", 692 | "27 假日 平安 ! 順心 ! 1 \n", 693 | "28 好險 我 都 逛 完了 ~ 乎 -1 \n", 694 | "29 買 一送 一 沒了 -1 \n", 695 | ".. ... ... \n", 696 | "67 有趣 中肯 溫馨 3 \n", 697 | "68 結果 說 要 去 都 沒去 成 -2 \n", 698 | "69 早上好 ~ 1 \n", 699 | "70 去 了 兩次 .. 也 只 看 了 一區 半 0 \n", 700 | "71 看來 你 順利 回到 家 了 1 \n", 701 | "72 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能... 0 \n", 702 | "73 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 -2 \n", 703 | "74 她 真的 很乖 0 \n", 704 | "75 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 -1 \n", 705 | "76 我 沒 有 排耶 XDD 真 幸運 1 \n", 706 | "77 有趣 中肯 溫馨 3 \n", 707 | "78 你 都 去 幾次 了 。 我 半次 都 沒去 過 -1 \n", 708 | "79 好聽 耶 1 \n", 709 | "80 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 4 \n", 710 | "81 真的 ! ! ! 今天 好熱 ! ! ! 0 \n", 711 | "82 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都... -2 \n", 712 | "83 有趣 中肯 溫馨 3 \n", 713 | "84 我 還有 好 幾 ㄍ 館 沒 看 ... 1 \n", 714 | "85 Have a Nice Day ~ 0 \n", 715 | "86 Good morning ~ 0 \n", 716 | "87 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天... 1 \n", 717 | "88 有趣 中肯 溫馨 3 \n", 718 | "89 早上好 1 \n", 719 | "90 午安 0 \n", 720 | "91 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~... 0 \n", 721 | "92 哀 噁 0 \n", 722 | "93 有趣 中肯 溫馨 3 \n", 723 | "94 me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 0 \n", 724 | "95 不錯 玩 1 \n", 725 | "96 別去 都 是 人 .. 0 \n", 726 | "\n", 727 | " sentiment \n", 728 | "0 1 \n", 729 | "1 0 \n", 730 | "2 0 \n", 731 | "3 1 \n", 732 | "4 1 \n", 733 | "5 1 \n", 734 | "6 1 \n", 735 | "7 1 \n", 736 | "8 -1 \n", 737 | "9 1 \n", 738 | "10 0 \n", 739 | "11 1 \n", 740 | "12 1 \n", 741 | "13 -1 \n", 742 | "14 1 \n", 743 | "15 -1 \n", 744 | "16 0 \n", 745 | "17 1 \n", 746 | "18 1 \n", 747 | "19 -1 \n", 748 | "20 0 \n", 749 | "21 0 \n", 750 | "22 -1 \n", 751 | "23 -1 \n", 752 | "24 1 \n", 753 | "25 1 \n", 754 | "26 1 \n", 755 | "27 1 \n", 756 | "28 -1 \n", 757 | "29 -1 \n", 758 | ".. ... \n", 759 | "67 1 \n", 760 | "68 -1 \n", 761 | "69 1 \n", 762 | "70 0 \n", 763 | "71 1 \n", 764 | "72 0 \n", 765 | "73 -1 \n", 766 | "74 0 \n", 767 | "75 -1 \n", 768 | "76 1 \n", 769 | "77 1 \n", 770 | "78 -1 \n", 771 | "79 1 \n", 772 | "80 1 \n", 773 | "81 0 \n", 774 | "82 -1 \n", 775 | "83 1 \n", 776 | "84 1 \n", 777 | "85 0 \n", 778 | "86 0 \n", 779 | "87 1 \n", 780 | "88 1 \n", 781 | "89 1 \n", 782 | "90 0 \n", 783 | "91 0 \n", 784 | "92 0 \n", 785 | "93 1 \n", 786 | "94 0 \n", 787 | "95 1 \n", 788 | "96 0 \n", 789 | "\n", 790 | "[97 rows x 3 columns]" 791 | ] 792 | }, 793 | "execution_count": 67, 794 | "metadata": {}, 795 | "output_type": "execute_result" 796 | } 797 | ], 798 | "source": [ 799 | "df" 800 | ] 801 | }, 802 | { 803 | "cell_type": "code", 804 | "execution_count": 68, 805 | "metadata": { 806 | "collapsed": true 807 | }, 808 | "outputs": [], 809 | "source": [ 810 | "df.to_csv('sentiment_annotation_sample.csv', index=False, header=True)" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "metadata": { 816 | "collapsed": true 817 | }, 818 | "source": [ 819 | "## 試試也將新聞語料庫做情緒標記" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 1, 825 | "metadata": { 826 | "collapsed": true 827 | }, 828 | "outputs": [], 829 | "source": [ 830 | "import json" 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": null, 836 | "metadata": { 837 | "collapsed": true 838 | }, 839 | "outputs": [], 840 | "source": [ 841 | "with open('../../語料庫資料前處理/data/clean_han.json') as json_data:\n", 842 | " d = json.load(json_data)" 843 | ] 844 | } 845 | ], 846 | "metadata": { 847 | "anaconda-cloud": {}, 848 | "kernelspec": { 849 | "display_name": "Python [Root]", 850 | "language": "python", 851 | "name": "Python [Root]" 852 | }, 853 | "language_info": { 854 | "codemirror_mode": { 855 | "name": "ipython", 856 | "version": 3 857 | }, 858 | "file_extension": ".py", 859 | "mimetype": "text/x-python", 860 | "name": "python", 861 | "nbconvert_exporter": "python", 862 | "pygments_lexer": "ipython3", 863 | "version": "3.5.2" 864 | } 865 | }, 866 | "nbformat": 4, 867 | "nbformat_minor": 0 868 | } 869 | -------------------------------------------------------------------------------- /語料庫標記及語言學分析/practice/2018 語料庫程式實務工作坊_語言學分析&標記實作_學習單.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫標記及語言學分析/practice/2018 語料庫程式實務工作坊_語言學分析&標記實作_學習單.pdf -------------------------------------------------------------------------------- /語料庫標記及語言學分析/practice/new_sample.txt: -------------------------------------------------------------------------------- 1 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 2 | 五樓耶, 好巧喔 五樓 耶 , 好巧 喔 + 3 | 結束了才要去 你是想到ㄛ妳 結束 了 才 要 去 你 是 想到 ㄛ 妳 - 4 | 年度最中肯!! 年度 最 中肯 ! ! + 5 | 好好玩阿 好 好玩 阿 + 6 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 7 | 年度最中肯!! 年度 最 中肯 ! ! + 8 | 生日快樂 生日快樂 + 9 | 我還沒去.... 我 還沒 去 .... - 10 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 11 | 早睡早起精神好 早睡早起 精神 好 + 12 | 太棒了!! 太棒 了 ! ! + 13 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 14 | 恐怖金紙鋪 恐怖 金紙 鋪 - 15 | 年度最中肯!! 年度 最 中肯 ! ! + 16 | 睿真的有夠生氣 睿 真的 有夠 生氣 - 17 | 嘖 有看到花博了 還嫌哩 嘖 有 看到 花博 了 還嫌 哩 - 18 | 爭豔館那紫色讚!! 爭 豔 館 那 紫色 讚 ! ! + 19 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 20 | 我可是去走了數小時一個館都沒逛呢 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 - 21 | 美女晚安安~ ~ 美女 晚安 安 ~ ~ + 22 | 美女晚安安~ ~~ 美女 晚安 安 ~ ~ ~ + 23 | 我還沒去過 我 還沒 去過 - 24 | 可憐你了 可憐 你 了 - 25 | 我不怎麼喜歡報告的說 我 不 怎麼 喜歡 報告 的 說 - 26 | 真是太厲害囉~ 真是太 厲害 囉 ~ + 27 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 28 | 假日平安!順心! 假日 平安 ! 順心 ! + 29 | 好險我都逛完了~乎 好險 我 都 逛 完了 ~ 乎 + 30 | 買一送一沒了 買 一送 一 沒了 - 31 | 上星期去花博給它拼了3館!爭豔館、養生館、天使館~,還有看到一點點的未來館~~還真是拼ㄚ 上星期 去 花博給 它 拼 了 3 館 ! 爭 豔 館 、 養生館 、 天使 館 ~ , 還有 看到 一點點 的 未來館 ~ ~ 還 真是 拼 ㄚ + 32 | 你去了喔XDDD 趕上了 你 去 了 喔 XDDD 趕上 了 + 33 | 對阿 謝謝志工 有玩的很開心就好 這樣子志工跟我都很開心!!YA 對 阿 謝謝 志工 有 玩 的 很開心 就 好 這樣 子志工 跟 我 都 很開心 ! ! Y A + 34 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 35 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 36 | 鐵腿了吧 鐵腿 了 吧 - 37 | 躺著也中槍 躺 著 也 中槍 - 38 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 39 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 40 | 幹麻不帶 幹麻 不帶 - 41 | 我最近也滿喜歡聽這個的 我 最近 也 滿 喜歡 聽 這個 的 + 42 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 43 | 嘻嘻嘻 枝姐姐新的大頭照好閃亮唷 是因為本人最近也很閃嗎? 嘻嘻 嘻 枝 姐姐 新 的 大頭照 好 閃亮 唷 是 因為 本人 最近 也 很 閃 嗎 ? + 44 | 早睡早起精神好 早睡早起 精神 好 + 45 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 46 | 恭喜恭喜 恭喜 恭喜 + 47 | 午安 今天高溫記得多喝水 午安 今天 高溫 記得 多喝水 + 48 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 49 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 50 | 還好夢想館會留下 還好 夢想 館會 留下 + 51 | 好餓喔... 好 餓 喔 ... - 52 | 宿舍哪來新聞 宿舍 哪來 新聞 - 53 | 好悲慘 好 悲慘 - 54 | 好恐怖 好 恐怖 - 55 | XDD 我要喝喜酒 XDD 我要 喝 喜酒 + 56 | 只見新人笑~不見就人哭~可憐阿~還沒卸任就已經被人唾棄了~ 只見 新人 笑 ~ 不見 就 人 哭 ~ 可憐 阿 ~ 還沒 卸任 就 已經 被 人 唾棄 了 ~ - 57 | 早安 ! 來去線上收聽 KISSRADIO 開始美好的一天!! 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! + 58 | 玩的開心點喔~ 玩 的 開心 點 喔 ~ + 59 | 花博又開放啦 花博 又 開放 啦 + 60 | + 61 | 真的快累斃了!!!! 真的 快累 斃了 ! ! ! ! - 62 | 哈哈哈~發酒瘋給她看XDD 哈哈哈 ~ 發酒瘋 給 她 看 XDD + 63 | 真讚~~ 真 讚 ~ ~ + 64 | 昨晚硬是把它用完 ..(但還有兩張遺失在我找不到的地方>< 昨晚 硬是 把 它 用 完 .. ( 但 還有 兩張 遺失 在 我 找不到 的 地方 > < - 65 | No good, Bob No good , Bob - 66 | 呵呵,我是乖孩子,小雪大人都沒打過我~ 呵呵 , 我 是 乖孩子 , 小雪 大人 都沒 打過 我 ~ + 67 | 我可能要去看醫生,掰伊~~ 我 可能 要 去 看 醫生 , 掰 伊 ~ ~ - 68 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 69 | 結果說要去都沒去成 結果 說 要 去 都 沒去 成 - 70 | 早上好~ 早上好 ~ + 71 | 去了兩次..也只看了一區半 去 了 兩次 .. 也 只 看 了 一區 半 - 72 | 看來你順利回到家了 看來 你 順利 回到 家 了 + 73 | 上次跟我媽他們去六點多就起床..還沒拿到夢想館的票,只能在門口等開門,一整天下來整個就是累慘了~~你們中午才去至少有睡飽 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能 在 門口 等 開門 , 一整天 下來 整個 就是 累慘 了 ~ ~ 你們 中午 才 去 至少 有 睡 飽 - 74 | 可惜的是這次沒能跟幸茵好好坐下來吃頓美食...幸茵就走了 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 - 75 | 她真的很乖 她 真的 很乖 + 76 | 那它分給別人 不就一堆人進來 嘖嘖 真不好 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 - 77 | 我沒有排耶XDD 真幸運 我 沒 有 排耶 XDD 真 幸運 + 78 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 79 | 你都去幾次了。 我半次都沒去過 你 都 去 幾次 了 。 我 半次 都 沒去 過 - 80 | 好聽耶 好聽 耶 + 81 | 五月之後就會有館重新開放了~到時候可以再慢慢晃 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 + 82 | 真的!!!今天好熱!!! 真的 ! ! ! 今天 好熱 ! ! ! - 83 | 對啊~~~我去那一天剛好是人數最多的一天....啥館都沒得看....又熱得半死...隨便逛逛就回家囉 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都 沒得 看 .... 又 熱得 半死 ... 隨便 逛逛 就 回家 囉 - 84 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 85 | 我還有好幾ㄍ館沒看... 我 還有 好 幾 ㄍ 館 沒 看 ... - 86 | Have a Nice Day~ Have a Nice Day ~ + 87 | Good morning~ Good morning ~ + 88 | 早安 ! 來去線上收聽 KISSRADIO 開始美好的一天!! 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! + 89 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 90 | 早上好 早上好 + 91 | 午安 午安 + 92 | 也為開始新的一歲倒數~~~Happy birthday 喔~~~ 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~ ~ ~ + 93 | 哀噁 哀 噁 - 94 | 有趣 中肯 溫馨 有趣 中肯 溫馨 + 95 | me too哦 鼻涕擤不庭加上喉嚨痛 me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 - 96 | 不錯玩 不錯 玩 + 97 | 別去 都是人.. 別去 都 是 人 .. - 98 | -------------------------------------------------------------------------------- /語料庫標記及語言學分析/practice/readme.txt: -------------------------------------------------------------------------------- 1 | [new_sample.txt] 2 | sample data released by SocialNlp, one additional column is added in, which is the segmented data using jieba 3 | 4 | [positive_words.txt]/[positivewordsall.txt] 5 | [negative_words.txt]/[negativewordsall.txt] 6 | contain both the NTUSD data and data from psychological experiment 7 | ps. NTUSD data are taken automatically extracted by machine 8 | 9 | [sentiment_annotation.py] / [sentiment_annotation.ipynb] 10 | script using both positive and negative wordlist to automatically annotate sentiment polarity for each text in new_sample.txt 11 | -------------------------------------------------------------------------------- /語料庫標記及語言學分析/practice/sentiment_annotation.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import pandas as pd 3 | import re 4 | 5 | # --- 讀入要標記的檔案 --- # 6 | sample_file = list(csv.reader(open('new_sample.txt', "r"), delimiter = '\t')) 7 | 8 | # --- 命名header --- # 9 | df = pd.DataFrame(sample_file, columns=["sample", "sample_segmented", "polarity"]) 10 | df = df.drop(['sample', 'polarity'], axis=1) #把參考用的column拿掉 11 | # df 12 | 13 | # --- 讀入情緒詞辭典 (NTUSD wordlist) --- # 14 | positive_words = open("positive_words.txt","r").read().split("\n") 15 | negative_words = open("negative_words.txt","r").read().split("\n") 16 | 17 | # === 計算positive word 在每一個sample出現的count === # 18 | positive_word_score = [] 19 | for text in list(df.sample_segmented): 20 | result = 0 21 | for words in positive_words: 22 | if words in text: 23 | result += 1 24 | positive_word_score.append(result) 25 | # positive_word_score 26 | 27 | # === 計算positive pattern 在每一個sample出現的count === # 28 | positive_pattern = '還好.+(會|不會)?' 29 | positive_pattern_score = [] 30 | for text in list(df.sample_segmented): 31 | positive_pattern_score.append(len(re.findall(positive_pattern,text))) 32 | # positive_pattern_score 33 | 34 | # === 將 positive word和positive pattern計算後的結果合併===# 35 | positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))] 36 | 37 | # === 計算positive word 在每一個sample出現的count === # 38 | positive_word_score = [] 39 | for text in list(df.sample_segmented): 40 | result = 0 41 | for words in positive_words: 42 | if words in text: 43 | result += 1 44 | positive_word_score.append(result) 45 | # positive_word_score 46 | 47 | # === 計算positive pattern 在每一個sample出現的count === # 48 | positive_pattern = '還好.+(會|不會)?' 49 | positive_pattern_score = [] 50 | for text in list(df.sample_segmented): 51 | positive_pattern_score.append(len(re.findall(positive_pattern,text))) 52 | # positive_pattern_score 53 | 54 | # === 將 positive word和positive pattern計算後的結果合併===# 55 | positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))] 56 | 57 | # === 加總所有的情緒極度分數 ===# 58 | # df['positive_score'] = positive_score 59 | # df['negative_score'] = negative_score 60 | df['polarity_score'] = [positive_score[i] + negative_score[i] for i in range(len(positive_score))] 61 | 62 | 63 | # --- 以數值標記情緒極度 (正向:1 / 中性:0 / 負向:-1) --- # 64 | df.loc[df.polarity_score > 0, 'sentiment'] = '1' 65 | df.loc[df.polarity_score < 0, 'sentiment'] = '-1' 66 | df.loc[df.polarity_score == 0, 'sentiment'] = '0' 67 | # df = df.drop(['polarity_score'], axis=1) 68 | 69 | # --- 將標記後的檔案輸出為csv格式 --- # 70 | df.to_csv('sentiment_annotation_sample.csv', index=False, header=True) 71 | 72 | 73 | ### >>>>>>>>>>> 試試也將新聞語料庫做情緒標記 <<<<<<<<<< ### 74 | # --- 你會需要的套件 75 | # import pandas as pd 76 | # import json 77 | 78 | # # --- 讀入新聞語料庫的檔案 (json file) 79 | # with open('your json file') as json_data: 80 | # d = json.load(json_data) 81 | 82 | # # --- 將讀入的json file 轉換為 df 格式 83 | # news_data = pd.DataFrame.from_records(d) 84 | # # news_data 85 | 86 | # # --- 只截取斷詞好的新聞文本column 87 | # news_content = pd.DataFrame(news_data['cln_content']) 88 | 89 | # # --- 計算positive word 在每一個sample出現的count 90 | # positive_word_score = [] 91 | # for text in list(news_content.cln_content): 92 | # result = 0 93 | # for words in positive_words: 94 | # if words in text: 95 | # result += 1 96 | # positive_word_score.append(result) 97 | # # positive_word_score 98 | 99 | # # --- 計算positive pattern 在每一個sample出現的count 100 | # positive_pattern = '還好.+(會|不會)?' 101 | # positive_pattern_score = [] 102 | # for text in list(news_content.cln_content): 103 | # positive_pattern_score.append(len(re.findall(positive_pattern,text))) 104 | # # positive_pattern_score 105 | 106 | # # --- 將 positive word和positive pattern計算後的結果合併 107 | # positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))] 108 | # #positive_score 109 | 110 | # # --- 計算negative word 在每一個sample出現的count 111 | # negative_word_score = [] 112 | # for text in list(news_content.cln_content): 113 | # result = 0 114 | # for words in negative_words: 115 | # if words in text: 116 | # result -= 1 117 | # negative_word_score.append(result) 118 | # # negative_word_score 119 | 120 | # # --- 計算negative pattern 在每一個sample出現的count 121 | # negative_pattern = r'都.*了.*還.*|連.+都.+|結果.+都' 122 | # negative_pattern_score = [] 123 | # for text in list(news_content.cln_content): 124 | # negative_pattern_score.append(len(re.findall(negative_pattern,text))*-1) 125 | # # negative_pattern_score 126 | 127 | # # --- 將 negative word和 negative pattern計算後的結果合併 128 | # negative_score = [negative_word_score[i] + negative_pattern_score[i] for i in range(len(negative_word_score))] 129 | # # negative_score 130 | 131 | # news_content['polarity_score'] = [positive_score[i] + negative_score[i] for i in range(len(positive_score))] 132 | 133 | # news_content.to_csv('news_annotation.csv', index=False, header=True) -------------------------------------------------------------------------------- /語料庫標記及語言學分析/practice/sentiment_annotation_sample.csv: -------------------------------------------------------------------------------- 1 | sample_segmented,polarity_score,sentiment 2 | 有趣 中肯 溫馨 ,3,1 3 | " 五樓 耶 , 好巧 喔 ",0,0 4 | 結束 了 才 要 去 你 是 想到 ㄛ 妳 ,0,0 5 | 年度 最 中肯 ! ! ,1,1 6 | 好 好玩 阿 ,1,1 7 | 有趣 中肯 溫馨 ,3,1 8 | 年度 最 中肯 ! ! ,1,1 9 | 生日快樂 ,3,1 10 | 我 還沒 去 .... ,-1,-1 11 | 有趣 中肯 溫馨 ,3,1 12 | 早睡早起 精神 好 ,0,0 13 | 太棒 了 ! ! ,1,1 14 | 有趣 中肯 溫馨 ,3,1 15 | 恐怖 金紙 鋪 ,-2,-1 16 | 年度 最 中肯 ! ! ,1,1 17 | 睿 真的 有夠 生氣 ,-2,-1 18 | 嘖 有 看到 花博 了 還嫌 哩 ,0,0 19 | 爭 豔 館 那 紫色 讚 ! ! ,1,1 20 | 有趣 中肯 溫馨 ,3,1 21 | 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 ,-1,-1 22 | 美女 晚安 安 ~ ~ ,0,0 23 | 美女 晚安 安 ~ ~ ~ ,0,0 24 | 我 還沒 去過 ,-1,-1 25 | 可憐 你 了 ,-1,-1 26 | 我 不 怎麼 喜歡 報告 的 說 ,3,1 27 | 真是太 厲害 囉 ~ ,1,1 28 | 有趣 中肯 溫馨 ,3,1 29 | 假日 平安 ! 順心 ! ,1,1 30 | 好險 我 都 逛 完了 ~ 乎 ,-1,-1 31 | 買 一送 一 沒了 ,-1,-1 32 | 上星期 去 花博給 它 拼 了 3 館 ! 爭 豔 館 、 養生館 、 天使 館 ~ , 還有 看到 一點點 的 未來館 ~ ~ 還 真是 拼 ㄚ ,2,1 33 | 你 去 了 喔 XDDD 趕上 了 ,0,0 34 | 對 阿 謝謝 志工 有 玩 的 很開心 就 好 這樣 子志工 跟 我 都 很開心 ! ! Y A ,4,1 35 | 有趣 中肯 溫馨 ,3,1 36 | 有趣 中肯 溫馨 ,3,1 37 | 鐵腿 了 吧 ,0,0 38 | 躺 著 也 中槍 ,0,0 39 | 有趣 中肯 溫馨 ,3,1 40 | 有趣 中肯 溫馨 ,3,1 41 | 幹麻 不帶 ,-1,-1 42 | 我 最近 也 滿 喜歡 聽 這個 的 ,2,1 43 | 有趣 中肯 溫馨 ,3,1 44 | 嘻嘻 嘻 枝 姐姐 新 的 大頭照 好 閃亮 唷 是 因為 本人 最近 也 很 閃 嗎 ? ,1,1 45 | 早睡早起 精神 好 ,0,0 46 | 有趣 中肯 溫馨 ,3,1 47 | 恭喜 恭喜 ,1,1 48 | 午安 今天 高溫 記得 多喝水 ,0,0 49 | 有趣 中肯 溫馨 ,3,1 50 | 有趣 中肯 溫馨 ,3,1 51 | 還好 夢想 館會 留下 ,4,1 52 | 好 餓 喔 ... ,0,0 53 | 宿舍 哪來 新聞 ,0,0 54 | 好 悲慘 ,-2,-1 55 | 好 恐怖 ,-2,-1 56 | XDD 我要 喝 喜酒 ,1,1 57 | 只見 新人 笑 ~ 不見 就 人 哭 ~ 可憐 阿 ~ 還沒 卸任 就 已經 被 人 唾棄 了 ~ ,-3,-1 58 | 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! ,1,1 59 | 玩 的 開心 點 喔 ~ ,2,1 60 | 花博 又 開放 啦 ,1,1 61 | +,0,0 62 | 真的 快累 斃了 ! ! ! ! ,-1,-1 63 | 哈哈哈 ~ 發酒瘋 給 她 看 XDD ,0,0 64 | 真 讚 ~ ~ ,1,1 65 | 昨晚 硬是 把 它 用 完 .. ( 但 還有 兩張 遺失 在 我 找不到 的 地方 > < ,-1,-1 66 | " No good , Bob ",0,0 67 | 呵呵 , 我 是 乖孩子 , 小雪 大人 都沒 打過 我 ~ ,0,0 68 | 我 可能 要 去 看 醫生 , 掰 伊 ~ ~ ,0,0 69 | 有趣 中肯 溫馨 ,3,1 70 | 結果 說 要 去 都 沒去 成 ,-2,-1 71 | 早上好 ~ ,1,1 72 | 去 了 兩次 .. 也 只 看 了 一區 半 ,0,0 73 | 看來 你 順利 回到 家 了 ,1,1 74 | " 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能 在 門口 等 開門 , 一整天 下來 整個 就是 累慘 了 ~ ~ 你們 中午 才 去 至少 有 睡 飽 ",0,0 75 | 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 ,-2,-1 76 | 她 真的 很乖 ,0,0 77 | 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 ,-1,-1 78 | 我 沒 有 排耶 XDD 真 幸運 ,1,1 79 | 有趣 中肯 溫馨 ,3,1 80 | 你 都 去 幾次 了 。 我 半次 都 沒去 過 ,-1,-1 81 | 好聽 耶 ,1,1 82 | 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 ,4,1 83 | 真的 ! ! ! 今天 好熱 ! ! ! ,0,0 84 | 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都 沒得 看 .... 又 熱得 半死 ... 隨便 逛逛 就 回家 囉 ,-2,-1 85 | 有趣 中肯 溫馨 ,3,1 86 | 我 還有 好 幾 ㄍ 館 沒 看 ... ,1,1 87 | Have a Nice Day ~ ,0,0 88 | Good morning ~ ,0,0 89 | 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! ,1,1 90 | 有趣 中肯 溫馨 ,3,1 91 | 早上好 ,1,1 92 | 午安 ,0,0 93 | 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~ ~ ~ ,0,0 94 | 哀 噁 ,0,0 95 | 有趣 中肯 溫馨 ,3,1 96 | me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 ,0,0 97 | 不錯 玩 ,1,1 98 | 別去 都 是 人 .. ,0,0 99 | -------------------------------------------------------------------------------- /語料庫標記及語言學分析/slide/20181104_語料庫語言學工作坊.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫標記及語言學分析/slide/20181104_語料庫語言學工作坊.pdf -------------------------------------------------------------------------------- /語料庫爬蟲/README.md: -------------------------------------------------------------------------------- 1 | 語料庫爬蟲 2 | 3 | --- 4 | 5 | 簡報: [連結](https://aji.tw/slides/corpus-crawler) 6 | -------------------------------------------------------------------------------- /語料庫爬蟲/src/applecrawler.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from time import sleep 3 | from random import random 4 | import json 5 | 6 | from requests import Session 7 | from bs4 import BeautifulSoup 8 | 9 | session = Session() 10 | session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36' 11 | 12 | 13 | base_url = 'https://tw.appledaily.com/new/realtime/{page}' 14 | page = 1 15 | links = [] 16 | current_date = datetime.now().date() 17 | 18 | while True: 19 | url = base_url.format(page=page) 20 | response = session.get(url) 21 | print(f'已經爬到第{page}頁') 22 | dom = BeautifulSoup(response.text) 23 | raw_time = dom.select('h1.dddd > time')[0].text 24 | date = datetime.strptime(raw_time, '%Y / %m / %d').date() 25 | if date < current_date: 26 | break 27 | elements = dom.select('h1.dddd + ul.rtddd > li') 28 | for element in elements: 29 | link = element.select('a')[0]['href'] 30 | links.append(link) 31 | sleep(random() * 5) 32 | page += 1 33 | 34 | base_url = 'https://tw.appledaily.com/new/realtime/{page}' 35 | page = 1 36 | links = [] 37 | current_date = datetime.now().date() 38 | 39 | while True: 40 | url = base_url.format(page=page) 41 | response = session.get(url) 42 | print(f'已經爬到第{page}頁面') 43 | dom = BeautifulSoup(response.text) 44 | raw_time = dom.select('h1.dddd > time')[0].text 45 | date = datetime.strptime(raw_time, '%Y / %m / %d').date() 46 | if date < current_date: 47 | break 48 | elements = dom.select('h1.dddd + ul.rtddd > li') 49 | for element in elements: 50 | link = element.select('a')[0]['href'] 51 | links.append(link) 52 | sleep(random() * 5) 53 | page += 1 54 | 55 | htmls = [] 56 | for num, link in enumerate(links): 57 | print(f'還剩下{len(links) - num}個頁面') 58 | response = session.get(link) 59 | htmls.append(response.text) 60 | sleep(random() * 5) 61 | 62 | posts = [] 63 | for html in htmls: 64 | dom = BeautifulSoup(html) 65 | title = dom.select('article.ndArticle_leftColumn h1')[0].text 66 | created_time = dom.select('article.ndArticle_leftColumn div.ndArticle_creat')[0].text 67 | category = dom.select('nav div.ndgTag a.current')[0].text 68 | content = dom.select('article.ndArticle_content p')[0].text 69 | post = dict(title=title, created_time=created_time, category=category, content=content) 70 | posts.append(post) 71 | 72 | with open('appledaily.json', 'w') as f: 73 | json.dump(posts, f) 74 | -------------------------------------------------------------------------------- /語料庫爬蟲/src/實戰.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 流程\n", 8 | "1. 蒐集 **即時 > 最新** 所有當日新聞連結\n", 9 | "1. 蒐集所有連結的內容並轉存為結構化的資料\n", 10 | "1. 將結構化的資料輸出成檔案" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": {}, 16 | "source": [ 17 | "事前準備" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 1, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from datetime import datetime\n", 27 | "from time import sleep\n", 28 | "from random import random\n", 29 | "\n", 30 | "from requests import Session\n", 31 | "from bs4 import BeautifulSoup\n", 32 | "\n", 33 | "session = Session()\n", 34 | "session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### 蒐集連結" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "base_url = 'https://tw.appledaily.com/new/realtime/{page}'\n", 51 | "page = 1\n", 52 | "links = []\n", 53 | "current_date = datetime.now().date()" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "遞迴找出所有連結" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "name": "stdout", 70 | "output_type": "stream", 71 | "text": [ 72 | "已經爬到第1頁面\n", 73 | "已經爬到第2頁面\n", 74 | "已經爬到第3頁面\n", 75 | "已經爬到第4頁面\n" 76 | ] 77 | } 78 | ], 79 | "source": [ 80 | "while True:\n", 81 | " url = base_url.format(page=page)\n", 82 | " response = session.get(url)\n", 83 | " print(f'已經爬到第{page}頁面')\n", 84 | " dom = BeautifulSoup(response.text)\n", 85 | " raw_time = dom.select('h1.dddd > time')[0].text\n", 86 | " date = datetime.strptime(raw_time, '%Y / %m / %d').date()\n", 87 | " if date < current_date:\n", 88 | " break\n", 89 | " elements = dom.select('h1.dddd + ul.rtddd > li')\n", 90 | " for element in elements:\n", 91 | " link = element.select('a')[0]['href'] \n", 92 | " links.append(link)\n", 93 | " sleep(random() * 5)\n", 94 | " page += 1" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### 蒐集內容" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "outputs": [ 109 | { 110 | "name": "stdout", 111 | "output_type": "stream", 112 | "text": [ 113 | "還剩下90個頁面\n", 114 | "還剩下89個頁面\n", 115 | "還剩下88個頁面\n", 116 | "還剩下87個頁面\n", 117 | "還剩下86個頁面\n", 118 | "還剩下85個頁面\n", 119 | "還剩下84個頁面\n", 120 | "還剩下83個頁面\n", 121 | "還剩下82個頁面\n", 122 | "還剩下81個頁面\n", 123 | "還剩下80個頁面\n", 124 | "還剩下79個頁面\n", 125 | "還剩下78個頁面\n", 126 | "還剩下77個頁面\n", 127 | "還剩下76個頁面\n", 128 | "還剩下75個頁面\n", 129 | "還剩下74個頁面\n", 130 | "還剩下73個頁面\n", 131 | "還剩下72個頁面\n", 132 | "還剩下71個頁面\n", 133 | "還剩下70個頁面\n", 134 | "還剩下69個頁面\n", 135 | "還剩下68個頁面\n", 136 | "還剩下67個頁面\n", 137 | "還剩下66個頁面\n", 138 | "還剩下65個頁面\n", 139 | "還剩下64個頁面\n", 140 | "還剩下63個頁面\n", 141 | "還剩下62個頁面\n", 142 | "還剩下61個頁面\n", 143 | "還剩下60個頁面\n", 144 | "還剩下59個頁面\n", 145 | "還剩下58個頁面\n", 146 | "還剩下57個頁面\n", 147 | "還剩下56個頁面\n", 148 | "還剩下55個頁面\n", 149 | "還剩下54個頁面\n", 150 | "還剩下53個頁面\n", 151 | "還剩下52個頁面\n", 152 | "還剩下51個頁面\n", 153 | "還剩下50個頁面\n", 154 | "還剩下49個頁面\n", 155 | "還剩下48個頁面\n", 156 | "還剩下47個頁面\n", 157 | "還剩下46個頁面\n", 158 | "還剩下45個頁面\n", 159 | "還剩下44個頁面\n", 160 | "還剩下43個頁面\n", 161 | "還剩下42個頁面\n", 162 | "還剩下41個頁面\n", 163 | "還剩下40個頁面\n", 164 | "還剩下39個頁面\n", 165 | "還剩下38個頁面\n", 166 | "還剩下37個頁面\n", 167 | "還剩下36個頁面\n", 168 | "還剩下35個頁面\n", 169 | "還剩下34個頁面\n", 170 | "還剩下33個頁面\n", 171 | "還剩下32個頁面\n", 172 | "還剩下31個頁面\n", 173 | "還剩下30個頁面\n", 174 | "還剩下29個頁面\n", 175 | "還剩下28個頁面\n", 176 | "還剩下27個頁面\n", 177 | "還剩下26個頁面\n", 178 | "還剩下25個頁面\n", 179 | "還剩下24個頁面\n", 180 | "還剩下23個頁面\n", 181 | "還剩下22個頁面\n", 182 | "還剩下21個頁面\n", 183 | "還剩下20個頁面\n", 184 | "還剩下19個頁面\n", 185 | "還剩下18個頁面\n", 186 | "還剩下17個頁面\n", 187 | "還剩下16個頁面\n", 188 | "還剩下15個頁面\n", 189 | "還剩下14個頁面\n", 190 | "還剩下13個頁面\n", 191 | "還剩下12個頁面\n", 192 | "還剩下11個頁面\n", 193 | "還剩下10個頁面\n", 194 | "還剩下9個頁面\n", 195 | "還剩下8個頁面\n", 196 | "還剩下7個頁面\n", 197 | "還剩下6個頁面\n", 198 | "還剩下5個頁面\n", 199 | "還剩下4個頁面\n", 200 | "還剩下3個頁面\n", 201 | "還剩下2個頁面\n", 202 | "還剩下1個頁面\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "htmls = []\n", 208 | "for num, link in enumerate(links):\n", 209 | " print(f'還剩下{len(links) - num}個頁面')\n", 210 | " response = session.get(link)\n", 211 | " htmls.append(response.text)\n", 212 | " sleep(random() * 5)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "元素定位與將資料結構化" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 5, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "posts = []\n", 229 | "for html in htmls:\n", 230 | " dom = BeautifulSoup(html)\n", 231 | " title = dom.select('article.ndArticle_leftColumn h1')[0].text\n", 232 | " created_time = dom.select('article.ndArticle_leftColumn div.ndArticle_creat')[0].text\n", 233 | " category = dom.select('nav div.ndgTag a.current')[0].text\n", 234 | " content = dom.select('article.ndArticle_content p')[0].text\n", 235 | " post = dict(title=title, created_time=created_time, category=category, content=content)\n", 236 | " posts.append(post)" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "### 輸出檔案" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 7, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "import json\n", 253 | "\n", 254 | "with open('appledaily.json', 'w') as f:\n", 255 | " json.dump(posts, f, ensure_ascii=False, indent=2)" 256 | ] 257 | } 258 | ], 259 | "metadata": { 260 | "kernelspec": { 261 | "display_name": "Python 3", 262 | "language": "python", 263 | "name": "python3" 264 | }, 265 | "language_info": { 266 | "codemirror_mode": { 267 | "name": "ipython", 268 | "version": 3 269 | }, 270 | "file_extension": ".py", 271 | "mimetype": "text/x-python", 272 | "name": "python", 273 | "nbconvert_exporter": "python", 274 | "pygments_lexer": "ipython3", 275 | "version": "3.7.0" 276 | } 277 | }, 278 | "nbformat": 4, 279 | "nbformat_minor": 2 280 | } 281 | -------------------------------------------------------------------------------- /語料庫資料前處理/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫資料前處理/.gitkeep -------------------------------------------------------------------------------- /語料庫資料前處理/README.md: -------------------------------------------------------------------------------- 1 | 語料庫資料前處理 2 | 3 | --- 4 | 簡報:[連結](https://goo.gl/GDcYGV) 5 | -------------------------------------------------------------------------------- /語料庫資料前處理/stopwords.txt: -------------------------------------------------------------------------------- 1 | , 2 | ? 3 | 、 4 | 。 5 | “ 6 | ” 7 | 「 8 | 」 9 | 《 10 | 》 11 | ! 12 | , 13 | : 14 | ; 15 | ? 16 | 人民 17 | 末##末 18 | 啊 19 | 阿 20 | 哎 21 | 哎呀 22 | 哎喲 23 | 唉 24 | 我 25 | 我們 26 | 按 27 | 按照 28 | 依照 29 | 吧 30 | 吧噠 31 | 把 32 | 罷了 33 | 被 34 | 本 35 | 本著 36 | 比 37 | 比方 38 | 比如 39 | 鄙人 40 | 彼 41 | 彼此 42 | 邊 43 | 別 44 | 別的 45 | 別說 46 | 並 47 | 並且 48 | 不比 49 | 不成 50 | 不單 51 | 不但 52 | 不獨 53 | 不管 54 | 不光 55 | 不過 56 | 不僅 57 | 不拘 58 | 不論 59 | 不怕 60 | 不然 61 | 不如 62 | 不特 63 | 不惟 64 | 不問 65 | 不只 66 | 朝 67 | 朝著 68 | 趁 69 | 趁著 70 | 乘 71 | 沖 72 | 除 73 | 除此之外 74 | 除非 75 | 除了 76 | 此 77 | 此間 78 | 此外 79 | 從 80 | 從而 81 | 打 82 | 待 83 | 但 84 | 但是 85 | 當 86 | 當著 87 | 到 88 | 得 89 | 的 90 | 的話 91 | 等 92 | 等等 93 | 地 94 | 第 95 | 叮咚 96 | 對 97 | 對於 98 | 多 99 | 多少 100 | 而 101 | 而況 102 | 而且 103 | 而是 104 | 而外 105 | 而言 106 | 而已 107 | 爾後 108 | 反過來 109 | 反過來說 110 | 反之 111 | 非但 112 | 非徒 113 | 否則 114 | 嘎 115 | 嘎登 116 | 該 117 | 趕 118 | 個 119 | 各 120 | 各個 121 | 各位 122 | 各種 123 | 各自 124 | 給 125 | 根據 126 | 跟 127 | 故 128 | 故此 129 | 固然 130 | 關於 131 | 管 132 | 歸 133 | 果然 134 | 果真 135 | 過 136 | 哈 137 | 哈哈 138 | 呵 139 | 和 140 | 何 141 | 何處 142 | 何況 143 | 何時 144 | 嘿 145 | 哼 146 | 哼唷 147 | 呼哧 148 | 乎 149 | 嘩 150 | 還是 151 | 還有 152 | 換句話說 153 | 換言之 154 | 或 155 | 或是 156 | 或者 157 | 極了 158 | 及 159 | 及其 160 | 及至 161 | 即 162 | 即便 163 | 即或 164 | 即令 165 | 即若 166 | 即使 167 | 幾 168 | 幾時 169 | 己 170 | 既 171 | 既然 172 | 既是 173 | 繼而 174 | 加之 175 | 假如 176 | 假若 177 | 假使 178 | 鑒於 179 | 將 180 | 較 181 | 較之 182 | 叫 183 | 接著 184 | 結果 185 | 借 186 | 緊接著 187 | 進而 188 | 盡 189 | 儘管 190 | 經 191 | 經過 192 | 就 193 | 就是 194 | 就是說 195 | 據 196 | 具體地說 197 | 具體說來 198 | 開始 199 | 開外 200 | 靠 201 | 咳 202 | 可 203 | 可見 204 | 可是 205 | 可以 206 | 況且 207 | 啦 208 | 來 209 | 來著 210 | 離 211 | 例如 212 | 哩 213 | 連 214 | 連同 215 | 兩者 216 | 了 217 | 臨 218 | 另 219 | 另外 220 | 另一方面 221 | 論 222 | 嘛 223 | 嗎 224 | 慢說 225 | 漫說 226 | 冒 227 | 麼 228 | 每 229 | 每當 230 | 們 231 | 莫若 232 | 某 233 | 某個 234 | 某些 235 | 拿 236 | 哪 237 | 哪邊 238 | 哪兒 239 | 哪個 240 | 哪裏 241 | 哪年 242 | 哪怕 243 | 哪天 244 | 哪些 245 | 哪樣 246 | 那 247 | 那邊 248 | 那兒 249 | 那個 250 | 那會兒 251 | 那裏 252 | 那麼 253 | 那麼些 254 | 那麼樣 255 | 那時 256 | 那些 257 | 那樣 258 | 乃 259 | 乃至 260 | 呢 261 | 能 262 | 你 263 | 你們 264 | 您 265 | 寧 266 | 寧可 267 | 寧肯 268 | 寧願 269 | 哦 270 | 嘔 271 | 啪達 272 | 旁人 273 | 呸 274 | 憑 275 | 憑藉 276 | 其 277 | 其次 278 | 其二 279 | 其他 280 | 其它 281 | 其一 282 | 其餘 283 | 其中 284 | 起 285 | 起見 286 | 豈但 287 | 恰恰相反 288 | 前後 289 | 前者 290 | 且 291 | 然而 292 | 然後 293 | 然則 294 | 讓 295 | 人家 296 | 任 297 | 任何 298 | 任憑 299 | 如 300 | 如此 301 | 如果 302 | 如何 303 | 如其 304 | 如若 305 | 如上所述 306 | 若 307 | 若非 308 | 若是 309 | 啥 310 | 上下 311 | 尚且 312 | 設若 313 | 設使 314 | 甚而 315 | 甚麼 316 | 甚至 317 | 省得 318 | 時候 319 | 什麼 320 | 什麼樣 321 | 使得 322 | 是 323 | 是的 324 | 首先 325 | 誰 326 | 誰知 327 | 順 328 | 順著 329 | 似的 330 | 雖 331 | 雖然 332 | 雖說 333 | 雖則 334 | 隨 335 | 隨著 336 | 所 337 | 所以 338 | 他 339 | 他們 340 | 他人 341 | 它 342 | 它們 343 | 她 344 | 她們 345 | 倘 346 | 倘或 347 | 倘然 348 | 倘若 349 | 倘使 350 | 騰 351 | 替 352 | 通過 353 | 同 354 | 同時 355 | 哇 356 | 萬一 357 | 往 358 | 望 359 | 為 360 | 為何 361 | 為了 362 | 為什麼 363 | 為著 364 | 餵 365 | 嗡嗡 366 | 我 367 | 我們 368 | 嗚 369 | 嗚呼 370 | 烏乎 371 | 無論 372 | 無寧 373 | 毋寧 374 | 嘻 375 | 嚇 376 | 相對而言 377 | 像 378 | 向 379 | 向著 380 | 噓 381 | 呀 382 | 焉 383 | 沿 384 | 沿著 385 | 要 386 | 要不 387 | 要不然 388 | 要不是 389 | 要麼 390 | 要是 391 | 也 392 | 也罷 393 | 也好 394 | 一 395 | 一般 396 | 一旦 397 | 一方面 398 | 一來 399 | 一切 400 | 一樣 401 | 一則 402 | 依 403 | 依照 404 | 矣 405 | 以 406 | 以便 407 | 以及 408 | 以免 409 | 以至 410 | 以至於 411 | 以致 412 | 抑或 413 | 因 414 | 因此 415 | 因而 416 | 因為 417 | 喲 418 | 用 419 | 由 420 | 由此可見 421 | 由於 422 | 有 423 | 有的 424 | 有關 425 | 有些 426 | 又 427 | 於 428 | 於是 429 | 於是乎 430 | 與 431 | 與此同時 432 | 與否 433 | 與其 434 | 越是 435 | 雲雲 436 | 哉 437 | 再說 438 | 再者 439 | 在 440 | 在下 441 | 咱 442 | 咱們 443 | 則 444 | 怎 445 | 怎麼 446 | 怎麼辦 447 | 怎麼樣 448 | 怎樣 449 | 咋 450 | 照 451 | 照著 452 | 者 453 | 這 454 | 這邊 455 | 這兒 456 | 這個 457 | 這會兒 458 | 這就是說 459 | 這裏 460 | 這麼 461 | 這麼點兒 462 | 這麼些 463 | 這麼樣 464 | 這時 465 | 這些 466 | 這樣 467 | 正如 468 | 吱 469 | 之 470 | 之類 471 | 之所以 472 | 之一 473 | 只是 474 | 只限 475 | 只要 476 | 只有 477 | 至 478 | 至於 479 | 諸位 480 | 著 481 | 著呢 482 | 自 483 | 自從 484 | 自個兒 485 | 自各兒 486 | 自己 487 | 自家 488 | 自身 489 | 綜上所述 490 | 總的來看 491 | 總的來說 492 | 總的說來 493 | 總而言之 494 | 總之 495 | 縱 496 | 縱令 497 | 縱然 498 | 縱使 499 | 遵照 500 | 作為 501 | 兮 502 | 呃 503 | 唄 504 | 咚 505 | 咦 506 | 喏 507 | 啐 508 | 喔唷 509 | 嗬 510 | 嗯 511 | 噯 512 | ~ 513 | ! 514 | . 515 | : 516 | " 517 | ' 518 | ( 519 | ) 520 | * 521 | A 522 | 白 523 | 社會主義 524 | -- 525 | .. 526 | >> 527 | [ 528 | ] 529 | 530 | < 531 | > 532 | / 533 | \ 534 | | 535 | - 536 | _ 537 | + 538 | = 539 | & 540 | ^ 541 | % 542 | # 543 | @ 544 | ` 545 | ; 546 | $ 547 | ( 548 | ) 549 | —— 550 | — 551 | ¥ 552 | · 553 | ... 554 | ‘ 555 | ’ 556 | 〉 557 | 〈 558 | … 559 |   560 | 0 561 | 1 562 | 2 563 | 3 564 | 4 565 | 5 566 | 6 567 | 7 568 | 8 569 | 9 570 | 0 571 | 1 572 | 2 573 | 3 574 | 4 575 | 5 576 | 6 577 | 7 578 | 8 579 | 9 580 | 二 581 | 三 582 | 四 583 | 五 584 | 六 585 | 七 586 | 八 587 | 九 588 | 零 589 | > 590 | < 591 | @ 592 | # 593 | $ 594 | % 595 | ︿ 596 | & 597 | * 598 | + 599 | ~ 600 | | 601 | [ 602 | ] 603 | { 604 | } 605 | 啊哈 606 | 啊呀 607 | 啊喲 608 | 挨次 609 | 挨個 610 | 挨家挨戶 611 | 挨門挨戶 612 | 挨門逐戶 613 | 挨著 614 | 按理 615 | 按期 616 | 按時 617 | 按說 618 | 暗地裏 619 | 暗中 620 | 暗自 621 | 昂然 622 | 八成 623 | 白白 624 | 半 625 | 梆 626 | 保管 627 | 保險 628 | 飽 629 | 背地裏 630 | 背靠背 631 | 倍感 632 | 倍加 633 | 本人 634 | 本身 635 | 甭 636 | 比起 637 | 比如說 638 | 比照 639 | 畢竟 640 | 必 641 | 必定 642 | 必將 643 | 必須 644 | 便 645 | 別人 646 | 並非 647 | 並肩 648 | 並沒 649 | 並沒有 650 | 併排 651 | 並無 652 | 勃然 653 | 不 654 | 不必 655 | 不常 656 | 不大 657 | 不但...而且 658 | 不得 659 | 不得不 660 | 不得了 661 | 不得已 662 | 不迭 663 | 不定 664 | 不對 665 | 不妨 666 | 不管怎樣 667 | 不會 668 | 不僅...而且 669 | 不僅僅 670 | 不僅僅是 671 | 不經意 672 | 不可開交 673 | 不可抗拒 674 | 不力 675 | 不了 676 | 不料 677 | 不滿 678 | 不免 679 | 不能不 680 | 不起 681 | 不巧 682 | 不然的話 683 | 不日 684 | 不少 685 | 不勝 686 | 不時 687 | 不是 688 | 不同 689 | 不能 690 | 不要 691 | 不外 692 | 不外乎 693 | 不下 694 | 不限 695 | 不消 696 | 不已 697 | 不亦樂乎 698 | 不由得 699 | 不再 700 | 不擇手段 701 | 不怎麼 702 | 不曾 703 | 不知不覺 704 | 不止 705 | 不止一次 706 | 不至於 707 | 才 708 | 才能 709 | 策略地 710 | 差不多 711 | 差一點 712 | 常 713 | 常常 714 | 常言道 715 | 常言說 716 | 常言說得好 717 | 長此下去 718 | 長話短說 719 | 長期以來 720 | 長線 721 | 敞開兒 722 | 徹夜 723 | 陳年 724 | 趁便 725 | 趁機 726 | 趁熱 727 | 趁勢 728 | 趁早 729 | 成年 730 | 成年累月 731 | 成心 732 | 乘機 733 | 乘勝 734 | 乘勢 735 | 乘隙 736 | 乘虛 737 | 誠然 738 | 遲早 739 | 充分 740 | 充其極 741 | 充其量 742 | 抽冷子 743 | 臭 744 | 初 745 | 出 746 | 出來 747 | 出去 748 | 除此 749 | 除此而外 750 | 除此以外 751 | 除開 752 | 除去 753 | 除卻 754 | 除外 755 | 處處 756 | 川流不息 757 | 傳 758 | 傳說 759 | 傳聞 760 | 串列 761 | 純 762 | 純粹 763 | 此後 764 | 此中 765 | 次第 766 | 匆匆 767 | 從不 768 | 從此 769 | 從此以後 770 | 從古到今 771 | 從古至今 772 | 從今以後 773 | 從寬 774 | 從來 775 | 從輕 776 | 從速 777 | 從頭 778 | 從未 779 | 從無到有 780 | 從小 781 | 從新 782 | 從嚴 783 | 從優 784 | 從早到晚 785 | 從中 786 | 從重 787 | 湊巧 788 | 粗 789 | 存心 790 | 達旦 791 | 打從 792 | 打開天窗說亮話 793 | 大 794 | 大不了 795 | 大大 796 | 大抵 797 | 大都 798 | 大多 799 | 大凡 800 | 大概 801 | 大家 802 | 大舉 803 | 大略 804 | 大面兒上 805 | 大事 806 | 大體 807 | 大體上 808 | 大約 809 | 大張旗鼓 810 | 大致 811 | 呆呆地 812 | 帶 813 | 殆 814 | 待到 815 | 單 816 | 單純 817 | 單單 818 | 但願 819 | 彈指之間 820 | 當場 821 | 當兒 822 | 當即 823 | 當口兒 824 | 當然 825 | 當庭 826 | 當頭 827 | 當下 828 | 當真 829 | 當中 830 | 倒不如 831 | 倒不如說 832 | 倒是 833 | 到處 834 | 到底 835 | 到了兒 836 | 到目前為止 837 | 到頭 838 | 到頭來 839 | 得起 840 | 得天獨厚 841 | 的確 842 | 等到 843 | 叮噹 844 | 頂多 845 | 定 846 | 動不動 847 | 動輒 848 | 陡然 849 | 都 850 | 獨 851 | 獨自 852 | 斷然 853 | 頓時 854 | 多次 855 | 多多 856 | 多多少少 857 | 多多益善 858 | 多虧 859 | 多年來 860 | 多年前 861 | 而後 862 | 而論 863 | 而又 864 | 爾等 865 | 二話不說 866 | 二話沒說 867 | 反倒 868 | 反倒是 869 | 反而 870 | 反手 871 | 反之亦然 872 | 反之則 873 | 方 874 | 方才 875 | 方能 876 | 放量 877 | 非常 878 | 非得 879 | 分期 880 | 分期分批 881 | 分頭 882 | 奮勇 883 | 憤然 884 | 風雨無阻 885 | 逢 886 | 弗 887 | 甫 888 | 嘎嘎 889 | 該當 890 | 概 891 | 趕快 892 | 趕早不趕晚 893 | 敢 894 | 敢情 895 | 敢於 896 | 剛 897 | 剛才 898 | 剛好 899 | 剛巧 900 | 高低 901 | 格外 902 | 隔日 903 | 隔夜 904 | 個人 905 | 各式 906 | 更 907 | 更加 908 | 更進一步 909 | 更為 910 | 公然 911 | 共 912 | 共總 913 | 夠瞧的 914 | 姑且 915 | 古來 916 | 故而 917 | 故意 918 | 固 919 | 怪 920 | 怪不得 921 | 慣常 922 | 光 923 | 光是 924 | 歸根到底 925 | 歸根結底 926 | 過於 927 | 毫不 928 | 毫無 929 | 毫無保留地 930 | 毫無例外 931 | 好在 932 | 何必 933 | 何嘗 934 | 何妨 935 | 何苦 936 | 何樂而不為 937 | 何須 938 | 何止 939 | 很 940 | 很多 941 | 很少 942 | 轟然 943 | 後來 944 | 呼啦 945 | 忽地 946 | 忽然 947 | 互 948 | 互相 949 | 嘩啦 950 | 話說 951 | 還 952 | 恍然 953 | 會 954 | 豁然 955 | 活 956 | 夥同 957 | 或多或少 958 | 或許 959 | 基本 960 | 基本上 961 | 基於 962 | 極 963 | 極大 964 | 極度 965 | 極端 966 | 極力 967 | 極其 968 | 極為 969 | 急匆匆 970 | 即將 971 | 即刻 972 | 即是說 973 | 幾度 974 | 幾番 975 | 幾乎 976 | 幾經 977 | 既...又 978 | 繼之 979 | 加上 980 | 加以 981 | 間或 982 | 簡而言之 983 | 簡言之 984 | 簡直 985 | 見 986 | 將才 987 | 將近 988 | 將要 989 | 交口 990 | 較比 991 | 較為 992 | 接連不斷 993 | 接下來 994 | 皆可 995 | 截然 996 | 截至 997 | 藉以 998 | 藉此 999 | 藉以 1000 | 屆時 1001 | 僅 1002 | 僅僅 1003 | 謹 1004 | 進來 1005 | 進去 1006 | 近 1007 | 近幾年來 1008 | 近來 1009 | 近年來 1010 | 儘管如此 1011 | 儘可能 1012 | 儘快 1013 | 儘量 1014 | 盡然 1015 | 盡如人意 1016 | 盡心竭力 1017 | 盡心盡力 1018 | 儘早 1019 | 精光 1020 | 經常 1021 | 竟 1022 | 竟然 1023 | 究竟 1024 | 就此 1025 | 就地 1026 | 就算 1027 | 居然 1028 | 局外 1029 | 舉凡 1030 | 據稱 1031 | 據此 1032 | 據實 1033 | 據說 1034 | 據我所知 1035 | 據悉 1036 | 具體來說 1037 | 決不 1038 | 決非 1039 | 絕 1040 | 絕不 1041 | 絕頂 1042 | 絕對 1043 | 絕非 1044 | 均 1045 | 喀 1046 | 看 1047 | 看來 1048 | 看起來 1049 | 看上去 1050 | 看樣子 1051 | 可好 1052 | 可能 1053 | 恐怕 1054 | 快 1055 | 快要 1056 | 來不及 1057 | 來得及 1058 | 來講 1059 | 來看 1060 | 攔腰 1061 | 牢牢 1062 | 老 1063 | 老大 1064 | 老老實實 1065 | 老是 1066 | 累次 1067 | 累年 1068 | 理當 1069 | 理該 1070 | 理應 1071 | 歷 1072 | 立 1073 | 立地 1074 | 立刻 1075 | 立馬 1076 | 立時 1077 | 聯袂 1078 | 連連 1079 | 連日 1080 | 連日來 1081 | 連聲 1082 | 連袂 1083 | 臨到 1084 | 另方面 1085 | 另行 1086 | 另一個 1087 | 路經 1088 | 屢 1089 | 屢次 1090 | 屢次三番 1091 | 屢屢 1092 | 縷縷 1093 | 率爾 1094 | 率然 1095 | 略 1096 | 略加 1097 | 略微 1098 | 略為 1099 | 論說 1100 | 馬上 1101 | 蠻 1102 | 滿 1103 | 沒 1104 | 沒有 1105 | 每逢 1106 | 每每 1107 | 每時每刻 1108 | 猛然 1109 | 猛然間 1110 | 莫 1111 | 莫不 1112 | 莫非 1113 | 莫如 1114 | 默默地 1115 | 默然 1116 | 吶 1117 | 那末 1118 | 奈 1119 | 難道 1120 | 難得 1121 | 難怪 1122 | 難說 1123 | 內 1124 | 年覆一年 1125 | 凝神 1126 | 偶而 1127 | 偶爾 1128 | 怕 1129 | 砰 1130 | 碰巧 1131 | 譬如 1132 | 偏偏 1133 | 乒 1134 | 平素 1135 | 頗 1136 | 迫於 1137 | 撲通 1138 | 其後 1139 | 其實 1140 | 奇 1141 | 齊 1142 | 起初 1143 | 起來 1144 | 起首 1145 | 起頭 1146 | 起先 1147 | 豈 1148 | 豈非 1149 | 豈止 1150 | 迄 1151 | 恰逢 1152 | 恰好 1153 | 恰恰 1154 | 恰巧 1155 | 恰如 1156 | 恰似 1157 | 千 1158 | 千萬 1159 | 千萬千萬 1160 | 切 1161 | 切不可 1162 | 切莫 1163 | 切切 1164 | 切勿 1165 | 竊 1166 | 親口 1167 | 親身 1168 | 親手 1169 | 親眼 1170 | 親自 1171 | 頃 1172 | 頃刻 1173 | 頃刻間 1174 | 頃刻之間 1175 | 請勿 1176 | 窮年累月 1177 | 取道 1178 | 去 1179 | 權時 1180 | 全都 1181 | 全力 1182 | 全年 1183 | 全然 1184 | 全身心 1185 | 然 1186 | 人人 1187 | 仍 1188 | 仍舊 1189 | 仍然 1190 | 日覆一日 1191 | 日見 1192 | 日漸 1193 | 日益 1194 | 日臻 1195 | 如常 1196 | 如此等等 1197 | 如次 1198 | 如今 1199 | 如期 1200 | 如前所述 1201 | 如上 1202 | 如下 1203 | 汝 1204 | 三番兩次 1205 | 三番五次 1206 | 三天兩頭 1207 | 瑟瑟 1208 | 沙沙 1209 | 上 1210 | 上來 1211 | 上去 -------------------------------------------------------------------------------- /語料庫資料前處理/userdict.txt: -------------------------------------------------------------------------------- 1 | 體檢 2 n -------------------------------------------------------------------------------- /語料庫開放框架及上線部署/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫開放框架及上線部署/.gitkeep -------------------------------------------------------------------------------- /語料庫開放框架及上線部署/README.md: -------------------------------------------------------------------------------- 1 | 語料庫開放框架及上線部署 2 | 3 | --- 4 | --------------------------------------------------------------------------------