├── README.md
├── 文本分析及語料庫統計
├── .gitkeep
├── README.md
├── data
│ ├── news.csv
│ ├── news_annotation.p
│ └── polls.json
├── ipynb
│ ├── 1. Fundamentals of Data Analysis.ipynb
│ ├── 2. Descriptive Statistics.ipynb
│ ├── 3. Models.ipynb
│ └── 4. Practice on news annotation.ipynb
└── slides
│ ├── 1. Fundamentals of Data Analysis.slides.html
│ ├── 2. Descriptive Statistics.slides.html
│ └── 3. Models.slides.html
├── 語料庫介面設計
├── .gitkeep
└── README.md
├── 語料庫標記及語言學分析
├── .gitkeep
├── README.md
├── practice
│ ├── .ipynb_checkpoints
│ │ └── sentiment_annotation-checkpoint.ipynb
│ ├── 2018 語料庫程式實務工作坊_語言學分析&標記實作_學習單.pdf
│ ├── negative_words.txt
│ ├── new_sample.txt
│ ├── positive_words.txt
│ ├── readme.txt
│ ├── sentiment_annotation.ipynb
│ ├── sentiment_annotation.py
│ └── sentiment_annotation_sample.csv
└── slide
│ └── 20181104_語料庫語言學工作坊.pdf
├── 語料庫爬蟲
├── README.md
├── data
│ ├── appledaily.json
│ └── han.json
└── src
│ ├── applecrawler.py
│ └── 實戰.ipynb
├── 語料庫資料前處理
├── .gitkeep
├── README.md
├── data
│ ├── 106_student.csv
│ ├── clean_han.json
│ └── han.json
├── dict.txt.big
├── pre_process.ipynb
├── stopwords.txt
└── userdict.txt
└── 語料庫開放框架及上線部署
├── .gitkeep
├── Lyrics_analytics.ipynb
└── README.md
/README.md:
--------------------------------------------------------------------------------
1 | ## COPENS Workshop 2018 語料庫程式實務工作坊
2 |
3 | [工作坊連結](http://lope.linguistics.ntu.edu.tw/hocor2018/)
4 |
5 | ### Hands-on Corpus Programming workshop
6 | 語料庫是語言學研究與語言科技應用的基礎建設。隨著語料的擴增與問題的複雜化,現有工具已經不敷使用。寫程式變成研究者基本的技能。
7 | 本工作坊的目的,是希望對有 Python 入門知識的人提供一組語料庫程式處理的專門程式技。
8 |
--------------------------------------------------------------------------------
/文本分析及語料庫統計/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/文本分析及語料庫統計/.gitkeep
--------------------------------------------------------------------------------
/文本分析及語料庫統計/README.md:
--------------------------------------------------------------------------------
1 | 文本分析及語料庫統計
2 |
3 | ---
4 |
5 | 這裡分為三個部份:
6 |
7 | 1. Fundamentals of Data Analysis
8 | 內容為如何使用 Pandas 做基本的資料處理,包括存取檔案、檢查數據表格、選取特定資料、資料排序、資料轉換、以及繪製圖表。
9 |
10 | 2. Descriptive Statistics
11 | 內容為基本的描述統計,包含頻率、平均數、標準差、標準化、相關係數。
12 |
13 | 3. Models
14 | 內容為建立模型,第一個是 Linear Model 來做數值的預測,第二個是用 Logistic Regression 的方法來做分類器,也會帶大家使用 scikit-learn 這個套件。
15 |
--------------------------------------------------------------------------------
/文本分析及語料庫統計/data/news_annotation.p:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/文本分析及語料庫統計/data/news_annotation.p
--------------------------------------------------------------------------------
/文本分析及語料庫統計/data/polls.json:
--------------------------------------------------------------------------------
1 | [{"有效樣本": "1077", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/10/22", "支持率": {"其他": "19.9%", "蘇貞昌": "31.0%", "侯友宜": "49.1%"}}, {"有效樣本": "1026", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/10/19", "支持率": {"其他": "20.0%", "蘇貞昌": "30.0%", "侯友宜": "50.0%"}}, {"有效樣本": "1075", "機構": "信傳媒", "訪問主題": "彰化縣長當選人", "時間": "2018/10/18", "支持率": {"魏明谷": "34.5%", "黃文玲": "4.8%", "其他": "27.4%", "王惠美": "33.3%"}}, {"有效樣本": "1009", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2018/10/17", "支持率": {"韓國瑜": "42.0%", "陳其邁": "35.0%", "其他": "23.0%"}}, {"有效樣本": "1070", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/10/15", "支持率": {"柯文哲": "37.5%", "姚文智": "11.3%", "丁守中": "25.0%", "其他": "26.2%"}}, {"有效樣本": "1271", "機構": "ETtoday 東森新聞雲", "訪問主題": "屏東縣長當選人", "時間": "2018/10/15", "支持率": {"潘孟安": "36.4%", "蘇清泉": "24.8%", "其他": "38.8%"}}, {"有效樣本": "803", "機構": "TVBS", "訪問主題": "嘉義市長當選人", "時間": "2018/10/15", "支持率": {"蕭淑麗": "15.0%", "黃敏惠": "35.0%", "其他": "21.0%", "涂醒者": "29.0%"}}, {"有效樣本": "1077", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/10/11", "支持率": {"其他": "17.8%", "蘇貞昌": "32.8%", "侯友宜": "49.4%"}}, {"有效樣本": "985", "機構": "TVBS", "訪問主題": "台東縣長當選人", "時間": "2018/10/11", "支持率": {"饒慶鈴": "40.0%", "鄺麗貞": "5.0%", "其他": "24.0%", "劉櫂豪": "31.0%"}}, {"有效樣本": "830", "機構": "TVBS", "訪問主題": "宜蘭縣長當選人", "時間": "2018/10/09", "支持率": {"林姿妙": "47.0%", "陳歐珀": "19.0%", "其他": "34.0%"}}, {"有效樣本": "1084", "機構": "信傳媒", "訪問主題": "台中市長當選人", "時間": "2018/10/08", "支持率": {"林佳龍": "40.0%", "盧秀燕": "34.6%", "其他": "25.4%"}}, {"有效樣本": "1074", "機構": "聯合報", "訪問主題": "宜蘭縣長當選人", "時間": "2018/10/08", "支持率": {"林姿妙": "42.0%", "陳歐珀": "13.0%", "其他": "45.0%"}}, {"有效樣本": "1078", "機構": "旺旺中時", "訪問主題": "台北市長當選人", "時間": "2018/10/08", "支持率": {"柯文哲": "41.6%", "姚文智": "15.2%", "丁守中": "27.2%", "其他": "16.0%"}}, {"有效樣本": "1533", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義縣長當選人", "時間": "2018/10/08", "支持率": {"翁章梁": "29.3%", "其他": "42.9%", "吳育仁": "22.8%", "吳芳銘": "5.0%"}}, {"有效樣本": "1075", "機構": "聯合報", "訪問主題": "嘉義市長當選人", "時間": "2018/10/05", "支持率": {"蕭淑麗": "17.0%", "涂醒哲": "25.0%", "黃敏惠": "28.0%", "其他": "30.0%"}}, {"有效樣本": "1069", "機構": "聯合報", "訪問主題": "新竹縣長當選人", "時間": "2018/10/04", "支持率": {"徐欣瑩": "31.0%", "楊文科": "25.0%", "其他": "33.0%", "鄭朝方": "11.0%"}}, {"有效樣本": "1014", "機構": "旺旺中時", "訪問主題": "台東縣長當選人", "時間": "2018/10/04", "支持率": {"饒慶鈴": "38.6%", "鄺麗貞": "3.3%", "其他": "27.8%", "劉櫂豪": "30.3%"}}, {"有效樣本": "1080", "機構": "聯合報", "訪問主題": "彰化縣長當選人", "時間": "2018/10/01", "支持率": {"魏明谷": "24.0%", "其他": "42.0%", "王惠美": "34.0%"}}, {"有效樣本": "1087", "機構": "旺旺中時", "訪問主題": "台中市長當選人", "時間": "2018/10/01", "支持率": {"林佳龍": "38.1%", "盧秀燕": "40.2%", "其他": "21.7%"}}, {"有效樣本": "1068", "機構": "三立電視", "訪問主題": "台北市長當選人", "時間": "2018/10/01", "支持率": {"柯文哲": "36.1%", "姚文智": "11.1%", "丁守中": "25.8%", "其他": "27.0%"}}, {"有效樣本": "1086", "機構": "信傳媒", "訪問主題": "宜蘭縣長當選人", "時間": "2018/10/01", "支持率": {"林姿妙": "40.1%", "陳歐珀": "26.5%", "其他": "33.4%"}}, {"有效樣本": "1501", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義市長當選人", "時間": "2018/09/30", "支持率": {"蕭淑麗": "10.7%", "涂醒哲": "26.0%", "黃敏惠": "35.4%", "其他": "27.9%"}}, {"有效樣本": "1072", "機構": "自由時報", "訪問主題": "嘉義市長當選人", "時間": "2018/09/29", "支持率": {"蕭淑麗": "17.9%", "涂醒哲": "30.1%", "黃敏惠": "24.6%", "其他": "27.1%"}}, {"有效樣本": "1070", "機構": "聯合報", "訪問主題": "桃園市長當選人", "時間": "2018/09/27", "支持率": {"楊麗環": "6.0%", "陳學聖": "16.0%", "其他": "27.0%", "鄭文燦": "51.0%"}}, {"有效樣本": "1072", "機構": "聯合報", "訪問主題": "台南市長當選人", "時間": "2018/09/27", "支持率": {"林義豐": "14.0%", "黃偉哲": "31.0%", "其他": "43.0%", "高思博": "12.0%"}}, {"有效樣本": "1086", "機構": "聯合報", "訪問主題": "高雄市長當選人", "時間": "2018/09/25", "支持率": {"韓國瑜": "32.0%", "陳其邁": "34.0%", "其他": "34.0%"}}, {"有效樣本": "1723", "機構": "ETtoday 東森新聞雲", "訪問主題": "宜蘭縣長當選人", "時間": "2018/09/25", "支持率": {"林姿妙": "35.6%", "陳歐珀": "23.1%", "其他": "41.3%"}}, {"有效樣本": "1064", "機構": "自由時報", "訪問主題": "宜蘭縣長當選人", "時間": "2018/09/22", "支持率": {"林姿妙": "37.6%", "陳歐珀": "23.4%", "其他": "39.0%"}}, {"有效樣本": "1071", "機構": "聯合報", "訪問主題": "台北市長當選人", "時間": "2018/09/21", "支持率": {"柯文哲": "37.0%", "姚文智": "8.0%", "丁守中": "29.0%", "其他": "26.0%"}}, {"有效樣本": "1070", "機構": "蘋果日報", "訪問主題": "桃園市長當選人", "時間": "2018/09/21", "支持率": {"楊麗環": "5.4%", "陳學聖": "18.5%", "其他": "33.8%", "鄭文燦": "42.3%"}}, {"有效樣本": "1501", "機構": "美麗島電子報", "訪問主題": "新北市長當選人", "時間": "2018/09/21", "支持率": {"其他": "30.1%", "蘇貞昌": "28.2%", "侯友宜": "41.7%"}}, {"有效樣本": "1089", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/09/21", "支持率": {"柯文哲": "34.6%", "姚文智": "11.1%", "丁守中": "23.5%", "其他": "30.8%"}}, {"有效樣本": "1076", "機構": "蘋果日報", "訪問主題": "台南市長當選人", "時間": "2018/09/20", "支持率": {"林義豐": "5.9%", "黃偉哲": "23.9%", "其他": "56.4%", "高思博": "13.8%"}}, {"有效樣本": "1074", "機構": "信傳媒", "訪問主題": "高雄市長當選人", "時間": "2018/09/20", "支持率": {"韓國瑜": "35.5%", "陳其邁": "41.2%", "其他": "23.3%"}}, {"有效樣本": "1068", "機構": "聯合報", "訪問主題": "新北市長當選人", "時間": "2018/09/19", "支持率": {"其他": "28.0%", "蘇貞昌": "24.0%", "侯友宜": "48.0%"}}, {"有效樣本": "1073", "機構": "蘋果日報", "訪問主題": "高雄市長當選人", "時間": "2018/09/19", "支持率": {"韓國瑜": "31.2%", "陳其邁": "33.8%", "其他": "35.0%"}}, {"有效樣本": "1068", "機構": "蘋果日報", "訪問主題": "台中市長當選人", "時間": "2018/09/18", "支持率": {"林佳龍": "30.3%", "盧秀燕": "30.9%", "其他": "38.8%"}}, {"有效樣本": "1068", "機構": "蘋果日報", "訪問主題": "台北市長當選人", "時間": "2018/09/17", "支持率": {"柯文哲": "34.9%", "姚文智": "10.4%", "丁守中": "30.8%", "其他": "23.9%"}}, {"有效樣本": "2408", "機構": "ETtoday 東森新聞雲", "訪問主題": "台北市長當選人", "時間": "2018/09/17", "支持率": {"柯文哲": "41.7%", "姚文智": "8.4%", "丁守中": "29.9%", "其他": "20.0%"}}, {"有效樣本": "1001", "機構": "TVBS", "訪問主題": "台北市長當選人", "時間": "2018/09/17", "支持率": {"柯文哲": "37.0%", "姚文智": "11.0%", "丁守中": "32.0%", "其他": "20.0%"}}, {"有效樣本": "1068", "機構": "蘋果日報", "訪問主題": "新北市長當選人", "時間": "2018/09/17", "支持率": {"其他": "30.4%", "蘇貞昌": "29.4%", "侯友宜": "40.2%"}}, {"有效樣本": "1011", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/09/17", "支持率": {"其他": "20.0%", "蘇貞昌": "32.0%", "侯友宜": "48.0%"}}, {"有效樣本": "801", "機構": "TVBS", "訪問主題": "桃園市長當選人", "時間": "2018/09/17", "支持率": {"楊麗環": "6.0%", "陳學聖": "17.0%", "其他": "24.0%", "鄭文燦": "54.0%"}}, {"有效樣本": "1015", "機構": "TVBS", "訪問主題": "台中市長當選人", "時間": "2018/09/17", "支持率": {"林佳龍": "35.0%", "盧秀燕": "38.0%", "其他": "27.0%"}}, {"有效樣本": "859", "機構": "TVBS", "訪問主題": "台南市長當選人", "時間": "2018/09/17", "支持率": {"林義豐": "13.0%", "黃偉哲": "33.0%", "其他": "40.0%", "高思博": "14.0%"}}, {"有效樣本": "1019", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2018/09/17", "支持率": {"韓國瑜": "35.0%", "陳其邁": "39.0%", "其他": "26.0%"}}, {"有效樣本": "1063", "機構": "自由時報", "訪問主題": "彰化縣長當選人", "時間": "2018/09/15", "支持率": {"魏明谷": "31.3%", "其他": "38.2%", "王惠美": "30.5%"}}, {"有效樣本": "1097", "機構": "聯合報", "訪問主題": "台中市長當選人", "時間": "2018/09/13", "支持率": {"林佳龍": "33.0%", "盧秀燕": "34.0%", "其他": "33.0%"}}, {"有效樣本": "1071", "機構": "年代電視", "訪問主題": "新竹縣長當選人", "時間": "2018/09/12", "支持率": {"徐欣瑩": "32.6%", "楊文科": "23.3%", "其他": "35.1%", "鄭朝方": "9.1%"}}, {"有效樣本": "1087", "機構": "ETtoday 東森新聞雲", "訪問主題": "彰化縣長當選人", "時間": "2018/09/10", "支持率": {"魏明谷": "29.6%", "黃文玲": "4.7%", "其他": "36.7%", "王惠美": "29.0%"}}, {"有效樣本": "1082", "機構": "自由時報", "訪問主題": "新北市長當選人", "時間": "2018/09/09", "支持率": {"其他": "29.3%", "蘇貞昌": "30.9%", "侯友宜": "39.9%"}}, {"有效樣本": "1107", "機構": "風傳媒", "訪問主題": "新竹市長當選人", "時間": "2018/09/05", "支持率": {"其他": "30.0%", "林智堅": "48.5%", "謝文進": "7.9%", "許明財": "13.6%"}}, {"有效樣本": "1070", "機構": "信傳媒", "訪問主題": "嘉義市長當選人", "時間": "2018/09/03", "支持率": {"蕭淑麗": "18.9%", "涂醒哲": "39.3%", "黃敏惠": "24.8%", "其他": "17.0%"}}, {"有效樣本": "1685", "機構": "ETtoday 東森新聞雲", "訪問主題": "高雄市長當選人", "時間": "2018/09/03", "支持率": {"韓國瑜": "31.0%", "陳其邁": "36.7%", "其他": "32.3%"}}, {"有效樣本": "1070", "機構": "自由時報", "訪問主題": "台北市長當選人", "時間": "2018/09/01", "支持率": {"柯文哲": "33.4%", "姚文智": "14.9%", "丁守中": "24.7%", "其他": "27.0%"}}, {"有效樣本": "1068", "機構": "TVBS", "訪問主題": "新竹縣長當選人", "時間": "2018/08/31", "支持率": {"徐欣瑩": "28.0%", "楊文科": "25.0%", "其他": "34.0%", "鄭朝方": "13.0%"}}, {"有效樣本": "1748", "機構": "ETtoday 東森新聞雲", "訪問主題": "台南市長當選人", "時間": "2018/08/27", "支持率": {"林義豐": "14.1%", "黃偉哲": "31.3%", "其他": "38.7%", "高思博": "15.9%"}}, {"有效樣本": "1069", "機構": "自由時報", "訪問主題": "台中市長當選人", "時間": "2018/08/25", "支持率": {"林佳龍": "38.5%", "盧秀燕": "32.4%", "其他": "39.1%"}}, {"有效樣本": "1580", "機構": "ETtoday 東森新聞雲", "訪問主題": "桃園市長當選人", "時間": "2018/08/20", "支持率": {"陳學聖": "23.8%", "其他": "25.5%", "鄭文燦": "50.7%"}}, {"有效樣本": "1069", "機構": "自由時報", "訪問主題": "台南市長當選人", "時間": "2018/08/18", "支持率": {"林義豐": "9.7%", "黃偉哲": "40.6%", "其他": "38.1%", "高思博": "11.6%"}}, {"有效樣本": "1082", "機構": "美麗島電子報", "訪問主題": "高雄市長當選人", "時間": "2018/08/17", "支持率": {"韓國瑜": "27.8%", "陳其邁": "38.9%", "其他": "33.3%"}}, {"有效樣本": "1607", "機構": "ETtoday 東森新聞雲", "訪問主題": "台中市長當選人", "時間": "2018/08/14", "支持率": {"林佳龍": "31.0%", "盧秀燕": "35.6%", "其他": "33.4%"}}, {"有效樣本": "1753", "機構": "ETtoday 東森新聞雲", "訪問主題": "新北市長當選人", "時間": "2018/08/10", "支持率": {"其他": "30.6%", "蘇貞昌": "24.2%", "侯友宜": "45.2%"}}, {"有效樣本": "1088", "機構": "NOWnews今日新聞", "訪問主題": "台中市長當選人", "時間": "2018/08/10", "支持率": {"林佳龍": "35.8%", "盧秀燕": "38.0%", "其他": "26.2%"}}, {"有效樣本": "1022", "機構": "自由時報", "訪問主題": "高雄市長當選人", "時間": "2018/08/04", "支持率": {"韓國瑜": "26.3%", "陳其邁": "38.5%", "其他": "35.2%"}}, {"有效樣本": "1100", "機構": "信傳媒", "訪問主題": "台中市長當選人", "時間": "2018/08/01", "支持率": {"林佳龍": "43.9%", "盧秀燕": "31.1%", "其他": "25.0%"}}, {"有效樣本": "1857", "機構": "ETtoday 東森新聞雲", "訪問主題": "台北市長當選人", "時間": "2018/07/30", "支持率": {"柯文哲": "42.0%", "姚文智": "5.4%", "丁守中": "31.0%", "其他": "21.6%"}}, {"有效樣本": "1376", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹縣長當選人", "時間": "2018/07/26", "支持率": {"徐欣瑩": "27.5%", "楊文科": "23.6%", "其他": "24.3%", "林為洲": "18.1%"}}, {"有效樣本": "870", "機構": "TVBS", "訪問主題": "台北市長當選人", "時間": "2018/07/24", "支持率": {"柯文哲": "40.0%", "姚文智": "11.0%", "丁守中": "30.0%", "其他": "19.0%"}}, {"有效樣本": "1030", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/07/24", "支持率": {"其他": "23.0%", "蘇貞昌": "29.0%", "侯友宜": "48.0%"}}, {"有效樣本": "803", "機構": "TVBS", "訪問主題": "桃園市長當選人", "時間": "2018/07/24", "支持率": {"陳學聖": "20.0%", "其他": "24.0%", "鄭文燦": "56.0%"}}, {"有效樣本": "1077", "機構": "TVBS", "訪問主題": "台中市長當選人", "時間": "2018/07/24", "支持率": {"林佳龍": "33.0%", "盧秀燕": "39.0%", "其他": "28.0%"}}, {"有效樣本": "833", "機構": "TVBS", "訪問主題": "台南市長當選人", "時間": "2018/07/24", "支持率": {"林義豐": "7.0%", "黃偉哲": "41.0%", "其他": "37.0%", "高思博": "15.0%"}}, {"有效樣本": "1012", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2018/07/24", "支持率": {"韓國瑜": "32.0%", "陳其邁": "40.0%", "其他": "28.0%"}}, {"有效樣本": "1070", "機構": "信傳媒", "訪問主題": "新北市長當選人", "時間": "2018/07/23", "支持率": {"其他": "22.6%", "蘇貞昌": "31.1%", "侯友宜": "46.3%"}}, {"有效樣本": "2607", "機構": "ETtoday 東森新聞雲", "訪問主題": "連江縣長當選人", "時間": "2018/07/23", "支持率": {"劉增應": "34.1%", "蘇柏豪": "27.8%", "其他": "38.1%"}}, {"有效樣本": "2607", "機構": "ETtoday 東森新聞雲", "訪問主題": "澎湖縣長當選人", "時間": "2018/07/23", "支持率": {"陳光復": "20.1%", "賴峰偉": "29.1%", "其他": "50.8%"}}, {"有效樣本": "2607", "機構": "ETtoday 東森新聞雲", "訪問主題": "金門縣長當選人", "時間": "2018/07/23", "支持率": {"其他": "47.4%", "陳福海": "23.0%", "楊鎮浯": "28.0%"}}, {"有效樣本": "2093", "機構": "ETtoday 東森新聞雲", "訪問主題": "苗栗縣長當選人", "時間": "2018/07/20", "支持率": {"徐定禎": "16.5%", "徐耀昌": "35.9%", "其他": "47.6%"}}, {"有效樣本": "1074", "機構": "三立電視", "訪問主題": "金門縣長當選人", "時間": "2018/07/17", "支持率": {"其他": "34.7%", "陳福海": "30.6%", "楊鎮浯": "32.7%"}}, {"有效樣本": "0", "機構": "年代電視", "訪問主題": "台南市長當選人", "時間": "2018/07/15", "支持率": {"黃偉哲": "36.6%", "其他": "50.4%", "高思博": "13.0%"}}, {"有效樣本": "1082", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/07/12", "支持率": {"柯文哲": "38.7%", "姚文智": "16.4%", "丁守中": "30.5%", "其他": "14.4%"}}, {"有效樣本": "1070", "機構": "信傳媒", "訪問主題": "台北市長當選人", "時間": "2018/07/11", "支持率": {"柯文哲": "38.9%", "姚文智": "11.0%", "丁守中": "27.5%", "其他": "22.6%"}}, {"有效樣本": "2315", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹縣長當選人", "時間": "2018/07/11", "支持率": {"徐欣瑩": "19.7%", "楊文科": "26.3%", "其他": "44.3%", "鄭朝方": "9.7%"}}, {"有效樣本": "1070", "機構": "旺旺中時", "訪問主題": "新竹市長當選人", "時間": "2018/07/04", "支持率": {"其他": "22.8%", "林智堅": "51.4%", "謝文進": "9.5%", "許明財": "16.3%"}}, {"有效樣本": "2520", "機構": "TVBS", "訪問主題": "台中市長當選人", "時間": "2018/06/29", "支持率": {"林佳龍": "33.0%", "盧秀燕": "39.7%", "其他": "27.3%"}}, {"有效樣本": "1024", "機構": "TVBS", "訪問主題": "新竹縣長當選人", "時間": "2018/06/29", "支持率": {"徐欣瑩": "27.0%", "楊文科": "20.0%", "其他": "21.0%", "林為洲": "24.0%"}}, {"有效樣本": "3273", "機構": "ETtoday 東森新聞雲", "訪問主題": "屏東縣長當選人", "時間": "2018/06/27", "支持率": {"潘孟安": "37.9%", "蘇清泉": "20.9%", "其他": "41.2%"}}, {"有效樣本": "907", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/06/26", "支持率": {"其他": "21.0%", "蘇貞昌": "31.0%", "侯友宜": "48.0%"}}, {"有效樣本": "1072", "機構": "旺旺中時", "訪問主題": "基隆市長當選人", "時間": "2018/06/26", "支持率": {"謝立功": "15.9%", "其他": "26.8%", "林右昌": "50.0%"}}, {"有效樣本": "853", "機構": "TVBS", "訪問主題": "桃園市長當選人", "時間": "2018/06/23", "支持率": {"陳學聖": "21.0%", "其他": "18.0%", "鄭文燦": "61.0%"}}, {"有效樣本": "1069", "機構": "旺旺中時", "訪問主題": "新竹縣長當選人", "時間": "2018/06/22", "支持率": {"徐欣瑩": "18.3%", "楊文科": "21.0%", "其他": "56.2%", "鄭朝方": "4.5%"}}, {"有效樣本": "1070", "機構": "年代電視", "訪問主題": "新竹縣長當選人", "時間": "2018/06/22", "支持率": {"徐欣瑩": "40.2%", "楊文科": "23.3%", "其他": "27.7%", "鄭朝方": "8.8%"}}, {"有效樣本": "3435", "機構": "ETtoday 東森新聞雲", "訪問主題": "台東縣長當選人", "時間": "2018/06/22", "支持率": {"饒慶鈴": "34.1%", "其他": "38.1%", "劉櫂豪": "27.8%"}}, {"有效樣本": "N/A", "機構": "三立電視", "訪問主題": "新竹市長當選人", "時間": "2018/06/21", "支持率": {"其他": "16.0%", "林智堅": "55.0%", "謝文進": "14.0%", "許明財": "15.0%"}}, {"有效樣本": "1075", "機構": "旺旺中時", "訪問主題": "新竹縣長當選人", "時間": "2018/06/19", "支持率": {"徐欣瑩": "20.1%", "楊文科": "27.0%", "其他": "48.2%", "鄭朝方": "4.7%"}}, {"有效樣本": "1217", "機構": "自由時報", "訪問主題": "新竹市長當選人", "時間": "2018/06/19", "支持率": {"其他": "22.0%", "林智堅": "55.8%", "謝文進": "7.7%", "許明財": "14.5%"}}, {"有效樣本": "1075", "機構": "美麗島電子報", "訪問主題": "新北市長當選人", "時間": "2018/06/15", "支持率": {"其他": "34.7%", "蘇貞昌": "23.6%", "侯友宜": "41.7%"}}, {"有效樣本": "1869", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹市長當選人", "時間": "2018/06/14", "支持率": {"其他": "39.0%", "林智堅": "38.0%", "謝文進": "6.0%", "許明財": "17.0%"}}, {"有效樣本": "1071", "機構": "旺旺中時", "訪問主題": "苗栗縣長當選人", "時間": "2018/06/13", "支持率": {"徐定禎": "17.6%", "徐耀昌": "56.9%", "其他": "25.5%"}}, {"有效樣本": "3435", "機構": "ETtoday 東森新聞雲", "訪問主題": "花蓮縣長當選人", "時間": "2018/06/13", "支持率": {"劉曉玫": "19.0%", "徐榛蔚": "44.4%", "其他": "36.6%"}}, {"有效樣本": "3453", "機構": "ETtoday 東森新聞雲", "訪問主題": "宜蘭縣長當選人", "時間": "2018/06/12", "支持率": {"林姿妙": "41.2%", "陳歐珀": "15.8%", "其他": "43.0%"}}, {"有效樣本": "1007", "機構": "旺旺中時", "訪問主題": "桃園市長當選人", "時間": "2018/06/12", "支持率": {"陳學聖": "24.9%", "其他": "23.6%", "鄭文燦": "51.5%"}}, {"有效樣本": "1009", "機構": "旺旺中時", "訪問主題": "台南市長當選人", "時間": "2018/06/08", "支持率": {"黃偉哲": "40.3%", "其他": "42.4%", "高思博": "17.3%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義縣長當選人", "時間": "2018/06/04", "支持率": {"翁章梁": "34.3%", "其他": "50.4%", "吳育仁": "15.3%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "嘉義市長當選人", "時間": "2018/06/04", "支持率": {"蕭淑麗": "18.0%", "涂醒哲": "18.2%", "黃敏惠": "31.2%", "其他": "32.6%"}}, {"有效樣本": "1195", "機構": "TVBS", "訪問主題": "台北市長當選人", "時間": "2018/05/31", "支持率": {"柯文哲": "31.0%", "姚文智": "13.0%", "丁守中": "33.0%", "其他": "23.0%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "雲林縣長當選人", "時間": "2018/05/31", "支持率": {"張麗善": "19.6%", "其他": "47.6%", "李進勇": "32.8%"}}, {"有效樣本": "1079", "機構": "蘋果日報", "訪問主題": "台北市長當選人", "時間": "2018/05/29", "支持率": {"柯文哲": "29.0%", "姚文智": "13.5%", "丁守中": "29.1%", "其他": "28.4%"}}, {"有效樣本": "3518", "機構": "ETtoday 東森新聞雲", "訪問主題": "台南市長當選人", "時間": "2018/05/28", "支持率": {"林義豐": "2.9%", "黃偉哲": "31.2%", "其他": "50.3%", "高思博": "18.5%"}}, {"有效樣本": "1001", "機構": "聯合報", "訪問主題": "台中市長當選人", "時間": "2018/05/28", "支持率": {"林佳龍": "32.0%", "盧秀燕": "39.0%", "其他": "29.0%"}}, {"有效樣本": "1010", "機構": "旺旺中時", "訪問主題": "高雄市長當選人", "時間": "2018/05/27", "支持率": {"韓國瑜": "33.0%", "陳其邁": "39.5%", "其他": "27.5%"}}, {"有效樣本": "819", "機構": "聯合報", "訪問主題": "台北市長當選人", "時間": "2018/05/19", "支持率": {"柯文哲": "38.0%", "姚文智": "8.0%", "丁守中": "39.0%", "其他": "15.0%"}}, {"有效樣本": "1848", "機構": "ETtoday 東森新聞雲", "訪問主題": "台北市長當選人", "時間": "2018/05/17", "支持率": {"柯文哲": "36.4%", "姚文智": "13.4%", "丁守中": "37.5%", "其他": "12.7%"}}, {"有效樣本": "1073", "機構": "美麗島電子報", "訪問主題": "台北市長當選人", "時間": "2018/05/14", "支持率": {"柯文哲": "35.2%", "姚文智": "15.0%", "丁守中": "33.1%", "其他": "16.7%"}}, {"有效樣本": "3273", "機構": "ETtoday 東森新聞雲", "訪問主題": "高雄市長當選人", "時間": "2018/05/04", "支持率": {"韓國瑜": "30.4%", "陳其邁": "32.0%", "其他": "37.6%"}}, {"有效樣本": "1003", "機構": "美麗島電子報", "訪問主題": "台中市長當選人", "時間": "2018/05/03", "支持率": {"林佳龍": "33.0%", "盧秀燕": "35.7%", "其他": "31.3%"}}, {"有效樣本": "980", "機構": "聯合報", "訪問主題": "新北市長當選人", "時間": "2018/05/01", "支持率": {"其他": "29.0%", "蘇貞昌": "26.0%", "侯友宜": "45.0%"}}, {"有效樣本": "1083", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/05/01", "支持率": {"其他": "22.3%", "蘇貞昌": "27.1%", "侯友宜": "50.6%"}}, {"有效樣本": "1081", "機構": "大社會民調", "訪問主題": "嘉義市長當選人", "時間": "2018/04/18", "支持率": {"蕭淑麗": "15.1%", "涂醒哲": "21.9%", "黃敏惠": "34.4%", "其他": "0.0%"}}, {"有效樣本": "1909", "機構": "ETtoday 東森新聞雲", "訪問主題": "新北市長當選人", "時間": "2018/04/13", "支持率": {"其他": "30.9%", "蘇貞昌": "22.7%", "侯友宜": "46.4%"}}, {"有效樣本": "957", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2018/04/12", "支持率": {"其他": "28.0%", "蘇貞昌": "32.0%", "侯友宜": "40.0%"}}, {"有效樣本": "1896", "機構": "ETtoday 東森新聞雲", "訪問主題": "新竹市長當選人", "時間": "2018/04/12", "支持率": {"其他": "36.6%", "林智堅": "51.4%", "許明財": "12.0%"}}, {"有效樣本": "1841", "機構": "ETtoday 東森新聞雲", "訪問主題": "桃園市長當選人", "時間": "2018/04/11", "支持率": {"陳學聖": "31.7%", "其他": "21.6%", "鄭文燦": "46.7%"}}, {"有效樣本": "N/A", "機構": "旺旺中時", "訪問主題": "彰化縣長當選人", "時間": "2018/04/10", "支持率": {"魏明谷": "39.9%", "其他": "34.4%", "王惠美": "25.7%"}}, {"有效樣本": "0", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/04/07", "支持率": {"其他": "19.3%", "蘇貞昌": "26.4%", "侯友宜": "54.3%"}}, {"有效樣本": "541", "機構": "蘋果日報", "訪問主題": "新北市長當選人", "時間": "2018/04/07", "支持率": {"其他": "11.1%", "蘇貞昌": "27.5%", "侯友宜": "61.4%"}}, {"有效樣本": "2016", "機構": "ETtoday 東森新聞雲", "訪問主題": "基隆市長當選人", "時間": "2018/03/30", "支持率": {"謝立功": "24.0%", "其他": "31.9%", "林右昌": "44.1%"}}, {"有效樣本": "1084", "機構": "旺旺中時", "訪問主題": "南投縣長當選人", "時間": "2018/03/28", "支持率": {"其他": "53.2%", "洪國浩": "8.9%", "林明溱": "37.9%"}}, {"有效樣本": "1071", "機構": "旺旺中時", "訪問主題": "花蓮縣長當選人", "時間": "2018/03/25", "支持率": {"劉曉玫": "17.0%", "徐榛蔚": "40.1%", "其他": "42.9%"}}, {"有效樣本": "3240", "機構": "ETtoday 東森新聞雲", "訪問主題": "彰化縣長當選人", "時間": "2018/03/19", "支持率": {"魏明谷": "31.9%", "其他": "39.8%", "王惠美": "28.3%"}}, {"有效樣本": "3240", "機構": "ETtoday 東森新聞雲", "訪問主題": "南投縣長當選人", "時間": "2018/03/19", "支持率": {"其他": "28.9%", "洪國浩": "11.7%", "林明溱": "59.4%"}}, {"有效樣本": "2259", "機構": "ETtoday 東森新聞雲", "訪問主題": "台中市長當選人", "時間": "2018/03/05", "支持率": {"林佳龍": "32.6%", "盧秀燕": "32.0%", "其他": "35.4%"}}, {"有效樣本": "1430", "機構": "ETtoday 東森新聞雲", "訪問主題": "新北市長當選人", "時間": "2018/03/02", "支持率": {"其他": "21.2%", "蘇貞昌": "27.4%", "侯友宜": "51.4%"}}, {"有效樣本": "1007", "機構": "旺旺中時", "訪問主題": "新北市長當選人", "時間": "2018/02/26", "支持率": {"其他": "12.6%", "蘇貞昌": "33.6%", "侯友宜": "53.8%"}}, {"有效樣本": "1010", "機構": "旺旺中時", "訪問主題": "台中市長當選人", "時間": "2018/02/13", "支持率": {"林佳龍": "36.4%", "盧秀燕": "30.6%", "其他": "33.0%"}}, {"有效樣本": "1073", "機構": "美麗島電子報", "訪問主題": "台中市長當選人", "時間": "2018/02/09", "支持率": {"林佳龍": "35.8%", "盧秀燕": "37.6%", "其他": "26.6%"}}, {"有效樣本": "1073", "機構": "旺旺中時", "訪問主題": "新竹市長當選人", "時間": "2018/01/26", "支持率": {"其他": "10.6%", "林智堅": "56.4%", "許明財": "33.0%"}}, {"有效樣本": "1068", "機構": "年代電視", "訪問主題": "澎湖縣長當選人", "時間": "2018/01/18", "支持率": {"陳光復": "26.4%", "賴峰偉": "39.4%", "其他": "34.2%"}}, {"有效樣本": "1073", "機構": "年代電視", "訪問主題": "嘉義市長當選人", "時間": "2018/01/16", "支持率": {"其他": "24.7%", "涂醒哲": "22.1%", "黃敏惠": "53.2%"}}, {"有效樣本": "1070", "機構": "美麗島電子報", "訪問主題": "新北市長當選人", "時間": "2018/01/15", "支持率": {"其他": "16.8%", "蘇貞昌": "41.7%", "侯友宜": "41.5%"}}, {"有效樣本": "N/A", "機構": "三立電視", "訪問主題": "基隆市長當選人", "時間": "2018/01/15", "支持率": {"謝立功": "15.2%", "其他": "26.8%", "林右昌": "58.0%"}}, {"有效樣本": "1069", "機構": "年代電視", "訪問主題": "新竹縣長當選人", "時間": "2018/01/04", "支持率": {"徐欣瑩": "32.2%", "楊文科": "23.6%", "其他": "31.6%", "鄭朝方": "12.6%"}}, {"有效樣本": "1069", "機構": "年代電視", "訪問主題": "彰化縣長當選人", "時間": "2018/01/04", "支持率": {"魏明谷": "26.3%", "其他": "37.8%", "王惠美": "35.9%"}}, {"有效樣本": "1070", "機構": "年代電視", "訪問主題": "台中市長當選人", "時間": "2017/12/28", "支持率": {"林佳龍": "36.1%", "盧秀燕": "35.1%", "其他": "28.8%"}}, {"有效樣本": "1070", "機構": "年代電視", "訪問主題": "宜蘭縣長當選人", "時間": "2017/12/27", "支持率": {"林姿妙": "45.9%", "陳歐珀": "23.4%", "其他": "30.7%"}}, {"有效樣本": "1068", "機構": "風傳媒", "訪問主題": "桃園市長當選人", "時間": "2017/12/13", "支持率": {"楊麗環": "4.8%", "陳學聖": "4.7%", "其他": "36.9%", "鄭文燦": "53.6%"}}, {"有效樣本": "897", "機構": "TVBS", "訪問主題": "新北市長當選人", "時間": "2017/12/06", "支持率": {"其他": "13.0%", "蘇貞昌": "39.0%", "侯友宜": "48.0%"}}, {"有效樣本": "1081", "機構": "旺旺中時", "訪問主題": "新竹市長當選人", "時間": "2017/11/14", "支持率": {"其他": "20.3%", "林智堅": "59.9%", "許明財": "19.8%"}}, {"有效樣本": "1071", "機構": "美麗島電子報", "訪問主題": "台中市長當選人", "時間": "2017/11/10", "支持率": {"林佳龍": "37.4%", "盧秀燕": "33.1%", "其他": "29.5%"}}, {"有效樣本": "890", "機構": "TVBS", "訪問主題": "高雄市長當選人", "時間": "2017/08/15", "支持率": {"韓國瑜": "31.0%", "陳其邁": "59.0%", "其他": "10.0%"}}, {"有效樣本": "811", "機構": "TVBS", "訪問主題": "宜蘭縣長當選人", "時間": "2017/07/19", "支持率": {"林姿妙": "57.0%", "陳歐珀": "23.0%", "其他": "20.0%"}}]
--------------------------------------------------------------------------------
/文本分析及語料庫統計/ipynb/2. Descriptive Statistics.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "slideshow": {
7 | "slide_type": "slide"
8 | }
9 | },
10 | "source": [
11 | "# 2. Descriptive Statistics\n",
12 | "## Outline\n",
13 | "* [Frequency](#frequency)\n",
14 | "* [Measures of central tendency](#measuresOfCentralTendency)\n",
15 | "* [Measures of dispersion](#measuresOfDispersion)\n",
16 | "* [Normalization and Standardization](#normalizationAndStandardization)\n",
17 | "* [Coefficients of correlation](#coefficientsOfCorrelation)"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {
23 | "slideshow": {
24 | "slide_type": "slide"
25 | }
26 | },
27 | "source": [
28 | "## Frequency\n",
29 | "\n",
30 | "資料常常需要計算出現的頻率,`.value_counts()` 可以統計某個欄位中每個值出現的次數。"
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": 1,
36 | "metadata": {
37 | "slideshow": {
38 | "slide_type": "fragment"
39 | }
40 | },
41 | "outputs": [
42 | {
43 | "data": {
44 | "text/html": [
45 | "
\n",
46 | "\n",
59 | "
\n",
60 | " \n",
61 | " \n",
62 | " | \n",
63 | " title | \n",
64 | " content | \n",
65 | " time | \n",
66 | " provider | \n",
67 | " url | \n",
68 | "
\n",
69 | " \n",
70 | " \n",
71 | " \n",
72 | " 0 | \n",
73 | " 「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現 | \n",
74 | " 國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允... | \n",
75 | " 2018-10-22 12:16:02+08:00 | \n",
76 | " 風傳媒 | \n",
77 | " https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁... | \n",
78 | "
\n",
79 | " \n",
80 | " 1 | \n",
81 | " 【Yahoo論壇/林青弘】柯文哲是否一再說謊? | \n",
82 | " 柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張... | \n",
83 | " 2018-10-22 14:00:26+08:00 | \n",
84 | " 林青弘 | \n",
85 | " https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘... | \n",
86 | "
\n",
87 | " \n",
88 | " 2 | \n",
89 | " 【Yahoo論壇】民進黨誰最怕陳其邁落選? | \n",
90 | " 讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲... | \n",
91 | " 2018-10-22 13:57:44+08:00 | \n",
92 | " 讀者投書 | \n",
93 | " https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選... | \n",
94 | "
\n",
95 | " \n",
96 | " 3 | \n",
97 | " 抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚 | \n",
98 | " 國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方... | \n",
99 | " 2018-10-22 13:32:00+08:00 | \n",
100 | " EBC東森新聞 | \n",
101 | " https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚... | \n",
102 | "
\n",
103 | " \n",
104 | " 4 | \n",
105 | " 百年土地公上香祈福 陳學聖提五不原則 | \n",
106 | " 【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉... | \n",
107 | " 2018-10-22 13:17:44+08:00 | \n",
108 | " 民眾日報 | \n",
109 | " https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0... | \n",
110 | "
\n",
111 | " \n",
112 | "
\n",
113 | "
"
114 | ],
115 | "text/plain": [
116 | " title \\\n",
117 | "0 「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現 \n",
118 | "1 【Yahoo論壇/林青弘】柯文哲是否一再說謊? \n",
119 | "2 【Yahoo論壇】民進黨誰最怕陳其邁落選? \n",
120 | "3 抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚 \n",
121 | "4 百年土地公上香祈福 陳學聖提五不原則 \n",
122 | "\n",
123 | " content \\\n",
124 | "0 國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允... \n",
125 | "1 柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張... \n",
126 | "2 讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲... \n",
127 | "3 國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方... \n",
128 | "4 【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉... \n",
129 | "\n",
130 | " time provider \\\n",
131 | "0 2018-10-22 12:16:02+08:00 風傳媒 \n",
132 | "1 2018-10-22 14:00:26+08:00 林青弘 \n",
133 | "2 2018-10-22 13:57:44+08:00 讀者投書 \n",
134 | "3 2018-10-22 13:32:00+08:00 EBC東森新聞 \n",
135 | "4 2018-10-22 13:17:44+08:00 民眾日報 \n",
136 | "\n",
137 | " url \n",
138 | "0 https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁... \n",
139 | "1 https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘... \n",
140 | "2 https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選... \n",
141 | "3 https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚... \n",
142 | "4 https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0... "
143 | ]
144 | },
145 | "execution_count": 1,
146 | "metadata": {},
147 | "output_type": "execute_result"
148 | }
149 | ],
150 | "source": [
151 | "import pandas as pd\n",
152 | "from pathlib import Path\n",
153 | "data_folder = Path(\"../data/\")\n",
154 | "\n",
155 | "news = pd.read_csv(data_folder / \"news.csv\")\n",
156 | "news.head()"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 2,
162 | "metadata": {
163 | "slideshow": {
164 | "slide_type": "subslide"
165 | }
166 | },
167 | "outputs": [
168 | {
169 | "data": {
170 | "text/plain": [
171 | "中央社 14\n",
172 | "聯合新聞網 13\n",
173 | "今日新聞NOWnews 11\n",
174 | "新頭殼 11\n",
175 | "風傳媒 11\n",
176 | "三立新聞網 setn.com 10\n",
177 | "民眾日報 10\n",
178 | "TVBS新聞網 9\n",
179 | "民視 7\n",
180 | "台灣好新聞報 6\n",
181 | "EBC東森新聞 6\n",
182 | "華視 3\n",
183 | "中華日報 2\n",
184 | "信傳媒 2\n",
185 | "讀者投書 1\n",
186 | "壹電視影音 1\n",
187 | "林青弘 1\n",
188 | "上報 1\n",
189 | "詹為元 1\n",
190 | "Name: provider, dtype: int64"
191 | ]
192 | },
193 | "execution_count": 2,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "news['provider'].value_counts()"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 3,
205 | "metadata": {
206 | "slideshow": {
207 | "slide_type": "subslide"
208 | }
209 | },
210 | "outputs": [
211 | {
212 | "data": {
213 | "text/plain": [
214 | "False 84\n",
215 | "True 36\n",
216 | "Name: 柯文哲, dtype: int64"
217 | ]
218 | },
219 | "execution_count": 3,
220 | "metadata": {},
221 | "output_type": "execute_result"
222 | }
223 | ],
224 | "source": [
225 | "word = '柯文哲' \n",
226 | "news[word] = [word in text for text in news.content]\n",
227 | "news[word].value_counts()"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 4,
233 | "metadata": {
234 | "slideshow": {
235 | "slide_type": "fragment"
236 | }
237 | },
238 | "outputs": [
239 | {
240 | "data": {
241 | "text/html": [
242 | "\n",
243 | "\n",
256 | "
\n",
257 | " \n",
258 | " \n",
259 | " 姚文智 | \n",
260 | " False | \n",
261 | " True | \n",
262 | "
\n",
263 | " \n",
264 | " 柯文哲 | \n",
265 | " | \n",
266 | " | \n",
267 | "
\n",
268 | " \n",
269 | " \n",
270 | " \n",
271 | " False | \n",
272 | " 76 | \n",
273 | " 8 | \n",
274 | "
\n",
275 | " \n",
276 | " True | \n",
277 | " 12 | \n",
278 | " 24 | \n",
279 | "
\n",
280 | " \n",
281 | "
\n",
282 | "
"
283 | ],
284 | "text/plain": [
285 | "姚文智 False True \n",
286 | "柯文哲 \n",
287 | "False 76 8\n",
288 | "True 12 24"
289 | ]
290 | },
291 | "execution_count": 4,
292 | "metadata": {},
293 | "output_type": "execute_result"
294 | }
295 | ],
296 | "source": [
297 | "word = '姚文智' \n",
298 | "news[word] = [word in text for text in news.content]\n",
299 | "pd.crosstab(news[\"柯文哲\"], news[\"姚文智\"])"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": 5,
305 | "metadata": {
306 | "slideshow": {
307 | "slide_type": "subslide"
308 | }
309 | },
310 | "outputs": [
311 | {
312 | "data": {
313 | "text/plain": [
314 | "0 42\n",
315 | "1 29\n",
316 | "2 17\n",
317 | "3 12\n",
318 | "5 7\n",
319 | "4 5\n",
320 | "6 3\n",
321 | "9 2\n",
322 | "11 1\n",
323 | "8 1\n",
324 | "7 1\n",
325 | "Name: 民進黨, dtype: int64"
326 | ]
327 | },
328 | "execution_count": 5,
329 | "metadata": {},
330 | "output_type": "execute_result"
331 | }
332 | ],
333 | "source": [
334 | "word = '民進黨'\n",
335 | "news[word] = [text.count(word) for text in news.content]\n",
336 | "news[word].value_counts()"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {
342 | "slideshow": {
343 | "slide_type": "slide"
344 | }
345 | },
346 | "source": [
347 | "## Measures of central tendency\n",
348 | "可以使用 `.mode()` 得到眾數、`.median()` 得到中位數、`.mean()` 得到平均數。"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 6,
354 | "metadata": {
355 | "slideshow": {
356 | "slide_type": "fragment"
357 | }
358 | },
359 | "outputs": [
360 | {
361 | "data": {
362 | "text/plain": [
363 | "0 中央社\n",
364 | "dtype: object"
365 | ]
366 | },
367 | "execution_count": 6,
368 | "metadata": {},
369 | "output_type": "execute_result"
370 | }
371 | ],
372 | "source": [
373 | "# mode\n",
374 | "news['provider'].mode()"
375 | ]
376 | },
377 | {
378 | "cell_type": "code",
379 | "execution_count": 7,
380 | "metadata": {
381 | "slideshow": {
382 | "slide_type": "subslide"
383 | }
384 | },
385 | "outputs": [],
386 | "source": [
387 | "# count the news length\n",
388 | "news['length'] = news['content'].apply(len)"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": 8,
394 | "metadata": {
395 | "slideshow": {
396 | "slide_type": "fragment"
397 | }
398 | },
399 | "outputs": [
400 | {
401 | "data": {
402 | "text/plain": [
403 | "680.5"
404 | ]
405 | },
406 | "execution_count": 8,
407 | "metadata": {},
408 | "output_type": "execute_result"
409 | }
410 | ],
411 | "source": [
412 | "# median\n",
413 | "news['length'].median()"
414 | ]
415 | },
416 | {
417 | "cell_type": "code",
418 | "execution_count": 9,
419 | "metadata": {
420 | "scrolled": true,
421 | "slideshow": {
422 | "slide_type": "fragment"
423 | }
424 | },
425 | "outputs": [
426 | {
427 | "data": {
428 | "text/plain": [
429 | "699.4833333333333"
430 | ]
431 | },
432 | "execution_count": 9,
433 | "metadata": {},
434 | "output_type": "execute_result"
435 | }
436 | ],
437 | "source": [
438 | "# mean\n",
439 | "news['length'].mean()"
440 | ]
441 | },
442 | {
443 | "cell_type": "markdown",
444 | "metadata": {
445 | "slideshow": {
446 | "slide_type": "slide"
447 | }
448 | },
449 | "source": [
450 | "### Measures of dispersion\n",
451 | "可以用 `.max()` 得到最大值、`.min()` 得到最小值、相減即為全距。 \n",
452 | "可以用 `.quantile()` 得到百分位數、`.std()` 得到標準差、`.var()` 得到變異數。 \n",
453 | "`.describe()` 則是數據表格的統計,包含平均數、標準差、最大最小值、中位數和四分位數。"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": 10,
459 | "metadata": {
460 | "slideshow": {
461 | "slide_type": "fragment"
462 | }
463 | },
464 | "outputs": [
465 | {
466 | "data": {
467 | "text/plain": [
468 | "1885"
469 | ]
470 | },
471 | "execution_count": 10,
472 | "metadata": {},
473 | "output_type": "execute_result"
474 | }
475 | ],
476 | "source": [
477 | "# range\n",
478 | "news.length.max() - news.length.min()"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": 11,
484 | "metadata": {
485 | "slideshow": {
486 | "slide_type": "fragment"
487 | }
488 | },
489 | "outputs": [
490 | {
491 | "data": {
492 | "text/plain": [
493 | "526.5"
494 | ]
495 | },
496 | "execution_count": 11,
497 | "metadata": {},
498 | "output_type": "execute_result"
499 | }
500 | ],
501 | "source": [
502 | "# Quantiles and quartiles \n",
503 | "news.length.quantile(0.25)"
504 | ]
505 | },
506 | {
507 | "cell_type": "code",
508 | "execution_count": 12,
509 | "metadata": {
510 | "slideshow": {
511 | "slide_type": "subslide"
512 | }
513 | },
514 | "outputs": [
515 | {
516 | "data": {
517 | "text/plain": [
518 | "319.96102660076804"
519 | ]
520 | },
521 | "execution_count": 12,
522 | "metadata": {},
523 | "output_type": "execute_result"
524 | }
525 | ],
526 | "source": [
527 | "# Standard deviation\n",
528 | "news.length.std()"
529 | ]
530 | },
531 | {
532 | "cell_type": "code",
533 | "execution_count": 13,
534 | "metadata": {
535 | "slideshow": {
536 | "slide_type": "fragment"
537 | }
538 | },
539 | "outputs": [
540 | {
541 | "data": {
542 | "text/plain": [
543 | "102375.0585434174"
544 | ]
545 | },
546 | "execution_count": 13,
547 | "metadata": {},
548 | "output_type": "execute_result"
549 | }
550 | ],
551 | "source": [
552 | "# Variance\n",
553 | "news.length.var()"
554 | ]
555 | },
556 | {
557 | "cell_type": "code",
558 | "execution_count": 14,
559 | "metadata": {
560 | "slideshow": {
561 | "slide_type": "fragment"
562 | }
563 | },
564 | "outputs": [
565 | {
566 | "data": {
567 | "text/plain": [
568 | "102375.05854341738"
569 | ]
570 | },
571 | "execution_count": 14,
572 | "metadata": {},
573 | "output_type": "execute_result"
574 | }
575 | ],
576 | "source": [
577 | "news.length.std() ** 2"
578 | ]
579 | },
580 | {
581 | "cell_type": "code",
582 | "execution_count": 15,
583 | "metadata": {
584 | "slideshow": {
585 | "slide_type": "subslide"
586 | }
587 | },
588 | "outputs": [
589 | {
590 | "data": {
591 | "text/html": [
592 | "\n",
593 | "\n",
606 | "
\n",
607 | " \n",
608 | " \n",
609 | " | \n",
610 | " 民進黨 | \n",
611 | " length | \n",
612 | "
\n",
613 | " \n",
614 | " \n",
615 | " \n",
616 | " count | \n",
617 | " 120.000000 | \n",
618 | " 120.000000 | \n",
619 | "
\n",
620 | " \n",
621 | " mean | \n",
622 | " 1.800000 | \n",
623 | " 699.483333 | \n",
624 | "
\n",
625 | " \n",
626 | " std | \n",
627 | " 2.198548 | \n",
628 | " 319.961027 | \n",
629 | "
\n",
630 | " \n",
631 | " min | \n",
632 | " 0.000000 | \n",
633 | " 63.000000 | \n",
634 | "
\n",
635 | " \n",
636 | " 25% | \n",
637 | " 0.000000 | \n",
638 | " 526.500000 | \n",
639 | "
\n",
640 | " \n",
641 | " 50% | \n",
642 | " 1.000000 | \n",
643 | " 680.500000 | \n",
644 | "
\n",
645 | " \n",
646 | " 75% | \n",
647 | " 3.000000 | \n",
648 | " 832.000000 | \n",
649 | "
\n",
650 | " \n",
651 | " max | \n",
652 | " 11.000000 | \n",
653 | " 1948.000000 | \n",
654 | "
\n",
655 | " \n",
656 | "
\n",
657 | "
"
658 | ],
659 | "text/plain": [
660 | " 民進黨 length\n",
661 | "count 120.000000 120.000000\n",
662 | "mean 1.800000 699.483333\n",
663 | "std 2.198548 319.961027\n",
664 | "min 0.000000 63.000000\n",
665 | "25% 0.000000 526.500000\n",
666 | "50% 1.000000 680.500000\n",
667 | "75% 3.000000 832.000000\n",
668 | "max 11.000000 1948.000000"
669 | ]
670 | },
671 | "execution_count": 15,
672 | "metadata": {},
673 | "output_type": "execute_result"
674 | }
675 | ],
676 | "source": [
677 | "news.describe()"
678 | ]
679 | },
680 | {
681 | "cell_type": "markdown",
682 | "metadata": {
683 | "slideshow": {
684 | "slide_type": "slide"
685 | }
686 | },
687 | "source": [
688 | "### Normalization and Standardization\n",
689 | "在建立模型前,通常會成資料標準化,常見的方法有下面兩種。 \n",
690 | "Normalization: \n",
691 | "$ x_{\\text{norm}} = (x-x_{\\text{min}}) / (x_{\\text{max}} - x_{\\text{min}}) $ \n",
692 | "$x_{\\text{norm}}$'s are between 0 and 1.\n",
693 | "\n",
694 | "Standardization: \n",
695 | "$ x_{\\text{std}} = (x-\\mu) / \\sigma $ \n",
696 | "$x_{\\text{std}}$'s have mean 0 and standard deviation 1.\n"
697 | ]
698 | },
699 | {
700 | "cell_type": "code",
701 | "execution_count": 16,
702 | "metadata": {
703 | "slideshow": {
704 | "slide_type": "fragment"
705 | }
706 | },
707 | "outputs": [],
708 | "source": [
709 | "news['length_norm'] = (news.length - news.length.min())/(news.length.max() - news.length.min())\n",
710 | "news['length_std'] = (news.length - news.length.mean())/news.length.std()"
711 | ]
712 | },
713 | {
714 | "cell_type": "code",
715 | "execution_count": 17,
716 | "metadata": {
717 | "scrolled": true,
718 | "slideshow": {
719 | "slide_type": "subslide"
720 | }
721 | },
722 | "outputs": [
723 | {
724 | "data": {
725 | "text/plain": [
726 | ""
727 | ]
728 | },
729 | "execution_count": 17,
730 | "metadata": {},
731 | "output_type": "execute_result"
732 | },
733 | {
734 | "data": {
735 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAEStJREFUeJzt3X+s3XV9x/Hn2wLKelnBFU86ZF6MaCQQwZ4YFpPtXFDX4SKY6CJRB5HtqpvEZGxJo3+IOhO2WUlGSLYukNYFuTKna1NwjiFXphHcrdbeYudQ7By1a8cKjZcxZ917f9xvTVdvOd/zu+dzno/kpuf7PZ/v+b7f/X776vd+v99zTmQmkqTx97xRFyBJ6g8DXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklSI04a5srVr1+b09HRXyz7zzDOsXr26vwWd4ux5MtjzZOil5507dz6Zmee2GzfUQJ+enmZhYaGrZefn52m1Wv0t6BRnz5PBnidDLz1HxL/WGecpF0kqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKsRQ3ymq8TC98d6RrXvLhsl6O7jUTx6hS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhWgb6BHxgoj4WkR8MyIejYgPV/O3RMT3ImJX9XPp4MuVJJ1MnQ/n+hFwRWYuRcTpwJcj4vPVc3+YmZ8ZXHmSpLraBnpmJrBUTZ5e/eQgi5Ikda7WOfSIWBURu4BDwP2Z+Uj11MciYndE3BoRzx9YlZKktmL5ALzm4Iizgc8BNwL/Cfw7cAawGfhuZn5khWVmgVmARqOxfm5urqtCl5aWmJqa6mrZcTWqnhf3Hxn6Oo+5YM0qt/MEsOfOzMzM7MzMZrtxHQU6QER8CHgmMz9+3LwW8AeZ+RvPtWyz2cyFhYWO1nfM/Pw8rVarq2XH1ah6HvUXXLidy2fPnYmIWoFe5y6Xc6sjcyLiTOB1wD9HxLpqXgDXAHu6qlSS1Bd17nJZB2yNiFUs/wdwT2buiIgvRsS5QAC7gPcMsE5JUht17nLZDVy2wvwrBlKRJKkrvlNUkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKkTbQI+IF0TE1yLimxHxaER8uJp/QUQ8EhGPRcSnI+KMwZcrSTqZOkfoPwKuyMxXAZcCGyLicuCPgVsz80LgKeCGwZUpSWqnbaDnsqVq8vTqJ4ErgM9U87cC1wykQklSLZGZ7QdFrAJ2Ai8Dbgf+FHg4M19WPX8+8PnMvHiFZWeBWYBGo7F+bm6uq0KXlpaYmprqatlxdejwEQ4+O+oqhuuCNasmbjtP4r5tz52ZmZnZmZnNduNOq/NimfkT4NKIOBv4HPDKlYadZNnNwGaAZrOZrVarzip/xvz8PN0uO65uu2sbmxZrbaJibNmweuK28yTu2/Y8GB3d5ZKZTwPzwOXA2RFxLG1eDPygv6VJkjpR5y6Xc6sjcyLiTOB1wF7gQeAt1bDrgG2DKlKS1F6d3+fXAVur8+jPA+7JzB0R8S1gLiL+CPgGcMcA65QktdE20DNzN3DZCvMfB14ziKIkSZ3znaKSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgpR50uiz4+IByNib0Q8GhHvr+bfHBH7I2JX9XPV4MuVJJ1MnS+JPgrclJlfj4izgJ0RcX/13K2Z+fHBlSdJqqvOl0QfAA5Uj38YEXuB8wZdmCSpMx2dQ4+IaeAy4JFq1vsiYndE3BkR5/S5NklSByIz6w2MmAK+BHwsMz8bEQ3gSSCBjwLrMvNdKyw3C8wCNBqN9XNzc10VurS0xNTUVFfLjqtDh49w8NlRVzFcF6xZNXHbeRL3bXvuzMzMzM7MbLYbVyvQI+J0YAfwhcz8xArPTwM7MvPi53qdZrOZCwsLbde3kvn5eVqtVlfLjqvb7trGpsU6lznKsWXD6onbzpO4b9tzZyKiVqDXucslgDuAvceHeUSsO27Ym4E93RQqSeqPOod/rwXeCSxGxK5q3geAayPiUpZPuewD3j2QCiVJtdS5y+XLQKzw1H39L0eS1C3fKSpJhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRB1viT6/Ih4MCL2RsSjEfH+av4LI+L+iHis+vOcwZcrSTqZOkfoR4GbMvOVwOXA70XERcBG4IHMvBB4oJqWJI1I20DPzAOZ+fXq8Q+BvcB5wNXA1mrYVuCaQRUpSWqvo3PoETENXAY8AjQy8wAshz7won4XJ0mqLzKz3sCIKeBLwMcy87MR8XRmnn3c809l5s+cR4+IWWAWoNForJ+bm+uq0KWlJaamprpadlwdOnyEg8+OuorhumDNqonbzpO4b9tzZ2ZmZnZmZrPduNPqvFhEnA78DXBXZn62mn0wItZl5oGIWAccWmnZzNwMbAZoNpvZarXqrPJnzM/P0+2y4+q2u7axabHWJirGlg2rJ247T+K+bc+DUeculwDuAPZm5ieOe2o7cF31+DpgW//LkyTVVefw77XAO4HFiNhVzfsAcAtwT0TcAHwfeOtgSpQk1dE20DPzy0Cc5Okr+1uOJKlbvlNUkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFWKy3oaoU97i/iNcv/Heoa933y1vHPo6pX7zCF2SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSpEnS+JvjMiDkXEnuPm3RwR+yNiV/Vz1WDLlCS1U+cIfQuwYYX5t2bmpdXPff0tS5LUqbaBnpkPAYeHUIskqQe9nEN/X0Tsrk7JnNO3iiRJXYnMbD8oYhrYkZkXV9MN4EkggY8C6zLzXSdZdhaYBWg0Guvn5ua6KnRpaYmpqamulh1Xhw4f4eCzo65iuBpnMpKeLzlvzfBXWpnEfdueOzMzM7MzM5vtxnX1BReZefDY44j4S2DHc4zdDGwGaDab2Wq1ulkl8/PzdLvsuLrtrm1sWpys7yC56ZKjI+l539tbQ1/nMZO4b9vzYHR1yiUi1h03+WZgz8nGSpKGo+2hUETcDbSAtRHxBPAhoBURl7J8ymUf8O4B1ihJqqFtoGfmtSvMvmMAtUiSeuA7RSWpEAa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQk/UNxF2a3njvSNZ70yUjWa2kMeURuiQVom2gR8SdEXEoIvYcN++FEXF/RDxW/XnOYMuUJLVT5wh9C7DhhHkbgQcy80LggWpakjRCbQM9Mx8CDp8w+2pga/V4K3BNn+uSJHUoMrP9oIhpYEdmXlxNP52ZZx/3/FOZueJpl4iYBWYBGo3G+rm5ua4KXVpaYmpqqqtle7W4/8hI1ts4Ew4+O5JVj8yoer7kvDXDX2lllPv2qNhzZ2ZmZnZmZrPduIHf5ZKZm4HNAM1mM1utVlevMz8/T7fL9ur6kd3lcpRNi5N1I9Koet739tbQ13nMKPftUbHnwej2LpeDEbEOoPrzUP9KkiR1o9tA3w5cVz2+DtjWn3IkSd2qc9vi3cBXgVdExBMRcQNwC/D6iHgMeH01LUkaobYnKzPz2pM8dWWfa5Ek9WBsrrgt7j8ysouTkjQOfOu/JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmFMNAlqRAGuiQVwkCXpEIY6JJUCANdkgphoEtSIQx0SSqEgS5JhTDQJakQBrokFaKnbyyKiH3AD4GfAEczs9mPoiRJnevHV9DNZOaTfXgdSVIPPOUiSYXoNdAT+PuI2BkRs/0oSJLUncjM7heO+MXM/EFEvAi4H7gxMx86YcwsMAvQaDTWz83NdbWuQ4ePcPDZrksdS40zsecJMKqeLzlvzfBXWllaWmJqampk6x+FXnqemZnZWecaZU+B/v9eKOJmYCkzP36yMc1mMxcWFrp6/dvu2samxX6c8h8fN11y1J4nwKh63nfLG4e+zmPm5+dptVojW/8o9NJzRNQK9K5PuUTE6og469hj4A3Anm5fT5LUm14OCxrA5yLi2Ot8KjP/ri9VSZI61nWgZ+bjwKv6WIskqQfetihJhTDQJakQBrokFcJAl6RCGOiSVAgDXZIKYaBLUiEMdEkqhIEuSYUw0CWpEAa6JBVisj6nVNJPTW+8d2Tr3rJh9cjWXTKP0CWpEAa6JBXCQJekQhjoklQIL4pKGrrF/Ue4foQXZUdhGBeCPUKXpEL0FOgRsSEivh0R34mIjf0qSpLUua4DPSJWAbcDvw5cBFwbERf1qzBJUmd6OUJ/DfCdzHw8M/8HmAOu7k9ZkqRO9RLo5wH/dtz0E9U8SdIIRGZ2t2DEW4Ffy8zfrqbfCbwmM288YdwsMFtNvgL4dpe1rgWe7HLZcWXPk8GeJ0MvPb8kM89tN6iX2xafAM4/bvrFwA9OHJSZm4HNPawHgIhYyMxmr68zTux5MtjzZBhGz72ccvkn4MKIuCAizgDeBmzvT1mSpE51fYSemUcj4n3AF4BVwJ2Z+WjfKpMkdaSnd4pm5n3AfX2qpZ2eT9uMIXueDPY8GQbec9cXRSVJpxbf+i9JhTjlAr3dxwlExPMj4tPV849ExPTwq+yvGj3/fkR8KyJ2R8QDEfGSUdTZT3U/NiIi3hIRGRFjfUdEnX4j4jer7fxoRHxq2DX2W439+pci4sGI+Ea1b181ijr7KSLujIhDEbHnJM9HRPxZ9XeyOyJe3dcCMvOU+WH54up3gZcCZwDfBC46YczvAn9ePX4b8OlR1z2EnmeAn6sev3cSeq7GnQU8BDwMNEdd94C38YXAN4BzqukXjbruIfS8GXhv9fgiYN+o6+5D378CvBrYc5LnrwI+DwRwOfBIP9d/qh2h1/k4gauBrdXjzwBXRkQMscZ+a9tzZj6Ymf9VTT7M8j3/46zux0Z8FPgT4L+HWdwA1On3d4DbM/MpgMw8NOQa+61Ozwn8fPV4DSu8j2XcZOZDwOHnGHI18Mlc9jBwdkSs69f6T7VAr/NxAj8dk5lHgSPALwylusHo9CMUbmD5f/hx1rbniLgMOD8zdwyzsAGps41fDrw8Ir4SEQ9HxIahVTcYdXq+GXhHRDzB8t1yN1K+gX5kyqn2BRcrHWmfeBtOnTHjpHY/EfEOoAn86kArGrzn7DkingfcClw/rIIGrM42Po3l0y4tln8D+8eIuDgznx5wbYNSp+drgS2ZuSkifhn4q6rn/x18eSMz0Pw61Y7Q63ycwE/HRMRpLP+q9ly/4pzqan2EQkS8Dvgg8KbM/NGQahuUdj2fBVwMzEfEPpbPNW4f4wujdffrbZn548z8HsufeXThkOobhDo93wDcA5CZXwVewPLnnZSs1r/3bp1qgV7n4wS2A9dVj98CfDGrqw1jqm3P1emHv2A5zMf93Cq06Tkzj2Tm2syczsxplq8bvCkzF0ZTbs/q7Nd/y/LFbyJiLcunYB4fapX9Vafn7wNXAkTEK1kO9P8YapXDtx34repul8uBI5l5oG+vPuqrwie5CvwvLF8h/2A17yMs/4OG5Y3+18B3gK8BLx11zUPo+R+Ag8Cu6mf7qGsedM8njJ1njO9yqbmNA/gE8C1gEXjbqGseQs8XAV9h+Q6YXcAbRl1zH3q+GzgA/Jjlo/EbgPcA7zluO99e/Z0s9nu/9p2iklSIU+2UiySpSwa6JBXCQJekQhjoklQIA12SCmGgS1IhDHRJKoSBLkmF+D+rSkg219juTAAAAABJRU5ErkJggg==\n",
736 | "text/plain": [
737 | ""
738 | ]
739 | },
740 | "metadata": {
741 | "needs_background": "light"
742 | },
743 | "output_type": "display_data"
744 | }
745 | ],
746 | "source": [
747 | "%matplotlib inline\n",
748 | "\n",
749 | "news['length_norm'].hist()"
750 | ]
751 | },
752 | {
753 | "cell_type": "code",
754 | "execution_count": 18,
755 | "metadata": {
756 | "slideshow": {
757 | "slide_type": "subslide"
758 | }
759 | },
760 | "outputs": [
761 | {
762 | "data": {
763 | "text/plain": [
764 | ""
765 | ]
766 | },
767 | "execution_count": 18,
768 | "metadata": {},
769 | "output_type": "execute_result"
770 | },
771 | {
772 | "data": {
773 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAEMVJREFUeJzt3W2MXOV5xvHrijHFYqhNhDN1DepGCkIgbzHyCFHxZdaQ1gUUoCpSUYqMoNpUCohK2xcn+RBoiuQqGCq1kSqrICyVskUpCGRDiUuYIKQEukuM1+5CSanb2ri2KOAy1KJacvfDHiPXWXvOnHNmZ+bZ/08aec7bM/ctjy8O52WOI0IAgOH3mX4XAACoBoEOAIkg0AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASMRZi/lhF1xwQYyMjBTa9qOPPtK5555bbUF9klIvEv0MspR6kZZuP9PT0+9GxOpO6y1qoI+MjGhqaqrQtq1WS81ms9qC+iSlXiT6GWQp9SIt3X5s/1ue8TjkAgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiVjUO0UxHEa27Opq/YnROd3e5Tanc2Dr9ZWMAyxF7KEDQCIIdABIBIEOAIkg0AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJ6Bjots+x/art123vt31fNv9R2/9qe0/2Wt/7cgEAp5Pnx7k+lrQxItq2l0t62fZz2bI/iIjv9q48AEBeHQM9IkJSO5tcnr2il0UBALqX6xi67WW290g6Kml3RLySLbrf9l7bD9n+uZ5VCQDoyPM74DlXtldJekrS3ZL+S9J/Sjpb0nZJ/xIRf7zANuOSxiWpXq9vmJycLFRou91WrVYrtO2gGfReZg4d62r9+grpyPFqPnt07cpqBiph0P9+upFSL9LS7WdsbGw6Ihqd1usq0CXJ9jclfRQRD5w0rynp9yPihjNt22g0YmpqqqvPO6HVaqnZbBbadtAMei9FHnCxbaaaZ6UMwgMuBv3vpxsp9SIt3X5s5wr0PFe5rM72zGV7haRrJb1he002z5JukrSvY1UAgJ7Js1u1RtIO28s0/x+AJyJip+3v214tyZL2SPrdHtYJAOggz1UueyVdscD8jT2pCABQCHeKAkAiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkAgCHQASQaADQCIIdABIBIEOAInoGOi2z7H9qu3Xbe+3fV82//O2X7H9lu2/tX1278sFAJxOnj30jyVtjIjLJa2XtMn2VZL+VNJDEXGxpPcl3dm7MgEAnXQM9JjXziaXZ6+QtFHSd7P5OyTd1JMKAQC5OCI6r2QvkzQt6QuSviPp25J+FBFfyJZfJOm5iFi3wLbjksYlqV6vb5icnCxUaLvdVq1WK7TtoMnby8yhY4tQTXn1FdKR49WMNbp2ZTUDlbAUv2vDYqn2MzY2Nh0RjU7rnZXnQyPiE0nrba+S9JSkSxda7TTbbpe0XZIajUY0m808H/kzWq2Wim47aPL2cvuWXb0vpgITo3PaNpPrq9TRgS83KxmnjKX4XRsW9HNmXV3lEhEfSGpJukrSKtsn/hVfKOmdyqoCAHQtz1Uuq7M9c9leIelaSbOSXpT0m9lqmyU93asiAQCd5fn/5DWSdmTH0T8j6YmI2Gn7nyRN2v4TST+W9HAP6wQAdNAx0CNir6QrFpj/tqQre1EUAKB73CkKAIkg0AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASAReR4SfZHtF23P2t5v+55s/r22D9nek72u6325AIDTyfOQ6DlJExHxmu3zJE3b3p0teygiHuhdeQCAvPI8JPqwpMPZ+w9tz0pa2+vCAADd6eoYuu0RSVdIeiWbdZftvbYfsX1+xbUBALrgiMi3ol2T9ANJ90fEk7brkt6VFJK+JWlNRNyxwHbjksYlqV6vb5icnCxUaLvdVq1WK7TtoMnby8yhY4tQTXn1FdKR49WMNbp2ZTUDlbAUv2vDYqn2MzY2Nh0RjU7r5Qp028sl7ZT0fEQ8uMDyEUk7I2LdmcZpNBoxNTXV8fMW0mq11Gw2C207aPL2MrJlV++LqcDE6Jy2zeQ5HdPZga3XVzJOGUvxuzYslmo/tnMFep6rXCzpYUmzJ4e57TUnrXazpH0dqwIA9Eye3aqrJd0macb2nmze1yXdanu95g+5HJD0lZ5UCADIJc9VLi9L8gKLnq2+HABAUdwpCgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkAgCHQASQaADQCIIdABIBIEOAIkg0AEgEXkeEn2R7Rdtz9reb/uebP5nbe+2/Vb25/m9LxcAcDp59tDnJE1ExKWSrpL0VduXSdoi6YWIuFjSC9k0AKBPOgZ6RByOiNey9x9KmpW0VtKNknZkq+2QdFOvigQAdNbVMXTbI5KukPSKpHpEHJbmQ1/S56ouDgCQnyMi34p2TdIPJN0fEU/a/iAiVp20/P2I+Jnj6LbHJY1LUr1e3zA5OVmo0Ha7rVqtVmjbQZO3l5lDxxahmvLqK6Qjx6sZa3TtymoGKmEpfteGxVLtZ2xsbDoiGp3WyxXotpdL2inp+Yh4MJv3pqRmRBy2vUZSKyIuOdM4jUYjpqamOn7eQlqtlprNZqFtB03eXka27Op9MRWYGJ3TtpmzKhnrwNbrKxmnjKX4XRsWS7Uf27kCPc9VLpb0sKTZE2GeeUbS5uz9ZklPd6wKANAzeXarrpZ0m6QZ23uyeV+XtFXSE7bvlPTvkm7pTYkAgDw6BnpEvCzJp1l8TbXlAACK4k5RAEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkIhqbu8DKtKvu2MH4Q5VoCz20AEgEQQ6ACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgETkeUj0I7aP2t530rx7bR+yvSd7XdfbMgEAneTZQ39U0qYF5j8UEeuz17PVlgUA6FbHQI+IlyS9twi1AABKKHMM/S7be7NDMudXVhEAoBBHROeV7BFJOyNiXTZdl/SupJD0LUlrIuKO02w7Lmlckur1+obJyclChbbbbdVqtULbDpq8vcwcOrYI1ZRXXyEdOd7vKsoZXbvy0/dL8bs2LJZqP2NjY9MR0ei0XqFAz7vsVI1GI6ampjp+3kJarZaazWahbQdN3l769bCHbk2MzmnbzHA/K+XkB1wsxe/asFiq/djOFeiFDrnYXnPS5M2S9p1uXQDA4ui4W2X7cUlNSRfYPijpm5Kattdr/pDLAUlf6WGNAIAcOgZ6RNy6wOyHe1ALAKAE7hQFgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkAgCHQASQaADQCIIdABIBIEOAIkg0AEgEQQ6ACSCQAeARAz3k30XSdUPa54YndPtQ/IAaADDgz10AEhEx0C3/Yjto7b3nTTvs7Z3234r+/P83pYJAOgkzx76o5I2nTJvi6QXIuJiSS9k0wCAPuoY6BHxkqT3Tpl9o6Qd2fsdkm6quC4AQJccEZ1Xskck7YyIddn0BxGx6qTl70fEgoddbI9LGpeker2+YXJyslCh7XZbtVqt0LZlzRw6Vul49RXSkeOVDtlXKfQzunblp+/7+V2rWkq9SEu3n7GxsemIaHRar+dXuUTEdknbJanRaESz2Sw0TqvVUtFty6r6ipSJ0Tltm0nnAqMU+jnw5ean7/v5XataSr1I9NNJ0atcjtheI0nZn0crqwgAUEjRQH9G0ubs/WZJT1dTDgCgqDyXLT4u6YeSLrF90PadkrZK+qLttyR9MZsGAPRRxwOfEXHraRZdU3EtAIAShuZM1syhY9wuDwBnwK3/AJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQCAIdABJBoANAIgh0AEgEgQ4AiSDQASARBDoAJIJAB4BEEOgAkIhSTyyyfUDSh5I+kTQXEY0qigIAdK+KR9CNRcS7FYwDACiBQy4AkIiygR6Svmd72vZ4FQUBAIpxRBTf2P7FiHjH9uck7ZZ0d0S8dMo645LGJaler2+YnJws9FlH3zumI8cLlzpQ6iuUTC8S/QyyM/Uyunbl4hZTgXa7rVqt1u8yKpO3n7Gxsek85yhLBfr/G8i+V1I7Ih443TqNRiOmpqYKjf/njz2tbTNVHPLvv4nRuWR6kehnkJ2plwNbr1/kasprtVpqNpv9LqMyefuxnSvQCx9ysX2u7fNOvJf0q5L2FR0PAFBOmd2QuqSnbJ8Y528i4u8rqQoA0LXCgR4Rb0u6vMJaAAAlcNkiACSCQAeARBDoAJAIAh0AEkGgA0AiCHQASASBDgCJINABIBEEOgAkgkAHgEQQ6ACQiDR+IxRA10a27OrbZw/jT/cOA/bQASARBDoAJIJAB4BEEOgAkAhOigJYdEVPyE6Mzun2Pp7MLWMxTgSzhw4AiSgV6LY32X7T9k9sb6mqKABA9woHuu1lkr4j6dclXSbpVtuXVVUYAKA7ZfbQr5T0k4h4OyL+V9KkpBurKQsA0K0ygb5W0n+cNH0wmwcA6ANHRLEN7Vsk/VpE/E42fZukKyPi7lPWG5c0nk1eIunNgrVeIOndgtsOmpR6kehnkKXUi7R0+/mliFjdaaUyly0elHTRSdMXSnrn1JUiYruk7SU+R5JkeyoiGmXHGQQp9SLRzyBLqReJfjopc8jlHyVdbPvzts+W9FuSnqmmLABAtwrvoUfEnO27JD0vaZmkRyJif2WVAQC6UupO0Yh4VtKzFdXSSenDNgMkpV4k+hlkKfUi0c8ZFT4pCgAYLNz6DwCJGJpAt/1t22/Y3mv7Kdur+l1TGbZvsb3f9k9tD+VZ+9R++sH2I7aP2t7X71rKsn2R7Rdtz2bfs3v6XVMZts+x/art17N+7ut3TWXZXmb7x7Z3VjXm0AS6pN2S1kXEL0v6Z0lf63M9Ze2T9BuSXup3IUUk+tMPj0ra1O8iKjInaSIiLpV0laSvDvnfz8eSNkbE5ZLWS9pk+6o+11TWPZJmqxxwaAI9Ir4XEXPZ5I80f9370IqI2YgoepPVIEjupx8i4iVJ7/W7jipExOGIeC17/6Hmg2No7+SOee1scnn2GtoTgLYvlHS9pL+qctyhCfRT3CHpuX4XscTx0w9DwvaIpCskvdLfSsrJDlHskXRU0u6IGOZ+/kzSH0r6aZWDDtQDLmz/g6RfWGDRNyLi6Wydb2j+fycfW8zaisjTzxDzAvOGdo8pVbZrkv5O0u9FxH/3u54yIuITSeuz82dP2V4XEUN3vsP2DZKORsS07WaVYw9UoEfEtWdabnuzpBskXRNDcL1lp36GXK6ffkD/2F6u+TB/LCKe7Hc9VYmID2y3NH++Y+gCXdLVkr5k+zpJ50j6edt/HRG/XXbgoTnkYnuTpD+S9KWI+J9+1wN++mGQ2bakhyXNRsSD/a6nLNurT1zZZnuFpGslvdHfqoqJiK9FxIURMaL5fzffryLMpSEKdEl/Iek8Sbtt77H9l/0uqAzbN9s+KOlXJO2y/Xy/a+pGdoL6xE8/zEp6Yth/+sH245J+KOkS2wdt39nvmkq4WtJtkjZm/172ZHuEw2qNpBdt79X8zsTuiKjscr9UcKcoACRimPbQAQBnQKADQCIIdABIBIEOAIkg0AEgEQQ6ACSCQAeARBDoAJCI/wO1mA9HPIldfAAAAABJRU5ErkJggg==\n",
774 | "text/plain": [
775 | ""
776 | ]
777 | },
778 | "metadata": {
779 | "needs_background": "light"
780 | },
781 | "output_type": "display_data"
782 | }
783 | ],
784 | "source": [
785 | "news['length_std'].hist()"
786 | ]
787 | },
788 | {
789 | "cell_type": "code",
790 | "execution_count": 19,
791 | "metadata": {
792 | "scrolled": true,
793 | "slideshow": {
794 | "slide_type": "subslide"
795 | }
796 | },
797 | "outputs": [
798 | {
799 | "data": {
800 | "text/plain": [
801 | ""
802 | ]
803 | },
804 | "execution_count": 19,
805 | "metadata": {},
806 | "output_type": "execute_result"
807 | },
808 | {
809 | "data": {
810 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAD8CAYAAABjAo9vAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAE4FJREFUeJzt3X2MZXV9x/H3tzzolkEeCtxsV+pipUbiRnAnhsZqZvAJwQq21UiMYqUZTcRoujbdalKx1gSrK0lTU0MDYW3U0VqJBKRIKKMxUewsLszSFXlw27KsSxBEhhLbpd/+cc+2M8M8nPtw5h5+vF/Jzdz7O+ee+5lzz/3MmXOfIjORJJXjV0YdQJI0XBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTBHrueNnXTSSbl58+ZFY0888QTHHHPMesboSdvzQfszmm9wbc/Y9nzQ/oyr5du1a9fDmXly7YVl5rqdtm7dmkvdeuutTxtrk7bny2x/RvMNru0Z254vs/0ZV8sHzGYPXeuhGEkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKsy6fqSAnhk2b79hqMvbtuUQ7665zH2Xnz/U25aejdxjl6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCrNmsUfEcyPiBxFxR0TcFREfr8aviYifRMTu6nRm83ElSWup8yFgvwTOycz5iDgK+G5E3FhN+5PM/Fpz8SRJvVqz2DMzgfnq4lHVKZsMJUnqX61j7BFxRETsBh4Cbs7M26pJn4yIOyPiioh4TmMpJUm1RXeHvObMEccD1wIfAH4G/BQ4GrgSuC8z/2KZ60wBUwCdTmfr9PT0ounz8/OMjY31m79xbc8Hw884t/+xoS0LoLMBDj5Zb94tm44b6m3X8Wy8j4et7fmg/RlXyzc5ObkrM8frLqunYgeIiI8BT2TmZxaMTQAfzsw3rXbd8fHxnJ2dXTQ2MzPDxMRETxnWU9vzwfAzNvFFGzvm6n2nyyi+aOPZeB8PW9vzQfszrpYvInoq9jqvijm52lMnIjYArwV+FBEbq7EALgT21L1RSVJz6uxGbQR2RsQRdP8QfDUzr4+If46Ik4EAdgPvazCnJKmmOq+KuRM4a5nxcxpJJEkaiO88laTCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgqzZrFHxHMj4gcRcUdE3BURH6/GT4uI2yLinoj4SkQc3XxcSdJa6uyx/xI4JzNfBpwJnBsRZwOfAq7IzNOBR4FLmospSaprzWLPrvnq4lHVKYFzgK9V4zuBCxtJKEnqSWTm2jNFHAHsAl4EfA74NPD9zHxRNf1U4MbMfOky150CpgA6nc7W6enpRdPn5+cZGxsb8Ndozijzze1/rNZ8nQ1w8MmGwwygl3xbNh3XbJhltH0bhPZnbHs+aH/G1fJNTk7uyszxuss6ss5MmfkUcGZEHA9cC7xkudlWuO6VwJUA4+PjOTExsWj6zMwMS8faZJT53r39hlrzbdtyiB1zte7Kkegl3753TDQbZhlt3wah/Rnbng/an3GY+Xp6VUxm/hyYAc4Gjo+Iw4/W5wMPDiWRJGkgdV4Vc3K1p05EbABeC+wFbgX+oJrtYuAbTYWUJNVX5//jjcDO6jj7rwBfzczrI+JfgemI+Evgh8BVDeaUJNW0ZrFn5p3AWcuM3w+8oolQkqT++c5TSSqMxS5JhbHYJakwFrskFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFqfNl1qdGxK0RsTci7oqID1bjl0XE/ojYXZ3Oaz6uJGktdb7M+hCwLTNvj4hjgV0RcXM17YrM/Exz8SRJvarzZdYHgAPV+ccjYi+wqelgkqT+9HSMPSI2A2cBt1VDl0bEnRFxdUScMORskqQ+RGbWmzFiDPg28MnM/HpEdICHgQQ+AWzMzPcsc70pYAqg0+lsnZ6eXjR9fn6esbGxgX6JJo0y39z+x2rN19kAB59sOMwAesm3ZdNxzYZZRtu3QWh/xrbng/ZnXC3f5OTkrswcr7usWsUeEUcB1wM3ZeZnl5m+Gbg+M1+62nLGx8dzdnZ20djMzAwTExN18667UebbvP2GWvNt23KIHXN1ni4ZjV7y7bv8/IbTPF3bt0Fof8a254P2Z1wtX0T0VOx1XhUTwFXA3oWlHhEbF8z2FmBP3RuVJDWnzm7UK4F3AnMRsbsa+whwUUScSfdQzD7gvY0klCT1pM6rYr4LxDKTvjn8OJKkQfnOU0kqjMUuSYWx2CWpMBa7JBXGYpekwljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhanzZdanRsStEbE3Iu6KiA9W4ydGxM0RcU/184Tm40qS1lJnj/0QsC0zXwKcDbw/Is4AtgO3ZObpwC3VZUnSiK1Z7Jl5IDNvr84/DuwFNgEXADur2XYCFzYVUpJUX0/H2CNiM3AWcBvQycwD0C1/4JRhh5Mk9S4ys96MEWPAt4FPZubXI+LnmXn8gumPZubTjrNHxBQwBdDpdLZOT08vmj4/P8/Y2NgAv0KzRplvbv9jtebrbICDTzYcZgC95Nuy6bhmwyyj7dsgtD9j2/NB+zOulm9ycnJXZo7XXVatYo+Io4DrgZsy87PV2N3ARGYeiIiNwExmvni15YyPj+fs7OyisZmZGSYmJurmXXejzLd5+w215tu25RA75o5sOE3/esm37/LzG07zdG3fBqH9GdueD9qfcbV8EdFTsdd5VUwAVwF7D5d65Trg4ur8xcA36t6oJKk5dXajXgm8E5iLiN3V2EeAy4GvRsQlwL8Db20moiSpF2sWe2Z+F4gVJr9muHEkSYPynaeSVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMO19u6Keleq+23aYtm05xMS636rUHPfYJakwFrskFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMBa7JBXGYpekwtT5MuurI+KhiNizYOyyiNgfEbur03nNxpQk1VVnj/0a4Nxlxq/IzDOr0zeHG0uS1K81iz0zvwM8sg5ZJElDMMgx9ksj4s7qUM0JQ0skSRpIZObaM0VsBq7PzJdWlzvAw0ACnwA2ZuZ7VrjuFDAF0Ol0tk5PTy+aPj8/z9jYWP+/QcNGmW9u/2O15utsgINPNhxmAM+EfKeceNyoY6zKx8ng2p5xtXyTk5O7MnO87rL6Kva605YaHx/P2dnZRWMzMzNMTEzUCjsKo8xX90sntm05xI659n5nyjMh3wfeccGoY6zKx8ng2p5xtXwR0VOx93UoJiI2Lrj4FmDPSvNKktbXmrtREfFlYAI4KSIeAD4GTETEmXQPxewD3ttgRklSD9Ys9sy8aJnhqxrIIkkaAt95KkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhbHYJakwFrskFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYVp7zcMt0jdL5WWpDZwj12SCrNmsUfE1RHxUETsWTB2YkTcHBH3VD9PaDamJKmuOnvs1wDnLhnbDtySmacDt1SXJUktsGaxZ+Z3gEeWDF8A7KzO7wQuHHIuSVKfIjPXniliM3B9Zr60uvzzzDx+wfRHM3PZwzERMQVMAXQ6na3T09OLps/PzzM2NtZv/sbNz8/zk8eeGnWMVXU2wMEnR51iZc+EfKeceNyoY6zqmfA4aXM+aH/G1fJNTk7uyszxustq/FUxmXklcCXA+Ph4TkxMLJo+MzPD0rE2mZmZYcd3nxh1jFVt23KIHXPtfYHTMyHf21q8DcIz43HS5nzQ/ozDzNfvq2IORsRGgOrnQ0NJI0kaWL/Ffh1wcXX+YuAbw4kjSRpUnZc7fhn4HvDiiHggIi4BLgdeFxH3AK+rLkuSWmDNA5+ZedEKk14z5CySpCFo7zNaS4zqbf3bthziGbSaJMmPFJCk0ljsklQYi12SCmOxS1JhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhbHYJakwFrskFcZil6TCDPTVQBGxD3gceAo4lJnjwwglSerfML7zbTIzHx7CciRJQ+ChGEkqzKDFnsC3ImJXREwNI5AkaTCRmf1fOeLXM/PBiDgFuBn4QGZ+Z8k8U8AUQKfT2To9Pb1oGfPz84yNja15W3P7H+s75yA6G+DgkyO56drantF8g2si45ZNxw1tWXUfx6PU9oyr5ZucnNzVy3OYAxX7ogVFXAbMZ+ZnVppnfHw8Z2dnF43NzMwwMTGx5vI3b79hwIT92bblEDvmhvFURHPantF8g2si477Lzx/asuo+jkep7RlXyxcRPRV734diIuKYiDj28Hng9cCefpcnSRqOQXYBOsC1EXF4OV/KzH8aSipJUt/6LvbMvB942RCzSJKGwJc7SlJhLHZJKozFLkmFsdglqTAWuyQVxmKXpMJY7JJUGItdkgpjsUtSYSx2SSqMxS5JhWn3Z5VKaswwPwp725ZDvLuH5Q3zI4P1dO6xS1JhLHZJKozFLkmFsdglqTA+eSpp3Y3iO4x7fYJ32NbzCWP32CWpMAMVe0ScGxF3R8S9EbF9WKEkSf3ru9gj4gjgc8AbgTOAiyLijGEFkyT1Z5A99lcA92bm/Zn5X8A0cMFwYkmS+jVIsW8C/mPB5QeqMUnSCEVm9nfFiLcCb8jMP6ouvxN4RWZ+YMl8U8BUdfHFwN1LFnUS8HBfIdZH2/NB+zOab3Btz9j2fND+jKvle0Fmnlx3QYO83PEB4NQFl58PPLh0psy8ErhypYVExGxmjg+Qo1Ftzwftz2i+wbU9Y9vzQfszDjPfIIdi/gU4PSJOi4ijgbcD1w0jlCSpf33vsWfmoYi4FLgJOAK4OjPvGloySVJfBnrnaWZ+E/jmgBlWPEzTEm3PB+3PaL7BtT1j2/NB+zMOLV/fT55KktrJjxSQpMKMrNjb8nEEEXFqRNwaEXsj4q6I+GA1fllE7I+I3dXpvAXX+bMq990R8YZ1yLgvIuaqHLPV2IkRcXNE3FP9PKEaj4j46yrfnRHx8oazvXjBOtodEb+IiA+Nev1FxNUR8VBE7Fkw1vM6i4iLq/nviYiLG8736Yj4UZXh2og4vhrfHBFPLliXn19wna3VtnFv9TtEwxl7vl+beqyvkO8rC7Lti4jd1fi6r8NVuqX57TAz1/1E98nW+4AXAkcDdwBnjCjLRuDl1fljgR/T/YiEy4APLzP/GVXe5wCnVb/HEQ1n3AectGTsr4Dt1fntwKeq8+cBNwIBnA3cts7360+BF4x6/QGvBl4O7Ol3nQEnAvdXP0+ozp/QYL7XA0dW5z+1IN/mhfMtWc4PgN+ust8IvLHhddjT/drkY325fEum7wD+fFTrcJVuaXw7HNUee2s+jiAzD2Tm7dX5x4G9rP4O2guA6cz8ZWb+BLiX7u+z3i4AdlbndwIXLhj/QnZ9Hzg+IjauU6bXAPdl5r+tMs+6rL/M/A7wyDK33cs6ewNwc2Y+kpmPAjcD5zaVLzO/lZmHqovfp/vekBVVGZ+Xmd/LbgN8YcHv1EjGVax0vzb2WF8tX7XX/Tbgy6sto8l1uEq3NL4djqrYW/lxBBGxGTgLuK0aurT6l+jqw/8uMZrsCXwrInZF9528AJ3MPADdDQg4ZYT5Dns7ix9IbVl/h/W6zkaZ9T10994OOy0ifhgR346IV1Vjm6pM652vl/t1VOvwVcDBzLxnwdjI1uGSbml8OxxVsS93DGukL8+JiDHgH4EPZeYvgL8FfhM4EzhA9986GE32V2bmy+l+kub7I+LVq8w7knUb3TepvRn4h2qoTetvLStlGtW6/ChwCPhiNXQA+I3MPAv4Y+BLEfG8EeXr9X4d1f19EYt3Mka2DpfplhVnXSFLzxlHVey1Po5gvUTEUXRX/Bcz8+sAmXkwM5/KzP8B/o7/P1yw7tkz88Hq50PAtVWWg4cPsVQ/HxpVvsobgdsz82CVtTXrb4Fe19m6Z62eGHsT8I7q0ADV4Y2fVed30T1m/VtVvoWHa9ZjW+z1fh3FOjwS+D3gKwtyj2QdLtctrMN2OKpib83HEVTH4q4C9mbmZxeMLzwu/Rbg8DPv1wFvj4jnRMRpwOl0n3xpKt8xEXHs4fN0n2DbU+U4/Oz4xcA3FuR7V/UM+9nAY4f/7WvYoj2ktqy/JXpdZzcBr4+IE6pDDq+vxhoREecCfwq8OTP/c8H4ydH9/gMi4oV019n9VcbHI+Lsajt+14LfqamMvd6vo3isvxb4UWb+3yGWUazDlbqF9dgOh/Hsbz8nus8A/5juX86PjjDH79D9t+ZOYHd1Og/4e2CuGr8O2LjgOh+tct/NEF+FsEK+F9J9JcEdwF2H1xXwa8AtwD3VzxOr8aD7BSj3VfnH12Ed/irwM+C4BWMjXX90/8gcAP6b7h7PJf2sM7rHuu+tTn/YcL576R5LPbwdfr6a9/er+/4O4HbgdxcsZ5xuud4H/A3Vmw4bzNjz/drUY325fNX4NcD7lsy77uuQlbul8e3Qd55KUmF856kkFcZil6TCWOySVBiLXZIKY7FLUmEsdkkqjMUuSYWx2CWpMP8LvRKHthesrdcAAAAASUVORK5CYII=\n",
811 | "text/plain": [
812 | ""
813 | ]
814 | },
815 | "metadata": {
816 | "needs_background": "light"
817 | },
818 | "output_type": "display_data"
819 | }
820 | ],
821 | "source": [
822 | "news['length'].hist()"
823 | ]
824 | },
825 | {
826 | "cell_type": "markdown",
827 | "metadata": {
828 | "slideshow": {
829 | "slide_type": "slide"
830 | }
831 | },
832 | "source": [
833 | "### Coefficients of correlation\n",
834 | "\n",
835 | "可以使用 `.corr()` 來看兩個欄位之間的相關係數(預設是 Pearson , 也可以用 Kendall 或Spearman 的方法)。"
836 | ]
837 | },
838 | {
839 | "cell_type": "code",
840 | "execution_count": 20,
841 | "metadata": {
842 | "slideshow": {
843 | "slide_type": "fragment"
844 | }
845 | },
846 | "outputs": [
847 | {
848 | "data": {
849 | "text/html": [
850 | "\n",
851 | "\n",
864 | "
\n",
865 | " \n",
866 | " \n",
867 | " | \n",
868 | " 柯文哲 | \n",
869 | " 姚文智 | \n",
870 | " 民進黨 | \n",
871 | "
\n",
872 | " \n",
873 | " \n",
874 | " \n",
875 | " 柯文哲 | \n",
876 | " 1.000000 | \n",
877 | " 0.592157 | \n",
878 | " -0.031563 | \n",
879 | "
\n",
880 | " \n",
881 | " 姚文智 | \n",
882 | " 0.592157 | \n",
883 | " 1.000000 | \n",
884 | " 0.227232 | \n",
885 | "
\n",
886 | " \n",
887 | " 民進黨 | \n",
888 | " -0.031563 | \n",
889 | " 0.227232 | \n",
890 | " 1.000000 | \n",
891 | "
\n",
892 | " \n",
893 | "
\n",
894 | "
"
895 | ],
896 | "text/plain": [
897 | " 柯文哲 姚文智 民進黨\n",
898 | "柯文哲 1.000000 0.592157 -0.031563\n",
899 | "姚文智 0.592157 1.000000 0.227232\n",
900 | "民進黨 -0.031563 0.227232 1.000000"
901 | ]
902 | },
903 | "execution_count": 20,
904 | "metadata": {},
905 | "output_type": "execute_result"
906 | }
907 | ],
908 | "source": [
909 | "news.loc[:,['柯文哲','姚文智','民進黨']].corr()"
910 | ]
911 | }
912 | ],
913 | "metadata": {
914 | "anaconda-cloud": {},
915 | "celltoolbar": "Slideshow",
916 | "kernelspec": {
917 | "display_name": "Python 3",
918 | "language": "python",
919 | "name": "python3"
920 | },
921 | "language_info": {
922 | "codemirror_mode": {
923 | "name": "ipython",
924 | "version": 3
925 | },
926 | "file_extension": ".py",
927 | "mimetype": "text/x-python",
928 | "name": "python",
929 | "nbconvert_exporter": "python",
930 | "pygments_lexer": "ipython3",
931 | "version": "3.7.0"
932 | }
933 | },
934 | "nbformat": 4,
935 | "nbformat_minor": 1
936 | }
937 |
--------------------------------------------------------------------------------
/文本分析及語料庫統計/ipynb/3. Models.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "slideshow": {
7 | "slide_type": "slide"
8 | }
9 | },
10 | "source": [
11 | "# 3. Models\n",
12 | "## Outline\n",
13 | "\n",
14 | "* [Preprocessing](#Preprocessing)\n",
15 | " * [Segmentation](#Segmentation)\n",
16 | " * [tf-idf](#tf-idf)\n",
17 | "* [Linear Models](#LinearModels)\n",
18 | "* [Binary Logistic Regression Models](#BinaryLogisticRegressionModels)\n",
19 | " * [Cross Validation](#CrossValidation)\n",
20 | "* [Exercises and Solutions](#ExercisesAndSolutions)"
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {
26 | "slideshow": {
27 | "slide_type": "notes"
28 | }
29 | },
30 | "source": [
31 | "這一節會教大家建立模型,第一個是 Linear Model 來做數值的預測,第二個是用 Logistic Regression 的方法來做分類器,也會帶大家使用 scikit-learn 這個套件。"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {
37 | "slideshow": {
38 | "slide_type": "slide"
39 | }
40 | },
41 | "source": [
42 | "## Preprocessing\n",
43 | "先將資料做前處理,將新聞的內容斷詞計算詞頻。"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 1,
49 | "metadata": {
50 | "scrolled": false,
51 | "slideshow": {
52 | "slide_type": "fragment"
53 | }
54 | },
55 | "outputs": [
56 | {
57 | "data": {
58 | "text/html": [
59 | "\n",
60 | "\n",
73 | "
\n",
74 | " \n",
75 | " \n",
76 | " | \n",
77 | " title | \n",
78 | " content | \n",
79 | " time | \n",
80 | " provider | \n",
81 | " url | \n",
82 | "
\n",
83 | " \n",
84 | " \n",
85 | " \n",
86 | " 0 | \n",
87 | " 「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現 | \n",
88 | " 國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允... | \n",
89 | " 2018-10-22 12:16:02+08:00 | \n",
90 | " 風傳媒 | \n",
91 | " https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁... | \n",
92 | "
\n",
93 | " \n",
94 | " 1 | \n",
95 | " 【Yahoo論壇/林青弘】柯文哲是否一再說謊? | \n",
96 | " 柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張... | \n",
97 | " 2018-10-22 14:00:26+08:00 | \n",
98 | " 林青弘 | \n",
99 | " https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘... | \n",
100 | "
\n",
101 | " \n",
102 | " 2 | \n",
103 | " 【Yahoo論壇】民進黨誰最怕陳其邁落選? | \n",
104 | " 讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲... | \n",
105 | " 2018-10-22 13:57:44+08:00 | \n",
106 | " 讀者投書 | \n",
107 | " https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選... | \n",
108 | "
\n",
109 | " \n",
110 | " 3 | \n",
111 | " 抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚 | \n",
112 | " 國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方... | \n",
113 | " 2018-10-22 13:32:00+08:00 | \n",
114 | " EBC東森新聞 | \n",
115 | " https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚... | \n",
116 | "
\n",
117 | " \n",
118 | " 4 | \n",
119 | " 百年土地公上香祈福 陳學聖提五不原則 | \n",
120 | " 【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉... | \n",
121 | " 2018-10-22 13:17:44+08:00 | \n",
122 | " 民眾日報 | \n",
123 | " https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0... | \n",
124 | "
\n",
125 | " \n",
126 | "
\n",
127 | "
"
128 | ],
129 | "text/plain": [
130 | " title \\\n",
131 | "0 「把25年前韓國瑜打人事件當英雄看」陳水扁批:吳敦義「災難政治學」的表現 \n",
132 | "1 【Yahoo論壇/林青弘】柯文哲是否一再說謊? \n",
133 | "2 【Yahoo論壇】民進黨誰最怕陳其邁落選? \n",
134 | "3 抽中籤王 韓國瑜車隊掃街 民眾路邊紛比讚 \n",
135 | "4 百年土地公上香祈福 陳學聖提五不原則 \n",
136 | "\n",
137 | " content \\\n",
138 | "0 國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允... \n",
139 | "1 柯文哲市長在台北市北投區七星公園造勢,行動競選總部的大卡車開進公園,違規臨停。競辦被開罰六張... \n",
140 | "2 讀者投書:廖念漢(現任奇策盟文宣部主任、曾任海巡署專聘講師)\\n 《長平之戰》是戰國時代最戲... \n",
141 | "3 國民黨高雄市長候選人韓國瑜聲勢上漲,又抽中一號籤王,心情相當興奮,立即展開掃街拜,經過的地方... \n",
142 | "4 【綜合報導】普悠瑪列車出軌意外舉國震驚如同國難,令社會大眾、競選團隊及陳學聖本人都感到十分沉... \n",
143 | "\n",
144 | " time provider \\\n",
145 | "0 2018-10-22 12:16:02+08:00 風傳媒 \n",
146 | "1 2018-10-22 14:00:26+08:00 林青弘 \n",
147 | "2 2018-10-22 13:57:44+08:00 讀者投書 \n",
148 | "3 2018-10-22 13:32:00+08:00 EBC東森新聞 \n",
149 | "4 2018-10-22 13:17:44+08:00 民眾日報 \n",
150 | "\n",
151 | " url \n",
152 | "0 https://tw.news.yahoo.com/把25年前韓國瑜打人事件當英雄看-陳水扁... \n",
153 | "1 https://tw.news.yahoo.com/【yahoo論壇%EF%BC%8F林青弘... \n",
154 | "2 https://tw.news.yahoo.com/【yahoo論壇】民進黨誰最怕陳其邁落選... \n",
155 | "3 https://tw.news.yahoo.com/抽中籤王-韓國瑜車隊掃街-民眾路邊紛比讚... \n",
156 | "4 https://tw.news.yahoo.com/百年土地公上香祈福-陳學聖提五不原則-0... "
157 | ]
158 | },
159 | "execution_count": 1,
160 | "metadata": {},
161 | "output_type": "execute_result"
162 | }
163 | ],
164 | "source": [
165 | "import pandas as pd\n",
166 | "from pathlib import Path\n",
167 | "data_folder = Path(\"../data/\")\n",
168 | "\n",
169 | "news = pd.read_csv(data_folder / \"news.csv\")\n",
170 | "news.head()"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 2,
176 | "metadata": {
177 | "slideshow": {
178 | "slide_type": "subslide"
179 | }
180 | },
181 | "outputs": [
182 | {
183 | "data": {
184 | "text/plain": [
185 | "0 627\n",
186 | "1 1304\n",
187 | "2 1673\n",
188 | "3 503\n",
189 | "4 585\n",
190 | "Name: length, dtype: int64"
191 | ]
192 | },
193 | "execution_count": 2,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "news['length'] = news['content'].apply(len)\n",
200 | "news['length'].head()"
201 | ]
202 | },
203 | {
204 | "cell_type": "markdown",
205 | "metadata": {
206 | "slideshow": {
207 | "slide_type": "subslide"
208 | }
209 | },
210 | "source": [
211 | "### Segmentation\n",
212 | "使用 jieba 來斷詞"
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 3,
218 | "metadata": {
219 | "slideshow": {
220 | "slide_type": "fragment"
221 | }
222 | },
223 | "outputs": [],
224 | "source": [
225 | "import jieba"
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 4,
231 | "metadata": {
232 | "scrolled": false,
233 | "slideshow": {
234 | "slide_type": "fragment"
235 | }
236 | },
237 | "outputs": [
238 | {
239 | "name": "stdout",
240 | "output_type": "stream",
241 | "text": [
242 | "國民黨主席吳敦義日前提到高雄市長候選人韓國瑜過去打陳水扁,表示「很認同跟敬佩」並形容「允文允武」。前總統陳水扁今(22)日在《新勇哥物語》質疑,這是「災難政治學」的表現,反批吳敦義以為高雄市長贏定了,得意忘形的囂張之情溢於言表。\n",
243 | "《新勇哥物語》今天刊出,陳水扁借勇哥表示,不相信吳敦義會講鼓勵暴力的話,也不相信吳敦義主席會認為,當年韓國瑜公然在國會殿堂打阿扁打到住院是對的。\n",
244 | "陳水扁指出,吳敦義的發言,是「災難政治學」的表現,他批評吳敦義還真的以為他提名韓國瑜參選高雄市長贏定了,得意忘形的囂張之情溢於言表,竟把25年前的打人事件拿出來捧為文武雙全的英雄看待。「是非不明,黑白不分,不是很可怕嗎?!」\n",
245 | "陳水扁也在文中還原,1993年5月韓國瑜推倒他導致受傷住院,隔天有幫派份子聚集到立法院,衝突場面導致10多人受傷掛彩,韓國瑜遭到質疑找黑道兄弟助陣,風波一度越演越烈。韓國瑜後來也道歉,「我願意為我肢體衝突,向陳水扁委員致歉。」\n",
246 | "陳水扁表示,當時他是在幫榮民講話,因為他對政府的榮民就養照護問題向退輔會提出質詢,認為「不能把榮民當豬養,不是說榮民是豬」。韓國瑜聽到「豬」就抓狂,這跟扁小時候家裡養豬賣錢供給讀書,生活經驗完全不同。(推薦閱讀:普悠瑪翻車慘劇》吳敦義籲所有九合一選舉候選人暫停選舉活動)\n",
247 | "相關報導● 吳敦義因韓國瑜「打扁」才提名選高雄?段宜康酸:乾脆提名殺過人的!● 強力反擊韓國瑜 洪耀福:愛河水臭、自來水不能喝,這是吳敦義當市長時的高雄\n"
248 | ]
249 | }
250 | ],
251 | "source": [
252 | "text = news.content[0]\n",
253 | "print(text)"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 5,
259 | "metadata": {
260 | "scrolled": false,
261 | "slideshow": {
262 | "slide_type": "subslide"
263 | }
264 | },
265 | "outputs": [
266 | {
267 | "name": "stderr",
268 | "output_type": "stream",
269 | "text": [
270 | "Building prefix dict from the default dictionary ...\n",
271 | "Loading model from cache /var/folders/80/gbrxkvp9687cyvtdv78_5z3h0000gn/T/jieba.cache\n",
272 | "Loading model cost 1.275 seconds.\n",
273 | "Prefix dict has been built succesfully.\n"
274 | ]
275 | },
276 | {
277 | "name": "stdout",
278 | "output_type": "stream",
279 | "text": [
280 | "國民黨 主席 吳敦義 日前 提到 高雄市 長 候選人 韓國瑜 過去 打陳水 扁 , 表示 「 很 認同 跟 敬佩 」 並 形容 「 允文允武 」 。 前 總統 陳 水扁 今 ( 22 ) 日 在 《 新勇哥 物語 》 質疑 , 這是 「 災難 政治 學 」 的 表現 , 反批 吳敦義以 為 高雄市 長 贏定 了 , 得意忘形 的 囂張 之 情溢 於 言表 。 \n",
281 | " 《 新勇哥 物語 》 今天 刊出 , 陳 水扁 借勇哥 表示 , 不 相信 吳敦義會 講鼓勵 暴力 的 話 , 也 不 相信 吳敦義 主席 會 認為 , 當年 韓國瑜 公然 在 國會 殿堂 打阿扁 打到 住院 是 對 的 。 \n",
282 | " 陳 水扁 指出 , 吳敦義的 發言 , 是 「 災難 政治 學 」 的 表現 , 他 批評 吳敦義還 真的 以為 他 提名 韓國瑜 參選 高雄市 長 贏定 了 , 得意忘形 的 囂張 之 情溢 於 言表 , 竟 把 25 年前 的 打人 事件 拿出 來 捧 為 文武 雙全 的 英雄 看待 。 「 是非 不明 , 黑白不分 , 不是 很 可怕 嗎 ? ! 」 \n",
283 | " 陳 水扁 也 在 文中 還原 , 1993 年 5 月 韓國瑜 推倒 他導致 受傷 住院 , 隔天 有 幫派 份子 聚集 到 立法院 , 衝突場 面導致 10 多人 受傷 掛彩 , 韓國瑜 遭到 質疑 找 黑道 兄弟 助陣 , 風波 一度 越演 越烈 。 韓國瑜後來 也 道歉 , 「 我願意 為 我 肢體 衝突 , 向 陳 水 扁委員 致歉 。 」 \n",
284 | " 陳 水扁 表示 , 當時 他 是 在 幫榮民 講話 , 因為 他 對 政府 的 榮民 就 養照護 問題 向 退 輔會 提出 質詢 , 認為 「 不能 把 榮民當 豬養 , 不是 說榮民 是豬 」 。 韓國瑜 聽 到 「 豬 」 就 抓狂 , 這跟 扁小時 候家裡 養豬 賣 錢 供給 讀書 , 生活 經驗 完全 不同 。 ( 推薦 閱讀 : 普悠瑪 翻車 慘劇 》 吳敦義籲 所有 九 合一 選舉候 選人 暫停 選舉 活動 ) \n",
285 | " 相關 報導 ● 吳敦義因 韓國瑜 「 打扁 」 才 提名 選高雄 ? 段 宜康酸 : 乾脆 提名 殺過 人 的 ! ● 強力 反擊 韓國瑜 洪耀福 : 愛 河水 臭 、 自來 水 不能 喝 , 這是 吳敦義當 市長 時 的 高雄\n"
286 | ]
287 | }
288 | ],
289 | "source": [
290 | "print(\" \".join(jieba.cut(text)))"
291 | ]
292 | },
293 | {
294 | "cell_type": "code",
295 | "execution_count": 6,
296 | "metadata": {
297 | "slideshow": {
298 | "slide_type": "subslide"
299 | }
300 | },
301 | "outputs": [
302 | {
303 | "data": {
304 | "text/plain": [
305 | "0 國民黨 主席 吳敦義 日前 提到 高雄市 長 候選人 韓國瑜 過去 打陳水 扁 , 表示 「...\n",
306 | "1 柯文 哲市 長 在 台北市 北投 區 七星 公園 造勢 , 行動 競選 總部 的 大卡 車開...\n",
307 | "2 讀者 投書 : 廖念漢 ( 現任 奇策 盟文 宣部 主任 、 曾任 海巡 署 專聘 講師 )...\n",
308 | "3 國民黨 高雄市 長 候選人 韓國瑜 聲勢 上 漲 , 又 抽中 一號 籤 王 , 心情 相當...\n",
309 | "4 【 綜合 報導 】 普悠瑪 列車 出軌 意外 舉國震 驚 如同 國難 , 令 社會 大眾 、...\n",
310 | "Name: segmentation, dtype: object"
311 | ]
312 | },
313 | "execution_count": 6,
314 | "metadata": {},
315 | "output_type": "execute_result"
316 | }
317 | ],
318 | "source": [
319 | "news['segmentation'] = news.content.apply(lambda text: \" \".join(jieba.cut(text)))\n",
320 | "news['segmentation'].head()"
321 | ]
322 | },
323 | {
324 | "cell_type": "markdown",
325 | "metadata": {
326 | "slideshow": {
327 | "slide_type": "subslide"
328 | }
329 | },
330 | "source": [
331 | "### tf-idf\n",
332 | "tf: term frequency 詞頻,詞語在單一文本中出現的頻率,\n",
333 | "idf: inverse document frequency 逆向檔案頻率,全部文本的數量除以包含詞語的文本的數量 \n",
334 | "\n",
335 | "$\\text{tf-idf} = tf * idf$ \n",
336 | "\n",
337 | "例如「的」可能在文本中詞頻高,但是每個文本都有「的」,因此 idf 很小,tf-idf 相乘起來就很小,代表不是重要的訊息"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 7,
343 | "metadata": {
344 | "slideshow": {
345 | "slide_type": "fragment"
346 | }
347 | },
348 | "outputs": [],
349 | "source": [
350 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
351 | "v = TfidfVectorizer()\n",
352 | "news_tfidf = v.fit_transform(news.segmentation)"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": 8,
358 | "metadata": {
359 | "scrolled": false,
360 | "slideshow": {
361 | "slide_type": "fragment"
362 | }
363 | },
364 | "outputs": [
365 | {
366 | "data": {
367 | "text/plain": [
368 | "(120, 7806)"
369 | ]
370 | },
371 | "execution_count": 8,
372 | "metadata": {},
373 | "output_type": "execute_result"
374 | }
375 | ],
376 | "source": [
377 | "news_tfidf.shape"
378 | ]
379 | },
380 | {
381 | "cell_type": "markdown",
382 | "metadata": {
383 | "slideshow": {
384 | "slide_type": "slide"
385 | }
386 | },
387 | "source": [
388 | "## Linear Models \n",
389 | "使用線性的模型來模擬預測未知數值"
390 | ]
391 | },
392 | {
393 | "cell_type": "code",
394 | "execution_count": 9,
395 | "metadata": {
396 | "slideshow": {
397 | "slide_type": "fragment"
398 | }
399 | },
400 | "outputs": [],
401 | "source": [
402 | "import sklearn\n",
403 | "from sklearn.model_selection import train_test_split\n",
404 | "X_train, X_test, y_train, y_test = train_test_split(\n",
405 | " news_tfidf, \n",
406 | " news[['length']],\n",
407 | " test_size=0.3, \n",
408 | " random_state=7)"
409 | ]
410 | },
411 | {
412 | "cell_type": "code",
413 | "execution_count": 10,
414 | "metadata": {
415 | "slideshow": {
416 | "slide_type": "subslide"
417 | }
418 | },
419 | "outputs": [],
420 | "source": [
421 | "from sklearn.linear_model import LinearRegression\n",
422 | "from sklearn.metrics import mean_squared_error, r2_score\n",
423 | "\n",
424 | "regr = LinearRegression()\n",
425 | "regr.fit(X_train, y_train)\n",
426 | "y_pred = regr.predict(X_test)"
427 | ]
428 | },
429 | {
430 | "cell_type": "code",
431 | "execution_count": 11,
432 | "metadata": {
433 | "scrolled": true,
434 | "slideshow": {
435 | "slide_type": "subslide"
436 | }
437 | },
438 | "outputs": [
439 | {
440 | "name": "stdout",
441 | "output_type": "stream",
442 | "text": [
443 | "Coefficients: \n",
444 | " [[96.85880969 -2.22508701 14.08190329 ... 0. 36.56082547\n",
445 | " 0. ]]\n",
446 | "Mean squared error: 61652.85\n",
447 | "Variance score: 0.15\n"
448 | ]
449 | }
450 | ],
451 | "source": [
452 | "print('Coefficients: \\n', regr.coef_)\n",
453 | "\n",
454 | "print(\"Mean squared error: %.2f\"\n",
455 | " % mean_squared_error(y_test, y_pred))\n",
456 | "\n",
457 | "# Explained variance score: 1 is perfect prediction\n",
458 | "print('Variance score: %.2f' % r2_score(y_test, y_pred))"
459 | ]
460 | },
461 | {
462 | "cell_type": "markdown",
463 | "metadata": {
464 | "slideshow": {
465 | "slide_type": "slide"
466 | }
467 | },
468 | "source": [
469 | "## Binary Logistic Regression Models \n",
470 | "使用二元分類的模型來預測資料的類別"
471 | ]
472 | },
473 | {
474 | "cell_type": "code",
475 | "execution_count": 12,
476 | "metadata": {
477 | "slideshow": {
478 | "slide_type": "fragment"
479 | }
480 | },
481 | "outputs": [
482 | {
483 | "data": {
484 | "text/html": [
485 | "\n",
486 | "\n",
499 | "
\n",
500 | " \n",
501 | " \n",
502 | " | \n",
503 | " content | \n",
504 | " provider | \n",
505 | "
\n",
506 | " \n",
507 | " \n",
508 | " \n",
509 | " 14 | \n",
510 | " 在競選活動方面,韓流在全台發威,韓國瑜從台灣頭跑到台灣尾進行輔選,他昨天一整天馬不停蹄,為高... | \n",
511 | " 聯合新聞網 | \n",
512 | "
\n",
513 | " \n",
514 | " 15 | \n",
515 | " 「台北客家義民嘉年華」重頭戲挑擔踩街活動昨天登場,由5000多人組成的踩街隊伍,一早浩浩蕩蕩... | \n",
516 | " 聯合新聞網 | \n",
517 | "
\n",
518 | " \n",
519 | " 16 | \n",
520 | " 市長選戰攻防激烈,台北市長柯文哲卻連連失言,前天酸「台女不化妝上街嚇人」,行動競總「開進」公... | \n",
521 | " 聯合新聞網 | \n",
522 | "
\n",
523 | " \n",
524 | " 32 | \n",
525 | " (中央社記者王揚宇台北21日電)民進黨台北市長參選人姚文智今天在一場論壇說,學校作為有機體,... | \n",
526 | " 中央社 | \n",
527 | "
\n",
528 | " \n",
529 | " 33 | \n",
530 | " (中央社記者黃麗芸台北21日電)「雙北市長青年論壇」今天登場,中國國民黨台北市長參選人丁守中... | \n",
531 | " 中央社 | \n",
532 | "
\n",
533 | " \n",
534 | "
\n",
535 | "
"
536 | ],
537 | "text/plain": [
538 | " content provider\n",
539 | "14 在競選活動方面,韓流在全台發威,韓國瑜從台灣頭跑到台灣尾進行輔選,他昨天一整天馬不停蹄,為高... 聯合新聞網\n",
540 | "15 「台北客家義民嘉年華」重頭戲挑擔踩街活動昨天登場,由5000多人組成的踩街隊伍,一早浩浩蕩蕩... 聯合新聞網\n",
541 | "16 市長選戰攻防激烈,台北市長柯文哲卻連連失言,前天酸「台女不化妝上街嚇人」,行動競總「開進」公... 聯合新聞網\n",
542 | "32 (中央社記者王揚宇台北21日電)民進黨台北市長參選人姚文智今天在一場論壇說,學校作為有機體,... 中央社\n",
543 | "33 (中央社記者黃麗芸台北21日電)「雙北市長青年論壇」今天登場,中國國民黨台北市長參選人丁守中... 中央社"
544 | ]
545 | },
546 | "execution_count": 12,
547 | "metadata": {},
548 | "output_type": "execute_result"
549 | }
550 | ],
551 | "source": [
552 | "selected_news = news.loc[news.provider.isin(['中央社','聯合新聞網']), ['content','provider']]\n",
553 | "selected_news.head()"
554 | ]
555 | },
556 | {
557 | "cell_type": "code",
558 | "execution_count": 13,
559 | "metadata": {
560 | "slideshow": {
561 | "slide_type": "subslide"
562 | }
563 | },
564 | "outputs": [],
565 | "source": [
566 | "selected_news_tfidf = news_tfidf[selected_news.index]"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 14,
572 | "metadata": {
573 | "scrolled": true,
574 | "slideshow": {
575 | "slide_type": "fragment"
576 | }
577 | },
578 | "outputs": [],
579 | "source": [
580 | "X_train, X_test, y_train, y_test = train_test_split(\n",
581 | " selected_news_tfidf, \n",
582 | " selected_news[['provider']],\n",
583 | " test_size=0.3, \n",
584 | " random_state=0)"
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": 15,
590 | "metadata": {
591 | "slideshow": {
592 | "slide_type": "subslide"
593 | }
594 | },
595 | "outputs": [
596 | {
597 | "data": {
598 | "text/plain": [
599 | "<18x7806 sparse matrix of type ''\n",
600 | "\twith 2601 stored elements in Compressed Sparse Row format>"
601 | ]
602 | },
603 | "execution_count": 15,
604 | "metadata": {},
605 | "output_type": "execute_result"
606 | }
607 | ],
608 | "source": [
609 | "X_train"
610 | ]
611 | },
612 | {
613 | "cell_type": "code",
614 | "execution_count": 16,
615 | "metadata": {
616 | "slideshow": {
617 | "slide_type": "fragment"
618 | }
619 | },
620 | "outputs": [
621 | {
622 | "data": {
623 | "text/plain": [
624 | "<9x7806 sparse matrix of type ''\n",
625 | "\twith 1380 stored elements in Compressed Sparse Row format>"
626 | ]
627 | },
628 | "execution_count": 16,
629 | "metadata": {},
630 | "output_type": "execute_result"
631 | }
632 | ],
633 | "source": [
634 | "X_test"
635 | ]
636 | },
637 | {
638 | "cell_type": "code",
639 | "execution_count": 17,
640 | "metadata": {
641 | "scrolled": true,
642 | "slideshow": {
643 | "slide_type": "subslide"
644 | }
645 | },
646 | "outputs": [
647 | {
648 | "data": {
649 | "text/html": [
650 | "\n",
651 | "\n",
664 | "
\n",
665 | " \n",
666 | " \n",
667 | " | \n",
668 | " provider | \n",
669 | "
\n",
670 | " \n",
671 | " \n",
672 | " \n",
673 | " 96 | \n",
674 | " 聯合新聞網 | \n",
675 | "
\n",
676 | " \n",
677 | " 92 | \n",
678 | " 聯合新聞網 | \n",
679 | "
\n",
680 | " \n",
681 | " 15 | \n",
682 | " 聯合新聞網 | \n",
683 | "
\n",
684 | " \n",
685 | " 72 | \n",
686 | " 中央社 | \n",
687 | "
\n",
688 | " \n",
689 | " 119 | \n",
690 | " 中央社 | \n",
691 | "
\n",
692 | " \n",
693 | " 109 | \n",
694 | " 中央社 | \n",
695 | "
\n",
696 | " \n",
697 | " 58 | \n",
698 | " 中央社 | \n",
699 | "
\n",
700 | " \n",
701 | " 49 | \n",
702 | " 中央社 | \n",
703 | "
\n",
704 | " \n",
705 | " 33 | \n",
706 | " 中央社 | \n",
707 | "
\n",
708 | " \n",
709 | " 94 | \n",
710 | " 聯合新聞網 | \n",
711 | "
\n",
712 | " \n",
713 | " 71 | \n",
714 | " 中央社 | \n",
715 | "
\n",
716 | " \n",
717 | " 50 | \n",
718 | " 中央社 | \n",
719 | "
\n",
720 | " \n",
721 | " 98 | \n",
722 | " 聯合新聞網 | \n",
723 | "
\n",
724 | " \n",
725 | " 32 | \n",
726 | " 中央社 | \n",
727 | "
\n",
728 | " \n",
729 | " 14 | \n",
730 | " 聯合新聞網 | \n",
731 | "
\n",
732 | " \n",
733 | " 97 | \n",
734 | " 聯合新聞網 | \n",
735 | "
\n",
736 | " \n",
737 | " 91 | \n",
738 | " 聯合新聞網 | \n",
739 | "
\n",
740 | " \n",
741 | " 74 | \n",
742 | " 中央社 | \n",
743 | "
\n",
744 | " \n",
745 | "
\n",
746 | "
"
747 | ],
748 | "text/plain": [
749 | " provider\n",
750 | "96 聯合新聞網\n",
751 | "92 聯合新聞網\n",
752 | "15 聯合新聞網\n",
753 | "72 中央社\n",
754 | "119 中央社\n",
755 | "109 中央社\n",
756 | "58 中央社\n",
757 | "49 中央社\n",
758 | "33 中央社\n",
759 | "94 聯合新聞網\n",
760 | "71 中央社\n",
761 | "50 中央社\n",
762 | "98 聯合新聞網\n",
763 | "32 中央社\n",
764 | "14 聯合新聞網\n",
765 | "97 聯合新聞網\n",
766 | "91 聯合新聞網\n",
767 | "74 中央社"
768 | ]
769 | },
770 | "execution_count": 17,
771 | "metadata": {},
772 | "output_type": "execute_result"
773 | }
774 | ],
775 | "source": [
776 | "y_train"
777 | ]
778 | },
779 | {
780 | "cell_type": "code",
781 | "execution_count": 18,
782 | "metadata": {
783 | "scrolled": true,
784 | "slideshow": {
785 | "slide_type": "subslide"
786 | }
787 | },
788 | "outputs": [
789 | {
790 | "data": {
791 | "text/html": [
792 | "\n",
793 | "\n",
806 | "
\n",
807 | " \n",
808 | " \n",
809 | " | \n",
810 | " provider | \n",
811 | "
\n",
812 | " \n",
813 | " \n",
814 | " \n",
815 | " 16 | \n",
816 | " 聯合新聞網 | \n",
817 | "
\n",
818 | " \n",
819 | " 107 | \n",
820 | " 中央社 | \n",
821 | "
\n",
822 | " \n",
823 | " 90 | \n",
824 | " 聯合新聞網 | \n",
825 | "
\n",
826 | " \n",
827 | " 93 | \n",
828 | " 聯合新聞網 | \n",
829 | "
\n",
830 | " \n",
831 | " 34 | \n",
832 | " 中央社 | \n",
833 | "
\n",
834 | " \n",
835 | " 73 | \n",
836 | " 中央社 | \n",
837 | "
\n",
838 | " \n",
839 | " 99 | \n",
840 | " 聯合新聞網 | \n",
841 | "
\n",
842 | " \n",
843 | " 77 | \n",
844 | " 中央社 | \n",
845 | "
\n",
846 | " \n",
847 | " 95 | \n",
848 | " 聯合新聞網 | \n",
849 | "
\n",
850 | " \n",
851 | "
\n",
852 | "
"
853 | ],
854 | "text/plain": [
855 | " provider\n",
856 | "16 聯合新聞網\n",
857 | "107 中央社\n",
858 | "90 聯合新聞網\n",
859 | "93 聯合新聞網\n",
860 | "34 中央社\n",
861 | "73 中央社\n",
862 | "99 聯合新聞網\n",
863 | "77 中央社\n",
864 | "95 聯合新聞網"
865 | ]
866 | },
867 | "execution_count": 18,
868 | "metadata": {},
869 | "output_type": "execute_result"
870 | }
871 | ],
872 | "source": [
873 | "y_test"
874 | ]
875 | },
876 | {
877 | "cell_type": "code",
878 | "execution_count": 19,
879 | "metadata": {
880 | "scrolled": true,
881 | "slideshow": {
882 | "slide_type": "subslide"
883 | }
884 | },
885 | "outputs": [
886 | {
887 | "data": {
888 | "text/plain": [
889 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
890 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
891 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
892 | " verbose=0, warm_start=False)"
893 | ]
894 | },
895 | "execution_count": 19,
896 | "metadata": {},
897 | "output_type": "execute_result"
898 | }
899 | ],
900 | "source": [
901 | "from sklearn.linear_model import LogisticRegression\n",
902 | "lr = LogisticRegression()\n",
903 | "lr.fit(X_train,y_train.provider.values)"
904 | ]
905 | },
906 | {
907 | "cell_type": "code",
908 | "execution_count": 20,
909 | "metadata": {
910 | "slideshow": {
911 | "slide_type": "subslide"
912 | }
913 | },
914 | "outputs": [
915 | {
916 | "data": {
917 | "text/plain": [
918 | "0.7777777777777778"
919 | ]
920 | },
921 | "execution_count": 20,
922 | "metadata": {},
923 | "output_type": "execute_result"
924 | }
925 | ],
926 | "source": [
927 | "from sklearn.metrics import accuracy_score\n",
928 | "accuracy_score(y_test.provider.values, lr.predict(X_test))"
929 | ]
930 | },
931 | {
932 | "cell_type": "code",
933 | "execution_count": 21,
934 | "metadata": {
935 | "scrolled": true,
936 | "slideshow": {
937 | "slide_type": "fragment"
938 | }
939 | },
940 | "outputs": [
941 | {
942 | "data": {
943 | "text/plain": [
944 | "array(['聯合新聞網', '中央社', '聯合新聞網', '聯合新聞網', '中央社', '中央社', '聯合新聞網', '中央社',\n",
945 | " '聯合新聞網'], dtype=object)"
946 | ]
947 | },
948 | "execution_count": 21,
949 | "metadata": {},
950 | "output_type": "execute_result"
951 | }
952 | ],
953 | "source": [
954 | "y_test.provider.values"
955 | ]
956 | },
957 | {
958 | "cell_type": "code",
959 | "execution_count": 22,
960 | "metadata": {
961 | "scrolled": true,
962 | "slideshow": {
963 | "slide_type": "fragment"
964 | }
965 | },
966 | "outputs": [
967 | {
968 | "data": {
969 | "text/plain": [
970 | "array(['中央社', '中央社', '聯合新聞網', '中央社', '中央社', '中央社', '聯合新聞網', '中央社',\n",
971 | " '聯合新聞網'], dtype=object)"
972 | ]
973 | },
974 | "execution_count": 22,
975 | "metadata": {},
976 | "output_type": "execute_result"
977 | }
978 | ],
979 | "source": [
980 | "lr.predict(X_test)"
981 | ]
982 | },
983 | {
984 | "cell_type": "markdown",
985 | "metadata": {
986 | "slideshow": {
987 | "slide_type": "subslide"
988 | }
989 | },
990 | "source": [
991 | "### Cross Validation \n",
992 | "我們可以使用 Cross Validation 來評估 Classifier 的效果,常用的方法是 k-fold ,也就是將資料分成 k 等份,每次使用其 k-1 份來 training,剩下一份來 testing,總共執行 k 次,這樣做可以充分利用手上已經有的資料來學習。"
993 | ]
994 | },
995 | {
996 | "cell_type": "code",
997 | "execution_count": 23,
998 | "metadata": {
999 | "scrolled": true,
1000 | "slideshow": {
1001 | "slide_type": "fragment"
1002 | }
1003 | },
1004 | "outputs": [
1005 | {
1006 | "name": "stdout",
1007 | "output_type": "stream",
1008 | "text": [
1009 | "[0.5 1. 1. 0.8 1. ]\n"
1010 | ]
1011 | }
1012 | ],
1013 | "source": [
1014 | "from sklearn.model_selection import cross_val_score\n",
1015 | "scores = cross_val_score(lr, selected_news_tfidf, selected_news.provider.values, cv=5)\n",
1016 | "print(scores)"
1017 | ]
1018 | },
1019 | {
1020 | "cell_type": "code",
1021 | "execution_count": 24,
1022 | "metadata": {
1023 | "slideshow": {
1024 | "slide_type": "fragment"
1025 | }
1026 | },
1027 | "outputs": [
1028 | {
1029 | "name": "stdout",
1030 | "output_type": "stream",
1031 | "text": [
1032 | "Accuracy: 0.86 (+/- 0.39)\n"
1033 | ]
1034 | }
1035 | ],
1036 | "source": [
1037 | "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
1038 | ]
1039 | },
1040 | {
1041 | "cell_type": "markdown",
1042 | "metadata": {
1043 | "slideshow": {
1044 | "slide_type": "slide"
1045 | }
1046 | },
1047 | "source": [
1048 | "## Exercises and Solutions \n",
1049 | "\n",
1050 | "
\n",
1051 | "1. 改用 F1 score 來評定 Classifer 的成效
\n",
1052 | "\n",
1053 | " \n",
1054 | "```python\n",
1055 | "from sklearn.metrics import f1_score\n",
1056 | "f1_score(y_test.provider.values, lr.predict(X_test), average='macro')\n",
1057 | "```\n",
1058 | "\n",
1059 | "
\n",
1060 | " \n",
1061 | "\n",
1062 | "2. 使用 Multinomial Naive Bayes 來做一個新的 Classifier
\n",
1063 | "\n",
1064 | " \n",
1065 | "```python\n",
1066 | "from sklearn.naive_bayes import MultinomialNB\n",
1067 | "nb = MultinomialNB()\n",
1068 | "nb.fit(X_train,y_train.provider.values)\n",
1069 | "accuracy_score(y_test.provider.values, nb.predict(X_test))\n",
1070 | "```\n",
1071 | "
\n",
1072 | " \n"
1073 | ]
1074 | },
1075 | {
1076 | "cell_type": "markdown",
1077 | "metadata": {
1078 | "slideshow": {
1079 | "slide_type": "notes"
1080 | }
1081 | },
1082 | "source": [
1083 | "## More about: \n",
1084 | "1. [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n",
1085 | "2. [Working With Text Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)\n",
1086 | "1. [Scikit Learn User Guide](http://scikit-learn.org/stable/user_guide.html)\n"
1087 | ]
1088 | }
1089 | ],
1090 | "metadata": {
1091 | "anaconda-cloud": {},
1092 | "celltoolbar": "Slideshow",
1093 | "kernelspec": {
1094 | "display_name": "Python 3",
1095 | "language": "python",
1096 | "name": "python3"
1097 | },
1098 | "language_info": {
1099 | "codemirror_mode": {
1100 | "name": "ipython",
1101 | "version": 3
1102 | },
1103 | "file_extension": ".py",
1104 | "mimetype": "text/x-python",
1105 | "name": "python",
1106 | "nbconvert_exporter": "python",
1107 | "pygments_lexer": "ipython3",
1108 | "version": "3.7.0"
1109 | }
1110 | },
1111 | "nbformat": 4,
1112 | "nbformat_minor": 1
1113 | }
1114 |
--------------------------------------------------------------------------------
/文本分析及語料庫統計/ipynb/4. Practice on news annotation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 4. Practice on news annotation\n",
8 | "使用上一節課得到的 news_annotation 建立模型 \n",
9 | "1. 將斷詞完的 cln_content 的欄位轉成 tf-idf\n",
10 | "2. 用 polarity_score 來做 Linear Model\n",
11 | "3. 用 sentiment 來做 Logistic Regression Model"
12 | ]
13 | },
14 | {
15 | "cell_type": "markdown",
16 | "metadata": {},
17 | "source": [
18 | "## Preprocessing"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 1,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "import pandas as pd\n",
28 | "from pathlib import Path\n",
29 | "data_folder = Path(\"../data/\")"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 2,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "news = pd.read_pickle(data_folder / 'news_annotation.p')"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 3,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "name": "stdout",
48 | "output_type": "stream",
49 | "text": [
50 | "\n",
51 | "RangeIndex: 381 entries, 0 to 380\n",
52 | "Data columns (total 3 columns):\n",
53 | "cln_content 381 non-null object\n",
54 | "polarity_score 381 non-null int64\n",
55 | "sentiment 381 non-null object\n",
56 | "dtypes: int64(1), object(2)\n",
57 | "memory usage: 9.0+ KB\n"
58 | ]
59 | }
60 | ],
61 | "source": [
62 | "news.info()"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": 4,
68 | "metadata": {},
69 | "outputs": [
70 | {
71 | "data": {
72 | "text/html": [
73 | "\n",
74 | "\n",
87 | "
\n",
88 | " \n",
89 | " \n",
90 | " | \n",
91 | " cln_content | \n",
92 | " polarity_score | \n",
93 | " sentiment | \n",
94 | "
\n",
95 | " \n",
96 | " \n",
97 | " \n",
98 | " 0 | \n",
99 | " 國民黨 高雄 市長 候選人 韓國瑜 昨 接受 媒體 專訪 時稱 當選 市長 後 禁止 政治 ... | \n",
100 | " 15 | \n",
101 | " 1 | \n",
102 | "
\n",
103 | " \n",
104 | " 1 | \n",
105 | " 國民黨 高雄 市長 候選人 韓國瑜 今天 台北 大安區 舉辦 第二場 北漂 青年 座談 表示... | \n",
106 | " 28 | \n",
107 | " 1 | \n",
108 | "
\n",
109 | " \n",
110 | " 2 | \n",
111 | " 韓流北 漂 第二場 國民黨 高雄 市長 韓國瑜 主攻 北漂 族群 昨天 舉行 北 漂族 見面... | \n",
112 | " 17 | \n",
113 | " 1 | \n",
114 | "
\n",
115 | " \n",
116 | " 3 | \n",
117 | " 國民黨 高雄 市長 候選人 韓國瑜 日前 接受 專訪 時 指出 高雄 市長 所有 高雄 街頭... | \n",
118 | " 12 | \n",
119 | " 1 | \n",
120 | "
\n",
121 | " \n",
122 | " 4 | \n",
123 | " 國民黨 高雄 市長 候選人 韓國瑜 接受 中時 電子報 專訪 稱 當選 市長 後 未來 政治... | \n",
124 | " 5 | \n",
125 | " 1 | \n",
126 | "
\n",
127 | " \n",
128 | "
\n",
129 | "
"
130 | ],
131 | "text/plain": [
132 | " cln_content polarity_score sentiment\n",
133 | "0 國民黨 高雄 市長 候選人 韓國瑜 昨 接受 媒體 專訪 時稱 當選 市長 後 禁止 政治 ... 15 1\n",
134 | "1 國民黨 高雄 市長 候選人 韓國瑜 今天 台北 大安區 舉辦 第二場 北漂 青年 座談 表示... 28 1\n",
135 | "2 韓流北 漂 第二場 國民黨 高雄 市長 韓國瑜 主攻 北漂 族群 昨天 舉行 北 漂族 見面... 17 1\n",
136 | "3 國民黨 高雄 市長 候選人 韓國瑜 日前 接受 專訪 時 指出 高雄 市長 所有 高雄 街頭... 12 1\n",
137 | "4 國民黨 高雄 市長 候選人 韓國瑜 接受 中時 電子報 專訪 稱 當選 市長 後 未來 政治... 5 1"
138 | ]
139 | },
140 | "execution_count": 4,
141 | "metadata": {},
142 | "output_type": "execute_result"
143 | }
144 | ],
145 | "source": [
146 | "news.head()"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 5,
152 | "metadata": {
153 | "scrolled": true
154 | },
155 | "outputs": [],
156 | "source": [
157 | "news.sentiment = news.sentiment.astype('category')"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 6,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
167 | "v = TfidfVectorizer()\n",
168 | "news_tfidf = v.fit_transform(news.cln_content)"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 7,
174 | "metadata": {},
175 | "outputs": [
176 | {
177 | "data": {
178 | "text/plain": [
179 | "(381, 16479)"
180 | ]
181 | },
182 | "execution_count": 7,
183 | "metadata": {},
184 | "output_type": "execute_result"
185 | }
186 | ],
187 | "source": [
188 | "news_tfidf.shape"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "## Linear Model"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 8,
201 | "metadata": {},
202 | "outputs": [],
203 | "source": [
204 | "import sklearn\n",
205 | "from sklearn.model_selection import train_test_split\n",
206 | "X_train, X_test, y_train, y_test = train_test_split(\n",
207 | " news_tfidf, \n",
208 | " news.polarity_score,\n",
209 | " test_size=0.3, \n",
210 | " random_state=7)"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 9,
216 | "metadata": {},
217 | "outputs": [],
218 | "source": [
219 | "from sklearn.linear_model import LinearRegression\n",
220 | "from sklearn.metrics import mean_squared_error, r2_score\n",
221 | "\n",
222 | "regr = LinearRegression()\n",
223 | "regr.fit(X_train, y_train)\n",
224 | "y_pred = regr.predict(X_test)"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": 10,
230 | "metadata": {},
231 | "outputs": [
232 | {
233 | "name": "stdout",
234 | "output_type": "stream",
235 | "text": [
236 | "Coefficients: \n",
237 | " [3.63835662 0.45605763 1.81917831 ... 0. 0.34342381 0.04584367]\n",
238 | "Mean squared error: 170.07\n",
239 | "Variance score: 0.24\n"
240 | ]
241 | }
242 | ],
243 | "source": [
244 | "print('Coefficients: \\n', regr.coef_)\n",
245 | "\n",
246 | "print(\"Mean squared error: %.2f\"\n",
247 | " % mean_squared_error(y_test, y_pred))\n",
248 | "\n",
249 | "# Explained variance score: 1 is perfect prediction\n",
250 | "print('Variance score: %.2f' % r2_score(y_test, y_pred))"
251 | ]
252 | },
253 | {
254 | "cell_type": "markdown",
255 | "metadata": {},
256 | "source": [
257 | "## Logistic Regression Model"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": 11,
263 | "metadata": {},
264 | "outputs": [],
265 | "source": [
266 | "import sklearn\n",
267 | "from sklearn.model_selection import train_test_split\n",
268 | "\n",
269 | "X_train, X_test, y_train, y_test = train_test_split(\n",
270 | " news_tfidf, \n",
271 | " news.sentiment,\n",
272 | " test_size=0.3, \n",
273 | " random_state=0)"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "execution_count": 12,
279 | "metadata": {
280 | "scrolled": false
281 | },
282 | "outputs": [
283 | {
284 | "data": {
285 | "text/plain": [
286 | "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
287 | " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n",
288 | " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n",
289 | " verbose=0, warm_start=False)"
290 | ]
291 | },
292 | "execution_count": 12,
293 | "metadata": {},
294 | "output_type": "execute_result"
295 | }
296 | ],
297 | "source": [
298 | "from sklearn.linear_model import LogisticRegression\n",
299 | "lr = LogisticRegression()\n",
300 | "lr.fit(X_train,y_train)"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 13,
306 | "metadata": {},
307 | "outputs": [
308 | {
309 | "data": {
310 | "text/plain": [
311 | "0.9391304347826087"
312 | ]
313 | },
314 | "execution_count": 13,
315 | "metadata": {},
316 | "output_type": "execute_result"
317 | }
318 | ],
319 | "source": [
320 | "from sklearn.metrics import accuracy_score\n",
321 | "accuracy_score(y_test, lr.predict(X_test))"
322 | ]
323 | },
324 | {
325 | "cell_type": "code",
326 | "execution_count": 14,
327 | "metadata": {},
328 | "outputs": [
329 | {
330 | "name": "stdout",
331 | "output_type": "stream",
332 | "text": [
333 | "[0.93506494 0.93506494 0.93421053 0.93421053 0.94666667]\n"
334 | ]
335 | }
336 | ],
337 | "source": [
338 | "from sklearn.model_selection import cross_val_score\n",
339 | "scores = cross_val_score(lr, news_tfidf, news.sentiment, cv=5)\n",
340 | "print(scores)"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 15,
346 | "metadata": {},
347 | "outputs": [
348 | {
349 | "name": "stdout",
350 | "output_type": "stream",
351 | "text": [
352 | "Accuracy: 0.94 (+/- 0.01)\n"
353 | ]
354 | }
355 | ],
356 | "source": [
357 | "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
358 | ]
359 | }
360 | ],
361 | "metadata": {
362 | "kernelspec": {
363 | "display_name": "Python 3",
364 | "language": "python",
365 | "name": "python3"
366 | },
367 | "language_info": {
368 | "codemirror_mode": {
369 | "name": "ipython",
370 | "version": 3
371 | },
372 | "file_extension": ".py",
373 | "mimetype": "text/x-python",
374 | "name": "python",
375 | "nbconvert_exporter": "python",
376 | "pygments_lexer": "ipython3",
377 | "version": "3.7.0"
378 | }
379 | },
380 | "nbformat": 4,
381 | "nbformat_minor": 2
382 | }
383 |
--------------------------------------------------------------------------------
/語料庫介面設計/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫介面設計/.gitkeep
--------------------------------------------------------------------------------
/語料庫介面設計/README.md:
--------------------------------------------------------------------------------
1 | 語料庫介面設計
2 |
3 | ---
4 |
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫標記及語言學分析/.gitkeep
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/README.md:
--------------------------------------------------------------------------------
1 | 語料庫標記及語言學分析
2 |
3 | ---
4 |
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/practice/.ipynb_checkpoints/sentiment_annotation-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 58,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "import csv\n",
12 | "import pandas as pd\n",
13 | "import re"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 59,
19 | "metadata": {
20 | "collapsed": false
21 | },
22 | "outputs": [],
23 | "source": [
24 | "sample_file = list(csv.reader(open('new_sample.txt', \"r\"), delimiter = '\\t'))"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 60,
30 | "metadata": {
31 | "collapsed": false,
32 | "scrolled": true
33 | },
34 | "outputs": [],
35 | "source": [
36 | "df = pd.DataFrame(sample_file, columns=[\"sample\", \"sample_segmented\", \"polarity\"])\n",
37 | "df = df.drop(['sample', 'polarity'], axis=1)\n",
38 | "# df"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 61,
44 | "metadata": {
45 | "collapsed": true
46 | },
47 | "outputs": [],
48 | "source": [
49 | "# NTUSD wordlist\n",
50 | "positive_words = open(\"positive_words.txt\",\"r\").read().split(\"\\n\")\n",
51 | "negative_words = open(\"negative_words.txt\",\"r\").read().split(\"\\n\")"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 62,
57 | "metadata": {
58 | "collapsed": false
59 | },
60 | "outputs": [],
61 | "source": [
62 | "# === 計算positive word 在每一個sample出現的count === #\n",
63 | "positive_word_score = []\n",
64 | "for text in list(df.sample_segmented):\n",
65 | " result = 0\n",
66 | " for words in positive_words:\n",
67 | " if words in text:\n",
68 | " result += 1 \n",
69 | " positive_word_score.append(result)\n",
70 | "# positive_word_score\n",
71 | "\n",
72 | "# === 計算positive pattern 在每一個sample出現的count === #\n",
73 | "positive_pattern = '還好.+(會|不會)?'\n",
74 | "positive_pattern_score = []\n",
75 | "for text in list(df.sample_segmented):\n",
76 | " positive_pattern_score.append(len(re.findall(positive_pattern,text)))\n",
77 | "# positive_pattern_score\n",
78 | "\n",
79 | "# === 將 positive word和positive pattern計算後的結果合併===#\n",
80 | "positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))]"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 63,
86 | "metadata": {
87 | "collapsed": false,
88 | "scrolled": true
89 | },
90 | "outputs": [
91 | {
92 | "data": {
93 | "text/plain": [
94 | "[0,\n",
95 | " 0,\n",
96 | " 0,\n",
97 | " 0,\n",
98 | " 0,\n",
99 | " 0,\n",
100 | " 0,\n",
101 | " 0,\n",
102 | " -1,\n",
103 | " 0,\n",
104 | " 0,\n",
105 | " 0,\n",
106 | " 0,\n",
107 | " -2,\n",
108 | " 0,\n",
109 | " -2,\n",
110 | " 0,\n",
111 | " 0,\n",
112 | " 0,\n",
113 | " -1,\n",
114 | " 0,\n",
115 | " 0,\n",
116 | " -1,\n",
117 | " -1,\n",
118 | " 0,\n",
119 | " 0,\n",
120 | " 0,\n",
121 | " 0,\n",
122 | " -1,\n",
123 | " -1,\n",
124 | " 0,\n",
125 | " 0,\n",
126 | " 0,\n",
127 | " 0,\n",
128 | " 0,\n",
129 | " 0,\n",
130 | " 0,\n",
131 | " 0,\n",
132 | " 0,\n",
133 | " -1,\n",
134 | " 0,\n",
135 | " 0,\n",
136 | " 0,\n",
137 | " 0,\n",
138 | " 0,\n",
139 | " 0,\n",
140 | " 0,\n",
141 | " 0,\n",
142 | " 0,\n",
143 | " 0,\n",
144 | " 0,\n",
145 | " 0,\n",
146 | " -2,\n",
147 | " -2,\n",
148 | " 0,\n",
149 | " -4,\n",
150 | " 0,\n",
151 | " 0,\n",
152 | " 0,\n",
153 | " 0,\n",
154 | " -1,\n",
155 | " -1,\n",
156 | " 0,\n",
157 | " -2,\n",
158 | " 0,\n",
159 | " -1,\n",
160 | " 0,\n",
161 | " 0,\n",
162 | " -2,\n",
163 | " 0,\n",
164 | " 0,\n",
165 | " 0,\n",
166 | " -1,\n",
167 | " -2,\n",
168 | " 0,\n",
169 | " -1,\n",
170 | " 0,\n",
171 | " 0,\n",
172 | " -1,\n",
173 | " 0,\n",
174 | " 0,\n",
175 | " 0,\n",
176 | " -3,\n",
177 | " 0,\n",
178 | " 0,\n",
179 | " 0,\n",
180 | " 0,\n",
181 | " 0,\n",
182 | " 0,\n",
183 | " 0,\n",
184 | " 0,\n",
185 | " 0,\n",
186 | " 0,\n",
187 | " 0,\n",
188 | " 0,\n",
189 | " 0,\n",
190 | " 0]"
191 | ]
192 | },
193 | "execution_count": 63,
194 | "metadata": {},
195 | "output_type": "execute_result"
196 | }
197 | ],
198 | "source": [
199 | "# === 計算negative word 在每一個sample出現的count === #\n",
200 | "negative_word_score = []\n",
201 | "for text in list(df.sample_segmented):\n",
202 | " result = 0\n",
203 | " for words in negative_words:\n",
204 | " if words in text:\n",
205 | " result -= 1 \n",
206 | " negative_word_score.append(result)\n",
207 | "# negative_word_score\n",
208 | "\n",
209 | "# === 計算negative pattern 在每一個sample出現的count === #\n",
210 | "negative_pattern = r'都.*了.*還.*|連.+都.+|結果.+都'\n",
211 | "negative_pattern_score = []\n",
212 | "for text in list(df.sample_segmented):\n",
213 | " negative_pattern_score.append(len(re.findall(negative_pattern,text))*-1)\n",
214 | "# negative_pattern_score\n",
215 | "\n",
216 | "# === 將 negative word和 negative pattern計算後的結果合併===#\n",
217 | "negative_score = [negative_word_score[i] + negative_pattern_score[i] for i in range(len(negative_word_score))]\n",
218 | "negative_score\n"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": 64,
224 | "metadata": {
225 | "collapsed": false
226 | },
227 | "outputs": [],
228 | "source": [
229 | "# df['positive_score'] = positive_score\n",
230 | "# df['negative_score'] = negative_score\n",
231 | "df['polarity_score'] = [positive_score[i] + negative_score[i] for i in range(len(positive_score))]\n"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 65,
237 | "metadata": {
238 | "collapsed": false
239 | },
240 | "outputs": [],
241 | "source": [
242 | "df.loc[df.polarity_score > 0, 'sentiment'] = '1' \n",
243 | "df.loc[df.polarity_score < 0, 'sentiment'] = '-1' \n",
244 | "df.loc[df.polarity_score == 0, 'sentiment'] = '0' "
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 66,
250 | "metadata": {
251 | "collapsed": false
252 | },
253 | "outputs": [],
254 | "source": [
255 | "# df = df.drop(['polarity_score'], axis=1)"
256 | ]
257 | },
258 | {
259 | "cell_type": "code",
260 | "execution_count": 67,
261 | "metadata": {
262 | "collapsed": false
263 | },
264 | "outputs": [
265 | {
266 | "data": {
267 | "text/html": [
268 | "\n",
269 | "\n",
282 | "
\n",
283 | " \n",
284 | " \n",
285 | " | \n",
286 | " sample_segmented | \n",
287 | " polarity_score | \n",
288 | " sentiment | \n",
289 | "
\n",
290 | " \n",
291 | " \n",
292 | " \n",
293 | " 0 | \n",
294 | " 有趣 中肯 溫馨 | \n",
295 | " 3 | \n",
296 | " 1 | \n",
297 | "
\n",
298 | " \n",
299 | " 1 | \n",
300 | " 五樓 耶 , 好巧 喔 | \n",
301 | " 0 | \n",
302 | " 0 | \n",
303 | "
\n",
304 | " \n",
305 | " 2 | \n",
306 | " 結束 了 才 要 去 你 是 想到 ㄛ 妳 | \n",
307 | " 0 | \n",
308 | " 0 | \n",
309 | "
\n",
310 | " \n",
311 | " 3 | \n",
312 | " 年度 最 中肯 ! ! | \n",
313 | " 1 | \n",
314 | " 1 | \n",
315 | "
\n",
316 | " \n",
317 | " 4 | \n",
318 | " 好 好玩 阿 | \n",
319 | " 1 | \n",
320 | " 1 | \n",
321 | "
\n",
322 | " \n",
323 | " 5 | \n",
324 | " 有趣 中肯 溫馨 | \n",
325 | " 3 | \n",
326 | " 1 | \n",
327 | "
\n",
328 | " \n",
329 | " 6 | \n",
330 | " 年度 最 中肯 ! ! | \n",
331 | " 1 | \n",
332 | " 1 | \n",
333 | "
\n",
334 | " \n",
335 | " 7 | \n",
336 | " 生日快樂 | \n",
337 | " 3 | \n",
338 | " 1 | \n",
339 | "
\n",
340 | " \n",
341 | " 8 | \n",
342 | " 我 還沒 去 .... | \n",
343 | " -1 | \n",
344 | " -1 | \n",
345 | "
\n",
346 | " \n",
347 | " 9 | \n",
348 | " 有趣 中肯 溫馨 | \n",
349 | " 3 | \n",
350 | " 1 | \n",
351 | "
\n",
352 | " \n",
353 | " 10 | \n",
354 | " 早睡早起 精神 好 | \n",
355 | " 0 | \n",
356 | " 0 | \n",
357 | "
\n",
358 | " \n",
359 | " 11 | \n",
360 | " 太棒 了 ! ! | \n",
361 | " 1 | \n",
362 | " 1 | \n",
363 | "
\n",
364 | " \n",
365 | " 12 | \n",
366 | " 有趣 中肯 溫馨 | \n",
367 | " 3 | \n",
368 | " 1 | \n",
369 | "
\n",
370 | " \n",
371 | " 13 | \n",
372 | " 恐怖 金紙 鋪 | \n",
373 | " -2 | \n",
374 | " -1 | \n",
375 | "
\n",
376 | " \n",
377 | " 14 | \n",
378 | " 年度 最 中肯 ! ! | \n",
379 | " 1 | \n",
380 | " 1 | \n",
381 | "
\n",
382 | " \n",
383 | " 15 | \n",
384 | " 睿 真的 有夠 生氣 | \n",
385 | " -2 | \n",
386 | " -1 | \n",
387 | "
\n",
388 | " \n",
389 | " 16 | \n",
390 | " 嘖 有 看到 花博 了 還嫌 哩 | \n",
391 | " 0 | \n",
392 | " 0 | \n",
393 | "
\n",
394 | " \n",
395 | " 17 | \n",
396 | " 爭 豔 館 那 紫色 讚 ! ! | \n",
397 | " 1 | \n",
398 | " 1 | \n",
399 | "
\n",
400 | " \n",
401 | " 18 | \n",
402 | " 有趣 中肯 溫馨 | \n",
403 | " 3 | \n",
404 | " 1 | \n",
405 | "
\n",
406 | " \n",
407 | " 19 | \n",
408 | " 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 | \n",
409 | " -1 | \n",
410 | " -1 | \n",
411 | "
\n",
412 | " \n",
413 | " 20 | \n",
414 | " 美女 晚安 安 ~ ~ | \n",
415 | " 0 | \n",
416 | " 0 | \n",
417 | "
\n",
418 | " \n",
419 | " 21 | \n",
420 | " 美女 晚安 安 ~ ~ ~ | \n",
421 | " 0 | \n",
422 | " 0 | \n",
423 | "
\n",
424 | " \n",
425 | " 22 | \n",
426 | " 我 還沒 去過 | \n",
427 | " -1 | \n",
428 | " -1 | \n",
429 | "
\n",
430 | " \n",
431 | " 23 | \n",
432 | " 可憐 你 了 | \n",
433 | " -1 | \n",
434 | " -1 | \n",
435 | "
\n",
436 | " \n",
437 | " 24 | \n",
438 | " 我 不 怎麼 喜歡 報告 的 說 | \n",
439 | " 3 | \n",
440 | " 1 | \n",
441 | "
\n",
442 | " \n",
443 | " 25 | \n",
444 | " 真是太 厲害 囉 ~ | \n",
445 | " 1 | \n",
446 | " 1 | \n",
447 | "
\n",
448 | " \n",
449 | " 26 | \n",
450 | " 有趣 中肯 溫馨 | \n",
451 | " 3 | \n",
452 | " 1 | \n",
453 | "
\n",
454 | " \n",
455 | " 27 | \n",
456 | " 假日 平安 ! 順心 ! | \n",
457 | " 1 | \n",
458 | " 1 | \n",
459 | "
\n",
460 | " \n",
461 | " 28 | \n",
462 | " 好險 我 都 逛 完了 ~ 乎 | \n",
463 | " -1 | \n",
464 | " -1 | \n",
465 | "
\n",
466 | " \n",
467 | " 29 | \n",
468 | " 買 一送 一 沒了 | \n",
469 | " -1 | \n",
470 | " -1 | \n",
471 | "
\n",
472 | " \n",
473 | " ... | \n",
474 | " ... | \n",
475 | " ... | \n",
476 | " ... | \n",
477 | "
\n",
478 | " \n",
479 | " 67 | \n",
480 | " 有趣 中肯 溫馨 | \n",
481 | " 3 | \n",
482 | " 1 | \n",
483 | "
\n",
484 | " \n",
485 | " 68 | \n",
486 | " 結果 說 要 去 都 沒去 成 | \n",
487 | " -2 | \n",
488 | " -1 | \n",
489 | "
\n",
490 | " \n",
491 | " 69 | \n",
492 | " 早上好 ~ | \n",
493 | " 1 | \n",
494 | " 1 | \n",
495 | "
\n",
496 | " \n",
497 | " 70 | \n",
498 | " 去 了 兩次 .. 也 只 看 了 一區 半 | \n",
499 | " 0 | \n",
500 | " 0 | \n",
501 | "
\n",
502 | " \n",
503 | " 71 | \n",
504 | " 看來 你 順利 回到 家 了 | \n",
505 | " 1 | \n",
506 | " 1 | \n",
507 | "
\n",
508 | " \n",
509 | " 72 | \n",
510 | " 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能... | \n",
511 | " 0 | \n",
512 | " 0 | \n",
513 | "
\n",
514 | " \n",
515 | " 73 | \n",
516 | " 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 | \n",
517 | " -2 | \n",
518 | " -1 | \n",
519 | "
\n",
520 | " \n",
521 | " 74 | \n",
522 | " 她 真的 很乖 | \n",
523 | " 0 | \n",
524 | " 0 | \n",
525 | "
\n",
526 | " \n",
527 | " 75 | \n",
528 | " 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 | \n",
529 | " -1 | \n",
530 | " -1 | \n",
531 | "
\n",
532 | " \n",
533 | " 76 | \n",
534 | " 我 沒 有 排耶 XDD 真 幸運 | \n",
535 | " 1 | \n",
536 | " 1 | \n",
537 | "
\n",
538 | " \n",
539 | " 77 | \n",
540 | " 有趣 中肯 溫馨 | \n",
541 | " 3 | \n",
542 | " 1 | \n",
543 | "
\n",
544 | " \n",
545 | " 78 | \n",
546 | " 你 都 去 幾次 了 。 我 半次 都 沒去 過 | \n",
547 | " -1 | \n",
548 | " -1 | \n",
549 | "
\n",
550 | " \n",
551 | " 79 | \n",
552 | " 好聽 耶 | \n",
553 | " 1 | \n",
554 | " 1 | \n",
555 | "
\n",
556 | " \n",
557 | " 80 | \n",
558 | " 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 | \n",
559 | " 4 | \n",
560 | " 1 | \n",
561 | "
\n",
562 | " \n",
563 | " 81 | \n",
564 | " 真的 ! ! ! 今天 好熱 ! ! ! | \n",
565 | " 0 | \n",
566 | " 0 | \n",
567 | "
\n",
568 | " \n",
569 | " 82 | \n",
570 | " 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都... | \n",
571 | " -2 | \n",
572 | " -1 | \n",
573 | "
\n",
574 | " \n",
575 | " 83 | \n",
576 | " 有趣 中肯 溫馨 | \n",
577 | " 3 | \n",
578 | " 1 | \n",
579 | "
\n",
580 | " \n",
581 | " 84 | \n",
582 | " 我 還有 好 幾 ㄍ 館 沒 看 ... | \n",
583 | " 1 | \n",
584 | " 1 | \n",
585 | "
\n",
586 | " \n",
587 | " 85 | \n",
588 | " Have a Nice Day ~ | \n",
589 | " 0 | \n",
590 | " 0 | \n",
591 | "
\n",
592 | " \n",
593 | " 86 | \n",
594 | " Good morning ~ | \n",
595 | " 0 | \n",
596 | " 0 | \n",
597 | "
\n",
598 | " \n",
599 | " 87 | \n",
600 | " 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天... | \n",
601 | " 1 | \n",
602 | " 1 | \n",
603 | "
\n",
604 | " \n",
605 | " 88 | \n",
606 | " 有趣 中肯 溫馨 | \n",
607 | " 3 | \n",
608 | " 1 | \n",
609 | "
\n",
610 | " \n",
611 | " 89 | \n",
612 | " 早上好 | \n",
613 | " 1 | \n",
614 | " 1 | \n",
615 | "
\n",
616 | " \n",
617 | " 90 | \n",
618 | " 午安 | \n",
619 | " 0 | \n",
620 | " 0 | \n",
621 | "
\n",
622 | " \n",
623 | " 91 | \n",
624 | " 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~... | \n",
625 | " 0 | \n",
626 | " 0 | \n",
627 | "
\n",
628 | " \n",
629 | " 92 | \n",
630 | " 哀 噁 | \n",
631 | " 0 | \n",
632 | " 0 | \n",
633 | "
\n",
634 | " \n",
635 | " 93 | \n",
636 | " 有趣 中肯 溫馨 | \n",
637 | " 3 | \n",
638 | " 1 | \n",
639 | "
\n",
640 | " \n",
641 | " 94 | \n",
642 | " me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 | \n",
643 | " 0 | \n",
644 | " 0 | \n",
645 | "
\n",
646 | " \n",
647 | " 95 | \n",
648 | " 不錯 玩 | \n",
649 | " 1 | \n",
650 | " 1 | \n",
651 | "
\n",
652 | " \n",
653 | " 96 | \n",
654 | " 別去 都 是 人 .. | \n",
655 | " 0 | \n",
656 | " 0 | \n",
657 | "
\n",
658 | " \n",
659 | "
\n",
660 | "
97 rows × 3 columns
\n",
661 | "
"
662 | ],
663 | "text/plain": [
664 | " sample_segmented polarity_score \\\n",
665 | "0 有趣 中肯 溫馨 3 \n",
666 | "1 五樓 耶 , 好巧 喔 0 \n",
667 | "2 結束 了 才 要 去 你 是 想到 ㄛ 妳 0 \n",
668 | "3 年度 最 中肯 ! ! 1 \n",
669 | "4 好 好玩 阿 1 \n",
670 | "5 有趣 中肯 溫馨 3 \n",
671 | "6 年度 最 中肯 ! ! 1 \n",
672 | "7 生日快樂 3 \n",
673 | "8 我 還沒 去 .... -1 \n",
674 | "9 有趣 中肯 溫馨 3 \n",
675 | "10 早睡早起 精神 好 0 \n",
676 | "11 太棒 了 ! ! 1 \n",
677 | "12 有趣 中肯 溫馨 3 \n",
678 | "13 恐怖 金紙 鋪 -2 \n",
679 | "14 年度 最 中肯 ! ! 1 \n",
680 | "15 睿 真的 有夠 生氣 -2 \n",
681 | "16 嘖 有 看到 花博 了 還嫌 哩 0 \n",
682 | "17 爭 豔 館 那 紫色 讚 ! ! 1 \n",
683 | "18 有趣 中肯 溫馨 3 \n",
684 | "19 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 -1 \n",
685 | "20 美女 晚安 安 ~ ~ 0 \n",
686 | "21 美女 晚安 安 ~ ~ ~ 0 \n",
687 | "22 我 還沒 去過 -1 \n",
688 | "23 可憐 你 了 -1 \n",
689 | "24 我 不 怎麼 喜歡 報告 的 說 3 \n",
690 | "25 真是太 厲害 囉 ~ 1 \n",
691 | "26 有趣 中肯 溫馨 3 \n",
692 | "27 假日 平安 ! 順心 ! 1 \n",
693 | "28 好險 我 都 逛 完了 ~ 乎 -1 \n",
694 | "29 買 一送 一 沒了 -1 \n",
695 | ".. ... ... \n",
696 | "67 有趣 中肯 溫馨 3 \n",
697 | "68 結果 說 要 去 都 沒去 成 -2 \n",
698 | "69 早上好 ~ 1 \n",
699 | "70 去 了 兩次 .. 也 只 看 了 一區 半 0 \n",
700 | "71 看來 你 順利 回到 家 了 1 \n",
701 | "72 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能... 0 \n",
702 | "73 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 -2 \n",
703 | "74 她 真的 很乖 0 \n",
704 | "75 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 -1 \n",
705 | "76 我 沒 有 排耶 XDD 真 幸運 1 \n",
706 | "77 有趣 中肯 溫馨 3 \n",
707 | "78 你 都 去 幾次 了 。 我 半次 都 沒去 過 -1 \n",
708 | "79 好聽 耶 1 \n",
709 | "80 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 4 \n",
710 | "81 真的 ! ! ! 今天 好熱 ! ! ! 0 \n",
711 | "82 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都... -2 \n",
712 | "83 有趣 中肯 溫馨 3 \n",
713 | "84 我 還有 好 幾 ㄍ 館 沒 看 ... 1 \n",
714 | "85 Have a Nice Day ~ 0 \n",
715 | "86 Good morning ~ 0 \n",
716 | "87 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天... 1 \n",
717 | "88 有趣 中肯 溫馨 3 \n",
718 | "89 早上好 1 \n",
719 | "90 午安 0 \n",
720 | "91 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~... 0 \n",
721 | "92 哀 噁 0 \n",
722 | "93 有趣 中肯 溫馨 3 \n",
723 | "94 me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 0 \n",
724 | "95 不錯 玩 1 \n",
725 | "96 別去 都 是 人 .. 0 \n",
726 | "\n",
727 | " sentiment \n",
728 | "0 1 \n",
729 | "1 0 \n",
730 | "2 0 \n",
731 | "3 1 \n",
732 | "4 1 \n",
733 | "5 1 \n",
734 | "6 1 \n",
735 | "7 1 \n",
736 | "8 -1 \n",
737 | "9 1 \n",
738 | "10 0 \n",
739 | "11 1 \n",
740 | "12 1 \n",
741 | "13 -1 \n",
742 | "14 1 \n",
743 | "15 -1 \n",
744 | "16 0 \n",
745 | "17 1 \n",
746 | "18 1 \n",
747 | "19 -1 \n",
748 | "20 0 \n",
749 | "21 0 \n",
750 | "22 -1 \n",
751 | "23 -1 \n",
752 | "24 1 \n",
753 | "25 1 \n",
754 | "26 1 \n",
755 | "27 1 \n",
756 | "28 -1 \n",
757 | "29 -1 \n",
758 | ".. ... \n",
759 | "67 1 \n",
760 | "68 -1 \n",
761 | "69 1 \n",
762 | "70 0 \n",
763 | "71 1 \n",
764 | "72 0 \n",
765 | "73 -1 \n",
766 | "74 0 \n",
767 | "75 -1 \n",
768 | "76 1 \n",
769 | "77 1 \n",
770 | "78 -1 \n",
771 | "79 1 \n",
772 | "80 1 \n",
773 | "81 0 \n",
774 | "82 -1 \n",
775 | "83 1 \n",
776 | "84 1 \n",
777 | "85 0 \n",
778 | "86 0 \n",
779 | "87 1 \n",
780 | "88 1 \n",
781 | "89 1 \n",
782 | "90 0 \n",
783 | "91 0 \n",
784 | "92 0 \n",
785 | "93 1 \n",
786 | "94 0 \n",
787 | "95 1 \n",
788 | "96 0 \n",
789 | "\n",
790 | "[97 rows x 3 columns]"
791 | ]
792 | },
793 | "execution_count": 67,
794 | "metadata": {},
795 | "output_type": "execute_result"
796 | }
797 | ],
798 | "source": [
799 | "df"
800 | ]
801 | },
802 | {
803 | "cell_type": "code",
804 | "execution_count": 68,
805 | "metadata": {
806 | "collapsed": true
807 | },
808 | "outputs": [],
809 | "source": [
810 | "df.to_csv('sentiment_annotation_sample.csv', index=False, header=True)"
811 | ]
812 | },
813 | {
814 | "cell_type": "markdown",
815 | "metadata": {
816 | "collapsed": true
817 | },
818 | "source": [
819 | "## 試試也將新聞語料庫做情緒標記"
820 | ]
821 | },
822 | {
823 | "cell_type": "code",
824 | "execution_count": 1,
825 | "metadata": {
826 | "collapsed": true
827 | },
828 | "outputs": [],
829 | "source": [
830 | "import json"
831 | ]
832 | },
833 | {
834 | "cell_type": "code",
835 | "execution_count": null,
836 | "metadata": {
837 | "collapsed": true
838 | },
839 | "outputs": [],
840 | "source": [
841 | "with open('../../語料庫資料前處理/data/clean_han.json') as json_data:\n",
842 | " d = json.load(json_data)"
843 | ]
844 | }
845 | ],
846 | "metadata": {
847 | "anaconda-cloud": {},
848 | "kernelspec": {
849 | "display_name": "Python [Root]",
850 | "language": "python",
851 | "name": "Python [Root]"
852 | },
853 | "language_info": {
854 | "codemirror_mode": {
855 | "name": "ipython",
856 | "version": 3
857 | },
858 | "file_extension": ".py",
859 | "mimetype": "text/x-python",
860 | "name": "python",
861 | "nbconvert_exporter": "python",
862 | "pygments_lexer": "ipython3",
863 | "version": "3.5.2"
864 | }
865 | },
866 | "nbformat": 4,
867 | "nbformat_minor": 0
868 | }
869 |
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/practice/2018 語料庫程式實務工作坊_語言學分析&標記實作_學習單.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫標記及語言學分析/practice/2018 語料庫程式實務工作坊_語言學分析&標記實作_學習單.pdf
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/practice/new_sample.txt:
--------------------------------------------------------------------------------
1 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
2 | 五樓耶, 好巧喔 五樓 耶 , 好巧 喔 +
3 | 結束了才要去 你是想到ㄛ妳 結束 了 才 要 去 你 是 想到 ㄛ 妳 -
4 | 年度最中肯!! 年度 最 中肯 ! ! +
5 | 好好玩阿 好 好玩 阿 +
6 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
7 | 年度最中肯!! 年度 最 中肯 ! ! +
8 | 生日快樂 生日快樂 +
9 | 我還沒去.... 我 還沒 去 .... -
10 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
11 | 早睡早起精神好 早睡早起 精神 好 +
12 | 太棒了!! 太棒 了 ! ! +
13 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
14 | 恐怖金紙鋪 恐怖 金紙 鋪 -
15 | 年度最中肯!! 年度 最 中肯 ! ! +
16 | 睿真的有夠生氣 睿 真的 有夠 生氣 -
17 | 嘖 有看到花博了 還嫌哩 嘖 有 看到 花博 了 還嫌 哩 -
18 | 爭豔館那紫色讚!! 爭 豔 館 那 紫色 讚 ! ! +
19 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
20 | 我可是去走了數小時一個館都沒逛呢 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 -
21 | 美女晚安安~ ~ 美女 晚安 安 ~ ~ +
22 | 美女晚安安~ ~~ 美女 晚安 安 ~ ~ ~ +
23 | 我還沒去過 我 還沒 去過 -
24 | 可憐你了 可憐 你 了 -
25 | 我不怎麼喜歡報告的說 我 不 怎麼 喜歡 報告 的 說 -
26 | 真是太厲害囉~ 真是太 厲害 囉 ~ +
27 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
28 | 假日平安!順心! 假日 平安 ! 順心 ! +
29 | 好險我都逛完了~乎 好險 我 都 逛 完了 ~ 乎 +
30 | 買一送一沒了 買 一送 一 沒了 -
31 | 上星期去花博給它拼了3館!爭豔館、養生館、天使館~,還有看到一點點的未來館~~還真是拼ㄚ 上星期 去 花博給 它 拼 了 3 館 ! 爭 豔 館 、 養生館 、 天使 館 ~ , 還有 看到 一點點 的 未來館 ~ ~ 還 真是 拼 ㄚ +
32 | 你去了喔XDDD 趕上了 你 去 了 喔 XDDD 趕上 了 +
33 | 對阿 謝謝志工 有玩的很開心就好 這樣子志工跟我都很開心!!YA 對 阿 謝謝 志工 有 玩 的 很開心 就 好 這樣 子志工 跟 我 都 很開心 ! ! Y A +
34 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
35 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
36 | 鐵腿了吧 鐵腿 了 吧 -
37 | 躺著也中槍 躺 著 也 中槍 -
38 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
39 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
40 | 幹麻不帶 幹麻 不帶 -
41 | 我最近也滿喜歡聽這個的 我 最近 也 滿 喜歡 聽 這個 的 +
42 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
43 | 嘻嘻嘻 枝姐姐新的大頭照好閃亮唷 是因為本人最近也很閃嗎? 嘻嘻 嘻 枝 姐姐 新 的 大頭照 好 閃亮 唷 是 因為 本人 最近 也 很 閃 嗎 ? +
44 | 早睡早起精神好 早睡早起 精神 好 +
45 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
46 | 恭喜恭喜 恭喜 恭喜 +
47 | 午安 今天高溫記得多喝水 午安 今天 高溫 記得 多喝水 +
48 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
49 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
50 | 還好夢想館會留下 還好 夢想 館會 留下 +
51 | 好餓喔... 好 餓 喔 ... -
52 | 宿舍哪來新聞 宿舍 哪來 新聞 -
53 | 好悲慘 好 悲慘 -
54 | 好恐怖 好 恐怖 -
55 | XDD 我要喝喜酒 XDD 我要 喝 喜酒 +
56 | 只見新人笑~不見就人哭~可憐阿~還沒卸任就已經被人唾棄了~ 只見 新人 笑 ~ 不見 就 人 哭 ~ 可憐 阿 ~ 還沒 卸任 就 已經 被 人 唾棄 了 ~ -
57 | 早安 ! 來去線上收聽 KISSRADIO 開始美好的一天!! 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! +
58 | 玩的開心點喔~ 玩 的 開心 點 喔 ~ +
59 | 花博又開放啦 花博 又 開放 啦 +
60 | +
61 | 真的快累斃了!!!! 真的 快累 斃了 ! ! ! ! -
62 | 哈哈哈~發酒瘋給她看XDD 哈哈哈 ~ 發酒瘋 給 她 看 XDD +
63 | 真讚~~ 真 讚 ~ ~ +
64 | 昨晚硬是把它用完 ..(但還有兩張遺失在我找不到的地方>< 昨晚 硬是 把 它 用 完 .. ( 但 還有 兩張 遺失 在 我 找不到 的 地方 > < -
65 | No good, Bob No good , Bob -
66 | 呵呵,我是乖孩子,小雪大人都沒打過我~ 呵呵 , 我 是 乖孩子 , 小雪 大人 都沒 打過 我 ~ +
67 | 我可能要去看醫生,掰伊~~ 我 可能 要 去 看 醫生 , 掰 伊 ~ ~ -
68 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
69 | 結果說要去都沒去成 結果 說 要 去 都 沒去 成 -
70 | 早上好~ 早上好 ~ +
71 | 去了兩次..也只看了一區半 去 了 兩次 .. 也 只 看 了 一區 半 -
72 | 看來你順利回到家了 看來 你 順利 回到 家 了 +
73 | 上次跟我媽他們去六點多就起床..還沒拿到夢想館的票,只能在門口等開門,一整天下來整個就是累慘了~~你們中午才去至少有睡飽 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能 在 門口 等 開門 , 一整天 下來 整個 就是 累慘 了 ~ ~ 你們 中午 才 去 至少 有 睡 飽 -
74 | 可惜的是這次沒能跟幸茵好好坐下來吃頓美食...幸茵就走了 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 -
75 | 她真的很乖 她 真的 很乖 +
76 | 那它分給別人 不就一堆人進來 嘖嘖 真不好 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 -
77 | 我沒有排耶XDD 真幸運 我 沒 有 排耶 XDD 真 幸運 +
78 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
79 | 你都去幾次了。 我半次都沒去過 你 都 去 幾次 了 。 我 半次 都 沒去 過 -
80 | 好聽耶 好聽 耶 +
81 | 五月之後就會有館重新開放了~到時候可以再慢慢晃 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 +
82 | 真的!!!今天好熱!!! 真的 ! ! ! 今天 好熱 ! ! ! -
83 | 對啊~~~我去那一天剛好是人數最多的一天....啥館都沒得看....又熱得半死...隨便逛逛就回家囉 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都 沒得 看 .... 又 熱得 半死 ... 隨便 逛逛 就 回家 囉 -
84 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
85 | 我還有好幾ㄍ館沒看... 我 還有 好 幾 ㄍ 館 沒 看 ... -
86 | Have a Nice Day~ Have a Nice Day ~ +
87 | Good morning~ Good morning ~ +
88 | 早安 ! 來去線上收聽 KISSRADIO 開始美好的一天!! 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! +
89 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
90 | 早上好 早上好 +
91 | 午安 午安 +
92 | 也為開始新的一歲倒數~~~Happy birthday 喔~~~ 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~ ~ ~ +
93 | 哀噁 哀 噁 -
94 | 有趣 中肯 溫馨 有趣 中肯 溫馨 +
95 | me too哦 鼻涕擤不庭加上喉嚨痛 me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 -
96 | 不錯玩 不錯 玩 +
97 | 別去 都是人.. 別去 都 是 人 .. -
98 |
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/practice/readme.txt:
--------------------------------------------------------------------------------
1 | [new_sample.txt]
2 | sample data released by SocialNlp, one additional column is added in, which is the segmented data using jieba
3 |
4 | [positive_words.txt]/[positivewordsall.txt]
5 | [negative_words.txt]/[negativewordsall.txt]
6 | contain both the NTUSD data and data from psychological experiment
7 | ps. NTUSD data are taken automatically extracted by machine
8 |
9 | [sentiment_annotation.py] / [sentiment_annotation.ipynb]
10 | script using both positive and negative wordlist to automatically annotate sentiment polarity for each text in new_sample.txt
11 |
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/practice/sentiment_annotation.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import pandas as pd
3 | import re
4 |
5 | # --- 讀入要標記的檔案 --- #
6 | sample_file = list(csv.reader(open('new_sample.txt', "r"), delimiter = '\t'))
7 |
8 | # --- 命名header --- #
9 | df = pd.DataFrame(sample_file, columns=["sample", "sample_segmented", "polarity"])
10 | df = df.drop(['sample', 'polarity'], axis=1) #把參考用的column拿掉
11 | # df
12 |
13 | # --- 讀入情緒詞辭典 (NTUSD wordlist) --- #
14 | positive_words = open("positive_words.txt","r").read().split("\n")
15 | negative_words = open("negative_words.txt","r").read().split("\n")
16 |
17 | # === 計算positive word 在每一個sample出現的count === #
18 | positive_word_score = []
19 | for text in list(df.sample_segmented):
20 | result = 0
21 | for words in positive_words:
22 | if words in text:
23 | result += 1
24 | positive_word_score.append(result)
25 | # positive_word_score
26 |
27 | # === 計算positive pattern 在每一個sample出現的count === #
28 | positive_pattern = '還好.+(會|不會)?'
29 | positive_pattern_score = []
30 | for text in list(df.sample_segmented):
31 | positive_pattern_score.append(len(re.findall(positive_pattern,text)))
32 | # positive_pattern_score
33 |
34 | # === 將 positive word和positive pattern計算後的結果合併===#
35 | positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))]
36 |
37 | # === 計算positive word 在每一個sample出現的count === #
38 | positive_word_score = []
39 | for text in list(df.sample_segmented):
40 | result = 0
41 | for words in positive_words:
42 | if words in text:
43 | result += 1
44 | positive_word_score.append(result)
45 | # positive_word_score
46 |
47 | # === 計算positive pattern 在每一個sample出現的count === #
48 | positive_pattern = '還好.+(會|不會)?'
49 | positive_pattern_score = []
50 | for text in list(df.sample_segmented):
51 | positive_pattern_score.append(len(re.findall(positive_pattern,text)))
52 | # positive_pattern_score
53 |
54 | # === 將 positive word和positive pattern計算後的結果合併===#
55 | positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))]
56 |
57 | # === 加總所有的情緒極度分數 ===#
58 | # df['positive_score'] = positive_score
59 | # df['negative_score'] = negative_score
60 | df['polarity_score'] = [positive_score[i] + negative_score[i] for i in range(len(positive_score))]
61 |
62 |
63 | # --- 以數值標記情緒極度 (正向:1 / 中性:0 / 負向:-1) --- #
64 | df.loc[df.polarity_score > 0, 'sentiment'] = '1'
65 | df.loc[df.polarity_score < 0, 'sentiment'] = '-1'
66 | df.loc[df.polarity_score == 0, 'sentiment'] = '0'
67 | # df = df.drop(['polarity_score'], axis=1)
68 |
69 | # --- 將標記後的檔案輸出為csv格式 --- #
70 | df.to_csv('sentiment_annotation_sample.csv', index=False, header=True)
71 |
72 |
73 | ### >>>>>>>>>>> 試試也將新聞語料庫做情緒標記 <<<<<<<<<< ###
74 | # --- 你會需要的套件
75 | # import pandas as pd
76 | # import json
77 |
78 | # # --- 讀入新聞語料庫的檔案 (json file)
79 | # with open('your json file') as json_data:
80 | # d = json.load(json_data)
81 |
82 | # # --- 將讀入的json file 轉換為 df 格式
83 | # news_data = pd.DataFrame.from_records(d)
84 | # # news_data
85 |
86 | # # --- 只截取斷詞好的新聞文本column
87 | # news_content = pd.DataFrame(news_data['cln_content'])
88 |
89 | # # --- 計算positive word 在每一個sample出現的count
90 | # positive_word_score = []
91 | # for text in list(news_content.cln_content):
92 | # result = 0
93 | # for words in positive_words:
94 | # if words in text:
95 | # result += 1
96 | # positive_word_score.append(result)
97 | # # positive_word_score
98 |
99 | # # --- 計算positive pattern 在每一個sample出現的count
100 | # positive_pattern = '還好.+(會|不會)?'
101 | # positive_pattern_score = []
102 | # for text in list(news_content.cln_content):
103 | # positive_pattern_score.append(len(re.findall(positive_pattern,text)))
104 | # # positive_pattern_score
105 |
106 | # # --- 將 positive word和positive pattern計算後的結果合併
107 | # positive_score = [positive_word_score[i] + positive_pattern_score[i] for i in range(len(positive_word_score))]
108 | # #positive_score
109 |
110 | # # --- 計算negative word 在每一個sample出現的count
111 | # negative_word_score = []
112 | # for text in list(news_content.cln_content):
113 | # result = 0
114 | # for words in negative_words:
115 | # if words in text:
116 | # result -= 1
117 | # negative_word_score.append(result)
118 | # # negative_word_score
119 |
120 | # # --- 計算negative pattern 在每一個sample出現的count
121 | # negative_pattern = r'都.*了.*還.*|連.+都.+|結果.+都'
122 | # negative_pattern_score = []
123 | # for text in list(news_content.cln_content):
124 | # negative_pattern_score.append(len(re.findall(negative_pattern,text))*-1)
125 | # # negative_pattern_score
126 |
127 | # # --- 將 negative word和 negative pattern計算後的結果合併
128 | # negative_score = [negative_word_score[i] + negative_pattern_score[i] for i in range(len(negative_word_score))]
129 | # # negative_score
130 |
131 | # news_content['polarity_score'] = [positive_score[i] + negative_score[i] for i in range(len(positive_score))]
132 |
133 | # news_content.to_csv('news_annotation.csv', index=False, header=True)
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/practice/sentiment_annotation_sample.csv:
--------------------------------------------------------------------------------
1 | sample_segmented,polarity_score,sentiment
2 | 有趣 中肯 溫馨 ,3,1
3 | " 五樓 耶 , 好巧 喔 ",0,0
4 | 結束 了 才 要 去 你 是 想到 ㄛ 妳 ,0,0
5 | 年度 最 中肯 ! ! ,1,1
6 | 好 好玩 阿 ,1,1
7 | 有趣 中肯 溫馨 ,3,1
8 | 年度 最 中肯 ! ! ,1,1
9 | 生日快樂 ,3,1
10 | 我 還沒 去 .... ,-1,-1
11 | 有趣 中肯 溫馨 ,3,1
12 | 早睡早起 精神 好 ,0,0
13 | 太棒 了 ! ! ,1,1
14 | 有趣 中肯 溫馨 ,3,1
15 | 恐怖 金紙 鋪 ,-2,-1
16 | 年度 最 中肯 ! ! ,1,1
17 | 睿 真的 有夠 生氣 ,-2,-1
18 | 嘖 有 看到 花博 了 還嫌 哩 ,0,0
19 | 爭 豔 館 那 紫色 讚 ! ! ,1,1
20 | 有趣 中肯 溫馨 ,3,1
21 | 我 可 是 去 走 了 數小時 一個館 都沒 逛 呢 ,-1,-1
22 | 美女 晚安 安 ~ ~ ,0,0
23 | 美女 晚安 安 ~ ~ ~ ,0,0
24 | 我 還沒 去過 ,-1,-1
25 | 可憐 你 了 ,-1,-1
26 | 我 不 怎麼 喜歡 報告 的 說 ,3,1
27 | 真是太 厲害 囉 ~ ,1,1
28 | 有趣 中肯 溫馨 ,3,1
29 | 假日 平安 ! 順心 ! ,1,1
30 | 好險 我 都 逛 完了 ~ 乎 ,-1,-1
31 | 買 一送 一 沒了 ,-1,-1
32 | 上星期 去 花博給 它 拼 了 3 館 ! 爭 豔 館 、 養生館 、 天使 館 ~ , 還有 看到 一點點 的 未來館 ~ ~ 還 真是 拼 ㄚ ,2,1
33 | 你 去 了 喔 XDDD 趕上 了 ,0,0
34 | 對 阿 謝謝 志工 有 玩 的 很開心 就 好 這樣 子志工 跟 我 都 很開心 ! ! Y A ,4,1
35 | 有趣 中肯 溫馨 ,3,1
36 | 有趣 中肯 溫馨 ,3,1
37 | 鐵腿 了 吧 ,0,0
38 | 躺 著 也 中槍 ,0,0
39 | 有趣 中肯 溫馨 ,3,1
40 | 有趣 中肯 溫馨 ,3,1
41 | 幹麻 不帶 ,-1,-1
42 | 我 最近 也 滿 喜歡 聽 這個 的 ,2,1
43 | 有趣 中肯 溫馨 ,3,1
44 | 嘻嘻 嘻 枝 姐姐 新 的 大頭照 好 閃亮 唷 是 因為 本人 最近 也 很 閃 嗎 ? ,1,1
45 | 早睡早起 精神 好 ,0,0
46 | 有趣 中肯 溫馨 ,3,1
47 | 恭喜 恭喜 ,1,1
48 | 午安 今天 高溫 記得 多喝水 ,0,0
49 | 有趣 中肯 溫馨 ,3,1
50 | 有趣 中肯 溫馨 ,3,1
51 | 還好 夢想 館會 留下 ,4,1
52 | 好 餓 喔 ... ,0,0
53 | 宿舍 哪來 新聞 ,0,0
54 | 好 悲慘 ,-2,-1
55 | 好 恐怖 ,-2,-1
56 | XDD 我要 喝 喜酒 ,1,1
57 | 只見 新人 笑 ~ 不見 就 人 哭 ~ 可憐 阿 ~ 還沒 卸任 就 已經 被 人 唾棄 了 ~ ,-3,-1
58 | 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! ,1,1
59 | 玩 的 開心 點 喔 ~ ,2,1
60 | 花博 又 開放 啦 ,1,1
61 | +,0,0
62 | 真的 快累 斃了 ! ! ! ! ,-1,-1
63 | 哈哈哈 ~ 發酒瘋 給 她 看 XDD ,0,0
64 | 真 讚 ~ ~ ,1,1
65 | 昨晚 硬是 把 它 用 完 .. ( 但 還有 兩張 遺失 在 我 找不到 的 地方 > < ,-1,-1
66 | " No good , Bob ",0,0
67 | 呵呵 , 我 是 乖孩子 , 小雪 大人 都沒 打過 我 ~ ,0,0
68 | 我 可能 要 去 看 醫生 , 掰 伊 ~ ~ ,0,0
69 | 有趣 中肯 溫馨 ,3,1
70 | 結果 說 要 去 都 沒去 成 ,-2,-1
71 | 早上好 ~ ,1,1
72 | 去 了 兩次 .. 也 只 看 了 一區 半 ,0,0
73 | 看來 你 順利 回到 家 了 ,1,1
74 | " 上次 跟 我媽 他們 去 六點 多 就 起床 .. 還沒 拿到 夢想 館 的 票 , 只能 在 門口 等 開門 , 一整天 下來 整個 就是 累慘 了 ~ ~ 你們 中午 才 去 至少 有 睡 飽 ",0,0
75 | 可惜 的 是 這次 沒能 跟 幸茵 好好 坐下 來 吃 頓 美食 ... 幸茵 就 走 了 ,-2,-1
76 | 她 真的 很乖 ,0,0
77 | 那 它 分給 別人 不 就 一堆 人 進來 嘖嘖 真 不好 ,-1,-1
78 | 我 沒 有 排耶 XDD 真 幸運 ,1,1
79 | 有趣 中肯 溫馨 ,3,1
80 | 你 都 去 幾次 了 。 我 半次 都 沒去 過 ,-1,-1
81 | 好聽 耶 ,1,1
82 | 五月 之 後 就會 有館 重新開放 了 ~ 到 時候 可以 再 慢慢 晃 ,4,1
83 | 真的 ! ! ! 今天 好熱 ! ! ! ,0,0
84 | 對 啊 ~ ~ ~ 我 去 那 一天 剛好 是 人數 最多 的 一天 .... 啥 館 都 沒得 看 .... 又 熱得 半死 ... 隨便 逛逛 就 回家 囉 ,-2,-1
85 | 有趣 中肯 溫馨 ,3,1
86 | 我 還有 好 幾 ㄍ 館 沒 看 ... ,1,1
87 | Have a Nice Day ~ ,0,0
88 | Good morning ~ ,0,0
89 | 早安 ! 來 去 線 上 收 聽 KISSRADIO 開始 美好 的 一天 ! ! ,1,1
90 | 有趣 中肯 溫馨 ,3,1
91 | 早上好 ,1,1
92 | 午安 ,0,0
93 | 也 為 開始 新 的 一歲 倒數 ~ ~ ~ Happy birthday 喔 ~ ~ ~ ,0,0
94 | 哀 噁 ,0,0
95 | 有趣 中肯 溫馨 ,3,1
96 | me too 哦 鼻涕 擤 不庭 加上 喉嚨 痛 ,0,0
97 | 不錯 玩 ,1,1
98 | 別去 都 是 人 .. ,0,0
99 |
--------------------------------------------------------------------------------
/語料庫標記及語言學分析/slide/20181104_語料庫語言學工作坊.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫標記及語言學分析/slide/20181104_語料庫語言學工作坊.pdf
--------------------------------------------------------------------------------
/語料庫爬蟲/README.md:
--------------------------------------------------------------------------------
1 | 語料庫爬蟲
2 |
3 | ---
4 |
5 | 簡報: [連結](https://aji.tw/slides/corpus-crawler)
6 |
--------------------------------------------------------------------------------
/語料庫爬蟲/src/applecrawler.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from time import sleep
3 | from random import random
4 | import json
5 |
6 | from requests import Session
7 | from bs4 import BeautifulSoup
8 |
9 | session = Session()
10 | session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
11 |
12 |
13 | base_url = 'https://tw.appledaily.com/new/realtime/{page}'
14 | page = 1
15 | links = []
16 | current_date = datetime.now().date()
17 |
18 | while True:
19 | url = base_url.format(page=page)
20 | response = session.get(url)
21 | print(f'已經爬到第{page}頁')
22 | dom = BeautifulSoup(response.text)
23 | raw_time = dom.select('h1.dddd > time')[0].text
24 | date = datetime.strptime(raw_time, '%Y / %m / %d').date()
25 | if date < current_date:
26 | break
27 | elements = dom.select('h1.dddd + ul.rtddd > li')
28 | for element in elements:
29 | link = element.select('a')[0]['href']
30 | links.append(link)
31 | sleep(random() * 5)
32 | page += 1
33 |
34 | base_url = 'https://tw.appledaily.com/new/realtime/{page}'
35 | page = 1
36 | links = []
37 | current_date = datetime.now().date()
38 |
39 | while True:
40 | url = base_url.format(page=page)
41 | response = session.get(url)
42 | print(f'已經爬到第{page}頁面')
43 | dom = BeautifulSoup(response.text)
44 | raw_time = dom.select('h1.dddd > time')[0].text
45 | date = datetime.strptime(raw_time, '%Y / %m / %d').date()
46 | if date < current_date:
47 | break
48 | elements = dom.select('h1.dddd + ul.rtddd > li')
49 | for element in elements:
50 | link = element.select('a')[0]['href']
51 | links.append(link)
52 | sleep(random() * 5)
53 | page += 1
54 |
55 | htmls = []
56 | for num, link in enumerate(links):
57 | print(f'還剩下{len(links) - num}個頁面')
58 | response = session.get(link)
59 | htmls.append(response.text)
60 | sleep(random() * 5)
61 |
62 | posts = []
63 | for html in htmls:
64 | dom = BeautifulSoup(html)
65 | title = dom.select('article.ndArticle_leftColumn h1')[0].text
66 | created_time = dom.select('article.ndArticle_leftColumn div.ndArticle_creat')[0].text
67 | category = dom.select('nav div.ndgTag a.current')[0].text
68 | content = dom.select('article.ndArticle_content p')[0].text
69 | post = dict(title=title, created_time=created_time, category=category, content=content)
70 | posts.append(post)
71 |
72 | with open('appledaily.json', 'w') as f:
73 | json.dump(posts, f)
74 |
--------------------------------------------------------------------------------
/語料庫爬蟲/src/實戰.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "### 流程\n",
8 | "1. 蒐集 **即時 > 最新** 所有當日新聞連結\n",
9 | "1. 蒐集所有連結的內容並轉存為結構化的資料\n",
10 | "1. 將結構化的資料輸出成檔案"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {},
16 | "source": [
17 | "事前準備"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 1,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "from datetime import datetime\n",
27 | "from time import sleep\n",
28 | "from random import random\n",
29 | "\n",
30 | "from requests import Session\n",
31 | "from bs4 import BeautifulSoup\n",
32 | "\n",
33 | "session = Session()\n",
34 | "session.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "### 蒐集連結"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "base_url = 'https://tw.appledaily.com/new/realtime/{page}'\n",
51 | "page = 1\n",
52 | "links = []\n",
53 | "current_date = datetime.now().date()"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {},
59 | "source": [
60 | "遞迴找出所有連結"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 3,
66 | "metadata": {},
67 | "outputs": [
68 | {
69 | "name": "stdout",
70 | "output_type": "stream",
71 | "text": [
72 | "已經爬到第1頁面\n",
73 | "已經爬到第2頁面\n",
74 | "已經爬到第3頁面\n",
75 | "已經爬到第4頁面\n"
76 | ]
77 | }
78 | ],
79 | "source": [
80 | "while True:\n",
81 | " url = base_url.format(page=page)\n",
82 | " response = session.get(url)\n",
83 | " print(f'已經爬到第{page}頁面')\n",
84 | " dom = BeautifulSoup(response.text)\n",
85 | " raw_time = dom.select('h1.dddd > time')[0].text\n",
86 | " date = datetime.strptime(raw_time, '%Y / %m / %d').date()\n",
87 | " if date < current_date:\n",
88 | " break\n",
89 | " elements = dom.select('h1.dddd + ul.rtddd > li')\n",
90 | " for element in elements:\n",
91 | " link = element.select('a')[0]['href'] \n",
92 | " links.append(link)\n",
93 | " sleep(random() * 5)\n",
94 | " page += 1"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "### 蒐集內容"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 4,
107 | "metadata": {},
108 | "outputs": [
109 | {
110 | "name": "stdout",
111 | "output_type": "stream",
112 | "text": [
113 | "還剩下90個頁面\n",
114 | "還剩下89個頁面\n",
115 | "還剩下88個頁面\n",
116 | "還剩下87個頁面\n",
117 | "還剩下86個頁面\n",
118 | "還剩下85個頁面\n",
119 | "還剩下84個頁面\n",
120 | "還剩下83個頁面\n",
121 | "還剩下82個頁面\n",
122 | "還剩下81個頁面\n",
123 | "還剩下80個頁面\n",
124 | "還剩下79個頁面\n",
125 | "還剩下78個頁面\n",
126 | "還剩下77個頁面\n",
127 | "還剩下76個頁面\n",
128 | "還剩下75個頁面\n",
129 | "還剩下74個頁面\n",
130 | "還剩下73個頁面\n",
131 | "還剩下72個頁面\n",
132 | "還剩下71個頁面\n",
133 | "還剩下70個頁面\n",
134 | "還剩下69個頁面\n",
135 | "還剩下68個頁面\n",
136 | "還剩下67個頁面\n",
137 | "還剩下66個頁面\n",
138 | "還剩下65個頁面\n",
139 | "還剩下64個頁面\n",
140 | "還剩下63個頁面\n",
141 | "還剩下62個頁面\n",
142 | "還剩下61個頁面\n",
143 | "還剩下60個頁面\n",
144 | "還剩下59個頁面\n",
145 | "還剩下58個頁面\n",
146 | "還剩下57個頁面\n",
147 | "還剩下56個頁面\n",
148 | "還剩下55個頁面\n",
149 | "還剩下54個頁面\n",
150 | "還剩下53個頁面\n",
151 | "還剩下52個頁面\n",
152 | "還剩下51個頁面\n",
153 | "還剩下50個頁面\n",
154 | "還剩下49個頁面\n",
155 | "還剩下48個頁面\n",
156 | "還剩下47個頁面\n",
157 | "還剩下46個頁面\n",
158 | "還剩下45個頁面\n",
159 | "還剩下44個頁面\n",
160 | "還剩下43個頁面\n",
161 | "還剩下42個頁面\n",
162 | "還剩下41個頁面\n",
163 | "還剩下40個頁面\n",
164 | "還剩下39個頁面\n",
165 | "還剩下38個頁面\n",
166 | "還剩下37個頁面\n",
167 | "還剩下36個頁面\n",
168 | "還剩下35個頁面\n",
169 | "還剩下34個頁面\n",
170 | "還剩下33個頁面\n",
171 | "還剩下32個頁面\n",
172 | "還剩下31個頁面\n",
173 | "還剩下30個頁面\n",
174 | "還剩下29個頁面\n",
175 | "還剩下28個頁面\n",
176 | "還剩下27個頁面\n",
177 | "還剩下26個頁面\n",
178 | "還剩下25個頁面\n",
179 | "還剩下24個頁面\n",
180 | "還剩下23個頁面\n",
181 | "還剩下22個頁面\n",
182 | "還剩下21個頁面\n",
183 | "還剩下20個頁面\n",
184 | "還剩下19個頁面\n",
185 | "還剩下18個頁面\n",
186 | "還剩下17個頁面\n",
187 | "還剩下16個頁面\n",
188 | "還剩下15個頁面\n",
189 | "還剩下14個頁面\n",
190 | "還剩下13個頁面\n",
191 | "還剩下12個頁面\n",
192 | "還剩下11個頁面\n",
193 | "還剩下10個頁面\n",
194 | "還剩下9個頁面\n",
195 | "還剩下8個頁面\n",
196 | "還剩下7個頁面\n",
197 | "還剩下6個頁面\n",
198 | "還剩下5個頁面\n",
199 | "還剩下4個頁面\n",
200 | "還剩下3個頁面\n",
201 | "還剩下2個頁面\n",
202 | "還剩下1個頁面\n"
203 | ]
204 | }
205 | ],
206 | "source": [
207 | "htmls = []\n",
208 | "for num, link in enumerate(links):\n",
209 | " print(f'還剩下{len(links) - num}個頁面')\n",
210 | " response = session.get(link)\n",
211 | " htmls.append(response.text)\n",
212 | " sleep(random() * 5)"
213 | ]
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "metadata": {},
218 | "source": [
219 | "元素定位與將資料結構化"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 5,
225 | "metadata": {},
226 | "outputs": [],
227 | "source": [
228 | "posts = []\n",
229 | "for html in htmls:\n",
230 | " dom = BeautifulSoup(html)\n",
231 | " title = dom.select('article.ndArticle_leftColumn h1')[0].text\n",
232 | " created_time = dom.select('article.ndArticle_leftColumn div.ndArticle_creat')[0].text\n",
233 | " category = dom.select('nav div.ndgTag a.current')[0].text\n",
234 | " content = dom.select('article.ndArticle_content p')[0].text\n",
235 | " post = dict(title=title, created_time=created_time, category=category, content=content)\n",
236 | " posts.append(post)"
237 | ]
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "### 輸出檔案"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 7,
249 | "metadata": {},
250 | "outputs": [],
251 | "source": [
252 | "import json\n",
253 | "\n",
254 | "with open('appledaily.json', 'w') as f:\n",
255 | " json.dump(posts, f, ensure_ascii=False, indent=2)"
256 | ]
257 | }
258 | ],
259 | "metadata": {
260 | "kernelspec": {
261 | "display_name": "Python 3",
262 | "language": "python",
263 | "name": "python3"
264 | },
265 | "language_info": {
266 | "codemirror_mode": {
267 | "name": "ipython",
268 | "version": 3
269 | },
270 | "file_extension": ".py",
271 | "mimetype": "text/x-python",
272 | "name": "python",
273 | "nbconvert_exporter": "python",
274 | "pygments_lexer": "ipython3",
275 | "version": "3.7.0"
276 | }
277 | },
278 | "nbformat": 4,
279 | "nbformat_minor": 2
280 | }
281 |
--------------------------------------------------------------------------------
/語料庫資料前處理/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫資料前處理/.gitkeep
--------------------------------------------------------------------------------
/語料庫資料前處理/README.md:
--------------------------------------------------------------------------------
1 | 語料庫資料前處理
2 |
3 | ---
4 | 簡報:[連結](https://goo.gl/GDcYGV)
5 |
--------------------------------------------------------------------------------
/語料庫資料前處理/stopwords.txt:
--------------------------------------------------------------------------------
1 | ,
2 | ?
3 | 、
4 | 。
5 | “
6 | ”
7 | 「
8 | 」
9 | 《
10 | 》
11 | !
12 | ,
13 | :
14 | ;
15 | ?
16 | 人民
17 | 末##末
18 | 啊
19 | 阿
20 | 哎
21 | 哎呀
22 | 哎喲
23 | 唉
24 | 我
25 | 我們
26 | 按
27 | 按照
28 | 依照
29 | 吧
30 | 吧噠
31 | 把
32 | 罷了
33 | 被
34 | 本
35 | 本著
36 | 比
37 | 比方
38 | 比如
39 | 鄙人
40 | 彼
41 | 彼此
42 | 邊
43 | 別
44 | 別的
45 | 別說
46 | 並
47 | 並且
48 | 不比
49 | 不成
50 | 不單
51 | 不但
52 | 不獨
53 | 不管
54 | 不光
55 | 不過
56 | 不僅
57 | 不拘
58 | 不論
59 | 不怕
60 | 不然
61 | 不如
62 | 不特
63 | 不惟
64 | 不問
65 | 不只
66 | 朝
67 | 朝著
68 | 趁
69 | 趁著
70 | 乘
71 | 沖
72 | 除
73 | 除此之外
74 | 除非
75 | 除了
76 | 此
77 | 此間
78 | 此外
79 | 從
80 | 從而
81 | 打
82 | 待
83 | 但
84 | 但是
85 | 當
86 | 當著
87 | 到
88 | 得
89 | 的
90 | 的話
91 | 等
92 | 等等
93 | 地
94 | 第
95 | 叮咚
96 | 對
97 | 對於
98 | 多
99 | 多少
100 | 而
101 | 而況
102 | 而且
103 | 而是
104 | 而外
105 | 而言
106 | 而已
107 | 爾後
108 | 反過來
109 | 反過來說
110 | 反之
111 | 非但
112 | 非徒
113 | 否則
114 | 嘎
115 | 嘎登
116 | 該
117 | 趕
118 | 個
119 | 各
120 | 各個
121 | 各位
122 | 各種
123 | 各自
124 | 給
125 | 根據
126 | 跟
127 | 故
128 | 故此
129 | 固然
130 | 關於
131 | 管
132 | 歸
133 | 果然
134 | 果真
135 | 過
136 | 哈
137 | 哈哈
138 | 呵
139 | 和
140 | 何
141 | 何處
142 | 何況
143 | 何時
144 | 嘿
145 | 哼
146 | 哼唷
147 | 呼哧
148 | 乎
149 | 嘩
150 | 還是
151 | 還有
152 | 換句話說
153 | 換言之
154 | 或
155 | 或是
156 | 或者
157 | 極了
158 | 及
159 | 及其
160 | 及至
161 | 即
162 | 即便
163 | 即或
164 | 即令
165 | 即若
166 | 即使
167 | 幾
168 | 幾時
169 | 己
170 | 既
171 | 既然
172 | 既是
173 | 繼而
174 | 加之
175 | 假如
176 | 假若
177 | 假使
178 | 鑒於
179 | 將
180 | 較
181 | 較之
182 | 叫
183 | 接著
184 | 結果
185 | 借
186 | 緊接著
187 | 進而
188 | 盡
189 | 儘管
190 | 經
191 | 經過
192 | 就
193 | 就是
194 | 就是說
195 | 據
196 | 具體地說
197 | 具體說來
198 | 開始
199 | 開外
200 | 靠
201 | 咳
202 | 可
203 | 可見
204 | 可是
205 | 可以
206 | 況且
207 | 啦
208 | 來
209 | 來著
210 | 離
211 | 例如
212 | 哩
213 | 連
214 | 連同
215 | 兩者
216 | 了
217 | 臨
218 | 另
219 | 另外
220 | 另一方面
221 | 論
222 | 嘛
223 | 嗎
224 | 慢說
225 | 漫說
226 | 冒
227 | 麼
228 | 每
229 | 每當
230 | 們
231 | 莫若
232 | 某
233 | 某個
234 | 某些
235 | 拿
236 | 哪
237 | 哪邊
238 | 哪兒
239 | 哪個
240 | 哪裏
241 | 哪年
242 | 哪怕
243 | 哪天
244 | 哪些
245 | 哪樣
246 | 那
247 | 那邊
248 | 那兒
249 | 那個
250 | 那會兒
251 | 那裏
252 | 那麼
253 | 那麼些
254 | 那麼樣
255 | 那時
256 | 那些
257 | 那樣
258 | 乃
259 | 乃至
260 | 呢
261 | 能
262 | 你
263 | 你們
264 | 您
265 | 寧
266 | 寧可
267 | 寧肯
268 | 寧願
269 | 哦
270 | 嘔
271 | 啪達
272 | 旁人
273 | 呸
274 | 憑
275 | 憑藉
276 | 其
277 | 其次
278 | 其二
279 | 其他
280 | 其它
281 | 其一
282 | 其餘
283 | 其中
284 | 起
285 | 起見
286 | 豈但
287 | 恰恰相反
288 | 前後
289 | 前者
290 | 且
291 | 然而
292 | 然後
293 | 然則
294 | 讓
295 | 人家
296 | 任
297 | 任何
298 | 任憑
299 | 如
300 | 如此
301 | 如果
302 | 如何
303 | 如其
304 | 如若
305 | 如上所述
306 | 若
307 | 若非
308 | 若是
309 | 啥
310 | 上下
311 | 尚且
312 | 設若
313 | 設使
314 | 甚而
315 | 甚麼
316 | 甚至
317 | 省得
318 | 時候
319 | 什麼
320 | 什麼樣
321 | 使得
322 | 是
323 | 是的
324 | 首先
325 | 誰
326 | 誰知
327 | 順
328 | 順著
329 | 似的
330 | 雖
331 | 雖然
332 | 雖說
333 | 雖則
334 | 隨
335 | 隨著
336 | 所
337 | 所以
338 | 他
339 | 他們
340 | 他人
341 | 它
342 | 它們
343 | 她
344 | 她們
345 | 倘
346 | 倘或
347 | 倘然
348 | 倘若
349 | 倘使
350 | 騰
351 | 替
352 | 通過
353 | 同
354 | 同時
355 | 哇
356 | 萬一
357 | 往
358 | 望
359 | 為
360 | 為何
361 | 為了
362 | 為什麼
363 | 為著
364 | 餵
365 | 嗡嗡
366 | 我
367 | 我們
368 | 嗚
369 | 嗚呼
370 | 烏乎
371 | 無論
372 | 無寧
373 | 毋寧
374 | 嘻
375 | 嚇
376 | 相對而言
377 | 像
378 | 向
379 | 向著
380 | 噓
381 | 呀
382 | 焉
383 | 沿
384 | 沿著
385 | 要
386 | 要不
387 | 要不然
388 | 要不是
389 | 要麼
390 | 要是
391 | 也
392 | 也罷
393 | 也好
394 | 一
395 | 一般
396 | 一旦
397 | 一方面
398 | 一來
399 | 一切
400 | 一樣
401 | 一則
402 | 依
403 | 依照
404 | 矣
405 | 以
406 | 以便
407 | 以及
408 | 以免
409 | 以至
410 | 以至於
411 | 以致
412 | 抑或
413 | 因
414 | 因此
415 | 因而
416 | 因為
417 | 喲
418 | 用
419 | 由
420 | 由此可見
421 | 由於
422 | 有
423 | 有的
424 | 有關
425 | 有些
426 | 又
427 | 於
428 | 於是
429 | 於是乎
430 | 與
431 | 與此同時
432 | 與否
433 | 與其
434 | 越是
435 | 雲雲
436 | 哉
437 | 再說
438 | 再者
439 | 在
440 | 在下
441 | 咱
442 | 咱們
443 | 則
444 | 怎
445 | 怎麼
446 | 怎麼辦
447 | 怎麼樣
448 | 怎樣
449 | 咋
450 | 照
451 | 照著
452 | 者
453 | 這
454 | 這邊
455 | 這兒
456 | 這個
457 | 這會兒
458 | 這就是說
459 | 這裏
460 | 這麼
461 | 這麼點兒
462 | 這麼些
463 | 這麼樣
464 | 這時
465 | 這些
466 | 這樣
467 | 正如
468 | 吱
469 | 之
470 | 之類
471 | 之所以
472 | 之一
473 | 只是
474 | 只限
475 | 只要
476 | 只有
477 | 至
478 | 至於
479 | 諸位
480 | 著
481 | 著呢
482 | 自
483 | 自從
484 | 自個兒
485 | 自各兒
486 | 自己
487 | 自家
488 | 自身
489 | 綜上所述
490 | 總的來看
491 | 總的來說
492 | 總的說來
493 | 總而言之
494 | 總之
495 | 縱
496 | 縱令
497 | 縱然
498 | 縱使
499 | 遵照
500 | 作為
501 | 兮
502 | 呃
503 | 唄
504 | 咚
505 | 咦
506 | 喏
507 | 啐
508 | 喔唷
509 | 嗬
510 | 嗯
511 | 噯
512 | ~
513 | !
514 | .
515 | :
516 | "
517 | '
518 | (
519 | )
520 | *
521 | A
522 | 白
523 | 社會主義
524 | --
525 | ..
526 | >>
527 | [
528 | ]
529 |
530 | <
531 | >
532 | /
533 | \
534 | |
535 | -
536 | _
537 | +
538 | =
539 | &
540 | ^
541 | %
542 | #
543 | @
544 | `
545 | ;
546 | $
547 | (
548 | )
549 | ——
550 | —
551 | ¥
552 | ·
553 | ...
554 | ‘
555 | ’
556 | 〉
557 | 〈
558 | …
559 |
560 | 0
561 | 1
562 | 2
563 | 3
564 | 4
565 | 5
566 | 6
567 | 7
568 | 8
569 | 9
570 | 0
571 | 1
572 | 2
573 | 3
574 | 4
575 | 5
576 | 6
577 | 7
578 | 8
579 | 9
580 | 二
581 | 三
582 | 四
583 | 五
584 | 六
585 | 七
586 | 八
587 | 九
588 | 零
589 | >
590 | <
591 | @
592 | #
593 | $
594 | %
595 | ︿
596 | &
597 | *
598 | +
599 | ~
600 | |
601 | [
602 | ]
603 | {
604 | }
605 | 啊哈
606 | 啊呀
607 | 啊喲
608 | 挨次
609 | 挨個
610 | 挨家挨戶
611 | 挨門挨戶
612 | 挨門逐戶
613 | 挨著
614 | 按理
615 | 按期
616 | 按時
617 | 按說
618 | 暗地裏
619 | 暗中
620 | 暗自
621 | 昂然
622 | 八成
623 | 白白
624 | 半
625 | 梆
626 | 保管
627 | 保險
628 | 飽
629 | 背地裏
630 | 背靠背
631 | 倍感
632 | 倍加
633 | 本人
634 | 本身
635 | 甭
636 | 比起
637 | 比如說
638 | 比照
639 | 畢竟
640 | 必
641 | 必定
642 | 必將
643 | 必須
644 | 便
645 | 別人
646 | 並非
647 | 並肩
648 | 並沒
649 | 並沒有
650 | 併排
651 | 並無
652 | 勃然
653 | 不
654 | 不必
655 | 不常
656 | 不大
657 | 不但...而且
658 | 不得
659 | 不得不
660 | 不得了
661 | 不得已
662 | 不迭
663 | 不定
664 | 不對
665 | 不妨
666 | 不管怎樣
667 | 不會
668 | 不僅...而且
669 | 不僅僅
670 | 不僅僅是
671 | 不經意
672 | 不可開交
673 | 不可抗拒
674 | 不力
675 | 不了
676 | 不料
677 | 不滿
678 | 不免
679 | 不能不
680 | 不起
681 | 不巧
682 | 不然的話
683 | 不日
684 | 不少
685 | 不勝
686 | 不時
687 | 不是
688 | 不同
689 | 不能
690 | 不要
691 | 不外
692 | 不外乎
693 | 不下
694 | 不限
695 | 不消
696 | 不已
697 | 不亦樂乎
698 | 不由得
699 | 不再
700 | 不擇手段
701 | 不怎麼
702 | 不曾
703 | 不知不覺
704 | 不止
705 | 不止一次
706 | 不至於
707 | 才
708 | 才能
709 | 策略地
710 | 差不多
711 | 差一點
712 | 常
713 | 常常
714 | 常言道
715 | 常言說
716 | 常言說得好
717 | 長此下去
718 | 長話短說
719 | 長期以來
720 | 長線
721 | 敞開兒
722 | 徹夜
723 | 陳年
724 | 趁便
725 | 趁機
726 | 趁熱
727 | 趁勢
728 | 趁早
729 | 成年
730 | 成年累月
731 | 成心
732 | 乘機
733 | 乘勝
734 | 乘勢
735 | 乘隙
736 | 乘虛
737 | 誠然
738 | 遲早
739 | 充分
740 | 充其極
741 | 充其量
742 | 抽冷子
743 | 臭
744 | 初
745 | 出
746 | 出來
747 | 出去
748 | 除此
749 | 除此而外
750 | 除此以外
751 | 除開
752 | 除去
753 | 除卻
754 | 除外
755 | 處處
756 | 川流不息
757 | 傳
758 | 傳說
759 | 傳聞
760 | 串列
761 | 純
762 | 純粹
763 | 此後
764 | 此中
765 | 次第
766 | 匆匆
767 | 從不
768 | 從此
769 | 從此以後
770 | 從古到今
771 | 從古至今
772 | 從今以後
773 | 從寬
774 | 從來
775 | 從輕
776 | 從速
777 | 從頭
778 | 從未
779 | 從無到有
780 | 從小
781 | 從新
782 | 從嚴
783 | 從優
784 | 從早到晚
785 | 從中
786 | 從重
787 | 湊巧
788 | 粗
789 | 存心
790 | 達旦
791 | 打從
792 | 打開天窗說亮話
793 | 大
794 | 大不了
795 | 大大
796 | 大抵
797 | 大都
798 | 大多
799 | 大凡
800 | 大概
801 | 大家
802 | 大舉
803 | 大略
804 | 大面兒上
805 | 大事
806 | 大體
807 | 大體上
808 | 大約
809 | 大張旗鼓
810 | 大致
811 | 呆呆地
812 | 帶
813 | 殆
814 | 待到
815 | 單
816 | 單純
817 | 單單
818 | 但願
819 | 彈指之間
820 | 當場
821 | 當兒
822 | 當即
823 | 當口兒
824 | 當然
825 | 當庭
826 | 當頭
827 | 當下
828 | 當真
829 | 當中
830 | 倒不如
831 | 倒不如說
832 | 倒是
833 | 到處
834 | 到底
835 | 到了兒
836 | 到目前為止
837 | 到頭
838 | 到頭來
839 | 得起
840 | 得天獨厚
841 | 的確
842 | 等到
843 | 叮噹
844 | 頂多
845 | 定
846 | 動不動
847 | 動輒
848 | 陡然
849 | 都
850 | 獨
851 | 獨自
852 | 斷然
853 | 頓時
854 | 多次
855 | 多多
856 | 多多少少
857 | 多多益善
858 | 多虧
859 | 多年來
860 | 多年前
861 | 而後
862 | 而論
863 | 而又
864 | 爾等
865 | 二話不說
866 | 二話沒說
867 | 反倒
868 | 反倒是
869 | 反而
870 | 反手
871 | 反之亦然
872 | 反之則
873 | 方
874 | 方才
875 | 方能
876 | 放量
877 | 非常
878 | 非得
879 | 分期
880 | 分期分批
881 | 分頭
882 | 奮勇
883 | 憤然
884 | 風雨無阻
885 | 逢
886 | 弗
887 | 甫
888 | 嘎嘎
889 | 該當
890 | 概
891 | 趕快
892 | 趕早不趕晚
893 | 敢
894 | 敢情
895 | 敢於
896 | 剛
897 | 剛才
898 | 剛好
899 | 剛巧
900 | 高低
901 | 格外
902 | 隔日
903 | 隔夜
904 | 個人
905 | 各式
906 | 更
907 | 更加
908 | 更進一步
909 | 更為
910 | 公然
911 | 共
912 | 共總
913 | 夠瞧的
914 | 姑且
915 | 古來
916 | 故而
917 | 故意
918 | 固
919 | 怪
920 | 怪不得
921 | 慣常
922 | 光
923 | 光是
924 | 歸根到底
925 | 歸根結底
926 | 過於
927 | 毫不
928 | 毫無
929 | 毫無保留地
930 | 毫無例外
931 | 好在
932 | 何必
933 | 何嘗
934 | 何妨
935 | 何苦
936 | 何樂而不為
937 | 何須
938 | 何止
939 | 很
940 | 很多
941 | 很少
942 | 轟然
943 | 後來
944 | 呼啦
945 | 忽地
946 | 忽然
947 | 互
948 | 互相
949 | 嘩啦
950 | 話說
951 | 還
952 | 恍然
953 | 會
954 | 豁然
955 | 活
956 | 夥同
957 | 或多或少
958 | 或許
959 | 基本
960 | 基本上
961 | 基於
962 | 極
963 | 極大
964 | 極度
965 | 極端
966 | 極力
967 | 極其
968 | 極為
969 | 急匆匆
970 | 即將
971 | 即刻
972 | 即是說
973 | 幾度
974 | 幾番
975 | 幾乎
976 | 幾經
977 | 既...又
978 | 繼之
979 | 加上
980 | 加以
981 | 間或
982 | 簡而言之
983 | 簡言之
984 | 簡直
985 | 見
986 | 將才
987 | 將近
988 | 將要
989 | 交口
990 | 較比
991 | 較為
992 | 接連不斷
993 | 接下來
994 | 皆可
995 | 截然
996 | 截至
997 | 藉以
998 | 藉此
999 | 藉以
1000 | 屆時
1001 | 僅
1002 | 僅僅
1003 | 謹
1004 | 進來
1005 | 進去
1006 | 近
1007 | 近幾年來
1008 | 近來
1009 | 近年來
1010 | 儘管如此
1011 | 儘可能
1012 | 儘快
1013 | 儘量
1014 | 盡然
1015 | 盡如人意
1016 | 盡心竭力
1017 | 盡心盡力
1018 | 儘早
1019 | 精光
1020 | 經常
1021 | 竟
1022 | 竟然
1023 | 究竟
1024 | 就此
1025 | 就地
1026 | 就算
1027 | 居然
1028 | 局外
1029 | 舉凡
1030 | 據稱
1031 | 據此
1032 | 據實
1033 | 據說
1034 | 據我所知
1035 | 據悉
1036 | 具體來說
1037 | 決不
1038 | 決非
1039 | 絕
1040 | 絕不
1041 | 絕頂
1042 | 絕對
1043 | 絕非
1044 | 均
1045 | 喀
1046 | 看
1047 | 看來
1048 | 看起來
1049 | 看上去
1050 | 看樣子
1051 | 可好
1052 | 可能
1053 | 恐怕
1054 | 快
1055 | 快要
1056 | 來不及
1057 | 來得及
1058 | 來講
1059 | 來看
1060 | 攔腰
1061 | 牢牢
1062 | 老
1063 | 老大
1064 | 老老實實
1065 | 老是
1066 | 累次
1067 | 累年
1068 | 理當
1069 | 理該
1070 | 理應
1071 | 歷
1072 | 立
1073 | 立地
1074 | 立刻
1075 | 立馬
1076 | 立時
1077 | 聯袂
1078 | 連連
1079 | 連日
1080 | 連日來
1081 | 連聲
1082 | 連袂
1083 | 臨到
1084 | 另方面
1085 | 另行
1086 | 另一個
1087 | 路經
1088 | 屢
1089 | 屢次
1090 | 屢次三番
1091 | 屢屢
1092 | 縷縷
1093 | 率爾
1094 | 率然
1095 | 略
1096 | 略加
1097 | 略微
1098 | 略為
1099 | 論說
1100 | 馬上
1101 | 蠻
1102 | 滿
1103 | 沒
1104 | 沒有
1105 | 每逢
1106 | 每每
1107 | 每時每刻
1108 | 猛然
1109 | 猛然間
1110 | 莫
1111 | 莫不
1112 | 莫非
1113 | 莫如
1114 | 默默地
1115 | 默然
1116 | 吶
1117 | 那末
1118 | 奈
1119 | 難道
1120 | 難得
1121 | 難怪
1122 | 難說
1123 | 內
1124 | 年覆一年
1125 | 凝神
1126 | 偶而
1127 | 偶爾
1128 | 怕
1129 | 砰
1130 | 碰巧
1131 | 譬如
1132 | 偏偏
1133 | 乒
1134 | 平素
1135 | 頗
1136 | 迫於
1137 | 撲通
1138 | 其後
1139 | 其實
1140 | 奇
1141 | 齊
1142 | 起初
1143 | 起來
1144 | 起首
1145 | 起頭
1146 | 起先
1147 | 豈
1148 | 豈非
1149 | 豈止
1150 | 迄
1151 | 恰逢
1152 | 恰好
1153 | 恰恰
1154 | 恰巧
1155 | 恰如
1156 | 恰似
1157 | 千
1158 | 千萬
1159 | 千萬千萬
1160 | 切
1161 | 切不可
1162 | 切莫
1163 | 切切
1164 | 切勿
1165 | 竊
1166 | 親口
1167 | 親身
1168 | 親手
1169 | 親眼
1170 | 親自
1171 | 頃
1172 | 頃刻
1173 | 頃刻間
1174 | 頃刻之間
1175 | 請勿
1176 | 窮年累月
1177 | 取道
1178 | 去
1179 | 權時
1180 | 全都
1181 | 全力
1182 | 全年
1183 | 全然
1184 | 全身心
1185 | 然
1186 | 人人
1187 | 仍
1188 | 仍舊
1189 | 仍然
1190 | 日覆一日
1191 | 日見
1192 | 日漸
1193 | 日益
1194 | 日臻
1195 | 如常
1196 | 如此等等
1197 | 如次
1198 | 如今
1199 | 如期
1200 | 如前所述
1201 | 如上
1202 | 如下
1203 | 汝
1204 | 三番兩次
1205 | 三番五次
1206 | 三天兩頭
1207 | 瑟瑟
1208 | 沙沙
1209 | 上
1210 | 上來
1211 | 上去
--------------------------------------------------------------------------------
/語料庫資料前處理/userdict.txt:
--------------------------------------------------------------------------------
1 | 體檢 2 n
--------------------------------------------------------------------------------
/語料庫開放框架及上線部署/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lopentu/BestPracticeInCorpusProgramming/5ab27de7aaf4822799bfe2c2d4f65f37c2eb437f/語料庫開放框架及上線部署/.gitkeep
--------------------------------------------------------------------------------
/語料庫開放框架及上線部署/README.md:
--------------------------------------------------------------------------------
1 | 語料庫開放框架及上線部署
2 |
3 | ---
4 |
--------------------------------------------------------------------------------