`|段落\n","``|超連結\n","\\
| |表格內的cell\n","\\ |換行(無結束標籤)"]},{"cell_type":"markdown","metadata":{"id":"1MDh03CGhhdg","colab_type":"text"},"source":["#### 常用屬性(Attributes)\n","\n","|屬性名稱|用途|\n","-|-\n","class|標籤的類別(可重複)\n","id|標籤的id(不可重複)\n","title|標籤的顯示資訊\n","style|標籤的樣式\n","data-*|自行定義的屬性"]},{"cell_type":"markdown","metadata":{"id":"AiAe7hnEjftP","colab_type":"text"},"source":["### 擷取網頁必要知識"]},{"cell_type":"markdown","metadata":{"id":"G9fT8Fykjhi0","colab_type":"text"},"source":["- 在HTTP協定中,定義了多種不同的method做為服務的請求方法,近年來由於行動裝置的普及化,越來越多的產品及網站都提供了WebAPI服務,既然我們要擷取網頁內容,就必須知道對HTTP請求方式。\n","- 在HTTP 1.1的版本中定義了八種 Method (方法),如下所示:\n"," - OPTIONS\n"," - **GET**\n"," - HEAD\n"," - **POST**\n"," - PUT\n"," - DELETE\n"," - TRACE\n"," - CONNECT"]},{"cell_type":"markdown","metadata":{"id":"ckztGOzfjlSG","colab_type":"text"},"source":["- 最常見的method為以下5種:\n","\n","|method|意義|\n","|-|- |\n","|GET|取得(想要的服務)的資料或是狀態。|\n","|POST|如同填表般的行為,以新增一項資料。\n","|PUT|利用更新的方式於\"指定位置\"新增一項資料。\n","|PATCH|在現有的資料欄位中,增加或部分更新一筆新的資料。\n","|DELETE|刪除指定資料。"]},{"cell_type":"markdown","metadata":{"id":"6S1aqUi_jquX","colab_type":"text"},"source":["- 更進一步了解請參閱W3C制定規範[RFC 5789](https://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.5)\n","- [淺談 HTTP Method:表單中的 GET 與 POST 有什麼差別?](https://blog.toright.com/posts/1203/%E6%B7%BA%E8%AB%87-http-method%EF%BC%9A%E8%A1%A8%E5%96%AE%E4%B8%AD%E7%9A%84-get-%E8%88%87-post-%E6%9C%89%E4%BB%80%E9%BA%BC%E5%B7%AE%E5%88%A5%EF%BC%9F.html)"]},{"cell_type":"markdown","metadata":{"id":"mxDpTI-8RfSb","colab_type":"text"},"source":["- 另外在網頁與資料庫的操作過程中,也會經常聽到CRUD這個詞,CRUD是指 新增(Create)、讀取(Read)、更新(Update)、刪除(Delete)的主要4個操作資料庫(如MySQL等)常用的功能\n","\n","- 參閱[Day 15 - 實作第一個 CRUD 之 Create、Update、Delete](https://ithelp.ithome.com.tw/articles/10206716)"]},{"cell_type":"markdown","metadata":{"id":"yGUfAF1GjsBF","colab_type":"text"},"source":["### 淺談Restful API\n"]},{"cell_type":"markdown","metadata":{"id":"btWqwWF_juHm","colab_type":"text"},"source":["- REST全名 Resource Representational State Transfer,可譯為具象狀態傳輸,其核心精神在於借用 HTTP 協定做為基礎,讓API規格簡單一致:\n"," - Resource:資源。\n"," - Representational:表現形式,如JSON,XML...\n"," - State Transfer:狀態變化。即上述講到的可利用HTTP動詞們來做呼叫。"]},{"cell_type":"markdown","metadata":{"id":"PigVvHPgjwfe","colab_type":"text"},"source":["- REST指的是網路中Client端和Server端的一種呼叫服務形式,透過既定的規則,滿足約束條件和原則的應用程式設計,對資源的操作包括獲取、創建、修改和刪除資源,可對應資料庫基本操作:新增、讀取、更新、刪除(Create、Read、Update、Delete, **CRUD**)。"]},{"cell_type":"markdown","metadata":{"id":"z9v2bBUVjzsZ","colab_type":"text"},"source":["- 舉例商品WebAPI的interface:\n"," - 獲得商品資料 GET /getItem/9527\n"," - 新增商品資料 POST /createItem\n"," - 更新商品資料 POST /updateItem/\n"," - 刪除商品資料 POST /deleteItem/\n","\n","- 運用RESTful API 開發的WebAPI的interface:\n"," - 獲取商品資料 /GET/items/9527\n"," - 新增商品資料 /POST/items\n"," - 更新商品資料 /PATCH/items/9527\n"," - 刪除商品資料 /DELETE/items/9527\n","\n","- 即便有些離題,但增加網頁常識對蒐集真實世界資料總有助益。"]},{"cell_type":"markdown","metadata":{"id":"TBsilSkyj7t3","colab_type":"text"},"source":["- 延伸閱讀\n"," - [[不是工程師] 休息(REST)式架構- 寧靜式(RESTful)的Web API是現在的潮流?](https://progressbar.tw/posts/53)\n"," >當Web service使用Web API進行介面介接時,每一串我們設計的URL,就會是一個專屬的服務『窗口』。\n"," - [RESTful API 設計準則與實務經驗](https://blog.toright.com/posts/5523/restful-api-%E8%A8%AD%E8%A8%88%E6%BA%96%E5%89%87%E8%88%87%E5%AF%A6%E5%8B%99%E7%B6%93%E9%A9%97.html)"]},{"cell_type":"markdown","metadata":{"id":"uZP7CHTATafy","colab_type":"text"},"source":["### 什麼是網路爬蟲(Web Crawler)"]},{"cell_type":"markdown","metadata":{"id":"sOh4tWNniCUa","colab_type":"text"},"source":["\n","https://blog.apify.com/what-is-web-scraping-1b548f8d6ac1\n"]},{"cell_type":"markdown","metadata":{"id":"TTuweXJRT23m","colab_type":"text"},"source":["- 網路爬蟲像是機器人,自動化的幫你擷取目標資訊\n","- 爬蟲無所不在,谷哥(度娘?)都是"]},{"cell_type":"markdown","metadata":{"id":"gRUO9OXPVT0E","colab_type":"text"},"source":["- 爬蟲應用?\n"," - 熱門遊戲評論、輿論分析系統、銷售分析、旅遊訂票...\n"]},{"cell_type":"markdown","metadata":{"id":"K0NoiLeiiANu","colab_type":"text"},"source":["- 再看一次網頁元素結構\n"," - ` target `\n"," - `<目標標籤+輔助資訊>目標資訊目標標籤>`"]},{"cell_type":"markdown","metadata":{"id":"pwGvs5JYicez","colab_type":"text"},"source":["- 寫爬蟲之前要注意的\n"," - 有沒有人寫過?\n"," - 該網站是否已經有API供人取用?\n"," - 要有禮貌(大量、頻繁的請求會造成伺服器負荷)"]},{"cell_type":"markdown","metadata":{"id":"IWzSAtfvWyYq","colab_type":"text"},"source":["### 爬蟲的主要步驟"]},{"cell_type":"markdown","metadata":{"id":"_YEX9RhlW9iI","colab_type":"text"},"source":["- 取得指定的HTML資料\n"," - 你有Python的requests模組可以取得HTML\n","- 解析資料取得目標資訊\n"," - 你有Python的 BeautifulSoup模組可以解析HTML\n","- 自動化(Robotic Process Automation, RPA)串起你的服務"]},{"cell_type":"markdown","metadata":{"id":"WOtmHmThBJcB","colab_type":"text"},"source":["### 有禮貌的爬蟲"]},{"cell_type":"markdown","metadata":{"id":"q3i-MAVdBOuY","colab_type":"text"},"source":["- 爬取網站資料時,請勿過於頻繁的索取資料,善用time.sleep()\n"]},{"cell_type":"code","metadata":{"id":"N4_7pmC1Bdpx","colab_type":"code","outputId":"7e833873-ac70-4277-8e9e-d1e334fdb7f8","executionInfo":{"status":"ok","timestamp":1572072862856,"user_tz":-480,"elapsed":3666,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":50}},"source":["import time\n","\n","print('----start----')\n","time.sleep(3)\n","print('----done----')\n"],"execution_count":0,"outputs":[{"output_type":"stream","text":["----start----\n","----done----\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"mx54vzuoCS1d","colab_type":"code","outputId":"08078393-ec9f-4016-c5c8-bd6095ddca2e","executionInfo":{"status":"ok","timestamp":1572072893537,"user_tz":-480,"elapsed":2569,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":50}},"source":["import time\n","import random\n","\n","random_s = 1 + random.randint(0,2) #加入隨機秒數\n","print('----start----')\n","time.sleep( random_s)\n","print('----done----')\n"],"execution_count":0,"outputs":[{"output_type":"stream","text":["----start----\n","----done----\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"0QQXheu3Dnj8","colab_type":"text"},"source":["- 經過SEO的網站可能有允許/禁止爬取的頁面規範,可至該網站網域`https://*.*.*/robots.txt`查看,如`https://www.facebook.com/robots.txt`及`https://twitter.com/robots.txt`\n","- `robots.txt`只是表明不要到網站這些地方,許多web scraping工具會遵循(但也可關掉預設值)\n","\n","- 另外也請注意智慧財產權(Intellectual Property, IP)相關的類型,如商標、著作權、專利,如果有未獲同意、實際傷害及故意,則有觸法之虞。\n"]},{"cell_type":"markdown","metadata":{"id":"iWkiq1Q_GxCK","colab_type":"text"},"source":["- 為了避免頻繁請求被目標伺服器阻擋,測試爬蟲時可採用你的手機(4g)網路,如果被ben,手機改飛航模式一陣子再開4g網路,即會在自動分配(取得)新的IP Address"]},{"cell_type":"markdown","metadata":{"id":"sGcVUywikBdi","colab_type":"text"},"source":["## 開始動手做GET網頁\n"]},{"cell_type":"markdown","metadata":{"id":"iodaXDy83xtV","colab_type":"text"},"source":["### 以example網頁為例"]},{"cell_type":"markdown","metadata":{"id":"RKprYoPf31yH","colab_type":"text"},"source":["\n","- 先觀察目標網頁: http://www.example.com/\n","- 以`requests.get`抓取網頁原始碼,並輸出結果"]},{"cell_type":"code","metadata":{"id":"3GFnb4ACf3ad","colab_type":"code","outputId":"ae8cb3fe-41f6-4ba0-c227-7ac0a69b11cb","executionInfo":{"status":"ok","timestamp":1572073234919,"user_tz":-480,"elapsed":608,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":289}},"source":["import requests\n","\n","\n","res = requests.get('http://www.example.com/')\n","print(res.text[:500])"],"execution_count":0,"outputs":[{"output_type":"stream","text":["\n","\n","\n","\n"," \n"," \n","\n","\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"Bgtv3fFn6AD2","colab_type":"text"},"source":["#### BeautifulSoup的常用函數\n","- `soup.find()` 找一個標籤 tag\n","- 回傳第一個被tag包圍的區塊\n","- 傳入的引數第一個通常是 tag 名稱,第二個引數若未指明屬性就代表 class 名稱,也可以直接使用 id 等屬性去定位區塊。定位到區塊後,可以取出其屬性與包含的字串值"]},{"cell_type":"code","metadata":{"id":"-F3nFNZF6XUA","colab_type":"code","colab":{}},"source":["?soup.find()\n","#soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"9oNQZZcyknE0","colab_type":"code","outputId":"6dc630c9-5624-4433-daec-a082b7644c62","executionInfo":{"status":"ok","timestamp":1572073713068,"user_tz":-480,"elapsed":605,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":50}},"source":["soup.find('p')"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["\n"," Example Domain\n","\n","\n"," This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission.\n"," \n","\n"," \n"," More information...\n"," \n"," \n","This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission. "]},"metadata":{"tags":[]},"execution_count":13}]},{"cell_type":"markdown","metadata":{"id":"5u8ObFd_a3JZ","colab_type":"text"},"source":["- soup.find_all() 回傳全被tag包圍的區塊,回傳為list"]},{"cell_type":"code","metadata":{"id":"YwlPhDkkk-kl","colab_type":"code","outputId":"f8a1771a-0032-4757-eccc-8cbf424e9382","executionInfo":{"status":"ok","timestamp":1572073727959,"user_tz":-480,"elapsed":599,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":87}},"source":["soup.find_all('p')"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission. ,\n"," ]"]},"metadata":{"tags":[]},"execution_count":14}]},{"cell_type":"code","metadata":{"id":"K20pi5WQbI4Q","colab_type":"code","outputId":"adcd0b60-4af4-42f6-dc0f-aa669165e6e0","executionInfo":{"status":"ok","timestamp":1572073755664,"user_tz":-480,"elapsed":597,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["a = soup.find(\"a\")\n","a"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["More information..."]},"metadata":{"tags":[]},"execution_count":15}]},{"cell_type":"code","metadata":{"id":"pD0ShQhibZYe","colab_type":"code","outputId":"3a4e3737-29a3-4ea3-851a-17eb20a69ff3","executionInfo":{"status":"ok","timestamp":1572073806255,"user_tz":-480,"elapsed":576,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["a[\"href\"]"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'https://www.iana.org/domains/example'"]},"metadata":{"tags":[]},"execution_count":16}]},{"cell_type":"code","metadata":{"id":"rFyV0EaQ50KG","colab_type":"code","outputId":"2c77866d-6e84-4210-a58f-34755ef0f3dc","executionInfo":{"status":"ok","timestamp":1572073860003,"user_tz":-480,"elapsed":582,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["a.text"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["'More information...'"]},"metadata":{"tags":[]},"execution_count":17}]},{"cell_type":"markdown","metadata":{"colab_type":"text","id":"SVhs9e9sbya3"},"source":["- 取得節點文字內容"]},{"cell_type":"code","metadata":{"id":"FMLtwCWo3cHc","colab_type":"code","outputId":"c50b6044-ea0a-450f-e53e-0857946d3c9e","executionInfo":{"status":"ok","timestamp":1572073974054,"user_tz":-480,"elapsed":571,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["title_tag = soup.title\n","title_tag"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission. ,\n"," ]"]},"metadata":{"tags":[]},"execution_count":30}]},{"cell_type":"code","metadata":{"id":"c2WOX3w8dKAn","colab_type":"code","outputId":"82ca34ac-2f7e-42d4-cc55-52ee093ea06b","executionInfo":{"status":"ok","timestamp":1572074268530,"user_tz":-480,"elapsed":629,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":67}},"source":["# 搜尋節點並從list取出內容\n","p_tags = soup.find_all(\"p\")\n","\n","for tag in p_tags:\n"," print(tag.string)"],"execution_count":0,"outputs":[{"output_type":"stream","text":["This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission.\n","More information...\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"jNMNUGqIBNRV","colab_type":"code","outputId":"4e546887-2626-4230-d6ed-dc8cd2e41546","executionInfo":{"status":"ok","timestamp":1572074321794,"user_tz":-480,"elapsed":585,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["# 取出節點屬性\n","a_tags = soup.find_all(\"a\")\n","\n","for tag in a_tags:\n"," print(tag['href'] )"],"execution_count":0,"outputs":[{"output_type":"stream","text":["https://www.iana.org/domains/example\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"p7J5TuVsBiKs","colab_type":"text"},"source":["#### 以list同時搜尋多種標籤"]},{"cell_type":"code","metadata":{"id":"HFKxz_XoBe3P","colab_type":"code","outputId":"70ebb509-0758-44bb-d211-ca07bdfc4502","executionInfo":{"status":"ok","timestamp":1572074361715,"user_tz":-480,"elapsed":376,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":161}},"source":["# 搜尋所有超連結與粗體字\n","tags = soup.find_all([\"a\", \"b\", \"p\" ,\"div\"])\n","print(tags)"],"execution_count":0,"outputs":[{"output_type":"stream","text":["[\n"," , Example Domain\n","This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission. \n","\n","This domain is for use in illustrative examples in documents. You may use this\n"," domain in literature without prior coordination or asking for permission. , , More information...]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"8iCzPmRpB-Em","colab_type":"text"},"source":["#### `find_all()`以`limit`參數限制搜尋數量\n","- 只有1個就可以改為`find()`"]},{"cell_type":"code","metadata":{"id":"QdWbrjToCZJe","colab_type":"code","outputId":"19cfb3ce-02cd-4fe0-eb02-9a2b6ebd1f1e","executionInfo":{"status":"ok","timestamp":1572074391893,"user_tz":-480,"elapsed":596,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["# 限制搜尋結果數量\n","tags = soup.find_all([\"a\", \"b\"], limit=2)\n","print(tags)"],"execution_count":0,"outputs":[{"output_type":"stream","text":["[More information...]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"fw1fg9t0ChkF","colab_type":"text"},"source":["#### 限制`find_all()`關閉遞迴搜尋\n","- 預設find_all()會遞迴搜尋所有子節點\n","- 以`recursive=False` 關閉遞迴搜尋功能"]},{"cell_type":"code","metadata":{"id":"f77l2SitC2Oq","colab_type":"code","outputId":"565918ce-c684-4939-afef-97890256127d","executionInfo":{"status":"ok","timestamp":1572074409091,"user_tz":-480,"elapsed":597,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["# 不使用遞迴搜尋,僅尋找次一層的子節點\n","soup.html.find_all(\"title\", recursive=False)"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[]"]},"metadata":{"tags":[]},"execution_count":38}]},{"cell_type":"code","metadata":{"id":"hzxqRSC77tc_","colab_type":"code","colab":{}},"source":["# 不指定標籤,但找出所有屬性 class = \"zzz\" 的標籤 \n","print(soup.find_all(\"\", {\"class\":\"zzz\"}))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"eKV3k44i7j-y","colab_type":"code","colab":{}},"source":["# 找出所有 td 標籤的第三個並找出其中的 a 標籤 \n","print(soup.find_all(\"td\")[2].find(\"a\")) "],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"SqziFClS70k9","colab_type":"code","colab":{}},"source":["# 找出所有內容等於 Example Domain 的文字 \n","print(soup.find_all(text=\"Example Domain\"))"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"eShAJ7Ff73le","colab_type":"code","colab":{}},"source":["# 找出第一個 a 標籤並印出屬性 \n","print(soup.find(\"a\").attrs) \n","print(soup.find(\"a\")[\"href\"]) "],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"uUNaqQ-w881S","colab_type":"code","colab":{}},"source":["#找出所有 td 標籤,並用 len 計算長度 \n","print(len(soup.find_all(\"a\")))\n","\n","# 找到 div 標籤,屬性 id = \"id1\",再印出其內容 \n","print(soup.find(\"div\", id=\"id1\").text)\n","# 透過觀察網頁可以發現 列3欄3 有個 id = hyperlink 可 以幫助我們定位這個 tag,再把 tag 的 href 找出來 print(soup.find(\"a\", {\"id\":\"hyperlink\"})[\"href\"])"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"PbkO2Q-K9QCi","colab_type":"text"},"source":["### 結合正規表達式regular expression進行搜尋\n"]},{"cell_type":"markdown","metadata":{"id":"eNSsgL9jx250","colab_type":"text"},"source":["- 正規表達式對於精準抓取網頁的各種標籤及內文非常有幫助,解決了許多Xpath與CSS selector無法精確擷取的問題,有必要好好理解。\n","- 擷取的文句段落可以使用[regex101.com](https://regex101.com/)測試。"]},{"cell_type":"markdown","metadata":{"id":"ZdytI0fA9tDP","colab_type":"text"},"source":["|意義|表示|範例|\n","|-|-|-|\n","|Start|`^`|123ABC `/^1/`\n","|End|`$`|123ABC `/5$/`\n","|Range|`[\n","[無言] 誰會委託同事買餐巾紙?\n"," , \n","Fw: [問卦] 鮭魚握壽司被蒸熟了該怎麼辦?\n"," , \n","[無言] 牙線棒卡在牙套上\n"," , \n","[公告] 笨板板規\n"," , \n","[公告]本板即日起不可PO問卷文\n"," , \n","[公告] 10月份置底閒聊文\n"," ]\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"vRrphsEiq8YP","colab_type":"code","outputId":"2f7228db-538b-4511-e508-fc75d4d5ec4f","executionInfo":{"status":"ok","timestamp":1572075593395,"user_tz":-480,"elapsed":757,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":145}},"source":["article_href = soup.select(\"div.title a\")\n","article_href"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["[[無言] 誰會委託同事買餐巾紙?,\n"," Fw: [問卦] 鮭魚握壽司被蒸熟了該怎麼辦?,\n"," [無言] 牙線棒卡在牙套上,\n"," [公告] 笨板板規,\n"," [公告]本板即日起不可PO問卷文,\n"," [公告] 10月份置底閒聊文]"]},"metadata":{"tags":[]},"execution_count":51}]},{"cell_type":"code","metadata":{"id":"zKnxpxaorHjr","colab_type":"code","outputId":"34acdd84-d21c-4a2e-cfe7-c8b4321d50d7","executionInfo":{"status":"ok","timestamp":1572075835147,"user_tz":-480,"elapsed":6039,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":319}},"source":["# 逐一取出標題、合併超連結\n","for a in article_href:\n"," print('title:', a.text)\n"," print('href:','https://www.ptt.cc'+a['href'])\n","\n"," #打開連結內的網頁並另存\n"," content_url = 'https://www.ptt.cc'+ a['href']\n"," r = requests.get(content_url)\n"," with open ( a.text + '.html', 'w+') as f:\n"," f.write(r.text)\n"," print('saved')"],"execution_count":0,"outputs":[{"output_type":"stream","text":["title: [無言] 誰會委託同事買餐巾紙?\n","href: https://www.ptt.cc/bbs/StupidClown/M.1572014689.A.F9B.html\n","saved\n","title: Fw: [問卦] 鮭魚握壽司被蒸熟了該怎麼辦?\n","href: https://www.ptt.cc/bbs/StupidClown/M.1572064384.A.96C.html\n","saved\n","title: [無言] 牙線棒卡在牙套上\n","href: https://www.ptt.cc/bbs/StupidClown/M.1572068103.A.43A.html\n","saved\n","title: [公告] 笨板板規\n","href: https://www.ptt.cc/bbs/StupidClown/M.1158735717.A.828.html\n","saved\n","title: [公告]本板即日起不可PO問卷文\n","href: https://www.ptt.cc/bbs/StupidClown/M.1435710970.A.31E.html\n","saved\n","title: [公告] 10月份置底閒聊文\n","href: https://www.ptt.cc/bbs/StupidClown/M.1569938128.A.51D.html\n","saved\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"MkRd8kgpuIQh","colab_type":"code","outputId":"9978aa63-838b-490f-f091-4d67c5fa8bab","executionInfo":{"status":"ok","timestamp":1572075857697,"user_tz":-480,"elapsed":1769,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":134}},"source":["%ls"],"execution_count":0,"outputs":[{"output_type":"stream","text":["'Fw: [問卦] 鮭魚握壽司被蒸熟了該怎麼辦?.html'\n"," \u001b[0m\u001b[01;34msample_data\u001b[0m/\n","'[公告] 10月份置底閒聊文.html'\n","'[公告]本板即日起不可PO問卷文.html'\n","'[公告] 笨板板規.html'\n","'[無言] 牙線棒卡在牙套上.html'\n","'[無言] 誰會委託同事買餐巾紙?.html'\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"1vyg4HrFLfVJ","colab_type":"code","colab":{}},"source":["#需滿18歲要加cookies\n","import requests\n","\n","def fetch(url):\n"," response = requests.get(url)\n"," response = requests.get(url, cookies={'over18': '1'}) # 一直向 server 回答滿 18 歲了 !\n"," return response\n","\n","url = 'https://www.ptt.cc/bbs/Gossiping/index.html'\n","resp = fetch(url) # step-1\n","\n","print(resp.text) # result of setp-1"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"lDF2XVSvNs_4","colab_type":"text"},"source":["- 更多可參考[爬蟲教學 CrawlerTutorial](https://github.com/leVirve/CrawlerTutorial)"]},{"cell_type":"markdown","metadata":{"id":"sdsBCb4BkOkK","colab_type":"text"},"source":["### 以wiki亞洲國家資訊為例"]},{"cell_type":"markdown","metadata":{"id":"rINKbz4-kXl8","colab_type":"text"},"source":["- 參考來源[Web Scraping Wikipedia Tables using BeautifulSoup and Python](https://medium.com/analytics-vidhya/web-scraping-wiki-tables-using-beautifulsoup-and-python-6b9ea26d8722)"]},{"cell_type":"code","metadata":{"id":"MwJK3gS2kpmJ","colab_type":"code","colab":{}},"source":["import requests\n","\n","website_url = requests.get('https://en.wikipedia.org/wiki/\\\n"," List_of_Asian_countries_by_area').text\n","\n","from bs4 import BeautifulSoup\n","soup = BeautifulSoup(website_url,'lxml')\n","print(soup.prettify())"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ysX4Mq1JlKTT","colab_type":"text"},"source":[""]},{"cell_type":"code","metadata":{"id":"ELYuAc4plPCN","colab_type":"code","colab":{}},"source":["My_table = soup.find(\"table\",{\"class\":\"wikitable sortable\"})\n","My_table"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"vPKOb288lhI1","colab_type":"code","colab":{}},"source":["links = My_table.findAll('a')\n","links"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"m7n_vzVClthW","colab_type":"code","colab":{}},"source":["Country = [ link.get('title') for link in links if link.get('title') != None]"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"lD4Uf9lVDNxL","colab_type":"code","colab":{}},"source":["Country"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"5QaTp8NcmdGZ","colab_type":"code","colab":{}},"source":["import pandas as pd\n","\n","df = pd.DataFrame()\n","df['Country'] = Country\n","df"],"execution_count":0,"outputs":[]},{"cell_type":"code","metadata":{"id":"0Hdt3qwunW_g","colab_type":"code","colab":{}},"source":["df = df.sort_values(by=\"Country\")\n","df.reset_index(drop = True)\n","df"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vnv7fzVMaPCz","colab_type":"text"},"source":["## 練習"]},{"cell_type":"markdown","metadata":{"id":"Ta5rh-Up1vQe","colab_type":"text"},"source":["### 練習1"]},{"cell_type":"markdown","metadata":{"id":"E76LBd6BaSRE","colab_type":"text"},"source":["- 試著看懂並執行、拆解以下程式\n","- 程式來源https://github.com/jwlin/web-crawler-tutorial/blob/master/ch3/ptt_gossiping.py"]},{"cell_type":"code","metadata":{"id":"8HJHhY5qaJtf","colab_type":"code","outputId":"cc214cef-ddf0-456e-9a12-2bc8f4c9868c","executionInfo":{"status":"ok","timestamp":1572076619044,"user_tz":-480,"elapsed":22301,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":581}},"source":["import requests\n","import time\n","import json\n","from bs4 import BeautifulSoup\n","\n","\n","PTT_URL = 'https://www.ptt.cc'\n","\n","\n","def get_web_page(url):\n"," resp = requests.get(\n"," url=url,\n"," cookies={'over18': '1'}\n"," )\n"," if resp.status_code != 200:\n"," print('Invalid url:', resp.url)\n"," return None\n"," else:\n"," return resp.text\n","\n","\n","def get_articles(dom, date):\n"," soup = BeautifulSoup(dom, 'html5lib')\n","\n"," # 取得上一頁的連結\n"," paging_div = soup.find('div', 'btn-group btn-group-paging')\n"," prev_url = paging_div.find_all('a')[1]['href']\n","\n"," articles = [] # 儲存取得的文章資料\n"," divs = soup.find_all('div', 'r-ent')\n"," for d in divs:\n"," if d.find('div', 'date').text.strip() == date: # 發文日期正確\n"," # 取得推文數\n"," push_count = 0\n"," push_str = d.find('div', 'nrec').text\n"," if push_str:\n"," try:\n"," push_count = int(push_str) # 轉換字串為數字\n"," except ValueError:\n"," # 若轉換失敗,可能是'爆'或 'X1', 'X2', ...\n"," # 若不是, 不做任何事,push_count 保持為 0\n"," if push_str == '爆':\n"," push_count = 99\n"," elif push_str.startswith('X'):\n"," push_count = -10\n","\n"," # 取得文章連結及標題\n"," if d.find('a'): # 有超連結,表示文章存在,未被刪除\n"," href = d.find('a')['href']\n"," title = d.find('a').text\n"," author = '' # author = d.find('div', 'author').text if d.find('div', 'author') else ''\n"," articles.append({\n"," 'title': title,\n"," 'href': href,\n"," 'push_count': push_count,\n"," 'author': author\n"," })\n"," return articles, prev_url\n","\n","\n","def get_author_ids(posts, pattern):\n"," ids = set()\n"," for post in posts:\n"," if pattern in post['author']:\n"," ids.add(post['author'])\n"," return ids\n","\n","if __name__ == '__main__':\n"," current_page = get_web_page(PTT_URL + '/bbs/Gossiping/index.html')\n"," if current_page:\n"," articles = [] # 全部的今日文章\n"," today = time.strftime(\"%m/%d\").lstrip('0') # 今天日期, 去掉開頭的 '0' 以符合 PTT 網站格式\n"," current_articles, prev_url = get_articles(current_page, today) # 目前頁面的今日文章\n"," while current_articles: # 若目前頁面有今日文章則加入 articles,並回到上一頁繼續尋找是否有今日文章\n"," articles += current_articles\n"," current_page = get_web_page(PTT_URL + prev_url)\n"," current_articles, prev_url = get_articles(current_page, today)\n","\n"," # 印出所有不同的 5566 id\n"," # print(get_author_ids(articles, '5566'))\n","\n"," # 儲存或處理文章資訊\n"," print('今天有', len(articles), '篇文章')\n"," threshold = 50\n"," print('熱門文章(> %d 推):' % (threshold))\n"," for a in articles:\n"," if int(a['push_count']) > threshold:\n"," print(a)\n"," with open('gossiping.json', 'w', encoding='utf-8') as f:\n"," json.dump(articles, f, indent=2, sort_keys=True, ensure_ascii=False)"],"execution_count":0,"outputs":[{"output_type":"stream","text":["今天有 776 篇文章\n","熱門文章(> 50 推):\n","{'title': '[新聞] 星宇航空首架機交付 張國煒親自駕回台', 'href': '/bbs/Gossiping/M.1572072025.A.F0A.html', 'push_count': 57, 'author': ''}\n","{'title': '[新聞] 同志遊行前夕 蘇貞昌:相互尊重讓台灣更', 'href': '/bbs/Gossiping/M.1572070242.A.738.html', 'push_count': 58, 'author': ''}\n","{'title': '[新聞] 違法三缺一?中國觸怒民怨的「麻將館禁令」', 'href': '/bbs/Gossiping/M.1572068601.A.93D.html', 'push_count': 60, 'author': ''}\n","{'title': '[新聞] 進台北市區注意!明天同志遊行中午起交通', 'href': '/bbs/Gossiping/M.1572066682.A.275.html', 'push_count': 54, 'author': ''}\n","{'title': '[新聞] 非洲豬瘟肆虐 菲律賓養豬業每月損失約6億', 'href': '/bbs/Gossiping/M.1572066826.A.A34.html', 'push_count': 80, 'author': ''}\n","{'title': '[問卦] 有沒有做出PttChrome計算推樓插件的八卦', 'href': '/bbs/Gossiping/M.1572066384.A.30B.html', 'push_count': 99, 'author': ''}\n","{'title': '[爆卦] 中國食品凍漲擋不住了', 'href': '/bbs/Gossiping/M.1572065077.A.9C0.html', 'push_count': 99, 'author': ''}\n","{'title': '[新聞] 貧窮線調高! 北市月收不到2萬4293元就', 'href': '/bbs/Gossiping/M.1572064119.A.A22.html', 'push_count': 55, 'author': ''}\n","{'title': '[爆卦] GD權志龍退伍了!', 'href': '/bbs/Gossiping/M.1572064886.A.775.html', 'push_count': 61, 'author': ''}\n","{'title': 'Re: [新聞] 李亞萍哭認余苑綺「是末期了」:心裡有數', 'href': '/bbs/Gossiping/M.1572063783.A.974.html', 'push_count': 58, 'author': ''}\n","{'title': '[新聞] 卓榮泰說沒民進黨柯文哲會落選?對手丁守', 'href': '/bbs/Gossiping/M.1572063861.A.27F.html', 'push_count': 51, 'author': ''}\n","{'title': '[新聞] 謝震武紅遍政論節目 竟讓劉寶傑好委屈', 'href': '/bbs/Gossiping/M.1572063933.A.371.html', 'push_count': 71, 'author': ''}\n","{'title': '[問卦] 癌症三期了會治療還是放棄?', 'href': '/bbs/Gossiping/M.1572061470.A.980.html', 'push_count': 54, 'author': ''}\n","{'title': '[新聞] 韓國瑜「國外遊學」支票 蔡英文:韓也說', 'href': '/bbs/Gossiping/M.1572061561.A.6B0.html', 'push_count': 99, 'author': ''}\n","{'title': 'Re: [問卦] 故宮的東西在世界上算厲害嗎', 'href': '/bbs/Gossiping/M.1572061164.A.F36.html', 'push_count': 99, 'author': ''}\n","{'title': '[新聞] 地府只能收美金?中國新法令冥紙禁印人', 'href': '/bbs/Gossiping/M.1572060525.A.E97.html', 'push_count': 56, 'author': ''}\n","{'title': '[新聞] 強國人為何想偷渡英國?\\u3000華春瑩嗆CNN:', 'href': '/bbs/Gossiping/M.1572059049.A.526.html', 'push_count': 99, 'author': ''}\n","{'title': 'Re: [新聞] 旺中「跳船」?《旺報》社長砲轟韓國瑜', 'href': '/bbs/Gossiping/M.1572059203.A.FE5.html', 'push_count': 69, 'author': ''}\n","{'title': '[問卦] 認真問,去駕訓班學手排.還是自己考自排?', 'href': '/bbs/Gossiping/M.1572058150.A.FBA.html', 'push_count': 59, 'author': ''}\n","{'title': '[問卦] 可以數位故宮那能不能數位遊學?!', 'href': '/bbs/Gossiping/M.1572056208.A.0EE.html', 'push_count': 64, 'author': ''}\n","{'title': 'Re: [新聞] 快訊/韓國瑜晚間開支票:只要當總統,大', 'href': '/bbs/Gossiping/M.1572051391.A.DCD.html', 'push_count': 99, 'author': ''}\n","{'title': '[新聞] 美軍盼晶片在地生產 台積電評估赴美建新', 'href': '/bbs/Gossiping/M.1572051828.A.2F6.html', 'push_count': 99, 'author': ''}\n","{'title': '[新聞] 台北警公務車載女友!女方「屁蛋妹」身份', 'href': '/bbs/Gossiping/M.1572050526.A.01D.html', 'push_count': 53, 'author': ''}\n","{'title': 'Re: [新聞] 快訊/韓國瑜晚間開支票:只要當總統,大', 'href': '/bbs/Gossiping/M.1572048367.A.98A.html', 'push_count': 99, 'author': ''}\n","{'title': '[問卦] 你周遭朋友,買過春的男生多嗎?', 'href': '/bbs/Gossiping/M.1572037109.A.44C.html', 'push_count': 88, 'author': ''}\n","{'title': '[問卦] 香港反送中為何不平息?', 'href': '/bbs/Gossiping/M.1572029681.A.250.html', 'push_count': 51, 'author': ''}\n","{'title': '[新聞] 死刑「注射針」插歪!他痛苦3倍時間才身亡', 'href': '/bbs/Gossiping/M.1572026898.A.906.html', 'push_count': 70, 'author': ''}\n","{'title': 'Re: [新聞] 藏人自焚說 柯文哲:用字不精確但我意思', 'href': '/bbs/Gossiping/M.1572026908.A.C09.html', 'push_count': 53, 'author': ''}\n","{'title': 'Re: [問卦] 五月天跪中國了?', 'href': '/bbs/Gossiping/M.1572020507.A.C3B.html', 'push_count': 99, 'author': ''}\n","{'title': '[新聞] 告別立院演說 王金平:太陽花學運不流血落幕 我無愧天地', 'href': '/bbs/Gossiping/M.1572019252.A.AE8.html', 'push_count': 99, 'author': ''}\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"wyXAXDIN1MBI","colab_type":"text"},"source":["### 練習2"]},{"cell_type":"markdown","metadata":{"id":"l9CxzWKIcqEZ","colab_type":"text"},"source":["- 試著看懂並執行、拆解以下程式\n","- 程式來源https://github.com/jwlin/web-crawler-tutorial/blob/master/ch3/yahoo_movie.py"]},{"cell_type":"code","metadata":{"id":"mq4Xi5X-cc9A","colab_type":"code","outputId":"1666f312-3144-4b81-99ce-45f0511f9a08","executionInfo":{"status":"ok","timestamp":1572076585531,"user_tz":-480,"elapsed":2183,"user":{"displayName":"陳宇威","photoUrl":"","userId":"00090422027206355406"}},"colab":{"base_uri":"https://localhost:8080/","height":211}},"source":["import requests\n","import re\n","import json\n","from bs4 import BeautifulSoup\n","\n","\n","Y_MOVIE_URL = 'https://tw.movies.yahoo.com/movie_thisweek.html'\n","\n","# 以下網址後面加上 \"/id=MOVIE_ID\" 即為該影片各項資訊\n","Y_INTRO_URL = 'https://tw.movies.yahoo.com/movieinfo_main.html' # 詳細資訊\n","Y_PHOTO_URL = 'https://tw.movies.yahoo.com/movieinfo_photos.html' # 劇照\n","Y_TIME_URL = 'https://tw.movies.yahoo.com/movietime_result.html' # 時刻表\n","\n","\n","def get_web_page(url):\n"," resp = requests.get(url)\n"," if resp.status_code != 200:\n"," print('Invalid url:', resp.url)\n"," return None\n"," else:\n"," return resp.text\n","\n","\n","def get_movies(dom):\n"," soup = BeautifulSoup(dom, 'html5lib')\n"," movies = []\n"," rows = soup.find_all('div', 'release_info_text')\n"," for row in rows:\n"," movie = dict()\n"," movie['expectation'] = row.find('div', 'leveltext').span.text.strip()\n"," movie['ch_name'] = row.find('div', 'release_movie_name').a.text.strip()\n"," movie['eng_name'] = row.find('div', 'release_movie_name').find('div', 'en').a.text.strip()\n"," movie['movie_id'] = get_movie_id(row.find('div', 'release_movie_name').a['href'])\n"," movie['poster_url'] = row.parent.find_previous_sibling('div', 'release_foto').a.img['src']\n"," movie['release_date'] = get_date(row.find('div', 'release_movie_time').text)\n"," movie['intro'] = row.find('div', 'release_text').text.replace(u'詳全文', '').strip()\n"," trailer_a = row.find_next_sibling('div', 'release_btn color_btnbox').find_all('a')[1]\n"," movie['trailer_url'] = trailer_a['href'] if 'href' in trailer_a.attrs.keys() else ''\n"," movies.append(movie)\n"," return movies\n","\n","\n","def get_date(date_str):\n"," # e.g. \"上映日期:2017-03-23\" -> match.group(0): \"2017-03-23\"\n"," pattern = '\\d+-\\d+-\\d+'\n"," match = re.search(pattern, date_str)\n"," if match is None:\n"," return date_str\n"," else:\n"," return match.group(0)\n","\n","\n","def get_movie_id(url):\n"," # 20180515: URL 格式有變, e.g., 'https://movies.yahoo.com.tw/movieinfo_main/%E6%AD%BB%E4%BE%8D2-deadpool-2-7820.html\n"," # e.g., \"https://tw.rd.yahoo.com/referurl/movie/thisweek/info/*https://tw.movies.yahoo.com/movieinfo_main.html/id=6707\"\n"," # -> match.group(0): \"/id=6707\"\n"," try:\n"," movie_id = url.split('.html')[0].split('-')[-1]\n"," except:\n"," movie_id = url\n"," return movie_id\n","\n","\n","def get_trailer_url(url):\n"," # e.g., 'https://tw.rd.yahoo.com/referurl/movie/thisweek/trailer/*https://tw.movies.yahoo.com/video/美女與野獸-最終版預告-024340912.html'\n"," return url.split('*')[1]\n","\n","\n","def get_complete_intro(movie_id):\n"," page = get_web_page(Y_INTRO_URL + '/id=' + movie_id)\n"," if page:\n"," soup = BeautifulSoup(page, 'html5lib')\n"," infobox = soup.find('div', 'gray_infobox_inner')\n"," print(infobox.text.strip())\n","\n","\n","def main():\n"," page = get_web_page(Y_MOVIE_URL)\n"," if page:\n"," movies = get_movies(page)\n"," for movie in movies:\n"," print(movie)\n"," # get_complete_intro(movie[\"movie_id\"])\n"," with open('movie.json', 'w', encoding='utf-8') as f:\n"," json.dump(movies, f, indent=2, sort_keys=True, ensure_ascii=False)\n","\n","\n","if __name__ == '__main__':\n"," main()"],"execution_count":0,"outputs":[{"output_type":"stream","text":["{'expectation': '90%', 'ch_name': '雙子殺手', 'eng_name': 'Gemini Man', 'movie_id': '10017', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/October2019/YSeFXqDiSK0XrIcSdHi1-486x720.JPG', 'release_date': '2019-10-23', 'intro': '威爾史密斯飾演的一名頂尖殺手亨利布羅根,卻被一位神秘的年輕殺手追殺,而且這位年輕的殺手竟然能事先預測亨利的一舉一動。《雙子殺手》由金像獎得主李安執導,知名製作人傑瑞布洛克海默、大衛艾利森、戴娜戈柏和唐葛蘭傑製作。其他演員還有瑪麗伊莉莎白文斯蒂德、克里夫歐文和班奈狄克王。', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E9%9B%99%E5%AD%90%E6%AE%BA%E6%89%8B-%E6%9C%80%E6%96%B0%E9%A0%90%E5%91%8A-035402781.html?movie_id=10017'}\n","{'expectation': '88%', 'ch_name': '電影版 吹響吧!上低音號~誓言的終章~', 'eng_name': 'Sound! Euphonium, the Movie -Our Promise: A Brand New Day-', 'movie_id': '10306', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/October2019/ZXplEx71mOQQlcTpxiYd-992x1418.JPG', 'release_date': '2019-10-23', 'intro': '順利在去年的全日本管樂競演會中出場的北宇治高中管樂社,升上二年級的黃前久美子和三年級的加部友惠,兩人開始一起負責指導從四月開始新加入的一年級新生。\\n\\xa0\\n由於身為全國大賽的出場學校,因此吸引了大量的一年級新生入部,其中,有四名新生來到了低音部,包括乍看之下似乎毫無問題的久石奏、不融入周圍的鈴木美玲、想要和美玲成為朋友的鈴木五月,以及從不提及私事的月永求。面對接連而來的Sunrise祭、選拔賽以及競演會,以「全國大賽金獎」為目標的管樂社,卻接連發生問題……!?北宇治高中管樂社,風波不斷的日子開始了!', 'trailer_url': ''}\n","{'expectation': '56%', 'ch_name': '日本國民導演—山田洋次影展', 'eng_name': '', 'movie_id': '10299', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/October2019/CLNjaW36KsCZoGzmmhMn-509x720.jpg', 'release_date': '2019-10-24', 'intro': '現年88歲的日本國民導演-山田洋次,從事電影工作超過一甲子的時間,已累績八十多部的導演作品,至今仍持續推出高品質的新作,半世紀以來,一直深受不同世代觀眾的共鳴。2019年適逢《男人真命苦》問世五十週年,光點台北因此特別規劃山田洋次回顧影展,精選八部代表作品,從1960年代到二十一世紀,一窺這位日本庶民大師五十年來的創作軌跡!10.04周五起正式售票,10.24-11.06影展期間於光點台北電影館播映,歡迎影迷們共襄盛舉!\\n\\xa0\\n本次影展共選《男人真命苦》、《家族》、《故鄉》、《幸福的黃手帕》、《遠山的呼喚》、《兒子》《黃昏清兵衛》、《東京小屋的回憶》八部經典之作,透過不同時期的電影回顧,激起屬於每個時代的獨特記憶。其中1969年正式上映的松竹電影《男人真命苦》,由渥美清、倍賞千惠子、前田吟、森川信等人主演,描述東京平凡市井小民─寅次郎的故事,本片不只創下破億日幣票房,在接下來的二十六年間以相同班底連續推出48集,同時也締造金氏世界紀錄最長系列片,成為日本人的共同記憶!\\n\\xa0\\n山田洋次從影歷程幾乎是半部日本電影史,耕耘多年的他繼承小津安二郎、木下惠介等前輩樹立的「庶民劇」傳統,終其一生致力於書寫市井小民的喜怒哀樂,也如同小津一樣,長期與固定的劇組合作,不僅培養出工作夥伴間的良好默契,也宛如陪伴觀眾成長的左鄰右舍一般親切。\\n\\xa0\\n「身為一介小市民,我要將日常生活中觸動─人心的事,透過某種契機使之成形,在構思中建立骨骼、填上血肉,完成具體的作品」,多年來山田洋次的秉持著相同的理念,在他的電影裡可以感受到最真摯、最貼近生活的創作,欲重溫這些經典作品的影迷們,10.24-11.06期間於光點台北電影館千萬別錯過【日本國民導演—山田洋次 影展】。\\n\\xa0\\n【售票資訊】10.04(五)起開始售票\\n\\xa0\\n全票:240元/張 ,光點會員200元/張\\n\\xa0\\n套票:720元/4張乙套\\n\\xa0\\n✔專屬優惠好禮─凡於現場一次購買『#兩組套票』,即贈 大春煉皂 提供【經典米萃皂】乙份。(數量有限,送完為止)\\n\\xa0\\n●購票一律請親至光點台北售票服務台並於現場確認場次\\n\\xa0\\n●光點會員請持會員卡及有效證件購票享會員優惠\\n\\xa0\\n●愛心票僅供65歲以上老人、身心障礙人士與乙名必要陪同者(需同時入場)購買,購票時請出示相關證件\\n\\xa0\\n●套票售出概不退換', 'trailer_url': ''}\n","{'expectation': '89%', 'ch_name': '金翅雀', 'eng_name': 'The Goldfinch', 'movie_id': '10072', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/October2019/WR18DRSUJn9Z1GdkwUNI-2985x4263.jpg', 'release_date': '2019-10-25', 'intro': '《金翅雀》故事敘述小名「席歐」的13歲少年席爾鐸戴克(奧克斯弗格雷 飾)與母親參觀大都會藝術博物館,當他們在欣賞母親最愛的「金翅雀」這幅畫作時,館內突然發生大爆炸,席歐幸運地逃過一劫,卻也因此與母親從此天人永隔,這場突如其來的鉅變改變了席歐的一生,讓他展開充滿波折的人生與探索的旅程;成年後的席歐(安索艾格特 飾)經歷了無止盡的悲傷與自責、一連串的重生、贖罪與溫暖的愛之後,當他回首生命的這一切,他心中放不下的還是那幅改變他一生的畫作:一隻腳被細細的鎖鏈綁在棲木上的小鳥兒,那個看似美麗,卻永遠無法獲得自由,同時也是母親在世時摯愛的作品:「金翅雀」。', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E9%87%91%E7%BF%85%E9%9B%80-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-072238436.html?movie_id=10072'}\n","{'expectation': '94%', 'ch_name': '陪你很久很久', 'eng_name': 'Stand by me', 'movie_id': '10160', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/September2019/IyCzEAuwXvQbjJFZZ4TI-510x720.jpg', 'release_date': '2019-10-25', 'intro': '★今年最熱血、最青春、最動感的校園愛情電影!\\n★李淳首挑大樑,攜手兩大超人氣女星邵雨薇、蔡瑞雪,共解最複雜的愛情習題!\\n★顛覆傳統浪漫小清新,荒唐、逗趣、充滿驚喜,新世代青春爆笑喜劇!\\n\\xa0\\n憨直的九餅(李淳飾)從小就暗戀著薄荷(邵雨薇飾),兩人一路相互陪伴,卻始終維持著「好朋友」的關係。直到上了大學後,九餅意外地與甜美高中生夏天(蔡瑞雪飾)成為同房室友,又在迎新試膽大會時,不小心將薄荷推向校園天菜麥子學長(宋柏緯飾)懷中,眼看著心愛的她即將被追走,九餅究竟要如何守住薄荷,為自己的青春奮力一搏!', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E9%99%AA%E4%BD%A0%E5%BE%88%E4%B9%85%E5%BE%88%E4%B9%85-%E9%9D%92%E6%98%A5%E7%86%B1%E8%A1%80%E7%89%88%E9%A0%90%E5%91%8A-132846573.html?movie_id=10160'}\n","{'expectation': '62%', 'ch_name': '伊索遊戲', 'eng_name': 'Aesop’s Game', 'movie_id': '10216', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/August2019/fJBtPhAz38ToLmsSaIVb-800x1142.jpg', 'release_date': '2019-10-25', 'intro': '「一隻烏龜從天而降砸傷了路人,不知是意外還是惡作劇,警方正在追查這隻肇事逃逸的烏龜…」故事從烏龜開始,加上兔子,還有狗狗來亂入?顛覆你所知道的伊索寓言!\\n\\xa0\\n龜田美羽(龜),內向的女大學生,和家人感情很好,唯一的朋友卻是隻烏龜。\\n兔草早織(兔),超人氣的星二代,出生「明星家族」,卻天生一副戀愛體質。\\n戌井小柚(犬),身手不凡,和爸爸一起經營「復仇」生意,天天幫人尋仇。\\n當三位少女相遇,你說這是部酸酸甜甜的青春電影?……大錯特錯!\\n\\xa0\\n綁架、背叛、復仇、揭開虛偽的假象,一場超乎想像的鬥智心理遊戲正要開始,結局會和你想像的……完全不同!!!校園純愛戀曲是假的,懸疑驚悚才是真的!當綁架事件爆發,最難預測的才不是青春,而是這部電影的劇情走向!\\n\\xa0\\n【關於電影】\\n\\xa0\\n延續神片《一屍到底》最強騙術\\n三位導演、三倍翻轉,一次滿足!\\n本片由日本觀影人次超過220萬、票房突破31億日圓、2018年最受矚目的話題作品《一屍到底》導演上田慎一郎親自操刀原創劇本,借三名導演之力搬上大銀幕,承襲《一屍到底》一路「騙」很大、在騙局中加洋蔥的反轉風格,將再次帶給觀眾就算被騙到底,也甘之如飴的觀影體驗。此次與上田慎一郎共同執導的導演之一是在《一屍到底》擔任副導的中泉裕矢,他在2011年推出第一部導演長片《圓罪》,其後作品《與母親的旅程》、《來去拍片尾》都在日本國內影展蔚為話題。另一位導演則是在《一屍到底》中擔任劇照師的淺沼直也,他曾以《若冬天燃起》在SKIP CITY國際電影節獲得2017年短片單元最佳作品,是日本當前備受期待的新銳創作者。三人在2015年便曾合作過短片集《四個與貓的故事》,當時三人各自擔任其中一部的導演,因此這次決定再度攜手合作。截然不同的三人共同擔任導演,將蹦出三種不同特色的火花,這部花費三年以上製作的電影,終於要登上大銀幕和觀眾見面了!\\n\\xa0\\n以《一屍到底》快速竄紅的導演上田慎一郎表示自己從中學時期就開始自己嘗試拍片,當時沒有YouTube,也不知道有電影節這種管道,光是把作品拍出來就已經是對自己最大的獎勵了!長大之後,也有曾經抱持著:「作品非在電影節放映不可!」、「我的作品一定要被偉大的人認可!」然而在拍攝《一屍到底》時,反而試圖讓自己不去多想成敗,只專注於自己想拍的東西,最後的成果反而得到崇高肯定,上田慎一郎表示:「投入《伊索遊戲》創作後我也期許自己秉持初心,繼續拍自己想拍的電影!」上田導演坦言《伊索遊戲》這部作品前後花了三年構想,不過事實上前兩年都還在和另兩位導演爭論究竟要拍什麼樣的故事,「三個人一起的執導,絕對會有各自無法妥協的部分,不過我們也有著要將這些矛盾、糾結化為這部電影的魅力的覺悟!」本片也大膽啟用新銳演員石川瑠華、井桁弘恵、紅甘擔綱主演,上田慎一郎在電影上映前也再次向觀眾呼籲:「這部電影有三名導演、三位女主角,所以大家請務必三人以上同行前往電影院觀看!」', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E4%BC%8A%E7%B4%A2%E9%81%8A%E6%88%B2-%E5%B0%8E%E6%BC%94%E5%95%8F%E5%80%99%E7%AF%87-105145492.html?movie_id=10216'}\n","{'expectation': '69%', 'ch_name': '西幽玹歌', 'eng_name': 'Thunderbolt Fantasy', 'movie_id': '10244', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/September2019/EcNK84hDr6NAucwtBl3N-506x720.jpg', 'release_date': '2019-10-25', 'intro': '生來便懷有異能歌聲的少年浪巫謠,自小便跟在隱遁雪山的盲眼母親身邊,接受經年累月的苛刻訓練。\\n\\xa0\\n母親懷著野心,想將兒子的歌聲鍛鍊成天下無雙,然後送入宮廷。然而過於苛刻的訓練卻招致了不幸的事故,母親在浪巫謠眼前斷送了性命。\\n\\xa0\\n失去照顧者後,浪巫謠成了流浪之身。但他的異能卻總被無情的人們利用,成為他們滿足慾望的道具,漸漸消磨著少年的心靈。\\n\\xa0\\n儘管如此,浪巫謠罕見的歌聲終於吸引到了西幽皇女,因而得到了過去母親所夢想的飛黃騰達。但是等待著他的,卻是成為執政者玩物、賭上性命與其他樂師進行演奏競賽的殘忍遊戲。\\n\\xa0\\n就在某一天,浪巫謠聽聞有一名在西幽各地奪取魔劍、占為己有的大罪人──「啖劍太歲」。而啖劍太歲的下一個目標,正是皇帝藏在宮中的聖劍。\\n\\xa0\\n【關於電影】\\n\\xa0\\n在臺灣可以說是「無人不知、無人不曉」,從大人到小孩都非常喜愛的木偶劇「布袋戲」。 本次作品由對布袋戲深深著迷的Nitroplus「虛淵玄」擔任故事原著・劇本・總監修,與臺灣布袋戲中最具知名度及以其高品質製作聞名的「霹靂國際多媒體」(日文簡稱:霹靂社)合作, 以此奇蹟般的組合所完成的日本及臺灣地區共同企劃之影像作品。\\n\\xa0\\n於2018年12月24日(一)播放TV版第二季《Thunderbolt Fantasy 東離劍遊紀2》最後一集, 在那之後隨即發表「第三季製作決定!」的情報, 並且在3月22日(五)公佈第二季外傳《Thunderbolt Fantasy 西幽玹歌》正在進行製作的消息。\\xa0\\n\\xa0\\n2019年10月25日(五)於戲院上映的《Thunderbolt Fantasy 西幽玹歌》,以活躍於TV版第二季的「浪巫謠」為主角,講述他的過去、以及他在西幽發生的故事,為Nitroplus「虛淵玄」擔任故事原著・劇本・總監修的全新作品。\\n\\xa0\\n角色設計和至今為止的系列作品同樣,由Nitroplus所率領的設計團隊(新加入成員「minoa」)負責,同時亦邀請公仔製作公司「GOOD SMILE COMPANY」擔任戲偶造型顧問。 配樂的部份也同樣邀請到專注在偶像劇、動畫、電影界配樂、並且擔任許多歌手樂曲之提供的「澤野弘之」擔任製作。', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E8%A5%BF%E5%B9%BD%E7%8E%B9%E6%AD%8C-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-053315107.html?movie_id=10244'}\n","{'expectation': '80%', 'ch_name': '失憶的總理大臣', 'eng_name': '', 'movie_id': '10246', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/October2019/w1eivF3UfcYZ5YI6T45k-800x1132.jpg', 'release_date': '2019-10-25', 'intro': '無良政客一夕失憶,竟然轉性善良大叔!?\\n史上民調最低的總理大臣黑田啟介(中井貴一飾),走到哪都被市民噓爆,黑粉滿街跑!某天,總理又被沿街抗議,沒想到一顆石頭飛來,砸中額頭而昏迷,醒來後,他竟完全喪失記憶,黑粉美夢一夕成真?!為了避免國家大亂,他的三位貼身秘書竭盡所能讓國事照常運作,全面隱瞞真相。而一向唯利是圖的無良總理,一夕之間轉性變成傾聽民心、致力國政的善良大叔,異常反差的態度也讓全民開始起疑。眼看總理「我失憶了」秘密即將露餡,政壇的狐群狗黨們為私利大亂鬥,美國總統偏偏在此時將參訪日本,一場顛覆政壇的狂喜劇即將鬧上全世界!\\n\\xa0\\n【關於電影】\\n\\xa0\\n喜劇大師三谷幸喜醞釀四十年《失憶的總理大臣》引爆全民共鳴\\n導演撿到槍電影質問首相安倍晉三幽默回應心得秒破冰!\\n日本喜劇大師三谷幸喜睽違多年,在影迷千呼萬喚之下,終於帶來最高傑作《失憶的總理大臣》!三谷幸喜端出醞釀四十年的故事,並集結黃金卡司包括日本奧斯卡影帝中井貴一、《月薪嬌妻》「全民小阿姨」石田百合子、《大叔之愛》田中圭、《告白》木村佳乃、老班底日本奧斯卡影帝佐藤浩市等一線明星,再次打造三谷幸喜無人能擋的魅力!三谷幸喜表示,故事靈感來自他高中的志願發想,「如果像我這樣沒有私利私慾、不追求權力金錢慾的人做總理,不是很好的政治家嗎?」但路人突然變成總理根本不可能,喜劇鬼才的他就想到石頭砸破腦袋的橋段,創作出這部不分國界引爆觀眾共鳴的喜劇新作。此外,膽大包天的《失憶的總理大臣》劇組甚至直接邀請日本現任首相安倍晉三看片,映後導演戰戰兢兢地問首相:「感想是?」沒想到安倍幽默拿出政治人物愛用語回:「我失去記憶了!」爆笑氣氛現場一秒破冰!三谷再次以縝密的故事編排和獨特「三谷式」幽默驚艷觀眾,電影對政客的惡搞更引起日本全民共鳴,展開「最希望哪位政治人物失憶」的熱烈討論,造就另一股社會話題炫風!\\n\\xa0\\n每一部都讓你笑到哭又感動哭!\\n三谷幸喜魅力無人能敵日本巨星爭相合作朝聖\\n鬼才導演三谷幸喜能編能導,作品橫跨電影、電視及舞台劇,過去不僅以叫好叫座的喜劇《魔幻時刻》、《鬼壓床了沒》、《有頂天大飯店》累積破百億票房,NHK大河劇《真田丸》也大獲好評,他所執導的舞台劇更是在日本及台灣都場場爆滿,是日本巨星阿部寬、妻夫木聰、綾瀨遙、松隆子、深津繪里、役所廣司等人爭相合作的對象,就算只能演小配角也沒問題!這次更加入首度合作的石田百合子及藤岡靛黃金陣容,加上三谷幸喜作品天馬行空卻又天衣無縫的巧合與誤會,荒謬劇情再度讓觀眾笑到岔氣,同時透過小人物的熱血初衷,帶來意想不到的逆轉結局,讓人忍不住笑著笑著就感動哭了!今年新作《失憶的總理大臣》以觀眾再熟悉不過的政治為題,三谷幸喜表示:「這不是政治諷刺劇,而是政治狂想曲!」大導透過架空的舞台為觀眾創造更多聯想空間,就算來自不同國家的觀眾,也能輕鬆帶入自己所熟悉的時事!', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E5%A4%B1%E6%86%B6%E7%9A%84%E7%B8%BD%E7%90%86%E5%A4%A7%E8%87%A3-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-122854501.html?movie_id=10246'}\n","{'expectation': '46%', 'ch_name': '阿嬤養的豬', 'eng_name': \"Esmeralda's Twilight\", 'movie_id': '10272', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/September2019/crRK7LKzsnj0hdMz3Yv2-488x720.JPG', 'release_date': '2019-10-25', 'intro': '★2019墨西哥奧斯卡「阿里爾獎」奪最佳處女電影獎導演提名\\n★2019墨西哥奧斯卡「阿里爾獎」最佳女主角提名\\n★墨西哥電影界最高榮譽「墨西哥電影電視藝術學院獎」提名\\n★比《我不笨,我有話要說》更感人的溫馨催淚片\\xa0\\n最佛心的阿嬤,豬年行大運\\n★ 豬儂我儂,有阿嬤疼的小豬最窩心、像個寶\\n\\xa0\\n年度必看的感人溫馨催淚電影 內含洋蔥讓人揪心落淚\\n隨著丈夫的去世和她兒子不在身旁,老婦人已經失去了對生活的樂趣,直到她重新獲得希望…當一隻小豬進入她的生活。\\n\\xa0\\n在丈夫過世後,老婦人獨居在小鎮,與移居美國的兒子通上電話,成為每天生活唯一的動力,對習慣照顧人的她來說,生命像失去目標。\\n\\xa0\\n有天,一頭小豬意外闖進她的生活,她開始把小豬當成孩子照顧,三餐讓牠吃飽飽,梳毛散步都不少,帶給她與鄰居很多歡樂,不久這隻豬懷孕了,她傾盡所有的積蓄、精力與時間,準備迎接豬孫的到來,卻發現豬女兒身體不如以往了…\\n\\xa0\\n【關於電影】\\xa0\\n\\xa0\\n喪偶奶奶透過小豬重獲新生 純樸鄉村裡人與豬之間最溫馨的羈絆關係\\n《阿嬤養的豬》透過獨居婦人對動物的依戀,點出老人現實處境,熱帶風情的景色、美味的料理、人情味的可愛人物,打造出充滿愛與信念的樸素小鎮,細膩堆疊出人與小豬的真摯情感,已近薄暮的主角,面對人生的無常,選擇放手與釋懷,心境轉換帶出希望的光輝,更顯溫暖令人動容。', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E9%98%BF%E5%AC%A4%E9%A4%8A%E7%9A%84%E8%B1%AC-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-134459693.html?movie_id=10272'}\n","{'expectation': '53%', 'ch_name': '我的朋友霸王龍', 'eng_name': 'My TYRANO: Together, Forever', 'movie_id': '10274', 'poster_url': 'https://movies.yahoo.com.tw/x/r/w420/i/o/production/movies/September2019/FjcsFOd9I7bDwpFHNcpP-800x1142.jpg', 'release_date': '2019-10-25', 'intro': '★改編自全球暢銷繪本作家宮西達也的恐龍系列作品,取自〈永遠永遠在一起〉〈我愛你〉〈我相信你〉三部繪本,是《你看起來好像很好吃》同系列電影。\\n★由《名偵探柯南》動畫系列導演靜野孔文出任導演、監製,是一部中、日、韓三地聯合製作的動畫電影。\\xa0\\n★奧斯卡金獎配樂大師坂本龍一第一部於動畫電影界亮相的作品。\\n★由手塚治虫創辦的手塚製作公司擔任動畫製作。\\n\\xa0\\n在冰河世紀來臨前的恐龍時期,一隻正被魔鬼龍追逐的粉紅色小翼龍普娜,就在她正要被吞下肚之前,一隻巨大威猛的食肉霸王龍突然出現在牠們面前;\\n當他一出現,早已不見被嚇到的魔鬼龍蹤跡,此時只獨留下普娜在這隻霸王龍面前,他張開嘴靠近普娜,一口口咬了下去,但他吞下的竟然是紅果子而不是這隻小翼龍普娜。\\n”大個子,你不是吃肉的動物嗎?怎麼會吃紅色水果?”普娜在他面前喋喋不休地問著;說著說著,她下定決心跟隨這隻霸王龍,路途中遇見了與媽媽失散的三角龍男孩,他們決定陪著牠回家並結伴一起開始尋找綠洲。\\n\\xa0\\n一隻不會飛的翼龍、一隻不吃肉的霸王龍與孤獨的男孩三角龍,在前往尋找綠洲的旅程中,三人成為像是好朋友又像是家人的好夥伴。\\n\\xa0\\n一路上,他們是否可以避開魔鬼龍的追逐,安全地到達那大家口中所說的的充滿和平與富裕的綠洲呢?', 'trailer_url': 'https://movies.yahoo.com.tw/video/%E6%88%91%E7%9A%84%E6%9C%8B%E5%8F%8B%E9%9C%B8%E7%8E%8B%E9%BE%8D-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-122910372.html?movie_id=10274'}\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"rgZmPsh9zZlN","colab_type":"text"},"source":["### 練習3\n"]},{"cell_type":"markdown","metadata":{"id":"w3GT1nsfzhEy","colab_type":"text"},"source":["- 擷取並parse「批批踢JOKE版的一篇文章」\n","- 請依下列步驟練習:\n"," - 以GET方法將網頁https://www.ptt.cc/bbs/joke/M.1571755669.A.663.html 原始碼讀入\n"," - 依照上述步驟parse出推文內容及推文者\n"," - 透過for迴圈,整齊印出"]}]}
--------------------------------------------------------------------------------
|