├── .gitignore
├── LICENSE
├── README.md
├── Text_to_Epub.ipynb
├── Word_檔文件轉換為_Excel.ipynb
├── crawler.ipynb
├── data
    ├── GRB_105.txt
    ├── GRB_106.txt
    ├── GRB_107.txt
    ├── GRB_108.txt
    ├── GRB_109.txt
    ├── GRB_110.txt
    └── iKnow_2017-2021.txt
├── oTranscribe_txt_to_srt_格式轉換.ipynb
├── whisper_Test.ipynb
├── youtuber_逐字稿.ipynb
├── 台股_Q1~Q3_EPS_抓取.ipynb
├── 技術議題關鍵字擴展.ipynb
├── 英文單字計算.ipynb
└── 錄音檔轉文字.ipynb


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | pip-wheel-metadata/
 24 | share/python-wheels/
 25 | *.egg-info/
 26 | .installed.cfg
 27 | *.egg
 28 | MANIFEST
 29 | 
 30 | # PyInstaller
 31 | #  Usually these files are written by a python script from a template
 32 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 33 | *.manifest
 34 | *.spec
 35 | 
 36 | # Installer logs
 37 | pip-log.txt
 38 | pip-delete-this-directory.txt
 39 | 
 40 | # Unit test / coverage reports
 41 | htmlcov/
 42 | .tox/
 43 | .nox/
 44 | .coverage
 45 | .coverage.*
 46 | .cache
 47 | nosetests.xml
 48 | coverage.xml
 49 | *.cover
 50 | *.py,cover
 51 | .hypothesis/
 52 | .pytest_cache/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | target/
 76 | 
 77 | # Jupyter Notebook
 78 | .ipynb_checkpoints
 79 | 
 80 | # IPython
 81 | profile_default/
 82 | ipython_config.py
 83 | 
 84 | # pyenv
 85 | .python-version
 86 | 
 87 | # pipenv
 88 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 89 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 90 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 91 | #   install all needed dependencies.
 92 | #Pipfile.lock
 93 | 
 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 95 | __pypackages__/
 96 | 
 97 | # Celery stuff
 98 | celerybeat-schedule
 99 | celerybeat.pid
100 | 
101 | # SageMath parsed files
102 | *.sage.py
103 | 
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 | 
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 | 
117 | # Rope project settings
118 | .ropeproject
119 | 
120 | # mkdocs documentation
121 | /site
122 | 
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 | 
128 | # Pyre type checker
129 | .pyre/
130 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2021 Reic Wang
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # colab_python
 2 | 這一個放置運用 Colab 平台開發出來的一些小空間的放置空間。   by 瑞課
 3 | 
 4 | # 語音轉文字工具
 5 | [錄音檔轉文字.ipynb](https://github.com/reic/colab_python/blob/main/%E9%8C%84%E9%9F%B3%E6%AA%94%E8%BD%89%E6%96%87%E5%AD%97.ipynb)採用 Python 開發，可應用於訪談錄音檔轉文字、影片的字幕的生成，及其它相關應用。
 6 | 
 7 | 因為透過Google Colab 平台、Google的語音轉文字工具，完成語音轉文字的工作。只需要有 Google 帳號，即可具備執行此程式的環境，輔以簡單的設定，不會程式的使用者也可以完成相關的工作。
 8 | 
 9 | == 更新記錄 ===
10 | 
11 | - 2021/5/3 增加多執行緖的方法，縮短翻譯的時間
12 | - 2021/5/9 修正因檔名無法產生 OTR 檔的問題，謝謝「彩虹小馬」的回饋
13 | - 2021/5/12 增加不同翻譯語言變數的設定，並於檔案中提供語系參考表。 謝謝 chin ho Lau 的回饋。
14 | 
15 | # 關鍵字擴展工具
16 | [技術議題關鍵字擴展.ipynb](https://github.com/reic/colab_python/blob/main/%E6%8A%80%E8%A1%93%E8%AD%B0%E9%A1%8C%E9%97%9C%E9%8D%B5%E5%AD%97%E6%93%B4%E5%B1%95.ipynb)採用 Python 開發，使用政府開放資料：民國 105-109年的政府研究資訊網 (grb.gov.tw) 公開資料做為資料集。
17 | 
18 | 對於不理解的議題，可以透過此工具了解**特定關鍵字**之相關技術，加速對於特定技術領域的理解。
19 | 
20 | # 文字檔轉 epub
21 | [Text_to_Epub.ipynb](https://github.com/reic/colab_python/blob/main/Text_to_Epub.ipynb) 採用 python 開發，進行資料處理將文字檔轉成多個 md 檔，再透過 colab 系統的 pandoc 套件，完成 md to epub 的轉換。
22 | 
23 | 在這一個程式中，首次引入 python 模組檢查的概念，撰寫了一個 modulechk 的函數，並在 import 模組前檢查系統是否已有安裝此模組，如果沒有就先安裝再 import。這一個是透過 os 模組的 popen 函數取得 python 套件的安裝列表，再檢查要 import 的模組是否在已安裝列表內；若模組在列表不存在，則透過 os.system 安裝需要 import 的模組。
24 | 


--------------------------------------------------------------------------------
/Text_to_Epub.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "Text_to_Epub.ipynb",
  7 |       "provenance": [],
  8 |       "collapsed_sections": [],
  9 |       "authorship_tag": "ABX9TyO6Ti7BlNS7lcoQdFSewsG/",
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "name": "python3",
 14 |       "display_name": "Python 3"
 15 |     },
 16 |     "language_info": {
 17 |       "name": "python"
 18 |     }
 19 |   },
 20 |   "cells": [
 21 |     {
 22 |       "cell_type": "markdown",
 23 |       "metadata": {
 24 |         "id": "view-in-github",
 25 |         "colab_type": "text"
 26 |       },
 27 |       "source": [
 28 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/Text_to_Epub.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 29 |       ]
 30 |     },
 31 |     {
 32 |       "cell_type": "code",
 33 |       "source": [
 34 |         "#@title Text 轉 epub 程式\n",
 35 |         "import re, os\n",
 36 |         "\n",
 37 |         "# check 套件\n",
 38 |         "def chkmodule(libset):\n",
 39 |         "  itm=os.popen(\"pip3 list\").read().splitlines()\n",
 40 |         "  itm=[ data[:data.find(\" \")].lower() for data in itm if len(data[:data.find(\" \")])>0 ]\n",
 41 |         "  # set(itm)\n",
 42 |         "  if libset.lower() in set(itm):\n",
 43 |         "    return\n",
 44 |         "  os.system(f\"pip3 install {libset}\")\n",
 45 |         "\n",
 46 |         "chkmodule(\"inlp\")\n",
 47 |         "from inlp.convert import chinese\n",
 48 |         "\n",
 49 |         "# 基礎設定資料\n",
 50 |         "filename=\"/content/drive/MyDrive/tmp/\\u4FEE\\u771F\\u804A\\u5929\\u7FA4.txt\" #@param {type:\"string\"}\n",
 51 |         "title=\"修真聊天群\" #@param {type:\"string\"}\n",
 52 |         "author=\"聖騎士的傳說\" #@param {type:\"string\"}\n",
 53 |         "#@markdown 打勾，將會協助進行簡體轉繁體\n",
 54 |         "chinese_S2T = True #@param {type:\"boolean\"}\n",
 55 |         "\n",
 56 |         "# 標題設定義\n",
 57 |         "YAML=f'''---\n",
 58 |         "title: {title}\n",
 59 |         "author: {author}\n",
 60 |         "language: zh-Hant\n",
 61 |         "---'''\n",
 62 |         "# page_break=\"<div style='page-break-after:always; visibility:hidden'></div>\"\n",
 63 |         "\n",
 64 |         "\n",
 65 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
 66 |         "  f.write(YAML)\n",
 67 |         "\n",
 68 |         "with open(filename,mode=\"r\",encoding='utf-8') as f:\n",
 69 |         "  content=f.read()\n",
 70 |         "text=re.sub(r\"\\n+\",\"\\n\\n\",re.sub(r\"[\\u3000]+\",\"\",content))\n",
 71 |         "\n",
 72 |         "print(\"{:=^50s}\".format(\" Markdown 處理 \"))\n",
 73 |         "# 章/卷處理\n",
 74 |         "patterns=[[r\"\\n(第.*卷\\s?)\",\"#\"],[r\"\\n(第.*章\\s?)\",\"#\"],[r\"\\n(第.*節\\s?)\",\"#\"]]\n",
 75 |         "for pns in patterns:  \n",
 76 |         "  text=re.sub(pns[0],r\"\\n{} \\1\".format(pns[1]),text)\n",
 77 |         "print(\"{}\".format(\"完成...\"))\n",
 78 |         "# 建立 md 檔的函數\n",
 79 |         "\n",
 80 |         "def writemd(title,arrtomd):\n",
 81 |         "  with open(\"{:04d}.md\".format(title),mode=\"w\",encoding='utf-8') as f:\n",
 82 |         "    f.write(arrtomd)\n",
 83 |         "\n",
 84 |         "## 文字處理\n",
 85 |         "textarry=text.split(\"# \")\n",
 86 |         "counter=0 # counter　為 md 檔名\n",
 87 |         "mdfiles=[] # 記錄 md 檔名\n",
 88 |         "# 若為簡體文件，就需要用 註解的那一個\n",
 89 |         "print(\"{:=^50s}\".format(\" md 檔分割 \"))\n",
 90 |         "\n",
 91 |         "for itm in textarry:\n",
 92 |         "  if chinese_S2T:\n",
 93 |         "    itm=chinese.s2t(itm)\n",
 94 |         "  # writemd(counter,\"# {}\".format(chinese.s2t(itm)))\n",
 95 |         "  writemd(counter,\"# {}\".format(itm))\n",
 96 |         "  mdfiles.append(\"{:04d}.md\".format(counter))\n",
 97 |         "  counter+=1\n",
 98 |         "\n",
 99 |         "print(\"{}\".format(\"完成...\"))\n",
100 |         "# counter　為 md 檔名\n",
101 |         "# mdfiles=[\"{:04d}.md\".format(itm) for itm in range(counter)]\n",
102 |         "\n",
103 |         "print(\"{:=^50s}\".format(\" 產生 epub 檔 \"))\n",
104 |         "# 透過 pandoc 生成 epub \n",
105 |         "os.system(\"pandoc -o \\\"{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
106 |         "print(\"{}\".format(\"完成...\"))\n",
107 |         "\n",
108 |         "# 下載\n",
109 |         "print(\"{:=^50s}\".format(\" 下載 epub 檔 \"))\n",
110 |         "from google.colab import files\n",
111 |         "files.download('{}.epub'.format(title))"
112 |       ],
113 |       "metadata": {
114 |         "id": "j7I7yNlusJCZ"
115 |       },
116 |       "execution_count": null,
117 |       "outputs": []
118 |     }
119 |   ]
120 | }


--------------------------------------------------------------------------------
/Word_檔文件轉換為_Excel.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "Word 檔文件轉換為 Excel",
  7 |       "provenance": [],
  8 |       "collapsed_sections": [],
  9 |       "toc_visible": true,
 10 |       "mount_file_id": "1pM5vkIjp5tu3tlTGE4UepW08_Ldsoag8",
 11 |       "authorship_tag": "ABX9TyPF9xlQR5JKqopdiPQB1DFT",
 12 |       "include_colab_link": true
 13 |     },
 14 |     "kernelspec": {
 15 |       "name": "python3",
 16 |       "display_name": "Python 3"
 17 |     },
 18 |     "language_info": {
 19 |       "name": "python"
 20 |     }
 21 |   },
 22 |   "cells": [
 23 |     {
 24 |       "cell_type": "markdown",
 25 |       "metadata": {
 26 |         "id": "view-in-github",
 27 |         "colab_type": "text"
 28 |       },
 29 |       "source": [
 30 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/Word_%E6%AA%94%E6%96%87%E4%BB%B6%E8%BD%89%E6%8F%9B%E7%82%BA_Excel.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 31 |       ]
 32 |     },
 33 |     {
 34 |       "cell_type": "markdown",
 35 |       "metadata": {
 36 |         "id": "uo5TrNiAsub-"
 37 |       },
 38 |       "source": [
 39 |         "# **將格式化 Word 的內容轉換為 Excel**\n",
 40 |         "首先， Word 文字的撰寫需要具有一定的規範，才可以將 word 文件的資料轉換成 excel。\n",
 41 |         "\n",
 42 |         "詳細內容可參考最後的 Word 編寫範例。"
 43 |       ]
 44 |     },
 45 |     {
 46 |       "cell_type": "code",
 47 |       "metadata": {
 48 |         "id": "vWDnAa8DVqKv",
 49 |         "cellView": "form"
 50 |       },
 51 |       "source": [
 52 |         "#@title 安裝必要套件\n",
 53 |         "!pip install docx2txt "
 54 |       ],
 55 |       "execution_count": null,
 56 |       "outputs": []
 57 |     },
 58 |     {
 59 |       "cell_type": "code",
 60 |       "metadata": {
 61 |         "id": "Xcya3e-oYvoS",
 62 |         "cellView": "form"
 63 |       },
 64 |       "source": [
 65 |         "import gspread\n",
 66 |         "import pandas as pd\n",
 67 |         "import datetime,os,docx2txt\n",
 68 |         "\n",
 69 |         "\n",
 70 |         "#@title Write Data into Gsheet \n",
 71 |         "\n",
 72 |         "#@markdown 輸出的 excel 檔名\n",
 73 |         "wordFilename = \"/content/drive/MyDrive/mytest.docx\" #@param {type:\"string\"}\n",
 74 |         "\n",
 75 |         "#@markdown 輸出的 excel 檔名，或放置和 docx 同目錄下\n",
 76 |         "excel__name = \"mytest.xlsx\" #@param {type:\"string\"}\n",
 77 |         "\n",
 78 |         "\n",
 79 |         "#@markdown 欄位分隔符號\n",
 80 |         "delimiter=\"\\uFF1A\" #@param {type:\"string\"}\n",
 81 |         "\n",
 82 |         "#@markdown 欄位名稱，請用 分隔符號 分隔\n",
 83 |         "columNames = \"\\u6A19\\u984C\\uFF1A\\u5167\\u5BB9\\uFF1A\\u51FA\\u8655\\uFF1A\\u6587\\u7AE0\\u65E5\\u671F\" #@param {type:\"string\"}\n",
 84 |         "\n",
 85 |         "#@markdown 在 word 檔案，需要採用同樣的分隔符號，區分欄位和內容, 例如：\n",
 86 |         "\n",
 87 |         "#@markdown 標題：蘋果新春發表有...\n",
 88 |         "\n",
 89 |         "#@markdown 內容：蘋果於台灣時間今（21）日凌晨舉辦今年首場新品發表會，此次一共發表了五款硬體新品，包括搭配Mini LED螢幕的新一代iPad Pro平板、七種繽紛顏色的新一代 iMac、AirTag藍牙防丟器、搭載A12仿生晶片的新一代Apple TV 4K機上盒及iPhone 12紫色新款。\n",
 90 |         "\n",
 91 |         "#@markdown 出處：yahoo!3C科技\n",
 92 |         "\n",
 93 |         "#@markdown 文章日期：2021/4/21\n",
 94 |         "delimiter=delimiter.strip()\n",
 95 |         "wordPos=wordFilename.rfind(\"/\")\n",
 96 |         "dirPath=wordFilename[:wordPos]\n",
 97 |         "columList=[itm.strip() for itm in columNames.split(delimiter)]\n",
 98 |         "# docx2txt.process(wordFilename)\n",
 99 |         "content=[itm.strip() for itm in docx2txt.process(wordFilename).split(\"\\n\") if len(itm)>1]\n",
100 |         "topd={}\n",
101 |         "for itm in columList:\n",
102 |         "  topd[itm]=list()\n",
103 |         "\n",
104 |         "for itm in content:\n",
105 |         "  posistion=itm.find(\"：\")\n",
106 |         "  topd[itm[:posistion]].append(itm[posistion+1:].strip())\n",
107 |         "\n",
108 |         "df=pd.DataFrame(topd,columns=columList)\n",
109 |         "\n",
110 |         "df.to_excel(f\"{dirPath}/{excel__name}\",index=False)\n",
111 |         "print(\" 資料轉換結束 \".center(100,\"=\"))\n",
112 |         "print()\n",
113 |         "print(f\"Excel 檔名 {excel__name} ，\\n\\n已儲存於 {wordFilename[wordPos+1:]} 相同目錄\\n \")\n",
114 |         "print(\"\".center(106,\"=\"))"
115 |       ],
116 |       "execution_count": null,
117 |       "outputs": []
118 |     },
119 |     {
120 |       "cell_type": "markdown",
121 |       "metadata": {
122 |         "id": "Q4_IWYW7rlah"
123 |       },
124 |       "source": [
125 |         "# **Word 編寫範列**\n",
126 |         "\n",
127 |         "## 欄位名稱為：標題：內容：出處：文章日期\n",
128 |         "\n",
129 |         "## 分隔符號為： ：(全型冒號)\n",
130 |         "\n",
131 |         "標題：蘋果新春發表有...\n",
132 |         "\n",
133 |         "內容：蘋果於台灣時間今（21）日凌晨舉辦今年首場新品發表會，此次一共發表了五款硬體新品，包括搭配Mini \n",
134 |         "\n",
135 |         "LED螢幕的新一代iPad Pro平板、七種繽紛顏色的新一代 iMac、AirTag藍牙防丟器、搭載A12仿生晶片的新一代Apple TV 4K機上盒及iPhone 12紫色新款。\n",
136 |         "\n",
137 |         "出處：yahoo!3C科技\n",
138 |         "\n",
139 |         "文章日期：2021/4/21\n",
140 |         "\n",
141 |         "......\n",
142 |         "\n",
143 |         "......\n",
144 |         "\n",
145 |         "標題：韓媒：三星多款新機 提前在8月上市\n",
146 |         "\n",
147 |         "內容：消息人士透露，全球最大手機製造商三星電子正與南韓電信商洽談，規劃在8月推出平價款旗艦機Galaxy S21 FE和折疊系列Galaxy Z Fold3、Galaxy Z Flip3，以擴大手機市場的市占率。韓聯社報導，若三星證實這項消息，意味這次新機上市時機比往年早。Galaxy S21 FE的前身Galaxy S20 FE在去年10月上市，三星折疊手機Galaxy Z Fold2和Galaxy Z Flip則於9月上市。\n",
148 |         "\n",
149 |         "出處：聯合新聞網\n",
150 |         "\n",
151 |         "文章日期：2021/5/9\n"
152 |       ]
153 |     }
154 |   ]
155 | }


--------------------------------------------------------------------------------
/crawler.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |   "cells": [
   3 |     {
   4 |       "cell_type": "markdown",
   5 |       "metadata": {
   6 |         "id": "view-in-github",
   7 |         "colab_type": "text"
   8 |       },
   9 |       "source": [
  10 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/crawler.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
  11 |       ]
  12 |     },
  13 |     {
  14 |       "cell_type": "markdown",
  15 |       "metadata": {
  16 |         "id": "OjvmI-ML-U4Y"
  17 |       },
  18 |       "source": [
  19 |         "# 網路爬蟲與多執行緒練習\n",
  20 |         "\n"
  21 |       ]
  22 |     },
  23 |     {
  24 |       "cell_type": "code",
  25 |       "execution_count": null,
  26 |       "metadata": {
  27 |         "cellView": "form",
  28 |         "collapsed": true,
  29 |         "id": "wGkkH1MmUvuV"
  30 |       },
  31 |       "outputs": [],
  32 |       "source": [
  33 |         "#@title 必要元件安裝\n",
  34 |         "# def check_package(package_name,is_system_package=False):\n",
  35 |         "#   import importlib, subprocess, shutil\n",
  36 |         "#   if is_system_package:\n",
  37 |         "#     if(shutil.which(package_name)):\n",
  38 |         "#       print(f\"{package_name} 套件已經安裝\")\n",
  39 |         "#     else:\n",
  40 |         "#       print(f\"{package_name} 套件尚未安裝，正在安裝中...\")\n",
  41 |         "#       subprocess.check_call([\"apt-get\", \"install\", \"-y\", package_name])\n",
  42 |         "#   else:\n",
  43 |         "#     try:\n",
  44 |         "#       importlib.import_module(package_name)\n",
  45 |         "#       print(f\"{package_name} 套件已經安裝\")\n",
  46 |         "#     except ImportError:\n",
  47 |         "#       print(f\"{package_name} 套件尚未安裝，正在安裝中...\")\n",
  48 |         "#       subprocess.check_call([\"pip\", \"install\", package_name])\n",
  49 |         "\n",
  50 |         "# # 檢查 pandoc 是否已安裝，若無則安裝\n",
  51 |         "# check_package(\"pandoc\",is_system_package=True)\n",
  52 |         "# check_package(\"inlp\")\n",
  53 |         "\n",
  54 |         "!apt-get install -y pandoc\n",
  55 |         "!pip install inlp"
  56 |       ]
  57 |     },
  58 |     {
  59 |       "cell_type": "markdown",
  60 |       "source": [
  61 |         "# 可用網站清單\n",
  62 |         "\n",
  63 |         "* 小書包小說網：https://www.xiaoshubao.net/read/488175/1.html\n",
  64 |         "* 冬日小說網：https://www.drxsw.com/\n",
  65 |         "* UU看書：https://www.uuread.tw/\n"
  66 |       ],
  67 |       "metadata": {
  68 |         "id": "uPeuRtTlEhXp"
  69 |       }
  70 |     },
  71 |     {
  72 |       "cell_type": "code",
  73 |       "execution_count": null,
  74 |       "metadata": {
  75 |         "id": "hKEPfJ6aEPIj",
  76 |         "cellView": "form"
  77 |       },
  78 |       "outputs": [],
  79 |       "source": [
  80 |         "import requests\n",
  81 |         "import os,re\n",
  82 |         "from bs4 import BeautifulSoup\n",
  83 |         "import concurrent.futures\n",
  84 |         "from inlp.convert import chinese\n",
  85 |         "\n",
  86 |         "try:\n",
  87 |         "  os.mkdir(\"/content/tmp\")\n",
  88 |         "except:\n",
  89 |         "  print(\"目錄已存在\")\n",
  90 |         "os.chdir(\"/content/tmp\")\n",
  91 |         "os.system(\"rm -fr *\")\n",
  92 |         "\n",
  93 |         "\n",
  94 |         "\n",
  95 |         "\n",
  96 |         "# @title 抓取書籍目錄名稱、網址\n",
  97 |         "url=\"https://www.xiaoshubao.net/read/503289/\" #@param {type:'string'}\n",
  98 |         "title=\"我在妖魔世界拾取技能碎片\" #@param {type:\"string\"}\n",
  99 |         "author=\"第九天命\" #@param {type:\"string\"}\n",
 100 |         "cssselector=\"#list > dl > dd:nth-child(n+3) > a\" #@param {type:\"string\"}\n",
 101 |         "#@markdown 打勾，將會直接變成 epub\n",
 102 |         "file2epub= True #@param {type:\"boolean\"}\n",
 103 |         "debugmode = True #@param {type:\"boolean\"}\n",
 104 |         "\n",
 105 |         "encode = \"utf-8\" # @param {\"type\":\"string\",\"placeholder\":\"utf-8\"}\n",
 106 |         "\n",
 107 |         "# 標題設定義\n",
 108 |         "YAML=f'''---\n",
 109 |         "title: {title}\n",
 110 |         "author: {author}\n",
 111 |         "language: zh-Hant\n",
 112 |         "---'''\n",
 113 |         "\n",
 114 |         "with open(\"title.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
 115 |         "  f.write(YAML)\n",
 116 |         "\n",
 117 |         "sites=url[:url.find(\"/\",8)]\n",
 118 |         "#sites=url[:url.rfind(\"/\")]\n",
 119 |         "\n",
 120 |         "reg=requests.get(url)\n",
 121 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
 122 |         "reg.encoding=encode\n",
 123 |         "soup=BeautifulSoup(reg.text)\n",
 124 |         "articles = soup.select(cssselector)\n",
 125 |         "# articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(\"a\")\n",
 126 |         "\n",
 127 |         "links=[]\n",
 128 |         "# len(articles)\n",
 129 |         "for i in articles:\n",
 130 |         "  href=i.get(\"href\")\n",
 131 |         "  links.append([i.getText(),f\"{sites}{href}\"])\n",
 132 |         "\n",
 133 |         "files_text=[link[1][link[1].rfind(\"/\")+1:]+\".txt\" for link in links]\n",
 134 |         "if debugmode:\n",
 135 |         "  print(links[:4])\n",
 136 |         "  print(files_text[:4])\n"
 137 |       ]
 138 |     },
 139 |     {
 140 |       "cell_type": "code",
 141 |       "execution_count": null,
 142 |       "metadata": {
 143 |         "id": "nRS1C2TfF7QP",
 144 |         "cellView": "form"
 145 |       },
 146 |       "outputs": [],
 147 |       "source": [
 148 |         "# @title 抓取章節內容 update:2024/12/31\n",
 149 |         "content_selector=\"#content\" #@param {type:\"string\"}\n",
 150 |         "#@markdown 打勾，將會直接變成 epub\n",
 151 |         "debug_content = False #@param {type:\"boolean\"}\n",
 152 |         "no_next_chapter = False # @param {\"type\":\"boolean\",\"placeholder\":\"True\"}\n",
 153 |         "next_page_selector = \"#content > p > a\" # @param {\"type\":\"string\"}\n",
 154 |         "find_All_p = False # @param {\"type\":\"boolean\",\"placeholder\":\"false\"}\n",
 155 |         "\n",
 156 |         "import lxml.html, time\n",
 157 |         "import re,os\n",
 158 |         "import json,ast\n",
 159 |         "\n",
 160 |         "\n",
 161 |         "\n",
 162 |         "\n",
 163 |         "def get_html(urls,pages=False):\n",
 164 |         "  time.sleep(2)\n",
 165 |         "  [title,art_url]=urls\n",
 166 |         "  art_id=art_url[art_url.rfind(\"/\")+1:]\n",
 167 |         "  if pages:\n",
 168 |         "    art_id=art_id[:art_id.rfind(\"_\")]+\".html\"\n",
 169 |         "  reg=requests.get(art_url)\n",
 170 |         "  reg.encoding=encode\n",
 171 |         "  soup=BeautifulSoup(reg.text)\n",
 172 |         "  # content=soup.find(name=\"div\",class_='content')\n",
 173 |         "  content=soup.select(content_selector)\n",
 174 |         "\n",
 175 |         "  if find_All_p:\n",
 176 |         "    content=content[0].find_all(\"p\")\n",
 177 |         "  content=\"<br/>\".join([p.get_text() for p in content])\n",
 178 |         "\n",
 179 |         "  print(art_id)\n",
 180 |         "  textcon = f\"# {title}\\n\\n\" if not pages else \"\"\n",
 181 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
 182 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
 183 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
 184 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
 185 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
 186 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
 187 |         "  textcon+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
 188 |         "  if debug_content:\n",
 189 |         "    print(chinese.s2t(textcon))\n",
 190 |         "\n",
 191 |         "  filemode=\"w\" if not pages else \"a\"\n",
 192 |         "  with open(f\"{art_id}.txt\",mode=filemode,encoding=\"utf-8\") as f:\n",
 193 |         "    f.write(chinese.s2t(textcon))\n",
 194 |         "  if no_next_chapter:\n",
 195 |         "    return\n",
 196 |         "  next_page = soup.select_one(next_page_selector)\n",
 197 |         "  if next_page and '_' in next_page['href']:\n",
 198 |         "    art_url = next_page['href'] if next_page['href'].startswith('https') else f\"{sites}{next_page['href']}\"\n",
 199 |         "    get_html([title,art_url],1)\n",
 200 |         "\n",
 201 |         "if debug_content:\n",
 202 |         "  get_html(links[0])\n",
 203 |         "else:\n",
 204 |         "  alreadydown=len(os.listdir())\n",
 205 |         "  startid=0\n",
 206 |         "  if alreadydown>1:\n",
 207 |         "    startid=alreadydown-2\n",
 208 |         "  for link in links[startid:]:\n",
 209 |         "    get_html(link)\n",
 210 |         "\n",
 211 |         "    output_name=title\n",
 212 |         "    # files_text=os.listdir()\n",
 213 |         "    # files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
 214 |         "    # 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
 215 |         "    # files_text.sort(key=lambda x:int(x[:-4]))\n",
 216 |         "  if file2epub:\n",
 217 |         "    os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(files_text)))\n",
 218 |         "    from google.colab import files\n",
 219 |         "    files.download('../{}.epub'.format(title))\n",
 220 |         "    pass\n",
 221 |         "  else:\n",
 222 |         "    with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
 223 |         "      for file in files_text:\n",
 224 |         "        with open(file,\"r\") as f2:\n",
 225 |         "          f.write(f2.read())\n",
 226 |         "    from google.colab import files\n",
 227 |         "    files.download('../{}.txt'.format(output_name))"
 228 |       ]
 229 |     },
 230 |     {
 231 |       "cell_type": "code",
 232 |       "execution_count": null,
 233 |       "metadata": {
 234 |         "cellView": "form",
 235 |         "id": "vLrNyHB0M7LL"
 236 |       },
 237 |       "outputs": [],
 238 |       "source": [
 239 |         "#@title 小說狂人\n",
 240 |         "\n",
 241 |         "def check_package(package_name,is_system_package=False):\n",
 242 |         "  import importlib, subprocess, shutil\n",
 243 |         "  if is_system_package:\n",
 244 |         "    if(shutil.which(package_name)):\n",
 245 |         "      print(f\"{package_name} 套件已經安裝\")\n",
 246 |         "    else:\n",
 247 |         "      print(f\"{package_name} 套件尚未安裝，正在安裝中...\")\n",
 248 |         "      subprocess.check_call([\"apt-get\", \"install\", \"-y\", package_name])\n",
 249 |         "  else:\n",
 250 |         "    try:\n",
 251 |         "      importlib.import_module(package_name)\n",
 252 |         "      print(f\"{package_name} 套件已經安裝\")\n",
 253 |         "    except ImportError:\n",
 254 |         "      print(f\"{package_name} 套件尚未安裝，正在安裝中...\")\n",
 255 |         "      subprocess.check_call([\"pip\", \"install\", package_name])\n",
 256 |         "\n",
 257 |         "# 檢查 pandoc 是否已安裝，若無則安裝\n",
 258 |         "check_package(\"pandoc\",is_system_package=True)\n",
 259 |         "check_package(\"inlp\")\n",
 260 |         "\n",
 261 |         "\n",
 262 |         "import requests\n",
 263 |         "import os,re\n",
 264 |         "from bs4 import BeautifulSoup\n",
 265 |         "import concurrent.futures\n",
 266 |         "from inlp.convert import chinese\n",
 267 |         "\n",
 268 |         "try:\n",
 269 |         "  os.mkdir(\"/content/tmp\")\n",
 270 |         "except:\n",
 271 |         "  print(\"目錄已存在\")\n",
 272 |         "os.chdir(\"/content/tmp\")\n",
 273 |         "os.system(\"rm -fr *\")\n",
 274 |         "\n",
 275 |         "\n",
 276 |         "\n",
 277 |         "def get_html(urls):\n",
 278 |         "  [title,art_url]=urls\n",
 279 |         "  art_id=art_url[art_url.rfind(\"/\")+1:]\n",
 280 |         "  reg=requests.get(art_url)\n",
 281 |         "  reg.encoding=\"utf-8\"\n",
 282 |         "  soup=BeautifulSoup(reg.text)\n",
 283 |         "  content=soup.find(name=\"div\",class_='content')\n",
 284 |         "\n",
 285 |         "  print(art_id)\n",
 286 |         "  text=f\"# {title}\\n\\n\"\n",
 287 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
 288 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
 289 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
 290 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
 291 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
 292 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
 293 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
 294 |         "\n",
 295 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
 296 |         "    f.write(text)\n",
 297 |         "\n",
 298 |         "\n",
 299 |         "#@markdown 書籍目錄網址\n",
 300 |         "url=\"https://czbooks.net/n/s6g1n14e\" #@param {type:'string'}\n",
 301 |         "title=\"我在原始部落當酋長\" #@param {type:\"string\"}\n",
 302 |         "author=\"西原公子\" #@param {type:\"string\"}\n",
 303 |         "#@markdown 打勾，將會直接變成 epub\n",
 304 |         "file2epub = True #@param {type:\"boolean\"}\n",
 305 |         "\n",
 306 |         "# 標題設定義\n",
 307 |         "YAML=f'''---\n",
 308 |         "title: {title}\n",
 309 |         "author: {author}\n",
 310 |         "language: zh-Hant\n",
 311 |         "---'''\n",
 312 |         "\n",
 313 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
 314 |         "  f.write(YAML)\n",
 315 |         "\n",
 316 |         "sites=url[:url.find(\"/\",8)]\n",
 317 |         "# sites=url[:url.rfind(\"/\")]\n",
 318 |         "\n",
 319 |         "reg=requests.get(url)\n",
 320 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
 321 |         "reg.encoding=\"utf-8\"\n",
 322 |         "soup=BeautifulSoup(reg.text)\n",
 323 |         "articles = soup.select('#chapter-list > li a')\n",
 324 |         "# articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(\"a\")\n",
 325 |         "\n",
 326 |         "links=[]\n",
 327 |         "# len(articles)\n",
 328 |         "for i in articles:\n",
 329 |         "  href=i.get(\"href\")\n",
 330 |         "\n",
 331 |         "  links.append([i.text,f\"https:{href}\"])\n",
 332 |         "\n",
 333 |         "files_text=[link[1][link[1].rfind(\"/\")+1:]+\".txt\" for link in links]\n",
 334 |         "\n",
 335 |         "\n",
 336 |         "\n",
 337 |         "# # # 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
 338 |         "# # # # 同時建立及啟用10個執行緒\n",
 339 |         "# # # with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
 340 |         "# # #     executor.map(get_html, links)\n",
 341 |         "for link in links:\n",
 342 |         "  get_html(link)\n",
 343 |         "\n",
 344 |         "output_name=title\n",
 345 |         "# files_text=os.listdir()\n",
 346 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
 347 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
 348 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
 349 |         "if file2epub:\n",
 350 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(files_text)))\n",
 351 |         "  from google.colab import files\n",
 352 |         "  files.download('../{}.epub'.format(title))\n",
 353 |         "  pass\n",
 354 |         "else:\n",
 355 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
 356 |         "    for file in files_text:\n",
 357 |         "      with open(file,\"r\") as f2:\n",
 358 |         "        f.write(f2.read())\n",
 359 |         "  from google.colab import files\n",
 360 |         "  files.download('../{}.txt'.format(output_name))\n",
 361 |         "\n"
 362 |       ]
 363 |     },
 364 |     {
 365 |       "cell_type": "code",
 366 |       "execution_count": null,
 367 |       "metadata": {
 368 |         "cellView": "form",
 369 |         "id": "yONyl870ergz"
 370 |       },
 371 |       "outputs": [],
 372 |       "source": [
 373 |         "#@title UU看書網\n",
 374 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
 375 |         "\n",
 376 |         "def check_package(itm):\n",
 377 |         "  import importlib\n",
 378 |         "  try:\n",
 379 |         "    importlib.import_module(itm)\n",
 380 |         "    print(f\"{itm} 套件已經安裝\")\n",
 381 |         "  except ImportError:\n",
 382 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
 383 |         "    !pip install {itm}\n",
 384 |         "\n",
 385 |         "\n",
 386 |         "check_package(\"inlp\")\n",
 387 |         "\n",
 388 |         "import requests\n",
 389 |         "import os,re\n",
 390 |         "from bs4 import BeautifulSoup\n",
 391 |         "import concurrent.futures\n",
 392 |         "from inlp.convert import chinese\n",
 393 |         "\n",
 394 |         "try:\n",
 395 |         "  os.mkdir(\"/content/tmp\")\n",
 396 |         "except:\n",
 397 |         "  print(\"目錄已存在\")\n",
 398 |         "os.chdir(\"/content/tmp\")\n",
 399 |         "os.system(\"rm -fr *\")\n",
 400 |         "\n",
 401 |         "\n",
 402 |         "def get_html(urls):\n",
 403 |         "  [title,art_url]=urls\n",
 404 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
 405 |         "  reg=requests.get(art_url)\n",
 406 |         "  reg.encoding=\"utf-8\"\n",
 407 |         "  soup=BeautifulSoup(reg.text)\n",
 408 |         "  content=soup.find(name=\"div\",id='contentbox')\n",
 409 |         "  print(art_id)\n",
 410 |         "  text=f\"# {title}\\n\\n\"\n",
 411 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
 412 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
 413 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
 414 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
 415 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
 416 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
 417 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
 418 |         "\n",
 419 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
 420 |         "    f.write(text)\n",
 421 |         "\n",
 422 |         "\n",
 423 |         "#@markdown 書籍目錄網址\n",
 424 |         "url=\"https://tw.uukanshu.com/b/239755/\" #@param {type:'string'}\n",
 425 |         "title=\"\\u6211\\u5BB6\\u5A18\\u5B50\\uFF0C\\u4E0D\\u5C0D\\u52C1\" #@param {type:\"string\"}\n",
 426 |         "author=\"\\u4E00\\u87EC\\u77E5\\u590F\" #@param {type:\"string\"}\n",
 427 |         "#@markdown 打勾，將會直接變成 epub\n",
 428 |         "file2epub = True #@param {type:\"boolean\"}\n",
 429 |         "\n",
 430 |         "# 標題設定義\n",
 431 |         "YAML=f'''---\n",
 432 |         "title: {title}\n",
 433 |         "author: {author}\n",
 434 |         "language: zh-Hant\n",
 435 |         "---'''\n",
 436 |         "\n",
 437 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
 438 |         "  f.write(YAML)\n",
 439 |         "\n",
 440 |         "sites=url[:url.find(\"/\",8)]\n",
 441 |         "# sites=url[:url.rfind(\"/\")]\n",
 442 |         "\n",
 443 |         "reg=requests.get(url)\n",
 444 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
 445 |         "reg.encoding=\"utf-8\"\n",
 446 |         "soup=BeautifulSoup(reg.text)\n",
 447 |         "articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(\"a\")\n",
 448 |         "\n",
 449 |         "links=[]\n",
 450 |         "# len(articles)\n",
 451 |         "for i in articles:\n",
 452 |         "  href=i.get(\"href\")\n",
 453 |         "  links.append([i.get(\"title\"),f\"{sites}{href}\"])\n",
 454 |         "# links.sort(key=lambda x: x[1])\n",
 455 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
 456 |         "\n",
 457 |         "# 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
 458 |         "# # 同時建立及啟用10個執行緒\n",
 459 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
 460 |         "#     executor.map(get_html, links)\n",
 461 |         "for link in links:\n",
 462 |         "  get_html(link)\n",
 463 |         "\n",
 464 |         "output_name=title\n",
 465 |         "# files_text=os.listdir()\n",
 466 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
 467 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
 468 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
 469 |         "if file2epub:\n",
 470 |         "  mdfiles=[ itm for itm in files_text[::-1] ]\n",
 471 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
 472 |         "  from google.colab import files\n",
 473 |         "  files.download('../{}.epub'.format(title))\n",
 474 |         "  pass\n",
 475 |         "else:\n",
 476 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
 477 |         "    for file in files_text[::-1]:\n",
 478 |         "      with open(file,\"r\") as f2:\n",
 479 |         "        f.write(f2.read())\n",
 480 |         "  from google.colab import files\n",
 481 |         "  files.download('../{}.txt'.format(output_name))\n",
 482 |         "\n"
 483 |       ]
 484 |     },
 485 |     {
 486 |       "cell_type": "code",
 487 |       "execution_count": null,
 488 |       "metadata": {
 489 |         "id": "pNaQIMaFVWba",
 490 |         "cellView": "form"
 491 |       },
 492 |       "outputs": [],
 493 |       "source": [
 494 |         "#@title UU看書網\n",
 495 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
 496 |         "\n",
 497 |         "def check_package(itm):\n",
 498 |         "  import importlib\n",
 499 |         "  try:\n",
 500 |         "    importlib.import_module(itm)\n",
 501 |         "    print(f\"{itm} 套件已經安裝\")\n",
 502 |         "  except ImportError:\n",
 503 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
 504 |         "    !pip install {itm}\n",
 505 |         "\n",
 506 |         "\n",
 507 |         "check_package(\"inlp\")\n",
 508 |         "\n",
 509 |         "import requests\n",
 510 |         "import os,re\n",
 511 |         "from bs4 import BeautifulSoup\n",
 512 |         "import concurrent.futures\n",
 513 |         "from inlp.convert import chinese\n",
 514 |         "\n",
 515 |         "try:\n",
 516 |         "  os.mkdir(\"/content/tmp\")\n",
 517 |         "  chk_cont=False\n",
 518 |         "except:\n",
 519 |         "  print(\"目錄已存在\")\n",
 520 |         "  chk_cont=True\n",
 521 |         "os.chdir(\"/content/tmp\")\n",
 522 |         "# os.system(\"rm -fr *\")\n",
 523 |         "\n",
 524 |         "\n",
 525 |         "def get_html(urls):\n",
 526 |         "  [title,art_url]=urls\n",
 527 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
 528 |         "  reg=requests.get(art_url)\n",
 529 |         "  reg.encoding=\"utf-8\"\n",
 530 |         "  soup=BeautifulSoup(reg.text)\n",
 531 |         "  # content=\"<br>\\n\".join([tag.string for tag in soup.find(name=\"div\",id='nr').find_all(\"p\")])\n",
 532 |         "  tmp=soup.find(name=\"div\",id='nr').find_all(\"p\")\n",
 533 |         "  print(art_id)\n",
 534 |         "  # pages=soup.find('h1').getText()\n",
 535 |         "  # numbers = re.findall(r'\\d+', pages)\n",
 536 |         "  check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n",
 537 |         "  if (check[-1].getText()==\"下一頁\"):\n",
 538 |         "    page_url=sites+check[-1].get(\"href\")\n",
 539 |         "  # if (len(numbers) >1):\n",
 540 |         "  #   last_number=numbers[-1]\n",
 541 |         "  #   # print(last_number)\n",
 542 |         "  #   for i in range(2,int(last_number)+1):\n",
 543 |         "  #     page_url=art_url[:-5]+f\"_{i}.html\"\n",
 544 |         "    content_extend=get_html_page(page_url)\n",
 545 |         "    tmp=tmp+content_extend\n",
 546 |         "\n",
 547 |         "\n",
 548 |         "  content=\"\\n\".join([str(itm) for itm in tmp])\n",
 549 |         "  # print(\"<br>\".join([tag.string for tag in content]))\n",
 550 |         "  text=f\"# {title}\\n\\n\"\n",
 551 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
 552 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
 553 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
 554 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
 555 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
 556 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
 557 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
 558 |         "\n",
 559 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
 560 |         "    f.write(text)\n",
 561 |         "\n",
 562 |         "\n",
 563 |         "def get_html_page(page_url):\n",
 564 |         "  reg=requests.get(page_url)\n",
 565 |         "\n",
 566 |         "  reg.encoding=\"utf-8\"\n",
 567 |         "  soup=BeautifulSoup(reg.text)\n",
 568 |         "  tmp1=soup.find(name=\"div\",id='nr').find_all(\"p\")\n",
 569 |         "  check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n",
 570 |         "  if (check[-1].getText()==\"下一頁\"):\n",
 571 |         "    page_url=sites+check[-1].get(\"href\")\n",
 572 |         "  # if (len(numbers) >1):\n",
 573 |         "  #   last_number=numbers[-1]\n",
 574 |         "  #   # print(last_number)\n",
 575 |         "  #   for i in range(2,int(last_number)+1):\n",
 576 |         "  #     page_url=art_url[:-5]+f\"_{i}.html\"\n",
 577 |         "    content_extend=get_html_page(page_url)\n",
 578 |         "    tmp1=tmp1+content_extend\n",
 579 |         "\n",
 580 |         "  return tmp1\n",
 581 |         "\n",
 582 |         "\n",
 583 |         "\n",
 584 |         "#@markdown 書籍目錄網址\n",
 585 |         "url=\"https://www.uuread.tw/28100\" #@param {type:'string'}\n",
 586 |         "title=\"系統賦我長生，我熬死了所有人\" #@param {type:\"string\"}\n",
 587 |         "author=\"一隻榴蓮3號\" #@param {type:\"string\"}\n",
 588 |         "#@markdown 打勾，將會直接變成 epub\n",
 589 |         "file2epub = True #@param {type:\"boolean\"}\n",
 590 |         "\n",
 591 |         "# 標題設定義\n",
 592 |         "YAML=f'''---\n",
 593 |         "title: {title}\n",
 594 |         "author: {author}\n",
 595 |         "language: zh-Hant\n",
 596 |         "---'''\n",
 597 |         "\n",
 598 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
 599 |         "  f.write(YAML)\n",
 600 |         "\n",
 601 |         "sites=url[:url.find(\"/\",8)]\n",
 602 |         "# sites=url[:url.rfind(\"/\")]\n",
 603 |         "\n",
 604 |         "reg=requests.get(url)\n",
 605 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
 606 |         "reg.encoding=\"utf-8\"\n",
 607 |         "soup=BeautifulSoup(reg.text)\n",
 608 |         "articles=soup.find(name=\"ul\",id=\"newlist\").find_all(\"a\")\n",
 609 |         "\n",
 610 |         "links=[]\n",
 611 |         "# len(articles)\n",
 612 |         "for i in articles:\n",
 613 |         "  href=i.get(\"href\")\n",
 614 |         "  links.append([i.get(\"title\"),f\"{sites}{href}\"])\n",
 615 |         "# links.sort(key=lambda x: x[1])\n",
 616 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
 617 |         "\n",
 618 |         "# # 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
 619 |         "# # # 同時建立及啟用10個執行緒\n",
 620 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
 621 |         "#     executor.map(get_html, links)\n",
 622 |         "\n",
 623 |         "if chk_cont:\n",
 624 |         "  start_num=int(len(os.listdir())-1)\n",
 625 |         "  print(start_num)\n",
 626 |         "else:\n",
 627 |         "  start_num=0\n",
 628 |         "for index,link in enumerate(links[start_num:],start_num+1):\n",
 629 |         "  print(index)\n",
 630 |         "  get_html(link)\n",
 631 |         "\n",
 632 |         "output_name=title\n",
 633 |         "# files_text=os.listdir()\n",
 634 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
 635 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
 636 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
 637 |         "if file2epub:\n",
 638 |         "  mdfiles=[ itm for itm in files_text ]\n",
 639 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
 640 |         "  from google.colab import files\n",
 641 |         "  files.download('../{}.epub'.format(title))\n",
 642 |         "  pass\n",
 643 |         "else:\n",
 644 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
 645 |         "    for file in files_text[::-1]:\n",
 646 |         "      with open(file,\"r\") as f2:\n",
 647 |         "        f.write(f2.read())\n",
 648 |         "  from google.colab import files\n",
 649 |         "  files.download('../{}.txt'.format(output_name))\n",
 650 |         "\n"
 651 |       ]
 652 |     },
 653 |     {
 654 |       "cell_type": "code",
 655 |       "execution_count": null,
 656 |       "metadata": {
 657 |         "cellView": "form",
 658 |         "id": "stXbsG0Qb0wC"
 659 |       },
 660 |       "outputs": [],
 661 |       "source": [
 662 |         "#@title UU看書網(修正版)\n",
 663 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
 664 |         "\n",
 665 |         "def check_package(package_name,is_system_package=False):\n",
 666 |         "  import importlib\n",
 667 |         "  import subprocess\n",
 668 |         "  try:\n",
 669 |         "    importlib.import_module(package_name)\n",
 670 |         "    print(f\"{package_name} 套件已經安裝\")\n",
 671 |         "  except ImportError:\n",
 672 |         "    print(f\"{package_name} 套件尚未安裝，正在安裝中...\")\n",
 673 |         "    if is_system_package:\n",
 674 |         "      subprocess.check_call([\"apt-get\", \"install\", \"-y\", package_name])\n",
 675 |         "    else:\n",
 676 |         "      subprocess.check_call([\"pip\", \"install\", package_name])\n",
 677 |         "\n",
 678 |         "# 檢查 pandoc 是否已安裝，若無則安裝\n",
 679 |         "check_package(\"pandoc\")\n",
 680 |         "\n",
 681 |         "check_package(\"inlp\")\n",
 682 |         "\n",
 683 |         "import requests\n",
 684 |         "import os,re\n",
 685 |         "from bs4 import BeautifulSoup\n",
 686 |         "import concurrent.futures\n",
 687 |         "from inlp.convert import chinese\n",
 688 |         "import time\n",
 689 |         "import random\n",
 690 |         "\n",
 691 |         "\n",
 692 |         "\n",
 693 |         "try:\n",
 694 |         "  os.mkdir(\"/content/tmp\")\n",
 695 |         "  chk_cont=False\n",
 696 |         "except:\n",
 697 |         "  print(\"目錄已存在\")\n",
 698 |         "  chk_cont=True\n",
 699 |         "os.chdir(\"/content/tmp\")\n",
 700 |         "# os.system(\"rm -fr *\")\n",
 701 |         "\n",
 702 |         "def get_article_content(article_url):\n",
 703 |         "    try:\n",
 704 |         "        response = requests.get(article_url,headers=headers, timeout=10)\n",
 705 |         "        response.encoding = \"utf-8\"\n",
 706 |         "        if response.status_code == 200:\n",
 707 |         "            soup = BeautifulSoup(response.text, 'html.parser')\n",
 708 |         "            content_div = soup.find('div', id='nr')\n",
 709 |         "            if content_div:\n",
 710 |         "                paragraphs = content_div.find_all('p')\n",
 711 |         "                content = '\\n'.join([p.get_text(strip=True) for p in paragraphs])\n",
 712 |         "\n",
 713 |         "                next_button_text = soup.find(\"div\", class_=\"operate\").find_all(\"a\")[-1].getText()\n",
 714 |         "                next_page_url = None\n",
 715 |         "                if next_button_text == \"下一頁\":\n",
 716 |         "                    next_page_url = sites + soup.find(\"div\", class_=\"operate\").find_all(\"a\")[-1].get(\"href\")\n",
 717 |         "\n",
 718 |         "                return content, next_page_url\n",
 719 |         "            else:\n",
 720 |         "                print(f\"無法找到內容：{article_url}\")\n",
 721 |         "        else:\n",
 722 |         "            print(f\"無法訪問網頁：{article_url}\")\n",
 723 |         "    except Exception as e:\n",
 724 |         "        print(f\"發生錯誤：{e}\")\n",
 725 |         "    return None, None\n",
 726 |         "\n",
 727 |         "def save_to_text_file(title, content, file_name, add_title=True):\n",
 728 |         "    try:\n",
 729 |         "        with open(file_name, 'a', encoding='utf-8') as file:\n",
 730 |         "            if add_title:\n",
 731 |         "                file.write(f\"# {title}\\n\\n\")\n",
 732 |         "            # 移除多餘的空白和換行\n",
 733 |         "            # print(content)\n",
 734 |         "            content = BeautifulSoup(content, \"html.parser\").get_text(separator=\"\\n\\n\")\n",
 735 |         "            content = content.replace(\"\\xa0\\xa0\\xa0\\xa0\", \"\").replace(\"<br/>\", \"\\n\").replace(\"</p>\", \"\\n\")\n",
 736 |         "\n",
 737 |         "            file.write(str(content))\n",
 738 |         "            print(f\"已保存至文件：{file_name}\")\n",
 739 |         "    except Exception as e:\n",
 740 |         "        print(f\"保存文件時出錯：{e}\")\n",
 741 |         "\n",
 742 |         "def crawl_articles(urls,start_num):\n",
 743 |         "\n",
 744 |         "    for index, (title, article_url) in enumerate(urls, start=start_num):\n",
 745 |         "        article_id = article_url.split('/')[-1].split('.')[0]\n",
 746 |         "        print(f\"正在處理文章：{article_id}\")\n",
 747 |         "        content, next_page_url = get_article_content(article_url)\n",
 748 |         "        if content:\n",
 749 |         "            # 保存第一頁的內容\n",
 750 |         "            file_name = f\"{article_id}.txt\"\n",
 751 |         "            save_to_text_file(title, content, file_name)\n",
 752 |         "\n",
 753 |         "            # 接著處理其他頁面（如果有）\n",
 754 |         "            while next_page_url:\n",
 755 |         "                next_page_content, next_page_url = get_article_content(next_page_url)\n",
 756 |         "                if next_page_content:\n",
 757 |         "                    # 保存其他頁面的內容\n",
 758 |         "                    save_to_text_file(title, next_page_content, file_name, add_title=False)\n",
 759 |         "\n",
 760 |         "\n",
 761 |         "def get_html(urls):\n",
 762 |         "  [title,art_url]=urls\n",
 763 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
 764 |         "  reg=requests.get(art_url,headers=headers, timeout=10)\n",
 765 |         "  reg.encoding=\"utf-8\"\n",
 766 |         "  soup=BeautifulSoup(reg.text)\n",
 767 |         "  # content=\"<br>\\n\".join([tag.string for tag in soup.find(name=\"div\",id='nr').find_all(\"p\")])\n",
 768 |         "  tmp=soup.find(name=\"div\",id='nr').find_all(\"p\")\n",
 769 |         "  print(art_id)\n",
 770 |         "  # pages=soup.find('h1').getText()\n",
 771 |         "  # numbers = re.findall(r'\\d+', pages)\n",
 772 |         "  check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n",
 773 |         "  if (check[-1].getText()==\"下一頁\"):\n",
 774 |         "    page_url=sites+check[-1].get(\"href\")\n",
 775 |         "  # if (len(numbers) >1):\n",
 776 |         "  #   last_number=numbers[-1]\n",
 777 |         "  #   # print(last_number)\n",
 778 |         "  #   for i in range(2,int(last_number)+1):\n",
 779 |         "  #     page_url=art_url[:-5]+f\"_{i}.html\"\n",
 780 |         "    content_extend=get_html_page(page_url)\n",
 781 |         "    tmp=tmp+content_extend\n",
 782 |         "\n",
 783 |         "\n",
 784 |         "  content=\"\\n\".join([str(itm) for itm in tmp])\n",
 785 |         "  # print(\"<br>\".join([tag.string for tag in content]))\n",
 786 |         "  text=f\"# {title}\\n\\n\"\n",
 787 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
 788 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
 789 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
 790 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
 791 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
 792 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
 793 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
 794 |         "\n",
 795 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
 796 |         "    f.write(text)\n",
 797 |         "\n",
 798 |         "\n",
 799 |         "def get_html_page(page_url):\n",
 800 |         "  reg=requests.get(page_url,headers=headers, timeout=10)\n",
 801 |         "\n",
 802 |         "  reg.encoding=\"utf-8\"\n",
 803 |         "  soup=BeautifulSoup(reg.text)\n",
 804 |         "  tmp1=soup.find(name=\"div\",id='nr').find_all(\"p\")\n",
 805 |         "  check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n",
 806 |         "  if (check[-1].getText()==\"下一頁\"):\n",
 807 |         "    page_url=sites+check[-1].get(\"href\")\n",
 808 |         "  # if (len(numbers) >1):\n",
 809 |         "  #   last_number=numbers[-1]\n",
 810 |         "  #   # print(last_number)\n",
 811 |         "  #   for i in range(2,int(last_number)+1):\n",
 812 |         "  #     page_url=art_url[:-5]+f\"_{i}.html\"\n",
 813 |         "    content_extend=get_html_page(page_url)\n",
 814 |         "    tmp1=tmp1+content_extend\n",
 815 |         "\n",
 816 |         "  return tmp1\n",
 817 |         "\n",
 818 |         "\n",
 819 |         "\n",
 820 |         "#@markdown 書籍目錄網址\n",
 821 |         "url=\"https://www.uuread.tw/94294\" #@param {type:'string'}\n",
 822 |         "title=\"阿斗之智近乎妖\" #@param {type:\"string\"}\n",
 823 |         "author=\"漢衛\" #@param {type:\"string\"}\n",
 824 |         "#@markdown 打勾，將會直接變成 epub\n",
 825 |         "file2epub = True #@param {type:\"boolean\"}\n",
 826 |         "\n",
 827 |         "# 標題設定義\n",
 828 |         "YAML=f'''---\n",
 829 |         "title: {title}\n",
 830 |         "author: {author}\n",
 831 |         "language: zh-Hant\n",
 832 |         "---'''\n",
 833 |         "\n",
 834 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
 835 |         "  f.write(YAML)\n",
 836 |         "\n",
 837 |         "sites=url[:url.find(\"/\",8)]\n",
 838 |         "# sites=url[:url.rfind(\"/\")]\n",
 839 |         "\n",
 840 |         "reg=requests.get(url)\n",
 841 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
 842 |         "reg.encoding=\"utf-8\"\n",
 843 |         "soup=BeautifulSoup(reg.text)\n",
 844 |         "articles=soup.find(name=\"ul\",id=\"newlist\").find_all(\"a\")\n",
 845 |         "\n",
 846 |         "links=[]\n",
 847 |         "# len(articles)\n",
 848 |         "for i in articles:\n",
 849 |         "  href=i.get(\"href\")\n",
 850 |         "  links.append([i.get(\"title\"),f\"{sites}{href}\"])\n",
 851 |         "# links.sort(key=lambda x: x[1])\n",
 852 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
 853 |         "\n",
 854 |         "# # 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
 855 |         "# # # 同時建立及啟用10個執行緒\n",
 856 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
 857 |         "#     executor.map(get_html, links)\n",
 858 |         "\n",
 859 |         "if chk_cont:\n",
 860 |         "  start_num=int(len(os.listdir())-1)\n",
 861 |         "  print(start_num)\n",
 862 |         "else:\n",
 863 |         "  start_num=0\n",
 864 |         "for index,link in enumerate(links[start_num:],start_num+1):\n",
 865 |         "  print(index)\n",
 866 |         "  get_html(link)\n",
 867 |         "  time.sleep(random.uniform(1, 2))\n",
 868 |         "\n",
 869 |         "# start_getnum=len(os.listdir(\".\"))-1\n",
 870 |         "\n",
 871 |         "# crawl_articles(links[start_getnum:],start_num)\n",
 872 |         "\n",
 873 |         "\n",
 874 |         "\n",
 875 |         "output_name=title\n",
 876 |         "# # files_text=os.listdir()\n",
 877 |         "# # files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
 878 |         "# # 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
 879 |         "# # files_text.sort(key=lambda x:int(x[:-4]))\n",
 880 |         "\n",
 881 |         "!apt-get install -y\n",
 882 |         "if file2epub:\n",
 883 |         "  mdfiles=[ itm for itm in files_text ]\n",
 884 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
 885 |         "  from google.colab import files\n",
 886 |         "  files.download('../{}.epub'.format(title))\n",
 887 |         "  pass\n",
 888 |         "else:\n",
 889 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
 890 |         "    for file in files_text[::-1]:\n",
 891 |         "      with open(file,\"r\") as f2:\n",
 892 |         "        f.write(f2.read())\n",
 893 |         "  from google.colab import files\n",
 894 |         "  files.download('../{}.txt'.format(output_name))\n",
 895 |         "\n"
 896 |       ]
 897 |     },
 898 |     {
 899 |       "cell_type": "code",
 900 |       "execution_count": null,
 901 |       "metadata": {
 902 |         "cellView": "form",
 903 |         "id": "e21JYlPGnuy7"
 904 |       },
 905 |       "outputs": [],
 906 |       "source": [
 907 |         "#@title UU看書\n",
 908 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
 909 |         "\n",
 910 |         "def check_package(itm):\n",
 911 |         "  import importlib\n",
 912 |         "  try:\n",
 913 |         "    importlib.import_module(itm)\n",
 914 |         "    print(f\"{itm} 套件已經安裝\")\n",
 915 |         "  except ImportError:\n",
 916 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
 917 |         "    !pip install {itm}\n",
 918 |         "\n",
 919 |         "\n",
 920 |         "check_package(\"inlp\")\n",
 921 |         "\n",
 922 |         "import requests\n",
 923 |         "import os,re\n",
 924 |         "from bs4 import BeautifulSoup\n",
 925 |         "import concurrent.futures\n",
 926 |         "from inlp.convert import chinese\n",
 927 |         "\n",
 928 |         "try:\n",
 929 |         "  os.mkdir(\"/content/tmp\")\n",
 930 |         "except:\n",
 931 |         "  print(\"目錄已存在\")\n",
 932 |         "os.chdir(\"/content/tmp\")\n",
 933 |         "os.system(\"rm -fr *\")\n",
 934 |         "\n",
 935 |         "\n",
 936 |         "def get_html(urls):\n",
 937 |         "  [title,art_url]=urls\n",
 938 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
 939 |         "  reg=requests.get(art_url)\n",
 940 |         "  reg.encoding=\"utf-8\"\n",
 941 |         "  soup=BeautifulSoup(reg.text)\n",
 942 |         "  content=soup.find(name=\"div\",id='TextContent')\n",
 943 |         "  print(art_id)\n",
 944 |         "  text=f\"# {title}\\n\\n\"\n",
 945 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
 946 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
 947 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
 948 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
 949 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
 950 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
 951 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
 952 |         "  text=chinese.s2t(text)\n",
 953 |         "\n",
 954 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
 955 |         "    f.write(text)\n",
 956 |         "\n",
 957 |         "\n",
 958 |         "#@markdown 書籍目錄網址\n",
 959 |         "url=\"https://www.uuks5.com/book/632203/\" #@param {type:'string'}\n",
 960 |         "title=\"我在古代當便宜爹\" #@param {type:\"string\"}\n",
 961 |         "author=\"雲山風海\" #@param {type:\"string\"}\n",
 962 |         "#@markdown 打勾，將會直接變成 epub\n",
 963 |         "file2epub = True #@param {type:\"boolean\"}\n",
 964 |         "\n",
 965 |         "# 標題設定義\n",
 966 |         "YAML=f'''---\n",
 967 |         "title: {title}\n",
 968 |         "author: {author}\n",
 969 |         "language: zh-Hant\n",
 970 |         "---'''\n",
 971 |         "\n",
 972 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
 973 |         "  f.write(YAML)\n",
 974 |         "\n",
 975 |         "sites=url[:url.find(\"/\",8)]\n",
 976 |         "# sites=url[:url.rfind(\"/\")]\n",
 977 |         "\n",
 978 |         "reg=requests.get(url)\n",
 979 |         "soup=BeautifulSoup(reg.text,\"html.parser\")\n",
 980 |         "reg.encoding=\"utf-8\"\n",
 981 |         "\n",
 982 |         "\n",
 983 |         "articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(name=\"a\")\n",
 984 |         "\n",
 985 |         "links=[]\n",
 986 |         "# # len(articles)\n",
 987 |         "for i in articles:\n",
 988 |         "  href=i.get(\"href\")\n",
 989 |         "  # print(i.text)\n",
 990 |         "  links.append([i.text,f\"{sites}{href}\"])\n",
 991 |         "# links.sort(key=lambda x: x[1])\n",
 992 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
 993 |         "\n",
 994 |         "\n",
 995 |         "for link in links:\n",
 996 |         "  get_html(link)\n",
 997 |         "\n",
 998 |         "output_name=title\n",
 999 |         "# files_text=os.listdir()\n",
1000 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
1001 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1002 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1003 |         "if file2epub:\n",
1004 |         "  mdfiles=[ itm for itm in files_text]\n",
1005 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1006 |         "  from google.colab import files\n",
1007 |         "  files.download('../{}.epub'.format(title))\n",
1008 |         "  pass\n",
1009 |         "else:\n",
1010 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1011 |         "    for file in files_text[::-1]:\n",
1012 |         "      with open(file,\"r\") as f2:\n",
1013 |         "        f.write(f2.read())\n",
1014 |         "  from google.colab import files\n",
1015 |         "  files.download('../{}.txt'.format(output_name))\n",
1016 |         "\n"
1017 |       ]
1018 |     },
1019 |     {
1020 |       "cell_type": "code",
1021 |       "execution_count": null,
1022 |       "metadata": {
1023 |         "cellView": "form",
1024 |         "id": "zXHImWy6FhSW"
1025 |       },
1026 |       "outputs": [],
1027 |       "source": [
1028 |         "#@title 飄天文學\n",
1029 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
1030 |         "def check_package(itm):\n",
1031 |         "  import importlib\n",
1032 |         "  try:\n",
1033 |         "    importlib.import_module(itm)\n",
1034 |         "    print(f\"{itm} 套件已經安裝\")\n",
1035 |         "  except ImportError:\n",
1036 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
1037 |         "    !pip install {itm}\n",
1038 |         "\n",
1039 |         "\n",
1040 |         "check_package(\"inlp\")\n",
1041 |         "\n",
1042 |         "\n",
1043 |         "import requests\n",
1044 |         "import os,re\n",
1045 |         "from bs4 import BeautifulSoup\n",
1046 |         "import concurrent.futures\n",
1047 |         "from inlp.convert import chinese\n",
1048 |         "\n",
1049 |         "try:\n",
1050 |         "  os.mkdir(\"/content/tmp\")\n",
1051 |         "except:\n",
1052 |         "  print(\"目錄已存在\")\n",
1053 |         "os.chdir(\"/content/tmp\")\n",
1054 |         "os.system(\"rm -fr *\")\n",
1055 |         "\n",
1056 |         "\n",
1057 |         "def get_html(urls):\n",
1058 |         "  [title,art_url]=urls\n",
1059 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
1060 |         "  # print(art_url)\n",
1061 |         "  req=requests.get(art_url)\n",
1062 |         "  # req.encoding=\"gbk\"\n",
1063 |         "  soup=BeautifulSoup(req.text)\n",
1064 |         "  # content=soup.find('div', {'id': 'content', 'class': 'fonts_mesne'})\n",
1065 |         "\n",
1066 |         "  print(art_id)\n",
1067 |         "\n",
1068 |         "\n",
1069 |         "  # Find all the siblings of the table element up to the center element\n",
1070 |         "  content = []\n",
1071 |         "  content=soup.find(name=\"div\",id='booktxt')\n",
1072 |         "  # print(content)\n",
1073 |         "  # print(soup)\n",
1074 |         "  for tag in content.find_all(['div', 'h1','table','script','center']):\n",
1075 |         "    tag.extract()\n",
1076 |         "\n",
1077 |         "  text=f\"# {title}\\n\\n\"\n",
1078 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
1079 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
1080 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
1081 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
1082 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
1083 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
1084 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
1085 |         "  text=chinese.s2t(text)\n",
1086 |         "\n",
1087 |         "\n",
1088 |         "\n",
1089 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
1090 |         "    f.write(text)\n",
1091 |         "\n",
1092 |         "\n",
1093 |         "#@markdown 書籍目錄網址\n",
1094 |         "url=\"https://www.piaotianba.com/ptoewgi\" #@param {type:'string'}\n",
1095 |         "title=\"先修諸天萬道再修仙\" #@param {type:\"string\"}\n",
1096 |         "author=\"門前鴨\" #@param {type:\"string\"}\n",
1097 |         "#@markdown 打勾，將會直接變成 epub\n",
1098 |         "file2epub = True #@param {type:\"boolean\"}\n",
1099 |         "\n",
1100 |         "# 標題設定義\n",
1101 |         "YAML=f'''---\n",
1102 |         "title: {title}\n",
1103 |         "author: {author}\n",
1104 |         "language: zh-Hant\n",
1105 |         "---'''\n",
1106 |         "\n",
1107 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
1108 |         "  f.write(YAML)\n",
1109 |         "\n",
1110 |         "# sites=url\n",
1111 |         "sites=url[:url.rfind(\"/\")]\n",
1112 |         "\n",
1113 |         "reg=requests.get(url)\n",
1114 |         "#reg.encoding = 'gbk'\n",
1115 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
1116 |         "soup=BeautifulSoup(reg.text)\n",
1117 |         "\n",
1118 |         "\n",
1119 |         "# articles=soup.find(name=\"div\",class_=\"centent\").find_all(\"a\")\n",
1120 |         "articles=soup.find(name=\"dl\",id=\"newlist\").find_all(\"a\")\n",
1121 |         "\n",
1122 |         "links=[]\n",
1123 |         "# len(articles)\n",
1124 |         "for i in articles:\n",
1125 |         "  href=i.get(\"href\")\n",
1126 |         "  links.append([chinese.s2t(i.text),f\"{sites}/{href}\"])\n",
1127 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
1128 |         "\n",
1129 |         "# # 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
1130 |         "# # # 同時建立及啟用10個執行緒\n",
1131 |         "# # with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
1132 |         "# #     executor.map(get_html, links)\n",
1133 |         "for link in links:\n",
1134 |         "  get_html(link)\n",
1135 |         "\n",
1136 |         "output_name=title\n",
1137 |         "# files_text=os.listdir()\n",
1138 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
1139 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1140 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1141 |         "if file2epub:\n",
1142 |         "  mdfiles=[ itm for itm in files_text]\n",
1143 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1144 |         "  from google.colab import files\n",
1145 |         "  files.download('../{}.epub'.format(title))\n",
1146 |         "  pass\n",
1147 |         "else:\n",
1148 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1149 |         "    for file in files_text:\n",
1150 |         "      with open(file,\"r\") as f2:\n",
1151 |         "        f.write(f2.read())\n",
1152 |         "  from google.colab import files\n",
1153 |         "  files.download('../{}.txt'.format(output_name))\n",
1154 |         "\n"
1155 |       ]
1156 |     },
1157 |     {
1158 |       "cell_type": "code",
1159 |       "execution_count": null,
1160 |       "metadata": {
1161 |         "cellView": "form",
1162 |         "id": "Sj1k1RHcre_O"
1163 |       },
1164 |       "outputs": [],
1165 |       "source": [
1166 |         "#@title 69書吧\n",
1167 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
1168 |         "def check_package(itm):\n",
1169 |         "  import importlib\n",
1170 |         "  try:\n",
1171 |         "    importlib.import_module(itm)\n",
1172 |         "    print(f\"{itm} 套件已經安裝\")\n",
1173 |         "  except ImportError:\n",
1174 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
1175 |         "    !pip install {itm}\n",
1176 |         "\n",
1177 |         "\n",
1178 |         "check_package(\"inlp\")\n",
1179 |         "\n",
1180 |         "\n",
1181 |         "import requests\n",
1182 |         "import os,re\n",
1183 |         "from bs4 import BeautifulSoup\n",
1184 |         "import concurrent.futures\n",
1185 |         "from inlp.convert import chinese\n",
1186 |         "\n",
1187 |         "try:\n",
1188 |         "  os.mkdir(\"/content/tmp\")\n",
1189 |         "except:\n",
1190 |         "  print(\"目錄已存在\")\n",
1191 |         "os.chdir(\"/content/tmp\")\n",
1192 |         "os.system(\"rm -fr *\")\n",
1193 |         "\n",
1194 |         "\n",
1195 |         "def get_html(urls):\n",
1196 |         "  [title,art_url]=urls\n",
1197 |         "  art_id=art_url[art_url.rfind(\"/\")+1:]\n",
1198 |         "  # print(art_url)\n",
1199 |         "  req=requests.get(art_url)\n",
1200 |         "  req.encoding=\"gbk\"\n",
1201 |         "  soup=BeautifulSoup(req.text)\n",
1202 |         "  print(art_id)\n",
1203 |         "\n",
1204 |         "  # Find all the siblings of the table element up to the center element\n",
1205 |         "  content = []\n",
1206 |         "  content=soup.find(name=\"div\",class_=\"txtnav\")\n",
1207 |         "\n",
1208 |         "  for tag in content.find_all(['div','h1','script','center']):\n",
1209 |         "    tag.extract()\n",
1210 |         "\n",
1211 |         "\n",
1212 |         "  text=f\"# {title}\\n\\n\"\n",
1213 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
1214 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
1215 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
1216 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
1217 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
1218 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
1219 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
1220 |         "  text=chinese.s2t(text)\n",
1221 |         "\n",
1222 |         "  # print(text)\n",
1223 |         "\n",
1224 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
1225 |         "    f.write(text)\n",
1226 |         "\n",
1227 |         "\n",
1228 |         "#@markdown 書籍目錄網址\n",
1229 |         "url=\"https://69shuba.cx/book/58865/\" #@param {type:'string'}\n",
1230 |         "title=\"暴富很難？我的超市通古今！\" #@param {type:\"string\"}\n",
1231 |         "author=\"琴止\" #@param {type:\"string\"}\n",
1232 |         "#@markdown 打勾，將會直接變成 epub\n",
1233 |         "file2epub = True #@param {type:\"boolean\"}\n",
1234 |         "\n",
1235 |         "# 標題設定義\n",
1236 |         "YAML=f'''---\n",
1237 |         "title: {title}\n",
1238 |         "author: {author}\n",
1239 |         "language: zh-Hant\n",
1240 |         "---'''\n",
1241 |         "\n",
1242 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
1243 |         "  f.write(YAML)\n",
1244 |         "\n",
1245 |         "sites=url\n",
1246 |         "# sites=url[:url.rfind(\"/\")]\n",
1247 |         "\n",
1248 |         "reg=requests.get(url)\n",
1249 |         "reg.encoding = 'gbk'\n",
1250 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
1251 |         "soup=BeautifulSoup(reg.text)\n",
1252 |         "\n",
1253 |         "articles=soup.find(name=\"div\",id=\"catalog\").find_all(\"a\")\n",
1254 |         "\n",
1255 |         "links=[]\n",
1256 |         "# # len(articles)\n",
1257 |         "for i in articles:\n",
1258 |         "  href=i.get(\"href\")\n",
1259 |         "  links.append([chinese.s2t(i.text),href])\n",
1260 |         "\n",
1261 |         "files_text=[link[1][link[1].rfind(\"/\")+1:]+\".txt\" for link in links]\n",
1262 |         "\n",
1263 |         "# 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
1264 |         "# # 同時建立及啟用10個執行緒\n",
1265 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
1266 |         "#     executor.map(get_html, links)\n",
1267 |         "for link in links:\n",
1268 |         "  get_html(link)\n",
1269 |         "\n",
1270 |         "output_name=title\n",
1271 |         "# files_text=os.listdir()\n",
1272 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
1273 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1274 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1275 |         "if file2epub:\n",
1276 |         "  mdfiles=[ itm for itm in files_text]\n",
1277 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1278 |         "  from google.colab import files\n",
1279 |         "  files.download('../{}.epub'.format(title))\n",
1280 |         "  pass\n",
1281 |         "else:\n",
1282 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1283 |         "    for file in files_text:\n",
1284 |         "      with open(file,\"r\") as f2:\n",
1285 |         "        f.write(f2.read())\n",
1286 |         "  from google.colab import files\n",
1287 |         "  files.download('../{}.txt'.format(output_name))"
1288 |       ]
1289 |     },
1290 |     {
1291 |       "cell_type": "code",
1292 |       "execution_count": null,
1293 |       "metadata": {
1294 |         "cellView": "form",
1295 |         "id": "lkVncEDb8tXj"
1296 |       },
1297 |       "outputs": [],
1298 |       "source": [
1299 |         "#@title 笔趣阁(bqg9527.net/)\n",
1300 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
1301 |         "\n",
1302 |         "def check_package(itm):\n",
1303 |         "  import importlib\n",
1304 |         "  try:\n",
1305 |         "    importlib.import_module(itm)\n",
1306 |         "    print(f\"{itm} 套件已經安裝\")\n",
1307 |         "  except ImportError:\n",
1308 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
1309 |         "    !pip install {itm}\n",
1310 |         "\n",
1311 |         "\n",
1312 |         "check_package(\"inlp\")\n",
1313 |         "\n",
1314 |         "import requests\n",
1315 |         "import os,re\n",
1316 |         "from bs4 import BeautifulSoup\n",
1317 |         "import concurrent.futures\n",
1318 |         "from inlp.convert import chinese\n",
1319 |         "\n",
1320 |         "\n",
1321 |         "try:\n",
1322 |         "  os.mkdir(\"/content/tmp\")\n",
1323 |         "except:\n",
1324 |         "  print(\"目錄已存在\")\n",
1325 |         "os.chdir(\"/content/tmp\")\n",
1326 |         "os.system(\"rm -fr *\")\n",
1327 |         "\n",
1328 |         "\n",
1329 |         "def get_html(urls):\n",
1330 |         "  [title,art_url]=urls\n",
1331 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
1332 |         "\n",
1333 |         "  soup=BeautifulSoup(requests.get(art_url).text)\n",
1334 |         "  # content=soup.find(name=\"div\",id='nr1')\n",
1335 |         "  content=soup.find(name=\"div\",id=\"content\")\n",
1336 |         "  print(art_id)\n",
1337 |         "  # print(soup)\n",
1338 |         "\n",
1339 |         "  text=f\"# {title}\\n\\n\"\n",
1340 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
1341 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
1342 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
1343 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
1344 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
1345 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
1346 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
1347 |         "  text=chinese.s2t(text)\n",
1348 |         "\n",
1349 |         "  # print(text)\n",
1350 |         "\n",
1351 |         "\n",
1352 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
1353 |         "    f.write(text)\n",
1354 |         "\n",
1355 |         "\n",
1356 |         "#@markdown 書籍目錄網址\n",
1357 |         "url=\"https://www.bqg9527.net/book/342930/\" #@param {type:'string'}\n",
1358 |         "title=\"我出錢你出命，我倆一起神經病\" #@param {type:\"string\"}\n",
1359 |         "author=\"我煞費苦心\" #@param {type:\"string\"}\n",
1360 |         "#@markdown 打勾，將會直接變成 epub\n",
1361 |         "file2epub = True #@param {type:\"boolean\"}\n",
1362 |         "\n",
1363 |         "# 標題設定義\n",
1364 |         "YAML=f'''---\n",
1365 |         "title: {title}\n",
1366 |         "author: {author}\n",
1367 |         "language: zh-Hant\n",
1368 |         "---'''\n",
1369 |         "\n",
1370 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
1371 |         "  f.write(YAML)\n",
1372 |         "\n",
1373 |         "# sites=url[:url.find(\"/\",8)+1]\n",
1374 |         "# sites=url[:url.rfind(\"/\")]\n",
1375 |         "sites=url\n",
1376 |         "reg=requests.get(url)\n",
1377 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
1378 |         "soup=BeautifulSoup(reg.text)\n",
1379 |         "# articles=soup.find_all(name=\"ul\",class_=\"chapter\")[1].find_all(\"a\")\n",
1380 |         "articles=soup.find(name=\"div\",id=\"list\").find_all(\"dt\")[1].find_next_siblings('dd')\n",
1381 |         "\n",
1382 |         "# second_dt=soup.find_all('dt')[1]\n",
1383 |         "# dd_tags = second_dt.find_next_siblings('dd')\n",
1384 |         "\n",
1385 |         "links=[]\n",
1386 |         "# len(articles)\n",
1387 |         "for itm in articles:\n",
1388 |         "  i=itm.find(\"a\")\n",
1389 |         "  href=i.get(\"href\")\n",
1390 |         "  if href[-4:]!= \"html\":\n",
1391 |         "    continue\n",
1392 |         "  links.append([i.text,f\"{sites}{href}\"])\n",
1393 |         "\n",
1394 |         "\n",
1395 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
1396 |         "# print(links[0])\n",
1397 |         "# print(files_text[0])\n",
1398 |         "# get_html(links[0])\n",
1399 |         "\n",
1400 |         "# # 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
1401 |         "# # # 同時建立及啟用10個執行緒\n",
1402 |         "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
1403 |         "    executor.map(get_html, links)\n",
1404 |         "\n",
1405 |         "# for link in links:\n",
1406 |         "  # get_html(link)\n",
1407 |         "\n",
1408 |         "# output_name=soup.find(\"h1\").getText()\n",
1409 |         "output_name=title\n",
1410 |         "# files_text=os.listdir()\n",
1411 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
1412 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1413 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1414 |         "if file2epub:\n",
1415 |         "  mdfiles=[ itm for itm in files_text]\n",
1416 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1417 |         "  from google.colab import files\n",
1418 |         "  files.download('../{}.epub'.format(title))\n",
1419 |         "  pass\n",
1420 |         "else:\n",
1421 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1422 |         "    for file in files_text[::-1]:\n",
1423 |         "      with open(file,\"r\") as f2:\n",
1424 |         "        f.write(f2.read())\n",
1425 |         "  from google.colab import files\n",
1426 |         "  files.download('../{}.txt'.format(output_name))\n",
1427 |         "\n"
1428 |       ]
1429 |     },
1430 |     {
1431 |       "cell_type": "code",
1432 |       "execution_count": null,
1433 |       "metadata": {
1434 |         "cellView": "form",
1435 |         "id": "Afz6HuGb85ZL"
1436 |       },
1437 |       "outputs": [],
1438 |       "source": [
1439 |         "#@title 笔趣阁(p2wt)\n",
1440 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
1441 |         "\n",
1442 |         "def check_package(itm):\n",
1443 |         "  import importlib\n",
1444 |         "  try:\n",
1445 |         "    importlib.import_module(itm)\n",
1446 |         "    print(f\"{itm} 套件已經安裝\")\n",
1447 |         "  except ImportError:\n",
1448 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
1449 |         "    !pip install {itm}\n",
1450 |         "\n",
1451 |         "\n",
1452 |         "check_package(\"inlp\")\n",
1453 |         "\n",
1454 |         "import requests\n",
1455 |         "import os,re\n",
1456 |         "from bs4 import BeautifulSoup\n",
1457 |         "import concurrent.futures\n",
1458 |         "from inlp.convert import chinese\n",
1459 |         "\n",
1460 |         "\n",
1461 |         "try:\n",
1462 |         "  os.mkdir(\"/content/tmp\")\n",
1463 |         "except:\n",
1464 |         "  print(\"目錄已存在\")\n",
1465 |         "os.chdir(\"/content/tmp\")\n",
1466 |         "os.system(\"rm -fr *\")\n",
1467 |         "\n",
1468 |         "\n",
1469 |         "def get_html(urls):\n",
1470 |         "  [title,art_url]=urls\n",
1471 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
1472 |         "  reg=requests.get(art_url)\n",
1473 |         "  reg.encoding=\"utf-8\"\n",
1474 |         "  soup=BeautifulSoup(reg.text)\n",
1475 |         "  content=soup.find(name=\"div\",id='chaptercontent')\n",
1476 |         "  print(art_id)\n",
1477 |         "\n",
1478 |         "  text=f\"# {title}\\n\\n\"\n",
1479 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
1480 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
1481 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
1482 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
1483 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
1484 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
1485 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
1486 |         "  text=chinese.s2t(text)\n",
1487 |         "\n",
1488 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
1489 |         "    f.write(text)\n",
1490 |         "\n",
1491 |         "\n",
1492 |         "\n",
1493 |         "def toEpub():\n",
1494 |         "# files_text=os.listdir()\n",
1495 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
1496 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1497 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1498 |         "  if file2epub:\n",
1499 |         "    mdfiles=[ itm for itm in files_text]\n",
1500 |         "    os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1501 |         "    from google.colab import files\n",
1502 |         "    files.download('../{}.epub'.format(title))\n",
1503 |         "    pass\n",
1504 |         "  else:\n",
1505 |         "    with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1506 |         "      for file in files_text[::-1]:\n",
1507 |         "        with open(file,\"r\") as f2:\n",
1508 |         "          f.write(f2.read())\n",
1509 |         "    from google.colab import files\n",
1510 |         "    files.download('../{}.txt'.format(output_name))\n",
1511 |         "\n",
1512 |         "\n",
1513 |         "#@markdown 書籍目錄網址\n",
1514 |         "url=\"https://m.bqg9527.net/book/480/\" #@param {type:'string'}\n",
1515 |         "title=\"\\u7D66\\u4E0D\\u8D77\\u5F69\\u79AE\\uFF0C\\u53EA\\u597D\\u5A36\\u4E86\\u9B54\\u9580\\u8056\\u5973\" #@param {type:\"string\"}\n",
1516 |         "author=\"\\u5149\\u5F71\" #@param {type:\"string\"}\n",
1517 |         "#@markdown 打勾，將會直接變成 epub\n",
1518 |         "file2epub = True #@param {type:\"boolean\"}\n",
1519 |         "\n",
1520 |         "# 標題設定義\n",
1521 |         "YAML=f'''---\n",
1522 |         "title: {title}\n",
1523 |         "author: {author}\n",
1524 |         "language: zh-Hant\n",
1525 |         "---'''\n",
1526 |         "\n",
1527 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
1528 |         "  f.write(YAML)\n",
1529 |         "\n",
1530 |         "sites=url[:url.find(\"/\",8)]\n",
1531 |         "# sites=url[:url.rfind(\"/\")], 取得 sites\n",
1532 |         "\n",
1533 |         "reg=requests.get(url)\n",
1534 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
1535 |         "reg.encoding=\"utf-8\"\n",
1536 |         "soup=BeautifulSoup(reg.text)\n",
1537 |         "output_name=soup.find(\"h1\").getText()\n",
1538 |         "articles=soup.find(name=\"div\",class_=\"listmain\").find_all(\"a\")\n",
1539 |         "\n",
1540 |         "\n",
1541 |         "links=[]\n",
1542 |         "# len(articles)\n",
1543 |         "for i in articles:\n",
1544 |         "  href=i.get(\"href\")\n",
1545 |         "  if href[-4:]!= \"html\":\n",
1546 |         "    continue\n",
1547 |         "  links.append([i.text,f\"{sites}{href}\"])\n",
1548 |         "\n",
1549 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
1550 |         "\n",
1551 |         "\n",
1552 |         "# 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
1553 |         "# # 同時建立及啟用10個執行緒\n",
1554 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
1555 |         "#     executor.map(get_html, links)\n",
1556 |         "for link in links:\n",
1557 |         "  get_html(link)\n",
1558 |         "\n",
1559 |         "\n",
1560 |         "toEpub()\n"
1561 |       ]
1562 |     },
1563 |     {
1564 |       "cell_type": "code",
1565 |       "execution_count": null,
1566 |       "metadata": {
1567 |         "id": "roy0O2ZJnAvz",
1568 |         "cellView": "form"
1569 |       },
1570 |       "outputs": [],
1571 |       "source": [
1572 |         "#@title 愛下電子書(撰寫中)\n",
1573 |         "\n",
1574 |         "\n",
1575 |         "import requests\n",
1576 |         "import os,re\n",
1577 |         "from bs4 import BeautifulSoup\n",
1578 |         "import concurrent.futures\n",
1579 |         "\n",
1580 |         "try:\n",
1581 |         "  os.mkdir(\"/content/tmp\")\n",
1582 |         "except:\n",
1583 |         "  print(\"目錄已存在\")\n",
1584 |         "os.chdir(\"/content/tmp\")\n",
1585 |         "os.system(\"rm -fr *\")\n",
1586 |         "\n",
1587 |         "\n",
1588 |         "\n",
1589 |         "def get_html(urls):\n",
1590 |         "  [title,art_url]=urls\n",
1591 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
1592 |         "\n",
1593 |         "  soup=BeautifulSoup(requests.get(art_url).text)\n",
1594 |         "  # content=soup.find(name=\"div\",id='nr1')\n",
1595 |         "  content=soup.find(\"section\")\n",
1596 |         "\n",
1597 |         "  print(art_id)\n",
1598 |         "\n",
1599 |         "  text=f\"# {title}\\n\\n\"\n",
1600 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
1601 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
1602 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
1603 |         "  context=re.sub('<script.*</script>','',context)\n",
1604 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
1605 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
1606 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
1607 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
1608 |         "  # text=chinese.s2t(text)\n",
1609 |         "\n",
1610 |         "  # print(text)\n",
1611 |         "\n",
1612 |         "\n",
1613 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
1614 |         "    f.write(text)\n",
1615 |         "\n",
1616 |         "\n",
1617 |         "\n",
1618 |         "\n",
1619 |         "\n",
1620 |         "#@markdown 書籍目錄網址\n",
1621 |         "url=\"https://ixdzs8.tw/read/381070/p484.html\" #@param {type:'string'}\n",
1622 |         "title=\"豪門梟士\" #@param {type:\"string\"}\n",
1623 |         "author=\"雲山風海\" #@param {type:\"string\"}\n",
1624 |         "#@markdown 打勾，將會直接變成 epub\n",
1625 |         "file2epub = True #@param {type:\"boolean\"}\n",
1626 |         "\n",
1627 |         "# 標題設定義\n",
1628 |         "YAML=f'''---\n",
1629 |         "title: {title}\n",
1630 |         "author: {author}\n",
1631 |         "language: zh-Hant\n",
1632 |         "---'''\n",
1633 |         "\n",
1634 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
1635 |         "  f.write(YAML)\n",
1636 |         "\n",
1637 |         "sites=url[:url.find(\"/\",8)]\n",
1638 |         "# sites=url[:url.rfind(\"/\")]\n",
1639 |         "\n",
1640 |         "# reg=requests.get(url)\n",
1641 |         "soup=BeautifulSoup(data)\n",
1642 |         "# articles=soup.find(name=\"div\",id=\"readerlists\").find_all(\"a\")\n",
1643 |         "articles=soup.find_all(\"a\")\n",
1644 |         "# print(articles)\n",
1645 |         "\n",
1646 |         "\n",
1647 |         "links=[]\n",
1648 |         "\n",
1649 |         "for i in articles:\n",
1650 |         "  href=i.get(\"href\")\n",
1651 |         "  itmTitle=i.getText()\n",
1652 |         "  links.append([itmTitle,f\"{sites}{href}\"])\n",
1653 |         "# links.sort(key=lambda x: x[1])\n",
1654 |         "\n",
1655 |         "\n",
1656 |         "files_text=[link[1][link[1].rfind(\"/\")+1:-5]+\".txt\" for link in links]\n",
1657 |         "# links[:4]\n",
1658 |         "\n",
1659 |         "# 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
1660 |         "# # 同時建立及啟用10個執行緒\n",
1661 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
1662 |         "#     executor.map(get_html, links)\n",
1663 |         "\n",
1664 |         "for link in links:\n",
1665 |         "  get_html(link)\n",
1666 |         "\n",
1667 |         "output_name=title\n",
1668 |         "listfiles=os.listdir()\n",
1669 |         "mdfiles=[ itm for itm in files_text if itm in listfiles]\n",
1670 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1671 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1672 |         "if file2epub:\n",
1673 |         "  # mdfiles=[ itm for itm in files_text if itm in listfiles]\n",
1674 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1675 |         "  from google.colab import files\n",
1676 |         "    files.download('../{}.epub'.format(title))\n",
1677 |         "  pass\n",
1678 |         "else:\n",
1679 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1680 |         "    for file in mdfiles:\n",
1681 |         "      with open(file,\"r\") as f2:\n",
1682 |         "        f.write(f2.read())\n",
1683 |         "  from google.colab import files\n",
1684 |         "  files.download('../{}.txt'.format(output_name))\n",
1685 |         "\n"
1686 |       ]
1687 |     },
1688 |     {
1689 |       "cell_type": "code",
1690 |       "execution_count": null,
1691 |       "metadata": {
1692 |         "id": "X-2pn1oynXxC",
1693 |         "cellView": "form"
1694 |       },
1695 |       "outputs": [],
1696 |       "source": [
1697 |         "#@title 52書庫(www.52shuku.vip)\n",
1698 |         "#@markdown 還在修正的程式，可以直接從這一個區塊執行\n",
1699 |         "\n",
1700 |         "def check_package(itm):\n",
1701 |         "  import importlib\n",
1702 |         "  try:\n",
1703 |         "    importlib.import_module(itm)\n",
1704 |         "    print(f\"{itm} 套件已經安裝\")\n",
1705 |         "  except ImportError:\n",
1706 |         "    print(f\"{itm} 套件尚未安裝，正在安裝中...\")\n",
1707 |         "    !pip install {itm}\n",
1708 |         "\n",
1709 |         "\n",
1710 |         "check_package(\"inlp\")\n",
1711 |         "\n",
1712 |         "import requests\n",
1713 |         "import os,re\n",
1714 |         "from bs4 import BeautifulSoup\n",
1715 |         "import concurrent.futures\n",
1716 |         "from inlp.convert import chinese\n",
1717 |         "\n",
1718 |         "\n",
1719 |         "try:\n",
1720 |         "  os.mkdir(\"/content/tmp\")\n",
1721 |         "except:\n",
1722 |         "  print(\"目錄已存在\")\n",
1723 |         "os.chdir(\"/content/tmp\")\n",
1724 |         "os.system(\"rm -fr *\")\n",
1725 |         "\n",
1726 |         "\n",
1727 |         "def get_html(urls):\n",
1728 |         "  [title,art_url]=urls\n",
1729 |         "  art_id=art_url[art_url.rfind(\"/\")+1:-5]\n",
1730 |         "  # print(art_id)\n",
1731 |         "  # print(art_url)\n",
1732 |         "  reg=requests.get(art_url)\n",
1733 |         "  reg.encoding=\"utf-8\"\n",
1734 |         "  soup=BeautifulSoup(reg.text)\n",
1735 |         "  content=soup.find(name=\"div\",id='text')\n",
1736 |         "  print(art_id)\n",
1737 |         "\n",
1738 |         "  text=f\"# {title}\\n\\n\"\n",
1739 |         "  context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n",
1740 |         "  context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n",
1741 |         "  context=re.sub('<div class=\"ad_content\".*?</div>','',context)\n",
1742 |         "  context=context.replace(\"<br/>\",\"\\n\").replace(\"</p>\",\"\\n\")\n",
1743 |         "  context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n",
1744 |         "  context=[itm.strip() for itm in context if len(itm)>0]\n",
1745 |         "  text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n",
1746 |         "  text=chinese.s2t(text)\n",
1747 |         "\n",
1748 |         "  # print(text)\n",
1749 |         "  with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n",
1750 |         "    f.write(text)\n",
1751 |         "\n",
1752 |         "\n",
1753 |         "#@markdown 書籍目錄網址\n",
1754 |         "url=\"https://www.52shuku.vip/yanqing/b/bjPGg.html\" #@param {type:'string'}\n",
1755 |         "title=\"\\u59D1\\u5A18\\u5979\\u7F8E\\u8C8C\\u5374\\u66B4\\u529B\" #@param {type:\"string\"}\n",
1756 |         "author=\"\\u4E60\\u6829\\u5112\\u751F\" #@param {type:\"string\"}\n",
1757 |         "#@markdown 打勾，將會直接變成 epub\n",
1758 |         "file2epub = True #@param {type:\"boolean\"}\n",
1759 |         "\n",
1760 |         "# 標題設定義\n",
1761 |         "YAML=f'''---\n",
1762 |         "title: {title}\n",
1763 |         "author: {author}\n",
1764 |         "language: zh-Hant\n",
1765 |         "---'''\n",
1766 |         "\n",
1767 |         "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n",
1768 |         "  f.write(YAML)\n",
1769 |         "\n",
1770 |         "sites=url[:url.find(\"/\",8)]\n",
1771 |         "# sites=url[:url.rfind(\"/\")]\n",
1772 |         "\n",
1773 |         "reg=requests.get(url)\n",
1774 |         "reg.encoding=\"utf-8\"\n",
1775 |         "# soup=BeautifulSoup(reg.text,\"html.parser\")\n",
1776 |         "soup=BeautifulSoup(reg.text)\n",
1777 |         "output_name=soup.find(\"h1\").getText()\n",
1778 |         "articles=soup.find(name=\"ul\",class_=\"list\").find_all(\"a\")\n",
1779 |         "\n",
1780 |         "\n",
1781 |         "links=[]\n",
1782 |         "# len(articles)\n",
1783 |         "for i in articles:\n",
1784 |         "  href=i.get(\"href\")\n",
1785 |         "  if href[-4:]!= \"html\":\n",
1786 |         "    continue\n",
1787 |         "  links.append([i.text,f\"{href}\"])\n",
1788 |         "\n",
1789 |         "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n",
1790 |         "\n",
1791 |         "\n",
1792 |         "\n",
1793 |         "# 暫時無法使用，目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n",
1794 |         "# # 同時建立及啟用10個執行緒\n",
1795 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n",
1796 |         "#     executor.map(get_html, links)\n",
1797 |         "for link in links:\n",
1798 |         "  get_html(link)\n",
1799 |         "\n",
1800 |         "output_name=soup.find(\"h1\").getText()\n",
1801 |         "# files_text=os.listdir()\n",
1802 |         "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n",
1803 |         "# 檔案排序，需要考慮 檔案名稱長短不一的問題，問前是透過數字的處理\n",
1804 |         "# files_text.sort(key=lambda x:int(x[:-4]))\n",
1805 |         "if file2epub:\n",
1806 |         "  mdfiles=[ itm for itm in files_text]\n",
1807 |         "  os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n",
1808 |         "  from google.colab import files\n",
1809 |         "  files.download('../{}.epub'.format(title))\n",
1810 |         "  pass\n",
1811 |         "else:\n",
1812 |         "  with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1813 |         "    for file in files_text[::-1]:\n",
1814 |         "      with open(file,\"r\") as f2:\n",
1815 |         "        f.write(f2.read())\n",
1816 |         "  from google.colab import files\n",
1817 |         "  files.download('../{}.txt'.format(output_name))\n",
1818 |         "\n"
1819 |       ]
1820 |     },
1821 |     {
1822 |       "cell_type": "markdown",
1823 |       "metadata": {
1824 |         "id": "XJOqF-eIbaCT"
1825 |       },
1826 |       "source": [
1827 |         "# 測試後，暫時沒有使用的程式碼區塊"
1828 |       ]
1829 |     },
1830 |     {
1831 |       "cell_type": "markdown",
1832 |       "metadata": {
1833 |         "id": "eNxKvxlfhvO6"
1834 |       },
1835 |       "source": [
1836 |         "# 效率\n",
1837 |         "\n",
1838 |         "透過下述的方法，合併檔案，因為輸出檔需要被反覆的開始太多次，隨著檔案大小逐漸增加。讓效能下跌\n",
1839 |         "\n",
1840 |         "```python\n",
1841 |         "for file in files:\n",
1842 |         "  os.system(\"cat {}>> ../{}.txt\".format(file,output_name))\n",
1843 |         "```\n",
1844 |         "若改用下述的方法， output 檔，只需要開啟一次。可以大大縮短時間。\n",
1845 |         "\n",
1846 |         "```python\n",
1847 |         "with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n",
1848 |         "  for file in files_text:\n",
1849 |         "    with open(file,\"r\") as f2:\n",
1850 |         "      f.write(f2.read())\n",
1851 |         "```"
1852 |       ]
1853 |     },
1854 |     {
1855 |       "cell_type": "markdown",
1856 |       "metadata": {
1857 |         "id": "8KCQpgIUtZ3M"
1858 |       },
1859 |       "source": [
1860 |         "# 參考資料"
1861 |       ]
1862 |     },
1863 |     {
1864 |       "cell_type": "code",
1865 |       "execution_count": null,
1866 |       "metadata": {
1867 |         "cellView": "form",
1868 |         "id": "4CCMkYp5OxmZ"
1869 |       },
1870 |       "outputs": [],
1871 |       "source": [
1872 |         "#@title 多執行序參考程式範例\n",
1873 |         "\n",
1874 |         "from bs4 import BeautifulSoup\n",
1875 |         "import concurrent.futures\n",
1876 |         "import requests\n",
1877 |         "import time\n",
1878 |         "\n",
1879 |         "\n",
1880 |         "def scrape(urls):\n",
1881 |         "\n",
1882 |         "    response = requests.get(urls)\n",
1883 |         "\n",
1884 |         "    soup = BeautifulSoup(response.content, \"lxml\")\n",
1885 |         "\n",
1886 |         "    # 爬取文章標題\n",
1887 |         "    titles = soup.find_all(\"h3\", {\"class\": \"post_title\"})\n",
1888 |         "\n",
1889 |         "    for title in titles:\n",
1890 |         "        print(title.getText().strip())\n",
1891 |         "\n",
1892 |         "    time.sleep(2)\n",
1893 |         "\n",
1894 |         "\n",
1895 |         "base_url = \"https://www.inside.com.tw/tag/AI\"\n",
1896 |         "urls = [f\"{base_url}?page={page}\" for page in range(1, 6)]  # 1~5頁的網址清單\n",
1897 |         "print(urls)\n",
1898 |         "start_time = time.time()  # 開始時間\n",
1899 |         "# scrape(urls)\n",
1900 |         "\n",
1901 |         "\n",
1902 |         "# 同時建立及啟用10個執行緒\n",
1903 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:\n",
1904 |         "#     executor.map(scrape, urls)\n",
1905 |         "\n",
1906 |         "end_time = time.time()\n",
1907 |         "print(f\"{end_time - start_time} 秒爬取 {len(urls)} 頁的文章\")"
1908 |       ]
1909 |     }
1910 |   ],
1911 |   "metadata": {
1912 |     "colab": {
1913 |       "provenance": [],
1914 |       "mount_file_id": "11h2gvU2w0w-cWs1O2ciOwdgK4Wi6qndj",
1915 |       "authorship_tag": "ABX9TyM7n29sOH7wFTuDQLfmH77G",
1916 |       "include_colab_link": true
1917 |     },
1918 |     "kernelspec": {
1919 |       "display_name": "Python 3",
1920 |       "name": "python3"
1921 |     }
1922 |   },
1923 |   "nbformat": 4,
1924 |   "nbformat_minor": 0
1925 | }


--------------------------------------------------------------------------------
/oTranscribe_txt_to_srt_格式轉換.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "oTranscribe txt to srt 格式轉換",
  7 |       "provenance": [],
  8 |       "collapsed_sections": [],
  9 |       "authorship_tag": "ABX9TyPY0cE4hDO0EpD8PepnE1dl",
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "name": "python3",
 14 |       "display_name": "Python 3"
 15 |     }
 16 |   },
 17 |   "cells": [
 18 |     {
 19 |       "cell_type": "markdown",
 20 |       "metadata": {
 21 |         "id": "view-in-github",
 22 |         "colab_type": "text"
 23 |       },
 24 |       "source": [
 25 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/oTranscribe_txt_to_srt_%E6%A0%BC%E5%BC%8F%E8%BD%89%E6%8F%9B.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 26 |       ]
 27 |     },
 28 |     {
 29 |       "cell_type": "markdown",
 30 |       "metadata": {
 31 |         "id": "wfFfVeSgXqTT"
 32 |       },
 33 |       "source": [
 34 |         "# oTranscribe txt 轉出轉 srt 格式\r\n",
 35 |         "\r\n",
 36 |         "srt 為 SubRip (.srt) 的格式，可用於 YouTube cc 字幕。"
 37 |       ]
 38 |     },
 39 |     {
 40 |       "cell_type": "code",
 41 |       "metadata": {
 42 |         "cellView": "form",
 43 |         "id": "T9JBJrZYZbZe"
 44 |       },
 45 |       "source": [
 46 |         "#@title 需求模組預載\r\n",
 47 |         "#@markdown 此區塊一定要執行\r\n",
 48 |         "\r\n",
 49 |         "from google.colab import files\r\n",
 50 |         "import re"
 51 |       ],
 52 |       "execution_count": null,
 53 |       "outputs": []
 54 |     },
 55 |     {
 56 |       "cell_type": "code",
 57 |       "metadata": {
 58 |         "cellView": "form",
 59 |         "id": "MkCN-ubYYYSe"
 60 |       },
 61 |       "source": [
 62 |         "#@title 上傳檔案\r\n",
 63 |         "uploaded = files.upload()"
 64 |       ],
 65 |       "execution_count": null,
 66 |       "outputs": []
 67 |     },
 68 |     {
 69 |       "cell_type": "code",
 70 |       "metadata": {
 71 |         "cellView": "form",
 72 |         "id": "1UYrae9gEnBI"
 73 |       },
 74 |       "source": [
 75 |         "#@title 環境設定\r\n",
 76 |         "\r\n",
 77 |         "#@markdown 上傳檔案名稱\r\n",
 78 |         "input_filename='1.txt' #@param {type:\"string\"} \r\n",
 79 |         "\r\n",
 80 |         "#@markdown 輸出檔案名稱\r\n",
 81 |         "output_filename=\"srt_output.txt\" #@param {type:\"string\"}"
 82 |       ],
 83 |       "execution_count": null,
 84 |       "outputs": []
 85 |     },
 86 |     {
 87 |       "cell_type": "code",
 88 |       "metadata": {
 89 |         "cellView": "form",
 90 |         "id": "MzZEky3BHApL"
 91 |       },
 92 |       "source": [
 93 |         "#@title 執行格式轉換\r\n",
 94 |         "with open(input_filename,'r',encoding='utf-8') as f:\r\n",
 95 |         "  text=f.read().replace(\"\\xa0\",' ')\r\n",
 96 |         "\r\n",
 97 |         "re_patten=r'([0-9:]+)\\s{0,2}(.*)\\s?\\n'\r\n",
 98 |         "aa=re.findall(re_patten,text)\r\n",
 99 |         "content=''\r\n",
100 |         "end=len(aa)-1\r\n",
101 |         "for idx,itm in enumerate(aa):\r\n",
102 |         "  if len(itm)==0:\r\n",
103 |         "    continue\r\n",
104 |         "  content+=\"%d\\n\"%(idx+1)\r\n",
105 |         "  if idx != end:\r\n",
106 |         "    content+=\"00:{} --> 00:{}\\n\".format(itm[0],aa[idx+1][0])\r\n",
107 |         "  else:\r\n",
108 |         "    tmp=itm[0].split(\":\")\r\n",
109 |         "    tmp[-1]=str(int(tmp[-1])+5)\r\n",
110 |         "    # print(\":\".join(tmp))\r\n",
111 |         "    content+=\"00:{} --> 00:{}\\n\".format(itm[0],\":\".join(tmp))\r\n",
112 |         "  content+=\"%s\\n\\n\"%itm[1]\r\n",
113 |         " \r\n",
114 |         "with open(output_filename,\"w\",encoding='utf-8') as f:\r\n",
115 |         "  f.write(content)\r\n",
116 |         "  "
117 |       ],
118 |       "execution_count": null,
119 |       "outputs": []
120 |     },
121 |     {
122 |       "cell_type": "code",
123 |       "metadata": {
124 |         "cellView": "form",
125 |         "id": "QyeMdPh5HZJA"
126 |       },
127 |       "source": [
128 |         "#@title 下載檔案\r\n",
129 |         "from google.colab import files\r\n",
130 |         "files.download(output_filename)"
131 |       ],
132 |       "execution_count": null,
133 |       "outputs": []
134 |     }
135 |   ]
136 | }


--------------------------------------------------------------------------------
/whisper_Test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "gpuType": "T4",
  8 |       "mount_file_id": "1z5py79AseIPWuKFO0oZ95uo4t3nA_0iG",
  9 |       "authorship_tag": "ABX9TyPUgpFfyXfEk66r4MgTLpTv",
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "name": "python3",
 14 |       "display_name": "Python 3"
 15 |     },
 16 |     "language_info": {
 17 |       "name": "python"
 18 |     },
 19 |     "accelerator": "GPU"
 20 |   },
 21 |   "cells": [
 22 |     {
 23 |       "cell_type": "markdown",
 24 |       "metadata": {
 25 |         "id": "view-in-github",
 26 |         "colab_type": "text"
 27 |       },
 28 |       "source": [
 29 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/whisper_Test.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 30 |       ]
 31 |     },
 32 |     {
 33 |       "cell_type": "markdown",
 34 |       "source": [
 35 |         "# 語音轉文字 AI 工具\n",
 36 |         "本工具使用 [OpenAI 的開源工具 Whisper](https://github.com/openai/whisper) 模型, 可以相對精準的將隨語音轉文字。\n",
 37 |         "\n",
 38 |         "# (一) 選擇適合的運作環境： T4 GPU\n",
 39 |         "本 Colab 虛擬機器使用為免費、多GPU的環境。已指定 T4 GPU 版本。\n",
 40 |         "\n",
 41 |         "若由  Github 直接開啟，可以忽略此說明。"
 42 |       ],
 43 |       "metadata": {
 44 |         "id": "Z8j9agRoP2Ef"
 45 |       }
 46 |     },
 47 |     {
 48 |       "cell_type": "code",
 49 |       "execution_count": null,
 50 |       "metadata": {
 51 |         "id": "ESQe_Qm7Ceoz"
 52 |       },
 53 |       "outputs": [],
 54 |       "source": [
 55 |         "# @title (1) 安裝 whisper\n",
 56 |         "!pip install git+https://github.com/openai/whisper.git"
 57 |       ]
 58 |     },
 59 |     {
 60 |       "cell_type": "markdown",
 61 |       "source": [
 62 |         "### (2) 掛載雲端硬碟\n",
 63 |         "1. 透過 Coloab 左邊的操作介面掛載\n",
 64 |         "2. 上傳音檔/影像檔到 Google drive\n",
 65 |         "  - 個人偏好在 Google drive 建一個 tmp 資料夾\n",
 66 |         "  - 將音檔上傳到 tmp 資料夾\n",
 67 |         "  - 在 Colab 左邊的掛載介面找到 drive => MyDrive => tmp\n",
 68 |         "  - 點選上載的音檔，按滑鼠右鍵，點選 複製路徑\n",
 69 |         "3. 將複製的路徑貼到轉檔區塊的 filenames 欄位中\n"
 70 |       ],
 71 |       "metadata": {
 72 |         "id": "vj1rk1zOKoh7"
 73 |       }
 74 |     },
 75 |     {
 76 |       "cell_type": "code",
 77 |       "source": [
 78 |         "# @title (3) 轉檔\n",
 79 |         "import os\n",
 80 |         "filename = \"/content/drive/MyDrive/tmp/phison.mp4\" # @param {type:\"string\"}\n",
 81 |         "#@markdown 設定使用的模型, 請參考 [Whisper Model Card](https://github.com/openai/whisper/blob/main/model-card.md) 選擇適合的模型\n",
 82 |         "model= \"medium\" # @param {type:\"string\"}\n",
 83 |         "\n",
 84 |         "#@markdown 設定主要的語言，如 Chinese, English，其它請參表 [tokenizer 文件](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py)\n",
 85 |         "language = \"English\" # @param {type:\"string\"}\n",
 86 |         "os.chdir(os.path.dirname(filename))\n",
 87 |         "os.getcwd()\n",
 88 |         "!whisper \"{filename}\" --model {model} --language {language}"
 89 |       ],
 90 |       "metadata": {
 91 |         "id": "t1k2RIWHDfhz",
 92 |         "cellView": "form"
 93 |       },
 94 |       "execution_count": null,
 95 |       "outputs": []
 96 |     },
 97 |     {
 98 |       "cell_type": "markdown",
 99 |       "source": [
100 |         "# (二) 取得法說會文字\n",
101 |         "## 1. 請先執行 (1) 安裝 whisper\n",
102 |         "## 2. 再執行 (4) 法說會逐字稿, ...."
103 |       ],
104 |       "metadata": {
105 |         "id": "OjboOjHQ6qOJ"
106 |       }
107 |     },
108 |     {
109 |       "cell_type": "code",
110 |       "source": [
111 |         "# @title (4) 法說會逐字稿，當影片在 Youtube 可直接使用這個\n",
112 |         "!pip install yt-dlp\n",
113 |         "\n",
114 |         "tubeUrl = \"https://www.youtube.com/watch?v=Q6sI_eY6sdU\" # @param {type:\"string\"}\n",
115 |         "import os\n",
116 |         "from yt_dlp import YoutubeDL\n",
117 |         "companyName=\"科技小電報\" # @param {type:\"string\"}\n",
118 |         "model= \"large\" # @param {type:\"string\"}\n",
119 |         "language = \"Chinese\" # @param {type:\"string\"}\n",
120 |         "\n",
121 |         "\n",
122 |         "\n",
123 |         "filename = companyName+\".m4a\"\n",
124 |         "ydl_opts = {'overwrites': True, 'format': 'bestaudio[ext=m4a]', 'outtmpl': filename}\n",
125 |         "with YoutubeDL(ydl_opts) as ydl:\n",
126 |         "    ydl.download([tubeUrl])\n",
127 |         "\n",
128 |         "!whisper \"{filename}\" --model {model} --language {language}\n",
129 |         "\n",
130 |         "from google.colab import files\n",
131 |         "exts=[\"txt\",\"srt\",\"tsv\",\"vtt\"]\n",
132 |         "for ext in exts:\n",
133 |         "  files.download('{}.{}'.format(companyName,ext))"
134 |       ],
135 |       "metadata": {
136 |         "id": "2kNMQIdNgS48",
137 |         "cellView": "form"
138 |       },
139 |       "execution_count": null,
140 |       "outputs": []
141 |     }
142 |   ]
143 | }


--------------------------------------------------------------------------------
/youtuber_逐字稿.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "youtuber 逐字稿",
  7 |       "provenance": [],
  8 |       "mount_file_id": "1PSXhAQ7DI5i_3836QB96XgGK7vi9kJ5z",
  9 |       "authorship_tag": "ABX9TyPv/Hov1nyluiLhL1f4L3FD",
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "name": "python3",
 14 |       "display_name": "Python 3"
 15 |     },
 16 |     "language_info": {
 17 |       "name": "python"
 18 |     }
 19 |   },
 20 |   "cells": [
 21 |     {
 22 |       "cell_type": "markdown",
 23 |       "metadata": {
 24 |         "id": "view-in-github",
 25 |         "colab_type": "text"
 26 |       },
 27 |       "source": [
 28 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/youtuber_%E9%80%90%E5%AD%97%E7%A8%BF.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 29 |       ]
 30 |     },
 31 |     {
 32 |       "cell_type": "code",
 33 |       "source": [
 34 |         "!pip install pytube\n",
 35 |         "!pip install inlp\n",
 36 |         "!pip install speechrecognition\n",
 37 |         "#@markdown 安裝必要套件\n",
 38 |         "\n",
 39 |         "#!pip install you-get"
 40 |       ],
 41 |       "metadata": {
 42 |         "id": "gTr19YDIVkKC",
 43 |         "cellView": "form"
 44 |       },
 45 |       "execution_count": null,
 46 |       "outputs": []
 47 |     },
 48 |     {
 49 |       "cell_type": "code",
 50 |       "execution_count": null,
 51 |       "metadata": {
 52 |         "id": "i0rNL4f_VV3c",
 53 |         "cellView": "form"
 54 |       },
 55 |       "outputs": [],
 56 |       "source": [
 57 |         "#@title 輸入 Youtube 網址\n",
 58 |         "import os\n",
 59 |         "from pytube import YouTube\n",
 60 |         "\n",
 61 |         "url = \"https://www.youtube.com/watch?v=Jp8xnYRhWnw\" #@param {type:\"string\"}\n",
 62 |         "\n",
 63 |         "def onProgress(stream, chunk, remains):\n",
 64 |         "    total = stream.filesize\n",
 65 |         "    percent = (total-remains) / total * 100\n",
 66 |         "    print('下載中… {:05.2f}%'.format(percent), end='\\r')\n",
 67 |         "\n",
 68 |         "\n",
 69 |         "# yt.streams.filter().get_highest_resolution().download()\n",
 70 |         "\n",
 71 |         "yt=YouTube(url,on_progress_callback=onProgress)\n",
 72 |         "yfilename=yt.streams.filter(type=\"audio\").first().download()\n",
 73 |         "\n",
 74 |         "# filename=yt.streams.filter(type=\"audio\").first().default_filename\n",
 75 |         "\n",
 76 |         "musicname=yfilename[:yfilename.rfind(\".\")]\n",
 77 |         "# yt.streams.filter(type=\"audio\").first().download()\n",
 78 |         "print(\"完成下載： \",yfilename)\n",
 79 |         "\n",
 80 |         "# print(\"完成下載： \",yt.streams.first().download())\n",
 81 |         "# print(\"轉檔中........\")\n",
 82 |         "# os.system('{} -i \"{}\" -vn -sn -dn \"{}.mp3\"'.format(\"ffmpeg\",filename, musicname))\n",
 83 |         "# print(\"完成轉檔： {}.mp3\".format(musicname))\n",
 84 |         "\n",
 85 |         "\n",
 86 |         "# # from google.colab import files\n",
 87 |         "# # files.download(\"{}.mp4\".format(filename[:filename.rfind(\".\")]))\n",
 88 |         "# from google.colab import files\n",
 89 |         "# files.download(f\"{musicname}.mp3\")\n",
 90 |         "\n",
 91 |         "import os\n",
 92 |         "import shutil\n",
 93 |         "import speech_recognition as sr\n",
 94 |         "import concurrent.futures\n",
 95 |         "import wave\n",
 96 |         "import json\n",
 97 |         "import numpy as np\n",
 98 |         "from inlp.convert import chinese\n",
 99 |         "\n",
100 |         " \n",
101 |         "\n",
102 |         "mp3Name= yfilename\n",
103 |         " \n",
104 |         "CutTimeDef = 20 \n",
105 |         "wav_path='wav' \n",
106 |         "txt_path='txt' \n",
107 |         "thread_num = 10 \n",
108 |         "\n",
109 |         "workpath=os.path.dirname(mp3Name)\n",
110 |         "mp3Name=os.path.basename(mp3Name)\n",
111 |         "FileName = mp3Name[:mp3Name.rfind(\".\")]+\".wav\"\n",
112 |         "os.chdir(workpath)\n",
113 |         "chk='y' \n",
114 |         " \n",
115 |         "def reset_dir(path):\n",
116 |         "    try:\n",
117 |         "        os.mkdir(path)\n",
118 |         "    except Exception:\n",
119 |         "      if chk==\"y\":\n",
120 |         "        shutil.rmtree(path)\n",
121 |         "        os.mkdir(path)\n",
122 |         " \n",
123 |         "def CutFile(FileName, target_path):\n",
124 |         " \n",
125 |         "    # print(\"CutFile File Name is \", FileName)\n",
126 |         "    f = wave.open(FileName, \"rb\")\n",
127 |         "    params = f.getparams()    \n",
128 |         "    nchannels, sampwidth, framerate, nframes = params[:4]\n",
129 |         "    CutFrameNum = framerate * CutTimeDef\n",
130 |         "    # 讀取格式資訊\n",
131 |         "    # 一次性返回所有的WAV檔案的格式資訊，它返回的是一個組元(tuple)：聲道數, 量化位數（byte    單位）, 採\n",
132 |         "    # 樣頻率, 取樣點數, 壓縮型別, 壓縮型別的描述。wave模組只支援非壓縮的資料，因此可以忽略最後兩個資訊\n",
133 |         " \n",
134 |         "    # print(\"CutFrameNum=%d\" % (CutFrameNum))\n",
135 |         "    # print(\"nchannels=%d\" % (nchannels))\n",
136 |         "    # print(\"sampwidth=%d\" % (sampwidth))\n",
137 |         "    # print(\"framerate=%d\" % (framerate))\n",
138 |         "    # print(\"nframes=%d\" % (nframes))\n",
139 |         " \n",
140 |         "    str_data = f.readframes(nframes)\n",
141 |         "    f.close()  # 將波形資料轉換成陣列\n",
142 |         "    # Cutnum =nframes/framerate/CutTimeDef\n",
143 |         "    # 需要根據聲道數和量化單位，將讀取的二進位制資料轉換為一個可以計算的陣列\n",
144 |         "    wave_data = np.frombuffer(str_data, dtype=np.short)\n",
145 |         "    wave_data.shape = -1, 2\n",
146 |         "    wave_data = wave_data.T\n",
147 |         "    temp_data = wave_data.T\n",
148 |         "    # StepNum = int(nframes/200)\n",
149 |         "    StepNum = CutFrameNum\n",
150 |         "    StepTotalNum = 0\n",
151 |         "    haha = 0\n",
152 |         "    while StepTotalNum < nframes:\n",
153 |         "        # for j in range(int(Cutnum)):\n",
154 |         "        # print(\"Stemp=%d\" % (haha))\n",
155 |         "        SaveFile = \"%s-%03d.wav\" % (FileName[:-4], (haha+1))\n",
156 |         "        # print(FileName)\n",
157 |         "        if haha % 3==0:\n",
158 |         "          print(\"*\",end='')\n",
159 |         "        temp_dataTemp = temp_data[StepNum * (haha):StepNum * (haha + 1)]\n",
160 |         "        haha = haha + 1\n",
161 |         "        StepTotalNum = haha * StepNum\n",
162 |         "        temp_dataTemp.shape = 1, -1\n",
163 |         "        temp_dataTemp = temp_dataTemp.astype(np.short)  # 開啟WAV文件\n",
164 |         "        f = wave.open(target_path+\"/\" + SaveFile, \"wb\")\n",
165 |         "        # 配置聲道數、量化位數和取樣頻率\n",
166 |         "        f.setnchannels(nchannels)\n",
167 |         "        f.setsampwidth(sampwidth)\n",
168 |         "        f.setframerate(framerate)\n",
169 |         "        # 將wav_data轉換為二進位制資料寫入檔案\n",
170 |         "        f.writeframes(temp_dataTemp.tobytes())\n",
171 |         "        f.close()\n",
172 |         " \n",
173 |         "\n",
174 |         " \n",
175 |         "def texts_to_one(path, target_file):\n",
176 |         "    files = os.listdir(path)\n",
177 |         "    files.sort()\n",
178 |         "    files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n",
179 |         "    with open(target_file, \"w\", encoding=\"utf-8\") as f:\n",
180 |         "        for file in files:\n",
181 |         "            with open(file, \"r\", encoding='utf-8') as f2:\n",
182 |         "                f.write(f2.read())\n",
183 |         "    print(\"完成合併, 檔案位於 %s \" % target_file)\n",
184 |         " \n",
185 |         " \n",
186 |         "def texts2otr(path, target_file, audio_name, timeperiod):\n",
187 |         "    template = '''<p><span class=\"timestamp\" data-timestamp=\"{}.000000\">{}</span>{}</p><p><br/></p>\n",
188 |         "    '''\n",
189 |         "    files = os.listdir(path)\n",
190 |         "    files.sort()\n",
191 |         "    content = ''\n",
192 |         "    files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n",
193 |         "    with open(target_file, \"w\", encoding=\"utf-8\") as f:\n",
194 |         " \n",
195 |         "        for file in files:\n",
196 |         "            with open(file, \"r\", encoding=\"utf-8\") as f2:\n",
197 |         "                txt = f2.read().split(\"\\n\")\n",
198 |         "                if len(txt) < 2:\n",
199 |         "                    continue\n",
200 |         "                pos=txt[0].rfind(\".\")\n",
201 |         "                time=int(txt[0][pos-3:pos])\n",
202 |         "                # times = (int(txt[0].split(\"-\")[1][:-5])-1)*CutTimeDef\n",
203 |         "                times=(time-1)*CutTimeDef\n",
204 |         "                secs, mins = times % 60, (times//60) % 60\n",
205 |         "                hours = (times//60)//60\n",
206 |         "                timeF = \"{:02d}:{:02d}:{:02d}\".format(hours, mins, secs)\n",
207 |         "                content += template.format(times, timeF, txt[1])\n",
208 |         " \n",
209 |         "        output = {\"text\": content, \"media\": audio_name,\n",
210 |         "                  \"media-time\": timeperiod}\n",
211 |         "        f.write(json.dumps(output, ensure_ascii=False))\n",
212 |         "    print(\"完成合併, otr 檔案位於 %s \" % target_file)\n",
213 |         "   \n",
214 |         "#@title 執行音頻轉換與分割\n",
215 |         " \n",
216 |         "print(\" mp3 轉 wav 檔 \".center(100,'=')) \n",
217 |         "os.system('{} -i \"{}\" \"{}\"'.format(\"ffmpeg\",mp3Name, FileName))\n",
218 |         "print(\" Wav 檔名為 {} \".format(FileName).center(96))\n",
219 |         "reset_dir(wav_path)\n",
220 |         "reset_dir(txt_path)\n",
221 |         "# # Cut Wave Setting\n",
222 |         "\n",
223 |         "print(\" 音頻以每{}秒分割 \".format(CutTimeDef).center(94,'='))\n",
224 |         "CutFile(FileName, wav_path)\n",
225 |         "print(\"\")\n",
226 |         "print(\" 完成分割 \".center(100,'-'))\n",
227 |         "#@title 執行語音轉文字 (需要耗費不少時間)\n",
228 |         "\n",
229 |         "#@markdown 指定翻譯的語言類型，如何設定語系請參考 [支援列表](https://cloud.google.com/speech-to-text/docs/languages)\n",
230 |         "voiceLanguage=\"cmn-Hant-TW\" #@param {type:\"string\"}\n",
231 |         "\n",
232 |         "def VoiceToText_thread(file):\n",
233 |         "  txt_file = \"%s/%s.txt\" % (txt_path, file[:-4])\n",
234 |         "      \n",
235 |         "  if os.path.isfile(txt_file):\n",
236 |         "    return\n",
237 |         "  with open(\"%s/%s.txt\" % (txt_path, file[:-4]), \"w\", encoding=\"utf-8\") as f:\n",
238 |         "    f.write(\"%s:\\n\" % file)\n",
239 |         "    r = sr.Recognizer()  # 預設辨識英文\n",
240 |         "    with sr.WavFile(wav_path+\"/\"+file) as source:  # 讀取wav檔\n",
241 |         "      audio = r.record(source)\n",
242 |         "      # r.adjust_for_ambient_noise(source)\n",
243 |         "      # audio = r.listen(source)\n",
244 |         "    try:\n",
245 |         "      text = r.recognize_google(audio,language = voiceLanguage)\n",
246 |         "      text = chinese.s2t(text)\n",
247 |         "      # r.recognize_google(audio)\n",
248 |         "      \n",
249 |         "      if len(text) == 0:\n",
250 |         "        print(\"===無資料==\")\n",
251 |         "        return\n",
252 |         "\n",
253 |         "      print(f\"{file}\\t{text}\")\n",
254 |         "      f.write(\"%s \\n\\n\" % text)\n",
255 |         "      if file == files[-1]:\n",
256 |         "          print(\"結束翻譯\")\n",
257 |         "    except sr.RequestError as e:\n",
258 |         "      print(\"無法翻譯{0}\".format(e))\n",
259 |         "      # 兩個 except 是當語音辨識不出來的時候 防呆用的\n",
260 |         "      # 使用Google的服務\n",
261 |         "    except LookupError:\n",
262 |         "      print(\"Could not understand audio\")\n",
263 |         "    except sr.UnknownValueError:\n",
264 |         "      print(f\"Error: 無法識別 Audio\\t {file}\")\n",
265 |         "  \n",
266 |         "\n",
267 |         "\n",
268 |         "\n",
269 |         "files = os.listdir(wav_path)\n",
270 |         "files.sort()\n",
271 |         "\n",
272 |         "with concurrent.futures.ThreadPoolExecutor(max_workers=thread_num) as executor:\n",
273 |         "    executor.map(VoiceToText_thread, files)\n",
274 |         "  \n",
275 |         "# VoiceToText(wav_path, files, txt_path)\n",
276 |         " \n",
277 |         "target_txtfile = \"{}.txt\".format(FileName[:-4])\n",
278 |         "texts_to_one(txt_path, target_txtfile)\n",
279 |         "otr_file = \"{}.otr\".format(FileName[:-4])\n",
280 |         "with wave.open(FileName, \"rb\") as f:\n",
281 |         "    params = f.getparams()\n",
282 |         "texts2otr(txt_path, otr_file, FileName, params.nframes)"
283 |       ]
284 |     },
285 |     {
286 |       "cell_type": "markdown",
287 |       "source": [
288 |         "`"
289 |       ],
290 |       "metadata": {
291 |         "id": "f5CFfXa-ZomL"
292 |       }
293 |     },
294 |     {
295 |       "cell_type": "code",
296 |       "source": [
297 |         "#@title 下載 otr\n",
298 |         "import os\n",
299 |         "import shutil\n",
300 |         "# 搬移音錄檔案到特定的目錄下\n",
301 |         "source_dir=\"/content/\"\n",
302 |         "target_dir=\"/content/drive/MyDrive/tmp/\"\n",
303 |         "def findotr(arr):\n",
304 |         "  for itm in arr[::-1]:\n",
305 |         "    if \"otr\" in itm:\n",
306 |         "      return itm\n",
307 |         "otrfilename=findotr(os.listdir(\".\"))\n",
308 |         "from google.colab import files\n",
309 |         "files.download(\"{}\".format(otrfilename))\n",
310 |         "# shutil.move(f\"{source_dir}{yfilename}\",target_dir)"
311 |       ],
312 |       "metadata": {
313 |         "colab": {
314 |           "base_uri": "https://localhost:8080/",
315 |           "height": 17
316 |         },
317 |         "id": "XseEE1QBiBQu",
318 |         "outputId": "7b4a5a66-f31b-4202-9ed8-c4c051161754",
319 |         "cellView": "form"
320 |       },
321 |       "execution_count": 14,
322 |       "outputs": [
323 |         {
324 |           "output_type": "display_data",
325 |           "data": {
326 |             "text/plain": [
327 |               "<IPython.core.display.Javascript object>"
328 |             ],
329 |             "application/javascript": [
330 |               "\n",
331 |               "    async function download(id, filename, size) {\n",
332 |               "      if (!google.colab.kernel.accessAllowed) {\n",
333 |               "        return;\n",
334 |               "      }\n",
335 |               "      const div = document.createElement('div');\n",
336 |               "      const label = document.createElement('label');\n",
337 |               "      label.textContent = `Downloading \"${filename}\": `;\n",
338 |               "      div.appendChild(label);\n",
339 |               "      const progress = document.createElement('progress');\n",
340 |               "      progress.max = size;\n",
341 |               "      div.appendChild(progress);\n",
342 |               "      document.body.appendChild(div);\n",
343 |               "\n",
344 |               "      const buffers = [];\n",
345 |               "      let downloaded = 0;\n",
346 |               "\n",
347 |               "      const channel = await google.colab.kernel.comms.open(id);\n",
348 |               "      // Send a message to notify the kernel that we're ready.\n",
349 |               "      channel.send({})\n",
350 |               "\n",
351 |               "      for await (const message of channel.messages) {\n",
352 |               "        // Send a message to notify the kernel that we're ready.\n",
353 |               "        channel.send({})\n",
354 |               "        if (message.buffers) {\n",
355 |               "          for (const buffer of message.buffers) {\n",
356 |               "            buffers.push(buffer);\n",
357 |               "            downloaded += buffer.byteLength;\n",
358 |               "            progress.value = downloaded;\n",
359 |               "          }\n",
360 |               "        }\n",
361 |               "      }\n",
362 |               "      const blob = new Blob(buffers, {type: 'application/binary'});\n",
363 |               "      const a = document.createElement('a');\n",
364 |               "      a.href = window.URL.createObjectURL(blob);\n",
365 |               "      a.download = filename;\n",
366 |               "      div.appendChild(a);\n",
367 |               "      a.click();\n",
368 |               "      div.remove();\n",
369 |               "    }\n",
370 |               "  "
371 |             ]
372 |           },
373 |           "metadata": {}
374 |         },
375 |         {
376 |           "output_type": "display_data",
377 |           "data": {
378 |             "text/plain": [
379 |               "<IPython.core.display.Javascript object>"
380 |             ],
381 |             "application/javascript": [
382 |               "download(\"download_c235d0ba-d21a-43be-ad8a-cb63194b34f6\", \"\\u5b64\\u7368\\u7684\\u6211\\u662f\\u5e78\\u798f\\u7684\\uff5c\\u6587\\u68ee\\u8aaa\\u66f8.otr\", 13904)"
383 |             ]
384 |           },
385 |           "metadata": {}
386 |         }
387 |       ]
388 |     },
389 |     {
390 |       "cell_type": "code",
391 |       "source": [
392 |         "#@title 下載 音檔\n",
393 |         "from google.colab import files\n",
394 |         "files.download(\"{}\".format(yfilename))"
395 |       ],
396 |       "metadata": {
397 |         "colab": {
398 |           "base_uri": "https://localhost:8080/",
399 |           "height": 17
400 |         },
401 |         "cellView": "form",
402 |         "id": "hUXzAOTfbvE2",
403 |         "outputId": "31939c83-611d-4e56-fa6c-7f97f136d77f"
404 |       },
405 |       "execution_count": 15,
406 |       "outputs": [
407 |         {
408 |           "output_type": "display_data",
409 |           "data": {
410 |             "text/plain": [
411 |               "<IPython.core.display.Javascript object>"
412 |             ],
413 |             "application/javascript": [
414 |               "\n",
415 |               "    async function download(id, filename, size) {\n",
416 |               "      if (!google.colab.kernel.accessAllowed) {\n",
417 |               "        return;\n",
418 |               "      }\n",
419 |               "      const div = document.createElement('div');\n",
420 |               "      const label = document.createElement('label');\n",
421 |               "      label.textContent = `Downloading \"${filename}\": `;\n",
422 |               "      div.appendChild(label);\n",
423 |               "      const progress = document.createElement('progress');\n",
424 |               "      progress.max = size;\n",
425 |               "      div.appendChild(progress);\n",
426 |               "      document.body.appendChild(div);\n",
427 |               "\n",
428 |               "      const buffers = [];\n",
429 |               "      let downloaded = 0;\n",
430 |               "\n",
431 |               "      const channel = await google.colab.kernel.comms.open(id);\n",
432 |               "      // Send a message to notify the kernel that we're ready.\n",
433 |               "      channel.send({})\n",
434 |               "\n",
435 |               "      for await (const message of channel.messages) {\n",
436 |               "        // Send a message to notify the kernel that we're ready.\n",
437 |               "        channel.send({})\n",
438 |               "        if (message.buffers) {\n",
439 |               "          for (const buffer of message.buffers) {\n",
440 |               "            buffers.push(buffer);\n",
441 |               "            downloaded += buffer.byteLength;\n",
442 |               "            progress.value = downloaded;\n",
443 |               "          }\n",
444 |               "        }\n",
445 |               "      }\n",
446 |               "      const blob = new Blob(buffers, {type: 'application/binary'});\n",
447 |               "      const a = document.createElement('a');\n",
448 |               "      a.href = window.URL.createObjectURL(blob);\n",
449 |               "      a.download = filename;\n",
450 |               "      div.appendChild(a);\n",
451 |               "      a.click();\n",
452 |               "      div.remove();\n",
453 |               "    }\n",
454 |               "  "
455 |             ]
456 |           },
457 |           "metadata": {}
458 |         },
459 |         {
460 |           "output_type": "display_data",
461 |           "data": {
462 |             "text/plain": [
463 |               "<IPython.core.display.Javascript object>"
464 |             ],
465 |             "application/javascript": [
466 |               "download(\"download_d7ea9f3e-65ed-4da1-8b0e-609894886090\", \"\\u5b64\\u7368\\u7684\\u6211\\u662f\\u5e78\\u798f\\u7684\\uff5c\\u6587\\u68ee\\u8aaa\\u66f8.mp4\", 4164951)"
467 |             ]
468 |           },
469 |           "metadata": {}
470 |         }
471 |       ]
472 |     }
473 |   ]
474 | }


--------------------------------------------------------------------------------
/台股_Q1~Q3_EPS_抓取.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |   "nbformat": 4,
 3 |   "nbformat_minor": 0,
 4 |   "metadata": {
 5 |     "colab": {
 6 |       "name": "台股 Q1~Q3 EPS 抓取",
 7 |       "provenance": [],
 8 |       "collapsed_sections": [],
 9 |       "authorship_tag": "ABX9TyPn0uCcBvxxCRI0YDMGbTl3",
10 |       "include_colab_link": true
11 |     },
12 |     "kernelspec": {
13 |       "name": "python3",
14 |       "display_name": "Python 3"
15 |     },
16 |     "language_info": {
17 |       "name": "python"
18 |     }
19 |   },
20 |   "cells": [
21 |     {
22 |       "cell_type": "markdown",
23 |       "metadata": {
24 |         "id": "view-in-github",
25 |         "colab_type": "text"
26 |       },
27 |       "source": [
28 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/%E5%8F%B0%E8%82%A1_Q1~Q3_EPS_%E6%8A%93%E5%8F%96.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
29 |       ]
30 |     },
31 |     {
32 |       "cell_type": "markdown",
33 |       "source": [
34 |         "# 自動取得台股 Q1~Q3 EPS\n",
35 |         "\n",
36 |         "文字標格式。數字和文字間用 tab 區分。一張股票一行。\n",
37 |         "\n",
38 |         "可以直接從 excel 貼到 txt 檔即可\n",
39 |         "\n",
40 |         "---\n",
41 |         "2330 台積電"
42 |       ],
43 |       "metadata": {
44 |         "id": "lCqeZiJAPLwc"
45 |       }
46 |     },
47 |     {
48 |       "cell_type": "code",
49 |       "source": [
50 |         "import requests\n",
51 |         "import concurrent.futures\n",
52 |         "from bs4 import BeautifulSoup, UnicodeDammit\n",
53 |         "import pandas as pd\n",
54 |         "\n",
55 |         "def addeps(stockid):\n",
56 |         "  \n",
57 |         "  url=url_template.format(stockid)\n",
58 |         "  reg=requests.get(url)\n",
59 |         "  soup=BeautifulSoup(reg.text)\n",
60 |         "  stockData=[]\n",
61 |         "  for itm in soup.find(\"section\",id=\"qsp-eps-table\").find_all(\"span\",class_=\"\")[1:7:2]:\n",
62 |         "    stockData.insert(0,itm.getText())\n",
63 |         "  \n",
64 |         "  stockids.get(stockid).extend(stockData)\n",
65 |         "\n",
66 |         "\n",
67 |         "mytest=list()\n",
68 |         "#@markdown 用 txt 檔存股票代碼\n",
69 |         "stock_txt=\"/content/stock_id_name.txt\" #@param {type:'string'} \n",
70 |         "with open(stock_txt,'rb') as f:\n",
71 |         "  encode=UnicodeDammit(f.read()).original_encoding\n",
72 |         "with open(stock_txt,\"r\",encoding=encode) as f:\n",
73 |         "  content=[itm.split() for itm in f.read().splitlines()]\n",
74 |         "\n",
75 |         "'''\n",
76 |         "Converting a list to dictionary with list elements as keys in dictionary\n",
77 |         "All keys will have same value\n",
78 |         "''' \n",
79 |         "# create Stock_id data\n",
80 |         "stockids = { i[0] : i for i in content }\n",
81 |         "url_template=\"https://tw.stock.yahoo.com/quote/{}/eps\" \n",
82 |         "with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:\n",
83 |         "  executor.map(addeps,stockids)\n",
84 |         "df=pd.DataFrame(stockids.values(),columns=[\"股票代碼\",\"股票名稱\",\"Q1 EPS\",\"Q2 EPS\",\"Q3 EPS\"])\n",
85 |         "df.to_excel(\"TWstocks_EPS.xlsx\",index=False)\n",
86 |         "df\n",
87 |         "\n",
88 |         "  "
89 |       ],
90 |       "metadata": {
91 |         "id": "6hjtu_OMGJud"
92 |       },
93 |       "execution_count": null,
94 |       "outputs": []
95 |     }
96 |   ]
97 | }


--------------------------------------------------------------------------------
/技術議題關鍵字擴展.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "name": "技術議題關鍵字擴展.ipynb",
  7 |       "provenance": [],
  8 |       "collapsed_sections": [],
  9 |       "authorship_tag": "ABX9TyO5GIToPuAyaKLz4lnvSqaj",
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "name": "python3",
 14 |       "display_name": "Python 3"
 15 |     },
 16 |     "language_info": {
 17 |       "name": "python"
 18 |     }
 19 |   },
 20 |   "cells": [
 21 |     {
 22 |       "cell_type": "markdown",
 23 |       "metadata": {
 24 |         "id": "view-in-github",
 25 |         "colab_type": "text"
 26 |       },
 27 |       "source": [
 28 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/%E6%8A%80%E8%A1%93%E8%AD%B0%E9%A1%8C%E9%97%9C%E9%8D%B5%E5%AD%97%E6%93%B4%E5%B1%95.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 29 |       ]
 30 |     },
 31 |     {
 32 |       "cell_type": "code",
 33 |       "metadata": {
 34 |         "id": "hQN-VXYIGSKX",
 35 |         "cellView": "form"
 36 |       },
 37 |       "source": [
 38 |         "#@title 關鍵字擴展\n",
 39 |         "#@markdown 關鍵字擴展是用 政府研究資訊網 (grb.gov.tw) 的民國 105-110年公開資料做為資料集，目前只支援中文字關鍵字的擴展\n",
 40 |         "\n",
 41 |         "import requests\n",
 42 |         "\n",
 43 |         "print(\"資料載入中 .......\")\n",
 44 |         "url=\"https://raw.githubusercontent.com/reic/colab_python/main/data/\"\n",
 45 |         "fnames=[\"GRB_105.txt\",\"GRB_106.txt\",\"GRB_107.txt\",\"GRB_108.txt\",\"GRB_109.txt\",\"GRB_110.txt\"]\n",
 46 |         "\n",
 47 |         "#@markdown 關鍵字\n",
 48 |         "print(f\"{len(fnames)} 個資料源\")\n",
 49 |         "inputword = \"\\u865B\\u64EC\\u5BE6\\u5883\" #@param {type:\"string\"}\n",
 50 |         "#@markdown 列出的擴展關鍵字之數量\n",
 51 |         "extendNumber =20 #@param {type:\"number\"}\n",
 52 |         "content=[]\n",
 53 |         "\n",
 54 |         "for fname in fnames:\n",
 55 |         "  # print(f\"從 Github 下載資料檔 {fname}\")\n",
 56 |         "  reg=requests.get(f\"{url}{fname}\")\n",
 57 |         "  content.extend(reg.text.splitlines())\n",
 58 |         "print(\"=== 資料載入完成 \".ljust(100,\"=\"))\n",
 59 |         "print(\"\")\n",
 60 |         "projectData = {}\n",
 61 |         "for itm in content:\n",
 62 |         "  # print(itm)\n",
 63 |         "  [id, keyword] = itm.split('\\t')\n",
 64 |         "  projectData[id] = keyword.split(\"：\")\n",
 65 |         "# print(projectData)\n",
 66 |         "\n",
 67 |         "inputword = inputword.lower()\n",
 68 |         "keywords = []\n",
 69 |         "projects = []\n",
 70 |         "for itm in content:\n",
 71 |         "    if inputword in itm.lower():\n",
 72 |         "        pro = itm.split(\"\\t\")\n",
 73 |         "        projects.append(pro[0])\n",
 74 |         "        getkeywordset = pro[1].split(\"：\")\n",
 75 |         "        keywords.extend(getkeywordset)\n",
 76 |         "\n",
 77 |         "# # print(len(keywords))\n",
 78 |         "# # print(len(list(set(keywords))))\n",
 79 |         "# # print(len(projects))\n",
 80 |         "# # print(projectData[projects[0]])\n",
 81 |         "uniqueKeywordCount = dict.fromkeys(keywords, 0)\n",
 82 |         "for itm in uniqueKeywordCount:\n",
 83 |         "    uniqueKeywordCount[itm] = keywords.count(itm)\n",
 84 |         "\n",
 85 |         "keywordExtend=[]\n",
 86 |         "for key,value in uniqueKeywordCount.items():\n",
 87 |         "  # if int(value) <2:    \n",
 88 |         "  #   continue\n",
 89 |         "  #   print(value)\n",
 90 |         "  keywordExtend.append([key,value])\n",
 91 |         "keywordExtend.sort(key=lambda x:x[1],reverse=True)\n",
 92 |         "\n",
 93 |         "if extendNumber > len(keywordExtend):\n",
 94 |         "  extendNumber=len(keywordExtend)\n",
 95 |         "\n",
 96 |         "for itm in keywordExtend[:extendNumber]:\n",
 97 |         "  print(f\"{itm[0]:15s}\\t{itm[1]}\")\n",
 98 |         "\n",
 99 |         "\n"
100 |       ],
101 |       "execution_count": null,
102 |       "outputs": []
103 |     },
104 |     {
105 |       "cell_type": "code",
106 |       "source": [
107 |         "#@title 產業趨勢關鍵字探索\n",
108 |         "#@markdown 關鍵字擴展由 科技產業資訊室(iknow.stpi.narl.org.tw) 提供。關鍵字由 iknow 設定義，可以掌握特定關鍵字和其它產業關鍵字共現關係。\n",
109 |         "\n",
110 |         "import requests\n",
111 |         "\n",
112 |         "print(\"資料載入中 .......\")\n",
113 |         "url=\"https://raw.githubusercontent.com/reic/colab_python/main/data/\"\n",
114 |         "fnames=[\"iKnow_2017-2021.txt\"]\n",
115 |         "\n",
116 |         "#@markdown 關鍵字\n",
117 |         "print(f\"{len(fnames)} 個資料源\")\n",
118 |         "inputword = \"Apple\" #@param {type:\"string\"}\n",
119 |         "\n",
120 |         "#@markdown 列出的擴展關鍵字之數量\n",
121 |         "extendNumber =20 #@param {type:\"number\"}\n",
122 |         "content=[]\n",
123 |         "\n",
124 |         "for fname in fnames:\n",
125 |         "  # print(f\"從 Github 下載資料檔 {fname}\")\n",
126 |         "  reg=requests.get(f\"{url}{fname}\")\n",
127 |         "  content.extend(reg.text.splitlines())\n",
128 |         "print(\"=== 資料載入完成 \".ljust(100,\"=\"))\n",
129 |         "print(\"\")\n",
130 |         "projectData = {}\n",
131 |         "for itm in content:\n",
132 |         "  # print(itm)\n",
133 |         "  [id, keyword] = itm.split('\\t')\n",
134 |         "  projectData[id] = keyword.split(\"：\")\n",
135 |         "# print(projectData)\n",
136 |         "\n",
137 |         "inputword = inputword.lower()\n",
138 |         "keywords = []\n",
139 |         "projects = []\n",
140 |         "for itm in content:\n",
141 |         "    if inputword in itm.lower():\n",
142 |         "        pro = itm.split(\"\\t\")\n",
143 |         "        projects.append(pro[0])\n",
144 |         "        getkeywordset = pro[1].split(\"：\")\n",
145 |         "        keywords.extend(getkeywordset)\n",
146 |         "\n",
147 |         "# # print(len(keywords))\n",
148 |         "# # print(len(list(set(keywords))))\n",
149 |         "# # print(len(projects))\n",
150 |         "# # print(projectData[projects[0]])\n",
151 |         "uniqueKeywordCount = dict.fromkeys(keywords, 0)\n",
152 |         "for itm in uniqueKeywordCount:\n",
153 |         "    uniqueKeywordCount[itm] = keywords.count(itm)\n",
154 |         "\n",
155 |         "keywordExtend=[]\n",
156 |         "for key,value in uniqueKeywordCount.items():\n",
157 |         "  # if int(value) <2:    \n",
158 |         "  #   continue\n",
159 |         "  #   print(value)\n",
160 |         "  keywordExtend.append([key,value])\n",
161 |         "keywordExtend.sort(key=lambda x:x[1],reverse=True)\n",
162 |         "\n",
163 |         "if extendNumber > len(keywordExtend):\n",
164 |         "  extendNumber=len(keywordExtend)\n",
165 |         "\n",
166 |         "for itm in keywordExtend[:extendNumber]:\n",
167 |         "  print(f\"{itm[0]:15s}\\t{itm[1]}\")\n",
168 |         "\n",
169 |         "\n"
170 |       ],
171 |       "metadata": {
172 |         "cellView": "form",
173 |         "id": "mcHn3ncMl-en"
174 |       },
175 |       "execution_count": null,
176 |       "outputs": []
177 |     }
178 |   ]
179 | }


--------------------------------------------------------------------------------
/英文單字計算.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "provenance": [],
  7 |       "collapsed_sections": [],
  8 |       "mount_file_id": "1z6G-heUGHnpYJkvVwXvsAnv3P2QRc19r",
  9 |       "authorship_tag": "ABX9TyM8taDxTDLlejhnxOuDq0au",
 10 |       "include_colab_link": true
 11 |     },
 12 |     "kernelspec": {
 13 |       "name": "python3",
 14 |       "display_name": "Python 3"
 15 |     },
 16 |     "language_info": {
 17 |       "name": "python"
 18 |     }
 19 |   },
 20 |   "cells": [
 21 |     {
 22 |       "cell_type": "markdown",
 23 |       "metadata": {
 24 |         "id": "view-in-github",
 25 |         "colab_type": "text"
 26 |       },
 27 |       "source": [
 28 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/%E8%8B%B1%E6%96%87%E5%96%AE%E5%AD%97%E8%A8%88%E7%AE%97.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 29 |       ]
 30 |     },
 31 |     {
 32 |       "cell_type": "code",
 33 |       "execution_count": 2,
 34 |       "metadata": {
 35 |         "id": "QxEVE_SZC_Bc",
 36 |         "cellView": "form"
 37 |       },
 38 |       "outputs": [],
 39 |       "source": [
 40 |         "#@title 在 context 輸入 zotero 標注的英文單字，並將單字存至 wordtank.txt\n",
 41 |         "#@markdown 請先掛載 google drive ，wordtank 是放在 google雲端硬碟內\n",
 42 |         "\n",
 43 |         "import re\n",
 44 |         "from nltk.stem import PorterStemmer\n",
 45 |         "import os\n",
 46 |         "\n",
 47 |         "def checkFileexist(path,file):\n",
 48 |         "  fileWithPath=\"{}/{}\".format(path,file)\n",
 49 |         "  if os.path.isfile(fileWithPath):\n",
 50 |         "    return\n",
 51 |         "  with open(fileWithPath,mode=\"w\",encoding='utf-8') as f:\n",
 52 |         "    f.write(\"fromReic\")\n",
 53 |         "\n",
 54 |         "def stemworkcheck(worda,wordb):\n",
 55 |         "  if len(wordb)>len(worda):\n",
 56 |         "    return worda\n",
 57 |         "  return wordb\n",
 58 |         "  \n",
 59 |         "context = \"\\u201Celectrolyte\\u201D ([Wang \\u7B49\\u3002, 2022, p. 1](zotero://select/library/items/VHI4XUQN)) ([pdf](zotero://open-pdf/library/items/3N7DAHAP?page=1&annotation=IG6RK45L)) electrolyte   \\u82F1 [\\u026A\\u02C8lektr\\u0259la\\u026At]   \\u7F8E [\\u026A\\u02C8lektr\\u0259la\\u026At]   n. \\u7535\\u89E3\\u6DB2\\uFF0C\\u7535\\u89E3\\u8D28\\uFF1B\\u7535\\u89E3   [ \\u590D\\u6570 electrolytes ]  \\u201Ccontradictions\\u201D ([Wang \\u7B49\\u3002, 2022, p. 2](zotero://select/library/items/VHI4XUQN)) ([pdf](zotero://open-pdf/library/items/3N7DAHAP?page=2&annotation=G3VLNTTD)) \\u77DB\\u76FE  \\u201Climbs\\u201D ([Wang \\u7B49\\u3002, 2022, p. 2](zotero://select/library/items/VHI4XUQN)) ([pdf](zotero://open-pdf/library/items/3N7DAHAP?page=2&annotation=UD9D4T95)) limbs   \\u82F1 [l\\u026Amz]   \\u7F8E [l\\u026Amz]   n. [\\u89E3\\u5256]\\u56DB\\u80A2\\uFF08limb \\u7684\\u590D\\u6570\\uFF09\" #@param {type:\"string\"}\n",
 60 |         "savedir = \"/content/drive/MyDrive/reic\" #@param {type:\"string\"}\n",
 61 |         "wordtank = \"wordtank.txt\" #@param {type:\"string\"}\n",
 62 |         "\n",
 63 |         "\n",
 64 |         "\n",
 65 |         "re_pattern=\"“(\\w+)[, .]?”\"\n",
 66 |         "ps=PorterStemmer()\n",
 67 |         "req=re.findall(re_pattern, context)\n",
 68 |         "req=[itm.lower() for itm in req]\n",
 69 |         "\n",
 70 |         "checkFileexist(savedir,wordtank)\n",
 71 |         "\n",
 72 |         "with open(\"{}/{}\".format(savedir,wordtank),mode=\"a\",encoding=\"utf-8\") as f:\n",
 73 |         "  f.write(\", {}\".format(\", \".join(req)))  "
 74 |       ]
 75 |     },
 76 |     {
 77 |       "cell_type": "code",
 78 |       "source": [
 79 |         "#@title 列出重複查詢次數最多的英文單字\n",
 80 |         "toprank = 10 #@param {type:\"number\"}\n",
 81 |         "with open(\"{}/{}\".format(savedir,wordtank),mode=\"r\",encoding=\"utf-8\") as f:\n",
 82 |         "  content=f.read().split(\", \")\n",
 83 |         "\n",
 84 |         "word=dict()\n",
 85 |         "wordcount=dict()\n",
 86 |         "for itm in content:\n",
 87 |         "  wordstem=ps.stem(itm)\n",
 88 |         "  word[wordstem]=stemworkcheck(word.get(wordstem,itm.lower()),itm.lower())\n",
 89 |         "  wordcount[wordstem]=wordcount.get(wordstem,0)+1\n",
 90 |         "\n",
 91 |         "arr=sorted(wordcount.items(),key=lambda x:x[1],reverse=True)\n",
 92 |         "for key,value in arr[:toprank]:\n",
 93 |         "  print(word[key],value)"
 94 |       ],
 95 |       "metadata": {
 96 |         "colab": {
 97 |           "base_uri": "https://localhost:8080/"
 98 |         },
 99 |         "id": "aGrzd_DBJ2R3",
100 |         "outputId": "307e0750-7dff-40dc-be8b-b53eb296c97c",
101 |         "cellView": "form"
102 |       },
103 |       "execution_count": 9,
104 |       "outputs": [
105 |         {
106 |           "output_type": "stream",
107 |           "name": "stdout",
108 |           "text": [
109 |             "fromreic 1\n",
110 |             "electrolyte 1\n",
111 |             "contradictions 1\n",
112 |             "limbs 1\n"
113 |           ]
114 |         }
115 |       ]
116 |     }
117 |   ]
118 | }


--------------------------------------------------------------------------------
/錄音檔轉文字.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "cells": [
  3 |     {
  4 |       "cell_type": "markdown",
  5 |       "metadata": {
  6 |         "id": "view-in-github",
  7 |         "colab_type": "text"
  8 |       },
  9 |       "source": [
 10 |         "<a href=\"https://colab.research.google.com/github/reic/colab_python/blob/main/%E9%8C%84%E9%9F%B3%E6%AA%94%E8%BD%89%E6%96%87%E5%AD%97.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 11 |       ]
 12 |     },
 13 |     {
 14 |       "cell_type": "markdown",
 15 |       "metadata": {
 16 |         "id": "sfatoyUwFvio"
 17 |       },
 18 |       "source": [
 19 |         "# 語音轉文字小工具\n",
 20 |         "\n",
 21 |         "此工具採用 Python 開發，可應用於**訪談錄音檔**轉文字、**影片的字幕**的生成，及其它相關應用。\n",
 22 |         "\n",
 23 |         "因為透過Google Colab 平台、Google的語音轉文字工具，完成語音轉文字的工作。只需要有 Google 帳號，即可具備執行此程式的環境，輔以簡單的設定，不會程式的使用者也可以完成相關的工作。\n",
 24 |         "\n",
 25 |         "# 新版的 AI 語音轉文字工具，結果更精準\n",
 26 |         "可以試試我用 Whisper AI模型撰新的新語音轉文字工具，**文字更精準**\n",
 27 |         "https://github.com/reic/colab_python/blob/main/whisper_Test.ipynb\n",
 28 |         "\n",
 29 |         "\n",
 30 |         "by 瑞課\n",
 31 |         "\n",
 32 |         "== 更新記錄 ===\n",
 33 |         "- 2024/2/20 Colab 調整執行緖限制，最多 2 個，只能執行60秒。多執行緖無法正確使用了。\n",
 34 |         "- 2023/4/11 調整 txt 檔的輸出模式，並將預設語言改為「繁體中文」\n",
 35 |         "- 2023/3/15 調整修改未完全的函式錯誤。 謝謝「左埕安」的回報\n",
 36 |         "- 2023/3/15 調整 txt檔的內容呈現\n",
 37 |         "- 2021/6/1 修正檔名有空白時，無法轉成 wav 和切割問題\n",
 38 |         "- 2021/5/12 增加不同翻譯語言變數的設定，並於檔案中提供語系參考表。 謝謝 chin ho Lau 的回饋。\n",
 39 |         "- 2021/5/9 修正因檔名無法產生 OTR 檔的問題，謝謝「彩虹小馬」的回饋\n",
 40 |         "- 2021/5/3 增加多執行緖的方法，縮短翻譯的時間\n",
 41 |         "\n",
 42 |         "\n",
 43 |         "\n"
 44 |       ]
 45 |     },
 46 |     {
 47 |       "cell_type": "markdown",
 48 |       "metadata": {
 49 |         "id": "FQPGQ9dlTlNI"
 50 |       },
 51 |       "source": [
 52 |         "## 1.安裝需求套件\n",
 53 |         "* 文字轉語音套件\n",
 54 |         "* 繁簡轉換套件"
 55 |       ]
 56 |     },
 57 |     {
 58 |       "cell_type": "code",
 59 |       "execution_count": null,
 60 |       "metadata": {
 61 |         "cellView": "form",
 62 |         "id": "zzDanp7lDmSC"
 63 |       },
 64 |       "outputs": [],
 65 |       "source": [
 66 |         "#@title 安裝運作所需套件\n",
 67 |         "!pip3 install SpeechRecognition\n",
 68 |         "!pip3 install iNLP"
 69 |       ]
 70 |     },
 71 |     {
 72 |       "cell_type": "markdown",
 73 |       "metadata": {
 74 |         "id": "YHUEj0k9HhSc"
 75 |       },
 76 |       "source": [
 77 |         "## 2.掛載 google 雲端硬碟\n",
 78 |         "\n",
 79 |         "可點選左側的 **檔案** 圖示，掛載 Google Drive 雲端硬碟，或執行程式"
 80 |       ]
 81 |     },
 82 |     {
 83 |       "cell_type": "code",
 84 |       "execution_count": null,
 85 |       "metadata": {
 86 |         "cellView": "form",
 87 |         "id": "nn9tQeSLF8oF"
 88 |       },
 89 |       "outputs": [],
 90 |       "source": [
 91 |         "#@title 掛載 Google雲端硬碟\n",
 92 |         "from google.colab import drive\n",
 93 |         "drive.mount('/content/drive')"
 94 |       ]
 95 |     },
 96 |     {
 97 |       "cell_type": "markdown",
 98 |       "metadata": {
 99 |         "id": "AoJDchjgHxl8"
100 |       },
101 |       "source": [
102 |         "## 3.設定環境變數與函數預載\n",
103 |         "\n",
104 |         "需給予**錄音檔**的路徑、 wav 切割檔的暫存目錄、txt 輸出檔的暫存目錄。請確定在**錄音檔**目錄下，沒有相同名稱目錄、或相同名稱目錄下沒有重要的資料。\n",
105 |         "\n",
106 |         "若要自行建立目錄者，請將 **chk** 設定為 n\n"
107 |       ]
108 |     },
109 |     {
110 |       "cell_type": "code",
111 |       "execution_count": 12,
112 |       "metadata": {
113 |         "id": "R4kZJGQdHdjy",
114 |         "cellView": "form"
115 |       },
116 |       "outputs": [],
117 |       "source": [
118 |         "#@title 基礎環境設定\n",
119 |         "import os\n",
120 |         "import shutil\n",
121 |         "import speech_recognition as sr\n",
122 |         "import concurrent.futures\n",
123 |         "import wave\n",
124 |         "import json\n",
125 |         "import numpy as np\n",
126 |         "from google.colab import files\n",
127 |         "from inlp.convert import chinese\n",
128 |         "\n",
129 |         "\n",
130 |         "#@markdown 錄音檔的位置\n",
131 |         "mp3Name= '/content/drive/MyDrive/tmp/240113_0558.mp3' #@param {type:\"string\"}\n",
132 |         "\n",
133 |         "#@markdown 設定錄音檔的分割大小，單位：秒。時間太長，轉文字的效果會較差。\n",
134 |         "CutTimeDef = 20 #@param {type:\"integer\"}\n",
135 |         "#@markdown 設定 wav 切割檔的暫存目錄\n",
136 |         "wav_path='wav' #@param {type:\"string\"}\n",
137 |         "#@markdown 設定文字檔暫存目錄。將特定秒數(CutTimeDef)的音檔轉為文字\n",
138 |         "txt_path='txt' #@param {type:\"string\"}\n",
139 |         "# #@markdown 執行緖的數量\n",
140 |         "# thread_num = 1 #@param {type:\"number\"}\n",
141 |         "\n",
142 |         "workpath=os.path.dirname(mp3Name)\n",
143 |         "mp3Name=os.path.basename(mp3Name)\n",
144 |         "FileName = mp3Name[:mp3Name.rfind(\".\")]+\".wav\"\n",
145 |         "os.chdir(workpath)\n",
146 |         "#@markdown 若 wav_path, txt_path 目錄存在是否移除重建\n",
147 |         "chk='y' #@param [\"y\",\"n\"]\n",
148 |         "\n",
149 |         "def reset_dir(path):\n",
150 |         "    try:\n",
151 |         "        os.mkdir(path)\n",
152 |         "    except Exception:\n",
153 |         "      if chk==\"y\":\n",
154 |         "        shutil.rmtree(path)\n",
155 |         "        os.mkdir(path)\n",
156 |         "\n",
157 |         "def CutFile(FileName, target_path):\n",
158 |         "\n",
159 |         "    # print(\"CutFile File Name is \", FileName)\n",
160 |         "    f = wave.open(FileName, \"rb\")\n",
161 |         "    params = f.getparams()\n",
162 |         "    nchannels, sampwidth, framerate, nframes = params[:4]\n",
163 |         "    CutFrameNum = framerate * CutTimeDef\n",
164 |         "    # 讀取格式資訊\n",
165 |         "    # 一次性返回所有的WAV檔案的格式資訊，它返回的是一個組元(tuple)：聲道數, 量化位數（byte    單位）, 採\n",
166 |         "    # 樣頻率, 取樣點數, 壓縮型別, 壓縮型別的描述。wave模組只支援非壓縮的資料，因此可以忽略最後兩個資訊\n",
167 |         "\n",
168 |         "    # print(\"CutFrameNum=%d\" % (CutFrameNum))\n",
169 |         "    # print(\"nchannels=%d\" % (nchannels))\n",
170 |         "    # print(\"sampwidth=%d\" % (sampwidth))\n",
171 |         "    # print(\"framerate=%d\" % (framerate))\n",
172 |         "    # print(\"nframes=%d\" % (nframes))\n",
173 |         "\n",
174 |         "    str_data = f.readframes(nframes)\n",
175 |         "    f.close()  # 將波形資料轉換成陣列\n",
176 |         "    # Cutnum =nframes/framerate/CutTimeDef\n",
177 |         "    # 需要根據聲道數和量化單位，將讀取的二進位制資料轉換為一個可以計算的陣列\n",
178 |         "    wave_data = np.frombuffer(str_data, dtype=np.short)\n",
179 |         "    wave_data.shape = -1, 2\n",
180 |         "    wave_data = wave_data.T\n",
181 |         "    temp_data = wave_data.T\n",
182 |         "    # StepNum = int(nframes/200)\n",
183 |         "    StepNum = CutFrameNum\n",
184 |         "    StepTotalNum = 0\n",
185 |         "    haha = 0\n",
186 |         "    while StepTotalNum < nframes:\n",
187 |         "        # for j in range(int(Cutnum)):\n",
188 |         "        # print(\"Stemp=%d\" % (haha))\n",
189 |         "        SaveFile = \"%s-%03d.wav\" % (FileName[:-4], (haha+1))\n",
190 |         "        # print(FileName)\n",
191 |         "        if haha % 3==0:\n",
192 |         "          print(\"*\",end='')\n",
193 |         "        temp_dataTemp = temp_data[StepNum * (haha):StepNum * (haha + 1)]\n",
194 |         "        haha = haha + 1\n",
195 |         "        StepTotalNum = haha * StepNum\n",
196 |         "        temp_dataTemp.shape = 1, -1\n",
197 |         "        temp_dataTemp = temp_dataTemp.astype(np.short)  # 開啟WAV文件\n",
198 |         "        f = wave.open(target_path+\"/\" + SaveFile, \"wb\")\n",
199 |         "        # 配置聲道數、量化位數和取樣頻率\n",
200 |         "        f.setnchannels(nchannels)\n",
201 |         "        f.setsampwidth(sampwidth)\n",
202 |         "        f.setframerate(framerate)\n",
203 |         "        # 將wav_data轉換為二進位制資料寫入檔案\n",
204 |         "        f.writeframes(temp_dataTemp.tobytes())\n",
205 |         "        f.close()\n",
206 |         "\n",
207 |         "\n",
208 |         "\n",
209 |         "\n",
210 |         "def texts_to_one(path, target_file):\n",
211 |         "    files = os.listdir(path)\n",
212 |         "    files.sort()\n",
213 |         "    files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n",
214 |         "    with open(target_file, \"w\", encoding=\"utf-8\") as f:\n",
215 |         "        for file in files:\n",
216 |         "            with open(file, \"r\", encoding='utf-8') as f2:\n",
217 |         "                txt= f2.read().split(\"\\n\")\n",
218 |         "                if len(txt) < 2:\n",
219 |         "                    continue\n",
220 |         "                f.write(txt[1])\n",
221 |         "    print(\"完成合併, 檔案位於 %s \" % target_file)\n",
222 |         "\n",
223 |         "\n",
224 |         "def texts2otr(path, target_file, audio_name, timeperiod):\n",
225 |         "    template = '''<p><span class=\"timestamp\" data-timestamp=\"{}.000000\">{}</span>{}</p><p><br/></p>\n",
226 |         "    '''\n",
227 |         "    files = os.listdir(path)\n",
228 |         "    files.sort()\n",
229 |         "    content = ''\n",
230 |         "    files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n",
231 |         "    with open(target_file, \"w\", encoding=\"utf-8\") as f:\n",
232 |         "\n",
233 |         "        for file in files:\n",
234 |         "            with open(file, \"r\", encoding=\"utf-8\") as f2:\n",
235 |         "                txt = f2.read().split(\"\\n\")\n",
236 |         "                if len(txt) < 2:\n",
237 |         "                    continue\n",
238 |         "                pos=txt[0].rfind(\".\")\n",
239 |         "                time=int(txt[0][pos-3:pos])\n",
240 |         "                # times = (int(txt[0].split(\"-\")[1][:-5])-1)*CutTimeDef\n",
241 |         "                times=(time-1)*CutTimeDef\n",
242 |         "                secs, mins = times % 60, (times//60) % 60\n",
243 |         "                hours = (times//60)//60\n",
244 |         "                timeF = \"{:02d}:{:02d}:{:02d}\".format(hours, mins, secs)\n",
245 |         "                content += template.format(times, timeF, txt[1])\n",
246 |         "\n",
247 |         "        output = {\"text\": content, \"media\": audio_name,\n",
248 |         "                  \"media-time\": timeperiod}\n",
249 |         "        f.write(json.dumps(output, ensure_ascii=False))\n",
250 |         "    print(\"完成合併, otr 檔案位於 %s \" % target_file)\n",
251 |         "\n",
252 |         "    #files.download(target_file)"
253 |       ]
254 |     },
255 |     {
256 |       "cell_type": "markdown",
257 |       "metadata": {
258 |         "id": "RIaMvxr7Jz_W"
259 |       },
260 |       "source": [
261 |         "## 4.音頻轉換與切割\n",
262 |         "\n",
263 |         "1. 將 mp3 轉成 wav 檔\n",
264 |         "2. 將音頻切割，並置於 wav_path 目錄下\n",
265 |         "3. 建立 txt_path ，做為語音判識的輸出檔\n",
266 |         "\n",
267 |         "\n"
268 |       ]
269 |     },
270 |     {
271 |       "cell_type": "code",
272 |       "execution_count": 13,
273 |       "metadata": {
274 |         "cellView": "form",
275 |         "id": "rpnUIqKBKBnQ",
276 |         "colab": {
277 |           "base_uri": "https://localhost:8080/"
278 |         },
279 |         "outputId": "757b14db-6b7c-47e9-c25c-78e8fd68f020"
280 |       },
281 |       "outputs": [
282 |         {
283 |           "output_type": "stream",
284 |           "name": "stdout",
285 |           "text": [
286 |             "=========================================== mp3 轉 wav 檔 ============================================\n",
287 |             "                                    Wav 檔名為 240113_0558.wav                                     \n",
288 |             "========================================= 音頻以每20秒分割 ==========================================\n",
289 |             "********\n",
290 |             "----------------------------------------------- 完成分割 -----------------------------------------------\n"
291 |           ]
292 |         }
293 |       ],
294 |       "source": [
295 |         "#@title 執行音頻轉換與分割\n",
296 |         "\n",
297 |         "print(\" mp3 轉 wav 檔 \".center(100,'='))\n",
298 |         "os.system('{} -i \"{}\" \"{}\"'.format(\"ffmpeg\",mp3Name, FileName))\n",
299 |         "print(\" Wav 檔名為 {} \".format(FileName).center(96))\n",
300 |         "reset_dir(wav_path)\n",
301 |         "reset_dir(txt_path)\n",
302 |         "# # Cut Wave Setting\n",
303 |         "\n",
304 |         "print(\" 音頻以每{}秒分割 \".format(CutTimeDef).center(94,'='))\n",
305 |         "CutFile(FileName, wav_path)\n",
306 |         "print(\"\")\n",
307 |         "print(\" 完成分割 \".center(100,'-'))"
308 |       ]
309 |     },
310 |     {
311 |       "cell_type": "markdown",
312 |       "metadata": {
313 |         "id": "msvFQZENdwGZ"
314 |       },
315 |       "source": [
316 |         "## 5.文字轉語音"
317 |       ]
318 |     },
319 |     {
320 |       "cell_type": "code",
321 |       "execution_count": null,
322 |       "metadata": {
323 |         "id": "rUh0kL6hC6yd",
324 |         "cellView": "form"
325 |       },
326 |       "outputs": [],
327 |       "source": [
328 |         "#@title 執行語音轉文字 (需要耗費不少時間)\n",
329 |         "#@markdown 指定翻譯的語言類型，如何設定語系請參考 [支援列表](https://cloud.google.com/speech-to-text/docs/languages)\n",
330 |         "\n",
331 |         "#@markdown 繁體中文：zh-TW(or cmn-Hant-TW)、英文： en-US\n",
332 |         "voiceLanguage=\"zh-TW\" #@param {type:\"string\"}\n",
333 |         "# cmn-Hant-TW\n",
334 |         "\n",
335 |         "def VoiceToText_thread(file):\n",
336 |         "  txt_file = \"%s/%s.txt\" % (txt_path, file[:-4])\n",
337 |         "\n",
338 |         "  if os.path.isfile(txt_file):\n",
339 |         "    return\n",
340 |         "  with open(\"%s/%s.txt\" % (txt_path, file[:-4]), \"w\", encoding=\"utf-8\") as f:\n",
341 |         "    f.write(\"%s:\\n\" % file)\n",
342 |         "    r = sr.Recognizer()  # 預設辨識英文\n",
343 |         "    with sr.WavFile(wav_path+\"/\"+file) as source:  # 讀取wav檔\n",
344 |         "      audio = r.record(source)\n",
345 |         "      # r.adjust_for_ambient_noise(source)\n",
346 |         "      # audio = r.listen(source)\n",
347 |         "    try:\n",
348 |         "      text = r.recognize_google(audio,language = voiceLanguage)\n",
349 |         "      text = chinese.s2t(text)\n",
350 |         "      # r.recognize_google(audio)\n",
351 |         "\n",
352 |         "      if len(text) == 0:\n",
353 |         "        print(\"===無資料==\")\n",
354 |         "        return\n",
355 |         "\n",
356 |         "      print(f\"{file}\\t{text}\")\n",
357 |         "      f.write(\"%s \\n\\n\" % text)\n",
358 |         "      if file == files[-1]:\n",
359 |         "          print(\"結束翻譯\")\n",
360 |         "    except sr.RequestError as e:\n",
361 |         "      print(\"無法翻譯{0}\".format(e))\n",
362 |         "      # 兩個 except 是當語音辨識不出來的時候 防呆用的\n",
363 |         "      # 使用Google的服務\n",
364 |         "    except LookupError:\n",
365 |         "      print(\"Could not understand audio\")\n",
366 |         "    except sr.UnknownValueError:\n",
367 |         "      print(f\"Error: 無法識別 Audio\\t {file}\")\n",
368 |         "\n",
369 |         "\n",
370 |         "\n",
371 |         "\n",
372 |         "files = os.listdir(wav_path)\n",
373 |         "files.sort()\n",
374 |         "\n",
375 |         "# 因為 colab 調整執行緒的使用原則，max=2 最多 60秒就關閉\n",
376 |         "# with concurrent.futures.ThreadPoolExecutor(max_workers=thread_num) as executor:\n",
377 |         "#     executor.map(VoiceToText_thread, files)\n",
378 |         "for file in files:\n",
379 |         "  VoiceToText_thread(file)\n",
380 |         "\n",
381 |         "# VoiceToText(wav_path, files, txt_path)\n",
382 |         "\n",
383 |         "target_txtfile = \"{}.txt\".format(FileName[:-4])\n",
384 |         "texts_to_one(txt_path, target_txtfile)\n",
385 |         "otr_file = \"{}.otr\".format(FileName[:-4])\n",
386 |         "with wave.open(FileName, \"rb\") as f:\n",
387 |         "    params = f.getparams()\n",
388 |         "texts2otr(txt_path, otr_file, FileName, params.nframes)"
389 |       ]
390 |     },
391 |     {
392 |       "cell_type": "code",
393 |       "execution_count": null,
394 |       "metadata": {
395 |         "cellView": "form",
396 |         "id": "Rqlix8f26WTs"
397 |       },
398 |       "outputs": [],
399 |       "source": [
400 |         "#@title 列出合併的文字檔之檔名\n",
401 |         "#@markdown 將會形成 txt 和 [oTranscribe](https://otranscribe.com/) 網站使用的 otr 格式。輸出檔將置於上傳錄音檔同目錄。\n",
402 |         "\n",
403 |         "#@markdown 若已知道檔名，不需要執行此區塊。\n",
404 |         "print(\" 輸出檔名 \".center(100,'='))\n",
405 |         "print(target_txtfile)\n",
406 |         "print(otr_file)"
407 |       ]
408 |     },
409 |     {
410 |       "cell_type": "markdown",
411 |       "metadata": {
412 |         "id": "nFJ79zpxDuMt"
413 |       },
414 |       "source": [
415 |         "## 6.暫存檔、暫目錄清理"
416 |       ]
417 |     },
418 |     {
419 |       "cell_type": "code",
420 |       "execution_count": 15,
421 |       "metadata": {
422 |         "cellView": "form",
423 |         "id": "XWG0Z-L-D6AK"
424 |       },
425 |       "outputs": [],
426 |       "source": [
427 |         "#@title 移除暫存檔、暫存目標\n",
428 |         "\n",
429 |         "#@markdown 將會移除 wav, txt 的目錄和 .wav 的暫存檔\n",
430 |         "\n",
431 |         "#@markdown 你可以透直接在 **Google雲端硬碟** 手動刪除，不透過程式移除\n",
432 |         "\n",
433 |         "\n",
434 |         "shutil.rmtree(wav_path)\n",
435 |         "shutil.rmtree(txt_path)\n",
436 |         "os.remove(FileName)\n"
437 |       ]
438 |     },
439 |     {
440 |       "cell_type": "code",
441 |       "execution_count": null,
442 |       "metadata": {
443 |         "cellView": "form",
444 |         "id": "hO9kdxCwaadE"
445 |       },
446 |       "outputs": [],
447 |       "source": [
448 |         "#@title 卸載 **Google 雲端硬碟**\n",
449 |         "drive.flush_and_unmount()"
450 |       ]
451 |     },
452 |     {
453 |       "cell_type": "markdown",
454 |       "metadata": {
455 |         "id": "YnV3vGD6gS-W"
456 |       },
457 |       "source": [
458 |         "## 附錄一.Youtube字幕格式輸出"
459 |       ]
460 |     },
461 |     {
462 |       "cell_type": "code",
463 |       "execution_count": null,
464 |       "metadata": {
465 |         "cellView": "form",
466 |         "id": "D9M7MJS7a281"
467 |       },
468 |       "outputs": [],
469 |       "source": [
470 |         "#@title Youtube 字幕 (.srt) 格式輸出\n",
471 |         "def get_timeF(times):\n",
472 |         "  secs, mins = times % 60, (times//60) % 60\n",
473 |         "  hours = (times//60)//60\n",
474 |         "  timeF = \"{:02d}:{:02d}:{:02d}\".format(hours, mins, secs)\n",
475 |         "  return timeF\n",
476 |         "\n",
477 |         "def texts2srt(path, target_file):\n",
478 |         "    template = '''{}\\n{} --> {}\\n{}\\n\\n'''\n",
479 |         "    files = os.listdir(path)\n",
480 |         "    files.sort()\n",
481 |         "    content = ''\n",
482 |         "    counter = 0\n",
483 |         "    files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n",
484 |         "    with open(target_file, \"w\", encoding=\"utf-8\") as f:\n",
485 |         "\n",
486 |         "        for file in files:\n",
487 |         "            with open(file, \"r\", encoding=\"utf-8\") as f2:\n",
488 |         "                txt = f2.read().split(\"\\n\")\n",
489 |         "                if len(txt) < 2:\n",
490 |         "                    continue\n",
491 |         "                counter+=1\n",
492 |         "                times = (int(txt[0].split(\"-\")[1][:-5])-1)*CutTimeDef\n",
493 |         "                time_start=get_timeF(times)\n",
494 |         "                time_end=get_timeF(times+CutTimeDef)\n",
495 |         "                content += template.format(counter, time_start, time_end, txt[1])\n",
496 |         "        f.write(content)\n",
497 |         "    print(\"完成合併, srt 檔案位於 %s \" % target_file)\n",
498 |         "\n",
499 |         "srt_file = \"{}_srt.txt\".format(FileName[:-4])\n",
500 |         "texts2srt(txt_path, srt_file)\n",
501 |         "files.download(srt_file)"
502 |       ]
503 |     }
504 |   ],
505 |   "metadata": {
506 |     "colab": {
507 |       "provenance": [],
508 |       "mount_file_id": "1SPRxSXsaErSrZ4riQ-1sxHankJ3Hlc9X",
509 |       "authorship_tag": "ABX9TyMWp1agJax/qdgy3Ri4I38A",
510 |       "include_colab_link": true
511 |     },
512 |     "kernelspec": {
513 |       "display_name": "Python 3",
514 |       "name": "python3"
515 |     }
516 |   },
517 |   "nbformat": 4,
518 |   "nbformat_minor": 0
519 | }


--------------------------------------------------------------------------------