├── .gitignore ├── LICENSE ├── README.md ├── Text_to_Epub.ipynb ├── Word_檔文件轉換為_Excel.ipynb ├── crawler.ipynb ├── data ├── GRB_105.txt ├── GRB_106.txt ├── GRB_107.txt ├── GRB_108.txt ├── GRB_109.txt ├── GRB_110.txt └── iKnow_2017-2021.txt ├── oTranscribe_txt_to_srt_格式轉換.ipynb ├── whisper_Test.ipynb ├── youtuber_逐字稿.ipynb ├── 台股_Q1~Q3_EPS_抓取.ipynb ├── 技術議題關鍵字擴展.ipynb ├── 英文單字計算.ipynb └── 錄音檔轉文字.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Reic Wang 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # colab_python 2 | 這一個放置運用 Colab 平台開發出來的一些小空間的放置空間。 by 瑞課 3 | 4 | # 語音轉文字工具 5 | [錄音檔轉文字.ipynb](https://github.com/reic/colab_python/blob/main/%E9%8C%84%E9%9F%B3%E6%AA%94%E8%BD%89%E6%96%87%E5%AD%97.ipynb)採用 Python 開發,可應用於訪談錄音檔轉文字、影片的字幕的生成,及其它相關應用。 6 | 7 | 因為透過Google Colab 平台、Google的語音轉文字工具,完成語音轉文字的工作。只需要有 Google 帳號,即可具備執行此程式的環境,輔以簡單的設定,不會程式的使用者也可以完成相關的工作。 8 | 9 | == 更新記錄 === 10 | 11 | - 2021/5/3 增加多執行緖的方法,縮短翻譯的時間 12 | - 2021/5/9 修正因檔名無法產生 OTR 檔的問題,謝謝「彩虹小馬」的回饋 13 | - 2021/5/12 增加不同翻譯語言變數的設定,並於檔案中提供語系參考表。 謝謝 chin ho Lau 的回饋。 14 | 15 | # 關鍵字擴展工具 16 | [技術議題關鍵字擴展.ipynb](https://github.com/reic/colab_python/blob/main/%E6%8A%80%E8%A1%93%E8%AD%B0%E9%A1%8C%E9%97%9C%E9%8D%B5%E5%AD%97%E6%93%B4%E5%B1%95.ipynb)採用 Python 開發,使用政府開放資料:民國 105-109年的政府研究資訊網 (grb.gov.tw) 公開資料做為資料集。 17 | 18 | 對於不理解的議題,可以透過此工具了解**特定關鍵字**之相關技術,加速對於特定技術領域的理解。 19 | 20 | # 文字檔轉 epub 21 | [Text_to_Epub.ipynb](https://github.com/reic/colab_python/blob/main/Text_to_Epub.ipynb) 採用 python 開發,進行資料處理將文字檔轉成多個 md 檔,再透過 colab 系統的 pandoc 套件,完成 md to epub 的轉換。 22 | 23 | 在這一個程式中,首次引入 python 模組檢查的概念,撰寫了一個 modulechk 的函數,並在 import 模組前檢查系統是否已有安裝此模組,如果沒有就先安裝再 import。這一個是透過 os 模組的 popen 函數取得 python 套件的安裝列表,再檢查要 import 的模組是否在已安裝列表內;若模組在列表不存在,則透過 os.system 安裝需要 import 的模組。 24 | -------------------------------------------------------------------------------- /Text_to_Epub.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Text_to_Epub.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyO6Ti7BlNS7lcoQdFSewsG/", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "source": [ 34 | "#@title Text 轉 epub 程式\n", 35 | "import re, os\n", 36 | "\n", 37 | "# check 套件\n", 38 | "def chkmodule(libset):\n", 39 | " itm=os.popen(\"pip3 list\").read().splitlines()\n", 40 | " itm=[ data[:data.find(\" \")].lower() for data in itm if len(data[:data.find(\" \")])>0 ]\n", 41 | " # set(itm)\n", 42 | " if libset.lower() in set(itm):\n", 43 | " return\n", 44 | " os.system(f\"pip3 install {libset}\")\n", 45 | "\n", 46 | "chkmodule(\"inlp\")\n", 47 | "from inlp.convert import chinese\n", 48 | "\n", 49 | "# 基礎設定資料\n", 50 | "filename=\"/content/drive/MyDrive/tmp/\\u4FEE\\u771F\\u804A\\u5929\\u7FA4.txt\" #@param {type:\"string\"}\n", 51 | "title=\"修真聊天群\" #@param {type:\"string\"}\n", 52 | "author=\"聖騎士的傳說\" #@param {type:\"string\"}\n", 53 | "#@markdown 打勾,將會協助進行簡體轉繁體\n", 54 | "chinese_S2T = True #@param {type:\"boolean\"}\n", 55 | "\n", 56 | "# 標題設定義\n", 57 | "YAML=f'''---\n", 58 | "title: {title}\n", 59 | "author: {author}\n", 60 | "language: zh-Hant\n", 61 | "---'''\n", 62 | "# page_break=\"
\"\n", 63 | "\n", 64 | "\n", 65 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 66 | " f.write(YAML)\n", 67 | "\n", 68 | "with open(filename,mode=\"r\",encoding='utf-8') as f:\n", 69 | " content=f.read()\n", 70 | "text=re.sub(r\"\\n+\",\"\\n\\n\",re.sub(r\"[\\u3000]+\",\"\",content))\n", 71 | "\n", 72 | "print(\"{:=^50s}\".format(\" Markdown 處理 \"))\n", 73 | "# 章/卷處理\n", 74 | "patterns=[[r\"\\n(第.*卷\\s?)\",\"#\"],[r\"\\n(第.*章\\s?)\",\"#\"],[r\"\\n(第.*節\\s?)\",\"#\"]]\n", 75 | "for pns in patterns: \n", 76 | " text=re.sub(pns[0],r\"\\n{} \\1\".format(pns[1]),text)\n", 77 | "print(\"{}\".format(\"完成...\"))\n", 78 | "# 建立 md 檔的函數\n", 79 | "\n", 80 | "def writemd(title,arrtomd):\n", 81 | " with open(\"{:04d}.md\".format(title),mode=\"w\",encoding='utf-8') as f:\n", 82 | " f.write(arrtomd)\n", 83 | "\n", 84 | "## 文字處理\n", 85 | "textarry=text.split(\"# \")\n", 86 | "counter=0 # counter 為 md 檔名\n", 87 | "mdfiles=[] # 記錄 md 檔名\n", 88 | "# 若為簡體文件,就需要用 註解的那一個\n", 89 | "print(\"{:=^50s}\".format(\" md 檔分割 \"))\n", 90 | "\n", 91 | "for itm in textarry:\n", 92 | " if chinese_S2T:\n", 93 | " itm=chinese.s2t(itm)\n", 94 | " # writemd(counter,\"# {}\".format(chinese.s2t(itm)))\n", 95 | " writemd(counter,\"# {}\".format(itm))\n", 96 | " mdfiles.append(\"{:04d}.md\".format(counter))\n", 97 | " counter+=1\n", 98 | "\n", 99 | "print(\"{}\".format(\"完成...\"))\n", 100 | "# counter 為 md 檔名\n", 101 | "# mdfiles=[\"{:04d}.md\".format(itm) for itm in range(counter)]\n", 102 | "\n", 103 | "print(\"{:=^50s}\".format(\" 產生 epub 檔 \"))\n", 104 | "# 透過 pandoc 生成 epub \n", 105 | "os.system(\"pandoc -o \\\"{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 106 | "print(\"{}\".format(\"完成...\"))\n", 107 | "\n", 108 | "# 下載\n", 109 | "print(\"{:=^50s}\".format(\" 下載 epub 檔 \"))\n", 110 | "from google.colab import files\n", 111 | "files.download('{}.epub'.format(title))" 112 | ], 113 | "metadata": { 114 | "id": "j7I7yNlusJCZ" 115 | }, 116 | "execution_count": null, 117 | "outputs": [] 118 | } 119 | ] 120 | } -------------------------------------------------------------------------------- /Word_檔文件轉換為_Excel.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "Word 檔文件轉換為 Excel", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "toc_visible": true, 10 | "mount_file_id": "1pM5vkIjp5tu3tlTGE4UepW08_Ldsoag8", 11 | "authorship_tag": "ABX9TyPF9xlQR5JKqopdiPQB1DFT", 12 | "include_colab_link": true 13 | }, 14 | "kernelspec": { 15 | "name": "python3", 16 | "display_name": "Python 3" 17 | }, 18 | "language_info": { 19 | "name": "python" 20 | } 21 | }, 22 | "cells": [ 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "id": "view-in-github", 27 | "colab_type": "text" 28 | }, 29 | "source": [ 30 | "\"Open" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": { 36 | "id": "uo5TrNiAsub-" 37 | }, 38 | "source": [ 39 | "# **將格式化 Word 的內容轉換為 Excel**\n", 40 | "首先, Word 文字的撰寫需要具有一定的規範,才可以將 word 文件的資料轉換成 excel。\n", 41 | "\n", 42 | "詳細內容可參考最後的 Word 編寫範例。" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "metadata": { 48 | "id": "vWDnAa8DVqKv", 49 | "cellView": "form" 50 | }, 51 | "source": [ 52 | "#@title 安裝必要套件\n", 53 | "!pip install docx2txt " 54 | ], 55 | "execution_count": null, 56 | "outputs": [] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "metadata": { 61 | "id": "Xcya3e-oYvoS", 62 | "cellView": "form" 63 | }, 64 | "source": [ 65 | "import gspread\n", 66 | "import pandas as pd\n", 67 | "import datetime,os,docx2txt\n", 68 | "\n", 69 | "\n", 70 | "#@title Write Data into Gsheet \n", 71 | "\n", 72 | "#@markdown 輸出的 excel 檔名\n", 73 | "wordFilename = \"/content/drive/MyDrive/mytest.docx\" #@param {type:\"string\"}\n", 74 | "\n", 75 | "#@markdown 輸出的 excel 檔名,或放置和 docx 同目錄下\n", 76 | "excel__name = \"mytest.xlsx\" #@param {type:\"string\"}\n", 77 | "\n", 78 | "\n", 79 | "#@markdown 欄位分隔符號\n", 80 | "delimiter=\"\\uFF1A\" #@param {type:\"string\"}\n", 81 | "\n", 82 | "#@markdown 欄位名稱,請用 分隔符號 分隔\n", 83 | "columNames = \"\\u6A19\\u984C\\uFF1A\\u5167\\u5BB9\\uFF1A\\u51FA\\u8655\\uFF1A\\u6587\\u7AE0\\u65E5\\u671F\" #@param {type:\"string\"}\n", 84 | "\n", 85 | "#@markdown 在 word 檔案,需要採用同樣的分隔符號,區分欄位和內容, 例如:\n", 86 | "\n", 87 | "#@markdown 標題:蘋果新春發表有...\n", 88 | "\n", 89 | "#@markdown 內容:蘋果於台灣時間今(21)日凌晨舉辦今年首場新品發表會,此次一共發表了五款硬體新品,包括搭配Mini LED螢幕的新一代iPad Pro平板、七種繽紛顏色的新一代 iMac、AirTag藍牙防丟器、搭載A12仿生晶片的新一代Apple TV 4K機上盒及iPhone 12紫色新款。\n", 90 | "\n", 91 | "#@markdown 出處:yahoo!3C科技\n", 92 | "\n", 93 | "#@markdown 文章日期:2021/4/21\n", 94 | "delimiter=delimiter.strip()\n", 95 | "wordPos=wordFilename.rfind(\"/\")\n", 96 | "dirPath=wordFilename[:wordPos]\n", 97 | "columList=[itm.strip() for itm in columNames.split(delimiter)]\n", 98 | "# docx2txt.process(wordFilename)\n", 99 | "content=[itm.strip() for itm in docx2txt.process(wordFilename).split(\"\\n\") if len(itm)>1]\n", 100 | "topd={}\n", 101 | "for itm in columList:\n", 102 | " topd[itm]=list()\n", 103 | "\n", 104 | "for itm in content:\n", 105 | " posistion=itm.find(\":\")\n", 106 | " topd[itm[:posistion]].append(itm[posistion+1:].strip())\n", 107 | "\n", 108 | "df=pd.DataFrame(topd,columns=columList)\n", 109 | "\n", 110 | "df.to_excel(f\"{dirPath}/{excel__name}\",index=False)\n", 111 | "print(\" 資料轉換結束 \".center(100,\"=\"))\n", 112 | "print()\n", 113 | "print(f\"Excel 檔名 {excel__name} ,\\n\\n已儲存於 {wordFilename[wordPos+1:]} 相同目錄\\n \")\n", 114 | "print(\"\".center(106,\"=\"))" 115 | ], 116 | "execution_count": null, 117 | "outputs": [] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": { 122 | "id": "Q4_IWYW7rlah" 123 | }, 124 | "source": [ 125 | "# **Word 編寫範列**\n", 126 | "\n", 127 | "## 欄位名稱為:標題:內容:出處:文章日期\n", 128 | "\n", 129 | "## 分隔符號為: :(全型冒號)\n", 130 | "\n", 131 | "標題:蘋果新春發表有...\n", 132 | "\n", 133 | "內容:蘋果於台灣時間今(21)日凌晨舉辦今年首場新品發表會,此次一共發表了五款硬體新品,包括搭配Mini \n", 134 | "\n", 135 | "LED螢幕的新一代iPad Pro平板、七種繽紛顏色的新一代 iMac、AirTag藍牙防丟器、搭載A12仿生晶片的新一代Apple TV 4K機上盒及iPhone 12紫色新款。\n", 136 | "\n", 137 | "出處:yahoo!3C科技\n", 138 | "\n", 139 | "文章日期:2021/4/21\n", 140 | "\n", 141 | "......\n", 142 | "\n", 143 | "......\n", 144 | "\n", 145 | "標題:韓媒:三星多款新機 提前在8月上市\n", 146 | "\n", 147 | "內容:消息人士透露,全球最大手機製造商三星電子正與南韓電信商洽談,規劃在8月推出平價款旗艦機Galaxy S21 FE和折疊系列Galaxy Z Fold3、Galaxy Z Flip3,以擴大手機市場的市占率。韓聯社報導,若三星證實這項消息,意味這次新機上市時機比往年早。Galaxy S21 FE的前身Galaxy S20 FE在去年10月上市,三星折疊手機Galaxy Z Fold2和Galaxy Z Flip則於9月上市。\n", 148 | "\n", 149 | "出處:聯合新聞網\n", 150 | "\n", 151 | "文章日期:2021/5/9\n" 152 | ] 153 | } 154 | ] 155 | } -------------------------------------------------------------------------------- /crawler.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "OjvmI-ML-U4Y" 17 | }, 18 | "source": [ 19 | "# 網路爬蟲與多執行緒練習\n", 20 | "\n" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": { 27 | "cellView": "form", 28 | "collapsed": true, 29 | "id": "wGkkH1MmUvuV" 30 | }, 31 | "outputs": [], 32 | "source": [ 33 | "#@title 必要元件安裝\n", 34 | "# def check_package(package_name,is_system_package=False):\n", 35 | "# import importlib, subprocess, shutil\n", 36 | "# if is_system_package:\n", 37 | "# if(shutil.which(package_name)):\n", 38 | "# print(f\"{package_name} 套件已經安裝\")\n", 39 | "# else:\n", 40 | "# print(f\"{package_name} 套件尚未安裝,正在安裝中...\")\n", 41 | "# subprocess.check_call([\"apt-get\", \"install\", \"-y\", package_name])\n", 42 | "# else:\n", 43 | "# try:\n", 44 | "# importlib.import_module(package_name)\n", 45 | "# print(f\"{package_name} 套件已經安裝\")\n", 46 | "# except ImportError:\n", 47 | "# print(f\"{package_name} 套件尚未安裝,正在安裝中...\")\n", 48 | "# subprocess.check_call([\"pip\", \"install\", package_name])\n", 49 | "\n", 50 | "# # 檢查 pandoc 是否已安裝,若無則安裝\n", 51 | "# check_package(\"pandoc\",is_system_package=True)\n", 52 | "# check_package(\"inlp\")\n", 53 | "\n", 54 | "!apt-get install -y pandoc\n", 55 | "!pip install inlp" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "source": [ 61 | "# 可用網站清單\n", 62 | "\n", 63 | "* 小書包小說網:https://www.xiaoshubao.net/read/488175/1.html\n", 64 | "* 冬日小說網:https://www.drxsw.com/\n", 65 | "* UU看書:https://www.uuread.tw/\n" 66 | ], 67 | "metadata": { 68 | "id": "uPeuRtTlEhXp" 69 | } 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": { 75 | "id": "hKEPfJ6aEPIj", 76 | "cellView": "form" 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "import requests\n", 81 | "import os,re\n", 82 | "from bs4 import BeautifulSoup\n", 83 | "import concurrent.futures\n", 84 | "from inlp.convert import chinese\n", 85 | "\n", 86 | "try:\n", 87 | " os.mkdir(\"/content/tmp\")\n", 88 | "except:\n", 89 | " print(\"目錄已存在\")\n", 90 | "os.chdir(\"/content/tmp\")\n", 91 | "os.system(\"rm -fr *\")\n", 92 | "\n", 93 | "\n", 94 | "\n", 95 | "\n", 96 | "# @title 抓取書籍目錄名稱、網址\n", 97 | "url=\"https://www.xiaoshubao.net/read/503289/\" #@param {type:'string'}\n", 98 | "title=\"我在妖魔世界拾取技能碎片\" #@param {type:\"string\"}\n", 99 | "author=\"第九天命\" #@param {type:\"string\"}\n", 100 | "cssselector=\"#list > dl > dd:nth-child(n+3) > a\" #@param {type:\"string\"}\n", 101 | "#@markdown 打勾,將會直接變成 epub\n", 102 | "file2epub= True #@param {type:\"boolean\"}\n", 103 | "debugmode = True #@param {type:\"boolean\"}\n", 104 | "\n", 105 | "encode = \"utf-8\" # @param {\"type\":\"string\",\"placeholder\":\"utf-8\"}\n", 106 | "\n", 107 | "# 標題設定義\n", 108 | "YAML=f'''---\n", 109 | "title: {title}\n", 110 | "author: {author}\n", 111 | "language: zh-Hant\n", 112 | "---'''\n", 113 | "\n", 114 | "with open(\"title.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 115 | " f.write(YAML)\n", 116 | "\n", 117 | "sites=url[:url.find(\"/\",8)]\n", 118 | "#sites=url[:url.rfind(\"/\")]\n", 119 | "\n", 120 | "reg=requests.get(url)\n", 121 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 122 | "reg.encoding=encode\n", 123 | "soup=BeautifulSoup(reg.text)\n", 124 | "articles = soup.select(cssselector)\n", 125 | "# articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(\"a\")\n", 126 | "\n", 127 | "links=[]\n", 128 | "# len(articles)\n", 129 | "for i in articles:\n", 130 | " href=i.get(\"href\")\n", 131 | " links.append([i.getText(),f\"{sites}{href}\"])\n", 132 | "\n", 133 | "files_text=[link[1][link[1].rfind(\"/\")+1:]+\".txt\" for link in links]\n", 134 | "if debugmode:\n", 135 | " print(links[:4])\n", 136 | " print(files_text[:4])\n" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": null, 142 | "metadata": { 143 | "id": "nRS1C2TfF7QP", 144 | "cellView": "form" 145 | }, 146 | "outputs": [], 147 | "source": [ 148 | "# @title 抓取章節內容 update:2024/12/31\n", 149 | "content_selector=\"#content\" #@param {type:\"string\"}\n", 150 | "#@markdown 打勾,將會直接變成 epub\n", 151 | "debug_content = False #@param {type:\"boolean\"}\n", 152 | "no_next_chapter = False # @param {\"type\":\"boolean\",\"placeholder\":\"True\"}\n", 153 | "next_page_selector = \"#content > p > a\" # @param {\"type\":\"string\"}\n", 154 | "find_All_p = False # @param {\"type\":\"boolean\",\"placeholder\":\"false\"}\n", 155 | "\n", 156 | "import lxml.html, time\n", 157 | "import re,os\n", 158 | "import json,ast\n", 159 | "\n", 160 | "\n", 161 | "\n", 162 | "\n", 163 | "def get_html(urls,pages=False):\n", 164 | " time.sleep(2)\n", 165 | " [title,art_url]=urls\n", 166 | " art_id=art_url[art_url.rfind(\"/\")+1:]\n", 167 | " if pages:\n", 168 | " art_id=art_id[:art_id.rfind(\"_\")]+\".html\"\n", 169 | " reg=requests.get(art_url)\n", 170 | " reg.encoding=encode\n", 171 | " soup=BeautifulSoup(reg.text)\n", 172 | " # content=soup.find(name=\"div\",class_='content')\n", 173 | " content=soup.select(content_selector)\n", 174 | "\n", 175 | " if find_All_p:\n", 176 | " content=content[0].find_all(\"p\")\n", 177 | " content=\"
\".join([p.get_text() for p in content])\n", 178 | "\n", 179 | " print(art_id)\n", 180 | " textcon = f\"# {title}\\n\\n\" if not pages else \"\"\n", 181 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 182 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 183 | " context=re.sub('
','',context)\n", 184 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 185 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 186 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 187 | " textcon+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 188 | " if debug_content:\n", 189 | " print(chinese.s2t(textcon))\n", 190 | "\n", 191 | " filemode=\"w\" if not pages else \"a\"\n", 192 | " with open(f\"{art_id}.txt\",mode=filemode,encoding=\"utf-8\") as f:\n", 193 | " f.write(chinese.s2t(textcon))\n", 194 | " if no_next_chapter:\n", 195 | " return\n", 196 | " next_page = soup.select_one(next_page_selector)\n", 197 | " if next_page and '_' in next_page['href']:\n", 198 | " art_url = next_page['href'] if next_page['href'].startswith('https') else f\"{sites}{next_page['href']}\"\n", 199 | " get_html([title,art_url],1)\n", 200 | "\n", 201 | "if debug_content:\n", 202 | " get_html(links[0])\n", 203 | "else:\n", 204 | " alreadydown=len(os.listdir())\n", 205 | " startid=0\n", 206 | " if alreadydown>1:\n", 207 | " startid=alreadydown-2\n", 208 | " for link in links[startid:]:\n", 209 | " get_html(link)\n", 210 | "\n", 211 | " output_name=title\n", 212 | " # files_text=os.listdir()\n", 213 | " # files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 214 | " # 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 215 | " # files_text.sort(key=lambda x:int(x[:-4]))\n", 216 | " if file2epub:\n", 217 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(files_text)))\n", 218 | " from google.colab import files\n", 219 | " files.download('../{}.epub'.format(title))\n", 220 | " pass\n", 221 | " else:\n", 222 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 223 | " for file in files_text:\n", 224 | " with open(file,\"r\") as f2:\n", 225 | " f.write(f2.read())\n", 226 | " from google.colab import files\n", 227 | " files.download('../{}.txt'.format(output_name))" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": { 234 | "cellView": "form", 235 | "id": "vLrNyHB0M7LL" 236 | }, 237 | "outputs": [], 238 | "source": [ 239 | "#@title 小說狂人\n", 240 | "\n", 241 | "def check_package(package_name,is_system_package=False):\n", 242 | " import importlib, subprocess, shutil\n", 243 | " if is_system_package:\n", 244 | " if(shutil.which(package_name)):\n", 245 | " print(f\"{package_name} 套件已經安裝\")\n", 246 | " else:\n", 247 | " print(f\"{package_name} 套件尚未安裝,正在安裝中...\")\n", 248 | " subprocess.check_call([\"apt-get\", \"install\", \"-y\", package_name])\n", 249 | " else:\n", 250 | " try:\n", 251 | " importlib.import_module(package_name)\n", 252 | " print(f\"{package_name} 套件已經安裝\")\n", 253 | " except ImportError:\n", 254 | " print(f\"{package_name} 套件尚未安裝,正在安裝中...\")\n", 255 | " subprocess.check_call([\"pip\", \"install\", package_name])\n", 256 | "\n", 257 | "# 檢查 pandoc 是否已安裝,若無則安裝\n", 258 | "check_package(\"pandoc\",is_system_package=True)\n", 259 | "check_package(\"inlp\")\n", 260 | "\n", 261 | "\n", 262 | "import requests\n", 263 | "import os,re\n", 264 | "from bs4 import BeautifulSoup\n", 265 | "import concurrent.futures\n", 266 | "from inlp.convert import chinese\n", 267 | "\n", 268 | "try:\n", 269 | " os.mkdir(\"/content/tmp\")\n", 270 | "except:\n", 271 | " print(\"目錄已存在\")\n", 272 | "os.chdir(\"/content/tmp\")\n", 273 | "os.system(\"rm -fr *\")\n", 274 | "\n", 275 | "\n", 276 | "\n", 277 | "def get_html(urls):\n", 278 | " [title,art_url]=urls\n", 279 | " art_id=art_url[art_url.rfind(\"/\")+1:]\n", 280 | " reg=requests.get(art_url)\n", 281 | " reg.encoding=\"utf-8\"\n", 282 | " soup=BeautifulSoup(reg.text)\n", 283 | " content=soup.find(name=\"div\",class_='content')\n", 284 | "\n", 285 | " print(art_id)\n", 286 | " text=f\"# {title}\\n\\n\"\n", 287 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 288 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 289 | " context=re.sub('
','',context)\n", 290 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 291 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 292 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 293 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 294 | "\n", 295 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 296 | " f.write(text)\n", 297 | "\n", 298 | "\n", 299 | "#@markdown 書籍目錄網址\n", 300 | "url=\"https://czbooks.net/n/s6g1n14e\" #@param {type:'string'}\n", 301 | "title=\"我在原始部落當酋長\" #@param {type:\"string\"}\n", 302 | "author=\"西原公子\" #@param {type:\"string\"}\n", 303 | "#@markdown 打勾,將會直接變成 epub\n", 304 | "file2epub = True #@param {type:\"boolean\"}\n", 305 | "\n", 306 | "# 標題設定義\n", 307 | "YAML=f'''---\n", 308 | "title: {title}\n", 309 | "author: {author}\n", 310 | "language: zh-Hant\n", 311 | "---'''\n", 312 | "\n", 313 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 314 | " f.write(YAML)\n", 315 | "\n", 316 | "sites=url[:url.find(\"/\",8)]\n", 317 | "# sites=url[:url.rfind(\"/\")]\n", 318 | "\n", 319 | "reg=requests.get(url)\n", 320 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 321 | "reg.encoding=\"utf-8\"\n", 322 | "soup=BeautifulSoup(reg.text)\n", 323 | "articles = soup.select('#chapter-list > li a')\n", 324 | "# articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(\"a\")\n", 325 | "\n", 326 | "links=[]\n", 327 | "# len(articles)\n", 328 | "for i in articles:\n", 329 | " href=i.get(\"href\")\n", 330 | "\n", 331 | " links.append([i.text,f\"https:{href}\"])\n", 332 | "\n", 333 | "files_text=[link[1][link[1].rfind(\"/\")+1:]+\".txt\" for link in links]\n", 334 | "\n", 335 | "\n", 336 | "\n", 337 | "# # # 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 338 | "# # # # 同時建立及啟用10個執行緒\n", 339 | "# # # with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 340 | "# # # executor.map(get_html, links)\n", 341 | "for link in links:\n", 342 | " get_html(link)\n", 343 | "\n", 344 | "output_name=title\n", 345 | "# files_text=os.listdir()\n", 346 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 347 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 348 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 349 | "if file2epub:\n", 350 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(files_text)))\n", 351 | " from google.colab import files\n", 352 | " files.download('../{}.epub'.format(title))\n", 353 | " pass\n", 354 | "else:\n", 355 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 356 | " for file in files_text:\n", 357 | " with open(file,\"r\") as f2:\n", 358 | " f.write(f2.read())\n", 359 | " from google.colab import files\n", 360 | " files.download('../{}.txt'.format(output_name))\n", 361 | "\n" 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": null, 367 | "metadata": { 368 | "cellView": "form", 369 | "id": "yONyl870ergz" 370 | }, 371 | "outputs": [], 372 | "source": [ 373 | "#@title UU看書網\n", 374 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 375 | "\n", 376 | "def check_package(itm):\n", 377 | " import importlib\n", 378 | " try:\n", 379 | " importlib.import_module(itm)\n", 380 | " print(f\"{itm} 套件已經安裝\")\n", 381 | " except ImportError:\n", 382 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 383 | " !pip install {itm}\n", 384 | "\n", 385 | "\n", 386 | "check_package(\"inlp\")\n", 387 | "\n", 388 | "import requests\n", 389 | "import os,re\n", 390 | "from bs4 import BeautifulSoup\n", 391 | "import concurrent.futures\n", 392 | "from inlp.convert import chinese\n", 393 | "\n", 394 | "try:\n", 395 | " os.mkdir(\"/content/tmp\")\n", 396 | "except:\n", 397 | " print(\"目錄已存在\")\n", 398 | "os.chdir(\"/content/tmp\")\n", 399 | "os.system(\"rm -fr *\")\n", 400 | "\n", 401 | "\n", 402 | "def get_html(urls):\n", 403 | " [title,art_url]=urls\n", 404 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 405 | " reg=requests.get(art_url)\n", 406 | " reg.encoding=\"utf-8\"\n", 407 | " soup=BeautifulSoup(reg.text)\n", 408 | " content=soup.find(name=\"div\",id='contentbox')\n", 409 | " print(art_id)\n", 410 | " text=f\"# {title}\\n\\n\"\n", 411 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 412 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 413 | " context=re.sub('
','',context)\n", 414 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 415 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 416 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 417 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 418 | "\n", 419 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 420 | " f.write(text)\n", 421 | "\n", 422 | "\n", 423 | "#@markdown 書籍目錄網址\n", 424 | "url=\"https://tw.uukanshu.com/b/239755/\" #@param {type:'string'}\n", 425 | "title=\"\\u6211\\u5BB6\\u5A18\\u5B50\\uFF0C\\u4E0D\\u5C0D\\u52C1\" #@param {type:\"string\"}\n", 426 | "author=\"\\u4E00\\u87EC\\u77E5\\u590F\" #@param {type:\"string\"}\n", 427 | "#@markdown 打勾,將會直接變成 epub\n", 428 | "file2epub = True #@param {type:\"boolean\"}\n", 429 | "\n", 430 | "# 標題設定義\n", 431 | "YAML=f'''---\n", 432 | "title: {title}\n", 433 | "author: {author}\n", 434 | "language: zh-Hant\n", 435 | "---'''\n", 436 | "\n", 437 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 438 | " f.write(YAML)\n", 439 | "\n", 440 | "sites=url[:url.find(\"/\",8)]\n", 441 | "# sites=url[:url.rfind(\"/\")]\n", 442 | "\n", 443 | "reg=requests.get(url)\n", 444 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 445 | "reg.encoding=\"utf-8\"\n", 446 | "soup=BeautifulSoup(reg.text)\n", 447 | "articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(\"a\")\n", 448 | "\n", 449 | "links=[]\n", 450 | "# len(articles)\n", 451 | "for i in articles:\n", 452 | " href=i.get(\"href\")\n", 453 | " links.append([i.get(\"title\"),f\"{sites}{href}\"])\n", 454 | "# links.sort(key=lambda x: x[1])\n", 455 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 456 | "\n", 457 | "# 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 458 | "# # 同時建立及啟用10個執行緒\n", 459 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 460 | "# executor.map(get_html, links)\n", 461 | "for link in links:\n", 462 | " get_html(link)\n", 463 | "\n", 464 | "output_name=title\n", 465 | "# files_text=os.listdir()\n", 466 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 467 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 468 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 469 | "if file2epub:\n", 470 | " mdfiles=[ itm for itm in files_text[::-1] ]\n", 471 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 472 | " from google.colab import files\n", 473 | " files.download('../{}.epub'.format(title))\n", 474 | " pass\n", 475 | "else:\n", 476 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 477 | " for file in files_text[::-1]:\n", 478 | " with open(file,\"r\") as f2:\n", 479 | " f.write(f2.read())\n", 480 | " from google.colab import files\n", 481 | " files.download('../{}.txt'.format(output_name))\n", 482 | "\n" 483 | ] 484 | }, 485 | { 486 | "cell_type": "code", 487 | "execution_count": null, 488 | "metadata": { 489 | "id": "pNaQIMaFVWba", 490 | "cellView": "form" 491 | }, 492 | "outputs": [], 493 | "source": [ 494 | "#@title UU看書網\n", 495 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 496 | "\n", 497 | "def check_package(itm):\n", 498 | " import importlib\n", 499 | " try:\n", 500 | " importlib.import_module(itm)\n", 501 | " print(f\"{itm} 套件已經安裝\")\n", 502 | " except ImportError:\n", 503 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 504 | " !pip install {itm}\n", 505 | "\n", 506 | "\n", 507 | "check_package(\"inlp\")\n", 508 | "\n", 509 | "import requests\n", 510 | "import os,re\n", 511 | "from bs4 import BeautifulSoup\n", 512 | "import concurrent.futures\n", 513 | "from inlp.convert import chinese\n", 514 | "\n", 515 | "try:\n", 516 | " os.mkdir(\"/content/tmp\")\n", 517 | " chk_cont=False\n", 518 | "except:\n", 519 | " print(\"目錄已存在\")\n", 520 | " chk_cont=True\n", 521 | "os.chdir(\"/content/tmp\")\n", 522 | "# os.system(\"rm -fr *\")\n", 523 | "\n", 524 | "\n", 525 | "def get_html(urls):\n", 526 | " [title,art_url]=urls\n", 527 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 528 | " reg=requests.get(art_url)\n", 529 | " reg.encoding=\"utf-8\"\n", 530 | " soup=BeautifulSoup(reg.text)\n", 531 | " # content=\"
\\n\".join([tag.string for tag in soup.find(name=\"div\",id='nr').find_all(\"p\")])\n", 532 | " tmp=soup.find(name=\"div\",id='nr').find_all(\"p\")\n", 533 | " print(art_id)\n", 534 | " # pages=soup.find('h1').getText()\n", 535 | " # numbers = re.findall(r'\\d+', pages)\n", 536 | " check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n", 537 | " if (check[-1].getText()==\"下一頁\"):\n", 538 | " page_url=sites+check[-1].get(\"href\")\n", 539 | " # if (len(numbers) >1):\n", 540 | " # last_number=numbers[-1]\n", 541 | " # # print(last_number)\n", 542 | " # for i in range(2,int(last_number)+1):\n", 543 | " # page_url=art_url[:-5]+f\"_{i}.html\"\n", 544 | " content_extend=get_html_page(page_url)\n", 545 | " tmp=tmp+content_extend\n", 546 | "\n", 547 | "\n", 548 | " content=\"\\n\".join([str(itm) for itm in tmp])\n", 549 | " # print(\"
\".join([tag.string for tag in content]))\n", 550 | " text=f\"# {title}\\n\\n\"\n", 551 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 552 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 553 | " context=re.sub('
','',context)\n", 554 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 555 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 556 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 557 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 558 | "\n", 559 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 560 | " f.write(text)\n", 561 | "\n", 562 | "\n", 563 | "def get_html_page(page_url):\n", 564 | " reg=requests.get(page_url)\n", 565 | "\n", 566 | " reg.encoding=\"utf-8\"\n", 567 | " soup=BeautifulSoup(reg.text)\n", 568 | " tmp1=soup.find(name=\"div\",id='nr').find_all(\"p\")\n", 569 | " check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n", 570 | " if (check[-1].getText()==\"下一頁\"):\n", 571 | " page_url=sites+check[-1].get(\"href\")\n", 572 | " # if (len(numbers) >1):\n", 573 | " # last_number=numbers[-1]\n", 574 | " # # print(last_number)\n", 575 | " # for i in range(2,int(last_number)+1):\n", 576 | " # page_url=art_url[:-5]+f\"_{i}.html\"\n", 577 | " content_extend=get_html_page(page_url)\n", 578 | " tmp1=tmp1+content_extend\n", 579 | "\n", 580 | " return tmp1\n", 581 | "\n", 582 | "\n", 583 | "\n", 584 | "#@markdown 書籍目錄網址\n", 585 | "url=\"https://www.uuread.tw/28100\" #@param {type:'string'}\n", 586 | "title=\"系統賦我長生,我熬死了所有人\" #@param {type:\"string\"}\n", 587 | "author=\"一隻榴蓮3號\" #@param {type:\"string\"}\n", 588 | "#@markdown 打勾,將會直接變成 epub\n", 589 | "file2epub = True #@param {type:\"boolean\"}\n", 590 | "\n", 591 | "# 標題設定義\n", 592 | "YAML=f'''---\n", 593 | "title: {title}\n", 594 | "author: {author}\n", 595 | "language: zh-Hant\n", 596 | "---'''\n", 597 | "\n", 598 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 599 | " f.write(YAML)\n", 600 | "\n", 601 | "sites=url[:url.find(\"/\",8)]\n", 602 | "# sites=url[:url.rfind(\"/\")]\n", 603 | "\n", 604 | "reg=requests.get(url)\n", 605 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 606 | "reg.encoding=\"utf-8\"\n", 607 | "soup=BeautifulSoup(reg.text)\n", 608 | "articles=soup.find(name=\"ul\",id=\"newlist\").find_all(\"a\")\n", 609 | "\n", 610 | "links=[]\n", 611 | "# len(articles)\n", 612 | "for i in articles:\n", 613 | " href=i.get(\"href\")\n", 614 | " links.append([i.get(\"title\"),f\"{sites}{href}\"])\n", 615 | "# links.sort(key=lambda x: x[1])\n", 616 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 617 | "\n", 618 | "# # 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 619 | "# # # 同時建立及啟用10個執行緒\n", 620 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 621 | "# executor.map(get_html, links)\n", 622 | "\n", 623 | "if chk_cont:\n", 624 | " start_num=int(len(os.listdir())-1)\n", 625 | " print(start_num)\n", 626 | "else:\n", 627 | " start_num=0\n", 628 | "for index,link in enumerate(links[start_num:],start_num+1):\n", 629 | " print(index)\n", 630 | " get_html(link)\n", 631 | "\n", 632 | "output_name=title\n", 633 | "# files_text=os.listdir()\n", 634 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 635 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 636 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 637 | "if file2epub:\n", 638 | " mdfiles=[ itm for itm in files_text ]\n", 639 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 640 | " from google.colab import files\n", 641 | " files.download('../{}.epub'.format(title))\n", 642 | " pass\n", 643 | "else:\n", 644 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 645 | " for file in files_text[::-1]:\n", 646 | " with open(file,\"r\") as f2:\n", 647 | " f.write(f2.read())\n", 648 | " from google.colab import files\n", 649 | " files.download('../{}.txt'.format(output_name))\n", 650 | "\n" 651 | ] 652 | }, 653 | { 654 | "cell_type": "code", 655 | "execution_count": null, 656 | "metadata": { 657 | "cellView": "form", 658 | "id": "stXbsG0Qb0wC" 659 | }, 660 | "outputs": [], 661 | "source": [ 662 | "#@title UU看書網(修正版)\n", 663 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 664 | "\n", 665 | "def check_package(package_name,is_system_package=False):\n", 666 | " import importlib\n", 667 | " import subprocess\n", 668 | " try:\n", 669 | " importlib.import_module(package_name)\n", 670 | " print(f\"{package_name} 套件已經安裝\")\n", 671 | " except ImportError:\n", 672 | " print(f\"{package_name} 套件尚未安裝,正在安裝中...\")\n", 673 | " if is_system_package:\n", 674 | " subprocess.check_call([\"apt-get\", \"install\", \"-y\", package_name])\n", 675 | " else:\n", 676 | " subprocess.check_call([\"pip\", \"install\", package_name])\n", 677 | "\n", 678 | "# 檢查 pandoc 是否已安裝,若無則安裝\n", 679 | "check_package(\"pandoc\")\n", 680 | "\n", 681 | "check_package(\"inlp\")\n", 682 | "\n", 683 | "import requests\n", 684 | "import os,re\n", 685 | "from bs4 import BeautifulSoup\n", 686 | "import concurrent.futures\n", 687 | "from inlp.convert import chinese\n", 688 | "import time\n", 689 | "import random\n", 690 | "\n", 691 | "\n", 692 | "\n", 693 | "try:\n", 694 | " os.mkdir(\"/content/tmp\")\n", 695 | " chk_cont=False\n", 696 | "except:\n", 697 | " print(\"目錄已存在\")\n", 698 | " chk_cont=True\n", 699 | "os.chdir(\"/content/tmp\")\n", 700 | "# os.system(\"rm -fr *\")\n", 701 | "\n", 702 | "def get_article_content(article_url):\n", 703 | " try:\n", 704 | " response = requests.get(article_url,headers=headers, timeout=10)\n", 705 | " response.encoding = \"utf-8\"\n", 706 | " if response.status_code == 200:\n", 707 | " soup = BeautifulSoup(response.text, 'html.parser')\n", 708 | " content_div = soup.find('div', id='nr')\n", 709 | " if content_div:\n", 710 | " paragraphs = content_div.find_all('p')\n", 711 | " content = '\\n'.join([p.get_text(strip=True) for p in paragraphs])\n", 712 | "\n", 713 | " next_button_text = soup.find(\"div\", class_=\"operate\").find_all(\"a\")[-1].getText()\n", 714 | " next_page_url = None\n", 715 | " if next_button_text == \"下一頁\":\n", 716 | " next_page_url = sites + soup.find(\"div\", class_=\"operate\").find_all(\"a\")[-1].get(\"href\")\n", 717 | "\n", 718 | " return content, next_page_url\n", 719 | " else:\n", 720 | " print(f\"無法找到內容:{article_url}\")\n", 721 | " else:\n", 722 | " print(f\"無法訪問網頁:{article_url}\")\n", 723 | " except Exception as e:\n", 724 | " print(f\"發生錯誤:{e}\")\n", 725 | " return None, None\n", 726 | "\n", 727 | "def save_to_text_file(title, content, file_name, add_title=True):\n", 728 | " try:\n", 729 | " with open(file_name, 'a', encoding='utf-8') as file:\n", 730 | " if add_title:\n", 731 | " file.write(f\"# {title}\\n\\n\")\n", 732 | " # 移除多餘的空白和換行\n", 733 | " # print(content)\n", 734 | " content = BeautifulSoup(content, \"html.parser\").get_text(separator=\"\\n\\n\")\n", 735 | " content = content.replace(\"\\xa0\\xa0\\xa0\\xa0\", \"\").replace(\"
\", \"\\n\").replace(\"

\", \"\\n\")\n", 736 | "\n", 737 | " file.write(str(content))\n", 738 | " print(f\"已保存至文件:{file_name}\")\n", 739 | " except Exception as e:\n", 740 | " print(f\"保存文件時出錯:{e}\")\n", 741 | "\n", 742 | "def crawl_articles(urls,start_num):\n", 743 | "\n", 744 | " for index, (title, article_url) in enumerate(urls, start=start_num):\n", 745 | " article_id = article_url.split('/')[-1].split('.')[0]\n", 746 | " print(f\"正在處理文章:{article_id}\")\n", 747 | " content, next_page_url = get_article_content(article_url)\n", 748 | " if content:\n", 749 | " # 保存第一頁的內容\n", 750 | " file_name = f\"{article_id}.txt\"\n", 751 | " save_to_text_file(title, content, file_name)\n", 752 | "\n", 753 | " # 接著處理其他頁面(如果有)\n", 754 | " while next_page_url:\n", 755 | " next_page_content, next_page_url = get_article_content(next_page_url)\n", 756 | " if next_page_content:\n", 757 | " # 保存其他頁面的內容\n", 758 | " save_to_text_file(title, next_page_content, file_name, add_title=False)\n", 759 | "\n", 760 | "\n", 761 | "def get_html(urls):\n", 762 | " [title,art_url]=urls\n", 763 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 764 | " reg=requests.get(art_url,headers=headers, timeout=10)\n", 765 | " reg.encoding=\"utf-8\"\n", 766 | " soup=BeautifulSoup(reg.text)\n", 767 | " # content=\"
\\n\".join([tag.string for tag in soup.find(name=\"div\",id='nr').find_all(\"p\")])\n", 768 | " tmp=soup.find(name=\"div\",id='nr').find_all(\"p\")\n", 769 | " print(art_id)\n", 770 | " # pages=soup.find('h1').getText()\n", 771 | " # numbers = re.findall(r'\\d+', pages)\n", 772 | " check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n", 773 | " if (check[-1].getText()==\"下一頁\"):\n", 774 | " page_url=sites+check[-1].get(\"href\")\n", 775 | " # if (len(numbers) >1):\n", 776 | " # last_number=numbers[-1]\n", 777 | " # # print(last_number)\n", 778 | " # for i in range(2,int(last_number)+1):\n", 779 | " # page_url=art_url[:-5]+f\"_{i}.html\"\n", 780 | " content_extend=get_html_page(page_url)\n", 781 | " tmp=tmp+content_extend\n", 782 | "\n", 783 | "\n", 784 | " content=\"\\n\".join([str(itm) for itm in tmp])\n", 785 | " # print(\"
\".join([tag.string for tag in content]))\n", 786 | " text=f\"# {title}\\n\\n\"\n", 787 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 788 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 789 | " context=re.sub('
','',context)\n", 790 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 791 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 792 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 793 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 794 | "\n", 795 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 796 | " f.write(text)\n", 797 | "\n", 798 | "\n", 799 | "def get_html_page(page_url):\n", 800 | " reg=requests.get(page_url,headers=headers, timeout=10)\n", 801 | "\n", 802 | " reg.encoding=\"utf-8\"\n", 803 | " soup=BeautifulSoup(reg.text)\n", 804 | " tmp1=soup.find(name=\"div\",id='nr').find_all(\"p\")\n", 805 | " check=soup.find(\"div\",class_=\"operate\").find_all(\"a\")\n", 806 | " if (check[-1].getText()==\"下一頁\"):\n", 807 | " page_url=sites+check[-1].get(\"href\")\n", 808 | " # if (len(numbers) >1):\n", 809 | " # last_number=numbers[-1]\n", 810 | " # # print(last_number)\n", 811 | " # for i in range(2,int(last_number)+1):\n", 812 | " # page_url=art_url[:-5]+f\"_{i}.html\"\n", 813 | " content_extend=get_html_page(page_url)\n", 814 | " tmp1=tmp1+content_extend\n", 815 | "\n", 816 | " return tmp1\n", 817 | "\n", 818 | "\n", 819 | "\n", 820 | "#@markdown 書籍目錄網址\n", 821 | "url=\"https://www.uuread.tw/94294\" #@param {type:'string'}\n", 822 | "title=\"阿斗之智近乎妖\" #@param {type:\"string\"}\n", 823 | "author=\"漢衛\" #@param {type:\"string\"}\n", 824 | "#@markdown 打勾,將會直接變成 epub\n", 825 | "file2epub = True #@param {type:\"boolean\"}\n", 826 | "\n", 827 | "# 標題設定義\n", 828 | "YAML=f'''---\n", 829 | "title: {title}\n", 830 | "author: {author}\n", 831 | "language: zh-Hant\n", 832 | "---'''\n", 833 | "\n", 834 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 835 | " f.write(YAML)\n", 836 | "\n", 837 | "sites=url[:url.find(\"/\",8)]\n", 838 | "# sites=url[:url.rfind(\"/\")]\n", 839 | "\n", 840 | "reg=requests.get(url)\n", 841 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 842 | "reg.encoding=\"utf-8\"\n", 843 | "soup=BeautifulSoup(reg.text)\n", 844 | "articles=soup.find(name=\"ul\",id=\"newlist\").find_all(\"a\")\n", 845 | "\n", 846 | "links=[]\n", 847 | "# len(articles)\n", 848 | "for i in articles:\n", 849 | " href=i.get(\"href\")\n", 850 | " links.append([i.get(\"title\"),f\"{sites}{href}\"])\n", 851 | "# links.sort(key=lambda x: x[1])\n", 852 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 853 | "\n", 854 | "# # 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 855 | "# # # 同時建立及啟用10個執行緒\n", 856 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 857 | "# executor.map(get_html, links)\n", 858 | "\n", 859 | "if chk_cont:\n", 860 | " start_num=int(len(os.listdir())-1)\n", 861 | " print(start_num)\n", 862 | "else:\n", 863 | " start_num=0\n", 864 | "for index,link in enumerate(links[start_num:],start_num+1):\n", 865 | " print(index)\n", 866 | " get_html(link)\n", 867 | " time.sleep(random.uniform(1, 2))\n", 868 | "\n", 869 | "# start_getnum=len(os.listdir(\".\"))-1\n", 870 | "\n", 871 | "# crawl_articles(links[start_getnum:],start_num)\n", 872 | "\n", 873 | "\n", 874 | "\n", 875 | "output_name=title\n", 876 | "# # files_text=os.listdir()\n", 877 | "# # files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 878 | "# # 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 879 | "# # files_text.sort(key=lambda x:int(x[:-4]))\n", 880 | "\n", 881 | "!apt-get install -y\n", 882 | "if file2epub:\n", 883 | " mdfiles=[ itm for itm in files_text ]\n", 884 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 885 | " from google.colab import files\n", 886 | " files.download('../{}.epub'.format(title))\n", 887 | " pass\n", 888 | "else:\n", 889 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 890 | " for file in files_text[::-1]:\n", 891 | " with open(file,\"r\") as f2:\n", 892 | " f.write(f2.read())\n", 893 | " from google.colab import files\n", 894 | " files.download('../{}.txt'.format(output_name))\n", 895 | "\n" 896 | ] 897 | }, 898 | { 899 | "cell_type": "code", 900 | "execution_count": null, 901 | "metadata": { 902 | "cellView": "form", 903 | "id": "e21JYlPGnuy7" 904 | }, 905 | "outputs": [], 906 | "source": [ 907 | "#@title UU看書\n", 908 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 909 | "\n", 910 | "def check_package(itm):\n", 911 | " import importlib\n", 912 | " try:\n", 913 | " importlib.import_module(itm)\n", 914 | " print(f\"{itm} 套件已經安裝\")\n", 915 | " except ImportError:\n", 916 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 917 | " !pip install {itm}\n", 918 | "\n", 919 | "\n", 920 | "check_package(\"inlp\")\n", 921 | "\n", 922 | "import requests\n", 923 | "import os,re\n", 924 | "from bs4 import BeautifulSoup\n", 925 | "import concurrent.futures\n", 926 | "from inlp.convert import chinese\n", 927 | "\n", 928 | "try:\n", 929 | " os.mkdir(\"/content/tmp\")\n", 930 | "except:\n", 931 | " print(\"目錄已存在\")\n", 932 | "os.chdir(\"/content/tmp\")\n", 933 | "os.system(\"rm -fr *\")\n", 934 | "\n", 935 | "\n", 936 | "def get_html(urls):\n", 937 | " [title,art_url]=urls\n", 938 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 939 | " reg=requests.get(art_url)\n", 940 | " reg.encoding=\"utf-8\"\n", 941 | " soup=BeautifulSoup(reg.text)\n", 942 | " content=soup.find(name=\"div\",id='TextContent')\n", 943 | " print(art_id)\n", 944 | " text=f\"# {title}\\n\\n\"\n", 945 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 946 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 947 | " context=re.sub('
','',context)\n", 948 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 949 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 950 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 951 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 952 | " text=chinese.s2t(text)\n", 953 | "\n", 954 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 955 | " f.write(text)\n", 956 | "\n", 957 | "\n", 958 | "#@markdown 書籍目錄網址\n", 959 | "url=\"https://www.uuks5.com/book/632203/\" #@param {type:'string'}\n", 960 | "title=\"我在古代當便宜爹\" #@param {type:\"string\"}\n", 961 | "author=\"雲山風海\" #@param {type:\"string\"}\n", 962 | "#@markdown 打勾,將會直接變成 epub\n", 963 | "file2epub = True #@param {type:\"boolean\"}\n", 964 | "\n", 965 | "# 標題設定義\n", 966 | "YAML=f'''---\n", 967 | "title: {title}\n", 968 | "author: {author}\n", 969 | "language: zh-Hant\n", 970 | "---'''\n", 971 | "\n", 972 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 973 | " f.write(YAML)\n", 974 | "\n", 975 | "sites=url[:url.find(\"/\",8)]\n", 976 | "# sites=url[:url.rfind(\"/\")]\n", 977 | "\n", 978 | "reg=requests.get(url)\n", 979 | "soup=BeautifulSoup(reg.text,\"html.parser\")\n", 980 | "reg.encoding=\"utf-8\"\n", 981 | "\n", 982 | "\n", 983 | "articles=soup.find(name=\"ul\",id=\"chapterList\").find_all(name=\"a\")\n", 984 | "\n", 985 | "links=[]\n", 986 | "# # len(articles)\n", 987 | "for i in articles:\n", 988 | " href=i.get(\"href\")\n", 989 | " # print(i.text)\n", 990 | " links.append([i.text,f\"{sites}{href}\"])\n", 991 | "# links.sort(key=lambda x: x[1])\n", 992 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 993 | "\n", 994 | "\n", 995 | "for link in links:\n", 996 | " get_html(link)\n", 997 | "\n", 998 | "output_name=title\n", 999 | "# files_text=os.listdir()\n", 1000 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 1001 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1002 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1003 | "if file2epub:\n", 1004 | " mdfiles=[ itm for itm in files_text]\n", 1005 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1006 | " from google.colab import files\n", 1007 | " files.download('../{}.epub'.format(title))\n", 1008 | " pass\n", 1009 | "else:\n", 1010 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1011 | " for file in files_text[::-1]:\n", 1012 | " with open(file,\"r\") as f2:\n", 1013 | " f.write(f2.read())\n", 1014 | " from google.colab import files\n", 1015 | " files.download('../{}.txt'.format(output_name))\n", 1016 | "\n" 1017 | ] 1018 | }, 1019 | { 1020 | "cell_type": "code", 1021 | "execution_count": null, 1022 | "metadata": { 1023 | "cellView": "form", 1024 | "id": "zXHImWy6FhSW" 1025 | }, 1026 | "outputs": [], 1027 | "source": [ 1028 | "#@title 飄天文學\n", 1029 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 1030 | "def check_package(itm):\n", 1031 | " import importlib\n", 1032 | " try:\n", 1033 | " importlib.import_module(itm)\n", 1034 | " print(f\"{itm} 套件已經安裝\")\n", 1035 | " except ImportError:\n", 1036 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 1037 | " !pip install {itm}\n", 1038 | "\n", 1039 | "\n", 1040 | "check_package(\"inlp\")\n", 1041 | "\n", 1042 | "\n", 1043 | "import requests\n", 1044 | "import os,re\n", 1045 | "from bs4 import BeautifulSoup\n", 1046 | "import concurrent.futures\n", 1047 | "from inlp.convert import chinese\n", 1048 | "\n", 1049 | "try:\n", 1050 | " os.mkdir(\"/content/tmp\")\n", 1051 | "except:\n", 1052 | " print(\"目錄已存在\")\n", 1053 | "os.chdir(\"/content/tmp\")\n", 1054 | "os.system(\"rm -fr *\")\n", 1055 | "\n", 1056 | "\n", 1057 | "def get_html(urls):\n", 1058 | " [title,art_url]=urls\n", 1059 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 1060 | " # print(art_url)\n", 1061 | " req=requests.get(art_url)\n", 1062 | " # req.encoding=\"gbk\"\n", 1063 | " soup=BeautifulSoup(req.text)\n", 1064 | " # content=soup.find('div', {'id': 'content', 'class': 'fonts_mesne'})\n", 1065 | "\n", 1066 | " print(art_id)\n", 1067 | "\n", 1068 | "\n", 1069 | " # Find all the siblings of the table element up to the center element\n", 1070 | " content = []\n", 1071 | " content=soup.find(name=\"div\",id='booktxt')\n", 1072 | " # print(content)\n", 1073 | " # print(soup)\n", 1074 | " for tag in content.find_all(['div', 'h1','table','script','center']):\n", 1075 | " tag.extract()\n", 1076 | "\n", 1077 | " text=f\"# {title}\\n\\n\"\n", 1078 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 1079 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 1080 | " context=re.sub('
','',context)\n", 1081 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 1082 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 1083 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 1084 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 1085 | " text=chinese.s2t(text)\n", 1086 | "\n", 1087 | "\n", 1088 | "\n", 1089 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 1090 | " f.write(text)\n", 1091 | "\n", 1092 | "\n", 1093 | "#@markdown 書籍目錄網址\n", 1094 | "url=\"https://www.piaotianba.com/ptoewgi\" #@param {type:'string'}\n", 1095 | "title=\"先修諸天萬道再修仙\" #@param {type:\"string\"}\n", 1096 | "author=\"門前鴨\" #@param {type:\"string\"}\n", 1097 | "#@markdown 打勾,將會直接變成 epub\n", 1098 | "file2epub = True #@param {type:\"boolean\"}\n", 1099 | "\n", 1100 | "# 標題設定義\n", 1101 | "YAML=f'''---\n", 1102 | "title: {title}\n", 1103 | "author: {author}\n", 1104 | "language: zh-Hant\n", 1105 | "---'''\n", 1106 | "\n", 1107 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 1108 | " f.write(YAML)\n", 1109 | "\n", 1110 | "# sites=url\n", 1111 | "sites=url[:url.rfind(\"/\")]\n", 1112 | "\n", 1113 | "reg=requests.get(url)\n", 1114 | "#reg.encoding = 'gbk'\n", 1115 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 1116 | "soup=BeautifulSoup(reg.text)\n", 1117 | "\n", 1118 | "\n", 1119 | "# articles=soup.find(name=\"div\",class_=\"centent\").find_all(\"a\")\n", 1120 | "articles=soup.find(name=\"dl\",id=\"newlist\").find_all(\"a\")\n", 1121 | "\n", 1122 | "links=[]\n", 1123 | "# len(articles)\n", 1124 | "for i in articles:\n", 1125 | " href=i.get(\"href\")\n", 1126 | " links.append([chinese.s2t(i.text),f\"{sites}/{href}\"])\n", 1127 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 1128 | "\n", 1129 | "# # 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 1130 | "# # # 同時建立及啟用10個執行緒\n", 1131 | "# # with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 1132 | "# # executor.map(get_html, links)\n", 1133 | "for link in links:\n", 1134 | " get_html(link)\n", 1135 | "\n", 1136 | "output_name=title\n", 1137 | "# files_text=os.listdir()\n", 1138 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 1139 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1140 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1141 | "if file2epub:\n", 1142 | " mdfiles=[ itm for itm in files_text]\n", 1143 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1144 | " from google.colab import files\n", 1145 | " files.download('../{}.epub'.format(title))\n", 1146 | " pass\n", 1147 | "else:\n", 1148 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1149 | " for file in files_text:\n", 1150 | " with open(file,\"r\") as f2:\n", 1151 | " f.write(f2.read())\n", 1152 | " from google.colab import files\n", 1153 | " files.download('../{}.txt'.format(output_name))\n", 1154 | "\n" 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "code", 1159 | "execution_count": null, 1160 | "metadata": { 1161 | "cellView": "form", 1162 | "id": "Sj1k1RHcre_O" 1163 | }, 1164 | "outputs": [], 1165 | "source": [ 1166 | "#@title 69書吧\n", 1167 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 1168 | "def check_package(itm):\n", 1169 | " import importlib\n", 1170 | " try:\n", 1171 | " importlib.import_module(itm)\n", 1172 | " print(f\"{itm} 套件已經安裝\")\n", 1173 | " except ImportError:\n", 1174 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 1175 | " !pip install {itm}\n", 1176 | "\n", 1177 | "\n", 1178 | "check_package(\"inlp\")\n", 1179 | "\n", 1180 | "\n", 1181 | "import requests\n", 1182 | "import os,re\n", 1183 | "from bs4 import BeautifulSoup\n", 1184 | "import concurrent.futures\n", 1185 | "from inlp.convert import chinese\n", 1186 | "\n", 1187 | "try:\n", 1188 | " os.mkdir(\"/content/tmp\")\n", 1189 | "except:\n", 1190 | " print(\"目錄已存在\")\n", 1191 | "os.chdir(\"/content/tmp\")\n", 1192 | "os.system(\"rm -fr *\")\n", 1193 | "\n", 1194 | "\n", 1195 | "def get_html(urls):\n", 1196 | " [title,art_url]=urls\n", 1197 | " art_id=art_url[art_url.rfind(\"/\")+1:]\n", 1198 | " # print(art_url)\n", 1199 | " req=requests.get(art_url)\n", 1200 | " req.encoding=\"gbk\"\n", 1201 | " soup=BeautifulSoup(req.text)\n", 1202 | " print(art_id)\n", 1203 | "\n", 1204 | " # Find all the siblings of the table element up to the center element\n", 1205 | " content = []\n", 1206 | " content=soup.find(name=\"div\",class_=\"txtnav\")\n", 1207 | "\n", 1208 | " for tag in content.find_all(['div','h1','script','center']):\n", 1209 | " tag.extract()\n", 1210 | "\n", 1211 | "\n", 1212 | " text=f\"# {title}\\n\\n\"\n", 1213 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 1214 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 1215 | " context=re.sub('
','',context)\n", 1216 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 1217 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 1218 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 1219 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 1220 | " text=chinese.s2t(text)\n", 1221 | "\n", 1222 | " # print(text)\n", 1223 | "\n", 1224 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 1225 | " f.write(text)\n", 1226 | "\n", 1227 | "\n", 1228 | "#@markdown 書籍目錄網址\n", 1229 | "url=\"https://69shuba.cx/book/58865/\" #@param {type:'string'}\n", 1230 | "title=\"暴富很難?我的超市通古今!\" #@param {type:\"string\"}\n", 1231 | "author=\"琴止\" #@param {type:\"string\"}\n", 1232 | "#@markdown 打勾,將會直接變成 epub\n", 1233 | "file2epub = True #@param {type:\"boolean\"}\n", 1234 | "\n", 1235 | "# 標題設定義\n", 1236 | "YAML=f'''---\n", 1237 | "title: {title}\n", 1238 | "author: {author}\n", 1239 | "language: zh-Hant\n", 1240 | "---'''\n", 1241 | "\n", 1242 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 1243 | " f.write(YAML)\n", 1244 | "\n", 1245 | "sites=url\n", 1246 | "# sites=url[:url.rfind(\"/\")]\n", 1247 | "\n", 1248 | "reg=requests.get(url)\n", 1249 | "reg.encoding = 'gbk'\n", 1250 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 1251 | "soup=BeautifulSoup(reg.text)\n", 1252 | "\n", 1253 | "articles=soup.find(name=\"div\",id=\"catalog\").find_all(\"a\")\n", 1254 | "\n", 1255 | "links=[]\n", 1256 | "# # len(articles)\n", 1257 | "for i in articles:\n", 1258 | " href=i.get(\"href\")\n", 1259 | " links.append([chinese.s2t(i.text),href])\n", 1260 | "\n", 1261 | "files_text=[link[1][link[1].rfind(\"/\")+1:]+\".txt\" for link in links]\n", 1262 | "\n", 1263 | "# 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 1264 | "# # 同時建立及啟用10個執行緒\n", 1265 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 1266 | "# executor.map(get_html, links)\n", 1267 | "for link in links:\n", 1268 | " get_html(link)\n", 1269 | "\n", 1270 | "output_name=title\n", 1271 | "# files_text=os.listdir()\n", 1272 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 1273 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1274 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1275 | "if file2epub:\n", 1276 | " mdfiles=[ itm for itm in files_text]\n", 1277 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1278 | " from google.colab import files\n", 1279 | " files.download('../{}.epub'.format(title))\n", 1280 | " pass\n", 1281 | "else:\n", 1282 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1283 | " for file in files_text:\n", 1284 | " with open(file,\"r\") as f2:\n", 1285 | " f.write(f2.read())\n", 1286 | " from google.colab import files\n", 1287 | " files.download('../{}.txt'.format(output_name))" 1288 | ] 1289 | }, 1290 | { 1291 | "cell_type": "code", 1292 | "execution_count": null, 1293 | "metadata": { 1294 | "cellView": "form", 1295 | "id": "lkVncEDb8tXj" 1296 | }, 1297 | "outputs": [], 1298 | "source": [ 1299 | "#@title 笔趣阁(bqg9527.net/)\n", 1300 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 1301 | "\n", 1302 | "def check_package(itm):\n", 1303 | " import importlib\n", 1304 | " try:\n", 1305 | " importlib.import_module(itm)\n", 1306 | " print(f\"{itm} 套件已經安裝\")\n", 1307 | " except ImportError:\n", 1308 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 1309 | " !pip install {itm}\n", 1310 | "\n", 1311 | "\n", 1312 | "check_package(\"inlp\")\n", 1313 | "\n", 1314 | "import requests\n", 1315 | "import os,re\n", 1316 | "from bs4 import BeautifulSoup\n", 1317 | "import concurrent.futures\n", 1318 | "from inlp.convert import chinese\n", 1319 | "\n", 1320 | "\n", 1321 | "try:\n", 1322 | " os.mkdir(\"/content/tmp\")\n", 1323 | "except:\n", 1324 | " print(\"目錄已存在\")\n", 1325 | "os.chdir(\"/content/tmp\")\n", 1326 | "os.system(\"rm -fr *\")\n", 1327 | "\n", 1328 | "\n", 1329 | "def get_html(urls):\n", 1330 | " [title,art_url]=urls\n", 1331 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 1332 | "\n", 1333 | " soup=BeautifulSoup(requests.get(art_url).text)\n", 1334 | " # content=soup.find(name=\"div\",id='nr1')\n", 1335 | " content=soup.find(name=\"div\",id=\"content\")\n", 1336 | " print(art_id)\n", 1337 | " # print(soup)\n", 1338 | "\n", 1339 | " text=f\"# {title}\\n\\n\"\n", 1340 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 1341 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 1342 | " context=re.sub('
','',context)\n", 1343 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 1344 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 1345 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 1346 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 1347 | " text=chinese.s2t(text)\n", 1348 | "\n", 1349 | " # print(text)\n", 1350 | "\n", 1351 | "\n", 1352 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 1353 | " f.write(text)\n", 1354 | "\n", 1355 | "\n", 1356 | "#@markdown 書籍目錄網址\n", 1357 | "url=\"https://www.bqg9527.net/book/342930/\" #@param {type:'string'}\n", 1358 | "title=\"我出錢你出命,我倆一起神經病\" #@param {type:\"string\"}\n", 1359 | "author=\"我煞費苦心\" #@param {type:\"string\"}\n", 1360 | "#@markdown 打勾,將會直接變成 epub\n", 1361 | "file2epub = True #@param {type:\"boolean\"}\n", 1362 | "\n", 1363 | "# 標題設定義\n", 1364 | "YAML=f'''---\n", 1365 | "title: {title}\n", 1366 | "author: {author}\n", 1367 | "language: zh-Hant\n", 1368 | "---'''\n", 1369 | "\n", 1370 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 1371 | " f.write(YAML)\n", 1372 | "\n", 1373 | "# sites=url[:url.find(\"/\",8)+1]\n", 1374 | "# sites=url[:url.rfind(\"/\")]\n", 1375 | "sites=url\n", 1376 | "reg=requests.get(url)\n", 1377 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 1378 | "soup=BeautifulSoup(reg.text)\n", 1379 | "# articles=soup.find_all(name=\"ul\",class_=\"chapter\")[1].find_all(\"a\")\n", 1380 | "articles=soup.find(name=\"div\",id=\"list\").find_all(\"dt\")[1].find_next_siblings('dd')\n", 1381 | "\n", 1382 | "# second_dt=soup.find_all('dt')[1]\n", 1383 | "# dd_tags = second_dt.find_next_siblings('dd')\n", 1384 | "\n", 1385 | "links=[]\n", 1386 | "# len(articles)\n", 1387 | "for itm in articles:\n", 1388 | " i=itm.find(\"a\")\n", 1389 | " href=i.get(\"href\")\n", 1390 | " if href[-4:]!= \"html\":\n", 1391 | " continue\n", 1392 | " links.append([i.text,f\"{sites}{href}\"])\n", 1393 | "\n", 1394 | "\n", 1395 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 1396 | "# print(links[0])\n", 1397 | "# print(files_text[0])\n", 1398 | "# get_html(links[0])\n", 1399 | "\n", 1400 | "# # 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 1401 | "# # # 同時建立及啟用10個執行緒\n", 1402 | "with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 1403 | " executor.map(get_html, links)\n", 1404 | "\n", 1405 | "# for link in links:\n", 1406 | " # get_html(link)\n", 1407 | "\n", 1408 | "# output_name=soup.find(\"h1\").getText()\n", 1409 | "output_name=title\n", 1410 | "# files_text=os.listdir()\n", 1411 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 1412 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1413 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1414 | "if file2epub:\n", 1415 | " mdfiles=[ itm for itm in files_text]\n", 1416 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1417 | " from google.colab import files\n", 1418 | " files.download('../{}.epub'.format(title))\n", 1419 | " pass\n", 1420 | "else:\n", 1421 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1422 | " for file in files_text[::-1]:\n", 1423 | " with open(file,\"r\") as f2:\n", 1424 | " f.write(f2.read())\n", 1425 | " from google.colab import files\n", 1426 | " files.download('../{}.txt'.format(output_name))\n", 1427 | "\n" 1428 | ] 1429 | }, 1430 | { 1431 | "cell_type": "code", 1432 | "execution_count": null, 1433 | "metadata": { 1434 | "cellView": "form", 1435 | "id": "Afz6HuGb85ZL" 1436 | }, 1437 | "outputs": [], 1438 | "source": [ 1439 | "#@title 笔趣阁(p2wt)\n", 1440 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 1441 | "\n", 1442 | "def check_package(itm):\n", 1443 | " import importlib\n", 1444 | " try:\n", 1445 | " importlib.import_module(itm)\n", 1446 | " print(f\"{itm} 套件已經安裝\")\n", 1447 | " except ImportError:\n", 1448 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 1449 | " !pip install {itm}\n", 1450 | "\n", 1451 | "\n", 1452 | "check_package(\"inlp\")\n", 1453 | "\n", 1454 | "import requests\n", 1455 | "import os,re\n", 1456 | "from bs4 import BeautifulSoup\n", 1457 | "import concurrent.futures\n", 1458 | "from inlp.convert import chinese\n", 1459 | "\n", 1460 | "\n", 1461 | "try:\n", 1462 | " os.mkdir(\"/content/tmp\")\n", 1463 | "except:\n", 1464 | " print(\"目錄已存在\")\n", 1465 | "os.chdir(\"/content/tmp\")\n", 1466 | "os.system(\"rm -fr *\")\n", 1467 | "\n", 1468 | "\n", 1469 | "def get_html(urls):\n", 1470 | " [title,art_url]=urls\n", 1471 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 1472 | " reg=requests.get(art_url)\n", 1473 | " reg.encoding=\"utf-8\"\n", 1474 | " soup=BeautifulSoup(reg.text)\n", 1475 | " content=soup.find(name=\"div\",id='chaptercontent')\n", 1476 | " print(art_id)\n", 1477 | "\n", 1478 | " text=f\"# {title}\\n\\n\"\n", 1479 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 1480 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 1481 | " context=re.sub('
','',context)\n", 1482 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 1483 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 1484 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 1485 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 1486 | " text=chinese.s2t(text)\n", 1487 | "\n", 1488 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 1489 | " f.write(text)\n", 1490 | "\n", 1491 | "\n", 1492 | "\n", 1493 | "def toEpub():\n", 1494 | "# files_text=os.listdir()\n", 1495 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 1496 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1497 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1498 | " if file2epub:\n", 1499 | " mdfiles=[ itm for itm in files_text]\n", 1500 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1501 | " from google.colab import files\n", 1502 | " files.download('../{}.epub'.format(title))\n", 1503 | " pass\n", 1504 | " else:\n", 1505 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1506 | " for file in files_text[::-1]:\n", 1507 | " with open(file,\"r\") as f2:\n", 1508 | " f.write(f2.read())\n", 1509 | " from google.colab import files\n", 1510 | " files.download('../{}.txt'.format(output_name))\n", 1511 | "\n", 1512 | "\n", 1513 | "#@markdown 書籍目錄網址\n", 1514 | "url=\"https://m.bqg9527.net/book/480/\" #@param {type:'string'}\n", 1515 | "title=\"\\u7D66\\u4E0D\\u8D77\\u5F69\\u79AE\\uFF0C\\u53EA\\u597D\\u5A36\\u4E86\\u9B54\\u9580\\u8056\\u5973\" #@param {type:\"string\"}\n", 1516 | "author=\"\\u5149\\u5F71\" #@param {type:\"string\"}\n", 1517 | "#@markdown 打勾,將會直接變成 epub\n", 1518 | "file2epub = True #@param {type:\"boolean\"}\n", 1519 | "\n", 1520 | "# 標題設定義\n", 1521 | "YAML=f'''---\n", 1522 | "title: {title}\n", 1523 | "author: {author}\n", 1524 | "language: zh-Hant\n", 1525 | "---'''\n", 1526 | "\n", 1527 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 1528 | " f.write(YAML)\n", 1529 | "\n", 1530 | "sites=url[:url.find(\"/\",8)]\n", 1531 | "# sites=url[:url.rfind(\"/\")], 取得 sites\n", 1532 | "\n", 1533 | "reg=requests.get(url)\n", 1534 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 1535 | "reg.encoding=\"utf-8\"\n", 1536 | "soup=BeautifulSoup(reg.text)\n", 1537 | "output_name=soup.find(\"h1\").getText()\n", 1538 | "articles=soup.find(name=\"div\",class_=\"listmain\").find_all(\"a\")\n", 1539 | "\n", 1540 | "\n", 1541 | "links=[]\n", 1542 | "# len(articles)\n", 1543 | "for i in articles:\n", 1544 | " href=i.get(\"href\")\n", 1545 | " if href[-4:]!= \"html\":\n", 1546 | " continue\n", 1547 | " links.append([i.text,f\"{sites}{href}\"])\n", 1548 | "\n", 1549 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 1550 | "\n", 1551 | "\n", 1552 | "# 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 1553 | "# # 同時建立及啟用10個執行緒\n", 1554 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 1555 | "# executor.map(get_html, links)\n", 1556 | "for link in links:\n", 1557 | " get_html(link)\n", 1558 | "\n", 1559 | "\n", 1560 | "toEpub()\n" 1561 | ] 1562 | }, 1563 | { 1564 | "cell_type": "code", 1565 | "execution_count": null, 1566 | "metadata": { 1567 | "id": "roy0O2ZJnAvz", 1568 | "cellView": "form" 1569 | }, 1570 | "outputs": [], 1571 | "source": [ 1572 | "#@title 愛下電子書(撰寫中)\n", 1573 | "\n", 1574 | "\n", 1575 | "import requests\n", 1576 | "import os,re\n", 1577 | "from bs4 import BeautifulSoup\n", 1578 | "import concurrent.futures\n", 1579 | "\n", 1580 | "try:\n", 1581 | " os.mkdir(\"/content/tmp\")\n", 1582 | "except:\n", 1583 | " print(\"目錄已存在\")\n", 1584 | "os.chdir(\"/content/tmp\")\n", 1585 | "os.system(\"rm -fr *\")\n", 1586 | "\n", 1587 | "\n", 1588 | "\n", 1589 | "def get_html(urls):\n", 1590 | " [title,art_url]=urls\n", 1591 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 1592 | "\n", 1593 | " soup=BeautifulSoup(requests.get(art_url).text)\n", 1594 | " # content=soup.find(name=\"div\",id='nr1')\n", 1595 | " content=soup.find(\"section\")\n", 1596 | "\n", 1597 | " print(art_id)\n", 1598 | "\n", 1599 | " text=f\"# {title}\\n\\n\"\n", 1600 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 1601 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 1602 | " context=re.sub('
','',context)\n", 1603 | " context=re.sub('','',context)\n", 1604 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 1605 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 1606 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 1607 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 1608 | " # text=chinese.s2t(text)\n", 1609 | "\n", 1610 | " # print(text)\n", 1611 | "\n", 1612 | "\n", 1613 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 1614 | " f.write(text)\n", 1615 | "\n", 1616 | "\n", 1617 | "\n", 1618 | "\n", 1619 | "\n", 1620 | "#@markdown 書籍目錄網址\n", 1621 | "url=\"https://ixdzs8.tw/read/381070/p484.html\" #@param {type:'string'}\n", 1622 | "title=\"豪門梟士\" #@param {type:\"string\"}\n", 1623 | "author=\"雲山風海\" #@param {type:\"string\"}\n", 1624 | "#@markdown 打勾,將會直接變成 epub\n", 1625 | "file2epub = True #@param {type:\"boolean\"}\n", 1626 | "\n", 1627 | "# 標題設定義\n", 1628 | "YAML=f'''---\n", 1629 | "title: {title}\n", 1630 | "author: {author}\n", 1631 | "language: zh-Hant\n", 1632 | "---'''\n", 1633 | "\n", 1634 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 1635 | " f.write(YAML)\n", 1636 | "\n", 1637 | "sites=url[:url.find(\"/\",8)]\n", 1638 | "# sites=url[:url.rfind(\"/\")]\n", 1639 | "\n", 1640 | "# reg=requests.get(url)\n", 1641 | "soup=BeautifulSoup(data)\n", 1642 | "# articles=soup.find(name=\"div\",id=\"readerlists\").find_all(\"a\")\n", 1643 | "articles=soup.find_all(\"a\")\n", 1644 | "# print(articles)\n", 1645 | "\n", 1646 | "\n", 1647 | "links=[]\n", 1648 | "\n", 1649 | "for i in articles:\n", 1650 | " href=i.get(\"href\")\n", 1651 | " itmTitle=i.getText()\n", 1652 | " links.append([itmTitle,f\"{sites}{href}\"])\n", 1653 | "# links.sort(key=lambda x: x[1])\n", 1654 | "\n", 1655 | "\n", 1656 | "files_text=[link[1][link[1].rfind(\"/\")+1:-5]+\".txt\" for link in links]\n", 1657 | "# links[:4]\n", 1658 | "\n", 1659 | "# 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 1660 | "# # 同時建立及啟用10個執行緒\n", 1661 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 1662 | "# executor.map(get_html, links)\n", 1663 | "\n", 1664 | "for link in links:\n", 1665 | " get_html(link)\n", 1666 | "\n", 1667 | "output_name=title\n", 1668 | "listfiles=os.listdir()\n", 1669 | "mdfiles=[ itm for itm in files_text if itm in listfiles]\n", 1670 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1671 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1672 | "if file2epub:\n", 1673 | " # mdfiles=[ itm for itm in files_text if itm in listfiles]\n", 1674 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1675 | " from google.colab import files\n", 1676 | " files.download('../{}.epub'.format(title))\n", 1677 | " pass\n", 1678 | "else:\n", 1679 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1680 | " for file in mdfiles:\n", 1681 | " with open(file,\"r\") as f2:\n", 1682 | " f.write(f2.read())\n", 1683 | " from google.colab import files\n", 1684 | " files.download('../{}.txt'.format(output_name))\n", 1685 | "\n" 1686 | ] 1687 | }, 1688 | { 1689 | "cell_type": "code", 1690 | "execution_count": null, 1691 | "metadata": { 1692 | "id": "X-2pn1oynXxC", 1693 | "cellView": "form" 1694 | }, 1695 | "outputs": [], 1696 | "source": [ 1697 | "#@title 52書庫(www.52shuku.vip)\n", 1698 | "#@markdown 還在修正的程式,可以直接從這一個區塊執行\n", 1699 | "\n", 1700 | "def check_package(itm):\n", 1701 | " import importlib\n", 1702 | " try:\n", 1703 | " importlib.import_module(itm)\n", 1704 | " print(f\"{itm} 套件已經安裝\")\n", 1705 | " except ImportError:\n", 1706 | " print(f\"{itm} 套件尚未安裝,正在安裝中...\")\n", 1707 | " !pip install {itm}\n", 1708 | "\n", 1709 | "\n", 1710 | "check_package(\"inlp\")\n", 1711 | "\n", 1712 | "import requests\n", 1713 | "import os,re\n", 1714 | "from bs4 import BeautifulSoup\n", 1715 | "import concurrent.futures\n", 1716 | "from inlp.convert import chinese\n", 1717 | "\n", 1718 | "\n", 1719 | "try:\n", 1720 | " os.mkdir(\"/content/tmp\")\n", 1721 | "except:\n", 1722 | " print(\"目錄已存在\")\n", 1723 | "os.chdir(\"/content/tmp\")\n", 1724 | "os.system(\"rm -fr *\")\n", 1725 | "\n", 1726 | "\n", 1727 | "def get_html(urls):\n", 1728 | " [title,art_url]=urls\n", 1729 | " art_id=art_url[art_url.rfind(\"/\")+1:-5]\n", 1730 | " # print(art_id)\n", 1731 | " # print(art_url)\n", 1732 | " reg=requests.get(art_url)\n", 1733 | " reg.encoding=\"utf-8\"\n", 1734 | " soup=BeautifulSoup(reg.text)\n", 1735 | " content=soup.find(name=\"div\",id='text')\n", 1736 | " print(art_id)\n", 1737 | "\n", 1738 | " text=f\"# {title}\\n\\n\"\n", 1739 | " context=str(content).replace(\"\\n\",\"\").replace(\"\\r\",\"\")\n", 1740 | " context=context.replace(\"\\xa0\\xa0\\xa0\\xa0\",\"\")\n", 1741 | " context=re.sub('
','',context)\n", 1742 | " context=context.replace(\"
\",\"\\n\").replace(\"

\",\"\\n\")\n", 1743 | " context=re.sub('<.*?>',\"\",context).split(\"\\n\")\n", 1744 | " context=[itm.strip() for itm in context if len(itm)>0]\n", 1745 | " text+=\"\\n\\n\".join(context)+\"\\n\\n\"\n", 1746 | " text=chinese.s2t(text)\n", 1747 | "\n", 1748 | " # print(text)\n", 1749 | " with open(f\"{art_id}.txt\",mode=\"w\",encoding=\"utf-8\") as f:\n", 1750 | " f.write(text)\n", 1751 | "\n", 1752 | "\n", 1753 | "#@markdown 書籍目錄網址\n", 1754 | "url=\"https://www.52shuku.vip/yanqing/b/bjPGg.html\" #@param {type:'string'}\n", 1755 | "title=\"\\u59D1\\u5A18\\u5979\\u7F8E\\u8C8C\\u5374\\u66B4\\u529B\" #@param {type:\"string\"}\n", 1756 | "author=\"\\u4E60\\u6829\\u5112\\u751F\" #@param {type:\"string\"}\n", 1757 | "#@markdown 打勾,將會直接變成 epub\n", 1758 | "file2epub = True #@param {type:\"boolean\"}\n", 1759 | "\n", 1760 | "# 標題設定義\n", 1761 | "YAML=f'''---\n", 1762 | "title: {title}\n", 1763 | "author: {author}\n", 1764 | "language: zh-Hant\n", 1765 | "---'''\n", 1766 | "\n", 1767 | "with open(\"title.txt\",mode=\"w\",encoding='utf-8') as f:\n", 1768 | " f.write(YAML)\n", 1769 | "\n", 1770 | "sites=url[:url.find(\"/\",8)]\n", 1771 | "# sites=url[:url.rfind(\"/\")]\n", 1772 | "\n", 1773 | "reg=requests.get(url)\n", 1774 | "reg.encoding=\"utf-8\"\n", 1775 | "# soup=BeautifulSoup(reg.text,\"html.parser\")\n", 1776 | "soup=BeautifulSoup(reg.text)\n", 1777 | "output_name=soup.find(\"h1\").getText()\n", 1778 | "articles=soup.find(name=\"ul\",class_=\"list\").find_all(\"a\")\n", 1779 | "\n", 1780 | "\n", 1781 | "links=[]\n", 1782 | "# len(articles)\n", 1783 | "for i in articles:\n", 1784 | " href=i.get(\"href\")\n", 1785 | " if href[-4:]!= \"html\":\n", 1786 | " continue\n", 1787 | " links.append([i.text,f\"{href}\"])\n", 1788 | "\n", 1789 | "files_text=[link[1][link[1].rfind(\"/\")+1:link[1].rfind(\".\")]+\".txt\" for link in links]\n", 1790 | "\n", 1791 | "\n", 1792 | "\n", 1793 | "# 暫時無法使用,目前 colab 只要開放 2 個執行緖、只能運作 60 秒\n", 1794 | "# # 同時建立及啟用10個執行緒\n", 1795 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:\n", 1796 | "# executor.map(get_html, links)\n", 1797 | "for link in links:\n", 1798 | " get_html(link)\n", 1799 | "\n", 1800 | "output_name=soup.find(\"h1\").getText()\n", 1801 | "# files_text=os.listdir()\n", 1802 | "# files_text=[file for file in files_text if file.endswith(\".txt\")]\n", 1803 | "# 檔案排序,需要考慮 檔案名稱長短不一的問題,問前是透過數字的處理\n", 1804 | "# files_text.sort(key=lambda x:int(x[:-4]))\n", 1805 | "if file2epub:\n", 1806 | " mdfiles=[ itm for itm in files_text]\n", 1807 | " os.system(\"pandoc -o \\\"../{}.epub\\\" title.txt {}\".format(title,\" \".join(mdfiles)))\n", 1808 | " from google.colab import files\n", 1809 | " files.download('../{}.epub'.format(title))\n", 1810 | " pass\n", 1811 | "else:\n", 1812 | " with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1813 | " for file in files_text[::-1]:\n", 1814 | " with open(file,\"r\") as f2:\n", 1815 | " f.write(f2.read())\n", 1816 | " from google.colab import files\n", 1817 | " files.download('../{}.txt'.format(output_name))\n", 1818 | "\n" 1819 | ] 1820 | }, 1821 | { 1822 | "cell_type": "markdown", 1823 | "metadata": { 1824 | "id": "XJOqF-eIbaCT" 1825 | }, 1826 | "source": [ 1827 | "# 測試後,暫時沒有使用的程式碼區塊" 1828 | ] 1829 | }, 1830 | { 1831 | "cell_type": "markdown", 1832 | "metadata": { 1833 | "id": "eNxKvxlfhvO6" 1834 | }, 1835 | "source": [ 1836 | "# 效率\n", 1837 | "\n", 1838 | "透過下述的方法,合併檔案,因為輸出檔需要被反覆的開始太多次,隨著檔案大小逐漸增加。讓效能下跌\n", 1839 | "\n", 1840 | "```python\n", 1841 | "for file in files:\n", 1842 | " os.system(\"cat {}>> ../{}.txt\".format(file,output_name))\n", 1843 | "```\n", 1844 | "若改用下述的方法, output 檔,只需要開啟一次。可以大大縮短時間。\n", 1845 | "\n", 1846 | "```python\n", 1847 | "with open(f\"../{output_name}.txt\",\"w\",encoding='utf-8') as f:\n", 1848 | " for file in files_text:\n", 1849 | " with open(file,\"r\") as f2:\n", 1850 | " f.write(f2.read())\n", 1851 | "```" 1852 | ] 1853 | }, 1854 | { 1855 | "cell_type": "markdown", 1856 | "metadata": { 1857 | "id": "8KCQpgIUtZ3M" 1858 | }, 1859 | "source": [ 1860 | "# 參考資料" 1861 | ] 1862 | }, 1863 | { 1864 | "cell_type": "code", 1865 | "execution_count": null, 1866 | "metadata": { 1867 | "cellView": "form", 1868 | "id": "4CCMkYp5OxmZ" 1869 | }, 1870 | "outputs": [], 1871 | "source": [ 1872 | "#@title 多執行序參考程式範例\n", 1873 | "\n", 1874 | "from bs4 import BeautifulSoup\n", 1875 | "import concurrent.futures\n", 1876 | "import requests\n", 1877 | "import time\n", 1878 | "\n", 1879 | "\n", 1880 | "def scrape(urls):\n", 1881 | "\n", 1882 | " response = requests.get(urls)\n", 1883 | "\n", 1884 | " soup = BeautifulSoup(response.content, \"lxml\")\n", 1885 | "\n", 1886 | " # 爬取文章標題\n", 1887 | " titles = soup.find_all(\"h3\", {\"class\": \"post_title\"})\n", 1888 | "\n", 1889 | " for title in titles:\n", 1890 | " print(title.getText().strip())\n", 1891 | "\n", 1892 | " time.sleep(2)\n", 1893 | "\n", 1894 | "\n", 1895 | "base_url = \"https://www.inside.com.tw/tag/AI\"\n", 1896 | "urls = [f\"{base_url}?page={page}\" for page in range(1, 6)] # 1~5頁的網址清單\n", 1897 | "print(urls)\n", 1898 | "start_time = time.time() # 開始時間\n", 1899 | "# scrape(urls)\n", 1900 | "\n", 1901 | "\n", 1902 | "# 同時建立及啟用10個執行緒\n", 1903 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:\n", 1904 | "# executor.map(scrape, urls)\n", 1905 | "\n", 1906 | "end_time = time.time()\n", 1907 | "print(f\"{end_time - start_time} 秒爬取 {len(urls)} 頁的文章\")" 1908 | ] 1909 | } 1910 | ], 1911 | "metadata": { 1912 | "colab": { 1913 | "provenance": [], 1914 | "mount_file_id": "11h2gvU2w0w-cWs1O2ciOwdgK4Wi6qndj", 1915 | "authorship_tag": "ABX9TyM7n29sOH7wFTuDQLfmH77G", 1916 | "include_colab_link": true 1917 | }, 1918 | "kernelspec": { 1919 | "display_name": "Python 3", 1920 | "name": "python3" 1921 | } 1922 | }, 1923 | "nbformat": 4, 1924 | "nbformat_minor": 0 1925 | } -------------------------------------------------------------------------------- /oTranscribe_txt_to_srt_格式轉換.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "oTranscribe txt to srt 格式轉換", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyPY0cE4hDO0EpD8PepnE1dl", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | } 16 | }, 17 | "cells": [ 18 | { 19 | "cell_type": "markdown", 20 | "metadata": { 21 | "id": "view-in-github", 22 | "colab_type": "text" 23 | }, 24 | "source": [ 25 | "\"Open" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": { 31 | "id": "wfFfVeSgXqTT" 32 | }, 33 | "source": [ 34 | "# oTranscribe txt 轉出轉 srt 格式\r\n", 35 | "\r\n", 36 | "srt 為 SubRip (.srt) 的格式,可用於 YouTube cc 字幕。" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "metadata": { 42 | "cellView": "form", 43 | "id": "T9JBJrZYZbZe" 44 | }, 45 | "source": [ 46 | "#@title 需求模組預載\r\n", 47 | "#@markdown 此區塊一定要執行\r\n", 48 | "\r\n", 49 | "from google.colab import files\r\n", 50 | "import re" 51 | ], 52 | "execution_count": null, 53 | "outputs": [] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "metadata": { 58 | "cellView": "form", 59 | "id": "MkCN-ubYYYSe" 60 | }, 61 | "source": [ 62 | "#@title 上傳檔案\r\n", 63 | "uploaded = files.upload()" 64 | ], 65 | "execution_count": null, 66 | "outputs": [] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "metadata": { 71 | "cellView": "form", 72 | "id": "1UYrae9gEnBI" 73 | }, 74 | "source": [ 75 | "#@title 環境設定\r\n", 76 | "\r\n", 77 | "#@markdown 上傳檔案名稱\r\n", 78 | "input_filename='1.txt' #@param {type:\"string\"} \r\n", 79 | "\r\n", 80 | "#@markdown 輸出檔案名稱\r\n", 81 | "output_filename=\"srt_output.txt\" #@param {type:\"string\"}" 82 | ], 83 | "execution_count": null, 84 | "outputs": [] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "metadata": { 89 | "cellView": "form", 90 | "id": "MzZEky3BHApL" 91 | }, 92 | "source": [ 93 | "#@title 執行格式轉換\r\n", 94 | "with open(input_filename,'r',encoding='utf-8') as f:\r\n", 95 | " text=f.read().replace(\"\\xa0\",' ')\r\n", 96 | "\r\n", 97 | "re_patten=r'([0-9:]+)\\s{0,2}(.*)\\s?\\n'\r\n", 98 | "aa=re.findall(re_patten,text)\r\n", 99 | "content=''\r\n", 100 | "end=len(aa)-1\r\n", 101 | "for idx,itm in enumerate(aa):\r\n", 102 | " if len(itm)==0:\r\n", 103 | " continue\r\n", 104 | " content+=\"%d\\n\"%(idx+1)\r\n", 105 | " if idx != end:\r\n", 106 | " content+=\"00:{} --> 00:{}\\n\".format(itm[0],aa[idx+1][0])\r\n", 107 | " else:\r\n", 108 | " tmp=itm[0].split(\":\")\r\n", 109 | " tmp[-1]=str(int(tmp[-1])+5)\r\n", 110 | " # print(\":\".join(tmp))\r\n", 111 | " content+=\"00:{} --> 00:{}\\n\".format(itm[0],\":\".join(tmp))\r\n", 112 | " content+=\"%s\\n\\n\"%itm[1]\r\n", 113 | " \r\n", 114 | "with open(output_filename,\"w\",encoding='utf-8') as f:\r\n", 115 | " f.write(content)\r\n", 116 | " " 117 | ], 118 | "execution_count": null, 119 | "outputs": [] 120 | }, 121 | { 122 | "cell_type": "code", 123 | "metadata": { 124 | "cellView": "form", 125 | "id": "QyeMdPh5HZJA" 126 | }, 127 | "source": [ 128 | "#@title 下載檔案\r\n", 129 | "from google.colab import files\r\n", 130 | "files.download(output_filename)" 131 | ], 132 | "execution_count": null, 133 | "outputs": [] 134 | } 135 | ] 136 | } -------------------------------------------------------------------------------- /whisper_Test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "gpuType": "T4", 8 | "mount_file_id": "1z5py79AseIPWuKFO0oZ95uo4t3nA_0iG", 9 | "authorship_tag": "ABX9TyPUgpFfyXfEk66r4MgTLpTv", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | }, 19 | "accelerator": "GPU" 20 | }, 21 | "cells": [ 22 | { 23 | "cell_type": "markdown", 24 | "metadata": { 25 | "id": "view-in-github", 26 | "colab_type": "text" 27 | }, 28 | "source": [ 29 | "\"Open" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "source": [ 35 | "# 語音轉文字 AI 工具\n", 36 | "本工具使用 [OpenAI 的開源工具 Whisper](https://github.com/openai/whisper) 模型, 可以相對精準的將隨語音轉文字。\n", 37 | "\n", 38 | "# (一) 選擇適合的運作環境: T4 GPU\n", 39 | "本 Colab 虛擬機器使用為免費、多GPU的環境。已指定 T4 GPU 版本。\n", 40 | "\n", 41 | "若由 Github 直接開啟,可以忽略此說明。" 42 | ], 43 | "metadata": { 44 | "id": "Z8j9agRoP2Ef" 45 | } 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": null, 50 | "metadata": { 51 | "id": "ESQe_Qm7Ceoz" 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "# @title (1) 安裝 whisper\n", 56 | "!pip install git+https://github.com/openai/whisper.git" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "source": [ 62 | "### (2) 掛載雲端硬碟\n", 63 | "1. 透過 Coloab 左邊的操作介面掛載\n", 64 | "2. 上傳音檔/影像檔到 Google drive\n", 65 | " - 個人偏好在 Google drive 建一個 tmp 資料夾\n", 66 | " - 將音檔上傳到 tmp 資料夾\n", 67 | " - 在 Colab 左邊的掛載介面找到 drive => MyDrive => tmp\n", 68 | " - 點選上載的音檔,按滑鼠右鍵,點選 複製路徑\n", 69 | "3. 將複製的路徑貼到轉檔區塊的 filenames 欄位中\n" 70 | ], 71 | "metadata": { 72 | "id": "vj1rk1zOKoh7" 73 | } 74 | }, 75 | { 76 | "cell_type": "code", 77 | "source": [ 78 | "# @title (3) 轉檔\n", 79 | "import os\n", 80 | "filename = \"/content/drive/MyDrive/tmp/phison.mp4\" # @param {type:\"string\"}\n", 81 | "#@markdown 設定使用的模型, 請參考 [Whisper Model Card](https://github.com/openai/whisper/blob/main/model-card.md) 選擇適合的模型\n", 82 | "model= \"medium\" # @param {type:\"string\"}\n", 83 | "\n", 84 | "#@markdown 設定主要的語言,如 Chinese, English,其它請參表 [tokenizer 文件](https://github.com/openai/whisper/blob/main/whisper/tokenizer.py)\n", 85 | "language = \"English\" # @param {type:\"string\"}\n", 86 | "os.chdir(os.path.dirname(filename))\n", 87 | "os.getcwd()\n", 88 | "!whisper \"{filename}\" --model {model} --language {language}" 89 | ], 90 | "metadata": { 91 | "id": "t1k2RIWHDfhz", 92 | "cellView": "form" 93 | }, 94 | "execution_count": null, 95 | "outputs": [] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "source": [ 100 | "# (二) 取得法說會文字\n", 101 | "## 1. 請先執行 (1) 安裝 whisper\n", 102 | "## 2. 再執行 (4) 法說會逐字稿, ...." 103 | ], 104 | "metadata": { 105 | "id": "OjboOjHQ6qOJ" 106 | } 107 | }, 108 | { 109 | "cell_type": "code", 110 | "source": [ 111 | "# @title (4) 法說會逐字稿,當影片在 Youtube 可直接使用這個\n", 112 | "!pip install yt-dlp\n", 113 | "\n", 114 | "tubeUrl = \"https://www.youtube.com/watch?v=Q6sI_eY6sdU\" # @param {type:\"string\"}\n", 115 | "import os\n", 116 | "from yt_dlp import YoutubeDL\n", 117 | "companyName=\"科技小電報\" # @param {type:\"string\"}\n", 118 | "model= \"large\" # @param {type:\"string\"}\n", 119 | "language = \"Chinese\" # @param {type:\"string\"}\n", 120 | "\n", 121 | "\n", 122 | "\n", 123 | "filename = companyName+\".m4a\"\n", 124 | "ydl_opts = {'overwrites': True, 'format': 'bestaudio[ext=m4a]', 'outtmpl': filename}\n", 125 | "with YoutubeDL(ydl_opts) as ydl:\n", 126 | " ydl.download([tubeUrl])\n", 127 | "\n", 128 | "!whisper \"{filename}\" --model {model} --language {language}\n", 129 | "\n", 130 | "from google.colab import files\n", 131 | "exts=[\"txt\",\"srt\",\"tsv\",\"vtt\"]\n", 132 | "for ext in exts:\n", 133 | " files.download('{}.{}'.format(companyName,ext))" 134 | ], 135 | "metadata": { 136 | "id": "2kNMQIdNgS48", 137 | "cellView": "form" 138 | }, 139 | "execution_count": null, 140 | "outputs": [] 141 | } 142 | ] 143 | } -------------------------------------------------------------------------------- /youtuber_逐字稿.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "youtuber 逐字稿", 7 | "provenance": [], 8 | "mount_file_id": "1PSXhAQ7DI5i_3836QB96XgGK7vi9kJ5z", 9 | "authorship_tag": "ABX9TyPv/Hov1nyluiLhL1f4L3FD", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "source": [ 34 | "!pip install pytube\n", 35 | "!pip install inlp\n", 36 | "!pip install speechrecognition\n", 37 | "#@markdown 安裝必要套件\n", 38 | "\n", 39 | "#!pip install you-get" 40 | ], 41 | "metadata": { 42 | "id": "gTr19YDIVkKC", 43 | "cellView": "form" 44 | }, 45 | "execution_count": null, 46 | "outputs": [] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "id": "i0rNL4f_VV3c", 53 | "cellView": "form" 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "#@title 輸入 Youtube 網址\n", 58 | "import os\n", 59 | "from pytube import YouTube\n", 60 | "\n", 61 | "url = \"https://www.youtube.com/watch?v=Jp8xnYRhWnw\" #@param {type:\"string\"}\n", 62 | "\n", 63 | "def onProgress(stream, chunk, remains):\n", 64 | " total = stream.filesize\n", 65 | " percent = (total-remains) / total * 100\n", 66 | " print('下載中… {:05.2f}%'.format(percent), end='\\r')\n", 67 | "\n", 68 | "\n", 69 | "# yt.streams.filter().get_highest_resolution().download()\n", 70 | "\n", 71 | "yt=YouTube(url,on_progress_callback=onProgress)\n", 72 | "yfilename=yt.streams.filter(type=\"audio\").first().download()\n", 73 | "\n", 74 | "# filename=yt.streams.filter(type=\"audio\").first().default_filename\n", 75 | "\n", 76 | "musicname=yfilename[:yfilename.rfind(\".\")]\n", 77 | "# yt.streams.filter(type=\"audio\").first().download()\n", 78 | "print(\"完成下載: \",yfilename)\n", 79 | "\n", 80 | "# print(\"完成下載: \",yt.streams.first().download())\n", 81 | "# print(\"轉檔中........\")\n", 82 | "# os.system('{} -i \"{}\" -vn -sn -dn \"{}.mp3\"'.format(\"ffmpeg\",filename, musicname))\n", 83 | "# print(\"完成轉檔: {}.mp3\".format(musicname))\n", 84 | "\n", 85 | "\n", 86 | "# # from google.colab import files\n", 87 | "# # files.download(\"{}.mp4\".format(filename[:filename.rfind(\".\")]))\n", 88 | "# from google.colab import files\n", 89 | "# files.download(f\"{musicname}.mp3\")\n", 90 | "\n", 91 | "import os\n", 92 | "import shutil\n", 93 | "import speech_recognition as sr\n", 94 | "import concurrent.futures\n", 95 | "import wave\n", 96 | "import json\n", 97 | "import numpy as np\n", 98 | "from inlp.convert import chinese\n", 99 | "\n", 100 | " \n", 101 | "\n", 102 | "mp3Name= yfilename\n", 103 | " \n", 104 | "CutTimeDef = 20 \n", 105 | "wav_path='wav' \n", 106 | "txt_path='txt' \n", 107 | "thread_num = 10 \n", 108 | "\n", 109 | "workpath=os.path.dirname(mp3Name)\n", 110 | "mp3Name=os.path.basename(mp3Name)\n", 111 | "FileName = mp3Name[:mp3Name.rfind(\".\")]+\".wav\"\n", 112 | "os.chdir(workpath)\n", 113 | "chk='y' \n", 114 | " \n", 115 | "def reset_dir(path):\n", 116 | " try:\n", 117 | " os.mkdir(path)\n", 118 | " except Exception:\n", 119 | " if chk==\"y\":\n", 120 | " shutil.rmtree(path)\n", 121 | " os.mkdir(path)\n", 122 | " \n", 123 | "def CutFile(FileName, target_path):\n", 124 | " \n", 125 | " # print(\"CutFile File Name is \", FileName)\n", 126 | " f = wave.open(FileName, \"rb\")\n", 127 | " params = f.getparams() \n", 128 | " nchannels, sampwidth, framerate, nframes = params[:4]\n", 129 | " CutFrameNum = framerate * CutTimeDef\n", 130 | " # 讀取格式資訊\n", 131 | " # 一次性返回所有的WAV檔案的格式資訊,它返回的是一個組元(tuple):聲道數, 量化位數(byte 單位), 採\n", 132 | " # 樣頻率, 取樣點數, 壓縮型別, 壓縮型別的描述。wave模組只支援非壓縮的資料,因此可以忽略最後兩個資訊\n", 133 | " \n", 134 | " # print(\"CutFrameNum=%d\" % (CutFrameNum))\n", 135 | " # print(\"nchannels=%d\" % (nchannels))\n", 136 | " # print(\"sampwidth=%d\" % (sampwidth))\n", 137 | " # print(\"framerate=%d\" % (framerate))\n", 138 | " # print(\"nframes=%d\" % (nframes))\n", 139 | " \n", 140 | " str_data = f.readframes(nframes)\n", 141 | " f.close() # 將波形資料轉換成陣列\n", 142 | " # Cutnum =nframes/framerate/CutTimeDef\n", 143 | " # 需要根據聲道數和量化單位,將讀取的二進位制資料轉換為一個可以計算的陣列\n", 144 | " wave_data = np.frombuffer(str_data, dtype=np.short)\n", 145 | " wave_data.shape = -1, 2\n", 146 | " wave_data = wave_data.T\n", 147 | " temp_data = wave_data.T\n", 148 | " # StepNum = int(nframes/200)\n", 149 | " StepNum = CutFrameNum\n", 150 | " StepTotalNum = 0\n", 151 | " haha = 0\n", 152 | " while StepTotalNum < nframes:\n", 153 | " # for j in range(int(Cutnum)):\n", 154 | " # print(\"Stemp=%d\" % (haha))\n", 155 | " SaveFile = \"%s-%03d.wav\" % (FileName[:-4], (haha+1))\n", 156 | " # print(FileName)\n", 157 | " if haha % 3==0:\n", 158 | " print(\"*\",end='')\n", 159 | " temp_dataTemp = temp_data[StepNum * (haha):StepNum * (haha + 1)]\n", 160 | " haha = haha + 1\n", 161 | " StepTotalNum = haha * StepNum\n", 162 | " temp_dataTemp.shape = 1, -1\n", 163 | " temp_dataTemp = temp_dataTemp.astype(np.short) # 開啟WAV文件\n", 164 | " f = wave.open(target_path+\"/\" + SaveFile, \"wb\")\n", 165 | " # 配置聲道數、量化位數和取樣頻率\n", 166 | " f.setnchannels(nchannels)\n", 167 | " f.setsampwidth(sampwidth)\n", 168 | " f.setframerate(framerate)\n", 169 | " # 將wav_data轉換為二進位制資料寫入檔案\n", 170 | " f.writeframes(temp_dataTemp.tobytes())\n", 171 | " f.close()\n", 172 | " \n", 173 | "\n", 174 | " \n", 175 | "def texts_to_one(path, target_file):\n", 176 | " files = os.listdir(path)\n", 177 | " files.sort()\n", 178 | " files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n", 179 | " with open(target_file, \"w\", encoding=\"utf-8\") as f:\n", 180 | " for file in files:\n", 181 | " with open(file, \"r\", encoding='utf-8') as f2:\n", 182 | " f.write(f2.read())\n", 183 | " print(\"完成合併, 檔案位於 %s \" % target_file)\n", 184 | " \n", 185 | " \n", 186 | "def texts2otr(path, target_file, audio_name, timeperiod):\n", 187 | " template = '''

{}{}


\n", 188 | " '''\n", 189 | " files = os.listdir(path)\n", 190 | " files.sort()\n", 191 | " content = ''\n", 192 | " files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n", 193 | " with open(target_file, \"w\", encoding=\"utf-8\") as f:\n", 194 | " \n", 195 | " for file in files:\n", 196 | " with open(file, \"r\", encoding=\"utf-8\") as f2:\n", 197 | " txt = f2.read().split(\"\\n\")\n", 198 | " if len(txt) < 2:\n", 199 | " continue\n", 200 | " pos=txt[0].rfind(\".\")\n", 201 | " time=int(txt[0][pos-3:pos])\n", 202 | " # times = (int(txt[0].split(\"-\")[1][:-5])-1)*CutTimeDef\n", 203 | " times=(time-1)*CutTimeDef\n", 204 | " secs, mins = times % 60, (times//60) % 60\n", 205 | " hours = (times//60)//60\n", 206 | " timeF = \"{:02d}:{:02d}:{:02d}\".format(hours, mins, secs)\n", 207 | " content += template.format(times, timeF, txt[1])\n", 208 | " \n", 209 | " output = {\"text\": content, \"media\": audio_name,\n", 210 | " \"media-time\": timeperiod}\n", 211 | " f.write(json.dumps(output, ensure_ascii=False))\n", 212 | " print(\"完成合併, otr 檔案位於 %s \" % target_file)\n", 213 | " \n", 214 | "#@title 執行音頻轉換與分割\n", 215 | " \n", 216 | "print(\" mp3 轉 wav 檔 \".center(100,'=')) \n", 217 | "os.system('{} -i \"{}\" \"{}\"'.format(\"ffmpeg\",mp3Name, FileName))\n", 218 | "print(\" Wav 檔名為 {} \".format(FileName).center(96))\n", 219 | "reset_dir(wav_path)\n", 220 | "reset_dir(txt_path)\n", 221 | "# # Cut Wave Setting\n", 222 | "\n", 223 | "print(\" 音頻以每{}秒分割 \".format(CutTimeDef).center(94,'='))\n", 224 | "CutFile(FileName, wav_path)\n", 225 | "print(\"\")\n", 226 | "print(\" 完成分割 \".center(100,'-'))\n", 227 | "#@title 執行語音轉文字 (需要耗費不少時間)\n", 228 | "\n", 229 | "#@markdown 指定翻譯的語言類型,如何設定語系請參考 [支援列表](https://cloud.google.com/speech-to-text/docs/languages)\n", 230 | "voiceLanguage=\"cmn-Hant-TW\" #@param {type:\"string\"}\n", 231 | "\n", 232 | "def VoiceToText_thread(file):\n", 233 | " txt_file = \"%s/%s.txt\" % (txt_path, file[:-4])\n", 234 | " \n", 235 | " if os.path.isfile(txt_file):\n", 236 | " return\n", 237 | " with open(\"%s/%s.txt\" % (txt_path, file[:-4]), \"w\", encoding=\"utf-8\") as f:\n", 238 | " f.write(\"%s:\\n\" % file)\n", 239 | " r = sr.Recognizer() # 預設辨識英文\n", 240 | " with sr.WavFile(wav_path+\"/\"+file) as source: # 讀取wav檔\n", 241 | " audio = r.record(source)\n", 242 | " # r.adjust_for_ambient_noise(source)\n", 243 | " # audio = r.listen(source)\n", 244 | " try:\n", 245 | " text = r.recognize_google(audio,language = voiceLanguage)\n", 246 | " text = chinese.s2t(text)\n", 247 | " # r.recognize_google(audio)\n", 248 | " \n", 249 | " if len(text) == 0:\n", 250 | " print(\"===無資料==\")\n", 251 | " return\n", 252 | "\n", 253 | " print(f\"{file}\\t{text}\")\n", 254 | " f.write(\"%s \\n\\n\" % text)\n", 255 | " if file == files[-1]:\n", 256 | " print(\"結束翻譯\")\n", 257 | " except sr.RequestError as e:\n", 258 | " print(\"無法翻譯{0}\".format(e))\n", 259 | " # 兩個 except 是當語音辨識不出來的時候 防呆用的\n", 260 | " # 使用Google的服務\n", 261 | " except LookupError:\n", 262 | " print(\"Could not understand audio\")\n", 263 | " except sr.UnknownValueError:\n", 264 | " print(f\"Error: 無法識別 Audio\\t {file}\")\n", 265 | " \n", 266 | "\n", 267 | "\n", 268 | "\n", 269 | "files = os.listdir(wav_path)\n", 270 | "files.sort()\n", 271 | "\n", 272 | "with concurrent.futures.ThreadPoolExecutor(max_workers=thread_num) as executor:\n", 273 | " executor.map(VoiceToText_thread, files)\n", 274 | " \n", 275 | "# VoiceToText(wav_path, files, txt_path)\n", 276 | " \n", 277 | "target_txtfile = \"{}.txt\".format(FileName[:-4])\n", 278 | "texts_to_one(txt_path, target_txtfile)\n", 279 | "otr_file = \"{}.otr\".format(FileName[:-4])\n", 280 | "with wave.open(FileName, \"rb\") as f:\n", 281 | " params = f.getparams()\n", 282 | "texts2otr(txt_path, otr_file, FileName, params.nframes)" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "source": [ 288 | "`" 289 | ], 290 | "metadata": { 291 | "id": "f5CFfXa-ZomL" 292 | } 293 | }, 294 | { 295 | "cell_type": "code", 296 | "source": [ 297 | "#@title 下載 otr\n", 298 | "import os\n", 299 | "import shutil\n", 300 | "# 搬移音錄檔案到特定的目錄下\n", 301 | "source_dir=\"/content/\"\n", 302 | "target_dir=\"/content/drive/MyDrive/tmp/\"\n", 303 | "def findotr(arr):\n", 304 | " for itm in arr[::-1]:\n", 305 | " if \"otr\" in itm:\n", 306 | " return itm\n", 307 | "otrfilename=findotr(os.listdir(\".\"))\n", 308 | "from google.colab import files\n", 309 | "files.download(\"{}\".format(otrfilename))\n", 310 | "# shutil.move(f\"{source_dir}{yfilename}\",target_dir)" 311 | ], 312 | "metadata": { 313 | "colab": { 314 | "base_uri": "https://localhost:8080/", 315 | "height": 17 316 | }, 317 | "id": "XseEE1QBiBQu", 318 | "outputId": "7b4a5a66-f31b-4202-9ed8-c4c051161754", 319 | "cellView": "form" 320 | }, 321 | "execution_count": 14, 322 | "outputs": [ 323 | { 324 | "output_type": "display_data", 325 | "data": { 326 | "text/plain": [ 327 | "" 328 | ], 329 | "application/javascript": [ 330 | "\n", 331 | " async function download(id, filename, size) {\n", 332 | " if (!google.colab.kernel.accessAllowed) {\n", 333 | " return;\n", 334 | " }\n", 335 | " const div = document.createElement('div');\n", 336 | " const label = document.createElement('label');\n", 337 | " label.textContent = `Downloading \"${filename}\": `;\n", 338 | " div.appendChild(label);\n", 339 | " const progress = document.createElement('progress');\n", 340 | " progress.max = size;\n", 341 | " div.appendChild(progress);\n", 342 | " document.body.appendChild(div);\n", 343 | "\n", 344 | " const buffers = [];\n", 345 | " let downloaded = 0;\n", 346 | "\n", 347 | " const channel = await google.colab.kernel.comms.open(id);\n", 348 | " // Send a message to notify the kernel that we're ready.\n", 349 | " channel.send({})\n", 350 | "\n", 351 | " for await (const message of channel.messages) {\n", 352 | " // Send a message to notify the kernel that we're ready.\n", 353 | " channel.send({})\n", 354 | " if (message.buffers) {\n", 355 | " for (const buffer of message.buffers) {\n", 356 | " buffers.push(buffer);\n", 357 | " downloaded += buffer.byteLength;\n", 358 | " progress.value = downloaded;\n", 359 | " }\n", 360 | " }\n", 361 | " }\n", 362 | " const blob = new Blob(buffers, {type: 'application/binary'});\n", 363 | " const a = document.createElement('a');\n", 364 | " a.href = window.URL.createObjectURL(blob);\n", 365 | " a.download = filename;\n", 366 | " div.appendChild(a);\n", 367 | " a.click();\n", 368 | " div.remove();\n", 369 | " }\n", 370 | " " 371 | ] 372 | }, 373 | "metadata": {} 374 | }, 375 | { 376 | "output_type": "display_data", 377 | "data": { 378 | "text/plain": [ 379 | "" 380 | ], 381 | "application/javascript": [ 382 | "download(\"download_c235d0ba-d21a-43be-ad8a-cb63194b34f6\", \"\\u5b64\\u7368\\u7684\\u6211\\u662f\\u5e78\\u798f\\u7684\\uff5c\\u6587\\u68ee\\u8aaa\\u66f8.otr\", 13904)" 383 | ] 384 | }, 385 | "metadata": {} 386 | } 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "source": [ 392 | "#@title 下載 音檔\n", 393 | "from google.colab import files\n", 394 | "files.download(\"{}\".format(yfilename))" 395 | ], 396 | "metadata": { 397 | "colab": { 398 | "base_uri": "https://localhost:8080/", 399 | "height": 17 400 | }, 401 | "cellView": "form", 402 | "id": "hUXzAOTfbvE2", 403 | "outputId": "31939c83-611d-4e56-fa6c-7f97f136d77f" 404 | }, 405 | "execution_count": 15, 406 | "outputs": [ 407 | { 408 | "output_type": "display_data", 409 | "data": { 410 | "text/plain": [ 411 | "" 412 | ], 413 | "application/javascript": [ 414 | "\n", 415 | " async function download(id, filename, size) {\n", 416 | " if (!google.colab.kernel.accessAllowed) {\n", 417 | " return;\n", 418 | " }\n", 419 | " const div = document.createElement('div');\n", 420 | " const label = document.createElement('label');\n", 421 | " label.textContent = `Downloading \"${filename}\": `;\n", 422 | " div.appendChild(label);\n", 423 | " const progress = document.createElement('progress');\n", 424 | " progress.max = size;\n", 425 | " div.appendChild(progress);\n", 426 | " document.body.appendChild(div);\n", 427 | "\n", 428 | " const buffers = [];\n", 429 | " let downloaded = 0;\n", 430 | "\n", 431 | " const channel = await google.colab.kernel.comms.open(id);\n", 432 | " // Send a message to notify the kernel that we're ready.\n", 433 | " channel.send({})\n", 434 | "\n", 435 | " for await (const message of channel.messages) {\n", 436 | " // Send a message to notify the kernel that we're ready.\n", 437 | " channel.send({})\n", 438 | " if (message.buffers) {\n", 439 | " for (const buffer of message.buffers) {\n", 440 | " buffers.push(buffer);\n", 441 | " downloaded += buffer.byteLength;\n", 442 | " progress.value = downloaded;\n", 443 | " }\n", 444 | " }\n", 445 | " }\n", 446 | " const blob = new Blob(buffers, {type: 'application/binary'});\n", 447 | " const a = document.createElement('a');\n", 448 | " a.href = window.URL.createObjectURL(blob);\n", 449 | " a.download = filename;\n", 450 | " div.appendChild(a);\n", 451 | " a.click();\n", 452 | " div.remove();\n", 453 | " }\n", 454 | " " 455 | ] 456 | }, 457 | "metadata": {} 458 | }, 459 | { 460 | "output_type": "display_data", 461 | "data": { 462 | "text/plain": [ 463 | "" 464 | ], 465 | "application/javascript": [ 466 | "download(\"download_d7ea9f3e-65ed-4da1-8b0e-609894886090\", \"\\u5b64\\u7368\\u7684\\u6211\\u662f\\u5e78\\u798f\\u7684\\uff5c\\u6587\\u68ee\\u8aaa\\u66f8.mp4\", 4164951)" 467 | ] 468 | }, 469 | "metadata": {} 470 | } 471 | ] 472 | } 473 | ] 474 | } -------------------------------------------------------------------------------- /台股_Q1~Q3_EPS_抓取.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "台股 Q1~Q3 EPS 抓取", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyPn0uCcBvxxCRI0YDMGbTl3", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "source": [ 34 | "# 自動取得台股 Q1~Q3 EPS\n", 35 | "\n", 36 | "文字標格式。數字和文字間用 tab 區分。一張股票一行。\n", 37 | "\n", 38 | "可以直接從 excel 貼到 txt 檔即可\n", 39 | "\n", 40 | "---\n", 41 | "2330 台積電" 42 | ], 43 | "metadata": { 44 | "id": "lCqeZiJAPLwc" 45 | } 46 | }, 47 | { 48 | "cell_type": "code", 49 | "source": [ 50 | "import requests\n", 51 | "import concurrent.futures\n", 52 | "from bs4 import BeautifulSoup, UnicodeDammit\n", 53 | "import pandas as pd\n", 54 | "\n", 55 | "def addeps(stockid):\n", 56 | " \n", 57 | " url=url_template.format(stockid)\n", 58 | " reg=requests.get(url)\n", 59 | " soup=BeautifulSoup(reg.text)\n", 60 | " stockData=[]\n", 61 | " for itm in soup.find(\"section\",id=\"qsp-eps-table\").find_all(\"span\",class_=\"\")[1:7:2]:\n", 62 | " stockData.insert(0,itm.getText())\n", 63 | " \n", 64 | " stockids.get(stockid).extend(stockData)\n", 65 | "\n", 66 | "\n", 67 | "mytest=list()\n", 68 | "#@markdown 用 txt 檔存股票代碼\n", 69 | "stock_txt=\"/content/stock_id_name.txt\" #@param {type:'string'} \n", 70 | "with open(stock_txt,'rb') as f:\n", 71 | " encode=UnicodeDammit(f.read()).original_encoding\n", 72 | "with open(stock_txt,\"r\",encoding=encode) as f:\n", 73 | " content=[itm.split() for itm in f.read().splitlines()]\n", 74 | "\n", 75 | "'''\n", 76 | "Converting a list to dictionary with list elements as keys in dictionary\n", 77 | "All keys will have same value\n", 78 | "''' \n", 79 | "# create Stock_id data\n", 80 | "stockids = { i[0] : i for i in content }\n", 81 | "url_template=\"https://tw.stock.yahoo.com/quote/{}/eps\" \n", 82 | "with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:\n", 83 | " executor.map(addeps,stockids)\n", 84 | "df=pd.DataFrame(stockids.values(),columns=[\"股票代碼\",\"股票名稱\",\"Q1 EPS\",\"Q2 EPS\",\"Q3 EPS\"])\n", 85 | "df.to_excel(\"TWstocks_EPS.xlsx\",index=False)\n", 86 | "df\n", 87 | "\n", 88 | " " 89 | ], 90 | "metadata": { 91 | "id": "6hjtu_OMGJud" 92 | }, 93 | "execution_count": null, 94 | "outputs": [] 95 | } 96 | ] 97 | } -------------------------------------------------------------------------------- /技術議題關鍵字擴展.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "技術議題關鍵字擴展.ipynb", 7 | "provenance": [], 8 | "collapsed_sections": [], 9 | "authorship_tag": "ABX9TyO5GIToPuAyaKLz4lnvSqaj", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "metadata": { 34 | "id": "hQN-VXYIGSKX", 35 | "cellView": "form" 36 | }, 37 | "source": [ 38 | "#@title 關鍵字擴展\n", 39 | "#@markdown 關鍵字擴展是用 政府研究資訊網 (grb.gov.tw) 的民國 105-110年公開資料做為資料集,目前只支援中文字關鍵字的擴展\n", 40 | "\n", 41 | "import requests\n", 42 | "\n", 43 | "print(\"資料載入中 .......\")\n", 44 | "url=\"https://raw.githubusercontent.com/reic/colab_python/main/data/\"\n", 45 | "fnames=[\"GRB_105.txt\",\"GRB_106.txt\",\"GRB_107.txt\",\"GRB_108.txt\",\"GRB_109.txt\",\"GRB_110.txt\"]\n", 46 | "\n", 47 | "#@markdown 關鍵字\n", 48 | "print(f\"{len(fnames)} 個資料源\")\n", 49 | "inputword = \"\\u865B\\u64EC\\u5BE6\\u5883\" #@param {type:\"string\"}\n", 50 | "#@markdown 列出的擴展關鍵字之數量\n", 51 | "extendNumber =20 #@param {type:\"number\"}\n", 52 | "content=[]\n", 53 | "\n", 54 | "for fname in fnames:\n", 55 | " # print(f\"從 Github 下載資料檔 {fname}\")\n", 56 | " reg=requests.get(f\"{url}{fname}\")\n", 57 | " content.extend(reg.text.splitlines())\n", 58 | "print(\"=== 資料載入完成 \".ljust(100,\"=\"))\n", 59 | "print(\"\")\n", 60 | "projectData = {}\n", 61 | "for itm in content:\n", 62 | " # print(itm)\n", 63 | " [id, keyword] = itm.split('\\t')\n", 64 | " projectData[id] = keyword.split(\":\")\n", 65 | "# print(projectData)\n", 66 | "\n", 67 | "inputword = inputword.lower()\n", 68 | "keywords = []\n", 69 | "projects = []\n", 70 | "for itm in content:\n", 71 | " if inputword in itm.lower():\n", 72 | " pro = itm.split(\"\\t\")\n", 73 | " projects.append(pro[0])\n", 74 | " getkeywordset = pro[1].split(\":\")\n", 75 | " keywords.extend(getkeywordset)\n", 76 | "\n", 77 | "# # print(len(keywords))\n", 78 | "# # print(len(list(set(keywords))))\n", 79 | "# # print(len(projects))\n", 80 | "# # print(projectData[projects[0]])\n", 81 | "uniqueKeywordCount = dict.fromkeys(keywords, 0)\n", 82 | "for itm in uniqueKeywordCount:\n", 83 | " uniqueKeywordCount[itm] = keywords.count(itm)\n", 84 | "\n", 85 | "keywordExtend=[]\n", 86 | "for key,value in uniqueKeywordCount.items():\n", 87 | " # if int(value) <2: \n", 88 | " # continue\n", 89 | " # print(value)\n", 90 | " keywordExtend.append([key,value])\n", 91 | "keywordExtend.sort(key=lambda x:x[1],reverse=True)\n", 92 | "\n", 93 | "if extendNumber > len(keywordExtend):\n", 94 | " extendNumber=len(keywordExtend)\n", 95 | "\n", 96 | "for itm in keywordExtend[:extendNumber]:\n", 97 | " print(f\"{itm[0]:15s}\\t{itm[1]}\")\n", 98 | "\n", 99 | "\n" 100 | ], 101 | "execution_count": null, 102 | "outputs": [] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "source": [ 107 | "#@title 產業趨勢關鍵字探索\n", 108 | "#@markdown 關鍵字擴展由 科技產業資訊室(iknow.stpi.narl.org.tw) 提供。關鍵字由 iknow 設定義,可以掌握特定關鍵字和其它產業關鍵字共現關係。\n", 109 | "\n", 110 | "import requests\n", 111 | "\n", 112 | "print(\"資料載入中 .......\")\n", 113 | "url=\"https://raw.githubusercontent.com/reic/colab_python/main/data/\"\n", 114 | "fnames=[\"iKnow_2017-2021.txt\"]\n", 115 | "\n", 116 | "#@markdown 關鍵字\n", 117 | "print(f\"{len(fnames)} 個資料源\")\n", 118 | "inputword = \"Apple\" #@param {type:\"string\"}\n", 119 | "\n", 120 | "#@markdown 列出的擴展關鍵字之數量\n", 121 | "extendNumber =20 #@param {type:\"number\"}\n", 122 | "content=[]\n", 123 | "\n", 124 | "for fname in fnames:\n", 125 | " # print(f\"從 Github 下載資料檔 {fname}\")\n", 126 | " reg=requests.get(f\"{url}{fname}\")\n", 127 | " content.extend(reg.text.splitlines())\n", 128 | "print(\"=== 資料載入完成 \".ljust(100,\"=\"))\n", 129 | "print(\"\")\n", 130 | "projectData = {}\n", 131 | "for itm in content:\n", 132 | " # print(itm)\n", 133 | " [id, keyword] = itm.split('\\t')\n", 134 | " projectData[id] = keyword.split(\":\")\n", 135 | "# print(projectData)\n", 136 | "\n", 137 | "inputword = inputword.lower()\n", 138 | "keywords = []\n", 139 | "projects = []\n", 140 | "for itm in content:\n", 141 | " if inputword in itm.lower():\n", 142 | " pro = itm.split(\"\\t\")\n", 143 | " projects.append(pro[0])\n", 144 | " getkeywordset = pro[1].split(\":\")\n", 145 | " keywords.extend(getkeywordset)\n", 146 | "\n", 147 | "# # print(len(keywords))\n", 148 | "# # print(len(list(set(keywords))))\n", 149 | "# # print(len(projects))\n", 150 | "# # print(projectData[projects[0]])\n", 151 | "uniqueKeywordCount = dict.fromkeys(keywords, 0)\n", 152 | "for itm in uniqueKeywordCount:\n", 153 | " uniqueKeywordCount[itm] = keywords.count(itm)\n", 154 | "\n", 155 | "keywordExtend=[]\n", 156 | "for key,value in uniqueKeywordCount.items():\n", 157 | " # if int(value) <2: \n", 158 | " # continue\n", 159 | " # print(value)\n", 160 | " keywordExtend.append([key,value])\n", 161 | "keywordExtend.sort(key=lambda x:x[1],reverse=True)\n", 162 | "\n", 163 | "if extendNumber > len(keywordExtend):\n", 164 | " extendNumber=len(keywordExtend)\n", 165 | "\n", 166 | "for itm in keywordExtend[:extendNumber]:\n", 167 | " print(f\"{itm[0]:15s}\\t{itm[1]}\")\n", 168 | "\n", 169 | "\n" 170 | ], 171 | "metadata": { 172 | "cellView": "form", 173 | "id": "mcHn3ncMl-en" 174 | }, 175 | "execution_count": null, 176 | "outputs": [] 177 | } 178 | ] 179 | } -------------------------------------------------------------------------------- /英文單字計算.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "collapsed_sections": [], 8 | "mount_file_id": "1z6G-heUGHnpYJkvVwXvsAnv3P2QRc19r", 9 | "authorship_tag": "ABX9TyM8taDxTDLlejhnxOuDq0au", 10 | "include_colab_link": true 11 | }, 12 | "kernelspec": { 13 | "name": "python3", 14 | "display_name": "Python 3" 15 | }, 16 | "language_info": { 17 | "name": "python" 18 | } 19 | }, 20 | "cells": [ 21 | { 22 | "cell_type": "markdown", 23 | "metadata": { 24 | "id": "view-in-github", 25 | "colab_type": "text" 26 | }, 27 | "source": [ 28 | "\"Open" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 2, 34 | "metadata": { 35 | "id": "QxEVE_SZC_Bc", 36 | "cellView": "form" 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "#@title 在 context 輸入 zotero 標注的英文單字,並將單字存至 wordtank.txt\n", 41 | "#@markdown 請先掛載 google drive ,wordtank 是放在 google雲端硬碟內\n", 42 | "\n", 43 | "import re\n", 44 | "from nltk.stem import PorterStemmer\n", 45 | "import os\n", 46 | "\n", 47 | "def checkFileexist(path,file):\n", 48 | " fileWithPath=\"{}/{}\".format(path,file)\n", 49 | " if os.path.isfile(fileWithPath):\n", 50 | " return\n", 51 | " with open(fileWithPath,mode=\"w\",encoding='utf-8') as f:\n", 52 | " f.write(\"fromReic\")\n", 53 | "\n", 54 | "def stemworkcheck(worda,wordb):\n", 55 | " if len(wordb)>len(worda):\n", 56 | " return worda\n", 57 | " return wordb\n", 58 | " \n", 59 | "context = \"\\u201Celectrolyte\\u201D ([Wang \\u7B49\\u3002, 2022, p. 1](zotero://select/library/items/VHI4XUQN)) ([pdf](zotero://open-pdf/library/items/3N7DAHAP?page=1&annotation=IG6RK45L)) electrolyte \\u82F1 [\\u026A\\u02C8lektr\\u0259la\\u026At] \\u7F8E [\\u026A\\u02C8lektr\\u0259la\\u026At] n. \\u7535\\u89E3\\u6DB2\\uFF0C\\u7535\\u89E3\\u8D28\\uFF1B\\u7535\\u89E3 [ \\u590D\\u6570 electrolytes ] \\u201Ccontradictions\\u201D ([Wang \\u7B49\\u3002, 2022, p. 2](zotero://select/library/items/VHI4XUQN)) ([pdf](zotero://open-pdf/library/items/3N7DAHAP?page=2&annotation=G3VLNTTD)) \\u77DB\\u76FE \\u201Climbs\\u201D ([Wang \\u7B49\\u3002, 2022, p. 2](zotero://select/library/items/VHI4XUQN)) ([pdf](zotero://open-pdf/library/items/3N7DAHAP?page=2&annotation=UD9D4T95)) limbs \\u82F1 [l\\u026Amz] \\u7F8E [l\\u026Amz] n. [\\u89E3\\u5256]\\u56DB\\u80A2\\uFF08limb \\u7684\\u590D\\u6570\\uFF09\" #@param {type:\"string\"}\n", 60 | "savedir = \"/content/drive/MyDrive/reic\" #@param {type:\"string\"}\n", 61 | "wordtank = \"wordtank.txt\" #@param {type:\"string\"}\n", 62 | "\n", 63 | "\n", 64 | "\n", 65 | "re_pattern=\"“(\\w+)[, .]?”\"\n", 66 | "ps=PorterStemmer()\n", 67 | "req=re.findall(re_pattern, context)\n", 68 | "req=[itm.lower() for itm in req]\n", 69 | "\n", 70 | "checkFileexist(savedir,wordtank)\n", 71 | "\n", 72 | "with open(\"{}/{}\".format(savedir,wordtank),mode=\"a\",encoding=\"utf-8\") as f:\n", 73 | " f.write(\", {}\".format(\", \".join(req))) " 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "source": [ 79 | "#@title 列出重複查詢次數最多的英文單字\n", 80 | "toprank = 10 #@param {type:\"number\"}\n", 81 | "with open(\"{}/{}\".format(savedir,wordtank),mode=\"r\",encoding=\"utf-8\") as f:\n", 82 | " content=f.read().split(\", \")\n", 83 | "\n", 84 | "word=dict()\n", 85 | "wordcount=dict()\n", 86 | "for itm in content:\n", 87 | " wordstem=ps.stem(itm)\n", 88 | " word[wordstem]=stemworkcheck(word.get(wordstem,itm.lower()),itm.lower())\n", 89 | " wordcount[wordstem]=wordcount.get(wordstem,0)+1\n", 90 | "\n", 91 | "arr=sorted(wordcount.items(),key=lambda x:x[1],reverse=True)\n", 92 | "for key,value in arr[:toprank]:\n", 93 | " print(word[key],value)" 94 | ], 95 | "metadata": { 96 | "colab": { 97 | "base_uri": "https://localhost:8080/" 98 | }, 99 | "id": "aGrzd_DBJ2R3", 100 | "outputId": "307e0750-7dff-40dc-be8b-b53eb296c97c", 101 | "cellView": "form" 102 | }, 103 | "execution_count": 9, 104 | "outputs": [ 105 | { 106 | "output_type": "stream", 107 | "name": "stdout", 108 | "text": [ 109 | "fromreic 1\n", 110 | "electrolyte 1\n", 111 | "contradictions 1\n", 112 | "limbs 1\n" 113 | ] 114 | } 115 | ] 116 | } 117 | ] 118 | } -------------------------------------------------------------------------------- /錄音檔轉文字.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "id": "view-in-github", 7 | "colab_type": "text" 8 | }, 9 | "source": [ 10 | "\"Open" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "metadata": { 16 | "id": "sfatoyUwFvio" 17 | }, 18 | "source": [ 19 | "# 語音轉文字小工具\n", 20 | "\n", 21 | "此工具採用 Python 開發,可應用於**訪談錄音檔**轉文字、**影片的字幕**的生成,及其它相關應用。\n", 22 | "\n", 23 | "因為透過Google Colab 平台、Google的語音轉文字工具,完成語音轉文字的工作。只需要有 Google 帳號,即可具備執行此程式的環境,輔以簡單的設定,不會程式的使用者也可以完成相關的工作。\n", 24 | "\n", 25 | "# 新版的 AI 語音轉文字工具,結果更精準\n", 26 | "可以試試我用 Whisper AI模型撰新的新語音轉文字工具,**文字更精準**\n", 27 | "https://github.com/reic/colab_python/blob/main/whisper_Test.ipynb\n", 28 | "\n", 29 | "\n", 30 | "by 瑞課\n", 31 | "\n", 32 | "== 更新記錄 ===\n", 33 | "- 2024/2/20 Colab 調整執行緖限制,最多 2 個,只能執行60秒。多執行緖無法正確使用了。\n", 34 | "- 2023/4/11 調整 txt 檔的輸出模式,並將預設語言改為「繁體中文」\n", 35 | "- 2023/3/15 調整修改未完全的函式錯誤。 謝謝「左埕安」的回報\n", 36 | "- 2023/3/15 調整 txt檔的內容呈現\n", 37 | "- 2021/6/1 修正檔名有空白時,無法轉成 wav 和切割問題\n", 38 | "- 2021/5/12 增加不同翻譯語言變數的設定,並於檔案中提供語系參考表。 謝謝 chin ho Lau 的回饋。\n", 39 | "- 2021/5/9 修正因檔名無法產生 OTR 檔的問題,謝謝「彩虹小馬」的回饋\n", 40 | "- 2021/5/3 增加多執行緖的方法,縮短翻譯的時間\n", 41 | "\n", 42 | "\n", 43 | "\n" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": { 49 | "id": "FQPGQ9dlTlNI" 50 | }, 51 | "source": [ 52 | "## 1.安裝需求套件\n", 53 | "* 文字轉語音套件\n", 54 | "* 繁簡轉換套件" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": { 61 | "cellView": "form", 62 | "id": "zzDanp7lDmSC" 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "#@title 安裝運作所需套件\n", 67 | "!pip3 install SpeechRecognition\n", 68 | "!pip3 install iNLP" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": { 74 | "id": "YHUEj0k9HhSc" 75 | }, 76 | "source": [ 77 | "## 2.掛載 google 雲端硬碟\n", 78 | "\n", 79 | "可點選左側的 **檔案** 圖示,掛載 Google Drive 雲端硬碟,或執行程式" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": { 86 | "cellView": "form", 87 | "id": "nn9tQeSLF8oF" 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "#@title 掛載 Google雲端硬碟\n", 92 | "from google.colab import drive\n", 93 | "drive.mount('/content/drive')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": { 99 | "id": "AoJDchjgHxl8" 100 | }, 101 | "source": [ 102 | "## 3.設定環境變數與函數預載\n", 103 | "\n", 104 | "需給予**錄音檔**的路徑、 wav 切割檔的暫存目錄、txt 輸出檔的暫存目錄。請確定在**錄音檔**目錄下,沒有相同名稱目錄、或相同名稱目錄下沒有重要的資料。\n", 105 | "\n", 106 | "若要自行建立目錄者,請將 **chk** 設定為 n\n" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": 12, 112 | "metadata": { 113 | "id": "R4kZJGQdHdjy", 114 | "cellView": "form" 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "#@title 基礎環境設定\n", 119 | "import os\n", 120 | "import shutil\n", 121 | "import speech_recognition as sr\n", 122 | "import concurrent.futures\n", 123 | "import wave\n", 124 | "import json\n", 125 | "import numpy as np\n", 126 | "from google.colab import files\n", 127 | "from inlp.convert import chinese\n", 128 | "\n", 129 | "\n", 130 | "#@markdown 錄音檔的位置\n", 131 | "mp3Name= '/content/drive/MyDrive/tmp/240113_0558.mp3' #@param {type:\"string\"}\n", 132 | "\n", 133 | "#@markdown 設定錄音檔的分割大小,單位:秒。時間太長,轉文字的效果會較差。\n", 134 | "CutTimeDef = 20 #@param {type:\"integer\"}\n", 135 | "#@markdown 設定 wav 切割檔的暫存目錄\n", 136 | "wav_path='wav' #@param {type:\"string\"}\n", 137 | "#@markdown 設定文字檔暫存目錄。將特定秒數(CutTimeDef)的音檔轉為文字\n", 138 | "txt_path='txt' #@param {type:\"string\"}\n", 139 | "# #@markdown 執行緖的數量\n", 140 | "# thread_num = 1 #@param {type:\"number\"}\n", 141 | "\n", 142 | "workpath=os.path.dirname(mp3Name)\n", 143 | "mp3Name=os.path.basename(mp3Name)\n", 144 | "FileName = mp3Name[:mp3Name.rfind(\".\")]+\".wav\"\n", 145 | "os.chdir(workpath)\n", 146 | "#@markdown 若 wav_path, txt_path 目錄存在是否移除重建\n", 147 | "chk='y' #@param [\"y\",\"n\"]\n", 148 | "\n", 149 | "def reset_dir(path):\n", 150 | " try:\n", 151 | " os.mkdir(path)\n", 152 | " except Exception:\n", 153 | " if chk==\"y\":\n", 154 | " shutil.rmtree(path)\n", 155 | " os.mkdir(path)\n", 156 | "\n", 157 | "def CutFile(FileName, target_path):\n", 158 | "\n", 159 | " # print(\"CutFile File Name is \", FileName)\n", 160 | " f = wave.open(FileName, \"rb\")\n", 161 | " params = f.getparams()\n", 162 | " nchannels, sampwidth, framerate, nframes = params[:4]\n", 163 | " CutFrameNum = framerate * CutTimeDef\n", 164 | " # 讀取格式資訊\n", 165 | " # 一次性返回所有的WAV檔案的格式資訊,它返回的是一個組元(tuple):聲道數, 量化位數(byte 單位), 採\n", 166 | " # 樣頻率, 取樣點數, 壓縮型別, 壓縮型別的描述。wave模組只支援非壓縮的資料,因此可以忽略最後兩個資訊\n", 167 | "\n", 168 | " # print(\"CutFrameNum=%d\" % (CutFrameNum))\n", 169 | " # print(\"nchannels=%d\" % (nchannels))\n", 170 | " # print(\"sampwidth=%d\" % (sampwidth))\n", 171 | " # print(\"framerate=%d\" % (framerate))\n", 172 | " # print(\"nframes=%d\" % (nframes))\n", 173 | "\n", 174 | " str_data = f.readframes(nframes)\n", 175 | " f.close() # 將波形資料轉換成陣列\n", 176 | " # Cutnum =nframes/framerate/CutTimeDef\n", 177 | " # 需要根據聲道數和量化單位,將讀取的二進位制資料轉換為一個可以計算的陣列\n", 178 | " wave_data = np.frombuffer(str_data, dtype=np.short)\n", 179 | " wave_data.shape = -1, 2\n", 180 | " wave_data = wave_data.T\n", 181 | " temp_data = wave_data.T\n", 182 | " # StepNum = int(nframes/200)\n", 183 | " StepNum = CutFrameNum\n", 184 | " StepTotalNum = 0\n", 185 | " haha = 0\n", 186 | " while StepTotalNum < nframes:\n", 187 | " # for j in range(int(Cutnum)):\n", 188 | " # print(\"Stemp=%d\" % (haha))\n", 189 | " SaveFile = \"%s-%03d.wav\" % (FileName[:-4], (haha+1))\n", 190 | " # print(FileName)\n", 191 | " if haha % 3==0:\n", 192 | " print(\"*\",end='')\n", 193 | " temp_dataTemp = temp_data[StepNum * (haha):StepNum * (haha + 1)]\n", 194 | " haha = haha + 1\n", 195 | " StepTotalNum = haha * StepNum\n", 196 | " temp_dataTemp.shape = 1, -1\n", 197 | " temp_dataTemp = temp_dataTemp.astype(np.short) # 開啟WAV文件\n", 198 | " f = wave.open(target_path+\"/\" + SaveFile, \"wb\")\n", 199 | " # 配置聲道數、量化位數和取樣頻率\n", 200 | " f.setnchannels(nchannels)\n", 201 | " f.setsampwidth(sampwidth)\n", 202 | " f.setframerate(framerate)\n", 203 | " # 將wav_data轉換為二進位制資料寫入檔案\n", 204 | " f.writeframes(temp_dataTemp.tobytes())\n", 205 | " f.close()\n", 206 | "\n", 207 | "\n", 208 | "\n", 209 | "\n", 210 | "def texts_to_one(path, target_file):\n", 211 | " files = os.listdir(path)\n", 212 | " files.sort()\n", 213 | " files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n", 214 | " with open(target_file, \"w\", encoding=\"utf-8\") as f:\n", 215 | " for file in files:\n", 216 | " with open(file, \"r\", encoding='utf-8') as f2:\n", 217 | " txt= f2.read().split(\"\\n\")\n", 218 | " if len(txt) < 2:\n", 219 | " continue\n", 220 | " f.write(txt[1])\n", 221 | " print(\"完成合併, 檔案位於 %s \" % target_file)\n", 222 | "\n", 223 | "\n", 224 | "def texts2otr(path, target_file, audio_name, timeperiod):\n", 225 | " template = '''

{}{}


\n", 226 | " '''\n", 227 | " files = os.listdir(path)\n", 228 | " files.sort()\n", 229 | " content = ''\n", 230 | " files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n", 231 | " with open(target_file, \"w\", encoding=\"utf-8\") as f:\n", 232 | "\n", 233 | " for file in files:\n", 234 | " with open(file, \"r\", encoding=\"utf-8\") as f2:\n", 235 | " txt = f2.read().split(\"\\n\")\n", 236 | " if len(txt) < 2:\n", 237 | " continue\n", 238 | " pos=txt[0].rfind(\".\")\n", 239 | " time=int(txt[0][pos-3:pos])\n", 240 | " # times = (int(txt[0].split(\"-\")[1][:-5])-1)*CutTimeDef\n", 241 | " times=(time-1)*CutTimeDef\n", 242 | " secs, mins = times % 60, (times//60) % 60\n", 243 | " hours = (times//60)//60\n", 244 | " timeF = \"{:02d}:{:02d}:{:02d}\".format(hours, mins, secs)\n", 245 | " content += template.format(times, timeF, txt[1])\n", 246 | "\n", 247 | " output = {\"text\": content, \"media\": audio_name,\n", 248 | " \"media-time\": timeperiod}\n", 249 | " f.write(json.dumps(output, ensure_ascii=False))\n", 250 | " print(\"完成合併, otr 檔案位於 %s \" % target_file)\n", 251 | "\n", 252 | " #files.download(target_file)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": { 258 | "id": "RIaMvxr7Jz_W" 259 | }, 260 | "source": [ 261 | "## 4.音頻轉換與切割\n", 262 | "\n", 263 | "1. 將 mp3 轉成 wav 檔\n", 264 | "2. 將音頻切割,並置於 wav_path 目錄下\n", 265 | "3. 建立 txt_path ,做為語音判識的輸出檔\n", 266 | "\n", 267 | "\n" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 13, 273 | "metadata": { 274 | "cellView": "form", 275 | "id": "rpnUIqKBKBnQ", 276 | "colab": { 277 | "base_uri": "https://localhost:8080/" 278 | }, 279 | "outputId": "757b14db-6b7c-47e9-c25c-78e8fd68f020" 280 | }, 281 | "outputs": [ 282 | { 283 | "output_type": "stream", 284 | "name": "stdout", 285 | "text": [ 286 | "=========================================== mp3 轉 wav 檔 ============================================\n", 287 | " Wav 檔名為 240113_0558.wav \n", 288 | "========================================= 音頻以每20秒分割 ==========================================\n", 289 | "********\n", 290 | "----------------------------------------------- 完成分割 -----------------------------------------------\n" 291 | ] 292 | } 293 | ], 294 | "source": [ 295 | "#@title 執行音頻轉換與分割\n", 296 | "\n", 297 | "print(\" mp3 轉 wav 檔 \".center(100,'='))\n", 298 | "os.system('{} -i \"{}\" \"{}\"'.format(\"ffmpeg\",mp3Name, FileName))\n", 299 | "print(\" Wav 檔名為 {} \".format(FileName).center(96))\n", 300 | "reset_dir(wav_path)\n", 301 | "reset_dir(txt_path)\n", 302 | "# # Cut Wave Setting\n", 303 | "\n", 304 | "print(\" 音頻以每{}秒分割 \".format(CutTimeDef).center(94,'='))\n", 305 | "CutFile(FileName, wav_path)\n", 306 | "print(\"\")\n", 307 | "print(\" 完成分割 \".center(100,'-'))" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": { 313 | "id": "msvFQZENdwGZ" 314 | }, 315 | "source": [ 316 | "## 5.文字轉語音" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "metadata": { 323 | "id": "rUh0kL6hC6yd", 324 | "cellView": "form" 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "#@title 執行語音轉文字 (需要耗費不少時間)\n", 329 | "#@markdown 指定翻譯的語言類型,如何設定語系請參考 [支援列表](https://cloud.google.com/speech-to-text/docs/languages)\n", 330 | "\n", 331 | "#@markdown 繁體中文:zh-TW(or cmn-Hant-TW)、英文: en-US\n", 332 | "voiceLanguage=\"zh-TW\" #@param {type:\"string\"}\n", 333 | "# cmn-Hant-TW\n", 334 | "\n", 335 | "def VoiceToText_thread(file):\n", 336 | " txt_file = \"%s/%s.txt\" % (txt_path, file[:-4])\n", 337 | "\n", 338 | " if os.path.isfile(txt_file):\n", 339 | " return\n", 340 | " with open(\"%s/%s.txt\" % (txt_path, file[:-4]), \"w\", encoding=\"utf-8\") as f:\n", 341 | " f.write(\"%s:\\n\" % file)\n", 342 | " r = sr.Recognizer() # 預設辨識英文\n", 343 | " with sr.WavFile(wav_path+\"/\"+file) as source: # 讀取wav檔\n", 344 | " audio = r.record(source)\n", 345 | " # r.adjust_for_ambient_noise(source)\n", 346 | " # audio = r.listen(source)\n", 347 | " try:\n", 348 | " text = r.recognize_google(audio,language = voiceLanguage)\n", 349 | " text = chinese.s2t(text)\n", 350 | " # r.recognize_google(audio)\n", 351 | "\n", 352 | " if len(text) == 0:\n", 353 | " print(\"===無資料==\")\n", 354 | " return\n", 355 | "\n", 356 | " print(f\"{file}\\t{text}\")\n", 357 | " f.write(\"%s \\n\\n\" % text)\n", 358 | " if file == files[-1]:\n", 359 | " print(\"結束翻譯\")\n", 360 | " except sr.RequestError as e:\n", 361 | " print(\"無法翻譯{0}\".format(e))\n", 362 | " # 兩個 except 是當語音辨識不出來的時候 防呆用的\n", 363 | " # 使用Google的服務\n", 364 | " except LookupError:\n", 365 | " print(\"Could not understand audio\")\n", 366 | " except sr.UnknownValueError:\n", 367 | " print(f\"Error: 無法識別 Audio\\t {file}\")\n", 368 | "\n", 369 | "\n", 370 | "\n", 371 | "\n", 372 | "files = os.listdir(wav_path)\n", 373 | "files.sort()\n", 374 | "\n", 375 | "# 因為 colab 調整執行緒的使用原則,max=2 最多 60秒就關閉\n", 376 | "# with concurrent.futures.ThreadPoolExecutor(max_workers=thread_num) as executor:\n", 377 | "# executor.map(VoiceToText_thread, files)\n", 378 | "for file in files:\n", 379 | " VoiceToText_thread(file)\n", 380 | "\n", 381 | "# VoiceToText(wav_path, files, txt_path)\n", 382 | "\n", 383 | "target_txtfile = \"{}.txt\".format(FileName[:-4])\n", 384 | "texts_to_one(txt_path, target_txtfile)\n", 385 | "otr_file = \"{}.otr\".format(FileName[:-4])\n", 386 | "with wave.open(FileName, \"rb\") as f:\n", 387 | " params = f.getparams()\n", 388 | "texts2otr(txt_path, otr_file, FileName, params.nframes)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": { 395 | "cellView": "form", 396 | "id": "Rqlix8f26WTs" 397 | }, 398 | "outputs": [], 399 | "source": [ 400 | "#@title 列出合併的文字檔之檔名\n", 401 | "#@markdown 將會形成 txt 和 [oTranscribe](https://otranscribe.com/) 網站使用的 otr 格式。輸出檔將置於上傳錄音檔同目錄。\n", 402 | "\n", 403 | "#@markdown 若已知道檔名,不需要執行此區塊。\n", 404 | "print(\" 輸出檔名 \".center(100,'='))\n", 405 | "print(target_txtfile)\n", 406 | "print(otr_file)" 407 | ] 408 | }, 409 | { 410 | "cell_type": "markdown", 411 | "metadata": { 412 | "id": "nFJ79zpxDuMt" 413 | }, 414 | "source": [ 415 | "## 6.暫存檔、暫目錄清理" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": 15, 421 | "metadata": { 422 | "cellView": "form", 423 | "id": "XWG0Z-L-D6AK" 424 | }, 425 | "outputs": [], 426 | "source": [ 427 | "#@title 移除暫存檔、暫存目標\n", 428 | "\n", 429 | "#@markdown 將會移除 wav, txt 的目錄和 .wav 的暫存檔\n", 430 | "\n", 431 | "#@markdown 你可以透直接在 **Google雲端硬碟** 手動刪除,不透過程式移除\n", 432 | "\n", 433 | "\n", 434 | "shutil.rmtree(wav_path)\n", 435 | "shutil.rmtree(txt_path)\n", 436 | "os.remove(FileName)\n" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": { 443 | "cellView": "form", 444 | "id": "hO9kdxCwaadE" 445 | }, 446 | "outputs": [], 447 | "source": [ 448 | "#@title 卸載 **Google 雲端硬碟**\n", 449 | "drive.flush_and_unmount()" 450 | ] 451 | }, 452 | { 453 | "cell_type": "markdown", 454 | "metadata": { 455 | "id": "YnV3vGD6gS-W" 456 | }, 457 | "source": [ 458 | "## 附錄一.Youtube字幕格式輸出" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": null, 464 | "metadata": { 465 | "cellView": "form", 466 | "id": "D9M7MJS7a281" 467 | }, 468 | "outputs": [], 469 | "source": [ 470 | "#@title Youtube 字幕 (.srt) 格式輸出\n", 471 | "def get_timeF(times):\n", 472 | " secs, mins = times % 60, (times//60) % 60\n", 473 | " hours = (times//60)//60\n", 474 | " timeF = \"{:02d}:{:02d}:{:02d}\".format(hours, mins, secs)\n", 475 | " return timeF\n", 476 | "\n", 477 | "def texts2srt(path, target_file):\n", 478 | " template = '''{}\\n{} --> {}\\n{}\\n\\n'''\n", 479 | " files = os.listdir(path)\n", 480 | " files.sort()\n", 481 | " content = ''\n", 482 | " counter = 0\n", 483 | " files = [path+\"/\" + f for f in files if f.endswith(\".txt\")]\n", 484 | " with open(target_file, \"w\", encoding=\"utf-8\") as f:\n", 485 | "\n", 486 | " for file in files:\n", 487 | " with open(file, \"r\", encoding=\"utf-8\") as f2:\n", 488 | " txt = f2.read().split(\"\\n\")\n", 489 | " if len(txt) < 2:\n", 490 | " continue\n", 491 | " counter+=1\n", 492 | " times = (int(txt[0].split(\"-\")[1][:-5])-1)*CutTimeDef\n", 493 | " time_start=get_timeF(times)\n", 494 | " time_end=get_timeF(times+CutTimeDef)\n", 495 | " content += template.format(counter, time_start, time_end, txt[1])\n", 496 | " f.write(content)\n", 497 | " print(\"完成合併, srt 檔案位於 %s \" % target_file)\n", 498 | "\n", 499 | "srt_file = \"{}_srt.txt\".format(FileName[:-4])\n", 500 | "texts2srt(txt_path, srt_file)\n", 501 | "files.download(srt_file)" 502 | ] 503 | } 504 | ], 505 | "metadata": { 506 | "colab": { 507 | "provenance": [], 508 | "mount_file_id": "1SPRxSXsaErSrZ4riQ-1sxHankJ3Hlc9X", 509 | "authorship_tag": "ABX9TyMWp1agJax/qdgy3Ri4I38A", 510 | "include_colab_link": true 511 | }, 512 | "kernelspec": { 513 | "display_name": "Python 3", 514 | "name": "python3" 515 | } 516 | }, 517 | "nbformat": 4, 518 | "nbformat_minor": 0 519 | } --------------------------------------------------------------------------------