├── .gitignore ├── LICENSE ├── README.md ├── README_zh.md ├── docs └── images │ ├── LLM_guide.png │ ├── demo.png │ ├── demo_en.png │ ├── dingding.png │ ├── guide.jpg │ ├── interface.jpg │ └── wechat.png ├── font └── STHeitiMedium.ttc ├── funclip ├── __init__.py ├── introduction.py ├── launch.py ├── llm │ ├── demo_prompt.py │ ├── g4f_openai_api.py │ ├── openai_api.py │ └── qwen_api.py ├── test │ ├── imagemagick_test.py │ └── test.sh ├── utils │ ├── argparse_tools.py │ ├── subtitle_utils.py │ ├── theme.json │ └── trans_utils.py └── videoclipper.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | .DS_Store 3 | *.DS_Store 4 | ClipVideo/clipvideo/output 5 | *__pycache__ 6 | *.spec -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Alibaba 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![SVG Banners](https://svg-banners.vercel.app/api?type=rainbow&text1=FunClip%20%20🥒&width=800&height=210)](https://github.com/Akshay090/svg-banners) 2 | 3 | ###

「[简体中文](./README_zh.md) | English」

4 | 5 | **

⚡ Open-source, accurate and easy-to-use video clipping tool

** 6 | **

🧠 Explore LLM based video clipping with FunClip

** 7 | 8 |

9 | 10 |

11 | alibaba-damo-academy%2FFunClip | Trendshift 12 |

13 | 14 |
15 |

16 | What's New 17 | | On Going 18 | | Install 19 | | Usage 20 | | Community 21 |

22 |
23 | 24 | **FunClip** is a fully open-source, locally deployed automated video clipping tool. It leverages Alibaba TONGYI speech lab's open-source [FunASR](https://github.com/alibaba-damo-academy/FunASR) Paraformer series models to perform speech recognition on videos. Then, users can freely choose text segments or speakers from the recognition results and click the clip button to obtain the video clip corresponding to the selected segments (Quick Experience [Modelscope⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) [HuggingFace🤗](https://huggingface.co/spaces/R1ckShi/FunClip)). 25 | 26 | ## Highlights🎨 27 | 28 | - 🔥Try AI clipping using LLM in FunClip now. 29 | - FunClip integrates Alibaba's open-source industrial-grade model [Paraformer-Large](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which is one of the best-performing open-source Chinese ASR models available, with over 13 million downloads on Modelscope. It can also accurately predict timestamps in an integrated manner. 30 | - FunClip incorporates the hotword customization feature of [SeACo-Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), allowing users to specify certain entity words, names, etc., as hotwords during the ASR process to enhance recognition results. 31 | - FunClip integrates the [CAM++](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) speaker recognition model, enabling users to use the auto-recognized speaker ID as the target for trimming, to clip segments from a specific speaker. 32 | - The functionalities are realized through Gradio interaction, offering simple installation and ease of use. It can also be deployed on a server and accessed via a browser. 33 | - FunClip supports multi-segment free clipping and automatically returns full video SRT subtitles and target segment SRT subtitles, offering a simple and convenient user experience. 34 | 35 | 36 | ## What's New🚀 37 | - 2024/06/12 FunClip supports recognize and clip English audio files now. Run `python funclip/launch.py -l en` to try. 38 | - 🔥2024/05/13 FunClip v2.0.0 now supports smart clipping with large language models, integrating models from the qwen series, GPT series, etc., providing default prompts. You can also explore and share tips for setting prompts, the usage is as follows: 39 | 1. After the recognition, select the name of the large model and configure your own apikey; 40 | 2. Click on the 'LLM Inference' button, and FunClip will automatically combine two prompts with the video's srt subtitles; 41 | 3. Click on the 'AI Clip' button, and based on the output results of the large language model from the previous step, FunClip will extract the timestamps for clipping; 42 | 4. You can try changing the prompt to leverage the capabilities of the large language models to get the results you want; 43 | - 2024/05/09 FunClip updated to v1.1.0, including the following updates and fixes: 44 | - Support configuration of output file directory, saving ASR intermediate results and video clipping intermediate files; 45 | - UI upgrade (see guide picture below), video and audio cropping function are on the same page now, button position adjustment; 46 | - Fixed a bug introduced due to FunASR interface upgrade, which has caused some serious clipping errors; 47 | - Support configuring different start and end time offsets for each paragraph; 48 | - Code update, etc; 49 | - 2024/03/06 Fix bugs in using FunClip with command line. 50 | - 2024/02/28 [FunASR](https://github.com/alibaba-damo-academy/FunASR) is updated to 1.0 version, use FunASR1.0 and SeACo-Paraformer to conduct ASR with hotword customization. 51 | - 2023/10/17 Fix bugs in multiple periods chosen, used to return video with wrong length. 52 | - 2023/10/10 FunClipper now supports recognizing with speaker diarization ability, choose 'yes' button in 'Recognize Speakers' and you will get recognition results with speaker id for each sentence. And then you can clip out the periods of one or some speakers (e.g. 'spk0' or 'spk0#spk3') using FunClipper. 53 | 54 | 55 | ## On Going🌵 56 | 57 | - [x] FunClip will support Whisper model for English users, coming soon (ASR using Whisper with timestamp requires massive GPU memory, we support timestamp prediction for vanilla Paraformer in FunASR to achieving this). 58 | - [x] FunClip will further explore the abilities of large langage model based AI clipping, welcome to discuss about prompt setting and clipping, etc. 59 | - [ ] Reverse periods choosing while clipping. 60 | - [ ] Removing silence periods. 61 | 62 | 63 | ## Install🔨 64 | 65 | ### Python env install 66 | 67 | FunClip basic functions rely on a python environment only. 68 | ```shell 69 | # clone funclip repo 70 | git clone https://github.com/alibaba-damo-academy/FunClip.git 71 | cd FunClip 72 | # install Python requirments 73 | pip install -r ./requirements.txt 74 | ``` 75 | 76 | ### imagemagick install (Optional) 77 | 78 | If you want to clip video file with embedded subtitles 79 | 80 | 1. ffmpeg and imagemagick is required 81 | 82 | - On Ubuntu 83 | ```shell 84 | apt-get -y update && apt-get -y install ffmpeg imagemagick 85 | sed -i 's/none/read,write/g' /etc/ImageMagick-6/policy.xml 86 | ``` 87 | - On MacOS 88 | ```shell 89 | brew install imagemagick 90 | sed -i 's/none/read,write/g' /usr/local/Cellar/imagemagick/7.1.1-8_1/etc/ImageMagick-7/policy.xml 91 | ``` 92 | - On Windows 93 | 94 | Download and install imagemagick https://imagemagick.org/script/download.php#windows 95 | 96 | Find your python install path and change the `IMAGEMAGICK_BINARY` to your imagemagick install path in file `site-packages\moviepy\config_defaults.py` 97 | 98 | 2. Download font file to funclip/font 99 | 100 | ```shell 101 | wget https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/STHeitiMedium.ttc -O font/STHeitiMedium.ttc 102 | ``` 103 | 104 | ## Use FunClip 105 | 106 | ### A. Use FunClip as local Gradio Service 107 | You can establish your own FunClip service which is same as [Modelscope Space](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) as follow: 108 | ```shell 109 | python funclip/launch.py 110 | # '-l en' for English audio recognize 111 | # '-p xxx' for setting port number 112 | # '-s True' for establishing service for public accessing 113 | ``` 114 | then visit ```localhost:7860``` you will get a Gradio service like below and you can use FunClip following the steps: 115 | 116 | - Step1: Upload your video file (or try the example videos below) 117 | - Step2: Copy the text segments you need to 'Text to Clip' 118 | - Step3: Adjust subtitle settings (if needed) 119 | - Step4: Click 'Clip' or 'Clip and Generate Subtitles' 120 | 121 | 122 | 123 | Follow the guide below to explore LLM based clipping: 124 | 125 | 126 | 127 | ### B. Experience FunClip in Modelscope 128 | 129 | [FunClip@Modelscope Space⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) 130 | 131 | [FunClip@HuggingFace Space🤗](https://huggingface.co/spaces/R1ckShi/FunClip) 132 | 133 | ### C. Use FunClip in command line 134 | 135 | FunClip supports you to recognize and clip with commands: 136 | ```shell 137 | # step1: Recognize 138 | python funclip/videoclipper.py --stage 1 \ 139 | --file examples/2022云栖大会_片段.mp4 \ 140 | --output_dir ./output 141 | # now you can find recognition results and entire SRT file in ./output/ 142 | # step2: Clip 143 | python funclip/videoclipper.py --stage 2 \ 144 | --file examples/2022云栖大会_片段.mp4 \ 145 | --output_dir ./output \ 146 | --dest_text '我们把它跟乡村振兴去结合起来,利用我们的设计的能力' \ 147 | --start_ost 0 \ 148 | --end_ost 100 \ 149 | --output_file './output/res.mp4' 150 | ``` 151 | 152 | 153 | ## Community Communication🍟 154 | 155 | FunClip is firstly open-sourced bu FunASR team, any useful PR is welcomed. 156 | 157 | You can also scan the following DingTalk group or WeChat group QR code to join the community group for communication. 158 | 159 | | DingTalk group | WeChat group | 160 | |:-------------------------------------------------------------------:|:-----------------------------------------------------:| 161 | |
|
| 162 | 163 | ## Find Speech Models in FunASR 164 | 165 | [FunASR](https://github.com/alibaba-damo-academy/FunASR) hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model released on ModelScope, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun! 166 | 167 | 📚FunASR Paper: 168 | 169 | 📚SeACo-Paraformer Paper: 170 | 171 | 🌟Support FunASR: 172 | -------------------------------------------------------------------------------- /README_zh.md: -------------------------------------------------------------------------------- 1 | [![SVG Banners](https://svg-banners.vercel.app/api?type=rainbow&text1=FunClip%20%20🥒&width=800&height=210)](https://github.com/Akshay090/svg-banners) 2 | 3 | ###

「简体中文 | [English](./README.md)」

4 | 5 | **

⚡ 开源、精准、方便的视频切片工具

** 6 | **

🧠 通过FunClip探索基于大语言模型的视频剪辑

** 7 | 8 |

9 | 10 |

11 | alibaba-damo-academy%2FFunClip | Trendshift 12 |

13 | 14 |
15 |

近期更新 16 | | 施工中 17 | | 安装环境 18 | | 使用方法 19 | | 社区交流 20 |

21 |
22 | 23 | **FunClip**是一款完全开源、本地部署的自动化视频剪辑工具,通过调用阿里巴巴通义实验室开源的[FunASR](https://github.com/alibaba-damo-academy/FunASR) Paraformer系列模型进行视频的语音识别,随后用户可以自由选择识别结果中的文本片段或说话人,点击裁剪按钮即可获取对应片段的视频(快速体验 [Modelscope⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) [HuggingFace🤗](https://huggingface.co/spaces/R1ckShi/FunClip))。 24 | 25 | ## 热点&特性🎨 26 | 27 | - 🔥FunClip集成了多种大语言模型调用方式并提供了prompt配置接口,尝试通过大语言模型进行视频裁剪~ 28 | - FunClip集成了阿里巴巴开源的工业级模型[Paraformer-Large](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),是当前识别效果最优的开源中文ASR模型之一,Modelscope下载量1300w+次,并且能够一体化的准确预测时间戳。 29 | - FunClip集成了[SeACo-Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)的热词定制化功能,在ASR过程中可以指定一些实体词、人名等作为热词,提升识别效果。 30 | - FunClip集成了[CAM++](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary)说话人识别模型,用户可以将自动识别出的说话人ID作为裁剪目标,将某一说话人的段落裁剪出来。 31 | - 通过Gradio交互实现上述功能,安装简单使用方便,并且可以在服务端搭建服务通过浏览器使用。 32 | - FunClip支持多段自由剪辑,并且会自动返回全视频SRT字幕、目标段落SRT字幕,使用简单方便。 33 | 34 | 欢迎体验使用,欢迎提出关于字幕生成或语音识别的需求与宝贵建议~ 35 | 36 | 37 | ## 近期更新🚀 38 | 39 | - 2024/06/12 FunClip现在支持识别与裁剪英文视频,通过`python funclip/launch.py -l en`来启动英文版本服务。 40 | - 🔥2024/05/13 FunClip v2.0.0加入大语言模型智能裁剪功能,集成qwen系列,gpt系列等模型,提供默认prompt,您也可以探索并分享prompt的设置技巧,使用方法如下: 41 | 1. 在进行识别之后,选择大模型名称,配置你自己的apikey; 42 | 2. 点击'LLM智能段落选择'按钮,FunClip将自动组合两个prompt与视频的srt字幕; 43 | 3. 点击'LLM智能裁剪'按钮,基于前一步的大语言模型输出结果,FunClip将提取其中的时间戳进行裁剪; 44 | 4. 您可以尝试改变prompt来借助大语言模型的能力来获取您想要的结果; 45 | - 2024/05/09 FunClip更新至v1.1.0,包含如下更新与修复: 46 | - 支持配置输出文件目录,保存ASR中间结果与视频裁剪中间文件; 47 | - UI升级(见下方演示图例),视频与音频裁剪功能在同一页,按钮位置调整; 48 | - 修复了由于FunASR接口升级引入的bug,该bug曾导致一些严重的剪辑错误; 49 | - 支持为每一个段落配置不同的起止时间偏移; 50 | - 代码优化等; 51 | - 2024/03/06 命令行调用方式更新与问题修复,相关功能可以正常使用。 52 | - 2024/02/28 FunClip升级到FunASR1.0模型调用方式,通过FunASR开源的SeACo-Paraformer模型在视频剪辑中进一步支持热词定制化功能。 53 | - 2024/02/28 原FunASR-APP/ClipVideo更名为FunClip。 54 | 55 | 56 | ## 施工中🌵 57 | 58 | - [x] FunClip将会集成Whisper模型,以提供英文视频剪辑能力(Whisper模型的时间戳预测功能需要显存较大,我们在FunASR中添加了Paraformer英文模型的时间戳预测支持以允许FunClip支持英文识别裁剪)。 59 | - [x] 集成大语言模型的能力,提供智能视频剪辑相关功能。大家可以基于FunClip探索使用大语言模型的视频剪辑~ 60 | - [ ] 给定文本段落,反向选取其他段落。 61 | - [ ] 删除视频中无人说话的片段。 62 | 63 | 64 | ## 安装🔨 65 | 66 | ### Python环境安装 67 | 68 | FunClip的运行仅依赖于一个Python环境,若您是一个小白开发者,可以先了解下如何使用Python,pip等~ 69 | ```shell 70 | # 克隆funclip仓库 71 | git clone https://github.com/alibaba-damo-academy/FunClip.git 72 | cd FunClip 73 | # 安装相关Python依赖 74 | pip install -r ./requirements.txt 75 | ``` 76 | 77 | ### 安装imagemagick(可选) 78 | 79 | 1. 如果你希望使用自动生成字幕的视频裁剪功能,需要安装imagemagick 80 | 81 | - Ubuntu 82 | ```shell 83 | apt-get -y update && apt-get -y install ffmpeg imagemagick 84 | sed -i 's/none/read,write/g' /etc/ImageMagick-6/policy.xml 85 | ``` 86 | - MacOS 87 | ```shell 88 | brew install imagemagick 89 | sed -i 's/none/read,write/g' /usr/local/Cellar/imagemagick/7.1.1-8_1/etc/ImageMagick-7/policy.xml 90 | ``` 91 | - Windows 92 | 93 | 首先下载并安装imagemagick https://imagemagick.org/script/download.php#windows 94 | 95 | 然后确定您的Python安装位置,在其中的`site-packages\moviepy\config_defaults.py`文件中修改`IMAGEMAGICK_BINARY`为imagemagick的exe路径 96 | 97 | 2. 下载你需要的字体文件,这里我们提供一个默认的黑体字体文件 98 | 99 | ```shell 100 | wget https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/STHeitiMedium.ttc -O font/STHeitiMedium.ttc 101 | ``` 102 | 103 | 104 | ## 使用FunClip 105 | 106 | ### A.在本地启动Gradio服务 107 | 108 | ```shell 109 | python funclip/launch.py 110 | # '-l en' for English audio recognize 111 | # '-p xxx' for setting port number 112 | # '-s True' for establishing service for public accessing 113 | ``` 114 | 随后在浏览器中访问```localhost:7860```即可看到如下图所示的界面,按如下步骤即可进行视频剪辑 115 | 1. 上传你的视频(或使用下方的视频用例) 116 | 2. (可选)设置热词,设置文件输出路径(保存识别结果、视频等) 117 | 3. 点击识别按钮获取识别结果,或点击识别+区分说话人在语音识别基础上识别说话人ID 118 | 4. 将识别结果中的选段复制到对应位置,或者将说话人ID输入到对应为止 119 | 5. (可选)配置剪辑参数,偏移量与字幕设置等 120 | 6. 点击“裁剪”或“裁剪+字幕”按钮 121 | 122 | 123 | 124 | 使用大语言模型裁剪请参考如下教程 125 | 126 | 127 | 128 | ### B.通过命令行调用使用FunClip的相关功能 129 | ```shell 130 | # 步骤一:识别 131 | python funclip/videoclipper.py --stage 1 \ 132 | --file examples/2022云栖大会_片段.mp4 \ 133 | --output_dir ./output 134 | # ./output中生成了识别结果与srt字幕等 135 | # 步骤二:裁剪 136 | python funclip/videoclipper.py --stage 2 \ 137 | --file examples/2022云栖大会_片段.mp4 \ 138 | --output_dir ./output \ 139 | --dest_text '我们把它跟乡村振兴去结合起来,利用我们的设计的能力' \ 140 | --start_ost 0 \ 141 | --end_ost 100 \ 142 | --output_file './output/res.mp4' 143 | ``` 144 | 145 | ### C.通过创空间与Space体验FunClip 146 | 147 | [FunClip@Modelscope创空间⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) 148 | 149 | [FunClip@HuggingFace Space🤗](https://huggingface.co/spaces/R1ckShi/FunClip) 150 | 151 | 152 | 153 | ## 社区交流🍟 154 | 155 | FunClip开源项目由FunASR社区维护,欢迎加入社区,交流与讨论,以及合作开发等。 156 | 157 | | 钉钉群 | 微信群 | 158 | |:-------------------------------------------------------------------:|:-----------------------------------------------------:| 159 | |
|
| 160 | 161 | ## 通过FunASR了解语音识别相关技术 162 | 163 | [FunASR](https://github.com/alibaba-damo-academy/FunASR)是阿里巴巴通义实验室开源的端到端语音识别工具包,目前已经成为主流ASR工具包之一。其主要包括Python pipeline,SDK部署与海量开源工业ASR模型等。 164 | 165 | 📚FunASR论文: 166 | 167 | 📚SeACo-Paraformer论文: 168 | 169 | ⭐支持FunASR: 170 | -------------------------------------------------------------------------------- /docs/images/LLM_guide.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/LLM_guide.png -------------------------------------------------------------------------------- /docs/images/demo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/demo.png -------------------------------------------------------------------------------- /docs/images/demo_en.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/demo_en.png -------------------------------------------------------------------------------- /docs/images/dingding.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/dingding.png -------------------------------------------------------------------------------- /docs/images/guide.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/guide.jpg -------------------------------------------------------------------------------- /docs/images/interface.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/interface.jpg -------------------------------------------------------------------------------- /docs/images/wechat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/wechat.png -------------------------------------------------------------------------------- /font/STHeitiMedium.ttc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/font/STHeitiMedium.ttc -------------------------------------------------------------------------------- /funclip/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/funclip/__init__.py -------------------------------------------------------------------------------- /funclip/introduction.py: -------------------------------------------------------------------------------- 1 | top_md_1 = (""" 2 |
3 |
4 | FunClip: 5 | 🌟支持我们: 6 |
7 |
8 | 9 | 基于阿里巴巴通义实验室自研并开源的[FunASR](https://github.com/alibaba-damo-academy/FunASR)工具包及Paraformer系列模型及语音识别、端点检测、标点预测、时间戳预测、说话人区分、热词定制化开源链路 10 | 11 | 准确识别,自由复制所需段落,或者设置说话人标识,一键裁剪、添加字幕 12 | 13 | * Step1: 上传视频或音频文件(或使用下方的用例体验),点击 **识别** 按钮 14 | * Step2: 复制识别结果中所需的文字至右上方,或者右设置说话人标识,设置偏移与字幕配置(可选) 15 | * Step3: 点击 **裁剪** 按钮或 **裁剪并添加字幕** 按钮获得结果 16 | 17 | 🔥 FunClip现在集成了大语言模型智能剪辑功能,选择LLM模型进行体验吧~ 18 | """) 19 | 20 | top_md_3 = ("""访问FunASR项目与论文能够帮助您深入了解ParaClipper中所使用的语音处理相关模型: 21 |
22 |
23 | FunASR: 24 | FunASR Paper: 25 | 🌟Star FunASR: 26 |
27 |
28 | """) 29 | 30 | top_md_4 = ("""我们在「LLM智能裁剪」模块中提供三种LLM调用方式, 31 | 1. 选择阿里云百炼平台通过api调用qwen系列模型,此时需要您准备百炼平台的apikey,请访问[阿里云百炼](https://bailian.console.aliyun.com/#/home); 32 | 2. 选择GPT开头的模型即为调用openai官方api,此时需要您自备sk与网络环境; 33 | 3. [gpt4free](https://github.com/xtekky/gpt4free?tab=readme-ov-file)项目也被集成进FunClip,可以通过它免费调用gpt模型; 34 | 35 | 其中方式1与方式2需要在界面中传入相应的apikey 36 | 方式3而可能非常不稳定,返回时间可能很长或者结果获取失败,可以多多尝试或者自己准备sk使用方式1,2 37 | 38 | 不要同时打开同一端口的多个界面,会导致文件上传非常缓慢或卡死,关闭其他界面即可解决 39 | """) 40 | -------------------------------------------------------------------------------- /funclip/launch.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved. 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | from http import server 7 | import os 8 | import logging 9 | import argparse 10 | import gradio as gr 11 | from funasr import AutoModel 12 | from videoclipper import VideoClipper 13 | from llm.openai_api import openai_call 14 | from llm.qwen_api import call_qwen_model 15 | from llm.g4f_openai_api import g4f_openai_call 16 | from utils.trans_utils import extract_timestamps 17 | from introduction import top_md_1, top_md_3, top_md_4 18 | 19 | 20 | if __name__ == "__main__": 21 | parser = argparse.ArgumentParser(description='argparse testing') 22 | parser.add_argument('--lang', '-l', type=str, default = "zh", help="language") 23 | parser.add_argument('--share', '-s', action='store_true', help="if to establish gradio share link") 24 | parser.add_argument('--port', '-p', type=int, default=7860, help='port number') 25 | parser.add_argument('--listen', action='store_true', help="if to listen to all hosts") 26 | args = parser.parse_args() 27 | 28 | if args.lang == 'zh': 29 | funasr_model = AutoModel(model="iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 30 | vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", 31 | punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 32 | spk_model="damo/speech_campplus_sv_zh-cn_16k-common", 33 | ) 34 | else: 35 | funasr_model = AutoModel(model="iic/speech_paraformer_asr-en-16k-vocab4199-pytorch", 36 | vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", 37 | punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 38 | spk_model="damo/speech_campplus_sv_zh-cn_16k-common", 39 | ) 40 | audio_clipper = VideoClipper(funasr_model) 41 | audio_clipper.lang = args.lang 42 | 43 | server_name='127.0.0.1' 44 | if args.listen: 45 | server_name = '0.0.0.0' 46 | 47 | 48 | 49 | def audio_recog(audio_input, sd_switch, hotwords, output_dir): 50 | return audio_clipper.recog(audio_input, sd_switch, None, hotwords, output_dir=output_dir) 51 | 52 | def video_recog(video_input, sd_switch, hotwords, output_dir): 53 | return audio_clipper.video_recog(video_input, sd_switch, hotwords, output_dir=output_dir) 54 | 55 | def video_clip(dest_text, video_spk_input, start_ost, end_ost, state, output_dir): 56 | return audio_clipper.video_clip( 57 | dest_text, start_ost, end_ost, state, dest_spk=video_spk_input, output_dir=output_dir 58 | ) 59 | 60 | def mix_recog(video_input, audio_input, hotwords, output_dir): 61 | output_dir = output_dir.strip() 62 | if not len(output_dir): 63 | output_dir = None 64 | else: 65 | output_dir = os.path.abspath(output_dir) 66 | audio_state, video_state = None, None 67 | if video_input is not None: 68 | res_text, res_srt, video_state = video_recog( 69 | video_input, 'No', hotwords, output_dir=output_dir) 70 | return res_text, res_srt, video_state, None 71 | if audio_input is not None: 72 | res_text, res_srt, audio_state = audio_recog( 73 | audio_input, 'No', hotwords, output_dir=output_dir) 74 | return res_text, res_srt, None, audio_state 75 | 76 | def mix_recog_speaker(video_input, audio_input, hotwords, output_dir): 77 | output_dir = output_dir.strip() 78 | if not len(output_dir): 79 | output_dir = None 80 | else: 81 | output_dir = os.path.abspath(output_dir) 82 | audio_state, video_state = None, None 83 | if video_input is not None: 84 | res_text, res_srt, video_state = video_recog( 85 | video_input, 'Yes', hotwords, output_dir=output_dir) 86 | return res_text, res_srt, video_state, None 87 | if audio_input is not None: 88 | res_text, res_srt, audio_state = audio_recog( 89 | audio_input, 'Yes', hotwords, output_dir=output_dir) 90 | return res_text, res_srt, None, audio_state 91 | 92 | def mix_clip(dest_text, video_spk_input, start_ost, end_ost, video_state, audio_state, output_dir): 93 | output_dir = output_dir.strip() 94 | if not len(output_dir): 95 | output_dir = None 96 | else: 97 | output_dir = os.path.abspath(output_dir) 98 | if video_state is not None: 99 | clip_video_file, message, clip_srt = audio_clipper.video_clip( 100 | dest_text, start_ost, end_ost, video_state, dest_spk=video_spk_input, output_dir=output_dir) 101 | return clip_video_file, None, message, clip_srt 102 | if audio_state is not None: 103 | (sr, res_audio), message, clip_srt = audio_clipper.clip( 104 | dest_text, start_ost, end_ost, audio_state, dest_spk=video_spk_input, output_dir=output_dir) 105 | return None, (sr, res_audio), message, clip_srt 106 | 107 | def video_clip_addsub(dest_text, video_spk_input, start_ost, end_ost, state, output_dir, font_size, font_color): 108 | output_dir = output_dir.strip() 109 | if not len(output_dir): 110 | output_dir = None 111 | else: 112 | output_dir = os.path.abspath(output_dir) 113 | return audio_clipper.video_clip( 114 | dest_text, start_ost, end_ost, state, 115 | font_size=font_size, font_color=font_color, 116 | add_sub=True, dest_spk=video_spk_input, output_dir=output_dir 117 | ) 118 | 119 | def llm_inference(system_content, user_content, srt_text, model, apikey): 120 | SUPPORT_LLM_PREFIX = ['qwen', 'gpt', 'g4f', 'moonshot'] 121 | if model.startswith('qwen'): 122 | return call_qwen_model(apikey, model, user_content+'\n'+srt_text, system_content) 123 | if model.startswith('gpt') or model.startswith('moonshot'): 124 | return openai_call(apikey, model, system_content, user_content+'\n'+srt_text) 125 | elif model.startswith('g4f'): 126 | model = "-".join(model.split('-')[1:]) 127 | return g4f_openai_call(model, system_content, user_content+'\n'+srt_text) 128 | else: 129 | logging.error("LLM name error, only {} are supported as LLM name prefix." 130 | .format(SUPPORT_LLM_PREFIX)) 131 | 132 | def AI_clip(LLM_res, dest_text, video_spk_input, start_ost, end_ost, video_state, audio_state, output_dir): 133 | timestamp_list = extract_timestamps(LLM_res) 134 | output_dir = output_dir.strip() 135 | if not len(output_dir): 136 | output_dir = None 137 | else: 138 | output_dir = os.path.abspath(output_dir) 139 | if video_state is not None: 140 | clip_video_file, message, clip_srt = audio_clipper.video_clip( 141 | dest_text, start_ost, end_ost, video_state, 142 | dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=False) 143 | return clip_video_file, None, message, clip_srt 144 | if audio_state is not None: 145 | (sr, res_audio), message, clip_srt = audio_clipper.clip( 146 | dest_text, start_ost, end_ost, audio_state, 147 | dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=False) 148 | return None, (sr, res_audio), message, clip_srt 149 | 150 | def AI_clip_subti(LLM_res, dest_text, video_spk_input, start_ost, end_ost, video_state, audio_state, output_dir): 151 | timestamp_list = extract_timestamps(LLM_res) 152 | output_dir = output_dir.strip() 153 | if not len(output_dir): 154 | output_dir = None 155 | else: 156 | output_dir = os.path.abspath(output_dir) 157 | if video_state is not None: 158 | clip_video_file, message, clip_srt = audio_clipper.video_clip( 159 | dest_text, start_ost, end_ost, video_state, 160 | dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=True) 161 | return clip_video_file, None, message, clip_srt 162 | if audio_state is not None: 163 | (sr, res_audio), message, clip_srt = audio_clipper.clip( 164 | dest_text, start_ost, end_ost, audio_state, 165 | dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=True) 166 | return None, (sr, res_audio), message, clip_srt 167 | 168 | # gradio interface 169 | theme = gr.Theme.load("funclip/utils/theme.json") 170 | with gr.Blocks(theme=theme) as funclip_service: 171 | gr.Markdown(top_md_1) 172 | # gr.Markdown(top_md_2) 173 | gr.Markdown(top_md_3) 174 | gr.Markdown(top_md_4) 175 | video_state, audio_state = gr.State(), gr.State() 176 | with gr.Row(): 177 | with gr.Column(): 178 | with gr.Row(): 179 | video_input = gr.Video(label="视频输入 | Video Input") 180 | audio_input = gr.Audio(label="音频输入 | Audio Input") 181 | with gr.Column(): 182 | gr.Examples(['https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E4%B8%BA%E4%BB%80%E4%B9%88%E8%A6%81%E5%A4%9A%E8%AF%BB%E4%B9%A6%EF%BC%9F%E8%BF%99%E6%98%AF%E6%88%91%E5%90%AC%E8%BF%87%E6%9C%80%E5%A5%BD%E7%9A%84%E7%AD%94%E6%A1%88-%E7%89%87%E6%AE%B5.mp4', 183 | 'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/2022%E4%BA%91%E6%A0%96%E5%A4%A7%E4%BC%9A_%E7%89%87%E6%AE%B52.mp4', 184 | 'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E4%BD%BF%E7%94%A8chatgpt_%E7%89%87%E6%AE%B5.mp4'], 185 | [video_input], 186 | label='示例视频 | Demo Video') 187 | gr.Examples(['https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E8%AE%BF%E8%B0%88.mp4'], 188 | [video_input], 189 | label='多说话人示例视频 | Multi-speaker Demo Video') 190 | gr.Examples(['https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E9%B2%81%E8%82%83%E9%87%87%E8%AE%BF%E7%89%87%E6%AE%B51.wav'], 191 | [audio_input], 192 | label="示例音频 | Demo Audio") 193 | with gr.Column(): 194 | # with gr.Row(): 195 | # video_sd_switch = gr.Radio(["No", "Yes"], label="👥区分说话人 Get Speakers", value='No') 196 | hotwords_input = gr.Textbox(label="🚒 热词 | Hotwords(可以为空,多个热词使用空格分隔,仅支持中文热词)") 197 | output_dir = gr.Textbox(label="📁 文件输出路径 | File Output Dir (可以为空,Linux, mac系统可以稳定使用)", value=" ") 198 | with gr.Row(): 199 | recog_button = gr.Button("👂 识别 | ASR", variant="primary") 200 | recog_button2 = gr.Button("👂👫 识别+区分说话人 | ASR+SD") 201 | video_text_output = gr.Textbox(label="✏️ 识别结果 | Recognition Result") 202 | video_srt_output = gr.Textbox(label="📖 SRT字幕内容 | RST Subtitles") 203 | with gr.Column(): 204 | with gr.Tab("🧠 LLM智能裁剪 | LLM Clipping"): 205 | with gr.Column(): 206 | prompt_head = gr.Textbox(label="Prompt System (按需更改,最好不要变动主体和要求)", value=("你是一个视频srt字幕分析剪辑器,输入视频的srt字幕," 207 | "分析其中的精彩且尽可能连续的片段并裁剪出来,输出四条以内的片段,将片段中在时间上连续的多个句子及它们的时间戳合并为一条," 208 | "注意确保文字与时间戳的正确匹配。输出需严格按照如下格式:1. [开始时间-结束时间] 文本,注意其中的连接符是“-”")) 209 | prompt_head2 = gr.Textbox(label="Prompt User(不需要修改,会自动拼接左下角的srt字幕)", value=("这是待裁剪的视频srt字幕:")) 210 | with gr.Column(): 211 | with gr.Row(): 212 | llm_model = gr.Dropdown( 213 | choices=["qwen-plus", 214 | "gpt-3.5-turbo", 215 | "gpt-3.5-turbo-0125", 216 | "gpt-4-turbo", 217 | "g4f-gpt-3.5-turbo"], 218 | value="qwen-plus", 219 | label="LLM Model Name", 220 | allow_custom_value=True) 221 | apikey_input = gr.Textbox(label="APIKEY") 222 | llm_button = gr.Button("LLM推理 | LLM Inference(首先进行识别,非g4f需配置对应apikey)", variant="primary") 223 | llm_result = gr.Textbox(label="LLM Clipper Result") 224 | with gr.Row(): 225 | llm_clip_button = gr.Button("🧠 LLM智能裁剪 | AI Clip", variant="primary") 226 | llm_clip_subti_button = gr.Button("🧠 LLM智能裁剪+字幕 | AI Clip+Subtitles") 227 | with gr.Tab("✂️ 根据文本/说话人裁剪 | Text/Speaker Clipping"): 228 | video_text_input = gr.Textbox(label="✏️ 待裁剪文本 | Text to Clip (多段文本使用'#'连接)") 229 | video_spk_input = gr.Textbox(label="✏️ 待裁剪说话人 | Speaker to Clip (多个说话人使用'#'连接)") 230 | with gr.Row(): 231 | clip_button = gr.Button("✂️ 裁剪 | Clip", variant="primary") 232 | clip_subti_button = gr.Button("✂️ 裁剪+字幕 | Clip+Subtitles") 233 | with gr.Row(): 234 | video_start_ost = gr.Slider(minimum=-500, maximum=1000, value=0, step=50, label="⏪ 开始位置偏移 | Start Offset (ms)") 235 | video_end_ost = gr.Slider(minimum=-500, maximum=1000, value=100, step=50, label="⏩ 结束位置偏移 | End Offset (ms)") 236 | with gr.Row(): 237 | font_size = gr.Slider(minimum=10, maximum=100, value=32, step=2, label="🔠 字幕字体大小 | Subtitle Font Size") 238 | font_color = gr.Radio(["black", "white", "green", "red"], label="🌈 字幕颜色 | Subtitle Color", value='white') 239 | # font = gr.Radio(["黑体", "Alibaba Sans"], label="字体 Font") 240 | video_output = gr.Video(label="裁剪结果 | Video Clipped") 241 | audio_output = gr.Audio(label="裁剪结果 | Audio Clipped") 242 | clip_message = gr.Textbox(label="⚠️ 裁剪信息 | Clipping Log") 243 | srt_clipped = gr.Textbox(label="📖 裁剪部分SRT字幕内容 | Clipped RST Subtitles") 244 | 245 | recog_button.click(mix_recog, 246 | inputs=[video_input, 247 | audio_input, 248 | hotwords_input, 249 | output_dir, 250 | ], 251 | outputs=[video_text_output, video_srt_output, video_state, audio_state]) 252 | recog_button2.click(mix_recog_speaker, 253 | inputs=[video_input, 254 | audio_input, 255 | hotwords_input, 256 | output_dir, 257 | ], 258 | outputs=[video_text_output, video_srt_output, video_state, audio_state]) 259 | clip_button.click(mix_clip, 260 | inputs=[video_text_input, 261 | video_spk_input, 262 | video_start_ost, 263 | video_end_ost, 264 | video_state, 265 | audio_state, 266 | output_dir 267 | ], 268 | outputs=[video_output, audio_output, clip_message, srt_clipped]) 269 | clip_subti_button.click(video_clip_addsub, 270 | inputs=[video_text_input, 271 | video_spk_input, 272 | video_start_ost, 273 | video_end_ost, 274 | video_state, 275 | output_dir, 276 | font_size, 277 | font_color, 278 | ], 279 | outputs=[video_output, clip_message, srt_clipped]) 280 | llm_button.click(llm_inference, 281 | inputs=[prompt_head, prompt_head2, video_srt_output, llm_model, apikey_input], 282 | outputs=[llm_result]) 283 | llm_clip_button.click(AI_clip, 284 | inputs=[llm_result, 285 | video_text_input, 286 | video_spk_input, 287 | video_start_ost, 288 | video_end_ost, 289 | video_state, 290 | audio_state, 291 | output_dir, 292 | ], 293 | outputs=[video_output, audio_output, clip_message, srt_clipped]) 294 | llm_clip_subti_button.click(AI_clip_subti, 295 | inputs=[llm_result, 296 | video_text_input, 297 | video_spk_input, 298 | video_start_ost, 299 | video_end_ost, 300 | video_state, 301 | audio_state, 302 | output_dir, 303 | ], 304 | outputs=[video_output, audio_output, clip_message, srt_clipped]) 305 | 306 | # start gradio service in local or share 307 | if args.listen: 308 | funclip_service.launch(share=args.share, server_port=args.port, server_name=server_name, inbrowser=False) 309 | else: 310 | funclip_service.launch(share=args.share, server_port=args.port, server_name=server_name) 311 | -------------------------------------------------------------------------------- /funclip/llm/demo_prompt.py: -------------------------------------------------------------------------------- 1 | demo_prompt=""" 2 | 你是一个视频srt字幕剪辑工具,输入视频的srt字幕之后根据如下要求剪辑对应的片段并输出每个段落的开始与结束时间, 3 | 剪辑出以下片段中最有意义的、尽可能连续的部分,按如下格式输出:1. [开始时间-结束时间] 文本, 4 | 原始srt字幕如下: 5 | 0 6 | 00:00:00,50 --> 00:00:02,10 7 | 读万卷书行万里路, 8 | 1 9 | 00:00:02,310 --> 00:00:03,990 10 | 这里是读书三六九, 11 | 2 12 | 00:00:04,670 --> 00:00:07,990 13 | 今天要和您分享的这篇文章是人民日报, 14 | 3 15 | 00:00:08,510 --> 00:00:09,730 16 | 为什么要多读书? 17 | 4 18 | 00:00:10,90 --> 00:00:11,930 19 | 这是我听过最好的答案, 20 | 5 21 | 00:00:12,310 --> 00:00:13,190 22 | 经常有人问, 23 | 6 24 | 00:00:13,730 --> 00:00:14,690 25 | 读了那么多书, 26 | 7 27 | 00:00:14,990 --> 00:00:17,250 28 | 最终还不是要回到一座平凡的城, 29 | 8 30 | 00:00:17,610 --> 00:00:19,410 31 | 打一份平凡的工组, 32 | 9 33 | 00:00:19,410 --> 00:00:20,670 34 | 建一个平凡的家庭, 35 | 10 36 | 00:00:21,330 --> 00:00:25,960 37 | 何苦折腾一个人读书的意义究竟是什么? 38 | 11 39 | 00:00:26,680 --> 00:00:30,80 40 | 今天给大家分享人民日报推荐的八条理由, 41 | 12 42 | 00:00:30,540 --> 00:00:32,875 43 | 告诉你人为什么要多读书? 44 | 13 45 | 00:00:34,690 --> 00:00:38,725 46 | 一脚步丈量不到的地方文字可以。 47 | 14 48 | 00:00:40,300 --> 00:00:41,540 49 | 钱钟书先生说过, 50 | 15 51 | 00:00:42,260 --> 00:00:43,140 52 | 如果不读书, 53 | 16 54 | 00:00:43,520 --> 00:00:44,400 55 | 行万里路, 56 | 17 57 | 00:00:44,540 --> 00:00:45,695 58 | 也只是个邮差。 59 | 18 60 | 00:00:46,900 --> 00:00:47,320 61 | 北京、 62 | 19 63 | 00:00:47,500 --> 00:00:47,980 64 | 西安、 65 | 20 66 | 00:00:48,320 --> 00:00:51,200 67 | 南京和洛阳少了学识的浸润, 68 | 21 69 | 00:00:51,600 --> 00:00:55,565 70 | 他们只是一个个耳中熟悉又眼里陌生的地名。 71 | 22 72 | 00:00:56,560 --> 00:00:59,360 73 | 故宫避暑山庄岱庙、 74 | 23 75 | 00:00:59,840 --> 00:01:02,920 76 | 曲阜三孔有了文化照耀, 77 | 24 78 | 00:01:03,120 --> 00:01:05,340 79 | 他们才不是被时间风化的标本。 80 | 25 81 | 00:01:05,820 --> 00:01:08,105 82 | 而是活了成百上千年的生命, 83 | 26 84 | 00:01:09,650 --> 00:01:10,370 85 | 不去读书, 86 | 27 87 | 00:01:10,670 --> 00:01:12,920 88 | 就是一个邮差风景, 89 | 28 90 | 00:01:13,0 --> 00:01:13,835 91 | 过眼就忘, 92 | 29 93 | 00:01:14,750 --> 00:01:17,365 94 | 就算踏破铁鞋又有什么用处呢? 95 | 30 96 | 00:01:19,240 --> 00:01:22,380 97 | 阅读不仅仅会让现实的旅行更加丰富, 98 | 31 99 | 00:01:23,120 --> 00:01:27,260 100 | 更重要的是能让精神突破现实和身体的桎梏, 101 | 32 102 | 00:01:27,640 --> 00:01:29,985 103 | 来一场灵魂长足的旅行。 104 | 33 105 | 00:01:31,850 --> 00:01:32,930 106 | 听过这样一句话, 107 | 34 108 | 00:01:33,490 --> 00:01:35,190 109 | 没有一艘非凡的船舰, 110 | 35 111 | 00:01:35,330 --> 00:01:36,430 112 | 能像一册书籍, 113 | 36 114 | 00:01:36,690 --> 00:01:38,595 115 | 把我们带到浩瀚的天地, 116 | 37 117 | 00:01:39,830 --> 00:01:42,685 118 | 你无法到达的地方文字在你过去, 119 | 38 120 | 00:01:43,530 --> 00:01:45,750 121 | 你无法经历的人生舒淇, 122 | 39 123 | 00:01:45,770 --> 00:01:46,595 124 | 带你相遇。 125 | 40 126 | 00:01:47,640 --> 00:01:50,340 127 | 那些读过的书会一本本充实, 128 | 41 129 | 00:01:50,340 --> 00:01:50,940 130 | 你的内心, 131 | 42 132 | 00:01:51,640 --> 00:01:54,855 133 | 让虚无单调的世界变得五彩斑斓。 134 | 43 135 | 00:01:55,930 --> 00:01:59,690 136 | 那些书中的人物会在你深陷生活泥潭之时, 137 | 44 138 | 00:02:00,170 --> 00:02:01,190 139 | 轻声的呼唤, 140 | 45 141 | 00:02:01,950 --> 00:02:03,270 142 | 用他们心怀梦想、 143 | 46 144 | 00:02:03,630 --> 00:02:04,950 145 | 不卑不亢的故事, 146 | 47 147 | 00:02:05,310 --> 00:02:07,90 148 | 激励你抵御苦难, 149 | 48 150 | 00:02:07,430 --> 00:02:08,525 151 | 勇往直前。 152 | 49 153 | 00:02:11,290 --> 00:02:11,695 154 | 二、 155 | 50 156 | 00:02:12,440 --> 00:02:16,900 157 | 读书的意义是使人虚心叫通达不固执、 158 | 51 159 | 00:02:17,200 --> 00:02:18,35 160 | 不偏执。 161 | 52 162 | 00:02:20,290 --> 00:02:22,935 163 | 读书越少的人越容易过得痛苦。 164 | 53 165 | 00:02:23,600 --> 00:02:24,400 166 | 读书越多, 167 | 54 168 | 00:02:24,800 --> 00:02:26,185 169 | 人才会越通透, 170 | 55 171 | 00:02:27,890 --> 00:02:30,30 172 | 知乎上有位网友讲过自己的故事。 173 | 56 174 | 00:02:30,750 --> 00:02:31,310 175 | 有一次, 176 | 57 177 | 00:02:31,530 --> 00:02:32,650 178 | 他跟伴侣吵架, 179 | 58 180 | 00:02:33,190 --> 00:02:35,505 181 | 气得连续好几个晚上没睡好, 182 | 59 183 | 00:02:36,360 --> 00:02:38,880 184 | 直到他读到一本关于亲密关系的书。 185 | 60 186 | 00:02:39,500 --> 00:02:41,920 187 | 书中有段关于夫妻关系的解读, 188 | 61 189 | 00:02:42,80 --> 00:02:43,100 190 | 让他豁然开朗, 191 | 62 192 | 00:02:43,460 --> 00:02:47,170 193 | 突然想明白了很多事气消了, 194 | 63 195 | 00:02:47,430 --> 00:02:48,410 196 | 心情好了, 197 | 64 198 | 00:02:48,790 --> 00:02:50,194 199 | 整个人也舒爽了。 200 | 65 201 | 00:02:51,780 --> 00:02:54,340 202 | 一个人书读的不多见识, 203 | 66 204 | 00:02:54,380 --> 00:02:55,180 205 | 难免受限, 206 | 67 207 | 00:02:55,720 --> 00:02:58,495 208 | 结果就必须受着眼前世界的禁锢, 209 | 68 210 | 00:02:59,540 --> 00:03:00,740 211 | 稍微遇到一点不顺, 212 | 69 213 | 00:03:00,940 --> 00:03:02,460 214 | 就极易消极悲观, 215 | 70 216 | 00:03:02,900 --> 00:03:03,720 217 | 郁郁寡欢, 218 | 71 219 | 00:03:04,140 --> 00:03:05,765 220 | 让自己困在情绪里, 221 | 72 222 | 00:03:06,900 --> 00:03:09,760 223 | 只有通过阅读才能看透人生真相, 224 | 73 225 | 00:03:10,300 --> 00:03:12,140 226 | 收获为人处事的智慧, 227 | 74 228 | 00:03:12,480 --> 00:03:14,95 229 | 把日子越过越好。 230 | 75 231 | 00:03:16,730 --> 00:03:17,890 232 | 生活的艺术里说, 233 | 76 234 | 00:03:18,410 --> 00:03:20,30 235 | 人一定要时时读书, 236 | 77 237 | 00:03:20,430 --> 00:03:22,915 238 | 不然便会鄙令晚腐。 239 | 78 240 | 00:03:23,690 --> 00:03:28,730 241 | 完剑俗剑生满身上一个人的落伍迂腐, 242 | 79 243 | 00:03:29,210 --> 00:03:31,205 244 | 就是不肯实施读书所致。 245 | 80 246 | 00:03:33,10 --> 00:03:34,790 247 | 只有在不断阅读的过程中, 248 | 81 249 | 00:03:34,990 --> 00:03:35,970 250 | 修心养性, 251 | 82 252 | 00:03:36,430 --> 00:03:38,735 253 | 才能摆脱我们的鄙俗和顽固。 254 | 83 255 | 00:03:39,920 --> 00:03:41,720 256 | 这世间没有谁的生活, 257 | 84 258 | 00:03:41,800 --> 00:03:42,540 259 | 没有烦恼, 260 | 85 261 | 00:03:43,140 --> 00:03:45,455 262 | 唯读书是最好的解药。 263 | 86 264 | 00:03:47,730 --> 00:03:48,185 265 | 三、 266 | 87 267 | 00:03:49,40 --> 00:03:50,720 268 | 书中未必有黄金屋, 269 | 88 270 | 00:03:51,0 --> 00:03:52,595 271 | 但一定有更好的自己。 272 | """ -------------------------------------------------------------------------------- /funclip/llm/g4f_openai_api.py: -------------------------------------------------------------------------------- 1 | from g4f.client import Client 2 | 3 | if __name__ == '__main__': 4 | from llm.demo_prompt import demo_prompt 5 | client = Client() 6 | response = client.chat.completions.create( 7 | model="gpt-3.5-turbo", 8 | messages=[{"role": "user", "content": "你好你的名字是什么"}], 9 | ) 10 | print(response.choices[0].message.content) 11 | 12 | 13 | def g4f_openai_call(model="gpt-3.5-turbo", 14 | user_content="如何做西红柿炖牛腩?", 15 | system_content=None): 16 | client = Client() 17 | if system_content is not None and len(system_content.strip()): 18 | messages = [ 19 | {'role': 'system', 'content': system_content}, 20 | {'role': 'user', 'content': user_content} 21 | ] 22 | else: 23 | messages = [ 24 | {'role': 'user', 'content': user_content} 25 | ] 26 | response = client.chat.completions.create( 27 | model=model, 28 | messages=messages, 29 | ) 30 | return(response.choices[0].message.content) -------------------------------------------------------------------------------- /funclip/llm/openai_api.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | from openai import OpenAI 4 | 5 | 6 | if __name__ == '__main__': 7 | from llm.demo_prompt import demo_prompt 8 | client = OpenAI( 9 | # This is the default and can be omitted 10 | api_key=os.environ.get("OPENAI_API_KEY"), 11 | ) 12 | 13 | chat_completion = client.chat.completions.create( 14 | messages=[ 15 | { 16 | "role": "user", 17 | "content": demo_prompt, 18 | } 19 | ], 20 | model="gpt-3.5-turbo-0125", 21 | ) 22 | print(chat_completion.choices[0].message.content) 23 | 24 | 25 | def openai_call(apikey, 26 | model="gpt-3.5-turbo", 27 | user_content="如何做西红柿炖牛腩?", 28 | system_content=None): 29 | client = OpenAI( 30 | # This is the default and can be omitted 31 | api_key=apikey, 32 | ) 33 | if system_content is not None and len(system_content.strip()): 34 | messages = [ 35 | {'role': 'system', 'content': system_content}, 36 | {'role': 'user', 'content': user_content} 37 | ] 38 | else: 39 | messages = [ 40 | {'role': 'user', 'content': user_content} 41 | ] 42 | 43 | chat_completion = client.chat.completions.create( 44 | messages=messages, 45 | model=model, 46 | ) 47 | logging.info("Openai model inference done.") 48 | return chat_completion.choices[0].message.content -------------------------------------------------------------------------------- /funclip/llm/qwen_api.py: -------------------------------------------------------------------------------- 1 | import dashscope 2 | from dashscope import Generation 3 | 4 | 5 | def call_qwen_model(key=None, 6 | model="qwen_plus", 7 | user_content="如何做西红柿炖牛腩?", 8 | system_content=None): 9 | dashscope.api_key = key 10 | if system_content is not None and len(system_content.strip()): 11 | messages = [ 12 | {'role': 'system', 'content': system_content}, 13 | {'role': 'user', 'content': user_content} 14 | ] 15 | else: 16 | messages = [ 17 | {'role': 'user', 'content': user_content} 18 | ] 19 | responses = Generation.call(model, 20 | messages=messages, 21 | result_format='message', # 设置输出为'message'格式 22 | stream=False, # 设置输出方式为流式输出 23 | incremental_output=False # 增量式流式输出 24 | ) 25 | print(responses) 26 | return responses['output']['choices'][0]['message']['content'] 27 | 28 | 29 | if __name__ == '__main__': 30 | call_qwen_model('YOUR_BAILIAN_APIKEY') -------------------------------------------------------------------------------- /funclip/test/imagemagick_test.py: -------------------------------------------------------------------------------- 1 | from moviepy.editor import * 2 | from moviepy.video.tools.subtitles import SubtitlesClip, TextClip 3 | from moviepy.editor import VideoFileClip, concatenate_videoclips 4 | from moviepy.video.compositing import CompositeVideoClip 5 | 6 | generator = lambda txt: TextClip(txt, font='./font/STHeitiMedium.ttc', fontsize=48, color='white') 7 | subs = [((0, 2), 'sub1中文字幕'), 8 | ((2, 4), 'subs2'), 9 | ((4, 6), 'subs3'), 10 | ((6, 8), 'subs4')] 11 | 12 | subtitles = SubtitlesClip(subs, generator) 13 | 14 | video = VideoFileClip("examples/2022云栖大会_片段.mp4.mp4") 15 | video = video.subclip(0, 8) 16 | video = CompositeVideoClip([video, subtitles.set_pos(('center','bottom'))]) 17 | 18 | video.write_videofile("test_output.mp4") -------------------------------------------------------------------------------- /funclip/test/test.sh: -------------------------------------------------------------------------------- 1 | # step1: Recognize 2 | python videoclipper.py --stage 1 \ 3 | --file ../examples/2022云栖大会_片段.mp4 \ 4 | --sd_switch yes \ 5 | --output_dir ./output 6 | # now you can find recognition results and entire SRT file in ./output/ 7 | # step2: Clip 8 | python videoclipper.py --stage 2 \ 9 | --file ../examples/2022云栖大会_片段.mp4 \ 10 | --output_dir ./output \ 11 | --dest_text '所以这个是我们办这个奖的初心啊,我们也会一届一届的办下去' \ 12 | # --dest_spk spk0 \ 13 | --start_ost 0 \ 14 | --end_ost 100 \ 15 | --output_file './output/res.mp4' -------------------------------------------------------------------------------- /funclip/utils/argparse_tools.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved. 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | import argparse 7 | from pathlib import Path 8 | 9 | import yaml 10 | import sys 11 | 12 | 13 | class ArgumentParser(argparse.ArgumentParser): 14 | """Simple implementation of ArgumentParser supporting config file 15 | 16 | This class is originated from https://github.com/bw2/ConfigArgParse, 17 | but this class is lack of some features that it has. 18 | 19 | - Not supporting multiple config files 20 | - Automatically adding "--config" as an option. 21 | - Not supporting any formats other than yaml 22 | - Not checking argument type 23 | 24 | """ 25 | 26 | def __init__(self, *args, **kwargs): 27 | super().__init__(*args, **kwargs) 28 | self.add_argument("--config", help="Give config file in yaml format") 29 | 30 | def parse_known_args(self, args=None, namespace=None): 31 | # Once parsing for setting from "--config" 32 | _args, _ = super().parse_known_args(args, namespace) 33 | if _args.config is not None: 34 | if not Path(_args.config).exists(): 35 | self.error(f"No such file: {_args.config}") 36 | 37 | with open(_args.config, "r", encoding="utf-8") as f: 38 | d = yaml.safe_load(f) 39 | if not isinstance(d, dict): 40 | self.error("Config file has non dict value: {_args.config}") 41 | 42 | for key in d: 43 | for action in self._actions: 44 | if key == action.dest: 45 | break 46 | else: 47 | self.error(f"unrecognized arguments: {key} (from {_args.config})") 48 | 49 | # NOTE(kamo): Ignore "--config" from a config file 50 | # NOTE(kamo): Unlike "configargparse", this module doesn't check type. 51 | # i.e. We can set any type value regardless of argument type. 52 | self.set_defaults(**d) 53 | return super().parse_known_args(args, namespace) 54 | 55 | 56 | def get_commandline_args(): 57 | extra_chars = [ 58 | " ", 59 | ";", 60 | "&", 61 | "(", 62 | ")", 63 | "|", 64 | "^", 65 | "<", 66 | ">", 67 | "?", 68 | "*", 69 | "[", 70 | "]", 71 | "$", 72 | "`", 73 | '"', 74 | "\\", 75 | "!", 76 | "{", 77 | "}", 78 | ] 79 | 80 | # Escape the extra characters for shell 81 | argv = [ 82 | arg.replace("'", "'\\''") 83 | if all(char not in arg for char in extra_chars) 84 | else "'" + arg.replace("'", "'\\''") + "'" 85 | for arg in sys.argv 86 | ] 87 | 88 | return sys.executable + " " + " ".join(argv) -------------------------------------------------------------------------------- /funclip/utils/subtitle_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved. 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | import re 6 | 7 | def time_convert(ms): 8 | ms = int(ms) 9 | tail = ms % 1000 10 | s = ms // 1000 11 | mi = s // 60 12 | s = s % 60 13 | h = mi // 60 14 | mi = mi % 60 15 | h = "00" if h == 0 else str(h) 16 | mi = "00" if mi == 0 else str(mi) 17 | s = "00" if s == 0 else str(s) 18 | tail = str(tail) 19 | if len(h) == 1: h = '0' + h 20 | if len(mi) == 1: mi = '0' + mi 21 | if len(s) == 1: s = '0' + s 22 | return "{}:{}:{},{}".format(h, mi, s, tail) 23 | 24 | def str2list(text): 25 | pattern = re.compile(r'[\u4e00-\u9fff]|[\w-]+', re.UNICODE) 26 | elements = pattern.findall(text) 27 | return elements 28 | 29 | class Text2SRT(): 30 | def __init__(self, text, timestamp, offset=0): 31 | self.token_list = text 32 | self.timestamp = timestamp 33 | start, end = timestamp[0][0] - offset, timestamp[-1][1] - offset 34 | self.start_sec, self.end_sec = start, end 35 | self.start_time = time_convert(start) 36 | self.end_time = time_convert(end) 37 | def text(self): 38 | if isinstance(self.token_list, str): 39 | return self.token_list 40 | else: 41 | res = "" 42 | for word in self.token_list: 43 | if '\u4e00' <= word <= '\u9fff': 44 | res += word 45 | else: 46 | res += " " + word 47 | return res.lstrip() 48 | def srt(self, acc_ost=0.0): 49 | return "{} --> {}\n{}\n".format( 50 | time_convert(self.start_sec+acc_ost*1000), 51 | time_convert(self.end_sec+acc_ost*1000), 52 | self.text()) 53 | def time(self, acc_ost=0.0): 54 | return (self.start_sec/1000+acc_ost, self.end_sec/1000+acc_ost) 55 | 56 | 57 | def generate_srt(sentence_list): 58 | srt_total = '' 59 | for i, sent in enumerate(sentence_list): 60 | t2s = Text2SRT(sent['text'], sent['timestamp']) 61 | if 'spk' in sent: 62 | srt_total += "{} spk{}\n{}".format(i, sent['spk'], t2s.srt()) 63 | else: 64 | srt_total += "{}\n{}".format(i, t2s.srt()) 65 | return srt_total 66 | 67 | def generate_srt_clip(sentence_list, start, end, begin_index=0, time_acc_ost=0.0): 68 | start, end = int(start * 1000), int(end * 1000) 69 | srt_total = '' 70 | cc = 1 + begin_index 71 | subs = [] 72 | for _, sent in enumerate(sentence_list): 73 | if isinstance(sent['text'], str): 74 | sent['text'] = str2list(sent['text']) 75 | if sent['timestamp'][-1][1] <= start: 76 | # print("CASE0") 77 | continue 78 | if sent['timestamp'][0][0] >= end: 79 | # print("CASE4") 80 | break 81 | # parts in between 82 | if (sent['timestamp'][-1][1] <= end and sent['timestamp'][0][0] > start) or (sent['timestamp'][-1][1] == end and sent['timestamp'][0][0] == start): 83 | # print("CASE1"); import pdb; pdb.set_trace() 84 | t2s = Text2SRT(sent['text'], sent['timestamp'], offset=start) 85 | srt_total += "{}\n{}".format(cc, t2s.srt(time_acc_ost)) 86 | subs.append((t2s.time(time_acc_ost), t2s.text())) 87 | cc += 1 88 | continue 89 | if sent['timestamp'][0][0] <= start: 90 | # print("CASE2"); import pdb; pdb.set_trace() 91 | if not sent['timestamp'][-1][1] > end: 92 | for j, ts in enumerate(sent['timestamp']): 93 | if ts[1] > start: 94 | break 95 | _text = sent['text'][j:] 96 | _ts = sent['timestamp'][j:] 97 | else: 98 | for j, ts in enumerate(sent['timestamp']): 99 | if ts[1] > start: 100 | _start = j 101 | break 102 | for j, ts in enumerate(sent['timestamp']): 103 | if ts[1] > end: 104 | _end = j 105 | break 106 | # _text = " ".join(sent['text'][_start:_end]) 107 | _text = sent['text'][_start:_end] 108 | _ts = sent['timestamp'][_start:_end] 109 | if len(ts): 110 | t2s = Text2SRT(_text, _ts, offset=start) 111 | srt_total += "{}\n{}".format(cc, t2s.srt(time_acc_ost)) 112 | subs.append((t2s.time(time_acc_ost), t2s.text())) 113 | cc += 1 114 | continue 115 | if sent['timestamp'][-1][1] > end: 116 | # print("CASE3"); import pdb; pdb.set_trace() 117 | for j, ts in enumerate(sent['timestamp']): 118 | if ts[1] > end: 119 | break 120 | _text = sent['text'][:j] 121 | _ts = sent['timestamp'][:j] 122 | if len(_ts): 123 | t2s = Text2SRT(_text, _ts, offset=start) 124 | srt_total += "{}\n{}".format(cc, t2s.srt(time_acc_ost)) 125 | subs.append( 126 | (t2s.time(time_acc_ost), t2s.text()) 127 | ) 128 | cc += 1 129 | continue 130 | return srt_total, subs, cc 131 | -------------------------------------------------------------------------------- /funclip/utils/theme.json: -------------------------------------------------------------------------------- 1 | { 2 | "theme": { 3 | "_font": [ 4 | { 5 | "__gradio_font__": true, 6 | "name": "Montserrat", 7 | "class": "google" 8 | }, 9 | { 10 | "__gradio_font__": true, 11 | "name": "ui-sans-serif", 12 | "class": "font" 13 | }, 14 | { 15 | "__gradio_font__": true, 16 | "name": "system-ui", 17 | "class": "font" 18 | }, 19 | { 20 | "__gradio_font__": true, 21 | "name": "sans-serif", 22 | "class": "font" 23 | } 24 | ], 25 | "_font_mono": [ 26 | { 27 | "__gradio_font__": true, 28 | "name": "IBM Plex Mono", 29 | "class": "google" 30 | }, 31 | { 32 | "__gradio_font__": true, 33 | "name": "ui-monospace", 34 | "class": "font" 35 | }, 36 | { 37 | "__gradio_font__": true, 38 | "name": "Consolas", 39 | "class": "font" 40 | }, 41 | { 42 | "__gradio_font__": true, 43 | "name": "monospace", 44 | "class": "font" 45 | } 46 | ], 47 | "background_fill_primary": "*neutral_50", 48 | "background_fill_primary_dark": "*neutral_950", 49 | "background_fill_secondary": "*neutral_50", 50 | "background_fill_secondary_dark": "*neutral_900", 51 | "block_background_fill": "white", 52 | "block_background_fill_dark": "*neutral_800", 53 | "block_border_color": "*border_color_primary", 54 | "block_border_color_dark": "*border_color_primary", 55 | "block_border_width": "0px", 56 | "block_border_width_dark": "0px", 57 | "block_info_text_color": "*body_text_color_subdued", 58 | "block_info_text_color_dark": "*body_text_color_subdued", 59 | "block_info_text_size": "*text_sm", 60 | "block_info_text_weight": "400", 61 | "block_label_background_fill": "*primary_100", 62 | "block_label_background_fill_dark": "*primary_600", 63 | "block_label_border_color": "*border_color_primary", 64 | "block_label_border_color_dark": "*border_color_primary", 65 | "block_label_border_width": "1px", 66 | "block_label_border_width_dark": "1px", 67 | "block_label_margin": "*spacing_md", 68 | "block_label_padding": "*spacing_sm *spacing_md", 69 | "block_label_radius": "*radius_md", 70 | "block_label_right_radius": "0 calc(*radius_lg - 1px) 0 calc(*radius_lg - 1px)", 71 | "block_label_text_color": "*primary_500", 72 | "block_label_text_color_dark": "*white", 73 | "block_label_text_size": "*text_md", 74 | "block_label_text_weight": "600", 75 | "block_padding": "*spacing_xl calc(*spacing_xl + 2px)", 76 | "block_radius": "*radius_lg", 77 | "block_shadow": "none", 78 | "block_shadow_dark": "none", 79 | "block_title_background_fill": "*block_label_background_fill", 80 | "block_title_background_fill_dark": "*block_label_background_fill", 81 | "block_title_border_color": "none", 82 | "block_title_border_color_dark": "none", 83 | "block_title_border_width": "0px", 84 | "block_title_border_width_dark": "0px", 85 | "block_title_padding": "*block_label_padding", 86 | "block_title_radius": "*block_label_radius", 87 | "block_title_text_color": "*primary_500", 88 | "block_title_text_color_dark": "*white", 89 | "block_title_text_size": "*text_md", 90 | "block_title_text_weight": "600", 91 | "body_background_fill": "*background_fill_primary", 92 | "body_background_fill_dark": "*background_fill_primary", 93 | "body_text_color": "*neutral_800", 94 | "body_text_color_dark": "*neutral_100", 95 | "body_text_color_subdued": "*neutral_400", 96 | "body_text_color_subdued_dark": "*neutral_400", 97 | "body_text_size": "*text_md", 98 | "body_text_weight": "400", 99 | "border_color_accent": "*primary_300", 100 | "border_color_accent_dark": "*neutral_600", 101 | "border_color_primary": "*neutral_200", 102 | "border_color_primary_dark": "*neutral_700", 103 | "button_border_width": "*input_border_width", 104 | "button_border_width_dark": "*input_border_width", 105 | "button_cancel_background_fill": "*button_secondary_background_fill", 106 | "button_cancel_background_fill_dark": "*button_secondary_background_fill", 107 | "button_cancel_background_fill_hover": "*button_secondary_background_fill_hover", 108 | "button_cancel_background_fill_hover_dark": "*button_secondary_background_fill_hover", 109 | "button_cancel_border_color": "*button_secondary_border_color", 110 | "button_cancel_border_color_dark": "*button_secondary_border_color", 111 | "button_cancel_border_color_hover": "*button_cancel_border_color", 112 | "button_cancel_border_color_hover_dark": "*button_cancel_border_color", 113 | "button_cancel_text_color": "*button_secondary_text_color", 114 | "button_cancel_text_color_dark": "*button_secondary_text_color", 115 | "button_cancel_text_color_hover": "*button_cancel_text_color", 116 | "button_cancel_text_color_hover_dark": "*button_cancel_text_color", 117 | "button_large_padding": "*spacing_lg calc(2 * *spacing_lg)", 118 | "button_large_radius": "*radius_lg", 119 | "button_large_text_size": "*text_lg", 120 | "button_large_text_weight": "600", 121 | "button_primary_background_fill": "*primary_500", 122 | "button_primary_background_fill_dark": "*primary_700", 123 | "button_primary_background_fill_hover": "*primary_400", 124 | "button_primary_background_fill_hover_dark": "*primary_500", 125 | "button_primary_border_color": "*primary_200", 126 | "button_primary_border_color_dark": "*primary_600", 127 | "button_primary_border_color_hover": "*button_primary_border_color", 128 | "button_primary_border_color_hover_dark": "*button_primary_border_color", 129 | "button_primary_text_color": "white", 130 | "button_primary_text_color_dark": "white", 131 | "button_primary_text_color_hover": "*button_primary_text_color", 132 | "button_primary_text_color_hover_dark": "*button_primary_text_color", 133 | "button_secondary_background_fill": "white", 134 | "button_secondary_background_fill_dark": "*neutral_600", 135 | "button_secondary_background_fill_hover": "*neutral_100", 136 | "button_secondary_background_fill_hover_dark": "*primary_500", 137 | "button_secondary_border_color": "*neutral_200", 138 | "button_secondary_border_color_dark": "*neutral_600", 139 | "button_secondary_border_color_hover": "*button_secondary_border_color", 140 | "button_secondary_border_color_hover_dark": "*button_secondary_border_color", 141 | "button_secondary_text_color": "*neutral_800", 142 | "button_secondary_text_color_dark": "white", 143 | "button_secondary_text_color_hover": "*button_secondary_text_color", 144 | "button_secondary_text_color_hover_dark": "*button_secondary_text_color", 145 | "button_shadow": "*shadow_drop_lg", 146 | "button_shadow_active": "*shadow_inset", 147 | "button_shadow_hover": "*shadow_drop_lg", 148 | "button_small_padding": "*spacing_sm calc(2 * *spacing_sm)", 149 | "button_small_radius": "*radius_lg", 150 | "button_small_text_size": "*text_md", 151 | "button_small_text_weight": "400", 152 | "button_transition": "background-color 0.2s ease", 153 | "checkbox_background_color": "*background_fill_primary", 154 | "checkbox_background_color_dark": "*neutral_800", 155 | "checkbox_background_color_focus": "*checkbox_background_color", 156 | "checkbox_background_color_focus_dark": "*checkbox_background_color", 157 | "checkbox_background_color_hover": "*checkbox_background_color", 158 | "checkbox_background_color_hover_dark": "*checkbox_background_color", 159 | "checkbox_background_color_selected": "*primary_600", 160 | "checkbox_background_color_selected_dark": "*primary_700", 161 | "checkbox_border_color": "*neutral_100", 162 | "checkbox_border_color_dark": "*neutral_600", 163 | "checkbox_border_color_focus": "*primary_500", 164 | "checkbox_border_color_focus_dark": "*primary_600", 165 | "checkbox_border_color_hover": "*neutral_300", 166 | "checkbox_border_color_hover_dark": "*neutral_600", 167 | "checkbox_border_color_selected": "*primary_600", 168 | "checkbox_border_color_selected_dark": "*primary_700", 169 | "checkbox_border_radius": "*radius_sm", 170 | "checkbox_border_width": "1px", 171 | "checkbox_border_width_dark": "*input_border_width", 172 | "checkbox_check": "url(\"data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M12.207 4.793a1 1 0 010 1.414l-5 5a1 1 0 01-1.414 0l-2-2a1 1 0 011.414-1.414L6.5 9.086l4.293-4.293a1 1 0 011.414 0z'/%3e%3c/svg%3e\")", 173 | "checkbox_label_background_fill": "*button_secondary_background_fill", 174 | "checkbox_label_background_fill_dark": "*button_secondary_background_fill", 175 | "checkbox_label_background_fill_hover": "*button_secondary_background_fill_hover", 176 | "checkbox_label_background_fill_hover_dark": "*button_secondary_background_fill_hover", 177 | "checkbox_label_background_fill_selected": "*primary_500", 178 | "checkbox_label_background_fill_selected_dark": "*primary_600", 179 | "checkbox_label_border_color": "*border_color_primary", 180 | "checkbox_label_border_color_dark": "*border_color_primary", 181 | "checkbox_label_border_color_hover": "*checkbox_label_border_color", 182 | "checkbox_label_border_color_hover_dark": "*checkbox_label_border_color", 183 | "checkbox_label_border_width": "*input_border_width", 184 | "checkbox_label_border_width_dark": "*input_border_width", 185 | "checkbox_label_gap": "*spacing_lg", 186 | "checkbox_label_padding": "*spacing_md calc(2 * *spacing_md)", 187 | "checkbox_label_shadow": "*shadow_drop_lg", 188 | "checkbox_label_text_color": "*body_text_color", 189 | "checkbox_label_text_color_dark": "*body_text_color", 190 | "checkbox_label_text_color_selected": "white", 191 | "checkbox_label_text_color_selected_dark": "*checkbox_label_text_color", 192 | "checkbox_label_text_size": "*text_md", 193 | "checkbox_label_text_weight": "400", 194 | "checkbox_shadow": "none", 195 | "color_accent": "*primary_500", 196 | "color_accent_soft": "*primary_50", 197 | "color_accent_soft_dark": "*neutral_700", 198 | "container_radius": "*radius_lg", 199 | "embed_radius": "*radius_lg", 200 | "error_background_fill": "#fee2e2", 201 | "error_background_fill_dark": "*background_fill_primary", 202 | "error_border_color": "#fecaca", 203 | "error_border_color_dark": "*border_color_primary", 204 | "error_border_width": "1px", 205 | "error_border_width_dark": "1px", 206 | "error_text_color": "#ef4444", 207 | "error_text_color_dark": "#ef4444", 208 | "font": "'Montserrat', 'ui-sans-serif', 'system-ui', sans-serif", 209 | "font_mono": "'IBM Plex Mono', 'ui-monospace', 'Consolas', monospace", 210 | "form_gap_width": "0px", 211 | "input_background_fill": "white", 212 | "input_background_fill_dark": "*neutral_700", 213 | "input_background_fill_focus": "*secondary_500", 214 | "input_background_fill_focus_dark": "*secondary_600", 215 | "input_background_fill_hover": "*input_background_fill", 216 | "input_background_fill_hover_dark": "*input_background_fill", 217 | "input_border_color": "*neutral_50", 218 | "input_border_color_dark": "*border_color_primary", 219 | "input_border_color_focus": "*secondary_300", 220 | "input_border_color_focus_dark": "*neutral_700", 221 | "input_border_color_hover": "*input_border_color", 222 | "input_border_color_hover_dark": "*input_border_color", 223 | "input_border_width": "0px", 224 | "input_border_width_dark": "0px", 225 | "input_padding": "*spacing_xl", 226 | "input_placeholder_color": "*neutral_400", 227 | "input_placeholder_color_dark": "*neutral_500", 228 | "input_radius": "*radius_lg", 229 | "input_shadow": "*shadow_drop", 230 | "input_shadow_dark": "*shadow_drop", 231 | "input_shadow_focus": "*shadow_drop_lg", 232 | "input_shadow_focus_dark": "*shadow_drop_lg", 233 | "input_text_size": "*text_md", 234 | "input_text_weight": "400", 235 | "layout_gap": "*spacing_xxl", 236 | "link_text_color": "*secondary_600", 237 | "link_text_color_active": "*secondary_600", 238 | "link_text_color_active_dark": "*secondary_500", 239 | "link_text_color_dark": "*secondary_500", 240 | "link_text_color_hover": "*secondary_700", 241 | "link_text_color_hover_dark": "*secondary_400", 242 | "link_text_color_visited": "*secondary_500", 243 | "link_text_color_visited_dark": "*secondary_600", 244 | "loader_color": "*color_accent", 245 | "loader_color_dark": "*color_accent", 246 | "name": "base", 247 | "neutral_100": "#f3f4f6", 248 | "neutral_200": "#e5e7eb", 249 | "neutral_300": "#d1d5db", 250 | "neutral_400": "#9ca3af", 251 | "neutral_50": "#f9fafb", 252 | "neutral_500": "#6b7280", 253 | "neutral_600": "#4b5563", 254 | "neutral_700": "#374151", 255 | "neutral_800": "#1f2937", 256 | "neutral_900": "#111827", 257 | "neutral_950": "#0b0f19", 258 | "panel_background_fill": "*background_fill_secondary", 259 | "panel_background_fill_dark": "*background_fill_secondary", 260 | "panel_border_color": "*border_color_primary", 261 | "panel_border_color_dark": "*border_color_primary", 262 | "panel_border_width": "1px", 263 | "panel_border_width_dark": "1px", 264 | "primary_100": "#e0e7ff", 265 | "primary_200": "#c7d2fe", 266 | "primary_300": "#a5b4fc", 267 | "primary_400": "#818cf8", 268 | "primary_50": "#eef2ff", 269 | "primary_500": "#6366f1", 270 | "primary_600": "#4f46e5", 271 | "primary_700": "#4338ca", 272 | "primary_800": "#3730a3", 273 | "primary_900": "#312e81", 274 | "primary_950": "#2b2c5e", 275 | "prose_header_text_weight": "600", 276 | "prose_text_size": "*text_md", 277 | "prose_text_weight": "400", 278 | "radio_circle": "url(\"data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3ccircle cx='8' cy='8' r='3'/%3e%3c/svg%3e\")", 279 | "radius_lg": "6px", 280 | "radius_md": "4px", 281 | "radius_sm": "2px", 282 | "radius_xl": "8px", 283 | "radius_xs": "1px", 284 | "radius_xxl": "12px", 285 | "radius_xxs": "1px", 286 | "secondary_100": "#ecfccb", 287 | "secondary_200": "#d9f99d", 288 | "secondary_300": "#bef264", 289 | "secondary_400": "#a3e635", 290 | "secondary_50": "#f7fee7", 291 | "secondary_500": "#84cc16", 292 | "secondary_600": "#65a30d", 293 | "secondary_700": "#4d7c0f", 294 | "secondary_800": "#3f6212", 295 | "secondary_900": "#365314", 296 | "secondary_950": "#2f4e14", 297 | "section_header_text_size": "*text_md", 298 | "section_header_text_weight": "400", 299 | "shadow_drop": "0 1px 4px 0 rgb(0 0 0 / 0.1)", 300 | "shadow_drop_lg": "0 2px 5px 0 rgb(0 0 0 / 0.1)", 301 | "shadow_inset": "rgba(0,0,0,0.05) 0px 2px 4px 0px inset", 302 | "shadow_spread": "6px", 303 | "shadow_spread_dark": "1px", 304 | "slider_color": "*primary_500", 305 | "slider_color_dark": "*primary_600", 306 | "spacing_lg": "6px", 307 | "spacing_md": "4px", 308 | "spacing_sm": "2px", 309 | "spacing_xl": "9px", 310 | "spacing_xs": "1px", 311 | "spacing_xxl": "12px", 312 | "spacing_xxs": "1px", 313 | "stat_background_fill": "*primary_300", 314 | "stat_background_fill_dark": "*primary_500", 315 | "table_border_color": "*neutral_300", 316 | "table_border_color_dark": "*neutral_700", 317 | "table_even_background_fill": "white", 318 | "table_even_background_fill_dark": "*neutral_950", 319 | "table_odd_background_fill": "*neutral_50", 320 | "table_odd_background_fill_dark": "*neutral_900", 321 | "table_radius": "*radius_lg", 322 | "table_row_focus": "*color_accent_soft", 323 | "table_row_focus_dark": "*color_accent_soft", 324 | "text_lg": "16px", 325 | "text_md": "14px", 326 | "text_sm": "12px", 327 | "text_xl": "22px", 328 | "text_xs": "10px", 329 | "text_xxl": "26px", 330 | "text_xxs": "9px" 331 | }, 332 | "version": "0.0.1" 333 | } -------------------------------------------------------------------------------- /funclip/utils/trans_utils.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved. 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | import os 7 | import re 8 | import numpy as np 9 | 10 | PUNC_LIST = [',', '。', '!', '?', '、', ',', '.', '?', '!'] 11 | 12 | def pre_proc(text): 13 | res = '' 14 | for i in range(len(text)): 15 | if text[i] in PUNC_LIST: 16 | continue 17 | if '\u4e00' <= text[i] <= '\u9fff': 18 | if len(res) and res[-1] != " ": 19 | res += ' ' + text[i]+' ' 20 | else: 21 | res += text[i]+' ' 22 | else: 23 | res += text[i] 24 | if res[-1] == ' ': 25 | res = res[:-1] 26 | return res 27 | 28 | def proc(raw_text, timestamp, dest_text, lang='zh'): 29 | # simple matching 30 | ld = len(dest_text.split()) 31 | mi, ts = [], [] 32 | offset = 0 33 | while True: 34 | fi = raw_text.find(dest_text, offset, len(raw_text)) 35 | ti = raw_text[:fi].count(' ') 36 | if fi == -1: 37 | break 38 | offset = fi + ld 39 | mi.append(fi) 40 | ts.append([timestamp[ti][0]*16, timestamp[ti+ld-1][1]*16]) 41 | return ts 42 | 43 | 44 | def proc_spk(dest_spk, sd_sentences): 45 | ts = [] 46 | for d in sd_sentences: 47 | d_start = d['timestamp'][0][0] 48 | d_end = d['timestamp'][-1][1] 49 | spkid=dest_spk[3:] 50 | if str(d['spk']) == spkid and d_end-d_start>999: 51 | ts.append([d_start*16, d_end*16]) 52 | return ts 53 | 54 | def generate_vad_data(data, sd_sentences, sr=16000): 55 | assert len(data.shape) == 1 56 | vad_data = [] 57 | for d in sd_sentences: 58 | d_start = round(d['ts_list'][0][0]/1000, 2) 59 | d_end = round(d['ts_list'][-1][1]/1000, 2) 60 | vad_data.append([d_start, d_end, data[int(d_start * sr):int(d_end * sr)]]) 61 | return vad_data 62 | 63 | def write_state(output_dir, state): 64 | for key in ['/recog_res_raw', '/timestamp', '/sentences']:#, '/sd_sentences']: 65 | with open(output_dir+key, 'w') as fout: 66 | fout.write(str(state[key[1:]])) 67 | if 'sd_sentences' in state: 68 | with open(output_dir+'/sd_sentences', 'w') as fout: 69 | fout.write(str(state['sd_sentences'])) 70 | 71 | def load_state(output_dir): 72 | state = {} 73 | with open(output_dir+'/recog_res_raw') as fin: 74 | line = fin.read() 75 | state['recog_res_raw'] = line 76 | with open(output_dir+'/timestamp') as fin: 77 | line = fin.read() 78 | state['timestamp'] = eval(line) 79 | with open(output_dir+'/sentences') as fin: 80 | line = fin.read() 81 | state['sentences'] = eval(line) 82 | if os.path.exists(output_dir+'/sd_sentences'): 83 | with open(output_dir+'/sd_sentences') as fin: 84 | line = fin.read() 85 | state['sd_sentences'] = eval(line) 86 | return state 87 | 88 | def convert_pcm_to_float(data): 89 | if data.dtype == np.float64: 90 | return data 91 | elif data.dtype == np.float32: 92 | return data.astype(np.float64) 93 | elif data.dtype == np.int16: 94 | bit_depth = 16 95 | elif data.dtype == np.int32: 96 | bit_depth = 32 97 | elif data.dtype == np.int8: 98 | bit_depth = 8 99 | else: 100 | raise ValueError("Unsupported audio data type") 101 | 102 | # Now handle the integer types 103 | max_int_value = float(2 ** (bit_depth - 1)) 104 | if bit_depth == 8: 105 | data = data - 128 106 | return (data.astype(np.float64) / max_int_value) 107 | 108 | def convert_time_to_millis(time_str): 109 | # 格式: [小时:分钟:秒,毫秒] 110 | hours, minutes, seconds, milliseconds = map(int, re.split('[:,]', time_str)) 111 | return (hours * 3600 + minutes * 60 + seconds) * 1000 + milliseconds 112 | 113 | def extract_timestamps(input_text): 114 | # 使用正则表达式查找所有时间戳 115 | timestamps = re.findall(r'\[(\d{2}:\d{2}:\d{2},\d{2,3})\s*-\s*(\d{2}:\d{2}:\d{2},\d{2,3})\]', input_text) 116 | times_list = [] 117 | print(timestamps) 118 | # 循环遍历找到的所有时间戳,并转换为毫秒 119 | for start_time, end_time in timestamps: 120 | start_millis = convert_time_to_millis(start_time) 121 | end_millis = convert_time_to_millis(end_time) 122 | times_list.append([start_millis, end_millis]) 123 | 124 | return times_list 125 | 126 | 127 | if __name__ == '__main__': 128 | text = ("1. [00:00:00,500-00:00:05,850] 在我们的设计普惠当中,有一个我经常津津乐道的项目叫寻找远方的美好。" 129 | "2. [00:00:07,120-00:00:12,940] 啊,在这样一个我们叫寻美在这样的一个项目当中,我们把它跟乡村振兴去结合起来,利用我们的设计的能力。" 130 | "3. [00:00:13,240-00:00:25,620] 问我们自身员工的设设计能力,我们设计生态伙伴的能力,帮助乡村振兴当中,要希望把他的产品推向市场,把他的农产品把他加工产品推向市场的这样的伙伴做一件事情,") 131 | 132 | print(extract_timestamps(text)) -------------------------------------------------------------------------------- /funclip/videoclipper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- encoding: utf-8 -*- 3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved. 4 | # MIT License (https://opensource.org/licenses/MIT) 5 | 6 | import re 7 | import os 8 | import sys 9 | import copy 10 | import librosa 11 | import logging 12 | import argparse 13 | import numpy as np 14 | import soundfile as sf 15 | from moviepy.editor import * 16 | import moviepy.editor as mpy 17 | from moviepy.video.tools.subtitles import SubtitlesClip, TextClip 18 | from moviepy.editor import VideoFileClip, concatenate_videoclips 19 | from moviepy.video.compositing.CompositeVideoClip import CompositeVideoClip 20 | from utils.subtitle_utils import generate_srt, generate_srt_clip 21 | from utils.argparse_tools import ArgumentParser, get_commandline_args 22 | from utils.trans_utils import pre_proc, proc, write_state, load_state, proc_spk, convert_pcm_to_float 23 | 24 | 25 | class VideoClipper(): 26 | def __init__(self, funasr_model): 27 | logging.warning("Initializing VideoClipper.") 28 | self.funasr_model = funasr_model 29 | self.GLOBAL_COUNT = 0 30 | 31 | def recog(self, audio_input, sd_switch='no', state=None, hotwords="", output_dir=None): 32 | if state is None: 33 | state = {} 34 | sr, data = audio_input 35 | 36 | # Convert to float64 consistently (includes data type checking) 37 | data = convert_pcm_to_float(data) 38 | 39 | # assert sr == 16000, "16kHz sample rate required, {} given.".format(sr) 40 | if sr != 16000: # resample with librosa 41 | data = librosa.resample(data, orig_sr=sr, target_sr=16000) 42 | if len(data.shape) == 2: # multi-channel wav input 43 | logging.warning("Input wav shape: {}, only first channel reserved.".format(data.shape)) 44 | data = data[:,0] 45 | state['audio_input'] = (sr, data) 46 | if sd_switch == 'Yes': 47 | rec_result = self.funasr_model.generate(data, 48 | return_spk_res=True, 49 | return_raw_text=True, 50 | is_final=True, 51 | output_dir=output_dir, 52 | hotword=hotwords, 53 | pred_timestamp=self.lang=='en', 54 | en_post_proc=self.lang=='en', 55 | cache={}) 56 | res_srt = generate_srt(rec_result[0]['sentence_info']) 57 | state['sd_sentences'] = rec_result[0]['sentence_info'] 58 | else: 59 | rec_result = self.funasr_model.generate(data, 60 | return_spk_res=False, 61 | sentence_timestamp=True, 62 | return_raw_text=True, 63 | is_final=True, 64 | hotword=hotwords, 65 | output_dir=output_dir, 66 | pred_timestamp=self.lang=='en', 67 | en_post_proc=self.lang=='en', 68 | cache={}) 69 | res_srt = generate_srt(rec_result[0]['sentence_info']) 70 | state['recog_res_raw'] = rec_result[0]['raw_text'] 71 | state['timestamp'] = rec_result[0]['timestamp'] 72 | state['sentences'] = rec_result[0]['sentence_info'] 73 | res_text = rec_result[0]['text'] 74 | return res_text, res_srt, state 75 | 76 | def clip(self, dest_text, start_ost, end_ost, state, dest_spk=None, output_dir=None, timestamp_list=None): 77 | # get from state 78 | audio_input = state['audio_input'] 79 | recog_res_raw = state['recog_res_raw'] 80 | timestamp = state['timestamp'] 81 | sentences = state['sentences'] 82 | sr, data = audio_input 83 | data = data.astype(np.float64) 84 | 85 | if timestamp_list is None: 86 | all_ts = [] 87 | if dest_spk is None or dest_spk == '' or 'sd_sentences' not in state: 88 | for _dest_text in dest_text.split('#'): 89 | if '[' in _dest_text: 90 | match = re.search(r'\[(\d+),\s*(\d+)\]', _dest_text) 91 | if match: 92 | offset_b, offset_e = map(int, match.groups()) 93 | log_append = "" 94 | else: 95 | offset_b, offset_e = 0, 0 96 | log_append = "(Bracket detected in dest_text but offset time matching failed)" 97 | _dest_text = _dest_text[:_dest_text.find('[')] 98 | else: 99 | log_append = "" 100 | offset_b, offset_e = 0, 0 101 | _dest_text = pre_proc(_dest_text) 102 | ts = proc(recog_res_raw, timestamp, _dest_text) 103 | for _ts in ts: all_ts.append([_ts[0]+offset_b*16, _ts[1]+offset_e*16]) 104 | if len(ts) > 1 and match: 105 | log_append += '(offsets detected but No.{} sub-sentence matched to {} periods in audio, \ 106 | offsets are applied to all periods)' 107 | else: 108 | for _dest_spk in dest_spk.split('#'): 109 | ts = proc_spk(_dest_spk, state['sd_sentences']) 110 | for _ts in ts: all_ts.append(_ts) 111 | log_append = "" 112 | else: 113 | all_ts = timestamp_list 114 | ts = all_ts 115 | # ts.sort() 116 | srt_index = 0 117 | clip_srt = "" 118 | if len(ts): 119 | start, end = ts[0] 120 | start = min(max(0, start+start_ost*16), len(data)) 121 | end = min(max(0, end+end_ost*16), len(data)) 122 | res_audio = data[start:end] 123 | start_end_info = "from {} to {}".format(start/16000, end/16000) 124 | srt_clip, _, srt_index = generate_srt_clip(sentences, start/16000.0, end/16000.0, begin_index=srt_index) 125 | clip_srt += srt_clip 126 | for _ts in ts[1:]: # multiple sentence input or multiple output matched 127 | start, end = _ts 128 | start = min(max(0, start+start_ost*16), len(data)) 129 | end = min(max(0, end+end_ost*16), len(data)) 130 | start_end_info += ", from {} to {}".format(start, end) 131 | res_audio = np.concatenate([res_audio, data[start+start_ost*16:end+end_ost*16]], -1) 132 | srt_clip, _, srt_index = generate_srt_clip(sentences, start/16000.0, end/16000.0, begin_index=srt_index-1) 133 | clip_srt += srt_clip 134 | if len(ts): 135 | message = "{} periods found in the speech: ".format(len(ts)) + start_end_info + log_append 136 | else: 137 | message = "No period found in the speech, return raw speech. You may check the recognition result and try other destination text." 138 | res_audio = data 139 | return (sr, res_audio), message, clip_srt 140 | 141 | def video_recog(self, video_filename, sd_switch='no', hotwords="", output_dir=None): 142 | video = mpy.VideoFileClip(video_filename) 143 | # Extract the base name, add '_clip.mp4', and 'wav' 144 | if output_dir is not None: 145 | os.makedirs(output_dir, exist_ok=True) 146 | _, base_name = os.path.split(video_filename) 147 | base_name, _ = os.path.splitext(base_name) 148 | clip_video_file = base_name + '_clip.mp4' 149 | audio_file = base_name + '.wav' 150 | audio_file = os.path.join(output_dir, audio_file) 151 | else: 152 | base_name, _ = os.path.splitext(video_filename) 153 | clip_video_file = base_name + '_clip.mp4' 154 | audio_file = base_name + '.wav' 155 | 156 | if video.audio is None: 157 | logging.error("No audio information found.") 158 | sys.exit(1) 159 | 160 | video.audio.write_audiofile(audio_file) 161 | wav = librosa.load(audio_file, sr=16000)[0] 162 | # delete the audio file after processing 163 | if os.path.exists(audio_file): 164 | os.remove(audio_file) 165 | state = { 166 | 'video_filename': video_filename, 167 | 'clip_video_file': clip_video_file, 168 | 'video': video, 169 | } 170 | # res_text, res_srt = self.recog((16000, wav), state) 171 | return self.recog((16000, wav), sd_switch, state, hotwords, output_dir) 172 | 173 | def video_clip(self, 174 | dest_text, 175 | start_ost, 176 | end_ost, 177 | state, 178 | font_size=32, 179 | font_color='white', 180 | add_sub=False, 181 | dest_spk=None, 182 | output_dir=None, 183 | timestamp_list=None): 184 | # get from state 185 | recog_res_raw = state['recog_res_raw'] 186 | timestamp = state['timestamp'] 187 | sentences = state['sentences'] 188 | video = state['video'] 189 | clip_video_file = state['clip_video_file'] 190 | video_filename = state['video_filename'] 191 | 192 | if timestamp_list is None: 193 | all_ts = [] 194 | if dest_spk is None or dest_spk == '' or 'sd_sentences' not in state: 195 | for _dest_text in dest_text.split('#'): 196 | if '[' in _dest_text: 197 | match = re.search(r'\[(\d+),\s*(\d+)\]', _dest_text) 198 | if match: 199 | offset_b, offset_e = map(int, match.groups()) 200 | log_append = "" 201 | else: 202 | offset_b, offset_e = 0, 0 203 | log_append = "(Bracket detected in dest_text but offset time matching failed)" 204 | _dest_text = _dest_text[:_dest_text.find('[')] 205 | else: 206 | offset_b, offset_e = 0, 0 207 | log_append = "" 208 | # import pdb; pdb.set_trace() 209 | _dest_text = pre_proc(_dest_text) 210 | ts = proc(recog_res_raw, timestamp, _dest_text.lower()) 211 | for _ts in ts: all_ts.append([_ts[0]+offset_b*16, _ts[1]+offset_e*16]) 212 | if len(ts) > 1 and match: 213 | log_append += '(offsets detected but No.{} sub-sentence matched to {} periods in audio, \ 214 | offsets are applied to all periods)' 215 | else: 216 | for _dest_spk in dest_spk.split('#'): 217 | ts = proc_spk(_dest_spk, state['sd_sentences']) 218 | for _ts in ts: all_ts.append(_ts) 219 | else: # AI clip pass timestamp as input directly 220 | all_ts = [[i[0]*16.0, i[1]*16.0] for i in timestamp_list] 221 | 222 | srt_index = 0 223 | time_acc_ost = 0.0 224 | ts = all_ts 225 | # ts.sort() 226 | clip_srt = "" 227 | if len(ts): 228 | if self.lang == 'en' and isinstance(sentences, str): 229 | sentences = sentences.split() 230 | start, end = ts[0][0] / 16000, ts[0][1] / 16000 231 | srt_clip, subs, srt_index = generate_srt_clip(sentences, start, end, begin_index=srt_index, time_acc_ost=time_acc_ost) 232 | start, end = start+start_ost/1000.0, end+end_ost/1000.0 233 | video_clip = video.subclip(start, end) 234 | start_end_info = "from {} to {}".format(start, end) 235 | clip_srt += srt_clip 236 | if add_sub: 237 | generator = lambda txt: TextClip(txt, font='./font/STHeitiMedium.ttc', fontsize=font_size, color=font_color) 238 | subtitles = SubtitlesClip(subs, generator) 239 | video_clip = CompositeVideoClip([video_clip, subtitles.set_pos(('center','bottom'))]) 240 | concate_clip = [video_clip] 241 | time_acc_ost += end+end_ost/1000.0 - (start+start_ost/1000.0) 242 | for _ts in ts[1:]: 243 | start, end = _ts[0] / 16000, _ts[1] / 16000 244 | srt_clip, subs, srt_index = generate_srt_clip(sentences, start, end, begin_index=srt_index-1, time_acc_ost=time_acc_ost) 245 | if not len(subs): 246 | continue 247 | chi_subs = [] 248 | sub_starts = subs[0][0][0] 249 | for sub in subs: 250 | chi_subs.append(((sub[0][0]-sub_starts, sub[0][1]-sub_starts), sub[1])) 251 | start, end = start+start_ost/1000.0, end+end_ost/1000.0 252 | _video_clip = video.subclip(start, end) 253 | start_end_info += ", from {} to {}".format(str(start)[:5], str(end)[:5]) 254 | clip_srt += srt_clip 255 | if add_sub: 256 | generator = lambda txt: TextClip(txt, font='./font/STHeitiMedium.ttc', fontsize=font_size, color=font_color) 257 | subtitles = SubtitlesClip(chi_subs, generator) 258 | _video_clip = CompositeVideoClip([_video_clip, subtitles.set_pos(('center','bottom'))]) 259 | # _video_clip.write_videofile("debug.mp4", audio_codec="aac") 260 | concate_clip.append(copy.copy(_video_clip)) 261 | time_acc_ost += end+end_ost/1000.0 - (start+start_ost/1000.0) 262 | message = "{} periods found in the audio: ".format(len(ts)) + start_end_info 263 | logging.warning("Concating...") 264 | if len(concate_clip) > 1: 265 | video_clip = concatenate_videoclips(concate_clip) 266 | # clip_video_file = clip_video_file[:-4] + '_no{}.mp4'.format(self.GLOBAL_COUNT) 267 | if output_dir is not None: 268 | os.makedirs(output_dir, exist_ok=True) 269 | _, file_with_extension = os.path.split(clip_video_file) 270 | clip_video_file_name, _ = os.path.splitext(file_with_extension) 271 | print(output_dir, clip_video_file) 272 | clip_video_file = os.path.join(output_dir, "{}_no{}.mp4".format(clip_video_file_name, self.GLOBAL_COUNT)) 273 | temp_audio_file = os.path.join(output_dir, "{}_tempaudio_no{}.mp4".format(clip_video_file_name, self.GLOBAL_COUNT)) 274 | else: 275 | clip_video_file = clip_video_file[:-4] + '_no{}.mp4'.format(self.GLOBAL_COUNT) 276 | temp_audio_file = clip_video_file[:-4] + '_tempaudio_no{}.mp4'.format(self.GLOBAL_COUNT) 277 | video_clip.write_videofile(clip_video_file, audio_codec="aac", temp_audiofile=temp_audio_file) 278 | self.GLOBAL_COUNT += 1 279 | else: 280 | clip_video_file = video_filename 281 | message = "No period found in the audio, return raw speech. You may check the recognition result and try other destination text." 282 | srt_clip = '' 283 | return clip_video_file, message, clip_srt 284 | 285 | 286 | def get_parser(): 287 | parser = ArgumentParser( 288 | description="ClipVideo Argument", 289 | formatter_class=argparse.ArgumentDefaultsHelpFormatter, 290 | ) 291 | parser.add_argument( 292 | "--stage", 293 | type=int, 294 | choices=(1, 2), 295 | help="Stage, 0 for recognizing and 1 for clipping", 296 | required=True 297 | ) 298 | parser.add_argument( 299 | "--file", 300 | type=str, 301 | default=None, 302 | help="Input file path", 303 | required=True 304 | ) 305 | parser.add_argument( 306 | "--sd_switch", 307 | type=str, 308 | choices=("no", "yes"), 309 | default="no", 310 | help="Turn on the speaker diarization or not", 311 | ) 312 | parser.add_argument( 313 | "--output_dir", 314 | type=str, 315 | default='./output', 316 | help="Output files path", 317 | ) 318 | parser.add_argument( 319 | "--dest_text", 320 | type=str, 321 | default=None, 322 | help="Destination text string for clipping", 323 | ) 324 | parser.add_argument( 325 | "--dest_spk", 326 | type=str, 327 | default=None, 328 | help="Destination spk id for clipping", 329 | ) 330 | parser.add_argument( 331 | "--start_ost", 332 | type=int, 333 | default=0, 334 | help="Offset time in ms at beginning for clipping" 335 | ) 336 | parser.add_argument( 337 | "--end_ost", 338 | type=int, 339 | default=0, 340 | help="Offset time in ms at ending for clipping" 341 | ) 342 | parser.add_argument( 343 | "--output_file", 344 | type=str, 345 | default=None, 346 | help="Output file path" 347 | ) 348 | parser.add_argument( 349 | "--lang", 350 | type=str, 351 | default='zh', 352 | help="language" 353 | ) 354 | return parser 355 | 356 | 357 | def runner(stage, file, sd_switch, output_dir, dest_text, dest_spk, start_ost, end_ost, output_file, config=None, lang='zh'): 358 | audio_suffixs = ['.wav','.mp3','.aac','.m4a','.flac'] 359 | video_suffixs = ['.mp4','.avi','.mkv','.flv','.mov','.webm','.ts','.mpeg'] 360 | _,ext = os.path.splitext(file) 361 | if ext.lower() in audio_suffixs: 362 | mode = 'audio' 363 | elif ext.lower() in video_suffixs: 364 | mode = 'video' 365 | else: 366 | logging.error("Unsupported file format: {}\n\nplease choise one of the following: {}".format(file),audio_suffixs+video_suffixs) 367 | sys.exit(1) # exit if the file is not supported 368 | while output_dir.endswith('/'): 369 | output_dir = output_dir[:-1] 370 | if not os.path.exists(output_dir): 371 | os.mkdir(output_dir) 372 | if stage == 1: 373 | from funasr import AutoModel 374 | # initialize funasr automodel 375 | logging.warning("Initializing modelscope asr pipeline.") 376 | if lang == 'zh': 377 | funasr_model = AutoModel(model="iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch", 378 | vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", 379 | punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 380 | spk_model="damo/speech_campplus_sv_zh-cn_16k-common", 381 | ) 382 | audio_clipper = VideoClipper(funasr_model) 383 | audio_clipper.lang = 'zh' 384 | elif lang == 'en': 385 | funasr_model = AutoModel(model="iic/speech_paraformer_asr-en-16k-vocab4199-pytorch", 386 | vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", 387 | punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch", 388 | spk_model="damo/speech_campplus_sv_zh-cn_16k-common", 389 | ) 390 | audio_clipper = VideoClipper(funasr_model) 391 | audio_clipper.lang = 'en' 392 | if mode == 'audio': 393 | logging.warning("Recognizing audio file: {}".format(file)) 394 | wav, sr = librosa.load(file, sr=16000) 395 | res_text, res_srt, state = audio_clipper.recog((sr, wav), sd_switch) 396 | if mode == 'video': 397 | logging.warning("Recognizing video file: {}".format(file)) 398 | res_text, res_srt, state = audio_clipper.video_recog(file, sd_switch) 399 | total_srt_file = output_dir + '/total.srt' 400 | with open(total_srt_file, 'w') as fout: 401 | fout.write(res_srt) 402 | logging.warning("Write total subtitle to {}".format(total_srt_file)) 403 | write_state(output_dir, state) 404 | logging.warning("Recognition successed. You can copy the text segment from below and use stage 2.") 405 | print(res_text) 406 | if stage == 2: 407 | audio_clipper = VideoClipper(None) 408 | if mode == 'audio': 409 | state = load_state(output_dir) 410 | wav, sr = librosa.load(file, sr=16000) 411 | state['audio_input'] = (sr, wav) 412 | (sr, audio), message, srt_clip = audio_clipper.clip(dest_text, start_ost, end_ost, state, dest_spk=dest_spk) 413 | if output_file is None: 414 | output_file = output_dir + '/result.wav' 415 | clip_srt_file = output_file[:-3] + 'srt' 416 | logging.warning(message) 417 | sf.write(output_file, audio, 16000) 418 | assert output_file.endswith('.wav'), "output_file must ends with '.wav'" 419 | logging.warning("Save clipped wav file to {}".format(output_file)) 420 | with open(clip_srt_file, 'w') as fout: 421 | fout.write(srt_clip) 422 | logging.warning("Write clipped subtitle to {}".format(clip_srt_file)) 423 | if mode == 'video': 424 | state = load_state(output_dir) 425 | state['video_filename'] = file 426 | if output_file is None: 427 | state['clip_video_file'] = file[:-4] + '_clip.mp4' 428 | else: 429 | state['clip_video_file'] = output_file 430 | clip_srt_file = state['clip_video_file'][:-3] + 'srt' 431 | state['video'] = mpy.VideoFileClip(file) 432 | clip_video_file, message, srt_clip = audio_clipper.video_clip(dest_text, start_ost, end_ost, state, dest_spk=dest_spk) 433 | logging.warning("Clipping Log: {}".format(message)) 434 | logging.warning("Save clipped mp4 file to {}".format(clip_video_file)) 435 | with open(clip_srt_file, 'w') as fout: 436 | fout.write(srt_clip) 437 | logging.warning("Write clipped subtitle to {}".format(clip_srt_file)) 438 | 439 | 440 | def main(cmd=None): 441 | print(get_commandline_args(), file=sys.stderr) 442 | parser = get_parser() 443 | args = parser.parse_args(cmd) 444 | kwargs = vars(args) 445 | runner(**kwargs) 446 | 447 | 448 | if __name__ == '__main__': 449 | main() 450 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | librosa 2 | soundfile 3 | scikit-learn>=1.3.2 4 | funasr>=1.1.2 5 | moviepy==1.0.3 6 | numpy==1.26.4 7 | gradio 8 | modelscope 9 | torch>=1.13 10 | torchaudio 11 | openai 12 | g4f 13 | dashscope 14 | curl_cffi 15 | --------------------------------------------------------------------------------