├── .gitignore
├── LICENSE
├── README.md
├── README_zh.md
├── docs
    └── images
    │   ├── LLM_guide.png
    │   ├── demo.png
    │   ├── demo_en.png
    │   ├── dingding.png
    │   ├── guide.jpg
    │   ├── interface.jpg
    │   └── wechat.png
├── font
    └── STHeitiMedium.ttc
├── funclip
    ├── __init__.py
    ├── introduction.py
    ├── launch.py
    ├── llm
    │   ├── demo_prompt.py
    │   ├── g4f_openai_api.py
    │   ├── openai_api.py
    │   └── qwen_api.py
    ├── test
    │   ├── imagemagick_test.py
    │   └── test.sh
    ├── utils
    │   ├── argparse_tools.py
    │   ├── subtitle_utils.py
    │   ├── theme.json
    │   └── trans_utils.py
    └── videoclipper.py
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | .DS_Store
3 | *.DS_Store
4 | ClipVideo/clipvideo/output
5 | *__pycache__
6 | *.spec


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2023 Alibaba
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![SVG Banners](https://svg-banners.vercel.app/api?type=rainbow&text1=FunClip%20%20🥒&width=800&height=210)](https://github.com/Akshay090/svg-banners)
  2 | 
  3 | ### <p align="center">「[简体中文](./README_zh.md) | English」</p>
  4 | 
  5 | **<p align="center"> ⚡ Open-source, accurate and easy-to-use video clipping tool </p>**
  6 | **<p align="center"> 🧠 Explore LLM based video clipping with FunClip </p>**
  7 | 
  8 | <p align="center"> <img src="docs/images/interface.jpg" width=444/></p>
  9 | 
 10 | <p align="center" class="trendshift">
 11 | <a href="https://trendshift.io/repositories/10126" target="_blank"><img src="https://trendshift.io/api/badge/repositories/10126" alt="alibaba-damo-academy%2FFunClip | Trendshift" style="width: 250px; height: 55px;" width="300" height="55"/></a>
 12 | </p>
 13 | 
 14 | <div align="center">  
 15 | <h4>
 16 | <a href="#What's New"> What's New </a>
 17 | ｜<a href="#On Going"> On Going </a>
 18 | ｜<a href="#Install"> Install </a>
 19 | ｜<a href="#Usage"> Usage </a>
 20 | ｜<a href="#Community"> Community </a>
 21 | </h4>
 22 | </div>
 23 | 
 24 | **FunClip** is a fully open-source, locally deployed automated video clipping tool. It leverages Alibaba TONGYI speech lab's open-source [FunASR](https://github.com/alibaba-damo-academy/FunASR) Paraformer series models to perform speech recognition on videos. Then, users can freely choose text segments or speakers from the recognition results and click the clip button to obtain the video clip corresponding to the selected segments (Quick Experience [Modelscope⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) [HuggingFace🤗](https://huggingface.co/spaces/R1ckShi/FunClip)).
 25 | 
 26 | ## Highlights🎨
 27 | 
 28 | - 🔥Try AI clipping using LLM in FunClip now.
 29 | - FunClip integrates Alibaba's open-source industrial-grade model [Paraformer-Large](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which is one of the best-performing open-source Chinese ASR models available, with over 13 million downloads on Modelscope. It can also accurately predict timestamps in an integrated manner.
 30 | - FunClip incorporates the hotword customization feature of [SeACo-Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), allowing users to specify certain entity words, names, etc., as hotwords during the ASR process to enhance recognition results.
 31 | - FunClip integrates the [CAM++](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) speaker recognition model, enabling users to use the auto-recognized speaker ID as the target for trimming, to clip segments from a specific speaker.
 32 | - The functionalities are realized through Gradio interaction, offering simple installation and ease of use. It can also be deployed on a server and accessed via a browser.
 33 | - FunClip supports multi-segment free clipping and automatically returns full video SRT subtitles and target segment SRT subtitles, offering a simple and convenient user experience.
 34 | 
 35 | <a name="What's New"></a>
 36 | ## What's New🚀
 37 | - 2024/06/12 FunClip supports recognize and clip English audio files now. Run `python funclip/launch.py -l en` to try.
 38 | - 🔥2024/05/13 FunClip v2.0.0 now supports smart clipping with large language models, integrating models from the qwen series, GPT series, etc., providing default prompts. You can also explore and share tips for setting prompts, the usage is as follows:
 39 |   1. After the recognition, select the name of the large model and configure your own apikey;
 40 |   2. Click on the 'LLM Inference' button, and FunClip will automatically combine two prompts with the video's srt subtitles;
 41 |   3. Click on the 'AI Clip' button, and based on the output results of the large language model from the previous step, FunClip will extract the timestamps for clipping;
 42 |   4. You can try changing the prompt to leverage the capabilities of the large language models to get the results you want;
 43 | - 2024/05/09 FunClip updated to v1.1.0, including the following updates and fixes:
 44 |   - Support configuration of output file directory, saving ASR intermediate results and video clipping intermediate files;
 45 |   - UI upgrade (see guide picture below), video and audio cropping function are on the same page now, button position adjustment;
 46 |   - Fixed a bug introduced due to FunASR interface upgrade, which has caused some serious clipping errors;
 47 |   - Support configuring different start and end time offsets for each paragraph;
 48 |   - Code update, etc;
 49 | - 2024/03/06 Fix bugs in using FunClip with command line.
 50 | - 2024/02/28 [FunASR](https://github.com/alibaba-damo-academy/FunASR) is updated to 1.0 version, use FunASR1.0 and SeACo-Paraformer to conduct ASR with hotword customization.
 51 | - 2023/10/17 Fix bugs in multiple periods chosen, used to return video with wrong length.
 52 | - 2023/10/10 FunClipper now supports recognizing with speaker diarization ability, choose 'yes' button in 'Recognize Speakers' and you will get recognition results with speaker id for each sentence. And then you can clip out the periods of one or some speakers (e.g. 'spk0' or 'spk0#spk3') using FunClipper.
 53 | 
 54 | <a name="On Going"></a>
 55 | ## On Going🌵
 56 | 
 57 | - [x] FunClip will support Whisper model for English users, coming soon (ASR using Whisper with timestamp requires massive GPU memory, we support timestamp prediction for vanilla Paraformer in FunASR to achieving this).
 58 | - [x] FunClip will further explore the abilities of large langage model based AI clipping, welcome to discuss about prompt setting and clipping, etc.
 59 | - [ ] Reverse periods choosing while clipping.
 60 | - [ ] Removing silence periods.
 61 | 
 62 | <a name="Install"></a>
 63 | ## Install🔨
 64 | 
 65 | ### Python env install
 66 | 
 67 | FunClip basic functions rely on a python environment only.
 68 | ```shell
 69 | # clone funclip repo
 70 | git clone https://github.com/alibaba-damo-academy/FunClip.git
 71 | cd FunClip
 72 | # install Python requirments
 73 | pip install -r ./requirements.txt
 74 | ```
 75 | 
 76 | ### imagemagick install (Optional)
 77 | 
 78 | If you want to clip video file with embedded subtitles
 79 | 
 80 | 1. ffmpeg and imagemagick is required
 81 | 
 82 | - On Ubuntu
 83 | ```shell
 84 | apt-get -y update && apt-get -y install ffmpeg imagemagick
 85 | sed -i 's/none/read,write/g' /etc/ImageMagick-6/policy.xml
 86 | ```
 87 | - On MacOS
 88 | ```shell
 89 | brew install imagemagick
 90 | sed -i 's/none/read,write/g' /usr/local/Cellar/imagemagick/7.1.1-8_1/etc/ImageMagick-7/policy.xml 
 91 | ```
 92 | - On Windows
 93 | 
 94 | Download and install imagemagick https://imagemagick.org/script/download.php#windows
 95 | 
 96 | Find your python install path and change the `IMAGEMAGICK_BINARY` to your imagemagick install path in file `site-packages\moviepy\config_defaults.py`
 97 | 
 98 | 2. Download font file to funclip/font
 99 | 
100 | ```shell
101 | wget https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/STHeitiMedium.ttc -O font/STHeitiMedium.ttc
102 | ```
103 | <a name="Usage"></a>
104 | ## Use FunClip
105 | 
106 | ### A. Use FunClip as local Gradio Service
107 | You can establish your own FunClip service which is same as [Modelscope Space](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary) as follow:
108 | ```shell
109 | python funclip/launch.py
110 | # '-l en' for English audio recognize
111 | # '-p xxx' for setting port number
112 | # '-s True' for establishing service for public accessing
113 | ```
114 | then visit ```localhost:7860``` you will get a Gradio service like below and you can use FunClip following the steps:
115 | 
116 | - Step1: Upload your video file (or try the example videos below)
117 | - Step2: Copy the text segments you need to 'Text to Clip'
118 | - Step3: Adjust subtitle settings (if needed)
119 | - Step4: Click 'Clip' or 'Clip and Generate Subtitles'
120 | 
121 | <img src="docs/images/guide.jpg"/>
122 | 
123 | Follow the guide below to explore LLM based clipping:
124 | 
125 | <img src="docs/images/LLM_guide.png" width=360/>
126 | 
127 | ### B. Experience FunClip in Modelscope
128 | 
129 | [FunClip@Modelscope Space⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary)
130 | 
131 | [FunClip@HuggingFace Space🤗](https://huggingface.co/spaces/R1ckShi/FunClip)
132 | 
133 | ### C. Use FunClip in command line
134 | 
135 | FunClip supports you to recognize and clip with commands:
136 | ```shell
137 | # step1: Recognize
138 | python funclip/videoclipper.py --stage 1 \
139 |                        --file examples/2022云栖大会_片段.mp4 \
140 |                        --output_dir ./output
141 | # now you can find recognition results and entire SRT file in ./output/
142 | # step2: Clip
143 | python funclip/videoclipper.py --stage 2 \
144 |                        --file examples/2022云栖大会_片段.mp4 \
145 |                        --output_dir ./output \
146 |                        --dest_text '我们把它跟乡村振兴去结合起来，利用我们的设计的能力' \
147 |                        --start_ost 0 \
148 |                        --end_ost 100 \
149 |                        --output_file './output/res.mp4'
150 | ```
151 | 
152 | <a name="Community"></a>
153 | ## Community Communication🍟
154 | 
155 | FunClip is firstly open-sourced bu FunASR team, any useful PR is welcomed.
156 | 
157 | You can also scan the following DingTalk group or WeChat group QR code to join the community group for communication.
158 | 
159 | |                           DingTalk group                            |                     WeChat group                      |
160 | |:-------------------------------------------------------------------:|:-----------------------------------------------------:|
161 | | <div align="left"><img src="docs/images/dingding.png" width="250"/> | <img src="docs/images/wechat.png" width="215"/></div> |
162 | 
163 | ## Find Speech Models in FunASR
164 | 
165 | [FunASR](https://github.com/alibaba-damo-academy/FunASR) hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model released on ModelScope, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun！
166 | 
167 | 📚FunASR Paper: <a href="https://arxiv.org/abs/2305.11013"><img src="https://img.shields.io/badge/Arxiv-2305.11013-orange"></a> 
168 | 
169 | 📚SeACo-Paraformer Paper: <a href="https://arxiv.org/abs/2308.03266"><img src="https://img.shields.io/badge/Arxiv-2308.03266-orange"></a>
170 | 
171 | 🌟Support FunASR: <a href='https://github.com/alibaba-damo-academy/FunASR/stargazers'><img src='https://img.shields.io/github/stars/alibaba-damo-academy/FunASR.svg?style=social'></a>
172 | 


--------------------------------------------------------------------------------
/README_zh.md:
--------------------------------------------------------------------------------
  1 | [![SVG Banners](https://svg-banners.vercel.app/api?type=rainbow&text1=FunClip%20%20🥒&width=800&height=210)](https://github.com/Akshay090/svg-banners)
  2 | 
  3 | ### <p align="center">「简体中文 | [English](./README.md)」</p>
  4 | 
  5 | **<p align="center"> ⚡ 开源、精准、方便的视频切片工具 </p>**
  6 | **<p align="center"> 🧠 通过FunClip探索基于大语言模型的视频剪辑 </p>**
  7 | 
  8 | <p align="center"> <img src="docs/images/interface.jpg" width=444/></p>
  9 | 
 10 | <p align="center" class="trendshift">
 11 | <a href="https://trendshift.io/repositories/10126" target="_blank"><img src="https://trendshift.io/api/badge/repositories/10126" alt="alibaba-damo-academy%2FFunClip | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 12 | </p>
 13 | 
 14 | <div align="center">  
 15 | <h4><a href="#近期更新"> 近期更新 </a>
 16 | ｜<a href="#施工中"> 施工中 </a>
 17 | ｜<a href="#安装环境"> 安装环境 </a>
 18 | ｜<a href="#使用方法"> 使用方法 </a>
 19 | ｜<a href="#社区交流"> 社区交流 </a>
 20 | </h4>
 21 | </div>
 22 | 
 23 | **FunClip**是一款完全开源、本地部署的自动化视频剪辑工具，通过调用阿里巴巴通义实验室开源的[FunASR](https://github.com/alibaba-damo-academy/FunASR) Paraformer系列模型进行视频的语音识别，随后用户可以自由选择识别结果中的文本片段或说话人，点击裁剪按钮即可获取对应片段的视频（快速体验 [Modelscope⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary)  [HuggingFace🤗](https://huggingface.co/spaces/R1ckShi/FunClip)）。
 24 | 
 25 | ## 热点&特性🎨
 26 | 
 27 | - 🔥FunClip集成了多种大语言模型调用方式并提供了prompt配置接口，尝试通过大语言模型进行视频裁剪~
 28 | - FunClip集成了阿里巴巴开源的工业级模型[Paraformer-Large](https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)，是当前识别效果最优的开源中文ASR模型之一，Modelscope下载量1300w+次，并且能够一体化的准确预测时间戳。
 29 | - FunClip集成了[SeACo-Paraformer](https://modelscope.cn/models/iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)的热词定制化功能，在ASR过程中可以指定一些实体词、人名等作为热词，提升识别效果。
 30 | - FunClip集成了[CAM++](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary)说话人识别模型，用户可以将自动识别出的说话人ID作为裁剪目标，将某一说话人的段落裁剪出来。
 31 | - 通过Gradio交互实现上述功能，安装简单使用方便，并且可以在服务端搭建服务通过浏览器使用。
 32 | - FunClip支持多段自由剪辑，并且会自动返回全视频SRT字幕、目标段落SRT字幕，使用简单方便。
 33 | 
 34 | 欢迎体验使用，欢迎提出关于字幕生成或语音识别的需求与宝贵建议~
 35 | 
 36 | <a name="近期更新"></a>
 37 | ## 近期更新🚀
 38 | 
 39 | - 2024/06/12 FunClip现在支持识别与裁剪英文视频，通过`python funclip/launch.py -l en`来启动英文版本服务。
 40 | - 🔥2024/05/13 FunClip v2.0.0加入大语言模型智能裁剪功能，集成qwen系列，gpt系列等模型，提供默认prompt，您也可以探索并分享prompt的设置技巧，使用方法如下：
 41 |   1. 在进行识别之后，选择大模型名称，配置你自己的apikey；
 42 |   2. 点击'LLM智能段落选择'按钮，FunClip将自动组合两个prompt与视频的srt字幕；
 43 |   3. 点击'LLM智能裁剪'按钮，基于前一步的大语言模型输出结果，FunClip将提取其中的时间戳进行裁剪；
 44 |   4. 您可以尝试改变prompt来借助大语言模型的能力来获取您想要的结果；
 45 | - 2024/05/09 FunClip更新至v1.1.0，包含如下更新与修复：
 46 |   - 支持配置输出文件目录，保存ASR中间结果与视频裁剪中间文件；
 47 |   - UI升级（见下方演示图例），视频与音频裁剪功能在同一页，按钮位置调整；
 48 |   - 修复了由于FunASR接口升级引入的bug，该bug曾导致一些严重的剪辑错误；
 49 |   - 支持为每一个段落配置不同的起止时间偏移；
 50 |   - 代码优化等；
 51 | - 2024/03/06 命令行调用方式更新与问题修复，相关功能可以正常使用。
 52 | - 2024/02/28 FunClip升级到FunASR1.0模型调用方式，通过FunASR开源的SeACo-Paraformer模型在视频剪辑中进一步支持热词定制化功能。
 53 | - 2024/02/28 原FunASR-APP/ClipVideo更名为FunClip。
 54 | 
 55 | <a name="施工中"></a>
 56 | ## 施工中🌵
 57 | 
 58 | - [x] FunClip将会集成Whisper模型，以提供英文视频剪辑能力(Whisper模型的时间戳预测功能需要显存较大，我们在FunASR中添加了Paraformer英文模型的时间戳预测支持以允许FunClip支持英文识别裁剪)。
 59 | - [x] 集成大语言模型的能力，提供智能视频剪辑相关功能。大家可以基于FunClip探索使用大语言模型的视频剪辑~
 60 | - [ ] 给定文本段落，反向选取其他段落。
 61 | - [ ] 删除视频中无人说话的片段。
 62 | 
 63 | <a name="安装环境"></a>
 64 | ## 安装🔨
 65 | 
 66 | ### Python环境安装
 67 | 
 68 | FunClip的运行仅依赖于一个Python环境，若您是一个小白开发者，可以先了解下如何使用Python，pip等~
 69 | ```shell
 70 | # 克隆funclip仓库
 71 | git clone https://github.com/alibaba-damo-academy/FunClip.git
 72 | cd FunClip
 73 | # 安装相关Python依赖
 74 | pip install -r ./requirements.txt
 75 | ```
 76 | 
 77 | ### 安装imagemagick（可选）
 78 | 
 79 | 1. 如果你希望使用自动生成字幕的视频裁剪功能，需要安装imagemagick
 80 | 
 81 | - Ubuntu
 82 | ```shell
 83 | apt-get -y update && apt-get -y install ffmpeg imagemagick
 84 | sed -i 's/none/read,write/g' /etc/ImageMagick-6/policy.xml
 85 | ```
 86 | - MacOS
 87 | ```shell
 88 | brew install imagemagick
 89 | sed -i 's/none/read,write/g' /usr/local/Cellar/imagemagick/7.1.1-8_1/etc/ImageMagick-7/policy.xml 
 90 | ```
 91 | - Windows
 92 | 
 93 | 首先下载并安装imagemagick https://imagemagick.org/script/download.php#windows
 94 | 
 95 | 然后确定您的Python安装位置，在其中的`site-packages\moviepy\config_defaults.py`文件中修改`IMAGEMAGICK_BINARY`为imagemagick的exe路径
 96 | 
 97 | 2. 下载你需要的字体文件，这里我们提供一个默认的黑体字体文件
 98 | 
 99 | ```shell
100 | wget https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/STHeitiMedium.ttc -O font/STHeitiMedium.ttc
101 | ```
102 | 
103 | <a name="使用方法"></a>
104 | ## 使用FunClip
105 | 
106 | ### A.在本地启动Gradio服务
107 | 
108 | ```shell
109 | python funclip/launch.py
110 | # '-l en' for English audio recognize
111 | # '-p xxx' for setting port number
112 | # '-s True' for establishing service for public accessing
113 | ```
114 | 随后在浏览器中访问```localhost:7860```即可看到如下图所示的界面，按如下步骤即可进行视频剪辑
115 | 1. 上传你的视频（或使用下方的视频用例）
116 | 2. （可选）设置热词，设置文件输出路径（保存识别结果、视频等）
117 | 3. 点击识别按钮获取识别结果，或点击识别+区分说话人在语音识别基础上识别说话人ID
118 | 4. 将识别结果中的选段复制到对应位置，或者将说话人ID输入到对应为止
119 | 5. （可选）配置剪辑参数，偏移量与字幕设置等
120 | 6. 点击“裁剪”或“裁剪+字幕”按钮
121 | 
122 | <img src="docs/images/guide.jpg"/>
123 | 
124 | 使用大语言模型裁剪请参考如下教程
125 | 
126 | <img src="docs/images/LLM_guide.png" width=360/>
127 | 
128 | ### B.通过命令行调用使用FunClip的相关功能
129 | ```shell
130 | # 步骤一：识别
131 | python funclip/videoclipper.py --stage 1 \
132 |                        --file examples/2022云栖大会_片段.mp4 \
133 |                        --output_dir ./output
134 | # ./output中生成了识别结果与srt字幕等
135 | # 步骤二：裁剪
136 | python funclip/videoclipper.py --stage 2 \
137 |                        --file examples/2022云栖大会_片段.mp4 \
138 |                        --output_dir ./output \
139 |                        --dest_text '我们把它跟乡村振兴去结合起来，利用我们的设计的能力' \
140 |                        --start_ost 0 \
141 |                        --end_ost 100 \
142 |                        --output_file './output/res.mp4'
143 | ```
144 | 
145 | ### C.通过创空间与Space体验FunClip
146 | 
147 | [FunClip@Modelscope创空间⭐](https://modelscope.cn/studios/iic/funasr_app_clipvideo/summary)
148 | 
149 | [FunClip@HuggingFace Space🤗](https://huggingface.co/spaces/R1ckShi/FunClip)
150 | 
151 | 
152 | <a name="社区交流"></a>
153 | ## 社区交流🍟
154 | 
155 | FunClip开源项目由FunASR社区维护，欢迎加入社区，交流与讨论，以及合作开发等。
156 | 
157 | |                              钉钉群                                |                     微信群                      |
158 | |:-------------------------------------------------------------------:|:-----------------------------------------------------:|
159 | | <div align="left"><img src="docs/images/dingding.png" width="250"/> | <img src="docs/images/wechat.png" width="215"/></div> |
160 | 
161 | ## 通过FunASR了解语音识别相关技术
162 | 
163 | [FunASR](https://github.com/alibaba-damo-academy/FunASR)是阿里巴巴通义实验室开源的端到端语音识别工具包，目前已经成为主流ASR工具包之一。其主要包括Python pipeline，SDK部署与海量开源工业ASR模型等。
164 | 
165 | 📚FunASR论文: <a href="https://arxiv.org/abs/2305.11013"><img src="https://img.shields.io/badge/Arxiv-2305.11013-orange"></a> 
166 | 
167 | 📚SeACo-Paraformer论文：<a href="https://arxiv.org/abs/2308.03266"><img src="https://img.shields.io/badge/Arxiv-2308.03266-orange"></a> 
168 | 
169 | ⭐支持FunASR: <a href='https://github.com/alibaba-damo-academy/FunASR.stargazers'><img src='https://img.shields.io/github/stars/alibaba-damo-academy/FunASR.svg?style=social'></a>
170 | 


--------------------------------------------------------------------------------
/docs/images/LLM_guide.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/LLM_guide.png


--------------------------------------------------------------------------------
/docs/images/demo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/demo.png


--------------------------------------------------------------------------------
/docs/images/demo_en.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/demo_en.png


--------------------------------------------------------------------------------
/docs/images/dingding.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/dingding.png


--------------------------------------------------------------------------------
/docs/images/guide.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/guide.jpg


--------------------------------------------------------------------------------
/docs/images/interface.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/interface.jpg


--------------------------------------------------------------------------------
/docs/images/wechat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/docs/images/wechat.png


--------------------------------------------------------------------------------
/font/STHeitiMedium.ttc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/font/STHeitiMedium.ttc


--------------------------------------------------------------------------------
/funclip/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/modelscope/FunClip/13e84ecd9b945606723ccbe5040ce80d9912d40a/funclip/__init__.py


--------------------------------------------------------------------------------
/funclip/introduction.py:
--------------------------------------------------------------------------------
 1 | top_md_1 = ("""
 2 |     <div align="center">
 3 |     <div style="display:flex; gap: 0.25rem;" align="center">
 4 |     FunClip: <a href='https://github.com/alibaba-damo-academy/FunClip'><img src='https://img.shields.io/badge/Github-Code-blue'></a> 
 5 |     🌟支持我们: <a href='https://github.com/alibaba-damo-academy/FunClip/stargazers'><img src='https://img.shields.io/github/stars/alibaba-damo-academy/FunClip.svg?style=social'></a>
 6 |     </div>
 7 |     </div>
 8 |     
 9 |     基于阿里巴巴通义实验室自研并开源的[FunASR](https://github.com/alibaba-damo-academy/FunASR)工具包及Paraformer系列模型及语音识别、端点检测、标点预测、时间戳预测、说话人区分、热词定制化开源链路
10 | 
11 |     准确识别，自由复制所需段落，或者设置说话人标识，一键裁剪、添加字幕
12 | 
13 |     * Step1: 上传视频或音频文件（或使用下方的用例体验），点击 **<font color="#f7802b">识别</font>** 按钮
14 |     * Step2: 复制识别结果中所需的文字至右上方，或者右设置说话人标识，设置偏移与字幕配置（可选）
15 |     * Step3: 点击 **<font color="#f7802b">裁剪</font>** 按钮或 **<font color="#f7802b">裁剪并添加字幕</font>** 按钮获得结果
16 |     
17 |     🔥 FunClip现在集成了大语言模型智能剪辑功能，选择LLM模型进行体验吧~
18 |     """)
19 | 
20 | top_md_3 = ("""访问FunASR项目与论文能够帮助您深入了解ParaClipper中所使用的语音处理相关模型：
21 |     <div align="center">
22 |     <div style="display:flex; gap: 0.25rem;" align="center">
23 |         FunASR: <a href='https://github.com/alibaba-damo-academy/FunASR'><img src='https://img.shields.io/badge/Github-Code-blue'></a> 
24 |         FunASR Paper: <a href="https://arxiv.org/abs/2305.11013"><img src="https://img.shields.io/badge/Arxiv-2305.11013-orange"></a> 
25 |         🌟Star FunASR: <a href='https://github.com/alibaba-damo-academy/FunASR/stargazers'><img src='https://img.shields.io/github/stars/alibaba-damo-academy/FunASR.svg?style=social'></a>
26 |     </div>
27 |     </div>
28 |     """)
29 | 
30 | top_md_4 = ("""我们在「LLM智能裁剪」模块中提供三种LLM调用方式，
31 |             1. 选择阿里云百炼平台通过api调用qwen系列模型，此时需要您准备百炼平台的apikey，请访问[阿里云百炼](https://bailian.console.aliyun.com/#/home)；
32 |             2. 选择GPT开头的模型即为调用openai官方api，此时需要您自备sk与网络环境；
33 |             3. [gpt4free](https://github.com/xtekky/gpt4free?tab=readme-ov-file)项目也被集成进FunClip，可以通过它免费调用gpt模型；
34 |             
35 |             其中方式1与方式2需要在界面中传入相应的apikey        
36 |             方式3而可能非常不稳定，返回时间可能很长或者结果获取失败，可以多多尝试或者自己准备sk使用方式1,2
37 |             
38 |             不要同时打开同一端口的多个界面，会导致文件上传非常缓慢或卡死，关闭其他界面即可解决
39 |             """)
40 | 


--------------------------------------------------------------------------------
/funclip/launch.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- encoding: utf-8 -*-
  3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved.
  4 | #  MIT License  (https://opensource.org/licenses/MIT)
  5 | 
  6 | from http import server
  7 | import os
  8 | import logging
  9 | import argparse
 10 | import gradio as gr
 11 | from funasr import AutoModel
 12 | from videoclipper import VideoClipper
 13 | from llm.openai_api import openai_call
 14 | from llm.qwen_api import call_qwen_model
 15 | from llm.g4f_openai_api import g4f_openai_call
 16 | from utils.trans_utils import extract_timestamps
 17 | from introduction import top_md_1, top_md_3, top_md_4
 18 | 
 19 | 
 20 | if __name__ == "__main__":
 21 |     parser = argparse.ArgumentParser(description='argparse testing')
 22 |     parser.add_argument('--lang', '-l', type=str, default = "zh", help="language")
 23 |     parser.add_argument('--share', '-s', action='store_true', help="if to establish gradio share link")
 24 |     parser.add_argument('--port', '-p', type=int, default=7860, help='port number')
 25 |     parser.add_argument('--listen', action='store_true', help="if to listen to all hosts")
 26 |     args = parser.parse_args()
 27 |     
 28 |     if args.lang == 'zh':
 29 |         funasr_model = AutoModel(model="iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
 30 |                                 vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
 31 |                                 punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
 32 |                                 spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
 33 |                                 )
 34 |     else:
 35 |         funasr_model = AutoModel(model="iic/speech_paraformer_asr-en-16k-vocab4199-pytorch",
 36 |                                 vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
 37 |                                 punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
 38 |                                 spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
 39 |                                 )
 40 |     audio_clipper = VideoClipper(funasr_model)
 41 |     audio_clipper.lang = args.lang
 42 |     
 43 |     server_name='127.0.0.1'
 44 |     if args.listen:
 45 |         server_name = '0.0.0.0'
 46 |         
 47 |         
 48 | 
 49 |     def audio_recog(audio_input, sd_switch, hotwords, output_dir):
 50 |         return audio_clipper.recog(audio_input, sd_switch, None, hotwords, output_dir=output_dir)
 51 | 
 52 |     def video_recog(video_input, sd_switch, hotwords, output_dir):
 53 |         return audio_clipper.video_recog(video_input, sd_switch, hotwords, output_dir=output_dir)
 54 | 
 55 |     def video_clip(dest_text, video_spk_input, start_ost, end_ost, state, output_dir):
 56 |         return audio_clipper.video_clip(
 57 |             dest_text, start_ost, end_ost, state, dest_spk=video_spk_input, output_dir=output_dir
 58 |             )
 59 | 
 60 |     def mix_recog(video_input, audio_input, hotwords, output_dir):
 61 |         output_dir = output_dir.strip()
 62 |         if not len(output_dir):
 63 |             output_dir = None
 64 |         else:
 65 |             output_dir = os.path.abspath(output_dir)
 66 |         audio_state, video_state = None, None
 67 |         if video_input is not None:
 68 |             res_text, res_srt, video_state = video_recog(
 69 |                 video_input, 'No', hotwords, output_dir=output_dir)
 70 |             return res_text, res_srt, video_state, None
 71 |         if audio_input is not None:
 72 |             res_text, res_srt, audio_state = audio_recog(
 73 |                 audio_input, 'No', hotwords, output_dir=output_dir)
 74 |             return res_text, res_srt, None, audio_state
 75 |     
 76 |     def mix_recog_speaker(video_input, audio_input, hotwords, output_dir):
 77 |         output_dir = output_dir.strip()
 78 |         if not len(output_dir):
 79 |             output_dir = None
 80 |         else:
 81 |             output_dir = os.path.abspath(output_dir)
 82 |         audio_state, video_state = None, None
 83 |         if video_input is not None:
 84 |             res_text, res_srt, video_state = video_recog(
 85 |                 video_input, 'Yes', hotwords, output_dir=output_dir)
 86 |             return res_text, res_srt, video_state, None
 87 |         if audio_input is not None:
 88 |             res_text, res_srt, audio_state = audio_recog(
 89 |                 audio_input, 'Yes', hotwords, output_dir=output_dir)
 90 |             return res_text, res_srt, None, audio_state
 91 |     
 92 |     def mix_clip(dest_text, video_spk_input, start_ost, end_ost, video_state, audio_state, output_dir):
 93 |         output_dir = output_dir.strip()
 94 |         if not len(output_dir):
 95 |             output_dir = None
 96 |         else:
 97 |             output_dir = os.path.abspath(output_dir)
 98 |         if video_state is not None:
 99 |             clip_video_file, message, clip_srt = audio_clipper.video_clip(
100 |                 dest_text, start_ost, end_ost, video_state, dest_spk=video_spk_input, output_dir=output_dir)
101 |             return clip_video_file, None, message, clip_srt
102 |         if audio_state is not None:
103 |             (sr, res_audio), message, clip_srt = audio_clipper.clip(
104 |                 dest_text, start_ost, end_ost, audio_state, dest_spk=video_spk_input, output_dir=output_dir)
105 |             return None, (sr, res_audio), message, clip_srt
106 |     
107 |     def video_clip_addsub(dest_text, video_spk_input, start_ost, end_ost, state, output_dir, font_size, font_color):
108 |         output_dir = output_dir.strip()
109 |         if not len(output_dir):
110 |             output_dir = None
111 |         else:
112 |             output_dir = os.path.abspath(output_dir)
113 |         return audio_clipper.video_clip(
114 |             dest_text, start_ost, end_ost, state, 
115 |             font_size=font_size, font_color=font_color, 
116 |             add_sub=True, dest_spk=video_spk_input, output_dir=output_dir
117 |             )
118 |         
119 |     def llm_inference(system_content, user_content, srt_text, model, apikey):
120 |         SUPPORT_LLM_PREFIX = ['qwen', 'gpt', 'g4f', 'moonshot']
121 |         if model.startswith('qwen'):
122 |             return call_qwen_model(apikey, model, user_content+'\n'+srt_text, system_content)
123 |         if model.startswith('gpt') or model.startswith('moonshot'):
124 |             return openai_call(apikey, model, system_content, user_content+'\n'+srt_text)
125 |         elif model.startswith('g4f'):
126 |             model = "-".join(model.split('-')[1:])
127 |             return g4f_openai_call(model, system_content, user_content+'\n'+srt_text)
128 |         else:
129 |             logging.error("LLM name error, only {} are supported as LLM name prefix."
130 |                           .format(SUPPORT_LLM_PREFIX))
131 |     
132 |     def AI_clip(LLM_res, dest_text, video_spk_input, start_ost, end_ost, video_state, audio_state, output_dir):
133 |         timestamp_list = extract_timestamps(LLM_res)
134 |         output_dir = output_dir.strip()
135 |         if not len(output_dir):
136 |             output_dir = None
137 |         else:
138 |             output_dir = os.path.abspath(output_dir)
139 |         if video_state is not None:
140 |             clip_video_file, message, clip_srt = audio_clipper.video_clip(
141 |                 dest_text, start_ost, end_ost, video_state, 
142 |                 dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=False)
143 |             return clip_video_file, None, message, clip_srt
144 |         if audio_state is not None:
145 |             (sr, res_audio), message, clip_srt = audio_clipper.clip(
146 |                 dest_text, start_ost, end_ost, audio_state, 
147 |                 dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=False)
148 |             return None, (sr, res_audio), message, clip_srt
149 |     
150 |     def AI_clip_subti(LLM_res, dest_text, video_spk_input, start_ost, end_ost, video_state, audio_state, output_dir):
151 |         timestamp_list = extract_timestamps(LLM_res)
152 |         output_dir = output_dir.strip()
153 |         if not len(output_dir):
154 |             output_dir = None
155 |         else:
156 |             output_dir = os.path.abspath(output_dir)
157 |         if video_state is not None:
158 |             clip_video_file, message, clip_srt = audio_clipper.video_clip(
159 |                 dest_text, start_ost, end_ost, video_state, 
160 |                 dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=True)
161 |             return clip_video_file, None, message, clip_srt
162 |         if audio_state is not None:
163 |             (sr, res_audio), message, clip_srt = audio_clipper.clip(
164 |                 dest_text, start_ost, end_ost, audio_state, 
165 |                 dest_spk=video_spk_input, output_dir=output_dir, timestamp_list=timestamp_list, add_sub=True)
166 |             return None, (sr, res_audio), message, clip_srt
167 |     
168 |     # gradio interface
169 |     theme = gr.Theme.load("funclip/utils/theme.json")
170 |     with gr.Blocks(theme=theme) as funclip_service:
171 |         gr.Markdown(top_md_1)
172 |         # gr.Markdown(top_md_2)
173 |         gr.Markdown(top_md_3)
174 |         gr.Markdown(top_md_4)
175 |         video_state, audio_state = gr.State(), gr.State()
176 |         with gr.Row():
177 |             with gr.Column():
178 |                 with gr.Row():
179 |                     video_input = gr.Video(label="视频输入 | Video Input")
180 |                     audio_input = gr.Audio(label="音频输入 | Audio Input")
181 |                 with gr.Column():
182 |                     gr.Examples(['https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E4%B8%BA%E4%BB%80%E4%B9%88%E8%A6%81%E5%A4%9A%E8%AF%BB%E4%B9%A6%EF%BC%9F%E8%BF%99%E6%98%AF%E6%88%91%E5%90%AC%E8%BF%87%E6%9C%80%E5%A5%BD%E7%9A%84%E7%AD%94%E6%A1%88-%E7%89%87%E6%AE%B5.mp4', 
183 |                                  'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/2022%E4%BA%91%E6%A0%96%E5%A4%A7%E4%BC%9A_%E7%89%87%E6%AE%B52.mp4', 
184 |                                  'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E4%BD%BF%E7%94%A8chatgpt_%E7%89%87%E6%AE%B5.mp4'],
185 |                                 [video_input],
186 |                                 label='示例视频 | Demo Video')
187 |                     gr.Examples(['https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E8%AE%BF%E8%B0%88.mp4'],
188 |                                 [video_input],
189 |                                 label='多说话人示例视频 | Multi-speaker Demo Video')
190 |                     gr.Examples(['https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ClipVideo/%E9%B2%81%E8%82%83%E9%87%87%E8%AE%BF%E7%89%87%E6%AE%B51.wav'],
191 |                                 [audio_input],
192 |                                 label="示例音频 | Demo Audio")
193 |                     with gr.Column():
194 |                         # with gr.Row():
195 |                             # video_sd_switch = gr.Radio(["No", "Yes"], label="👥区分说话人 Get Speakers", value='No')
196 |                         hotwords_input = gr.Textbox(label="🚒 热词 | Hotwords(可以为空，多个热词使用空格分隔，仅支持中文热词)")
197 |                         output_dir = gr.Textbox(label="📁 文件输出路径 | File Output Dir (可以为空，Linux, mac系统可以稳定使用)", value=" ")
198 |                         with gr.Row():
199 |                             recog_button = gr.Button("👂 识别 | ASR", variant="primary")
200 |                             recog_button2 = gr.Button("👂👫 识别+区分说话人 | ASR+SD")
201 |                 video_text_output = gr.Textbox(label="✏️ 识别结果 | Recognition Result")
202 |                 video_srt_output = gr.Textbox(label="📖 SRT字幕内容 | RST Subtitles")
203 |             with gr.Column():
204 |                 with gr.Tab("🧠 LLM智能裁剪 | LLM Clipping"):
205 |                     with gr.Column():
206 |                         prompt_head = gr.Textbox(label="Prompt System (按需更改，最好不要变动主体和要求)", value=("你是一个视频srt字幕分析剪辑器，输入视频的srt字幕，"
207 |                                 "分析其中的精彩且尽可能连续的片段并裁剪出来，输出四条以内的片段，将片段中在时间上连续的多个句子及它们的时间戳合并为一条，"
208 |                                 "注意确保文字与时间戳的正确匹配。输出需严格按照如下格式：1. [开始时间-结束时间] 文本，注意其中的连接符是“-”"))
209 |                         prompt_head2 = gr.Textbox(label="Prompt User（不需要修改，会自动拼接左下角的srt字幕）", value=("这是待裁剪的视频srt字幕："))
210 |                         with gr.Column():
211 |                             with gr.Row():
212 |                                 llm_model = gr.Dropdown(
213 |                                     choices=["qwen-plus",
214 |                                              "gpt-3.5-turbo", 
215 |                                              "gpt-3.5-turbo-0125", 
216 |                                              "gpt-4-turbo",
217 |                                              "g4f-gpt-3.5-turbo"], 
218 |                                     value="qwen-plus",
219 |                                     label="LLM Model Name",
220 |                                     allow_custom_value=True)
221 |                                 apikey_input = gr.Textbox(label="APIKEY")
222 |                             llm_button =  gr.Button("LLM推理 | LLM Inference（首先进行识别，非g4f需配置对应apikey）", variant="primary")
223 |                         llm_result = gr.Textbox(label="LLM Clipper Result")
224 |                         with gr.Row():
225 |                             llm_clip_button = gr.Button("🧠 LLM智能裁剪 | AI Clip", variant="primary")
226 |                             llm_clip_subti_button = gr.Button("🧠 LLM智能裁剪+字幕 | AI Clip+Subtitles")
227 |                 with gr.Tab("✂️ 根据文本/说话人裁剪 | Text/Speaker Clipping"):
228 |                     video_text_input = gr.Textbox(label="✏️ 待裁剪文本 | Text to Clip (多段文本使用'#'连接)")
229 |                     video_spk_input = gr.Textbox(label="✏️ 待裁剪说话人 | Speaker to Clip (多个说话人使用'#'连接)")
230 |                     with gr.Row():
231 |                         clip_button = gr.Button("✂️ 裁剪 | Clip", variant="primary")
232 |                         clip_subti_button = gr.Button("✂️ 裁剪+字幕 | Clip+Subtitles")
233 |                     with gr.Row():
234 |                         video_start_ost = gr.Slider(minimum=-500, maximum=1000, value=0, step=50, label="⏪ 开始位置偏移 | Start Offset (ms)")
235 |                         video_end_ost = gr.Slider(minimum=-500, maximum=1000, value=100, step=50, label="⏩ 结束位置偏移 | End Offset (ms)")
236 |                 with gr.Row():
237 |                     font_size = gr.Slider(minimum=10, maximum=100, value=32, step=2, label="🔠 字幕字体大小 | Subtitle Font Size")
238 |                     font_color = gr.Radio(["black", "white", "green", "red"], label="🌈 字幕颜色 | Subtitle Color", value='white')
239 |                     # font = gr.Radio(["黑体", "Alibaba Sans"], label="字体 Font")
240 |                 video_output = gr.Video(label="裁剪结果 | Video Clipped")
241 |                 audio_output = gr.Audio(label="裁剪结果 | Audio Clipped")
242 |                 clip_message = gr.Textbox(label="⚠️ 裁剪信息 | Clipping Log")
243 |                 srt_clipped = gr.Textbox(label="📖 裁剪部分SRT字幕内容 | Clipped RST Subtitles")            
244 |                 
245 |         recog_button.click(mix_recog, 
246 |                             inputs=[video_input, 
247 |                                     audio_input, 
248 |                                     hotwords_input, 
249 |                                     output_dir,
250 |                                     ], 
251 |                             outputs=[video_text_output, video_srt_output, video_state, audio_state])
252 |         recog_button2.click(mix_recog_speaker, 
253 |                             inputs=[video_input, 
254 |                                     audio_input, 
255 |                                     hotwords_input, 
256 |                                     output_dir,
257 |                                     ], 
258 |                             outputs=[video_text_output, video_srt_output, video_state, audio_state])
259 |         clip_button.click(mix_clip, 
260 |                            inputs=[video_text_input, 
261 |                                    video_spk_input, 
262 |                                    video_start_ost, 
263 |                                    video_end_ost, 
264 |                                    video_state, 
265 |                                    audio_state, 
266 |                                    output_dir
267 |                                    ],
268 |                            outputs=[video_output, audio_output, clip_message, srt_clipped])
269 |         clip_subti_button.click(video_clip_addsub, 
270 |                            inputs=[video_text_input, 
271 |                                    video_spk_input, 
272 |                                    video_start_ost, 
273 |                                    video_end_ost, 
274 |                                    video_state, 
275 |                                    output_dir, 
276 |                                    font_size, 
277 |                                    font_color,
278 |                                    ], 
279 |                            outputs=[video_output, clip_message, srt_clipped])
280 |         llm_button.click(llm_inference,
281 |                          inputs=[prompt_head, prompt_head2, video_srt_output, llm_model, apikey_input],
282 |                          outputs=[llm_result])
283 |         llm_clip_button.click(AI_clip, 
284 |                            inputs=[llm_result,
285 |                                    video_text_input, 
286 |                                    video_spk_input, 
287 |                                    video_start_ost, 
288 |                                    video_end_ost, 
289 |                                    video_state, 
290 |                                    audio_state, 
291 |                                    output_dir,
292 |                                    ],
293 |                            outputs=[video_output, audio_output, clip_message, srt_clipped])
294 |         llm_clip_subti_button.click(AI_clip_subti, 
295 |                            inputs=[llm_result,
296 |                                    video_text_input, 
297 |                                    video_spk_input, 
298 |                                    video_start_ost, 
299 |                                    video_end_ost, 
300 |                                    video_state, 
301 |                                    audio_state, 
302 |                                    output_dir,
303 |                                    ],
304 |                            outputs=[video_output, audio_output, clip_message, srt_clipped])
305 |     
306 |     # start gradio service in local or share
307 |     if args.listen:
308 |         funclip_service.launch(share=args.share, server_port=args.port, server_name=server_name, inbrowser=False)
309 |     else:
310 |         funclip_service.launch(share=args.share, server_port=args.port, server_name=server_name)
311 | 


--------------------------------------------------------------------------------
/funclip/llm/demo_prompt.py:
--------------------------------------------------------------------------------
  1 | demo_prompt="""
  2 | 你是一个视频srt字幕剪辑工具，输入视频的srt字幕之后根据如下要求剪辑对应的片段并输出每个段落的开始与结束时间，
  3 | 剪辑出以下片段中最有意义的、尽可能连续的部分，按如下格式输出：1. [开始时间-结束时间] 文本，
  4 | 原始srt字幕如下：
  5 | 0
  6 | 00:00:00,50 --> 00:00:02,10
  7 | 读万卷书行万里路，
  8 | 1
  9 | 00:00:02,310 --> 00:00:03,990
 10 | 这里是读书三六九，
 11 | 2
 12 | 00:00:04,670 --> 00:00:07,990
 13 | 今天要和您分享的这篇文章是人民日报，
 14 | 3
 15 | 00:00:08,510 --> 00:00:09,730
 16 | 为什么要多读书？
 17 | 4
 18 | 00:00:10,90 --> 00:00:11,930
 19 | 这是我听过最好的答案，
 20 | 5
 21 | 00:00:12,310 --> 00:00:13,190
 22 | 经常有人问，
 23 | 6
 24 | 00:00:13,730 --> 00:00:14,690
 25 | 读了那么多书，
 26 | 7
 27 | 00:00:14,990 --> 00:00:17,250
 28 | 最终还不是要回到一座平凡的城，
 29 | 8
 30 | 00:00:17,610 --> 00:00:19,410
 31 | 打一份平凡的工组，
 32 | 9
 33 | 00:00:19,410 --> 00:00:20,670
 34 | 建一个平凡的家庭，
 35 | 10
 36 | 00:00:21,330 --> 00:00:25,960
 37 | 何苦折腾一个人读书的意义究竟是什么？
 38 | 11
 39 | 00:00:26,680 --> 00:00:30,80
 40 | 今天给大家分享人民日报推荐的八条理由，
 41 | 12
 42 | 00:00:30,540 --> 00:00:32,875
 43 | 告诉你人为什么要多读书？
 44 | 13
 45 | 00:00:34,690 --> 00:00:38,725
 46 | 一脚步丈量不到的地方文字可以。
 47 | 14
 48 | 00:00:40,300 --> 00:00:41,540
 49 | 钱钟书先生说过，
 50 | 15
 51 | 00:00:42,260 --> 00:00:43,140
 52 | 如果不读书，
 53 | 16
 54 | 00:00:43,520 --> 00:00:44,400
 55 | 行万里路，
 56 | 17
 57 | 00:00:44,540 --> 00:00:45,695
 58 | 也只是个邮差。
 59 | 18
 60 | 00:00:46,900 --> 00:00:47,320
 61 | 北京、
 62 | 19
 63 | 00:00:47,500 --> 00:00:47,980
 64 | 西安、
 65 | 20
 66 | 00:00:48,320 --> 00:00:51,200
 67 | 南京和洛阳少了学识的浸润，
 68 | 21
 69 | 00:00:51,600 --> 00:00:55,565
 70 | 他们只是一个个耳中熟悉又眼里陌生的地名。
 71 | 22
 72 | 00:00:56,560 --> 00:00:59,360
 73 | 故宫避暑山庄岱庙、
 74 | 23
 75 | 00:00:59,840 --> 00:01:02,920
 76 | 曲阜三孔有了文化照耀，
 77 | 24
 78 | 00:01:03,120 --> 00:01:05,340
 79 | 他们才不是被时间风化的标本。
 80 | 25
 81 | 00:01:05,820 --> 00:01:08,105
 82 | 而是活了成百上千年的生命，
 83 | 26
 84 | 00:01:09,650 --> 00:01:10,370
 85 | 不去读书，
 86 | 27
 87 | 00:01:10,670 --> 00:01:12,920
 88 | 就是一个邮差风景，
 89 | 28
 90 | 00:01:13,0 --> 00:01:13,835
 91 | 过眼就忘，
 92 | 29
 93 | 00:01:14,750 --> 00:01:17,365
 94 | 就算踏破铁鞋又有什么用处呢？
 95 | 30
 96 | 00:01:19,240 --> 00:01:22,380
 97 | 阅读不仅仅会让现实的旅行更加丰富，
 98 | 31
 99 | 00:01:23,120 --> 00:01:27,260
100 | 更重要的是能让精神突破现实和身体的桎梏，
101 | 32
102 | 00:01:27,640 --> 00:01:29,985
103 | 来一场灵魂长足的旅行。
104 | 33
105 | 00:01:31,850 --> 00:01:32,930
106 | 听过这样一句话，
107 | 34
108 | 00:01:33,490 --> 00:01:35,190
109 | 没有一艘非凡的船舰，
110 | 35
111 | 00:01:35,330 --> 00:01:36,430
112 | 能像一册书籍，
113 | 36
114 | 00:01:36,690 --> 00:01:38,595
115 | 把我们带到浩瀚的天地，
116 | 37
117 | 00:01:39,830 --> 00:01:42,685
118 | 你无法到达的地方文字在你过去，
119 | 38
120 | 00:01:43,530 --> 00:01:45,750
121 | 你无法经历的人生舒淇，
122 | 39
123 | 00:01:45,770 --> 00:01:46,595
124 | 带你相遇。
125 | 40
126 | 00:01:47,640 --> 00:01:50,340
127 | 那些读过的书会一本本充实，
128 | 41
129 | 00:01:50,340 --> 00:01:50,940
130 | 你的内心，
131 | 42
132 | 00:01:51,640 --> 00:01:54,855
133 | 让虚无单调的世界变得五彩斑斓。
134 | 43
135 | 00:01:55,930 --> 00:01:59,690
136 | 那些书中的人物会在你深陷生活泥潭之时，
137 | 44
138 | 00:02:00,170 --> 00:02:01,190
139 | 轻声的呼唤，
140 | 45
141 | 00:02:01,950 --> 00:02:03,270
142 | 用他们心怀梦想、
143 | 46
144 | 00:02:03,630 --> 00:02:04,950
145 | 不卑不亢的故事，
146 | 47
147 | 00:02:05,310 --> 00:02:07,90
148 | 激励你抵御苦难，
149 | 48
150 | 00:02:07,430 --> 00:02:08,525
151 | 勇往直前。
152 | 49
153 | 00:02:11,290 --> 00:02:11,695
154 | 二、
155 | 50
156 | 00:02:12,440 --> 00:02:16,900
157 | 读书的意义是使人虚心叫通达不固执、
158 | 51
159 | 00:02:17,200 --> 00:02:18,35
160 | 不偏执。
161 | 52
162 | 00:02:20,290 --> 00:02:22,935
163 | 读书越少的人越容易过得痛苦。
164 | 53
165 | 00:02:23,600 --> 00:02:24,400
166 | 读书越多，
167 | 54
168 | 00:02:24,800 --> 00:02:26,185
169 | 人才会越通透，
170 | 55
171 | 00:02:27,890 --> 00:02:30,30
172 | 知乎上有位网友讲过自己的故事。
173 | 56
174 | 00:02:30,750 --> 00:02:31,310
175 | 有一次，
176 | 57
177 | 00:02:31,530 --> 00:02:32,650
178 | 他跟伴侣吵架，
179 | 58
180 | 00:02:33,190 --> 00:02:35,505
181 | 气得连续好几个晚上没睡好，
182 | 59
183 | 00:02:36,360 --> 00:02:38,880
184 | 直到他读到一本关于亲密关系的书。
185 | 60
186 | 00:02:39,500 --> 00:02:41,920
187 | 书中有段关于夫妻关系的解读，
188 | 61
189 | 00:02:42,80 --> 00:02:43,100
190 | 让他豁然开朗，
191 | 62
192 | 00:02:43,460 --> 00:02:47,170
193 | 突然想明白了很多事气消了，
194 | 63
195 | 00:02:47,430 --> 00:02:48,410
196 | 心情好了，
197 | 64
198 | 00:02:48,790 --> 00:02:50,194
199 | 整个人也舒爽了。
200 | 65
201 | 00:02:51,780 --> 00:02:54,340
202 | 一个人书读的不多见识，
203 | 66
204 | 00:02:54,380 --> 00:02:55,180
205 | 难免受限，
206 | 67
207 | 00:02:55,720 --> 00:02:58,495
208 | 结果就必须受着眼前世界的禁锢，
209 | 68
210 | 00:02:59,540 --> 00:03:00,740
211 | 稍微遇到一点不顺，
212 | 69
213 | 00:03:00,940 --> 00:03:02,460
214 | 就极易消极悲观，
215 | 70
216 | 00:03:02,900 --> 00:03:03,720
217 | 郁郁寡欢，
218 | 71
219 | 00:03:04,140 --> 00:03:05,765
220 | 让自己困在情绪里，
221 | 72
222 | 00:03:06,900 --> 00:03:09,760
223 | 只有通过阅读才能看透人生真相，
224 | 73
225 | 00:03:10,300 --> 00:03:12,140
226 | 收获为人处事的智慧，
227 | 74
228 | 00:03:12,480 --> 00:03:14,95
229 | 把日子越过越好。
230 | 75
231 | 00:03:16,730 --> 00:03:17,890
232 | 生活的艺术里说，
233 | 76
234 | 00:03:18,410 --> 00:03:20,30
235 | 人一定要时时读书，
236 | 77
237 | 00:03:20,430 --> 00:03:22,915
238 | 不然便会鄙令晚腐。
239 | 78
240 | 00:03:23,690 --> 00:03:28,730
241 | 完剑俗剑生满身上一个人的落伍迂腐，
242 | 79
243 | 00:03:29,210 --> 00:03:31,205
244 | 就是不肯实施读书所致。
245 | 80
246 | 00:03:33,10 --> 00:03:34,790
247 | 只有在不断阅读的过程中，
248 | 81
249 | 00:03:34,990 --> 00:03:35,970
250 | 修心养性，
251 | 82
252 | 00:03:36,430 --> 00:03:38,735
253 | 才能摆脱我们的鄙俗和顽固。
254 | 83
255 | 00:03:39,920 --> 00:03:41,720
256 | 这世间没有谁的生活，
257 | 84
258 | 00:03:41,800 --> 00:03:42,540
259 | 没有烦恼，
260 | 85
261 | 00:03:43,140 --> 00:03:45,455
262 | 唯读书是最好的解药。
263 | 86
264 | 00:03:47,730 --> 00:03:48,185
265 | 三、
266 | 87
267 | 00:03:49,40 --> 00:03:50,720
268 | 书中未必有黄金屋，
269 | 88
270 | 00:03:51,0 --> 00:03:52,595
271 | 但一定有更好的自己。
272 | """


--------------------------------------------------------------------------------
/funclip/llm/g4f_openai_api.py:
--------------------------------------------------------------------------------
 1 | from g4f.client import Client
 2 | 
 3 | if __name__ == '__main__':
 4 |     from llm.demo_prompt import demo_prompt
 5 |     client = Client()
 6 |     response = client.chat.completions.create(
 7 |         model="gpt-3.5-turbo",
 8 |         messages=[{"role": "user", "content": "你好你的名字是什么"}],
 9 |     )
10 |     print(response.choices[0].message.content)
11 |  
12 | 
13 | def g4f_openai_call(model="gpt-3.5-turbo", 
14 |                     user_content="如何做西红柿炖牛腩？", 
15 |                     system_content=None):
16 |     client = Client()
17 |     if system_content is not None and len(system_content.strip()):
18 |         messages = [
19 |             {'role': 'system', 'content': system_content},
20 |             {'role': 'user', 'content': user_content}
21 |       ]
22 |     else:
23 |         messages = [
24 |             {'role': 'user', 'content': user_content}
25 |       ]
26 |     response = client.chat.completions.create(
27 |         model=model,
28 |         messages=messages,
29 |     )
30 |     return(response.choices[0].message.content)


--------------------------------------------------------------------------------
/funclip/llm/openai_api.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import logging
 3 | from openai import OpenAI
 4 | 
 5 | 
 6 | if __name__ == '__main__':
 7 |     from llm.demo_prompt import demo_prompt
 8 |     client = OpenAI(
 9 |         # This is the default and can be omitted
10 |         api_key=os.environ.get("OPENAI_API_KEY"),
11 |     )
12 | 
13 |     chat_completion = client.chat.completions.create(
14 |         messages=[
15 |             {
16 |                 "role": "user",
17 |                 "content": demo_prompt,
18 |             }
19 |         ],
20 |         model="gpt-3.5-turbo-0125",
21 |     )
22 |     print(chat_completion.choices[0].message.content)
23 |     
24 |     
25 | def openai_call(apikey, 
26 |                 model="gpt-3.5-turbo", 
27 |                 user_content="如何做西红柿炖牛腩？", 
28 |                 system_content=None):
29 |     client = OpenAI(
30 |         # This is the default and can be omitted
31 |         api_key=apikey,
32 |     )
33 |     if system_content is not None and len(system_content.strip()):
34 |         messages = [
35 |             {'role': 'system', 'content': system_content},
36 |             {'role': 'user', 'content': user_content}
37 |       ]
38 |     else:
39 |         messages = [
40 |             {'role': 'user', 'content': user_content}
41 |       ]
42 |     
43 |     chat_completion = client.chat.completions.create(
44 |         messages=messages,
45 |         model=model,
46 |     )
47 |     logging.info("Openai model inference done.")
48 |     return chat_completion.choices[0].message.content


--------------------------------------------------------------------------------
/funclip/llm/qwen_api.py:
--------------------------------------------------------------------------------
 1 | import dashscope
 2 | from dashscope import Generation
 3 | 
 4 | 
 5 | def call_qwen_model(key=None, 
 6 |                     model="qwen_plus", 
 7 |                     user_content="如何做西红柿炖牛腩？", 
 8 |                     system_content=None):
 9 |     dashscope.api_key = key
10 |     if system_content is not None and len(system_content.strip()):
11 |         messages = [
12 |             {'role': 'system', 'content': system_content},
13 |             {'role': 'user', 'content': user_content}
14 |       ]
15 |     else:
16 |         messages = [
17 |             {'role': 'user', 'content': user_content}
18 |       ]
19 |     responses = Generation.call(model,
20 |                                 messages=messages,
21 |                                 result_format='message',  # 设置输出为'message'格式
22 |                                 stream=False, # 设置输出方式为流式输出
23 |                                 incremental_output=False  # 增量式流式输出
24 |                                 )
25 |     print(responses)
26 |     return responses['output']['choices'][0]['message']['content']
27 | 
28 | 
29 | if __name__ == '__main__':
30 |     call_qwen_model('YOUR_BAILIAN_APIKEY')


--------------------------------------------------------------------------------
/funclip/test/imagemagick_test.py:
--------------------------------------------------------------------------------
 1 | from moviepy.editor import *
 2 | from moviepy.video.tools.subtitles import SubtitlesClip, TextClip
 3 | from moviepy.editor import VideoFileClip, concatenate_videoclips
 4 | from moviepy.video.compositing import CompositeVideoClip
 5 | 
 6 | generator = lambda txt: TextClip(txt, font='./font/STHeitiMedium.ttc', fontsize=48, color='white')
 7 | subs = [((0, 2), 'sub1中文字幕'),
 8 |         ((2, 4), 'subs2'),
 9 |         ((4, 6), 'subs3'),
10 |         ((6, 8), 'subs4')]
11 | 
12 | subtitles = SubtitlesClip(subs, generator)
13 | 
14 | video = VideoFileClip("examples/2022云栖大会_片段.mp4.mp4")
15 | video = video.subclip(0, 8)
16 | video = CompositeVideoClip([video, subtitles.set_pos(('center','bottom'))])
17 | 
18 | video.write_videofile("test_output.mp4")


--------------------------------------------------------------------------------
/funclip/test/test.sh:
--------------------------------------------------------------------------------
 1 | # step1: Recognize
 2 | python videoclipper.py --stage 1 \
 3 |                        --file ../examples/2022云栖大会_片段.mp4 \
 4 |                        --sd_switch yes \
 5 |                        --output_dir ./output
 6 | # now you can find recognition results and entire SRT file in ./output/
 7 | # step2: Clip
 8 | python videoclipper.py --stage 2 \
 9 |                        --file ../examples/2022云栖大会_片段.mp4 \
10 |                        --output_dir ./output \
11 |                        --dest_text '所以这个是我们办这个奖的初心啊，我们也会一届一届的办下去' \
12 |                     #    --dest_spk spk0 \
13 |                        --start_ost 0 \
14 |                        --end_ost 100 \
15 |                        --output_file './output/res.mp4'


--------------------------------------------------------------------------------
/funclip/utils/argparse_tools.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | # -*- encoding: utf-8 -*-
 3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved.
 4 | #  MIT License  (https://opensource.org/licenses/MIT)
 5 | 
 6 | import argparse
 7 | from pathlib import Path
 8 | 
 9 | import yaml
10 | import sys
11 | 
12 | 
13 | class ArgumentParser(argparse.ArgumentParser):
14 |     """Simple implementation of ArgumentParser supporting config file
15 | 
16 |     This class is originated from https://github.com/bw2/ConfigArgParse,
17 |     but this class is lack of some features that it has.
18 | 
19 |     - Not supporting multiple config files
20 |     - Automatically adding "--config" as an option.
21 |     - Not supporting any formats other than yaml
22 |     - Not checking argument type
23 | 
24 |     """
25 | 
26 |     def __init__(self, *args, **kwargs):
27 |         super().__init__(*args, **kwargs)
28 |         self.add_argument("--config", help="Give config file in yaml format")
29 | 
30 |     def parse_known_args(self, args=None, namespace=None):
31 |         # Once parsing for setting from "--config"
32 |         _args, _ = super().parse_known_args(args, namespace)
33 |         if _args.config is not None:
34 |             if not Path(_args.config).exists():
35 |                 self.error(f"No such file: {_args.config}")
36 | 
37 |             with open(_args.config, "r", encoding="utf-8") as f:
38 |                 d = yaml.safe_load(f)
39 |             if not isinstance(d, dict):
40 |                 self.error("Config file has non dict value: {_args.config}")
41 | 
42 |             for key in d:
43 |                 for action in self._actions:
44 |                     if key == action.dest:
45 |                         break
46 |                 else:
47 |                     self.error(f"unrecognized arguments: {key} (from {_args.config})")
48 | 
49 |             # NOTE(kamo): Ignore "--config" from a config file
50 |             # NOTE(kamo): Unlike "configargparse", this module doesn't check type.
51 |             #   i.e. We can set any type value regardless of argument type.
52 |             self.set_defaults(**d)
53 |         return super().parse_known_args(args, namespace)
54 | 
55 | 
56 | def get_commandline_args():
57 |     extra_chars = [
58 |         " ",
59 |         ";",
60 |         "&",
61 |         "(",
62 |         ")",
63 |         "|",
64 |         "^",
65 |         "<",
66 |         ">",
67 |         "?",
68 |         "*",
69 |         "[",
70 |         "]",
71 |         "$",
72 |         "`",
73 |         '"',
74 |         "\\",
75 |         "!",
76 |         "{",
77 |         "}",
78 |     ]
79 | 
80 |     # Escape the extra characters for shell
81 |     argv = [
82 |         arg.replace("'", "'\\''")
83 |         if all(char not in arg for char in extra_chars)
84 |         else "'" + arg.replace("'", "'\\''") + "'"
85 |         for arg in sys.argv
86 |     ]
87 | 
88 |     return sys.executable + " " + " ".join(argv)


--------------------------------------------------------------------------------
/funclip/utils/subtitle_utils.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- encoding: utf-8 -*-
  3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved.
  4 | #  MIT License  (https://opensource.org/licenses/MIT)
  5 | import re
  6 | 
  7 | def time_convert(ms):
  8 |     ms = int(ms)
  9 |     tail = ms % 1000
 10 |     s = ms // 1000
 11 |     mi = s // 60
 12 |     s = s % 60
 13 |     h = mi // 60
 14 |     mi = mi % 60
 15 |     h = "00" if h == 0 else str(h)
 16 |     mi = "00" if mi == 0 else str(mi)
 17 |     s = "00" if s == 0 else str(s)
 18 |     tail = str(tail)
 19 |     if len(h) == 1: h = '0' + h
 20 |     if len(mi) == 1: mi = '0' + mi
 21 |     if len(s) == 1: s = '0' + s
 22 |     return "{}:{}:{},{}".format(h, mi, s, tail)
 23 | 
 24 | def str2list(text):
 25 |     pattern = re.compile(r'[\u4e00-\u9fff]|[\w-]+', re.UNICODE)
 26 |     elements = pattern.findall(text)
 27 |     return elements
 28 | 
 29 | class Text2SRT():
 30 |     def __init__(self, text, timestamp, offset=0):
 31 |         self.token_list = text
 32 |         self.timestamp = timestamp
 33 |         start, end = timestamp[0][0] - offset, timestamp[-1][1] - offset
 34 |         self.start_sec, self.end_sec = start, end
 35 |         self.start_time = time_convert(start)
 36 |         self.end_time = time_convert(end)
 37 |     def text(self):
 38 |         if isinstance(self.token_list, str):
 39 |             return self.token_list
 40 |         else:
 41 |             res = ""
 42 |             for word in self.token_list:
 43 |                 if '\u4e00' <= word <= '\u9fff':
 44 |                     res += word
 45 |                 else:
 46 |                     res += " " + word
 47 |             return res.lstrip()
 48 |     def srt(self, acc_ost=0.0):
 49 |         return "{} --> {}\n{}\n".format(
 50 |             time_convert(self.start_sec+acc_ost*1000),
 51 |             time_convert(self.end_sec+acc_ost*1000), 
 52 |             self.text())
 53 |     def time(self, acc_ost=0.0):
 54 |         return (self.start_sec/1000+acc_ost, self.end_sec/1000+acc_ost)
 55 | 
 56 | 
 57 | def generate_srt(sentence_list):
 58 |     srt_total = ''
 59 |     for i, sent in enumerate(sentence_list):
 60 |         t2s = Text2SRT(sent['text'], sent['timestamp'])
 61 |         if 'spk' in sent:
 62 |             srt_total += "{}  spk{}\n{}".format(i, sent['spk'], t2s.srt())
 63 |         else:
 64 |             srt_total += "{}\n{}".format(i, t2s.srt())
 65 |     return srt_total
 66 | 
 67 | def generate_srt_clip(sentence_list, start, end, begin_index=0, time_acc_ost=0.0):
 68 |     start, end = int(start * 1000), int(end * 1000)
 69 |     srt_total = ''
 70 |     cc = 1 + begin_index
 71 |     subs = []
 72 |     for _, sent in enumerate(sentence_list):
 73 |         if isinstance(sent['text'], str):
 74 |             sent['text'] = str2list(sent['text'])
 75 |         if sent['timestamp'][-1][1] <= start:
 76 |             # print("CASE0")
 77 |             continue
 78 |         if sent['timestamp'][0][0] >= end:
 79 |             # print("CASE4")
 80 |             break
 81 |         # parts in between
 82 |         if (sent['timestamp'][-1][1] <= end and sent['timestamp'][0][0] > start) or (sent['timestamp'][-1][1] == end and sent['timestamp'][0][0] == start):
 83 |             # print("CASE1"); import pdb; pdb.set_trace()
 84 |             t2s = Text2SRT(sent['text'], sent['timestamp'], offset=start)
 85 |             srt_total += "{}\n{}".format(cc, t2s.srt(time_acc_ost))
 86 |             subs.append((t2s.time(time_acc_ost), t2s.text()))
 87 |             cc += 1
 88 |             continue
 89 |         if sent['timestamp'][0][0] <= start:
 90 |             # print("CASE2"); import pdb; pdb.set_trace()
 91 |             if not sent['timestamp'][-1][1] > end:
 92 |                 for j, ts in enumerate(sent['timestamp']):
 93 |                     if ts[1] > start:
 94 |                         break
 95 |                 _text = sent['text'][j:]
 96 |                 _ts = sent['timestamp'][j:]
 97 |             else:
 98 |                 for j, ts in enumerate(sent['timestamp']):
 99 |                     if ts[1] > start:
100 |                         _start = j
101 |                         break
102 |                 for j, ts in enumerate(sent['timestamp']):
103 |                     if ts[1] > end:
104 |                         _end = j
105 |                         break
106 |                 # _text = " ".join(sent['text'][_start:_end])
107 |                 _text = sent['text'][_start:_end]
108 |                 _ts = sent['timestamp'][_start:_end]
109 |             if len(ts):
110 |                 t2s = Text2SRT(_text, _ts, offset=start)
111 |                 srt_total += "{}\n{}".format(cc, t2s.srt(time_acc_ost))
112 |                 subs.append((t2s.time(time_acc_ost), t2s.text()))
113 |                 cc += 1
114 |             continue
115 |         if sent['timestamp'][-1][1] > end:
116 |             # print("CASE3"); import pdb; pdb.set_trace()
117 |             for j, ts in enumerate(sent['timestamp']):
118 |                 if ts[1] > end:
119 |                     break
120 |             _text = sent['text'][:j]
121 |             _ts = sent['timestamp'][:j]
122 |             if len(_ts):
123 |                 t2s = Text2SRT(_text, _ts, offset=start)
124 |                 srt_total += "{}\n{}".format(cc, t2s.srt(time_acc_ost))
125 |                 subs.append(
126 |                     (t2s.time(time_acc_ost), t2s.text())
127 |                     )
128 |                 cc += 1
129 |             continue
130 |     return srt_total, subs, cc
131 | 


--------------------------------------------------------------------------------
/funclip/utils/theme.json:
--------------------------------------------------------------------------------
  1 | {
  2 | 	"theme": {
  3 | 	"_font": [
  4 | 	{
  5 | 	"__gradio_font__": true,
  6 | 	"name": "Montserrat",
  7 | 	"class": "google"
  8 | 	},
  9 | 	{
 10 | 	"__gradio_font__": true,
 11 | 	"name": "ui-sans-serif",
 12 | 	"class": "font"
 13 | 	},
 14 | 	{
 15 | 	"__gradio_font__": true,
 16 | 	"name": "system-ui",
 17 | 	"class": "font"
 18 | 	},
 19 | 	{
 20 | 	"__gradio_font__": true,
 21 | 	"name": "sans-serif",
 22 | 	"class": "font"
 23 | 	}
 24 | 	],
 25 | 	"_font_mono": [
 26 | 	{
 27 | 	"__gradio_font__": true,
 28 | 	"name": "IBM Plex Mono",
 29 | 	"class": "google"
 30 | 	},
 31 | 	{
 32 | 	"__gradio_font__": true,
 33 | 	"name": "ui-monospace",
 34 | 	"class": "font"
 35 | 	},
 36 | 	{
 37 | 	"__gradio_font__": true,
 38 | 	"name": "Consolas",
 39 | 	"class": "font"
 40 | 	},
 41 | 	{
 42 | 	"__gradio_font__": true,
 43 | 	"name": "monospace",
 44 | 	"class": "font"
 45 | 	}
 46 | 	],
 47 | 	"background_fill_primary": "*neutral_50",
 48 | 	"background_fill_primary_dark": "*neutral_950",
 49 | 	"background_fill_secondary": "*neutral_50",
 50 | 	"background_fill_secondary_dark": "*neutral_900",
 51 | 	"block_background_fill": "white",
 52 | 	"block_background_fill_dark": "*neutral_800",
 53 | 	"block_border_color": "*border_color_primary",
 54 | 	"block_border_color_dark": "*border_color_primary",
 55 | 	"block_border_width": "0px",
 56 | 	"block_border_width_dark": "0px",
 57 | 	"block_info_text_color": "*body_text_color_subdued",
 58 | 	"block_info_text_color_dark": "*body_text_color_subdued",
 59 | 	"block_info_text_size": "*text_sm",
 60 | 	"block_info_text_weight": "400",
 61 | 	"block_label_background_fill": "*primary_100",
 62 | 	"block_label_background_fill_dark": "*primary_600",
 63 | 	"block_label_border_color": "*border_color_primary",
 64 | 	"block_label_border_color_dark": "*border_color_primary",
 65 | 	"block_label_border_width": "1px",
 66 | 	"block_label_border_width_dark": "1px",
 67 | 	"block_label_margin": "*spacing_md",
 68 | 	"block_label_padding": "*spacing_sm *spacing_md",
 69 | 	"block_label_radius": "*radius_md",
 70 | 	"block_label_right_radius": "0 calc(*radius_lg - 1px) 0 calc(*radius_lg - 1px)",
 71 | 	"block_label_text_color": "*primary_500",
 72 | 	"block_label_text_color_dark": "*white",
 73 | 	"block_label_text_size": "*text_md",
 74 | 	"block_label_text_weight": "600",
 75 | 	"block_padding": "*spacing_xl calc(*spacing_xl + 2px)",
 76 | 	"block_radius": "*radius_lg",
 77 | 	"block_shadow": "none",
 78 | 	"block_shadow_dark": "none",
 79 | 	"block_title_background_fill": "*block_label_background_fill",
 80 | 	"block_title_background_fill_dark": "*block_label_background_fill",
 81 | 	"block_title_border_color": "none",
 82 | 	"block_title_border_color_dark": "none",
 83 | 	"block_title_border_width": "0px",
 84 | 	"block_title_border_width_dark": "0px",
 85 | 	"block_title_padding": "*block_label_padding",
 86 | 	"block_title_radius": "*block_label_radius",
 87 | 	"block_title_text_color": "*primary_500",
 88 | 	"block_title_text_color_dark": "*white",
 89 | 	"block_title_text_size": "*text_md",
 90 | 	"block_title_text_weight": "600",
 91 | 	"body_background_fill": "*background_fill_primary",
 92 | 	"body_background_fill_dark": "*background_fill_primary",
 93 | 	"body_text_color": "*neutral_800",
 94 | 	"body_text_color_dark": "*neutral_100",
 95 | 	"body_text_color_subdued": "*neutral_400",
 96 | 	"body_text_color_subdued_dark": "*neutral_400",
 97 | 	"body_text_size": "*text_md",
 98 | 	"body_text_weight": "400",
 99 | 	"border_color_accent": "*primary_300",
100 | 	"border_color_accent_dark": "*neutral_600",
101 | 	"border_color_primary": "*neutral_200",
102 | 	"border_color_primary_dark": "*neutral_700",
103 | 	"button_border_width": "*input_border_width",
104 | 	"button_border_width_dark": "*input_border_width",
105 | 	"button_cancel_background_fill": "*button_secondary_background_fill",
106 | 	"button_cancel_background_fill_dark": "*button_secondary_background_fill",
107 | 	"button_cancel_background_fill_hover": "*button_secondary_background_fill_hover",
108 | 	"button_cancel_background_fill_hover_dark": "*button_secondary_background_fill_hover",
109 | 	"button_cancel_border_color": "*button_secondary_border_color",
110 | 	"button_cancel_border_color_dark": "*button_secondary_border_color",
111 | 	"button_cancel_border_color_hover": "*button_cancel_border_color",
112 | 	"button_cancel_border_color_hover_dark": "*button_cancel_border_color",
113 | 	"button_cancel_text_color": "*button_secondary_text_color",
114 | 	"button_cancel_text_color_dark": "*button_secondary_text_color",
115 | 	"button_cancel_text_color_hover": "*button_cancel_text_color",
116 | 	"button_cancel_text_color_hover_dark": "*button_cancel_text_color",
117 | 	"button_large_padding": "*spacing_lg calc(2 * *spacing_lg)",
118 | 	"button_large_radius": "*radius_lg",
119 | 	"button_large_text_size": "*text_lg",
120 | 	"button_large_text_weight": "600",
121 | 	"button_primary_background_fill": "*primary_500",
122 | 	"button_primary_background_fill_dark": "*primary_700",
123 | 	"button_primary_background_fill_hover": "*primary_400",
124 | 	"button_primary_background_fill_hover_dark": "*primary_500",
125 | 	"button_primary_border_color": "*primary_200",
126 | 	"button_primary_border_color_dark": "*primary_600",
127 | 	"button_primary_border_color_hover": "*button_primary_border_color",
128 | 	"button_primary_border_color_hover_dark": "*button_primary_border_color",
129 | 	"button_primary_text_color": "white",
130 | 	"button_primary_text_color_dark": "white",
131 | 	"button_primary_text_color_hover": "*button_primary_text_color",
132 | 	"button_primary_text_color_hover_dark": "*button_primary_text_color",
133 | 	"button_secondary_background_fill": "white",
134 | 	"button_secondary_background_fill_dark": "*neutral_600",
135 | 	"button_secondary_background_fill_hover": "*neutral_100",
136 | 	"button_secondary_background_fill_hover_dark": "*primary_500",
137 | 	"button_secondary_border_color": "*neutral_200",
138 | 	"button_secondary_border_color_dark": "*neutral_600",
139 | 	"button_secondary_border_color_hover": "*button_secondary_border_color",
140 | 	"button_secondary_border_color_hover_dark": "*button_secondary_border_color",
141 | 	"button_secondary_text_color": "*neutral_800",
142 | 	"button_secondary_text_color_dark": "white",
143 | 	"button_secondary_text_color_hover": "*button_secondary_text_color",
144 | 	"button_secondary_text_color_hover_dark": "*button_secondary_text_color",
145 | 	"button_shadow": "*shadow_drop_lg",
146 | 	"button_shadow_active": "*shadow_inset",
147 | 	"button_shadow_hover": "*shadow_drop_lg",
148 | 	"button_small_padding": "*spacing_sm calc(2 * *spacing_sm)",
149 | 	"button_small_radius": "*radius_lg",
150 | 	"button_small_text_size": "*text_md",
151 | 	"button_small_text_weight": "400",
152 | 	"button_transition": "background-color 0.2s ease",
153 | 	"checkbox_background_color": "*background_fill_primary",
154 | 	"checkbox_background_color_dark": "*neutral_800",
155 | 	"checkbox_background_color_focus": "*checkbox_background_color",
156 | 	"checkbox_background_color_focus_dark": "*checkbox_background_color",
157 | 	"checkbox_background_color_hover": "*checkbox_background_color",
158 | 	"checkbox_background_color_hover_dark": "*checkbox_background_color",
159 | 	"checkbox_background_color_selected": "*primary_600",
160 | 	"checkbox_background_color_selected_dark": "*primary_700",
161 | 	"checkbox_border_color": "*neutral_100",
162 | 	"checkbox_border_color_dark": "*neutral_600",
163 | 	"checkbox_border_color_focus": "*primary_500",
164 | 	"checkbox_border_color_focus_dark": "*primary_600",
165 | 	"checkbox_border_color_hover": "*neutral_300",
166 | 	"checkbox_border_color_hover_dark": "*neutral_600",
167 | 	"checkbox_border_color_selected": "*primary_600",
168 | 	"checkbox_border_color_selected_dark": "*primary_700",
169 | 	"checkbox_border_radius": "*radius_sm",
170 | 	"checkbox_border_width": "1px",
171 | 	"checkbox_border_width_dark": "*input_border_width",
172 | 	"checkbox_check": "url(\"data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3cpath d='M12.207 4.793a1 1 0 010 1.414l-5 5a1 1 0 01-1.414 0l-2-2a1 1 0 011.414-1.414L6.5 9.086l4.293-4.293a1 1 0 011.414 0z'/%3e%3c/svg%3e\")",
173 | 	"checkbox_label_background_fill": "*button_secondary_background_fill",
174 | 	"checkbox_label_background_fill_dark": "*button_secondary_background_fill",
175 | 	"checkbox_label_background_fill_hover": "*button_secondary_background_fill_hover",
176 | 	"checkbox_label_background_fill_hover_dark": "*button_secondary_background_fill_hover",
177 | 	"checkbox_label_background_fill_selected": "*primary_500",
178 | 	"checkbox_label_background_fill_selected_dark": "*primary_600",
179 | 	"checkbox_label_border_color": "*border_color_primary",
180 | 	"checkbox_label_border_color_dark": "*border_color_primary",
181 | 	"checkbox_label_border_color_hover": "*checkbox_label_border_color",
182 | 	"checkbox_label_border_color_hover_dark": "*checkbox_label_border_color",
183 | 	"checkbox_label_border_width": "*input_border_width",
184 | 	"checkbox_label_border_width_dark": "*input_border_width",
185 | 	"checkbox_label_gap": "*spacing_lg",
186 | 	"checkbox_label_padding": "*spacing_md calc(2 * *spacing_md)",
187 | 	"checkbox_label_shadow": "*shadow_drop_lg",
188 | 	"checkbox_label_text_color": "*body_text_color",
189 | 	"checkbox_label_text_color_dark": "*body_text_color",
190 | 	"checkbox_label_text_color_selected": "white",
191 | 	"checkbox_label_text_color_selected_dark": "*checkbox_label_text_color",
192 | 	"checkbox_label_text_size": "*text_md",
193 | 	"checkbox_label_text_weight": "400",
194 | 	"checkbox_shadow": "none",
195 | 	"color_accent": "*primary_500",
196 | 	"color_accent_soft": "*primary_50",
197 | 	"color_accent_soft_dark": "*neutral_700",
198 | 	"container_radius": "*radius_lg",
199 | 	"embed_radius": "*radius_lg",
200 | 	"error_background_fill": "#fee2e2",
201 | 	"error_background_fill_dark": "*background_fill_primary",
202 | 	"error_border_color": "#fecaca",
203 | 	"error_border_color_dark": "*border_color_primary",
204 | 	"error_border_width": "1px",
205 | 	"error_border_width_dark": "1px",
206 | 	"error_text_color": "#ef4444",
207 | 	"error_text_color_dark": "#ef4444",
208 | 	"font": "'Montserrat', 'ui-sans-serif', 'system-ui', sans-serif",
209 | 	"font_mono": "'IBM Plex Mono', 'ui-monospace', 'Consolas', monospace",
210 | 	"form_gap_width": "0px",
211 | 	"input_background_fill": "white",
212 | 	"input_background_fill_dark": "*neutral_700",
213 | 	"input_background_fill_focus": "*secondary_500",
214 | 	"input_background_fill_focus_dark": "*secondary_600",
215 | 	"input_background_fill_hover": "*input_background_fill",
216 | 	"input_background_fill_hover_dark": "*input_background_fill",
217 | 	"input_border_color": "*neutral_50",
218 | 	"input_border_color_dark": "*border_color_primary",
219 | 	"input_border_color_focus": "*secondary_300",
220 | 	"input_border_color_focus_dark": "*neutral_700",
221 | 	"input_border_color_hover": "*input_border_color",
222 | 	"input_border_color_hover_dark": "*input_border_color",
223 | 	"input_border_width": "0px",
224 | 	"input_border_width_dark": "0px",
225 | 	"input_padding": "*spacing_xl",
226 | 	"input_placeholder_color": "*neutral_400",
227 | 	"input_placeholder_color_dark": "*neutral_500",
228 | 	"input_radius": "*radius_lg",
229 | 	"input_shadow": "*shadow_drop",
230 | 	"input_shadow_dark": "*shadow_drop",
231 | 	"input_shadow_focus": "*shadow_drop_lg",
232 | 	"input_shadow_focus_dark": "*shadow_drop_lg",
233 | 	"input_text_size": "*text_md",
234 | 	"input_text_weight": "400",
235 | 	"layout_gap": "*spacing_xxl",
236 | 	"link_text_color": "*secondary_600",
237 | 	"link_text_color_active": "*secondary_600",
238 | 	"link_text_color_active_dark": "*secondary_500",
239 | 	"link_text_color_dark": "*secondary_500",
240 | 	"link_text_color_hover": "*secondary_700",
241 | 	"link_text_color_hover_dark": "*secondary_400",
242 | 	"link_text_color_visited": "*secondary_500",
243 | 	"link_text_color_visited_dark": "*secondary_600",
244 | 	"loader_color": "*color_accent",
245 | 	"loader_color_dark": "*color_accent",
246 | 	"name": "base",
247 | 	"neutral_100": "#f3f4f6",
248 | 	"neutral_200": "#e5e7eb",
249 | 	"neutral_300": "#d1d5db",
250 | 	"neutral_400": "#9ca3af",
251 | 	"neutral_50": "#f9fafb",
252 | 	"neutral_500": "#6b7280",
253 | 	"neutral_600": "#4b5563",
254 | 	"neutral_700": "#374151",
255 | 	"neutral_800": "#1f2937",
256 | 	"neutral_900": "#111827",
257 | 	"neutral_950": "#0b0f19",
258 | 	"panel_background_fill": "*background_fill_secondary",
259 | 	"panel_background_fill_dark": "*background_fill_secondary",
260 | 	"panel_border_color": "*border_color_primary",
261 | 	"panel_border_color_dark": "*border_color_primary",
262 | 	"panel_border_width": "1px",
263 | 	"panel_border_width_dark": "1px",
264 | 	"primary_100": "#e0e7ff",
265 | 	"primary_200": "#c7d2fe",
266 | 	"primary_300": "#a5b4fc",
267 | 	"primary_400": "#818cf8",
268 | 	"primary_50": "#eef2ff",
269 | 	"primary_500": "#6366f1",
270 | 	"primary_600": "#4f46e5",
271 | 	"primary_700": "#4338ca",
272 | 	"primary_800": "#3730a3",
273 | 	"primary_900": "#312e81",
274 | 	"primary_950": "#2b2c5e",
275 | 	"prose_header_text_weight": "600",
276 | 	"prose_text_size": "*text_md",
277 | 	"prose_text_weight": "400",
278 | 	"radio_circle": "url(\"data:image/svg+xml,%3csvg viewBox='0 0 16 16' fill='white' xmlns='http://www.w3.org/2000/svg'%3e%3ccircle cx='8' cy='8' r='3'/%3e%3c/svg%3e\")",
279 | 	"radius_lg": "6px",
280 | 	"radius_md": "4px",
281 | 	"radius_sm": "2px",
282 | 	"radius_xl": "8px",
283 | 	"radius_xs": "1px",
284 | 	"radius_xxl": "12px",
285 | 	"radius_xxs": "1px",
286 | 	"secondary_100": "#ecfccb",
287 | 	"secondary_200": "#d9f99d",
288 | 	"secondary_300": "#bef264",
289 | 	"secondary_400": "#a3e635",
290 | 	"secondary_50": "#f7fee7",
291 | 	"secondary_500": "#84cc16",
292 | 	"secondary_600": "#65a30d",
293 | 	"secondary_700": "#4d7c0f",
294 | 	"secondary_800": "#3f6212",
295 | 	"secondary_900": "#365314",
296 | 	"secondary_950": "#2f4e14",
297 | 	"section_header_text_size": "*text_md",
298 | 	"section_header_text_weight": "400",
299 | 	"shadow_drop": "0 1px 4px 0 rgb(0 0 0 / 0.1)",
300 | 	"shadow_drop_lg": "0 2px 5px 0 rgb(0 0 0 / 0.1)",
301 | 	"shadow_inset": "rgba(0,0,0,0.05) 0px 2px 4px 0px inset",
302 | 	"shadow_spread": "6px",
303 | 	"shadow_spread_dark": "1px",
304 | 	"slider_color": "*primary_500",
305 | 	"slider_color_dark": "*primary_600",
306 | 	"spacing_lg": "6px",
307 | 	"spacing_md": "4px",
308 | 	"spacing_sm": "2px",
309 | 	"spacing_xl": "9px",
310 | 	"spacing_xs": "1px",
311 | 	"spacing_xxl": "12px",
312 | 	"spacing_xxs": "1px",
313 | 	"stat_background_fill": "*primary_300",
314 | 	"stat_background_fill_dark": "*primary_500",
315 | 	"table_border_color": "*neutral_300",
316 | 	"table_border_color_dark": "*neutral_700",
317 | 	"table_even_background_fill": "white",
318 | 	"table_even_background_fill_dark": "*neutral_950",
319 | 	"table_odd_background_fill": "*neutral_50",
320 | 	"table_odd_background_fill_dark": "*neutral_900",
321 | 	"table_radius": "*radius_lg",
322 | 	"table_row_focus": "*color_accent_soft",
323 | 	"table_row_focus_dark": "*color_accent_soft",
324 | 	"text_lg": "16px",
325 | 	"text_md": "14px",
326 | 	"text_sm": "12px",
327 | 	"text_xl": "22px",
328 | 	"text_xs": "10px",
329 | 	"text_xxl": "26px",
330 | 	"text_xxs": "9px"
331 | 	},
332 | 	"version": "0.0.1"
333 | 	}


--------------------------------------------------------------------------------
/funclip/utils/trans_utils.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- encoding: utf-8 -*-
  3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved.
  4 | #  MIT License  (https://opensource.org/licenses/MIT)
  5 | 
  6 | import os
  7 | import re
  8 | import numpy as np  
  9 | 
 10 | PUNC_LIST = ['，', '。', '！', '？', '、', ',', '.', '?', '!']
 11 | 
 12 | def pre_proc(text):
 13 |     res = ''
 14 |     for i in range(len(text)):
 15 |         if text[i] in PUNC_LIST:
 16 |             continue
 17 |         if '\u4e00' <= text[i] <= '\u9fff':
 18 |             if len(res) and res[-1] != " ":
 19 |                 res += ' ' + text[i]+' '
 20 |             else:
 21 |                 res += text[i]+' '
 22 |         else:
 23 |             res += text[i]
 24 |     if res[-1] == ' ':
 25 |         res = res[:-1]
 26 |     return res
 27 | 
 28 | def proc(raw_text, timestamp, dest_text, lang='zh'):
 29 |     # simple matching
 30 |     ld = len(dest_text.split())
 31 |     mi, ts = [], []
 32 |     offset = 0
 33 |     while True:
 34 |         fi = raw_text.find(dest_text, offset, len(raw_text))
 35 |         ti = raw_text[:fi].count(' ')
 36 |         if fi == -1:
 37 |             break
 38 |         offset = fi + ld
 39 |         mi.append(fi)
 40 |         ts.append([timestamp[ti][0]*16, timestamp[ti+ld-1][1]*16])
 41 |     return ts
 42 |             
 43 | 
 44 | def proc_spk(dest_spk, sd_sentences):
 45 |     ts = []
 46 |     for d in sd_sentences:
 47 |         d_start = d['timestamp'][0][0]
 48 |         d_end = d['timestamp'][-1][1]
 49 |         spkid=dest_spk[3:]
 50 |         if str(d['spk']) == spkid and d_end-d_start>999:
 51 |             ts.append([d_start*16, d_end*16])
 52 |     return ts
 53 | 
 54 | def generate_vad_data(data, sd_sentences, sr=16000):
 55 |     assert len(data.shape) == 1
 56 |     vad_data = []
 57 |     for d in sd_sentences:
 58 |         d_start = round(d['ts_list'][0][0]/1000, 2)
 59 |         d_end = round(d['ts_list'][-1][1]/1000, 2)
 60 |         vad_data.append([d_start, d_end, data[int(d_start * sr):int(d_end * sr)]])
 61 |     return vad_data
 62 | 
 63 | def write_state(output_dir, state):
 64 |     for key in ['/recog_res_raw', '/timestamp', '/sentences']:#, '/sd_sentences']:
 65 |         with open(output_dir+key, 'w') as fout:
 66 |             fout.write(str(state[key[1:]]))
 67 |     if 'sd_sentences' in state:
 68 |         with open(output_dir+'/sd_sentences', 'w') as fout:
 69 |             fout.write(str(state['sd_sentences']))
 70 | 
 71 | def load_state(output_dir):
 72 |     state = {}
 73 |     with open(output_dir+'/recog_res_raw') as fin:
 74 |         line = fin.read()
 75 |         state['recog_res_raw'] = line
 76 |     with open(output_dir+'/timestamp') as fin:
 77 |         line = fin.read()
 78 |         state['timestamp'] = eval(line)
 79 |     with open(output_dir+'/sentences') as fin:
 80 |         line = fin.read()
 81 |         state['sentences'] = eval(line)
 82 |     if os.path.exists(output_dir+'/sd_sentences'):
 83 |         with open(output_dir+'/sd_sentences') as fin:
 84 |             line = fin.read()
 85 |             state['sd_sentences'] = eval(line)
 86 |     return state
 87 | 
 88 | def convert_pcm_to_float(data):
 89 |     if data.dtype == np.float64:
 90 |         return data
 91 |     elif data.dtype == np.float32:
 92 |         return data.astype(np.float64)
 93 |     elif data.dtype == np.int16:
 94 |         bit_depth = 16
 95 |     elif data.dtype == np.int32:
 96 |         bit_depth = 32
 97 |     elif data.dtype == np.int8:
 98 |         bit_depth = 8
 99 |     else:
100 |         raise ValueError("Unsupported audio data type")
101 | 
102 |     # Now handle the integer types
103 |     max_int_value = float(2 ** (bit_depth - 1))
104 |     if bit_depth == 8:
105 |         data = data - 128
106 |     return (data.astype(np.float64) / max_int_value)
107 | 
108 | def convert_time_to_millis(time_str):
109 |     # 格式: [小时:分钟:秒,毫秒]
110 |     hours, minutes, seconds, milliseconds = map(int, re.split('[:,]', time_str))
111 |     return (hours * 3600 + minutes * 60 + seconds) * 1000 + milliseconds
112 | 
113 | def extract_timestamps(input_text):
114 |     # 使用正则表达式查找所有时间戳
115 |     timestamps = re.findall(r'\[(\d{2}:\d{2}:\d{2},\d{2,3})\s*-\s*(\d{2}:\d{2}:\d{2},\d{2,3})\]', input_text)
116 |     times_list = []
117 |     print(timestamps)
118 |     # 循环遍历找到的所有时间戳，并转换为毫秒
119 |     for start_time, end_time in timestamps:
120 |         start_millis = convert_time_to_millis(start_time)
121 |         end_millis = convert_time_to_millis(end_time)
122 |         times_list.append([start_millis, end_millis])
123 |     
124 |     return times_list
125 | 
126 | 
127 | if __name__ == '__main__':
128 |     text = ("1. [00:00:00,500-00:00:05,850] 在我们的设计普惠当中，有一个我经常津津乐道的项目叫寻找远方的美好。"
129 |     "2. [00:00:07,120-00:00:12,940] 啊，在这样一个我们叫寻美在这样的一个项目当中，我们把它跟乡村振兴去结合起来，利用我们的设计的能力。"
130 |     "3. [00:00:13,240-00:00:25,620] 问我们自身员工的设设计能力，我们设计生态伙伴的能力，帮助乡村振兴当中，要希望把他的产品推向市场，把他的农产品把他加工产品推向市场的这样的伙伴做一件事情，")
131 | 
132 |     print(extract_timestamps(text))


--------------------------------------------------------------------------------
/funclip/videoclipper.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # -*- encoding: utf-8 -*-
  3 | # Copyright FunASR (https://github.com/alibaba-damo-academy/FunClip). All Rights Reserved.
  4 | #  MIT License  (https://opensource.org/licenses/MIT)
  5 | 
  6 | import re
  7 | import os
  8 | import sys
  9 | import copy
 10 | import librosa
 11 | import logging
 12 | import argparse
 13 | import numpy as np
 14 | import soundfile as sf
 15 | from moviepy.editor import *
 16 | import moviepy.editor as mpy
 17 | from moviepy.video.tools.subtitles import SubtitlesClip, TextClip
 18 | from moviepy.editor import VideoFileClip, concatenate_videoclips
 19 | from moviepy.video.compositing.CompositeVideoClip import CompositeVideoClip
 20 | from utils.subtitle_utils import generate_srt, generate_srt_clip
 21 | from utils.argparse_tools import ArgumentParser, get_commandline_args
 22 | from utils.trans_utils import pre_proc, proc, write_state, load_state, proc_spk, convert_pcm_to_float
 23 | 
 24 | 
 25 | class VideoClipper():
 26 |     def __init__(self, funasr_model):
 27 |         logging.warning("Initializing VideoClipper.")
 28 |         self.funasr_model = funasr_model
 29 |         self.GLOBAL_COUNT = 0
 30 | 
 31 |     def recog(self, audio_input, sd_switch='no', state=None, hotwords="", output_dir=None):
 32 |         if state is None:
 33 |             state = {}
 34 |         sr, data = audio_input
 35 | 
 36 |         # Convert to float64 consistently (includes data type checking)
 37 |         data = convert_pcm_to_float(data)
 38 | 
 39 |         # assert sr == 16000, "16kHz sample rate required, {} given.".format(sr)
 40 |         if sr != 16000: # resample with librosa
 41 |             data = librosa.resample(data, orig_sr=sr, target_sr=16000)
 42 |         if len(data.shape) == 2:  # multi-channel wav input
 43 |             logging.warning("Input wav shape: {}, only first channel reserved.".format(data.shape))
 44 |             data = data[:,0]
 45 |         state['audio_input'] = (sr, data)
 46 |         if sd_switch == 'Yes':
 47 |             rec_result = self.funasr_model.generate(data, 
 48 |                                                     return_spk_res=True,
 49 |                                                     return_raw_text=True, 
 50 |                                                     is_final=True,
 51 |                                                     output_dir=output_dir, 
 52 |                                                     hotword=hotwords, 
 53 |                                                     pred_timestamp=self.lang=='en',
 54 |                                                     en_post_proc=self.lang=='en',
 55 |                                                     cache={})
 56 |             res_srt = generate_srt(rec_result[0]['sentence_info'])
 57 |             state['sd_sentences'] = rec_result[0]['sentence_info']
 58 |         else:
 59 |             rec_result = self.funasr_model.generate(data, 
 60 |                                                     return_spk_res=False, 
 61 |                                                     sentence_timestamp=True, 
 62 |                                                     return_raw_text=True, 
 63 |                                                     is_final=True, 
 64 |                                                     hotword=hotwords,
 65 |                                                     output_dir=output_dir,
 66 |                                                     pred_timestamp=self.lang=='en',
 67 |                                                     en_post_proc=self.lang=='en',
 68 |                                                     cache={})
 69 |             res_srt = generate_srt(rec_result[0]['sentence_info'])
 70 |         state['recog_res_raw'] = rec_result[0]['raw_text']
 71 |         state['timestamp'] = rec_result[0]['timestamp']
 72 |         state['sentences'] = rec_result[0]['sentence_info']
 73 |         res_text = rec_result[0]['text']
 74 |         return res_text, res_srt, state
 75 | 
 76 |     def clip(self, dest_text, start_ost, end_ost, state, dest_spk=None, output_dir=None, timestamp_list=None):
 77 |         # get from state
 78 |         audio_input = state['audio_input']
 79 |         recog_res_raw = state['recog_res_raw']
 80 |         timestamp = state['timestamp']
 81 |         sentences = state['sentences']
 82 |         sr, data = audio_input
 83 |         data = data.astype(np.float64)
 84 | 
 85 |         if timestamp_list is None:
 86 |             all_ts = []
 87 |             if dest_spk is None or dest_spk == '' or 'sd_sentences' not in state:
 88 |                 for _dest_text in dest_text.split('#'):
 89 |                     if '[' in _dest_text:
 90 |                         match = re.search(r'\[(\d+),\s*(\d+)\]', _dest_text)
 91 |                         if match:
 92 |                             offset_b, offset_e = map(int, match.groups())
 93 |                             log_append = ""
 94 |                         else:
 95 |                             offset_b, offset_e = 0, 0
 96 |                             log_append = "(Bracket detected in dest_text but offset time matching failed)"
 97 |                         _dest_text = _dest_text[:_dest_text.find('[')]
 98 |                     else:
 99 |                         log_append = ""
100 |                         offset_b, offset_e = 0, 0
101 |                     _dest_text = pre_proc(_dest_text)
102 |                     ts = proc(recog_res_raw, timestamp, _dest_text)
103 |                     for _ts in ts: all_ts.append([_ts[0]+offset_b*16, _ts[1]+offset_e*16])
104 |                     if len(ts) > 1 and match:
105 |                         log_append += '(offsets detected but No.{} sub-sentence matched to {} periods in audio, \
106 |                             offsets are applied to all periods)'
107 |             else:
108 |                 for _dest_spk in dest_spk.split('#'):
109 |                     ts = proc_spk(_dest_spk, state['sd_sentences'])
110 |                     for _ts in ts: all_ts.append(_ts)
111 |                 log_append = ""
112 |         else:
113 |             all_ts = timestamp_list
114 |         ts = all_ts
115 |         # ts.sort()
116 |         srt_index = 0
117 |         clip_srt = ""
118 |         if len(ts):
119 |             start, end = ts[0]
120 |             start = min(max(0, start+start_ost*16), len(data))
121 |             end = min(max(0, end+end_ost*16), len(data))
122 |             res_audio = data[start:end]
123 |             start_end_info = "from {} to {}".format(start/16000, end/16000)
124 |             srt_clip, _, srt_index = generate_srt_clip(sentences, start/16000.0, end/16000.0, begin_index=srt_index)
125 |             clip_srt += srt_clip
126 |             for _ts in ts[1:]:  # multiple sentence input or multiple output matched
127 |                 start, end = _ts
128 |                 start = min(max(0, start+start_ost*16), len(data))
129 |                 end = min(max(0, end+end_ost*16), len(data))
130 |                 start_end_info += ", from {} to {}".format(start, end)
131 |                 res_audio = np.concatenate([res_audio, data[start+start_ost*16:end+end_ost*16]], -1)
132 |                 srt_clip, _, srt_index = generate_srt_clip(sentences, start/16000.0, end/16000.0, begin_index=srt_index-1)
133 |                 clip_srt += srt_clip
134 |         if len(ts):
135 |             message = "{} periods found in the speech: ".format(len(ts)) + start_end_info + log_append
136 |         else:
137 |             message = "No period found in the speech, return raw speech. You may check the recognition result and try other destination text."
138 |             res_audio = data
139 |         return (sr, res_audio), message, clip_srt
140 | 
141 |     def video_recog(self, video_filename, sd_switch='no', hotwords="", output_dir=None):
142 |         video = mpy.VideoFileClip(video_filename)
143 |         # Extract the base name, add '_clip.mp4', and 'wav'
144 |         if output_dir is not None:
145 |             os.makedirs(output_dir, exist_ok=True)
146 |             _, base_name = os.path.split(video_filename)
147 |             base_name, _ = os.path.splitext(base_name)
148 |             clip_video_file = base_name + '_clip.mp4'
149 |             audio_file = base_name + '.wav'
150 |             audio_file = os.path.join(output_dir, audio_file)
151 |         else:
152 |             base_name, _ = os.path.splitext(video_filename)
153 |             clip_video_file = base_name + '_clip.mp4'
154 |             audio_file = base_name + '.wav'
155 | 
156 |         if video.audio is None:
157 |             logging.error("No audio information found.")
158 |             sys.exit(1)
159 | 
160 |         video.audio.write_audiofile(audio_file)
161 |         wav = librosa.load(audio_file, sr=16000)[0]
162 |         # delete the audio file after processing
163 |         if os.path.exists(audio_file):
164 |             os.remove(audio_file)
165 |         state = {
166 |             'video_filename': video_filename,
167 |             'clip_video_file': clip_video_file,
168 |             'video': video,
169 |         }
170 |         # res_text, res_srt = self.recog((16000, wav), state)
171 |         return self.recog((16000, wav), sd_switch, state, hotwords, output_dir)
172 | 
173 |     def video_clip(self, 
174 |                    dest_text, 
175 |                    start_ost, 
176 |                    end_ost, 
177 |                    state, 
178 |                    font_size=32, 
179 |                    font_color='white', 
180 |                    add_sub=False, 
181 |                    dest_spk=None, 
182 |                    output_dir=None,
183 |                    timestamp_list=None):
184 |         # get from state
185 |         recog_res_raw = state['recog_res_raw']
186 |         timestamp = state['timestamp']
187 |         sentences = state['sentences']
188 |         video = state['video']
189 |         clip_video_file = state['clip_video_file']
190 |         video_filename = state['video_filename']
191 |         
192 |         if timestamp_list is None:
193 |             all_ts = []
194 |             if dest_spk is None or dest_spk == '' or 'sd_sentences' not in state:
195 |                 for _dest_text in dest_text.split('#'):
196 |                     if '[' in _dest_text:
197 |                         match = re.search(r'\[(\d+),\s*(\d+)\]', _dest_text)
198 |                         if match:
199 |                             offset_b, offset_e = map(int, match.groups())
200 |                             log_append = ""
201 |                         else:
202 |                             offset_b, offset_e = 0, 0
203 |                             log_append = "(Bracket detected in dest_text but offset time matching failed)"
204 |                         _dest_text = _dest_text[:_dest_text.find('[')]
205 |                     else:
206 |                         offset_b, offset_e = 0, 0
207 |                         log_append = ""
208 |                     # import pdb; pdb.set_trace()
209 |                     _dest_text = pre_proc(_dest_text)
210 |                     ts = proc(recog_res_raw, timestamp, _dest_text.lower())
211 |                     for _ts in ts: all_ts.append([_ts[0]+offset_b*16, _ts[1]+offset_e*16])
212 |                     if len(ts) > 1 and match:
213 |                         log_append += '(offsets detected but No.{} sub-sentence matched to {} periods in audio, \
214 |                             offsets are applied to all periods)'
215 |             else:
216 |                 for _dest_spk in dest_spk.split('#'):
217 |                     ts = proc_spk(_dest_spk, state['sd_sentences'])
218 |                     for _ts in ts: all_ts.append(_ts)
219 |         else:  # AI clip pass timestamp as input directly
220 |             all_ts = [[i[0]*16.0, i[1]*16.0] for i in timestamp_list]
221 |         
222 |         srt_index = 0
223 |         time_acc_ost = 0.0
224 |         ts = all_ts
225 |         # ts.sort()
226 |         clip_srt = ""
227 |         if len(ts):
228 |             if self.lang == 'en' and isinstance(sentences, str):
229 |                 sentences = sentences.split()
230 |             start, end = ts[0][0] / 16000, ts[0][1] / 16000
231 |             srt_clip, subs, srt_index = generate_srt_clip(sentences, start, end, begin_index=srt_index, time_acc_ost=time_acc_ost)
232 |             start, end = start+start_ost/1000.0, end+end_ost/1000.0
233 |             video_clip = video.subclip(start, end)
234 |             start_end_info = "from {} to {}".format(start, end)
235 |             clip_srt += srt_clip
236 |             if add_sub:
237 |                 generator = lambda txt: TextClip(txt, font='./font/STHeitiMedium.ttc', fontsize=font_size, color=font_color)
238 |                 subtitles = SubtitlesClip(subs, generator)
239 |                 video_clip = CompositeVideoClip([video_clip, subtitles.set_pos(('center','bottom'))])
240 |             concate_clip = [video_clip]
241 |             time_acc_ost += end+end_ost/1000.0 - (start+start_ost/1000.0)
242 |             for _ts in ts[1:]:
243 |                 start, end = _ts[0] / 16000, _ts[1] / 16000
244 |                 srt_clip, subs, srt_index = generate_srt_clip(sentences, start, end, begin_index=srt_index-1, time_acc_ost=time_acc_ost)
245 |                 if not len(subs):
246 |                     continue
247 |                 chi_subs = []
248 |                 sub_starts = subs[0][0][0]
249 |                 for sub in subs:
250 |                     chi_subs.append(((sub[0][0]-sub_starts, sub[0][1]-sub_starts), sub[1]))
251 |                 start, end = start+start_ost/1000.0, end+end_ost/1000.0
252 |                 _video_clip = video.subclip(start, end)
253 |                 start_end_info += ", from {} to {}".format(str(start)[:5], str(end)[:5])
254 |                 clip_srt += srt_clip
255 |                 if add_sub:
256 |                     generator = lambda txt: TextClip(txt, font='./font/STHeitiMedium.ttc', fontsize=font_size, color=font_color)
257 |                     subtitles = SubtitlesClip(chi_subs, generator)
258 |                     _video_clip = CompositeVideoClip([_video_clip, subtitles.set_pos(('center','bottom'))])
259 |                     # _video_clip.write_videofile("debug.mp4", audio_codec="aac")
260 |                 concate_clip.append(copy.copy(_video_clip))
261 |                 time_acc_ost += end+end_ost/1000.0 - (start+start_ost/1000.0)
262 |             message = "{} periods found in the audio: ".format(len(ts)) + start_end_info
263 |             logging.warning("Concating...")
264 |             if len(concate_clip) > 1:
265 |                 video_clip = concatenate_videoclips(concate_clip)
266 |             # clip_video_file = clip_video_file[:-4] + '_no{}.mp4'.format(self.GLOBAL_COUNT)
267 |             if output_dir is not None:
268 |                 os.makedirs(output_dir, exist_ok=True)
269 |                 _, file_with_extension = os.path.split(clip_video_file)
270 |                 clip_video_file_name, _ = os.path.splitext(file_with_extension)
271 |                 print(output_dir, clip_video_file)
272 |                 clip_video_file = os.path.join(output_dir, "{}_no{}.mp4".format(clip_video_file_name, self.GLOBAL_COUNT))
273 |                 temp_audio_file = os.path.join(output_dir, "{}_tempaudio_no{}.mp4".format(clip_video_file_name, self.GLOBAL_COUNT))
274 |             else:
275 |                 clip_video_file = clip_video_file[:-4] + '_no{}.mp4'.format(self.GLOBAL_COUNT)
276 |                 temp_audio_file = clip_video_file[:-4] + '_tempaudio_no{}.mp4'.format(self.GLOBAL_COUNT)
277 |             video_clip.write_videofile(clip_video_file, audio_codec="aac", temp_audiofile=temp_audio_file)
278 |             self.GLOBAL_COUNT += 1
279 |         else:
280 |             clip_video_file = video_filename
281 |             message = "No period found in the audio, return raw speech. You may check the recognition result and try other destination text."
282 |             srt_clip = ''
283 |         return clip_video_file, message, clip_srt
284 | 
285 | 
286 | def get_parser():
287 |     parser = ArgumentParser(
288 |         description="ClipVideo Argument",
289 |         formatter_class=argparse.ArgumentDefaultsHelpFormatter,
290 |     )
291 |     parser.add_argument(
292 |         "--stage",
293 |         type=int,
294 |         choices=(1, 2),
295 |         help="Stage, 0 for recognizing and 1 for clipping",
296 |         required=True
297 |     )
298 |     parser.add_argument(
299 |         "--file",
300 |         type=str,
301 |         default=None,
302 |         help="Input file path",
303 |         required=True
304 |     )
305 |     parser.add_argument(
306 |         "--sd_switch",
307 |         type=str,
308 |         choices=("no", "yes"),
309 |         default="no",
310 |         help="Turn on the speaker diarization or not",
311 |     )
312 |     parser.add_argument(
313 |         "--output_dir",
314 |         type=str,
315 |         default='./output',
316 |         help="Output files path",
317 |     )
318 |     parser.add_argument(
319 |         "--dest_text",
320 |         type=str,
321 |         default=None,
322 |         help="Destination text string for clipping",
323 |     )
324 |     parser.add_argument(
325 |         "--dest_spk",
326 |         type=str,
327 |         default=None,
328 |         help="Destination spk id for clipping",
329 |     )
330 |     parser.add_argument(
331 |         "--start_ost",
332 |         type=int,
333 |         default=0,
334 |         help="Offset time in ms at beginning for clipping"
335 |     )
336 |     parser.add_argument(
337 |         "--end_ost",
338 |         type=int,
339 |         default=0,
340 |         help="Offset time in ms at ending for clipping"
341 |     )
342 |     parser.add_argument(
343 |         "--output_file",
344 |         type=str,
345 |         default=None,
346 |         help="Output file path"
347 |     )
348 |     parser.add_argument(
349 |         "--lang",
350 |         type=str,
351 |         default='zh',
352 |         help="language"
353 |     )
354 |     return parser
355 | 
356 | 
357 | def runner(stage, file, sd_switch, output_dir, dest_text, dest_spk, start_ost, end_ost, output_file, config=None, lang='zh'):
358 |     audio_suffixs = ['.wav','.mp3','.aac','.m4a','.flac']
359 |     video_suffixs = ['.mp4','.avi','.mkv','.flv','.mov','.webm','.ts','.mpeg']
360 |     _,ext = os.path.splitext(file)
361 |     if ext.lower() in audio_suffixs:
362 |         mode = 'audio'
363 |     elif ext.lower() in video_suffixs:
364 |         mode = 'video'
365 |     else:
366 |         logging.error("Unsupported file format: {}\n\nplease choise one of the following: {}".format(file),audio_suffixs+video_suffixs)
367 |         sys.exit(1) # exit if the file is not supported
368 |     while output_dir.endswith('/'):
369 |         output_dir = output_dir[:-1]
370 |     if not os.path.exists(output_dir):
371 |         os.mkdir(output_dir)
372 |     if stage == 1:
373 |         from funasr import AutoModel
374 |         # initialize funasr automodel
375 |         logging.warning("Initializing modelscope asr pipeline.")
376 |         if lang == 'zh':
377 |             funasr_model = AutoModel(model="iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
378 |                     vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
379 |                     punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
380 |                     spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
381 |                     )
382 |             audio_clipper = VideoClipper(funasr_model)
383 |             audio_clipper.lang = 'zh'
384 |         elif lang == 'en':
385 |             funasr_model = AutoModel(model="iic/speech_paraformer_asr-en-16k-vocab4199-pytorch",
386 |                                 vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
387 |                                 punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
388 |                                 spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
389 |                                 )
390 |             audio_clipper = VideoClipper(funasr_model)
391 |             audio_clipper.lang = 'en'
392 |         if mode == 'audio':
393 |             logging.warning("Recognizing audio file: {}".format(file))
394 |             wav, sr = librosa.load(file, sr=16000)
395 |             res_text, res_srt, state = audio_clipper.recog((sr, wav), sd_switch)
396 |         if mode == 'video':
397 |             logging.warning("Recognizing video file: {}".format(file))
398 |             res_text, res_srt, state = audio_clipper.video_recog(file, sd_switch)
399 |         total_srt_file = output_dir + '/total.srt'
400 |         with open(total_srt_file, 'w') as fout:
401 |             fout.write(res_srt)
402 |             logging.warning("Write total subtitle to {}".format(total_srt_file))
403 |         write_state(output_dir, state)
404 |         logging.warning("Recognition successed. You can copy the text segment from below and use stage 2.")
405 |         print(res_text)
406 |     if stage == 2:
407 |         audio_clipper = VideoClipper(None)
408 |         if mode == 'audio':
409 |             state = load_state(output_dir)
410 |             wav, sr = librosa.load(file, sr=16000)
411 |             state['audio_input'] = (sr, wav)
412 |             (sr, audio), message, srt_clip = audio_clipper.clip(dest_text, start_ost, end_ost, state, dest_spk=dest_spk)
413 |             if output_file is None:
414 |                 output_file = output_dir + '/result.wav'
415 |             clip_srt_file = output_file[:-3] + 'srt'
416 |             logging.warning(message)
417 |             sf.write(output_file, audio, 16000)
418 |             assert output_file.endswith('.wav'), "output_file must ends with '.wav'"
419 |             logging.warning("Save clipped wav file to {}".format(output_file))
420 |             with open(clip_srt_file, 'w') as fout:
421 |                 fout.write(srt_clip)
422 |                 logging.warning("Write clipped subtitle to {}".format(clip_srt_file))
423 |         if mode == 'video':
424 |             state = load_state(output_dir)
425 |             state['video_filename'] = file
426 |             if output_file is None:
427 |                 state['clip_video_file'] = file[:-4] + '_clip.mp4'
428 |             else:
429 |                 state['clip_video_file'] = output_file
430 |             clip_srt_file = state['clip_video_file'][:-3] + 'srt'
431 |             state['video'] = mpy.VideoFileClip(file)
432 |             clip_video_file, message, srt_clip = audio_clipper.video_clip(dest_text, start_ost, end_ost, state, dest_spk=dest_spk)
433 |             logging.warning("Clipping Log: {}".format(message))
434 |             logging.warning("Save clipped mp4 file to {}".format(clip_video_file))
435 |             with open(clip_srt_file, 'w') as fout:
436 |                 fout.write(srt_clip)
437 |                 logging.warning("Write clipped subtitle to {}".format(clip_srt_file))
438 | 
439 | 
440 | def main(cmd=None):
441 |     print(get_commandline_args(), file=sys.stderr)
442 |     parser = get_parser()
443 |     args = parser.parse_args(cmd)
444 |     kwargs = vars(args)
445 |     runner(**kwargs)
446 | 
447 | 
448 | if __name__ == '__main__':
449 |     main()
450 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | librosa
 2 | soundfile
 3 | scikit-learn>=1.3.2
 4 | funasr>=1.1.2
 5 | moviepy==1.0.3
 6 | numpy==1.26.4
 7 | gradio
 8 | modelscope
 9 | torch>=1.13
10 | torchaudio
11 | openai
12 | g4f
13 | dashscope
14 | curl_cffi
15 | 


--------------------------------------------------------------------------------