├── gptpdf
    ├── __init__.py
    └── parse.py
├── docs
    ├── demo.jpg
    ├── wechat.jpg
    └── develop.md
├── examples
    ├── rh.pdf
    ├── rh
    │   ├── 0_0.png
    │   ├── 0_1.png
    │   ├── 2_0.png
    │   ├── 5_0.png
    │   ├── 5_1.png
    │   ├── 6_0.png
    │   ├── 7_0.png
    │   ├── 8_0.png
    │   ├── 9_0.png
    │   └── output.md
    ├── attention_is_all_you_need.pdf
    ├── attention_is_all_you_need
    │   ├── 0_0.png
    │   ├── 12_0.png
    │   ├── 13_0.png
    │   ├── 14_0.png
    │   ├── 2_0.png
    │   ├── 3_0.png
    │   ├── 3_1.png
    │   ├── 5_0.png
    │   ├── 7_0.png
    │   ├── 8_0.png
    │   ├── 9_0.png
    │   └── output.md
    └── gptpdf_Quick_Tour.ipynb
├── test
    ├── .env.example
    └── test.py
├── .gitignore
├── pyproject.toml
├── LICENSE
├── README_CN.md
└── README.md


/gptpdf/__init__.py:
--------------------------------------------------------------------------------
1 | from .parse import parse_pdf


--------------------------------------------------------------------------------
/docs/demo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/docs/demo.jpg


--------------------------------------------------------------------------------
/docs/wechat.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/docs/wechat.jpg


--------------------------------------------------------------------------------
/examples/rh.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh.pdf


--------------------------------------------------------------------------------
/test/.env.example:
--------------------------------------------------------------------------------
1 | OPENAI_API_KEY=''
2 | OPENAI_API_BASE='https://api.openai.com/v1/'


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | test/.env
2 | gp/*
3 | *.pyc
4 | dist/*
5 | .idea
6 | venv
7 | test_output


--------------------------------------------------------------------------------
/examples/rh/0_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/0_0.png


--------------------------------------------------------------------------------
/examples/rh/0_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/0_1.png


--------------------------------------------------------------------------------
/examples/rh/2_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/2_0.png


--------------------------------------------------------------------------------
/examples/rh/5_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/5_0.png


--------------------------------------------------------------------------------
/examples/rh/5_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/5_1.png


--------------------------------------------------------------------------------
/examples/rh/6_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/6_0.png


--------------------------------------------------------------------------------
/examples/rh/7_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/7_0.png


--------------------------------------------------------------------------------
/examples/rh/8_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/8_0.png


--------------------------------------------------------------------------------
/examples/rh/9_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/rh/9_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need.pdf


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/0_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/0_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/12_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/12_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/13_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/13_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/14_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/14_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/2_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/2_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/3_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/3_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/3_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/3_1.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/5_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/5_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/7_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/7_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/8_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/8_0.png


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/9_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alexfazio/gptpdf/main/examples/attention_is_all_you_need/9_0.png


--------------------------------------------------------------------------------
/docs/develop.md:
--------------------------------------------------------------------------------
 1 | # 发布
 2 | 
 3 | ```bash
 4 | # 发布pip库
 5 | poetry build -f sdist
 6 | poetry publish
 7 | ```
 8 | 
 9 | # 测试
10 | 
11 | ```shell
12 | # 新建python环境
13 | python -m venv gp
14 | source gp/bin/activate
15 | 
16 | # 临时取消python别名 (如果有)
17 | unalias python
18 | 
19 | # 安装依赖
20 | pip install .
21 | 
22 | # 测试
23 | cd test
24 | # 导出环境变量
25 | export $(grep -v '^#' .env | sed 's/^export //g' | xargs)
26 | python test.py
27 | ```


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "gptpdf"
 3 | version = "0.0.15"
 4 | description = "Using GPT to parse PDF"
 5 | authors = ["Chen Li <lichenarthurdata@gmail.com>"]
 6 | license = "Apache 2.0"
 7 | readme = "README.md"
 8 | repository = "https://github.com/CosmosShadow/gptpdf"
 9 | packages = [
10 |     { include = "gptpdf" },
11 | ]
12 | 
13 | [tool.poetry.dependencies]
14 | python = ">=3.8.1,<4.0"
15 | GeneralAgent = "^0.3.21"
16 | shapely = "^2.0.1"
17 | pymupdf = "^1.24.7"
18 | python-dotenv = "^1.0.0"
19 | 
20 | [tool.poetry.group.dev.dependencies]
21 | pytest = "^7.4.3"
22 | pytest-asyncio = "^0.21.1"
23 | 
24 | 
25 | [[tool.poetry.source]]
26 | name = "PyPI"
27 | priority="primary"
28 | 
29 | 
30 | [build-system]
31 | requires = ["poetry-core"]
32 | build-backend = "poetry.core.masonry.api"


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License
 2 | 
 3 | Copyright (c) 2024 Chen Li
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in
13 | all copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21 | THE SOFTWARE.


--------------------------------------------------------------------------------
/test/test.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | # laod environment variables from .env file
 4 | import dotenv
 5 | dotenv.load_dotenv()
 6 | 
 7 | pdf_path = '../examples/attention_is_all_you_need.pdf'
 8 | output_dir = '../examples/attention_is_all_you_need/'   
 9 | 
10 | pdf_path = '../examples/rh.pdf'
11 | output_dir = '../examples/rh/'
12 | 
13 | # 清空output_dir
14 | # import shutil
15 | # shutil.rmtree(output_dir, ignore_errors=True)
16 | 
17 | def test_use_api_key():
18 |     from gptpdf import parse_pdf
19 |     api_key = os.getenv('OPENAI_API_KEY')
20 |     base_url = os.getenv('OPENAI_API_BASE')
21 |     # Manually provide OPENAI_API_KEY and OPEN_API_BASE
22 |     content, image_paths = parse_pdf(pdf_path, output_dir=output_dir, api_key=api_key, base_url=base_url, model='gpt-4o', gpt_worker=6)
23 |     print(content)
24 |     print(image_paths)
25 |     # also output_dir/output.md is generated
26 | 
27 | 
28 | def test_use_env():
29 |     from gptpdf import parse_pdf
30 |     # Use OPENAI_API_KEY and OPENAI_API_BASE from environment variables
31 |     content, image_paths = parse_pdf(pdf_path, output_dir=output_dir, model='gpt-4o', verbose=True)
32 |     print(content)
33 |     print(image_paths)
34 |     # also output_dir/output.md is generated
35 | 
36 | 
37 | def test_azure():
38 |     from gptpdf import parse_pdf
39 |     api_key = '8ef0b4df45e444079cd5a4xxxxx' # Azure API Key
40 |     base_url = 'https://xxx.openai.azure.com/' # Azure API Base URL
41 |     model = 'azure_xxxx' # azure_ with deploy ID name (not open ai model name), e.g. azure_cpgpt4
42 |     # Use OPENAI_API_KEY and OPENAI_API_BASE from environment variables
43 |     content, image_paths = parse_pdf(pdf_path, output_dir=output_dir, api_key=api_key, base_url=base_url, model=model, verbose=True)
44 |     print(content)
45 |     print(image_paths)
46 | 
47 | def test_qwen_vl_max():
48 |     from gptpdf import parse_pdf
49 |     api_key = 'sk-xxxx'
50 |     base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"
51 |     # Refer to: https://help.aliyun.com/zh/dashscope/developer-reference/compatibility-of-openai-with-dashscope
52 |     model  = 'qwen-vl-max'
53 |     content, image_paths = parse_pdf(pdf_path, output_dir=output_dir, api_key=api_key, base_url=base_url, model=model, verbose=True, temperature=0.5, max_tokens=1000, top_p=0.9, frequency_penalty=1)
54 |     print(content)
55 |     print(image_paths)
56 | 
57 | 
58 | if __name__ == '__main__':
59 |     # test_use_api_key()
60 |     test_use_env()
61 |     # test_azure()
62 |     # test_qwen_vl_max()


--------------------------------------------------------------------------------
/README_CN.md:
--------------------------------------------------------------------------------
  1 | # gptpdf
  2 | 
  3 | <p align="center">
  4 | <a href="README_CN.md"><img src="https://img.shields.io/badge/文档-中文版-blue.svg" alt="CN doc"></a>
  5 | <a href="README.md"><img src="https://img.shields.io/badge/document-English-blue.svg" alt="EN doc"></a>
  6 | </p>
  7 | 
  8 | 使用视觉大语言模型（如 GPT-4o）将 PDF 解析为 markdown。
  9 | 
 10 | 我们的方法非常简单(只有293行代码)，但几乎可以完美地解析排版、数学公式、表格、图片、图表等。
 11 | 
 12 | 每页平均价格：0.013 美元
 13 | 
 14 | 我们使用 [GeneralAgent](https://github.com/CosmosShadow/GeneralAgent) lib 与 OpenAI API 交互。
 15 | 
 16 | [pdfgpt-ui](https://github.com/daodao97/gptpdf-ui) 是一个基于 gptpdf 的可视化工具。
 17 | 
 18 | ## 处理流程
 19 | 
 20 | 1. 使用 PyMuPDF 库，对 PDF 进行解析出所有非文本区域，并做好标记，比如:
 21 | 
 22 | ![](docs/demo.jpg)
 23 | 
 24 | 2. 使用视觉大模型（如 GPT-4o）进行解析，得到 markdown 文件。
 25 | 
 26 | ## 样例
 27 | 
 28 | 有关
 29 | PDF，请参阅 [examples/attention_is_all_you_need/output.md](examples/attention_is_all_you_need/output.md) [examples/attention_is_all_you_need.pdf](examples/attention_is_all_you_need.pdf)。
 30 | 
 31 | ## 安装
 32 | 
 33 | ```bash
 34 | pip install gptpdf
 35 | ```
 36 | 
 37 | ## 使用
 38 | 
 39 | ```python
 40 | from gptpdf import parse_pdf
 41 | 
 42 | api_key = 'Your OpenAI API Key'
 43 | content, image_paths = parse_pdf(pdf_path, api_key=api_key)
 44 | print(content)
 45 | ```
 46 | 
 47 | 更多内容请见 [test/test.py](test/test.py)
 48 | 
 49 | ## API
 50 | 
 51 | ### parse_pdf
 52 | 
 53 | **函数**：
 54 | 
 55 | ```
 56 | def parse_pdf(
 57 |         pdf_path: str,
 58 |         output_dir: str = './',
 59 |         prompt: Optional[Dict] = None,
 60 |         api_key: Optional[str] = None,
 61 |         base_url: Optional[str] = None,
 62 |         model: str = 'gpt-4o',
 63 |         verbose: bool = False,
 64 |         gpt_worker: int = 1,
 65 |         **args
 66 | ) -> Tuple[str, List[str]]:
 67 | ```
 68 | 
 69 | 将 PDF 文件解析为 Markdown 文件，并返回 Markdown 内容和所有图片路径列表。
 70 | 
 71 | **参数**：
 72 | 
 73 | - **pdf_path**：*str*  
 74 |   PDF 文件路径
 75 | 
 76 | - **output_dir**：*str*，默认值：'./'  
 77 |   输出目录，存储所有图片和 Markdown 文件
 78 | 
 79 | - **api_key**：*Optional[str]*，可选  
 80 |   OpenAI API 密钥。如果未提供，则使用 `OPENAI_API_KEY` 环境变量。
 81 | 
 82 | - **base_url**：*Optional[str]*，可选  
 83 |   OpenAI 基本 URL。如果未提供，则使用 `OPENAI_BASE_URL` 环境变量。可以通过修改该环境变量调用 OpenAI API
 84 |   类接口的其他大模型服务，例如`GLM-4V`。
 85 | 
 86 | - **model**：*str*，默认值：'gpt-4o'。OpenAI API 格式的多模态大模型。如果需要使用其他模型，例如
 87 |     - [qwen-vl-max](https://help.aliyun.com/zh/dashscope/developer-reference/compatibility-of-openai-with-dashscope)
 88 |     - [GLM-4V](https://open.bigmodel.cn/dev/api#glm-4v)
 89 |     - [Yi-Vision](https://platform.lingyiwanwu.com/docs)
 90 |     - Azure OpenAI，通过将 `base_url` 指定为 `https://xxxx.openai.azure.com/` 来使用 Azure OpenAI，`api_key` 是 Azure API
 91 |       密钥，模型类似于 `azure_xxxx`，其中 `xxxx` 是部署的模型名称（已测试）。
 92 | 
 93 | - **verbose**：*bool*，默认值：False，详细模式，开启后会在命令行显示大模型解析的内容。
 94 | 
 95 | - **gpt_worker**：*int*，默认值：1  
 96 |   GPT 解析工作线程数。如果您的机器性能较好，可以适当调高，以提高解析速度。
 97 | 
 98 | - **prompt**: *dict*, 可选，如果您使用的模型与本仓库默认的提示词不匹配，无法发挥出最佳效果，我们支持自定义加入提示词。
 99 |   仓库中，提示词分为三个部分，分别是：
100 |     + `prompt`：主要用于指导模型如何处理和转换图片中的文本内容。
101 |     + `rect_prompt`：用于处理图片中标注了特定区域（例如表格或图片）的情况。
102 |     + `role_prompt`：定义了模型的角色，确保模型理解它在执行PDF文档解析任务。
103 |       您可以用字典的形式传入自定义的提示词，实现对任意提示词的替换，这是一个例子：
104 | 
105 |   ```python
106 |   prompt = {
107 |       "prompt": "自定义提示词语",
108 |       "rect_prompt": "自定义提示词",
109 |       "role_prompt": "自定义提示词"
110 |   }
111 |   
112 |   content, image_paths = parse_pdf(
113 |       pdf_path=pdf_path,
114 |       output_dir='./output',
115 |       model="gpt-4o",
116 |       prompt="",
117 |       verbose=False,
118 |   )
119 |   
120 |   ```
121 |   您不需要替换所有的提示词，如果您没有传入自定义提示词，仓库会自动使用默认的提示词。默认提示词使用的是中文，如果您的PDF文档是英文的，或者您的模型不支持中文，建议您自定义提示词。
122 | 
123 | - **args"": LLM 中其他参数，例如 `temperature`，`max_tokens`, `top_p`, `frequency_penalty`, `presence_penalty` 等。
124 | 
125 | 
126 | ## 加入我们👏🏻
127 | 
128 | 使用微信扫描下方二维码，加入微信群聊，或参与贡献。
129 | 
130 | <p align="center">
131 | <img src="./docs/wechat.jpg" alt="wechat" width=400/>
132 | </p>
133 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # gptpdf
  2 | 
  3 | <p align="center">
  4 | <a href="README_CN.md"><img src="https://img.shields.io/badge/文档-中文版-blue.svg" alt="CN doc"></a>
  5 | <a href="README.md"><img src="https://img.shields.io/badge/document-English-blue.svg" alt="EN doc"></a>
  6 | </p>
  7 | 
  8 | Using VLLM (like GPT-4o) to parse PDF into markdown.
  9 | 
 10 | Our approach is very simple (only 293 lines of code), but can almost perfectly parse typography, math formulas, tables, pictures, charts, etc.
 11 | 
 12 | Average cost per page: $0.013
 13 | 
 14 | This package use [GeneralAgent](https://github.com/CosmosShadow/GeneralAgent) lib to interact with OpenAI API.
 15 | 
 16 | [pdfgpt-ui](https://github.com/daodao97/gptpdf-ui) is a visual tool based on gptpdf.
 17 | 
 18 | 
 19 | 
 20 | ## Process steps
 21 | 
 22 | 1. Use the PyMuPDF library to parse the PDF to find all non-text areas and mark them, for example:
 23 | 
 24 | ![](docs/demo.jpg)
 25 | 
 26 | 2. Use a large visual model (such as GPT-4o) to parse and get a markdown file.
 27 | 
 28 | 
 29 | 
 30 | ## DEMO
 31 | 
 32 | 1. [examples/attention_is_all_you_need/output.md](examples/attention_is_all_you_need/output.md) for PDF [examples/attention_is_all_you_need.pdf](examples/attention_is_all_you_need.pdf).
 33 | 
 34 | 
 35 | 2. [examples/rh/output.md](examples/rh/output.md) for PDF [examples/rh.pdf](examples/rh.pdf).
 36 | 
 37 | 
 38 | ## Installation
 39 | 
 40 | ```bash
 41 | pip install gptpdf
 42 | ```
 43 | 
 44 | 
 45 | 
 46 | ## Usage
 47 | 
 48 | ```python
 49 | from gptpdf import parse_pdf
 50 | api_key = 'Your OpenAI API Key'
 51 | content, image_paths = parse_pdf(pdf_path, api_key=api_key)
 52 | print(content)
 53 | ```
 54 | 
 55 | See more in [test/test.py](test/test.py)
 56 | 
 57 | 
 58 | ## API
 59 | 
 60 | ### parse_pdf
 61 | 
 62 | **Function**: 
 63 | ```
 64 | def parse_pdf(
 65 |         pdf_path: str,
 66 |         output_dir: str = './',
 67 |         prompt: Optional[Dict] = None,
 68 |         api_key: Optional[str] = None,
 69 |         base_url: Optional[str] = None,
 70 |         model: str = 'gpt-4o',
 71 |         verbose: bool = False,
 72 |         gpt_worker: int = 1
 73 | ) -> Tuple[str, List[str]]:
 74 | ```
 75 | 
 76 | Parses a PDF file into a Markdown file and returns the Markdown content along with all image paths.
 77 | 
 78 | **Parameters**:
 79 | 
 80 | - **pdf_path**: *str*  
 81 |   Path to the PDF file
 82 | 
 83 | - **output_dir**: *str*, default: './'  
 84 |   Output directory to store all images and the Markdown file
 85 | 
 86 | - **api_key**: *Optional[str]*, optional  
 87 |   OpenAI API key. If not provided, the `OPENAI_API_KEY` environment variable will be used.
 88 | 
 89 | - **base_url**: *Optional[str]*, optional  
 90 |   OpenAI base URL. If not provided, the `OPENAI_BASE_URL` environment variable will be used. This can be modified to call other large model services with OpenAI API interfaces, such as `GLM-4V`.
 91 | 
 92 | - **model**: *str*, default: 'gpt-4o'  
 93 |   OpenAI API formatted multimodal large model. If you need to use other models, such as:
 94 |   - [qwen-vl-max](https://help.aliyun.com/zh/dashscope/developer-reference/compatibility-of-openai-with-dashscope) 
 95 |   - [GLM-4V](https://open.bigmodel.cn/dev/api#glm-4v)
 96 |   - [Yi-Vision](https://platform.lingyiwanwu.com/docs) 
 97 |   - Azure OpenAI, by setting the `base_url` to `https://xxxx.openai.azure.com/` to use Azure OpenAI, where `api_key` is the Azure API key, and the model is similar to `azure_xxxx`, where `xxxx` is the deployed model name (tested).
 98 | 
 99 | - **verbose**: *bool*, default: False  
100 |   Verbose mode. When enabled, the content parsed by the large model will be displayed in the command line.
101 | 
102 | - **gpt_worker**: *int*, default: 1  
103 |   Number of GPT parsing worker threads. If your machine has better performance, you can increase this value to speed up the parsing.
104 | 
105 | - **prompt**: *dict*, optional  
106 |   If the model you are using does not match the default prompt provided in this repository and cannot achieve the best results, we support adding custom prompts. The prompts in the repository are divided into three parts:
107 |   - `prompt`: Mainly used to guide the model on how to process and convert text content in images.
108 |   - `rect_prompt`: Used to handle cases where specific areas (such as tables or images) are marked in the image.
109 |   - `role_prompt`: Defines the role of the model to ensure the model understands it is performing a PDF document parsing task.
110 | 
111 |   You can pass custom prompts in the form of a dictionary to replace any of the prompts. Here is an example:
112 | 
113 |   ```python
114 |   prompt = {
115 |       "prompt": "Custom prompt text",
116 |       "rect_prompt": "Custom rect prompt",
117 |       "role_prompt": "Custom role prompt"
118 |   }
119 | 
120 |   content, image_paths = parse_pdf(
121 |       pdf_path=pdf_path,
122 |       output_dir='./output',
123 |       model="gpt-4o",
124 |       prompt=prompt,
125 |       verbose=False,
126 |   )
127 | 
128 | - **args"": LLM other parameters, such as `temperature`, `top_p`, `max_tokens`, `presence_penalty`, `frequency_penalty`, etc.
129 |   
130 | ## Join Us 👏🏻
131 | 
132 | Scan the QR code below with WeChat to join our group chat or contribute.
133 | 
134 | <p align="center">
135 | <img src="./docs/wechat.jpg" alt="wechat" width=400/>
136 | </p>
137 | 


--------------------------------------------------------------------------------
/gptpdf/parse.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import re
  3 | from typing import List, Tuple, Optional, Dict
  4 | import logging
  5 | 
  6 | logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
  7 | import fitz  # PyMuPDF
  8 | import shapely.geometry as sg
  9 | from shapely.geometry.base import BaseGeometry
 10 | from shapely.validation import explain_validity
 11 | import concurrent.futures
 12 | 
 13 | # This Default Prompt Using Chinese and could be changed to other languages.
 14 | 
 15 | DEFAULT_PROMPT = """使用markdown语法，将图片中识别到的文字转换为markdown格式输出。你必须做到：
 16 | 1. 输出和使用识别到的图片的相同的语言，例如，识别到英语的字段，输出的内容必须是英语。
 17 | 2. 不要解释和输出无关的文字，直接输出图片中的内容。例如，严禁输出 “以下是我根据图片内容生成的markdown文本：”这样的例子，而是应该直接输出markdown。
 18 | 3. 内容不要包含在```markdown ```中、段落公式使用 $$ $$ 的形式、行内公式使用 $ $ 的形式、忽略掉长直线、忽略掉页码。
 19 | 再次强调，不要解释和输出无关的文字，直接输出图片中的内容。
 20 | """
 21 | DEFAULT_RECT_PROMPT = """图片中用红色框和名称(%s)标注出了一些区域。如果区域是表格或者图片，使用 ![]() 的形式插入到输出内容中，否则直接输出文字内容。
 22 | """
 23 | DEFAULT_ROLE_PROMPT = """你是一个PDF文档解析器，使用markdown和latex语法输出图片的内容。
 24 | """
 25 | 
 26 | 
 27 | def _is_near(rect1: BaseGeometry, rect2: BaseGeometry, distance: float = 20) -> bool:
 28 |     """
 29 |     Check if two rectangles are near each other if the distance between them is less than the target.
 30 |     """
 31 |     return rect1.buffer(0.1).distance(rect2.buffer(0.1)) < distance
 32 | 
 33 | 
 34 | def _is_horizontal_near(rect1: BaseGeometry, rect2: BaseGeometry, distance: float = 100) -> bool:
 35 |     """
 36 |     Check if two rectangles are near horizontally if one of them is a horizontal line.
 37 |     """
 38 |     result = False
 39 |     if abs(rect1.bounds[3] - rect1.bounds[1]) < 0.1 or abs(rect2.bounds[3] - rect2.bounds[1]) < 0.1:
 40 |         if abs(rect1.bounds[0] - rect2.bounds[0]) < 0.1 and abs(rect1.bounds[2] - rect2.bounds[2]) < 0.1:
 41 |             result = abs(rect1.bounds[3] - rect2.bounds[3]) < distance
 42 |     return result
 43 | 
 44 | 
 45 | def _union_rects(rect1: BaseGeometry, rect2: BaseGeometry) -> BaseGeometry:
 46 |     """
 47 |     Union two rectangles.
 48 |     """
 49 |     return sg.box(*(rect1.union(rect2).bounds))
 50 | 
 51 | 
 52 | def _merge_rects(rect_list: List[BaseGeometry], distance: float = 20, horizontal_distance: Optional[float] = None) -> \
 53 |         List[BaseGeometry]:
 54 |     """
 55 |     Merge rectangles in the list if the distance between them is less than the target.
 56 |     """
 57 |     merged = True
 58 |     while merged:
 59 |         merged = False
 60 |         new_rect_list = []
 61 |         while rect_list:
 62 |             rect = rect_list.pop(0)
 63 |             for other_rect in rect_list:
 64 |                 if _is_near(rect, other_rect, distance) or (
 65 |                         horizontal_distance and _is_horizontal_near(rect, other_rect, horizontal_distance)):
 66 |                     rect = _union_rects(rect, other_rect)
 67 |                     rect_list.remove(other_rect)
 68 |                     merged = True
 69 |             new_rect_list.append(rect)
 70 |         rect_list = new_rect_list
 71 |     return rect_list
 72 | 
 73 | 
 74 | def _adsorb_rects_to_rects(source_rects: List[BaseGeometry], target_rects: List[BaseGeometry], distance: float = 10) -> \
 75 |         Tuple[List[BaseGeometry], List[BaseGeometry]]:
 76 |     """
 77 |     Adsorb a set of rectangles to another set of rectangles.
 78 |     """
 79 |     new_source_rects = []
 80 |     for text_area_rect in source_rects:
 81 |         adsorbed = False
 82 |         for index, rect in enumerate(target_rects):
 83 |             if _is_near(text_area_rect, rect, distance):
 84 |                 rect = _union_rects(text_area_rect, rect)
 85 |                 target_rects[index] = rect
 86 |                 adsorbed = True
 87 |                 break
 88 |         if not adsorbed:
 89 |             new_source_rects.append(text_area_rect)
 90 |     return new_source_rects, target_rects
 91 | 
 92 | 
 93 | def _parse_rects(page: fitz.Page) -> List[Tuple[float, float, float, float]]:
 94 |     """
 95 |     Parse drawings in the page and merge adjacent rectangles.
 96 |     """
 97 | 
 98 |     # 提取画的内容
 99 |     drawings = page.get_drawings()
100 | 
101 |     # 忽略掉长度小于30的水平直线
102 |     is_short_line = lambda x: abs(x['rect'][3] - x['rect'][1]) < 1 and abs(x['rect'][2] - x['rect'][0]) < 30
103 |     drawings = [drawing for drawing in drawings if not is_short_line(drawing)]
104 | 
105 |     # 转换为shapely的矩形
106 |     rect_list = [sg.box(*drawing['rect']) for drawing in drawings]
107 | 
108 |     # 提取图片区域
109 |     images = page.get_image_info()
110 |     image_rects = [sg.box(*image['bbox']) for image in images]
111 | 
112 |     # 合并drawings和images
113 |     rect_list += image_rects
114 | 
115 |     merged_rects = _merge_rects(rect_list, distance=10, horizontal_distance=100)
116 |     merged_rects = [rect for rect in merged_rects if explain_validity(rect) == 'Valid Geometry']
117 | 
118 |     # 将大文本区域和小文本区域分开处理: 大文本相小合并，小文本靠近合并
119 |     is_large_content = lambda x: (len(x[4]) / max(1, len(x[4].split('\n')))) > 5
120 |     small_text_area_rects = [sg.box(*x[:4]) for x in page.get_text('blocks') if not is_large_content(x)]
121 |     large_text_area_rects = [sg.box(*x[:4]) for x in page.get_text('blocks') if is_large_content(x)]
122 |     _, merged_rects = _adsorb_rects_to_rects(large_text_area_rects, merged_rects, distance=0.1) # 完全相交
123 |     _, merged_rects = _adsorb_rects_to_rects(small_text_area_rects, merged_rects, distance=5) # 靠近
124 | 
125 |     # 再次自身合并
126 |     merged_rects = _merge_rects(merged_rects, distance=10)
127 | 
128 |     # 过滤比较小的矩形
129 |     merged_rects = [rect for rect in merged_rects if rect.bounds[2] - rect.bounds[0] > 20 and rect.bounds[3] - rect.bounds[1] > 20]
130 | 
131 |     return [rect.bounds for rect in merged_rects]
132 | 
133 | 
134 | def _parse_pdf_to_images(pdf_path: str, output_dir: str = './') -> List[Tuple[str, List[str]]]:
135 |     """
136 |     Parse PDF to images and save to output_dir.
137 |     """
138 |     # 打开PDF文件
139 |     pdf_document = fitz.open(pdf_path)
140 |     image_infos = []
141 | 
142 |     for page_index, page in enumerate(pdf_document):
143 |         logging.info(f'parse page: {page_index}')
144 |         rect_images = []
145 |         rects = _parse_rects(page)
146 |         for index, rect in enumerate(rects):
147 |             fitz_rect = fitz.Rect(rect)
148 |             # 保存页面为图片
149 |             pix = page.get_pixmap(clip=fitz_rect, matrix=fitz.Matrix(4, 4))
150 |             name = f'{page_index}_{index}.png'
151 |             pix.save(os.path.join(output_dir, name))
152 |             rect_images.append(name)
153 |             # # 在页面上绘制红色矩形
154 |             big_fitz_rect = fitz.Rect(fitz_rect.x0 - 1, fitz_rect.y0 - 1, fitz_rect.x1 + 1, fitz_rect.y1 + 1)
155 |             # 空心矩形
156 |             page.draw_rect(big_fitz_rect, color=(1, 0, 0), width=1)
157 |             # 画矩形区域(实心)
158 |             # page.draw_rect(big_fitz_rect, color=(1, 0, 0), fill=(1, 0, 0))
159 |             # 在矩形内的左上角写上矩形的索引name，添加一些偏移量
160 |             text_x = fitz_rect.x0 + 2
161 |             text_y = fitz_rect.y0 + 10
162 |             text_rect = fitz.Rect(text_x, text_y - 9, text_x + 80, text_y + 2)
163 |             # 绘制白色背景矩形
164 |             page.draw_rect(text_rect, color=(1, 1, 1), fill=(1, 1, 1))
165 |             # 插入带有白色背景的文字
166 |             page.insert_text((text_x, text_y), name, fontsize=10, color=(1, 0, 0))
167 |         page_image_with_rects = page.get_pixmap(matrix=fitz.Matrix(3, 3))
168 |         page_image = os.path.join(output_dir, f'{page_index}.png')
169 |         page_image_with_rects.save(page_image)
170 |         image_infos.append((page_image, rect_images))
171 | 
172 |     pdf_document.close()
173 |     return image_infos
174 | 
175 | 
176 | def _gpt_parse_images(
177 |         image_infos: List[Tuple[str, List[str]]],
178 |         prompt_dict: Optional[Dict] = None,
179 |         output_dir: str = './',
180 |         api_key: Optional[str] = None,
181 |         base_url: Optional[str] = None,
182 |         model: str = 'gpt-4o',
183 |         verbose: bool = False,
184 |         gpt_worker: int = 1,
185 |         **args
186 | ) -> str:
187 |     """
188 |     Parse images to markdown content.
189 |     """
190 |     from GeneralAgent import Agent
191 | 
192 |     if isinstance(prompt_dict, dict) and 'prompt' in prompt_dict:
193 |         prompt = prompt_dict['prompt']
194 |         logging.info("prompt is provided, using user prompt.")
195 |     else:
196 |         prompt = DEFAULT_PROMPT
197 |         logging.info("prompt is not provided, using default prompt.")
198 |     if isinstance(prompt_dict, dict) and 'rect_prompt' in prompt_dict:
199 |         rect_prompt = prompt_dict['rect_prompt']
200 |         logging.info("rect_prompt is provided, using user prompt.")
201 |     else:
202 |         rect_prompt = DEFAULT_RECT_PROMPT
203 |         logging.info("rect_prompt is not provided, using default prompt.")
204 |     if isinstance(prompt_dict, dict) and 'role_prompt' in prompt_dict:
205 |         role_prompt = prompt_dict['role_prompt']
206 |         logging.info("role_prompt is provided, using user prompt.")
207 |     else:
208 |         role_prompt = DEFAULT_ROLE_PROMPT
209 |         logging.info("role_prompt is not provided, using default prompt.")
210 | 
211 |     def _process_page(index: int, image_info: Tuple[str, List[str]]) -> Tuple[int, str]:
212 |         logging.info(f'gpt parse page: {index}')
213 |         agent = Agent(role=role_prompt, api_key=api_key, base_url=base_url, disable_python_run=True, model=model, **args)
214 |         page_image, rect_images = image_info
215 |         local_prompt = prompt
216 |         if rect_images:
217 |             local_prompt += rect_prompt + ', '.join(rect_images)
218 |         content = agent.run([local_prompt, {'image': page_image}], display=verbose)
219 |         return index, content
220 | 
221 |     contents = [None] * len(image_infos)
222 |     with concurrent.futures.ThreadPoolExecutor(max_workers=gpt_worker) as executor:
223 |         futures = [executor.submit(_process_page, index, image_info) for index, image_info in enumerate(image_infos)]
224 |         for future in concurrent.futures.as_completed(futures):
225 |             index, content = future.result()
226 | 
227 |             # 在某些情况下大模型还是会输出 ```markdown ```字符串
228 |             if '```markdown' in content:
229 |                 content = content.replace('```markdown\n', '')
230 |                 last_backticks_pos = content.rfind('```')
231 |                 if last_backticks_pos != -1:
232 |                     content = content[:last_backticks_pos] + content[last_backticks_pos + 3:]
233 | 
234 |             contents[index] = content
235 | 
236 |     output_path = os.path.join(output_dir, 'output.md')
237 |     with open(output_path, 'w', encoding='utf-8') as f:
238 |         f.write('\n\n'.join(contents))
239 | 
240 |     return '\n\n'.join(contents)
241 | 
242 | 
243 | def parse_pdf(
244 |         pdf_path: str,
245 |         output_dir: str = './',
246 |         prompt: Optional[Dict] = None,
247 |         api_key: Optional[str] = None,
248 |         base_url: Optional[str] = None,
249 |         model: str = 'gpt-4o',
250 |         verbose: bool = False,
251 |         gpt_worker: int = 1,
252 |         **args
253 | ) -> Tuple[str, List[str]]:
254 |     """
255 |     Parse a PDF file to a markdown file.
256 |     """
257 |     if not os.path.exists(output_dir):
258 |         os.makedirs(output_dir)
259 | 
260 |     image_infos = _parse_pdf_to_images(pdf_path, output_dir=output_dir)
261 |     content = _gpt_parse_images(
262 |         image_infos=image_infos,
263 |         output_dir=output_dir,
264 |         prompt_dict=prompt,
265 |         api_key=api_key,
266 |         base_url=base_url,
267 |         model=model,
268 |         verbose=verbose,
269 |         gpt_worker=gpt_worker,
270 |         **args
271 |     )
272 | 
273 |     all_rect_images = []
274 |     # remove all rect images
275 |     if not verbose:
276 |         for page_image, rect_images in image_infos:
277 |             if os.path.exists(page_image):
278 |                 os.remove(page_image)
279 |             all_rect_images.extend(rect_images)
280 |     return content, all_rect_images
281 | 


--------------------------------------------------------------------------------
/examples/gptpdf_Quick_Tour.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |   "nbformat": 4,
  3 |   "nbformat_minor": 0,
  4 |   "metadata": {
  5 |     "colab": {
  6 |       "private_outputs": true,
  7 |       "provenance": [],
  8 |       "authorship_tag": "ABX9TyPw6K7sJaE/ySw8g6ZjwqpY",
  9 |       "include_colab_link": true
 10 |     },
 11 |     "kernelspec": {
 12 |       "name": "python3",
 13 |       "display_name": "Python 3"
 14 |     },
 15 |     "language_info": {
 16 |       "name": "python"
 17 |     }
 18 |   },
 19 |   "cells": [
 20 |     {
 21 |       "cell_type": "markdown",
 22 |       "metadata": {
 23 |         "id": "view-in-github",
 24 |         "colab_type": "text"
 25 |       },
 26 |       "source": [
 27 |         "<a href=\"https://colab.research.google.com/github/alexfazio/gptpdf/blob/main/examples/gptpdf_Quick_Tour.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
 28 |       ]
 29 |     },
 30 |     {
 31 |       "cell_type": "markdown",
 32 |       "source": [
 33 |         "## Quick Tour of `GPTPDF`\n",
 34 |         "\n",
 35 |         "The example below serves as a quick start guide on how to start using `gptpdf`.\n",
 36 |         "\n",
 37 |         "For a detailed overview of the features, please visit our [documentation page](https://github.com/CosmosShadow/gptpdf)."
 38 |       ],
 39 |       "metadata": {
 40 |         "id": "omUvC33EBFbv"
 41 |       }
 42 |     },
 43 |     {
 44 |       "cell_type": "code",
 45 |       "source": [
 46 |         "# @title 🛠️ Install Requirements\n",
 47 |         "!pip install fitz shapely GeneralAgent shapely pymupdf python-dotenv\n",
 48 |         "!pip install --force-reinstall pymupdf"
 49 |       ],
 50 |       "metadata": {
 51 |         "id": "3ef6XZP_O680",
 52 |         "cellView": "form"
 53 |       },
 54 |       "execution_count": null,
 55 |       "outputs": []
 56 |     },
 57 |     {
 58 |       "cell_type": "code",
 59 |       "execution_count": null,
 60 |       "metadata": {
 61 |         "id": "UR_97-9NO5nH",
 62 |         "cellView": "form"
 63 |       },
 64 |       "outputs": [],
 65 |       "source": [
 66 |         "# @title ⚙️ Function Definitions\n",
 67 |         "\n",
 68 |         "import os\n",
 69 |         "import re\n",
 70 |         "from typing import List, Tuple, Optional, Dict\n",
 71 |         "import logging\n",
 72 |         "\n",
 73 |         "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')\n",
 74 |         "import fitz\n",
 75 |         "import shapely.geometry as sg\n",
 76 |         "from shapely.geometry.base import BaseGeometry\n",
 77 |         "from shapely.validation import explain_validity\n",
 78 |         "import concurrent.futures\n",
 79 |         "\n",
 80 |         "# This Default Prompt Using Chinese and could be changed to other languages.\n",
 81 |         "\n",
 82 |         "DEFAULT_PROMPT = \"\"\"使用markdown语法，将图片中识别到的文字转换为markdown格式输出。你必须做到：\n",
 83 |         "1. 输出和使用识别到的图片的相同的语言，例如，识别到英语的字段，输出的内容必须是英语。\n",
 84 |         "2. 不要解释和输出无关的文字，直接输出图片中的内容。例如，严禁输出 “以下是我根据图片内容生成的markdown文本：”这样的例子，而是应该直接输出markdown。\n",
 85 |         "3. 内容不要包含在```markdown ```中、段落公式使用 $$ $$ 的形式、行内公式使用 $ $ 的形式、忽略掉长直线、忽略掉页码。\n",
 86 |         "再次强调，不要解释和输出无关的文字，直接输出图片中的内容。\n",
 87 |         "\"\"\"\n",
 88 |         "DEFAULT_RECT_PROMPT = \"\"\"图片中用红色框和名称(%s)标注出了一些区域。如果区域是表格或者图片，使用 ![]() 的形式插入到输出内容中，否则直接输出文字内容。\n",
 89 |         "\"\"\"\n",
 90 |         "DEFAULT_ROLE_PROMPT = \"\"\"你是一个PDF文档解析器，使用markdown和latex语法输出图片的内容。\n",
 91 |         "\"\"\"\n",
 92 |         "\n",
 93 |         "\n",
 94 |         "def _is_near(rect1: BaseGeometry, rect2: BaseGeometry, distance: float = 20) -> bool:\n",
 95 |         "    \"\"\"\n",
 96 |         "    Check if two rectangles are near each other if the distance between them is less than the target.\n",
 97 |         "    \"\"\"\n",
 98 |         "    return rect1.buffer(0.1).distance(rect2.buffer(0.1)) < distance\n",
 99 |         "\n",
100 |         "\n",
101 |         "def _is_horizontal_near(rect1: BaseGeometry, rect2: BaseGeometry, distance: float = 100) -> bool:\n",
102 |         "    \"\"\"\n",
103 |         "    Check if two rectangles are near horizontally if one of them is a horizontal line.\n",
104 |         "    \"\"\"\n",
105 |         "    result = False\n",
106 |         "    if abs(rect1.bounds[3] - rect1.bounds[1]) < 0.1 or abs(rect2.bounds[3] - rect2.bounds[1]) < 0.1:\n",
107 |         "        if abs(rect1.bounds[0] - rect2.bounds[0]) < 0.1 and abs(rect1.bounds[2] - rect2.bounds[2]) < 0.1:\n",
108 |         "            result = abs(rect1.bounds[3] - rect2.bounds[3]) < distance\n",
109 |         "    return result\n",
110 |         "\n",
111 |         "\n",
112 |         "def _union_rects(rect1: BaseGeometry, rect2: BaseGeometry) -> BaseGeometry:\n",
113 |         "    \"\"\"\n",
114 |         "    Union two rectangles.\n",
115 |         "    \"\"\"\n",
116 |         "    return sg.box(*(rect1.union(rect2).bounds))\n",
117 |         "\n",
118 |         "\n",
119 |         "def _merge_rects(rect_list: List[BaseGeometry], distance: float = 20, horizontal_distance: Optional[float] = None) -> \\\n",
120 |         "        List[BaseGeometry]:\n",
121 |         "    \"\"\"\n",
122 |         "    Merge rectangles in the list if the distance between them is less than the target.\n",
123 |         "    \"\"\"\n",
124 |         "    merged = True\n",
125 |         "    while merged:\n",
126 |         "        merged = False\n",
127 |         "        new_rect_list = []\n",
128 |         "        while rect_list:\n",
129 |         "            rect = rect_list.pop(0)\n",
130 |         "            for other_rect in rect_list:\n",
131 |         "                if _is_near(rect, other_rect, distance) or (\n",
132 |         "                        horizontal_distance and _is_horizontal_near(rect, other_rect, horizontal_distance)):\n",
133 |         "                    rect = _union_rects(rect, other_rect)\n",
134 |         "                    rect_list.remove(other_rect)\n",
135 |         "                    merged = True\n",
136 |         "            new_rect_list.append(rect)\n",
137 |         "        rect_list = new_rect_list\n",
138 |         "    return rect_list\n",
139 |         "\n",
140 |         "\n",
141 |         "def _adsorb_rects_to_rects(source_rects: List[BaseGeometry], target_rects: List[BaseGeometry], distance: float = 10) -> \\\n",
142 |         "        Tuple[List[BaseGeometry], List[BaseGeometry]]:\n",
143 |         "    \"\"\"\n",
144 |         "    Adsorb a set of rectangles to another set of rectangles.\n",
145 |         "    \"\"\"\n",
146 |         "    new_source_rects = []\n",
147 |         "    for text_area_rect in source_rects:\n",
148 |         "        adsorbed = False\n",
149 |         "        for index, rect in enumerate(target_rects):\n",
150 |         "            if _is_near(text_area_rect, rect, distance):\n",
151 |         "                rect = _union_rects(text_area_rect, rect)\n",
152 |         "                target_rects[index] = rect\n",
153 |         "                adsorbed = True\n",
154 |         "                break\n",
155 |         "        if not adsorbed:\n",
156 |         "            new_source_rects.append(text_area_rect)\n",
157 |         "    return new_source_rects, target_rects\n",
158 |         "\n",
159 |         "\n",
160 |         "def _parse_rects(page: fitz.Page) -> List[Tuple[float, float, float, float]]:\n",
161 |         "    \"\"\"\n",
162 |         "    Parse drawings in the page and merge adjacent rectangles.\n",
163 |         "    \"\"\"\n",
164 |         "\n",
165 |         "    # 提取画的内容\n",
166 |         "    drawings = page.get_drawings()\n",
167 |         "\n",
168 |         "    # 忽略掉长度小于30的水平直线\n",
169 |         "    is_short_line = lambda x: abs(x['rect'][3] - x['rect'][1]) < 1 and abs(x['rect'][2] - x['rect'][0]) < 30\n",
170 |         "    drawings = [drawing for drawing in drawings if not is_short_line(drawing)]\n",
171 |         "\n",
172 |         "    # 转换为shapely的矩形\n",
173 |         "    rect_list = [sg.box(*drawing['rect']) for drawing in drawings]\n",
174 |         "\n",
175 |         "    # 提取图片区域\n",
176 |         "    images = page.get_image_info()\n",
177 |         "    image_rects = [sg.box(*image['bbox']) for image in images]\n",
178 |         "\n",
179 |         "    # 合并drawings和images\n",
180 |         "    rect_list += image_rects\n",
181 |         "\n",
182 |         "    merged_rects = _merge_rects(rect_list, distance=10, horizontal_distance=100)\n",
183 |         "    merged_rects = [rect for rect in merged_rects if explain_validity(rect) == 'Valid Geometry']\n",
184 |         "\n",
185 |         "    # 将大文本区域和小文本区域分开处理: 大文本相小合并，小文本靠近合并\n",
186 |         "    is_large_content = lambda x: (len(x[4]) / max(1, len(x[4].split('\\n')))) > 5\n",
187 |         "    small_text_area_rects = [sg.box(*x[:4]) for x in page.get_text('blocks') if not is_large_content(x)]\n",
188 |         "    large_text_area_rects = [sg.box(*x[:4]) for x in page.get_text('blocks') if is_large_content(x)]\n",
189 |         "    _, merged_rects = _adsorb_rects_to_rects(large_text_area_rects, merged_rects, distance=0.1) # 完全相交\n",
190 |         "    _, merged_rects = _adsorb_rects_to_rects(small_text_area_rects, merged_rects, distance=5) # 靠近\n",
191 |         "\n",
192 |         "    # 再次自身合并\n",
193 |         "    merged_rects = _merge_rects(merged_rects, distance=10)\n",
194 |         "\n",
195 |         "    # 过滤比较小的矩形\n",
196 |         "    merged_rects = [rect for rect in merged_rects if rect.bounds[2] - rect.bounds[0] > 20 and rect.bounds[3] - rect.bounds[1] > 20]\n",
197 |         "\n",
198 |         "    return [rect.bounds for rect in merged_rects]\n",
199 |         "\n",
200 |         "\n",
201 |         "def _parse_pdf_to_images(pdf_path: str, output_dir: str = './') -> List[Tuple[str, List[str]]]:\n",
202 |         "    \"\"\"\n",
203 |         "    Parse PDF to images and save to output_dir.\n",
204 |         "    \"\"\"\n",
205 |         "    # 打开PDF文件\n",
206 |         "    pdf_document = fitz.open(pdf_path)\n",
207 |         "    image_infos = []\n",
208 |         "\n",
209 |         "    for page_index, page in enumerate(pdf_document):\n",
210 |         "        logging.info(f'parse page: {page_index}')\n",
211 |         "        rect_images = []\n",
212 |         "        rects = _parse_rects(page)\n",
213 |         "        for index, rect in enumerate(rects):\n",
214 |         "            fitz_rect = fitz.Rect(rect)\n",
215 |         "            # 保存页面为图片\n",
216 |         "            pix = page.get_pixmap(clip=fitz_rect, matrix=fitz.Matrix(4, 4))\n",
217 |         "            name = f'{page_index}_{index}.png'\n",
218 |         "            pix.save(os.path.join(output_dir, name))\n",
219 |         "            rect_images.append(name)\n",
220 |         "            # # 在页面上绘制红色矩形\n",
221 |         "            big_fitz_rect = fitz.Rect(fitz_rect.x0 - 1, fitz_rect.y0 - 1, fitz_rect.x1 + 1, fitz_rect.y1 + 1)\n",
222 |         "            # 空心矩形\n",
223 |         "            page.draw_rect(big_fitz_rect, color=(1, 0, 0), width=1)\n",
224 |         "            # 画矩形区域(实心)\n",
225 |         "            # page.draw_rect(big_fitz_rect, color=(1, 0, 0), fill=(1, 0, 0))\n",
226 |         "            # 在矩形内的左上角写上矩形的索引name，添加一些偏移量\n",
227 |         "            text_x = fitz_rect.x0 + 2\n",
228 |         "            text_y = fitz_rect.y0 + 10\n",
229 |         "            text_rect = fitz.Rect(text_x, text_y - 9, text_x + 80, text_y + 2)\n",
230 |         "            # 绘制白色背景矩形\n",
231 |         "            page.draw_rect(text_rect, color=(1, 1, 1), fill=(1, 1, 1))\n",
232 |         "            # 插入带有白色背景的文字\n",
233 |         "            page.insert_text((text_x, text_y), name, fontsize=10, color=(1, 0, 0))\n",
234 |         "        page_image_with_rects = page.get_pixmap(matrix=fitz.Matrix(3, 3))\n",
235 |         "        page_image = os.path.join(output_dir, f'{page_index}.png')\n",
236 |         "        page_image_with_rects.save(page_image)\n",
237 |         "        image_infos.append((page_image, rect_images))\n",
238 |         "\n",
239 |         "    pdf_document.close()\n",
240 |         "    return image_infos\n",
241 |         "\n",
242 |         "\n",
243 |         "def _gpt_parse_images(\n",
244 |         "        image_infos: List[Tuple[str, List[str]]],\n",
245 |         "        prompt_dict: Optional[Dict] = None,\n",
246 |         "        output_dir: str = './',\n",
247 |         "        api_key: Optional[str] = None,\n",
248 |         "        base_url: Optional[str] = None,\n",
249 |         "        model: str = 'gpt-4o',\n",
250 |         "        verbose: bool = False,\n",
251 |         "        gpt_worker: int = 1,\n",
252 |         "        **args\n",
253 |         ") -> str:\n",
254 |         "    \"\"\"\n",
255 |         "    Parse images to markdown content.\n",
256 |         "    \"\"\"\n",
257 |         "    from GeneralAgent import Agent\n",
258 |         "\n",
259 |         "    if isinstance(prompt_dict, dict) and 'prompt' in prompt_dict:\n",
260 |         "        prompt = prompt_dict['prompt']\n",
261 |         "        logging.info(\"prompt is provided, using user prompt.\")\n",
262 |         "    else:\n",
263 |         "        prompt = DEFAULT_PROMPT\n",
264 |         "        logging.info(\"prompt is not provided, using default prompt.\")\n",
265 |         "    if isinstance(prompt_dict, dict) and 'rect_prompt' in prompt_dict:\n",
266 |         "        rect_prompt = prompt_dict['rect_prompt']\n",
267 |         "        logging.info(\"rect_prompt is provided, using user prompt.\")\n",
268 |         "    else:\n",
269 |         "        rect_prompt = DEFAULT_RECT_PROMPT\n",
270 |         "        logging.info(\"rect_prompt is not provided, using default prompt.\")\n",
271 |         "    if isinstance(prompt_dict, dict) and 'role_prompt' in prompt_dict:\n",
272 |         "        role_prompt = prompt_dict['role_prompt']\n",
273 |         "        logging.info(\"role_prompt is provided, using user prompt.\")\n",
274 |         "    else:\n",
275 |         "        role_prompt = DEFAULT_ROLE_PROMPT\n",
276 |         "        logging.info(\"role_prompt is not provided, using default prompt.\")\n",
277 |         "\n",
278 |         "    def _process_page(index: int, image_info: Tuple[str, List[str]]) -> Tuple[int, str]:\n",
279 |         "        logging.info(f'gpt parse page: {index}')\n",
280 |         "        agent = Agent(role=role_prompt, api_key=api_key, base_url=base_url, disable_python_run=True, model=model, **args)\n",
281 |         "        page_image, rect_images = image_info\n",
282 |         "        local_prompt = prompt\n",
283 |         "        if rect_images:\n",
284 |         "            local_prompt += rect_prompt + ', '.join(rect_images)\n",
285 |         "        content = agent.run([local_prompt, {'image': page_image}], display=verbose)\n",
286 |         "        return index, content\n",
287 |         "\n",
288 |         "    contents = [None] * len(image_infos)\n",
289 |         "    with concurrent.futures.ThreadPoolExecutor(max_workers=gpt_worker) as executor:\n",
290 |         "        futures = [executor.submit(_process_page, index, image_info) for index, image_info in enumerate(image_infos)]\n",
291 |         "        for future in concurrent.futures.as_completed(futures):\n",
292 |         "            index, content = future.result()\n",
293 |         "\n",
294 |         "            # 在某些情况下大模型还是会输出 ```markdown ```字符串\n",
295 |         "            if '```markdown' in content:\n",
296 |         "                content = content.replace('```markdown\\n', '')\n",
297 |         "                last_backticks_pos = content.rfind('```')\n",
298 |         "                if last_backticks_pos != -1:\n",
299 |         "                    content = content[:last_backticks_pos] + content[last_backticks_pos + 3:]\n",
300 |         "\n",
301 |         "            contents[index] = content\n",
302 |         "\n",
303 |         "    output_path = os.path.join(output_dir, 'output.md')\n",
304 |         "    with open(output_path, 'w', encoding='utf-8') as f:\n",
305 |         "        f.write('\\n\\n'.join(contents))\n",
306 |         "\n",
307 |         "    return '\\n\\n'.join(contents)\n",
308 |         "\n",
309 |         "\n",
310 |         "def parse_pdf(\n",
311 |         "        pdf_path: str,\n",
312 |         "        output_dir: str = './',\n",
313 |         "        prompt: Optional[Dict] = None,\n",
314 |         "        api_key: Optional[str] = None,\n",
315 |         "        base_url: Optional[str] = None,\n",
316 |         "        model: str = 'gpt-4o',\n",
317 |         "        verbose: bool = False,\n",
318 |         "        gpt_worker: int = 1,\n",
319 |         "        **args\n",
320 |         ") -> Tuple[str, List[str]]:\n",
321 |         "    \"\"\"\n",
322 |         "    Parse a PDF file to a markdown file.\n",
323 |         "    \"\"\"\n",
324 |         "    if not os.path.exists(output_dir):\n",
325 |         "        os.makedirs(output_dir)\n",
326 |         "\n",
327 |         "    image_infos = _parse_pdf_to_images(pdf_path, output_dir=output_dir)\n",
328 |         "    content = _gpt_parse_images(\n",
329 |         "        image_infos=image_infos,\n",
330 |         "        output_dir=output_dir,\n",
331 |         "        prompt_dict=prompt,\n",
332 |         "        api_key=api_key,\n",
333 |         "        base_url=base_url,\n",
334 |         "        model=model,\n",
335 |         "        verbose=verbose,\n",
336 |         "        gpt_worker=gpt_worker,\n",
337 |         "        **args\n",
338 |         "    )\n",
339 |         "\n",
340 |         "    all_rect_images = []\n",
341 |         "    # remove all rect images\n",
342 |         "    if not verbose:\n",
343 |         "        for page_image, rect_images in image_infos:\n",
344 |         "            if os.path.exists(page_image):\n",
345 |         "                os.remove(page_image)\n",
346 |         "            all_rect_images.extend(rect_images)\n",
347 |         "    return content, all_rect_images\n"
348 |       ]
349 |     },
350 |     {
351 |       "cell_type": "code",
352 |       "source": [
353 |         "# @title ⬇📕 Download Sample `.PDF` Document\n",
354 |         "\n",
355 |         "import os\n",
356 |         "import requests\n",
357 |         "\n",
358 |         "# Sample .pdf data from Arxiv\n",
359 |         "pdf_url = 'https://arxiv.org/pdf/2310.10035.pdf' # @param {type:\"string\"}\n",
360 |         "response = requests.get(pdf_url)\n",
361 |         "\n",
362 |         "# Extract the filename from the URL\n",
363 |         "filename = os.path.basename(pdf_url)\n",
364 |         "\n",
365 |         "# Specify the file path in the current working directory\n",
366 |         "pdf_file_path = os.path.join(os.getcwd(), filename)\n",
367 |         "\n",
368 |         "# Save the file, overwriting if it already exists\n",
369 |         "with open(pdf_file_path, 'wb') as file:\n",
370 |         "    file.write(response.content)\n",
371 |         "\n",
372 |         "print(f\"PDF file downloaded and saved to: {pdf_file_path}\")\n"
373 |       ],
374 |       "metadata": {
375 |         "id": "6j-nqwINSFgI",
376 |         "cellView": "form"
377 |       },
378 |       "execution_count": null,
379 |       "outputs": []
380 |     },
381 |     {
382 |       "cell_type": "code",
383 |       "source": [
384 |         "# @title 📂 Create the necessary directories\n",
385 |         "# Create a folder named 'new_folder' in the current working directory\n",
386 |         "os.makedirs('output', exist_ok=True)"
387 |       ],
388 |       "metadata": {
389 |         "id": "0o9kHCK9TPBo"
390 |       },
391 |       "execution_count": null,
392 |       "outputs": []
393 |     },
394 |     {
395 |       "cell_type": "code",
396 |       "source": [
397 |         "# @title 🚀 Define vars & Kickoff\n",
398 |         "result = parse_pdf(\n",
399 |         "    pdf_path=pdf_file_path,\n",
400 |         "    output_dir=\"./output\",\n",
401 |         "    api_key=\"insert-api-key-here\", # @param {type:\"string\"} - String Field\n",
402 |         "    model=\"gpt-4o\",\n",
403 |         "    verbose=True,\n",
404 |         "    gpt_worker=2\n",
405 |         ")"
406 |       ],
407 |       "metadata": {
408 |         "id": "sw10v-1gTo0N"
409 |       },
410 |       "execution_count": null,
411 |       "outputs": []
412 |     },
413 |     {
414 |       "cell_type": "code",
415 |       "source": [
416 |         "# @title 🖨️ Print Result\n",
417 |         "\n",
418 |         "from IPython.display import Markdown\n",
419 |         "Markdown(result[0])"
420 |       ],
421 |       "metadata": {
422 |         "id": "ENOP8q96DRsP"
423 |       },
424 |       "execution_count": null,
425 |       "outputs": []
426 |     }
427 |   ]
428 | }


--------------------------------------------------------------------------------
/examples/attention_is_all_you_need/output.md:
--------------------------------------------------------------------------------
  1 | ![](0_0.png)
  2 | 
  3 | Ashish Vaswani\*
  4 | Google Brain
  5 | avaswani@google.com
  6 | 
  7 | Noam Shazeer\*
  8 | Google Brain
  9 | noam@google.com
 10 | 
 11 | Niki Parmar\*
 12 | Google Research
 13 | nikip@google.com
 14 | 
 15 | Jakob Uszkoreit\*
 16 | Google Research
 17 | usz@google.com
 18 | 
 19 | Llion Jones\*
 20 | Google Research
 21 | llion@google.com
 22 | 
 23 | Aidan N. Gomez\* †
 24 | University of Toronto
 25 | aidan@cs.toronto.edu
 26 | 
 27 | Łukasz Kaiser\*
 28 | Google Brain
 29 | lukaszkaiser@google.com
 30 | 
 31 | Illia Polosukhin\* ‡
 32 | illia.polosukhin@gmail.com
 33 | 
 34 | ## Abstract
 35 | 
 36 | The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
 37 | 
 38 | \*Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with model novel variants, was responsible for our initial codebase, and efficient inference and visualizations. Łukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
 39 | 
 40 | †Work performed while at Google Brain.
 41 | 
 42 | ‡Work performed while at Google Research.
 43 | 
 44 | 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
 45 | 
 46 | # 1 Introduction
 47 | 
 48 | Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].
 49 | 
 50 | Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states $h_t$, as a function of the previous hidden state $h_{t-1}$ and the input for position $t$. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
 51 | 
 52 | Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.
 53 | 
 54 | In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
 55 | 
 56 | # 2 Background
 57 | 
 58 | The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
 59 | 
 60 | Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22].
 61 | 
 62 | End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks [34].
 63 | 
 64 | To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].
 65 | 
 66 | # 3 Model Architecture
 67 | 
 68 | Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the encoder maps an input sequence of symbol representations $(x_1, ..., x_n)$ to a sequence of continuous representations $z = (z_1, ..., z_n)$. Given $z$, the decoder then generates an output sequence $(y_1, ..., y_m)$ of symbols one element at a time. At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input when generating the next.
 69 | 
 70 | ### 3.1 Encoder and Decoder Stacks
 71 | 
 72 | **Encoder:** The encoder is composed of a stack of $N = 6$ identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is $LayerNorm(x + Sublayer(x))$, where $Sublayer(x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{model} = 512$.
 73 | 
 74 | **Decoder:** The decoder is also composed of a stack of $N = 6$ identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization. We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position $i$ can depend only on the known outputs at positions less than $i$.
 75 | 
 76 | ### 3.2 Attention
 77 | 
 78 | An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
 79 | 
 80 | ![](2_0.png)
 81 | 
 82 | Figure 1: The Transformer - model architecture.
 83 | 
 84 | The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
 85 | 
 86 | ### Scaled Dot-Product Attention
 87 | 
 88 | ![](3_0.png) ![](3_1.png)
 89 | 
 90 | Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel.
 91 | 
 92 | of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
 93 | 
 94 | #### 3.2.1 Scaled Dot-Product Attention
 95 | 
 96 | We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values.
 97 | 
 98 | In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:
 99 | 
100 | $$
101 | \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
102 | $$
103 | 
104 | The two most commonly used attention functions are additive attention [2], and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
105 | 
106 | While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$ [3]. We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.
107 | 
108 | #### 3.2.2 Multi-Head Attention
109 | 
110 | Instead of performing a single attention function with $d_{\text{model}}$-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d_k$, $d_k$, and $d_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$-dimensional
111 | 
112 | Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.
113 | 
114 | $$
115 | \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
116 | $$
117 | 
118 | where $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$
119 | 
120 | Where the projections are parameter matrices $W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{model}}$.
121 | 
122 | In this work we employ $h = 8$ parallel attention layers, or heads. For each of these we use $d_k = d_v = d_{model}/h = 64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.
123 | 
124 | ### 3.2.3 Applications of Attention in our Model
125 | 
126 | The Transformer uses multi-head attention in three different ways:
127 | 
128 | - In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
129 | - The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
130 | - Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to $-\infty$) all values in the input of the softmax which correspond to illegal connections. See Figure 2.
131 | 
132 | ### 3.3 Position-wise Feed-Forward Networks
133 | 
134 | In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.
135 | 
136 | $$
137 | \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
138 | $$
139 | 
140 | While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{ff} = 2048$.
141 | 
142 | ### 3.4 Embeddings and Softmax
143 | 
144 | Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$.
145 | 
146 | Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. $n$ is the sequence length, $d$ is the representation dimension, $k$ is the kernel size of convolutions and $r$ the size of the neighborhood in restricted self-attention.
147 | 
148 | ![](5_0.png)
149 | 
150 | ### 3.5 Positional Encoding
151 | 
152 | Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].
153 | 
154 | In this work, we use sine and cosine functions of different frequencies:
155 | 
156 | $$
157 | PE_{(pos, 2i)} = \sin(pos / 10000^{2i / d_{model}})
158 | $$
159 | 
160 | $$
161 | PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i / d_{model}})
162 | $$
163 | 
164 | where $pos$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$, $PE_{(pos+k)}$ can be represented as a linear function of $PE_{pos}$.
165 | 
166 | We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
167 | 
168 | ### 4 Why Self-Attention
169 | 
170 | In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations $(x_1, ..., x_n)$ to another sequence of equal length $(z_1, ..., z_n)$, with $z_i \in \mathbb{R}^d$, such as a hidden layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.
171 | 
172 | One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
173 | 
174 | The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies [12]. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
175 | 
176 | As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires $O(n)$ sequential operations. In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence
177 | 
178 | ## 5 Training
179 | 
180 | This section describes the training regime for our models.
181 | 
182 | ### 5.1 Training Data and Batching
183 | 
184 | We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding [3], which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary [38]. Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.
185 | 
186 | ### 5.2 Hardware and Schedule
187 | 
188 | We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models (described on the bottom line of table 3), step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).
189 | 
190 | ### 5.3 Optimizer
191 | 
192 | We used the Adam optimizer [20] with $\beta_1 = 0.9$, $\beta_2 = 0.98$ and $\epsilon = 10^{-9}$. We varied the learning rate over the course of training, according to the formula:
193 | 
194 | $$
195 | lrate = d_{model}^{-0.5} \cdot \min(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5})
196 | $$
197 | 
198 | This corresponds to increasing the learning rate linearly for the first $warmup\_steps$ training steps, and decreasing it thereafter proportionally to the inverse square root of the step number. We used $warmup\_steps = 4000$.
199 | 
200 | ### 5.4 Regularization
201 | 
202 | We employ three types of regularization during training:
203 | 
204 | Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
205 | 
206 | ![](7_0.png)
207 | 
208 | **Residual Dropout** We apply dropout [33] to the output of each sub-layer, before it is added to the sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $P_{drop} = 0.1$.
209 | 
210 | **Label Smoothing** During training, we employed label smoothing of value $\epsilon_{ls} = 0.1$ [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.
211 | 
212 | ## 6 Results
213 | 
214 | ### 6.1 Machine Translation
215 | 
216 | On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
217 | 
218 | On the WMT 2014 English-to-French translation task, our big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1/4 the training cost of the previous state-of-the-art model. The Transformer (big) model trained for English-to-French used dropout rate $P_{drop} = 0.1$, instead of 0.3.
219 | 
220 | For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints. We used beam search with a beam size of 4 and length penalty $\alpha = 0.6$ [38]. These hyperparameters were chosen after experimentation on the development set. We set the maximum output length during inference to input length + 50, but terminate early when possible [38].
221 | 
222 | Table 2 summarizes our results and compares our translation quality and training costs to other model architectures from the literature. We estimate the number of floating point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the sustained single-precision floating-point capacity of each GPU.
223 | 
224 | ### 6.2 Model Variations
225 | 
226 | To evaluate the importance of different components of the Transformer, we varied our base model in different ways, measuring the change in performance on English-to-German translation on the
227 | 
228 | ---
229 | 
230 | ^5 We used values of 2.8, 3.7, 6.0 and 9.5 TFLOPS for K80, K40, M40 and P100, respectively.
231 | 
232 | Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
233 | 
234 | ![](8_0.png)
235 | 
236 | development set, newstest2013. We used beam search as described in the previous section, but no checkpoint averaging. We present these results in Table 3.
237 | 
238 | In Table 3 rows (A), we vary the number of attention heads and the attention key and value dimensions, keeping the amount of computation constant, as described in Section 3.2.2. While single-head attention is 0.9 BLEU worse than the best setting, quality also drops off with too many heads.
239 | 
240 | In Table 3 rows (B), we observe that reducing the attention key size $d_k$ hurts model quality. This suggests that determining compatibility is not easy and that a more sophisticated compatibility function than dot product may be beneficial. We further observe in rows (C) and (D) that, as expected, bigger models are better, and dropout is very helpful in avoiding over-fitting. In row (E) we replace our sinusoidal positional encoding with learned positional embeddings [9], and observe nearly identical results to the base model.
241 | 
242 | ### 6.3 English Constituency Parsing
243 | 
244 | To evaluate if the Transformer can generalize to other tasks we performed experiments on English constituency parsing. This task presents specific challenges: the output is subject to strong structural constraints and is significantly longer than the input. Furthermore, RNN sequence-to-sequence models have not been able to attain state-of-the-art results in small-data regimes [37].
245 | 
246 | We trained a 4-layer transformer with $d_{model} = 1024$ on the Wall Street Journal (WSJ) portion of the Penn Treebank [25], about 40K training sentences. We also trained it in a semi-supervised setting, using the larger high-confidence and BerkleyParser corpora from with approximately 17M sentences [37]. We used a vocabulary of 16K tokens for the WSJ only setting and a vocabulary of 32K tokens for the semi-supervised setting.
247 | 
248 | We performed only a small number of experiments to select the dropout, both attention and residual (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters remained unchanged from the English-to-German base translation model. During inference, we
249 | 
250 | Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)
251 | 
252 | ![](9_0.png)
253 | 
254 | increased the maximum output length to input length + 300. We used a beam size of 21 and $\alpha = 0.3$ for both WSJ only and the semi-supervised setting.
255 | 
256 | Our results in Table 4 show that despite the lack of task-specific tuning our model performs surprisingly well, yielding better results than all previously reported models with the exception of the Recurrent Neural Network Grammar [8].
257 | 
258 | In contrast to RNN sequence-to-sequence models [37], the Transformer outperforms the Berkeley-Parser [29] even when training only on the WSJ training set of 40K sentences.
259 | 
260 | ## 7 Conclusion
261 | 
262 | In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
263 | 
264 | For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
265 | 
266 | We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goals of ours.
267 | 
268 | The code we used to train and evaluate our models is available at [https://github.com/tensorflow/tensor2tensor](https://github.com/tensorflow/tensor2tensor).
269 | 
270 | ## Acknowledgements
271 | 
272 | We are grateful to Nal Kalchbrenner and Stephan Gouws for their fruitful comments, corrections and inspiration.
273 | 
274 | ## References
275 | 
276 | [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. *arXiv preprint arXiv:1607.06450*, 2016.
277 | 
278 | [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. *CoRR, abs/1409.0473*, 2014.
279 | 
280 | [3] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc V Le. Massive exploration of neural machine translation architectures. *CoRR, abs/1703.03906*, 2017.
281 | 
282 | [4] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. *arXiv preprint arXiv:1601.06733*, 2016.
283 | 
284 | [5] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
285 | 
286 | [6] Francois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
287 | 
288 | [7] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
289 | 
290 | [8] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proc. of NAACL, 2016.
291 | 
292 | [9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.
293 | 
294 | [10] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
295 | 
296 | [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
297 | 
298 | [12] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
299 | 
300 | [13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
301 | 
302 | [14] Zhongqiang Huang and Mary Harper. Self-training PCFG grammars with latent annotations across languages. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 832–841. ACL, August 2009.
303 | 
304 | [15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
305 | 
306 | [16] Łukasz Kaiser and Samy Bengio. Can active memory replace attention? In Advances in Neural Information Processing Systems, (NIPS), 2016.
307 | 
308 | [17] Łukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In International Conference on Learning Representations (ICLR), 2016.
309 | 
310 | [18] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099v2, 2017.
311 | 
312 | [19] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. Structured attention networks. In International Conference on Learning Representations, 2017.
313 | 
314 | [20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
315 | 
316 | [21] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. arXiv preprint arXiv:1703.10722, 2017.
317 | 
318 | [22] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
319 | 
320 | [23] Minh-Thang Luong, Quoc V. Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114, 2015.
321 | 
322 | [24] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
323 | 
324 | [25] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. *Computational linguistics*, 19(2):313–330, 1993.
325 | 
326 | [26] David McClosky, Eugene Charniak, and Mark Johnson. Effective self-training for parsing. In *Proceedings of the Human Language Technology Conference of the NAACL, Main Conference*, pages 152–159. ACL, June 2006.
327 | 
328 | [27] Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. A decomposable attention model. In *Empirical Methods in Natural Language Processing*, 2016.
329 | 
330 | [28] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. *arXiv preprint arXiv:1705.04304*, 2017.
331 | 
332 | [29] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate, compact, and interpretable tree annotation. In *Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL*, pages 433–440. ACL, July 2006.
333 | 
334 | [30] Ofir Press and Lior Wolf. Using the output embedding to improve language models. *arXiv preprint arXiv:1608.05859*, 2016.
335 | 
336 | [31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. *arXiv preprint arXiv:1508.07909*, 2015.
337 | 
338 | [32] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. *arXiv preprint arXiv:1701.06538*, 2017.
339 | 
340 | [33] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. *Journal of Machine Learning Research*, 15(1):1929–1958, 2014.
341 | 
342 | [34] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, *Advances in Neural Information Processing Systems 28*, pages 2440–2448. Curran Associates, Inc., 2015.
343 | 
344 | [35] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In *Advances in Neural Information Processing Systems*, pages 3104–3112, 2014.
345 | 
346 | [36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. *CoRR*, abs/1512.00567, 2015.
347 | 
348 | [37] Vinyals & Kaiser, Koo, Petrov, Sutskever, and Hinton. Grammar as a foreign language. In *Advances in Neural Information Processing Systems*, 2015.
349 | 
350 | [38] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. *arXiv preprint arXiv:1609.08144*, 2016.
351 | 
352 | [39] Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. Deep recurrent models with fast-forward connections for neural machine translation. *CoRR*, abs/1606.04199, 2016.
353 | 
354 | [40] Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. Fast and accurate shift-reduce constituent parsing. In *Proceedings of the 51st Annual Meeting of the ACL (Volume 1: Long Papers)*, pages 434–443. ACL, August 2013.
355 | 
356 | ![](12_0.png)
357 | 
358 | Figure 3: An example of the attention mechanism following long-distance dependencies in the encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for the word ‘making’. Different colors represent different heads. Best viewed in color.
359 | 
360 | ![](13_0.png)
361 | 
362 | ![](14_0.png)


--------------------------------------------------------------------------------
/examples/rh/output.md:
--------------------------------------------------------------------------------
  1 | ![](0_0.png)
  2 | 
  3 | # Regioselective hydroxylation of cholecalciferol, cholesterol and other sterol derivatives by steroid C25 dehydrogenase
  4 | 
  5 | ![](0_1.png)
  6 | 
  7 | Received: 14 April 2016 / Revised: 25 August 2016 / Accepted: 20 September 2016 / Published online: 11 October 2016  
  8 | © Springer-Verlag Berlin Heidelberg 2016
  9 | 
 10 | **Abstract** Steroid C25 dehydrogenase (S25DH) from *Sterolibacterium denitrificans* Chol-1S is a molybdenum oxido-reductase belonging to the so-called ethylbenzene dehydrogenase (EBDH)-like subclass of DMSO reductases capable of the regioselective hydroxylation of cholesterol or cholecalciferol to 25-hydroxy products. Both products are important biologically active molecules: 25-hydroxycholesterol is responsible for a complex regulatory function in the immunological system, while 25-hydroxycholecalciferol (calcifediol) is the activated form of vitamin D$_3$ used in the treatment of rickets and other calcium disorders. Studies revealed that the optimal enzymatic synthesis proceeds in fed-batch reactors under anaerobic conditions, with 6–9 % (w/v) 2-hydroxypropyl-β-cyclodextrin as solubilizer and 1.25–5 % (v/v) 2-methoxyethanol as an organic co-solvent, both adjusted to the substrate type, and 8–15 mM K$_3$[Fe(CN)$_6$] as an electron acceptor. Such thorough optimization of the reaction conditions resulted in high product concentrations: 0.8 g/L for 25-hydroxycholesterol, 1.4 g/L for calcifediol and 2.2 g/L for 25-hydroxy-3-ketosterols. Although the purification protocol yields approximately 2.3 mg of pure S25DH from 30 g of wet cell mass (specific activity of 14 nmol min$^{-1}$ mg$^{-1}$), the non-purified crude extract or enzyme preparation can be readily used for the regioselective hydroxylation of both cholesterol and cholecalciferol. On the other hand, pure S25DH can be efficiently immobilized either on powder or a monolithic silica support functionalized with an organic linker providing NH$_2$ groups for enzyme covalent binding. Although such immobilization reduced the enzyme initial activity to less than twofold it extended S25DH catalytic lifetime under working conditions at least 3.5 times.
 11 | 
 12 | **Keywords** Calcifediol · 25-hydroxycholesterol · Regioselective hydroxylation · Sterol · Steroid C25 dehydrogenase · Molybdenum enzyme
 13 | 
 14 | **Electronic supplementary material** The online version of this article (doi:10.1007/s00253-016-7882-2) contains supplementary material, which is available to authorized users.
 15 | 
 16 | **M. Szaleniec**  
 17 | ncszalen@cyf-kr.edu.pl
 18 | 
 19 | 1 Jerzy Haber Institute of Catalysis and Surface Chemistry, Polish Academy of Sciences, Niezapominajek 8, PL-30239 Krakow, Poland  
 20 | 2 Institute of Pharmacology, Polish Academy of Sciences, Smetna 12, 31343 Krakow, Poland  
 21 | 3 Department of Organic Chemistry, Faculty of Chemistry, Jagiellonian University, Ingardena 3, 30060 Krakow, Poland  
 22 | 4 Department of Chemical Engineering, Silesian University of Technology, Ks. M. Strzody 7, 44100 Gliwice, Poland  
 23 | 5 Institute of Chemical Engineering, Polish Academy of Sciences, Batycka 5, 44100 Gliwice, Poland  
 24 | 6 Department of Biotechnology and Food Microbiology, Poznan University of Life Sciences, Wojska Polskiego 48, 60627 Poznan, Poland
 25 | 
 26 | **Introduction**
 27 | 
 28 | Sterols are ubiquitous compounds in nature that play a range of physiological roles in all living organisms. In mammals, they are mostly used as cell membrane components, hormones, and vitamin D precursors, which leads to their wide application as pharmaceuticals. Therefore, the ability to modify their base structure and to introduce functional groups is of the utmost importance. Although traditional methods of organic chemistry in the synthesis of many steroid drugs have been successfully used with great efficiency, biocatalytic methods (being more selective and environmentally friendly) have recently attracted more attention and have been incorporated into the organic
 29 | 
 30 | chemist's toolbox (Brixius-Anderko et al. 2015; Donova 2007; Holland 1992; Riva 1991; Zhang et al. 2014).
 31 | 
 32 | 25-hydroxylated derivatives of sterols play an important role in human metabolic control and treatment of associated disorders. 25-hydroxycholesterol (25-OH-Ch) is known to perform a complex regulatory function in the immunological system by (i) controlling the differentiation of monocytes into macrophages, (ii) suppressing the production of IgA by B cells, and (iii) directing the migration of activated B cells to the germinal follicle (Bauman et al. 2009; McDonald and Russell 2010; Reboldi et al. 2014). Thus, 25-OH-Ch is proposed as a drug in diseases associated with IgA overproduction such as Berger disease (Bauman et al. 2009). Calcifediol (25-OH-D$_3$) is an activated form of cholecalciferol (vitamin D$_3$), an important compound introduced into the organism from dietary sources or formed in skin tissue from 7-dehydrocholesterol after sunlight UV irradiation and further activation (regioselective hydroxylation) in liver (Zhu et al. 2013). 25-OH-D$_3$ is the main blood-circulating metabolite responsible for maintaining calcium and phosphate homeostasis. As a drug, calcifediol is much more potent than the parental vitamin D$_3$ and therefore is used in the treatment of rickets and other calcium disorders (Bischoff-Ferrari et al. 2012; Brandi and Minisola 2013; Jetter et al. 2014).
 33 | 
 34 | However, the production of sterols derivatives activated at position 25 and their use in medical applications is limited by the chemical synthesis of both compounds require a multi-step approach which poses high time and labor demands on the production and results in relatively low yields of the overall synthetic pathways (6–25 %) (Kurek-Tyrlik et al. 2005; Miyamoto et al. 1986; Ogawa et al. 2009; Riediker and Schwartz 1981; Reynar et al. 2002; Westover and Covey 2006). The more efficient method of 25-OH-Ch synthesis (yield 60–70 %) requires an expensive starting material such as desmosterol (Zhao et al. 2014). On the other hand, the reported enzymatic methods are much more straightforward and usually involve single hydroxylation step catalyzed by P450 cytochromes (Ban et al. 2014; Yasuda et al. 2013; Yasutake et al. 2013) or a non-heme monooxygenase overproduced in transgenic plants Arabidopsis thaliana and Solanum tuberosum (Beste et al. 2011).
 35 | 
 36 | Steroid C25 dehydrogenase (S25DH) from $\beta$-proteobacterium *Sterolibacterium denitrificans* (Chol-1S), a facultative anaerobic microorganism capable of the full mineralization of cholesterol in both aerobic and anaerobic conditions (Tarlera 2003) appears to be another example of a catalyst for the regioselective hydroxylation of sterols and their derivatives. S25DH is a molybdenum enzyme belonging to the so-called ethylbenzene dehydrogenase (EBDH)-like subclass of DMSO reductases (Heider et al. 2016; Hill et al. 2014). It is a heterotrimer ($\alpha\beta\gamma$, 168 ± 12 kDa) containing a bis-pyranopterin-guanine dinucleotides (MGD)-molybdenum cofactor and [4Fe-4S] cluster in the $\alpha$ subunit (108 kDa), four more iron-sulfur clusters in the $\beta$ subunit (38 kDa) and one heme in the $\gamma$ subunit (27 kDa) (Demmer and Fuchs 2012).
 37 | 
 38 | In the cholesterol degradation pathway, S25DH activates a tertiary C25 carbon atom of the cholest-4-en-3-one by introduction of the oxygen atom that originates from a water molecule, finally yielding the 25-hydroxycholest-4-en-3-one (Chiang et al. 2007, 2008b) (Fig. 1a). Beside its native substrate, S25DH also catalyzes the regioselective hydroxylation of other 3-ketosterols, 3-hydroxy sterols (e.g., cholesterol and 7-dehydrocholesterol) and cholecalciferol (Demmer and Fuchs 2012; Warnke et al. 2016) (Fig. 1b). Therefore, the enzyme can be applied in the catalytic synthesis of pharmacologically important molecules such as 25-hydroxycholesterol, 25-hydroxy-7-dehydrocholesterol (hydroxylated pro-vitamin D$_3$, 25-OH-pro-D$_3$), and calcifediol (25-OH-D$_3$) (Demmer and Fuchs 2012; Szaleniec et al. 2015; Warnke et al. 2016).
 39 | 
 40 | In this work, we present extensive studies on the engineering of reaction medium, the optimization of the biocatalyst formulation and the simplified aerobic purification procedure of steroid C25 dehydrogenase from *S. denitrificans*. To apply S25DH for hydroxylation of sterols, cholecalciferol, or ergocalciferol (Fig. 1b), the following obstacles had to be overcome: (i) the low solubility of hydrophobic reagents in aqueous medium, (ii) the sensitivity of S25DH to oxygen, (iii) the high sensitivity of the product 25-hydroxycholesterol to oxidation, (iv) the low conversion of cholesterol to 3-ketosterols (such as cholest-4-en-3-one), and (v) the limitation of the reaction conversion by the availability of enzyme re-oxidant. Here, we provide the results of a stepwise optimization of the reaction conditions that addresses these issues. Moreover, to demonstrate a real application potential of S25DH, we conducted the synthesis of 25-OH-Ch and 25-OH-D$_3$ in fed-batch reactors using either pure enzyme or enzyme preparation. The syntheses of 25-hydroxy 3-ketosterols were carried out in fed-batch reactors systems using either the homogenous or immobilized enzyme. We also tested a plug flow reactor thus broadening the scope of reaction systems and introducing a possibility of switching from a batch process to a continuous one.
 41 | 
 42 | ## Methods
 43 | 
 44 | ### Materials and bacterial strain
 45 | 
 46 | All chemicals of analytical grade were purchased from Sigma-Aldrich (Poland), Avantor Performance Materials (Poland), GE Healthcare (USA) or Carbosynth Ltd. (UK). Cholest-1,4-dien-3-one was synthesized according to Barton's protocol (Iida et al. 1988) (melting point: 99–101 °C, lit. 97–100 °C (Czarny et al. 1977)), while cholesteryl succinate Tris salt was prepared according to Bildzukiewich's protocol
 47 | 
 48 | ![](2_0.png)
 49 | 
 50 | (Bildziukevich et al. 2013) (melting point of acid: 174–176 °C, lit. 176–177 °C (Carvalho et al. 2010)). *S. denitrificans* Chol-1S (DSMZ 13999) was purchased from the Deutsche Sammlung für Mikroorganismen und Zellkulturen GmbH (Braunschweig, Germany).
 51 | 
 52 | ### Cultivation of bacteria
 53 | 
 54 | *Sterolibacterium denitrificans* was grown on cholesterol as a sole carbon source at 30 °C under anoxic, denitrifying conditions as previously described (Chiang et al. 2008b; Tarlera 2003). Large-scale fermenter cultures (100–150 L) were conducted according to a previously described procedure (Chiang et al. 2007) with the automatic measurement of pH and supplementation of 1 M sulfuric acid. The fermentations were conducted in the facility of the Department of Biotechnology and Food Microbiology, University of Life Sciences in Poznan, Poland. Cells were harvested by centrifugation during the exponential growth phase at an optical density of 0.8–1.0 and stored at −80 °C.
 55 | 
 56 | ### Aerobic enzyme purification
 57 | 
 58 | The S25DH was purified from *S. denitrificans* according to a modified protocol (Dermer and Fuchs 2012) under aerobic conditions. Briefly, the procedure comprised four steps: (i) cell extract solubilization, (ii) ion-exchange separation on DEAE-Sepharose, (iii) ion-exchange separation on Q-Sepharose, and (iv) affinity separation on Reactive Red 120 (see Table S1).
 59 | 
 60 | The active pool after (i) is referred to as "crude enzyme," after (ii) as "enzyme preparation," and after (iv) as "pure enzyme." Details of the purification procedure were provided in the [Supplementary Material](#).
 61 | 
 62 | ### Enzyme immobilization
 63 | 
 64 | The immobilization of pure S25DH was conducted on four types of support: (i) Granocel and commercial cellulose; (ii) mesostructured cellular foam (MCF), SBA15, and SBA15-ultra (Santa Barbara Amorphous) silica powder carriers with amino groups; (iii) Eupergit® C; and (iv) silica monoliths with amino groups (see Table S2 for BET surface characterization). Briefly, the immobilization procedure advanced along the following steps: (i) carrier activation (if needed), (ii) rinsing of the activator, (iii) binding of S25DH in the presence of a protective re-oxidant ($K_3[Fe(CN)_6]$), and (iv) end-capping of the still-free active surface groups with Tris. All eluates were collected and analyzed for the presence of the unbound protein and S25DH activity. The amount of protein bound to a carrier was calculated as the difference between the amount of the protein used for the immobilization and the amount of unbound protein. The initial activity of the immobilized biocatalysts was determined in an HPLC assay for the first 3 h of reaction in the batch system. The activity recovery (AR) was calculated as a ratio of the specific activities [mU/mg of enzyme] measured for the immobilized and the free enzyme. A description of the carrier syntheses and functionalizations, as well as a detailed description of the
 65 | 
 66 | protein immobilization protocols, were provided in the Supplementary Material.
 67 | 
 68 | ### S25DH activity assay
 69 | 
 70 | The standard reaction assay was prepared in a 0.4-mL sample probe containing 280 mM KH$_2$PO$_4$/K$_2$HPO$_4$ pH 7.0 buffer, 12.5 mM K$_3$[Fe(CN)$_6$] as an electron acceptor, 8% (w/v) 2-hydroxypropyl-β-cyclodextrin (HBC) as a solubilizer, 1.25% (w/v) 2-methoxyethanol (EGME) as an organic co-solvent and substrate stock solution, containing app. 0.25 g/L of sterol substrate (C3-ketone, C3-alcohol, C3-ester, or cholecalciferol) and S25DH (in an amount differing depending on the substrate type and the purity of the enzyme). The reactions were carried out under anaerobic conditions at 30 °C in a thermoblock shaker at 800 rpm. As in each reaction cycle two equivalent protons are released to the reaction medium, the elevated concentration of the buffer (i.e., 280 mM) was used in order to maintain a stable pH during the prolonged reaction runs (up to 10 min).
 71 | 
 72 | The tests with immobilized S25DH were conducted at a 10-fold higher scale (4 mL), where 0.5 mL of the immobilized enzyme suspension was used as a catalyst.
 73 | 
 74 | ### UV-Vis detection of activity
 75 | 
 76 | The activity was measured spectrophotometrically at 420 nm (ε = 6218.9 M$^{-1}$ cm$^{-1}$) as an electron acceptor. The initial substrate concentration was 0.25 g/L, and the pure enzyme concentration was 0.625 μg/mL. The tests were conducted in triplicate. The assay was used in the optimization of reaction conditions for cholest-4-en-3-one, where the HBC content was tested in the range of 8–20% (w/v) and EGME in the range of 1.25–10% (v/v).
 77 | 
 78 | ### HPLC detection of activity
 79 | 
 80 | The reagent concentrations in the reactors were monitored over time by the injection of 10-μL samples. The reactions were stopped by the addition of the sample to 10 μL of isopropanol and 1 μL of a saturated solution of FeSO$_4$. The precipitated enzyme and electron acceptor were centrifuged (25,000g, 15 min), and the sample was transferred to a glass vial for LC-MS analysis. Samples were analyzed with RP-HPLC-DAD-MS on an Ascentis® Express RP-Amide column (2.7 μm, 7.5 cm × 4.6 mm, 1 mL/min) using the gradient method 55–98% acetonitrile/H$_2$O/10 mM NH$_4$CH$_3$COO (DAD/(+)-ESI-MS) for C3-ketones and cholecalciferol and 95–98% acetonitrile/H$_2$O/0.01% HCOOH for C3-alcohols and esters (DAD/(+)-APCI-MS). The quantitative analysis was conducted with a DAD detector (240 nm for cholest-4-en-3-one and cholest-1,4-dien-3-one, 265 nm for cholecalciferol and ergocalciferol, 280 nm for 7-dehydrocholesterol, and 205 nm for cholesterol and cholesteryl succinate). The hydroxylation of the product was confirmed by MS spectroscopy (Table S4) by the detection of the characteristic quasi-molecular signals corresponding to the masses of the product [M + H]$^+$ or [M + K]$^+$ higher by 16 m/z than the respective quasi-molecular signals of the substrate or characteristic of product fragmentation signals (e.g., [M + H – H$_2$O]$^+$).
 81 | 
 82 | ### Reactor systems (batch, fed-batch, plug flow)
 83 | 
 84 | Optimization of the reactor conditions for cholecalciferol and cholesterol were carried out under anaerobic conditions in 20 and 2 mL volumes, respectively. The reactor contents comprised the enzyme preparation (15.5 and 1.15 mL, respectively; C$_{f}$ = 1.01 mg/mL, specific activity (SA) = 1.34 mU/mg) as a catalyst and reaction buffer, 12.5 mM K$_3$[Fe(CN)$_6$] and varied amounts of EGME with substrate (1.25–5% (v/v), substrate stock concentration: 20.2 g/L for cholecalciferol, 12.8 g/L for cholesterol) and HBC (1–12% (w/v) for cholecalciferol, 4–16% (w/v) for cholesterol). Increased amounts of EGME (1.25, 2.5, and 5% (v/v)) resulted in increased loadings of cholecalciferol (0.25, 0.5, and 1 g/L, respectively) or cholesterol (0.16, 0.32, and 0.64 g/L, respectively).
 85 | 
 86 | The 12 different reactions (four tests with replicates) with cholecalciferol were carried out under optimized conditions, magnetic stirring bars (500 rpm) at 25–30 °C. The 12 different reactions with cholesterol were carried out under optimized conditions (500 rpm, 30 °C). The average volume activity (VA) for each reactor was calculated from the first 24 or 15.5 h of the reaction (for cholesterol or cholecalciferol, respectively) and presented as the 3D contour plots in STATISTICA v10 (StatSoft) using a distance-weighted least squares fitting for non-linear interpolation (Neter et al. 1985).
 87 | 
 88 | In fed-batch mode, the substrate and electron acceptor were supplemented by the addition of substrate in EGME (20 g/L stock solution) or K$_3$[Fe(CN)$_6$] (1 M stock solution) whenever an HPLC analysis showed a low level of the substrate (<0.05 g/L) or a UV-Vis measurement (Abs 420 nm, ε = 1040 M$^{-1}$ cm$^{-1}$) showed a low concentration of the re-oxidant.
 89 | 
 90 | ### Reactors with electrochemical recovery of re-oxidant
 91 | 
 92 | The reactions with the electrochemical recovery of the re-oxidant were carried out in anaerobic conditions in batch reactors fitted with an electrochemical system, as previously described (Tataruch et al. 2014).
 93 | 
 94 | ### Plug flow reactor
 95 | 
 96 | The reactions in 1.5-mL monolithic silica plug flow reactors (Szymanska et al. 2013) were conducted in anaerobic conditions using cholest-1,4-dien-3-one (0.23 g/L) as a substrate in
 97 | 
 98 | 280 mM buffer K$_2$HPO$_4$/KH$_2$PO$_4$ pH 7.0 containing 8 % (w/v) HBC, 1.25 % (v/v) EGME, and 12.5 mM K$_3$[Fe(CN)$_6$]. Next, 10 mL of the reaction mixture was pumped through the reactor at a 0.1 mL/min flow rate. The reagent concentrations were determined by HPLC at the end of the reactor. After each pass through the reactor, the reaction mixture was collected for the next pass. Altogether, the reactor was tested across 7 days time with six passages of the reaction mixture.
 99 | 
100 | ### Product separation
101 | 
102 | The reaction mixtures were extracted with ethyl acetate (3 × 0.25 of reaction medium volume). The combined extracts were washed with saturated KCl$_{aq}$, dried over anhydrous magnesium sulfate and evaporated under reduced pressure. The obtained residue was purified using column chromatography on silica with ethyl acetate: hexane (1:1).
103 | 
104 | ### Results
105 | 
106 | #### Low solubility of hydrophobic substrates in aqueous medium
107 | 
108 | The reaction mixtures of hydrophobic substrates and their derivatives by S25DH with simultaneous minimization of the required HBC and organic solvent content. The total replacement of HBC in such solubilization was reported as difficult (Dermer and Fuchs 2012; Warnke et al. 2016). Indeed, the employment of other solubilizers, such as n-dodecyl-β-D-thio-maltoside (DDM), CHAPS (3-[(3-cholamidopropyl)dimethylammonio]-1-propanesulfonate) or saponins resulted in a significant decrease in product conversion (data not shown). As β-cyclodextrin was reported to be a better solubilizer than the α- and γ-ones (Seung-Kwon et al. 2002), we compared three different β-cyclodextrin derivatives, i.e., 2-hydroxypropyl-, methyl-, and unmodified β-cyclodextrin. The HBC proved to be the best solubilizer, as the reaction in methyl-β-cyclodextrin proceeded two times slower for cholest-4-en-3-one (both reactors with 8 % (w/v) solubilizer content), while the reaction with unmodified β-cyclodextrin proceeded four times slower for cholest-4-en-3-one and 2.5 times slower for cholecalciferol (reactor with saturated 2 % (w/v) solubilizer content compared with 8 % (w/v) HBC). Despite a slight slower initial activity observed for methyl-β-cyclodextrin with respect to HBC, in both reactions the final conversion reached 100 % (after 5 vs. 12 h, respectively for 2-propyl- and methyl-β-cyclodextrin) (Fig. S1 of Supplementary Material). Meanwhile, the conversion in reactors with unmodified β-cyclodextrin reached only 5–8 % (Fig. S2 of Supplementary Material).
109 | 
110 | Despite the presence of solubilizer a small content of organic solvent (at least 1 %) for introduction of steroid into the reaction mixture is always required. The previously established S25DH assay was based on 20 % (v/v) HBC and 1.25 % (v/v) 1,4-dioxane (Dermer and Fuchs 2012). We tested S25DH activity with UV-Vis assay in other organic solvents for compatibility with S25DH such as tert-butanol, 2-propanol, methanol, 1,2-propanediol and 2-methoxyethanol (ethylene glycol monomethyl ether (EGME)) (data not shown). Above all, EGME proved to be the most efficient substitute of 1,4-dioxane. We determined S25DH initial activity in cholest-4-en-3-one hydroxylation (constant substrate concentration) with different HBC and EGME content (Fig. 2a). Then, S25DH reaction rate turned out to be dependent on the HBC solubility in the reaction medium (lower range of HBC/EGME concentration) and putative substrate sequestrating by the multiple-HBC complexes (Yamamoto et al. 2005) (upper HBC range) together with the detrimental influence of the organic solvent on the enzyme activity (EGME upper range). Notably, in the presence of EGME, the content of HBC could be reduced from 20 % (w/v) to 8 % (w/v) without cholest-4-en-3-one precipitation from the solution. This, in turn, increased the observed initial S25DH activity threefold (reaction with 8 % (w/v) HBC and 1.25 % (v/v) EGME compared to the assay with 20 % (w/v) HBC and 1.25 % (v/v) 1,4-dioxane). The most favorable for hydroxylation of cholest-4-en-3-one, although the enzyme tolerated EGME levels up to 5 % (v/v).
111 | 
112 | Subsequently, we selected optimal medium conditions utilizing EGME was compared to previously described conditions utilizing 1,4-dioxane in the experiment employing two parallel reactors with the same enzyme and initial cholest-4-en-3-one concentrations (Fig. 2b). Again, the change of solubilizer content and organic co-solvent type resulted in a significantly higher conversion rate after 4 h of reaction, i.e., 31 % for reactor with EGME and 18 % for reactor with 1,4-dioxane. Moreover, a better dissolution of the substrate was achieved in reactor with EGME (respectively, 0.3 and 0.25 g/L), despite the same substrate concentration in the organic stock solutions.
113 | 
114 | As the substrates of interest differ in hydrophobicity (log P for cholesterol 8.7, cholest-4-en-3-one 8.4 and cholecalciferol 7.9 – XLogP, PubChem DataBase), the HBC and EGME content in the S25DH reaction mixture were optimized individually for each substrate. Furthermore, as the reported apparent $K_m$ values for S25DH are relatively high (in the range of 0.4–0.8 mM) (Warnke et al. 2016), the increased substrate loading of the reactors has a significant impact on the observed reaction rate. Therefore, the increased EGME content was combined with the increase of substrate concentration. The optimal reaction conditions were determined using the average volume activity (VA) from the first 24 or 15.5 h of cholesterol or cholecalciferol hydroxylation in both reactors.
115 | 
116 | ![](5_0.png)
117 | 
118 | Fig. 2 Reaction medium optimization for cholest-4-en-3-one. a Initial volume activity $VA_{init}$ of S25DH as a function of HBC and EGME contents. In all tests, the substrate concentration was 0.25 g/L. b Progress curves for cholest-4-en-3-one (squares) to 25-OH-product (circles) conversions for reaction mixture with 1.25 % (v/v) 1,4-dioxane (black, solid line) and EGME (blue, dashed line). In both experiments, the other conditions, including the amount of S25DH enzyme (SA 15 mU/mg), were identical
119 | 
120 | For cholecalciferol (Fig. 3a), the optimum reaction medium conditions were found to be 6–8 % (w/v) HBC and 5 % (v/v) EGME while for cholesterol (Fig. 3b): 6–9 % (w/v) HBC and 2.5 % (v/v) EGME. Initial substrate concentrations reached 0.52 g/L for cholecalciferol and 0.32 g/L for cholesterol, respectively.
121 | 
122 | Aerobic vs. anaerobic atmosphere
123 | 
124 | The enzymes of the EBDH class, including S25DH, were reported to be oxygen sensitive, especially in their reduced state (Dermer and Fuchs 2012; Szaleniec et al. 2007; Tataruch et al. 2014). Despite some contradicting reports (Warmke et al. 2016), we decided to assess the influence of an oxygen-containing atmosphere on the long-term performance of the S25DH catalyst. Two 8-mL batch reactors were prepared for the hydroxylation of 7-dehydrocholesterol: one under aerobic conditions and the other in a glove box (97 % $N_2/3 % H_2$) (Fig. 4, Fig. S3). The final product concentration in aerobic reactor reached 0.05 g/L and the enzyme was inactivated (no further change in product concentration) after approximately 48 h. Meanwhile, under anaerobic conditions, 0.15 g/L of product was reached, and the enzyme remained active for at least 150 h. A similar effect was observed for the immobilized S25DH (Fig. 4 circles), where after an initial period of an identical reaction rate (app. 24 h), the enzyme working under aerobic atmosphere lost most of its activity within 96 h, while the enzyme under anaerobic atmosphere remained active after 480 h of continuous processing. It should be underlined that in each case, the reaction progress was not limited by substrate availability, as the initial concentration of 7-dehydrocholesterol was in the range of 0.29–0.36 g/L (Fig. S3 of Supplementary Material), and the $K_3[Fe(CN)_6]$ was replenished whenever it reached a low level.
125 | 
126 | ![](5_1.png)
127 | 
128 | Fig. 3 Volume activity (VA) of S25DH in batch reactor tests with a cholecalciferol and b cholesterol as a function of the HBC and EGME content. As the substrates were dissolved in EGME, higher substrate loadings were obtained for higher EGME contents
129 | 
130 | ![6_0.png](6_0.png)
131 | 
132 | **Fig. 4** Influence of oxygen-containing atmosphere on S25DH activity. Progress curves of 7-dehydrocholesterol conversions conducted in aerobic (filled symbols) and anaerobic (empty symbols) conditions for homogenous (squares, solid black line SA 0.5 mU/mg) and immobilized (SBA15-AEAPTS) pure enzyme (blue circles, dashed line SA 0.1 mU/mg)
133 | 
134 | ### Purity of S25DH
135 | 
136 | Two protocols for the S25DH purification described in the literature are composed of multiple steps (Dermer and Fuchs 2012; Warnke et al. 2016). However, recently it was suggested that the conversions of cholecalciferol and 7-dehydrocholesterol can be achieved with a crude enzyme preparation as a catalyst (Warnke et al. 2016). We studied three different types of S25DH: (i) crude enzyme, (ii) enzyme preparation (eluate of diethylaminoethanol (DEAE)-Sepharose column), and (iii) pure enzyme (three chromatographic steps (DEAE-Sepharose, Q-Sepharose, Reactive Red 120)). Notably, the enzyme types (i) and (ii) contained cholest-4-en-3-one-Δ1-dehydrogenase (AcmB), the FAD-enzyme that in the presence of S25DH-oxidants catalyzes oxidative dehydrogenation of 3-ketosteriods and 7-ketosterols (Chiang et al. 2008a). Nevertheless, AcmB is unable to dehydrate 3-hydroxy substrates such as cholesterol, 7-dehydrocholesterol and cholecalciferol. Therefore, less S25DH performance in a real reactor system with different substrates was never systematically surveyed before. The reported initial apparent specific activity of S25DH was 5–10 times higher in the hydroxylation of 3-ketosterols compared to that of 3-hydroxysteroids (Dermer and Fuchs 2012). A series of small 0.4-mL tests in batch mode was studied with pure S25DH (Table 1 and Fig. S5 of Supplementary Material). The catalyst amount was adjusted to the reported initial activity of the substrate, i.e., lower for very active 3-keto substrates, higher for 3-hydroxy substrates. During the long reaction time, a gradual inactivation of the enzyme was observed. For 3-ketosterols, 90 % of the substrate conversion was reached in 4 h, while in the case of cholecalciferol such conversion was obtained after 2 days. For cholesterol and 7-dehydrocholesterol, the 90 % conversion was not obtained, as the enzyme was deactivated before reaching such a conversion.
137 | 
138 | Therefore, higher yields of 25-hydroxy-3-hydroxysteroid were obtained by adding more of the catalyst into the reactor, e.g., doubling the amount of enzyme in the reactors with cholesterol resulted in an increase of conversion from 35 to 67 % (first part of the fed-batch reactor Fig. S6a, b). Interestingly, although the substitution of cholesterol by its succinate ester
139 | 
140 | **Table 1** Results of batch reactor conversion of S25DH substrates in reaction mixture containing 8 % (w/v) HBC and 1.25 % (v/v) EGME
141 | 
142 | | Substrate | VA [mU mL⁻¹] | C₀ [g L⁻¹] | Product Cfin [g L⁻¹] | Time to 90 % conversion [h] | Total reaction time [days] |
143 | |-----------|---------------|------------|----------------------|-----------------------------|--------------------------|
144 | | Cholest-4-en-3-one | 1.18 | 0.28 | 0.28 | 4 | 1 |
145 | | Cholest-1,4-dien-3-one | 1.18 | 0.25 | 0.25 | 4 | 1 |
146 | | Cholecalciferol | 3.15 | 0.3 | 0.3 | 46 | 5 |
147 | | 7-dehydrocholesterol | 3.15 | 0.34 | 0.19 | - | - |
148 | | Ergocalciferol | 3.15 | 0.25 | 0.0012 | - | - |
149 | | Cholesterol | 6.3 | 0.34 | 0.23 | - | - |
150 | | Cholesteryl succinate | 6.3 | 0.92 | 0.506 | - | 19 |
151 | 
152 | *Total enzyme amount of pure enzyme introduced into reactor, C₀ initial substrate concentration, Cfin final product concentration, time to 90 % conversion time in which 90 % of substrate was converted to product, Total reaction time time at which final product concentration was reached*
153 | 
154 | allowed a much higher loading of the batch reactor (0.92 g/L instead of 0.26 g/L), which resulted in a high yield of the product (0.5 g/L), the reaction proceeded significantly slower than in the case of cholesterol (Table 1, Fig. S6e of Supplementary Material). The identity of S25DH products was confirmed by LC-MS and NMR (Table S4, Fig. S10–13 of Supplementary Material) and was consistent with that to reported before (Chiang et al. 2007; Warnke et al. 2016).
155 | 
156 | ### Electrochemical recovery of S25DH re-oxidant
157 | 
158 | During the reaction, S25DH is reduced by a sterol substrate and then re-oxidized by an artificial electron acceptor (K₃[Fe(CN)₆] or [Fe(cp)₂]BF₄). The preliminary tests with [Fe(cp)₂]BF₄ indicated its interaction with a substrate solubilizer, HBC. This phenomenon was confirmed by cyclic voltammetry experiments, which showed a gradual shift of the ferrocene potential in the HBC solution toward more positive value and a decrease of the observed current (data not shown). As no such effect was observed for K₃[Fe(CN)₆], it was used in the subsequent tests. The reactor experiments revealed efficient substrate hydroxylation in a broad range of K₃[Fe(CN)₆] concentrations (1–15 mM). However, an increase of the re-oxidant concentration above 10 mM had a negative effect on the reaction rate. K₃[Fe(CN)₆] was employed in an electrochemical reactor (Fig. 5) with cholesterol as a substrate and crude enzyme as a catalyst, i.e., a low-cost catalyst that, due to the side reactions with other redox proteins, consumes more re-oxidant. Initially, at sustained high concentration of K₃[Fe(CN)₆], the conversion in the electrochemical reactor proceeded faster than in the control reactor without electrochemical recovery. However, after approximately 48 h, the hydroxylation rate in the electrochemical reactor decreased, and the enzyme had become inactive after 100 h. Meanwhile, the enzyme in the control reactor was able to catalyze the hydroxylation of cholesterol for 700 h, despite the gradual loss of its activity. A similar effect was observed for pure enzyme hydroxylating cholest-4-en-3-one, as well as for the hydroxylation of ethylbenzene by immobilized EBDH (data not shown).
159 | 
160 | ![](7_0.png)
161 | 
162 | **Fig. 5** Influence of electrochemical recovery of K₃[Fe(CN)₆] on S25DH activity (crude enzyme, SA = 0.39 mU/mg): progress curves for cholesterol (filled squares) to 25-OH-Ch (empty circles) conversions conducted with (blue, dashed line) and without electrochemical recovery (black, solid line)
163 | 
164 | ### Immobilization of S25DH
165 | 
166 | The activity of immobilized enzyme preparations was evaluated for different supports in batch reactors with cholest-1,4-dien-3-one as a substrate (Table 2). As a reference, homogenized S25DH with an initial activity corresponding to that used for preparation of the immobilized catalyst (8.04 mU) was used. In each case, 0.5 mL of the settled immobilized catalyst was suspended in 3.5 mL of the reaction mixture and placed in a thermostated reactor at 30 °C in an anaerobic atmosphere. The reaction progress was monitored by HPLC for 2 weeks (Fig. S9 of Supplementary Material). The highest activity recovery (% of immobilized specific activity, AR) was observed for mesostructured cellular foam (MCF) and Santa Barbara Amorphous (SBA-15) silica supports both functionalized with 2-aminoethyl-3-aminopropyltrimethoxysilane (AEAPTS; 45 % and 40 %, followed by SBA-ultra supports functionalized with aminopropyltriethoxysilane (APTS) (13–14 %). The lowest ARs were detected for S25DH immobilized on cellulose carriers (1–2 %) and Eupergit® C (0.3 %). In order to test the influence of the functionalization linker, further experiments were conducted employing AEAPTS or APTES as linkers on SBA-ultra carriers with pure enzyme (SA 4.67 mU/mg) (Table S3 Supplementary Material). For both carriers functionalized by a longer (AEAPTS) or a shorter (APTES) linker, similar values of protein loading as well as initial activity and AR were obtained. Thus, under experimental conditions the influence of the linker was not observed.
167 | 
168 | ### S25DH hydroxylation applications
169 | 
170 | #### Synthesis of 25-OH-Ch
171 | 
172 | The synthesis of 25-OH-Ch in a fed-batch reactor (25 °C, 8 % (w/v) HBC; 1.25–3.75 % (v/v) EGME; 0.12 mL of pure enzyme SA 14 mU/mg, anaerobic atmosphere) resulted in an approx. 0.8 g/L product concentration after 500 h of reaction (Fig. S6 of Supplementary Material) and conversion of 40–60 %. The conversion and subsequent downstream processing can be further optimized by decreasing the substrate addition at the later stage of the reaction.
173 | 
174 | #### Synthesis of 25-OH-D₃
175 | 
176 | The synthesis of 25-OH-D₃ in a fed-batch reactor was carried out in two parallel 20-mL glass reactors (25 °C, 8 % (w/v) HBC and 5 % (v/v) EGME, initial
177 | 
178 | Appl Microbiol Biotechnol (2017) 101:1163–1174
179 | 
180 | Table 2  Catalytic characterization of immobilized S25DH in reaction with cholest-1,4-dien-3-one. The reference activity of homogenous S25DH tested with cholest-1,4-dien-3-one was 20.1 mU/mL (SA 6.6 mU/mg)
181 | 
182 | | Carrier                | Bound protein [mg/g] | Protein loading [mg/mL] | Initial activity [mU/mL] | Activity recovery [%] | $c^{prod}$ [14th day [g/L]] |
183 | |------------------------|----------------------|-------------------------|--------------------------|------------------------|-----------------------------|
184 | | OH-GranoCel            | 1.17 (0%)            | 2.34                    | 0.3                      | 2.1                    | 0.034                       |
185 | | Commercial cellulose   | 0.96 (7%)            | 1.92                    | 0.1                      | 1.0                    | 0.006                       |
186 | | MCF-AEAPS1             | 1.21 (9%)            | 2.42                    | 7.2                      | 45.1                   | 1.748                       |
187 | | SBA15-AEAPTS           | 1.22 (10%)           | 2.44                    | 4.8                      | 29.9                   | 1.670                       |
188 | | SBA15-ultra-1-APTES    | 1.21 (10%)           | 2.42                    | 2.0                      | 12.7                   | 0.437                       |
189 | | SBA15-ultra-2-APTIMS   | 1.22 (10%)           | 2.44                    | 2.2                      | 13.9                   | 0.368                       |
190 | | Eupergit® C            | 0.68 (6%)            | 1.36                    | 0.3                      | 0.3                    | 0.229                       |
191 | 
192 | _MCF mesostructured cellular foam, AEAPTS 2-aminoethyl-3-aminopropyltrimethoxysilane, SBA Santa Barbara amorphous, APTES 3-aminopropyltriethoxysilane, APTMS 3-aminopropyltrimethoxysilane_
193 | 
194 | substrate concentration of 1.2 g/L, anaerobic atmosphere) with enzyme preparation as a catalyst (15 mL, SA 1.34 mU/ mg). After 162 h of reaction (Fig. S7 of Supplementary Material), the product concentration reached 1.4 g/L with conversion of 99 %. The product, 25-OH-D$_3$, was isolated (see “Methods”) with 70 % yield (40 mg, melting point = 81– 83 °C, lit. 81–83 °C (Campbell et al. 1969)) and analyzed by NMR (Fig. 6 and Figs. S11–S13 of Supplementary Material). The synthesis of 3-ketosteroids was carried out in a fed-batch reactor (30 °C, 8 % (w/v) HBC and 1.25–6.25 % (v/v) EGME, 0.02 mL of pure enzyme SA 14 mU/mg, anaerobic atmosphere) for 120 h, yielding a 1.64 g/L product concentration (>99 % conversion) (Fig. S8 of Supplementary Material). Similarly, the synthesis of 25-hydroxycholest-1,4-dien-3-one was carried out in two 400-h experiments (Fig. S9 of Supplementary Material), using pure (8.04 mU) or immobilized S25DH (3.6 mU) at 25 °C. The reactor with pure enzyme reached the product concentration of 2.21 g/L, while the reactor with immobilized enzyme reached 1.74 g/L (conversions 82 %). The same reaction carried out in a plug flow reactor under anaerobic atmosphere resulted in 7.5 % conversion during the first pass (100 min), followed by 6.7 % conversion during the second and third passes (second day of the reaction), and 4.5–5 % during the fourth to sixth passes (third and seventh day of the reaction). The final cumulative conversion was 33.5 % (0.16 g/L) and the catalyst exhibited 50 % of its initial activity after 7 days of discontinuous work (intermediate storage of catalyst under anaerobic conditions for 10 days).
195 | 
196 | ![](8_0.png)
197 | 
198 | _Fig. 6 Reaction progress curve of the fed-batch synthesis of calcifediol. Filled circles calcifediol, empty squares cholecalciferol_
199 | 
200 | Discussion
201 | 
202 | Optimization of reaction conditions The low solubility of steroids in aqueous medium is a known problem in their biotransformation that limits the yield of such processes (Rao et al. 2013). However, the reactor loading can be increased by the addition of steroid solubilizers, such as 2-hydroxypropyl-$\beta$-cyclodextrin (HBC) together with organic co-solvent, which additionally enables an easy introduction of the substrate to the reaction medium. Comparative tests of differently functionalized $\beta$-cyclodextrins showed that HBC is especially efficient in the solubilization of sterols, followed by methyl-$\beta$-cyclodextrin. The unmodified $\beta$-cyclodextrin proved to be very inefficient in that process and limited the conversion. A thorough investigation of the influence of HBC on the observed activity showed that both too low and too high contents of cyclodextrin were detrimental to the reaction rate. The low HBC content results in the low solubility of the substrate and its subsequent precipitation from the reaction medium. On the other hand, a high content of HBC is most probably associated with substrate sequestration by cyclodextrins forming tubular supramolecular structures. This effectively decreases the substrate concentration available to the enzyme (Decaprio et al. 1992; Williams et al. 1998). As a
203 | 
204 | consequence, for more hydrophilic compounds (e.g., cholecalciferol), a lower concentration of the expensive HBC can be used (6 % instead of 8 % (w/v)). The optimal EGME content is also bracketed by low sterol concentrations for low levels of EGME in the reaction mixture and enzyme deactivation due to exposure to high concentrations of denaturizing organic solvent. The tests conducted at the constant concentration of cholest-4-en-3-one indicated, that a low concentration of EGME is optimal for the initial rate of the enzyme. However, the increased EGME concentration not only allows the introduction of higher doses of sterol substrates but also seems to stabilize the high concentration of sterol in the water/ HBC medium. As a result the optimal productivity of the reactor can be achieved at higher contents of the EGME due to elevated concentration of the sterol substrate (Fig. 7).
205 | 
206 | The optimization showed that S25DH performs best under anaerobic conditions. The long-term exposure to aerobic conditions during the reaction results in the significantly faster deactivation of the catalysts. The optimal concentration of the $K_3[Fe(CN)_6]$ re-oxidant is in the range of 10–15 mM, where a concentration of approximately 1 mM results in the kinetic limitation of the re-oxidation process, and approximately 100 mM in the inactivation of the enzyme (possibly due to too high redox potential). Surprisingly, the introduction of the electrochemical re-oxidation of $K_3[Fe(CN)_6]$ by a Pt electrode in the plug flow system resulted in a decrease in the enzyme activity by approximately 50 % under anaerobic conditions (associated with a high and steady concentration of the re-oxidant). The effect was observed both for homogenous and immobilized biocatalyst, and we have been yet unable to explain this effect, which seems to be characteristic of the EBDH-like class of enzymes. Nevertheless, the potential use of electrochemical recovery in the plug flow system (in the form of an electrochemical flow cell) seems to be an attractive alternative to increase the economic feasibility of the process.
207 | 
208 | The covalent immobilization of S25DH on a solid support resulted in a significant loss of activity (50–80 %). Although the carrier survey was not very systematic, and many factors could influence the biocatalyst activity after immobilization, one can draw two conclusions from these experiments: (i) cellulose-based carriers behave significantly worse than those functionalized with hydrophobic linker silica supports and (ii) morphology of the silica support is significantly influencing the observed activity of the immobilized biocatalyst. Although these effects require further study, it is possible that membrane-associated S25DH could retain more of its activity when immobilized on a more hydrophobic surface (i.e., silica functionalized with hydrophobic linker) (Schilke and Kelly 2008). Interestingly, despite the significant decrease of the observed initial activity for the immobilized enzyme (45 % of activity for homogenous enzyme either due to mass transfer limitations or over-binding by covalent linker), the immobilized catalyst performed comparably to the homogenous catalyst. This may suggest that for the immobilized enzyme, the deactivation process is indeed slower, which extends the performance over long-term use (up to months). Additionally, taking into account the simplified downstream processing of the reaction medium, the potential application of the immobilized S25DH enzyme in industrial practice seems to be promising (DiCosimo et al. 2013).
209 | 
210 | S25DH as a catalyst in the synthesis of 25-hydroxysterols and calcifeidol S25DH has been proven to be an efficient catalyst, especially in the hydroxylation of 3-ketosterols and cholecalciferol. The observed reaction rates were 5–10 times higher for 3-ketosterols compared to 3-hydroxysterols (e.g., cholesterol, 7-dehydrocholesterol or cholesteryl succinate), and as a result, a higher yield could be obtained in the synthesis of 25-hydroxy-3-ketosterols (up to 1.5 g/L compared to 0.8 g/ L). Therefore, in the synthesis of 25-hydroxycholesterol, we propose chemoenzymatic approach with one enzymatic step: synthesis of 25-hydroxycholest-4-en-3-one from cholest-4-en-3-one and a further chemical isomerization and reduction step afterwards. Interestingly, cholecalciferol is a very good substrate for S25DH, despite the fact that it chemically resembles 3-hydroxyserols. Moreover, S25DH activity in the hydroxylation of ergocalciferol was detected (Table 1). Unfortunately, the reaction rates for ergocalciferol were very low, most probably due to the steric hindrance introduced by the additional methyl group close to the hydroxylation site. A higher solubility of cholecalciferol (compared to other sterols) in aqueous medium seems to result in better saturation of the enzyme active site, resulting in a higher observed hydroxylation rate. To our best knowledge, the use of S25DH for the synthesis of calcifeidol yields the highest concentration of the product that has been obtained with biotechnological methods (Fujii et al. 2009; Kang et al. 2006, 2015; Sasaki et al. 1992). The highest product concentration of 0.57 g/L was reported for Rhodococcus
211 | 
212 | ![9_0.png](9_0.png)
213 | 
214 | **Fig. 7** Schematic representation of the optimization of reaction conditions for S25DH presenting how concentrations of sterol solubilizer (HBC) and organic solvent (EGME) influence the S25DH activity and productivity of 25-hydroxyserols and other sterol derivatives
215 | 
216 | erythropolis cells containing a recombinant vitamin D$_3$ hydroxylase from *Pseudonocardia autotrophica* (Yasutake et al. 2013). In our approach, we were able to reach 1.4 g/L, while experiments with other substrates demonstrated the possibility of going even above 2.0 g/L. The use of isolated enzyme instead of cells also enables the use of less physiological conditions (such as the presence of organic co-solvents) as well as an easy shift from a batch system to flow reactors. In summary, S25DH is an interesting biocatalyst and can be efficiently used in the fine chemical and pharmaceutical industries.
217 | 
218 | ### Acknowledgments
219 | The authors acknowledge the financial support of the Polish institutions of the National Center of Research and Development (grant project: LIDER/33/147/L-3/11/NCBR) and the National Center of Science (grant SONATA UMO-2012/05/D/ST4/ 00777).
220 | 
221 | ### Compliance with ethical standards
222 | 
223 | #### Conflict of interest
224 | All authors declare no conflict of interests.
225 | 
226 | #### Human and animal rights and informed consent
227 | This article does not contain any studies with human participants or animals performed by any of the authors.
228 | 
229 | ### References
230 | 
231 | Ban JO, Kim HB, Lee MJ, Anbu P, Kim ES (2014) Identification of a vitamin D$_3$-specific hydroxylase genes through actinomycetes genome mining. J Ind Microbiol Biotechnol 41(2):265–273. doi:10.1007/s10295-013-1336-9
232 | 
233 | Bauman DR, Bitmansour AD, McDonald JG, Thompson BM, Liang G, Russell DW (2005) 25-hydroxycholesterol secreted by macrophages in response to toll-like receptor activation suppresses immunoglobulin production. Proc Natl Acad Sci U S A 102(39):16764– 16769. doi:10.1073/pnas.0509145102
234 | 
235 | Bestle A, Nahn D, Lehmann K, Fujokata S, Jonsson L, Dutta PC, Stiborn F (2011) Synthesis of hydroxylated sterols in transgenic Arabidopsis plants alters growth and sterol metabolism. Plant Physiol 157(1): 426–440. doi:10.1104/pp.110.171199
236 | 
237 | Bidlukiewicz U, Rarova L, Saman D, Hlavicek L, Drasar P, Wimmer Z (2013) Amides derived from heteroaromatic amines and selected sterol by mesiteres. Steroids 78(4):134–137
238 | 
239 | Bischoff-Ferrari HA, Dawson-Hughes B, Stöcklin E, Sidelnikov E, Willett WC, Edel JO, Stähelin HB, Wolfram S, Jetter A, Schwager J, Henschkowski J, von Eckardstein A, Egli A (2012) Oral supplementation with 25(OH)D$_3$ versus vitamin D$_3$: effects on 25(OH)D levels, lower extremity function, blood pressure, and markers of innate immunity. J Bone Miner Res 27(1):160–169. doi:10.1002/ jbmr.551
240 | 
241 | Brandi ML, Minisola S (2013) Calcidiol (25OHD$_3$): from diagnostic marker to therapeutic agent. Curr Med Res Opin 29(11):1565– 1572. doi:10.1185/03007995.2013.838549
242 | 
243 | Brixius-Anderko S, Fischer L, Hannemann F, Janocha B, Bernhardt R (2015) A CYP21A2 based whole-cell system in Escherichia coli for the biotechnological production of premedrol. Microb Cell Factories 14:135. doi:10.1186/s12934-015-0333-2
244 | 
245 | Campbell JA, Squires DM, Babcock JC (1969) Synthesis of 25-hydroxycholecalciferol biologically effective metabolite of vitamin D$_3$. Steroids 13(5):567–577
246 | 
247 | Carvalho IF, Silva MM, Moreira JN, Simoes S, Sa e Melo ML (2010) Sterols as anticancer agents: synthesis of ring B oxygenated steroids, cytotoxic profile, and comprehensive SAR analysis. J Med Chem 53(21):7632–7638. doi:10.1021/jm100769e
248 | 
249 | Chiang YR, Ismail W, Muller M, Fuchs G (2007) Initial steps in the anoxic metabolism of cholesterol by the denitrifying *Sterolibacterium denitrificans*. J Biol Chem 282(18):13240– 13249. doi:10.1074/jbc.M610930200
250 | 
251 | Chiang YR, Ismail W, Gallien S, Heintz D, Van Dorsselaer A, Fuchs G (2008a) Cholest-4-en-3-one-Delta1-dehydrogenase, a flavoprotein catalyzing the second step in anoxic cholesterol metabolism. Appl Environ Microbiol 74(1):107–113
252 | 
253 | Chiang YR, Ismail W, Heintz D, Schaeffer C, Van Dorsselaer A, Fuchs G (2008b) Study of anoxic and oxic cholesterol metabolism by *Sterolibacterium denitrificans*. J Bacteriol 190(3):905–914. doi:10.1128/jb.01525-07
254 | 
255 | Cranny MR, Nelson JA, Spencer TA (1977) Synthesis of a novel C$_{19}$ steroid: glycolcopentanophenan-3beta-ol. J Org Chem 42(17):2941– 2944. doi:10.1021/jo00437a041
256 | 
257 | Decaprio J, Yun J, Javitt NB (1992) Bile acid and sterol solubilization in 2-hydroxypropyl-Beta-cyclodextrin. J Lipid Res 33(3):441–443
258 | 
259 | Demler J, Fuchs G (2012) Molybdoxybenzen that catalyze the anaerobic hydroxylation of a tertiary carbon atom in the side chain of cholesterol. J Biol Chem 287(4):3695–3696
260 | 
261 | DiCosimo R, McAuliffe J, Poulose AJ, Bohlmann G (2013) Industrial use of immobilized enzymes. Chem Soc Rev 42(15):6437–6474. doi:10.1039/C3CS35506C
262 | 
263 | Donova MV (2007) Transformation of steroids by actinobacteria: a review. Appl Biochem Microbiol 43(1):1–14
264 | 
265 | Fujii T, Kigawa Y, Mochida K, Mase T, Fujii T (2011) Functional formations using Escherichia coli with tolC acrAB mutations expressing cytochrome P450 genes. Biosci Biotechnol Biochem 74(4): 805–810. doi:10.1271/bbb.80627
266 | 
267 | Heider J, Szaleniec M, Stinwick K, Boll M (2016) Ethylbenzene dehydrogenase and related molybdenum enzymes involved in oxygenindependent alkyl chain hydroxylation. J Mol Microbiol Biotechnol 26(1–3):45–62
268 | 
269 | Hille R, Hall J, Basu P (2014) The iron/molybdenum molybdenum enzymes. Chem Rev 114(7):3963–4038. doi:10.1021/cr400442w
270 | 
271 | Holm RH (1992) Organizing synthesis with oxidative enzymes. Wiley VCH, New York
272 | 
273 | Iida T, Shinohara T, Goto J, Nambara T, Chang FC (1988) A facile onestep synthesis of delta-1,4-lactone-16beta-olides by iodobenzene and benzeneselen anhydride. J Lipid Res 29(8):1097–1101
274 | 
275 | Jetter A, Egli A, Dawson-Huges B, Staehelin HB, Stoecklin E, Goessl R, Henschkowski J, Bischoff-Ferrari HA (2014) Pharmacokinetics of oral vitamin D$_3$ and calcifediol. Bone 59:14–19
276 | 
277 | Kang DJ-J, Lee H-S, Park J-T, Bang JS, Hong S-K, Kim T-Y (2006) Optimization of culture conditions for the production of 25-hydroxyvitamin D$_3$ using *Pseudonocardia autotrophica* D13-900. Biotechnol Bioproce E 11(5):408–413. doi:10.1007/bf02932037
278 | 
279 | Kang DJ, Im JH, Kang HJ, Kim KH (2015) Biosynthesis of vitamin D$_3$ to calcifediol by using resting cells of *Pseudonocardia* sp. Biotechnol Lett 37(9):1895–1904. doi:10.1007/s10529-015-1862-9
280 | 
281 | Kurek-Tryjki A, Michalak K, Wicha J (2005) Synthesis of 17-epiallactocholic from a common androstane derivative, involving the ring B photochemical opening and the intermediate trione ozonolysis. J Org Chem 70(21):8531–8537. doi:10.1021/jo051375u
282 | 
283 | McDonald JG, Russell DW (2010) Editorial: 25-hydroxycholesterol: a new life in immunology. J Leukoc Biol 88(6):1071–1072. doi:10.1189/jlb.0710418
284 | 
285 | Miyamoto K, Kubodera N, Murayama E, Ochi K, Mori T, Matsunaga I (1986) Synthetic studies on vitamin-D analogs. 2. A synthesis of 25-hydroxyvitamin-D3 from lithocholic acid. Synthetic Commun 16(5):513–521. doi:10.1080/00397918608076785
286 | 
287 | Neter J, Wasserman W, Kutner MH (1985) Applied linear statistical models: regression, analysis of variance, and experimental designs. Irwin, Homewood, IL
288 | 
289 | Ogawa S, Kakiyama G, Muto A, Hosoda H, Mitamura K, Ikegawa S, Akamatsu M, Iida T (2009) A facile synthesis of C-24 and C-25 oxysterols by in situ generated ethyl(trifluoromethyl)dioxirane. Steroids 74(1):81–87. doi:10.1016/j.steroids.2008.09.015
290 | 
291 | Rao SM, Thakkar KV, Parikh SA (2013) Microbial transformation of steroids: current trends in cortical side chain cleavage. Quest 1:16–20
292 | 
293 | Reboldi A, Dang EV, McDonald JG, Liang G, Russell DW, Cyster JG (2014) 25-Hydroxycholesterol suppresses interleukin-1-driven inflammation downstream of type I interferon. Science 345(6197): 679–684. doi:10.1126/science.1254790
294 | 
295 | R i e d i k e r M , S c h w e i t z e r J ( 1 9 8 1 ) A n e w s y n t h e s i s o f 2 5- hydroxycholesterol. Tetrahedron Lett 22(46):4655–4658. doi:10.1016/S0040-4039(01)83005-7
296 | 
297 | Riva S (1991) Enzymatic modifications of steroids, vol 1. Marcel Dekker, Inc., New York
298 | 
299 | Ryzner T, Krupa M, Kutner A (2002) Syntheses of vitamin D metabolites and analogs. Retrospect and prospects. Pure Chem 8(5):300–310
300 | 
301 | Sasaki J, Miyazaki A, Saito M, Adachi T, Mizukc K, Hanada K, Omura S (1992) Transformation of vitamin-D3 to 1-alpha, 25- dihydroxyvitamin-D3 by 25-hydroxyvitamin-D3 using Amycolatopsis sp. strains. Appl Microbiol Biotechnol 38(2):152–157
302 | 
303 | Schliekle KF, Eggert T (2008) Activation of immobilized lipase in non- aqueous systems by hydroxylic polar additives. J Biotechnol Prog Biotechnol Biochem 109(1):1–9. doi:10.1002/btpr.8
304 | 
305 | Szalewicz M, Hagel C, Menke M, Nowak P, Witko M, Heider J (2007) Kinetics and mechanism of oxygen-independent hydrocarbon hy- droxylation by ethylbenzene dehydrogenase. Biochemistry 46(25): 7637–7646. doi:10.1021/bi700363c
306 | 
307 | Szalewicz, M, Rugor, A, Dudzik, A, Tataruch, M, Szymańska, K, Jarzębski, A (2015) Method of obtaining 25-hydroxylated sterol derivatives, including 25-hydroxy-7-dehydrocholesterol. Poland
308 | 
309 | Szymańska K, Pudło W, Mrowiec-Białoń J, Czardybon A, Kocurek J, Jarzębski AB (2013) Immobilization of invertase on silica monoliths with hierarchical pore structure to obtain continuous flow enzymatic microreactors of high performance. Micropor Mesopor Mat 170:75–82
310 | 
311 | Tarlera S (2003) Sterolibacterium denitrificans gen. Nov., sp. nov., a novel cholesterol-oxidizing, denitrifying member of the β- Proteobacteria. Int J Syst Evol Microbiol 53(4):1085–1091. doi:10.1099/ijs.0.02693-0
312 | 
313 | Tataruch M, Heider J, Bryjak J, Nowak P, Knack D, Czermak A, Lisiene J, Szalewicz (2014) Suitability of the hydrocarbon-hydroxylating hydroquinone-enzyme ethylbenzene dehydrogenase for use in chiral alcohol production. J Biotechnol 192:400–409. doi:10.1016 /j.jbiotec.2014.06.021
314 | 
315 | Warnke M, Jung T, Demer J, Hilpp K, Jemlich N, von Bergen M, Ferland S, Frey A, Müller M, Boll M (2016) 25-hydroxyvitamin D3 synthesis by enzymatic steric shield biosynthesis in a bacterial water. Angew Chem Int Ed Engl 55(5):1881–1884. doi:10.1002/ ange.201503311
316 | 
317 | Westover EJ, Covey DF (2004) Enzymatic sterol on 25-hydroxycholesterol. Steroids 71(6):484–488. doi:10.1016/j.steroids.2005.04.001
318 | 
319 | Williams RO, Mahaguna V, Sirovanyong M (1998) Characterization of an inclusion complex of cholesterol and hydroxypropyl-beta-cyclodextrin. Eur J Pharm Biopharm 46(3):355–360. doi:10.1016/S0939- 6411(98)00033-3
320 | 
321 | Yamamota S, Kurihara H, Muto T, Xing NH, Ueno H (2005) Cholesterol recovery from inclusion complex of beta-cyclodextrin and cholesterol by anion at elevated temperatures. Biochem Eng J 23(2):197–205
322 | 
323 | Yasuda K, Endo M, Ikushiro S, Kamakura M, Ohta M, Sakaki T (2013) UV-Dependent production of 25-hydroxyvitamin D2 in the recom- binant yeast cells expressing human CYP27B1. Biochem Biophys Res Commun 434(2):311–315. doi:10.1016/j.bbrc.2013.02.132
324 | 
325 | Yasutake Y, Nishikata H, Inoue M, Taira M (2013) A single mutation at the ferredoxin site of vitamin D3 enables efficient bacterial production of 25-hydroxyvitamin D3. Biotechnol Lett 35(4):607–614. doi:10.1007/s10529-012-1124-0
326 | 
327 | Zhu D (2014) Engineering a hydroxysteroid dehydrogenase to improve its soluble expression for the asymmetric reduction of cortisone to 11β-hydroxycortisone. Appl Microbiol Biotechnol 98(21): 8879–8886. doi:10.1007/s00253-014-5967-1
328 | 
329 | Zhao Q, Ji L, Qian GP, Liu JG, Wang ZQ, WP Y, Chen XZ (2014) Investigation on the synthesis of 25-hydroxycholesterol. Steroids 85:1–5. doi:10.1016/j.steroids.2014.02.002
330 | 
331 | Zhu JG, Ochalek JT, Kaufmann M, Jones G, DeLuca HF (2013) CYP2R1 is a major, but not exclusive, contributor to 25-hydroxyvitamin D production in vivo. Proc Natl Acad Sci U S A 110(39):15650– 15655. doi:10.1073/pnas.1315006110


--------------------------------------------------------------------------------