├── .gitignore ├── docker_ssh ├── Dockerfile ├── README.md └── run.sh ├── README.md ├── Vagrantfile └── data-provider.md /.gitignore: -------------------------------------------------------------------------------- 1 | paddle 2 | paddle_html 3 | woboq_codebrowser 4 | .DS_Store 5 | *~ 6 | .vagrant 7 | -------------------------------------------------------------------------------- /docker_ssh/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM paddledev/paddle:cpu-latest 2 | 3 | MAINTAINER Yi Wang 4 | 5 | RUN apt-get update 6 | RUN apt-get install -y openssh-server emacs24-nox 7 | RUN mkdir /var/run/sshd 8 | RUN echo 'root:root' | chpasswd 9 | 10 | RUN sed -ri 's/^PermitRootLogin\s+.*/PermitRootLogin yes/' /etc/ssh/sshd_config 11 | RUN sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config 12 | 13 | EXPOSE 22 14 | 15 | CMD ["/usr/sbin/sshd", "-D"] 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Browsing Paddle C++ Source Code 2 | 3 | By git-cloning this repository and running `vagrant up`, we use 4 | [woboq_codebrowser](https://github.com/woboq/woboq_codebrowser) to 5 | convert [Paddle](https://github.com/baidu/paddle) into indexed and 6 | browsable HTML files in `paddle_html` directory. We can then copy the 7 | directory to a Web server so to publish the result, for example, 8 | [here](https://link.zhihu.com/?target=http%3A//162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/trainer/TrainerMain.cpp.html) 9 | 10 | For more details like why we would like to browse Paddle source code, 11 | please refer to 12 | [this post](https://zhuanlan.zhihu.com/p/22484207?refer=cxwangyi) in 13 | Chinese. 14 | 15 | TODO: It seems that we don't have to do this as provisioning a virtual 16 | machine; instead, we can do it as building a Docker image. 17 | 18 | 20 | -------------------------------------------------------------------------------- /docker_ssh/README.md: -------------------------------------------------------------------------------- 1 | # Run Paddle Docker Container and SSH into It. 2 | 3 | It is true that we can run Paddle in a Docker container by simply typing 4 | 5 | ``` 6 | docker run -it paddledev/paddle:cpu-latest /bin/bash 7 | ``` 8 | 9 | but it is usually a pain to use bash inside a container, for example, 10 | if we run bash inside container, we'd have to 11 | [type Ctrl-P twice](http://stackoverflow.com/questions/20828657/docker-change-ctrlp-to-something-else) 12 | before getting the previous command line. 13 | 14 | An easy solution to this is to run both Paddle and OpenSSH server 15 | inside a container, and we SSH into the container from the host 16 | machine, as described in 17 | [Paddle's document](http://www.paddlepaddle.org/doc/build/docker_install.html#remote-access). 18 | 19 | This repo slightly simplifies the described method by providing a 20 | `run.sh` script which builds the Docker image of Paddle and OpenSSH 21 | server, runs the container and SSH into the container. 22 | 23 | 25 | -------------------------------------------------------------------------------- /docker_ssh/run.sh: -------------------------------------------------------------------------------- 1 | realpath() { 2 | [[ $1 = /* ]] && echo "$1" || echo "$PWD/${1#./}" 3 | } 4 | 5 | DIR=$(dirname $(realpath $0)) 6 | PADDLE=$(realpath $DIR/../paddle) 7 | 8 | if [[ ! -d $PADDLE ]]; then 9 | printf "Clone paddle ... " 10 | git clone https://github.com/baidu/Paddle $PADDLE > /dev/null 2>&1 || { echo "Failed"; exit -1; } 11 | else 12 | printf "Update paddle ... " 13 | cd $PADDLE 14 | git pull > /dev/null 2>&1 || { echo "Failed"; exit -1; } 15 | fi 16 | echo "Done" 17 | 18 | printf "Linking paddle ... " 19 | ln -sf $PADDLE $DIR/paddle 20 | echo "Done" 21 | 22 | if docker images --format '{{.Repository}}' | grep '^paddle_ssh$' > /dev/null; then 23 | echo "Use the existing Docker image paddle_ssh." 24 | else 25 | printf "Building paddle_ssh Docker image ... " 26 | cd $DIR 27 | docker build -t paddle_ssh . || { echo "Failed"; exit -1; } 28 | echo "Done" 29 | fi 30 | 31 | printf "Starting paddle and sshd container ... " 32 | docker rm -f paddle_ssh > /dev/null 2>&1 33 | docker run -d --name paddle_ssh -p 8022:22 -v $PADDLE:/paddle paddle_ssh 34 | echo "Done" 35 | 36 | ssh -p 8022 root@localhost 37 | -------------------------------------------------------------------------------- /Vagrantfile: -------------------------------------------------------------------------------- 1 | # -*- mode: ruby -*- 2 | Vagrant.configure("2") do |config| 3 | config.vm.box = "ubuntu/trusty64" 4 | 5 | config.vm.provider "virtualbox" do |vb| 6 | vb.gui = false 7 | vb.memory = "4096" 8 | end 9 | 10 | config.vm.provision "shell", inline: <<-SHELL 11 | apt-get update 12 | apt-get install -y clang-3.8 llvm-3.8 libclang-3.8-dev g++ make cmake build-essential libatlas-base-dev python python-pip libpython-dev m4 libprotobuf-dev protobuf-compiler python-protobuf python-numpy git libgoogle-glog-dev libgflags-dev libgtest-dev 13 | sudo pip install wheel 14 | sudo chmod -R a+w /usr/src/gtest 15 | cd /usr/src/gtest 16 | cmake . && make 17 | sudo cp *.a /usr/lib 18 | 19 | cd /vagrant 20 | if [[ ! -d woboq_codebrowser ]]; then 21 | git clone https://github.com/woboq/woboq_codebrowser 22 | fi 23 | cd woboq_codebrowser 24 | git checkout master 25 | git pull origin master 26 | cmake . -DLLVM_CONFIG_EXECUTABLE=/usr/bin/llvm-config-3.8 -DCMAKE_BUILD_TYPE=Release 27 | make -j4 28 | 29 | cd /vagrant 30 | if [[ ! -d paddle ]]; then 31 | git clone https://github.com/baidu/paddle 32 | fi 33 | cd paddle 34 | git checkout master 35 | git pull origin master 36 | if [[ -f Makefile ]]; then 37 | make clean 38 | fi 39 | rm CMakeCache.txt 40 | cmake . -DWITH_GPU=OFF -DWITH_DOC=OFF -DCMAKE_EXPORT_COMPILE_COMMANDS=ON 41 | make -j4 42 | 43 | export OUTPUTDIRECTORY=/vagrant/paddle_html/codebrowser 44 | cp -rv /vagrant/woboq_codebrowser/data /vagrant/paddle_html/ 45 | BUILDIRECTORY=$PWD 46 | VERSION=`git describe --always --tags` 47 | /vagrant/woboq_codebrowser/generator/codebrowser_generator -b $BUILDIRECTORY -a -o $OUTPUTDIRECTORY -p codebrowser:$BUILDIRECTORY:$VERSION 48 | /vagrant/woboq_codebrowser/indexgenerator/codebrowser_indexgenerator $OUTPUTDIRECTORY 49 | SHELL 50 | end 51 | -------------------------------------------------------------------------------- /data-provider.md: -------------------------------------------------------------------------------- 1 | # Paddle 的 Data Provider 2 | 3 | Yi Wang 4 | 5 | 6 | ## 概述 7 | 8 | 一个 Paddle 程序通常包括两个部分:data provider 和 主程序。 9 | 10 | 11 | 在 Paddle 的 [Quick Start tutorial](http://www.paddlepaddle.org/doc/demo/quick_start/index_en.html) 里,一个很简单的主程序是 [trainer_config.lr.py](https://github.com/baidu/Paddle/blob/master/demo/quick_start/trainer_config.lr.py),其中[调用](https://github.com/baidu/Paddle/blob/master/demo/quick_start/trainer_config.lr.py#L35)函数 `define_py_data_source2` 指定了它的 data provider: 12 | 13 | ``` 14 | define_py_data_sources2(train_list=trn, 15 | test_list=tst, 16 | module="dataprovider_bow", 17 | obj=process, 18 | args={"dictionary": word_dict}) 19 | ``` 20 | 21 | 这个调用指定了训练数据 `trn = 'data/train.list'` 和测试数据 `tst = 'data/test.list'`。这是两个文本文件,其中每行是一个数据文件的名字。读取数据文件的 Python 函数是 [`process`](https://github.com/baidu/Paddle/blob/master/demo/quick_start/dataprovider_bow.py#L50),它被定义在 `dataprovider_bow.py` 里: 22 | 23 | 24 | ``` 25 | @provider(init_hook=initializer, cache=CacheType.CACHE_PASS_IN_MEM) 26 | def process(settings, file_name): 27 | ``` 28 | 29 | 其中 `@provider` 是一个 Python decorator,定义在 [`python/paddle/trainer/PyDataProvider2.py`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer/PyDataProvider2.py) 里: 30 | 31 | ``` 32 | def provider(input_types=None, should_shuffle=None, pool_size=-1, 33 | min_pool_size=-1, 34 | can_over_batch_size=True, 35 | calc_batch_size=None, 36 | cache=CacheType.NO_CACHE, 37 | check=False, check_fail_continue=False, 38 | use_dynamic_order=True, 39 | init_hook=None, **kwargs): 40 | """ 41 | Provider decorator. Use it to make a function into PyDataProvider2 object. 42 | In this function, user only need to get each sample for some train/test 43 | file. 44 | ``` 45 | 46 | `@provider`是一个有参数的 decorator,相对更简单的[没有参数的decorator](http://jfine-python-classes.readthedocs.io/en/latest/decorators.html),有参数的 decorator 的描述在[这里](http://stackoverflow.com/questions/5929107/python-decorators-with-parameters). 47 | 48 | 49 | ## 定义 Data Provider 50 | 51 | 为了细致了解 data provider 机制,我们细致跟踪一下上述 Paddle 程序 `trainer_config.lr.py` 的执行过程。这个过程是从 [`trian.sh`](https://github.com/baidu/Paddle/blob/master/demo/quick_start/train.sh) 开始的: 52 | 53 | ``` 54 | paddle train \ 55 | --config=trainer_config.lr.py \ 56 | --save_dir=./output \ 57 | --trainer_count=4 \ 58 | --log_period=20 \ 59 | --num_passes=15 \ 60 | --use_gpu=false \ 61 | --show_parameter_stats_period=100 \ 62 | --test_all_data_in_one_period=1 \ 63 | 2>&1 | tee 'train.log' 64 | ``` 65 | 66 | 当执行 paddle 命令时,paddle 会执行 `--config` 指定的 Python 程序 `trianer_config.lr.py`。这个程序指定 data provider、描述(深度)神经元网络、并且启动训练或者测试或者预测任务。其中指定 data provider 的部分在上文中已经见到了,是通过调用 `define_py_data_sources2` 做到的。 67 | 68 | 69 | `define_py_data_sources2` 的定义在 [`python/paddle/trainer_config_helpers/data_sources.py`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer_config_helpers/data_sources.py),只是简单地调用了另一个函数 [`define_py_data_sources`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer_config_helpers/data_sources.py#L97): 70 | 71 | ``` 72 | define_py_data_sources(train_list=train_list, 73 | test_list=test_list, 74 | module=module, 75 | obj=obj, 76 | args=args, 77 | data_cls=None) 78 | ``` 79 | 80 | 而 [`define_py_data_sources`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer_config_helpers/data_sources.py#L97) 调用了两次 [`define_py_data_source`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer_config_helpers/data_sources.py#L29): 81 | 82 | ``` 83 | define_py_data_source(file_list=train_list, cls=TrainData, module, obj, args, 84 | train_async=False, data_cls=None) 85 | 86 | define_py_data_source(file_list=test_list, cls=TestData, module, obj, args, 87 | train_asycn=False, data_cls=None) 88 | ``` 89 | 90 | [`define_py_data_source`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer_config_helpers/data_sources.py#L29) 做了如下工作: 91 | 92 | 93 | 1. 如果 `file_list` 是一个 Python list,则将其转换成一个列出数据文件的文本文件 94 | 95 | 96 | 1. 因为上面例子里的参数 `data_cls` 是 `None`,所以 `define_py_data_source` 把 `data_cls` 设置为嵌套定义的函数 [`py_data2`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer_config_helpers/data_sources.py#L79)。 97 | 98 | ``` 99 | if data_cls is None: 100 | def py_data2(files, load_data_module, load_data_object, load_data_args, 101 | **kwargs): 102 | data = DataBase() 103 | data.type = 'py2' 104 | data.files = files 105 | data.load_data_module = load_data_module 106 | data.load_data_object = load_data_object 107 | data.load_data_args = load_data_args 108 | return data 109 | data_cls = py_data2 110 | ``` 111 | 112 | `py_data2` 调用 [`DataBase()`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer/config_parser.py#L788),这个函数创建一个 protobuf message `DataConfig`,然后往里填写一些信息来描述一个数据源,其中一项很重要的信息是数据源的类型 `DataConfig.type=“py2”`。在下文中我们会看到这个数据类型和作用。 113 | 114 | 115 | 1. 随后,`define_py_data_source` 调用函数 `cls`。在上例中,`cls` 的值是 [`TrainData`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer/config_parser.py#L938) 或者 [`TestData`](https://github.com/baidu/Paddle/blob/master/python/paddle/trainer/config_parser.py#L950): 116 | 117 | ``` 118 | cls(data_cls(files=file_list, 119 | load_data_module=module, 120 | load_data_object=obj, 121 | load_data_args=args, 122 | async_load_data=async)) 123 | ``` 124 | 125 | `TrainData` 和 `TestData` 并没有做太多工作。它们主要是吧 `DataConfig` 变量记录到全局变量 `g_config` 里. 126 | 127 | 128 | ## 调用 Data Provider 129 | 130 | 上例中,`module` 和 `obj` 参数被填写近 `DataConfig.load_data_module` 和 `DataConfig.load_data_object` 里了。这两个 fields 被一个 C++ class [`PyDataProvider2`](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/gserver/dataproviders/PyDataProvider2.cpp.html#paddle::PyDataProvider2) 使用。 131 | 132 | 在 `PyDataProvider2` 的定义的下面,有[一行](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/gserver/dataproviders/PyDataProvider2.cpp.html#667): 133 | 134 | ``` 135 | REGISTER_DATA_PROVIDER_EX(py2, PyDataProvider2); 136 | ``` 137 | 138 | 宏 `REGISTER_DATA_PROVIDER_EX` 的定义在[这里](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/gserver/dataproviders/DataProvider.h.html#_M/REGISTER_DATA_PROVIDER_EX): 139 | 140 | ``` 141 | #define REGISTER_DATA_PROVIDER_EX(__type_name, __class_name) \ 142 | static InitFunction __reg_type_##__type_name([] { \ 143 | DataProvider::registrar_.registerClass<__class_name>(#__type_name); \ 144 | }) 145 | ``` 146 | 147 | 它负责把 `__type_name` (在我们的例子是 “py2”)和 `__class_name` (在我们的例子里是 PyDataProvider2)登记到 class `DataProvider` 的 static member [`registrar_`](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/gserver/dataproviders/DataProvider.h.html#_ZN6paddle12DataProvider10registrar_E) 里: 148 | 149 | ``` 150 | class DataProvider { 151 | public: 152 | static ClassRegistrar registrar_; 153 | ``` 154 | 155 | 其中,class template `ClassRegistrar`的定义在[这里](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/utils/ClassRegistrar.h.html#44)。它维护一个从class的名字到其creator函数的映射: 156 | 157 | ``` 158 | template 159 | class ClassRegistrar { 160 | typedef std::function ClassCreator; 161 | std::map creatorMap_; 162 | ``` 163 | 164 | `DataProvider::registrar_` 被成员函数 [`DataProvider::create`](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/gserver/dataproviders/DataProvider.cpp.html#_ZN6paddle12DataProvider6createERKNS_10DataConfigERKNS_11ModelConfigEb) 用来创建一个 data provider: 165 | 166 | ``` 167 | DataProvider* DataProvider::create(const DataConfig& config, 168 | const ModelConfig& modelConfig, 169 | bool useGpu) { 170 | return registrar_.createByType(config.type(), config, modelConfig, useGpu); 171 | } 172 | ``` 173 | 174 | `DataProvider::create` 又被 [`Trainer::init`](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/trainer/Trainer.cpp.html#196) 调用。 175 | 176 | ``` 177 | if (!dataProvider_ && config_->hasDataConfig()) { 178 | dataProvider_.reset(DataProvider::create(*config_, *config_, gpuData)); 179 | } 180 | ``` 181 | 182 | 这个调用返回的 data provider 被保存在数据成员 [`Trainer::dataProvider_`](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/trainer/Trainer.h.html#paddle::Trainer::dataProvider_) 里。 183 | 184 | 185 | `Trainer::dataProvider_` 被 [`Trainer::train`](http://162.243.141.242/paddle_html/codebrowser/codebrowser/paddle/trainer/Trainer.cpp.html#274) 传递给 gradient machine 的 `start` 成员函数: 186 | 187 | ``` 188 | void Trainer::train(size_t numPasses) { 189 | ... 190 | trainerInternal_.getGradientMachine()->start(*config_, dataProvider_); 191 | ... 192 | ``` 193 | 194 | 这个 `start` 函数会调用 `PyDataProvider2::createPyData` 和 `PyDataProvider2::readByFields` 来创建需要的 Python 函数 —— 也就是文首提到的 `process` —— 来读取数据。 195 | 196 | 随后 `Trainer::train` 执行一个循环,迭代地更新模型参数。每个迭代结束的时候会调用 data provider 的 `reset`函数,来通知 data provider 一次扫描结束。 197 | 198 | ``` 199 | for (size_t i = 0; i < numPasses; ++i) { 200 | ... 201 | if (i < numPasses - 1) { 202 | dataProvider_->reset(); 203 | } 204 | } 205 | ``` 206 | 207 | Data provider 可以利用这个机会,更新 cache,这样下一个迭代对数据的扫描就不需要从磁盘读取,而是可以利用已经缓存在内存中的备份了: 208 | --------------------------------------------------------------------------------