├── LICENSE ├── Makefile ├── README.md ├── R_api └── index.rst ├── all_api └── index.rst ├── cli_api └── index.rst ├── conf.py ├── demo └── index.rst ├── images ├── out-of-core.png ├── ps.png └── speed.png ├── index.rst ├── install ├── index.rst └── install_windows.rst ├── large └── index.rst ├── make.bat ├── python_api └── index.rst ├── tune └── index.rst └── tutorial └── index.rst /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Minimal makefile for Sphinx documentation 2 | # 3 | 4 | # You can set these variables from the command line. 5 | SPHINXOPTS = 6 | SPHINXBUILD = sphinx-build 7 | SPHINXPROJ = xLearn 8 | SOURCEDIR = . 9 | BUILDDIR = _build 10 | 11 | # Put it first so that "make" without argument is like "make help". 12 | help: 13 | @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) 14 | 15 | .PHONY: help Makefile 16 | 17 | # Catch-all target: route all unknown targets to Sphinx using the new 18 | # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). 19 | %: Makefile 20 | @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # xlearn_doc_cn 2 | xlearn 中文文档 3 | -------------------------------------------------------------------------------- /R_api/index.rst: -------------------------------------------------------------------------------- 1 | xLearn R API 使用指南 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | xLearn R package guide is coming soon. -------------------------------------------------------------------------------- /all_api/index.rst: -------------------------------------------------------------------------------- 1 | xLearn API 列表总览 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | 该页面展示了 xLearn 所有的 API 列表,包括了命令行接口以及 Python 接口。 5 | 6 | 命令行接口 7 | ------------------------------ 8 | 9 | 训练: :: 10 | 11 | xlearn_train [OPTIONS] 12 | 13 | 参数选项: :: 14 | 15 | -s : Type of machine learning model (default 0) 16 | for classification task: 17 | 0 -- linear model (GLM) 18 | 1 -- factorization machines (FM) 19 | 2 -- field-aware factorization machines (FFM) 20 | for regression task: 21 | 3 -- linear model (GLM) 22 | 4 -- factorization machines (FM) 23 | 5 -- field-aware factorization machines (FFM) 24 | 25 | -x : The metric can be 'acc', 'prec', 'recall', 'f1', 'auc' for classification, and 26 | 'mae', 'mape', 'rmsd (rmse)' for regression. On defaurt, xLearn will not print 27 | any evaluation metric information (only print loss value). 28 | 29 | -p : Choose the optimization method, including 'sgd', adagrad', and 'ftrl'. On default, 30 | xLearn uses the 'adagrad' optimization method. 31 | 32 | -v : Path of the validation data. This option will be empty by default. In this way, 33 | xLearn will not perform validation process. 34 | 35 | -m : Path of the model dump file. On default, the model file name is 'train_file' + '.model'. 36 | If we set this value to 'none', the xLearn will not dump the model checkpoint. 37 | 38 | -pre : Path of the pre-trained model. This can be used for online learning. 39 | 40 | -t : Path of the TEXT model checkpoint file. On default, we do not set this option 41 | and xLearn will not dump the TEXT model. 42 | 43 | -l : Path of the log file. xLearn uses '/tmp/xlearn_log.*' by default. 44 | 45 | -k : Number of the latent factor used by FM and FFM tasks. Using 4 by default. 46 | Note that, we will get the same model size when setting k to 1 and 4. 47 | This is because we use SSE instruction and the memory need to be aligned. 48 | So even you assign k = 1, we still fill some dummy zeros from k = 2 to 4. 49 | 50 | -r : Learning rate for optimization method. Using 0.2 by default. 51 | xLearn can use adaptive gradient descent (AdaGrad) for optimization problem, 52 | if you choose AdaGrad method, the learning rate will be changed adaptively. 53 | 54 | -b : Lambda for L2 regular. Using 0.00002 by default. We can disable the 55 | regular term by setting this value to zero. 56 | 57 | -alpha : Hyper parameters used by ftrl. 58 | 59 | -beta : Hyper parameters used by ftrl. 60 | 61 | -lambda_1 : Hyper parameters used by ftrl. 62 | 63 | -lambda_2 : Hyper parameters used by ftrl. 64 | 65 | -u : Hyper parameter used for initialize model parameters. Using 0.66 by default. 66 | 67 | -e : Number of epoch for training process. Using 10 by default. Note that xLearn will perform 68 | early-stopping by default, so this value is just a upper bound. 69 | 70 | -f : Number of folds for cross-validation (If we set --cv option). Using 5 by default. 71 | 72 | -nthread : Number of thread for multiple thread lock-free learning (Hogwild!). 73 | 74 | -block : Block size for on-disk training. 75 | 76 | -sw : Size of stop window for early-stopping. Using 2 by default. 77 | 78 | -seed : Random Seed to shuffle data set. 79 | 80 | --disk : Open on-disk training for large-scale machine learning problems. 81 | 82 | --cv : Open cross-validation in training tasks. If we use this option, xLearn will ignore 83 | the validation file (set by -t option). 84 | 85 | --dis-lock-free : Disable lock-free training. Lock-free training can accelerate training but the result 86 | is non-deterministic. Our suggestion is that you can open this flag if the training data 87 | is big and sparse. 88 | 89 | --dis-es : Disable early-stopping in training. By default, xLearn will use early-stopping 90 | in training process, except for training in cross-validation. 91 | 92 | --no-norm : Disable instance-wise normalization. By default, xLearn will use instance-wise 93 | normalization in both training and prediction processes. 94 | 95 | --no-bin : Do not generate bin file for training and test data file. 96 | 97 | --quiet : Don't print any evaluation information during the training and just train the 98 | model quietly. It can accelerate the training process. 99 | 100 | 预测: :: 101 | 102 | xlearn_predict [OPTIONS] 103 | 104 | 参数选项: :: 105 | 106 | -o : Path of the output file. On default, this value will be set to 'test_file' + '.out' 107 | 108 | -l : Path of the log file. xLearn uses '/tmp/xlearn_log' by default. 109 | 110 | -nthread : Number of thread for multiple thread lock-free learning (Hogwild!). 111 | 112 | -block : Block size fot on-disk prediction. 113 | 114 | --sign : Converting output result to 0 and 1. 115 | 116 | --sigmoid : Converting output result to 0 ~ 1 (problebility). 117 | 118 | --disk : On-disk prediction. 119 | 120 | --no-norm : Disable instance-wise normalization. By default, xLearn will use instance-wise 121 | normalization in both training and prediction processes. 122 | 123 | Python 接口 124 | ------------------------------ 125 | 126 | API 列表: :: 127 | 128 | import xlearn as xl # Import xlearn package 129 | 130 | xl.hello() # Say hello to user 131 | 132 | # This part is for data 133 | # X is feautres data, can be pandas DataFrame or numpy.ndarray, 134 | # y is label, default None, can be pandas DataFrame\Series, array or list, 135 | # filed_map is field map of features, default None, can be pandas DataFrame\Series, array or list 136 | dmatrix = xl.DMatrix(X, y, field_map) 137 | 138 | model = create_linear() # Create linear model. 139 | 140 | model = create_fm() # Create factorization machines. 141 | 142 | model = create_ffm() # Create field-aware factorizarion machines. 143 | 144 | model.show() # Show model information. 145 | 146 | model.fit(param, "model_path") # Train model. 147 | 148 | model.cv(param) # Perform cross-validation. 149 | 150 | # Users can choose one of this two 151 | model.predict("model_path", "output_path") # Perform prediction, output result to file, return None. 152 | model.predict("model_path") # Perform prediction, return result by numpy.ndarray. 153 | 154 | # Users can choose one of this two 155 | model.setTrain("data_path") # Set training data from file for xLearn. 156 | model.setTrain(dmatrix) # Set training data from DMatrix for xLearn. 157 | 158 | # Users can choose one of this two 159 | # note: this type of validate must be same as train 160 | # that is, set train from file, must set validate from file 161 | model.setValidate("data_path") # Set validation data from file for xLearn. 162 | model.setValidate(dmatrix) # Set validation data from DMatrix for xLearn. 163 | 164 | # Users can choose one of this two 165 | model.setTest("data_path") # Set test data from file for xLearn. 166 | model.setTest(dmatrix) # Set test data from DMatrix for xLearn. 167 | 168 | model.setQuiet() # Set xlearn to train model quietly. 169 | 170 | model.setOnDisk() # Set xlearn to use on-disk training. 171 | 172 | model.setNoBin() # Do not generate bin file for training and test data. 173 | 174 | model.setSign() # Convert prediction to 0 and 1. 175 | 176 | model.setSigmoid() # Convert prediction to (0, 1). 177 | 178 | model.disableNorm() # Disable instance-wise normalization. 179 | 180 | model.disableLockFree() # Disable lock-free training. 181 | 182 | model.disableEarlyStop() # Disable early-stopping. 183 | 184 | 超参数列表: :: 185 | 186 | task : {'binary', # Binary classification 187 | 'reg'} # Regression 188 | 189 | metric : {'acc', 'prec', 'recall', 'f1', 'auc', # for classification 190 | 'mae', 'mape', 'rmse', 'rmsd'} # for regression 191 | 192 | lr : float value # learning rate 193 | 194 | lambda : float value # regular lambda 195 | 196 | k : int value # latent factor for fm and ffm 197 | 198 | init : float value # model initialize 199 | 200 | alpha : float value # hyper parameter for ftrl 201 | 202 | beta : float value # hyper parameter for ftrl 203 | 204 | lambda_1 : float value # hyper parameter for ftrl 205 | 206 | lambda_2 : float value # hyper parameter for ftrl 207 | 208 | nthread : int value # the number of CPU cores 209 | 210 | epoch : int vlaue # number of epoch 211 | 212 | fold : int value # number of fold for cross-validation 213 | 214 | opt : {'sgd', 'agagrad', 'ftrl'} # optimization method 215 | 216 | stop_window : Size of stop window for early-stopping. 217 | 218 | block_size : int value # block size for on-disk training 219 | 220 | R 接口 221 | ------------------------------ 222 | 223 | xLearn R API page is coming soon. 224 | -------------------------------------------------------------------------------- /cli_api/index.rst: -------------------------------------------------------------------------------- 1 | xLearn 命令行接口使用指南 2 | =============================== 3 | 4 | 如果你已经编译并安装好 xLearn,你会在当前的 ``build`` 文件夹下看见 ``xlearn_train`` 和 ``xlearn_predict`` 这两个可执行文件,它们可以被用来进行模型的训练和预测。 5 | 6 | 快速开始 7 | ---------------------------------------- 8 | 9 | 确保你现在正在 xLearn 的 ``build`` 文件夹下,在该文件夹下用户可以看见 ``small_test.txt`` 和 ``small_train.txt`` 这两个样例数据集。我们使用以下命令进行模型训练: :: 10 | 11 | ./xlearn_train ./small_train.txt 12 | 13 | 下面是一部分程序的输出。注意,这里显示的 ``log_loss`` 值可能和你本地计算出的 ``log_loss`` 值不完全一样: :: 14 | 15 | [ ACTION ] Start to train ... 16 | [------------] Epoch Train log_loss Time cost (sec) 17 | [ 10% ] 1 0.569292 0.00 18 | [ 20% ] 2 0.517142 0.00 19 | [ 30% ] 3 0.490124 0.00 20 | [ 40% ] 4 0.470445 0.00 21 | [ 50% ] 5 0.451919 0.00 22 | [ 60% ] 6 0.437888 0.00 23 | [ 70% ] 7 0.425603 0.00 24 | [ 80% ] 8 0.415573 0.00 25 | [ 90% ] 9 0.405933 0.00 26 | [ 100% ] 10 0.396388 0.00 27 | [ ACTION ] Start to save model ... 28 | [------------] Model file: ./small_train.txt.model 29 | 30 | 在默认的情况下,xLearn 会使用 *logistic regression (LR)* 来训练我们的模型 (10 epoch). 31 | 32 | 我们发现,xLearn 训练过后在当前文件夹下产生了一个叫 ``small_train.txt.model`` 的新文件。这个文件用来存储训练后的模型,我们可以用这个模型在未来进行预测: :: 33 | 34 | ./xlearn_predict ./small_test.txt ./small_train.txt.model 35 | 36 | 运行上述命令之后,我们在当前文件夹下得到了一个新的文件 ``small_test.txt.out``,这是我们进行预测任务的输出。我们可以通过如下命令显示这个输出文件的前几行数据: :: 37 | 38 | head -n 5 ./small_test.txt.out 39 | 40 | -1.9872 41 | -0.0707959 42 | -0.456214 43 | -0.170811 44 | -1.28986 45 | 46 | 这里每一行的分数都对应了测试数据中的一行预测样本。负数代表我们预测该样本为负样本,正数代表正样本 (在这个例子中没有)。在 xLearn 中,用户可以将分数通过 ``--sigmoid`` 选项转换到 (0-1) 之间,还可以使用 ``--sign`` 选项将其转换成 0 或 1: :: 47 | 48 | ./xlearn_predict ./small_test.txt ./small_train.txt.model --sigmoid 49 | head -n 5 ./small_test.txt.out 50 | 51 | 0.120553 52 | 0.482308 53 | 0.387884 54 | 0.457401 55 | 0.215877 56 | 57 | ./xlearn_predict ./small_test.txt ./small_train.txt.model --sign 58 | head -n 5 ./small_test.txt.out 59 | 60 | 0 61 | 0 62 | 0 63 | 0 64 | 0 65 | 66 | 模型的输出 67 | ---------------------------------------- 68 | 69 | 用户可以通过设置不同的超参数来生成不同的模型,xLearn 通过 ``-m`` 选项来指定这些输出模型文件的路径。在默认的情况下,模型文件的路径是当前运行文件夹下的 ``training_data_name`` + ``.model`` 文件: :: 70 | 71 | ./xlearn_train ./small_train.txt -m new_model 72 | 73 | 用户还可以通过 ``-t`` 选项将模型输出成人类可读的 ``TXT`` 格式,例如: :: 74 | 75 | ./xlearn_train ./small_train.txt -t model.txt 76 | 77 | 运行上述命令后,我们发现在当前文件夹下生成了一个新的文件 ``model.txt``,这个文件存储着 ``TXT`` 格式的输出模型: :: 78 | 79 | head -n 5 ./model.txt 80 | 81 | -0.688182 82 | 0.458082 83 | 0 84 | 0 85 | 0 86 | 87 | 对于线性模型来说,``TXT`` 格式的模型将每一个模型参数存储在一行。对于 FM 和 FFM,模型将每一个 latent vector 存储在一行。 88 | 89 | Linear: :: 90 | 91 | bias: 0 92 | i_0: 0 93 | i_1: 0 94 | i_2: 0 95 | i_3: 0 96 | 97 | FM: :: 98 | 99 | bias: 0 100 | i_0: 0 101 | i_1: 0 102 | i_2: 0 103 | i_3: 0 104 | v_0: 5.61937e-06 0.0212581 0.150338 0.222903 105 | v_1: 0.241989 0.0474224 0.128744 0.0995021 106 | v_2: 0.0657265 0.185878 0.0223869 0.140097 107 | v_3: 0.145557 0.202392 0.14798 0.127928 108 | 109 | FFM: :: 110 | 111 | bias: 0 112 | i_0: 0 113 | i_1: 0 114 | i_2: 0 115 | i_3: 0 116 | v_0_0: 5.61937e-06 0.0212581 0.150338 0.222903 117 | v_0_1: 0.241989 0.0474224 0.128744 0.0995021 118 | v_0_2: 0.0657265 0.185878 0.0223869 0.140097 119 | v_0_3: 0.145557 0.202392 0.14798 0.127928 120 | v_1_0: 0.219158 0.248771 0.181553 0.241653 121 | v_1_1: 0.0742756 0.106513 0.224874 0.16325 122 | v_1_2: 0.225384 0.240383 0.0411782 0.214497 123 | v_1_3: 0.226711 0.0735065 0.234061 0.103661 124 | v_2_0: 0.0771142 0.128723 0.0988574 0.197446 125 | v_2_1: 0.172285 0.136068 0.148102 0.0234075 126 | v_2_2: 0.152371 0.108065 0.149887 0.211232 127 | v_2_3: 0.123096 0.193212 0.0179155 0.0479647 128 | v_3_0: 0.055902 0.195092 0.0209918 0.0453358 129 | v_3_1: 0.154174 0.144785 0.184828 0.0785329 130 | v_3_2: 0.109711 0.102996 0.227222 0.248076 131 | v_3_3: 0.144264 0.0409806 0.17463 0.083712 132 | 133 | 在线学习 134 | ---------------------------------------- 135 | xLearn 提供在线学习的功能,即 xLearn 可以加载之前预训练过的模型继续学习。用户可以通过 ``-pre`` 选项来指定预先训练过的模型文件路径。例如: :: 136 | 137 | ./xlearn_train ./small_train.txt -s 0 -pre ./pre_model 138 | 139 | 注意,xLearn 只能加载二进制预训练模型,不能加载 TXT 格式的文本模型。 140 | 141 | 预测结果的输出 142 | ---------------------------------------- 143 | 144 | 用户可以通过 ``-o`` 选项来指定预测结果输出文件的路径。例如: :: 145 | 146 | ./xlearn_predict ./small_test.txt ./small_train.txt.model -o output.txt 147 | head -n 5 ./output.txt 148 | 149 | -2.01192 150 | -0.0657416 151 | -0.456185 152 | -0.170979 153 | -1.28849 154 | 155 | 在默认的情况下,预测结果输出文件的路径格式是当前文件夹下的 ``test_data_name`` + ``.out`` 文件。 156 | 157 | 选择机器学习算法 158 | ---------------------------------------- 159 | 160 | 目前,xLearn 可以支持三种不同的机器学习算法,包括了线性模型 (LR)、factorization machine (FM),以及 field-aware factorization machine (FFM). 161 | 162 | 用户可以通过 ``-s`` 选项来选择不同的算法: :: 163 | 164 | ./xlearn_train ./small_train.txt -s 0 # Classification: Linear model (GLM) 165 | ./xlearn_train ./small_train.txt -s 1 # Classification: Factorization machine (FM) 166 | ./xlearn_train ./small_train.txt -s 2 # Classification: Field-awre factorization machine (FFM) 167 | 168 | ./xlearn_train ./small_train.txt -s 3 # Regression: Linear model (GLM) 169 | ./xlearn_train ./small_train.txt -s 4 # Regression: Factorization machine (FM) 170 | ./xlearn_train ./small_train.txt -s 5 # Regression: Field-awre factorization machine (FFM) 171 | 172 | 对于 LR 和 FM 算法而言,我们的输入数据格式必须是 ``CSV`` 或者 ``libsvm``. 对于 FFM 算法,我们的输入数据必须是 ``libffm`` 格式: :: 173 | 174 | libsvm format: 175 | 176 | y index_1:value_1 index_2:value_2 ... index_n:value_n 177 | 178 | 0 0:0.1 1:0.5 3:0.2 ... 179 | 0 0:0.2 2:0.3 5:0.1 ... 180 | 1 0:0.2 2:0.3 5:0.1 ... 181 | 182 | CSV format: 183 | 184 | y value_1 value_2 .. value_n 185 | 186 | 0 0.1 0.2 0.2 ... 187 | 1 0.2 0.3 0.1 ... 188 | 0 0.1 0.2 0.4 ... 189 | 190 | libffm format: 191 | 192 | y field_1:index_1:value_1 field_2:index_2:value_2 ... 193 | 194 | 0 0:0:0.1 1:1:0.5 2:3:0.2 ... 195 | 0 0:0:0.2 1:2:0.3 2:5:0.1 ... 196 | 1 0:0:0.2 1:2:0.3 2:5:0.1 ... 197 | 198 | xLearn 还可以使用 ``,`` 作为数据的分隔符,例如: :: 199 | 200 | libsvm format: 201 | 202 | label,index_1:value_1,index_2:value_2 ... index_n:value_n 203 | 204 | CSV format: 205 | 206 | label,value_1,value_2 .. value_n 207 | 208 | libffm format: 209 | 210 | label,field_1:index_1:value_1,field_2:index_2:value_2 ... 211 | 212 | 注意,如果输入的 csv 文件里不含 ``y`` 值,用户必须手动向其每一行数据添加一个占位符 (同样针对测试数据)。否则,xLearn 会将第一个元素视为 ``y``. 213 | 214 | LR 和 FM 算法的输入可以是 ``libffm`` 格式,xLearn 会忽略其中的 ``field`` 项并将其视为 ``libsvm`` 格式。 215 | 216 | 设置 Validation Dataset(验证集) 217 | ---------------------------------------- 218 | 219 | 在机器学习中,我们可以通过 Validation Dataset (验证集) 来进行超参数调优。在 xLearn 中,用户可以使用 ``-v`` 选项来指定验证集文件,例如: :: 220 | 221 | ./xlearn_train ./small_train.txt -v ./small_test.txt 222 | 223 | 下面是程序的一部分输出: :: 224 | 225 | [ ACTION ] Start to train ... 226 | [------------] Epoch Train log_loss Test log_loss Time cost (sec) 227 | [ 10% ] 1 0.571922 0.531160 0.00 228 | [ 20% ] 2 0.520315 0.542134 0.00 229 | [ 30% ] 3 0.492147 0.529684 0.00 230 | [ 40% ] 4 0.470234 0.538684 0.00 231 | [ 50% ] 5 0.452695 0.537496 0.00 232 | [ 60% ] 6 0.439367 0.537790 0.00 233 | [ 70% ] 7 0.425216 0.534396 0.00 234 | [ 80% ] 8 0.416215 0.542883 0.00 235 | [ 90% ] 9 0.404673 0.547597 0.00 236 | 237 | 我们可以看到,在这个任务中 ``Train log_loss`` 在不断的下降,而 ``Test log_loss`` (validation loss) 则是先下降,后上升。这代表当前我们训练的模型已经 overfit (过拟合)我们的训练数据。 238 | 239 | 在默认的情况下,xLearn 会在每一轮 epoch 结束后计算 validation loss 的数值,而用户可以使用 ``-x`` 选项来制定不同的评价指标。对于分类任务而言,评价指标有: ``acc`` (accuracy), ``prec`` (precision), ``f1``, 以及 ``auc``,例如: :: 240 | 241 | ./xlearn_train ./small_train.txt -v ./small_test.txt -x acc 242 | ./xlearn_train ./small_train.txt -v ./small_test.txt -x prec 243 | ./xlearn_train ./small_train.txt -v ./small_test.txt -x f1 244 | ./xlearn_train ./small_train.txt -v ./small_test.txt -x auc 245 | 246 | 对于回归任务而言,评价指标包括:``mae``, ``mape``, 以及 ``rmsd`` (或者叫作 ``rmse``),例如: :: 247 | 248 | cd demo/house_price/ 249 | ../../xlearn_train ./house_price_train.txt -s 3 -x rmse --cv 250 | ../../xlearn_train ./house_price_train.txt -s 3 -x rmsd --cv 251 | 252 | 注意,这里我们通过设置 ``--cv`` 选项使用了 *Cross-Validation (交叉验证)* 功能, 我们将在下一节详细介绍该功能。 253 | 254 | Cross-Validation (交叉验证) 255 | ---------------------------------------- 256 | 257 | 在机器学习中,Cross-Validation (交叉验证) 是一种被广泛使用的模型超参数调优技术。在 xLearn 中,用户可以使用 ``--cv`` 258 | 选项来使用交叉验证功能,例如: :: 259 | 260 | ./xlearn_train ./small_train.txt --cv 261 | 262 | 在默认的情况下,xLearn 使用 3-folds 交叉验证 (即将数据集平均分成 3 份),用户也可以通过 ``-f`` 选项来指定数据划分的份数,例如: :: 263 | 264 | ./xlearn_train ./small_train.txt -f 5 --cv 265 | 266 | 上述命令将数据集划分成为 5 份,并且 xLearn 会在最后计算出平均的 validation loss: :: 267 | 268 | ... 269 | [------------] Average log_loss: 0.549417 270 | [ ACTION ] Finish Cross-Validation 271 | [ ACTION ] Clear the xLearn environment ... 272 | [------------] Total time cost: 0.03 (sec) 273 | 274 | 选择优化算法 275 | ---------------------------------------- 276 | 277 | 在 xLearn 中,用户可以通过 ``-p`` 选项来选择使用不同的优化算法。目前,xLearn 支持 ``SGD``, ``AdaGrad``, 以及 ``FTRL`` 这三种优化算法。 278 | 在默认的情况下,xLearn 使用 ``AdaGrad`` 优化算法: :: 279 | 280 | ./xlearn_train ./small_train.txt -p sgd 281 | ./xlearn_train ./small_train.txt -p adagrad 282 | ./xlearn_train ./small_train.txt -p ftrl 283 | 284 | 相比于传统的 ``SGD`` (随机梯度下降) 算法,``AdaGrad`` 可以自适应的调整学习速率 learning rate,对于不常用的参数进行较大的更新,对于常用的参数进行较小的更新。 285 | 正因如此,``AdaGrad`` 算法常用于稀疏数据的优化问题上。除此之外,相比于 ``AdaGrad``,``SGD`` 对学习速率的大小更敏感,这增加了用户调参的难度。 286 | 287 | ``FTRL`` (Follow-the-Regularized-Leader) 同样被广泛应用于大规模稀疏数据的优化问题上。相比于 ``SGD`` 和 ``AdaGrad``, ``FTRL`` 需要用户调试更多的超参数,我们将在下一节详细介绍 xLearn 的超参数调优。 288 | 289 | 超参数调优 290 | ---------------------------------------- 291 | 292 | 在机器学习中,*hyper-parameter* (超参数) 是指在训练之前设置的参数,而模型参数是指在训练过程中更新的参数。超参数调优通常是机器学习训练过程中不可避免的一个环节。 293 | 294 | 首先,``learning rate`` (学习速率) 是机器学习中的一个非常重要的超参数,用来控制每次模型迭代时更新的步长。在默认的情况下,这个值在 xLearn 中被设置为 ``0.2``,用户可以通过 ``-r`` 选项来改变这个值: :: 295 | 296 | ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 297 | ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.5 298 | ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.01 299 | 300 | 用户还可以通过 ``-b`` 选项来控制 regularization (正则项)。xLearn 使用 ``L2`` 正则项,这个值被默认设置为 ``0.00002``: :: 301 | 302 | ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.001 303 | ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.002 304 | ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.01 305 | 306 | 对于 ``FTRL`` 算法来说,除了学习速率和正则项,我们还需要调节其他的超参数,包括:``-alpha``, ``-beta``, ``-lambda_1`` 和 ``-lambda_2``,例如: :: 307 | 308 | ./xlearn_train ./small_train.txt -p ftrl -alpha 0.002 -beta 0.8 -lambda_1 0.001 -lambda_2 1.0 309 | 310 | 对于 FM 和 FFM 模型,用户需要通过 ``-k`` 选项来设置 *latent vector* (隐向量) 的长度。在默认的情况下,xLearn 将其设置为 ``4``: :: 311 | 312 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 2 313 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 4 314 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 5 315 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 8 316 | 317 | 注意,xLearn 使用了 *SSE* 硬件指令来加速向量运算,该指令会同时进行向量长度为 ``4`` 的运算,因此将 ``k=2`` 和 ``k=4`` 所需的运算时间是相同的。 318 | 319 | 除此之外,对于 FM 和 FFM,用户可以通过设置超参数 ``-u`` 来调节模型的初始化参数。在默认的情况下,这个值被设置为 ``0.66``: :: 320 | 321 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.80 322 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.40 323 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.10 324 | 325 | 迭代次数 & Early-Stop (提前终止) 326 | ---------------------------------------- 327 | 328 | 在模型的训练过程中,每一个 epoch 都会遍历整个训练数据。在 xLearn 中,用户可以通过 ``-e`` 选项来设置需要的 epoch 数量: :: 329 | 330 | ./xlearn_train ./small_train.txt -e 3 331 | ./xlearn_train ./small_train.txt -e 5 332 | ./xlearn_train ./small_train.txt -e 10 333 | 334 | 如果用户设置了 validation dataset (验证集),xLearn 在默认情况下会在得到最好的 validation 结果时进行 early-stop (提前终止训练),例如: :: 335 | 336 | ./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10 337 | 338 | 在上述命令中,我们设置 epoch 的大小为 ``10``,但是 xLearn 会在第 7 轮提前停止训练 (你可能在你的本地计算机上会得到不同的轮次): :: 339 | 340 | ... 341 | [ ACTION ] Early-stopping at epoch 7 342 | [ ACTION ] Start to save model ... 343 | 344 | 用户可以通过 ``-sw`` 来设置提前停止机制的窗口大小。即,``-sw=2`` 意味着如果在后两轮的时间窗口之内都没有比当前更好的验证结果,则停止训练,并保存之前最好的模型: :: 345 | 346 | ./xlearn_train ./small_train.txt -e 10 -v ./small_test.txt -sw 3 347 | 348 | 用户可以通过 ``--dis-es`` 选项来禁止 early-stop: :: 349 | 350 | ./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10 --dis-es 351 | 352 | 在上述命令中,xLearn 将进行完整的 10 轮 epoch 训练。 353 | 354 | 注意,在默认情况下,如果没有设置 metric,则 xLearn 会通过 test_loss 来选择最佳停止时机。如果设置了 metric,则 xLearn 通过 metric 的值来决定停止时机。 355 | 356 | 无锁 (Lock-free) 学习 357 | ---------------------------------------- 358 | 359 | 在默认情况下,xLearn 会进行 *Hogwild!* 无锁学习,该方法通过 CPU 多核进行并行训练,提高 CPU 利用率,加快算法收敛速度。但是,该无锁算法是非确定性的算法 (*non-deterministic*). 即,如果我们多次运行如下的命令,我们会在每一次运行得到略微不同的 loss 结果: :: 360 | 361 | ./xlearn_train ./small_train.txt 362 | 363 | The 1st time: 0.396352 364 | 365 | ./xlearn_train ./small_train.txt 366 | 367 | The 2nd time: 0.396119 368 | 369 | ./xlearn_train ./small_train.txt 370 | 371 | The 3nd time: 0.396187 372 | 373 | 用户可以通过 ``-nthread`` 选项来设置使用 CPU 核心的数量,例如: :: 374 | 375 | ./xlearn_train ./small_train.txt -nthread 2 376 | 377 | 上述命令指定使用 2 个 CPU Core 来进行模型训练。如果用户不设置该选项,xLearn 在默认情况下会使用全部的 CPU 核心进行计算。xLearn 会显示当前使用线程数量的情况: :: 378 | 379 | [------------] xLearn uses 2 threads for training task. 380 | [ ACTION ] Read Problem ... 381 | 382 | 用户可以通过设置 ``--dis-lock-free`` 选项禁止多核无锁学习: :: 383 | 384 | ./xlearn_train ./small_train.txt --dis-lock-free 385 | 386 | 这时,xLearn 计算的结果是确定性的 (*determinnistic*): :: 387 | 388 | ./xlearn_train ./small_train.txt 389 | 390 | The 1st time: 0.396372 391 | 392 | ./xlearn_train ./small_train.txt 393 | 394 | The 2nd time: 0.396372 395 | 396 | ./xlearn_train ./small_train.txt 397 | 398 | The 3nd time: 0.396372 399 | 400 | 使用 ``--dis-lock-free`` 的缺点是训练速度会比无锁训练慢很多,我们的建议是在大规模数据训练下开启此功能。 401 | 402 | Instance-Wise 归一化 403 | ---------------------------------------- 404 | 405 | 对于 FM 和 FFM 来说,xLearn 会默认对特征进行 *Instance-Wise Normalizarion* (归一化). 在一些大规模稀疏数据的场景 (例如 CTR 预估), 这一技术非常的有效,但是有些时候它也会影响模型的准确率。用户可以通过设置 ``--no-norm`` 来关掉该功能: :: 406 | 407 | ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt --no-norm 408 | 409 | 注意,如果在训练过程中使用了 Instance-Wise 归一化,用户需要在预测过程中同样使用该功能。 410 | 411 | Quiet Model 安静模式 412 | ---------------------------------------- 413 | 414 | xLearn 的训练支持安静模式,在安静模式下,用户通过调用 ``--quiet()`` 选项来使得 xLearn 的训练过程不会计算任何评价指标,这样可以很大程度上提高训练速度: :: 415 | 416 | ./xlearn_train ./small_train.txt -e 10 --quiet 417 | 418 | xLearn 还可以支持 Python API,我们将在下一节详细介绍。 419 | -------------------------------------------------------------------------------- /conf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # 3 | # xlearn_doc documentation build configuration file, created by 4 | # sphinx-quickstart on Sun Dec 3 18:43:51 2017. 5 | # 6 | # This file is execfile()d with the current directory set to its 7 | # containing dir. 8 | # 9 | # Note that not all possible configuration values are present in this 10 | # autogenerated file. 11 | # 12 | # All configuration values have a default; values that are commented out 13 | # serve to show the default. 14 | 15 | # If extensions (or modules to document with autodoc) are in another directory, 16 | # add these directories to sys.path here. If the directory is relative to the 17 | # documentation root, use os.path.abspath to make it absolute, like shown here. 18 | # 19 | # import os 20 | # import sys 21 | # sys.path.insert(0, os.path.abspath('.')) 22 | 23 | # -- General configuration ------------------------------------------------ 24 | 25 | # If your documentation needs a minimal Sphinx version, state it here. 26 | # 27 | # needs_sphinx = '1.0' 28 | 29 | # Add any Sphinx extension module names here, as strings. They can be 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom 31 | # ones. 32 | extensions = ['sphinx.ext.autodoc'] 33 | 34 | # Add any paths that contain templates here, relative to this directory. 35 | templates_path = ['_templates'] 36 | 37 | # The suffix(es) of source filenames. 38 | # You can specify multiple suffix as a list of string: 39 | # 40 | # source_suffix = ['.rst', '.md'] 41 | source_suffix = '.rst' 42 | 43 | # The master toctree document. 44 | master_doc = 'index' 45 | 46 | # General information about the project. 47 | project = u'xLearn' 48 | copyright = u'2017, Chao Ma' 49 | author = u'Chao Ma' 50 | 51 | # The version info for the project you're documenting, acts as replacement for 52 | # |version| and |release|, also used in various other places throughout the 53 | # built documents. 54 | # 55 | # The short X.Y version. 56 | version = u'0.4.0' 57 | # The full version, including alpha/beta/rc tags. 58 | release = u'0.4.0' 59 | 60 | # The language for content autogenerated by Sphinx. Refer to documentation 61 | # for a list of supported languages. 62 | # 63 | # This is also used if you do content translation via gettext catalogs. 64 | # Usually you set "language" from the command line for these cases. 65 | language = None 66 | 67 | # List of patterns, relative to source directory, that match files and 68 | # directories to ignore when looking for source files. 69 | # This patterns also effect to html_static_path and html_extra_path 70 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] 71 | 72 | # The name of the Pygments (syntax highlighting) style to use. 73 | pygments_style = 'sphinx' 74 | 75 | # If true, `todo` and `todoList` produce output, else they produce nothing. 76 | todo_include_todos = False 77 | 78 | 79 | # -- Options for HTML output ---------------------------------------------- 80 | 81 | # The theme to use for HTML and HTML Help pages. See the documentation for 82 | # a list of builtin themes. 83 | # 84 | html_theme = 'sphinx_rtd_theme' 85 | 86 | # Theme options are theme-specific and customize the look and feel of a theme 87 | # further. For a list of options available for each theme, see the 88 | # documentation. 89 | # 90 | # html_theme_options = {} 91 | 92 | # Add any paths that contain custom static files (such as style sheets) here, 93 | # relative to this directory. They are copied after the builtin static files, 94 | # so a file named "default.css" will overwrite the builtin "default.css". 95 | html_static_path = ['_static'] 96 | 97 | # Custom sidebar templates, must be a dictionary that maps document names 98 | # to template names. 99 | # 100 | # This is required for the alabaster theme 101 | # refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars 102 | html_sidebars = { 103 | '**': [ 104 | 'relations.html', # needs 'show_related': True theme option to display 105 | 'searchbox.html', 106 | ] 107 | } 108 | 109 | 110 | # -- Options for HTMLHelp output ------------------------------------------ 111 | 112 | # Output file base name for HTML help builder. 113 | htmlhelp_basename = 'xlearn_docdoc' 114 | 115 | 116 | # -- Options for LaTeX output --------------------------------------------- 117 | 118 | latex_elements = { 119 | # The paper size ('letterpaper' or 'a4paper'). 120 | # 121 | # 'papersize': 'letterpaper', 122 | 123 | # The font size ('10pt', '11pt' or '12pt'). 124 | # 125 | # 'pointsize': '10pt', 126 | 127 | # Additional stuff for the LaTeX preamble. 128 | # 129 | # 'preamble': '', 130 | 131 | # Latex figure (float) alignment 132 | # 133 | # 'figure_align': 'htbp', 134 | } 135 | 136 | # Grouping the document tree into LaTeX files. List of tuples 137 | # (source start file, target name, title, 138 | # author, documentclass [howto, manual, or own class]). 139 | latex_documents = [ 140 | (master_doc, 'xlearn_doc.tex', u'xlearn\\_doc Documentation', 141 | u'Chao Ma', 'manual'), 142 | ] 143 | 144 | # -- Options for manual page output --------------------------------------- 145 | 146 | # One entry per manual page. List of tuples 147 | # (source start file, name, description, authors, manual section). 148 | man_pages = [ 149 | (master_doc, 'xlearn_doc', u'xlearn_doc Documentation', 150 | [author], 1) 151 | ] 152 | 153 | # -- Options for Texinfo output ------------------------------------------- 154 | 155 | # Grouping the document tree into Texinfo files. List of tuples 156 | # (source start file, target name, title, author, 157 | # dir menu entry, description, category) 158 | texinfo_documents = [ 159 | (master_doc, 'xlearn_doc', u'xlearn_doc Documentation', 160 | author, 'xlearn_doc', 'One line description of project.', 161 | 'Miscellaneous'), 162 | ] -------------------------------------------------------------------------------- /demo/index.rst: -------------------------------------------------------------------------------- 1 | xLearn 样例程序 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | 注意:这是用的所有数据集所有权属于原作者。 5 | 6 | Criteo 在线广告预估 7 | --------------------------- 8 | 9 | Kaglle 预测广告是否会被用户点击 (`链接`__) 10 | 11 | Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. 12 | However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is 13 | sharing a week’s worth of data for you to develop models predicting ad click-through rate (CTR). Given a user 14 | and the page he is visiting, what is the probability that he will click on a given ad? 15 | 16 | 样例数据在: ``/demo/classification/criteo_ctr/``. 17 | 18 | The follow code is the Python demo: 19 | 20 | .. code-block:: python 21 | 22 | import xlearn as xl 23 | 24 | # Training task 25 | ffm_model = xl.create_ffm() # Use field-aware factorization machine 26 | ffm_model.setTrain("./small_train.txt") # Training data 27 | ffm_model.setValidate("./small_test.txt") # Validation data 28 | 29 | # param: 30 | # 0. binary classification 31 | # 1. learning rate: 0.2 32 | # 2. regular lambda: 0.002 33 | # 3. evaluation metric: accuracy 34 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'acc'} 35 | 36 | # Start to train 37 | # The trained model will be stored in model.out 38 | ffm_model.fit(param, './model.out') 39 | 40 | # Prediction task 41 | ffm_model.setTest("./small_test.txt") # Test data 42 | ffm_model.setSigmoid() # Convert output to 0-1 43 | 44 | # Start to predict 45 | # The output result will be stored in output.txt 46 | ffm_model.predict("./model.out", "./output.txt") 47 | 48 | 蘑菇分类 49 | --------------------------- 50 | 51 | 数据集来自 UCI Machine Learning Repositpry (`链接`__) 52 | 53 | This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in 54 | the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, 55 | or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly 56 | states that there is no simple rule for determining the edibility of a mushroom; no rule like *leaflets three, let it be* 57 | for Poisonous Oak and Ivy. 58 | 59 | 样例数据在: ``/demo/classification/mushroom/``. 60 | 61 | The follow code is the Python demo: 62 | 63 | .. code-block:: python 64 | 65 | import xlearn as xl 66 | 67 | # Training task 68 | linear_model = xl.create_linear() # Use linear model 69 | linear_model.setTrain("./agaricus_train.txt") # Training data 70 | linear_model.setValidate("./agaricus_test.txt") # Validation data 71 | 72 | # param: 73 | # 0. binary classification 74 | # 1. learning rate: 0.2 75 | # 2. lambda: 0.002 76 | # 3. evaluation metric: accuarcy 77 | # 4. use sgd optimization method 78 | param = {'task':'binary', 'lr':0.2, 79 | 'lambda':0.002, 'metric':'acc', 80 | 'opt':'sgd'} 81 | 82 | # Start to train 83 | # The trained model will be stored in model.out 84 | linear_model.fit(param, './model.out') 85 | 86 | # Prediction task 87 | linear_model.setTest("./agaricus_test.txt") # Test data 88 | linear_model.setSigmoid() # Convert output to 0-1 89 | 90 | # Start to predict 91 | # The output result will be stored in output.txt 92 | linear_model.predict("./model.out", "./output.txt") 93 | 94 | 泰塔尼克生还预测 95 | ----------------------------- 96 | 97 | This challenge comes from the Kaggle. In this challenge, we ask you to complete the analysis of what sorts of people 98 | were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers 99 | survived the tragedy. (`链接`__) 100 | 101 | You can find the data used in this demo in the path ``/demo/classification/titanic/``. 102 | 103 | The follow code is the Python demo: 104 | 105 | .. code-block:: python 106 | 107 | import xlearn as xl 108 | 109 | # Training task 110 | fm_model = xl.create_fm() # Use factorization machine 111 | fm_model.setTrain("./titanic_train.txt") # Training data 112 | 113 | # param: 114 | # 0. Binary classification task 115 | # 1. learning rate: 0.2 116 | # 2. lambda: 0.002 117 | # 3. metric: accuracy 118 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'acc'} 119 | 120 | # Use cross-validation 121 | fm_model.cv(param) 122 | 123 | 房价预测 124 | ----------------------------- 125 | 126 | This demo shows how to use xLearn to solve the regression problem, and it comes from the Kaggle. The Ames 127 | Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative 128 | for data scientists looking for a modernized and expanded version of the often cited Boston 129 | Housing dataset. (`链接`__) 130 | 131 | 样例数据在: ``/demo/regression/house_price/``. 132 | 133 | The follow code is the Python demo: 134 | 135 | .. code-block:: python 136 | 137 | import xlearn as xl 138 | 139 | # Training task 140 | ffm_model = xl.create_fm() # Use factorization machine 141 | ffm_model.setTrain("./house_price_train.txt") # Training data 142 | 143 | # param: 144 | # 0. Binary task 145 | # 1. learning rate: 0.2 146 | # 2. regular lambda: 0.002 147 | # 4. evaluation metric: rmse 148 | param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric':'rmse'} 149 | 150 | # Use cross-validation 151 | ffm_model.cv(param) 152 | 153 | More Demo in xLearn is coming soon. 154 | 155 | .. __: https://www.kaggle.com/c/criteo-display-ad-challenge 156 | .. __: https://archive.ics.uci.edu/ml/datasets/Mushroom 157 | .. __: https://www.kaggle.com/c/titanic 158 | .. __: https://www.kaggle.com/c/house-prices-advanced-regression-techniques 159 | -------------------------------------------------------------------------------- /images/out-of-core.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aksnzhy/xlearn_doc_cn/eaa11f7ca9e54c3815850696802d1910ee72bc1a/images/out-of-core.png -------------------------------------------------------------------------------- /images/ps.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aksnzhy/xlearn_doc_cn/eaa11f7ca9e54c3815850696802d1910ee72bc1a/images/ps.png -------------------------------------------------------------------------------- /images/speed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aksnzhy/xlearn_doc_cn/eaa11f7ca9e54c3815850696802d1910ee72bc1a/images/speed.png -------------------------------------------------------------------------------- /index.rst: -------------------------------------------------------------------------------- 1 | .. xlearn_doc documentation master file, created by 2 | sphinx-quickstart on Sun Dec 3 18:43:51 2017. 3 | You can adapt this file completely to your liking, but it should at least 4 | contain the root `toctree` directive. 5 | 6 | 欢迎使用 xLearn ! 7 | =============================== 8 | 9 | xLearn 是一款高性能的,易用的,并且可扩展的机器学习算法库,你可以用它来解决大规模机器学习问题,尤其是大规模稀疏数据机器学习问题。在近年来,大规模稀疏数据机器学习算法被广泛应用在各种领域,例如广告点击率预测、推荐系统等。如果你是 liblinear、libfm、libffm 的用户,那么现在 xLearn 将会是你更好的选择,因为 xLearn 几乎囊括了这些系统的全部功能,并且具有更好的性能,易用性,以及可扩展性。 10 | 11 | .. image:: ./images/speed.png 12 | :width: 650 13 | 14 | 快速开始 15 | ---------------------------------- 16 | 17 | 我们接下来展示如何在一个小型数据样例 (Criteo 广告点击预测数据) 上使用 xLearn 来解决二分类问题。在这个问题里,机器学习算法需要判断当前用户是否会点击给定的广告。 18 | 19 | 安装 xLearn 20 | ^^^^^^^^^^^^^ 21 | 22 | xLearn 最简单的安装方法是使用 ``pip`` 安装工具. 下面的命令会下载 xLearn 的源代码,并且在用户的本地机器上进行编译和安装。 :: 23 | 24 | sudo pip install xlearn 25 | 26 | 上述安装过程可能会持续一段时间,请耐心等候。安装完成后,用户可以使用下面的代码来检测 xLearn 是否安装成功。 :: 27 | 28 | >>> import xlearn as xl 29 | >>> xl.hello() 30 | 31 | 如果安装成功,用户会看到如下显示: :: 32 | 33 | ------------------------------------------------------------------------- 34 | _ 35 | | | 36 | __ _| | ___ __ _ _ __ _ __ 37 | \ \/ / | / _ \/ _` | '__| '_ \ 38 | > <| |___| __/ (_| | | | | | | 39 | /_/\_\_____/\___|\__,_|_| |_| |_| 40 | 41 | xLearn -- 0.44 Version -- 42 | ------------------------------------------------------------------------- 43 | 44 | 如果你在安装的过程中遇到了任何问题,或者你希望自己通过在 `Github`__ 上最新的源代码进行手动编译,或者你想使用 xLearn 的命令行接口,你可以从这里 (`Installation Guide`__) 查看如何对 xLearn 进行从源码的手动编译和安装。 45 | 46 | .. __: https://github.com/aksnzhy/xlearn 47 | .. __: ./install/index.html 48 | 49 | Python 样例 50 | ^^^^^^^^^^^^^^ 51 | 52 | 下面的 Python 代码展示了如何使用 xLearn 的 *FFM* 算法来处理机器学习二分类任务: 53 | 54 | .. code-block:: python 55 | 56 | import xlearn as xl 57 | 58 | # Training task 59 | ffm_model = xl.create_ffm() # Use field-aware factorization machine (ffm) 60 | ffm_model.setTrain("./small_train.txt") # Set the path of training dataset 61 | ffm_model.setValidate("./small_test.txt") # Set the path of validation dataset 62 | 63 | # Parameters: 64 | # 0. task: binary classification 65 | # 1. learning rate: 0.2 66 | # 2. regular lambda: 0.002 67 | # 3. evaluation metric: accuracy 68 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'acc'} 69 | 70 | # Start to train 71 | # The trained model will be stored in model.out 72 | ffm_model.fit(param, './model.out') 73 | 74 | # Prediction task 75 | ffm_model.setTest("./small_test.txt") # Set the path of test dataset 76 | ffm_model.setSigmoid() # Convert output to 0-1 77 | 78 | # Start to predict 79 | # The output result will be stored in output.txt 80 | ffm_model.predict("./model.out", "./output.txt") 81 | 82 | 上述样例通过使用 *field-aware factorization machines (FFM)* 来解决一个简单的二分类任务。用户可以在 ``demo/classification/criteo_ctr`` 83 | 路径下找到我们所使用的样例数据 (``small_train.txt`` 和 ``small_test.txt``). 84 | 85 | 其他资源链接 86 | ---------------------------------------- 87 | 88 | .. toctree:: 89 | :glob: 90 | :maxdepth: 1 91 | 92 | self 93 | install/index 94 | cli_api/index 95 | python_api/index 96 | R_api/index 97 | tune/index 98 | all_api/index 99 | large/index 100 | demo/index 101 | tutorial/index -------------------------------------------------------------------------------- /install/index.rst: -------------------------------------------------------------------------------- 1 | 详细安装指南 2 | ---------------------------------- 3 | 4 | 目前 xLearn 可以支持 Linux, Mac OS X 以及 Windows 平台. 在 Windows 平台安装 xLearn 请参考 `link`__ . 这一节主要介绍了如何在 Linux 和 Mac OSX 平台通过 ``pip`` 工具安装 xLearn,并且详细介绍了如何通过源码手动编译并安装 xLearn. 无论你使用哪种方法安装 xLearn,请确保你的机器上已经安装了支持 C++11 的编译器,例如 ``GCC`` 或者 ``Clang``. 除此之外,用户还需要提前安装好 ``CMake`` 编译工具. 5 | 6 | .. __: ./install_windows.html 7 | 8 | 安装 GCC 或 Clang 9 | ^^^^^^^^^^^^^^^^^^^^^^^^ 10 | 11 | *如果你已经安装了支持 C++ 11 的编译器,请忽略此节内容。* 12 | 13 | * 在 Cygwin 上, 运行 ``setup.exe`` 并安装 ``gcc`` 和 ``binutils``. 14 | * 在 Debian/Ubuntu Linux 上, 输入如下命令: :: 15 | 16 | sudo apt-get install gcc binutils 17 | 18 | 安装 GCC (或者 Clang) :: 19 | 20 | sudo apt-get install clang 21 | 22 | * 在 FreeBSD 上, 输入以下命令安装 Clang: :: 23 | 24 | sudo pkg_add -r clang 25 | 26 | * 在 Mac OS X, 安装 ``XCode`` 来获得 Clang. 27 | 28 | 29 | 安装 CMake 30 | ^^^^^^^^^^^^^^^^^^^^^^^^ 31 | 32 | *如果你已经安装了 CMake,请忽略此节内容。* 33 | 34 | * 在 Cygwin 上, 运行 ``setup.exe`` 并安装 cmake. 35 | * 在 Debian/Ubuntu Linux 上, 输入以下命令安装 cmake: :: 36 | 37 | sudo apt-get install cmake 38 | 39 | * 在 FreeBSD 上, 输入以下命令: :: 40 | 41 | sudo pkg_add -r cmake 42 | 43 | 在 Mac OS X, 如果你安装了 ``homebrew``, 输入以下命令: :: 44 | 45 | brew install cmake 46 | 47 | 或者你安装了 ``MacPorts``, 输入以下命令: :: 48 | 49 | sudo port install cmake 50 | 51 | 从源码安装 xLearn 52 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 53 | 54 | 从源码安装 xLearn 分为两个步骤: 55 | 56 | 首先,我们需要编译 xLearn 得到 ``xlearn_train`` 和 ``xlearn_predict`` 这两个可执行文件。除此之外,我们还需要得到 ``libxlearn_api.so`` (Linux 平台) 和 ``libxlearn_api.dylib`` (Mac OS X 平台) 这两个动态链接库 (用来进行 Python 调用)。随后,用户可以安装 xLearn Python Package. 57 | 58 | 编译 xLearn 59 | =========== 60 | 61 | 用户从 Github 上 clone 下 xLearn 源代码: :: 62 | 63 | git clone https://github.com/aksnzhy/xlearn.git 64 | 65 | cd xlearn 66 | mkdir build 67 | cd build 68 | cmake ../ 69 | make 70 | 71 | 如果编译成功,用户将在 build 文件夹下看到 ``xlearn_train`` 和 ``xlearn_predict`` 这两个可执行文件。用户可以通过如下命令检查 xLearn 是否安装成功: :: 72 | 73 | ./run_example.sh 74 | 75 | 安装 Python Package 76 | ==================== 77 | 78 | 之后,你就可以通过 ``install-python.sh`` 脚本来安装 xLearn Python 包: :: 79 | 80 | cd python-package 81 | sudo ./install-python.sh 82 | 83 | 用户可以通过如下命令检测 xLearn Python 库是否安装成功: :: 84 | 85 | cd ../ 86 | python run_demo_ctr.py 87 | 88 | 一键安装脚本 89 | ============ 90 | 91 | 我们已经写好了一个脚本 ``build.sh`` 来帮助用户做上述所有的安装工作。 92 | 93 | 用户只需要从 Github 上 clone 下 xLearn 源代码: :: 94 | 95 | git clone https://github.com/aksnzhy/xlearn.git 96 | 97 | 然后通过以下命令进行编译和安装: :: 98 | 99 | cd xlearn 100 | sudo ./build.sh 101 | 102 | 在安装过程中用户可能会被要求输入管理员账户密码。 103 | 104 | 通过 pip 安装 xLearn 105 | ^^^^^^^^^^^^^^^^^^^^^^^^ 106 | 107 | 安装 xLearn 最简单的方法是使用 ``pip`` 安装工具. 如下命令会下载 xLearn 源代码,并在你的本地计算机进行编译和安装工作,该方法使用的前提是你已经安装了 xLearn 所需的开发环境,例如 C++11 和 CMake: :: 108 | 109 | sudo pip install xlearn 110 | 111 | 上述安装过程可能会持续一段时间,请耐心等候。安装完成后,用户可以使用下面的代码来检测 xLearn 是否安装成功: :: 112 | 113 | >>> import xlearn as xl 114 | >>> xl.hello() 115 | 116 | 如果安装成功,你会看到如下显示: :: 117 | 118 | ------------------------------------------------------------------------- 119 | _ 120 | | | 121 | __ _| | ___ __ _ _ __ _ __ 122 | \ \/ / | / _ \/ _` | '__| '_ \ 123 | > <| |___| __/ (_| | | | | | | 124 | /_/\_\_____/\___|\__,_|_| |_| |_| 125 | 126 | xLearn -- 0.44 Version -- 127 | ------------------------------------------------------------------------- 128 | 129 | 安装 R 库 130 | ^^^^^^^^^^^^^^^^^^^^^^^^ 131 | 132 | The R package installation guide is coming soon. 133 | -------------------------------------------------------------------------------- /install/install_windows.rst: -------------------------------------------------------------------------------- 1 | Windows 安装指南 2 | ---------------------------------- 3 | 4 | xLearn 支持 Windows 平台的安装和使用。本小节主要介绍如何在 Windows 平台安装并使用 xLearn 库。 5 | 6 | 安装 Visual Studio 2017 7 | ^^^^^^^^^^^^^^^^^^^^^^^^ 8 | 9 | *如果你的 Windows 系统已经安装过 Visual studio,你可以跳过这一步。* 10 | 11 | 从 https://visualstudio.microsoft.com/downloads/ 下载你所需要的 Visual studio (``vs_xxxx_xxxx.exe``)。之后,你可以通过 VS 的安装说明 (https://docs.microsoft.com/en-us/visualstudio/install/install-visual-studio?view=vs-2017.)进行安装。 12 | 13 | 安装 CMake 14 | ^^^^^^^^^^^^^^^^^^^^^^^^ 15 | 16 | *如果你的系统已经安装了 CMake,你可以跳过这一步* 17 | 18 | 从这里 https://cmake.org/download/ 下载最新版本 (至少 v3.10) CMake。请确保安装 CMake 后将其路径正确添加到你的系统路径。 19 | 20 | 从源码安装 xLearn 21 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 22 | 23 | 从源码安装 xLearn 包括了两个步骤: 24 | 25 | 首先你需要编译源码得到两个可执行文件:``xlearn_train.exe`` 和 ``xlearn_predict.exe``,并且得到动态链接库 ``xlearn_api.dll``。 之后,需要安装 xLearn Python 包。 26 | 27 | 编译源代码 28 | ======================= 29 | 30 | 用户进入 DOS 控制台,输入命令: :: 31 | 32 | git clone https://github.com/aksnzhy/xlearn.git 33 | 34 | cd xlearn 35 | mkdir build 36 | cd build 37 | cmake -G "Visual Studio 15 Win64" ../ 38 | "C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64 39 | MSBuild xLearn.sln /p:Configuration=Release 40 | 41 | **注意:** 你需要将路径 ``"C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Auxiliary\Build\vcvarsall.bat"`` 42 | 替换成你自己的 VS 安装路径. 43 | 44 | 例如,默认情况下 VS 的路径为 ``"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat"``. 45 | 46 | 如果安装成果, 用户可以在 `build\Release`` 路径下看到 ``xlearn_train.exe`` 和 ``xlearn_predict.exe`` 两个可执行文件。 47 | 48 | 用户可以通过如下命令进行测试: :: 49 | 50 | run_example.bat 51 | 52 | 从 Visual Studio解决方案编译源码 53 | ======================= 54 | 这个编译方法是上面“编译源码”方法的一个备用选择,如果你已经使用上面方法进行编译,你可以跳过这个部分。 55 | 56 | 我们为用户提供了Visual Studio解决方案,这些文件在xLearn项目根目录的windows目录下面,用户可以直接使用``xLearn.sln``进行源代码。 57 | 58 | There are three vs project in this solution: xlearn_train, xlearn_test, xlearn_api, respectively relation to build executable train,predict entry program and DLL(dynamic link library) API for windows. 59 | 这个解决方案包括三个项目:``xlearn_train``, ``xlearn_test``, ``xlearn_api``,分别对应产生xLearn的训练、预测的可执行文件和动态链接库。 60 | 61 | 用户需要保证所使用的VS的工具平台版本在v141及其之上。 62 | 63 | **注意:** 从这个解决方案编译得到的可执行文件和动态链接库会和使用cmake构建、编译得到的有所不同,这是因为它们构建结构不相同。 64 | 65 | 安装 Python 包 66 | ======================= 67 | 68 | 用户可以通过如下命令安装 Python 包: :: 69 | 70 | cd python-package 71 | python setup.py install 72 | 73 | 然后通过如下命令对安装进行测试: :: 74 | 75 | cd ../ 76 | python test_python.py 77 | 78 | 一键安装 79 | ======================= 80 | 81 | 用户可以通过 ``build.bat`` 脚本来对 xLearn 进行一键安装: :: 82 | 83 | git clone https://github.com/aksnzhy/xlearn.git 84 | 85 | cd xlearn 86 | build.bat 87 | 88 | 从pip安装 89 | ^^^^^^^^^^^^^^^^^^^^^^^^ 90 | 91 | 我们现在提供了windows平台下的二进制Python包,它支持64位Python的一下版本:``2.7, 3.4, 3.5, 3.6, 3.7``。 92 | 93 | 用户可以从 release_ 栏(xLearn项目主页)下载,然后用 ``pip`` 命令安装下载下来的后缀为 ``.whl`` 的二进制安装包文件。 94 | 95 | .. _release: https://github.com/aksnzhy/xlearn/releases 96 | 97 | 98 | 用户可以通过如下命令检查是 xLearn 是否安装成功: :: 99 | 100 | >>> import xlearn as xl 101 | >>> xl.hello() 102 | 103 | 如果安装成功,你可以看到: :: 104 | 105 | ------------------------------------------------------------------------- 106 | _ 107 | | | 108 | __ _| | ___ __ _ _ __ _ __ 109 | \ \/ / | / _ \/ _` | '__| '_ \ 110 | > <| |___| __/ (_| | | | | | | 111 | /_/\_\_____/\___|\__,_|_| |_| |_| 112 | 113 | xLearn -- 0.44 Version -- 114 | ------------------------------------------------------------------------- 115 | -------------------------------------------------------------------------------- /large/index.rst: -------------------------------------------------------------------------------- 1 | xLearn 大规模机器学习 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | 我们在这一节里主要展示如何使用 xLearn 来处理大规模机器学习问题。近年来,快速增长的海量数据为机器学习任务带来了挑战。例如,我们的数据集可能会用数千亿条训练样本,这些数据是不可能被存放在单台计算机的内存中的。正因如此,我们在设计 xLearn 时专门考虑了如何解决大规模数据的机器学习训练功能。首先,xLearn 可以支持外村计算,通过利用单台计算机的磁盘来处理 TB 量级的数据训练任务。此外,xLearn 可以通过基于参数服务器的分布式架构来进行多机分布式训练。 5 | 6 | 外存计算 7 | -------------------------------- 8 | 9 | 外存计算适用于那些数据量过大不能被内存装下,但是可以被磁盘等外部存储设备装下的情况。通常情况下,单台机器的内存容量从几个 GB 到几百个 GB 不等。然而,当前的服务器外存容量通常可以很容易达到几个 TB. 外存计算的核心是通过 mini-batch 的方法,在每一次的计算时只读取一小部分数据进入内存,增量式地学习所有的训练数据。外存计算需要用户设定合适的 mini-batch-size. 10 | 11 | .. image:: ../images/out-of-core.png 12 | :width: 500 13 | 14 | 命令行接口 15 | =================================================== 16 | 17 | 在 xLearn 中,用户可以通过设置 ``--disk`` 选项来进行外存计算。例如: :: 18 | 19 | ./xlearn_train ./big_data.txt -s 2 --disk 20 | 21 | Epoch Train log_loss Time cost (sec) 22 | 1 0.483997 4.41 23 | 2 0.466553 4.56 24 | 3 0.458234 4.88 25 | 4 0.451463 4.77 26 | 5 0.445169 4.79 27 | 6 0.438834 4.71 28 | 7 0.432173 4.84 29 | 8 0.424904 4.91 30 | 9 0.416855 5.03 31 | 10 0.407846 4.53 32 | 33 | 在上述示例中,xLearn 需要花费将近 ``4.5`` 秒进行每一个 epoch 的训练任务。如果我们取消 ``--disk`` 选项,xLearn 的训练速度会变快: :: 34 | 35 | ./xlearn_train ./big_data.txt -s 2 36 | 37 | Epoch Train log_loss Time cost (sec) 38 | 1 0.484022 1.65 39 | 2 0.466452 1.64 40 | 3 0.458112 1.64 41 | 4 0.451371 1.76 42 | 5 0.445040 1.83 43 | 6 0.438680 1.92 44 | 7 0.432007 1.99 45 | 8 0.424695 1.95 46 | 9 0.416579 1.96 47 | 10 0.407518 2.11 48 | 49 | 这一次,每一个 epoch 的训练时间变成了 ``1.8`` 秒。我们还可以通过 ``-block`` 选项来设置外存计算的内存 block 大小 (MB)。 50 | 51 | 用户同样可以在预测任务中使用 ``--disk`` 选项,例如: :: 52 | 53 | ./xlearn_predict ./big_data_test.txt ./big_data.txt.model --disk 54 | 55 | Python 接口 56 | =================================================== 57 | 58 | 在 Python 中,用户可以通过 ``setOnDisk()`` API 来使用外存计算,例如: :: 59 | 60 | import xlearn as xl 61 | 62 | # Training task 63 | ffm_model = xl.create_ffm() # Use field-aware factorization machine 64 | 65 | # On-disk training 66 | ffm_model.setOnDisk() 67 | 68 | ffm_model.setTrain("./small_train.txt") # Training data 69 | ffm_model.setValidate("./small_test.txt") # Validation data 70 | 71 | # param: 72 | # 0. binary classification 73 | # 1. learning rate: 0.2 74 | # 2. regular lambda: 0.002 75 | # 3. evaluation metric: accuracy 76 | param = {'task':'binary', 'lr':0.2, 77 | 'lambda':0.002, 'metric':'acc'} 78 | 79 | # Start to train 80 | # The trained model will be stored in model.out 81 | ffm_model.fit(param, './model.out') 82 | 83 | # Prediction task 84 | ffm_model.setTest("./small_test.txt") # Test data 85 | ffm_model.setSigmoid() # Convert output to 0-1 86 | 87 | # Start to predict 88 | # The output result will be stored in output.txt 89 | ffm_model.predict("./model.out", "./output.txt") 90 | 91 | 用户还可以通过 ``block_size`` 参数来设置外存计算的内存 block 大小 (MB), 例如: :: 92 | 93 | ./xlearn_train ./big_data.txt -s 2 -block 1000 --disk 94 | 95 | 如上所示, 我们将 block size 设置为 ``1000MB``. 在默认的情况下, 这个值会被设置为 ``500``. 96 | 97 | R 接口 98 | =================================================== 99 | 100 | The R guide is coming soon. 101 | 102 | 分布式计算 (参数服务器架构) 103 | -------------------------------- 104 | 105 | 面对海量数据,很多情况下我们无法通过一台机器就完成机器学习的训练任务。例如大规模 CTR 任务,用户可能需要处理千亿级别的训练样本和十亿级别的模型参数,这些都是一台计算机的内存无法装下的。对于这样的挑战,我们需要采用多机分布式训练。 106 | 107 | *Parameter Server* (参数服务器) 是近几年提出并被广泛应用的一种分布式机器学习架构,专门针对于 “大数据” 和 “大模型” 带来的挑战。在这个架构下,训练数据和计算任务被划分到多台 worker 节点之上,而 Server 节点负责存储机器学习模型的参数(所以叫作参数服务器)。下图展示了一个参数服务器的工作流程。 108 | 109 | .. image:: ../images/ps.png 110 | :width: 500 111 | 112 | 如图所示,一个标准的参数服务器系统提供给用户两个简洁的 API: *Push* 和 *Pull*. 113 | 114 | *Push*: 向参数服务器发送 key-value pairs. 以分布式梯度下降为例,worker 节点会计算本地的梯度 (gradient)并将其发送给参数服务器。由于数据的稀疏性,只有一小部分数据不为 0. 我们通常会发送一个 (key,value)的向量给参数服务器,其中 key 是参数的标记位,value 是梯度的数值。 115 | 116 | *Pull*: 通过发送 key 的列表从参数服务器请求更新后的模型参数。在大规模机器学习下,模型的大小通常无法被存放在一台机器中,所以 *pull* 接口只会请求那些当前计算需要的模型参数,而并不会将整个模型请求下来。 117 | 118 | The distributed training guide for xLearn is coming soon. 119 | -------------------------------------------------------------------------------- /make.bat: -------------------------------------------------------------------------------- 1 | @ECHO OFF 2 | 3 | pushd %~dp0 4 | 5 | REM Command file for Sphinx documentation 6 | 7 | if "%SPHINXBUILD%" == "" ( 8 | set SPHINXBUILD=sphinx-build 9 | ) 10 | set SOURCEDIR=. 11 | set BUILDDIR=_build 12 | set SPHINXPROJ=xlearn_doc 13 | 14 | if "%1" == "" goto help 15 | 16 | %SPHINXBUILD% >NUL 2>NUL 17 | if errorlevel 9009 ( 18 | echo. 19 | echo.The 'sphinx-build' command was not found. Make sure you have Sphinx 20 | echo.installed, then set the SPHINXBUILD environment variable to point 21 | echo.to the full path of the 'sphinx-build' executable. Alternatively you 22 | echo.may add the Sphinx directory to PATH. 23 | echo. 24 | echo.If you don't have Sphinx installed, grab it from 25 | echo.http://sphinx-doc.org/ 26 | exit /b 1 27 | ) 28 | 29 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% 30 | goto end 31 | 32 | :help 33 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% 34 | 35 | :end 36 | popd 37 | -------------------------------------------------------------------------------- /python_api/index.rst: -------------------------------------------------------------------------------- 1 | xLearn Python API 使用指南 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | xLearn 支持简单易用的 Python 接口。在使用之前,请确保你已经成功安装了 xLearn Python Package. 用户可以进入 Python shell,然后输入如下代码来检查是否成功安装 xLearn Python Package: :: 5 | 6 | >>> import xlearn as xl 7 | >>> xl.hello() 8 | 9 | 如果你已经成功安装了 xLearn Python Package,你将会看到: :: 10 | 11 | ------------------------------------------------------------------------- 12 | _ 13 | | | 14 | __ _| | ___ __ _ _ __ _ __ 15 | \ \/ / | / _ \/ _` | '__| '_ \ 16 | > <| |___| __/ (_| | | | | | | 17 | /_/\_\_____/\___|\__,_|_| |_| |_| 18 | 19 | xLearn -- 0.44 Version -- 20 | ------------------------------------------------------------------------- 21 | 22 | 快速开始 23 | ---------------------------------------- 24 | 25 | 如下代码展示如何使用 xLearn Python API,你可以在 ``demo/classification/criteo_ctr`` 路径下找到样例数据 (``small_train.txt`` and ``small_test.txt``): 26 | 27 | .. code-block:: python 28 | 29 | import xlearn as xl 30 | 31 | # Training task 32 | ffm_model = xl.create_ffm() # Use field-aware factorization machine (ffm) 33 | ffm_model.setTrain("./small_train.txt") # Set the path of training data 34 | 35 | # parameter: 36 | # 0. task: binary classification 37 | # 1. learning rate : 0.2 38 | # 2. regular lambda : 0.002 39 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 40 | 41 | # Train model 42 | ffm_model.fit(param, "./model.out") 43 | 44 | 以下是 xLearn 的部分输出: :: 45 | 46 | ... 47 | [ ACTION ] Start to train ... 48 | [------------] Epoch Train log_loss Time cost (sec) 49 | [ 10% ] 1 0.595881 0.00 50 | [ 20% ] 2 0.538845 0.00 51 | [ 30% ] 3 0.520051 0.00 52 | [ 40% ] 4 0.504366 0.00 53 | [ 50% ] 5 0.492811 0.00 54 | [ 60% ] 6 0.483286 0.00 55 | [ 70% ] 7 0.472567 0.00 56 | [ 80% ] 8 0.465035 0.00 57 | [ 90% ] 9 0.457047 0.00 58 | [ 100% ] 10 0.448725 0.00 59 | [ ACTION ] Start to save model ... 60 | 61 | 在上述例子中,xLearn 使用 *field-aware factorization machines (ffm)* 来解决一个机器学习二分类问题。如果想解决回归 (regression) 问题,用户可以通过将 ``task`` 参数设置为 ``reg`` 来实现: :: 62 | 63 | param = {'task':'reg', 'lr':0.2, 'lambda':0.002} 64 | 65 | 我们发现,xLearn 训练过后在当前文件夹下产生了一个叫 ``model.out`` 的新文件。这个文件用来存储训练后的模型,我们可以用这个模型在未来进行预测: :: 66 | 67 | ffm_model.setTest("./small_test.txt") 68 | ffm_model.predict("./model.out", "./output.txt") 69 | 70 | 运行上述命令之后,我们在当前文件夹下得到了一个新的文件 ``output.txt``,这是我们进行预测任务的输出。我们可以通过如下命令显示这个输出文件的前几行数据: :: 71 | 72 | head -n 5 ./output.txt 73 | 74 | -1.58631 75 | -0.393496 76 | -0.638334 77 | -0.38465 78 | -1.15343 79 | 80 | 这里每一行的分数都对应了测试数据中的一行预测样本。负数代表我们预测该样本为负样本,正数代表正样本 (在这个例子中没有)。在 xLearn 中,用户可以将分数通过 ``setSigmoid()`` API 转换到(0-1)之间: :: 81 | 82 | ffm_model.setSigmoid() 83 | ffm_model.setTest("./small_test.txt") 84 | ffm_model.predict("./model.out", "./output.txt") 85 | 86 | 结果如下: :: 87 | 88 | head -n 5 ./output.txt 89 | 90 | 0.174698 91 | 0.413642 92 | 0.353551 93 | 0.414588 94 | 0.250373 95 | 96 | 用户还可以使用 ``setSign()`` API 将预测结果转换成 0 或 1: :: 97 | 98 | ffm_model.setSign() 99 | ffm_model.setTest("./small_test.txt") 100 | ffm_model.predict("./model.out", "./output.txt") 101 | 102 | 结果如下: :: 103 | 104 | head -n 5 ./output.txt 105 | 106 | 0 107 | 0 108 | 0 109 | 0 110 | 0 111 | 112 | 模型输出 113 | ---------------------------------------- 114 | 115 | 用户还可以通过 ``setTXTModel()`` API 将模型输出成人类可读的 ``TXT`` 格式,例如: :: 116 | 117 | ffm_model.setTXTModel("./model.txt") 118 | ffm_model.fit(param, "./model.out") 119 | 120 | 运行上述命令后,我们发现在当前文件夹下生成了一个新的文件 ``model.txt``,这个文件存储着 ``TXT`` 格式的输出模型: :: 121 | 122 | head -n 5 ./model.txt 123 | 124 | -1.041 125 | 0.31609 126 | 0 127 | 0 128 | 0 129 | 130 | 对于线性模型来说,TXT 格式的模型输出将每一个模型参数存储在一行。对于 FM 和 FFM,模型将每一个 latent vector 存储在一行。 131 | 132 | Linear: :: 133 | 134 | bias: 0 135 | i_0: 0 136 | i_1: 0 137 | i_2: 0 138 | i_3: 0 139 | 140 | FM: :: 141 | 142 | bias: 0 143 | i_0: 0 144 | i_1: 0 145 | i_2: 0 146 | i_3: 0 147 | v_0: 5.61937e-06 0.0212581 0.150338 0.222903 148 | v_1: 0.241989 0.0474224 0.128744 0.0995021 149 | v_2: 0.0657265 0.185878 0.0223869 0.140097 150 | v_3: 0.145557 0.202392 0.14798 0.127928 151 | 152 | FFM: :: 153 | 154 | bias: 0 155 | i_0: 0 156 | i_1: 0 157 | i_2: 0 158 | i_3: 0 159 | v_0_0: 5.61937e-06 0.0212581 0.150338 0.222903 160 | v_0_1: 0.241989 0.0474224 0.128744 0.0995021 161 | v_0_2: 0.0657265 0.185878 0.0223869 0.140097 162 | v_0_3: 0.145557 0.202392 0.14798 0.127928 163 | v_1_0: 0.219158 0.248771 0.181553 0.241653 164 | v_1_1: 0.0742756 0.106513 0.224874 0.16325 165 | v_1_2: 0.225384 0.240383 0.0411782 0.214497 166 | v_1_3: 0.226711 0.0735065 0.234061 0.103661 167 | v_2_0: 0.0771142 0.128723 0.0988574 0.197446 168 | v_2_1: 0.172285 0.136068 0.148102 0.0234075 169 | v_2_2: 0.152371 0.108065 0.149887 0.211232 170 | v_2_3: 0.123096 0.193212 0.0179155 0.0479647 171 | v_3_0: 0.055902 0.195092 0.0209918 0.0453358 172 | v_3_1: 0.154174 0.144785 0.184828 0.0785329 173 | v_3_2: 0.109711 0.102996 0.227222 0.248076 174 | v_3_3: 0.144264 0.0409806 0.17463 0.083712 175 | 176 | 在线学习 177 | ---------------------------------------- 178 | xLearn 提供在线学习的功能,即 xLearn 可以加载之前预训练过的模型继续学习。用户可以通过 ``setPreModel()`` API 来指定预先训练过的模型文件路径。例如: :: 179 | 180 | import xlearn as xl 181 | 182 | ffm_model = xl.create_ffm() 183 | ffm_model.setTrain("./small_train.txt") 184 | ffm_model.setValidate("./small_test.txt") 185 | ffm_model.setPreModel("./pre_model") 186 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 187 | 188 | ffm_model.fit(param, "./model.out") 189 | 190 | 注意,xLearn 只能加载二进制预训练模型,不能加载 TXT 格式的文本模型。 191 | 192 | 选择机器学习算法 193 | ---------------------------------------- 194 | 195 | 目前,xLearn 可以支持三种不同的机器学习算法,包括了线性模型 (LR)、factorization machine (FM),以及 field-aware factorization machine (FFM): :: 196 | 197 | import xlearn as xl 198 | 199 | ffm_model = xl.create_ffm() 200 | fm_model = xl.create_fm() 201 | lr_model = xl.create_linear() 202 | 203 | 对于 LR 和 FM 算法而言,我们的输入数据格式必须是 ``CSV`` 或者 ``libsvm``. 对于 FFM 算法而言,我们的输入数据必须是 ``libffm`` 格式: :: 204 | 205 | libsvm format: 206 | 207 | y index_1:value_1 index_2:value_2 ... index_n:value_n 208 | 209 | 0 0:0.1 1:0.5 3:0.2 ... 210 | 0 0:0.2 2:0.3 5:0.1 ... 211 | 1 0:0.2 2:0.3 5:0.1 ... 212 | 213 | CSV format: 214 | 215 | y value_1 value_2 .. value_n 216 | 217 | 0 0.1 0.2 0.2 ... 218 | 1 0.2 0.3 0.1 ... 219 | 0 0.1 0.2 0.4 ... 220 | 221 | libffm format: 222 | 223 | y field_1:index_1:value_1 field_2:index_2:value_2 ... 224 | 225 | 0 0:0:0.1 1:1:0.5 2:3:0.2 ... 226 | 0 0:0:0.2 1:2:0.3 2:5:0.1 ... 227 | 1 0:0:0.2 1:2:0.3 2:5:0.1 ... 228 | 229 | xLearn 还可以使用 ``,`` 作为数据的分隔符,例如: :: 230 | 231 | libsvm format: 232 | 233 | label,index_1:value_1,index_2:value_2 ... index_n:value_n 234 | 235 | CSV format: 236 | 237 | label,value_1,value_2 .. value_n 238 | 239 | libffm format: 240 | 241 | label,field_1:index_1:value_1,field_2:index_2:value_2 ... 242 | 243 | 注意,如果输入的 csv 文件里不含 ``y`` 值,用户必须手动向其每一行数据添加一个占位符 (同样针对测试数据)。否则,xLearn 会将第一个元素视为 ``y``. 244 | 245 | LR 和 FM 算法的输入可以是 ``libffm`` 格式,xLearn 会忽略其中的 ``field`` 项并将其视为 ``libsvm`` 格式。 246 | 247 | 设置 Validation Dataset (验证集) 248 | ---------------------------------------- 249 | 250 | 在机器学习中,我们可以通过 Validation Dataset (验证集) 来进行超参数调优。在 xLearn 中,用户可以使用 ``setValidate()`` API 来指定验证集文件,例如: :: 251 | 252 | import xlearn as xl 253 | 254 | ffm_model = xl.create_ffm() 255 | ffm_model.setTrain("./small_train.txt") 256 | ffm_model.setValidate("./small_test.txt") 257 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 258 | 259 | ffm_model.fit(param, "./model.out") 260 | 261 | 下面是程序的一部分输出: :: 262 | 263 | [ ACTION ] Start to train ... 264 | [------------] Epoch Train log_loss Test log_loss Time cost (sec) 265 | [ 10% ] 1 0.589475 0.535867 0.00 266 | [ 20% ] 2 0.540977 0.546504 0.00 267 | [ 30% ] 3 0.521881 0.531474 0.00 268 | [ 40% ] 4 0.507194 0.530958 0.00 269 | [ 50% ] 5 0.495460 0.530627 0.00 270 | [ 60% ] 6 0.483910 0.533307 0.00 271 | [ 70% ] 7 0.470661 0.527650 0.00 272 | [ 80% ] 8 0.465455 0.532556 0.00 273 | [ 90% ] 9 0.455787 0.538841 0.00 274 | [ ACTION ] Early-stopping at epoch 7 275 | 276 | 我们可以看到,在这个任务中 ``Train log_loss`` 在不断的下降,而 ``Test log_loss`` (validation loss) 则是先下降,后上升。这代表当前我们训练的模型已经 overfit (过拟合)我们的训练数据。 277 | 278 | 在默认的情况下,xLearn 会在每一轮 epoch 结束后计算 validation loss 的数值,而用户可以使用 ``metric`` 参数来制定不同的评价指标。对于分类任务而言,评价指标有:``acc`` (accuracy), ``prec`` (precision), ``f1``, 以及 ``auc``,例如: :: 279 | 280 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'acc'} 281 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'prec'} 282 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'f1'} 283 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'auc'} 284 | 285 | 对于回归任务而言,评价指标包括:``mae``, ``mape``, 以及 ``rmsd`` (或者叫作 ``rmse``),例如: :: 286 | 287 | param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric': 'rmse'} 288 | param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric': 'mae'} 289 | param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric': 'mape'} 290 | 291 | Cross-Validation (交叉验证) 292 | ---------------------------------------- 293 | 294 | 在机器学习中,Cross-Validation (交叉验证) 是一种被广泛使用的模型超参数调优技术。在 xLearn 中,用户可以使用 ``cv()`` API 来使用交叉验证功能,例如: :: 295 | 296 | import xlearn as xl 297 | 298 | ffm_model = xl.create_ffm() 299 | ffm_model.setTrain("./small_train.txt") 300 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 301 | 302 | ffm_model.cv(param) 303 | 304 | 在默认的情况下,xLearn 使用 3-folds 交叉验证 (即将数据集平均分成 3 份),用户也可以通过 ``fold`` 参数来指定数据划分的份数,例如: :: 305 | 306 | import xlearn as xl 307 | 308 | ffm_model = xl.create_ffm() 309 | ffm_model.setTrain("./small_train.txt") 310 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'fold':5} 311 | 312 | ffm_model.cv(param) 313 | 314 | 上述命令将数据集划分成为 5 份,并且 xLearn 会在最后计算出平均的 validation loss: :: 315 | 316 | [------------] Average log_loss: 0.549758 317 | [ ACTION ] Finish Cross-Validation 318 | [ ACTION ] Clear the xLearn environment ... 319 | [------------] Total time cost: 0.05 (sec) 320 | 321 | 选择优化算法 322 | ---------------------------------------- 323 | 324 | 在 xLearn 中,用户可以通过 ``opt`` 参数来选择使用不同的优化算法。目前,xLearn 支持 ``SGD``, ``AdaGrad``, 以及 ``FTRL`` 这三种优化算法。 在默认的情况下,xLearn 使用 ``AdaGrad`` 优化算法: :: 325 | 326 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'opt':'sgd'} 327 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'opt':'adagrad'} 328 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'opt':'ftrl'} 329 | 330 | 相比于传统的 SGD (随机梯度下降) 算法,AdaGrad 可以自适应的调整学习速率 learning rate,对于不常用的参数进行较大的更新,对于常用的参数进行较小的更新。 正因如此,AdaGrad 算法常用于稀疏数据的优化问题上。除此之外,相比于 AdaGrad,SGD 对学习速率的大小更敏感,这增加了用户调参的难度。 331 | 332 | FTRL (Follow-the-Regularized-Leader) 同样被广泛应用于大规模稀疏数据的优化问题上。相比于 SGD 和 AdaGrad, FTRL 需要用户调试更多的超参数,我们将在下一节详细介绍 xLearn 的超参数调优。 333 | 334 | 超参数调优 335 | ---------------------------------------- 336 | 337 | 在机器学习中,hyper-parameter (超参数) 是指在训练之前设置的参数,而模型参数是指在训练过程中更新的参数。超参数调优通常是机器学习训练过程中不可避免的一个环节。 338 | 339 | 首先,``learning rate`` (学习速率) 是机器学习中的一个非常重要的超参数,用来控制每次模型迭代时更新的步长。在默认的情况下,这个值在 xLearn 中被设置为 0.2,用户可以通过 ``lr`` 参数来改变这个值: :: 340 | 341 | param = {'task':'binary', 'lr':0.2} 342 | param = {'task':'binary', 'lr':0.5} 343 | param = {'task':'binary', 'lr':0.01} 344 | 345 | 用户还可以通过 ``-b`` 选项来控制 regularization (正则项)。xLearn 使用 ``L2`` 正则项,这个值被默认设置为 ``0.00002``: :: 346 | 347 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01} 348 | param = {'task':'binary', 'lr':0.2, 'lambda':0.02} 349 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 350 | 351 | 对于 FTRL 算法来说,除了学习速率和正则项,我们还需要调节其他的超参数,包括:``-alpha``, ``-beta``, ``-lambda_1`` 和 ``-lambda_2``,例如: :: 352 | 353 | param = {'alpha':0.002, 'beta':0.8, 'lambda_1':0.001, 'lambda_2': 1.0} 354 | 355 | 对于 FM 和 FFM 模型,用户需要通过 ``-k`` 选项来设置 latent vector (隐向量) 的长度。在默认的情况下,xLearn 将其设置为 ``4``: :: 356 | 357 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':2} 358 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':4} 359 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':5} 360 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':8} 361 | 362 | 注意,xLearn 使用了 *SSE* 硬件指令来加速向量运算,该指令会同时进行向量长度为 4 的运算,因此将 ``k=2`` 和 ``k=4`` 所需的运算时间是相同的。 363 | 364 | 除此之外,对于 FM 和 FFM,用户可以通过设置超参数 ``-u`` 来调节模型的初始化参数。在默认的情况下,这个值被设置为 ``0.66``: :: 365 | 366 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'init':0.80} 367 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'init':0.40} 368 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'init':0.10} 369 | 370 | 迭代次数 & Early-Stop (提前终止) 371 | ---------------------------------------- 372 | 373 | 在模型的训练过程中,每一个 epoch 都会遍历整个训练数据。在 xLearn 中,用户可以通过 ``epoch`` 参数来设置需要的 epoch 数量: :: 374 | 375 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'epoch':3} 376 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'epoch':5} 377 | param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'epoch':10} 378 | 379 | 如果用户设置了 validation dataset (验证集),xLearn 在默认情况下会在得到最好的 validation 结果时进行 early-stop (提前终止训练),例如: :: 380 | 381 | import xlearn as xl 382 | 383 | ffm_model = xl.create_ffm() 384 | ffm_model.setTrain("./small_train.txt") 385 | ffm_model.setValidate("./small_test.txt") 386 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'epoch':10} 387 | 388 | ffm_model.fit(param, "./model.out") 389 | 390 | 在上述命令中,我们设置 epoch 的大小为 10,但是 xLearn 会在第 7 轮提前停止训练 (你可能在你的本地计算机上会得到不同的轮次): :: 391 | 392 | Early-stopping at epoch 7 393 | Start to save model ... 394 | 395 | 用户可以通过 ``stop_window`` 参数来设置提前停止机制的窗口大小。即,``stop_window=2`` 意味着如果在后两轮的时间窗口之内都没有比当前更好的验证结果,则停止训练,并保存之前最好的模型: :: 396 | 397 | param = {'task':'binary', 'lr':0.2, 398 | 'lambda':0.002, 'epoch':10, 399 | 'stop_window':3} 400 | 401 | ffm_model.fit(param, "./model.out") 402 | 403 | 用户可以通过 ``disableEarlyStop()`` API 来禁止 early-stop: :: 404 | 405 | import xlearn as xl 406 | 407 | ffm_model = xl.create_ffm() 408 | ffm_model.setTrain("./small_train.txt") 409 | ffm_model.setValidate("./small_test.txt") 410 | ffm_model.disableEarlyStop(); 411 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'epoch':10} 412 | 413 | ffm_model.fit(param, "./model.out") 414 | 415 | 在上述命令中,xLearn 将进行完整的 10 轮 epoch 训练。 416 | 417 | 注意,在默认情况下,如果没有设置 metric,则 xLearn 会通过 test_loss 来选择最佳停止时机。如果设置了 metric,则 xLearn 通过 metric 的值来决定停止时机。 418 | 419 | 无锁(Lock-free)学习 420 | ---------------------------------------- 421 | 422 | 在默认情况下,xLearn 会进行 *Hogwild!* 无锁学习,该方法通过 CPU 多核进行并行训练,提高 CPU 利用率,加快算法收敛速度。但是,该无锁算法是非确定性的算法 (non-deterministic). 即,如果我们多次运行如下的命令,我们会在每一次运行得到略微不同的 loss 结果: :: 423 | 424 | import xlearn as xl 425 | 426 | ffm_model = xl.create_ffm() 427 | ffm_model.setTrain("./small_train.txt") 428 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 429 | 430 | ffm_model.fit(param, "./model.out") 431 | 432 | The 1st time: 0.449056 433 | The 2nd time: 0.449302 434 | The 3nd time: 0.449185 435 | 436 | 用户可以通过 ``nthread`` 参数来设置使用 CPU 核心的数量,例如: :: 437 | 438 | import xlearn as xl 439 | 440 | ffm_model = xl.create_ffm() 441 | ffm_model.setTrain("./small_train.txt") 442 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'nthread':4} 443 | 444 | ffm_model.fit(param, "./model.out") 445 | 446 | 上述代码指定使用 4 个 CPU Core 来进行模型训练。如果用户不设置该选项,xLearn 在默认情况下会使用全部的 CPU 核心进行计算。 447 | 448 | xLearn 会显示当前使用线程数量的情况: :: 449 | 450 | [------------] xLearn uses 4 threads for training task. 451 | [ ACTION ] Read Problem ... 452 | 453 | 454 | 用户可以通过设置 ``disableLockFree()`` API 禁止多核无锁学习: :: 455 | 456 | import xlearn as xl 457 | 458 | ffm_model = xl.create_ffm() 459 | ffm_model.setTrain("./small_train.txt") 460 | ffm_model.disableLockFree() 461 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 462 | 463 | ffm_model.fit(param, "./model.out") 464 | 465 | 这时,xLearn 计算的结果是确定性的 (determinnistic): :: 466 | 467 | The 1st time: 0.449172 468 | The 2nd time: 0.449172 469 | The 3nd time: 0.449172 470 | 471 | 使用 ``disableLockFree()`` 的缺点是训练速度会比无锁训练慢很多,我们的建议是在大规模数据训练下开启此功能。 472 | 473 | Instance-Wise 归一化 474 | ---------------------------------------- 475 | 476 | 对于 FM 和 FFM 来说,xLearn 会默认对特征进行 Instance-Wise Normalizarion (归一化). 在一些大规模稀疏数据的场景 (例如 CTR 预估), 这一技术非常的有效,但是有些时候它也会影响模型的准确率。用户可以通过设置 ``disableNorm()`` API 来关掉该功能: :: 477 | 478 | import xlearn as xl 479 | 480 | ffm_model = xl.create_ffm() 481 | ffm_model.setTrain("./small_train.txt") 482 | ffm_model.disableNorm() 483 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 484 | 485 | ffm_model.fit(param, "./model.out") 486 | 487 | 注意,如果在训练过程中使用了 Instance-Wise 归一化,用户需要在预测过程中同样使用该功能。 488 | 489 | Quiet Model 安静模式 490 | ---------------------------------------- 491 | 492 | xLearn 的训练支持安静模式,在安静模式下,用户通过调用 ``setQuiet()`` API 来使得 xLearn 的训练过程不会计算任何评价指标,这样可以很大程度上提高训练速度: :: 493 | 494 | import xlearn as xl 495 | 496 | ffm_model = xl.create_ffm() 497 | ffm_model.setTrain("./small_train.txt") 498 | ffm_model.setQuiet() 499 | param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 500 | 501 | ffm_model.fit(param, "./model.out") 502 | 503 | DMatrix 转换 504 | ---------------------------------------- 505 | 如下代码展示如何使用 xLearn Python DMatrix API,你可以在 ``demo/regression/house_price`` 路径下找到样例数据 (``house_price_train.txt`` and ``house_price_test.txt``): 506 | 507 | .. code-block:: python 508 | 509 | import xlearn as xl 510 | import numpy as np 511 | import pandas as pd 512 | 513 | # read file from file 514 | house_price_train = pd.read_csv("house_price_train.txt", header=None, sep="\t") 515 | house_price_test = pd.read_csv("house_price_test.txt", header=None, sep="\t") 516 | 517 | # get train X, y 518 | X_train = house_price_train[house_price_train.columns[1:]] 519 | y_train = house_price_train[0] 520 | 521 | # get test X, y 522 | X_test = house_price_test[house_price_test.columns[1:]] 523 | y_test = house_price_test[0] 524 | 525 | # DMatrix transition, if use field ,use must pass field map(an array) of features 526 | xdm_train = xl.DMatrix(X_train, y_train) 527 | xdm_test = xl.DMatrix(X_test, y_test) 528 | 529 | # Training task 530 | fm_model = xl.create_fm() # Use factorization machine 531 | # we use the same API for train from file 532 | # that is, you can also pass xl.DMatrix for this API now 533 | fm_model.setTrain(xdm_train) # Training data 534 | fm_model.setValidate(xdm_test) # Validation data 535 | 536 | # param: 537 | # 0. regression task 538 | # 1. learning rate: 0.2 539 | # 2. regular lambda: 0.002 540 | # 3. evaluation metric: mae 541 | param = {'task':'reg', 'lr':0.2, 542 | 'lambda':0.002, 'metric':'mae'} 543 | 544 | # Start to train 545 | # The trained model will be stored in model.out 546 | fm_model.fit(param, './model_dm.out') 547 | 548 | # Prediction task 549 | # we use the same API for test from file 550 | # that is, you can also pass xl.DMatrix for this API now 551 | fm_model.setTest(xdm_test) # Test data 552 | 553 | # Start to predict 554 | # The output result will be stored in output.txt 555 | # if no result out path setted, we return res as numpy.ndarray 556 | res = fm_model.predict("./model_dm.out") 557 | 558 | **注意:** 将数据转换成DMatrix进行训练暂时还不支持交叉验证, 我们很快会添加这个特征。 559 | 560 | Scikit-learn API 561 | ---------------------------------------- 562 | 563 | xLearn 还可以支持 Scikit-learn API: :: 564 | 565 | import numpy as np 566 | import xlearn as xl 567 | from sklearn.datasets import load_iris 568 | from sklearn.model_selection import train_test_split 569 | 570 | # Load dataset 571 | iris_data = load_iris() 572 | X = iris_data['data'] 573 | y = (iris_data['target'] == 2) 574 | 575 | X_train, \ 576 | X_val, \ 577 | y_train, \ 578 | y_val = train_test_split(X, y, test_size=0.3, random_state=0) 579 | 580 | # param: 581 | # 0. binary classification 582 | # 1. model scale: 0.1 583 | # 2. epoch number: 10 (auto early-stop) 584 | # 3. learning rate: 0.1 585 | # 4. regular lambda: 1.0 586 | # 5. use sgd optimization method 587 | linear_model = xl.LRModel(task='binary', init=0.1, 588 | epoch=10, lr=0.1, 589 | reg_lambda=1.0, opt='sgd') 590 | 591 | # Start to train 592 | linear_model.fit(X_train, y_train, 593 | eval_set=[X_val, y_val], 594 | is_lock_free=False) 595 | 596 | # Generate predictions 597 | y_pred = linear_model.predict(X_val) 598 | 599 | .. __: https://github.com/aksnzhy/xlearn/tree/master/demo/classification/scikit_learn_demo 600 | -------------------------------------------------------------------------------- /tune/index.rst: -------------------------------------------------------------------------------- 1 | xLearn 调参指南 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | Coming soon ... -------------------------------------------------------------------------------- /tutorial/index.rst: -------------------------------------------------------------------------------- 1 | xLearn Tutorials 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3 | 4 | (1) `深入FFM原理与实践(美团技术团队)`__ 5 | (2) `一文读懂FM算法优势,并用python实现`__ 6 | (3) `Introductory Guide – Factorization Machines & their application on huge datasets (with codes in Python)`__ 7 | (4) `简单高效的组合特征自动挖掘框架`__ 8 | 9 | .. __: https://tech.meituan.com/deep_understanding_of_ffm_principles_and_practices.html 10 | .. __: https://yq.aliyun.com/articles/374170 11 | .. __: https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/ 12 | .. __: https://zhuanlan.zhihu.com/p/42946318 --------------------------------------------------------------------------------