├── LICENSE
├── Makefile
├── README.md
├── R_api
    └── index.rst
├── all_api
    └── index.rst
├── cli_api
    └── index.rst
├── conf.py
├── demo
    └── index.rst
├── images
    ├── out-of-core.png
    ├── ps.png
    └── speed.png
├── index.rst
├── install
    ├── index.rst
    └── install_windows.rst
├── large
    └── index.rst
├── make.bat
├── python_api
    └── index.rst
├── tune
    └── index.rst
└── tutorial
    └── index.rst


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | # Minimal makefile for Sphinx documentation
 2 | #
 3 | 
 4 | # You can set these variables from the command line.
 5 | SPHINXOPTS    =
 6 | SPHINXBUILD   = sphinx-build
 7 | SPHINXPROJ    = xLearn
 8 | SOURCEDIR     = .
 9 | BUILDDIR      = _build
10 | 
11 | # Put it first so that "make" without argument is like "make help".
12 | help:
13 | 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14 | 
15 | .PHONY: help Makefile
16 | 
17 | # Catch-all target: route all unknown targets to Sphinx using the new
18 | # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
19 | %: Makefile
20 | 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # xlearn_doc_cn
2 | xlearn 中文文档
3 | 


--------------------------------------------------------------------------------
/R_api/index.rst:
--------------------------------------------------------------------------------
1 | xLearn R API 使用指南
2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
3 | 
4 | xLearn R package guide is coming soon.


--------------------------------------------------------------------------------
/all_api/index.rst:
--------------------------------------------------------------------------------
  1 | xLearn API 列表总览
  2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  3 | 
  4 | 该页面展示了 xLearn 所有的 API 列表，包括了命令行接口以及 Python 接口。
  5 | 
  6 | 命令行接口
  7 | ------------------------------
  8 | 
  9 | 训练: ::
 10 | 
 11 |     xlearn_train <train_file_path> [OPTIONS]
 12 | 
 13 | 参数选项: ::
 14 | 
 15 |   -s <type> : Type of machine learning model (default 0)
 16 |      for classification task:
 17 |          0 -- linear model (GLM)
 18 |          1 -- factorization machines (FM)
 19 |          2 -- field-aware factorization machines (FFM)
 20 |      for regression task:
 21 |          3 -- linear model (GLM)
 22 |          4 -- factorization machines (FM)
 23 |          5 -- field-aware factorization machines (FFM)
 24 |                                                                            
 25 |   -x <metric>          :  The metric can be 'acc', 'prec', 'recall', 'f1', 'auc' for classification, and
 26 |                           'mae', 'mape', 'rmsd (rmse)' for regression. On defaurt, xLearn will not print
 27 |                           any evaluation metric information (only print loss value).                                           
 28 |                                                                                                      
 29 |   -p <opt_method>      :  Choose the optimization method, including 'sgd', adagrad', and 'ftrl'. On default,
 30 |                           xLearn uses the 'adagrad' optimization method.
 31 |                                                                                                 
 32 |   -v <validate_file>   :  Path of the validation data. This option will be empty by default. In this way, 
 33 |                           xLearn will not perform validation process.
 34 |                                                                                              
 35 |   -m <model_file>      :  Path of the model dump file. On default, the model file name is 'train_file' + '.model'. 
 36 |                           If we set this value to 'none', the xLearn will not dump the model checkpoint.
 37 | 
 38 |   -pre <pre-model>     :  Path of the pre-trained model. This can be used for online learning. 
 39 | 
 40 |   -t <txt_model_file>  :  Path of the TEXT model checkpoint file. On default, we do not set this option
 41 |                           and xLearn will not dump the TEXT model.
 42 |                                                                             
 43 |   -l <log_file>        :  Path of the log file. xLearn uses '/tmp/xlearn_log.*' by default.
 44 |                                                                                       
 45 |   -k <number_of_K>     :  Number of the latent factor used by FM and FFM tasks. Using 4 by default.
 46 |                           Note that, we will get the same model size when setting k to 1 and 4.
 47 |                           This is because we use SSE instruction and the memory need to be aligned.
 48 |                           So even you assign k = 1, we still fill some dummy zeros from k = 2 to 4.
 49 |                                                                                          
 50 |   -r <learning_rate>   :  Learning rate for optimization method. Using 0.2 by default.
 51 |                           xLearn can use adaptive gradient descent (AdaGrad) for optimization problem,
 52 |                           if you choose AdaGrad method, the learning rate will be changed adaptively.
 53 |                                                                                     
 54 |   -b <lambda_for_regu> :  Lambda for L2 regular. Using 0.00002 by default. We can disable the
 55 |                           regular term by setting this value to zero.
 56 | 
 57 |   -alpha               :  Hyper parameters used by ftrl.
 58 |                                        
 59 |   -beta                :  Hyper parameters used by ftrl.
 60 |                                        
 61 |   -lambda_1            :  Hyper parameters used by ftrl.
 62 |                                        
 63 |   -lambda_2            :  Hyper parameters used by ftrl.     
 64 | 
 65 |   -u <model_scale>     :  Hyper parameter used for initialize model parameters. Using 0.66 by default.
 66 |                                                                                  
 67 |   -e <epoch_number>    :  Number of epoch for training process. Using 10 by default. Note that xLearn will perform 
 68 |                           early-stopping by default, so this value is just a upper bound.
 69 |                                                                                        
 70 |   -f <fold_number>     :  Number of folds for cross-validation (If we set --cv option). Using 5 by default.    
 71 | 
 72 |   -nthread <thread_number> :  Number of thread for multiple thread lock-free learning (Hogwild!).
 73 | 
 74 |   -block <block_size>  :  Block size for on-disk training.
 75 | 
 76 |   -sw <stop_window>    :  Size of stop window for early-stopping. Using 2 by default. 
 77 | 
 78 |   -seed <random_seed>  :  Random Seed to shuffle data set.
 79 |                                                                                      
 80 |   --disk               :  Open on-disk training for large-scale machine learning problems.
 81 |                                                                    
 82 |   --cv                 :  Open cross-validation in training tasks. If we use this option, xLearn will ignore 
 83 |                           the validation file (set by -t option). 
 84 |                                                                   
 85 |   --dis-lock-free      :  Disable lock-free training. Lock-free training can accelerate training but the result 
 86 |                           is non-deterministic. Our suggestion is that you can open this flag if the training data 
 87 |                           is big and sparse.
 88 |                                                                        
 89 |   --dis-es             :  Disable early-stopping in training. By default, xLearn will use early-stopping
 90 |                           in training process, except for training in cross-validation.
 91 |                                                                                          
 92 |   --no-norm            :  Disable instance-wise normalization. By default, xLearn will use instance-wise 
 93 |                           normalization in both training and prediction processes.
 94 | 
 95 |   --no-bin             :  Do not generate bin file for training and test data file.
 96 |                                                                  
 97 |   --quiet              :  Don't print any evaluation information during the training and just train the 
 98 |                           model quietly. It can accelerate the training process.
 99 | 
100 | 预测: ::
101 | 
102 |     xlearn_predict <test_file_path> <model_file_path> [OPTIONS]
103 | 
104 | 参数选项: ::
105 | 
106 |   -o <output_file>     :  Path of the output file. On default, this value will be set to 'test_file' + '.out'
107 |                                                       
108 |   -l <log_file_path>   :  Path of the log file. xLearn uses '/tmp/xlearn_log' by default.  
109 | 
110 |   -nthread <thread number> :  Number of thread for multiple thread lock-free learning (Hogwild!).
111 | 
112 |   -block <block_size>      :  Block size fot on-disk prediction. 
113 | 
114 |   --sign                   :  Converting output result to 0 and 1.
115 | 
116 |   --sigmoid                :  Converting output result to 0 ~ 1 (problebility).
117 | 
118 |   --disk                   :  On-disk prediction.
119 | 
120 |   --no-norm                :  Disable instance-wise normalization. By default, xLearn will use instance-wise 
121 |                               normalization in both training and prediction processes.
122 | 
123 | Python 接口
124 | ------------------------------
125 | 
126 | API 列表: ::
127 | 
128 |     import xlearn as xl      # Import xlearn package
129 | 
130 |     xl.hello()               # Say hello to user
131 | 
132 |     # This part is for data
133 |     # X is feautres data, can be pandas DataFrame or numpy.ndarray,
134 |     # y is label, default None, can be pandas DataFrame\Series, array or list,
135 |     # filed_map is field map of features, default None, can be pandas DataFrame\Series, array or list
136 |     dmatrix = xl.DMatrix(X, y, field_map)  
137 | 
138 |     model = create_linear()  #  Create linear model.
139 | 
140 |     model = create_fm()      #  Create factorization machines.
141 | 
142 |     model = create_ffm()     #  Create field-aware factorizarion machines.
143 | 
144 |     model.show()             #  Show model information.
145 | 
146 |     model.fit(param, "model_path")   #  Train model.
147 | 
148 |     model.cv(param)    # Perform cross-validation.
149 |     
150 |     # Users can choose one of this two
151 |     model.predict("model_path", "output_path")  # Perform prediction, output result to file, return None.
152 |     model.predict("model_path")                 # Perform prediction, return result by numpy.ndarray. 
153 |     
154 |     # Users can choose one of this two
155 |     model.setTrain("data_path")      #  Set training data from file for xLearn.
156 |     model.setTrain(dmatrix)          #  Set training data from DMatrix for xLearn.
157 |     
158 |     # Users can choose one of this two
159 |     # note: this type of validate must be same as train
160 |     # that is, set train from file, must set validate from file
161 |     model.setValidate("data_path")   #  Set validation data from file for xLearn.
162 |     model.setValidate(dmatrix)       #  Set validation data from DMatrix for xLearn.
163 |     
164 |     # Users can choose one of this two
165 |     model.setTest("data_path")       #  Set test data from file for xLearn.
166 |     model.setTest(dmatrix)           #  Set test data from DMatrix for xLearn.
167 | 
168 |     model.setQuiet()    #  Set xlearn to train model quietly.
169 | 
170 |     model.setOnDisk()   #  Set xlearn to use on-disk training.
171 | 
172 |     model.setNoBin()    # Do not generate bin file for training and test data.
173 | 
174 |     model.setSign()     # Convert prediction to 0 and 1.
175 | 
176 |     model.setSigmoid()  # Convert prediction to (0, 1).
177 | 
178 |     model.disableNorm() # Disable instance-wise normalization.
179 | 
180 |     model.disableLockFree()   # Disable lock-free training.
181 | 
182 |     model.disableEarlyStop()  # Disable early-stopping.
183 | 
184 | 超参数列表: ::
185 | 
186 |     task     : {'binary',  # Binary classification
187 |                 'reg'}     # Regression
188 | 
189 |     metric   : {'acc', 'prec', 'recall', 'f1', 'auc',   # for classification
190 |                 'mae', 'mape', 'rmse', 'rmsd'}  # for regression
191 | 
192 |     lr       : float value  # learning rate
193 | 
194 |     lambda   : float value  # regular lambda
195 | 
196 |     k        : int value    # latent factor for fm and ffm
197 | 
198 |     init     : float value  # model initialize
199 | 
200 |     alpha    : float value  # hyper parameter for ftrl
201 | 
202 |     beta     : float value  # hyper parameter for ftrl
203 | 
204 |     lambda_1 : float value  # hyper parameter for ftrl
205 | 
206 |     lambda_2 : float value  # hyper parameter for ftrl
207 | 
208 |     nthread  : int value    # the number of CPU cores
209 | 
210 |     epoch    : int vlaue    # number of epoch
211 | 
212 |     fold     : int value    # number of fold for cross-validation
213 | 
214 |     opt      : {'sgd', 'agagrad', 'ftrl'}  # optimization method
215 | 
216 |     stop_window : Size of stop window for early-stopping.
217 | 
218 |     block_size : int value  # block size for on-disk training
219 | 
220 | R 接口
221 | ------------------------------
222 | 
223 | xLearn R API page is coming soon.
224 | 


--------------------------------------------------------------------------------
/cli_api/index.rst:
--------------------------------------------------------------------------------
  1 | xLearn 命令行接口使用指南
  2 | ===============================
  3 | 
  4 | 如果你已经编译并安装好 xLearn，你会在当前的 ``build`` 文件夹下看见 ``xlearn_train`` 和 ``xlearn_predict`` 这两个可执行文件，它们可以被用来进行模型的训练和预测。
  5 | 
  6 | 快速开始
  7 | ----------------------------------------
  8 | 
  9 | 确保你现在正在 xLearn 的 ``build`` 文件夹下，在该文件夹下用户可以看见 ``small_test.txt`` 和 ``small_train.txt`` 这两个样例数据集。我们使用以下命令进行模型训练: ::
 10 | 
 11 |     ./xlearn_train ./small_train.txt
 12 | 
 13 | 下面是一部分程序的输出。注意，这里显示的 ``log_loss`` 值可能和你本地计算出的 ``log_loss`` 值不完全一样: ::
 14 | 
 15 |   [ ACTION     ] Start to train ...
 16 |   [------------] Epoch      Train log_loss     Time cost (sec)
 17 |   [   10%      ]     1            0.569292                0.00
 18 |   [   20%      ]     2            0.517142                0.00
 19 |   [   30%      ]     3            0.490124                0.00
 20 |   [   40%      ]     4            0.470445                0.00
 21 |   [   50%      ]     5            0.451919                0.00
 22 |   [   60%      ]     6            0.437888                0.00
 23 |   [   70%      ]     7            0.425603                0.00
 24 |   [   80%      ]     8            0.415573                0.00
 25 |   [   90%      ]     9            0.405933                0.00
 26 |   [  100%      ]    10            0.396388                0.00
 27 |   [ ACTION     ] Start to save model ...
 28 |   [------------] Model file: ./small_train.txt.model
 29 | 
 30 | 在默认的情况下，xLearn 会使用 *logistic regression (LR)* 来训练我们的模型 (10 epoch).
 31 | 
 32 | 我们发现，xLearn 训练过后在当前文件夹下产生了一个叫 ``small_train.txt.model`` 的新文件。这个文件用来存储训练后的模型，我们可以用这个模型在未来进行预测: ::
 33 | 
 34 |     ./xlearn_predict ./small_test.txt ./small_train.txt.model
 35 | 
 36 | 运行上述命令之后，我们在当前文件夹下得到了一个新的文件 ``small_test.txt.out``，这是我们进行预测任务的输出。我们可以通过如下命令显示这个输出文件的前几行数据: ::
 37 |     
 38 |     head -n 5 ./small_test.txt.out
 39 | 
 40 |     -1.9872
 41 |     -0.0707959
 42 |     -0.456214
 43 |     -0.170811
 44 |     -1.28986
 45 | 
 46 | 这里每一行的分数都对应了测试数据中的一行预测样本。负数代表我们预测该样本为负样本，正数代表正样本 (在这个例子中没有)。在 xLearn 中，用户可以将分数通过 ``--sigmoid`` 选项转换到 (0-1) 之间，还可以使用 ``--sign`` 选项将其转换成 0 或 1: ::
 47 | 
 48 |     ./xlearn_predict ./small_test.txt ./small_train.txt.model --sigmoid
 49 |     head -n 5 ./small_test.txt.out
 50 | 
 51 |     0.120553
 52 |     0.482308
 53 |     0.387884
 54 |     0.457401
 55 |     0.215877
 56 | 
 57 |     ./xlearn_predict ./small_test.txt ./small_train.txt.model --sign
 58 |     head -n 5 ./small_test.txt.out
 59 | 
 60 |     0
 61 |     0
 62 |     0
 63 |     0
 64 |     0
 65 | 
 66 | 模型的输出
 67 | ----------------------------------------
 68 | 
 69 | 用户可以通过设置不同的超参数来生成不同的模型，xLearn 通过 ``-m`` 选项来指定这些输出模型文件的路径。在默认的情况下，模型文件的路径是当前运行文件夹下的 ``training_data_name`` + ``.model`` 文件: ::
 70 | 
 71 |   ./xlearn_train ./small_train.txt -m new_model
 72 | 
 73 | 用户还可以通过 ``-t`` 选项将模型输出成人类可读的 ``TXT`` 格式，例如: ::
 74 | 
 75 |   ./xlearn_train ./small_train.txt -t model.txt
 76 | 
 77 | 运行上述命令后，我们发现在当前文件夹下生成了一个新的文件 ``model.txt``，这个文件存储着 ``TXT`` 格式的输出模型: ::
 78 | 
 79 |   head -n 5 ./model.txt
 80 | 
 81 |   -0.688182
 82 |   0.458082
 83 |   0
 84 |   0
 85 |   0
 86 | 
 87 | 对于线性模型来说，``TXT`` 格式的模型将每一个模型参数存储在一行。对于 FM 和 FFM，模型将每一个 latent vector 存储在一行。
 88 | 
 89 | Linear: ::
 90 | 
 91 |   bias: 0
 92 |   i_0: 0
 93 |   i_1: 0
 94 |   i_2: 0
 95 |   i_3: 0
 96 | 
 97 | FM: ::
 98 | 
 99 |   bias: 0
100 |   i_0: 0
101 |   i_1: 0
102 |   i_2: 0
103 |   i_3: 0
104 |   v_0: 5.61937e-06 0.0212581 0.150338 0.222903
105 |   v_1: 0.241989 0.0474224 0.128744 0.0995021
106 |   v_2: 0.0657265 0.185878 0.0223869 0.140097
107 |   v_3: 0.145557 0.202392 0.14798 0.127928
108 | 
109 | FFM: ::
110 | 
111 |   bias: 0
112 |   i_0: 0
113 |   i_1: 0
114 |   i_2: 0
115 |   i_3: 0
116 |   v_0_0: 5.61937e-06 0.0212581 0.150338 0.222903
117 |   v_0_1: 0.241989 0.0474224 0.128744 0.0995021
118 |   v_0_2: 0.0657265 0.185878 0.0223869 0.140097
119 |   v_0_3: 0.145557 0.202392 0.14798 0.127928
120 |   v_1_0: 0.219158 0.248771 0.181553 0.241653
121 |   v_1_1: 0.0742756 0.106513 0.224874 0.16325
122 |   v_1_2: 0.225384 0.240383 0.0411782 0.214497
123 |   v_1_3: 0.226711 0.0735065 0.234061 0.103661
124 |   v_2_0: 0.0771142 0.128723 0.0988574 0.197446
125 |   v_2_1: 0.172285 0.136068 0.148102 0.0234075
126 |   v_2_2: 0.152371 0.108065 0.149887 0.211232
127 |   v_2_3: 0.123096 0.193212 0.0179155 0.0479647
128 |   v_3_0: 0.055902 0.195092 0.0209918 0.0453358
129 |   v_3_1: 0.154174 0.144785 0.184828 0.0785329
130 |   v_3_2: 0.109711 0.102996 0.227222 0.248076
131 |   v_3_3: 0.144264 0.0409806 0.17463 0.083712
132 | 
133 | 在线学习
134 | ----------------------------------------
135 | xLearn 提供在线学习的功能，即 xLearn 可以加载之前预训练过的模型继续学习。用户可以通过 ``-pre`` 选项来指定预先训练过的模型文件路径。例如: ::
136 | 
137 |   ./xlearn_train ./small_train.txt -s 0 -pre ./pre_model
138 | 
139 | 注意，xLearn 只能加载二进制预训练模型，不能加载 TXT 格式的文本模型。
140 | 
141 | 预测结果的输出
142 | ----------------------------------------
143 | 
144 | 用户可以通过 ``-o`` 选项来指定预测结果输出文件的路径。例如: ::
145 | 
146 |   ./xlearn_predict ./small_test.txt ./small_train.txt.model -o output.txt  
147 |   head -n 5 ./output.txt
148 | 
149 |   -2.01192
150 |   -0.0657416
151 |   -0.456185
152 |   -0.170979
153 |   -1.28849
154 | 
155 | 在默认的情况下，预测结果输出文件的路径格式是当前文件夹下的 ``test_data_name`` + ``.out`` 文件。
156 | 
157 | 选择机器学习算法
158 | ----------------------------------------
159 | 
160 | 目前，xLearn 可以支持三种不同的机器学习算法，包括了线性模型 (LR)、factorization machine (FM)，以及 field-aware factorization machine (FFM).
161 | 
162 | 用户可以通过 ``-s`` 选项来选择不同的算法: ::
163 | 
164 |   ./xlearn_train ./small_train.txt -s 0  # Classification: Linear model (GLM) 
165 |   ./xlearn_train ./small_train.txt -s 1  # Classification: Factorization machine (FM) 
166 |   ./xlearn_train ./small_train.txt -s 2  # Classification: Field-awre factorization machine (FFM) 
167 | 
168 |   ./xlearn_train ./small_train.txt -s 3  # Regression: Linear model (GLM) 
169 |   ./xlearn_train ./small_train.txt -s 4  # Regression: Factorization machine (FM) 
170 |   ./xlearn_train ./small_train.txt -s 5  # Regression: Field-awre factorization machine (FFM) 
171 | 
172 | 对于 LR 和 FM 算法而言，我们的输入数据格式必须是 ``CSV`` 或者 ``libsvm``. 对于 FFM 算法，我们的输入数据必须是 ``libffm`` 格式: ::
173 | 
174 |   libsvm format:
175 | 
176 |      y index_1:value_1 index_2:value_2 ... index_n:value_n
177 | 
178 |      0   0:0.1   1:0.5   3:0.2   ...
179 |      0   0:0.2   2:0.3   5:0.1   ...
180 |      1   0:0.2   2:0.3   5:0.1   ...
181 | 
182 |   CSV format:
183 | 
184 |      y value_1 value_2 .. value_n
185 | 
186 |      0      0.1     0.2     0.2   ...
187 |      1      0.2     0.3     0.1   ...
188 |      0      0.1     0.2     0.4   ...
189 | 
190 |   libffm format:
191 | 
192 |      y field_1:index_1:value_1 field_2:index_2:value_2   ...
193 | 
194 |      0   0:0:0.1   1:1:0.5   2:3:0.2   ...
195 |      0   0:0:0.2   1:2:0.3   2:5:0.1   ...
196 |      1   0:0:0.2   1:2:0.3   2:5:0.1   ...
197 | 
198 | xLearn 还可以使用 ``,`` 作为数据的分隔符，例如: ::
199 | 
200 |   libsvm format:
201 | 
202 |      label,index_1:value_1,index_2:value_2 ... index_n:value_n
203 | 
204 |   CSV format:
205 | 
206 |      label,value_1,value_2 .. value_n
207 | 
208 |   libffm format:
209 | 
210 |      label,field_1:index_1:value_1,field_2:index_2:value_2 ...
211 | 
212 | 注意，如果输入的 csv 文件里不含 ``y`` 值，用户必须手动向其每一行数据添加一个占位符 (同样针对测试数据)。否则，xLearn 会将第一个元素视为 ``y``.
213 | 
214 | LR 和 FM 算法的输入可以是 ``libffm`` 格式，xLearn 会忽略其中的 ``field`` 项并将其视为 ``libsvm`` 格式。
215 | 
216 | 设置 Validation Dataset（验证集）
217 | ----------------------------------------
218 | 
219 | 在机器学习中，我们可以通过 Validation Dataset (验证集) 来进行超参数调优。在 xLearn 中，用户可以使用 ``-v`` 选项来指定验证集文件，例如: ::
220 | 
221 |     ./xlearn_train ./small_train.txt -v ./small_test.txt    
222 | 
223 | 下面是程序的一部分输出: ::
224 | 
225 |   [ ACTION     ] Start to train ...
226 |   [------------] Epoch      Train log_loss       Test log_loss     Time cost (sec)
227 |   [   10%      ]     1            0.571922            0.531160                0.00
228 |   [   20%      ]     2            0.520315            0.542134                0.00
229 |   [   30%      ]     3            0.492147            0.529684                0.00
230 |   [   40%      ]     4            0.470234            0.538684                0.00
231 |   [   50%      ]     5            0.452695            0.537496                0.00
232 |   [   60%      ]     6            0.439367            0.537790                0.00
233 |   [   70%      ]     7            0.425216            0.534396                0.00
234 |   [   80%      ]     8            0.416215            0.542883                0.00
235 |   [   90%      ]     9            0.404673            0.547597                0.00
236 | 
237 | 我们可以看到，在这个任务中 ``Train log_loss`` 在不断的下降，而 ``Test log_loss`` (validation loss) 则是先下降，后上升。这代表当前我们训练的模型已经 overfit （过拟合）我们的训练数据。
238 | 
239 | 在默认的情况下，xLearn 会在每一轮 epoch 结束后计算 validation loss 的数值，而用户可以使用 ``-x`` 选项来制定不同的评价指标。对于分类任务而言，评价指标有： ``acc`` (accuracy), ``prec`` (precision), ``f1``, 以及 ``auc``，例如: ::
240 | 
241 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -x acc
242 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -x prec
243 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -x f1
244 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -x auc
245 | 
246 | 对于回归任务而言，评价指标包括：``mae``, ``mape``, 以及 ``rmsd`` (或者叫作 ``rmse``)，例如: ::
247 | 
248 |     cd demo/house_price/
249 |     ../../xlearn_train ./house_price_train.txt -s 3 -x rmse --cv
250 |     ../../xlearn_train ./house_price_train.txt -s 3 -x rmsd --cv
251 | 
252 | 注意，这里我们通过设置 ``--cv`` 选项使用了 *Cross-Validation (交叉验证)* 功能, 我们将在下一节详细介绍该功能。
253 | 
254 | Cross-Validation (交叉验证)
255 | ----------------------------------------
256 | 
257 | 在机器学习中，Cross-Validation (交叉验证) 是一种被广泛使用的模型超参数调优技术。在 xLearn 中，用户可以使用 ``--cv`` 
258 | 选项来使用交叉验证功能，例如: ::
259 | 
260 |     ./xlearn_train ./small_train.txt --cv
261 | 
262 | 在默认的情况下，xLearn 使用 3-folds 交叉验证 (即将数据集平均分成 3 份)，用户也可以通过 ``-f`` 选项来指定数据划分的份数，例如: ::
263 |     
264 |     ./xlearn_train ./small_train.txt -f 5 --cv
265 | 
266 | 上述命令将数据集划分成为 5 份，并且 xLearn 会在最后计算出平均的 validation loss: ::
267 | 
268 |      ...
269 |     [------------] Average log_loss: 0.549417
270 |     [ ACTION     ] Finish Cross-Validation
271 |     [ ACTION     ] Clear the xLearn environment ...
272 |     [------------] Total time cost: 0.03 (sec)
273 | 
274 | 选择优化算法
275 | ----------------------------------------
276 |  
277 | 在 xLearn 中，用户可以通过 ``-p`` 选项来选择使用不同的优化算法。目前，xLearn 支持 ``SGD``, ``AdaGrad``, 以及 ``FTRL`` 这三种优化算法。
278 | 在默认的情况下，xLearn 使用 ``AdaGrad`` 优化算法: ::
279 | 
280 |     ./xlearn_train ./small_train.txt -p sgd
281 |     ./xlearn_train ./small_train.txt -p adagrad
282 |     ./xlearn_train ./small_train.txt -p ftrl
283 | 
284 | 相比于传统的 ``SGD`` (随机梯度下降) 算法，``AdaGrad`` 可以自适应的调整学习速率 learning rate，对于不常用的参数进行较大的更新，对于常用的参数进行较小的更新。
285 | 正因如此，``AdaGrad`` 算法常用于稀疏数据的优化问题上。除此之外，相比于 ``AdaGrad``，``SGD`` 对学习速率的大小更敏感，这增加了用户调参的难度。
286 | 
287 | ``FTRL`` (Follow-the-Regularized-Leader) 同样被广泛应用于大规模稀疏数据的优化问题上。相比于 ``SGD`` 和 ``AdaGrad``, ``FTRL`` 需要用户调试更多的超参数，我们将在下一节详细介绍 xLearn 的超参数调优。
288 | 
289 | 超参数调优
290 | ----------------------------------------
291 | 
292 | 在机器学习中，*hyper-parameter* (超参数) 是指在训练之前设置的参数，而模型参数是指在训练过程中更新的参数。超参数调优通常是机器学习训练过程中不可避免的一个环节。
293 | 
294 | 首先，``learning rate`` (学习速率) 是机器学习中的一个非常重要的超参数，用来控制每次模型迭代时更新的步长。在默认的情况下，这个值在 xLearn 中被设置为 ``0.2``，用户可以通过 ``-r`` 选项来改变这个值: ::
295 | 
296 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1
297 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.5
298 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.01
299 | 
300 | 用户还可以通过 ``-b`` 选项来控制 regularization (正则项)。xLearn 使用 ``L2`` 正则项，这个值被默认设置为 ``0.00002``: ::
301 | 
302 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.001
303 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.002
304 |     ./xlearn_train ./small_train.txt -v ./small_test.txt -r 0.1 -b 0.01
305 | 
306 | 对于 ``FTRL`` 算法来说，除了学习速率和正则项，我们还需要调节其他的超参数，包括：``-alpha``, ``-beta``, ``-lambda_1`` 和 ``-lambda_2``，例如: ::
307 | 
308 |     ./xlearn_train ./small_train.txt -p ftrl -alpha 0.002 -beta 0.8 -lambda_1 0.001 -lambda_2 1.0
309 | 
310 | 对于 FM 和 FFM 模型，用户需要通过 ``-k`` 选项来设置 *latent vector* (隐向量) 的长度。在默认的情况下，xLearn 将其设置为 ``4``: ::
311 | 
312 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 2
313 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 4
314 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 5
315 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -k 8
316 | 
317 | 注意，xLearn 使用了 *SSE* 硬件指令来加速向量运算，该指令会同时进行向量长度为 ``4`` 的运算，因此将 ``k=2`` 和 ``k=4`` 所需的运算时间是相同的。
318 | 
319 | 除此之外，对于 FM 和 FFM，用户可以通过设置超参数 ``-u`` 来调节模型的初始化参数。在默认的情况下，这个值被设置为 ``0.66``: ::
320 | 
321 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.80
322 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.40
323 |     ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt -u 0.10
324 | 
325 | 迭代次数 & Early-Stop (提前终止)
326 | ----------------------------------------
327 | 
328 | 在模型的训练过程中，每一个 epoch 都会遍历整个训练数据。在 xLearn 中，用户可以通过 ``-e`` 选项来设置需要的 epoch 数量: ::
329 | 
330 |     ./xlearn_train ./small_train.txt -e 3
331 |     ./xlearn_train ./small_train.txt -e 5
332 |     ./xlearn_train ./small_train.txt -e 10   
333 | 
334 | 如果用户设置了 validation dataset (验证集)，xLearn 在默认情况下会在得到最好的 validation 结果时进行 early-stop (提前终止训练)，例如: ::
335 |   
336 |     ./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10
337 | 
338 | 在上述命令中，我们设置 epoch 的大小为 ``10``，但是 xLearn 会在第 7 轮提前停止训练 (你可能在你的本地计算机上会得到不同的轮次): ::
339 | 
340 |    ...
341 |   [ ACTION     ] Early-stopping at epoch 7
342 |   [ ACTION     ] Start to save model ...
343 | 
344 | 用户可以通过 ``-sw`` 来设置提前停止机制的窗口大小。即，``-sw=2`` 意味着如果在后两轮的时间窗口之内都没有比当前更好的验证结果，则停止训练，并保存之前最好的模型: ::
345 | 
346 |     ./xlearn_train ./small_train.txt -e 10 -v ./small_test.txt -sw 3
347 | 
348 | 用户可以通过 ``--dis-es`` 选项来禁止 early-stop: ::
349 | 
350 |     ./xlearn_train ./small_train.txt -s 2 -v ./small_test.txt -e 10 --dis-es
351 | 
352 | 在上述命令中，xLearn 将进行完整的 10 轮 epoch 训练。
353 | 
354 | 注意，在默认情况下，如果没有设置 metric，则 xLearn 会通过 test_loss 来选择最佳停止时机。如果设置了 metric，则 xLearn 通过 metric 的值来决定停止时机。 
355 | 
356 | 无锁 (Lock-free) 学习
357 | ----------------------------------------
358 | 
359 | 在默认情况下，xLearn 会进行 *Hogwild!* 无锁学习，该方法通过 CPU 多核进行并行训练，提高 CPU 利用率，加快算法收敛速度。但是，该无锁算法是非确定性的算法 (*non-deterministic*). 即，如果我们多次运行如下的命令，我们会在每一次运行得到略微不同的 loss 结果: ::
360 | 
361 |    ./xlearn_train ./small_train.txt 
362 | 
363 |    The 1st time: 0.396352
364 | 
365 |    ./xlearn_train ./small_train.txt 
366 | 
367 |    The 2nd time: 0.396119
368 | 
369 |    ./xlearn_train ./small_train.txt 
370 | 
371 |    The 3nd time: 0.396187
372 | 
373 | 用户可以通过 ``-nthread`` 选项来设置使用 CPU 核心的数量，例如: ::
374 | 
375 |    ./xlearn_train ./small_train.txt -nthread 2
376 | 
377 | 上述命令指定使用 2 个 CPU Core 来进行模型训练。如果用户不设置该选项，xLearn 在默认情况下会使用全部的 CPU 核心进行计算。xLearn 会显示当前使用线程数量的情况: ::
378 | 
379 |     [------------] xLearn uses 2 threads for training task.
380 |     [ ACTION     ] Read Problem ...
381 | 
382 | 用户可以通过设置 ``--dis-lock-free`` 选项禁止多核无锁学习: ::
383 | 
384 |   ./xlearn_train ./small_train.txt --dis-lock-free
385 | 
386 | 这时，xLearn 计算的结果是确定性的 (*determinnistic*): ::
387 | 
388 |    ./xlearn_train ./small_train.txt 
389 | 
390 |    The 1st time: 0.396372
391 | 
392 |    ./xlearn_train ./small_train.txt 
393 | 
394 |    The 2nd time: 0.396372
395 | 
396 |    ./xlearn_train ./small_train.txt 
397 | 
398 |    The 3nd time: 0.396372
399 | 
400 | 使用 ``--dis-lock-free`` 的缺点是训练速度会比无锁训练慢很多，我们的建议是在大规模数据训练下开启此功能。
401 | 
402 | Instance-Wise 归一化
403 | ----------------------------------------
404 | 
405 | 对于 FM 和 FFM 来说，xLearn 会默认对特征进行 *Instance-Wise Normalizarion* (归一化). 在一些大规模稀疏数据的场景 (例如 CTR 预估), 这一技术非常的有效，但是有些时候它也会影响模型的准确率。用户可以通过设置 ``--no-norm`` 来关掉该功能: ::
406 | 
407 |   ./xlearn_train ./small_train.txt -s 1 -v ./small_test.txt --no-norm
408 | 
409 | 注意，如果在训练过程中使用了 Instance-Wise 归一化，用户需要在预测过程中同样使用该功能。
410 | 
411 | Quiet Model 安静模式
412 | ----------------------------------------
413 | 
414 | xLearn 的训练支持安静模式，在安静模式下，用户通过调用 ``--quiet()`` 选项来使得 xLearn 的训练过程不会计算任何评价指标，这样可以很大程度上提高训练速度: ::
415 | 
416 |   ./xlearn_train ./small_train.txt -e 10 --quiet
417 | 
418 | xLearn 还可以支持 Python API，我们将在下一节详细介绍。
419 | 


--------------------------------------------------------------------------------
/conf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | #
  3 | # xlearn_doc documentation build configuration file, created by
  4 | # sphinx-quickstart on Sun Dec  3 18:43:51 2017.
  5 | #
  6 | # This file is execfile()d with the current directory set to its
  7 | # containing dir.
  8 | #
  9 | # Note that not all possible configuration values are present in this
 10 | # autogenerated file.
 11 | #
 12 | # All configuration values have a default; values that are commented out
 13 | # serve to show the default.
 14 | 
 15 | # If extensions (or modules to document with autodoc) are in another directory,
 16 | # add these directories to sys.path here. If the directory is relative to the
 17 | # documentation root, use os.path.abspath to make it absolute, like shown here.
 18 | #
 19 | # import os
 20 | # import sys
 21 | # sys.path.insert(0, os.path.abspath('.'))
 22 | 
 23 | # -- General configuration ------------------------------------------------
 24 | 
 25 | # If your documentation needs a minimal Sphinx version, state it here.
 26 | #
 27 | # needs_sphinx = '1.0'
 28 | 
 29 | # Add any Sphinx extension module names here, as strings. They can be
 30 | # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 31 | # ones.
 32 | extensions = ['sphinx.ext.autodoc']
 33 | 
 34 | # Add any paths that contain templates here, relative to this directory.
 35 | templates_path = ['_templates']
 36 | 
 37 | # The suffix(es) of source filenames.
 38 | # You can specify multiple suffix as a list of string:
 39 | #
 40 | # source_suffix = ['.rst', '.md']
 41 | source_suffix = '.rst'
 42 | 
 43 | # The master toctree document.
 44 | master_doc = 'index'
 45 | 
 46 | # General information about the project.
 47 | project = u'xLearn'
 48 | copyright = u'2017, Chao Ma'
 49 | author = u'Chao Ma'
 50 | 
 51 | # The version info for the project you're documenting, acts as replacement for
 52 | # |version| and |release|, also used in various other places throughout the
 53 | # built documents.
 54 | #
 55 | # The short X.Y version.
 56 | version = u'0.4.0'
 57 | # The full version, including alpha/beta/rc tags.
 58 | release = u'0.4.0'
 59 | 
 60 | # The language for content autogenerated by Sphinx. Refer to documentation
 61 | # for a list of supported languages.
 62 | #
 63 | # This is also used if you do content translation via gettext catalogs.
 64 | # Usually you set "language" from the command line for these cases.
 65 | language = None
 66 | 
 67 | # List of patterns, relative to source directory, that match files and
 68 | # directories to ignore when looking for source files.
 69 | # This patterns also effect to html_static_path and html_extra_path
 70 | exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
 71 | 
 72 | # The name of the Pygments (syntax highlighting) style to use.
 73 | pygments_style = 'sphinx'
 74 | 
 75 | # If true, `todo` and `todoList` produce output, else they produce nothing.
 76 | todo_include_todos = False
 77 | 
 78 | 
 79 | # -- Options for HTML output ----------------------------------------------
 80 | 
 81 | # The theme to use for HTML and HTML Help pages.  See the documentation for
 82 | # a list of builtin themes.
 83 | #
 84 | html_theme = 'sphinx_rtd_theme'
 85 | 
 86 | # Theme options are theme-specific and customize the look and feel of a theme
 87 | # further.  For a list of options available for each theme, see the
 88 | # documentation.
 89 | #
 90 | # html_theme_options = {}
 91 | 
 92 | # Add any paths that contain custom static files (such as style sheets) here,
 93 | # relative to this directory. They are copied after the builtin static files,
 94 | # so a file named "default.css" will overwrite the builtin "default.css".
 95 | html_static_path = ['_static']
 96 | 
 97 | # Custom sidebar templates, must be a dictionary that maps document names
 98 | # to template names.
 99 | #
100 | # This is required for the alabaster theme
101 | # refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
102 | html_sidebars = {
103 |     '**': [
104 |         'relations.html',  # needs 'show_related': True theme option to display
105 |         'searchbox.html',
106 |     ]
107 | }
108 | 
109 | 
110 | # -- Options for HTMLHelp output ------------------------------------------
111 | 
112 | # Output file base name for HTML help builder.
113 | htmlhelp_basename = 'xlearn_docdoc'
114 | 
115 | 
116 | # -- Options for LaTeX output ---------------------------------------------
117 | 
118 | latex_elements = {
119 |     # The paper size ('letterpaper' or 'a4paper').
120 |     #
121 |     # 'papersize': 'letterpaper',
122 | 
123 |     # The font size ('10pt', '11pt' or '12pt').
124 |     #
125 |     # 'pointsize': '10pt',
126 | 
127 |     # Additional stuff for the LaTeX preamble.
128 |     #
129 |     # 'preamble': '',
130 | 
131 |     # Latex figure (float) alignment
132 |     #
133 |     # 'figure_align': 'htbp',
134 | }
135 | 
136 | # Grouping the document tree into LaTeX files. List of tuples
137 | # (source start file, target name, title,
138 | #  author, documentclass [howto, manual, or own class]).
139 | latex_documents = [
140 |     (master_doc, 'xlearn_doc.tex', u'xlearn\\_doc Documentation',
141 |      u'Chao Ma', 'manual'),
142 | ]
143 | 
144 | # -- Options for manual page output ---------------------------------------
145 | 
146 | # One entry per manual page. List of tuples
147 | # (source start file, name, description, authors, manual section).
148 | man_pages = [
149 |     (master_doc, 'xlearn_doc', u'xlearn_doc Documentation',
150 |      [author], 1)
151 | ]
152 | 
153 | # -- Options for Texinfo output -------------------------------------------
154 | 
155 | # Grouping the document tree into Texinfo files. List of tuples
156 | # (source start file, target name, title, author,
157 | #  dir menu entry, description, category)
158 | texinfo_documents = [
159 |     (master_doc, 'xlearn_doc', u'xlearn_doc Documentation',
160 |      author, 'xlearn_doc', 'One line description of project.',
161 |      'Miscellaneous'),
162 | ]


--------------------------------------------------------------------------------
/demo/index.rst:
--------------------------------------------------------------------------------
  1 | xLearn 样例程序
  2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  3 | 
  4 | 注意：这是用的所有数据集所有权属于原作者。
  5 | 
  6 | Criteo 在线广告预估
  7 | ---------------------------
  8 | 
  9 | Kaglle 预测广告是否会被用户点击 (`链接`__)
 10 | 
 11 | Display advertising is a billion dollar effort and one of the central uses of machine learning on the Internet. 
 12 | However, its data and methods are usually kept under lock and key. In this research competition, CriteoLabs is 
 13 | sharing a week’s worth of data for you to develop models predicting ad click-through rate (CTR). Given a user 
 14 | and the page he is visiting, what is the probability that he will click on a given ad?
 15 | 
 16 | 样例数据在： ``/demo/classification/criteo_ctr/``.
 17 | 
 18 | The follow code is the Python demo:
 19 | 
 20 | .. code-block:: python
 21 | 
 22 |     import xlearn as xl
 23 | 
 24 |     # Training task
 25 |     ffm_model = xl.create_ffm() # Use field-aware factorization machine
 26 |     ffm_model.setTrain("./small_train.txt")  # Training data
 27 |     ffm_model.setValidate("./small_test.txt")  # Validation data
 28 | 
 29 |     # param:
 30 |     #  0. binary classification
 31 |     #  1. learning rate: 0.2
 32 |     #  2. regular lambda: 0.002
 33 |     #  3. evaluation metric: accuracy
 34 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'acc'}
 35 | 
 36 |     # Start to train
 37 |     # The trained model will be stored in model.out
 38 |     ffm_model.fit(param, './model.out')
 39 | 
 40 |     # Prediction task
 41 |     ffm_model.setTest("./small_test.txt")  # Test data
 42 |     ffm_model.setSigmoid()  # Convert output to 0-1
 43 | 
 44 |     # Start to predict
 45 |     # The output result will be stored in output.txt
 46 |     ffm_model.predict("./model.out", "./output.txt")
 47 | 
 48 | 蘑菇分类
 49 | ---------------------------
 50 | 
 51 | 数据集来自 UCI Machine Learning Repositpry (`链接`__)
 52 | 
 53 | This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in 
 54 | the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, 
 55 | or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly 
 56 | states that there is no simple rule for determining the edibility of a mushroom; no rule like *leaflets three, let it be*
 57 | for Poisonous Oak and Ivy.
 58 | 
 59 | 样例数据在： ``/demo/classification/mushroom/``.
 60 | 
 61 | The follow code is the Python demo:
 62 | 
 63 | .. code-block:: python
 64 | 
 65 |     import xlearn as xl
 66 | 
 67 |     # Training task
 68 |     linear_model = xl.create_linear()  # Use linear model
 69 |     linear_model.setTrain("./agaricus_train.txt")  # Training data
 70 |     linear_model.setValidate("./agaricus_test.txt")  # Validation data
 71 | 
 72 |     # param:
 73 |     #  0. binary classification
 74 |     #  1. learning rate: 0.2
 75 |     #  2. lambda: 0.002
 76 |     #  3. evaluation metric: accuarcy
 77 |     #  4. use sgd optimization method
 78 |     param = {'task':'binary', 'lr':0.2, 
 79 |              'lambda':0.002, 'metric':'acc', 
 80 |              'opt':'sgd'}
 81 | 
 82 |     # Start to train
 83 |     # The trained model will be stored in model.out
 84 |     linear_model.fit(param, './model.out')
 85 | 
 86 |     # Prediction task
 87 |     linear_model.setTest("./agaricus_test.txt")  # Test data
 88 |     linear_model.setSigmoid()  # Convert output to 0-1
 89 | 
 90 |     # Start to predict
 91 |     # The output result will be stored in output.txt
 92 |     linear_model.predict("./model.out", "./output.txt")
 93 | 
 94 | 泰塔尼克生还预测
 95 | -----------------------------
 96 | 
 97 | This challenge comes from the Kaggle. In this challenge, we ask you to complete the analysis of what sorts of people 
 98 | were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers 
 99 | survived the tragedy. (`链接`__)
100 | 
101 | You can find the data used in this demo in the path ``/demo/classification/titanic/``.
102 | 
103 | The follow code is the Python demo:
104 | 
105 | .. code-block:: python
106 | 
107 |     import xlearn as xl
108 | 
109 |     # Training task
110 |     fm_model = xl.create_fm()  # Use factorization machine
111 |     fm_model.setTrain("./titanic_train.txt")  # Training data
112 | 
113 |     # param:
114 |     #  0. Binary classification task
115 |     #  1. learning rate: 0.2
116 |     #  2. lambda: 0.002
117 |     #  3. metric: accuracy
118 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'acc'}
119 | 
120 |     # Use cross-validation
121 |     fm_model.cv(param)
122 | 
123 | 房价预测
124 | -----------------------------
125 | 
126 | This demo shows how to use xLearn to solve the regression problem, and it comes from the Kaggle. The Ames 
127 | Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative 
128 | for data scientists looking for a modernized and expanded version of the often cited Boston 
129 | Housing dataset. (`链接`__)
130 | 
131 | 样例数据在： ``/demo/regression/house_price/``.
132 | 
133 | The follow code is the Python demo:
134 | 
135 | .. code-block:: python
136 | 
137 |     import xlearn as xl
138 | 
139 |     # Training task
140 |     ffm_model = xl.create_fm()  # Use factorization machine
141 |     ffm_model.setTrain("./house_price_train.txt")  # Training data
142 | 
143 |     # param:
144 |     #  0. Binary task
145 |     #  1. learning rate: 0.2
146 |     #  2. regular lambda: 0.002
147 |     #  4. evaluation metric: rmse
148 |     param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric':'rmse'}
149 | 
150 |     # Use cross-validation
151 |     ffm_model.cv(param)
152 | 
153 | More Demo in xLearn is coming soon.
154 | 
155 | .. __: https://www.kaggle.com/c/criteo-display-ad-challenge
156 | .. __: https://archive.ics.uci.edu/ml/datasets/Mushroom
157 | .. __: https://www.kaggle.com/c/titanic
158 | .. __: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
159 | 


--------------------------------------------------------------------------------
/images/out-of-core.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aksnzhy/xlearn_doc_cn/eaa11f7ca9e54c3815850696802d1910ee72bc1a/images/out-of-core.png


--------------------------------------------------------------------------------
/images/ps.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aksnzhy/xlearn_doc_cn/eaa11f7ca9e54c3815850696802d1910ee72bc1a/images/ps.png


--------------------------------------------------------------------------------
/images/speed.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aksnzhy/xlearn_doc_cn/eaa11f7ca9e54c3815850696802d1910ee72bc1a/images/speed.png


--------------------------------------------------------------------------------
/index.rst:
--------------------------------------------------------------------------------
  1 | .. xlearn_doc documentation master file, created by
  2 |    sphinx-quickstart on Sun Dec  3 18:43:51 2017.
  3 |    You can adapt this file completely to your liking, but it should at least
  4 |    contain the root `toctree` directive.
  5 | 
  6 | 欢迎使用 xLearn !
  7 | ===============================
  8 | 
  9 | xLearn 是一款高性能的，易用的，并且可扩展的机器学习算法库，你可以用它来解决大规模机器学习问题，尤其是大规模稀疏数据机器学习问题。在近年来，大规模稀疏数据机器学习算法被广泛应用在各种领域，例如广告点击率预测、推荐系统等。如果你是 liblinear、libfm、libffm 的用户，那么现在 xLearn 将会是你更好的选择，因为 xLearn 几乎囊括了这些系统的全部功能，并且具有更好的性能，易用性，以及可扩展性。
 10 | 
 11 | .. image:: ./images/speed.png
 12 |     :width: 650  
 13 | 
 14 | 快速开始
 15 | ----------------------------------
 16 | 
 17 | 我们接下来展示如何在一个小型数据样例 (Criteo 广告点击预测数据) 上使用 xLearn 来解决二分类问题。在这个问题里，机器学习算法需要判断当前用户是否会点击给定的广告。
 18 | 
 19 | 安装 xLearn
 20 | ^^^^^^^^^^^^^
 21 | 
 22 | xLearn 最简单的安装方法是使用 ``pip`` 安装工具. 下面的命令会下载 xLearn 的源代码，并且在用户的本地机器上进行编译和安装。 ::
 23 | 
 24 |     sudo pip install xlearn
 25 | 
 26 | 上述安装过程可能会持续一段时间，请耐心等候。安装完成后，用户可以使用下面的代码来检测 xLearn 是否安装成功。 ::
 27 | 
 28 |   >>> import xlearn as xl
 29 |   >>> xl.hello()
 30 | 
 31 | 如果安装成功，用户会看到如下显示: ::
 32 | 
 33 |   -------------------------------------------------------------------------
 34 |            _
 35 |           | |
 36 |      __  _| |     ___  __ _ _ __ _ __
 37 |      \ \/ / |    / _ \/ _` | '__| '_ \
 38 |       >  <| |___|  __/ (_| | |  | | | |
 39 |      /_/\_\_____/\___|\__,_|_|  |_| |_|
 40 | 
 41 |         xLearn   -- 0.44 Version --
 42 |   -------------------------------------------------------------------------
 43 | 
 44 | 如果你在安装的过程中遇到了任何问题，或者你希望自己通过在 `Github`__ 上最新的源代码进行手动编译，或者你想使用 xLearn 的命令行接口，你可以从这里 (`Installation Guide`__) 查看如何对 xLearn 进行从源码的手动编译和安装。
 45 | 
 46 | .. __: https://github.com/aksnzhy/xlearn
 47 | .. __: ./install/index.html
 48 | 
 49 | Python 样例
 50 | ^^^^^^^^^^^^^^
 51 | 
 52 | 下面的 Python 代码展示了如何使用 xLearn 的 *FFM* 算法来处理机器学习二分类任务： 
 53 | 
 54 | .. code-block:: python
 55 | 
 56 |     import xlearn as xl
 57 | 
 58 |     # Training task
 59 |     ffm_model = xl.create_ffm()                # Use field-aware factorization machine (ffm)
 60 |     ffm_model.setTrain("./small_train.txt")    # Set the path of training dataset
 61 |     ffm_model.setValidate("./small_test.txt")  # Set the path of validation dataset
 62 | 
 63 |     # Parameters:
 64 |     #  0. task: binary classification
 65 |     #  1. learning rate: 0.2
 66 |     #  2. regular lambda: 0.002
 67 |     #  3. evaluation metric: accuracy
 68 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric':'acc'}
 69 | 
 70 |     # Start to train
 71 |     # The trained model will be stored in model.out
 72 |     ffm_model.fit(param, './model.out')
 73 | 
 74 |     # Prediction task
 75 |     ffm_model.setTest("./small_test.txt")  # Set the path of test dataset
 76 |     ffm_model.setSigmoid()                 # Convert output to 0-1
 77 | 
 78 |     # Start to predict
 79 |     # The output result will be stored in output.txt
 80 |     ffm_model.predict("./model.out", "./output.txt")
 81 | 
 82 | 上述样例通过使用 *field-aware factorization machines (FFM)* 来解决一个简单的二分类任务。用户可以在 ``demo/classification/criteo_ctr`` 
 83 | 路径下找到我们所使用的样例数据 (``small_train.txt`` 和 ``small_test.txt``).
 84 | 
 85 | 其他资源链接
 86 | ----------------------------------------
 87 | 
 88 | .. toctree::
 89 |    :glob:
 90 |    :maxdepth: 1
 91 | 
 92 |    self
 93 |    install/index
 94 |    cli_api/index
 95 |    python_api/index
 96 |    R_api/index
 97 |    tune/index
 98 |    all_api/index
 99 |    large/index
100 |    demo/index
101 |    tutorial/index


--------------------------------------------------------------------------------
/install/index.rst:
--------------------------------------------------------------------------------
  1 | 详细安装指南
  2 | ----------------------------------
  3 | 
  4 | 目前 xLearn 可以支持 Linux, Mac OS X 以及 Windows 平台. 在 Windows 平台安装 xLearn 请参考 `link`__ . 这一节主要介绍了如何在 Linux 和 Mac OSX 平台通过 ``pip`` 工具安装 xLearn，并且详细介绍了如何通过源码手动编译并安装 xLearn. 无论你使用哪种方法安装 xLearn，请确保你的机器上已经安装了支持 C++11 的编译器，例如 ``GCC`` 或者 ``Clang``. 除此之外，用户还需要提前安装好 ``CMake`` 编译工具.
  5 | 
  6 | .. __: ./install_windows.html
  7 | 
  8 | 安装 GCC 或 Clang
  9 | ^^^^^^^^^^^^^^^^^^^^^^^^
 10 | 
 11 | *如果你已经安装了支持 C++ 11 的编译器，请忽略此节内容。*
 12 | 
 13 | * 在 Cygwin 上, 运行 ``setup.exe`` 并安装 ``gcc`` 和 ``binutils``.
 14 | * 在 Debian/Ubuntu Linux 上, 输入如下命令: ::
 15 | 
 16 |       sudo apt-get install gcc binutils 
 17 | 
 18 |   安装 GCC (或者 Clang) :: 
 19 | 
 20 |       sudo apt-get install clang 
 21 | 
 22 | * 在 FreeBSD 上, 输入以下命令安装 Clang: :: 
 23 | 
 24 |       sudo pkg_add -r clang 
 25 | 
 26 | * 在 Mac OS X, 安装 ``XCode`` 来获得 Clang.
 27 | 
 28 | 
 29 | 安装 CMake
 30 | ^^^^^^^^^^^^^^^^^^^^^^^^
 31 | 
 32 | *如果你已经安装了 CMake，请忽略此节内容。*
 33 | 
 34 | * 在 Cygwin 上, 运行 ``setup.exe`` 并安装 cmake.
 35 | * 在 Debian/Ubuntu Linux 上, 输入以下命令安装 cmake: ::
 36 | 
 37 |       sudo apt-get install cmake
 38 | 
 39 | * 在 FreeBSD 上, 输入以下命令: ::
 40 |    
 41 |       sudo pkg_add -r cmake
 42 | 
 43 | 在 Mac OS X, 如果你安装了 ``homebrew``, 输入以下命令: :: 
 44 | 
 45 |      brew install cmake
 46 | 
 47 | 或者你安装了 ``MacPorts``, 输入以下命令: :: 
 48 | 
 49 |      sudo port install cmake
 50 | 
 51 | 从源码安装 xLearn
 52 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 53 | 
 54 | 从源码安装 xLearn 分为两个步骤：
 55 | 
 56 | 首先，我们需要编译 xLearn 得到 ``xlearn_train`` 和 ``xlearn_predict`` 这两个可执行文件。除此之外，我们还需要得到 ``libxlearn_api.so`` (Linux 平台) 和 ``libxlearn_api.dylib`` (Mac OS X 平台) 这两个动态链接库 (用来进行 Python 调用)。随后，用户可以安装 xLearn Python Package.
 57 | 
 58 | 编译 xLearn
 59 | ===========
 60 | 
 61 | 用户从 Github 上 clone 下 xLearn 源代码: ::
 62 | 
 63 |   git clone https://github.com/aksnzhy/xlearn.git
 64 | 
 65 |   cd xlearn
 66 |   mkdir build
 67 |   cd build
 68 |   cmake ../
 69 |   make
 70 | 
 71 | 如果编译成功，用户将在 build 文件夹下看到 ``xlearn_train`` 和 ``xlearn_predict`` 这两个可执行文件。用户可以通过如下命令检查 xLearn 是否安装成功: ::
 72 | 
 73 |   ./run_example.sh
 74 | 
 75 | 安装 Python Package
 76 | ====================
 77 | 
 78 | 之后，你就可以通过 ``install-python.sh`` 脚本来安装 xLearn Python 包: ::
 79 | 
 80 |   cd python-package
 81 |   sudo ./install-python.sh
 82 | 
 83 | 用户可以通过如下命令检测 xLearn Python 库是否安装成功: ::
 84 | 
 85 |   cd ../
 86 |   python run_demo_ctr.py
 87 | 
 88 | 一键安装脚本
 89 | ============
 90 | 
 91 | 我们已经写好了一个脚本 ``build.sh`` 来帮助用户做上述所有的安装工作。
 92 | 
 93 | 用户只需要从 Github 上 clone 下 xLearn 源代码: ::
 94 | 
 95 |   git clone https://github.com/aksnzhy/xlearn.git
 96 | 
 97 | 然后通过以下命令进行编译和安装: ::
 98 | 
 99 |   cd xlearn
100 |   sudo ./build.sh
101 | 
102 | 在安装过程中用户可能会被要求输入管理员账户密码。
103 | 
104 | 通过 pip 安装 xLearn
105 | ^^^^^^^^^^^^^^^^^^^^^^^^
106 | 
107 | 安装 xLearn 最简单的方法是使用 ``pip`` 安装工具. 如下命令会下载 xLearn 源代码，并在你的本地计算机进行编译和安装工作，该方法使用的前提是你已经安装了 xLearn 所需的开发环境，例如 C++11 和 CMake: ::
108 | 
109 |     sudo pip install xlearn
110 | 
111 | 上述安装过程可能会持续一段时间，请耐心等候。安装完成后，用户可以使用下面的代码来检测 xLearn 是否安装成功: ::
112 | 
113 |   >>> import xlearn as xl
114 |   >>> xl.hello()
115 | 
116 | 如果安装成功，你会看到如下显示: ::
117 | 
118 |   -------------------------------------------------------------------------
119 |            _
120 |           | |
121 |      __  _| |     ___  __ _ _ __ _ __
122 |      \ \/ / |    / _ \/ _` | '__| '_ \
123 |       >  <| |___|  __/ (_| | |  | | | |
124 |      /_/\_\_____/\___|\__,_|_|  |_| |_|
125 | 
126 |         xLearn   -- 0.44 Version --
127 |   -------------------------------------------------------------------------
128 | 
129 | 安装 R 库
130 | ^^^^^^^^^^^^^^^^^^^^^^^^
131 | 
132 | The R package installation guide is coming soon.
133 | 


--------------------------------------------------------------------------------
/install/install_windows.rst:
--------------------------------------------------------------------------------
  1 | Windows 安装指南
  2 | ----------------------------------
  3 | 
  4 | xLearn 支持 Windows 平台的安装和使用。本小节主要介绍如何在 Windows 平台安装并使用 xLearn 库。
  5 | 
  6 | 安装 Visual Studio 2017
  7 | ^^^^^^^^^^^^^^^^^^^^^^^^
  8 | 
  9 | *如果你的 Windows 系统已经安装过 Visual studio，你可以跳过这一步。*
 10 |  
 11 | 从 https://visualstudio.microsoft.com/downloads/ 下载你所需要的 Visual studio （``vs_xxxx_xxxx.exe``）。之后，你可以通过 VS 的安装说明 （https://docs.microsoft.com/en-us/visualstudio/install/install-visual-studio?view=vs-2017.）进行安装。
 12 | 
 13 | 安装 CMake
 14 | ^^^^^^^^^^^^^^^^^^^^^^^^
 15 | 
 16 | *如果你的系统已经安装了 CMake，你可以跳过这一步*
 17 | 
 18 | 从这里 https://cmake.org/download/ 下载最新版本 (至少 v3.10) CMake。请确保安装 CMake 后将其路径正确添加到你的系统路径。
 19 | 
 20 | 从源码安装 xLearn
 21 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 22 | 
 23 | 从源码安装 xLearn 包括了两个步骤：
 24 | 
 25 | 首先你需要编译源码得到两个可执行文件：``xlearn_train.exe`` 和 ``xlearn_predict.exe``，并且得到动态链接库 ``xlearn_api.dll``。 之后，需要安装 xLearn Python 包。
 26 | 
 27 | 编译源代码
 28 | =======================
 29 | 
 30 | 用户进入 DOS 控制台，输入命令: ::
 31 | 
 32 |   git clone https://github.com/aksnzhy/xlearn.git
 33 | 
 34 |   cd xlearn
 35 |   mkdir build
 36 |   cd build
 37 |   cmake -G "Visual Studio 15 Win64" ../
 38 |   "C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" x64
 39 |   MSBuild xLearn.sln /p:Configuration=Release
 40 |   
 41 | **注意:** 你需要将路径 ``"C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\VC\Auxiliary\Build\vcvarsall.bat"``
 42 | 替换成你自己的 VS 安装路径.
 43 | 
 44 | 例如，默认情况下 VS 的路径为 ``"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat"``.
 45 | 
 46 | 如果安装成果, 用户可以在 `build\Release`` 路径下看到 ``xlearn_train.exe`` 和 ``xlearn_predict.exe`` 两个可执行文件。
 47 | 
 48 | 用户可以通过如下命令进行测试: ::
 49 | 
 50 |   run_example.bat
 51 | 
 52 | 从 Visual Studio解决方案编译源码
 53 | =======================
 54 | 这个编译方法是上面“编译源码”方法的一个备用选择，如果你已经使用上面方法进行编译，你可以跳过这个部分。
 55 | 
 56 | 我们为用户提供了Visual Studio解决方案，这些文件在xLearn项目根目录的windows目录下面，用户可以直接使用``xLearn.sln``进行源代码。
 57 | 
 58 | There are three vs project in this solution: xlearn_train, xlearn_test, xlearn_api, respectively relation to build executable train,predict entry program and DLL(dynamic link library) API for windows.
 59 | 这个解决方案包括三个项目：``xlearn_train``, ``xlearn_test``, ``xlearn_api``，分别对应产生xLearn的训练、预测的可执行文件和动态链接库。
 60 | 
 61 | 用户需要保证所使用的VS的工具平台版本在v141及其之上。
 62 | 
 63 | **注意：** 从这个解决方案编译得到的可执行文件和动态链接库会和使用cmake构建、编译得到的有所不同，这是因为它们构建结构不相同。
 64 | 
 65 | 安装 Python 包
 66 | =======================
 67 | 
 68 | 用户可以通过如下命令安装 Python 包: ::
 69 | 
 70 |   cd python-package
 71 |   python setup.py install 
 72 | 
 73 | 然后通过如下命令对安装进行测试: ::
 74 | 
 75 |   cd ../
 76 |   python test_python.py
 77 | 
 78 | 一键安装
 79 | =======================
 80 | 
 81 | 用户可以通过 ``build.bat`` 脚本来对 xLearn 进行一键安装: ::
 82 | 
 83 |   git clone https://github.com/aksnzhy/xlearn.git
 84 | 
 85 |   cd xlearn
 86 |   build.bat
 87 | 
 88 | 从pip安装
 89 | ^^^^^^^^^^^^^^^^^^^^^^^^
 90 | 
 91 | 我们现在提供了windows平台下的二进制Python包，它支持64位Python的一下版本：``2.7, 3.4, 3.5, 3.6, 3.7``。
 92 | 
 93 | 用户可以从 release_ 栏（xLearn项目主页）下载，然后用 ``pip`` 命令安装下载下来的后缀为 ``.whl`` 的二进制安装包文件。
 94 | 
 95 | .. _release: https://github.com/aksnzhy/xlearn/releases
 96 | 
 97 | 
 98 | 用户可以通过如下命令检查是 xLearn 是否安装成功: ::
 99 | 
100 |   >>> import xlearn as xl
101 |   >>> xl.hello()
102 | 
103 | 如果安装成功，你可以看到: ::
104 | 
105 |   -------------------------------------------------------------------------
106 |            _
107 |           | |
108 |      __  _| |     ___  __ _ _ __ _ __
109 |      \ \/ / |    / _ \/ _` | '__| '_ \
110 |       >  <| |___|  __/ (_| | |  | | | |
111 |      /_/\_\_____/\___|\__,_|_|  |_| |_|
112 | 
113 |         xLearn   -- 0.44 Version --
114 |   -------------------------------------------------------------------------
115 | 


--------------------------------------------------------------------------------
/large/index.rst:
--------------------------------------------------------------------------------
  1 | xLearn 大规模机器学习
  2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  3 | 
  4 | 我们在这一节里主要展示如何使用 xLearn 来处理大规模机器学习问题。近年来，快速增长的海量数据为机器学习任务带来了挑战。例如，我们的数据集可能会用数千亿条训练样本，这些数据是不可能被存放在单台计算机的内存中的。正因如此，我们在设计 xLearn 时专门考虑了如何解决大规模数据的机器学习训练功能。首先，xLearn 可以支持外村计算，通过利用单台计算机的磁盘来处理 TB 量级的数据训练任务。此外，xLearn 可以通过基于参数服务器的分布式架构来进行多机分布式训练。
  5 | 
  6 | 外存计算
  7 | --------------------------------
  8 | 
  9 | 外存计算适用于那些数据量过大不能被内存装下，但是可以被磁盘等外部存储设备装下的情况。通常情况下，单台机器的内存容量从几个 GB 到几百个 GB 不等。然而，当前的服务器外存容量通常可以很容易达到几个 TB. 外存计算的核心是通过 mini-batch 的方法，在每一次的计算时只读取一小部分数据进入内存，增量式地学习所有的训练数据。外存计算需要用户设定合适的 mini-batch-size.
 10 | 
 11 | .. image:: ../images/out-of-core.png
 12 |     :width: 500   
 13 | 
 14 | 命令行接口
 15 | ===================================================
 16 | 
 17 | 在 xLearn 中，用户可以通过设置 ``--disk`` 选项来进行外存计算。例如: ::
 18 | 
 19 |     ./xlearn_train ./big_data.txt -s 2 --disk
 20 | 
 21 |    Epoch      Train log_loss     Time cost (sec)
 22 |        1            0.483997                4.41
 23 |        2            0.466553                4.56
 24 |        3            0.458234                4.88
 25 |        4            0.451463                4.77
 26 |        5            0.445169                4.79
 27 |        6            0.438834                4.71
 28 |        7            0.432173                4.84
 29 |        8            0.424904                4.91
 30 |        9            0.416855                5.03
 31 |       10            0.407846                4.53
 32 | 
 33 | 在上述示例中，xLearn 需要花费将近 ``4.5`` 秒进行每一个 epoch 的训练任务。如果我们取消 ``--disk`` 选项，xLearn 的训练速度会变快: ::
 34 | 
 35 |     ./xlearn_train ./big_data.txt -s 2
 36 | 
 37 |     Epoch      Train log_loss     Time cost (sec)
 38 |         1            0.484022                1.65
 39 |         2            0.466452                1.64
 40 |         3            0.458112                1.64
 41 |         4            0.451371                1.76
 42 |         5            0.445040                1.83
 43 |         6            0.438680                1.92
 44 |         7            0.432007                1.99
 45 |         8            0.424695                1.95
 46 |         9            0.416579                1.96
 47 |        10            0.407518                2.11
 48 | 
 49 | 这一次，每一个 epoch 的训练时间变成了 ``1.8`` 秒。我们还可以通过 ``-block`` 选项来设置外存计算的内存 block 大小 （MB）。
 50 | 
 51 | 用户同样可以在预测任务中使用 ``--disk`` 选项，例如: ::
 52 | 
 53 |     ./xlearn_predict ./big_data_test.txt ./big_data.txt.model --disk
 54 | 
 55 | Python 接口
 56 | ===================================================
 57 | 
 58 | 在 Python 中，用户可以通过 ``setOnDisk()`` API 来使用外存计算，例如: ::
 59 | 
 60 |     import xlearn as xl
 61 | 
 62 |     # Training task
 63 |     ffm_model = xl.create_ffm() # Use field-aware factorization machine
 64 | 
 65 |     # On-disk training
 66 |     ffm_model.setOnDisk()
 67 | 
 68 |     ffm_model.setTrain("./small_train.txt")  # Training data
 69 |     ffm_model.setValidate("./small_test.txt")  # Validation data
 70 | 
 71 |     # param:
 72 |     #  0. binary classification
 73 |     #  1. learning rate: 0.2
 74 |     #  2. regular lambda: 0.002
 75 |     #  3. evaluation metric: accuracy
 76 |     param = {'task':'binary', 'lr':0.2, 
 77 |              'lambda':0.002, 'metric':'acc'}
 78 | 
 79 |     # Start to train
 80 |     # The trained model will be stored in model.out
 81 |     ffm_model.fit(param, './model.out')
 82 | 
 83 |     # Prediction task
 84 |     ffm_model.setTest("./small_test.txt")  # Test data
 85 |     ffm_model.setSigmoid()  # Convert output to 0-1
 86 | 
 87 |     # Start to predict
 88 |     # The output result will be stored in output.txt
 89 |     ffm_model.predict("./model.out", "./output.txt")
 90 | 
 91 | 用户还可以通过 ``block_size`` 参数来设置外存计算的内存 block 大小 （MB), 例如: ::
 92 | 
 93 |     ./xlearn_train ./big_data.txt -s 2 -block 1000 --disk
 94 | 
 95 | 如上所示, 我们将 block size 设置为 ``1000MB``. 在默认的情况下, 这个值会被设置为 ``500``.
 96 | 
 97 | R 接口
 98 | ===================================================
 99 | 
100 | The R guide is coming soon.
101 | 
102 | 分布式计算 （参数服务器架构）
103 | --------------------------------
104 | 
105 | 面对海量数据，很多情况下我们无法通过一台机器就完成机器学习的训练任务。例如大规模 CTR 任务，用户可能需要处理千亿级别的训练样本和十亿级别的模型参数，这些都是一台计算机的内存无法装下的。对于这样的挑战，我们需要采用多机分布式训练。
106 | 
107 | *Parameter Server* (参数服务器) 是近几年提出并被广泛应用的一种分布式机器学习架构，专门针对于 “大数据” 和 “大模型” 带来的挑战。在这个架构下，训练数据和计算任务被划分到多台 worker 节点之上，而 Server 节点负责存储机器学习模型的参数（所以叫作参数服务器）。下图展示了一个参数服务器的工作流程。
108 | 
109 | .. image:: ../images/ps.png
110 |     :width: 500   
111 | 
112 | 如图所示，一个标准的参数服务器系统提供给用户两个简洁的 API: *Push* 和 *Pull*. 
113 | 
114 | *Push*: 向参数服务器发送 key-value pairs. 以分布式梯度下降为例，worker 节点会计算本地的梯度 （gradient）并将其发送给参数服务器。由于数据的稀疏性，只有一小部分数据不为 0. 我们通常会发送一个 （key，value）的向量给参数服务器，其中 key 是参数的标记位，value 是梯度的数值。 
115 | 
116 | *Pull*: 通过发送 key 的列表从参数服务器请求更新后的模型参数。在大规模机器学习下，模型的大小通常无法被存放在一台机器中，所以 *pull* 接口只会请求那些当前计算需要的模型参数，而并不会将整个模型请求下来。
117 | 
118 | The distributed training guide for xLearn is coming soon.
119 | 


--------------------------------------------------------------------------------
/make.bat:
--------------------------------------------------------------------------------
 1 | @ECHO OFF
 2 | 
 3 | pushd %~dp0
 4 | 
 5 | REM Command file for Sphinx documentation
 6 | 
 7 | if "%SPHINXBUILD%" == "" (
 8 | 	set SPHINXBUILD=sphinx-build
 9 | )
10 | set SOURCEDIR=.
11 | set BUILDDIR=_build
12 | set SPHINXPROJ=xlearn_doc
13 | 
14 | if "%1" == "" goto help
15 | 
16 | %SPHINXBUILD% >NUL 2>NUL
17 | if errorlevel 9009 (
18 | 	echo.
19 | 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
20 | 	echo.installed, then set the SPHINXBUILD environment variable to point
21 | 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
22 | 	echo.may add the Sphinx directory to PATH.
23 | 	echo.
24 | 	echo.If you don't have Sphinx installed, grab it from
25 | 	echo.http://sphinx-doc.org/
26 | 	exit /b 1
27 | )
28 | 
29 | %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
30 | goto end
31 | 
32 | :help
33 | %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
34 | 
35 | :end
36 | popd
37 | 


--------------------------------------------------------------------------------
/python_api/index.rst:
--------------------------------------------------------------------------------
  1 | xLearn Python API 使用指南
  2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  3 | 
  4 | xLearn 支持简单易用的 Python 接口。在使用之前，请确保你已经成功安装了 xLearn Python Package. 用户可以进入 Python shell，然后输入如下代码来检查是否成功安装 xLearn Python Package: ::
  5 | 
  6 |     >>> import xlearn as xl
  7 |     >>> xl.hello()
  8 | 
  9 | 如果你已经成功安装了 xLearn Python Package，你将会看到: ::
 10 | 
 11 |   -------------------------------------------------------------------------
 12 |            _
 13 |           | |
 14 |      __  _| |     ___  __ _ _ __ _ __
 15 |      \ \/ / |    / _ \/ _` | '__| '_ \
 16 |       >  <| |___|  __/ (_| | |  | | | |
 17 |      /_/\_\_____/\___|\__,_|_|  |_| |_|
 18 | 
 19 |         xLearn   -- 0.44 Version --
 20 |   -------------------------------------------------------------------------
 21 | 
 22 | 快速开始
 23 | ----------------------------------------
 24 | 
 25 | 如下代码展示如何使用 xLearn Python API，你可以在 ``demo/classification/criteo_ctr`` 路径下找到样例数据 (``small_train.txt`` and ``small_test.txt``):
 26 | 
 27 | .. code-block:: python
 28 | 
 29 |    import xlearn as xl
 30 | 
 31 |    # Training task
 32 |    ffm_model = xl.create_ffm()                # Use field-aware factorization machine (ffm)
 33 |    ffm_model.setTrain("./small_train.txt")    # Set the path of training data
 34 | 
 35 |    # parameter:
 36 |    #  0. task: binary classification
 37 |    #  1. learning rate : 0.2
 38 |    #  2. regular lambda : 0.002
 39 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002}
 40 |             
 41 |    # Train model
 42 |    ffm_model.fit(param, "./model.out")  
 43 | 
 44 | 以下是 xLearn 的部分输出: ::
 45 | 
 46 |    ...
 47 |  [ ACTION     ] Start to train ...
 48 |  [------------] Epoch      Train log_loss     Time cost (sec)
 49 |  [   10%      ]     1            0.595881                0.00
 50 |  [   20%      ]     2            0.538845                0.00
 51 |  [   30%      ]     3            0.520051                0.00
 52 |  [   40%      ]     4            0.504366                0.00
 53 |  [   50%      ]     5            0.492811                0.00
 54 |  [   60%      ]     6            0.483286                0.00
 55 |  [   70%      ]     7            0.472567                0.00
 56 |  [   80%      ]     8            0.465035                0.00
 57 |  [   90%      ]     9            0.457047                0.00
 58 |  [  100%      ]    10            0.448725                0.00
 59 |  [ ACTION     ] Start to save model ...
 60 | 
 61 | 在上述例子中，xLearn 使用 *field-aware factorization machines (ffm)* 来解决一个机器学习二分类问题。如果想解决回归 (regression) 问题，用户可以通过将 ``task`` 参数设置为 ``reg`` 来实现: ::
 62 | 
 63 |     param = {'task':'reg', 'lr':0.2, 'lambda':0.002} 
 64 | 
 65 | 我们发现，xLearn 训练过后在当前文件夹下产生了一个叫 ``model.out`` 的新文件。这个文件用来存储训练后的模型，我们可以用这个模型在未来进行预测: ::
 66 | 
 67 |     ffm_model.setTest("./small_test.txt")
 68 |     ffm_model.predict("./model.out", "./output.txt")      
 69 | 
 70 | 运行上述命令之后，我们在当前文件夹下得到了一个新的文件 ``output.txt``，这是我们进行预测任务的输出。我们可以通过如下命令显示这个输出文件的前几行数据: ::
 71 | 
 72 |     head -n 5 ./output.txt
 73 | 
 74 |     -1.58631
 75 |     -0.393496
 76 |     -0.638334
 77 |     -0.38465
 78 |     -1.15343
 79 | 
 80 | 这里每一行的分数都对应了测试数据中的一行预测样本。负数代表我们预测该样本为负样本，正数代表正样本 (在这个例子中没有)。在 xLearn 中，用户可以将分数通过 ``setSigmoid()`` API 转换到（0-1）之间: ::
 81 | 
 82 |    ffm_model.setSigmoid()
 83 |    ffm_model.setTest("./small_test.txt")  
 84 |    ffm_model.predict("./model.out", "./output.txt")      
 85 | 
 86 | 结果如下: ::
 87 | 
 88 |    head -n 5 ./output.txt
 89 | 
 90 |   0.174698
 91 |   0.413642
 92 |   0.353551
 93 |   0.414588
 94 |   0.250373
 95 | 
 96 | 用户还可以使用 ``setSign()`` API 将预测结果转换成 0 或 1: ::
 97 | 
 98 |    ffm_model.setSign()
 99 |    ffm_model.setTest("./small_test.txt")  
100 |    ffm_model.predict("./model.out", "./output.txt")
101 | 
102 | 结果如下: ::
103 | 
104 |    head -n 5 ./output.txt
105 | 
106 |    0
107 |    0
108 |    0
109 |    0
110 |    0
111 | 
112 | 模型输出
113 | ----------------------------------------
114 | 
115 | 用户还可以通过 ``setTXTModel()`` API 将模型输出成人类可读的 ``TXT`` 格式，例如: ::
116 | 
117 |     ffm_model.setTXTModel("./model.txt")
118 |     ffm_model.fit(param, "./model.out")
119 | 
120 | 运行上述命令后，我们发现在当前文件夹下生成了一个新的文件 ``model.txt``，这个文件存储着 ``TXT`` 格式的输出模型: ::
121 | 
122 |   head -n 5 ./model.txt
123 | 
124 |   -1.041
125 |   0.31609
126 |   0
127 |   0
128 |   0
129 | 
130 | 对于线性模型来说，TXT 格式的模型输出将每一个模型参数存储在一行。对于 FM 和 FFM，模型将每一个 latent vector 存储在一行。
131 | 
132 | Linear: ::
133 | 
134 |   bias: 0
135 |   i_0: 0
136 |   i_1: 0
137 |   i_2: 0
138 |   i_3: 0
139 | 
140 | FM: ::
141 | 
142 |   bias: 0
143 |   i_0: 0
144 |   i_1: 0
145 |   i_2: 0
146 |   i_3: 0
147 |   v_0: 5.61937e-06 0.0212581 0.150338 0.222903
148 |   v_1: 0.241989 0.0474224 0.128744 0.0995021
149 |   v_2: 0.0657265 0.185878 0.0223869 0.140097
150 |   v_3: 0.145557 0.202392 0.14798 0.127928
151 | 
152 | FFM: ::
153 | 
154 |   bias: 0
155 |   i_0: 0
156 |   i_1: 0
157 |   i_2: 0
158 |   i_3: 0
159 |   v_0_0: 5.61937e-06 0.0212581 0.150338 0.222903
160 |   v_0_1: 0.241989 0.0474224 0.128744 0.0995021
161 |   v_0_2: 0.0657265 0.185878 0.0223869 0.140097
162 |   v_0_3: 0.145557 0.202392 0.14798 0.127928
163 |   v_1_0: 0.219158 0.248771 0.181553 0.241653
164 |   v_1_1: 0.0742756 0.106513 0.224874 0.16325
165 |   v_1_2: 0.225384 0.240383 0.0411782 0.214497
166 |   v_1_3: 0.226711 0.0735065 0.234061 0.103661
167 |   v_2_0: 0.0771142 0.128723 0.0988574 0.197446
168 |   v_2_1: 0.172285 0.136068 0.148102 0.0234075
169 |   v_2_2: 0.152371 0.108065 0.149887 0.211232
170 |   v_2_3: 0.123096 0.193212 0.0179155 0.0479647
171 |   v_3_0: 0.055902 0.195092 0.0209918 0.0453358
172 |   v_3_1: 0.154174 0.144785 0.184828 0.0785329
173 |   v_3_2: 0.109711 0.102996 0.227222 0.248076
174 |   v_3_3: 0.144264 0.0409806 0.17463 0.083712
175 | 
176 | 在线学习
177 | ----------------------------------------
178 | xLearn 提供在线学习的功能，即 xLearn 可以加载之前预训练过的模型继续学习。用户可以通过 ``setPreModel()`` API 来指定预先训练过的模型文件路径。例如: ::
179 | 
180 |   import xlearn as xl
181 | 
182 |    ffm_model = xl.create_ffm()
183 |    ffm_model.setTrain("./small_train.txt")
184 |    ffm_model.setValidate("./small_test.txt")  
185 |    ffm_model.setPreModel("./pre_model")
186 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
187 |             
188 |    ffm_model.fit(param, "./model.out") 
189 | 
190 | 注意，xLearn 只能加载二进制预训练模型，不能加载 TXT 格式的文本模型。
191 | 
192 | 选择机器学习算法
193 | ----------------------------------------
194 | 
195 | 目前，xLearn 可以支持三种不同的机器学习算法，包括了线性模型 (LR)、factorization machine (FM)，以及 field-aware factorization machine (FFM): ::
196 |    
197 |     import xlearn as xl
198 | 
199 |     ffm_model = xl.create_ffm()
200 |     fm_model = xl.create_fm()
201 |     lr_model = xl.create_linear()
202 | 
203 | 对于 LR 和 FM 算法而言，我们的输入数据格式必须是 ``CSV`` 或者 ``libsvm``. 对于 FFM 算法而言，我们的输入数据必须是 ``libffm`` 格式: ::
204 | 
205 |   libsvm format:
206 | 
207 |      y index_1:value_1 index_2:value_2 ... index_n:value_n
208 | 
209 |      0   0:0.1   1:0.5   3:0.2   ...
210 |      0   0:0.2   2:0.3   5:0.1   ...
211 |      1   0:0.2   2:0.3   5:0.1   ...
212 | 
213 |   CSV format:
214 | 
215 |      y value_1 value_2 .. value_n
216 | 
217 |      0      0.1     0.2     0.2   ...
218 |      1      0.2     0.3     0.1   ...
219 |      0      0.1     0.2     0.4   ...
220 | 
221 |   libffm format:
222 | 
223 |      y field_1:index_1:value_1 field_2:index_2:value_2   ...
224 | 
225 |      0   0:0:0.1   1:1:0.5   2:3:0.2   ...
226 |      0   0:0:0.2   1:2:0.3   2:5:0.1   ...
227 |      1   0:0:0.2   1:2:0.3   2:5:0.1   ...
228 | 
229 | xLearn 还可以使用 ``,`` 作为数据的分隔符，例如: ::
230 | 
231 |   libsvm format:
232 | 
233 |      label,index_1:value_1,index_2:value_2 ... index_n:value_n
234 | 
235 |   CSV format:
236 | 
237 |      label,value_1,value_2 .. value_n
238 | 
239 |   libffm format:
240 | 
241 |      label,field_1:index_1:value_1,field_2:index_2:value_2 ...
242 | 
243 | 注意，如果输入的 csv 文件里不含 ``y`` 值，用户必须手动向其每一行数据添加一个占位符 (同样针对测试数据)。否则，xLearn 会将第一个元素视为 ``y``.
244 | 
245 | LR 和 FM 算法的输入可以是 ``libffm`` 格式，xLearn 会忽略其中的 ``field`` 项并将其视为 ``libsvm`` 格式。
246 | 
247 | 设置 Validation Dataset (验证集)
248 | ----------------------------------------
249 | 
250 | 在机器学习中，我们可以通过 Validation Dataset (验证集) 来进行超参数调优。在 xLearn 中，用户可以使用 ``setValidate()`` API 来指定验证集文件，例如: ::
251 | 
252 |    import xlearn as xl
253 | 
254 |    ffm_model = xl.create_ffm()
255 |    ffm_model.setTrain("./small_train.txt")
256 |    ffm_model.setValidate("./small_test.txt")  
257 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
258 |             
259 |    ffm_model.fit(param, "./model.out") 
260 | 
261 | 下面是程序的一部分输出: ::
262 | 
263 |   [ ACTION     ] Start to train ...
264 |   [------------] Epoch      Train log_loss       Test log_loss     Time cost (sec)
265 |   [   10%      ]     1            0.589475            0.535867                0.00
266 |   [   20%      ]     2            0.540977            0.546504                0.00
267 |   [   30%      ]     3            0.521881            0.531474                0.00
268 |   [   40%      ]     4            0.507194            0.530958                0.00
269 |   [   50%      ]     5            0.495460            0.530627                0.00
270 |   [   60%      ]     6            0.483910            0.533307                0.00
271 |   [   70%      ]     7            0.470661            0.527650                0.00
272 |   [   80%      ]     8            0.465455            0.532556                0.00
273 |   [   90%      ]     9            0.455787            0.538841                0.00
274 |   [ ACTION     ] Early-stopping at epoch 7
275 | 
276 | 我们可以看到，在这个任务中 ``Train log_loss`` 在不断的下降，而 ``Test log_loss`` (validation loss) 则是先下降，后上升。这代表当前我们训练的模型已经 overfit （过拟合）我们的训练数据。
277 | 
278 | 在默认的情况下，xLearn 会在每一轮 epoch 结束后计算 validation loss 的数值，而用户可以使用 ``metric``  参数来制定不同的评价指标。对于分类任务而言，评价指标有：``acc`` (accuracy), ``prec`` (precision), ``f1``, 以及 ``auc``，例如: ::
279 | 
280 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'acc'}
281 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'prec'}
282 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'f1'}
283 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'metric': 'auc'}           
284 | 
285 | 对于回归任务而言，评价指标包括：``mae``, ``mape``, 以及 ``rmsd`` (或者叫作 ``rmse``)，例如: ::
286 | 
287 |    param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric': 'rmse'}
288 |    param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric': 'mae'}    
289 |    param = {'task':'reg', 'lr':0.2, 'lambda':0.002, 'metric': 'mape'}  
290 | 
291 | Cross-Validation (交叉验证)
292 | ----------------------------------------
293 | 
294 | 在机器学习中，Cross-Validation (交叉验证) 是一种被广泛使用的模型超参数调优技术。在 xLearn 中，用户可以使用 ``cv()`` API 来使用交叉验证功能，例如: ::
295 | 
296 |     import xlearn as xl
297 | 
298 |     ffm_model = xl.create_ffm()
299 |     ffm_model.setTrain("./small_train.txt")  
300 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
301 |             
302 |     ffm_model.cv(param)
303 | 
304 | 在默认的情况下，xLearn 使用 3-folds 交叉验证 (即将数据集平均分成 3 份)，用户也可以通过 ``fold`` 参数来指定数据划分的份数，例如: ::
305 | 
306 |     import xlearn as xl
307 | 
308 |     ffm_model = xl.create_ffm()
309 |     ffm_model.setTrain("./small_train.txt")  
310 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'fold':5} 
311 |             
312 |     ffm_model.cv(param)     
313 | 
314 | 上述命令将数据集划分成为 5 份，并且 xLearn 会在最后计算出平均的 validation loss: ::
315 | 
316 |   [------------] Average log_loss: 0.549758
317 |   [ ACTION     ] Finish Cross-Validation
318 |   [ ACTION     ] Clear the xLearn environment ...
319 |   [------------] Total time cost: 0.05 (sec)
320 | 
321 | 选择优化算法
322 | ----------------------------------------
323 | 
324 | 在 xLearn 中，用户可以通过 ``opt`` 参数来选择使用不同的优化算法。目前，xLearn 支持 ``SGD``, ``AdaGrad``, 以及 ``FTRL`` 这三种优化算法。 在默认的情况下，xLearn 使用 ``AdaGrad`` 优化算法: ::
325 | 
326 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'opt':'sgd'} 
327 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'opt':'adagrad'} 
328 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'opt':'ftrl'} 
329 | 
330 | 相比于传统的 SGD (随机梯度下降) 算法，AdaGrad 可以自适应的调整学习速率 learning rate，对于不常用的参数进行较大的更新，对于常用的参数进行较小的更新。 正因如此，AdaGrad 算法常用于稀疏数据的优化问题上。除此之外，相比于 AdaGrad，SGD 对学习速率的大小更敏感，这增加了用户调参的难度。
331 | 
332 | FTRL (Follow-the-Regularized-Leader) 同样被广泛应用于大规模稀疏数据的优化问题上。相比于 SGD 和 AdaGrad, FTRL 需要用户调试更多的超参数，我们将在下一节详细介绍 xLearn 的超参数调优。
333 | 
334 | 超参数调优
335 | ----------------------------------------
336 | 
337 | 在机器学习中，hyper-parameter (超参数) 是指在训练之前设置的参数，而模型参数是指在训练过程中更新的参数。超参数调优通常是机器学习训练过程中不可避免的一个环节。
338 | 
339 | 首先，``learning rate`` (学习速率) 是机器学习中的一个非常重要的超参数，用来控制每次模型迭代时更新的步长。在默认的情况下，这个值在 xLearn 中被设置为 0.2，用户可以通过 ``lr`` 参数来改变这个值: ::
340 | 
341 |     param = {'task':'binary', 'lr':0.2} 
342 |     param = {'task':'binary', 'lr':0.5}
343 |     param = {'task':'binary', 'lr':0.01}
344 | 
345 | 用户还可以通过 ``-b`` 选项来控制 regularization (正则项)。xLearn 使用 ``L2`` 正则项，这个值被默认设置为 ``0.00002``: ::
346 | 
347 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01}
348 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.02} 
349 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
350 | 
351 | 对于 FTRL 算法来说，除了学习速率和正则项，我们还需要调节其他的超参数，包括：``-alpha``, ``-beta``, ``-lambda_1`` 和 ``-lambda_2``，例如: ::
352 | 
353 |     param = {'alpha':0.002, 'beta':0.8, 'lambda_1':0.001, 'lambda_2': 1.0}
354 | 
355 | 对于 FM 和 FFM 模型，用户需要通过 ``-k`` 选项来设置 latent vector (隐向量) 的长度。在默认的情况下，xLearn 将其设置为 ``4``: ::
356 | 
357 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':2}
358 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':4}
359 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':5}
360 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'k':8}
361 | 
362 | 注意，xLearn 使用了 *SSE* 硬件指令来加速向量运算，该指令会同时进行向量长度为 4 的运算，因此将 ``k=2`` 和 ``k=4`` 所需的运算时间是相同的。
363 | 
364 | 除此之外，对于 FM 和 FFM，用户可以通过设置超参数 ``-u`` 来调节模型的初始化参数。在默认的情况下，这个值被设置为 ``0.66``: ::
365 | 
366 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'init':0.80}
367 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'init':0.40}
368 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'init':0.10}
369 |   
370 | 迭代次数 & Early-Stop (提前终止)
371 | ----------------------------------------
372 | 
373 | 在模型的训练过程中，每一个 epoch 都会遍历整个训练数据。在 xLearn 中，用户可以通过 ``epoch`` 参数来设置需要的 epoch 数量: ::
374 | 
375 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'epoch':3}
376 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'epoch':5}
377 |     param = {'task':'binary', 'lr':0.2, 'lambda':0.01, 'epoch':10}
378 | 
379 | 如果用户设置了 validation dataset (验证集)，xLearn 在默认情况下会在得到最好的 validation 结果时进行 early-stop (提前终止训练)，例如: ::
380 | 
381 |    import xlearn as xl
382 | 
383 |    ffm_model = xl.create_ffm()
384 |    ffm_model.setTrain("./small_train.txt")
385 |    ffm_model.setValidate("./small_test.txt")
386 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'epoch':10} 
387 |             
388 |    ffm_model.fit(param, "./model.out") 
389 | 
390 | 在上述命令中，我们设置 epoch 的大小为 10，但是 xLearn 会在第 7 轮提前停止训练 (你可能在你的本地计算机上会得到不同的轮次): ::
391 | 
392 |     Early-stopping at epoch 7
393 |     Start to save model ...
394 | 
395 | 用户可以通过 ``stop_window`` 参数来设置提前停止机制的窗口大小。即，``stop_window=2`` 意味着如果在后两轮的时间窗口之内都没有比当前更好的验证结果，则停止训练，并保存之前最好的模型: ::
396 | 
397 |     param = {'task':'binary',  'lr':0.2, 
398 |              'lambda':0.002, 'epoch':10,
399 |              'stop_window':3} 
400 |             
401 |     ffm_model.fit(param, "./model.out") 
402 | 
403 | 用户可以通过 ``disableEarlyStop()`` API 来禁止 early-stop: ::
404 | 
405 |    import xlearn as xl
406 | 
407 |    ffm_model = xl.create_ffm()
408 |    ffm_model.setTrain("./small_train.txt")
409 |    ffm_model.setValidate("./small_test.txt")
410 |    ffm_model.disableEarlyStop();
411 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'epoch':10} 
412 |             
413 |    ffm_model.fit(param, "./model.out") 
414 | 
415 | 在上述命令中，xLearn 将进行完整的 10 轮 epoch 训练。
416 | 
417 | 注意，在默认情况下，如果没有设置 metric，则 xLearn 会通过 test_loss 来选择最佳停止时机。如果设置了 metric，则 xLearn 通过 metric 的值来决定停止时机。 
418 | 
419 | 无锁（Lock-free）学习
420 | ----------------------------------------
421 | 
422 | 在默认情况下，xLearn 会进行 *Hogwild!* 无锁学习，该方法通过 CPU 多核进行并行训练，提高 CPU 利用率，加快算法收敛速度。但是，该无锁算法是非确定性的算法 (non-deterministic). 即，如果我们多次运行如下的命令，我们会在每一次运行得到略微不同的 loss 结果: ::
423 | 
424 |    import xlearn as xl
425 | 
426 |    ffm_model = xl.create_ffm()
427 |    ffm_model.setTrain("./small_train.txt")  
428 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
429 |             
430 |    ffm_model.fit(param, "./model.out") 
431 | 
432 |    The 1st time: 0.449056
433 |    The 2nd time: 0.449302
434 |    The 3nd time: 0.449185
435 | 
436 | 用户可以通过 ``nthread`` 参数来设置使用 CPU 核心的数量，例如: ::
437 | 
438 |    import xlearn as xl
439 | 
440 |    ffm_model = xl.create_ffm()
441 |    ffm_model.setTrain("./small_train.txt")  
442 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002, 'nthread':4} 
443 |             
444 |    ffm_model.fit(param, "./model.out") 
445 | 
446 | 上述代码指定使用 4 个 CPU Core 来进行模型训练。如果用户不设置该选项，xLearn 在默认情况下会使用全部的 CPU 核心进行计算。
447 | 
448 | xLearn 会显示当前使用线程数量的情况: ::
449 | 
450 |     [------------] xLearn uses 4 threads for training task.
451 |     [ ACTION     ] Read Problem ...
452 | 
453 | 
454 | 用户可以通过设置 ``disableLockFree()`` API 禁止多核无锁学习: ::
455 | 
456 |    import xlearn as xl
457 | 
458 |    ffm_model = xl.create_ffm()
459 |    ffm_model.setTrain("./small_train.txt")  
460 |    ffm_model.disableLockFree()
461 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
462 |             
463 |    ffm_model.fit(param, "./model.out") 
464 | 
465 | 这时，xLearn 计算的结果是确定性的 (determinnistic): ::
466 | 
467 |    The 1st time: 0.449172
468 |    The 2nd time: 0.449172
469 |    The 3nd time: 0.449172
470 | 
471 | 使用 ``disableLockFree()`` 的缺点是训练速度会比无锁训练慢很多，我们的建议是在大规模数据训练下开启此功能。
472 | 
473 | Instance-Wise 归一化
474 | ----------------------------------------
475 | 
476 | 对于 FM 和 FFM 来说，xLearn 会默认对特征进行 Instance-Wise Normalizarion (归一化). 在一些大规模稀疏数据的场景 (例如 CTR 预估), 这一技术非常的有效，但是有些时候它也会影响模型的准确率。用户可以通过设置 ``disableNorm()`` API 来关掉该功能: ::
477 | 
478 |    import xlearn as xl
479 | 
480 |    ffm_model = xl.create_ffm()
481 |    ffm_model.setTrain("./small_train.txt")  
482 |    ffm_model.disableNorm()
483 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
484 |             
485 |    ffm_model.fit(param, "./model.out") 
486 | 
487 | 注意，如果在训练过程中使用了 Instance-Wise 归一化，用户需要在预测过程中同样使用该功能。
488 | 
489 | Quiet Model 安静模式
490 | ----------------------------------------
491 | 
492 | xLearn 的训练支持安静模式，在安静模式下，用户通过调用 ``setQuiet()`` API 来使得 xLearn 的训练过程不会计算任何评价指标，这样可以很大程度上提高训练速度: ::
493 | 
494 |    import xlearn as xl
495 | 
496 |    ffm_model = xl.create_ffm()
497 |    ffm_model.setTrain("./small_train.txt")  
498 |    ffm_model.setQuiet()
499 |    param = {'task':'binary', 'lr':0.2, 'lambda':0.002} 
500 |             
501 |    ffm_model.fit(param, "./model.out") 
502 | 
503 | DMatrix 转换
504 | ----------------------------------------
505 | 如下代码展示如何使用 xLearn Python DMatrix API，你可以在 ``demo/regression/house_price`` 路径下找到样例数据 (``house_price_train.txt`` and ``house_price_test.txt``):
506 | 
507 | .. code-block:: python
508 | 
509 |     import xlearn as xl
510 |     import numpy as np
511 |     import pandas as pd
512 | 
513 |     # read file from file
514 |     house_price_train = pd.read_csv("house_price_train.txt", header=None, sep="\t")
515 |     house_price_test = pd.read_csv("house_price_test.txt", header=None, sep="\t")
516 |     
517 |     # get train X, y
518 |     X_train = house_price_train[house_price_train.columns[1:]]
519 |     y_train = house_price_train[0]
520 | 
521 |     # get test X, y
522 |     X_test = house_price_test[house_price_test.columns[1:]]
523 |     y_test = house_price_test[0]
524 |     
525 |     # DMatrix transition, if use field ,use must pass field map(an array) of features 
526 |     xdm_train = xl.DMatrix(X_train, y_train)
527 |     xdm_test = xl.DMatrix(X_test, y_test)
528 | 
529 |     # Training task
530 |     fm_model = xl.create_fm()  # Use factorization machine
531 |     # we use the same API for train from file
532 |     # that is, you can also pass xl.DMatrix for this API now
533 |     fm_model.setTrain(xdm_train)    # Training data
534 |     fm_model.setValidate(xdm_test)  # Validation data
535 |     
536 |     # param:
537 |     #  0. regression task
538 |     #  1. learning rate: 0.2
539 |     #  2. regular lambda: 0.002
540 |     #  3. evaluation metric: mae
541 |     param = {'task':'reg', 'lr':0.2, 
542 |              'lambda':0.002, 'metric':'mae'}
543 | 
544 |     # Start to train
545 |     # The trained model will be stored in model.out
546 |     fm_model.fit(param, './model_dm.out')
547 | 
548 |     # Prediction task
549 |     # we use the same API for test from file
550 |     # that is, you can also pass xl.DMatrix for this API now
551 |     fm_model.setTest(xdm_test)  # Test data
552 | 
553 |     # Start to predict
554 |     # The output result will be stored in output.txt
555 |     # if no result out path setted, we return res as numpy.ndarray
556 |     res = fm_model.predict("./model_dm.out")
557 | 
558 | **注意：** 将数据转换成DMatrix进行训练暂时还不支持交叉验证， 我们很快会添加这个特征。
559 | 
560 | Scikit-learn API
561 | ----------------------------------------
562 | 
563 | xLearn 还可以支持 Scikit-learn API: ::
564 | 
565 |   import numpy as np
566 |   import xlearn as xl
567 |   from sklearn.datasets import load_iris
568 |   from sklearn.model_selection import train_test_split
569 | 
570 |   # Load dataset
571 |   iris_data = load_iris()
572 |   X = iris_data['data']
573 |   y = (iris_data['target'] == 2)
574 | 
575 |   X_train,   \
576 |   X_val,     \
577 |   y_train,   \
578 |   y_val = train_test_split(X, y, test_size=0.3, random_state=0)
579 | 
580 |   # param:
581 |   #  0. binary classification
582 |   #  1. model scale: 0.1
583 |   #  2. epoch number: 10 (auto early-stop)
584 |   #  3. learning rate: 0.1
585 |   #  4. regular lambda: 1.0
586 |   #  5. use sgd optimization method
587 |   linear_model = xl.LRModel(task='binary', init=0.1, 
588 |                             epoch=10, lr=0.1, 
589 |                             reg_lambda=1.0, opt='sgd')
590 | 
591 |   # Start to train
592 |   linear_model.fit(X_train, y_train, 
593 |                    eval_set=[X_val, y_val], 
594 |                    is_lock_free=False)
595 | 
596 |   # Generate predictions
597 |   y_pred = linear_model.predict(X_val)
598 | 
599 | .. __: https://github.com/aksnzhy/xlearn/tree/master/demo/classification/scikit_learn_demo
600 | 


--------------------------------------------------------------------------------
/tune/index.rst:
--------------------------------------------------------------------------------
1 | xLearn 调参指南
2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
3 | 
4 | Coming soon ...


--------------------------------------------------------------------------------
/tutorial/index.rst:
--------------------------------------------------------------------------------
 1 | xLearn Tutorials
 2 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 3 | 
 4 | (1) `深入FFM原理与实践(美团技术团队)`__
 5 | (2) `一文读懂FM算法优势，并用python实现`__
 6 | (3) `Introductory Guide – Factorization Machines & their application on huge datasets (with codes in Python)`__
 7 | (4) `简单高效的组合特征自动挖掘框架`__
 8 | 
 9 |  .. __: https://tech.meituan.com/deep_understanding_of_ffm_principles_and_practices.html
10 |  .. __: https://yq.aliyun.com/articles/374170
11 |  .. __: https://www.analyticsvidhya.com/blog/2018/01/factorization-machines/
12 |  .. __: https://zhuanlan.zhihu.com/p/42946318


--------------------------------------------------------------------------------