├── README.md
├── dataset
    └── webshell.zip
├── log
    ├── checklog.log
    └── trainlog.log
├── model
    ├── mlp.pkl
    └── tfidftransformer.pkl
├── pic
    └── 1.jpg
├── train.py
└── webshellDc.py


/README.md:
--------------------------------------------------------------------------------
  1 | # webshellDc v0.1
  2 | 
  3 | webshell通常是指利用asp、jsp、php、py、pl脚本语言编写，对web服务器进行管理的工具，也叫webadmin。webshell可以用来上传下载文件，查看数据库，系统命令调用，因此常被黑客利用并对服务器进行一系列入侵操作，具备威胁大、隐蔽性强等特点。
  4 | 
  5 | 
  6 | 本项目分别收集了160个Github项目的webshell黑样本和大量个开源php、jsp、asp、java项目作为白样本，去重后黑样本2944个，白样本11945个，采用CountVectorizer和TfidfTransformer对n-gram后的样本进行特征向量处理，分别采用多层神经网络、XGBoost、朴素贝叶斯进行训练。其中MLPClassifier模型表现较好。
  7 | 
  8 | 
  9 | 
 10 | ## 使用方式
 11 | ```
 12 | 训练：
 13 | python train.py -n webshelldir(黑样本文件路径) -p normaldir(白样本文件路径) -m mlp(模型选项)
 14 | 
 15 | 测试：
 16 | python webshellDc.py
 17 | ```
 18 | 
 19 | 
 20 | ## 训练环境 
 21 | 
 22 | > 系统：macOS 16 GB + python 3.6.3
 23 | > 执行时间：134s
 24 | 
 25 | 
 26 | ## 运行截图 
 27 | <!-- ![mlpevaluation](pic/1.jpg) -->
 28 | <img src="pic/1.jpg" width = "350" height = "500" div align=center />
 29 | 
 30 | 
 31 | 白名单检测：
 32 | 检测总量:11945, 检测出webshell:23, 检测出正常文件:11922
 33 | 误报率：0.0019254918375889493
 34 | 
 35 | 黑名单检测：
 36 | 检测总量:2944, 检测出webshell:2925, 检测出正常文件:19
 37 | 召回率：0.993546195652174
 38 | 
 39 | 
 40 | ## 黑样本 
 41 | 
 42 | https://github.com/tennc/webshell  
 43 | https://github.com/ysrc/webshell-sample
 44 | https://github.com/xl7dev/WebShell
 45 | https://github.com/tdifg/WebShell
 46 | https://github.com/fictivekin/webshell
 47 | https://github.com/bartblaze/PHP-backdoors
 48 | https://github.com/malwares/WebShell
 49 | https://github.com/xypiie/WebShell
 50 | https://github.com/testsecer/WebShell
 51 | https://github.com/nbs-system/php-malware-finder
 52 | https://github.com/BlackArch/webshells
 53 | https://github.com/tanjiti/webshellSample
 54 | https://github.com/dotcppfile/DAws
 55 | https://github.com/theralfbrown/webshell
 56 | https://github.com/gokyle/webshell
 57 | https://github.com/sunnyelf/cheetah
 58 | https://github.com/JohnTroony/php-webshells
 59 | https://github.com/evilcos/python-webshell
 60 | https://github.com/lhlsec/webshell
 61 | https://github.com/shewey/webshell
 62 | https://github.com/boy-hack/WebshellManager
 63 | https://github.com/liulongfei/web_shell_bopo
 64 | https://github.com/Ni7eipr/webshell
 65 | https://github.com/WangYihang/Webshell-Sniper
 66 | https://github.com/pm2-hive/pm2-webshell
 67 | https://github.com/samdark/yii2-webshell
 68 | https://github.com/b1ueb0y/webshell
 69 | https://github.com/oneoneplus/webshell
 70 | https://github.com/zhaojh329/xterminal
 71 | https://github.com/juanparati/Webshell
 72 | https://github.com/wofeiwo/webshell-find-tools
 73 | https://github.com/abcdlzy/webshell-manager
 74 | https://github.com/alert0/webshellch
 75 | https://github.com/needle-wang/jweevely
 76 | https://github.com/tengzhangchao/PyCmd
 77 | https://github.com/0x73686974/WebShell
 78 | https://github.com/wonderqs/Blade
 79 | https://github.com/le4f/aspexec
 80 | https://github.com/jijinggang/WebShell
 81 | https://github.com/matiasmenares/Shuffle
 82 | https://github.com/Skycrab/PySpy
 83 | https://github.com/huge818/webshell
 84 | https://github.com/gb-sn/go-webshell
 85 | https://github.com/BlackHole1/Fastener
 86 | https://github.com/blackhalt/WebShells
 87 | https://github.com/tomas1000r/webshell
 88 | https://github.com/hanzhibin/Webshell
 89 | https://github.com/decebel/webShell
 90 | https://github.com/Aviso-hub/Webshell
 91 | https://github.com/vnhacker1337/Webshell
 92 | https://github.com/bittorrent3389/Webshell
 93 | https://github.com/anhday22/WebShell
 94 | https://github.com/buxiaomo/webshell
 95 | https://github.com/z3robat/webshell
 96 | https://github.com/n3oism/webshell
 97 | https://github.com/uuleaf/WebShell
 98 | https://github.com/onefor1/webshell
 99 | https://github.com/cunlin-yu/webshell
100 | https://github.com/roytest1/webshell
101 | https://github.com/backlion/webshell
102 | https://github.com/opetrovski/webshell
103 | https://github.com/opetrovski/webshell
104 | https://github.com/gsmlg/webshell
105 | https://github.com/health901/webshell
106 | https://github.com/inof8r/WebShell
107 | https://github.com/Najones19746/webShell
108 | https://github.com/RaspiCar/WebShell
109 | https://github.com/health901/webshell
110 | https://github.com/dinamsky/WebShell
111 | https://github.com/Fay48/WebShell
112 | https://github.com/tuz358/webshell
113 | https://github.com/shajf/Webshell
114 | https://github.com/t17lab/WebShell
115 | https://github.com/blacksunwen/webshell
116 | https://github.com/webshellarchive/webshellco
117 | https://github.com/lolwaleet/Rubshell
118 | https://github.com/WhiteWinterWolf/WhiteWinterWolf-php-webshell
119 | https://github.com/goodtouch/jruby-webshell
120 | https://github.com/maestrano/webshell-server
121 | https://github.com/LuciferoO/webshell-collector
122 | https://github.com/wangeradd1/myWebShell
123 | https://github.com/0xHJK/caidao
124 | https://github.com/alintamvanz/1945shell
125 | https://github.com/Venen0/vshell
126 | https://github.com/lojikil/tinyshell
127 | https://github.com/wso-shell/PHP-SHELL-WSO
128 | https://github.com/meme-lord/PHPShellBackdoors
129 | https://github.com/Learn2Better/51mp3L-Web-Backdoor
130 | https://github.com/yuxiaokui/JBoss-Hack
131 | https://github.com/SecurityRiskAdvisors/cmd.jsp
132 | https://github.com/ddcunningham/crude-shellhunter
133 | https://github.com/stormdark/BackdoorPHP
134 | https://github.com/vduddu/Malware
135 | https://github.com/1oid/BurstPHPshell
136 | https://github.com/gokyle/urlshorten_ng
137 | https://github.com/rhelsing/trello_osx
138 | https://github.com/pfrazee/wsh-grammar
139 | https://github.com/x-o-r-r-o/PHP-Webshells-Collection
140 | https://github.com/IHA114/WebShell2
141 | https://github.com/WangYihang/WebShellCracker
142 | https://github.com/KINGSABRI/WebShellConsole
143 | https://github.com/jujinesy/webshells.17.03.18
144 | https://github.com/hackzsd/HandyShells
145 | https://github.com/mperlet/pomsky
146 | https://github.com/cybernoir/bns-php-shell
147 | https://github.com/XianThi/rexShell
148 | https://github.com/H4CK3RT3CH/php-webshells
149 | https://github.com/minisllc/subshell
150 | https://github.com/linuxsec/indoxploit-shell
151 | https://github.com/kuniasahi/mpshell
152 | https://github.com/datasiph0n/MyBB-Shell-Plugin
153 | https://github.com/magicming200/evil-koala-php-webshell
154 | https://github.com/0xK3v/Simple-WebShell
155 | https://github.com/djoq/docker-pm2-webshell
156 | https://github.com/SMRUCC/GCModeller.WebShell
157 | https://github.com/darknesstiller/WebShells
158 | https://github.com/devilscream/remoteshell
159 | https://github.com/0verl0ad/gorosaurus
160 | https://github.com/grCod/poly
161 | https://github.com/cryptobioz/wizhack
162 | https://github.com/amwso/docker-webshell
163 | https://github.com/William-Hunter/JSP_Webshell
164 | https://github.com/yangbaopeng/ashx_webshell
165 | https://github.com/webshellpub/awsome-webshell
166 | https://github.com/noalh8t/simple-webshell
167 | https://github.com/s3cureshell/wso-2.8-web-shell
168 | https://github.com/LiamRandall/simpleexec
169 | https://github.com/Samorodek/humhub-modules-webshell
170 | https://github.com/mwambler/webshell-xpages-ext-lib
171 | https://github.com/AVGP/Wesh
172 | https://github.com/edibledinos/weevely3-stealth
173 | https://github.com/lehins/haskell-webshell
174 | https://github.com/guglia001/php-secure-remove
175 | https://github.com/gokyle/webshell_tutorial
176 | https://github.com/azmanishak/webshell-php
177 | https://github.com/andrefernandes/docker-webshell
178 | https://github.com/codehz/node-webshell
179 | https://github.com/koolshare/merlin-webshell
180 | https://github.com/StephaneP/erl-webshell
181 | https://github.com/jjjmaracay3/webshells
182 | https://github.com/grCod/webshells
183 | https://github.com/ian4hu/bootshell
184 | https://github.com/Ghostboy-287/wso-webshell
185 | https://github.com/xiaoxiaoleo/xiao-webshell
186 | https://github.com/alexbires/webshellmanagement
187 | https://github.com/codeT/collectWebShell
188 | https://github.com/PhilCodeEx/jak3fr0z
189 | https://github.com/Ettack/WebshellCCL
190 | https://github.com/jubal-R/TinyWebShell
191 | https://github.com/CaledoniaProject/AxisInvoker
192 | https://github.com/theBrianCui/ISSS_webShell
193 | https://github.com/webshell/webshell-node-sdk
194 | https://github.com/Medicean/AS_BugScan
195 | https://github.com/3xp10it/xwebshell
196 | https://github.com/niemand-sec/RazorSyntaxWebshell
197 | https://github.com/LuciferoO/webshell-collector
198 | https://github.com/0verl0ad/HideShell
199 | https://github.com/L-codes/oneshellcrack
200 | https://github.com/ArchAssault-Project/webshells
201 | https://github.com/AndrHacK/andrshell
202 | 


--------------------------------------------------------------------------------
/dataset/webshell.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/dataset/webshell.zip


--------------------------------------------------------------------------------
/log/trainlog.log:
--------------------------------------------------------------------------------
  1 | 2020-07-20 15:48:29,770 - train.py[line:122] - INFO: 训练耗时：148.7166247367859
  2 | 2020-07-20 16:20:34,958 - train.py[line:72] - INFO: 白样本：11955
  3 | 2020-07-20 16:20:34,959 - train.py[line:73] - INFO: 黑样本：2944
  4 | 2020-07-20 16:24:30,351 - train.py[line:72] - INFO: 白样本：11955
  5 | 2020-07-20 16:24:30,351 - train.py[line:73] - INFO: 黑样本：2944
  6 | 2020-07-20 16:27:06,981 - train.py[line:116] - INFO: 初始化参数：
  7 | 2020-07-20 16:27:50,587 - train.py[line:116] - INFO: 初始化参数：
  8 | 2020-07-20 16:27:50,587 - train.py[line:117] - INFO: 当前训练版本号：v0
  9 | 2020-07-20 16:27:50,587 - train.py[line:118] - INFO: 白样本：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal
 10 | 2020-07-20 16:27:50,588 - train.py[line:119] - INFO: 黑样本：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell
 11 | 2020-07-20 16:27:50,588 - train.py[line:120] - INFO: 训练模型：mlp
 12 | 2020-07-20 16:27:50,588 - train.py[line:121] - INFO: 随即种子：777
 13 | 2020-07-20 16:27:50,588 - train.py[line:122] - INFO: 特征维度：25000
 14 | 2020-07-20 16:28:13,146 - train.py[line:72] - INFO: 白样本：11955
 15 | 2020-07-20 16:28:13,146 - train.py[line:73] - INFO: 黑样本：2944
 16 | 2020-07-20 16:29:10,831 - train.py[line:116] - INFO: 初始化参数：
 17 | 2020-07-20 16:29:10,831 - train.py[line:117] - INFO: 当前训练版本号：v0
 18 | 2020-07-20 16:29:10,831 - train.py[line:118] - INFO: 白样本路径：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal
 19 | 2020-07-20 16:29:10,831 - train.py[line:119] - INFO: 黑样本路径：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell
 20 | 2020-07-20 16:29:10,831 - train.py[line:120] - INFO: 训练模型：mlp
 21 | 2020-07-20 16:29:10,831 - train.py[line:121] - INFO: 随机种子：777
 22 | 2020-07-20 16:29:10,831 - train.py[line:122] - INFO: 特征维度：25000
 23 | 2020-07-20 16:29:25,707 - train.py[line:72] - INFO: 白样本总量：11955
 24 | 2020-07-20 16:29:25,707 - train.py[line:73] - INFO: 黑样本总量：2944
 25 | 2020-07-20 16:32:08,022 - train.py[line:99] - INFO: 训练集评估：
 26 | 2020-07-20 16:32:08,585 - train.py[line:89] - INFO: 准确率:0.999808227059162
 27 | 2020-07-20 16:32:08,601 - train.py[line:90] - INFO: [[8372    1]
 28 |  [   1 2055]]
 29 | 2020-07-20 16:32:08,609 - train.py[line:91] - INFO:              precision    recall  f1-score   support
 30 | 
 31 |           0       1.00      1.00      1.00      8373
 32 |           1       1.00      1.00      1.00      2056
 33 | 
 34 | avg / total       1.00      1.00      1.00     10429
 35 | 
 36 | 2020-07-20 16:32:08,611 - train.py[line:101] - INFO: 测试集评估：
 37 | 2020-07-20 16:32:09,026 - train.py[line:89] - INFO: 准确率:0.9910514541387024
 38 | 2020-07-20 16:32:09,035 - train.py[line:90] - INFO: [[3560   22]
 39 |  [  18  870]]
 40 | 2020-07-20 16:32:09,039 - train.py[line:91] - INFO:              precision    recall  f1-score   support
 41 | 
 42 |           0       0.99      0.99      0.99      3582
 43 |           1       0.98      0.98      0.98       888
 44 | 
 45 | avg / total       0.99      0.99      0.99      4470
 46 | 
 47 | 2020-07-20 16:32:09,417 - train.py[line:127] - INFO: 训练耗时：178.58603620529175
 48 | 2020-07-20 16:32:34,282 - train.py[line:116] - INFO: 初始化参数：
 49 | 2020-07-20 16:32:34,282 - train.py[line:117] - INFO: 当前训练版本号：v1
 50 | 2020-07-20 16:32:34,282 - train.py[line:118] - INFO: 白样本路径：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal
 51 | 2020-07-20 16:32:34,282 - train.py[line:119] - INFO: 黑样本路径：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell
 52 | 2020-07-20 16:32:34,283 - train.py[line:120] - INFO: 训练模型：mlp
 53 | 2020-07-20 16:32:34,283 - train.py[line:121] - INFO: 随机种子：777
 54 | 2020-07-20 16:32:34,283 - train.py[line:122] - INFO: 特征维度：25000
 55 | 2020-07-20 16:32:55,886 - train.py[line:116] - INFO: 初始化参数：
 56 | 2020-07-20 16:32:55,887 - train.py[line:117] - INFO: 当前训练版本号：v1
 57 | 2020-07-20 16:32:55,887 - train.py[line:118] - INFO: 白样本路径：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal
 58 | 2020-07-20 16:32:55,887 - train.py[line:119] - INFO: 黑样本路径：/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell
 59 | 2020-07-20 16:32:55,887 - train.py[line:120] - INFO: 训练模型：mlp
 60 | 2020-07-20 16:32:55,887 - train.py[line:121] - INFO: 随机种子：666
 61 | 2020-07-20 16:32:55,887 - train.py[line:122] - INFO: 特征维度：25000
 62 | 2020-07-20 16:33:15,973 - train.py[line:72] - INFO: 白样本总量：11955
 63 | 2020-07-20 16:33:15,973 - train.py[line:73] - INFO: 黑样本总量：2944
 64 | 2020-07-20 16:35:51,871 - train.py[line:99] - INFO: 训练集评估：
 65 | 2020-07-20 16:35:53,377 - train.py[line:89] - INFO: 准确率:1.0
 66 | 2020-07-20 16:35:53,421 - train.py[line:90] - INFO: [[8393    0]
 67 |  [   0 2036]]
 68 | 2020-07-20 16:35:53,438 - train.py[line:91] - INFO:              precision    recall  f1-score   support
 69 | 
 70 |           0       1.00      1.00      1.00      8393
 71 |           1       1.00      1.00      1.00      2036
 72 | 
 73 | avg / total       1.00      1.00      1.00     10429
 74 | 
 75 | 2020-07-20 16:35:53,438 - train.py[line:101] - INFO: 测试集评估：
 76 | 2020-07-20 16:35:54,622 - train.py[line:89] - INFO: 准确率:0.9937360178970918
 77 | 2020-07-20 16:35:54,642 - train.py[line:90] - INFO: [[3552   10]
 78 |  [  18  890]]
 79 | 2020-07-20 16:35:54,665 - train.py[line:91] - INFO:              precision    recall  f1-score   support
 80 | 
 81 |           0       0.99      1.00      1.00      3562
 82 |           1       0.99      0.98      0.98       908
 83 | 
 84 | avg / total       0.99      0.99      0.99      4470
 85 | 
 86 | 2020-07-20 16:35:55,162 - train.py[line:127] - INFO: 训练耗时：179.27528595924377
 87 | 2020-08-12 14:41:26,580 - train.py[line:116] - INFO: 初始化参数：
 88 | 2020-08-12 14:41:26,580 - train.py[line:117] - INFO: 当前训练版本号：v0
 89 | 2020-08-12 14:41:26,580 - train.py[line:118] - INFO: 白样本路径：dataset/dataset/normal
 90 | 2020-08-12 14:41:26,580 - train.py[line:119] - INFO: 黑样本路径：dataset/dataset/webshell
 91 | 2020-08-12 14:41:26,580 - train.py[line:120] - INFO: 训练模型：mlp
 92 | 2020-08-12 14:41:26,580 - train.py[line:121] - INFO: 随机种子：777
 93 | 2020-08-12 14:41:26,580 - train.py[line:122] - INFO: 特征维度：25000
 94 | 2020-08-12 14:41:44,037 - train.py[line:72] - INFO: 白样本总量：11928
 95 | 2020-08-12 14:41:44,037 - train.py[line:73] - INFO: 黑样本总量：2944
 96 | 2020-08-12 14:43:52,266 - train.py[line:99] - INFO: 训练集评估：
 97 | 2020-08-12 14:43:52,635 - train.py[line:89] - INFO: 准确率:0.9998078770413065
 98 | 2020-08-12 14:43:52,651 - train.py[line:90] - INFO: [[8351    1]
 99 |  [   1 2057]]
100 | 2020-08-12 14:43:52,656 - train.py[line:91] - INFO:              precision    recall  f1-score   support
101 | 
102 |           0       1.00      1.00      1.00      8352
103 |           1       1.00      1.00      1.00      2058
104 | 
105 | avg / total       1.00      1.00      1.00     10410
106 | 
107 | 2020-08-12 14:43:52,659 - train.py[line:101] - INFO: 测试集评估：
108 | 2020-08-12 14:43:52,880 - train.py[line:89] - INFO: 准确率:0.9937247870909905
109 | 2020-08-12 14:43:52,885 - train.py[line:90] - INFO: [[3563   13]
110 |  [  15  871]]
111 | 2020-08-12 14:43:52,888 - train.py[line:91] - INFO:              precision    recall  f1-score   support
112 | 
113 |           0       1.00      1.00      1.00      3576
114 |           1       0.99      0.98      0.98       886
115 | 
116 | avg / total       0.99      0.99      0.99      4462
117 | 
118 | 2020-08-12 14:43:53,142 - train.py[line:127] - INFO: 训练耗时：146.56202912330627
119 | 2020-08-12 15:28:12,545 - train.py[line:140] - INFO: 初始化参数：
120 | 2020-08-12 15:28:12,545 - train.py[line:141] - INFO: 当前训练版本号：v0
121 | 2020-08-12 15:28:12,545 - train.py[line:142] - INFO: 白样本路径：dataset/dataset/normal
122 | 2020-08-12 15:28:12,546 - train.py[line:143] - INFO: 黑样本路径：dataset/dataset/webshell
123 | 2020-08-12 15:28:12,546 - train.py[line:144] - INFO: 训练模型：mlp
124 | 2020-08-12 15:28:12,546 - train.py[line:145] - INFO: 随机种子：777
125 | 2020-08-12 15:28:12,546 - train.py[line:146] - INFO: 特征维度：25000
126 | 2020-08-12 15:28:31,642 - train.py[line:73] - INFO: 白样本总量：11928
127 | 2020-08-12 15:28:31,642 - train.py[line:74] - INFO: 黑样本总量：2944
128 | 2020-08-12 15:30:36,342 - train.py[line:124] - INFO: 测试集评估：
129 | 2020-08-12 15:30:36,589 - train.py[line:112] - INFO: 准确率:0.9937247870909905
130 | 2020-08-12 15:30:36,596 - train.py[line:113] - INFO: [[3563   13]
131 |  [  15  871]]
132 | 2020-08-12 15:30:36,601 - train.py[line:114] - INFO:              precision    recall  f1-score   support
133 | 
134 |           0       1.00      1.00      1.00      3576
135 |           1       0.99      0.98      0.98       886
136 | 
137 | avg / total       0.99      0.99      0.99      4462
138 | 
139 | 2020-08-12 15:30:37,934 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
140 | 2020-08-12 15:30:37,974 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
141 | 2020-08-12 15:32:47,656 - train.py[line:151] - INFO: 训练耗时：275.1095280647278
142 | 2020-08-12 15:38:10,802 - train.py[line:147] - INFO: 初始化参数：
143 | 2020-08-12 15:38:10,802 - train.py[line:148] - INFO: 当前训练版本号：v0
144 | 2020-08-12 15:38:10,802 - train.py[line:149] - INFO: 白样本路径：dataset/dataset/normal
145 | 2020-08-12 15:38:10,802 - train.py[line:150] - INFO: 黑样本路径：dataset/dataset/webshell
146 | 2020-08-12 15:38:10,802 - train.py[line:151] - INFO: 训练模型：mlp
147 | 2020-08-12 15:38:10,802 - train.py[line:152] - INFO: 随机种子：777
148 | 2020-08-12 15:38:10,802 - train.py[line:153] - INFO: 特征维度：25000
149 | 2020-08-12 15:38:28,691 - train.py[line:73] - INFO: 白样本总量：11928
150 | 2020-08-12 15:38:28,691 - train.py[line:74] - INFO: 黑样本总量：2944
151 | 2020-08-12 15:40:40,195 - train.py[line:131] - INFO: 测试集评估：
152 | 2020-08-12 15:40:40,449 - train.py[line:119] - INFO: 准确率:0.9937247870909905
153 | 2020-08-12 15:40:40,456 - train.py[line:120] - INFO: [[3563   13]
154 |  [  15  871]]
155 | 2020-08-12 15:40:40,459 - train.py[line:121] - INFO:              precision    recall  f1-score   support
156 | 
157 |           0       1.00      1.00      1.00      3576
158 |           1       0.99      0.98      0.98       886
159 | 
160 | avg / total       0.99      0.99      0.99      4462
161 | 
162 | 2020-08-12 15:40:41,361 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
163 | 2020-08-12 15:40:41,390 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
164 | 2020-08-12 15:48:35,288 - train.py[line:158] - INFO: 训练耗时：624.4852471351624
165 | 2020-08-12 15:53:13,740 - train.py[line:147] - INFO: 初始化参数：
166 | 2020-08-12 15:53:13,740 - train.py[line:148] - INFO: 当前训练版本号：v0
167 | 2020-08-12 15:53:13,741 - train.py[line:149] - INFO: 白样本路径：dataset/dataset/normal
168 | 2020-08-12 15:53:13,741 - train.py[line:150] - INFO: 黑样本路径：dataset/dataset/webshell
169 | 2020-08-12 15:53:13,741 - train.py[line:151] - INFO: 训练模型：mlp
170 | 2020-08-12 15:53:13,741 - train.py[line:152] - INFO: 随机种子：777
171 | 2020-08-12 15:53:13,741 - train.py[line:153] - INFO: 特征维度：25000
172 | 2020-08-12 15:53:31,687 - train.py[line:73] - INFO: 白样本总量：11928
173 | 2020-08-12 15:53:31,687 - train.py[line:74] - INFO: 黑样本总量：2944
174 | 2020-08-12 15:55:42,848 - train.py[line:131] - INFO: 测试集评估：
175 | 2020-08-12 15:55:43,293 - train.py[line:119] - INFO: 准确率:0.9937247870909905
176 | 2020-08-12 15:55:43,300 - train.py[line:120] - INFO: [[3563   13]
177 |  [  15  871]]
178 | 2020-08-12 15:55:43,303 - train.py[line:121] - INFO:              precision    recall  f1-score   support
179 | 
180 |           0       1.00      1.00      1.00      3576
181 |           1       0.99      0.98      0.98       886
182 | 
183 | avg / total       0.99      0.99      0.99      4462
184 | 
185 | 2020-08-12 15:55:50,371 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
186 | 2020-08-12 15:55:50,398 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
187 | 2020-08-12 15:58:50,293 - train.py[line:158] - INFO: 训练耗时：336.55256724357605
188 | 2020-08-13 09:06:01,743 - train.py[line:147] - INFO: 初始化参数：
189 | 2020-08-13 09:06:01,744 - train.py[line:148] - INFO: 当前训练版本号：v0
190 | 2020-08-13 09:06:01,744 - train.py[line:149] - INFO: 白样本路径：dataset/dataset/normal
191 | 2020-08-13 09:06:01,744 - train.py[line:150] - INFO: 黑样本路径：dataset/dataset/webshell
192 | 2020-08-13 09:06:01,744 - train.py[line:151] - INFO: 训练模型：mlp
193 | 2020-08-13 09:06:01,744 - train.py[line:152] - INFO: 随机种子：777
194 | 2020-08-13 09:06:01,744 - train.py[line:153] - INFO: 特征维度：25000
195 | 2020-08-13 09:06:21,212 - train.py[line:73] - INFO: 白样本总量：11928
196 | 2020-08-13 09:06:21,212 - train.py[line:74] - INFO: 黑样本总量：2944
197 | 2020-08-13 09:08:31,024 - train.py[line:131] - INFO: 测试集评估：
198 | 2020-08-13 09:08:31,256 - train.py[line:119] - INFO: 准确率:0.9937247870909905
199 | 2020-08-13 09:08:31,268 - train.py[line:120] - INFO: [[3563   13]
200 |  [  15  871]]
201 | 2020-08-13 09:08:31,270 - train.py[line:121] - INFO:              precision    recall  f1-score   support
202 | 
203 |           0       1.00      1.00      1.00      3576
204 |           1       0.99      0.98      0.98       886
205 | 
206 | avg / total       0.99      0.99      0.99      4462
207 | 
208 | 2020-08-13 09:08:32,249 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
209 | 2020-08-13 09:08:32,284 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000
210 | 2020-08-13 09:09:50,606 - train.py[line:158] - INFO: 训练耗时：228.86156272888184
211 | 


--------------------------------------------------------------------------------
/model/mlp.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/model/mlp.pkl


--------------------------------------------------------------------------------
/model/tfidftransformer.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/model/tfidftransformer.pkl


--------------------------------------------------------------------------------
/pic/1.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/pic/1.jpg


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding=utf-8
  3 | # Python3
  4 | """
  5 | @Author: liuhaojie
  6 | @File  : train.py
  7 | @Idea  : IntelliJ IDEA
  8 | @Date  : 2020/7/16
  9 | @Desc  : Descriptions
 10 | """
 11 | import logging
 12 | import optparse
 13 | import os
 14 | import time
 15 | import matplotlib.pyplot as plt
 16 | from sklearn.metrics import roc_curve, auc
 17 | from sklearn import metrics
 18 | from sklearn.externals import joblib
 19 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
 20 | from sklearn.metrics import classification_report, confusion_matrix
 21 | from sklearn.model_selection import train_test_split
 22 | from sklearn.naive_bayes import GaussianNB
 23 | from sklearn.neural_network import MLPClassifier
 24 | from xgboost import XGBClassifier
 25 | 
 26 | logging.basicConfig(level=logging.DEBUG,
 27 |                     filename='log/trainlog.log',
 28 |                     filemode='a',
 29 |                     format='%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s')
 30 | logger = logging.getLogger(__name__)
 31 | logger.addHandler(logging.StreamHandler())
 32 | 
 33 | 
 34 | def model_collection(mode):
 35 |     if mode == 'mlp':
 36 |         return MLPClassifier(solver="lbfgs", alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
 37 |     if mode == 'xgb':
 38 |         return XGBClassifier()
 39 |     if mode == 'gnb':
 40 |         return GaussianNB()
 41 | 
 42 | 
 43 | def read_file(filename):
 44 |     text = b""
 45 |     with open(filename, "rb") as f:
 46 |         for line in f:
 47 |             line = line.strip(b"\r\t")
 48 |             text += line
 49 |     return text
 50 | 
 51 | 
 52 | def read_dir(path):
 53 |     text_list = []
 54 |     for r, d, files in os.walk(path):
 55 |         for file in files:
 56 |             filename = r + "/" + file
 57 |             if os.path.splitext(file)[-1].lower() in filetypes:
 58 |                 text = read_file(filename)
 59 |                 text_list.append(text)
 60 |     return text_list
 61 | 
 62 | 
 63 | def features_process(negativedir, postivedir, maxfeatures):
 64 |     webshell_texts = read_dir(negativedir)
 65 |     normal_texts = read_dir(postivedir)
 66 |     webshell_number = len(webshell_texts)
 67 |     normal_number = len(normal_texts)
 68 |     texts = webshell_texts + normal_texts
 69 |     webshell_lables = [1] * webshell_number
 70 |     normal_lables = [0] * normal_number
 71 |     lables = webshell_lables + normal_lables
 72 |     logger.info("白样本总量：%i" % normal_number)
 73 |     logger.info("黑样本总量：%i" % webshell_number)
 74 | 
 75 |     countvectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
 76 |                                       min_df=1, analyzer="word",
 77 |                                       token_pattern=r'[^\w\s]+|\b\w+\b',
 78 |                                       max_features=maxfeatures)
 79 |     tfidftransformer = TfidfTransformer(smooth_idf=False)
 80 |     cv_x = countvectorizer.fit_transform(texts).toarray()
 81 |     tf_x = tfidftransformer.fit_transform(cv_x).toarray()
 82 | 
 83 |     joblib.dump(countvectorizer, "model/countvectorizer_" + options.version + ".pkl")
 84 |     joblib.dump(tfidftransformer, "model/tfidftransformer_" + options.version + ".pkl")
 85 |     return tf_x, lables, countvectorizer, tfidftransformer
 86 | 
 87 | 
 88 | def plot_roc(x_test, y_test, clf):
 89 |     """
 90 |     当模型为mlp时进行roc
 91 |     :param x_test:
 92 |     :param y_test:
 93 |     :param clf:
 94 |     :return:
 95 |     """
 96 |     if options.mode == 'mlp':
 97 |         y_test_score = clf.predict_proba(x_test)[:, 1]
 98 |         # y_pred = clf.predict(x_test)
 99 |         fpr, tpr, threshold = roc_curve(y_test, y_test_score)
100 |         roc_auc = auc(fpr, tpr)
101 |         lw = 2
102 |         plt.figure(figsize=(7, 7))
103 |         plt.plot(fpr, tpr, color='darkorange',
104 |                  lw=lw, label='webshellDc (AUC = %0.5f)' % roc_auc)  ###横坐标为假正率，纵坐标为真正率
105 |         plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
106 |         plt.xlim([0.0, 1.0])
107 |         plt.ylim([0.0, 1.05])
108 |         plt.xlabel('False Positive Rate')
109 |         plt.ylabel('True Positive Rate')
110 |         plt.title('Receiver operating characteristic curve')
111 |         plt.legend(loc="lower right")
112 |         plt.show()
113 |     else:
114 |         pass
115 | 
116 | 
117 | def evaluation(y_test, y_pred):
118 |     logger.info("准确率:%s" % metrics.accuracy_score(y_test, y_pred))
119 |     logger.info(confusion_matrix(y_test, y_pred))
120 |     logger.info(classification_report(y_test, y_pred))
121 | 
122 | 
123 | def train(trainset, lables, mode, seed):
124 |     x_train, x_test, y_train, y_test = train_test_split(trainset, lables, test_size=0.3, random_state=seed)
125 |     clf = model_collection(mode)
126 |     clfname = "model/" + mode + "_" + options.version + ".pkl"
127 |     clf.fit(x_train, y_train)
128 |     # logger.info("训练集评估：")
129 |     # evaluation(y_train, clf.predict(x_train))
130 |     logger.info("测试集评估：")
131 |     evaluation(y_test, clf.predict(x_test))
132 |     joblib.dump(clf, clfname, compress=3)
133 |     plot_roc(x_test, y_test, clf)
134 | 
135 | 
136 | if __name__ == "__main__":
137 |     parser = optparse.OptionParser()
138 |     parser.add_option("-v", "--version", dest="version", default="v0", help=u'当前训练版本号')
139 |     parser.add_option("-s", "--seed", dest="seed", default=777, type="int", help=u'模型训练随机种子')
140 |     parser.add_option("-p", "--postive_dir", dest="normal", default=False, help=u'训练白样本文件夹路径')
141 |     parser.add_option("-n", "--negative_dir", dest="webshell", default=False, help=u'训练黑样本文件夹路径')
142 |     parser.add_option("-m", "--model", dest="mode", default="mlp", help=u'训责训练的模型种类')
143 |     parser.add_option("-d", "--dimensions", dest="max_features", default=25000, type="int", help=u'特征向量维度')
144 |     options, _ = parser.parse_args()
145 |     filetypes = ['.php', '.jsp', '.asp', '.aspx', '.jspx', '.java', '.txt']
146 |     logger.info("初始化参数：")
147 |     logger.info("当前训练版本号：%s" % options.version)
148 |     logger.info("白样本路径：%s" % options.normal)
149 |     logger.info("黑样本路径：%s" % options.webshell)
150 |     logger.info("训练模型：%s" % options.mode)
151 |     logger.info("随机种子：%s" % options.seed)
152 |     logger.info("特征维度：%s" % options.max_features)
153 |     sTime = time.time()
154 |     x, y, cv, transformer = features_process(options.webshell, options.normal, options.max_features)
155 |     train(x, y, options.mode, options.seed)
156 |     eTime = time.time()
157 |     logger.info("训练耗时：%s" % str(eTime - sTime))
158 | 
159 | # /Users/hadoop/Jupyter/恶意代码/webshell/dataset/
160 | 


--------------------------------------------------------------------------------
/webshellDc.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # coding=utf-8
  3 | # Python3
  4 | """
  5 | @Author: liuhaojie
  6 | @File  : webshellDc.py
  7 | @Idea  : IntelliJ IDEA
  8 | @Date  : 2020/7/15
  9 | @Desc  : Descriptions
 10 | """
 11 | import logging
 12 | import os
 13 | import numpy as np
 14 | 
 15 | from sklearn.externals import joblib
 16 | 
 17 | logging.basicConfig(level=logging.DEBUG,
 18 |                     filename='log/checklog.log',
 19 |                     filemode='a',
 20 |                     format='%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s')
 21 | logger = logging.getLogger(__name__)
 22 | logger.addHandler(logging.StreamHandler())
 23 | 
 24 | 
 25 | class WebshellDec(object):
 26 |     def __init__(self, vs):
 27 |         super(WebshellDec, self).__init__()
 28 |         self.cv = joblib.load("model/countvectorizer_" + vs + ".pkl")
 29 |         self.transformer = joblib.load("model/tfidftransformer_" + vs + ".pkl")
 30 |         self.mlp = joblib.load("model/mlp_" + vs + ".pkl")
 31 | 
 32 |     @staticmethod
 33 |     def load_file(file_path):
 34 |         t = b''
 35 |         with open(file_path, "rb") as f:
 36 |             for line in f:
 37 |                 line = line.strip(b'\r\t')
 38 |                 t += line
 39 |         return t
 40 | 
 41 |     def checkdir(self, path):
 42 |         counter, webshell_number, normal_number = 0, 0, 0
 43 |         for r, d, files in os.walk(path):
 44 |             for file in files:
 45 |                 file_path = r + '/' + file
 46 |                 if os.path.splitext(file)[-1].lower() in ['.php', '.jsp', '.jspx', '.java', '.asp', '.aspx']:
 47 |                     t = self.load_file(file_path)
 48 |                     t_list = list()
 49 |                     t_list.append(t)
 50 |                     x = self.cv.transform(t_list).toarray()
 51 |                     x = self.transformer.transform(x).toarray()
 52 |                     y_pred = self.mlp.predict(x)
 53 |                     counter += 1
 54 |                     if y_pred[0] == 1:
 55 |                         logger.info("{} is webshell".format(file_path))
 56 |                         webshell_number += 1
 57 |                     else:
 58 |                         logger.info("{} is not webshell".format(file_path))
 59 |                         normal_number += 1
 60 |         logger.info("检测总量:%i, 检测出webshell:%i, 检测出正常文件:%i" % (counter, webshell_number, normal_number))
 61 | 
 62 |     def check(self, input_data):
 63 |         if os.path.isdir(input_data):
 64 |             return self.checkdir(input_data)
 65 |         elif os.path.isfile(input_data):
 66 |             t = [self.load_file(input_data)]
 67 |             x = self.cv.transform(t).toarray()
 68 |             x = self.transformer.transform(x).toarray()
 69 |             y_pred = self.mlp.predict(x)[0]
 70 |             logger.info("输入文件：%s" % input_data)
 71 |             logger.info("检测结果：%s" % "webshell" if y_pred == 1 else "正常文件")
 72 |         else:
 73 |             if type(input_data) is bytes or type(input_data) is str:
 74 |                 try:
 75 |                     t = [input_data.decode()]
 76 |                 except AttributeError:
 77 |                     t = [input_data]
 78 |                 x = self.cv.transform(t).toarray()
 79 |                 x = self.transformer.transform(x).toarray()
 80 |                 y_pred = self.mlp.predict(x)
 81 |                 logger.info("输入代码：%s" % input_data)
 82 |                 logger.info("检测结果：%s" % "webshell" if y_pred == 1 else "正常代码")
 83 |             else:
 84 |                 logger.info("输入无效")
 85 | 
 86 | 
 87 | if __name__ == "__main__":
 88 |     assert (np.__version__ >= "1.18.5")
 89 |     logger.info("输入模型版本号...")
 90 |     while True:
 91 |         try:
 92 |             version = input("version:")
 93 |             logger.info("初始化模型...")
 94 |             webshelldc = WebshellDec(version)
 95 |             logger.info("初始化模型完成，输入检测内容，支持脚本文件、文本代码、文件路径")
 96 |             break
 97 |         except FileNotFoundError:
 98 |             logger.info("对应版本模型不存在，请重新输入模型版本号...")
 99 | 
100 |     while True:
101 |         inputs = input("inputs:")
102 |         if inputs == "exit":
103 |             logger.info("退出检测！")
104 |             break
105 |         else:
106 |             webshelldc.check(inputs)
107 |             logger.info("继续输入：")
108 | # text = '<?php@eval($_GET[\'p\'])\n<?php assert (    $_GET[\'p\']\n)\n$func="test";$b374k=$func(\'$x\', \'ev\'.\'al\')\n$b=$W(\'\',$S);$b();\n;$pouet($pif,$paf);\n${$pouet}\n\'pouet\'.\'pif\' . \'pouet\' . "lol" ."kwainkwain"\n'
109 | # print(webshelldc.check(text))
110 | 


--------------------------------------------------------------------------------