├── README.md ├── dataset └── webshell.zip ├── log ├── checklog.log └── trainlog.log ├── model ├── mlp.pkl └── tfidftransformer.pkl ├── pic └── 1.jpg ├── train.py └── webshellDc.py /README.md: -------------------------------------------------------------------------------- 1 | # webshellDc v0.1 2 | 3 | webshell通常是指利用asp、jsp、php、py、pl脚本语言编写,对web服务器进行管理的工具,也叫webadmin。webshell可以用来上传下载文件,查看数据库,系统命令调用,因此常被黑客利用并对服务器进行一系列入侵操作,具备威胁大、隐蔽性强等特点。 4 | 5 | 6 | 本项目分别收集了160个Github项目的webshell黑样本和大量个开源php、jsp、asp、java项目作为白样本,去重后黑样本2944个,白样本11945个,采用CountVectorizer和TfidfTransformer对n-gram后的样本进行特征向量处理,分别采用多层神经网络、XGBoost、朴素贝叶斯进行训练。其中MLPClassifier模型表现较好。 7 | 8 | 9 | 10 | ## 使用方式 11 | ``` 12 | 训练: 13 | python train.py -n webshelldir(黑样本文件路径) -p normaldir(白样本文件路径) -m mlp(模型选项) 14 | 15 | 测试: 16 | python webshellDc.py 17 | ``` 18 | 19 | 20 | ## 训练环境 21 | 22 | > 系统:macOS 16 GB + python 3.6.3 23 | > 执行时间:134s 24 | 25 | 26 | ## 运行截图 27 | 28 | 29 | 30 | 31 | 白名单检测: 32 | 检测总量:11945, 检测出webshell:23, 检测出正常文件:11922 33 | 误报率:0.0019254918375889493 34 | 35 | 黑名单检测: 36 | 检测总量:2944, 检测出webshell:2925, 检测出正常文件:19 37 | 召回率:0.993546195652174 38 | 39 | 40 | ## 黑样本 41 | 42 | https://github.com/tennc/webshell 43 | https://github.com/ysrc/webshell-sample 44 | https://github.com/xl7dev/WebShell 45 | https://github.com/tdifg/WebShell 46 | https://github.com/fictivekin/webshell 47 | https://github.com/bartblaze/PHP-backdoors 48 | https://github.com/malwares/WebShell 49 | https://github.com/xypiie/WebShell 50 | https://github.com/testsecer/WebShell 51 | https://github.com/nbs-system/php-malware-finder 52 | https://github.com/BlackArch/webshells 53 | https://github.com/tanjiti/webshellSample 54 | https://github.com/dotcppfile/DAws 55 | https://github.com/theralfbrown/webshell 56 | https://github.com/gokyle/webshell 57 | https://github.com/sunnyelf/cheetah 58 | https://github.com/JohnTroony/php-webshells 59 | https://github.com/evilcos/python-webshell 60 | https://github.com/lhlsec/webshell 61 | https://github.com/shewey/webshell 62 | https://github.com/boy-hack/WebshellManager 63 | https://github.com/liulongfei/web_shell_bopo 64 | https://github.com/Ni7eipr/webshell 65 | https://github.com/WangYihang/Webshell-Sniper 66 | https://github.com/pm2-hive/pm2-webshell 67 | https://github.com/samdark/yii2-webshell 68 | https://github.com/b1ueb0y/webshell 69 | https://github.com/oneoneplus/webshell 70 | https://github.com/zhaojh329/xterminal 71 | https://github.com/juanparati/Webshell 72 | https://github.com/wofeiwo/webshell-find-tools 73 | https://github.com/abcdlzy/webshell-manager 74 | https://github.com/alert0/webshellch 75 | https://github.com/needle-wang/jweevely 76 | https://github.com/tengzhangchao/PyCmd 77 | https://github.com/0x73686974/WebShell 78 | https://github.com/wonderqs/Blade 79 | https://github.com/le4f/aspexec 80 | https://github.com/jijinggang/WebShell 81 | https://github.com/matiasmenares/Shuffle 82 | https://github.com/Skycrab/PySpy 83 | https://github.com/huge818/webshell 84 | https://github.com/gb-sn/go-webshell 85 | https://github.com/BlackHole1/Fastener 86 | https://github.com/blackhalt/WebShells 87 | https://github.com/tomas1000r/webshell 88 | https://github.com/hanzhibin/Webshell 89 | https://github.com/decebel/webShell 90 | https://github.com/Aviso-hub/Webshell 91 | https://github.com/vnhacker1337/Webshell 92 | https://github.com/bittorrent3389/Webshell 93 | https://github.com/anhday22/WebShell 94 | https://github.com/buxiaomo/webshell 95 | https://github.com/z3robat/webshell 96 | https://github.com/n3oism/webshell 97 | https://github.com/uuleaf/WebShell 98 | https://github.com/onefor1/webshell 99 | https://github.com/cunlin-yu/webshell 100 | https://github.com/roytest1/webshell 101 | https://github.com/backlion/webshell 102 | https://github.com/opetrovski/webshell 103 | https://github.com/opetrovski/webshell 104 | https://github.com/gsmlg/webshell 105 | https://github.com/health901/webshell 106 | https://github.com/inof8r/WebShell 107 | https://github.com/Najones19746/webShell 108 | https://github.com/RaspiCar/WebShell 109 | https://github.com/health901/webshell 110 | https://github.com/dinamsky/WebShell 111 | https://github.com/Fay48/WebShell 112 | https://github.com/tuz358/webshell 113 | https://github.com/shajf/Webshell 114 | https://github.com/t17lab/WebShell 115 | https://github.com/blacksunwen/webshell 116 | https://github.com/webshellarchive/webshellco 117 | https://github.com/lolwaleet/Rubshell 118 | https://github.com/WhiteWinterWolf/WhiteWinterWolf-php-webshell 119 | https://github.com/goodtouch/jruby-webshell 120 | https://github.com/maestrano/webshell-server 121 | https://github.com/LuciferoO/webshell-collector 122 | https://github.com/wangeradd1/myWebShell 123 | https://github.com/0xHJK/caidao 124 | https://github.com/alintamvanz/1945shell 125 | https://github.com/Venen0/vshell 126 | https://github.com/lojikil/tinyshell 127 | https://github.com/wso-shell/PHP-SHELL-WSO 128 | https://github.com/meme-lord/PHPShellBackdoors 129 | https://github.com/Learn2Better/51mp3L-Web-Backdoor 130 | https://github.com/yuxiaokui/JBoss-Hack 131 | https://github.com/SecurityRiskAdvisors/cmd.jsp 132 | https://github.com/ddcunningham/crude-shellhunter 133 | https://github.com/stormdark/BackdoorPHP 134 | https://github.com/vduddu/Malware 135 | https://github.com/1oid/BurstPHPshell 136 | https://github.com/gokyle/urlshorten_ng 137 | https://github.com/rhelsing/trello_osx 138 | https://github.com/pfrazee/wsh-grammar 139 | https://github.com/x-o-r-r-o/PHP-Webshells-Collection 140 | https://github.com/IHA114/WebShell2 141 | https://github.com/WangYihang/WebShellCracker 142 | https://github.com/KINGSABRI/WebShellConsole 143 | https://github.com/jujinesy/webshells.17.03.18 144 | https://github.com/hackzsd/HandyShells 145 | https://github.com/mperlet/pomsky 146 | https://github.com/cybernoir/bns-php-shell 147 | https://github.com/XianThi/rexShell 148 | https://github.com/H4CK3RT3CH/php-webshells 149 | https://github.com/minisllc/subshell 150 | https://github.com/linuxsec/indoxploit-shell 151 | https://github.com/kuniasahi/mpshell 152 | https://github.com/datasiph0n/MyBB-Shell-Plugin 153 | https://github.com/magicming200/evil-koala-php-webshell 154 | https://github.com/0xK3v/Simple-WebShell 155 | https://github.com/djoq/docker-pm2-webshell 156 | https://github.com/SMRUCC/GCModeller.WebShell 157 | https://github.com/darknesstiller/WebShells 158 | https://github.com/devilscream/remoteshell 159 | https://github.com/0verl0ad/gorosaurus 160 | https://github.com/grCod/poly 161 | https://github.com/cryptobioz/wizhack 162 | https://github.com/amwso/docker-webshell 163 | https://github.com/William-Hunter/JSP_Webshell 164 | https://github.com/yangbaopeng/ashx_webshell 165 | https://github.com/webshellpub/awsome-webshell 166 | https://github.com/noalh8t/simple-webshell 167 | https://github.com/s3cureshell/wso-2.8-web-shell 168 | https://github.com/LiamRandall/simpleexec 169 | https://github.com/Samorodek/humhub-modules-webshell 170 | https://github.com/mwambler/webshell-xpages-ext-lib 171 | https://github.com/AVGP/Wesh 172 | https://github.com/edibledinos/weevely3-stealth 173 | https://github.com/lehins/haskell-webshell 174 | https://github.com/guglia001/php-secure-remove 175 | https://github.com/gokyle/webshell_tutorial 176 | https://github.com/azmanishak/webshell-php 177 | https://github.com/andrefernandes/docker-webshell 178 | https://github.com/codehz/node-webshell 179 | https://github.com/koolshare/merlin-webshell 180 | https://github.com/StephaneP/erl-webshell 181 | https://github.com/jjjmaracay3/webshells 182 | https://github.com/grCod/webshells 183 | https://github.com/ian4hu/bootshell 184 | https://github.com/Ghostboy-287/wso-webshell 185 | https://github.com/xiaoxiaoleo/xiao-webshell 186 | https://github.com/alexbires/webshellmanagement 187 | https://github.com/codeT/collectWebShell 188 | https://github.com/PhilCodeEx/jak3fr0z 189 | https://github.com/Ettack/WebshellCCL 190 | https://github.com/jubal-R/TinyWebShell 191 | https://github.com/CaledoniaProject/AxisInvoker 192 | https://github.com/theBrianCui/ISSS_webShell 193 | https://github.com/webshell/webshell-node-sdk 194 | https://github.com/Medicean/AS_BugScan 195 | https://github.com/3xp10it/xwebshell 196 | https://github.com/niemand-sec/RazorSyntaxWebshell 197 | https://github.com/LuciferoO/webshell-collector 198 | https://github.com/0verl0ad/HideShell 199 | https://github.com/L-codes/oneshellcrack 200 | https://github.com/ArchAssault-Project/webshells 201 | https://github.com/AndrHacK/andrshell 202 | -------------------------------------------------------------------------------- /dataset/webshell.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/dataset/webshell.zip -------------------------------------------------------------------------------- /log/trainlog.log: -------------------------------------------------------------------------------- 1 | 2020-07-20 15:48:29,770 - train.py[line:122] - INFO: 训练耗时:148.7166247367859 2 | 2020-07-20 16:20:34,958 - train.py[line:72] - INFO: 白样本:11955 3 | 2020-07-20 16:20:34,959 - train.py[line:73] - INFO: 黑样本:2944 4 | 2020-07-20 16:24:30,351 - train.py[line:72] - INFO: 白样本:11955 5 | 2020-07-20 16:24:30,351 - train.py[line:73] - INFO: 黑样本:2944 6 | 2020-07-20 16:27:06,981 - train.py[line:116] - INFO: 初始化参数: 7 | 2020-07-20 16:27:50,587 - train.py[line:116] - INFO: 初始化参数: 8 | 2020-07-20 16:27:50,587 - train.py[line:117] - INFO: 当前训练版本号:v0 9 | 2020-07-20 16:27:50,587 - train.py[line:118] - INFO: 白样本:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal 10 | 2020-07-20 16:27:50,588 - train.py[line:119] - INFO: 黑样本:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell 11 | 2020-07-20 16:27:50,588 - train.py[line:120] - INFO: 训练模型:mlp 12 | 2020-07-20 16:27:50,588 - train.py[line:121] - INFO: 随即种子:777 13 | 2020-07-20 16:27:50,588 - train.py[line:122] - INFO: 特征维度:25000 14 | 2020-07-20 16:28:13,146 - train.py[line:72] - INFO: 白样本:11955 15 | 2020-07-20 16:28:13,146 - train.py[line:73] - INFO: 黑样本:2944 16 | 2020-07-20 16:29:10,831 - train.py[line:116] - INFO: 初始化参数: 17 | 2020-07-20 16:29:10,831 - train.py[line:117] - INFO: 当前训练版本号:v0 18 | 2020-07-20 16:29:10,831 - train.py[line:118] - INFO: 白样本路径:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal 19 | 2020-07-20 16:29:10,831 - train.py[line:119] - INFO: 黑样本路径:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell 20 | 2020-07-20 16:29:10,831 - train.py[line:120] - INFO: 训练模型:mlp 21 | 2020-07-20 16:29:10,831 - train.py[line:121] - INFO: 随机种子:777 22 | 2020-07-20 16:29:10,831 - train.py[line:122] - INFO: 特征维度:25000 23 | 2020-07-20 16:29:25,707 - train.py[line:72] - INFO: 白样本总量:11955 24 | 2020-07-20 16:29:25,707 - train.py[line:73] - INFO: 黑样本总量:2944 25 | 2020-07-20 16:32:08,022 - train.py[line:99] - INFO: 训练集评估: 26 | 2020-07-20 16:32:08,585 - train.py[line:89] - INFO: 准确率:0.999808227059162 27 | 2020-07-20 16:32:08,601 - train.py[line:90] - INFO: [[8372 1] 28 | [ 1 2055]] 29 | 2020-07-20 16:32:08,609 - train.py[line:91] - INFO: precision recall f1-score support 30 | 31 | 0 1.00 1.00 1.00 8373 32 | 1 1.00 1.00 1.00 2056 33 | 34 | avg / total 1.00 1.00 1.00 10429 35 | 36 | 2020-07-20 16:32:08,611 - train.py[line:101] - INFO: 测试集评估: 37 | 2020-07-20 16:32:09,026 - train.py[line:89] - INFO: 准确率:0.9910514541387024 38 | 2020-07-20 16:32:09,035 - train.py[line:90] - INFO: [[3560 22] 39 | [ 18 870]] 40 | 2020-07-20 16:32:09,039 - train.py[line:91] - INFO: precision recall f1-score support 41 | 42 | 0 0.99 0.99 0.99 3582 43 | 1 0.98 0.98 0.98 888 44 | 45 | avg / total 0.99 0.99 0.99 4470 46 | 47 | 2020-07-20 16:32:09,417 - train.py[line:127] - INFO: 训练耗时:178.58603620529175 48 | 2020-07-20 16:32:34,282 - train.py[line:116] - INFO: 初始化参数: 49 | 2020-07-20 16:32:34,282 - train.py[line:117] - INFO: 当前训练版本号:v1 50 | 2020-07-20 16:32:34,282 - train.py[line:118] - INFO: 白样本路径:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal 51 | 2020-07-20 16:32:34,282 - train.py[line:119] - INFO: 黑样本路径:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell 52 | 2020-07-20 16:32:34,283 - train.py[line:120] - INFO: 训练模型:mlp 53 | 2020-07-20 16:32:34,283 - train.py[line:121] - INFO: 随机种子:777 54 | 2020-07-20 16:32:34,283 - train.py[line:122] - INFO: 特征维度:25000 55 | 2020-07-20 16:32:55,886 - train.py[line:116] - INFO: 初始化参数: 56 | 2020-07-20 16:32:55,887 - train.py[line:117] - INFO: 当前训练版本号:v1 57 | 2020-07-20 16:32:55,887 - train.py[line:118] - INFO: 白样本路径:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/normal 58 | 2020-07-20 16:32:55,887 - train.py[line:119] - INFO: 黑样本路径:/Users/hadoop/Jupyter/恶意代码/webshell/dataset/webshell 59 | 2020-07-20 16:32:55,887 - train.py[line:120] - INFO: 训练模型:mlp 60 | 2020-07-20 16:32:55,887 - train.py[line:121] - INFO: 随机种子:666 61 | 2020-07-20 16:32:55,887 - train.py[line:122] - INFO: 特征维度:25000 62 | 2020-07-20 16:33:15,973 - train.py[line:72] - INFO: 白样本总量:11955 63 | 2020-07-20 16:33:15,973 - train.py[line:73] - INFO: 黑样本总量:2944 64 | 2020-07-20 16:35:51,871 - train.py[line:99] - INFO: 训练集评估: 65 | 2020-07-20 16:35:53,377 - train.py[line:89] - INFO: 准确率:1.0 66 | 2020-07-20 16:35:53,421 - train.py[line:90] - INFO: [[8393 0] 67 | [ 0 2036]] 68 | 2020-07-20 16:35:53,438 - train.py[line:91] - INFO: precision recall f1-score support 69 | 70 | 0 1.00 1.00 1.00 8393 71 | 1 1.00 1.00 1.00 2036 72 | 73 | avg / total 1.00 1.00 1.00 10429 74 | 75 | 2020-07-20 16:35:53,438 - train.py[line:101] - INFO: 测试集评估: 76 | 2020-07-20 16:35:54,622 - train.py[line:89] - INFO: 准确率:0.9937360178970918 77 | 2020-07-20 16:35:54,642 - train.py[line:90] - INFO: [[3552 10] 78 | [ 18 890]] 79 | 2020-07-20 16:35:54,665 - train.py[line:91] - INFO: precision recall f1-score support 80 | 81 | 0 0.99 1.00 1.00 3562 82 | 1 0.99 0.98 0.98 908 83 | 84 | avg / total 0.99 0.99 0.99 4470 85 | 86 | 2020-07-20 16:35:55,162 - train.py[line:127] - INFO: 训练耗时:179.27528595924377 87 | 2020-08-12 14:41:26,580 - train.py[line:116] - INFO: 初始化参数: 88 | 2020-08-12 14:41:26,580 - train.py[line:117] - INFO: 当前训练版本号:v0 89 | 2020-08-12 14:41:26,580 - train.py[line:118] - INFO: 白样本路径:dataset/dataset/normal 90 | 2020-08-12 14:41:26,580 - train.py[line:119] - INFO: 黑样本路径:dataset/dataset/webshell 91 | 2020-08-12 14:41:26,580 - train.py[line:120] - INFO: 训练模型:mlp 92 | 2020-08-12 14:41:26,580 - train.py[line:121] - INFO: 随机种子:777 93 | 2020-08-12 14:41:26,580 - train.py[line:122] - INFO: 特征维度:25000 94 | 2020-08-12 14:41:44,037 - train.py[line:72] - INFO: 白样本总量:11928 95 | 2020-08-12 14:41:44,037 - train.py[line:73] - INFO: 黑样本总量:2944 96 | 2020-08-12 14:43:52,266 - train.py[line:99] - INFO: 训练集评估: 97 | 2020-08-12 14:43:52,635 - train.py[line:89] - INFO: 准确率:0.9998078770413065 98 | 2020-08-12 14:43:52,651 - train.py[line:90] - INFO: [[8351 1] 99 | [ 1 2057]] 100 | 2020-08-12 14:43:52,656 - train.py[line:91] - INFO: precision recall f1-score support 101 | 102 | 0 1.00 1.00 1.00 8352 103 | 1 1.00 1.00 1.00 2058 104 | 105 | avg / total 1.00 1.00 1.00 10410 106 | 107 | 2020-08-12 14:43:52,659 - train.py[line:101] - INFO: 测试集评估: 108 | 2020-08-12 14:43:52,880 - train.py[line:89] - INFO: 准确率:0.9937247870909905 109 | 2020-08-12 14:43:52,885 - train.py[line:90] - INFO: [[3563 13] 110 | [ 15 871]] 111 | 2020-08-12 14:43:52,888 - train.py[line:91] - INFO: precision recall f1-score support 112 | 113 | 0 1.00 1.00 1.00 3576 114 | 1 0.99 0.98 0.98 886 115 | 116 | avg / total 0.99 0.99 0.99 4462 117 | 118 | 2020-08-12 14:43:53,142 - train.py[line:127] - INFO: 训练耗时:146.56202912330627 119 | 2020-08-12 15:28:12,545 - train.py[line:140] - INFO: 初始化参数: 120 | 2020-08-12 15:28:12,545 - train.py[line:141] - INFO: 当前训练版本号:v0 121 | 2020-08-12 15:28:12,545 - train.py[line:142] - INFO: 白样本路径:dataset/dataset/normal 122 | 2020-08-12 15:28:12,546 - train.py[line:143] - INFO: 黑样本路径:dataset/dataset/webshell 123 | 2020-08-12 15:28:12,546 - train.py[line:144] - INFO: 训练模型:mlp 124 | 2020-08-12 15:28:12,546 - train.py[line:145] - INFO: 随机种子:777 125 | 2020-08-12 15:28:12,546 - train.py[line:146] - INFO: 特征维度:25000 126 | 2020-08-12 15:28:31,642 - train.py[line:73] - INFO: 白样本总量:11928 127 | 2020-08-12 15:28:31,642 - train.py[line:74] - INFO: 黑样本总量:2944 128 | 2020-08-12 15:30:36,342 - train.py[line:124] - INFO: 测试集评估: 129 | 2020-08-12 15:30:36,589 - train.py[line:112] - INFO: 准确率:0.9937247870909905 130 | 2020-08-12 15:30:36,596 - train.py[line:113] - INFO: [[3563 13] 131 | [ 15 871]] 132 | 2020-08-12 15:30:36,601 - train.py[line:114] - INFO: precision recall f1-score support 133 | 134 | 0 1.00 1.00 1.00 3576 135 | 1 0.99 0.98 0.98 886 136 | 137 | avg / total 0.99 0.99 0.99 4462 138 | 139 | 2020-08-12 15:30:37,934 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 140 | 2020-08-12 15:30:37,974 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 141 | 2020-08-12 15:32:47,656 - train.py[line:151] - INFO: 训练耗时:275.1095280647278 142 | 2020-08-12 15:38:10,802 - train.py[line:147] - INFO: 初始化参数: 143 | 2020-08-12 15:38:10,802 - train.py[line:148] - INFO: 当前训练版本号:v0 144 | 2020-08-12 15:38:10,802 - train.py[line:149] - INFO: 白样本路径:dataset/dataset/normal 145 | 2020-08-12 15:38:10,802 - train.py[line:150] - INFO: 黑样本路径:dataset/dataset/webshell 146 | 2020-08-12 15:38:10,802 - train.py[line:151] - INFO: 训练模型:mlp 147 | 2020-08-12 15:38:10,802 - train.py[line:152] - INFO: 随机种子:777 148 | 2020-08-12 15:38:10,802 - train.py[line:153] - INFO: 特征维度:25000 149 | 2020-08-12 15:38:28,691 - train.py[line:73] - INFO: 白样本总量:11928 150 | 2020-08-12 15:38:28,691 - train.py[line:74] - INFO: 黑样本总量:2944 151 | 2020-08-12 15:40:40,195 - train.py[line:131] - INFO: 测试集评估: 152 | 2020-08-12 15:40:40,449 - train.py[line:119] - INFO: 准确率:0.9937247870909905 153 | 2020-08-12 15:40:40,456 - train.py[line:120] - INFO: [[3563 13] 154 | [ 15 871]] 155 | 2020-08-12 15:40:40,459 - train.py[line:121] - INFO: precision recall f1-score support 156 | 157 | 0 1.00 1.00 1.00 3576 158 | 1 0.99 0.98 0.98 886 159 | 160 | avg / total 0.99 0.99 0.99 4462 161 | 162 | 2020-08-12 15:40:41,361 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 163 | 2020-08-12 15:40:41,390 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 164 | 2020-08-12 15:48:35,288 - train.py[line:158] - INFO: 训练耗时:624.4852471351624 165 | 2020-08-12 15:53:13,740 - train.py[line:147] - INFO: 初始化参数: 166 | 2020-08-12 15:53:13,740 - train.py[line:148] - INFO: 当前训练版本号:v0 167 | 2020-08-12 15:53:13,741 - train.py[line:149] - INFO: 白样本路径:dataset/dataset/normal 168 | 2020-08-12 15:53:13,741 - train.py[line:150] - INFO: 黑样本路径:dataset/dataset/webshell 169 | 2020-08-12 15:53:13,741 - train.py[line:151] - INFO: 训练模型:mlp 170 | 2020-08-12 15:53:13,741 - train.py[line:152] - INFO: 随机种子:777 171 | 2020-08-12 15:53:13,741 - train.py[line:153] - INFO: 特征维度:25000 172 | 2020-08-12 15:53:31,687 - train.py[line:73] - INFO: 白样本总量:11928 173 | 2020-08-12 15:53:31,687 - train.py[line:74] - INFO: 黑样本总量:2944 174 | 2020-08-12 15:55:42,848 - train.py[line:131] - INFO: 测试集评估: 175 | 2020-08-12 15:55:43,293 - train.py[line:119] - INFO: 准确率:0.9937247870909905 176 | 2020-08-12 15:55:43,300 - train.py[line:120] - INFO: [[3563 13] 177 | [ 15 871]] 178 | 2020-08-12 15:55:43,303 - train.py[line:121] - INFO: precision recall f1-score support 179 | 180 | 0 1.00 1.00 1.00 3576 181 | 1 0.99 0.98 0.98 886 182 | 183 | avg / total 0.99 0.99 0.99 4462 184 | 185 | 2020-08-12 15:55:50,371 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 186 | 2020-08-12 15:55:50,398 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 187 | 2020-08-12 15:58:50,293 - train.py[line:158] - INFO: 训练耗时:336.55256724357605 188 | 2020-08-13 09:06:01,743 - train.py[line:147] - INFO: 初始化参数: 189 | 2020-08-13 09:06:01,744 - train.py[line:148] - INFO: 当前训练版本号:v0 190 | 2020-08-13 09:06:01,744 - train.py[line:149] - INFO: 白样本路径:dataset/dataset/normal 191 | 2020-08-13 09:06:01,744 - train.py[line:150] - INFO: 黑样本路径:dataset/dataset/webshell 192 | 2020-08-13 09:06:01,744 - train.py[line:151] - INFO: 训练模型:mlp 193 | 2020-08-13 09:06:01,744 - train.py[line:152] - INFO: 随机种子:777 194 | 2020-08-13 09:06:01,744 - train.py[line:153] - INFO: 特征维度:25000 195 | 2020-08-13 09:06:21,212 - train.py[line:73] - INFO: 白样本总量:11928 196 | 2020-08-13 09:06:21,212 - train.py[line:74] - INFO: 黑样本总量:2944 197 | 2020-08-13 09:08:31,024 - train.py[line:131] - INFO: 测试集评估: 198 | 2020-08-13 09:08:31,256 - train.py[line:119] - INFO: 准确率:0.9937247870909905 199 | 2020-08-13 09:08:31,268 - train.py[line:120] - INFO: [[3563 13] 200 | [ 15 871]] 201 | 2020-08-13 09:08:31,270 - train.py[line:121] - INFO: precision recall f1-score support 202 | 203 | 0 1.00 1.00 1.00 3576 204 | 1 0.99 0.98 0.98 886 205 | 206 | avg / total 0.99 0.99 0.99 4462 207 | 208 | 2020-08-13 09:08:32,249 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=10.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 209 | 2020-08-13 09:08:32,284 - /Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/font_manager.py[line:1343] - DEBUG: findfont: Matching :family=sans-serif:style=normal:variant=normal:weight=normal:stretch=normal:size=12.0 to DejaVu Sans ('/Users/hadoop/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/fonts/ttf/DejaVuSans.ttf') with score of 0.050000 210 | 2020-08-13 09:09:50,606 - train.py[line:158] - INFO: 训练耗时:228.86156272888184 211 | -------------------------------------------------------------------------------- /model/mlp.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/model/mlp.pkl -------------------------------------------------------------------------------- /model/tfidftransformer.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/model/tfidftransformer.pkl -------------------------------------------------------------------------------- /pic/1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hooog/webshellDc/164564520f2598491b3e3d747410369a97d44f9f/pic/1.jpg -------------------------------------------------------------------------------- /train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | # Python3 4 | """ 5 | @Author: liuhaojie 6 | @File : train.py 7 | @Idea : IntelliJ IDEA 8 | @Date : 2020/7/16 9 | @Desc : Descriptions 10 | """ 11 | import logging 12 | import optparse 13 | import os 14 | import time 15 | import matplotlib.pyplot as plt 16 | from sklearn.metrics import roc_curve, auc 17 | from sklearn import metrics 18 | from sklearn.externals import joblib 19 | from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 20 | from sklearn.metrics import classification_report, confusion_matrix 21 | from sklearn.model_selection import train_test_split 22 | from sklearn.naive_bayes import GaussianNB 23 | from sklearn.neural_network import MLPClassifier 24 | from xgboost import XGBClassifier 25 | 26 | logging.basicConfig(level=logging.DEBUG, 27 | filename='log/trainlog.log', 28 | filemode='a', 29 | format='%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s') 30 | logger = logging.getLogger(__name__) 31 | logger.addHandler(logging.StreamHandler()) 32 | 33 | 34 | def model_collection(mode): 35 | if mode == 'mlp': 36 | return MLPClassifier(solver="lbfgs", alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) 37 | if mode == 'xgb': 38 | return XGBClassifier() 39 | if mode == 'gnb': 40 | return GaussianNB() 41 | 42 | 43 | def read_file(filename): 44 | text = b"" 45 | with open(filename, "rb") as f: 46 | for line in f: 47 | line = line.strip(b"\r\t") 48 | text += line 49 | return text 50 | 51 | 52 | def read_dir(path): 53 | text_list = [] 54 | for r, d, files in os.walk(path): 55 | for file in files: 56 | filename = r + "/" + file 57 | if os.path.splitext(file)[-1].lower() in filetypes: 58 | text = read_file(filename) 59 | text_list.append(text) 60 | return text_list 61 | 62 | 63 | def features_process(negativedir, postivedir, maxfeatures): 64 | webshell_texts = read_dir(negativedir) 65 | normal_texts = read_dir(postivedir) 66 | webshell_number = len(webshell_texts) 67 | normal_number = len(normal_texts) 68 | texts = webshell_texts + normal_texts 69 | webshell_lables = [1] * webshell_number 70 | normal_lables = [0] * normal_number 71 | lables = webshell_lables + normal_lables 72 | logger.info("白样本总量:%i" % normal_number) 73 | logger.info("黑样本总量:%i" % webshell_number) 74 | 75 | countvectorizer = CountVectorizer(ngram_range=(2, 2), decode_error="ignore", 76 | min_df=1, analyzer="word", 77 | token_pattern=r'[^\w\s]+|\b\w+\b', 78 | max_features=maxfeatures) 79 | tfidftransformer = TfidfTransformer(smooth_idf=False) 80 | cv_x = countvectorizer.fit_transform(texts).toarray() 81 | tf_x = tfidftransformer.fit_transform(cv_x).toarray() 82 | 83 | joblib.dump(countvectorizer, "model/countvectorizer_" + options.version + ".pkl") 84 | joblib.dump(tfidftransformer, "model/tfidftransformer_" + options.version + ".pkl") 85 | return tf_x, lables, countvectorizer, tfidftransformer 86 | 87 | 88 | def plot_roc(x_test, y_test, clf): 89 | """ 90 | 当模型为mlp时进行roc 91 | :param x_test: 92 | :param y_test: 93 | :param clf: 94 | :return: 95 | """ 96 | if options.mode == 'mlp': 97 | y_test_score = clf.predict_proba(x_test)[:, 1] 98 | # y_pred = clf.predict(x_test) 99 | fpr, tpr, threshold = roc_curve(y_test, y_test_score) 100 | roc_auc = auc(fpr, tpr) 101 | lw = 2 102 | plt.figure(figsize=(7, 7)) 103 | plt.plot(fpr, tpr, color='darkorange', 104 | lw=lw, label='webshellDc (AUC = %0.5f)' % roc_auc) ###横坐标为假正率,纵坐标为真正率 105 | plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') 106 | plt.xlim([0.0, 1.0]) 107 | plt.ylim([0.0, 1.05]) 108 | plt.xlabel('False Positive Rate') 109 | plt.ylabel('True Positive Rate') 110 | plt.title('Receiver operating characteristic curve') 111 | plt.legend(loc="lower right") 112 | plt.show() 113 | else: 114 | pass 115 | 116 | 117 | def evaluation(y_test, y_pred): 118 | logger.info("准确率:%s" % metrics.accuracy_score(y_test, y_pred)) 119 | logger.info(confusion_matrix(y_test, y_pred)) 120 | logger.info(classification_report(y_test, y_pred)) 121 | 122 | 123 | def train(trainset, lables, mode, seed): 124 | x_train, x_test, y_train, y_test = train_test_split(trainset, lables, test_size=0.3, random_state=seed) 125 | clf = model_collection(mode) 126 | clfname = "model/" + mode + "_" + options.version + ".pkl" 127 | clf.fit(x_train, y_train) 128 | # logger.info("训练集评估:") 129 | # evaluation(y_train, clf.predict(x_train)) 130 | logger.info("测试集评估:") 131 | evaluation(y_test, clf.predict(x_test)) 132 | joblib.dump(clf, clfname, compress=3) 133 | plot_roc(x_test, y_test, clf) 134 | 135 | 136 | if __name__ == "__main__": 137 | parser = optparse.OptionParser() 138 | parser.add_option("-v", "--version", dest="version", default="v0", help=u'当前训练版本号') 139 | parser.add_option("-s", "--seed", dest="seed", default=777, type="int", help=u'模型训练随机种子') 140 | parser.add_option("-p", "--postive_dir", dest="normal", default=False, help=u'训练白样本文件夹路径') 141 | parser.add_option("-n", "--negative_dir", dest="webshell", default=False, help=u'训练黑样本文件夹路径') 142 | parser.add_option("-m", "--model", dest="mode", default="mlp", help=u'训责训练的模型种类') 143 | parser.add_option("-d", "--dimensions", dest="max_features", default=25000, type="int", help=u'特征向量维度') 144 | options, _ = parser.parse_args() 145 | filetypes = ['.php', '.jsp', '.asp', '.aspx', '.jspx', '.java', '.txt'] 146 | logger.info("初始化参数:") 147 | logger.info("当前训练版本号:%s" % options.version) 148 | logger.info("白样本路径:%s" % options.normal) 149 | logger.info("黑样本路径:%s" % options.webshell) 150 | logger.info("训练模型:%s" % options.mode) 151 | logger.info("随机种子:%s" % options.seed) 152 | logger.info("特征维度:%s" % options.max_features) 153 | sTime = time.time() 154 | x, y, cv, transformer = features_process(options.webshell, options.normal, options.max_features) 155 | train(x, y, options.mode, options.seed) 156 | eTime = time.time() 157 | logger.info("训练耗时:%s" % str(eTime - sTime)) 158 | 159 | # /Users/hadoop/Jupyter/恶意代码/webshell/dataset/ 160 | -------------------------------------------------------------------------------- /webshellDc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | # Python3 4 | """ 5 | @Author: liuhaojie 6 | @File : webshellDc.py 7 | @Idea : IntelliJ IDEA 8 | @Date : 2020/7/15 9 | @Desc : Descriptions 10 | """ 11 | import logging 12 | import os 13 | import numpy as np 14 | 15 | from sklearn.externals import joblib 16 | 17 | logging.basicConfig(level=logging.DEBUG, 18 | filename='log/checklog.log', 19 | filemode='a', 20 | format='%(asctime)s - %(pathname)s[line:%(lineno)d] - %(levelname)s: %(message)s') 21 | logger = logging.getLogger(__name__) 22 | logger.addHandler(logging.StreamHandler()) 23 | 24 | 25 | class WebshellDec(object): 26 | def __init__(self, vs): 27 | super(WebshellDec, self).__init__() 28 | self.cv = joblib.load("model/countvectorizer_" + vs + ".pkl") 29 | self.transformer = joblib.load("model/tfidftransformer_" + vs + ".pkl") 30 | self.mlp = joblib.load("model/mlp_" + vs + ".pkl") 31 | 32 | @staticmethod 33 | def load_file(file_path): 34 | t = b'' 35 | with open(file_path, "rb") as f: 36 | for line in f: 37 | line = line.strip(b'\r\t') 38 | t += line 39 | return t 40 | 41 | def checkdir(self, path): 42 | counter, webshell_number, normal_number = 0, 0, 0 43 | for r, d, files in os.walk(path): 44 | for file in files: 45 | file_path = r + '/' + file 46 | if os.path.splitext(file)[-1].lower() in ['.php', '.jsp', '.jspx', '.java', '.asp', '.aspx']: 47 | t = self.load_file(file_path) 48 | t_list = list() 49 | t_list.append(t) 50 | x = self.cv.transform(t_list).toarray() 51 | x = self.transformer.transform(x).toarray() 52 | y_pred = self.mlp.predict(x) 53 | counter += 1 54 | if y_pred[0] == 1: 55 | logger.info("{} is webshell".format(file_path)) 56 | webshell_number += 1 57 | else: 58 | logger.info("{} is not webshell".format(file_path)) 59 | normal_number += 1 60 | logger.info("检测总量:%i, 检测出webshell:%i, 检测出正常文件:%i" % (counter, webshell_number, normal_number)) 61 | 62 | def check(self, input_data): 63 | if os.path.isdir(input_data): 64 | return self.checkdir(input_data) 65 | elif os.path.isfile(input_data): 66 | t = [self.load_file(input_data)] 67 | x = self.cv.transform(t).toarray() 68 | x = self.transformer.transform(x).toarray() 69 | y_pred = self.mlp.predict(x)[0] 70 | logger.info("输入文件:%s" % input_data) 71 | logger.info("检测结果:%s" % "webshell" if y_pred == 1 else "正常文件") 72 | else: 73 | if type(input_data) is bytes or type(input_data) is str: 74 | try: 75 | t = [input_data.decode()] 76 | except AttributeError: 77 | t = [input_data] 78 | x = self.cv.transform(t).toarray() 79 | x = self.transformer.transform(x).toarray() 80 | y_pred = self.mlp.predict(x) 81 | logger.info("输入代码:%s" % input_data) 82 | logger.info("检测结果:%s" % "webshell" if y_pred == 1 else "正常代码") 83 | else: 84 | logger.info("输入无效") 85 | 86 | 87 | if __name__ == "__main__": 88 | assert (np.__version__ >= "1.18.5") 89 | logger.info("输入模型版本号...") 90 | while True: 91 | try: 92 | version = input("version:") 93 | logger.info("初始化模型...") 94 | webshelldc = WebshellDec(version) 95 | logger.info("初始化模型完成,输入检测内容,支持脚本文件、文本代码、文件路径") 96 | break 97 | except FileNotFoundError: 98 | logger.info("对应版本模型不存在,请重新输入模型版本号...") 99 | 100 | while True: 101 | inputs = input("inputs:") 102 | if inputs == "exit": 103 | logger.info("退出检测!") 104 | break 105 | else: 106 | webshelldc.check(inputs) 107 | logger.info("继续输入:") 108 | # text = '