├── .gitattributes ├── .gitignore ├── ReadMe.md ├── TuDao.App ├── Program.cs ├── Properties │ └── AssemblyInfo.cs ├── ShangPin.cs ├── TuDao.App.csproj ├── dll │ └── CsQuery.dll └── packages.config ├── TuDao.sln └── 发布包 └── 屠刀.zip /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | 4 | # Custom for Visual Studio 5 | *.cs diff=csharp 6 | 7 | # Standard to msysgit 8 | *.doc diff=astextplain 9 | *.DOC diff=astextplain 10 | *.docx diff=astextplain 11 | *.DOCX diff=astextplain 12 | *.dot diff=astextplain 13 | *.DOT diff=astextplain 14 | *.pdf diff=astextplain 15 | *.PDF diff=astextplain 16 | *.rtf diff=astextplain 17 | *.RTF diff=astextplain 18 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | ## Ignore Visual Studio temporary files, build results, and 2 | ## files generated by popular Visual Studio add-ons. 3 | 4 | # User-specific files 5 | *.suo 6 | *.user 7 | *.userosscache 8 | *.sln.docstates 9 | 10 | # Build results 11 | [Dd]ebug/ 12 | [Dd]ebugPublic/ 13 | [Rr]elease/ 14 | [Rr]eleases/ 15 | x64/ 16 | x86/ 17 | build/ 18 | bld/ 19 | [Bb]in/ 20 | [Oo]bj/ 21 | 22 | # Roslyn cache directories 23 | *.ide/ 24 | 25 | # MSTest test Results 26 | [Tt]est[Rr]esult*/ 27 | [Bb]uild[Ll]og.* 28 | 29 | #NUNIT 30 | *.VisualState.xml 31 | TestResult.xml 32 | 33 | # Build Results of an ATL Project 34 | [Dd]ebugPS/ 35 | [Rr]eleasePS/ 36 | dlldata.c 37 | 38 | *_i.c 39 | *_p.c 40 | *_i.h 41 | *.ilk 42 | *.meta 43 | *.obj 44 | *.pch 45 | *.pdb 46 | *.pgc 47 | *.pgd 48 | *.rsp 49 | *.sbr 50 | *.tlb 51 | *.tli 52 | *.tlh 53 | *.tmp 54 | *.tmp_proj 55 | *.log 56 | *.vspscc 57 | *.vssscc 58 | .builds 59 | *.pidb 60 | *.svclog 61 | *.scc 62 | 63 | # Chutzpah Test files 64 | _Chutzpah* 65 | 66 | # Visual C++ cache files 67 | ipch/ 68 | *.aps 69 | *.ncb 70 | *.opensdf 71 | *.sdf 72 | *.cachefile 73 | 74 | # Visual Studio profiler 75 | *.psess 76 | *.vsp 77 | *.vspx 78 | 79 | # TFS 2012 Local Workspace 80 | $tf/ 81 | 82 | # Guidance Automation Toolkit 83 | *.gpState 84 | 85 | # ReSharper is a .NET coding add-in 86 | _ReSharper*/ 87 | *.[Rr]e[Ss]harper 88 | *.DotSettings.user 89 | 90 | # JustCode is a .NET coding addin-in 91 | .JustCode 92 | 93 | # TeamCity is a build add-in 94 | _TeamCity* 95 | 96 | # DotCover is a Code Coverage Tool 97 | *.dotCover 98 | 99 | # NCrunch 100 | _NCrunch_* 101 | .*crunch*.local.xml 102 | 103 | # MightyMoose 104 | *.mm.* 105 | AutoTest.Net/ 106 | 107 | # Web workbench (sass) 108 | .sass-cache/ 109 | 110 | # Installshield output folder 111 | [Ee]xpress/ 112 | 113 | # DocProject is a documentation generator add-in 114 | DocProject/buildhelp/ 115 | DocProject/Help/*.HxT 116 | DocProject/Help/*.HxC 117 | DocProject/Help/*.hhc 118 | DocProject/Help/*.hhk 119 | DocProject/Help/*.hhp 120 | DocProject/Help/Html2 121 | DocProject/Help/html 122 | 123 | # Click-Once directory 124 | publish/ 125 | 126 | # Publish Web Output 127 | *.[Pp]ublish.xml 128 | *.azurePubxml 129 | # TODO: Comment the next line if you want to checkin your web deploy settings 130 | # but database connection strings (with potential passwords) will be unencrypted 131 | *.pubxml 132 | *.publishproj 133 | 134 | # NuGet Packages 135 | *.nupkg 136 | # The packages folder can be ignored because of Package Restore 137 | **/packages/* 138 | # except build/, which is used as an MSBuild target. 139 | !**/packages/build/ 140 | # If using the old MSBuild-Integrated Package Restore, uncomment this: 141 | #!**/packages/repositories.config 142 | 143 | # Windows Azure Build Output 144 | csx/ 145 | *.build.csdef 146 | 147 | # Windows Store app package directory 148 | AppPackages/ 149 | 150 | # Others 151 | sql/ 152 | *.Cache 153 | ClientBin/ 154 | [Ss]tyle[Cc]op.* 155 | ~$* 156 | *~ 157 | *.dbmdl 158 | *.dbproj.schemaview 159 | *.pfx 160 | *.publishsettings 161 | node_modules/ 162 | 163 | # RIA/Silverlight projects 164 | Generated_Code/ 165 | 166 | # Backup & report files from converting an old project file 167 | # to a newer Visual Studio version. Backup files are not needed, 168 | # because we have git ;-) 169 | _UpgradeReport_Files/ 170 | Backup*/ 171 | UpgradeLog*.XML 172 | UpgradeLog*.htm 173 | 174 | # SQL Server files 175 | *.mdf 176 | *.ldf 177 | 178 | # Business Intelligence projects 179 | *.rdl.data 180 | *.bim.layout 181 | *.bim_*.settings 182 | 183 | # Microsoft Fakes 184 | FakesAssemblies/ 185 | 186 | # ========================= 187 | # Operating System Files 188 | # ========================= 189 | 190 | # OSX 191 | # ========================= 192 | 193 | .DS_Store 194 | .AppleDouble 195 | .LSOverride 196 | 197 | # Thumbnails 198 | ._* 199 | 200 | # Files that might appear on external disk 201 | .Spotlight-V100 202 | .Trashes 203 | 204 | # Directories potentially created on remote AFP share 205 | .AppleDB 206 | .AppleDesktop 207 | Network Trash Folder 208 | Temporary Items 209 | .apdisk 210 | 211 | # Windows 212 | # ========================= 213 | 214 | # Windows image file caches 215 | Thumbs.db 216 | ehthumbs.db 217 | 218 | # Folder config file 219 | Desktop.ini 220 | 221 | # Recycle Bin used on file shares 222 | $RECYCLE.BIN/ 223 | 224 | # Windows Installer files 225 | *.cab 226 | *.msi 227 | *.msm 228 | *.msp 229 | 230 | # Windows shortcuts 231 | *.lnk 232 | -------------------------------------------------------------------------------- /ReadMe.md: -------------------------------------------------------------------------------- 1 |  2 | **项目描述:** 3 | 4 | > 天猫店整店商品图片采集工具(包含题图、颜色图、内容图,适用于服装类,按货号保存商品图片),目前只测试了一个店铺,可能会存在一些问题,收费负责解决问题。 5 | -------------------------------------------------------------------------------- /TuDao.App/Program.cs: -------------------------------------------------------------------------------- 1 | using CsQuery; 2 | using Newtonsoft.Json; 3 | using Newtonsoft.Json.Linq; 4 | using System; 5 | using System.Collections.Generic; 6 | using System.IO; 7 | using System.Linq; 8 | using System.Net; 9 | using System.Net.Security; 10 | using System.Security.Cryptography.X509Certificates; 11 | using System.Text; 12 | using System.Text.RegularExpressions; 13 | 14 | namespace TuDao.App 15 | { 16 | class Program 17 | { 18 | static string basePath; 19 | static string baseListUrl; 20 | static string baseItemUrl = "https://detail.tmall.com/item.htm?id="; 21 | static void Main(string[] args) 22 | { 23 | Console.ForegroundColor = ConsoleColor.Green; 24 | Console.WriteLine("本程序分三步完成采集工作(每完成一部需要重启程序):"); 25 | Console.WriteLine("第1步:根据商户商品列表页面地址,采集商品编号"); 26 | Console.WriteLine("第2步:根据商品编号,采集商品图片地址"); 27 | Console.WriteLine("第3步:根据图片地址,下载图片"); 28 | Console.WriteLine("请问您现在需要执行第几步操作:(请输入1、2或3然后按任意键开始)"); 29 | var key = Console.ReadLine(); 30 | if(key == "1") 31 | { 32 | Console.WriteLine("请先输入目标商户的商品列表页面地址:"); 33 | baseListUrl = Console.ReadLine(); 34 | getId(); 35 | Console.WriteLine("第1步操作执行完毕,按任意键退出程序"); 36 | } 37 | else if(key == "2") 38 | { 39 | Console.WriteLine("开始执行第2步操作:"); 40 | prepareData(); 41 | Console.WriteLine("第2步操作执行完毕,按任意键退出程序"); 42 | } 43 | else if(key == "3") 44 | { 45 | Console.WriteLine("开始执行第3步操作:"); 46 | downloadPic(); 47 | Console.WriteLine("第3步操作执行完毕,按任意键退出程序"); 48 | } 49 | Console.ReadKey(); 50 | } 51 | 52 | static List idlist; 53 | 54 | static void downloadPic() 55 | { 56 | DirectoryInfo basedi = new DirectoryInfo("data"); 57 | foreach(var di in basedi.EnumerateDirectories()) 58 | { 59 | var jsonstr = File.ReadAllText(Path.Combine(di.FullName, "config.txt")); 60 | var obj = JObject.Parse(jsonstr); 61 | var dic1 = JsonConvert.DeserializeObject>(obj["TiTu"].ToString()); 62 | var dic2 = JsonConvert.DeserializeObject>(obj["SeTu"].ToString()); 63 | var dic3 = JsonConvert.DeserializeObject>(obj["SeTu"].ToString()); 64 | var dic4 = JsonConvert.DeserializeObject>(obj["NeiRongTu"].ToString()); 65 | eachPic(di.FullName, dic1); 66 | eachPic(di.FullName, dic2); 67 | eachPic(di.FullName, dic3); 68 | eachPic(di.FullName, dic4); 69 | } 70 | } 71 | static void eachPic(string dicName,Dictionary dic1) 72 | { 73 | foreach (var key in dic1.Keys) 74 | { 75 | var name = Path.Combine(dicName, key + dic1[key].Substring(dic1[key].LastIndexOf('.'))); 76 | try 77 | { 78 | getPic(dic1[key], name); 79 | Console.WriteLine("图片下载:" + name); 80 | } 81 | catch (Exception ex) 82 | { 83 | File.AppendAllText("err2.txt", dic1[key] + Environment.NewLine); 84 | } 85 | } 86 | } 87 | static void getPic(string url,string name) 88 | { 89 | ServicePointManager.ServerCertificateValidationCallback = ValidateServerCertificate; 90 | HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest; 91 | request.UseDefaultCredentials = true; 92 | HttpWebResponse response = request.GetResponse() as HttpWebResponse; 93 | Stream stream = response.GetResponseStream(); 94 | var fileStream = new FileStream(name, FileMode.Create, FileAccess.Write); 95 | stream.CopyTo(fileStream); 96 | fileStream.Dispose(); 97 | stream.Close(); 98 | } 99 | static string getHtml(string url) 100 | { 101 | ServicePointManager.ServerCertificateValidationCallback = ValidateServerCertificate; 102 | HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest; 103 | request.UseDefaultCredentials = true; 104 | HttpWebResponse response = request.GetResponse() as HttpWebResponse; 105 | Stream stream = response.GetResponseStream(); 106 | StreamReader reader = new StreamReader(stream, Encoding.Default); 107 | string html = reader.ReadToEnd(); 108 | stream.Close(); 109 | return html; 110 | } 111 | static bool ValidateServerCertificate(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors) 112 | { 113 | return true; 114 | } 115 | static void getId() 116 | { 117 | idlist = new List(); 118 | CQ doc = getHtml(baseListUrl); 119 | var pageCount = Convert.ToInt32(doc[".ui-page-s-len"].Text().Split('/')[1]); 120 | var liList = doc[".item"].ToList().Take(60); 121 | foreach (var obj in liList) 122 | { 123 | var id = obj.GetAttribute("data-id"); 124 | idlist.Add(id); 125 | Console.WriteLine("采集到id:{0}", id); 126 | } 127 | for (var i = 2; i <= pageCount; i++) 128 | { 129 | CQ doc1 = getHtml(baseListUrl + "&pageNo=" + i.ToString()); 130 | var liList1 = doc1[".item"].ToList().Take(60); 131 | foreach (var obj in liList1) 132 | { 133 | var id = obj.GetAttribute("data-id"); 134 | idlist.Add(id); 135 | Console.WriteLine("采集到id:{0}", id); 136 | } 137 | } 138 | var sb = new StringBuilder(); 139 | foreach (var id in idlist) 140 | { 141 | sb.AppendLine(id); 142 | } 143 | File.WriteAllText("ids.txt", sb.ToString()); 144 | } 145 | static ShangPin getShangPin(string id) 146 | { 147 | //id = "522670612044"; 148 | var sp = new ShangPin(); 149 | sp.Id = id; 150 | var html = getHtml(baseItemUrl + id); 151 | var indexHH = html.IndexOf("货号"); 152 | if(indexHH < 1) 153 | { 154 | indexHH = html.IndexOf("款号"); 155 | if(indexHH < 1) 156 | { 157 | indexHH = html.IndexOf("型号"); 158 | if (indexHH < 1) 159 | { 160 | File.AppendAllText("err.txt", id + Environment.NewLine); 161 | return null; 162 | } 163 | 164 | } 165 | } 166 | if (html.Substring(indexHH - 7, 12).StartsWith("name")) 167 | { 168 | sp.HuoHao = html.Substring(indexHH + 13,60); 169 | sp.HuoHao = sp.HuoHao.Substring(0, sp.HuoHao.IndexOf('"')); 170 | } 171 | else 172 | { 173 | sp.HuoHao = html.Substring(indexHH, 60); 174 | sp.HuoHao = sp.HuoHao.Replace(" ", "").Substring(3); 175 | sp.HuoHao = sp.HuoHao.Substring(0, sp.HuoHao.IndexOf('<')); 176 | } 177 | 178 | sp.DetailJsonUrl = html.Substring(html.IndexOf("descUrl") + 10); 179 | sp.DetailJsonUrl = "https:" + sp.DetailJsonUrl.Substring(0, sp.DetailJsonUrl.IndexOf('"')); 180 | CQ doc = html; 181 | var shoutulist = doc["#J_UlThumb img"].ToList(); 182 | var i = 1; 183 | foreach (var st in shoutulist) 184 | { 185 | var src = "https:" + st.GetAttribute("src"); 186 | src = src.Substring(0, src.LastIndexOf('_')); 187 | sp.TiTu.Add("题图" + i, src); 188 | Console.WriteLine("采集到题图:{0}", src); 189 | i += 1; 190 | } 191 | var setuList = doc[".tb-sku .J_TSaleProp a"].ToList(); 192 | i = 1; 193 | foreach (var st in setuList) 194 | { 195 | var style = st.GetAttribute("style"); 196 | if (string.IsNullOrEmpty(style)) 197 | { 198 | continue; 199 | } 200 | style = style.Substring(style.IndexOf("(") + 1); 201 | style = style.Substring(0, style.IndexOf(")")); 202 | style = "http:" + style; 203 | style = style.Substring(0, style.LastIndexOf('_')); 204 | sp.SeTu.Add(st.InnerText.Trim() + i, style); 205 | Console.WriteLine("采集到颜色图:{0}", style); 206 | i += 1; 207 | } 208 | var neirongJsonStr = getHtml(sp.DetailJsonUrl); 209 | var neirongArr = Regex.Split(neirongJsonStr, @"]*?\bsrc[\s\t\r\n]*=[\s\t\r\n]*[""']?[\s\t\r\n]*(?[^\s\t\r\n""'<>]*)[^<>]*?/?[\s\t\r\n]*>", RegexOptions.IgnoreCase); 210 | 211 | i = 1; 212 | foreach (var nrt in neirongArr) 213 | { 214 | if (!nrt.StartsWith("http") || nrt.EndsWith("spaceball.gif")) 215 | { 216 | continue; 217 | } 218 | sp.NeiRongTu.Add("内容" + i, nrt); 219 | Console.WriteLine("采集到内容图:{0}", nrt); 220 | i += 1; 221 | } 222 | return sp; 223 | } 224 | static void prepareData() 225 | { 226 | basePath = Directory.CreateDirectory("data").FullName; 227 | idlist = File.ReadLines("ids.txt").ToList(); 228 | foreach (var id in idlist) 229 | { 230 | var sp = getShangPin(id); 231 | if(sp == null) 232 | { 233 | continue; 234 | } 235 | var curP = Path.Combine(basePath, sp.HuoHao); 236 | Directory.CreateDirectory(curP); 237 | var jsonStr = JsonConvert.SerializeObject(sp); 238 | File.WriteAllText(Path.Combine(curP, "config.txt"), jsonStr); 239 | } 240 | } 241 | } 242 | } 243 | -------------------------------------------------------------------------------- /TuDao.App/Properties/AssemblyInfo.cs: -------------------------------------------------------------------------------- 1 | using System.Reflection; 2 | using System.Runtime.CompilerServices; 3 | using System.Runtime.InteropServices; 4 | 5 | // 有关程序集的一般信息由以下 6 | // 控制。更改这些特性值可修改 7 | // 与程序集关联的信息。 8 | [assembly: AssemblyTitle("TuDao.App")] 9 | [assembly: AssemblyDescription("")] 10 | [assembly: AssemblyConfiguration("")] 11 | [assembly: AssemblyCompany("")] 12 | [assembly: AssemblyProduct("TuDao.App")] 13 | [assembly: AssemblyCopyright("Copyright © 2015")] 14 | [assembly: AssemblyTrademark("")] 15 | [assembly: AssemblyCulture("")] 16 | 17 | //将 ComVisible 设置为 false 将使此程序集中的类型 18 | //对 COM 组件不可见。 如果需要从 COM 访问此程序集中的类型, 19 | //请将此类型的 ComVisible 特性设置为 true。 20 | [assembly: ComVisible(false)] 21 | 22 | // 如果此项目向 COM 公开,则下列 GUID 用于类型库的 ID 23 | [assembly: Guid("63ae8a49-be75-4396-b341-76f854976f2f")] 24 | 25 | // 程序集的版本信息由下列四个值组成: 26 | // 27 | // 主版本 28 | // 次版本 29 | // 生成号 30 | // 修订号 31 | // 32 | //可以指定所有这些值,也可以使用“生成号”和“修订号”的默认值, 33 | // 方法是按如下所示使用“*”: : 34 | // [assembly: AssemblyVersion("1.0.*")] 35 | [assembly: AssemblyVersion("1.0.0.0")] 36 | [assembly: AssemblyFileVersion("1.0.0.0")] 37 | -------------------------------------------------------------------------------- /TuDao.App/ShangPin.cs: -------------------------------------------------------------------------------- 1 | using System; 2 | using System.Collections.Generic; 3 | using System.Linq; 4 | using System.Text; 5 | 6 | namespace TuDao.App 7 | { 8 | public class ShangPin 9 | { 10 | public Dictionary TiTu = new Dictionary(); 11 | public Dictionary SeTu = new Dictionary(); 12 | public Dictionary NeiRongTu = new Dictionary(); 13 | public string DetailJsonUrl 14 | { 15 | get; 16 | set; 17 | } 18 | public string HuoHao 19 | { 20 | get; 21 | set; 22 | } 23 | public string Id 24 | { 25 | get; 26 | set; 27 | } 28 | } 29 | } 30 | -------------------------------------------------------------------------------- /TuDao.App/TuDao.App.csproj: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | 5 | Debug 6 | AnyCPU 7 | {63AE8A49-BE75-4396-B341-76F854976F2F} 8 | Exe 9 | Properties 10 | TuDao.App 11 | TuDao.App 12 | v4.0 13 | 512 14 | 15 | 16 | AnyCPU 17 | true 18 | full 19 | false 20 | bin\Debug\ 21 | DEBUG;TRACE 22 | prompt 23 | 4 24 | 25 | 26 | AnyCPU 27 | pdbonly 28 | true 29 | bin\Release\ 30 | TRACE 31 | prompt 32 | 4 33 | 34 | 35 | 36 | False 37 | dll\CsQuery.dll 38 | 39 | 40 | ..\packages\Newtonsoft.Json.7.0.1\lib\net40\Newtonsoft.Json.dll 41 | True 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 71 | -------------------------------------------------------------------------------- /TuDao.App/dll/CsQuery.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xland/TuDao/e6288aaea5f5e80f5284ded684a92ce23a4df3c6/TuDao.App/dll/CsQuery.dll -------------------------------------------------------------------------------- /TuDao.App/packages.config: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | -------------------------------------------------------------------------------- /TuDao.sln: -------------------------------------------------------------------------------- 1 |  2 | Microsoft Visual Studio Solution File, Format Version 12.00 3 | # Visual Studio 14 4 | VisualStudioVersion = 14.0.23107.0 5 | MinimumVisualStudioVersion = 10.0.40219.1 6 | Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "TuDao", "TuDao\TuDao.csproj", "{8AEF09CF-0F9E-4C0B-9CEC-138DC70E6614}" 7 | EndProject 8 | Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "TuDao.App", "TuDao.App\TuDao.App.csproj", "{63AE8A49-BE75-4396-B341-76F854976F2F}" 9 | EndProject 10 | Global 11 | GlobalSection(SolutionConfigurationPlatforms) = preSolution 12 | Debug|Any CPU = Debug|Any CPU 13 | Release|Any CPU = Release|Any CPU 14 | EndGlobalSection 15 | GlobalSection(ProjectConfigurationPlatforms) = postSolution 16 | {8AEF09CF-0F9E-4C0B-9CEC-138DC70E6614}.Debug|Any CPU.ActiveCfg = Debug|Any CPU 17 | {8AEF09CF-0F9E-4C0B-9CEC-138DC70E6614}.Debug|Any CPU.Build.0 = Debug|Any CPU 18 | {8AEF09CF-0F9E-4C0B-9CEC-138DC70E6614}.Release|Any CPU.ActiveCfg = Release|Any CPU 19 | {8AEF09CF-0F9E-4C0B-9CEC-138DC70E6614}.Release|Any CPU.Build.0 = Release|Any CPU 20 | {63AE8A49-BE75-4396-B341-76F854976F2F}.Debug|Any CPU.ActiveCfg = Debug|Any CPU 21 | {63AE8A49-BE75-4396-B341-76F854976F2F}.Debug|Any CPU.Build.0 = Debug|Any CPU 22 | {63AE8A49-BE75-4396-B341-76F854976F2F}.Release|Any CPU.ActiveCfg = Release|Any CPU 23 | {63AE8A49-BE75-4396-B341-76F854976F2F}.Release|Any CPU.Build.0 = Release|Any CPU 24 | EndGlobalSection 25 | GlobalSection(SolutionProperties) = preSolution 26 | HideSolutionNode = FALSE 27 | EndGlobalSection 28 | EndGlobal 29 | -------------------------------------------------------------------------------- /发布包/屠刀.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xland/TuDao/e6288aaea5f5e80f5284ded684a92ce23a4df3c6/发布包/屠刀.zip --------------------------------------------------------------------------------