├── .gitignore ├── README.md ├── decode_objurl.py ├── image_crawler.sh └── query_list.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .* 2 | !.gitignore 3 | baidu/ 4 | google/ 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # image\_crawler v1.0.4 2 | This is an image crawler in pure shell script, which could download the images (actually the image URLs on the web) once given a keyword. It fakes to visit Google Image/Baidu Image, using the keywords you provide to perform like a human search, and then parse and record the results(image urls) returned by Google or Baidu. Once you have the image urls, you could download the real images using any script languages you like. This tool could also be used to compare the search quality and relevance between Google Image and Baidu Image. 3 | 4 | ## What's NEW? 5 | * Now it supports both wget and curl! 6 | * Performance is enhanced ~10X faster, increasing the parallelism by utilizing multi-process background jobs in bash. 7 | * Baidu/Google image search urls are updated. This script was created ~5 years ago, and things have changed a lot since then, both baidu and google changed their image search query parameters and rules. So I took quite a few time this weekend to figure out what's the change, and update the script to make it work again. 8 | * A python script is introduced to decode the baidu image objURL. I planned to do it in bash script originally, but you need to introduce a hashtable, damn complicated to make it work in bash only... So I'll leave it as a TODO now - to update this python script to the bash script with the same functionality. 9 | * [Experiment]: Downloading images after parsing out the img urls. You could turn this feature off by specifying 'EXPERIMENT="OFF"'. 10 | 11 | ## TODO: 12 | * Continue to improve the performance. 13 | * Adding proxy support? 14 | * Implement the python equivalent to decode the baidu image url. 15 | * ... 16 | 17 | ## How to use it? 18 | * Input: A file named query\_list.txt, per keyword per line. 19 | * Usage: 20 | $./image\_crawler.sh google ; 21 | * google could be replaced by baidu 22 | * is the number of images you want to download for a given keyword. 23 | * Output: The script generates a directory named google/(or baidu/ as you choosed), which contains files of the format: "i\_objURL-list\_keyword[i]", i is the ith keyword in the query\_list.txt. In each of these files contain num lines, per image url per line. 24 | 25 | ## Performance 26 | I've tested this script with 10 keywords (just as in the query\_list.txt), each keyword crawling 300 results using Google.
27 | Results are as follows:
28 | [unix14 ~/imagecrawler]$ time ./image\_crawler.sh google 300
29 | real 0m5.766s 30 | user 0m2.425s 31 | sys 0m2.254s 32 | 33 | [unix14 ~/imagecrawler]$ time ./image\_crawler.sh baidu 300
34 | real 0m11.419s 35 | user 0m1.254s 36 | sys 0m1.044s 37 | 38 | The result is not bad, and in the future I'll tweak it into a more concurrent version.
39 | 40 | # Note: 41 | It works in any platform that supports bash, egrep, awk, python, wget | curl. So, Ubuntu, MacOS, etc. 42 | -------------------------------------------------------------------------------- /decode_objurl.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import sys 4 | 5 | encoded_str1 = "_z2C$q" 6 | encoded_str2 = "_z&e3B" 7 | encoded_str3 = "AzdH3F" 8 | 9 | HASHTABLE = { 10 | encoded_str1 : ":", 11 | encoded_str2 : ".", 12 | encoded_str3 : "/", 13 | "w" : "a", 14 | "k" : "b", 15 | "v" : "c", 16 | "1" : "d", 17 | "j" : "e", 18 | "u" : "f", 19 | "2" : "g", 20 | "i" : "h", 21 | "t" : "i", 22 | "3" : "j", 23 | "h" : "k", 24 | "s" : "l", 25 | "4" : "m", 26 | "g" : "n", 27 | "5" : "o", 28 | "r" : "p", 29 | "q" : "q", 30 | "6" : "r", 31 | "f" : "s", 32 | "p" : "t", 33 | "7" : "u", 34 | "e" : "v", 35 | "o" : "w", 36 | "8" : "1", 37 | "d" : "2", 38 | "n" : "3", 39 | "9" : "4", 40 | "c" : "5", 41 | "m" : "6", 42 | "0" : "7", 43 | "b" : "8", 44 | "l" : "9", 45 | "a" : "0", 46 | } 47 | 48 | def DecodeObjUrl(objURL): 49 | i = 0; 50 | decoded_url = ""; 51 | while i < len(objURL): 52 | if (objURL[i] == "_" and 53 | objURL.find(encoded_str1, i, i + len(encoded_str1)) != -1): 54 | decoded_url += HASHTABLE[encoded_str1] 55 | i += len(encoded_str1) 56 | elif (objURL[i] == "_" and 57 | objURL.find(encoded_str2, i, i + len(encoded_str2)) != -1): 58 | decoded_url += HASHTABLE[encoded_str2] 59 | i += len(encoded_str2) 60 | elif (objURL[i] == "A" and 61 | objURL.find(encoded_str3, i, i + len(encoded_str3)) != -1): 62 | decoded_url += HASHTABLE[encoded_str3] 63 | i += len(encoded_str3) 64 | elif objURL[i] in HASHTABLE: 65 | decoded_url += HASHTABLE[objURL[i]] 66 | i += 1 67 | else: 68 | decoded_url += objURL[i] 69 | i += 1 70 | # print "decoded_url = ", decoded_url 71 | return decoded_url 72 | 73 | def ProcessRawObjUrlFile(in_f, out_f): 74 | with open(in_f, 'r') as infile: 75 | with open(out_f, 'w') as outfile: 76 | for objURL in infile: 77 | decoded_url = DecodeObjUrl(objURL) 78 | outfile.write(decoded_url); 79 | 80 | if __name__ == '__main__': 81 | if len(sys.argv) != 3: 82 | print "Usage: ./decode.py " 83 | sys.exit(1) 84 | infile = sys.argv[1] 85 | outfile = sys.argv[2] 86 | ProcessRawObjUrlFile(infile, outfile) 87 | -------------------------------------------------------------------------------- /image_crawler.sh: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dryruner/imagecrawler/6c79873b883f6d495d79dea2c3b6a787893a3b52/image_crawler.sh -------------------------------------------------------------------------------- /query_list.txt: -------------------------------------------------------------------------------- 1 | ruby 2 | python 3 | C++ 4 | Java 5 | white 6 | black 7 | red 8 | green 9 | blue 10 | pink 11 | --------------------------------------------------------------------------------