├── .gitattributes ├── .gitignore ├── Blog_Png ├── 1-1.jpg ├── 1-2.png ├── 1-3.png ├── 1-4.png ├── 2-1.png ├── 2-2.png └── 2-3.png ├── Code ├── Image_Process.py └── Image_Spider.py └── README.md /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # celery beat schedule file 95 | celerybeat-schedule 96 | 97 | # SageMath parsed files 98 | *.sage.py 99 | 100 | # Environments 101 | .env 102 | .venv 103 | env/ 104 | venv/ 105 | ENV/ 106 | env.bak/ 107 | venv.bak/ 108 | 109 | # Spyder project settings 110 | .spyderproject 111 | .spyproject 112 | 113 | # Rope project settings 114 | .ropeproject 115 | 116 | # mkdocs documentation 117 | /site 118 | 119 | # mypy 120 | .mypy_cache/ 121 | .dmypy.json 122 | dmypy.json 123 | 124 | # Pyre type checker 125 | .pyre/ 126 | -------------------------------------------------------------------------------- /Blog_Png/1-1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/1-1.jpg -------------------------------------------------------------------------------- /Blog_Png/1-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/1-2.png -------------------------------------------------------------------------------- /Blog_Png/1-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/1-3.png -------------------------------------------------------------------------------- /Blog_Png/1-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/1-4.png -------------------------------------------------------------------------------- /Blog_Png/2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/2-1.png -------------------------------------------------------------------------------- /Blog_Png/2-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/2-2.png -------------------------------------------------------------------------------- /Blog_Png/2-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/FTLIKON/My_Py_Image/572793abedcd30e4b0e9cd67e6c679d02abb46ca/Blog_Png/2-3.png -------------------------------------------------------------------------------- /Code/Image_Process.py: -------------------------------------------------------------------------------- 1 | import sys, os 2 | from PIL import Image, ImageDraw 3 | 4 | 5 | val = {} # 二值数组 6 | dx = [-1,-1,-1,0,0,1,1,1] # 八个方向 7 | dy = [-1,0,1,-1,1,1,0,-1] 8 | 9 | 10 | def two_value(image, G): # G: Integer 图像二值化阀值 11 | for y in range(0, image.size[1]): 12 | for x in range(0, image.size[0]): # 遍历图片像素点 13 | g = image.getpixel((x, y)) 14 | if g > G: # 二值化 15 | val[(x, y)] = 1 16 | else: 17 | val[(x, y)] = 0 18 | 19 | def clear_noise(image, N): # N: 图片降噪阈值 20 | val[(0, 0)] = 1 21 | val[(image.size[0] - 1, image.size[1] - 1)] = 1 22 | for x in range(1, image.size[0] - 1): 23 | for y in range(1, image.size[1] - 1): 24 | nearDots = 0 # 相同像素点的数量 25 | L = val[(x, y)] # 当前像素点 26 | for j in range(8): # 和周围8个像素点进行比较 27 | if L == val[(x + dx[j], y + dy[j])]: 28 | nearDots += 1 29 | if nearDots < N: # 小于阈值,删除像素点 30 | val[(x, y)] = 0 31 | 32 | def save_png(filename, size): 33 | image = Image.new("1", size) 34 | draw = ImageDraw.Draw(image) 35 | for x in range(0, size[0]): 36 | for y in range(0, size[1]): 37 | draw.point((x, y), val[(x, y)]) # 描绘像素点 38 | image.save(filename) 39 | 40 | for i in range(100): 41 | path = "image/" + str(i) + ".png" # 输入路径 42 | image = Image.open(path) 43 | image = image.convert('L') 44 | two_value(image, 230) 45 | clear_noise(image, 4) 46 | pathout = "image/" + str(i)+"+"+str(i) + ".png" # 输出路径 47 | save_png(pathout, image.size) -------------------------------------------------------------------------------- /Code/Image_Spider.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | for i in range(100): 4 | img_url = "http://passport2.chaoxing.com/num/code?1593027117381" 5 | r = requests.get(img_url) 6 | open('image\\%d.png'%i, 'wb').write(r.content) 7 | 8 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 基于Python的图片爬虫及图片处理 2 | 3 | ## 前言 4 | 刚刚开始学Python,边学边做的状态~~ 5 | 这是一个没有目标的小项目,想到哪就做到哪咯 6 | 这个活还比较有趣,我应该会持续更新的~ 7 | 8 | 当前进度: 9 | 10 | __1.实现了获取验证码图片的简易爬虫__ 2020/6/25 11 | 12 | __2.实现了图片的二值化与降噪处理__ 2020/6/26 13 | 14 | --- 15 | 16 | ## 爬虫:获取验证码图片 17 | 18 | ~~(其实最开始是想去Pixiv上爬一些妹子的图片的说)~~ 19 | 20 | 因为想要获得大量的,相似有共性的图片,并在后期对这些图片进行深入的处理或者分类,所以想到了:验证码! 于是我默默的打开了教务系统的页面(狗头保命): 21 | 22 | ![教务系统登录界面](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/1-1.jpg?raw=true) 23 | 24 | 突然我发现,这个验证码不太一样。很多常见页面的验证码都是点击刷新的,而这里的验证码是每次刷新网页就会刷新。 25 | 26 | 暂时不管这个,先常规操作:找到图片的url。 因为我是用的chrome,所以直接右键检查就找到啦。 27 | 28 | ![找到验证码url](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/1-2.png?raw=true) 29 | 30 | 点进去这个url,发现确实是每次刷新网页验证码就会变化!验证了之前的猜想。 31 | 32 | ![发现验证码特性](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/1-3.png?raw=true) 33 | 34 | 那么,利用这个特性,就可以知道爬虫每次访问这个url时获取到的图片都是不同的!那么这个爬虫写起来就相当简单啦!代码如下: 35 | 36 | ```python 37 | 38 | import requests 39 | 40 | for i in range(100): # 下载数量,我设置的100 41 | img_url = "http://passport2.chaoxing.com/num/code?1593027117381" 42 | r = requests.get(img_url) 43 | open('image\\%d.png'%i, 'wb').write(r.content) 44 | # 二进制写入图片,不存在就新建png文件,并命名为 i.png 45 | 46 | ``` 47 | 48 | 点击运行! 和预期一样,在 “/image” 目录下获取了100张验证码,nice! 49 | 50 | ![获取验证码图片](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/1-4.png?raw=true) 51 | 52 | 图片获取成功,接下来则是对图片的处理了。 53 | 54 | --- 55 | 56 | ## 图片的二值化与降噪处理 57 | 58 | 59 | 根据上文的爬虫,已经成功的下载了100 张验证码的图片啦! 但是我发现这些验证码有两个普遍的问题: 60 | 61 | __1.图片清晰度不高,难以辨认__ 62 | __2.噪点过多,难以辨认__ 63 | 64 | 于是可以通过二值化的操作来先处理清晰度的问题。二值化即为:将图片上的像素点的灰度值设置为0或255,将整个图片呈现出明显的黑白效果。 65 | 这里我学习了一下PIL库,发现有个`getpixel((x,y))`的方法,即为获取坐标值`(x,y)`像素点的RGB颜色值。那么可以设置下二值化的阈值,只保留颜色较深的像素点。代码片段如下: 66 | 67 | ```python 68 | 69 | val = {} # 二值数组 70 | def two_value(image, G): # G: 图像二值化阀值 71 | for y in range(0, image.size[1]): 72 | for x in range(0, image.size[0]): # 遍历图片像素点 73 | g = image.getpixel((x, y)) 74 | if g > G: # 根据阈值二值化 75 | val[(x, y)] = 1 76 | else: 77 | val[(x, y)] = 0 78 | 79 | ``` 80 | 81 | 这里我发现阈值G设为230最为合适,下面是处理后的效果: 82 | 83 | 处理前: 84 | 85 | ![处理前](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/2-1.png?raw=true) 86 | 87 | 处理后: 88 | 89 | ![二值化处理后](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/2-2.png?raw=true) 90 | 91 | 可以说效果还是非常好的。但是可以发现依然有一些像素点孤零零的在这些字符外,这些像素点对我们识别图片有害无益。 92 | 这里我通过判断每一个像素点周围相同的像素点的数量,如果相同像素点的数量小于一定数量则可以认为是噪点,删除即可。代码片段如下: 93 | 94 | ```python 95 | 96 | dx = [-1,-1,-1,0,0,1,1,1] # 八个方向,指向相邻的像素点 97 | dy = [-1,0,1,-1,1,1,0,-1] 98 | def clear_noise(image, N): # N: 图片降噪阈值 99 | val[(0, 0)] = 1 100 | val[(image.size[0] - 1, image.size[1] - 1)] = 1 101 | for x in range(1, image.size[0] - 1): 102 | for y in range(1, image.size[1] - 1): 103 | nearDots = 0 # 相同像素点的数量 104 | L = val[(x, y)] # 当前像素点 105 | for j in range(8): # 和周围8个像素点进行比较 106 | if L == val[(x + dx[j], y + dy[j])]: 107 | nearDots += 1 108 | if nearDots < N: # 小于阈值,删除像素点 109 | val[(x, y)] = 0 110 | 111 | ``` 112 | 113 | 这里的阈值N我发现设置为4最为合适。下面是处理后的效果: 114 | ![降噪处理后](https://github.com/FTLIKON/My_Py_Image/blob/master/Blog_Png/2-3.png?raw=true) 115 | 116 | 效果可以说是非常惊艳了! 关于保存图片的代码就不分析了,完整代码放在上面的/code文件夹里,需要自取~~ --------------------------------------------------------------------------------