├── README.md
├── Remove_Duplicates.py
├── phone_search.py
└── see_what.py
/README.md:
--------------------------------------------------------------------------------
1 | # PythonTo-repeat-the-text-Bigdata
2 | Remove_Duplicates.py : Python脚本实现千万级文本数据快速去重
3 |
4 | see_what.py :仅查看大文本内部数据样式
5 |
6 | pthone_search.py :定向检测是否存在特定phone泄露数据
7 |
8 |
--------------------------------------------------------------------------------
/Remove_Duplicates.py:
--------------------------------------------------------------------------------
1 | #coding=utf-8
2 |
3 | import sys, re, os
4 |
5 | def getDictList(dict):
6 |
7 | regx = '''[\w\~`\!\@\#\$\%\^\&\*\(\)\_\-\+\=\[\]\{\}\:\;\,\.\/\<\>\?]+'''
8 |
9 | with open(dict) as f:
10 |
11 | data = f.read()
12 |
13 | return re.findall(regx, data)
14 |
15 | def rmdp(dictList):
16 |
17 | return list(set(dictList))
18 |
19 | def fileSave(dictRmdp, out):
20 |
21 | with open(out, 'a') as f:
22 |
23 | for line in dictRmdp:
24 |
25 | f.write(line + '\n')
26 |
27 | def main():#用法是在命令行中执行Remove_Duplicates.py test.txt result.txt
28 |
29 | try:
30 |
31 | dict = sys.argv[1].strip()
32 |
33 | out = sys.argv[2].strip()
34 |
35 | except Exception, e:
36 |
37 | print 'error:', e
38 |
39 | me = os.path.basename(__file__)
40 |
41 | print 'usage: %s