Crawler-利用Simhash做URL去重

Pre:

为了防止扫重复的url以及提高扫描效率,url去重扫描器的爬虫中必不可少的一部分.


实现思路:

第一步:泛化

解析URL的每个参数,把每个参数的值做泛化.

例如将参数值里面的字母转化为A,数字转化为N,特殊符号转换为T,其他符号或者字符转化成C

例如:

1
2
3
4
5
6
http://tuan.qunar.com/?in_track=home_tuan_content&list=gengduo
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_50dao100
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_100dao150
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_150dao200
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_200dao500

经过泛化后则为

1
2
3
4
5
6
http://tuan.qunar.com/?in_track=home_tuan_content&list=gengduo
http://tuan.qunar.com/?list=AAAAAAA&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50
http://tuan.qunar.com/?tag=AAAAATAAAAAANN&in_track=AAAATAAAATAAAAAAA
http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_50dao100
http://tuan.qunar.com/?tag=AAAAATNNAAANNN&in_track=AAAATAAAATAAAAAAA

第二步: Simhash

Simhash是Google处理网页去重的算法.

Simhash的作用简单来说就是判断两个URL是否相似,如果汉明距离在一定范围内,就可判断两个URL相似。


完整代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#!/usr/local/bin/python
# -*- coding:utf-8 -*-
# @Time : 2019/8/5 8:41 PM
# @Author : Jerry
# @Desc : refs: https://docs.ioin.in/writeup/www.noblexu.com/_%E5%88%A9%E7%94%A8Simhash%E5%81%9AURL%E5%8E%BB%E9%87%8D%E7%9A%84%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F_/index.html
# @File : filter.py

import urllib
from urllib import parse as urlparse
from urllib.request import unquote
from simhash import Simhash

Chars = [',', '-', '_']


def url_etl(url):
'''
url泛化处理
:param url: 原始url
:return: 处理过后的url
'''
params_new = {}
u = urlparse.urlparse(url)
query = unquote(u.query)
if not query:
return url
path = unquote(u.path)
params = urlparse.parse_qsl(query, True)
for k, v in params:
if v:
params_new[k] = etl(v)
query_new = urllib.parse.urlencode(params_new)
url_new = urlparse.urlunparse(
(u.scheme, u.netloc, u.path, u.params, query_new, u.fragment))
# print url_new
return url_new


def etl(str):
'''
传入一个字符串,将里面的字母转化为A,数字转化为N,特殊符号转换为T,其他符号或者字符转化成C
:param str:
:return:
'''
chars = ""
for c in str:
c = c.lower()
if ord('a') <= ord(c) <= ord('z'):
chars += 'A'
elif ord('0') <= ord(c) <= ord('9'):
chars += 'N'
elif c in Chars:
chars += 'T'
else:
chars += 'C'
return chars


def url_compare(url, link):
dis = Simhash(url).distance(Simhash(link))
if -2 < dis < 5:
return True
else:
return False


def reduce_urls(ori_urls):
'''
对url列表去重
:param ori_urls: 原始url列表
:return: 去重后的url列表
'''
etl_urls = []
result_urls = []
for ori_url in ori_urls:
etl = url_etl(ori_url)
print(etl)
score = 0
if etl_urls:
for etl_url in etl_urls:
if not url_compare(etl, etl_url):
score += 1

if score == len(etl_urls):
result_urls.append(ori_url)
etl_urls.append(etl)
else:
etl_urls.append(etl)
result_urls.append(ori_url)

return result_urls


if __name__ == '__main__':
test_list = [
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58963',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58964',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58965',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58966',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58967',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58968',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58969',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58970',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58971',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58972',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58973',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58974',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58975',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58976',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58977',
'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58978',
]

# print(len(test_list))
print(reduce_urls(test_list))

# url = 'http://tuan.qunar.com/ext/sact/RjABVv?in_track=home_tuan_content_lunbo'
# print(url_etl(url))
# print(etl(url))

总结:

基于布尔的sql注入扫描起来需要发比较多的测试payload

一开始爬虫没对url进行去重,然后遇到咨询类网站的时候.

就发现前面爬的几十条链接都是几乎一样的,参数也是一样

但是因为没有去重导致一直重复在扫这个页面的这几个参数.

效率和效果都非常差,加了过滤之后,需要扫描的url少了很多.效率提高了不少.

但是这个时候也要注意,要看看有没因过滤而漏掉的url,宁愿重复扫也不要漏扫太多…


Refs: