Pre:
为了防止扫重复的url以及提高扫描效率,url去重扫描器的爬虫中必不可少的一部分.
实现思路:
第一步:泛化
解析URL的每个参数,把每个参数的值做泛化.
例如将参数值里面的字母转化为A,数字转化为N,特殊符号转换为T,其他符号或者字符转化成C
例如:
1 2 3 4 5 6
| http://tuan.qunar.com/?in_track=home_tuan_content&list=gengduo http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50 http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_50dao100 http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_100dao150 http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_150dao200 http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_200dao500
|
经过泛化后则为
1 2 3 4 5 6
| http://tuan.qunar.com/?in_track=home_tuan_content&list=gengduo http://tuan.qunar.com/?list=AAAAAAA&in_track=AAAATAAAATAAAAAAA http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_xiaoyu50 http://tuan.qunar.com/?tag=AAAAATAAAAAANN&in_track=AAAATAAAATAAAAAAA http://tuan.qunar.com/?in_track=home_tuan_content&tag=jiage_50dao100 http://tuan.qunar.com/?tag=AAAAATNNAAANNN&in_track=AAAATAAAATAAAAAAA
|
第二步: Simhash
Simhash是Google处理网页去重的算法.
Simhash的作用简单来说就是判断两个URL是否相似,如果汉明距离在一定范围内,就可判断两个URL相似。
完整代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
|
import urllib from urllib import parse as urlparse from urllib.request import unquote from simhash import Simhash
Chars = [',', '-', '_']
def url_etl(url): ''' url泛化处理 :param url: 原始url :return: 处理过后的url ''' params_new = {} u = urlparse.urlparse(url) query = unquote(u.query) if not query: return url path = unquote(u.path) params = urlparse.parse_qsl(query, True) for k, v in params: if v: params_new[k] = etl(v) query_new = urllib.parse.urlencode(params_new) url_new = urlparse.urlunparse( (u.scheme, u.netloc, u.path, u.params, query_new, u.fragment)) return url_new
def etl(str): ''' 传入一个字符串,将里面的字母转化为A,数字转化为N,特殊符号转换为T,其他符号或者字符转化成C :param str: :return: ''' chars = "" for c in str: c = c.lower() if ord('a') <= ord(c) <= ord('z'): chars += 'A' elif ord('0') <= ord(c) <= ord('9'): chars += 'N' elif c in Chars: chars += 'T' else: chars += 'C' return chars
def url_compare(url, link): dis = Simhash(url).distance(Simhash(link)) if -2 < dis < 5: return True else: return False
def reduce_urls(ori_urls): ''' 对url列表去重 :param ori_urls: 原始url列表 :return: 去重后的url列表 ''' etl_urls = [] result_urls = [] for ori_url in ori_urls: etl = url_etl(ori_url) print(etl) score = 0 if etl_urls: for etl_url in etl_urls: if not url_compare(etl, etl_url): score += 1
if score == len(etl_urls): result_urls.append(ori_url) etl_urls.append(etl) else: etl_urls.append(etl) result_urls.append(ori_url)
return result_urls
if __name__ == '__main__': test_list = [ 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58963', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58964', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58965', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58966', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58967', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58968', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58969', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58970', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58971', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58972', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58973', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58974', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58975', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58976', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58977', 'http://xxx.com/_visitcountdisplay?siteId=64&type=3&article=58978', ]
print(reduce_urls(test_list))
|
总结:
基于布尔的sql注入扫描起来需要发比较多的测试payload
一开始爬虫没对url进行去重,然后遇到咨询类网站的时候.
就发现前面爬的几十条链接都是几乎一样的,参数也是一样
但是因为没有去重导致一直重复在扫这个页面的这几个参数.
效率和效果都非常差,加了过滤之后,需要扫描的url少了很多.效率提高了不少.
但是这个时候也要注意,要看看有没因过滤而漏掉的url,宁愿重复扫也不要漏扫太多…
Refs: