cdxy.me
Cyber Security / Data Science / Trading

单机多线程爬虫,耗时30小时,爬取B站2000W用户公开数据,存入数据库。

  为用户个性签名提供网页索引,说不定这是东半球脑洞最大的小词儿了。

这里写图片描述

网页版入口: http://cdxy.me/CI/

项目地址: https://github.com/Xyntax/POC-T/blob/master/module/spider.py

脚本很简单,已作为模块整合到我的多线程框架中:

import requests
import json
import MySQLdb


def info():
    pass


def exp():
    pass

def poc(str):
    url = 'http://space.bilibili.com/ajax/member/GetInfo?mid=' + str
    head = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'
    }

    jscontent = requests.get(url, headers=head, verify=False).content
    jsDict = json.loads(jscontent)
    if jsDict['status'] and jsDict['data']['sign']:
        jsData = jsDict['data']
        mid = jsData['mid']
        name = jsData['name']
        sign = jsData['sign']
        try:
            conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, charset='utf8')
            cur = conn.cursor()
            conn.select_db('bilibili')
            cur.execute(
                'INSERT INTO bilibili_user_info VALUES (%s,%s,%s,%s)', [mid, mid, name, sign])
            return True

        except MySQLdb.Error, e:
            pass

    else:
        pass