背景

相信在 51test 论坛下载过学习资料的同学都知道体验这些问题:
链接多需要一个个点,搜索资源多了看的累
tab 页多了自己不清楚有哪些资源

爬虫部分

data=requests.get(url=url, cookies=cook, headers=headers)
 data.encoding = "gbk"
 web_data=BeautifulSoup(data.text,'lxml')

 # 在需要爬取的区域div进行爬取
 for link in web_data.find('div',{'class':'two_js_yw'}).findAll('a'):
     if "href" in link.attrs:
         if 'target' in link.attrs:
             resul_url = link.attrs['href']
             if '51t' in resul_url and 'search' not in resul_url and 'catid' not in resul_url:
                 page.add(resul_url)
                 print("爬虫地址:"+resul_url)
                 GetDownFile(resul_url)
                 time.sleep(0.1)

获取下载地址:

data = requests.get(url=url, cookies=cook, headers=headers)
   data.encoding = "gbk"
   web_data = BeautifulSoup(data.text, 'lxml')
   urls = web_data.findAll('a')
   titles=web_data.find_all('h3')
   for url in urls:
       if "href" in url.attrs:
           web_link = url.attrs['href']
           if len(titles)>0:
               title = titles[0].text
           else:
               title="爬虫"
           if '.pdf' in web_link:
                print("下载地址:"+web_link)
                insert(web_link,title)

根据 url 下载文件

由于对 python 多线程了解不太深入,下载部分使用了 java,通过获取本地数据库的地址,批量执行下载任务

        public static void main(String[] args) throws SQLException {  

        Connection conn = getConn();
        statement = conn.createStatement();
        String sql = "SELECT a.url,a.title FROM test51file a WHERE id >=941 AND id<=1140";
        ResultSet rs = statement.executeQuery(sql);
        DownloadFile downloadFile=new DownloadFile();
        while (rs.next()) {
             UploadRannable up=new UploadRannable();
             up.setPath(path);
             up.setTitle(rs.getString("title"));
             up.setUrl(rs.getString("url"));
             Thread thd=new Thread(up);
             thd.start();
        }
    }  

public class UploadRannable implements Runnable{

    private String url;
    private String path;
    private String title;
    @Override
    public void run() {

        try {
            uploadFile(url, path, title);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    public void uploadFile(String url,String path,String title) throws IOException{
        String name=null;
        if (url.contains(".pdf")) {
            name =  title + ".pdf";
        }else if (url.contains(".doc")){
            name=title+".doc";
        }else if (url.contains(".xls")){
            name=title+".xls";
        }
        DownloadFile.downLoadFromUrl(url, name, path);
    }


总结

目前爬取了 1000 条记录,由于代码写的不健壮,后来进入了爬虫黑洞,第一次写博客希望能过。
爬取的资料下载地址包括了开发和测试的资料:
链接:http://pan.baidu.com/s/1eRQ6M8y 密码:1h1j


↙↙↙阅读原文可查看相关链接,并与作者交流