相信在 51test 论坛下载过学习资料的同学都知道体验这些问题:
链接多需要一个个点,搜索资源多了看的累
tab 页多了自己不清楚有哪些资源
data=requests.get(url=url, cookies=cook, headers=headers)
data.encoding = "gbk"
web_data=BeautifulSoup(data.text,'lxml')
# 在需要爬取的区域div进行爬取
for link in web_data.find('div',{'class':'two_js_yw'}).findAll('a'):
if "href" in link.attrs:
if 'target' in link.attrs:
resul_url = link.attrs['href']
if '51t' in resul_url and 'search' not in resul_url and 'catid' not in resul_url:
page.add(resul_url)
print("爬虫地址:"+resul_url)
GetDownFile(resul_url)
time.sleep(0.1)
获取下载地址:
data = requests.get(url=url, cookies=cook, headers=headers)
data.encoding = "gbk"
web_data = BeautifulSoup(data.text, 'lxml')
urls = web_data.findAll('a')
titles=web_data.find_all('h3')
for url in urls:
if "href" in url.attrs:
web_link = url.attrs['href']
if len(titles)>0:
title = titles[0].text
else:
title="爬虫"
if '.pdf' in web_link:
print("下载地址:"+web_link)
insert(web_link,title)
由于对 python 多线程了解不太深入,下载部分使用了 java,通过获取本地数据库的地址,批量执行下载任务
public static void main(String[] args) throws SQLException {
Connection conn = getConn();
statement = conn.createStatement();
String sql = "SELECT a.url,a.title FROM test51file a WHERE id >=941 AND id<=1140";
ResultSet rs = statement.executeQuery(sql);
DownloadFile downloadFile=new DownloadFile();
while (rs.next()) {
UploadRannable up=new UploadRannable();
up.setPath(path);
up.setTitle(rs.getString("title"));
up.setUrl(rs.getString("url"));
Thread thd=new Thread(up);
thd.start();
}
}
public class UploadRannable implements Runnable{
private String url;
private String path;
private String title;
@Override
public void run() {
try {
uploadFile(url, path, title);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void uploadFile(String url,String path,String title) throws IOException{
String name=null;
if (url.contains(".pdf")) {
name = title + ".pdf";
}else if (url.contains(".doc")){
name=title+".doc";
}else if (url.contains(".xls")){
name=title+".xls";
}
DownloadFile.downLoadFromUrl(url, name, path);
}
目前爬取了 1000 条记录,由于代码写的不健壮,后来进入了爬虫黑洞,第一次写博客希望能过。
爬取的资料下载地址包括了开发和测试的资料:
链接:http://pan.baidu.com/s/1eRQ6M8y 密码:1h1j