FunTester 图片爬虫实践

FunTester · 2021年03月09日 · 2136 次阅读

之前写了一个Java&Groovy 下载文件对比，其中主要的实践就是通过下载图片验证的功能。之前也承诺过一个图片爬虫的功能，刚好有个机会写了一个爬虫，下载一些二维码图片的素材。

思路跟之前一样，先从首页中获取各个素材的地址，然后从地址中匹配图片的URL链接，然后下载到本地。

脚本

package com.funtester.groovy

import com.funtester.httpclient.FunLibrary
import com.funtester.utils.FileUtil
import com.funtester.utils.RWUtil
import com.funtester.utils.Regex

import java.util.stream.Collectors

class FunTester extends FunLibrary {

    static void main(String[] args) {
        String url = "https://kt.fkw.com/muban/word-7502-0-0-0-0-0-0.html"
        def get = getHttpGet(url)

        def response = getHttpResponse(get)
        def s = response.getString(RESPONSE_CONTENT).replaceAll("\\s", EMPTY)
        def urls = (Regex.regexAll(s, "//kt\\.fkw\\.com/tupian/\\w{8}.html") as Set) as List
        //        output(s)
        def collect = urls.stream().map {
            x -> "https:" + x
        }.collect(Collectors.toList())
        output(collect)
        collect.each {
            downPic(it)
        }

    }

    /**
     * 下载图片
     * @param picurl
     * @return
     */
    static def downPic(String picurl) {
        def get1 = getHttpGet(picurl)
        def response1 = getHttpResponse(get1)
        def pic = response1.getString(RESPONSE_CONTENT).replaceAll("\\s", EMPTY)
        //        output(pic)
        def all = "https:" + Regex.findFirst(pic, "//1\\.s91i\\.faiusr\\.com/\\d/.+?\\.png")
        def tuple = FileUtil.handlePicName(all)
        RWUtil.down(tuple.first, LONG_Path + "pic/" + tuple.second)
    }

}

不得不说，正则属实好用，花点时间掌握基础使用还是挺方便的。

这里写了一个封装方法用来获取匹配的第一个对象，如下：

/**
 * 获取第一个匹配对象
 *
 * @param text
 * @param regex
 * @return
 */
public static String findFirst(String text, String regex) {
    Matcher matcher = matcher(text, regex);
    if (matcher.find()) return matcher.group();
    return EMPTY;
}

控制台

INFO-> 当前用户：fv，IP：192.168.0.103，工作目录：/Users/fv/Documents/workspace/funtester/,系统编码格式:UTF-8,系统Mac OS X版本:10.16
WARN-> 响应体非json格式，已经自动转换成json格式！
INFO-> 请求uri：https://kt.fkw.com/muban/word-7502-0-0-0-0-0-0.html,耗时：2058 ms, 
INFO-> 第1个：https://kt.fkw.com/tupian/g0a7Z5l6.html
……此处省略N条日志……
INFO-> 第50个：https://kt.fkw.com/tupian/2icutZh7.html
INFO-> 第51个：https://kt.fkw.com/tupian/2icutZh5.html
WARN-> 响应体非json格式，已经自动转换成json格式！
INFO-> 请求uri：https://kt.fkw.com/tupian/g0a7Z5l6.html,耗时：1790 ms, 
INFO-> 下载链接：https://1.s91i.faiusr.com/4/AFsIABAEGAAgq8-27AUohsvXxgMwhAc49AM!800x800.png，存储文件名：/Users/fv/Documents/workspace/funtester/long/pic/AFsIABAEGAAgq8-27AUohsvXxgMwhAc49AM!800x800.png

Process finished with exit code 130 (interrupted by signal 2: SIGINT)

有兴趣的可以自己看一下网页的结构，尝试用Selenium等框架也爬一遍试试。

FunTester，腾讯云社区钦定年度作者，非著名测试开发 er，欢迎关注。

如果觉得我的文章对您有用，请随意打赏。您的支持将鼓励我继续创作！

打赏支持

暂无回复。

需要登录后方可回复, 如果你还没有账号请点击这里注册。