首页 > 编程札记 > 编程

爬某培训机构的百度网盘地址

阅读：评论：0

爬某培训机构的百度网盘地址

阅读文本大概需要15分钟。

当当满400再减30：

终于在当当申请到一些IT书籍的优惠码

在现在这个培训机构鱼珠混杂的环境下，很多培训机构都或多或少录制一些推广视频，这些视频其实对于一些想进入IT行业的伙伴们还是挺有效果的，想想当年小编就是靠这些视频进入了IT行业。今天和大家一起爬下黑马的所有免费视频的百度网盘地址。

黑马免费视频所在的页面

可以分析一下这个界面，点击下载的分页页码，可以发现地址是很有规律的：

.html
.html..
.html

发现变化的只有n.html中的n，所有编写如下函数获取所有的列表页url

 public static List<String> getUrlLit(){String url = "/";List<String> urls = new ArrayList<String>();for(int i=1; i<23; i++){String tmp = url +i+".html";urls.add(tmp);}return urls;}

等到结果

[.html,
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html, 
.html]

对于每一个

.html

获取这个页面的html代码

  public static String getListHtml(String url){String listHtml = HttpUtils.sendGet(url);return listHtml;}

对应HttpUtils.java工具类如下

public static String sendGet(String url) {HttpGet get = null;CloseableHttpResponse resp = null;CloseableHttpClient client = null;try {client = ateDefault();get = new HttpGet(url);resp = ute(get);int statusCode = StatusLine().getStatusCode();if (statusCode >= 200 && statusCode < 300) {HttpEntity entity = Entity();String content = String(entity, CHARSET);return content;}} catch (Exception e) {("sendGet发生异常：", e);} finally {try {if (resp != null)resp.close();} catch (IOException e) {}try {if (client != null)client.close();} catch (IOException e) {}}return null;}

分析每一个列表页的代码

.html

没一个框住的代表一个内容页地址，需要把这个链接的href地址获取到，先分析一下页面代码，发现框住的是链接地址

编写代码获取一个页面的所有内容页url

public static Set<String> parseListHtml(String html){Set<String> urls = new HashSet<String>();html = html.substring(html.indexOf("<div class="alllist v_list">"), html.indexOf("getCookie('isclose')"));Document doc = Jsoup.parse(html);Elements links = doc.select("a[href]");for(Element link : links) {String linkHref = link.attr("href");// String linkText = ();
//            System.out.println(linkHref);// System.out.println(linkText);if(linkHref.startsWith("/course/")&&!linkHref.startsWith("/course/index/p/")){urls.add(linkHref);}}System.out.println(urls);return urls;}

结果

[/course/8.html, /course/375.html, /course/267.html, 
/course/36.html, /course/33.html, /course/30.html, 
/course/5.html, /course/35.html, /course/273.html,/course/6.html, /course/3.html, /course/7.html, 
/course/10.html, /course/31.html, /course/1.html,/course/4.html, /course/135.html, /course/140.html, 
/course/9.html, /course/87.html]

因为这个地址都是以/打头可以可以知道内容页的绝对url地址是

+ /course/xxx.html

使用jsoup工具直接爬取百度网盘地址

public static  Set<String> getBaiDuYunPan(String url){String baseUrl = "";String contentUrl = baseUrl +url;Set<String> yunpanUrl = new HashSet<String>();try {Thread.sleep(5000);Document doc = Jsoup.parse(new URL(contentUrl), 100000);Elements links = doc.select("a[href]");for(Element link : links) {String linkHref = link.attr("href");ains("pan.baidu")){yunpanUrl.add(linkHref);}}}catch (Exception e){e.printStackTrace();}
//        System.out.println(yunpanUrl);return yunpanUrl;}

这个函数的url就是在列表页获取例如：/course/xxx.html

最终结果

整合的main函数

 public static void main(String[] args){List<String> urls = getUrlLit();System.out.println(urls);for(int i=0; i<urls.size(); i++){String listHtml =  (i));Set<String> urlSet =  parseListHtml(listHtml);System.out.println("print page num: "+ i);Iterator<String> iter = urlSet.iterator();while(iter.hasNext()){String contentUrl = ();Set<String> contentUrlList = getBaiDuYunPan(contentUrl);System.out.println(contentUrlList);}}}

对应的l文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=".0.0"xmlns:xsi=""xsi:schemaLocation=".0.0 .0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>yunpan</groupId><artifactId>yunpan</artifactId><version>1.0-SNAPSHOT</version><dependencies><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.5</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpcore</artifactId><version>4.4.9</version></dependency><dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.49</version></dependency><dependency><groupId>org.slf4j</groupId><artifactId>slf4j-api</artifactId><version>1.7.24</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.12.1</version></dependency></dependencies></project>

☆

往期精彩

☆

01 漫谈发版哪些事，好课程推荐

02 Linux的常用最危险的命令

03 精讲Spring Boot—入门+进阶+实例

关注我

每天进步一点点

很干！在看吗？☟

本文发布于:2024-02-05 02:18:10，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170721784562113.html

上一篇：3Dconnexion SpaceMouse Enterprise 企业版有线3D鼠标