广西空中课堂五年级每日爬取教学视频(使用工具:scrapy selenium re BeautifulSoup)

阅读: 评论:0

广西空中课堂五年级每日爬取教学视频(使用工具:scrapy selenium re BeautifulSoup)

广西空中课堂五年级每日爬取教学视频(使用工具:scrapy selenium re BeautifulSoup)

这几天由于特殊原因,闲在家中无事干,恰逢老妹要在家上课,家里没有广西广电机顶盒,所以只能去网上下载下来放到电视上看。前段时间又学了点爬虫正好拿来练练手(已查阅网站无robots协议限制)

网站链接:广西空中课堂

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
import re
import datetime
from selenium import webdriver
import os
import timeclass MycoursespiderSpider(scrapy.Spider):name = 'mycoursespider'global mydictmydict = {}start_urls = ['.html']def parse(self, response):curr_time = w()global todaytoday = str(h) + '月' + str(curr_time.day) + '日'global mypathmypath = os.path.dirname(alpath(__file__)) + '/' + todayif not ists(mypath):os.mkdir(mypath)mypath = mypath + '/'else:mypath = mypath + '/'domain = ''ctable = response.css('a#ctable::attr(href)').extract()[0]              #获取课程表yield scrapy.Request(ctable, callback=self.parsecoursetable, meta={'url': ctable})g5 = response.css('ul#g5 a[target=_blank]').extract()   #获取五年级栏目的内容g4 = response.css('ul#g4 a[target=_blank]').extract()   #获取四年级栏目的内容g5 = ''.join(g5)g4 = ''.join(g4)soup = BeautifulSoup(g5, 'html.parser')ensoup = BeautifulSoup(g4, 'html.parser')for i in ensoup.find_all('a'):if re.search(today + '-英语', i['title']) is not None:      #查看今天有没有英语课,乡下五年级学四年级mydict.update({i['title']: domain + i['href']})for i in soup.find_all('a'):                                      #查看今天五年级的更新内容if re.search(today, i['title']) is not None:if re.search('-英语', i['title']) is not None:passelse:mydict.update({i['title']: domain + i['href']})     #添加到待解析的字典中for key in mydict:page = mydict[key]yield scrapy.Request(page, callback=self.parseinside)def parseinside(self, response):curr_time = w()                                 #当前时间filename = str(h) + '-' + str(curr_time.day) + '.txt'playhost = '.*mp4|.*mp4'                       #匹配链接字符串resp = print(resp)title = response.css('h3#title::text').extract_first()print(title)playlink = re.search(playhost, resp)if playlink is not None:video = up(0))mydict[title] = videoelse:returnwith open(mypath + filename, 'w+') as f:for key in mydict:f.write(str(key).replace(u'xa0', u' ') + ':' + str(mydict[key]).replace(u'xa0', u' '))f.write('n')yield scrapy.Request(video, self.parsevideo, meta={'title': title}) #注释可不下载视频,meta实现内部函数之间传参def parsevideo(self, response):                                         #保存视频quest.headers['User-Agent'])title = a['title'] + '.mp4'# title = anslate(None, r'|\?*<":>+[]/')with open(mypath + title, 'wb') as f:f.write(response.body)def parsecoursetable(self, response):         #换成selenium,虽然慢了点,但是达到目的了url = a['url']browser = webdriver.Firefox(executable_path=r'C:UsersqqAppDataLocalProgramsPython')(url)time.sleep(3)soup = BeautifulSoup(browser.page_source, 'html.parser')i = soup.find(id='news_con')print(str(i))piclink = re.findall('.*?jpg|.*?jpg', str(i))browser.quit()if piclink is not None:lengh = len(piclink)for i in range(0, lengh):yield scrapy.Request(piclink[i], callback=self.parsepicture, meta={'title': str(i)})def parsepicture(self, response):title = today + '_' + a['title'] + '.jpg'with open(mypath + title, 'wb') as f:f.write(response.body)

放弃splash,使用selenium,简单多了

小记:在写文件时gbk无法处理xa0即网页的空白字符,所以需要place(u'xa0', u' ')

本文发布于:2024-02-01 08:26:18,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170674718035228.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:广西   五年级   工具   空中课堂   视频
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23