51job招聘数据python爬虫

阅读: 评论:0

51job招聘数据python爬虫

51job招聘数据python爬虫

1.引入库

代码如下(示例):

from selenium import webdriver
from selenium.webdrivermon.by import By
from selenium.webdrivermon.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import time
from lxml import etree
import pymongo

2.读入数据

代码如下(示例):

def get_text_url(n):url = ',000000,0000,00,9,99,%25E4%25BA%25BA%25E5%25B7%25A5%25E6%2599%25BA%25E8%2583%25BD,2,'for page in range(1, n):url = url+str(page) + '.html'yield urldef get_text(url):try:browser = webdriver.Chrome()(url)time.sleep(3)text = browser.page_sourcereturn textfinally:browser.close()def get_url_from_text(text):selector = etree.HTML(text)res = selector.xpath("//div[@class='dw_table']//div[@class='el']/p/span/a/@href")for i in res:yield i
def parse_detail_page(url):detail_text = get_text(url)selector = etree.HTML(detail_text)dic={}job_name = selector.xpath("//div[@class='cn']/h1/text()")dic['job_name'] = ''.join(job_name).strip()job_salary = selector.xpath("//div[@class='cn']/strong/text()")dic['job_salary'] = ''.join(job_salary).strip()exp_edu_req = selector.xpath("//div[@class='cn']/p[@class='msg ltype']/text()")dic['exp_edu_req'] = ','.join(exp_edu_req).strip()job_welfare = selector.xpath("//div[@class='cn']//div[@class='t1']/span/text()")dic['job_welfare'] = ','.join(job_welfare).strip()tech_responsibility_describe = selector.xpath("//div[@class='tCompany_main']//div[@class='bmsg job_msg inbox']/p/text()")dic['tech_responsibility_describe'] = ''.join(tech_responsibility_describe).strip()company_name = selector.xpath("//div[@class='cn']/p[@class='cname']/a/@title")dic['company_name'] = ','.join(company_name).strip()company_info = selector.xpath("//div[@class='tCompany_sidebar']//div[@class='com_tag']//text()")dic['company_info'] = ','.join([x.strip() for x in company_info]).strip()# dic['company_info'] = ','.join(company_info).strip()company_introduce = selector.xpath("//div[@class='tCompany_main']//div[@class='tmsg inbox']//text()")dic['company_introduce'] = ','.join(company_introduce).strip()return dicdef insertMongo(data):client = pymongo.MongoClient('mongodb://localhost:27017/')db = client.spidercollection = db.job51result = collection.insert_one(data)print(result)if __name__ == '__main__':for i in get_text_url(6):text = get_text(i)for j in get_url_from_text(text):data = parse_detail_page(j)print(data)insertMongo(data)

本文发布于:2024-01-30 17:54:12,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170660845621787.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:爬虫   数据   job   python
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23