博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
21天打造分布式爬虫-豆瓣电影和电影天堂实战(三)
阅读量:5213 次
发布时间:2019-06-14

本文共 4341 字,大约阅读时间需要 14 分钟。

3.1.豆瓣电影

使用lxml

import requestsfrom lxml import etreeheaders = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',    'Referer':'https://movie.douban.com/'}url = 'https://movie.douban.com/cinema/nowplaying/beijing/'response = requests.get(url,headers=headers)text = response.texthtml = etree.HTML(text)#获取正在上映的电影ul = html.xpath("//ul[@class='lists']")[0]lis = ul.xpath("./li")movies = []for li in lis:    title = li.xpath("@data-title")[0]    score = li.xpath("@data-score")[0]    duration = li.xpath("@data-duration")[0]    region = li.xpath("@data-region")[0]    director = li.xpath("@data-director")[0]    actors = li.xpath("@data-actors")[0]    #电影海报图片    thumbnail = li.xpath(".//img/@src")[0]    movie = {        'title':title,        'score':score,        'duration':duration,        'region':region,        'director':director,        'actors':actors,        'thumbnail':thumbnail,    }    movies.append(movie)print(movies)

 3.2.电影天堂

使用lxml

import requestsfrom lxml import etreeBASE_DOMAIN = 'http://dytt8.net'HEADERS = {    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',}def get_detail_urls(url):    '''获取详情页url'''    response = requests.get(url, headers=HEADERS)    text = response.text    html = etree.HTML(text)    # 获取电影详情的url    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")    #加上域名,拼接成完整的url    detail_urls = map(lambda url:BASE_DOMAIN + url,detail_urls)    return detail_urlsdef parse_detail_page(url):    '''处理爬取页面'''    movie = {}    response= requests.get(url,headers=HEADERS)    text = response.content.decode('gbk')    html = etree.HTML(text)    title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]    movie['title'] = title    zoomE = html.xpath("//div[@id='Zoom']")[0]    imgs = zoomE.xpath(".//img/@src")    cover = imgs[0]         #电影海报    movie['cover'] = cover    #因为有的电影没有截图,所有这里加个异常处理    try:        screenshot = imgs[1]    #电影截图        movie['screenshot'] = screenshot    except IndexError:        pass    infos = zoomE.xpath(".//text()")    for index,info in enumerate(infos):        if info.startswith("◎年  代"):            info = info.replace("◎年  代","").strip()            movie['year'] = info        elif info.startswith("◎产  地"):            info = info.replace("◎产  地", "").strip()            movie['country'] = info        elif info.startswith("◎类  别"):            info = info.replace("◎类  别", "").strip()            movie['category'] = info        elif info.startswith("◎豆瓣评分"):            info = info.replace("◎豆瓣评分", "").strip()            movie['douban_rating'] = info        elif info.startswith("◎片  长"):            info = info.replace("◎片  长", "").strip()            movie['duration'] = info        elif info.startswith("◎导  演"):            info = info.replace("◎导  演", "").strip()            movie['director'] = info                #影片的主演有多个,所有要添加判断        elif info.startswith("◎主  演"):            info = info.replace("◎主  演", "").strip()            actors = [info,]            for x in range(index+1,len(infos)):                actor = infos[x].strip()                if actor.startswith("◎简  介 "):                    break                actors.append(actor)            # print(actors)            movie['actors'] = actors        elif info.startswith("◎简  介 "):            info = info.replace("◎简  介 ", "").strip()            for x in range(index+1,len(infos)):                profile = infos[x].strip()                if profile.startswith("【下载地址】"):                    break                # print(profile)                movie['profile'] = profile    #下载地址    download_url = html.xpath("//td[@bgcolor='#fdfddf']/a/@href")[0]    # print(download_url)    movie['download_url'] = download_url    return moviedef spider():    base_url = 'http://dytt8.net/html/gndy/dyzz/list_23_{}.html'    movies = []    for x in range(1,8):    #爬前7页        url = base_url.format(x)        detail_urls = get_detail_urls(url)        for detail_url in detail_urls:            movie = parse_detail_page(detail_url)                  movies.append(movie)            print(movie)    print(movies)    #所有的电影信息if __name__ == '__main__':    spider()

 

转载于:https://www.cnblogs.com/derek1184405959/p/9384050.html

你可能感兴趣的文章
HDU-1031(水题)
查看>>
java代理模式学习
查看>>
抓包简单操作Fiddler
查看>>
可遇不可求的Question之INSERT … ON DUPLICATE KEY UPDATE 语法篇
查看>>
JavaScript学习二
查看>>
Django实现注册
查看>>
java调用操作系统命令
查看>>
洛谷P3403跳楼机(最短路构造/同余最短路)
查看>>
莫比乌斯反演
查看>>
Lexia 3 V48 Diagbox 软件安装指引
查看>>
时间管理
查看>>
刻意练习,逃离舒适区:怎么样成为一个高手
查看>>
limits.conf文件修改注意事项,限制文件描述符数和进程数
查看>>
WS调用的时候报错
查看>>
linux 命令大全
查看>>
命令模式
查看>>
897.Increasing Order Search Tree
查看>>
HTMLDiary
查看>>
2.JVM的参数配置
查看>>
iOS下获取用户当前位置的信息
查看>>