爬文字、图片、视频

初级

文字（爬取百度首页文章列表标题）

# 此案例使用text属性进行解码，若使用content进行读取则需要使用decode()及对应的编码格式进行解码
import requests
from lxml import etree

header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"}

res = requests.get('http://www.baidu.com', headers=header)
print(res.encoding)

with open('baidu.html', 'w') as f:
    f.write(res.text)

selector = etree.HTML(res.text)
result = selector.xpath('//ul[@class]/li//span[@class="title-content-title"]/text()')
print(result)

图片（爬取百度logo并存到硬盘）

import requests

res = requests.get('https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png')

print(res.content)

with open('baidu-log.png', 'wb') as f:
    f.write(res.content)

视频（爬取指定地址的mp4视频并存到硬盘）

# 使用iter_content()进行分批写入，避免内存撑爆
import requests

res = requests.get('https://baikevideo.cdn.bcebos.com/media/mda-OgKIAVGwqTr85ead/6010f8507988556ac6e536ffb7f74031.mp4')

with open('intro.mp4', 'wb') as f:
    for item in res.iter_content():
        f.write(item)

Previous网页解析 Nextasyncio aiohttp spider

Last updated 2 years ago

hashtag初级

hashtag文字（爬取百度首页文章列表标题）

hashtag图片（爬取百度logo并存到硬盘）

hashtag视频（爬取指定地址的mp4视频并存到硬盘）

初级

文字（爬取百度首页文章列表标题）

图片（爬取百度logo并存到硬盘）

视频（爬取指定地址的mp4视频并存到硬盘）