爬文字、图片、视频
初级
文字(爬取百度首页文章列表标题)
# 此案例使用text属性进行解码,若使用content进行读取则需要使用decode()及对应的编码格式进行解码
import requests
from lxml import etree
header = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"}
res = requests.get('http://www.baidu.com', headers=header)
print(res.encoding)
with open('baidu.html', 'w') as f:
f.write(res.text)
selector = etree.HTML(res.text)
result = selector.xpath('//ul[@class]/li//span[@class="title-content-title"]/text()')
print(result)图片(爬取百度logo并存到硬盘)
import requests
res = requests.get('https://www.baidu.com/img/PCtm_d9c8750bed0b3c7d089fa7d55720d6cf.png')
print(res.content)
with open('baidu-log.png', 'wb') as f:
f.write(res.content)视频(爬取指定地址的mp4视频并存到硬盘)
# 使用iter_content()进行分批写入,避免内存撑爆
import requests
res = requests.get('https://baikevideo.cdn.bcebos.com/media/mda-OgKIAVGwqTr85ead/6010f8507988556ac6e536ffb7f74031.mp4')
with open('intro.mp4', 'wb') as f:
for item in res.iter_content():
f.write(item)Last updated