网页解析

RE

搜索匹配规则的所有内容

# 提取指定内容
s = """
<a href="http://www.badu.com/s?wd=hahaha">hahaha</a>
<a href="http://www.tmall.com/">tmall</a>
<a href="http://www.tmall.com/">tmall</a>
"""

result = re.findall('<a href="(.*?)">(.*?)</a>', s, re.S)
print(result)

# 输出：
[('http://www.badu.com/s?wd=hahaha', 'hahaha'), ('http://www.tmall.com/', 'tmall'), ('http://www.tmall.com/', 'tmall')]


# 解析：
# re.findall 搜索所有匹配的结果，返回一个列表
# ()用于优先输出，返回结果为元组

搜索第一个匹配到的值，并赋予变量名

s = """
我的第一个手机号码是13928835900,
我的第二个手机号是15100363326,
我的qq邮箱是[email protected],
我的gmail邮箱是[email protected].
"""

result = re.search("(?P<tel_1>1[1-9][0-9]{9}).*?(?P<email_1>[0-9a-zA-Z_]+@[0-9a-zA-Z_]+.com)", s, re.S)

print(result.group("tel_1"))
print(result.group("email_1"))
print(result)

# 输出：
13928835900
[email protected]
<re.Match object; span=(11, 67), match='13928835900,\n我的第二个手机号是15100363326,\n我的qq邮箱是anyci>


# 解析：
# re.search 搜索匹配项，则仅返回匹配项的第一个匹配项，如果找不到匹配项，则返回值为None
# ?P<name> 用于给匹配项赋予变量名，使用.group('name')可以返回匹配项的值

BS4

XPath

案例：17k chapter html 获取章节链接

XPath 规则

# 获取链接href XPath规则
//span[contains(text(), "正文")]/../../dd/a/@href

# 获取链接文字 XPath规则
//span[contains(text(), "正文")]/../../dd/a/span/text()

# 获取有自定义属性ne-if，且其值中包含 {{ 的div 如：<div class="hidden" ne-if="{{__i == 0}}">
//div[contains(@ne-if,"{{")]

Python 实现

# res为requests请求的返回对象
selector  = etree.HTML(res.text)
aResult = selector.xpath('//span[contains(text(), "正文")]/../../dd/a')
for item in aResult:
    name = item.xpath('./span/text()')[0].strip()
    href = item.xpath('./@href')[0]
    print(name, href)

# 输出：
第一章  爷爷的秘密 /chapter/3425715/46114063.html
第二章  火车上的事情 /chapter/3425715/46117382.html
第三章  谢家人的出现 /chapter/3425715/46118300.html
。。。

html源码 (https://www.17k.com/list/3425715.html)

32KB

chapter.html

Open

案例：17K page html 获取文章内容文本

XPath 规则

# 获取文章内容文本(不要最后一个p标签的内容) XPath规则
//div[contains(@class,"content")]/div[@class="p"]/p[position()<last()]

html源码（https://www.17k.com/chapter/3425715/46114063.html）

39KB

page.html

Open

Previous序列化与反序列化 Next爬文字、图片、视频

Last updated 2 years ago

hashtagRE

hashtag搜索匹配规则的所有内容

hashtag搜索第一个匹配到的值，并赋予变量名

hashtagBS4

hashtagXPath

hashtag案例：17k chapter html 获取章节链接

hashtag案例：17K page html 获取文章内容文本

RE

搜索匹配规则的所有内容

搜索第一个匹配到的值，并赋予变量名

BS4

XPath

案例：17k chapter html 获取章节链接

案例：17K page html 获取文章内容文本