[toc]

一、声明

from bs4 import BeautifulSoup
soup = BeautifulSoup(爬取内容,解释器)

二、基本元素

1.对BeautifulSoup库的理解

Beautifulsoup是解析、遍历、维护”标签书“的功能库

2.BeautifulSoup类

(1)原理

flowchart LR
	HTML <--> 标签树
	标签树 <--> BeautifulSoup类
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")

(2)解析器

解析器使用方法条件
bs4的HTML解析器BeautifulSoup(mk,“html.parser”)安装bs4库
lxml的HTML解析器BeautifulSoup(mk,“lxml”)pip install lxml
lxml的XML解析器BeautifulSoup(mk,“xml”)pip install xml
html5lib的解析器BeautifulSoup(mk,“html5lib”)pip install html5lib

(3)基本元素

基本元素说明
Tag标签,最基本元素,分别用<>和</>标明开通与结尾
Name标签的名字
Attributes标签的属性,字典形式格式,字典形式组织,格式:.attrs
NavigableString标签内非属性字符串,<>……</>中字符串
Comment标签内字符串的注释部分

三、使用

1.载入

(1)通过字符串构建

html='''
<html lang="zh-cn">
<head>
    <meta charset="utf-8" />    
</head>
 
<div id="main">
    <span role="heading" aria-level="2">span</span>
    <h1>h1</h1>
	<p>p</p>
</div>
</body>
</html>
'''
soup = BeautifulSoup(html,'html_parser')
print(soup.prettify())

(2)从文件中加载

with open('测试.html'encoding='utf-8') as f:
    soup = BeatuifulSoup(f,'html_parser')