码字不易,喜欢请点赞!!!
背景 :其实两年前就爬了天眼查的很多信息,包括电话、地址等基本信息之外,还有公司的股东、专利以及对外投资等信息,但是当时的电脑没备份,代码都没了。这次山东的某个教育机构有偿找我帮爬天眼查公司电话以及地址信息,所以就重新爬了一下天眼查。
准备
:selenium+PhatomJS或者selenium+Firefox
我这里直接用的后者selenium+Firefox
思路 爬取这部分信息的话,代码其实不难,主要包括模拟登陆、获得页面网址以及抓取页面信息。
模拟登陆
网址:https://www.tianyancha.com/login
页面如下:
使用selenium模拟登陆代码:
time
.
sleep
(
random
.
random
(
)
+
1
)
browser
.
get
(
loginURL
)
time
.
sleep
(
random
.
random
(
)
+
random
.
randint
(
2
,
3
)
)
browser
.
find_element_by_css_selector
(
'div.title:nth-child(2)'
)
.
click
(
)
time
.
sleep
(
random
.
uniform
(
0.5
,
1
)
)
phone
=
browser
.
find_element_by_css_selector
(
'div.modulein:nth-child(2) > div:nth-child(2) > input:nth-child(1)'
)
phone
.
send_keys
(
zhangHao
)
time
.
sleep
(
random
.
uniform
(
0.4
,
0.9
)
)
password
=
browser
.
find_element_by_css_selector
(
'.input-pwd'
)
password
.
send_keys
(
miMa
)
click
=
browser
.
find_element_by_css_selector
(
'div.modulein:nth-child(2) > div:nth-child(5)'
)
click
.
click
(
)
time
.
sleep
(
random
.
uniform
(
0.5
,
1
)
+
10
)
登录之后页面:
关键词对应的页面网址:https://www.tianyancha.com/search?key= + key
这里以“滴滴”为例:https://www.tianyancha.com/search?key=滴滴
页面内容如下:
获取公司页面网址
解析滴滴关键词页面HTML,获得每个公司对应得URL。
注意
:非会员只能查看前五页的公司信息
代码:
#获取页面数
try
:
pages
=
soup
.
find
(
'ul'
,
class_
=
'pagination'
)
.
find_all
(
'li'
)
[
-
2
]
.
getText
(
)
.
replace
(
'...'
,
''
)
except
:
pages
=
1
finally
:
print
(
'pages:'
,
pages
)
def
getUid
(
soup
)
:
urls
=
[
]
divs
=
soup
.
find
(
'div'
,
class_
=
'result-list sv-search-container'
)
.
find_all
(
'div'
,
class_
=
'search-item sv-search-company'
)
for
div
in
divs
:
urls
.
append
(
div
.
find
(
'div'
,
class_
=
'header'
)
.
find
(
'a'
)
[
'href'
]
)
return
urls
#非会员只能爬前五页
if
(
int
(
pages
)
>
5
)
:
pages
=
5
urls
=
[
]
for
i
in
range
(
1
,
pages
+
1
)
:
url
=
'https://www.tianyancha.com/search/p'
+
str
(
i
)
+
'?key='
+
key
browser
.
get
(
url
)
time
.
sleep
(
random
.
uniform
(
0.6
,
1
)
+
2
)
soup
=
BeautifulSoup
(
browser
.
page_source
,
'lxml'
)
urls
.
extend
(
getUid
(
soup
)
)
获得企业信息
最后根据企业网页HTML内容,解析获取需要的信息,看页面可以发现这里需要的电话和地址都在最上面就有。
获取这部分内容代码:
#这里为了避免意外每次将结果直接写入Excel文件
try
:
for
url
in
urls
:
path
=
r
'C:\Users\liuliang_i\Desktop\tianYanCha.xlsx'
try
:
df1
=
pd
.
read_excel
(
path
)
except
:
df1
=
pd
.
DataFrame
(
columns
=
[
'Company'
,
'Phone'
,
'Address'
,
'Url'
]
)
browser
.
get
(
url
)
time
.
sleep
(
random
.
uniform
(
0.4
,
0.8
)
+
1
)
soup
=
BeautifulSoup
(
browser
.
page_source
,
'lxml'
)
company
=
soup
.
find
(
'div'
,
class_
=
'header'
)
.
find
(
'h1'
,
class_
=
'name'
)
.
getText
(
)
phone
=
soup
.
find
(
'div'
,
class_
=
'in-block sup-ie-company-header-child-1'
)
.
find_all
(
'span'
)
[
1
]
.
getText
(
)
address
=
soup
.
find
(
'div'
,
class_
=
'auto-folder'
)
.
find
(
'div'
)
.
getText
(
)
df1
.
loc
[
df1
.
shape
[
0
]
,
'Company'
]
=
company
df1
.
loc
[
df1
.
shape
[
0
]
-
1
,
'Phone'
]
=
phone
df1
.
loc
[
df1
.
shape
[
0
]
-
1
,
'Address'
]
=
address
df1
.
loc
[
df1
.
shape
[
0
]
-
1
,
'Url'
]
=
url
df1
.
to_excel
(
path
,
index
=
0
)
except
:
pass