Python3爬虫模拟新浪微博登录

初学Python3小白一枚，若有错误请不吝赐教

过程分析

整个过程 从填写数据到跳转到主页一共经历了七个步骤：
1.在登陆前，输入账号结束，失去输入框焦点，浏览器会发送两个请求，分别请求了登陆前加密密码所需的servertime、nonce、pubkey（图中对应3）
2．第二个ajax请求的是关于验证码的（图中对应4、5）
3．这个是点击登录按钮后，将加密后的数据post到服务器（图中对应6）
4．服务器会返回一系列数据，（对应图中的8），它包含了重定向的地址
5．多次页面加载后，会接收到来自服务器的json数据包（对应图中的19）在这
个数据包中包含了每个微博用户特定的uniqueid
6．这是一个携带有相关用户信息的script脚本（对应图中的22）
7．经过一系列的跳转后，最终跳转到个人主页面

遇到的坑

遇到的最严重的坑，就是验证码的啦
验证码的请求分析，着实耗费了大半天。这是验证码请求的服务器地址：
https://login.sina.com.cn/cgi/pin.php?r=18674039&s=0&p=yf-c92f2edb50c21d4bcbfdc3fccfdb94c4c23f
其中分析后发现：
固定服务器url：https://login.sina.com.cn/cgi/pin.php?
携带的参数：r = 18674039，p = yf-c92f2edb50c21d4bcbfdc3fccfdb94c4c23f，s = 0
其中s和p是固定值，r是一串不固定变化的数字，在分析的过程中，我试图寻找关于r的规律，最后实在没办法，我打算在Fiddler中测试一下，看看不加参数r是否能获得，结果是可喜可贺的，确实获得了。
事实上，在保证cookie一致的情况下，去请求验证码，也就是说和你本次的登录保持在一个cookie中，可以主动的抓取验证码。

            
              
                # 获取验证码
              
              
                def
              
              
                get_verificationcode
              
              
                (
              
              
                )
              
              
                :
              
              
                print
              
              
                (
              
              
                "开始请求获取验证码..."
              
              
                )
              
              
    url 
              
                =
              
              
                "https://login.sina.com.cn/cgi/pin.php?s=0&p=yf-9f5e31626347e127bc21874aa9d6f4d745ca"
              
              
    request
              
                .
              
              urlretrieve
              
                (
              
              url
              
                =
              
              url
              
                ,
              
               filename
              
                =
              
              
                "./img/code.jpg"
              
              
                )
              
              
                print
              
              
                (
              
              
                "验证码获取成功！"
              
              
                )
              
              
                return
              
              
                input
              
              
                (
              
              
                "请输入验证码："
              
              
                )

在应对验证码问题上，我这里采用的是半人工半自动的，将每次获得的验证码存到当前目录下的img文件中，人工查看和输入验证码。

关于第一步，账号和密码加密

经分析在第二步发送登录的post请求之前，浏览器会实现发一个请求，请求的响应信息如下：

里边携带的servertime、pubkey、nonce、rsakv等关键信息在后边加密密码和包装post请求数据非常关键。关于如何得到这个分析以及整个爬虫的结构也，参考了这篇博文：https://www.cnblogs.com/houkai/p/3488468.html.
以下是加密模块encrypt.py的代码

            
              
                import
              
               base64

              
                import
              
               binascii

              
                import
              
               rsa


              
                # 对用户名加密
              
              
                def
              
              
                encryUsername
              
              
                (
              
              username
              
                )
              
              
                :
              
              
                print
              
              
                (
              
              
                "开始加密用户名..."
              
              
                )
              
              
    text 
              
                =
              
              
                (
              
              base64
              
                .
              
              b64encode
              
                (
              
              username
              
                .
              
              encode
              
                (
              
              encoding
              
                =
              
              
                "utf-8"
              
              
                )
              
              
                )
              
              
                )
              
              
    text 
              
                =
              
               text
              
                .
              
              decode
              
                (
              
              
                )
              
              
                return
              
              
                str
              
              
                (
              
              text
              
                )
              
              
                .
              
              replace
              
                (
              
              
                "="
              
              
                ,
              
              
                ""
              
              
                )
              
              
                # 对密码加密
              
              
                def
              
              
                encryPassword
              
              
                (
              
              password
              
                ,
              
              servertime
              
                ,
              
              nonce
              
                ,
              
              pubkey
              
                )
              
              
                :
              
              
                print
              
              
                (
              
              
                "开始加密密码..."
              
              
                )
              
              
    rsaPublickey 
              
                =
              
              
                int
              
              
                (
              
              pubkey
              
                ,
              
              
                16
              
              
                )
              
              
    key 
              
                =
              
               rsa
              
                .
              
              PublicKey
              
                (
              
              rsaPublickey
              
                ,
              
              
                65537
              
              
                )
              
              
                # 创建公钥
              
              
    message 
              
                =
              
              
                str
              
              
                (
              
              servertime
              
                )
              
              
                +
              
              
                '\t'
              
              
                +
              
              
                str
              
              
                (
              
              nonce
              
                )
              
              
                +
              
              
                '\n'
              
              
                +
              
              
                str
              
              
                (
              
              password
              
                )
              
              
                # 拼接明文js加密文件中得到
              
              
    message 
              
                =
              
              
                bytes
              
              
                (
              
              message
              
                ,
              
               encoding
              
                =
              
              
                "utf-8"
              
              
                )
              
              
    passwd 
              
                =
              
               rsa
              
                .
              
              encrypt
              
                (
              
              message
              
                ,
              
               key
              
                )
              
              
                # 加密
              
              
    passwd 
              
                =
              
               binascii
              
                .
              
              b2a_hex
              
                (
              
              passwd
              
                )
              
              
                # 将加密信息转换为16进制。
              
              
                return
              
               passwd

登录请求post数据包装

在Chrome的开发者模式下，可以抓取相关参数信息

            
              
                # 组织post数据
              
              
                def
              
              
                get_postData
              
              
                (
              
              su
              
                ,
              
              password
              
                ,
              
              servertime
              
                ,
              
              nonce
              
                ,
              
              pubkey
              
                ,
              
              rsakv
              
                )
              
              
                :
              
              
                print
              
              
                (
              
              
                "开始组织post数据..."
              
              
                )
              
              
                # 密码加密
              
              
    sp 
              
                =
              
               encrypt
              
                .
              
              encryPassword
              
                (
              
              password
              
                ,
              
               servertime
              
                ,
              
               nonce
              
                ,
              
               pubkey
              
                )
              
              
                # 验证码请求
              
              
    door 
              
                =
              
               get_verificationcode
              
                (
              
              
                )
              
              
                # 构造post请求参数
              
              
    data 
              
                =
              
              
                {
              
              
                "door"
              
              
                :
              
               door
              
                ,
              
              
                "entry"
              
              
                :
              
              
                "weibo"
              
              
                ,
              
              
                "gateway"
              
              
                :
              
              
                1
              
              
                ,
              
              
                "from"
              
              
                :
              
              
                ""
              
              
                ,
              
              
                "savestate"
              
              
                :
              
              
                7
              
              
                ,
              
              
                "su"
              
              
                :
              
               su
              
                ,
              
              
                "sp"
              
              
                :
              
               sp
              
                ,
              
              
                "servertime"
              
              
                :
              
               servertime
              
                ,
              
              
                "service"
              
              
                :
              
              
                "miniblog"
              
              
                ,
              
              
                "nonce"
              
              
                :
              
               nonce
              
                ,
              
              
                "rsakv"
              
              
                :
              
               rsakv
              
                ,
              
              
                "encoding"
              
              
                :
              
              
                "UTF-8"
              
              
                ,
              
              
                "domain"
              
              
                :
              
              
                "sina.com.cn"
              
              
                ,
              
              
                "returntype"
              
              
                :
              
              
                "META"
              
              
                ,
              
              
                "vsnf"
              
              
                :
              
              
                1
              
              
                ,
              
              
                "useticket"
              
              
                :
              
              
                1
              
              
                ,
              
              
                "pwencode"
              
              
                :
              
              
                "rsa2"
              
              
                ,
              
              
                "prelt"
              
              
                :
              
              
                372
              
              
                ,
              
              
                "qrcode_flag"
              
              
                :
              
              
                "false"
              
              
                ,
              
              
                "url"
              
              
                :
              
              
                "https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack"
              
              
                }
              
              
    data 
              
                =
              
               parse
              
                .
              
              urlencode
              
                (
              
              data
              
                )
              
              
                .
              
              encode
              
                (
              
              
                "utf-8"
              
              
                )
              
              
                return
              
               data

关于登录后如跳转到主页面

这个过程参考了：https://www.cnblogs.com/woaixuexi9999/p/9404745.html
在模块login.py中定义了一个类Login，其中的登录方法代码：

            
              
                def
              
              
                login
              
              
                (
              
              self
              
                )
              
              
                :
              
              
                # 第一步 获得时间戳、公钥、nonce等数据
              
              
        req 
              
                =
              
               request
              
                .
              
              Request
              
                (
              
              url
              
                =
              
              self
              
                .
              
              __preloginUrl
              
                ,
              
              headers
              
                =
              
              self
              
                .
              
              headers1
              
                ,
              
              method
              
                =
              
              
                "get"
              
              
                )
              
              
        response 
              
                =
              
               request
              
                .
              
              urlopen
              
                (
              
              req
              
                )
              
              
        text 
              
                =
              
               response
              
                .
              
              read
              
                (
              
              
                )
              
              
        servertime
              
                ,
              
               nonce
              
                ,
              
               pubkey
              
                ,
              
               rsakv 
              
                =
              
               dealdata
              
                .
              
              get_prelogin
              
                (
              
              text
              
                =
              
              text
              
                )
              
              
                # 第二步 向服务器发送post请求 登录信息
              
              
        postdata 
              
                =
              
               dealdata
              
                .
              
              get_postData
              
                (
              
              self
              
                .
              
              __su
              
                ,
              
              self
              
                .
              
              __password
              
                ,
              
              servertime
              
                ,
              
              nonce
              
                ,
              
              pubkey
              
                ,
              
              rsakv
              
                )
              
              
        req 
              
                =
              
               request
              
                .
              
              Request
              
                (
              
              url
              
                =
              
              self
              
                .
              
              __loginUrl
              
                ,
              
               headers
              
                =
              
              self
              
                .
              
              __postheaders
              
                ,
              
               data
              
                =
              
              postdata
              
                ,
              
              method
              
                =
              
              
                "post"
              
              
                )
              
              
        response 
              
                =
              
               request
              
                .
              
              urlopen
              
                (
              
              req
              
                )
              
              
        text 
              
                =
              
               response
              
                .
              
              read
              
                (
              
              
                )
              
              
                # 第三步 解析登录响应数据 获取中间链接
              
              
        replaceUrl 
              
                =
              
               dealdata
              
                .
              
              get_replaceUrl
              
                (
              
              text
              
                =
              
              text
              
                )
              
              
                # 分析登录结果
              
              
        result
              
                ,
              
              retcode
              
                ,
              
              reason 
              
                =
              
               dealdata
              
                .
              
              get_reason
              
                (
              
              replaceUrl
              
                )
              
              
                if
              
               result
              
                ==
              
              
                False
              
              
                :
              
              
                print
              
              
                (
              
              
                "登录失败！"
              
              
                )
              
              
                print
              
              
                (
              
              
                "原因"
              
              
                ,
              
              reason
              
                )
              
              
                return
              
              
                else
              
              
                :
              
              
                print
              
              
                (
              
              
                "登录成功！"
              
              
                )
              
              
                print
              
              
                (
              
              
                "正在向个人主页跳转..."
              
              
                )
              
              
                # 第四步 加载中间链接 提取ticket
              
              
        response 
              
                =
              
               request
              
                .
              
              urlopen
              
                (
              
              replaceUrl
              
                )
              
              
        text 
              
                =
              
               response
              
                .
              
              read
              
                (
              
              
                )
              
              
        ticket 
              
                =
              
               dealdata
              
                .
              
              get_ticket
              
                (
              
              text
              
                =
              
              text
              
                )
              
              
                # 第五步 利用ticket组合关键部分构造网址 获得携带uniqueid的json数据
              
              
        uniqueidUrl 
              
                =
              
               ticket 
              
                +
              
              
                "&callback=sinaSSOController.doCrossDomainCallBack&scriptId=ssoscript0&client=ssologin.js(v1.4.19)&_=1564805281285"
              
              
        response 
              
                =
              
               request
              
                .
              
              urlopen
              
                (
              
              uniqueidUrl
              
                )
              
              
        text 
              
                =
              
               response
              
                .
              
              read
              
                (
              
              
                )
              
              
        uniqueid 
              
                =
              
               dealdata
              
                .
              
              get_uniqueid
              
                (
              
              text
              
                )
              
              
                # 第六步 跳转到主页
              
              
                print
              
              
                (
              
              
                "进入个人主页..."
              
              
                )
              
              
        homeUrl 
              
                =
              
              
                "https://weibo.com/u/"
              
              
                +
              
               uniqueid 
              
                +
              
              
                "/home"
              
              
        request
              
                .
              
              urlretrieve
              
                (
              
              homeUrl
              
                ,
              
              
                "./html/home.html"
              
              
                )

其他模块

处理数据的dealdata.py模块

工程文件列表

code.jpg是验证码
home.html是加载的主页

执行结果

初学Python，深感Python的强大。人生苦短，我用Python。
此程序仅供尝试使用，不可商用。
转载请注明出处:https://blog.csdn.net/Blz624613442/article/details/98368815

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义