Python抓取3D打印笔天猫评论(2)

2022-08-16 405 0

在网上看了一些信息,评论文件的json包,currentPage之后的都是些时间戳相关的参数,没有实际意义,去掉也不影响,于是获取一个天猫链接全部评论就很容易实现了。

代码的循环写的很麻烦,这一篇用了3个for循环。以后得想办法解决这个问题,不然以后修改起来得麻烦死。

下面放出代码

import requests
from bs4 import BeautifulSoup
import json
from urllib.parse import quote

for i in range(1,6,1):
    url="https://rate.tmall.com/list_detail_rate.htm?itemId=623333281518&spuId=0&sellerId=898146183&order=3&currentPage={}".format(i)
    headers={
'cookie': 'lid=%E7%AC%94%E6%9D%86%E5%AD%90%E5%8A%9E%E5%85%AC%E6%97%97%E8%88%B0%E5%BA%97; hng=CN%7Czh-CN%7CCNY%7C156; enc=L1%2BEWKfqEhWH1WILeWEF1KOiuDf2Cajd%2F0eZYzQgcI3e%2FsTc5rVan3hyj4mSQDEslXHbyj4chZunVGKjZ4fTTheXwGRUVwKZANtPTzFrMBg%3D; cna=Vuo7F5s2vysCAXFoyUirZNF7; t=bbf5313974f93cdbf4b381462ad8b608; tracknick=%5Cu7B14%5Cu6746%5Cu5B50%5Cu529E%5Cu516C%5Cu65D7%5Cu8230%5Cu5E97; lgc=%5Cu7B14%5Cu6746%5Cu5B50%5Cu529E%5Cu516C%5Cu65D7%5Cu8230%5Cu5E97; _tb_token_=e17ee57eb8696; cookie2=1c657b9ae328ee25031a029bd22e24b2; xlly_s=1; dnk=%5Cu7B14%5Cu6746%5Cu5B50%5Cu529E%5Cu516C%5Cu65D7%5Cu8230%5Cu5E97; uc1=cookie16=V32FPkk%2FxXMk5UvIbNtImtMfJQ%3D%3D&pas=0&cookie14=UoeyDtntTKOlhw%3D%3D&cookie21=UtASsssmfavW4WY1P7YMYA%3D%3D&existShop=true&tmb=1&cookie15=UtASsssmOIJ0bQ%3D%3D; uc3=lg2=VFC%2FuZ9ayeYq2g%3D%3D&vt3=F8dCv4CINajEPRaps2k%3D&id2=UUphwocR7BRT9edm5Q%3D%3D&nk2=0o8%2FnXBGkvTgGSdoApFcWQ%3D%3D; _l_g_=Ug%3D%3D; uc4=id4=0%40U2grGR8RjYrYeyzJ7CYb6fCfVbRSSrkf&nk4=0%400D4kcvaqbgtfR8uEG8IjbABnAOkd7Z1Xb3Va; unb=2208092032956; cookie1=BYlty4Hl0E049Ew0r6wFoRPzJm4uOVqjRMkbuC2MJuQ%3D; login=true; cookie17=UUphwocR7BRT9edm5Q%3D%3D; _nk_=%5Cu7B14%5Cu6746%5Cu5B50%5Cu529E%5Cu516C%5Cu65D7%5Cu8230%5Cu5E97; sgcookie=E100AZQtyZkIPh6oeaQCkw3UQbCUcx%2FjU6T8vC5S%2B64N5G27kUmU12AQLDg7Zdamq2DrlA9Q%2FCzljUnSK%2Bybhj9VaEW9tVwLrYmOyakY%2BrPJxho%3D; cancelledSubSites=empty; sg=%E5%BA%9764; csg=a7fc881e; s=VWsvES7x; x5sec=7b22617365727665723b32223a22656437643165386632393037396563343765313730643530663634343938396543506962375a6347454a3676344b32506e4a715762786f504d6a49774f4441354d6a417a4d6a6b314e6a73784d4b36683859304351414d3d227d; tfstk=cyMlBbYdMbPSl7nZYYwWKF0whOMlZaWUSvkj33t23esgBoDViifV_ZlHmzY6z61..; l=eBgzLE7VLtIL9c-wBOfZlurza77TkIRfguPzaNbMiOCP_H1w52_lW6YGP9YeCnGVHs1XR3uy_uqHBJLOjydVokb4d_BkdlkmndC..; isg=BElJrMBGTwNZKDN8LPpfl20ZWHWjlj3I7HY1EOu-uTBvMmhEM-TumHpkdJaEatUA',
'referer': 'https://detail.tmall.com/item.htm?spm=a230r.1.14.16.70cb67c0bCF7G4&id=623333281518&ns=1&abbucket=1&skuId=5049447779543',
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
    }
    html=requests.get(url,headers=headers).text
    start=html.find('{"rateDetail"')
    ends=html.find('}}')+len('}}')
    # print(json.loads(html[start:ends]))
    for p in json.loads(html[start:ends])['rateDetail']['rateList']:
        print(p['rateContent'])
        print(p['displayUserNick'])
        print(p['rateDate'])
        try:
            for n in range(0,6,1):
                print('http:' + p['pics'][n])
        except:
            pass

只要根据链接的评论总页数,修改第一个for循环的参数。就可以把所有的评论信息都抓取下来,

结果如下图:

期间碰到的问题特别多,比如天猫隔一段时间就会反爬一次,需要重新写headers。

比如for循环的嵌套问题。最后还是解决了。

最后的导出问题还没解决,暂时先不摸索了。之后可能还要回去巩固下下载保存文档和图片的内容

相关文章

Tkinter学习(2)
Tkinter学习(1)
selenium自动化模块学习(5)
selenium自动化模块学习(4)
selenium自动化模块学习(3)
selenium自动化模块学习(2)

发布评论