b0207191 发表于 2023-6-17 09:40

请教个scrapy的问题

本帖最后由 b0207191 于 2023-6-17 09:55 编辑

我参考下面这个页面方法登录网站抓取页面

Scrapy 进行简单的自动登录_51CTO博客_scrapy crawl

但是报错

    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 380: invalid continuation byte

2023-06-17 09:31:27 ERROR: Spider error processing <POST https://网站> (referer: https://网站)
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/scrapy/utils/defer.py", line 74, in mustbe_deferred
    result = f(*args, **kw)
File "/root/miniconda3/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 94, in _process_spider_input
    return scrape_func(response, request, spider)
File "/root/miniconda3/lib/python3.8/site-packages/scrapy/core/scraper.py", line 209, in call_spider
    warn_on_generator_with_return_value(spider, callback)
File "/root/miniconda3/lib/python3.8/site-packages/scrapy/utils/misc.py", line 263, in warn_on_generator_with_return_value
    if is_generator_with_return_value(callable):
File "/root/miniconda3/lib/python3.8/site-packages/scrapy/utils/misc.py", line 239, in is_generator_with_return_value
    src = inspect.getsource(func)
File "/root/miniconda3/lib/python3.8/inspect.py", line 985, in getsource
    lines, lnum = getsourcelines(object)
File "/root/miniconda3/lib/python3.8/inspect.py", line 967, in getsourcelines
    lines, lnum = findsource(object)
File "/root/miniconda3/lib/python3.8/inspect.py", line 794, in findsource
    lines = linecache.getlines(file, module.__dict__)
File "/root/miniconda3/lib/python3.8/linecache.py", line 47, in getlines
    return updatecache(filename, module_globals)
File "/root/miniconda3/lib/python3.8/linecache.py", line 137, in updatecache
    lines = fp.readlines()
File "/root/miniconda3/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcf in position 380: invalid continuation byte
看了下,可能是因为网页是gbk编码,于是我看看在哪里设置编码,先在每个函数入口出口都添加了打印

然后发现程序是在parse函数跑完,还未进入next函数的时候就抛出了异常,这个怎么解决

试了下,在scrapy.FormRequest中添加encode="gbk"也没用

            callback = self.next,
            formdata = post_data,
            encoding = "GBK"

页: [1]
查看完整版本: 请教个scrapy的问题