Scrapy This Week: Proxy Auth, Crawler Settings, and scrapy-lint
Scrapy landed 17 commits on master over the past week, touching 71 files with a net gain of about 400 lines. Most of the diff is test coverage, but a handful of changes matter if you run spiders behind proxies, pass prebuilt Crawler objects into runners, or rely on rel="nofollow" to stay off certain pages. The headline fix stops Proxy-Authorization from leaking into upstream requests when using the streaming download handlers.
Proxy auth no longer rides along to the origin
Operators who scrape through authenticated HTTP or SOCKS proxies hit a subtle bug in the streaming download path. BaseStreamingDownloadHandler used to pop the Proxy-Authorization header off the Request object while extracting proxy credentials. That mutation meant the header could still reach the target server on the actual download call, which is both wrong and a credential leak.
The fix in proxy authorization handling for streaming handlers adds a _request_headers() helper that copies headers and strips Proxy-Authorization before the handler sends the request. Proxy credentials are read with .get() instead of .pop(), so the original Request stays intact. The HttpxDownloadHandler path now passes the cleaned header tuple list into client.stream().
Tests in tests/test_downloader_handlers_http_base.py were refactored around a single parameterized proxy_server fixture covering plain HTTP, HTTPS, and SOCKS5 proxies.
CrawlerRunner settings merge into existing Crawler objects
A long standing footgun: passing an already constructed Crawler into CrawlerRunner.create_crawler() ignored the runner’s settings entirely. Only spider class inputs picked up CrawlerRunner({"FOO": "bar"}) values. Scripts that build a Crawler manually and hand it to the runner would silently miss runner level defaults.
Do not ignore CrawlerProcess settings fixes scrapy/crawler.py so create_crawler() calls crawler.settings.update(self.settings) when given a Crawler instance. The merge respects priority: runner values apply only where the crawler does not already hold an equal or higher priority setting. Spider custom_settings still win over runner defaults.
CrawlerProcess and CrawlerRunner settings dicts now behave the same whether you pass a spider class or a prebuilt crawler.
Link following and signal dispatch fixes
Two small runtime fixes are easy to miss but painful when they bite.
rel_has_nofollow() in scrapy/utils/misc.py treated nofollow as case sensitive. Sites that emit rel="NoFollow" or rel="NOFOLLOW" were crawled anyway. Fix nofollow detection to match the HTML spec lowercases the attribute before token splitting.
Separately, scrapy/utils/signal.py had a loop variable bug in _send_catch_log_deferred(). The deferred callback closed over receiver, so concurrent signal dispatch could pair the wrong receiver with a result. Fix the signal dispatch loop variable bug passes receiver as an explicit lambda argument instead.
Static analysis replaces some runtime nagging
Scrapy is pushing mistake detection out of the framework and into tooling.
Document scrapy lint and drop runtime spider checks does three things at once. The docs gain a static analysis section in docs/topics/practices.rst pointing at scrapy-lint for common spider mistakes. The framework drops the start_url vs start_urls typo guard from scrapy/spiders/__init__.py. OffsiteMiddleware no longer warns when allowed_domains contains full URLs or port suffixes; invalid entries now flow straight into the compiled regex.
That is a behavior change for sloppy allowed_domains lists. A value like https://example.com used to be ignored with a warning. Now it becomes part of the domain regex. Run scrapy-lint locally or in pre-commit rather than expecting Scrapy to scold you at crawl time.
Custom command authors get a deprecation notice too. Deprecate ScrapyCommand.help() warns when a subclass overrides help() instead of long_desc(). Override long_desc() for extended command help text going forward.
Docs and coverage work in the same window
Not everything here changes runtime behavior. Add a content based image filtering example adds a ImageClassifierPipeline pattern to docs/topics/media-pipeline.rst showing how to override get_images() and is_valid_image() for TensorFlow based filtering.
The other half of the week’s diff is test coverage across settings, link extractors, priority queues, and response types. Improve test coverage for settings/ locks in getdictorlist() edge cases and deprecated settings like DNS_RESOLVER. Good for contributors, invisible to operators.
What to watch
Three items to track after pulling master:
- Proxy downloads: If you set
Proxy-Authorizationmanually on requests, confirm origin servers no longer see that header. - Crawler construction: Audit scripts that pass a
Crawlerinstance intoCrawlerRunnerorAsyncCrawlerRunner. Runner settings now merge in. - Lint over warnings: Adopt
scrapy-lintforstart_urlstypos and badallowed_domainsentries. Framework level checks for those mistakes are gone.
More context lives in the Scrapy repository commit log.