web crawler - Scrapy: Stop crawling a domain and hop to the next if a condition is met -

July 15, 2012

i'd write bfo broad crawler following:

begin first url
try find links impressum regex: '.*mpressum.*' (translation: imprint)
check if condition met. in case if postal code in range
if condition met continue crawling page
if condition not met stop crawling domain blacklist future crawls.
continue next domain

how can implement behavior in scrapy?

basically i'm doing because want answer following question:
domains in germany in postal code range?

my code mess, learning scrapy @ moment.

you can use allowed_domains variables in scraper. when condition met remove domain allowed_domains. not cancel queued downloads believe not let queue new ones.

ps: refer https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.offsitemiddleware

Search This Blog

LP

web crawler - Scrapy: Stop crawling a domain and hop to the next if a condition is met -

Comments

Post a Comment

Popular posts from this blog

PHP and MySQL WP -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

go - golang pprof for c library code -