python - download and save images from a website using scrapy -


i new scrapy , python, question may simple one. using existing website guide, i've written scraper scrapes website's pages , shows images url, name , ... in output file. want download images in directory output directory empty!

here code:

myspider.py

import scrapy class bricksetspider(scrapy.spider): name = 'brick_spider`enter code here`' start_urls = ['http://brickset.com/sets/year-2016']  def parse(self, response):     set_selector = '.set'     brickset in response.css(set_selector):          name_selector = 'h1 ::text'         pieces_selector = './/dl[dt/text() = "pieces"]/dd/a/text()'         minifigs_selector = './/dl[dt/text() = "minifigs"]/dd[2]/a/text()'         image_selector = 'img ::attr(src)'         yield {             'name': brickset.css(name_selector).extract_first(),             'pieces': brickset.xpath(pieces_selector).extract_first(),             'minifigs': brickset.xpath(minifigs_selector).extract_first(),             'image': brickset.css(image_selector).extract_first(),         }      next_page_selector = '.next ::attr(href)'     next_page = response.css(next_page_selector).extract_first()     if next_page:         yield scrapy.request(             response.urljoin(next_page),             callback=self.parse         ) 

settings.py

item_pipelines = {'brickset.pipelines.bricksetpipeline': 1} images_store = '/home/nmd/brickset/brickset/spiders/output'   #items.py  import scrapy class bricksetspider(scrapy.item): image_urls = scrapy.field() images = scrapy.field() pass 

scrapy provides media pipeline if interested in downloading files or images

item_pipelines = {'scrapy.pipelines.images.imagespipeline': 1} 

then need add image_urls in item pipeline download file, change

    yield {         'name': brickset.css(name_selector).extract_first(),         'pieces': brickset.xpath(pieces_selector).extract_first(),         'minifigs': brickset.xpath(minifigs_selector).extract_first(),         'image': brickset.css(image_selector).extract_first(),     } 

to

    yield {         'name': brickset.css(name_selector).extract_first(),         'pieces': brickset.xpath(pieces_selector).extract_first(),         'minifigs': brickset.xpath(minifigs_selector).extract_first(),         'image_urls': brickset.css(image_selector).extract_first(),     } 

for more details refer https://doc.scrapy.org/en/latest/topics/media-pipeline.html


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -