python - download and save images from a website using scrapy -
i new scrapy , python, question may simple one. using existing website guide, i've written scraper scrapes website's pages , shows images url, name , ... in output file. want download images in directory output directory empty!
here code:
myspider.py
import scrapy class bricksetspider(scrapy.spider): name = 'brick_spider`enter code here`' start_urls = ['http://brickset.com/sets/year-2016'] def parse(self, response): set_selector = '.set' brickset in response.css(set_selector): name_selector = 'h1 ::text' pieces_selector = './/dl[dt/text() = "pieces"]/dd/a/text()' minifigs_selector = './/dl[dt/text() = "minifigs"]/dd[2]/a/text()' image_selector = 'img ::attr(src)' yield { 'name': brickset.css(name_selector).extract_first(), 'pieces': brickset.xpath(pieces_selector).extract_first(), 'minifigs': brickset.xpath(minifigs_selector).extract_first(), 'image': brickset.css(image_selector).extract_first(), } next_page_selector = '.next ::attr(href)' next_page = response.css(next_page_selector).extract_first() if next_page: yield scrapy.request( response.urljoin(next_page), callback=self.parse )
settings.py
item_pipelines = {'brickset.pipelines.bricksetpipeline': 1} images_store = '/home/nmd/brickset/brickset/spiders/output' #items.py import scrapy class bricksetspider(scrapy.item): image_urls = scrapy.field() images = scrapy.field() pass
scrapy provides media pipeline if interested in downloading files or images
item_pipelines = {'scrapy.pipelines.images.imagespipeline': 1}
then need add image_urls
in item pipeline download file, change
yield { 'name': brickset.css(name_selector).extract_first(), 'pieces': brickset.xpath(pieces_selector).extract_first(), 'minifigs': brickset.xpath(minifigs_selector).extract_first(), 'image': brickset.css(image_selector).extract_first(), }
to
yield { 'name': brickset.css(name_selector).extract_first(), 'pieces': brickset.xpath(pieces_selector).extract_first(), 'minifigs': brickset.xpath(minifigs_selector).extract_first(), 'image_urls': brickset.css(image_selector).extract_first(), }
for more details refer https://doc.scrapy.org/en/latest/topics/media-pipeline.html
Comments
Post a Comment