stormcrawler - Tuning Storm-Crawler to fully use available resources -

June 15, 2010

i have node dedicated storm-crawler based crawler. have @ disposal 20 dual-core cpus, 130 gb of ram , 10gb/s ethernet connection.

i reduced topology to: collapsingspout -> urlpartitionerbolt -> fetcherbolt. spout reading elasticsearch index (with ~50 m records). elasticsearch configured 30 gb ram , 2 shards.

i use single worker 50 gb of ram dedicated jvm. playing different settings (total number of threads, number of threads per queue, max pending spout, related elasticsearch such number of buckets , bucket size mainly) can reach overall fetching speed of 100 mb/s. however, looking @ ganglia reports, corresponds 10% of bandwidth available me. note cpu usage @ 20% , ram not issue.

i’m seeking hints on bottleneck , advice on how tune/adjust crawler use resources available me.

thanks in advance.

etienne

you use kibana or grafana visualise metrics generated stormcrawler, see tutorial. give insights performance. also, storm ui tell of bottlenecks @ component level.

you use more 2 shards status index , have corresponding number of spout instances. increase parallelism.

do follow outlinks webpages or size of index remain constant? 50m urls not don't expect es super busy.

have tried using aggregationspout instead? collapsingspout pretty new + better use bucket size of 1 think emits separate query each bucket.

it difficult tell problem without seeing topology. try find obvious culprits using methods above.

Search This Blog

LP

stormcrawler - Tuning Storm-Crawler to fully use available resources -

Comments

Post a Comment

Popular posts from this blog

PHP and MySQL WP -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

nginx - phpPgAdmin - log in works but I have to login again after clicking on any links -