stormcrawler - Tuning Storm-Crawler to fully use available resources -


i have node dedicated storm-crawler based crawler. have @ disposal 20 dual-core cpus, 130 gb of ram , 10gb/s ethernet connection.

i reduced topology to: collapsingspout -> urlpartitionerbolt -> fetcherbolt. spout reading elasticsearch index (with ~50 m records). elasticsearch configured 30 gb ram , 2 shards.

i use single worker 50 gb of ram dedicated jvm. playing different settings (total number of threads, number of threads per queue, max pending spout, related elasticsearch such number of buckets , bucket size mainly) can reach overall fetching speed of 100 mb/s. however, looking @ ganglia reports, corresponds 10% of bandwidth available me. note cpu usage @ 20% , ram not issue.

i’m seeking hints on bottleneck , advice on how tune/adjust crawler use resources available me.

thanks in advance.

etienne

you use kibana or grafana visualise metrics generated stormcrawler, see tutorial. give insights performance. also, storm ui tell of bottlenecks @ component level.

you use more 2 shards status index , have corresponding number of spout instances. increase parallelism.

do follow outlinks webpages or size of index remain constant? 50m urls not don't expect es super busy.

have tried using aggregationspout instead? collapsingspout pretty new + better use bucket size of 1 think emits separate query each bucket.

it difficult tell problem without seeing topology. try find obvious culprits using methods above.


Comments

Popular posts from this blog

python Tkinter Capturing keyboard events save as one single string -

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

javascript - Z-index in d3.js -