Spring Cloud DataFlow http polling and deduplication -


i have been reading spring cloud dataflow , related documentation in order produce data ingest solution run in organization's cloud foundry deployment. goal poll http service data, perhaps 3 times per day sake of discussion, , insert/update data in postgresql database. http service seems provide 10s of thousands of records per day.

one point of confusion far best practice in context of dataflow pipeline deduplicating polled records. source data not have timestamp field aid in tracking polling, coarse day-level date field. have no guarantee records not ever updated retroactively. records appear have unique id, can dedup records way, not sure based on documentation how best implement logic in dataflow. far can tell, spring cloud stream starters not provide out-of-the-box. reading spring integration's smart polling, i'm not sure that's meant address concern either.

my intuition create custom processor java component in dataflow stream performs database query determine whether polled records have been inserted, inserts appropriate records target database, or passes them on down stream. querying target database in intermediate step acceptable in stream app? alternatively, implement in spring cloud task batch operation triggers based on schedule.

what best way proceed respect dataflow app? common/best practices achieving deduplication described above in dataflow/stream/task/integration app? should copy setup of starter app or start scratch, because i'll need write custom code? need spring cloud dataflow, because i'm not sure i'll using dsl @ all? apologies questions, being new cloud foundry , these spring projects, it's daunting piece together.

thanks in advance help.

you on right track, given requirements need create custom processor. need keep track of has been inserted in order avoid duplication.

there's nothing preventing writing such processor in stream app, performance may take hit, since each record issue db query.

if order not important, parallelize query process several concurrent messages, in end db still pay price.

another approach use bloomfilter can quite lot on speeding checking inserted records.

you can start cloning starter apps, have poller trigger http client processor fetches data , go through custom code processor , jdbc-sink. stream create time --triger.cron=<cron_expression> | httpclient --httpclient.url-expression=<remote_endpoint> | customprocessor | jdbc

one of advantages of using scdf independently scale custom processor via deployment properties such deployer.customprocessor.count=8


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -