scala - Mongodb spark Connector:load/join will lost some data -
environment: spark 2.2, mongodb spark connector 2.2, scala 2.11, java 1.8.
i have problem mongodb spark connector. first of all, want make join operation 2 datasets. here original code (both result1
, result2
have imei_md5
column):
import sc.implicits._ val textfile1 = sc.sparkcontext.textfile(inputfile1) val result1 = textfile1.map(_.split(" ")).map(i => gdt(i(0), i(1), i(2), i(3), i(4))).tods() val textfile2 = sc.sparkcontext.textfile(inputfile2) val result2 = textfile2.map(_.split("\t")) .filter(i => i.length == 2) .map(i => applist(i(0), md5.hash(i(0)), i(1))) .tods() val result3 = result1.select("imei_md5").distinct().join(result2, "imei_md5")
this code can give me right join result (no data loss), when got result1
existing collection in mongodb, load
, join
result gives me lost result (about 1000w data loss). problematic code below:
val textfile = sc.sparkcontext.textfile(filepath) val result = textfile.map(_.split("\t")) .filter(i => i.length == 2) .map(i => applist(i(0), md5.hash(i(0)), i(1))) .todf() /* sc.config give input.uri has same data above code textfile1 */ val temp= mongospark.load(sc) val result2 = temp.join(result, "imei_md5") println(result2.count())
why happening, , how can deal it?
Comments
Post a Comment