scala - Mongodb spark Connector:load/join will lost some data -


environment: spark 2.2, mongodb spark connector 2.2, scala 2.11, java 1.8.

i have problem mongodb spark connector. first of all, want make join operation 2 datasets. here original code (both result1 , result2 have imei_md5 column):

import sc.implicits._ val textfile1 = sc.sparkcontext.textfile(inputfile1) val result1 = textfile1.map(_.split(" ")).map(i => gdt(i(0), i(1), i(2), i(3), i(4))).tods() val textfile2 = sc.sparkcontext.textfile(inputfile2) val result2 = textfile2.map(_.split("\t"))                        .filter(i => i.length == 2)                        .map(i => applist(i(0), md5.hash(i(0)), i(1)))                        .tods() val result3 = result1.select("imei_md5").distinct().join(result2, "imei_md5") 

this code can give me right join result (no data loss), when got result1 existing collection in mongodb, load , join result gives me lost result (about 1000w data loss). problematic code below:

val textfile = sc.sparkcontext.textfile(filepath) val result = textfile.map(_.split("\t"))                      .filter(i => i.length == 2)                      .map(i => applist(i(0), md5.hash(i(0)), i(1)))                      .todf()  /* sc.config give input.uri has same data above code textfile1 */ val temp= mongospark.load(sc)  val result2 = temp.join(result, "imei_md5")  println(result2.count()) 

why happening, , how can deal it?


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -