python - Unexpected Spark behaviour when joining on union of literal columns -


i've been using spark (current version 2.1.0) time now, ran strange behaviour.

let's have 2 dataframes:

df_before = sparksession.createdataframe([('a', 4), ('b', 5)], ['a', 'b']) df_after  = sparksession.createdataframe([('a', 6), ('b', 7)], ['a', 'b']) 

we augment them column describing origin (or something). in case column says before or after:

df_before = df_before.withcolumn('c', lit('before')) df_after = df_after.withcolumn('c', lit('after')) 

and put them in 1 dataframe:

df_all = df_before.union(df_after) 

which gives us:

 | b | c ---|---|------  | 4 | before  b | 5 | before  | 6 | after  b | 7 | after 

next happen have different dataframe:

data_other = [   ('a', 'before', 10),    ('b', 'before', 11),    ('a', 'after', 12),    ('b', 'after', 13) ]  df_other = sparksession.createdataframe(data_other, ['a', 'c', 'd']) 

if join 2 straighforwardly:

df_all.join(df_other, ['a', 'c']) 

i get:

 | c      | b | d   ---|--------|---|----  | before | 4 | 10   b | before | 5 | 11   | after  | 6 | 10   b | after  | 7 | 11  

which different expected:

 | c      | b | d   ---|--------|---|----  | before | 4 | 10   b | before | 5 | 11   | after  | 6 | 12   b | after  | 7 | 13  

can explain behaviour? doing wrong?

general solution: upgrade spark 2.1.1 bugfix spark-19766 solves it.

if reason option not available, workaround 2.1.0:

instead of:

 df_after = df_after.withcolumn('c', lit('after')) 

use udf create column:

 def my_lit(literal):      def returnliteral(x): return literal      return udf(returnliteral, stringtype())   df_after.withcolumn('c', my_lit('after')(df_after['a'])) 

Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -