python - Unexpected Spark behaviour when joining on union of literal columns -
i've been using spark (current version 2.1.0) time now, ran strange behaviour.
let's have 2 dataframes:
df_before = sparksession.createdataframe([('a', 4), ('b', 5)], ['a', 'b']) df_after = sparksession.createdataframe([('a', 6), ('b', 7)], ['a', 'b'])
we augment them column describing origin (or something). in case column says before or after:
df_before = df_before.withcolumn('c', lit('before')) df_after = df_after.withcolumn('c', lit('after'))
and put them in 1 dataframe:
df_all = df_before.union(df_after)
which gives us:
| b | c ---|---|------ | 4 | before b | 5 | before | 6 | after b | 7 | after
next happen have different dataframe:
data_other = [ ('a', 'before', 10), ('b', 'before', 11), ('a', 'after', 12), ('b', 'after', 13) ] df_other = sparksession.createdataframe(data_other, ['a', 'c', 'd'])
if join 2 straighforwardly:
df_all.join(df_other, ['a', 'c'])
i get:
| c | b | d ---|--------|---|---- | before | 4 | 10 b | before | 5 | 11 | after | 6 | 10 b | after | 7 | 11
which different expected:
| c | b | d ---|--------|---|---- | before | 4 | 10 b | before | 5 | 11 | after | 6 | 12 b | after | 7 | 13
can explain behaviour? doing wrong?
general solution: upgrade spark 2.1.1 bugfix spark-19766 solves it.
if reason option not available, workaround 2.1.0:
instead of:
df_after = df_after.withcolumn('c', lit('after'))
use udf create column:
def my_lit(literal): def returnliteral(x): return literal return udf(returnliteral, stringtype()) df_after.withcolumn('c', my_lit('after')(df_after['a']))
Comments
Post a Comment