python - Trying to understand how pandas merge works, right join specifically -
i ran confusing result larger dataframe, have made toy 1 captures of what's confusing me:
import pandas pd big_index = [123, 124, 125, 126, 127, 128, 129, 130] big_dat = {'year': pd.series([2000, 2000, 2000, 2001, 2002, 2002, 2002, 2004], index=big_index), 'other': pd.series(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], index=big_index)} big_df = pd.dataframe(big_dat) year_index = [2003, 2000, 2001, 2002] year_dat = {'a': pd.series([1, 2, 3, 4], index=year_index), 'b': pd.series([5, 6, 7, 8], index=year_index)} year_df = pd.dataframe(year_dat)
left, , inner merge work i'd expect, right , outer produce odd results:
merged_right = pd.merge( big_df, year_df, how='right', left_on='year', right_index=true ) merged_right other year b 123 2000 2 6 124 b 2000 2 6 125 c 2000 2 6 126 d 2001 3 7 127 e 2002 4 8 128 f 2002 4 8 129 g 2002 4 8 130 nan 2003 1 5 merged_outer = pd.merge( big_df, year_df, how='outer', left_on='year', right_index=true ) merged_outer other year b 123 2000 2.0 6.0 124 b 2000 2.0 6.0 125 c 2000 2.0 6.0 126 d 2001 3.0 7.0 127 e 2002 4.0 8.0 128 f 2002 4.0 8.0 129 g 2002 4.0 8.0 130 h 2004 nan nan 130 nan 2003 1.0 5.0
in both cases index 130 gets associated 2003 year entry, no apparent reason. there's no "good" way handle since i'm assuming index can't have nan in it. i'd have expected throw error though, rather returning incorrect last column. i'm misunderstanding pandas doing under hood. tips resources figure out why going wrong appreciated code showing how right.
Comments
Post a Comment