python - Ignoring bad rows of data in pandas.read_csv() that break header= keyword -
i have series of messy *.csv files being read in pandas. example csv is:
instrument 35392 "log file name : station" "setup date (mmddyy) : 031114" "setup time (hhmmss) : 073648" "starting date (mmddyy) : 031114" "starting time (hhmmss) : 090000" "stopping date (mmddyy) : 031115" "stopping time (hhmmss) : 235959" "interval (hhmmss) : 010000" "sensor warmup (hhmmss) : 000200" "circltr warmup (hhmmss) : 000200" "date","time","","temp","","spcond","","sal","","ibatt","" "mmddyy","hhmmss","","øc","","ms/cm","","ppt","","volts","" "random message here 031114 073721 031114 083200" 03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,"" 03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,"" 03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,"" 03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,"" 03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,"" 03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,"" "message 3" "message 4"**
i have been using code import *csv file, process double headers, pull out empty columns, , strip offending rows bad data:
df = pd.read_csv(badfile,parse_dates={'datetime_(ascii)': [0,1]}, sep=",", \ header=[10,11],na_values=['','na', 'nan nan'], \ skiprows=[10], encoding='cp1252') df = df.dropna(how="all", axis=1) df = df.dropna(thresh=2) droplist = ['message', 'random'] df = df[~df['datetime_(ascii)'].str.contains('|'.join(droplist))] df.head() datetime_(ascii) (temp, øc) (spcond, ms/cm) (sal, ppt) (ibatt, volts) 0 03/11/14 09:00:00 15.85 1.408 0.74 6.2 1 03/11/14 10:00:00 15.99 1.960 1.05 6.3 2 03/11/14 11:00:00 14.20 40.800 26.12 6.2 3 03/11/14 12:00:01 14.20 41.700 26.77 6.2 4 03/11/14 13:00:00 14.50 41.300 26.52 6.2
this working fine , dandy until have file has erronious 1 row line after header: "random message here 031114 073721 031114 083200"
the error receieve is:
*c:\users\user\appdata\local\continuum\anaconda3\lib\site- packages\pandas\io\parsers.py in _do_date_conversions(self, names, data) 1554 data, names = _process_date_conversion( 1555 data, self._date_conv, self.parse_dates, self.index_col, -> 1556 self.index_names, names, keep_date_col=self.keep_date_col) 1557 1558 return names, data c:\users\user\appdata\local\continuum\anaconda3\lib\site- packages\pandas\io\parsers.py in _process_date_conversion(data_dict, converter, parse_spec, index_col, index_names, columns, keep_date_col) 2975 if not keep_date_col: 2976 c in list(date_cols): -> 2977 data_dict.pop(c) 2978 new_cols.remove(c) 2979 keyerror: ('time', 'hhmmss')*
if remove line, code works fine. similarly, if remove header= line code works fine. however, want able preserve because reading in hundreds of these files.
difficulty: prefer not open each file before call pandas.read_csv() these files can rather large - don't want read , save multiple times! also, prefer real pandas/pythonic solution doesn't involve openning file first stringio buffer removing offending lines.
here's 1 approach, making use of fact skip_rows
accepts callable function. function receives row index being considered, built-in limitation of parameter.
as such, callable function skip_test()
first checks whether current index in set of known indices skip. if not, opens actual file , checks corresponding row see if contents match.
the skip_test()
function little hacky in sense inspect actual file, although inspects until current row index it's evaluating. assumes bad line begins same string (in example case, "foo"
), seems safe assumption given op.
# example data """ foo.csv uid,a,b,c 0,1,2,3 skip me 1,11,22,33 foo 2,111,222,333 """ import pandas pd def skip_test(r, fn, fail_on, known): if r in known: # know want skip these return true # check if row index matches problem line in file # efficiency, quit after pass row index in file f = open(fn, "r") data = f.read() i, line in enumerate(data.splitlines()): if (i == r) & line.startswith(fail_on): return true elif > r: break return false fname = "foo.csv" fail_str = "foo" known_skip = [2] pd.read_csv(fname, sep=",", header=0, skiprows=lambda x: skip_test(x, fname, fail_str, known_skip)) # output uid b c 0 0 1 2 3 1 1 11 22 33 2 2 111 222 333
if know line random message appear on when appear, faster, can tell not inspect file contents index past potential offending line.
Comments
Post a Comment