python - Ignoring bad rows of data in pandas.read_csv() that break header= keyword -


i have series of messy *.csv files being read in pandas. example csv is:

instrument 35392 "log file name : station" "setup date (mmddyy) : 031114" "setup time (hhmmss) : 073648" "starting date (mmddyy) : 031114" "starting time (hhmmss) : 090000" "stopping date (mmddyy) : 031115" "stopping time (hhmmss) : 235959" "interval (hhmmss) : 010000" "sensor warmup (hhmmss) : 000200" "circltr warmup (hhmmss) : 000200"    "date","time","","temp","","spcond","","sal","","ibatt","" "mmddyy","hhmmss","","øc","","ms/cm","","ppt","","volts",""  "random message here 031114 073721 031114 083200" 03/11/14,09:00:00,"",15.85,"",1.408,"",.74,"",6.2,"" 03/11/14,10:00:00,"",15.99,"",1.96,"",1.05,"",6.3,"" 03/11/14,11:00:00,"",14.2,"",40.8,"",26.12,"",6.2,"" 03/11/14,12:00:01,"",14.2,"",41.7,"",26.77,"",6.2,"" 03/11/14,13:00:00,"",14.5,"",41.3,"",26.52,"",6.2,"" 03/11/14,14:00:00,"",14.96,"",41,"",26.29,"",6.2,"" "message 3" "message 4"** 

i have been using code import *csv file, process double headers, pull out empty columns, , strip offending rows bad data:

df = pd.read_csv(badfile,parse_dates={'datetime_(ascii)': [0,1]}, sep=",", \              header=[10,11],na_values=['','na', 'nan nan'], \              skiprows=[10], encoding='cp1252')  df = df.dropna(how="all", axis=1) df = df.dropna(thresh=2) droplist = ['message', 'random'] df = df[~df['datetime_(ascii)'].str.contains('|'.join(droplist))]  df.head()  datetime_(ascii)    (temp, øc)  (spcond, ms/cm) (sal, ppt)  (ibatt, volts) 0   03/11/14 09:00:00   15.85   1.408   0.74    6.2 1   03/11/14 10:00:00   15.99   1.960   1.05    6.3 2   03/11/14 11:00:00   14.20   40.800  26.12   6.2 3   03/11/14 12:00:01   14.20   41.700  26.77   6.2 4   03/11/14 13:00:00   14.50   41.300  26.52   6.2 

this working fine , dandy until have file has erronious 1 row line after header: "random message here 031114 073721 031114 083200"

the error receieve is:

    *c:\users\user\appdata\local\continuum\anaconda3\lib\site-     packages\pandas\io\parsers.py in _do_date_conversions(self, names, data)    1554             data, names = _process_date_conversion(    1555                 data, self._date_conv, self.parse_dates, self.index_col,     -> 1556                 self.index_names, names,      keep_date_col=self.keep_date_col)    1557     1558         return names, data     c:\users\user\appdata\local\continuum\anaconda3\lib\site-     packages\pandas\io\parsers.py in _process_date_conversion(data_dict,      converter, parse_spec, index_col, index_names, columns, keep_date_col)    2975     if not keep_date_col:    2976         c in list(date_cols):     -> 2977             data_dict.pop(c)    2978             new_cols.remove(c)    2979     keyerror: ('time', 'hhmmss')* 

if remove line, code works fine. similarly, if remove header= line code works fine. however, want able preserve because reading in hundreds of these files.

difficulty: prefer not open each file before call pandas.read_csv() these files can rather large - don't want read , save multiple times! also, prefer real pandas/pythonic solution doesn't involve openning file first stringio buffer removing offending lines.

here's 1 approach, making use of fact skip_rows accepts callable function. function receives row index being considered, built-in limitation of parameter.

as such, callable function skip_test() first checks whether current index in set of known indices skip. if not, opens actual file , checks corresponding row see if contents match.

the skip_test() function little hacky in sense inspect actual file, although inspects until current row index it's evaluating. assumes bad line begins same string (in example case, "foo"), seems safe assumption given op.

# example data """ foo.csv uid,a,b,c 0,1,2,3 skip me 1,11,22,33 foo 2,111,222,333  """  import pandas pd  def skip_test(r, fn, fail_on, known):     if r in known: # know want skip these         return true     # check if row index matches problem line in file     # efficiency, quit after pass row index in file     f = open(fn, "r")     data = f.read()     i, line in enumerate(data.splitlines()):         if (i == r) & line.startswith(fail_on):             return true         elif > r:             break     return false  fname = "foo.csv" fail_str = "foo" known_skip = [2] pd.read_csv(fname, sep=",", header=0,              skiprows=lambda x: skip_test(x, fname, fail_str, known_skip)) # output    uid       b    c 0    0    1    2    3 1    1   11   22   33 2    2  111  222  333 

if know line random message appear on when appear, faster, can tell not inspect file contents index past potential offending line.


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -