Spark Session read mulitple files instead of using pattern -
i'm trying read couple of csv files using sparksession
folder on hdfs ( i.e don't want read files in folder )
i following error while running (code @ end):
path not exist: file:/home/cloudera/works/javakafkasparkstream/input/input_2.csv, /home/cloudera/works/javakafkasparkstream/input/input_1.csv
i don't want use pattern while reading, /home/temp/*.csv
, reason being in future have logic pick 1 or 2 files in folder out of 100 csv files
please advise
sparksession sparksession = sparksession .builder() .appname(sparkcsvprocessors.class.getname()) .master(master).getorcreate(); sparkcontext context = sparksession.sparkcontext(); context.setloglevel("error"); set<string> fileset = files.list(paths.get("/home/cloudera/works/javakafkasparkstream/input/")) .filter(name -> name.tostring().endswith(".csv")) .map(name -> name.tostring()) .collect(collectors.toset()); sqlcontext sqlctx = sparksession.sqlcontext(); dataset<row> rawdataset = sparksession.read() .option("inferschema", "true") .option("header", "true") .format("com.databricks.spark.csv") .option("delimiter", ",") //.load(string.join(" , ", fileset)); .load("/home/cloudera/works/javakafkasparkstream/input/input_2.csv, " + "/home/cloudera/works/javakafkasparkstream/input/input_1.csv");
update
i can iterate files , union below. please recommend if there better way ...
dataset<row> unifieddataset = null; (string filename : fileset) { dataset<row> tempdataset = sparksession.read() .option("inferschema", "true") .option("header", "true") .format("csv") .option("delimiter", ",") .load(filename); if (unifieddataset != null) { unifieddataset= unifieddataset.unionall(tempdataset); } else { unifieddataset = tempdataset; } }
your problem creating string value:
"/home/cloudera/works/javakafkasparkstream/input/input_2.csv, /home/cloudera/works/javakafkasparkstream/input/input_1.csv"
instead passing 2 filenames parameters, should done by:
.load("/home/cloudera/works/javakafkasparkstream/input/input_2.csv", "/home/cloudera/works/javakafkasparkstream/input/input_1.csv");
the comma has outside strin , should have 2 values, instead of 1 string.
Comments
Post a Comment