Spark Session read mulitple files instead of using pattern -

July 15, 2015

i'm trying read couple of csv files using sparksession folder on hdfs ( i.e don't want read files in folder )

i following error while running (code @ end):

path not exist: file:/home/cloudera/works/javakafkasparkstream/input/input_2.csv, /home/cloudera/works/javakafkasparkstream/input/input_1.csv

i don't want use pattern while reading, /home/temp/*.csv, reason being in future have logic pick 1 or 2 files in folder out of 100 csv files

please advise

    sparksession sparksession = sparksession             .builder()             .appname(sparkcsvprocessors.class.getname())             .master(master).getorcreate();     sparkcontext context = sparksession.sparkcontext();     context.setloglevel("error");      set<string> fileset = files.list(paths.get("/home/cloudera/works/javakafkasparkstream/input/"))             .filter(name -> name.tostring().endswith(".csv"))             .map(name -> name.tostring())             .collect(collectors.toset());      sqlcontext sqlctx = sparksession.sqlcontext();      dataset<row> rawdataset = sparksession.read()             .option("inferschema", "true")             .option("header", "true")             .format("com.databricks.spark.csv")             .option("delimiter", ",")             //.load(string.join(" , ", fileset));             .load("/home/cloudera/works/javakafkasparkstream/input/input_2.csv, " +                     "/home/cloudera/works/javakafkasparkstream/input/input_1.csv");

update

i can iterate files , union below. please recommend if there better way ...

    dataset<row> unifieddataset = null;      (string filename : fileset) {         dataset<row> tempdataset = sparksession.read()                 .option("inferschema", "true")                 .option("header", "true")                 .format("csv")                 .option("delimiter", ",")                 .load(filename);         if (unifieddataset != null) {             unifieddataset= unifieddataset.unionall(tempdataset);         } else {             unifieddataset = tempdataset;         }     }

your problem creating string value:

"/home/cloudera/works/javakafkasparkstream/input/input_2.csv, /home/cloudera/works/javakafkasparkstream/input/input_1.csv"

instead passing 2 filenames parameters, should done by:

.load("/home/cloudera/works/javakafkasparkstream/input/input_2.csv", "/home/cloudera/works/javakafkasparkstream/input/input_1.csv");

the comma has outside strin , should have 2 values, instead of 1 string.

Search This Blog

LP

Spark Session read mulitple files instead of using pattern -

Comments

Post a Comment

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -