java - External Sort GC Overhead -


i writing external sort sort large 2 gig file on disk

i first split file chunks fit memory , sort each 1 individually, , rewrite them disk. however, during process getting gc memory overhead exception in string.split method in function gemodel. below code.

private static list<model> getmodel(string file, long linecount, final long readsize) {     list<model> modellist = new arraylist<model>();     long read = 0l;     try (bufferedreader br = new bufferedreader(new filereader(file))) {         //skip linecount lines;         (long = 0; < linecount; i++)             br.readline();         string line = "";         while ((line = br.readline()) != null) {             read += line.length();             if (read > readsize)                 break;             string[] split = line.split("\t");             string curvature = (split.length >= 7) ? split[6] : "";             string heading = (split.length >= 8) ? split[7] : "";             string slope = (split.length == 9) ? split[8] : "";              modellist.add(new model(split[0], split[1], split[2], split[3], split[4], split[5], curvature, heading, slope));         }            br.close();         return modellist;     } catch (filenotfoundexception e) {         // todo auto-generated catch block         e.printstacktrace();     } catch (ioexception e) {         // todo auto-generated catch block         e.printstacktrace();     }     return null; }  private static void split(string inputdir, string inputfile, string outputdir, final long readsize) throws ioexception {     long linecount = 0l;     int count = 0;     int writesize = 100000;     system.out.println("reading...");     list<model> curmodel = getmodel(inputdir + inputfile, linecount, readsize);     system.out.println("reading complete");     while (curmodel.size() > 0) {         linecount += curmodel.size();         system.out.println("sorting...");         curmodel.sort(new comparator<model>() {             @override             public int compare(model arg0, model arg1) {                 return arg0.compareto(arg1);             }         });         system.out.println("sorting complete");         system.out.println("writing...");         writefile(curmodel, outputdir + inputfile + count, writesize);         system.out.println("writing complete");         count++;         system.out.println("reading...");         curmodel = getmodel(inputdir + inputfile, linecount, readsize);         system.out.println("reading complete");     } } 

it makes through 1 pass , sorts ~250 mb of data file. however, on second pass throws gc memory overhead exception on string.split function. not want use external libraries, want learn on own. sorting , splitting works, cannot understand why gc throwing memory overhead exception on string.split function.

i'm not sure causing exception--manipulating large strings, in particular cutting , splicing them, huge memory/gc issue. stringbuilder can help, in general may have take more direct control on process.

to figure out more want run profiler app. there 1 built jdk (visualvm) functional. show objects java holding on to... because of nature of strings it's possible holding onto lot of redundant character array data.

personally i'd try different approach, instance, if sorted entire file in memory loading first 10(?) sortable characters of each line array along file location read from, sort array , resolve ties loading more (the rest?) of lines identical.

if did should able seek each line , copy destination file without ever caching more 1 line in memory , reading through source file twice.

i suppose manufacture file fail if strings identical until last couple characters, if ever became issue might have able flush full strings you've cached (there java memory reference object made automatically you, it's not particularly hard)


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -