powershell - Need split to long (1,000,000+ line) CSV files based on column value and rename with value from other column -

i have folder csv files in following format:

file-2017-08-14.csv

 ticker  price   date aapl    1   2017-08-14 aapl    2   2017-08-14 aapl    3   2017-08-14 aapl    4   2017-08-14 msft    5   2017-08-14 msft    6   2017-08-14 msft    7   2017-08-14 goog    8   2017-08-14 goog    9   2017-08-14 ...

file-2017-08-13.csv

 ticker  price   date aapl    1   2017-08-13 aapl    2   2017-08-13 aapl    3   2017-08-13 aapl    4   2017-08-13 msft    5   2017-08-13 msft    6   2017-08-13 msft    7   2017-08-13 goog    8   2017-08-13 goog    9   2017-08-13 ...

and on. need split 2x3= 6 subfiles, named accordingly:

/out/aapl-2017-08-14.csv

 ticker  price   date aapl    1   2017-08-14 aapl    2   2017-08-14 aapl    3   2017-08-14 aapl    4   2017-08-14

/out/msft-2017-08-14.csv

 ticker  price   date msft    5   2017-08-14 msft    6   2017-08-14 msft    7   2017-08-14

/out/goog-2017-08-14.csv

 ticker  price   date goog    8   2017-08-14 goog    9   2017-08-14

/out/aapl-2017-08-13.csv

 ticker  price   date aapl    1   2017-08-13 aapl    2   2017-08-13 aapl    3   2017-08-13 aapl    4   2017-08-13

/out/msft-2017-08-13.csv

 ticker  price   date msft    5   2017-08-13 msft    6   2017-08-13 msft    7   2017-08-13

/out/goog-2017-08-13.csv

 ticker  price   date goog    8   2017-08-13 goog    9   2017-08-13

i have written script can group ticker , split 1 file, can't figure out how proper renaming , don't know how loop on of files in input folder.

import-csv file-2017-08-14.csv | group-object -property "ticker" | foreach-object {     $path = $_.name + ".csv";     $_.group | export-csv -path $path -notypeinformation }

any ideas?

method 1

get-childitem -filter '*.csv' -file -force `     | select-object -expandproperty 'fullname' `     | import-csv -delimiter "`t" `     | foreach-object -process {         $outputfilepath = "out\{0}-{1}.csv" -f $_.ticker, $_.date;          $_ | export-csv -path $outputfilepath -append -notypeinformation;     };

the lines above perform following:

get-childitem retrieves .csv files current directory (not including child directories)
the result of get-childitem fileinfo instances, want pass string instances representing file paths import-csv, use select-object pass fullname property down pipeline
import-csv reads csv file(s) specified in pipeline , passes each record down pipeline
inside of foreach-object, $_ variable holds each csv record. build output path appropriate record using ticker , date properties (the latter of string , not datetime, no formatting necessary). pass record export-csv, having append new row file @ $outputpath.

while code short , simple, opening , appending each output file once per input record slow, million lines, though memory usage minimal because 1 record in memory @ given time.

method 2

we can improve code appending each output file after every 1,000 records (or whatever value like) instead of every record. hashtable stores pending records each output file, , pending records flushed when given output file has exceeded pending record limit or there no more records read (end of input files):

$pendingrecordsbyfilepath = @{}; $maxpendingrecordsperfilepath = 1000;  get-childitem -filter '*.csv' -file -force `     | select-object -expandproperty 'fullname' `     | import-csv -delimiter "`t" `     | foreach-object -process {         $outputfilepath = "out\{0}-{1}.csv" -f $_.ticker, $_.date;         $pendingrecords = $pendingrecordsbyfilepath[$outputfilepath];          if ($pendingrecords -eq $null)         {             # first time we're encountering output file; create new array             $pendingrecords = @();         }         elseif ($pendingrecords.length -ge $maxpendingrecordsperfilepath)         {             # flush pending records output file             $pendingrecords `                 | export-csv -path $outputfilepath -append -notypeinformation;             $pendingrecords = @();         }          $pendingrecords += $_;         $pendingrecordsbyfilepath[$outputfilepath] = $pendingrecords;     };  # no more input records read; flush pending records each output file foreach ($outputfilepath in $pendingrecordsbyfilepath.keys) {     $pendingrecordsbyfilepath[$outputfilepath] `         | export-csv -path $outputfilepath -append -notypeinformation; }

method 3

we can improve further using list<object> instead of array store pending records write. setting capacity of list upon creation $maxpendingrecordsperfilename eliminate overhead of expanding arrays every time record added.

$pendingrecordsbyfilepath = @{}; $maxpendingrecordsperfilepath = 1000;  get-childitem -filter '*.csv' -file -force `     | select-object -expandproperty 'fullname' `     | import-csv -delimiter "`t" `     | foreach-object -process {         $outputfilepath = "out\{0}-{1}.csv" -f $_.ticker, $_.date;         $pendingrecords = $pendingrecordsbyfilepath[$outputfilepath];          if ($pendingrecords -eq $null)         {             # first time we're encountering output file; create new list             $pendingrecords = new-object `                 -typename 'system.collections.generic.list[object]' `                 -argumentlist (,$maxpendingrecordsperfilepath);             $pendingrecordsbyfilepath[$outputfilepath] = $pendingrecords;         }         elseif ($pendingrecords.count -ge $maxpendingrecordsperfilepath)         {             # flush pending records output file             $pendingrecords `                 | export-csv -path $outputfilepath -append -notypeinformation;             $pendingrecords.clear();         }         $pendingrecords.add($_);     };  # no more input records read; flush pending records each output file foreach ($outputfilepath in $pendingrecordsbyfilepath.keys) {     $pendingrecordsbyfilepath[$outputfilepath] `         | export-csv -path $outputfilepath -append -notypeinformation; }

method 4a

we can eliminate need buffering records/lines output , opening/appending output files if use streamwriter class. we'll create 1 streamwriter per output file , leave them open until we're finished. try/finally block necessary ensure closed properly. use convertto-csv generate output, includes header line whether need or not, there logic ensure write header when file first opened.

$truncateexistingoutputfiles = $true; $outputfilewritersbypath = @{};  try {     get-childitem -filter '*.csv' -file -force `         | select-object -expandproperty 'fullname' `         | import-csv -delimiter "`t" `         | foreach-object -process {             $outputfilepath = join-path -path (get-location) -childpath ('out\{0}-{1}.csv' -f $_.ticker, $_.date);             $outputfilewriter = $outputfilewritersbypath[$outputfilepath];             $outputlines = $_ | convertto-csv -notypeinformation;              if ($outputfilewriter -eq $null)             {                 # first time we're encountering output file; create new streamwriter                 $outputfilewriter = new-object `                     -typename 'system.io.streamwriter' `                     -argumentlist ($outputfilepath, -not $truncateexistingoutputfiles, [system.text.encoding]::ascii);                  $outputfilewritersbypath[$outputfilepath] = $outputfilewriter;                  # write header line                 $outputfilewriter.writeline($outputlines[0]);             }              # write data line             $outputfilewriter.writeline($outputlines[1]);         }; } {     foreach ($writer in $outputfilewritersbypath.values)     {         $writer.close();     } }

surprisingly, caused performance change of 175%...slower. i'll nail down why further revise code.

method 4b

my first thought address performance drop reintroduce output buffering; basically, combine method 3 , 4a. equally surprising, further hurt performance. guess why since streamwriter own character buffering makes doing our own buffering unnecessary. in fact, tested values maxpendingrecordsperfilepath in powers of 10 10 100,000 , overall performance difference 2 extremes mere 5 seconds. thus, our own buffering isn't helping anything, , tiny overhead of managing list adds additional 30 seconds run time on million iterations. so, let's scrap buffering.

method 4c

instead of using convertto-csv out 2-element array of strings (a header line , data line), let's build 2 lines ourselves using string formatting.

method 4d

on every iteration of foreach-object need build output file path because it's based on input objects ticker , date properties. pass absolute path when constructing streamwriter because powershell has different notion of "current directory" (from relative paths based) typical .net application. we've been calling get-location build absolute path on every iteration, isn't necessary because path doesn't change. so, let's move call get-location outside of foreach-object.

method 4e

instead of using join-path build our output file path, let's try .net's path.combine method.

method 4f

instead of using join-path build our output file path, let's try less platform-agnostic string interpolation ($outputfilepath = "$outputdirectorypath\$outputfilename";).

combining changes methods 4a, 4c, 4d, , 4e final code:

$truncateexistingoutputfiles = $true; $outputdirectorypath = join-path -path (get-location) -childpath 'out'; $outputfilewritersbypath = @{};  try {     get-childitem -filter '*.csv' -file -force `         | select-object -expandproperty 'fullname' `         | import-csv -delimiter "`t" `         | foreach-object -process {             $outputfilename = '{0}-{1}.csv' -f $_.ticker, $_.date;             $outputfilepath = [system.io.path]::combine($outputdirectorypath, $outputfilename);             $outputfilewriter = $outputfilewritersbypath[$outputfilepath];              if ($outputfilewriter -eq $null)             {                 # first time we're encountering output file; create new streamwriter                 $outputfilewriter = new-object `                         -typename 'system.io.streamwriter' `                         -argumentlist ($outputfilepath, -not $truncateexistingoutputfiles, [system.text.encoding]::ascii);                  $outputfilewritersbypath[$outputfilepath] = $outputfilewriter;                  # write header line                 $outputfilewriter.writeline('"ticker","price","date"');             }              # write data line             $outputfilewriter.writeline("""$($_.ticker)"",""$($_.price)"",""$($_.date)""");         }; } {     foreach ($writer in $outputfilewritersbypath.values)     {         $writer.close();     } }

here benchmarks each method averaged on 3 runs each against million line csv. performed on core i7 860 @ 2.8 ghz turboboost disabled running 64-bit powershell v5.1 on windows 10 pro v1703:

+--------+----------------------+----------------------+--------------+---------------------+-----------------+ | method |     path handling    |     line building    | file writing |   output buffering  |  execution time | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |    1   |       relative       |      export-csv      |  export-csv  |          no         | 2,178.5 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |    2   |       relative       |      export-csv      |  export-csv  | 1,000-element array |   222.9 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |    3   |       relative       |      export-csv      |  export-csv  |  1,000-element list |   154.2 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |   4a   |       join-path      |     convertto-csv    | streamwriter |          no         |   425.0 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |   4b   |       join-path      |     convertto-csv    | streamwriter |  1,000-element list |   456.1 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |   4c   |       join-path      | string interpolation | streamwriter |          no         |   302.5 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |   4d   |       join-path      | string interpolation | streamwriter |          no         |   225.1 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |   4e   | [io.path]::combine() | string interpolation | streamwriter |          no         |    78.0 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ |   4f   | string interpolation | string interpolation | streamwriter |          no         |    77.7 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+

key takeaways:

when used export-csv, output buffering (1 → 2 , 1 → 3) provides massive performance improvement.
when used streamwriters, output buffering (4a → 4b) not helpful , causes small performance hit.
eliminating convertto-csv (4a → 4c) reduced execution time third (153.6 seconds).
method 4a slower buffered export-csv methods because introduced use of get-location , join-path. either these cmdlets invole lot more processing behind scenes meets eye, or invoking cmdlets slow in general (when done million times, of course).
- moving get-location outside of foreach-object (4c → 4d) reduced execution time quarter (77.4 seconds).
- using [system.io.path]::combine() instead of join-path (4d → 4e) reduced execution time two-thirds (147.1 seconds).
script optimization fun , educational!

Search This Blog

LP