powershell - Need split to long (1,000,000+ line) CSV files based on column value and rename with value from other column -
i have folder csv files in following format:
file-2017-08-14.csv
ticker price date aapl 1 2017-08-14 aapl 2 2017-08-14 aapl 3 2017-08-14 aapl 4 2017-08-14 msft 5 2017-08-14 msft 6 2017-08-14 msft 7 2017-08-14 goog 8 2017-08-14 goog 9 2017-08-14 ...
file-2017-08-13.csv
ticker price date aapl 1 2017-08-13 aapl 2 2017-08-13 aapl 3 2017-08-13 aapl 4 2017-08-13 msft 5 2017-08-13 msft 6 2017-08-13 msft 7 2017-08-13 goog 8 2017-08-13 goog 9 2017-08-13 ...
and on. need split 2x3= 6 subfiles, named accordingly:
/out/aapl-2017-08-14.csv
ticker price date aapl 1 2017-08-14 aapl 2 2017-08-14 aapl 3 2017-08-14 aapl 4 2017-08-14
/out/msft-2017-08-14.csv
ticker price date msft 5 2017-08-14 msft 6 2017-08-14 msft 7 2017-08-14
/out/goog-2017-08-14.csv
ticker price date goog 8 2017-08-14 goog 9 2017-08-14
/out/aapl-2017-08-13.csv
ticker price date aapl 1 2017-08-13 aapl 2 2017-08-13 aapl 3 2017-08-13 aapl 4 2017-08-13
/out/msft-2017-08-13.csv
ticker price date msft 5 2017-08-13 msft 6 2017-08-13 msft 7 2017-08-13
/out/goog-2017-08-13.csv
ticker price date goog 8 2017-08-13 goog 9 2017-08-13
i have written script can group ticker , split 1 file, can't figure out how proper renaming , don't know how loop on of files in input folder.
import-csv file-2017-08-14.csv | group-object -property "ticker" | foreach-object { $path = $_.name + ".csv"; $_.group | export-csv -path $path -notypeinformation }
any ideas?
method 1
get-childitem -filter '*.csv' -file -force ` | select-object -expandproperty 'fullname' ` | import-csv -delimiter "`t" ` | foreach-object -process { $outputfilepath = "out\{0}-{1}.csv" -f $_.ticker, $_.date; $_ | export-csv -path $outputfilepath -append -notypeinformation; };
the lines above perform following:
get-childitem
retrieves.csv
files current directory (not including child directories)- the result of
get-childitem
fileinfo
instances, want passstring
instances representing file pathsimport-csv
, useselect-object
passfullname
property down pipeline import-csv
reads csv file(s) specified in pipeline , passes each record down pipeline- inside of
foreach-object
,$_
variable holds each csv record. build output path appropriate record usingticker
,date
properties (the latter ofstring
, notdatetime
, no formatting necessary). pass recordexport-csv
, having append new row file @$outputpath
.
while code short , simple, opening , appending each output file once per input record slow, million lines, though memory usage minimal because 1 record in memory @ given time.
method 2
we can improve code appending each output file after every 1,000 records (or whatever value like) instead of every record. hashtable
stores pending records each output file, , pending records flushed when given output file has exceeded pending record limit or there no more records read (end of input files):
$pendingrecordsbyfilepath = @{}; $maxpendingrecordsperfilepath = 1000; get-childitem -filter '*.csv' -file -force ` | select-object -expandproperty 'fullname' ` | import-csv -delimiter "`t" ` | foreach-object -process { $outputfilepath = "out\{0}-{1}.csv" -f $_.ticker, $_.date; $pendingrecords = $pendingrecordsbyfilepath[$outputfilepath]; if ($pendingrecords -eq $null) { # first time we're encountering output file; create new array $pendingrecords = @(); } elseif ($pendingrecords.length -ge $maxpendingrecordsperfilepath) { # flush pending records output file $pendingrecords ` | export-csv -path $outputfilepath -append -notypeinformation; $pendingrecords = @(); } $pendingrecords += $_; $pendingrecordsbyfilepath[$outputfilepath] = $pendingrecords; }; # no more input records read; flush pending records each output file foreach ($outputfilepath in $pendingrecordsbyfilepath.keys) { $pendingrecordsbyfilepath[$outputfilepath] ` | export-csv -path $outputfilepath -append -notypeinformation; }
method 3
we can improve further using list<object>
instead of array store pending records write. setting capacity of list upon creation $maxpendingrecordsperfilename
eliminate overhead of expanding arrays every time record added.
$pendingrecordsbyfilepath = @{}; $maxpendingrecordsperfilepath = 1000; get-childitem -filter '*.csv' -file -force ` | select-object -expandproperty 'fullname' ` | import-csv -delimiter "`t" ` | foreach-object -process { $outputfilepath = "out\{0}-{1}.csv" -f $_.ticker, $_.date; $pendingrecords = $pendingrecordsbyfilepath[$outputfilepath]; if ($pendingrecords -eq $null) { # first time we're encountering output file; create new list $pendingrecords = new-object ` -typename 'system.collections.generic.list[object]' ` -argumentlist (,$maxpendingrecordsperfilepath); $pendingrecordsbyfilepath[$outputfilepath] = $pendingrecords; } elseif ($pendingrecords.count -ge $maxpendingrecordsperfilepath) { # flush pending records output file $pendingrecords ` | export-csv -path $outputfilepath -append -notypeinformation; $pendingrecords.clear(); } $pendingrecords.add($_); }; # no more input records read; flush pending records each output file foreach ($outputfilepath in $pendingrecordsbyfilepath.keys) { $pendingrecordsbyfilepath[$outputfilepath] ` | export-csv -path $outputfilepath -append -notypeinformation; }
method 4a
we can eliminate need buffering records/lines output , opening/appending output files if use streamwriter
class. we'll create 1 streamwriter
per output file , leave them open until we're finished. try
/finally
block necessary ensure closed properly. use convertto-csv
generate output, includes header line whether need or not, there logic ensure write header when file first opened.
$truncateexistingoutputfiles = $true; $outputfilewritersbypath = @{}; try { get-childitem -filter '*.csv' -file -force ` | select-object -expandproperty 'fullname' ` | import-csv -delimiter "`t" ` | foreach-object -process { $outputfilepath = join-path -path (get-location) -childpath ('out\{0}-{1}.csv' -f $_.ticker, $_.date); $outputfilewriter = $outputfilewritersbypath[$outputfilepath]; $outputlines = $_ | convertto-csv -notypeinformation; if ($outputfilewriter -eq $null) { # first time we're encountering output file; create new streamwriter $outputfilewriter = new-object ` -typename 'system.io.streamwriter' ` -argumentlist ($outputfilepath, -not $truncateexistingoutputfiles, [system.text.encoding]::ascii); $outputfilewritersbypath[$outputfilepath] = $outputfilewriter; # write header line $outputfilewriter.writeline($outputlines[0]); } # write data line $outputfilewriter.writeline($outputlines[1]); }; } { foreach ($writer in $outputfilewritersbypath.values) { $writer.close(); } }
surprisingly, caused performance change of 175%...slower. i'll nail down why further revise code.
method 4b
my first thought address performance drop reintroduce output buffering; basically, combine method 3 , 4a. equally surprising, further hurt performance. guess why since streamwriter
own character buffering makes doing our own buffering unnecessary. in fact, tested values maxpendingrecordsperfilepath
in powers of 10 10 100,000 , overall performance difference 2 extremes mere 5 seconds. thus, our own buffering isn't helping anything, , tiny overhead of managing list
adds additional 30 seconds run time on million iterations. so, let's scrap buffering.
method 4c
instead of using convertto-csv
out 2-element array of string
s (a header line , data line), let's build 2 lines ourselves using string
formatting.
method 4d
on every iteration of foreach-object
need build output file path because it's based on input objects ticker
, date
properties. pass absolute path when constructing streamwriter
because powershell has different notion of "current directory" (from relative paths based) typical .net application. we've been calling get-location
build absolute path on every iteration, isn't necessary because path doesn't change. so, let's move call get-location
outside of foreach-object
.
method 4e
instead of using join-path
build our output file path, let's try .net's path.combine
method.
method 4f
instead of using join-path
build our output file path, let's try less platform-agnostic string
interpolation ($outputfilepath = "$outputdirectorypath\$outputfilename";
).
combining changes methods 4a, 4c, 4d, , 4e final code:
$truncateexistingoutputfiles = $true; $outputdirectorypath = join-path -path (get-location) -childpath 'out'; $outputfilewritersbypath = @{}; try { get-childitem -filter '*.csv' -file -force ` | select-object -expandproperty 'fullname' ` | import-csv -delimiter "`t" ` | foreach-object -process { $outputfilename = '{0}-{1}.csv' -f $_.ticker, $_.date; $outputfilepath = [system.io.path]::combine($outputdirectorypath, $outputfilename); $outputfilewriter = $outputfilewritersbypath[$outputfilepath]; if ($outputfilewriter -eq $null) { # first time we're encountering output file; create new streamwriter $outputfilewriter = new-object ` -typename 'system.io.streamwriter' ` -argumentlist ($outputfilepath, -not $truncateexistingoutputfiles, [system.text.encoding]::ascii); $outputfilewritersbypath[$outputfilepath] = $outputfilewriter; # write header line $outputfilewriter.writeline('"ticker","price","date"'); } # write data line $outputfilewriter.writeline("""$($_.ticker)"",""$($_.price)"",""$($_.date)"""); }; } { foreach ($writer in $outputfilewritersbypath.values) { $writer.close(); } }
here benchmarks each method averaged on 3 runs each against million line csv. performed on core i7 860 @ 2.8 ghz turboboost disabled running 64-bit powershell v5.1 on windows 10 pro v1703:
+--------+----------------------+----------------------+--------------+---------------------+-----------------+ | method | path handling | line building | file writing | output buffering | execution time | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 1 | relative | export-csv | export-csv | no | 2,178.5 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 2 | relative | export-csv | export-csv | 1,000-element array | 222.9 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 3 | relative | export-csv | export-csv | 1,000-element list | 154.2 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 4a | join-path | convertto-csv | streamwriter | no | 425.0 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 4b | join-path | convertto-csv | streamwriter | 1,000-element list | 456.1 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 4c | join-path | string interpolation | streamwriter | no | 302.5 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 4d | join-path | string interpolation | streamwriter | no | 225.1 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 4e | [io.path]::combine() | string interpolation | streamwriter | no | 78.0 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+ | 4f | string interpolation | string interpolation | streamwriter | no | 77.7 seconds | +--------+----------------------+----------------------+--------------+---------------------+-----------------+
key takeaways:
- when used
export-csv
, output buffering (1 → 2 , 1 → 3) provides massive performance improvement. - when used
streamwriter
s, output buffering (4a → 4b) not helpful , causes small performance hit. - eliminating
convertto-csv
(4a → 4c) reduced execution time third (153.6 seconds). - method 4a slower buffered
export-csv
methods because introduced use ofget-location
,join-path
. either these cmdlets invole lot more processing behind scenes meets eye, or invoking cmdlets slow in general (when done million times, of course).- moving
get-location
outside offoreach-object
(4c → 4d) reduced execution time quarter (77.4 seconds). - using
[system.io.path]::combine()
instead ofjoin-path
(4d → 4e) reduced execution time two-thirds (147.1 seconds).
- moving
- script optimization fun , educational!
Comments
Post a Comment