python - batch data prefetching uisng queue on sharded data files -
i generated 500
sharded numpy data files, each of them contains 10000
data samples (e.g., image , label), example:
file-000001.npy file-000002.npy file-000003.npy ... file-000500.npy
each of .npy
contains numpy dictionary keys , size {'image':10000x3x512x64 (dtype=np.float32),'label':10000x100 (dtype=np.float32)}
. please note of these numpy files contain less 10000
samples, 8111
etc.
during training, each epoch, need iterate 500x10000
samples. these data cannot loaded memory due capacity limits. common solution data prefetching queue.
my thought follows: (1) first record filenames , count of data samples in each file, (2) each batch, compute batch indices, corresponding data files needed loaded memory read data samples compose batch data.
during step (2), if set batch size 256
, possible need read 256
files , read 1 sample in each of them compose batch data. might slow , unpractical.
based on queue, data loading might running on background threads, , readed batch data saved in queue (the capacity might large number depends on memory capacity). , background threads consistently read data fill queue once after queue have space.
is hard implement this? i've searched in google, seems there advanced solutions such using cache
technique, using mmap
. i'm not familiar these guys. there simple examples on this?
Comments
Post a Comment