python - batch data prefetching uisng queue on sharded data files -


i generated 500 sharded numpy data files, each of them contains 10000 data samples (e.g., image , label), example:

    file-000001.npy     file-000002.npy     file-000003.npy     ...     file-000500.npy 

each of .npy contains numpy dictionary keys , size {'image':10000x3x512x64 (dtype=np.float32),'label':10000x100 (dtype=np.float32)}. please note of these numpy files contain less 10000 samples, 8111 etc.

during training, each epoch, need iterate 500x10000 samples. these data cannot loaded memory due capacity limits. common solution data prefetching queue.

my thought follows: (1) first record filenames , count of data samples in each file, (2) each batch, compute batch indices, corresponding data files needed loaded memory read data samples compose batch data.

during step (2), if set batch size 256, possible need read 256 files , read 1 sample in each of them compose batch data. might slow , unpractical.

based on queue, data loading might running on background threads, , readed batch data saved in queue (the capacity might large number depends on memory capacity). , background threads consistently read data fill queue once after queue have space.

is hard implement this? i've searched in google, seems there advanced solutions such using cache technique, using mmap. i'm not familiar these guys. there simple examples on this?


Comments

Popular posts from this blog

android - InAppBilling registering BroadcastReceiver in AndroidManifest -

python Tkinter Capturing keyboard events save as one single string -

sql server - Why does Linq-to-SQL add unnecessary COUNT()? -