09-16-2019 10:25 AM
Hi all,
I have the following problem:
I have a very big, temporary, binary file (a few gigabytes), filled with (complex) data.
In my application I need to save this data into a new file that contains some more information besides the data.
Since the original file is very big I'm reading it in chunks and writing each chunk to the new file.
Once the new file is complete I'm erasing the temporary file.
It works fine, but...
Let's say the original file size was 20GB.
After saving the data to the new file I get 2 files with an overall size of more than 40GB (just before erasing the temporary file) since the data is duplicated.
Sometimes the free space on my HD may not be enough to contain both the files...
What I would like to do is to read a chunk from the BEGINNING of the temporary data file, save it to the new file and then DELETE it from the temporary data file. Then read the next chunk and so on...
This way the overall data size of the two files together always stays the same (temporary file gets smaller and the new file gets larger...).
Is there some nice way to implement this?
Any idea would be appreciated 🙂
Thank you in advance...
Solved! Go to Solution.
09-16-2019 10:38 AM
I believe that, in principle, you can do this. The trick is how to delete it from the original file without making a copy. I think the following algorithm will work, but you should test it to be sure:
What I think will happen (but please test to be sure!) is that the original file will stay in the same blocks on disk, but will be "shorter" as the beginning N% of the file are not present.
Bob Schor
09-16-2019 11:15 AM - edited 09-16-2019 11:53 AM
Can you guarantee that now new data is written to the temporary file while you are doing the copying?
Where are you adding "more information"? If it is at the beginning of the file, can't you just reserve a sufficient chunk before writing to the temporary file? (make sure the header is padded to a fixed size). Now, when it is time to save, just overwrite the header of the temporary file followed by closing and renaming the temporary file to the saved file? No need to shovel GBs of data.
09-16-2019 11:40 AM
Hi Bob,
Thanks for replying.
I think I get your idea.
However looks like it has two obstacles:
1. Step#3 "Read everything remaining in the original file into memory".
Assuming the file is very big it may be a problem to read the remaining (GB!) into the memory.
2. What I really want is to to keep the same free space on the HD.
So, if the temporary file indeed get shorter, but the original "deleted" data will stay in the same memory blocks on the HD, then the two files would still occupy double space..
Mentos
09-16-2019 11:57 AM
Hi Altenbach,
In reply to your questions:
My temporary file is built during my acquisition time.
Only when the acquisition is done the copying process starts.
The additional information is indeed at the beginning of the new file.
A part of adding the additional information to the new file I sometimes do some post processing on some of the data before saving it to the new file. So historically the new file is saved after the acquisition is done.
But you got a good point... It may be worth while to change my code according to you suggestion.
Then I'll have to rename the temporary file and also MOVE it to its final destination.
I'll check and let you know...
Thanks,
Mentos
09-16-2019 12:05 PM
@Mentos wrote:
. Step#3 "Read everything remaining in the original file into memory".
Assuming the file is very big it may be a problem to read the remaining (GB!) into the memory. You are absolutely correct. You'd need to (a) have enough memory (e.g. 32 GB) and (b) be running LabVIEW (64-bit) to be able to access it all.
2. What I really want is to to keep the same free space on the HD.
So, if the temporary file indeed get shorter, but the original "deleted" data will stay in the same memory blocks on the HD, then the two files would still occupy double space. It shouldn't. Assume you have a 20 GB file in a 30 GB "hole" (i.e. your disk started out with 30 GB of free space, all compacted into a single set of blocks, and you wrote a 20 GB file in there). You'd now have a 20 GB file and a 10 GB hole. You open the 20 GB file, read 5 GB, and write it as a 5 GB file. You now have 20 GB + 5 GB + 5 GB hole. You copy the last 15 GB into memory and write them to the beginning of the 20 GB file (overwriting what was there on the disk). Space stays the same. You now reset the size of the 20 GB file to 15 GB (the size you just wrote), and close the file. You now have a 15 GB file, a 5 GB hole, a 5 GB file, and another 5 GB hole. Basically, you've got two files totaling 20 GB, and two holes totaling 10 GB, same as when you started.
But can you do this again? Yes, because Windows files do not need to fit into a single "hole", they can just "fill the available space". True, splitting a file can take a little more disk space than a single file (there are always a few "straggly bits" that don't fit neatly into a disk block, so you'll have some unused space at the end of the file), but this is trivial.
But the Proof of the Pudding (or the Algorithm) is in the Testing -- try it and see if it works for you. I'll bet you a beer (or glass of wine) that it does.
Bob Schor
09-16-2019 12:11 PM
Rename is the same as move.
09-16-2019 03:14 PM
Can you consider using a file format that accommodates your needs better?
NI has been advocating TDMS as an efficient target for high-speed streaming. It's pretty easy to stream while testing but still be able to add new "header-like" metadata and post-processed results data after the fact. You could do this by simply adding to the original file rather than making a copy.
Note: TDMS isn't the most space-efficient format. It's really good at streaming speed, not as good at packing density. I haven't personally explored HDF5, but it probably supports options like compression if storage space is a bigger premium than streaming speed.
-Kevin P
09-17-2019 06:37 AM
@Kevin_Price wrote:
Note: TDMS isn't the most space-efficient format. It's really good at streaming speed, not as good at packing density.
The density can be improved by using the TDMS Defragment as part of the post processing. Depending on file size, this could take some time. Great improvements were done in LabVIEW 2015 to speed this up along with TDMS Defrag Status.
09-17-2019 09:46 AM - edited 09-17-2019 10:03 AM
Edit:
Will it be efficient to "reverse" the file first into temporary file, then reverse it again into final file?
First reverse:
Read chunk from the end, put it into beginning of temporary file.
Shrink original file (we can do it from the end).
Repeat until original file empty.
Second reverse:
Read temporary file from the end in the same size chunks, write them to the beginning of the final file and shrink temporary file.
Also to avoid extra space reallocation: reversed data can be written into final file, then swap first and last chunks (then second and last but one, etc) to restore order.
Regarding file structure:
Do headers need to be in the beginning? Just reserve space for data size in the beginning and update it when data has been written. Then you know where your header starts (after data).