How to delete a portion from the start of a binary file

Mentos · ‎09-16-2019

Hi all,

I have the following problem:

I have a very big, temporary, binary file (a few gigabytes), filled with (complex) data.

In my application I need to save this data into a new file that contains some more information besides the data.

Since the original file is very big I'm reading it in chunks and writing each chunk to the new file.

Once the new file is complete I'm erasing the temporary file.

It works fine, but...

Let's say the original file size was 20GB.

After saving the data to the new file I get 2 files with an overall size of more than 40GB (just before erasing the temporary file) since the data is duplicated.

Sometimes the free space on my HD may not be enough to contain both the files...

What I would like to do is to read a chunk from the BEGINNING of the temporary data file, save it to the new file and then DELETE it from the temporary data file. Then read the next chunk and so on...

This way the overall data size of the two files together always stays the same (temporary file gets smaller and the new file gets larger...).

Is there some nice way to implement this?

Any idea would be appreciated 🙂

Thank you in advance...

Bob_Schor · ‎09-16-2019

I believe that, in principle, you can do this. The trick is how to delete it from the original file without making a copy. I think the following algorithm will work, but you should test it to be sure:

Open the original file, read a Chunk of data (I'd recommend you read at least 10%, maybe even 50%, to make this process as efficient as possible).
Write the Chunk to a new File and close that file, leaving the original file open.
Read everything remaining in the original file into memory.
Reset the File Position of the original file to the beginning of the file.
Write back the remaining data, now moved to the beginning of the file.
Determine the current File Position and use that to reset the file size.
Close the original file.

What I think will happen (but please test to be sure!) is that the original file will stay in the same blocks on disk, but will be "shorter" as the beginning N% of the file are not present.

Bob Schor

altenbach · ‎09-16-2019

Can you guarantee that now new data is written to the temporary file while you are doing the copying?

Where are you adding "more information"? If it is at the beginning of the file, can't you just reserve a sufficient chunk before writing to the temporary file? (make sure the header is padded to a fixed size). Now, when it is time to save, just overwrite the header of the temporary file followed by closing and renaming the temporary file to the saved file? No need to shovel GBs of data.

LabVIEW Champion.

Mentos · ‎09-16-2019

Hi Bob,

Thanks for replying.

I think I get your idea.

However looks like it has two obstacles:

1. Step#3 "Read everything remaining in the original file into memory".

Assuming the file is very big it may be a problem to read the remaining (GB!) into the memory.

2. What I really want is to to keep the same free space on the HD.

So, if the temporary file indeed get shorter, but the original "deleted" data will stay in the same memory blocks on the HD, then the two files would still occupy double space..

Mentos

Mentos · ‎09-16-2019

Hi Altenbach,

In reply to your questions:

My temporary file is built during my acquisition time.

Only when the acquisition is done the copying process starts.

The additional information is indeed at the beginning of the new file.

A part of adding the additional information to the new file I sometimes do some post processing on some of the data before saving it to the new file. So historically the new file is saved after the acquisition is done.

But you got a good point... It may be worth while to change my code according to you suggestion.

Then I'll have to rename the temporary file and also MOVE it to its final destination.

I'll check and let you know...

Thanks,

Mentos

Bob_Schor · ‎09-16-2019

@Mentos wrote:

. Step#3 "Read everything remaining in the original file into memory".

Assuming the file is very big it may be a problem to read the remaining (GB!) into the memory. You are absolutely correct. You'd need to (a) have enough memory (e.g. 32 GB) and (b) be running LabVIEW (64-bit) to be able to access it all.

2. What I really want is to to keep the same free space on the HD.

So, if the temporary file indeed get shorter, but the original "deleted" data will stay in the same memory blocks on the HD, then the two files would still occupy double space. It shouldn't. Assume you have a 20 GB file in a 30 GB "hole" (i.e. your disk started out with 30 GB of free space, all compacted into a single set of blocks, and you wrote a 20 GB file in there). You'd now have a 20 GB file and a 10 GB hole. You open the 20 GB file, read 5 GB, and write it as a 5 GB file. You now have 20 GB + 5 GB + 5 GB hole. You copy the last 15 GB into memory and write them to the beginning of the 20 GB file (overwriting what was there on the disk). Space stays the same. You now reset the size of the 20 GB file to 15 GB (the size you just wrote), and close the file. You now have a 15 GB file, a 5 GB hole, a 5 GB file, and another 5 GB hole. Basically, you've got two files totaling 20 GB, and two holes totaling 10 GB, same as when you started.

But can you do this again? Yes, because Windows files do not need to fit into a single "hole", they can just "fill the available space". True, splitting a file can take a little more disk space than a single file (there are always a few "straggly bits" that don't fit neatly into a disk block, so you'll have some unused space at the end of the file), but this is trivial.

But the Proof of the Pudding (or the Algorithm) is in the Testing -- try it and see if it works for you. I'll bet you a beer (or glass of wine) that it does.

Bob Schor

altenbach · ‎09-16-2019

Rename is the same as move.

LabVIEW Champion.

Kevin_Price · ‎09-16-2019

Can you consider using a file format that accommodates your needs better?

NI has been advocating TDMS as an efficient target for high-speed streaming. It's pretty easy to stream while testing but still be able to add new "header-like" metadata and post-processed results data after the fact. You could do this by simply adding to the original file rather than making a copy.

Note: TDMS isn't the most space-efficient format. It's really good at streaming speed, not as good at packing density. I haven't personally explored HDF5, but it probably supports options like compression if storage space is a bigger premium than streaming speed.

-Kevin P

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

crossrulz · ‎09-17-2019

@Kevin_Price wrote:

Note: TDMS isn't the most space-efficient format. It's really good at streaming speed, not as good at packing density.

The density can be improved by using the TDMS Defragment as part of the post processing. Depending on file size, this could take some time. Great improvements were done in LabVIEW 2015 to speed this up along with TDMS Defrag Status.

There are only two ways to tell somebody thanks: Kudos and Marked Solutions
Unofficial Forum Rules and Guidelines
"Not that we are sufficient in ourselves to claim anything as coming from us, but our sufficiency is from God" - 2 Corinthians 3:5

Alexander_Sobolev · ‎09-17-2019

Edit:

Will it be efficient to "reverse" the file first into temporary file, then reverse it again into final file?

First reverse:

Read chunk from the end, put it into beginning of temporary file.

Shrink original file (we can do it from the end).

Repeat until original file empty.

Second reverse:

Read temporary file from the end in the same size chunks, write them to the beginning of the final file and shrink temporary file.

Also to avoid extra space reallocation: reversed data can be written into final file, then swap first and last chunks (then second and last but one, etc) to restore order.

Regarding file structure:

Do headers need to be in the beginning? Just reserve space for data size in the beginning and update it when data has been written. Then you know where your header starts (after data).

LabVIEW

How to delete a portion from the start of a binary file

How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file

Re: How to delete a portion from the start of a binary file