fastest way to do a bitwise comparison of 2 large files

DennisV · ‎02-14-2007

Hello All,

I have two relatively large files (memory dumps) of approx. 1 MB. Both files have the same size.

I need to do a bitwise comparison of both the files. (one is the reference file, the other is the measured file)

What is the fastest way of doing this? I need to know how many bitwise differences there are.

I need to present the user with a percentage of how good the match between both files is and also the number of differences.

Mayby later on I also need to list the differences.

I was thinking about opening the files in binary mode, reading them out and converting them to boolean arrays and with the autoindexing of for-loops determining if there are differences. Probably there are much better ways for doing this.

Your help is greatly appreciated!

noxus.

Tomi_Maila · ‎02-14-2007

Read the files as binary strings and compare with "=". Your large files are so tiny that they fit easily to the memory of any computer. I initially thought that your files are several hundreds of gigabytes. We just bought a new 3TB RAID array to our workstation to fit our a little bit larger files.

--
Tomi Maila

altenbach · ‎02-14-2007

Read each file as an U8 array and do a boolean XOR between them.

Create a lookup table for all 256 possible U8 values (index=number, value=# of high bits in that number). Now loop through the XORed array using each value as index into the lookup table and accumulate the total number of differences in a shift register.

Clear enough? If not, I can make a quick example.

LabVIEW Champion.

Tomi_Maila · ‎02-14-2007

I missed that it was bitwise and not bytewise. XORing the solution then so do as altenback suggested. If equality is expected, then it may be faster to do the bytewise comparison with '=' first and then check the bytes that didn't match using XOR in the end.

Tomi

--
Tomi Maila

altenbach · ‎02-14-2007

OK, here's a quick draft (LV 8.0) of what I meant. I assume you know how to read a file as U8 binary array, so let's start with two U8 arrays of equal length.

The lookup table needs to be calculated only once, then you can replace it with a diagram constant. Still, it is very cheap to do. 🙂

I doubt you can count the number of bit differences much faster that this. 😄

DIsplay all arrays in binary so you see what's going on. (# of differences" should probably be labeled # of bitwise differences").

Message Edited by altenbach on 02-14-2007 02:50 PM

LabVIEW Champion.

Tomi_Maila · ‎02-14-2007

@altenbach wrote:

I doubt you can count the number of bit differences much faster that this. 😄

I take the challenge

I assume there is only a small number of differing bits, otherwise the whole comparison doesn't make any sense. The code below was somewhat faster than altenbachs code for 1000 differing bytes in 10 000 000 bytes. I timed 120-140 ms for my loop against 160-180ms that of altenbachs loop.

Tomi

Message Edited by Tomi M on 02-15-2007 01:52 AM

--
Tomi Maila

altenbach · ‎02-14-2007

That's NOT "much faster". 😄

Yes, you can throw much more code at it and (in the limiting case of few errors) you might win. Still, even "search array" needs to look at each element one way or another.

Then there's the cost of allocation a full-sized boolean array, and if the errors are more frequent, huge output arrays that you sum later. Try it with an array that barely fits in memory and it won't be pretty. 😮

Here's a very simple potential improvement that skips the lookup, addition, and coercion if no error exists. See if it does any better. 🙂

(Other case is "0" and just carries the shift register value across)

Message Edited by altenbach on 02-14-2007 04:39 PM

LabVIEW Champion.

altenbach · ‎02-14-2007

@altenbach wrote:
Here's a very simple potential improvement that skips the lookup, addition, and coercion if no error exists. See if it does any better. :)

Some casual testing with your random test data skeleton:

No case structure: 170ms (Seems our PCs are about the same speed).
With case structure: 90ms

Almost double the speed. 🙂

Interesting. I'm sure it could be still be optimized for multiprocessor, dual core, etc. It shoud scale well, since everything could be done in parallel.

LabVIEW Champion.

Tomi_Maila · ‎02-15-2007

Altenbach, you won! 🙂 Your code could still be made faster by using blocks of 32-bit on 32-bit environments instead of 8-bit blocks. In addition as many parallel loops as there are processors would speed up.

Tomi

--
Tomi Maila

tst · ‎02-15-2007

@Tomi M wrote:
Altenbach, you won! 🙂

Well, you should have known better than trying to have an optimization contest with Altenbach. It's normally a recipe for losing (and for learning ), and being the sole user of other obscure LV features (I hope Stephen doesn't read this) does not mean you can handle these minute details as well.

___________________
Try to take over the world!

LabVIEW

fastest way to do a bitwise comparison of 2 large files

fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files