fastest way to do a bitwise comparison of 2 large files

altenbach · ‎02-25-2007

No tricks here! I run exactly your unmodified code with the posted diff1 and diffAlt (except for the correction in timing measurement. :D) under LabVIEW 8.20.

I consistently get the 5x results as shown above.

(This is on an Intel Core2 Duo T7600, 2.33GHz, 1.5MB of RAM. What is your CPU?).

It is well possible that LabVIEW 8.20 acts differently compared to your LabVIEW 7.1.

Here's another test version down-converted to LabVIEW 7.1. (make sure to "save" before running to eliminate conversion issues). I avoid subVI calls and only test one code version at any given run. The times are quite reproducible between runs. Notice that your code would speed up by about 20% if you would place the XOR in front of the FOR loop as in my version. For the given default parameters, my code is about 6x faster (53ms vs 320ms). What do you get?

LabVIEW Champion.

tbd · ‎02-25-2007

Hi Altenbach, thanks for humoring me with this...

As delivered, lookup/multicase, I'm seeing roughly 94ms/258ms - running some Socket-7 AMD supposedly equivilent to a 2.2G Pentium. If two completely independent random arrays are compared, the difference is more like 210/320 but this changes to 506/433 with ". "Allow Debugging" = true.

So, Lookup is clearly faster, though, multicase is simpler (to Debug)(?)

Cheers!

"Inside every large program is a small program struggling to get out." (attributed to Tony Hoare)

altenbach · ‎02-25-2007

Many years ago (I think it was during the first LabVIEW challenge), I noticed that a case structure with two cases is always faster than a case with more than two cases.

Yes, the debugging option seems to have a proportionally higher impact on my code. I have no idea why, maybe the extra debugging code causes thrashing of the cache. I think it is fair to compare code with debugging disabled. We want to time the raw code, not the "trimmings". 🙂

Anyway, we don't have to read the files as String or bytes, we can directly read them in a multibyte representation such as U16 or U32 to find the bit differences. We can also use a 16bit lookup table. These things will allow us to make the algorithm even faster. Here are quick drafts.

Results for the default values:

8bit data, 8bit lookup: 53ms (see above)

16bit data, 16bit lookup: 33ms!

32bit data, 16bit lookup: 22ms!!

LabVIEW 8.0+ also has U64. So:

64bit data, 16bit lookup: 20ms!!!

(probably not worth it)

As you can see, there is still quite a bit of slack left. I am sure it can be further improved. 😄

(I challenge you to make a multicase structure for the 16bit situation! 🐵

LabVIEW Champion.

Jim_Kring · ‎02-25-2007

It might be late to jump on this thread, but I figured that I'd mention that the OpenG "File" library has a function called Compare File Binary for comparing two files on disk. I have no idea about the performance when comparing large files. Perhaps someone in this thread is interested in helping improve its performance

Thanks,

-Jim

Message Edited by Jim Kring on 02-25-2007 05:26 PM

Let's talk about the future of LabVIEW...

tbd · ‎02-25-2007

@altenbach wrote:

Many years ago (I think it was during the first LabVIEW challenge), I noticed that a case structure with two cases is always faster than a case with more than two cases.

I expect a search is required to associate a specific-selector value with it's case. I'd hoped when the number of possible selector-values was "small", they would just create a table of case#s, one for every possible selector value - then the case-selection would be as fast as a table-lookup.

32bit data, 16bit lookup: 22ms!!

At more than double the original speed, this could be a worthwhile "upgrade" (wonder if David is till following?)

LabVIEW 8.0+ also has U64. So:

64bit data, 16bit lookup: 20ms!!!

(probably not worth it)

(I challenge you to make a multicase structure for the 16bit situation! 🐵

Hmm, there's only 12870 values to type (worst case)

"Inside every large program is a small program struggling to get out." (attributed to Tony Hoare)

altenbach · ‎02-26-2007

@Jim Kring wrote:

I have no idea about the performance when comparing large files.

Unfortunately, I don't have the openG stuff installed at the moment, but your file comparison is a slightly different problem because we only need to know if two files of equal size a different or not, a boolean. (It does not report e.g. the number of differences found).

You would probably simply use a WHILE loop and then exit at the first difference found. It would only become expensive with huge files differeing only near the very end of the file, for example.

LabVIEW Champion.

dusty_g · ‎08-09-2007

I'm a super beginner at labview...and i need to do the same thing shown here basically. I need to take two huge text files and compare them to see if they are the same and if they're not then i need to know where and what wasn't the same. Altenbach's example works pretty good, except my file is huge so if the difference is way down in the file you cant see where the error is at. How would find the location of the error easier?

Henrik_Volkers · ‎08-10-2007

@dusty_g wrote:

I'm a super beginner at labview...and i need to do the same thing shown here basically. I need to take two huge text files and compare them to see if they are the same and if they're not then i need to know where and what wasn't the same. Altenbach's example works pretty good, except my file is huge so if the difference is way down in the file you cant see where the error is at. How would find the location of the error easier?

If you don't really need to do it with LabVIEW, Unix (Linux...) has some very powerfull commands to stream though huge textfiles and create various outputs of the resulting differences. However since I can't remember the commands (with that endless possible options) writing an simple LabView program might be faster than digging in the unix help files....

Greetings from Germany
Henrik

LV since v3.1

“ground” is a convenient fantasy

'˙˙˙˙uıɐƃɐ lɐıp puɐ °06 ǝuoɥd ɹnoʎ uɹnʇ ǝsɐǝld 'ʎɹɐuıƃɐɯı sı pǝlɐıp ǝʌɐɥ noʎ ɹǝqɯnu ǝɥʇ'

Kevin_Price · ‎08-10-2007

I must have missed this thread before but it just perc'ed its way up near the top again.

All the optimization stuff was fun to read through, but I thought that the original poster had additional concerns in the initial post (though he seemed satisfied with what was addressed when he chimed in). Also, dusty_g shares one of them.

I don't have LV nearby (or much spare time) to participate now, but anyone want to take a shot at the following mods:

1. Keep track of the indices of all u8 bytes that are different. Could be simply a matter of pre-allocating an array then trimming at the end. But I'm a little curious about the very general case where you might want both speed and a small memory footprint, i.e., you're not allowed to pre-allocate a huge array as big as the arrays to be compared. Considering that the differences might be either sparse or dense, how might you approach this?

2. Instead of 1 count for # different bytes, how about 8 counts so that # differences can be tracked for each bit individually? What other/better methods are there besides a 256x8 lookup table that requires you to copy out a 1x8 array for summing?

-Kevin P.

ALERT! LabVIEW's subscription-only policy came to an end (finally!). Unfortunately, pricing favors the captured and committed over new adopters -- so tread carefully.

LabVIEW

fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files

Re: fastest way to do a bitwise comparison of 2 large files