Delete Array Duplicate for only one column

altenbach · ‎10-07-2011

SimpleJack wrote:
1. No the duplicates are not always adjecent and probably will not be.Duplicates can occur more than one time depending how many times the operator retests. You see, I am going after first pass yield, so I only care about the first entry.

If the duplicates are not necessarily adjacent, you need to use my version 3 or later. Version 3 also sorts by serial number. If you want to sort by first occurence of each serial number instead, you can do the following trivial modification.

In the case of many columns (55), this will be very efficient, because the bulk of the operations is done on the single column and the full dataset is touched only once at the end,

The sort step in my code is O(Nlog(N)) (see also) and most operations are "in place", while some of the alternative suggestions above are O(N²) and carry huge memory reallocation penalties due to constant array resizing. My version will probably be orders of magnitude faster for large arrays.

Can you work with the attached 2010 snippet or do you want real VIs in a possibly earlier version?

LabVIEW Champion.

Darin.K · ‎10-07-2011

Here is how I roll:

Another time I wish for a reverse iteration terminal:

http://forums.ni.com/t5/LabVIEW-Idea-Exchange/Reverse-Iteration-Terminal-in-For-Loop/idi-p/1174449

altenbach · ‎10-07-2011

I knew if I wait enough somebody will bring variant attributes to the table. 😄

(imagine how much more intuitive it would look after this idea is implemented :D)

LabVIEW Champion.

Darin.K · ‎10-07-2011

Well it still cheeses me off that I have to use a shift register there because (at least up to LV10) the Feedback Node chokes on variants (similar code is much, much slower ). Assuming that is fixed by LV11 or soon thereafter, we need this as well:

http://forums.ni.com/t5/LabVIEW-Idea-Exchange/An-output-terminal-for-feedback-nodes-that-mirrors-the...

I learned to search much earlier in the idea process after I had drawn the following and was proceeding to post:

altenbach · ‎10-07-2011

Of course my version is almost an order of magnitude faster. 😄

LabVIEW Champion.

myle00 · ‎10-07-2011

@altenbach wrote:

Of course my version is almost an order of magnitude faster. 😄

It's of course also 10 times faster than my code But in your code, the speed decreases by half if you sort it as strings (i.e, you don't first convert it to a number).

Darin.K · ‎10-07-2011

I only see about a factor of 3 difference at most (depends a bit on number of duplicates and elements), much of which is due to the fortuitous numeric conversion (myle00 stole my thunder there).

Personally I like the attributes in the read-a-text-file-and-remove-duplicates game for their mixture of speed (typically quite good) and flexibility (easy to key on multiple columns with multiple types).

altenbach · ‎10-07-2011

Sorry, I was in a seminar... I only tested on an old Athlon XP and I eyeballed it as 800ms/100ms, but it seems closer to about 6.5x. 😄

Yes it depends on the number of elements and the number of duplicates. The above is for an array 100000 with ~10 duplicates for each number.

Here is my benchmarking VI. I am sure that other processors will give different results.

(...and yes, if I would skip the numeric conversion, you would win by about a factor of three ;))

LabVIEW Champion.

altenbach · ‎10-07-2011

@myle00 wrote:
It's of course also 10 times faster than my code But in your code, the speed decreases by half if you sort it as strings (i.e, you don't first convert it to a number).

In my benchmark, yours is about 300x slower than mine (size=100000, ~10 duplicates each) while for 10x smaller inputs it is only 30x slower. As mentioned above, yours is O(N²) , so things really deteriorate if the sizes get bigger.

Mine seems really not much worse than O(N) and thus seems to scale about linearly with input size. (I guess the sorting is a minor part overall;))

A 10x larger array costs you 100x more while my code slows down only about 10x for the same increase. For a million elements, mine is 1.2 seconds (measured!), while yours would probably be around 5 minutes on the same computer (estimated, not tested).

LabVIEW Champion.

myle00 · ‎10-07-2011

You're right. Mine is O(N^2) if there are no duplicates while the more duplicates present the closer to O(N) it gets. The sort function seems to be O(N), so that shouldn't make yours worse than O(N) (see attached).

LabVIEW

Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column

Re: Delete Array Duplicate for only one column