05-19-2009 07:24 AM
Hello,
I am having a little problem and was wondering if anyone had any ideas on how to best solve it.
Here is the problem:
- I have a large file 6000 rows by 2500 columns.
- First I sort the file by columns 1 and 2
- then I find that various rows in these two columns (1 and 2) have duplicate values, sometimes only twice, but sometimes three or four, or five or up to 9 times.
- this duplication occurs in only the first two columns, but we don't know in which rows and we don't know how much duplication there is. The remaining columns, i.e. column 3 to column 2500, for the corresponding rows contain data.
- Programatically, I would like to find the duplicated rows by searching columns 1 and 2 and when I find them, average the respective data for these rows in columns 3 to 2500.
- So, once this is done I want to save the averaged data to file. In each row this file should have the name of colunm 1 and 2 and the averaged row values for columns 3 to 2500. So the file will have n rows by 2500 columns, where n will depend on how many duplicated rows there are in the original file.
I hope that this makes sense. I have outlined the problem in a simple example below:
In the example below we have two duplicates in rows 1 and 2 and four duplicates in rows 5 to 8.
Example input file:
Col1 Col2 Col3 ... Col2500
3 4 0.2 ... 0.5
3 4 0.4 ... 0.8
8 5 0.1 ... 0.4
7 9 0.7 ... 0.9
2 8 0.1 ... 0.5
2 8 0.5 ... 0.8
2 8 0.3 ... 0.2
2 8 0.6 ... 0.7
6 9 0.9 ... 0.1
...
So, based on the above example, the first two rows need averaging (two duplicates) as do rows 5 to 8 (four duplicates). The output file should look like this:
Col1 Col2 Col3 ... Col2500
3 4 0.3 ... 0.65
8 5 0.1 ... 0.4
7 9 0.7 ... 0.9
2 8 0.375 ... 0.55
6 9 0.9 ... 0.1
...
Solved! Go to Solution.
05-19-2009 11:21 AM
05-19-2009 09:45 PM
thank you for that very useful response. As I said in my first post, I have a problem and I do not know how to best approach it.
Obviously I could provide you with the code/diagram that i have but it would be no good because apart from the basic structures, I am not sure how to search the data for these duplicates. Remember that there could be 2, 3, 4...up to 9 duplicates and there is no pattern.
Can anyone help please?
Cheers
05-19-2009 10:15 PM
The difficulty here isn't with finding the duplicates - that part is relatively easy. The difficulty with your problem is with memory management. You said you have a large file with 6000 rows and 2500 columns, and have what I presume are DBLs. This gives 6000 x 2500 x 8 = 114 MB of memory just to store the array. This is a good chunk of memory, and one has to be real careful with not having multiple copies of this data floating around. This means that array operations must be done in place as much as possible. You said you are sorting the arrays. Are you doing this in LabVIEW, or somewhere else. If it's in LabVIEW, how are you doing it? How sensitive are your needs to memory and speed? And please don't say "the least amount of memory in the fastest time" since those are opposing requirements. Also, what version of LabVIEW are you running, as that affects what approach to take.
05-19-2009 10:25 PM
Thank you for your response Smercurio.
I can sort elsewhere but I was planning to do it in Labview using a sort 2D array vi that i have. If it will cause issues I can easily do it elsewhere.
In terms of memory and speed - I basically want to get this done ASAP as lots depends on it. This is priority over all. I can leave it for a couple of hours running OK.
I am running 8.6 on a mac but also have the pc version.
What you say is easy I think quite difficult. I am a relative begginner.
Hope that you can help.
Best wishes.
05-19-2009 10:33 PM
05-20-2009 12:54 AM
Hello,
sorry for delay. here is the sample file. I have reduced it.
Thank you for your help.
The file is a tab delimited text file.
05-20-2009 10:20 AM
Actually, I wanted to get a copy of the full file since that would be a real test. No matter, as I just created it based on the file that you uploaded. I now have a nice fat 88 MB file on my hard drive. ![]()
As I indicated, the memory issue on this problem is going to be very difficult. I've tried a few things and so far I've run out of memory each time. Even just trying to read the file in was taking its toll. This will require some time to work out a good solution in terms of memory management.
05-20-2009 01:59 PM
Well, here's an initial crack at it. The premise behind this solution is to not even bother with the sorting. Also, trying to read the whole file at once just leads to memory problems. The approach taken is to read the file in chunks (as lines) and then for each line create a lookup key to see if that particular line has the same first and second columns as one that we've previously encountered. A shift register is used to keep track of the unique "keys".
This
is only an initial attempt and has known issues. Since a Build Array is
used to create the resulting output array the loop will slow down over
time, though it may slow down, speed up, and slow down as LabVIEW
performs internal memory management to allocate more memory for the
resultant array. On the large 6000 x 2500 array it took several minutes on my computer. I did this on LabVIEW 8.2, and I know that LabVIEW 8.6
has better memory management, so the performance will likely be
different.
05-20-2009 07:57 PM
Hi Smercurio,
the code works in 8.6 and is not too slow! that is really great!
One thing is that the averaging of the rows is not working well. I do not get the corrrect averages, when I checked the data. I will have a look at your code. Any suggestions?
Best wishes,
M