How to extract a column out of a large ASCII file?

NetizenKane · ‎05-21-2007

Hi all.

After searching the board and applying several solution approaches my problem still remains. Maybe you can help me.

The data source i've to deal with are large ASCII files (~540 MB) with 14 columns (delimiter: TAB). Each column represents one channel. The number of characters in each "field" is variable. I have to read user defined columns (=channels) out of each data set. Needless to say that reading the whole file runs into memory problems.

If anyone has an idea i would be happy

Thanks in advance.

Greets

Kane

mikeporter · ‎05-21-2007

What you'll need to do is read the data out of the file in chunks that you can handle efficiently. The overall process will look something like this:

1) Read some number of bytes.
2) Process all complete lines in the chunk to extract the column you need (the spreadsheet string to array function will help here).
3) Repeat as needed till you get to the end of the file.

Note that unless you know ahead of time the number of rows in the datafile, this is going to be an inherently (very!) inefficient process because you are going to be allocating a lot of memory on the fly.

Mike...

PS: Longer term, is there any likelihood of the data getting saved in some other form?

Certified Professional Instructor
Certified LabVIEW Architect
LabVIEW Champion

"... after all, He's not a tame lion..."

For help with grief and grieving.

Marc_A · ‎05-21-2007

Good information from Mike. I just wanted to add some ideas to make it more efficient so that you don't have to build an array every time you read a chunk of data.

1. You can estimate the number of lines of the file if you read the file size and one row of data. File size / length of a line ~ number of lines. You could add 100 or so to this number to make room for everything, or you could just add to it when you hit the end of the array. Use this to initialize an array of the size you need for the column and use Replace Array Subset.vi to add data to it.

2. Make sure you keep the file open and pass the refnum in a shift register in this loop that you'll use to read the file. This will be faster than using something like Read from Spreadsheet File.vi, which opens and closes the file every time it reads.

NetizenKane · ‎05-22-2007

Thank you for your help. Unfortunately I already failed at 1 and 2 of Mikes explanation. Perhaps you could give me a hint with a small example?

Thanks a lot in advance.

Greets

Kane

PS: Unfortunately I don't have any possibilities to change the data format. I get the data and have to eat it as they come

Message Edited by NetizenKane on 05-22-2007 02:52 AM

NetizenKane · ‎05-22-2007

PPS.: Just for info: I'm not too lazy to do it by myself, I'm really stuck at this point

Marc_A · ‎05-22-2007

No problem, here is an example in 8.2. Like I said in the comment, I haven't tested it. It should get you started though. If you have trouble getting it to run, post back and I'll take a look at it. Edit: Woohoo! I'm a veteran!

Message Edited by Marc A on 05-22-2007 09:19 AM

Message Edited by Marc A on 05-22-2007 09:20 AM

Marc_A · ‎05-22-2007

I was playing around with this and found a problem. You need to wire an end of line constant to the 'delimiter' input of Array to Spreadsheet String. Since you're reading a 1D array of lines from the file, this will create the spreadsheet string correctly with the end of line after each array element, which are already delimited by tabs.

I also got rid of the error cluster shift register and just wired the error to the stop terminal of the loop so that the error wouldn't propagate to the Close File node.

NetizenKane · ‎05-23-2007

Looks good for me. Thanks for that.
Unfortunately with test data i get some different results i expect. Therefore i attach a small piece of the large file for testing purposes.
Greets
Kane

PS.: I use LabVIEW 8.0.

Marc_A · ‎05-23-2007

OK, I had to change a few things to get it to work with that data. The problem is that your data isn't consistently the same length, so the estimation of the length of the array wasn't working too well. If you format all your data to be the same width, it will work a lot better. I added 1000 instead of 50 to the estimate and it worked, but I can't say this will work for any amount of data you throw at it.

Here is the new VI, saved in 8.0.

DFGray · ‎05-23-2007

I hate to defocus you, but there is a more efficient way to do this. My apologies that I do not have the time to write code, but here is the pseudo code.

Create an array for your output greater than or equal to what you think you will need.
Read a 65,000 character chunk from the file (or the rest of the file, whichever is smaller).
Use the string search functions functions to find successive line ends and the appropriate tab character delimiters for your column.
Convert and replace the element in your output array.
When done, trim your output array to the right size.

If you drop an LVM read, convert it to a regular VI, and dive in, you will see an example of this type of process. The idea is to keep disk reads, which are very inefficient, to a minimum. It also minimizes your memory allocations, because you do not need to resize your input buffer for every line. Problems you will need to deal with (which are handled by the LVM read) are such things as:

Your line crosses a chunk boundary.
The end-of-file creates a smaller chunk than 65,000 characters (the optimum chunk size for Win32 systems).
The end-of-line character is not well defined (in your case, this is probably not an issue)
Searching for a character can produce memory allocations

You may want to try reading the data as a U8 array instead of a string and doing your searches on that instead of the string.

I have always wanted to write the piece of code, but never had the time or reason to do so. Good luck. I will try to help if I can.

LabVIEW

How to extract a column out of a large ASCII file?

How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?

Re: How to extract a column out of a large ASCII file?