Scan large ASCII file for data.

AntonioNI · ‎12-05-2015

Hello guys. I have some ASCII files wich contain 500000 rows. This file is a circular buffer containing messured data by other program. That means the data in the array is sorted, but when the file is full, the most recent data or last line writed by the program will go back to the first row and loop arround the file. My goal is to at first, read the file to look for the max date (I have a double coolumn called "Time_ms" from wich I can tell wich is the latest row). I thought I could do a simple binary-search like thing avoiding to have to read all the file (since my data is sorted, or kind of sorted). but since my rows have diferent lengths (in bytes) I cannot figure out where to start to read the lines. My Idea was to read like every 10000 line, then compare the dates on those lines and close on the portion of the file containing the max row. But I cannot find a fast way to get my desired offset in bytes for the read from spreadsheet function to go to the desired target lines to read, since for to get the correct offsets I would have to call read from spreadsheet again to find it and therfore not saving any time.

How can I quickly spped up carry returns through the file without having to actually read the data out in labview?

Just to sum up. I have a file of 500000 rows containging a max date somewhere. Data is sorted ASC from the start of the file to the max, and ASC again from the max position till the end of the file, but containging older records than the previous ones. Therfore, thers a position in the file when the data goes to the MAX date to the actual MIN of the file. I need to quikly find the direction of this line (in chars) to be able to read the file the file arround that position in order to look for new data (the file is beeing constantly added some lines per second (lets say 5 or 10 lines per second). Once I find new data, I could move arrownd this marker line in the file once in a while and keep monitoring the new data beeing added to the file.

Since I have like 22 diferent files, I dont want to spend 4-5 seg per file to idenfity where do I have to start traking for new data. I also have truble with the file beeing opened and writen in to while im doing the search, since it seems like if the file it's opened EOF terminal of spreadsheet read wont worck properly.

If you have some other Ideas on how to track for new data on a big file thats beeing writen in to, plz suggest.

This image shows a really slow way of finding the offset of the target line by reading all the file line by line (wich is what I dont want to end up having to do).

GerdW · ‎12-05-2015

Hi Antonio,

when you overwrite old data in your file (as it is a circular buffer) all lines should be of same length!

If this is the case you only need to multiply line index by line length…

(Overwriting lines in a buffer with changing line size doesn't make much sense to me.)

Other option: read the file as pure text and parse on your own (avoiding conversion to big 2D array). 500k lines by ~80bytes/line is just 40000kByte=40MByte, any modern PC should handle this - as long as you avoid additional buffer allocations!

Best regards,
GerdW

using LV2016/2019/2021 on Win10/11+cRIO, TestStand2016/2019

AntonioNI · ‎12-05-2015

The lines are not of contstant legth becouse two columns are strings that migth change, such as name of the tag that the data belongs to, and for some reason this program stores the timestamp as string aswell (and some dates having only one digit compared to ones with two digits, lets say 10:00:00 to 9:00:00, its a problem aswell).

AntonioNI · ‎12-05-2015

But I guess your second suggestion is probably going to work perfectly.

Soomething like this could allow me to search the file for specific lines much faster.

1. Reading the file as a big string.

2. Skyp 10 lines by searching for end of line.

3. Get next line.

Ill try like this a post the results I get whem im done.

AntonioNI · ‎12-06-2015

I have made an algorithm to search for the max value of a file.

The idea is to quickly find the max value of a certain column of the file, by reading the file as a string, and performing a binary search. Therfore the file must be sorted (or semi sorted, as I explained before, I have a circular buffer overwriting old data with new data, and the max position in the file could be in any row but the file is sorted before, and after this row). I have added terminals to specify the delimiter used in the rows, and to specify wich column is the one the file is sorted by. I have not added a terminal for the end of line delimiter, it's set as line feed string constant as that works for mee.

I would very much apreciate suggestion and improvements on this topic!.

GerdW Ty for your suggestion:

"Other option: read the file as pure text and parse on your own (avoiding conversion to big 2D array). 500k lines by ~80bytes/line is just 40000kByte=40MByte, any modern PC should handle this - as long as you avoid additional buffer allocations!"

I think this will perfectly fit in what I need to do (wich is monotiring a program wich is writing this CSV circular buffer files for the new added data).

PD: This is not 100% debuged xD migth fail in some cases I haven tested enough.

AntonioNI · ‎12-06-2015

Edit 1: Missing shift register on the max value in the while loop.

Bob_Schor · ‎12-06-2015

Here are some comments, things to think about:

Kudos to you for creating Icons for your VIs! Makes it so much easier to "self-document" your LabVIEW code.
I looked at your example .csv file. It looks to me like it has the following format: String (comma-delimited), TimeStamp in Date/Time string format, I32 (unknown purpose), I32 (might be index).
You want to process this file, which is "semi-sorted" (that is, it "circularly-sorted", and you don't know where the first (lowest) line is located).

Several things are unclear to me, so I'll make some assumptions (which might be wrong, but it's the "principle of the thing"). First, I assume you want to do something with these data, and may want to do further processing of the files (rewritten so as to be "truly sorted", with the first record in the first position). Second, I'm going to assume that the record is sorted by the Time Stamp column -- I'm going to ignore the fourth column that "looks like" it could be an index.

What I would be tempted to do if faced with this task would be to start by defining a "Record Cluster" having four elements -- Description (String), Date/Time (TimeStamp), Value1 (I32), and Value2 (I32). I would then open the file as a Text file, and create a While loop that does the following: Read one line (this is a right-click option on the Read from Text File option), use Match Pattern twice with commas as the separator to isolate the Description, Date/Time String, and string with the two Values, use the appropriate Time functions (and possible Scan from String) to turn the Date/Time string into a (numeric) TimeStamp (which can be sorted!), and use Scan from String (with format %d,%d) to get the two Values. Package them into your Cluster, and pass out an Indexed tunnel. Note that when you read the End of File, Read from Text File will give you an Error, which you can use to stop your While loop (clear the Error after using it to stop the loop). Note that this will result in a "bad point" being put at the end of your new Array of Clusters -- you can avoid this by using a Conditional Indexed Tunnel and only adding the point if there is no Error (i.e. the Read was good).

Why go to all this bother? First, I suspect (but could be wrong!) that the Array of Clusters will be much smaller than the Array of Strings. Second, if you define your Cluster with the Index Element (which I'm assuming is the second, Date/Time, element) as the first Cluster element, when you do Sort Array, it will sort by Date/Time. Third, when you do whatever you need to do with this data, you won't have to re-parse the String information. Note that when you go to write out this data, you will be writing out a Binary File, which will be smaller, have faster I/O, but will require that you use/know/specify the Cluster TypeDef in order to read it back.

This might not work for you at all -- if you do decide to try it, check out if my assumptions (as to speed, size, efficiency, etc.) are correct. I also think doing the string parsing once and in the simplest manner possible will make the code more robust and easier to use, but that's also up to you to decide.

Bob Schor

AntonioNI · ‎12-06-2015

Hello Bob Schor. Thank you for your comments and kudos (=. I'll try out your suggestions as soon as I can.

To clarify my situation, I have data fomr multiple tags (such as temperature sensors, etc) beeing stored by a third program (Wincc) on this circular CVS files. WinnCC reads the data from a SIEMENS PLC. Since Im not direclty comunicating labview with the PLC, or WinCC (and wont do so for now) im trying to monitor in real time the last values of the tags and graph them in labview (since wincc graphical interface is bad). My goal is to track this files and always have the latest data in labview. Since the files are large, I cannot read all the file with spreadsheet string every time, so Im trying to "point" to the line with the max value of the index column and just "track" it's position on the file by readin a few lines arround it. I do not want to sort the CSV file by myself or write in to it, just read the new values.

First column is the name of the tags, second is timestamp (is not the actual column im sorting by but could be), third is a validation column (1 indicates valid value, and 0 invalid), and 4th row is the time elapsed in ms (wich is the column im ordering by (order by index is set to 3), since is more precise, cos Wincc for some reason does not put ms on the timestamp aswell ....). The test file I posted is simulated by myself using labview (only 2000 rows cos the 500 000 rows was to big to upload), just writing random data. And im actually missing the column with the actual values of the tag (I forgot about it, but ther should be another column with the values of the data beeing messured) xD.

So yhe, my plan is to quickly locate the offset in chars from which I need to start reading the file and then reed some lines arround it not to miss any new data, and then move this offset arround as the circular buffer keeps looping and overwriting old data.

Thanks for the help!

AntonioNI · ‎12-06-2015

By your suggestion I understood something like the atached vi. This is slower cos im converting all the array and then sorting, I guess. Takes like 7-8 seg to compleate for the 500k lines while the binary search takes 100ms. IDK if im doing something wrong. Ty for your time in advance.

Bob_Schor · ‎12-06-2015

Thank you for your clear explanation -- I now better understand what you are trying to do.

So here is a small modification to my suggestions. First, you don't need to sort anything, so you can leave the Cluster definition in its "natural" order of "Description, Date/Time, Valid, and Time Offset". The last element, "Time Offset", will be always increasing until it reaches the "wrap-around" point, where it will decrease. So use Binary Search. I haven't given a lot of thought to finding the Wrap Point on a "folded sorted array", but it seems to me dividing the array into maybe thirds and comparing three values might enable you to determine in which segment the "wrap" was located. [Hmm -- this is a fascinating problem -- I'll have to think more about this ...]

Bob Schor

LabVIEW

Scan large ASCII file for data.

Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.

Re: Scan large ASCII file for data.