Chunking a massive CSV file for easier analysis

awooster · ‎12-23-2015

I have a CSV file that contains 3 columns: time, voltage, and current. This data spans a period of 12 days with 1ms between samples, and the file size is 53.4GB!!

I'm using LabVIEW to chunk this file into more manageable pieces for analysis using the Read From Spreadsheet File.vi and Write to Spreadsheet File.vi, and my method works great...to a point. Here's a code snippet of what I'm doing:

I'm using the "start of read offset" input to the Read From Spreadsheet File.vi to inch my way through the file, but since this is an I32 datatype it eventually rolls over and the VI errors out. What I think I should do is delete chunks from the main file as I read and write them to their respective smaller file, but the method to do this is stumping me. Are there any suggestions out there on how I could pull this off while still using the Spreadsheet File vi's (to avoid reinventing the wheel as much as possible)?

A couple ideas I'm toying with:

1. Iteratively delete data and reset the marker to "0" in the file (reading data in order of present to future).

2. Start at the END of the file (i.e. read the data from future to present) and iteratively set the file size smaller, thus deleting the chunks of data as I go.

Bob_Schor · ‎12-23-2015

I'm almost afraid to ask about the nature of the data! Sampling at 1KHz implies you are looking at fairly-rapidly-changing data (with signals of interest of at least 100 Hz), but sampling continuously for 12 days also suggests looking at ultra-slow phenomena (that take hours to manifest themselves)! So much data to process ...

But LabVIEW can help! By my calculations, you have about a billion samples, so each sample generates about 50 bytes in your 53.4GB file. Let's say we want to produce 12 * 24 = 288 files, each holding an hour's worth of the CSV data -- this would be a file of about 185 MB that would represent 3.6 million samples, still quite a bit, but certainly something LabVIEW should be able to handle.

The secret to doing this is to treat the main file as a Binary File, and read part of the file into a "Work buffer", figure out where the "hour breaks" fall, and write that out as an appropriately-named Binary File (with the .csv extension so you'll be able to read it back in easily).

Figuring out where the hour breaks fall just involves finding and counting the individual lines in the work buffer. Lines typically end with <CR><LF> (or 13, 10 if interpreted as U8).

Suppose you start with a Work Buffer of 200 million bytes, which should be big enough for an hour of data. Find where the hour ends (somewhere around 185 million). Using Delete from Array, split the Array into this initial First Hour and "Beginning of Next Hour" arrays, and write the First Hour to your first .CSV file. Now take the remaining (smaller) array and fill it back up to 200 million by reading additional bytes from the main file. Repeat until done.

This might not be the optimal way to do this (i.e. there may be more efficient ways of handling these large arrays), but the algorithm should work.

Bob Schor

RTSLVU · ‎12-23-2015

I know hindsight is 20/20, but a little forethought goes a long way.

Next time consider 'chunking' your data file from the start. Use a timer or change in date/time to create a new file at regular intervals heck a even just a "write counter" to limit the file size.

Also TDMS was actually intended for handling such large data sets, but it also has it's own learning curve and idiosyncrasies.

========================
=== Engineer Ambiguously ===
========================

johnsold · ‎12-23-2015

Read from Spreadsheet File.vi creates multiple copies of the data. It has at least one copy of the file as a string and one copy of the numerical ouptut array. I have not dug into it recently but think I recall that there may also be other copies under certain circumstances.

So, using Read from Spreadsheet File.vi may not be a particulary good way to manage large files.

Lynn

tst · ‎12-28-2015

The easiest solution would probably be to modify the existing spreadsheet VIs (or a copy you make of them) - as you realized, they use 32 bit values to track the position, but if you look inside the VIs, you should see that the actual file functions use 64 bit values (the 32 bit values in the functions are actually used for backward compatibility). I haven't looked closely, but my guess is that you can simply change this to be 64 bit and your code should then be able to go through the entire file.

___________________
Try to take over the world!

awooster · ‎12-28-2015

I appreciate the responses!

Bob_Schor, thanks for the suggested method, I'll keep this technique in my toolbelt. To avoid having to rewrite my VI, I went with tst's recommendation and made modified copies of the spreadsheet VI's. These were the ones that I had to modify to adapt to my VI (I appended "I64 Marker" to all of them):

Read File+ (string) - I64 Marker.vi

Read From Spreadsheet File (DBL) - I64 Marker.vi

Read From Spreadsheet File (string) - I64 Marker.vi

Read Lines From File - I64 Marker.vi

Read File+ (string) - I64 Marker.vi is at the bottom of the call stack. In addition to changing the relevant controls/indicators to I64 rather than I32, I had to also delete a "To Long Integer" that was at the "offset (in bytes)" output of Get File Position.vi.

My top-level VI is off and running, so far so good!

Bob_Schor · ‎12-28-2015

Great! Be sure to mark tst's Reply as a Solution (you, the Original Poster, is the only one who can do that) -- this lets other Forum users know that "This Question Has Been Answered", in case they have a similar issue.

Bob Schor

LabVIEW

Chunking a massive CSV file for easier analysis

Chunking a massive CSV file for easier analysis

Re: Chunking a massive CSV file for easier analysis

Re: Chunking a massive CSV file for easier analysis

Re: Chunking a massive CSV file for easier analysis

Re: Chunking a massive CSV file for easier analysis

Re: Chunking a massive CSV file for easier analysis

Re: Chunking a massive CSV file for easier analysis