transpose a large CSV file

bmihura · ‎07-16-2011

I have a large CSV file, about 332 Mbytes. It's way too large to fit in memory; I tried that, and somehow that 332 Mbyte file won't fit on my computer with 6 Gbytes by a factor of 8.

It has 1024 columns and about 162,000 lines, and I need to transpose it.

Writing this code in LabVIEW would be more than a little tedious. Is there a function or VI out there that does this?

DFGray · ‎07-18-2011

First, I would recommend you read Managing Large Data Sets in LabVIEW (the same information is in the LabVIEW help in newer versions of LabVIEW). Next, using the knowledge you just gained, fix the memory copies which are probably the root cause of your problem. A 332MByte CSV file should be about 100MBytes in memory, easily within LabVIEWs ability. However, due to memory fragmentation issues, it may only take a couple of copies of this data to run you out of memory. Proper use of In Place element structures and Data Value references should make this easier. A simple strategy that should work for you is the following:

If you know the final array size, preallocate that array in memory (use Initialize Array) and load it into a Data Value reference.
Load your data from the CSV file in rows (or subrows, if necessary) that are about 65,000 bytes of data in size.
At each load, convert to binary and replace the appropriate data in the initialized array (use the In Place structure to get data from you data value and to index into your large array).
When you have all the data in, transpose the array.
Write it back out to disk about a quarter line at a time (fast) or a line at a time (easy).

Good luck. Let us know if you run into further issues.

_Ian_ · ‎07-18-2011

@DFGray wrote:
5. Write it back out to disk about a quarter line at a time (fast) or a line at a time (easy).

Damien,

My brain may not be fully firing today, so please could you expand on that last comment?

Why would writing a quarter line at a time be more efficient?

Thanks,

Ian

DFGray · ‎07-18-2011

His final result will have 162,000 columns. Assuming a Windows operating system, fastest disk performance is achieved (without a lot of low-level tweaking) by writing about 65,000 bytes per chunk. This drops off pretty fast if you use smaller than 65,000 bytes, and slower above it. Given about 8 bytes per column, optimum would actually be about 20 chunks per line (162,000 columns × 8 bytes/column / 65,000 bytes). That's what I get for being in too much of a hurry and not bothering to actually do the math before posting. My apologies.

altenbach · ‎07-18-2011

You already got some good advice.

Is this somethings that needs to be done often and speed is critical, or do you just need to fix some old files that were written incorrectly?
What kind of data is in the file (small integers, DBLs with a huge number of siginficant digits, etc.)
Is the file well structured (constant number of bytes/field)?
Can you attach a small sample of your file (e.g. the first 20 lines).

Since the fields in the raw file are actually numerically formatted strings with commas as delimiters and the final data in the output file are again exactly the same strings (comma delimited), I would recommend NOT to even convert to numeric and back. That is just churning bytes for no reason and you might actually introduce small differences due to the limitations of decimal formatting.

I would read the file in reasonable chunks, but only retain a few columns in memory with each pass, which can then be written directly to the output file. Repeat until all columns are processed.

By now you probably realized that csv is one of the worst formats to store massive amounts of numeric data. First of all, it is not easily compatible with computers that have a language setting where the comma is also the decimal seperator (e.g. German). Secondly, the file is much large than the actual data warrants, because of the small subset of ASCII characters allowed. scanning numeric strings and formatting back to strings causes truncation errors and is relatively expensive, computationally. In the future, decide on a binary format for large data files.

LabVIEW Champion.

bmihura2 · ‎07-18-2011

Well those are a lot of good suggestions. This is the kind of thing that will need to be done once or twice a day, and could burn up an hour or two unless I can make it faster.

I'll try those memory tricks and also find out if it's OK that I go straight binary. Probably 32-bit floating point numbers would be acceptable for this application.

bmihura · ‎07-18-2011

For directly reading and writing large tables, the binary format vs. CSV format helped insanely.

Here are my machine specs:

AMD Athlon 7550 Dual-Core, 2.5 GHz

6.00 GB RAM

64-bit

A simple write of the file went from 35+ minutes to 21 seconds. Wow!

A simple read of the file went from 2 minutes to 4 seconds. Also wow!

However, since I'm transforming files, it still takes 27 minutes just to do the transform even in binary land; my first shot just now was one float64 at a time for reading, then writing the entire row. To do an array transform on disk, when reading, I must hop from one address on disk to a non-adjacent address every time. So I cannot take advantage of the disk's built-in caching; hence the slowness.

But still it's a great improvement: (binary (27 minutes) vs. CSV (27 + 35 = 62 minutes)).

With some clever LabVIEW caching I could improve that 27-minute transform by a factor of hopefully 10, so it would be 2.7 minutes or so.

I do like the idea of keeping everything in memory though, as I like thinking in terms of seconds rather than minutes, so I'll eventually wind up exploring the memory-only route as DFGray initially suggested with his Managing Large Data Sets in LabVIEW link.

Thanks all!

altenbach · ‎07-18-2011

bmihura wrote:
I must hop from one address on disk to a non-adjacent address every time. So I cannot take advantage of the disk's built-in caching; hence the slowness.

It might still be faster to read larger chunks, then simply disacard everything except the columns processed in the current pass.

Can you show us some code?

LabVIEW Champion.

bmihura · ‎07-18-2011

I understand you 100% altenbach as far as reading in chunks of code, thereby "caching" them.

The code I've attached now does none of that, as it was all whipped up in about 4 hours to demonstrate things I've been saying. It is all flat-file 2-dimensional float64 data with no header.

Before I put everybody to sleep, try that file called Write Binary or CSV File.vi and watch the difference in time between writing large amounts of binary data vs. CSV data to disk.

Summary of code:

Write Binary or CSV File.vi

Demonstrates the difference of speed (I wrote this because of your post) between writing CSV and binary table.

Read Binary File.vi

Reads the same thing in, just to make sure it wrote correctly.

Transpose Binary File.vi

What I really wanted to do in the first place, but other disk-caching issues have arisen as mention earlier. Here is where I know how to improve with LabVIEW caching.

DFGray · ‎07-19-2011

My original algorithm should work quite well for you. Read the file in chunks into a preallocated array (disk speed limited, around 700MBytes with SGL data type, so about 30s with modern disks). Transpose the array (in place operation, less than one second). Write it back out in chunks. Combined time for the operation should be under two minutes and limited primarily by your disk speed.

A couple of things to be careful with. Using SGL data type, you have ~700MBytes of data. DBL will be twice that. Since you have a 64-bit OS, you should be able to do this, with a lot of care. I tested using LabVIEW 2010 SP1. What version are you using?

File reads and writes can make an extra copy, so you need to use chunks to keep the impact of that copy down.

LabVIEW

transpose a large CSV file

transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file

Re: transpose a large CSV file