How to get folder size in bytes

josborne · ‎11-09-2011

Btw: I am not debating your "facts". The issue has always been speed. And DU.exe is the best (fastest) solution. Way better than any pure-G solution. And way better than re-writing my OS. Which is just silly.

http://www.medicollector.com

smercurio_fc · ‎11-09-2011

Fine. Stay out of the discussion. You may see it as solved for yourself, but it's hardly that, given Matt's results, which contradict your claims that DU is the "best" solution.

Matt, I am curious about the results. When you ran du did you run it via System Exec? What command-line switches did you use? Also, what OS were you using? Did you have anti-virus turned off?

Wart · ‎11-09-2011

If you change the contents of your folder does DU still run very quickly?

The size data gets cached somewhere on Windows and I'm not sure what it takes to make it re calculate.

When I right clicked on my windows folder and selected properties I could see it counting up in size as it processed through all the files. The first time it took several seconds to calculate the 8+gig size.

Once it finished I did it again. This time the size display was almost instant. It knew what it'd calculated before and didn't have to do it again! I suspect the DU program is doing something similar.

smercurio_fc · ‎11-09-2011

I don't think there's any doubt about the fact that there's definitely some caching going on. I created a folder with 1.2 million 1K files. Rebooted PC and waited for Windows to settle down. Using the simple File/Directory File Info function to get a folder's size (remember we're not caring about subfolder for the moment), the VI took 590 seconds the first run. The next run? 1.4 seconds. Quit LabVIEW and restarted LabVIEW. Ran the VI. Time? 1.4 seconds. Rebooted PC and waited for Windows to settle down. Ran a VI that used du via System Exec. First time du? 626 seconds. Second time? 2.9 seconds.

Clearly, du is the solution to this issue.

Can you sense my sarcasm through all those internet wires?

josborne · ‎11-09-2011

Your results are dubious.

http://www.medicollector.com

Matt_W1 · ‎11-09-2011

smercurio_fc wrote:

Matt, I am curious about the results. When you ran du did you run it via System Exec? What command-line switches did you use? Also, what OS were you using? Did you have anti-virus turned off?

I used the code as posted in josbornes link

https://decibel.ni.com/content/docs/DOC-17862

This was on XP, no antivirus, LV2011f2.

Personally I expected the du function to be faster, simply because the sysinternal guys really know their stuff, but I was thinking maybe twice as fast. I suspect du is handling corner cases (hardlinks and stuff like that), which is slowing it down a bit. This would probably force some more disk io even though I'm trying to run it on cached data, which would certainly slow it down relative to LV (harddrive io is a performance killer, far more than LV's string allocations). NTFS slows down with a large amount of files (I've heard 10000+), I think it can also compact small files with their metadata, du maybe better handle those cases.

My work computer takes a couple of minutes to boot up, so I haven't been trying the non cached case.

smercurio_fc · ‎11-09-2011

@josborne wrote:
Your results are dubious.

I thought you were happy with your solution, so you weren't bothering with this thread anymore and any "academic" analysis of why there's no clear-cut solution to this problem, since the operating system (and buffering by the disk controller) plays as much a role here as any technique you can come up with in LabVIEW. All I did was replicate your conditions and presented the results. Don't seem to match what you're claiming. Perhaps the Elysian Fields aren't so idyllic.

@MattW: My test bed is a Windows 7, 64-bit machine. I am currently trying to test the same thing on XP, but I'm not expecting to see the results to be any different than what I saw. While there may be a way to write a VI that does not use the Recursive File List VI to generate the list of folders ahead of time but rather accumulates the total folder usage "on the fly", I'm not expecting that much of a difference here either. The killer is, and always will be, disk I/O.

Aside: You know what the funny part in all of this is? Look at the very first response in this thread. It's from me. I suggested using the Disk Usage utility, not for speed, but rather because the original poster didn't want to iterate through the directories, adding up all the file sizes. The Disk Usage utility simply hid that from the user, providing a nice number at the end. The fundamental way it works was no different, though, and its time is no better that a pure G solution, as your results, and mine show.

Matt_W1 · ‎11-10-2011

I was planning on doing cached runs but due to some issues that show up later I didn't have the time today.

Fully Cached means I ran it at least once before rerunning to measure the time.

The modified pure g I'm recursing by hand and avoiding a lot of unneeded arrays as in the simply solution.

The Windows API is an interface I hacked together (not heavily tested, it uses the kernel32 functions called with CLNs).

This is a copy of my dev directory on an USB2 External hard drive.

	Not Cached (ms)	Fully Cached (ms)	Size (MB)
DU	not run	27167	9538.67
Pure G	not run	10147	9538.67
Modified Pure G	not run	9919	9538.67
Windows API	not run	1192	9538.67

But with a test of 1267788 files with size range of 20-28k (attempting to mirror josborne's data). Something is really breaking down in the pure g solutions, I haven't figured out what yet. Although to be fair it's considered bad to have over 10 thousand files in a single directory (causes performance issues), let alone over a million. It seems that DU has slow downs when recusing directories, since it's quite fast in a single folder case. I suspect the not cached tests would be a lot faster on a SSD.

	Not Cached (ms)	Cached (ms)	Size (MB)
DU	333488	1786	29718.2
Pure G	not run	610245	29718.2
Modified Pure G	not run	552049	29718.2
Windows API	336001	1364	29718.2

To check the basis of the problem I rewrote the list folder node with CLNs and the windows api. It handles patterns but doesn't handle LLBs, nor does it look for datalogs.

An average over 10000 runs on my dev folder (remember list folder doesn't recurse)

	Fully Cached
LV Primitive	1.77
Mine	1.01

An average of 10 runs on the folder with ~1.2 mil files

	Fully Cached
LV Primitive	18902.6
Mine	1951.9

So the issue is at least partially at the primitive level. But this doesn't seems sufficient to explain all the slow down with the file size functions. I know path's in labview have there own structure (ie not just strings), so it might have something to do with that.

On a side note it's not too often I get to beat a primitive function so handily speed wise.

@smercurio_fc wrote:

...

Using the simple File/Directory File Info function to get a folder's size (remember we're not caring about subfolder for the moment)

...

That size is the number of files within the directory not how much space they in total take (a little confusing I know).

PhillipBrooks · ‎11-10-2011

My first thought about this problem was that LabVIEW was spending too much time resizing the array of filenames.

I created a version of the Recursive File List that got the file size for each file in each folder and eliminated the huge file array. The execution was about the same.

I created a second version that included queues. One loop to get all the filenames and a recursive VI called multiple times to get the file sizes. This pushed the processor to 100% and ran in about half the time.

I used VI.lib as the base folder.

LV Punk: Global free, sequence structure free and BETTER THAN YOU !
Coordinated Universal Time (UTC) format string: %^<%Y-%m-%dT%H:%M:%S%3uZ>T

smercurio_fc · ‎11-10-2011

That's a superb analysis, Matt. I wish I could give you more than 1 Kudos. What I take away from your analysis is that there is no clear-cut solution to the problem. In your first set of results, the difference between the pure G and the modified G is quite small (relatively speaking). DU is almost 3X slower. In your second set the difference between pure G and modified G is much larger, but completely buried by DU's results. So, in one case G handily wins, and in the other case du handily wins. Would it be possible for you to upload the code you used so that others can test it out under other conditions?

One thing we need to keep in mind here is that we are trying to find a cross-platform solution, and we don't know what these results are like on other file system types, and that plays a factor here as well. As you noted, NTFS doesn't like a lot of small files.

Matt W wrote:

@smercurio_fc wrote:

...

Using the simple File/Directory File Info function to get a folder's size (remember we're not caring about subfolder for the moment)

...

That size is the number of files within the directory not how much space they in total take (a little confusing I know).

Yes, that's correct. My presumption was that whatever effects would cause a huge hit to the File I/O functions on non-cached information would be seen here as well. The results I got seemed to indicate as such.

@Phillip Brooks wrote:

My first thought about this problem was that LabVIEW was spending too much time resizing the array of filenames.

I created a version of the Recursive File List that got the file size for each file in each folder and eliminated the huge file array. The execution was about the same.

I created a second version that included queues. One loop to get all the filenames and a recursive VI called multiple times to get the file sizes. This pushed the processor to 100% and ran in about half the time.

I used VI.lib as the base folder.

I'm trying your VI now. It's still running after 20 minutes on the single-folder with 1.2 million files. Something doesn't seem quite right...

LabVIEW

How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes

Re: How to get folder size in bytes