11-09-2011 07:47 AM
Btw: I am not debating your "facts". The issue has always been speed. And DU.exe is the best (fastest) solution. Way better than any pure-G solution. And way better than re-writing my OS. Which is just silly.
11-09-2011 08:48 AM
Fine. Stay out of the discussion. You may see it as solved for yourself, but it's hardly that, given Matt's results, which contradict your claims that DU is the "best" solution.
Matt, I am curious about the results. When you ran du did you run it via System Exec? What command-line switches did you use? Also, what OS were you using? Did you have anti-virus turned off?
11-09-2011 11:33 AM
If you change the contents of your folder does DU still run very quickly?
The size data gets cached somewhere on Windows and I'm not sure what it takes to make it re calculate.
When I right clicked on my windows folder and selected properties I could see it counting up in size as it processed through all the files. The first time it took several seconds to calculate the 8+gig size.
Once it finished I did it again. This time the size display was almost instant. It knew what it'd calculated before and didn't have to do it again! I suspect the DU program is doing something similar.
11-09-2011 01:25 PM
I don't think there's any doubt about the fact that there's definitely some caching going on. I created a folder with 1.2 million 1K files. Rebooted PC and waited for Windows to settle down. Using the simple File/Directory File Info function to get a folder's size (remember we're not caring about subfolder for the moment), the VI took 590 seconds the first run. The next run? 1.4 seconds. Quit LabVIEW and restarted LabVIEW. Ran the VI. Time? 1.4 seconds. Rebooted PC and waited for Windows to settle down. Ran a VI that used du via System Exec. First time du? 626 seconds. Second time? 2.9 seconds.
Clearly, du is the solution to this issue.
Can you sense my sarcasm through all those internet wires?
11-09-2011 01:56 PM
Your results are dubious.
11-09-2011 04:26 PM
smercurio_fc wrote:
Matt, I am curious about the results. When you ran du did you run it via System Exec? What command-line switches did you use? Also, what OS were you using? Did you have anti-virus turned off?
I used the code as posted in josbornes link
https://decibel.ni.com/content/docs/DOC-17862
This was on XP, no antivirus, LV2011f2.
Personally I expected the du function to be faster, simply because the sysinternal guys really know their stuff, but I was thinking maybe twice as fast. I suspect du is handling corner cases (hardlinks and stuff like that), which is slowing it down a bit. This would probably force some more disk io even though I'm trying to run it on cached data, which would certainly slow it down relative to LV (harddrive io is a performance killer, far more than LV's string allocations). NTFS slows down with a large amount of files (I've heard 10000+), I think it can also compact small files with their metadata, du maybe better handle those cases.
My work computer takes a couple of minutes to boot up, so I haven't been trying the non cached case.
11-09-2011 04:53 PM
@josborne wrote:
Your results are dubious.
I thought you were happy with your solution, so you weren't bothering with this thread anymore and any "academic" analysis of why there's no clear-cut solution to this problem, since the operating system (and buffering by the disk controller) plays as much a role here as any technique you can come up with in LabVIEW. All I did was replicate your conditions and presented the results. Don't seem to match what you're claiming. Perhaps the Elysian Fields aren't so idyllic.
@MattW: My test bed is a Windows 7, 64-bit machine. I am currently trying to test the same thing on XP, but I'm not expecting to see the results to be any different than what I saw. While there may be a way to write a VI that does not use the Recursive File List VI to generate the list of folders ahead of time but rather accumulates the total folder usage "on the fly", I'm not expecting that much of a difference here either. The killer is, and always will be, disk I/O.
Aside: You know what the funny part in all of this is? Look at the very first response in this thread. It's from me. I suggested using the Disk Usage utility, not for speed, but rather because the original poster didn't want to iterate through the directories, adding up all the file sizes. The Disk Usage utility simply hid that from the user, providing a nice number at the end. The fundamental way it works was no different, though, and its time is no better that a pure G solution, as your results, and mine show.
11-10-2011 12:46 AM
I was planning on doing cached runs but due to some issues that show up later I didn't have the time today.
Fully Cached means I ran it at least once before rerunning to measure the time.
The modified pure g I'm recursing by hand and avoiding a lot of unneeded arrays as in the simply solution.
The Windows API is an interface I hacked together (not heavily tested, it uses the kernel32 functions called with CLNs).
This is a copy of my dev directory on an USB2 External hard drive.
| Not Cached (ms) | Fully Cached (ms) | Size (MB) | |
|---|---|---|---|
| DU | not run | 27167 | 9538.67 |
| Pure G | not run | 10147 | 9538.67 |
| Modified Pure G | not run | 9919 | 9538.67 |
| Windows API | not run | 1192 | 9538.67 |
But with a test of 1267788 files with size range of 20-28k (attempting to mirror josborne's data). Something is really breaking down in the pure g solutions, I haven't figured out what yet. Although to be fair it's considered bad to have over 10 thousand files in a single directory (causes performance issues), let alone over a million. It seems that DU has slow downs when recusing directories, since it's quite fast in a single folder case. I suspect the not cached tests would be a lot faster on a SSD.
| Not Cached (ms) | Cached (ms) | Size (MB) | |
|---|---|---|---|
| DU | 333488 | 1786 | 29718.2 |
| Pure G | not run | 610245 | 29718.2 |
| Modified Pure G | not run | 552049 | 29718.2 |
| Windows API | 336001 | 1364 |
29718.2 |
To check the basis of the problem I rewrote the list folder node with CLNs and the windows api. It handles patterns but doesn't handle LLBs, nor does it look for datalogs.
An average over 10000 runs on my dev folder (remember list folder doesn't recurse)
| Fully Cached | |
|---|---|
| LV Primitive | 1.77 |
| Mine | 1.01 |
An average of 10 runs on the folder with ~1.2 mil files
| Fully Cached | |
|---|---|
| LV Primitive | 18902.6 |
| Mine | 1951.9 |
So the issue is at least partially at the primitive level. But this doesn't seems sufficient to explain all the slow down with the file size functions. I know path's in labview have there own structure (ie not just strings), so it might have something to do with that.
On a side note it's not too often I get to beat a primitive function so handily speed wise.![]()
@smercurio_fc wrote:
...
Using the simple File/Directory File Info function to get a folder's size (remember we're not caring about subfolder for the moment)
...
That size is the number of files within the directory not how much space they in total take (a little confusing I know).
11-10-2011 07:25 AM
My first thought about this problem was that LabVIEW was spending too much time resizing the array of filenames.
I created a version of the Recursive File List that got the file size for each file in each folder and eliminated the huge file array. The execution was about the same.
I created a second version that included queues. One loop to get all the filenames and a recursive VI called multiple times to get the file sizes. This pushed the processor to 100% and ran in about half the time.
I used VI.lib as the base folder.
11-10-2011 08:12 AM
That's a superb analysis, Matt. I wish I could give you more than 1 Kudos. What I take away from your analysis is that there is no clear-cut solution to the problem. In your first set of results, the difference between the pure G and the modified G is quite small (relatively speaking). DU is almost 3X slower. In your second set the difference between pure G and modified G is much larger, but completely buried by DU's results. So, in one case G handily wins, and in the other case du handily wins. Would it be possible for you to upload the code you used so that others can test it out under other conditions?
One thing we need to keep in mind here is that we are trying to find a cross-platform solution, and we don't know what these results are like on other file system types, and that plays a factor here as well. As you noted, NTFS doesn't like a lot of small files.
Matt W wrote:
@smercurio_fc wrote:
...
Using the simple File/Directory File Info function to get a folder's size (remember we're not caring about subfolder for the moment)
...
That size is the number of files within the directory not how much space they in total take (a little confusing I know).
Yes, that's correct. My presumption was that whatever effects would cause a huge hit to the File I/O functions on non-cached information would be seen here as well. The results I got seemed to indicate as such.
@Phillip Brooks wrote:
My first thought about this problem was that LabVIEW was spending too much time resizing the array of filenames.
I created a version of the Recursive File List that got the file size for each file in each folder and eliminated the huge file array. The execution was about the same.
I created a second version that included queues. One loop to get all the filenames and a recursive VI called multiple times to get the file sizes. This pushed the processor to 100% and ran in about half the time.
I used VI.lib as the base folder.
I'm trying your VI now. It's still running after 20 minutes on the single-folder with 1.2 million files. Something doesn't seem quite right...