Differentiating flattened strings

BowenM · ‎05-12-2023

I was handed some legacy code the other day that was responsible for generating a ton of test data files that nobody has the ability to read. It is my task to write a viewer for all of these legacy files.

The code that writes the file has a header that is just a flattened string from a cluster. However, there are two different cluster types: one with arrays in it, and one without. I need some code to uniquely identify which one of these clusters was flattened (see below)

I'm hoping someone has a clever way to do this. Some options I have considered:

Look at size of the flattened string. Those with arrays should contain the prepended array size bytes so it would be bigger
- Drawback: Channel Name is an unbound string size. A "Single" cluster with a large channel name could be larger than a "Multi" cluster with only one channel
Look for "meaningful" values in Channel Offset. Prepended array size bytes would likely give nonsense to the remaining values
- Drawback: Possibility of failure, or possibility that some of the data could mistakenly actually have not-meaningful values in the Channel Offset field
Attempt "Single" conversion first, then check Channel Name field for non-ASCII characters. If any are found, assume it is "Multi" cluster
- Drawback: Possibility for binary values to match ASCII characters and fail to catch "Multi" data type

At this point, I'm leaning towards option 2. I can imagine a scenario that it would fail the check, but I don't think it very likely. Still, I'm hoping someone else has a better idea.

altenbach · ‎05-12-2023

If you try to unflatten to the two data structures in parallel, I would expect one to typically generate an error or a "remaining string" that is not empty. Have you tried?

LabVIEW Champion.

BertMcMahan · ‎05-12-2023

@altenbach wrote:

If you try to unflatten to the two data structures in parallel, I would expect one to typically generate an error or a "remaining string" that is not empty. Have you tried?

I was curious about this so I made the following example:

As long as there are enough bytes in the flattened string, you can unflatten whatever you'd like. You both probably knew this already but I wasn't sure.

I might try doing both, then doing some checks on the data to see if they make sense. For example, a multi-channel system will have the same number of elements in the Offset, Gain, and Name arrays. It'll be very unlikely for that to accidentally happen. Add your option 3 to look for non-ASCII characters and you could filter even more errors (the ol' "Swiss Cheese Model").

If all else fails and it's still ambiguous, you could have a popup for the user to view. Something like plotting the data points in a popup two ways and asking the user to identify which one "looks right".

altenbach · ‎05-12-2023

You have not looked at the size of the "rest of binary string" output. It'll tell you if there is more data than expected.

LabVIEW Champion.

BowenM · ‎05-12-2023

Thank you both.

I didn't include "look for error" on my list because with data in the arrays it almost always worked without error so it wasn't worth checking. However, I did not consider looking at the extra bytes.

According to help "This function does not convert all of the bytes if the size of binary string is not a multiple of the size of type". If I'm reading that right, I could theoretically still get a "multi" cluster that would convert as a "single" with zero remaining bytes... but I think a combination of these methods will work well enough.

Side note: It seems like every time I've picked up a project from LabVIEW 8 ish era or earlier, it was super common to see config files with just a flattened cluster. And every single time I've seen it, the practice has caused me problems. I sure am glad better methods have come about like JSON or XML files...

rolfk · ‎05-12-2023

I have used binary flattened files regularly. But the magic is to have some versioning if you allow for different file types. Typically I always start such files with a "magic" 32-bit integer. Then each block has its own "magic" 32-bit integer. That way it is actually fairly easy to add new block types to a file.

Of course one problem about binary files remains. A single corrupted byte can make the whole file pretty difficult to parse.

Rolf Kalbermatter
My Blog

mcduff · ‎05-13-2023

@BertMcMahan wrote:

@altenbach wrote:

If you try to unflatten to the two data structures in parallel, I would expect one to typically generate an error or a "remaining string" that is not empty. Have you tried?

I was curious about this so I made the following example:

As long as there are enough bytes in the flattened string, you can unflatten whatever you'd like. You both probably knew this already but I wasn't sure.

I might try doing both, then doing some checks on the data to see if they make sense. For example, a multi-channel system will have the same number of elements in the Offset, Gain, and Name arrays. It'll be very unlikely for that to accidentally happen. Add your option 3 to look for non-ASCII characters and you could filter even more errors (the ol' "Swiss Cheese Model").

If all else fails and it's still ambiguous, you could have a popup for the user to view. Something like plotting the data points in a popup two ways and asking the user to identify which one "looks right".

Can't give any advice but this problem is not easy. Two different structures can give the same flattened string. ¯\_(ツ)_/¯

raphschru · ‎05-13-2023

Indeed you could deliberately choose values that work with both clusters without error and remaining string:

In the first example, all real values are single-precision floats, so both scalar zeros and empty arrays flatten to 4 null bytes.

The second example is trickier. I have chosen a double number that has 0x0000 0001 in its first 4 bytes, so that it is the same as an array length of 1 in the other cluster. A similar trick could be done with the channel names, but it would require names with unprintable characters…

While this is possible to have the same flattened string, with real-world data it is highly unlikely because it would require either null data (like example 1) or extremely small numbers or names with unrealistic characters (like example 2).

altenbach · ‎05-13-2023

Yes, if we assume that the orange values are SGL and allow empty arrays, we might get an ambiguity.

Of course downstream we probably want the same cluster for both versions, so we might create arrays with one element if the scalar data is encountered.

LabVIEW Champion.

LabVIEW

Differentiating flattened strings

Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings

Re: Differentiating flattened strings