UTF-8 to UTF-8 BOM conversion

mmm21 · ‎08-26-2022

I want to convert UTF-8 text to UTF-8 with BOM text.

Does anybody have the programme to be shared ?

Thank you for your support !

rolfk · ‎08-26-2022

That is pretty simple. Just prepend the 3 bytes 0xEF, 0xBB, 0xBF to the actual UTF-8 byte stream and you are done. Use the Write to Binary File node to write this to disk. The Write To Text File node assumes that the incoming byte stream is in the current local encoding (ANSI under Windows) and unscrupulously will try to do line end translation and such on the bytes before writing it to disk. Write to Binary File is guaranteed to leave the bytes alone and not accidentally mess up your UTF-8 encoded byte stream by assuming it is encoded in your locally configured code page.

Rolf Kalbermatter My Blog

DEMO, Electronic and Mechanical Support department, room 36.LB00.390

rolfk · ‎08-26-2022

That is pretty simple. Just prepend the 3 bytes (0xEF, 0xBB, 0xBF in this order) to the actual UTF-8 byte stream and you are done. Use the Write to Binary File node to write this to disk. The Write To Text File node assumes that the incoming byte stream is in the current local encoding (ANSI under Windows) and unscrupulously will try to do line end translation and such on the bytes before writing it to disk. Write to Binary File is guaranteed to leave the bytes alone and not accidentally mess up your UTF-8 encoded byte stream by assuming it is encoded in your locally configured code page.

Rolf Kalbermatter My Blog

DEMO, Electronic and Mechanical Support department, room 36.LB00.390

wiebe@CARYA · ‎08-26-2022

@rolfk wrote:

The Write To Text File node assumes that the incoming byte stream is in the current local encoding (ANSI under Windows) and unscrupulously will try to do line end translation and such on the bytes before writing it to disk.

I never used anything else then Write To Text, and never noticed any automatic conversion (after turning Convert EOL off by right clicking it of course).

Search LabVIEW like a graph!

rolfk · ‎08-26-2022

wiebe@CARYA wrote:

@rolfk wrote:

The Write To Text File node assumes that the incoming byte stream is in the current local encoding (ANSI under Windows) and unscrupulously will try to do line end translation and such on the bytes before writing it to disk.

I never used anything else then Write To Text, and never noticed any automatic conversion (after turning Convert EOL off by right clicking it of course).

It shouldn't and generally doesn't but the Write To Binary is the method to use if you do not want any chance of modifications to the byte stream.

In a perfect world LabVIEW would have full Unicode support and LabVIEW strings would have an encoding attribute that tells all string functions how to deal with it, including write and read file functions having an additional terminal to allow adding a BOM or trying to interpret a present BOM in the file to know how the encoding for that file is.

In the real world LabVIEW strings still are simply syntactic sugar around a byte stream and its encoding in terms of text relies on whatever the current platform locale configuration is independent of what you actually put into those bytes. And yes there would be of course conversion functions to convert a binary stream into a LabVIEW string with a parameter to indicate to the function to use a present BOM or a specific encoding as defined by the parameter and the according reverse function too.

The difficulty would be how to generate that encoding parameter. Most programming languages like Java, Python and .Net use a string as that is generally the most flexible. But that gives the difficulty that you need to know the syntax of that string and that it could indicate encodings that are not supported on the current platform. The alternative of using an enum is however worse as it would limit the encodings to a few that the developers have thought off rather than what the current platform could support. In the case of Windows things are even more hairy as the Unicode translation functions use codepages that are just a 16 bit number with magic values. For standard codepages there is a direct mapping of the "cpXXXX" value to the numeric code but UTF-8 and other encodings such as ANSI or OEM, which are simply indicating to Windows to use the currently configured locale, which is one of the actual codepage values, are simply magic values. Other platforms however only sometimes understand the Windows "CPXXXX" names and usually use other formats such as "UTF8", "US-ASCII", "ISO-8859-1", etc.

Rolf Kalbermatter My Blog

DEMO, Electronic and Mechanical Support department, room 36.LB00.390

wiebe@CARYA · ‎08-26-2022

@rolfk wrote:

wiebe@CARYA wrote:

@rolfk wrote:

The Write To Text File node assumes that the incoming byte stream is in the current local encoding (ANSI under Windows) and unscrupulously will try to do line end translation and such on the bytes before writing it to disk.

I never used anything else then Write To Text, and never noticed any automatic conversion (after turning Convert EOL off by right clicking it of course).

It shouldn't and generally doesn't but the Write To Binary is the method to use if you do not want any chance of modifications to the byte stream.

Ironically, I probably started using Write To Text because Write To Binary manipulated my data.

I'm not sure if the option to not prepend string\array size has always been there? IIRC, that was the problem I 'solved' by using Write To Text. I probably made that decision 20 years ago... Could have simply overlooked the option, or maybe I was done having to wire it.

Search LabVIEW like a graph!

rolfk · ‎08-26-2022

wiebe@CARYA wrote:

Ironically, I probably started using Write To Text because Write To Binary manipulated my data.

I can't remember that that option wasn't there, but it's a possibility that it wasn't. I simply prefer the clear and obvious notation that the Write To Binary function won't mess with the bytes at all and store them as they are in memory.

I have tried to get the point over to NI since a long time that that "string == byte stream" notation should have been stopped many many moons ago. All the functions that operate on IO should in general NOT have string inputs and outputs for the data stream but rather byte arrays. There could be convenience wrappers around those that accept strings but they should also allow to indicate the encoding that is desired on the external side. And of course encoding conversion functions that can convert strings to binary streams and vice versa. If NI had started doing that 20 years ago, they could have adopted a system that would initially have had compatibility functions that work like the old methods but build on a clean string interface that allows actual encoding as an attribute of strings. And by now those compatibility functions would reside in some _string_compatibility.llb library and be not anymore present on the palette since at least LabVIEW 2009. The new encoding aware string could have been introduced as a new data type, with type code 0x31 instead of 0x30 just as what was done with the change of the boolean from an int16 to an int8 when going from LabVIEW 4 to LabVIEW 5. And yes the flattened format would always be an UTF8 string. (Yes I know that typecode 0x31 is actually already used for the undocumented and in the meantime unsupported handle datatype. But it could also have been 0x38 or whatever.)

It would have required a bit of effort to make the actually present UTF8 support in the LabVIEW manager functions an officially accessible interface but would have been fully independent of actually supporting Unicode in the UI. First making the underlying text manager able to deal with encodings then worrying about getting this functionality also into the UI. Instead they tried something to hack Unicode support directly into the UI and then let it linger as unfinished project in the code base without making any headway to properly support string encoding at least in the programming interface.

Rolf Kalbermatter My Blog

DEMO, Electronic and Mechanical Support department, room 36.LB00.390

wiebe@CARYA · ‎08-26-2022

Well, you have my vote.

I hope NI's listening, as Unicode is on the roadmap...

Search LabVIEW like a graph!

mmm21 · ‎09-05-2022

Thank you.

I ccould convert the strings by adding the letters you told.

LabVIEW

UTF-8 to UTF-8 BOM conversion

UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion

Re: UTF-8 to UTF-8 BOM conversion