Never Gonna BOM You Up

.NET supported Unicode from its very beginning. Pretty much anything you might need for Unicode manipulation is there. Yes, as early adopters, they made a bet on UTF-16 that didn’t pay off since rest of the world has moved toward UTF-8 as an (almost) exclusive encoding. However, if we ignore a bit higher memory footprint, C# strings made Unicode as easy as it gets.

And, while UTF-8 is not a native encoding for its strings, C# is no slouch and has a convenient Encoding.UTF8 static property allowing for easy conversion. However, if you do use that Encoding.UTF8.GetBytes() function, you will get a bit extra.

That something extra is Byte order mark. Its intention is noble - to help detect endianess. However, its usage for UTF-8 is of dubious help since 8-bit encoding doesn’t really have issues with endianness to start with. Unicode specification itself does allows for one but doesn’t recommend it. It merely acknowledges it might happen as a side-effect of data conversion from other unicode encodings that do have endianness.

So, in theory, UTF-8 with BOM should be perfectly acceptable. In practice, only Microsoft really embraced UTF-8 BOM. Pretty much everybody else decided to have UTF-8 without BOM as that allowed for full compatibility with 7-bit ASCII.

With time, .NET/C# stopped being Windows-only and, by today, became really good multiplatform solution. And now, helper function that ought to simplify things is actually producing output that will annoy many command-line tools that don’t expect it. If you read the documentation, solution exists - just create your own UTF-8 converter instance.

private static readonly Encoding Utf8 = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);

Now you can call Utf8.GetBytes() instead and you will get expected result on all platforms, including Windows - no BOM, no problems.

So, one could argue that Encoding.UTF8 default should be changed to what is more appropriate value. I mean, .NET is multiplatform and the current default doesn’t work everywhere. One could argue but this default is not changing, ever.

When any project starts, decisions must be made. And you won’t know for a while if those decisions were good. On the other hand, people will start depending on whatever behavior you selected.

In the case of BOM, it might be that developer got so used to having those three extra bytes that, instead checking the file content, they simply use <=3 as a signal file is empty. Or they have a script that takes output of some C# application and just strips the first three bytes blindly before moving it to non-BOM friendly input. Or any other decision somebody made in project years ago. It doesn’t really matter how bad someones code is. What matters is that code is currently working and new C# release shouldn’t silently break somebody’s code.

So, I am reasonably sure that Microsoft won’t ever change this default. And, begrudgingly, I agree with that. Some bad choices are simply meant to stay around.


PS: And don’t let me start talking about GUIDs and their binary format…