Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Windows is also a rare bird in UTF-16.

"UTF-16 is used by the Windows API, and by many programming environments such as Java and Qt. The variable-length character of UTF-16, combined with the fact that most characters are not variable-length (so variable length is rarely tested), has led to many bugs in software, including in Windows itself.

"UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit ASCII. It has never gained popularity on the web, where it is declared by under 0.004% of public web pages (and even then, the web pages are most likely also using UTF-8). UTF-8, by comparison, gained dominance years ago and accounted for 99% of all web pages by 2025."

https://en.wikipedia.org/wiki/UTF-16

 help



NT shipped with USC-2 as UTF-8 (and -16) did not yet exist. USC-2 naturally translated to UTF-16, hence the choice. NT/Win32 is also designed for fixed-with code units, something UTF-8 doesn't support.

You can use UTF-8 on a per-application basis, within limits.

https://learn.microsoft.com/en-us/windows/apps/design/global...

Conversely, UEFI is UTF-16 only, thanks to Windows.

UTF-8 only would be an ABI breaking change, so that's not going to happen. We don't want the NT kernel to end up like Linux, after all :-)


> UTF-8 only would be an ABI breaking change, so that's not going to happen. We don't want the NT kernel to end up like Linux, after all :-)

I think you're making a joke, but it still doesn't make sense. Linux does avoid breaking changes to its userspace ABI


Linux kernel's ABI/API surface is completely byte-based and is tiny compared to Win32 API. The stable ABI of Windows is strictly in the user space and it covers the entire useful operating system. Don't forget that glibc isn't that ABI stable nor anything that goes all the way up to systemd/X11/gettext/Wayland/cairo/GTK/Qt/glade/Pipewire/xdg. Nothing equivalent in Linux environment is stable, especially compared against Win32 system libraries.

You encounter encoding requirements not in kernel system calls but in more user-facing sides of the OS. So your comparison should include all the userspace components of a Linux system (e.g. gettext), if you're considering the places where you encounter actual Unicode string-based operations where UTF-16 comes into play in Windows.

ALL OF THE equivalent Windows libraries (user32.dll, kernel32.dll, ws2_32.dll, shellex.dll ...) to the Linux ones I counted above and more are ABI and API stable.


See, I was sympathetic to that view, except that they specifically posted

> We don't want the NT kernel to end up like Linux, after all

(Emphasis mine)

And that's... The one part of the OS where Linux is ABI stable.


I think they didn't mean Linux as in the kernel sense but in the overall "Family of OSes and ecosystem with very unstable APIs" sense.

Additional Detail: it is specifically utf-16 little endian when a byte order mark is not used, which is the opposite of the recommended choice of big endian in the RFC.

Worse are the byte order marks required to support both endians that end up in files.


The development of Windows NT based on UCS-2 precedes the RFC by roughly a decade, and little-endian was the natural choice for the Intel PC platform. Obviously the endianness had to remain the same when UCS-2 was extended to support UTF-16.

Also true. The BOMs though are annoying.

UTF-16 is also used by C#, Java, and JavaScript. Since JavaScript is so widely adopted, I wouldn't call it a rare bird. Not necessarily used when reading or writing files, but it's what's used internally for the strings. As a result, your strings use UTF-16 surrogate pairs to represent characters outside of the basic multilingual plane (such as Emoji).

UTF-16 is the internal format of the ICU library (International Components for Unicode, the support library from the Unicode standards people) which is a common way to add "full fat" Unicode support to a programming language. This has knock-on effects everywhere. If you're using ICU, you either use UTF-16, too, or you constantly convert back and forth every time you interact with ICU. You're often best off using UTF-16 in memory and only converting to UTF-8 when you write files or transmit over the network.

> Windows is also a rare bird in UTF-16.

Web browsers use UTF-16 internally. So Windows isn't even largest "platform" that uses UTF-16.


> Windows is also a rare bird in UTF-16.

an interesting tidbit, some Windows kernel developer realized that most registry keys are ascii anyways so they could save up to 50% space simply by storing the name as ascii. The flag is called "compressed name" and they will pad with 0x00 when reading the name to make a proper utf-16 string.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: