Hacker Read

e12e · 2017-08-24 04:29:57+00:00

As I'm now on my laptop with access to the two simple programs; for text output:

  import encodings
  var hello1 = convert("Hellø, wørld!", "850", "UTF-8")

  # Doesn't work - seems to think current codepage is utf8.
  var hello2 = convert("Hellø, wørld!", getCurrentEncoding(), "UTF-8")

  # Outputs correct text:
  echo hello1
  # Outputs corrupted text:
  echo hello2

(And simply outputting an unconverted string fails, like hello2 does).

As for the linking problem I had, the main issue is that it seems to be harder than it probably should, to get started with the GUI-programming with nim on windows. The following code works, provided one manually obtains sdl2.dll (and sdl2.lib for static linking) - however, even with static linking flags (no errors), the resulting exe still depends on the presence of sdl2.dll:

  import nimx.window
  import nimx.text_field
  import nimx.system_logger

  proc startApp() =
    var window = newWindow(newRect(40, 40, 800, 600))
    let label = newLabel(newRect(20, 20, 150, 20))
    label.text = "Hellø, wørld!"
    window.addSubview(label)

  runApplication:
    startApp()

With sdl2.dll (and .lib) in the same folder both of these work:

nim --threads:on --dynlibOverride:SDL2 --passL:SDL2.lib c -r win.nim

nim --threads:on c -r win.nim

but both fail without sdl2.dll present (ie: the "statically" linked exe still depends on dynamically loading sdl2.dll).

And there's so far no easy way of getting a "supported" sdl2.dll to go with the nim compiler - as far as I can tell, neither "nimble install sdl2" or "nimble install nimx" provide a way to get the sdl2.dll and/or C source code to compile it.

But perhaps managing DLLs and such is considered to be out-of-scope for nimble/nim package manager for now.

reply

e12e | karma 13838 | avg karma 1.49 · | 2017-08-24 04:08:29+00:00

If you have an "echo" primitive/function, and "out of the box unicode support", then respecting the host OS codepage/output encoding by default is the sane thing to do (as indicated in the bug linked by the sibling comment ("just use the win32 api").

It's not that nim can't output wide characters from a utf8 source, it's just that it's not obvious how to do it in a standard way - one might thing that utf8 unicode "hello world" should "just work" on windows, and it doesn't.

It doesn't really make much sense to only do the right thing on systems that happen to have a utf8 locale (it's not that windows doesn't handle wide strings, it just doesn't have a utf8 locale by default).

It's not my impression that the nim community doesn't want to be cross-platform and beginner friendly - it's just that they're going through a phase of modernizing the win(32) sub-system.

Correct text handling isn't "just use utf8", as fun as that would be - correct handling is figuring out "what encoding is the source text", "what encoding is the destination file/device" and "how do I put the source into the destination".

I'd expect a latin1, a utf16 and a utf8 string to all be output correctly with "echo" - on all supported platforms. It's kind of why you would want to use a higher level language with "batteries included" in the first place.

reply

SomeCallMeTim | karma 4107 | avg karma 2.77 · | 2015-08-07 20:34:49+00:00

Good to know that cross-compilation is supported at least.

LLVM is supported on Windows. [1]

Talking to Windows APIs requires UTF-16, but UTF-8 can (and should!) be used in all languages on all platforms internally. It's trivial to add a UTF-8-to-16 conversion step when calling a Windows API (and the other direction on receiving such strings from Windows).

[1] http://llvm.org/docs/GettingStartedVS.html

reply

gamacodre | karma 192 | avg karma 2.43 · | 2014-01-16 23:18:49+00:00

We "solved" (worked around? hacked?) this by creating a set of FunctionU macros and in some cases stubs that wrap all of the Windows entry points we use with incoming and outgoing converters. It's ugly under the hood and a bit slower than it needs to be, but the payoff of the app being able to consistently "think" in UTF-8 has been worth it.

Of course, we had to ditch resource-based string storage anyway for other cross-platform reasons, and were never particularly invested in the "Windows way" of doing things, so it wasn't a big shock to our developers when we made this change.

reply

jasonjei | karma 2026 | avg karma 3.12 · | 2014-01-16 18:09:10+00:00

We constantly have to deal with Win32 as a build platform and we write our apps natively for that platform using wchar. I think the main difficulty is that most developers hate adding another library to their stack, and to make matters worse, displaying this text in Windows GUI would require conversion to wchar. That's why I think they are up for a lot of resistance, at least in the Windows world. If the Windows APIs were friendlier to UTF-8, there might be hope. But as it stands right now, using UTF-8 requires the CA2W/CW2A macros, which is just a lot of dancing to keep your strings in UTF-8 which ultimately must be rendered in wchar/UTF-16.

Maybe there might be a shot in getting developers to switch if Windows GUIs/native API would render Unicode text presented in UTF-8. But right now, it's back to encoding/decoding.

reply

snagglegaggle | karma 116 | avg karma 0.33 · | 2019-11-14 20:19:51+00:00

See, it just broke when UTF-8 was interpreted as ASCII. It's entirely possible to treat bytes as bytes and leave encoding out of it for the vast majority of programs. If you're dealing with text editing and so on, then you know you need to be UTF-8 aware, and the broken programs would still be broken in either language.

The visibility of the errors is a minor point, but I think it more appropriate that it be solved by e.g. the windowing toolkit API.

reply

layer8 | karma 23301 | avg karma 2.59 · | 2023-06-07 09:00:39

Windows UTF-8 support is relatively recent and I have no experience with it, so I don’t know. However I expect Windows to just do the same conversions internally or in the linked runtime, that the program would otherwise have to do by itself. I’d assume that there will be edge cases that such programs then can’t handle, such as UI input and file paths containing unpaired surrogates.

flohofwoe | karma 17782 | avg karma 3.67 · | 2014-04-26 10:31:43+00:00

For our multi-platform game projects we have settled with UTF-8, UTF-16 and UTF-32 using the Unicode/LLVM standalone source code to convert between them.

We no longer mess around with code pages, or old-school multibyte encodings like Shift-JIS, and we also don't compile Windows executables in "UNICODE mode", instead we keep all strings as UTF-8, and convert from and to UTF-16 (on Windows) or UTF-32 (on UNIX-like operating systems). Conversion to and from UTF-32 is hardly necessary though, since outside Windows, everything seems to use UTF-8 anyway.

Instead of using OS functions like MultiByteToWideChar() or the iconv library, we have integrated the LLVM/Unicode standalone UTF conversion functions: http://llvm.org/docs/doxygen/html/ConvertUTF_8h_source.html, although with C++11 these conversions are now builtin.

A few other notable points: - properly handling IME input for Asian languages can be tricky (for fullscreen 3D games)

- Arabic text rendering (not because of right-to-left, but because the character appearance changes depending on whether a character is at the start or end of a word, and there is nearly no sample code around which demonstrates this behaviour (and the one we found had all Arabic comments)

- some Asian languages require incredibly huge font textures (most 3D-game text renderers are font-texture based as far as I'm aware of)

[edit: removed redundant link]

reply

e12e | karma 13838 | avg karma 1.49 · | 2015-04-30 19:33:12

I had to check, but apparently that'll work for utf8 on posix (windows not so much) with unicode literals and std=c++11 (or higher presumably). Nim assumes utf8. For c++:

  puts(u8"World is not spelled 'wørld'");

alexbock | karma 751 | avg karma 9.39 · | 2016-05-08 14:49:38+00:00

Trying to handle character encoding on Windows in multi-platform programs is a nightmare. In C++ you can almost always get away with treating C strings as UTF-8 for input/output and you only need special consideration for the encoding if you want to do language-based tasks like converting to lowercase or measuring the "display width" of a string. Not on Windows. Whether or not you define the magical UNICODE macro, Windows will fail to open UTF-8 encoded filenames using standard C library functions. You have to use non-standard wchar overloads or use the Windows API. That is to say, there is no standard-conformant internationalization-friendly way to open a file by name on Windows in C or C++. I really wish Microsoft would at least support UTF-8, even if they want to stick with UTF-16 internally.

The section titled "How to do text on Windows" on http://utf8everywhere.org/#windows covers the insanity in more detail.

reply

Dwedit | karma | avg karma · | 2024-04-26 20:08:25

With system-default code pages on Windows, it's not only platform-dependent, it's also System Locale dependent.

Windows badly dropped the ball here by not providing a simple opt-in way to make all the Ansi functions (TextOutA, etc) use the UTF-8 code page, until many many years later with the manifest file. This should have been a feature introduced in NT4 or Windows 98, not something that's put off until midway through Windows 10's development cycle.

reply

throwaway984393 | karma 1368 | avg karma 1.41 · | 2022-03-16 01:40:20

The window manager seems to do a bunch of different things, some with files and input, so basically it would expect non-UTF text for those operations

powercf | karma 224 | avg karma 2.99 · | 2017-04-04 19:32:57+00:00

re. UTF-8 support: ncurses does include this, in libncursesw. It seems to work, although I haven't tried anything outside of Latin-1.

jstimpfle | karma 3971 | avg karma 1.31 · | 2021-07-20 09:53:26

> Firstly you do not need A functions. They are there only for old programs. For really old programs from win9x era.

See here, it might be that they recognized that choices made in the 90's were wrong: https://docs.microsoft.com/en-us/windows/apps/design/globali...

"Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes."

> I like this one. Maybe you are right after all.

Thanks :-)

reply

fceller | karma 137 | avg karma 2.91 · | 2016-02-16 16:38:22+00:00

Another reason is Windows. Even READLINE breaks when used on Windows with UTF-8 characters.

e4m2 | karma 236 | avg karma 5.13 · | 2023-08-24 11:53:04

> that release finally made UTF-8 the standard text encoding

This is a bit of a reach unfortunately. UTF-16 (most often broken WTF-16) is still the standard text encoding on NT. The manifest change only makes A-suffixed functions always do a "UTF-8 to UTF-16" conversion, instead of the unhelpful "local codepage to UTF-16" conversion. Sure, that's better, but:

- Microsoft has expressed no interest in switching the actual underlying encoding to UTF-8, they will likely never do so in NT.

- A-suffixed functions only exist for some Win32 API. Newer Win32 functions are defined only taking Unicode strings. All COM, WinRT and native (Rtl/Nt) APIs use Unicode strings as well. The UTF-8 API surface is actually quite small - one might have to step out of the UTF-8 comfort zone in practice.

- Performance is being left on the table. Transcoding still has to be done, it would be more beneficial to at least keep it local for optimization purposes. The most trivial, yet still important use case: constant string conversions can't be optimized across DLL boundaries.

- Unlike, say, DPI awareness mode, there is no official API to change this behavior at runtime. It has to be done through the manifest, a compiler implicitly embedding such a manifest into every executable it produces is not ideal.

TL;DR: It's an improvement, but just barely, so bumping the minimum version just for this "feature" is not worth it IMO. Just bite the bullet and convert to UTF-16 if you have to, it's conceptually simpler and likely faster, too.

reply

mehrdadn | karma 286 | avg karma 0.06 · | 2019-06-16 22:05:17+00:00

Thanks! Yeah I wouldn't try to detect encodings or use heuristics either. If you could just reduce a single pattern into the OR of a bunch of byte sequences in each encoding, I think that should work? I'm not sure how easy that is with the interface you're given. (I wouldn't call UTF-16 'broken', but either way... it's a reality; a huge fraction of the time when you're searching binary files on Windows it's to find text inside executables, which on Windows are generally UTF-16.)

vintagedave | karma 1816 | avg karma 3.5 · | 2018-11-07 12:13:40

> Unfortunately the newer FreePascal/Delphi versions made it all very confusing by adding an encoding field

I can clear this up. This change is really useful and important if you are using strings.

A string of bytes has attached metadata saying what it is. Is it ANSI of some sort, or UTF8, or...? Is it a specific encoding, such as Windows-1252? Without that data, all you have are bytes, and you don't know how to interpret them.

Thus, RawByteString (bytes); UTF8String (UTF8); and ANSI strings with the encoding, plus UnicodeString which is native Unicode on whichever platform (eg, on Windows it matches Windows UTF16.)

This data is essential to convert to and from different string types. I don't know where conversion does "not always" happen - can't think of anywhere offhand. But if you ever run into issues, there are RTL functions for conversion. Check out the TEncoding class: http://docwiki.embarcadero.com/Libraries/Tokyo/en/System.Sys...

> In the past you could just assume all strings are UTF-8 as code style rule.

This was an incorrect assumption, because before the encoding metadata was added, you would have been using an AnsiString there, and that, by definition ('Ansi'), is not UTF8. These days, if you have a UTF8 string, you can place it in a UTF8String type. Correctness enforced by libraries is much better than a coding convention that a certain type contains a subtly different payload. That way lies horror. Metadata and strong typing is much safer.

--

I do agree with you that Delphi has the best strings :) Copy on write and embedded length both seem a real win, after twenty years of use, not to mention great string twiddling methods.

I'm looking at adding string_view support to the strings currently (for C++17 support); one thing it highlights is how much more powerful the inbuilt String types are, and how much string_view is a workaround for a problem in C++'s string design which other string libraries - not just ours, but ours is IMO very good - do not suffer from.

reply

OptionX | karma 334 | avg karma 2.88 · | 2018-10-09 04:50:02+00:00

And more than that can cause compatibility issues. For example, its a bit of a pain to use utf-8 in the normal windows cmd.

carlmr | karma 7214 | avg karma 2.48 · | 2018-11-22 15:37:41+00:00

It says you only plan UTF8 support and no Windows line endings. The windows line endings ok, and I get why in Rust it was easier with UTF8 (although there are some good crates for handling encoding). However this seems awfully restrictive to me.