Can the authors expound on the reasons why they can't compile their language's string semantics into whatever representation will be used by WASI? Both C++ and Rust support numerous string representations, C++ even more so than Rust.
How would an end developer write code in one fashion (f.e. `let foo: string = "hello "`) while the compiler makes that work perfectly in every scenario? It would take a high amount of engineering effort compared to having one format that works well in the web to begin with.
How does a compiler ensure that when that string is passed to a Rust Wasm module it goes to it in UTF-8 and then when moments later the same string is passed by the same module to JS it goes over as WTF-16?
How will the compiler know where the string is being passed after compilation (at runtime)?
What new syntax would you propose for TypeScript to make it possible to work with all strings types? How would you keep TS/JS developer ergonomics up to par with what currently exists?
If Interface Types we're to consider web as a first class citizen (because Wasm originated as a web feature) then interop between Wasm modules and JS would considered of utmost importance, without making a web language ,(such as AssemblyScript) have to go through great lengths to engineer that aforementioned complication.
I don't really understand the questions. Have you looked at any prior art to see how C++ and Rust handle different string representations? C++ is probably the best influence due to type coercion since it sounds like you care about ergonomics over correctness.
For FFI there's nothing a compiler can do. That's why FFI is unsafe and restricted to rudimentary types in most languages - it's up to the caller to ensure the data is laid out as the callee expects.
I also don't know what interface types have to do with anything. Wasm is far lower level than interfaces, and nothing is stopping you from implementing interfaces in your language and doing automatic type conversion through them to handle string representations as required.
Look past the web for a moment - wasm is a competitor with the JVM, GraalVM, and LLVM as a platform and implementation independent byte code. Think about how your language would be implemented on those targets before the web.
> On August 3rd, the WebAssembly CG will poll on whether JavaScript string semantics/encoding are out of scope of the Interface Types proposal. This decision will likely be backed by Google (C++), Mozilla (Rust) and the Bytecode Alliance (WASI), who appear to have a common interest to exclusively promote C++, Rust respectively non-Web semantics and concepts in WebAssembly.
> If the poll passes, which is likely, AssemblyScript will be severely impacted as the tools it has developed must be deprecated due to unresolvable correctness and security problems the decision imposes upon languages utilizing JavaScript-like 16-bit string semantics and its users.
So, the problem is that AssemblyScript wants to keep using UTF-16? I'm not sure I understand.
Is AssemblyScript the thing that lets you hand-write WebAsm?
I’m confused why they can’t just switch their (nascent) language to UTF-8, and if so why the alarmist attitude? I didn’t think they were mature enough to claim no breaking changes, for example.
I probably prefer we drag the web (and .Net and Java) platforms towards UTF-8, to be honest… but maybe that’s just me.
What should Blazor and TeaVM do when existing code allows for isolate pairs? If they perform implicit conversions to utf-8 they have the option to either trap, or perform lossy conversion which has immense security and data integrity implications.
Realistically speaking, you can't "switch" AssemblyScript to UTF-8 unless you also decide it only can run in UTF-8 host environments (i.e. not web browsers). Right now it uses UTF-16, which is what the web uses. If you move it over to UTF-8 now every operation that passes strings to web APIs has to perform encoding and decoding, and you end up with a bunch of new performance and correctness issues. It's a very complex migration.
P.S. the web will never switch to UTF-8. It would break too many web pages. Most browser vendors won't even accept breaking 0.1% of web pages, unless they're doing it to show you more ads (i.e. Chrome).
Yes, but WebAssembly operates at the boundary with JS, and that is not UTF-8. JS uses WTF-16 at runtime, and if WebAssembly did too then this would make interop between Wasm and JS a first-class feature with maximal performance and without security and data integrity issues.
The canonical representation of DOM content is DOMString (https://developer.mozilla.org/en-US/docs/Web/API/DOMString), which is not UTF-8. Your HTML being encoded in UTF-8 is irrelevant, it gets decoded when it's loaded into whatever the canonical representation is. Your HTML could be in Shift-JIS or ASCII or whatever and not UTF-8, same difference.
You can abstract it, but AssemblyScript did not. So it's not a trivial change, it's a complex migration. Similarly, you can use UTF-8 in Java and C#, but you can't just "switch" them over to the encoding directly, it has to be exposed via new types/etc.
Note that AssemblyScript rides on TypeScript language syntax. How would
```
let foo: string = "whatever"
```
be able to work in any similar sense as TS/JS if? How can that map to multiple string types? The idea is both AS and TS use the same syntax for strings, and are compatible across boundaries (TS for JS side, AS for Wasm side). Having multiple string types is possible, but this would greatly reduce developer ergonomics.
AssemblyScript is a compiler that aims to compile TypeScript code (with slight differences to be able to make sense in Wasm, though trying to minimize those differences) into Wasm. To remain compatible with TypeScript (which is AssemblyScript's goal) and be an optimal language for Wasm that will communicate with TS/JS on the other side, the type `string` would need to be in the same format to avoid any perf hit or data loss while passing thing if that type across the Wasm-JS boundary.
I'm not sure what type of compiler to label that as, but that's the goal.
Because if they did, then interop with JS would require performance-losing conversion any time a string needs to be sent from one side to the other, making Web a secondary and irrelevant target compared to native.
That's not what the web needs. The web needs WebAssembly to work flawlessly with JavaScript for maximal potential, so the web will be great and not just a performance landmine that native developers will laugh (as much) at.
I'm not going to enter the discussion regarding UTF-8 vs WTF-16 for representing strings, as I lack the context to determine which one is the right approach if everything has to fit the same model. However, I think an approach that allows multiple serialization/deserialization mechanisms depending on the host/guest language seems like a nice way to move it forward.
If you want to chime in and retrieve more context, here are some relevant issues:
Yep, it will impact any language with WTF-16. Those languages may incur a performance hit if the vote passes to not support "expressive UTF-16", but more notably, there will be data integrity issues that can lead to security issues.
I think you're mistaken there: AssemblyScript is much newer, it's momentum is only just getting started. AssemblyScript is one of the top three most desired languages for WebAssembly now: https://blog.scottlogic.com/2021/06/21/state-of-wasm.html
In the past year it has gained numerous libraries and bindings, including from Surma from Google. Stay tuned...
It is fair to like Rust, but there is nonetheless an influx of web developers who already know JavaScript and TypeScript moving to AssemblyScript to (finally) experience what Wasm is all about. They don't want to move to other languages. I believe their experience should be highly valued, and as optimal as possible.
There is a huge opportunity here to build an optimal foundation for these incoming developers, so that they won't be let down.
The influx has only just begun.
Ideally though, interface types would give languages options: the ability to choose which format their boundary will use. Obviously a JS host and a language like AssemblyScript would align on WTF-16, while a Rust Wasm module running on a Rust-powered Wasm runtime like wasmtime could optimally choose UTF-8.
I'm hoping things will be designed with flexibility in mind for this upcoming most-generic runtime feature.
So basically even in UTF8 you can create malformed string. For example ðŸŒ. It miss one byte and may cause to problems in some editors / text viewers which doesn't handle or pre-verify such cases . Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'. Similarly for three and four byte UTF8 characters it starts with '1110xxxx' and '11110xxx' followed by '10xxxxxx' one less times as there are bytes.
So it's not just UTF16 that has problems and can cause security problems. I just wanted to emphasize that
The point is that UTF-16 is the worst of both worlds. It's not ASCII compatible like UTF-8, but it still has the disadvantaged of being a variable length encoding.
Every problem that UTF-8, it shares with UTF-16. It also shares every problem with UTF-32.
The fact that UTF-16 is bad doesn't mean you should necessarily get rid of it. We keep all sorts of bad stuff around, like C strings (and C).
You can certainly decide you don't care about any existing code, and that anyone using UTF-16 based platforms (Windows, .NET, Java, JavaScript) should get a bad experience, but I don't think the case for that is as obvious as you believe it is.
The upcoming Interface Types spec for WebAssembly was thinking to not support WTF-16 string format, which means any Wasm modules (for example those written in AssemblyScript, a web-inspired language) passing strings from one side to the other (f.e. from Wasm to JS) will experience two things:
- an extra performance cost due to format conversion at the boundary,
- as well as negative implications on security and data integrity,
thus making this a loss for the web if Interface Types will not be fully compatible with the web (JavaScript) by default.
That dates back to the origins of the WASM spec process, it's always been very combative due to the fact that the Google side of things was reluctantly shooting Native Client in the head while the Mozilla side had basically done all of the initial heavy lifting to prove their model with asm.js. I would have preferred a more mature process, personally, but the tension didn't really interfere with the actual outcome as far as I could tell, it just made it a bit more stressful. Because WASM sits on top of existing JS runtimes each JS vendor also had to make compromises in order for it to be possible to implement it across all browsers (the very strange control flow model is a good example of this - some JS engines couldn't handle unconstrained control flow)
The needs of all the different WASM consumers also creates tension here. A C# programmer trying to ship a webapp has very different needs from a C programmer trying to run WASM on a cloudflare edge node, and you can't really satisfy both of them, so you end up having to tell one of them to go take a walk into the sea.
This is an unfortunate consequence of the poor choice of keeping UCS-2 alive as UTF-16 for way too long. The plug in 16 bit encodings should have been pulled a long time ago, but some people were and still are so focused on backwards compatibility that they didn't see they were just pushing the issue to another decade. UTF-8 has won, completely. UTF-16 is basically a zombie nobody wants anymore, kept artificially alive by the fear of big 90s frameworks of clean breaks with the past.
We must get rid of legacy encodings no matter the cost, I'm tired of seeing Java and Qt apps wasting millions of CPU cycles mindlessly converting stuff back and forth from UTF-16. It's plain madness, and sometimes you just need the courage to destroy everything and start again.
I love reading stuff like this, because it reminds me that there are two entire universes of IT, and both are mostly filled with people blissfully unaware of the other.
UTF-8 is a great hack that works wonderfully on Linux and BSD, because neither actually supported internationalisation properly until recently. They clung to 8-bit ASCII with white knuckles until they could bear it no longer, but then UTF-8 came to the rescue and there was much rejoicing. "It's the inevitable future!" cried millions of Linux devs... in English. I mention this because UTF-8 is a bit... shit... if you're from Asia.
Meanwhile, in the other universe, UCS-2 or UTF-16 have been around for forever because in that Universe people do things for money and had to take internationalisation seriously. Not just recently, but decades ago. Before some Linux developers were born. In this Universe, an ungodly amount of Real Important Code was written by Big Business and Big Government. The type of code that processes trillions of dollars, not the type used to call MySQL unreliably from some Python ML bullshit running in a container or whatever the kids are doing these days.
So, yes. Clearly UTF-16 has to "die" because it's inconvenient for C developers that never figured out how to deal with strings based on more than encoding.
PS: There are several Unicode compression formats that blow UTF-8 out of the water if used in the right way. If you can support those, then you can support UTF-16. If you can't, then you can't claim that you chose UTF-8 because you care about performance.
Are you sure UTF-8 is the ideal format? After all, we have grapheme clusters that cannot be rendered as text units using UTF-8. Maybe UTF-8 is already obsolete and never took over the world? I am more than sure that soon we will see the new Unicode format ;)
Full disclosure, I am an active participant in WebAssembly standardisation, my github is here (https://github.com/conrad-watt). What follows is purely my personal opinion.
This announcement is deliberately phrased to scare people who do not have sufficient context. I don't know why some AssemblyScript maintainers have decided to act in this extreme way over what is quite a niche issue. The vote that this announcement is sounding the alarm over is _not_ a vote on whether UTF-16 should be supported.
There has been a longstanding debate as part of the Wasm interface types proposal regarding whether UTF-8 should be privileged as a canonical string representation. Recently, we have moved in the direction of supporting both UTF-8 and UTF-16, although a vote to confirm this is still pending (but I personally believe would pass uncontroversially).
However, JavaScript strings are not always well-formed UTF-16 - in particular some validation is deferred for performance reasons, meaning that strings can contain invalid code points called isolated surrogates. Again, the referenced vote is _not_ a vote on whether UTF-16 should be supported, but is in fact a vote on whether we should require that invalid code points should be sanitised when strings are copied across component boundaries. Some AS maintainers have developed a strong opinion that such sanitisation would somehow be a webcompat/security hazard and have campaigned stridently against it. However sanitising strings in this way is actually a recommended security practice (https://websec.github.io/unicode-security-guide/character-tr...), so they haven't gained the traction they were hoping for with their objections.
The announcement is worded to obscure this point - talking about "JavaScript-like 16-bit string semantics" (i.e. where isolated surrogates are not sanitised) as opposed to merely "UTF-16", which forbids isolated surrogates by definition, but inviting the conflation of the two.
AS does not need to radically alter its string representation - if we were were to support UTF-16 with sanitisation, they could simply document that their potentially invalid UTF-16 strings will be sanitised when passed between components. Note that the component model is actually still being specified, so this design choice doesn't even affect any currently existing AS code. I interpret the announcement's threat of radical change as some maintainers holding AS hostage over the (again, very niche) string sanitisation issue, which is frankly pretty poor behaviour.
The security aspect is separate from whether UTF-16 lowering and lifting is supported. One simply cannot roundtrip every possible DOMString, C#, Java etc. String through the single concept of Unicode Scalar Values without either introducing lots of surface area for a) silent data corruption (what you call "recommended security practice", i.e. strings not comparing equal anymore after what appears to be an innocent function call) or b) for (deliberate) denial of service (when erroring instead). I mean, there is a good reason why all these languages do not do that in between function calls, but actually try very hard to guarantee integrity. And in a real program, an Interface Types function looks like any other function to the developer, just imported somewhere in the codebase, so good luck documenting that. Other than that I do not know how to respond to your subtle insults, please forgive my ignorance.
At that time, it was ~ "you either stop advocating for the security concern, or we make sure you get nothing at all". In fact, I believe we would need to trap to at least make the breakage non-silent, but that would break AssemblyScript even more obviously and only exchange data corruption with denial of service. Not necessarily an improvement if you care about people using what you are building. In hindsight I am not proud of that comment and should have protested, no matter how dire the situation.
I don't agree with your representation that sanitisation of isolated surrogates constitutes "corruption". As a high-level point, when passing a string from your component to an external one, the external component receives a sanitised copy of your string - the original string is not modified in-place. So you still have access to your original string if you're relying on the presence of isolated surrogates for some reason.
For fairness, I will link below to your concrete example of "corruption", noting that you claim it will render Wasm "the biggest security disaster man ever created for everything that uses or opted to preserve the semantics of 16-bit Unicode".
I'd argue that the fundamental bug here is in splitting a string in between two code points which make up an emoji, creating isolated surrogates. This kind of mistake is common and can already cause logic and display errors in other parts of the code (e.g. for languages with non-BMP characters) independent of whether components are involved (again, I emphasise that no code using components has been written yet).
EDIT: I should also note that if it becomes necessary to transfer raw/invalid code points between components, the fallback of the `list u8` or `list u16` interface type always exists, although I acknowledge that the ergonomics may not be ideal, especially prior to adaptor functions existing.
And sure you can transfer your string that someone else does not consider a string using alternative mechanisms, but then you are only not doing anything wrong because you are not doing it at all for entire categories of languages. There is no integration story for these, and once one mixes with optimizations like compact strings or has multiple encodings under the hood one cannot statically annotate the appropriate type anyhow. And sadly, adapter functions won't help as well when the fundamental 'char' type backing the 'string' type is already unable to represent your language's string.
I also do not understand where the idea that a single language always lives in a single component comes from. Certainly not from npm, NuGet, Maven or DLLs.
Extended this post to provide additional relevant context. It's not a bug, it's a feature.
I agree with the linked quote - it captures an important reason why it is valuable to _enforce_ sanitisation at component boundaries, rather than merely documenting "please don't rely on isolated surrogates being preserved across component boundaries" (which would be a problem if we didn't enforce it, since an external component you don't control may be forced to internally sanitise the string if it relies on (e.g.) an API, language runtime, or storage mechanism that admits only well-formed strings).
EDIT: since a whole other paragraph was edited in as I replied, I will respond by saying that within a component, your string can have whatever invalid representation you want. Most written code will naturally be a single component (which could even be made up of both JS and Wasm composed together through the JS API). The code may interface with other components, and this discussion is purely about what enforcement is appropriate at that boundary.
EDIT2: please consider a further reply to my post, rather than repeatedly editing your parent post in response. It is disorientating for observers. In any case, my paragraph above did not claim that there will be one component per language, but that the code _one writes oneself_ within a single language (or a collection of languages/libraries which can be tightly coupled through an API/build system) will naturally form one component.
Sure, we could resolve this problem by either a) giving these languages a separate fitting string type to use internally or externally (Rust for instance can use 'string' everywhere) or b) integrating their semantics into the single one so they are covered as well as first-class citizens. And coincidentally, that would fit JavaScript perfectly, which is rather surprising being off the table in a Web standard. Yet we are polling on having a "single" "list-of-USV" string type, likely closing the door for them forever with everything it implies.
There is no problem, assuming that one believes that the list-of-USV abstraction (i.e. sanitising strings to be valid unicode) is the right thing to enforce at the component boundary, _including_ when the internals of the component are implemented using JavaScript.
I appreciate that this is exactly the point where we currently disagree, and accept that I won't be able to convince you here. However, the AS website's announcement did not make the boundaries of the debate clear.
If Interface Types doesn't convert WTF strings to UTF, the flaw won't have to be documented, and the risk that someone forgets to do the right thing and causes a program to break will be eliminated. This seems like a better outcome. I haven't been able to think of downsides of not "sanitizing". Can you list those?
reply