I see the point; but I don't think it has enough justification. Something evolutionary successful isn't always globally perfect - rather it's most probable along the path from there to here.
JavaScript may stay long with us. Or - as sometimes happens - it could be eclipsed in few years, as happened with technologies and paradigms before. Say, guessing correctly on technologies in use in 2000, sitting in 1990, would be quite hard.
Not that "more direct approach" of defining bytecode and standardizing should always be better. With ideas disruptive enough many bets are off.
But it seems to me that "evolving into" the use of other languages through Javascript, will allow us to start ignoring the fact that it's Javascript in particular that we're targeting--such that, one day soon, the browser-makers will just give us a "shortcut" to doing the things that have already evolved, without all the Javascript-y mess in-between.
But something has to get popular first as a new "open web-scripting language", before the browser-makers will be willing to all go in on supporting it. (Otherwise you get the reactions you see by, e.g., Mozilla to Google's NaCl.)
And that was a chicken-and-egg problem until now, because you can't really create and force universal adoption of an "open web-scripting language" (or framework, or platform, or bytecode, etc.) if you're just one company. But now, with asm.js, you can--and the rest of the steps will follow soon after.
WebSQL was a completely separate debate, if I recall correctly, the issues there had to do with WebSQL being heavily dependent on SQLite, a single implementation. So it was hard to spec in a vendor-independent way like web standards require.
IndexedDB is much less capable than SQLite, no doubt about it, but still very useful, and far far simpler and feasible to spec and standardize (which it has been).
It is another case where they make fundamentally bad decision.
In case of WebSQL, 40 years of legacy is not important. Millions business apps developers needs are not important. Clean and nice spec is more important for Mozilla guys.
However in case of Brendan Eich baby, legacy is very important. Ugly hacks like asm.js, without W3 specs, are promoted. Try to guess why.
Don't know why others are giving you sensible answers, but you using the words 'Brendan Eich baby' shows your level of trolling. If you want a _good_ discussion, stop doing that.
I found the parallax background incredibly distracting. Maybe I just lack focusing skills, but I would much rather read a plain text file than have a slightly laggy background change every time I scroll.
I'm more than happy to make the effort to read a blog that pushes the limits with JS/CSS/HTML the way acko's blog does, even if it marginally decreases readability. I spend more time reading the source than the articles anyway. Think of acko.net as a tech demo first, blog second.
In this exchange UTF-8 got dragged into list of ugly hacks, but it is a beautiful hack.
Endian-independent, more efficient than UTF-16 for most languages (often including CJK web pages: halved cost of HTML & URLs makes up for 33% extra text cost), supports easy and safe substring search, can detect cut characters, and all that with ASCII and C-string backwards-compatibility.
If I could redesign entire computing platform from scratch UTF-8 is one thing I wouldn't change.
The disadvantage of utf-8 is that it's a variable length encoding. This means that certain operations which are usually O(1) are O(n) with UTF-8: one is finding the n-th character in a string, the other is finding the length in characters (though that's also true for null terminated C strings)
Another problem is that swapping a character in a string might cause the (byte) length of the string to change, which might force a reallocation (also slow) and if you're doing it in a loop over the string, your loop-ending condition might now have changed because the string (byte) length has changed.
In-fact, iterating over a UTF-8 strings characters is not something you can use a simple for loop any more, but something that requires at least one, possibly two function calls per character (one to find the next character, one for finding the end of the string which might just have moved due to your modification).
Finally, efficiency: For English texts, UTF8 is the most efficient Unicode encoding, but for other languages, that isn't true. A Chinese text would require three to four bytes per character, as opposed to just two in UCS-2 (which is what most OSes and languages use, even though it doesn't support encoding all of Unicode)
For these reasons, dealing with a fixed-length encoding is much more convenient (and speedier) while the string is loaded into memory. UTF8 is great for i/o and storage on disk, but in memory, it's inconvenient.
UCS-2 or UTF-16 is the reverse: it's very inconvenient on disk and for i/o (need I say more than BOM), but in-memory UCS-2 is very convenient, even though it doesn't support all of Unicode. It's in-fact so convenient that it's used by most programming environments (see yesterday's discussion about strings being broken)
Take python 3.3 and later for example: even though they now have full support for Unicode, also for characters outside of the BMP and thus requiring more than two bytes of storage, they didn't go with a variable length in-memory encoding, but they now chose the shortest width possible fixed length encoding that can be used for a particular string.
This seems like an awful lot of work to me, but they decided the fixed-lengthness was still worth it.
Umm, is there any unicode encoding where finding n-th character (not codepoint) in string is O(1) ? In any encoding you can have a single 'composite character' that consists of dozens of bytes, but needs to be counted as a single character for the purposes of string length, n-th symbol, and cutting substrings.
This is not a disadvantage of UTF-8 but of unicode (or natural language complexity) as such.
"UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses exactly 32 bits per Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint." (http://en.wikipedia.org/wiki/UTF-32)
Of course, the problem of combining marks and CJK ideographs remains.
That's the point - you get O(1) functions that work on codepoints. Since for pretty much all practical purposes you don't want to work on codepoints but on characters, then codepoint-function efficiency is pretty much irrelevant.
I'm actually hard-pressed to find any example where I'd want to use a function that works on codepoints. Text editor internals and direct implementation of keyboard input? For what I'd say 99% of usecases, if codepoint-level functions are used then that's simply a bug (the code would break on valid text that has composite characters, say, a foreign surname) that's not yet discovered.
If a programmer doesn't want to go into detail of encodings, then I'd much prefer for the default option for string functions to be 'safe but less efficient' instead of 'faster but gives wrong results on some valid data'.
For a lot of usecases, you're just dealing with ASCII though (hello HTML). Wouldn't it be possible, in a string implementation, to have a flag indicating that the string is pure ASCII (set by the language internals), thereby indicating that fast, O(1) operations are safe to use?
What you say is done with UTF-8 + such a flag - if the string is pure ASCII (codes under 127) then the UTF8 representation is identical. IIRC latest python does exactly that, sending utf-8 to C functions that expect ascii, if they are 'clean'.
But for common what usecases you're just dealing with ASCII? Unless your data comes from a cobol mainframe, you're going to get non-ascii input at random places.
Html is a prime example of that - the default encoding is utf8, html pages very often include unescaped non-ascii content, and even if you're us-english only, your page content can include things such as accented proper names or the various non-ascii characters for quotation marks - such as '»' used in NYTimes frontpage.
It really depends on what kind of thing you are doing. Say you're processing financial data from a big CSV. Sure, you may run into non-ASCII characters on some lines. So what? As long as you're streaming the data line-by-line, it's still a big win. You could say the same for HTML - you're going to pay the Unicode price on accented content, but not with all your DOM manipulations which only involve element names (though I don't know by how much something like == takes a hit when dealing with non-ASCII), or when dealing with text nodes which don't have special characters.
I'm happy not to pay a performance price for things I don't use :)
The scenarios you mention would actually have significantly higher performance in UTF8 (identical to ASCII) rather than in multibyte encodings with fixed character size such as UCS2 or UTF32 that were recommended above. That's why utf-8 is the recommended encoding for html content.
Streaming, 'dealing with text nodes' while ignoring their meaning, and equality operations are byte operations that mostly depend on size of the text after encoding.
You're still wrong on UCS-2: Except for JavaScript were it's still used (afaik there are unportable workarounds, but it's still specified that way) all others use UTF-16. Yes, many methods operate on code units instead of code points (because the character type is usually 16 bits wide), but code-point-aware methods usually exist in Java, .NET and others. Just because a string is treated as a sequence of UTF-16 code units that doesn't make it UCS-2. You could with the same argument say that everything that uses UTF-8 and gives you access to the underlying bytes doesn't support Unicode at all, but only ASCII.
"Text normalization in Go" https://news.ycombinator.com/item?id=6806062 was here yesterday and made me wonder: how can you handle ligatures, accents, digraphs etc in fixed-width?
Yes it is hard. Instead of code points, humans usually handle text by graphemes, which have an arbitrary length in edge cases [0] no matter what encoding is used. An O(1) solution for graphemes would be an array of pointers to grapheme objects, which is slower and more memory hungry than necessary in most cases.
O(n) is only an issue for large strings - let's say it might start being a concern at n > 1024, although for modern systems quite frankly n can be substantially larger without incurring any noticeable time loss. In my 20 year career, it has been rare to deal with strings larger than 1024 characters. In those cases, I don't think it's unreasonable to have to use a specialised class that tracks string indexing such that lookup becomes O(1). With enough mucking around you can even get insertion to be O(1).
For the remaining use cases of strings (the vast majority - filenames, URLs, UI labels, database fields), O(n) performance is perfectly adequate. This is why UTF-8 is such a successful hack.
Variable encoding is absolutely good choice for i18n. It forces you to give up random access, which is nearly impossible for i18n text processing (just thinking of a vowel followed by 2 or more accents). i18n libraries, such as popular regexp package, can easily deal with utf-8 with much penalty. Of course, the library has to deal with all kinds of complexity anyway, the random access was never a real need anyway. DISCLAIM: I contributed to regexp package of a major programming language.
"On a shared medium like the web, where content has to run across all OSes, platforms, and browsers, backwards-compatible strategies are far more likely to succeed than discrete jumps."
This is a valid point, but hits on something that constantly grates with me. We already have something massively more cross platform and performance focused than JavaScript - C. The problem is that the standard library for C is utterly lacking, and there is a lack of focus on developing mechanisms to deploy native code /directly/ across the web.
Don't kid yourself either - all the data being thrown around at the moment is always (always, always, always otherwise it wouldn't even work!) translated into native code in some way - we are already throwing native code around the web, just in a horribly inefficient way. In todays world of sandboxing and virtualisation all of the security arguments that used to be completely reasonable are quite thoroughly invalidated.
I really do believe that fixing this from the technical perspective is not an impossible or even hard problem to solve - I can not stress this point enough. Game developers are constantly re-solving this problem in limited, optimisation focused ways. JavaScript implementations themselves are ultimately built on top of this technology or other technologies built on top of it. The standard libraries of modern scripting languages are possible to use from within the constraints of C. We have already solved all of these problems and there are myriad examples.
On the other hand, the political problem could be intractable... and thats sad.
Worst of all perhaps, C can be made better for performance without much thought or effort, and none of the new languages I see actually tackle this problem, which is quite real and measurable and impacts everything from productivity to the environment - they (quite rightly in many ways) focus utterly on ease of use and massive standard libraries. Why can't we focus this effort on fixing C, or providing a better alternative?
Why are we trying to catchup with native code performance instead of using just native code? Why aren't we improving native code performance in a serious way?
but it trivially can be - this is very measurable and lots of people do it. see for instance iOS, OS X, Android, Windows 8... actually I can just list the modern operating systems that don't provide such an environment out of the box:
I can totally understand your frustration that shines through. Basically I agree with you. 8 years ago I saw the start of an explosion of new languages/frameworks/etc. All solving 1 of more things of a bigger problem. To my mind it became too fragmented, too many layers where too many things can go wrong and if not, will have a performance penalty.
So I went to a more abstract approach by working visually in models. After some tries I discovered the OutSystems development environment for web applications and am working with it since on a daily base. The advantage of working visually is that I don't know or see what the compiler creates. (of course I encounter some SQL, JS, CSS and HTML)
However, sometimes I miss text based coding, so this year I looked around to see if the development situation had become better. In my opinion it has not. There seems to be less fragmentation but that's probably caused by the fact that only a select set survived. Besides that new initiatives have been started.
Personally I like the idea of using C as a cross platform language; because it already IS. However, I wouldn't be surprised if the lower levels of the used languages use it already in some form (remember it always needs to go native). Besides that, I think that if C would be the standard then within a couple of weeks you will have language X to C compilers and then have ABC->X->C and then have a webvariant WEB->ABC->X->C while undoubtedly someone will create a Javascript to web to abc to x to c compiler. And than there we are again where we started :)
So in my opinion the problem lays with the chains. Some abstraction is needed of course but not several layers on layers. I would like to see more direct to (e.g.) c compilers: JS->C, Web->C, ABC->C, X->C whereby it is not the goal to program in C but have it as an intermediate format.
I believe that software development has not grown forward enough. There is plenty movement left and right but not going forward. So, I am back to my visual modelling environment and will check later on...
IMHO the "sandboxes" those OS'es provide are all terribly broken. You'll have to sign your code, you can't distribute your code outside of app shops (either completely impossible like on iOS, or scare-dialogs pop up like in OSX), there are gate keepers which dictate what software gets in the app shops and which don't, and which can remove your app from the shop at a whim. And did you look at the Windows8 Store Apps API? It's a f*cking joke. Innovation doesn't happen in those closed-down environments, only commerce (but commerce depends on innovation).
Native code is, at the bottom of its ladder of abstraction, always held at the behest of some vendor's platform or another. Microsoft, Apple, and Google will never agree on a set of standard C libraries that could be used to build rich, modern graphical applications. If they could, we'd never have needed the web for anything other than the ideas of hyperlinks and intents--everything else could be done with URL-specified zero-install applications. (This was the future portended by Hypercard.)
But things didn't go that way. Instead, each vendor built its own walled garden, with its own platform-specific libraries to do the same things.
The web, in this reality, is simply our attempt at encapsulating away the entirety of the OS, along with all its platform-specific libraries, as a sort of really-big-BIOS, and then building vendor-neutral platform on top with APIs that will work on every computer, running every OS.
But the fun thing is, once we get this web platform nailed down, and it can do everything? Once the OS beneath it is redundant in every respect? We can "molt" the outer layer away.
The future of Operating Systems will be the lineage of today's ChromeOS and FirefoxOS, not Windows or OSX.
"Microsoft, Apple, and Google will never agree on a set of standard C libraries that could be used to build rich, modern graphical applications."
this is probably very true, but it isn't even a real problem. the standard library can be expanded to encompass rendering, audio, networking and co. regardless as to platform nitty gritties we can implement a layer over the top (lots of people do this to make games already).
when i look at the web as an approach to being platform independent i see something which is truly quite poorly constructed. the kinds of bugs and problems in the webstack are utterly alien in my world... libraries to work around browser bugs? laying out objects on a screen 'complicated'? its just shoddy all over... especially the browser implementations and the myriad frameworks piled on top of javascript and CSS.
coupled with the staggering loss of performance and increase in complexity I have no desire to use web tech to develop my cross platform products - and my native development moves at an extremely rapid pace.
i qualify this very heavily with "i know how to do this and have done it alone and in teams many times more often than once"
a great example of this is everything already in the C standard library which - whilst it seems simple today now its done - hides a great deal of OS and platform specific details which vendors still disagree on but programmers in that environment are largely (and rightly) unaware of. a quick look into the Win32 apis, X11, BSD socket implementations or the Cocoa/Objective-C stack on OSX will show up just how different the platforms can be in their 'low level' interface for many things.
"Convenient though it would be if it were true, Mozilla is not big because it's full of useless crap. Mozilla is big because your needs are big. Your needs are big because the Internet is big."
I think any document and application layout system that handles all the use cases of CSS is going to be about as complex as CSS. Certainly PDF and Microsoft Word .DOC are up there in terms of complexity.
C is not sandboxed, and C has lots of undefined behavior. For those reasons, it is problematic for the web, which needs to let people view any site from any browser and platform, in a safe way.
People have tried to "fix" those issues with C, and usually they end up going pretty far, ending up with stuff like the JVM or CLR.
this opinion is utterly unjustifiable imo and exactly what i mean by 'the old security arguments are obsolete'
your comment suggest that you simply don't understand sandboxing and virtualisation i'm afraid. nor how the technology of the web works down to the metal.
Are you familiar with the joke that ends with the punchline "I don't know, but he's got the Pope for a driver"?
For all I know, you have the knowledge and experience to hold your own, but do you realize that in just this thread you've defended your position by suggesting that (among others) the authors of Emscripten and Rust don't how understand the how the web works at a low level? This seems unlikely. I think you have a good point to make about the indirectness of asm.js, but your style of argument leaves something to be desired.
Well you can't get out of the sandbox or even crash the process you're running on with C/C++ compiled to asm.js. To me this qualifies as "safety". Of course you can mess up, fragment the local heap/typed-array and produce memory leaks. But it's all contained.
Naive question: is there a reason why we cannot just compile programs to both native/bytecode and javascript and have browsers automatically fetch the version they support? If it turns out that native/bytecode universally runs (say) 20% faster and loads in half the time, the javascript target will eventually die a natural death of obsolescence without compatibility ever being sacrificed. If it turns out that the javascript target can keep up with the performance native/bytecode, then we can just stop compiling to anything but javascript, and nothing will be lost. The attitude that we should not even try to do better than compiling to javascript just seems odd.
I don't care if it's just Google or Google + Mozilla + some bunch of other guys. At least with Dart I can see the difference and it took a couple of years to implement instead of a dozen of years. I hope Dart will get some traction so that other players will be innovating again instead of relying on legacy languages.
We can do that - I mean, it's a chicken and egg problem; browsers don't support proper bytecode because there's not a particular mass of software that needs that support, and there won't be a mass of software until the support exists on most browsers.
Asm.js provides a 'fake egg' that can be hatched by the already existing JS support - but once there is enough software being built through, say, LLVM to asm.js, then it would make sense for a browser vendor to provide a feature "supply alternative bytecode built by the exact same compiler toolchain, and you'll get 50% better performance".
Well, that seems to be what Google is doing with PNaCl and pepper.js. Pepper.js uses emscripten to compile PNaCl apps so they run in JS (specifically asm.js). In Chrome the PNaCl version can run, and everywhere else it runs in JS. That sounds like what you are proposing?
Those two sites have the same codebases built for both PNaCl and asm.js. Overall they run pretty well in both, so this doesn't seem to show a clear advantage to either JS or a non-JS bytecode. PNaCl starts more slowly, but then runs more quickly, but even those differences are not that big. And surely both PNaCl will get faster to start up, and JS get faster to run, because there is no reason they both cannot get pretty much to native speed in both startup and execution.
Note though that there are risks to this approach. No one enforces that everyone create dual builds of this nature, and there is no guarantee that the dual builds will be equivalent. So while this is interesting to do, it does open up a whole set of compatibility risks, which could fragment the web.
Except for the fact that you can not compile threaded programs to javascript I am pretty sure while you can to PNaCl and this is my problem with this talk of asm.js, until it supports threads I don't really consider it an acceptable solution.
Nope, it's a better idea to create a higher level thread-pool-based parallel-task-system which abstracts away the differences between pthreads and WebWorkers. I think that at least most game engines have such a system in place anyway, and can be relatively easily adopted. YMMV because it may be more overhead to get data in and out of WebWorkers since they don't have a shared address space (so in that regard they are more like processes).
JavaScript may stay long with us. Or - as sometimes happens - it could be eclipsed in few years, as happened with technologies and paradigms before. Say, guessing correctly on technologies in use in 2000, sitting in 1990, would be quite hard.
Not that "more direct approach" of defining bytecode and standardizing should always be better. With ideas disruptive enough many bets are off.
reply