Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

Coding editors also often show this kind of thing intentionally, as those characters are meaningful for interpretation purposes. Many of them are very UTF friendly, but they still show zero-width spaces as e.g. "<zwsp>" on purpose.

They've also often shown non-printable ASCII control characters for basically forever. Null bytes and \bel and whatnot are very important despite being "invisible", and they've been around for decades.



sort by: page size:

In some languages which allow non-ASCII but aren't Unicode-aware (PHP, for instance), you can add significant, invisible zero-width spaces to identifiers.

Some editors used them to help detect UTF-8 encoded files. Since they are also valid zero length space characters they also served as a nice easter egg for people who ended up editing their linux shell scripts with a windows text editor.

Zero-width unicode chars have been used in exploit kits for a while now; just use hd (or something similar) when debugging.

I'll trade editor awareness of control chars for eliminating the problem in 99.999% of cases.

Also some editors do show the ascii codes, like notepad++.


In emacs those characters are by default visible as one-pixel wide spaces, to make them more apparent eval (update-glyphless-char-display 'glyphless-char-display-control '((format-control . empty-box) (no-font . hex-code))).

Then you run into ASCII art some end user made, or pre-Unicode text from Scandinavia where 0x7C was the code point of a letter. Commas make it pretty obvious which tools are unusably broken, where very rare characters let these bugs go undetected far too long.

What a bizarre choice. If they're going to commit to weird ASCII control chars you'd think they could just use 0x1C to 0x1F, which are explicitly intended as delimiters/Separators... sigh. (I've always wondered why more people don't use the various Separators, but I admit human-readability is a big advantage)

Have you ever seen the ASCII separator characters used as they were intended? I don't think I have. It's obvious the problem they were trying to solve, but it was too little too late. It doesn't help that they're control characters that aren't meant to be displayed so they're practically invisible.

from the article, its likely you'd not even notice - unless you pasted in an ascii only editor that doesn't allow anything other than plain old text.

Even ASCII-only forums and mediums have come up with informal markups for features like *bold*, _underline_, SPEAKING LOUDLY, etc. You might be so accustomed to these that you dont’t even see the codes anymore, but they are codes, and quite ugly too.

I am not a fan of C's strings either. Null termination rather than communicating the length as a preamble has been a font of bugs for decades. On the other hand, ASCII is merely a way of interpreting a stream of bits that chunks the bits in into bytes and maps the bytes to characters. Unicode is another way of interpreting a string of bits which also chunks the bits into bytes, but then chunks bytes together before mapping to characters.

Both Unicode and ASCII are abstractions built on top of streams of bits largely (since both contain machine instructions such as <BEL>) intended to communicate text primarily and strings (in the computational sense) secondarily (for example as commands to a REST endpoint). For example C has had a wide character type for about 25 years [1] available as an abstraction built on top of strings...like most of C how wide is wide is implementation dependent and explicit 16 bit and 32 bit wide characters are more recently standardized.

[1]:


People are lazy.

The ASCII C0 separators -- FS, GS, RS, US -- are non-printable (by design), but this impacts their usability by laypeople, because they don't show up as obvious symbols in a plaintext editor, and most critically, they visually seem to be absent from keyboards, so additional domain knowledge is required for people to figure out how to produce them. So instead, people developed all sorts of formats that are subject to delimiter collision.

Also, ASCII is recognized as a widely-deployed standard now, but this wasn't always the case. Computers used dozens of different codepages to represent characters with bytes, and while 26 English letters, 0-9, and some punctuation was always present, control characters seldom had equivalents in a different codepage, so interchange was a problem, because in most machines' native codepages, these delimiters were absent.

ASCII actually abbreviates 'American Standard Code for Information Interchange', but it largely came to be used for printable characters only -- "plain text", and not as a format for structured data.

Although by that point, the ship on C0 delimiters had largely sailed, to compound the chicken-and-egg problem, some codepages that were developed after ASCII often discarded the notion of control characters entirely, and redefined their byte sequences as additional printable characters. Windows-1252 was a notable offender [1].

[1] https://en.wikipedia.org/wiki/Windows-1252


It looks like each character is taking up twice the width of an ASCII symbol, but half of that is empty space. Why is that? Is that space completely unusable?

It's actually unicode instead of ascii. It's full of special characters for boxes, and more for the function library.

But I get your point, you probably meant "fixed width text characters" instead of ascii. I'm just being pedantic :)


I think there are sensible reasons people don't like using unlabeled, whitespace-delimited formats that require an ascii-art diagram to explain in the modern day.

That's cos it's the ASCII version :).

Let's not be so pedantic. I meant visible ASCII symbols, character codes 32-126 plus newline.

The examples at the bottom of the page are mostly special characters, like the valid program `;i;c;;#\\?z{;?;;fn':.;, `

I've been working with ASCII for over 30 years and never realized this smh
next

Legal | privacy