What a bizarre choice. If they're going to commit to weird ASCII control chars you'd think they could just use 0x1C to 0x1F, which are explicitly intended as delimiters/Separators... sigh. (I've always wondered why more people don't use the various Separators, but I admit human-readability is a big advantage)
The ASCII C0 separators -- FS, GS, RS, US -- are non-printable (by design), but this impacts their usability by laypeople, because they don't show up as obvious symbols in a plaintext editor, and most critically, they visually seem to be absent from keyboards, so additional domain knowledge is required for people to figure out how to produce them. So instead, people developed all sorts of formats that are subject to delimiter collision.
Also, ASCII is recognized as a widely-deployed standard now, but this wasn't always the case. Computers used dozens of different codepages to represent characters with bytes, and while 26 English letters, 0-9, and some punctuation was always present, control characters seldom had equivalents in a different codepage, so interchange was a problem, because in most machines' native codepages, these delimiters were absent.
ASCII actually abbreviates 'American Standard Code for Information Interchange', but it largely came to be used for printable characters only -- "plain text", and not as a format for structured data.
Although by that point, the ship on C0 delimiters had largely sailed, to compound the chicken-and-egg problem, some codepages that were developed after ASCII often discarded the notion of control characters entirely, and redefined their byte sequences as additional printable characters. Windows-1252 was a notable offender [1].
Hell, there's ASCII characters specifically for delimiters. 0x1C to 0x1F are respectively defined as file, group, record, and unit separators. Unicode naturally inherits them all.
CSVs are such a pain if you have freeform text (in which you have to handle newlines and escaped quotes).
It seems like using FS/GS/RS/US from the C0 control codes would be divine (provided there was implicit support for viewing/editing them in a simple text editor). I get that it's a non-printable control character, and thus not traditionally rendered... but that latter point could have been done ages ago as a standard convention in editors (enter VIM et. al., as they will indeed allow you to see them).
Have you ever seen the ASCII separator characters used as they were intended? I don't think I have. It's obvious the problem they were trying to solve, but it was too little too late. It doesn't help that they're control characters that aren't meant to be displayed so they're practically invisible.
ASCII has special delimiters 0x1E Record Separator and 0x1F Unit Separator to avoid conflicting with values, but they have never gained widespread adoption.
An exceedingly stupid act of paranoia; I knew the input _could not_ go above the normal ascii character set without errors elsewhere in the pipeline, it seemed therefore more robust to chose one that could by other invariants never be hit. That being said your group separators, had I thought harder about it might still have been a more valid answer. (but then I wouldn't be able to talk about it as quite so much of a dirty hack!) I imagine they aren't used much because frankly I hadn't even thought about their distinct function more than two to three times in my entire post-programming life.
I've always been curious about the characters in ASCII for this, but I've never seen them used in the wild. Stuff like "Group Separator" (0x1D), "Record Separator" (0x1E) or "Unit Separator" (0x1F)
Is there a reason why nobody uses these? Did someone work out back in the 90s they were pure evil and we've just never used them since?
Tangentially, I'm a bit surprised that we've completely dropped the ASCII separator control characters: 28/0x1C FS (File Separator), 29/0x1D GS (Group Separator), 30/0x1E RS (Record Separator), 21/0x1F US (Unit Separator).
It's a pity, because I usually see Space (32/0x20) as the record separator, which I suppose is convenient because it works well in a standard text editor, but it does mean we've built up decades of habit/trauma-avoidance about avoiding spaces in names, replacing them with underscores (_) or dashes (-)...
I never understood why the ASCII separator characters aren't used more. It seems like we're one simple text editor feature away from having easy display and modification. Is there some historical reason for not doing that?
The answer makes sense to me, but I wish we could fix editors to properly handle the ASCII separators (1C, 1D, 1E, 1F) instead of resorting to Unicode control picture characters (241C, 241D, 241E, 241F).
Maybe if editors are fixed up we could adopt ASCII Separated Values (ASV) as the new standard.
It’s so easy to play games with specs like this though. I could choose any of a dozen of the original ASCII control characters, or U+FFFE, or some other character that I never expected to encounter and I’d probably be able to avoid ever hearing about the problems it caused, perhaps even because there were none (depending on how obscure the character). I maintain that it’s bad design. It’s not that much more expensive to wrap or escape your text properly. Hell, there are even unused bytes that can’t appear in valid UTF-8 that you could use instead. Just use FF and FE as your delimiters and sleep easy with full domain support.
The thing about encoding multiple fields within a cell using \x01 somewhat bugs me -- not because it's a hack but because it's yet another example of needless reinvention of wheels. Good old ASCII has characters specifically devoted to separating fields, keys, etc. that no-one uses for anything else. Why not use them instead of inventing a new character that does the same thing? (By the same token, there was never any need for CSV or tab-delimited text since those characters can't be typed into spreadsheet cells; parsing CSV can be a pain, and TABs can be typed.)
If you care, the characters are:
FS -- file separator 1C
GS -- group separator 1D
RS -- record separator 1E
US -- unit separator 1F
As you can see, they also have the benefit of being self-describing (unlike \x01, as the article points out).
(The sad thing is I only learned about these characters because I had to parse files in a 1960s format originally designed to be stored on tape drives -- and they used these delimiters and they worked great.)
Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.
>Fun fact: ASCII actually reserved control characters for this stuff. 1F is the "unit separator" and 1E is the "record separator". There is even a "group separator" (1D) and a "file separator" (1C).
Which means you can't safely use it for arbitrary data (since your records themselves could contain these separators). Most of the time that doesn't matter; sometimes it does.
reply