Incremental Parsing in Go

xyzzy4747 | karma 88 | avg karma 0.24 · 2022-10-22 10:19:17

For max optimization, wouldn’t it be better to create a Rust or C library for parsing that Go links into? I personally don’t see the usefulness of trying to optimize Go itself too much as it’s handicapped by the runtime and garbage collection.

tester756 | karma 3905 | avg karma 1.96 · 2022-10-22 10:28:34

C library for parsing?

isn't it dangerous from security perspective?

reply

xyzzy4747 | karma 88 | avg karma 0.24 · 2022-10-22 10:32:31

It's just an example. The options are really Rust (what I'd prefer), C++, C, or perhaps something like Nim that compiles to C.

If you’re trying to make an unoptimized parser, then use whatever you want.

reply

pharmakom | karma 3602 | avg karma 2.45 · 2022-10-22 10:29:49

For some reason we insist that language parsers are implemented in the language itself, even when the language isn’t great for parsers.

37ef_ced3 | karma 898 | avg karma 3.1 · 2022-10-22 10:31:42

You're in for a big surprise. Try using the language.

Spend some time using Go, and you will be impressed by its performance.

You'll wonder, "Were all those haters on Hacker News misinformed?"

reply

xyzzy4747 | karma 88 | avg karma 0.24 · 2022-10-22 10:33:49

If you're making something requiring CPU optimization as a core feature, you might as well go with one of the fastest languages instead of handicapping your project from Day 1. Go is not considered one of the fastest. It's better for network or filesystem logic that is I/O limited.

dymk | karma 8044 | avg karma 2.72 · 2022-10-22 10:39:18

This is a premature optimization, and keeping everything in the same language has benefits like greatly simplified tooling and building

xyzzy4747 | karma 88 | avg karma 0.24 · 2022-10-22 10:44:35

It’s not a premature optimization - it’s deciding the maximum that the parser can be optimized in the future. Choosing Go sets a lower ceiling.

> Keeping everything in the same language has benefits like greatly simplified tooling and building

Surely there are other Go libraries that incorporate C, C++, or Rust? Also if both parsers existed and were equally easy to set up, and you were planning on doing a ton of parsing, it would make sense to go with the faster one.

reply

dymk | karma 8044 | avg karma 2.72 · 2022-10-22 10:47:29

It absolutely is a premature optimization. If it's fast enough, then it's fast enough. The author hasn't indicated that the current Go implementation is hitting a ceiling imposed by the language yet.

If you'd like to, you can provide some real-world examples - or even microbenchmarks - showing that Go is so much slower than <your choice here> that it's going to make a difference.

> Also if both parsers existed and were equally easy to set up

They're not equally easy to set up. Language interop is a pain in the pass.

reply

Jtsummers | karma 23473 | avg karma 2.62 · 2022-10-22 11:28:16

Look at the current Makefile:

https://github.com/aretext/aretext/blob/main/Makefile

Build is literally a `go build ...` and install is `go install`. Adding any other language to the mix would make this a polyglot project and not be "equally easy to set up". The other question is, do both parsers exist? In this write-up they point to tree-sitter as a possibility which is a JS program that produces C code. This would be viable, but here's the author's take:

> I considered integrating tree-sitter, an incremental parsing library with parsers for many existing languages. However, running JavaScript to generate parsers and linking to a C library would have greatly complicated the build process. Today, aretext can be built on almost any platform using a single go install command. I’ve had users install aretext on ARM laptops, FreeBSD servers, Chromebooks, and Android phones. To maintain portability, I wanted a pure Go implementation.

So this wasn't some casual decision, but something they at least considered long enough to describe here.

And the parsing library itself is only around 1200 lines total (comments, blanks, and code). The parsers for each language add a lot more, of course, but should be roughly equivalent given the same library and interface. I imagine that if this project really takes off and performance becomes a real problem they can do the rewrite at that point. Right now, the code works, seems to work fast enough for its author and primary users, and it's trivial to install on any platform supported by Go. So yes, it would have been a premature optimization to complicate the build process, probably reduce the number of supported platforms (or greatly increase the effort to support the same number of platforms), just to have a slightly faster parser.

reply

samatman | karma 17828 | avg karma 2.76 · 2022-10-22 10:42:42

The optimization here is using incremental parsing, so that changing parse state goes from O(n) to may-as-well-be-O(1). It's probably linear with tree depth.

Any language is fast enough to do this, certainly Go is. Naive parser combinators written in slow languages can tokenize six-figure LOC files fast enough that the user won't notice.

reply

jerf | karma 85298 | avg karma 5.28 · 2022-10-22 10:55:57

This is kind of a test of how nuanced your understanding of programming languages can be.

Rust with a bit of effort put into optimization will be faster than Go with a bit of effort put into optimization, it is true. However, you need to double-check your intuition for how big and how consequential the delta is, because I'd guesstimate it as roughly a factor of two, possibly a touch less. It is true that Rust does a crapton more "optimizations", but a lot of those optimizations have diminishing returns.

A factor of 2 may still sound large, but in practice it isn't as large as it sounds, because my qualification "a bit of effort put into optimization" is not redundant. Go with a bit of optimization will probably beat someone's first draft of Rust. Go with a ton of careful optimization will probably beat Rust with a bit of optimization. The raw performance of the two are reasonably close, and smaller than the improvements you can usually get with optimization. So Rust's speed advantage, which is real, generally only matters in cases where you're going to optimize very heavily.

Is this one of them? For that I can give a solid... Maybe! There are times and places where parsing is something you want optimized to within an inch of its life, certainly. However... it isn't all the places, and some of your intuitions may lead you astray if you're not careful; you might think a heavy duty programming language would need a great parser, but if it's going to chew on optimizations for possibly literally 100x the time, it may matter a lot less.

In general, Rust is capable of going faster than Go (again, I'd guestimate about a factor of 2 with isolate tasks where it may be able to go event faster, but that only matters if that's the bulk of your program), but Go is going to be fast enough that that only matters in certain limited places where you're willing to put some non-trivial effort into performance in the first place.

This is in contrast to a comparison between Go/Rust and Python, where even casually written Go/Rust can outpace optimized pure Python, even before we start talking about how much better Go/Rust will be using multiple CPUs. This is because Python is just that slow, and let's not even talk about how slow Python can be if you don't write it to be optimized and you start heavily using all its features without realizing how expensive they are. From the point of view of Python, Go and Rust have very similar performance characteristics. (But then, of course, one must be careful with Python because something like NumPy will blow your socks off when it turns out to not really be Python at all.)

It's a rich, nuanced problem space and should not be approached with sloganeering and "my language is better than yours".

My summary of Go is: It's a slow compiled language... but it is still a compiled language, and it is faster than pretty much everything that isn't, possible exception of LuaJIT, and the delta between slowest compiled and fastest compiled is only ~2-3x, which in the overall space of programming language speed isn't actually that great.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 11:38:14

Not sure if rust vs go would be the best example here. Rust vs Java would be a better one — go has a very primitive GC in comparison, and java does optimize hot loops to a higher degree, so a naive code base would be very hard to beat in a lower level language.

richieartoul | karma 812 | avg karma 6.11 · 2022-10-22 12:12:02

I do a lot of “high throughput” stuff at work in both Go and Java, and the Go stuff is usually faster by default.

Java tends to win for really naive programs where the author didn’t bother caring about performance or allocations at all, but if any care was put into it at all Go usually wins in my experience.

The trope that Go’s GC is primitive in comparison to Javas is not really accurate. You can’t consider a language’s GC in isolation.

Java’s GC and JIT are extremely complex because the language semantics are terrible for performance by default. The “everything is an object” model made sense when the language was designed and main memory access times were roughly equal to a CPU cycles, but that’s no longer true by a factor 100 to 200 now.

Go’s GC makes different trade offs (low latency, extremely high concurrency, tight integration with the runtime and scheduler) because the language semantics are much more sympathetic to modern hardware (“true” structs, automatic escape analysis, etc), so it can.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 12:22:09

Sure, Go can get away with more primitive GC exactly because it has “value types”, so less garbage is created. But they are still much worse, lower latency only means that they pause threads to get more breathing space if they have been allocating too heavily, they are absolutely not even close to the same league Java’s low latency ZGC does.

chrsig | karma 2686 | avg karma 3.64 · 2022-10-22 12:24:50

Do you have benchmarks you can share?

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 12:38:27

Well, cross-language benchmarking is always hard, but for purely testing the GC this is not too bad: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

See how ahead Java is of any other managed language (and it doesn’t really make sense to do this benchmark with non-GCd languages)? Though this is done with the G1 GC, not the low-latency one - this default GC is more throughput oriented with a max pause time target goal. Also note how Java does use multiple times more memory, as it “postpones running GC when it knows it still will have enough time to collect all of it without running out of the target goal” - this is also the reason why java is quite ahead on “energy efficiency” reports as well. And also, GCs work better with more heap space.

reply

geodel | karma 6768 | avg karma 2.18 · 2022-10-22 14:38:17

> this is also the reason why java is quite ahead on “energy efficiency” reports as well.

Very soon businesses would be asking for "dollar efficiency" also. I think going by effort on Java and their frameworks vendors to pack more instances of Java process/pods on a VM, it is already been asked by tech savvy customers.

So that old fact that on sever side programing customers only care for raw throughput and not on machine size because RAM/CPU/disk is cheap is not working well in cloud based deployments where now each of these matter.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 16:19:02

To be honest, I really don’t get this microservice/cloud hype, stackoverflow (which let’s be honest will be bigger than that 34th startup) runs on a single (very beefy though) dedicated server machine.

I pay like 5 dollars a month for a VM with very low params, but even that will happily run anything. Especially that the DB likely can’t be shared the bus factor will remain 1.

reply

geodel | karma 6768 | avg karma 2.18 · 2022-10-23 00:42:14

> To be honest, I really don’t get this microservice/cloud hype, ..

I agree on that. And the bureaucracy evolved around "Microservice Architecture" of Kube pods, service mesh and so many other pieces required for it feel like something we are trying to do it well, what should not be done in first place.

reply

richieartoul | karma 812 | avg karma 6.11 · 2022-10-22 13:49:07

It’s really the usage of the word primitive that I’m arguing with. Java’s GC comes with a lot of additional trade offs that Go’s doesn’t.

For example, the fact that the Java GC is copying and generational means that there is a LOT more overhead introduced by write barriers.

If you benchmark the rate at which the GCs can clean up garbage, Java always wins, but the Java GC impairs you a lot more in other situations that the Go one doesn’t.

It’s trade offs, but the Go one makes much better trade offs for modern hardware IMO.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 15:31:00

Write barriers are a single local conditional on the fast path, if I’m not mistaken. Also, since a JIT compiler is in action, it may have a much wider range than every object interaction. It’s basically free on modern computers with branch prediction.

ZGC (the low-lat GC) does employ read barriers though which will decrease throughput considerably, but latency and throughput are almost universally opposites of each other.

reply

geodel | karma 6768 | avg karma 2.18 · 2022-10-22 14:26:56

> they are absolutely not even close to the same league Java’s low latency ZGC does

This is the kind of thing always offered without any serious numbers extracted from real life or even realistic test programs.

So even if technically true in very narrow sense it is more of high performance car marketing with fancy algorithm and data structure names. By the time GC are used in end user programs with tons of libraries, frameworks, design patterns and inefficient to implement business rules those GCs show little difference that fancy ads promised on TV.

reply

fsdjkflsjfsoij | karma 228 | avg karma 4.3 · 2022-10-22 12:14:21

Java requires a much more advanced GC and JIT because Java programs tend to allocate a lot more and have extremely bad memory layout when you're not restricting yourself to primitives. Project Valhalla's value types will significantly improve the situation. Relying so heavily on the JIT also has other problems especially in programs that have widely varying execution paths.

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 12:43:32

Surely, that’s the incentive part for why the team spent many many work hours improving their GC - just because the JVM typically depends more on a good GC doesn’t make it any less useful - long running processes do make significant use of automatic memory management.

Also, Java’s GCs are moving/compacting GCs, so while the immediate memory representation is indeed inefficient, again, for long running processes Java will place often together-used objects physically close to each other, and will defragment the heap. But Valhalla can’t come soon enough, I agree.

> Relying so heavily on the JIT also has other problems especially in programs that have widely varying execution paths

Has it? I would think that an AOT program would have a worse time with widely varying execution paths, while a JIT compiler is free to reoptimise based on changing application state.

reply

geodel | karma 6768 | avg karma 2.18 · 2022-10-22 14:50:25

> just because the JVM typically depends more on a good GC doesn’t make it any less useful -

I mean it feels like personal choice. Do I praise the spouse when they bring whole kitchen down while making a dish and cleaning up quickly afterwards? Or do I take it as "Well, you made mess so it was basic expectation from you to clean up fast for later use"

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 16:11:33

I would wager that most applications have plenty of object lifetimes that are not regular at all — a web server with some stateful sessions for example. So your analog doesn’t really make sense — go can’t avoid these situations at all and will significantly perform worse in these cases.

foldr | karma 6163 | avg karma 1.17 · 2022-10-22 10:56:09

You’d generally expect Rust and Go to perform about the same for CPU bound workloads. Rust has access to more advanced codegen and optimizations via LLVM, but Go’s garbage collector will often be faster than refcounting (or whatever manual memory management technique your Rust code ends up using). This is especially so given that the GC runs on a separate thread without any additional effort on your part, making it almost ‘free’ in the context of parsers (which tend to be single threaded).

A real world example of this is esbuild. The author posted on HN that his initial Rust implementation was actually somewhat slower than the subsequent Go version.

reply

super_flanker | karma 101 | avg karma 1.4 · 2022-10-22 14:37:53

> Go’s garbage collector will often be faster than refcounting (or whatever manual memory management technique your Rust code ends up using)

I'm not supporting the argument that everything should be written in Rust (or whatever) for good performance. However blanket statement like this is not true; micro-benchmarks are often misleading. There are many factors which affect the performance and they come with tradeoffs, so you can choose what options favor you most. At the end, objectively Rust offers more ways to optimize your program.

reply

foldr | karma 6163 | avg karma 1.17 · 2022-10-23 02:35:56

Rust doesn’t offer the option of using a multi-threaded garbage collector. And ‘will often be faster’ is not a blanket statement; it’s just a rough generalization. I was basing the statement not on microbenchmarks but on the profiling done by the author of esbuild: https://news.ycombinator.com/item?id=22336284

super_flanker | karma 101 | avg karma 1.4 · 2022-10-25 17:31:11

> Rust doesn’t offer the option of using a multi-threaded garbage collector. I'm not sure why do you want a garbage collector in rust, most likely an arena allocator would be sufficient for what you may be looking for.

> esbuild That's an interesting situation. I'll have to take the author's word here since rust's version is not available for us to see. I write both go and rust (go for living and rust for side projects), I can see some situation where a naive implementation in go could perform better than naive implementation in rust (assuming both are implementing same algorithm). Again rust provides options to mitigate performance problems once you decide to optimize a bottleneck, but if you touch the upper ceiling in go program, there is not much you can do about it.

reply

foldr | karma 6163 | avg karma 1.17 · 2022-10-27 17:06:01

You’d want a garbage collector in Rust for the same reason you’d want it in any other language. Manual memory management adds code complexity and can often be slower. Arena allocators only work in certain situations and considerably complicate management of lifetimes.

> , but if you touch the upper ceiling in go program, there is not much you can do about it

I’m not seeing this. There’s lots you can do to optimize Go code. Could you give a concrete example?

reply

bugfix-66 | karma 501 | avg karma 2.15 · 2022-10-22 10:51:39

Correct, Go is fast, very close to C.

And just like in C, if you want to avoid memory management overhead you can use a slice of structs, integers instead of pointers, and a freelist (if needed). For example, here is a pointerless sparse bit vector:

https://bugfix-66.com/7256e0772dc3b02d72abf15b171731c933fd44...

The article is storing parses in a balanced binary tree, like a packrat memoizing parser.

Here is the fastest balanced search tree in Go. It allocates (and uses Go's garbage collector) but you can easily use a slice of structs with integer index pointers and a freelist instead:

https://bugfix-66.com/c93e950965804eba90a34e0055985b1c42d5a1...

The above code will perform very similarly to C.

reply

rollcat | karma 3733 | avg karma 3.02 · 2022-10-22 10:56:21

One of Go's primary goals has always been compilation speed.

Go started out in C, and was later (post-1.0) incrementally rewritten to be self-hosting. One of the authors (Ken Thompson) is also one of the co-creators of C. I would argue these guys know what they are doing.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 11:13:09

I don’t know, not implementing generics when it was pretty obviously needed was a huge oversight, so I’m not sure.

Also, the reason for compiler bootstrapping is more of a “beauty thing”, then practicality. It would definitely be faster in a low-level language, but I doubt it would matter as an end user.

reply

fredrikholm | karma 1057 | avg karma 3.97 · 2022-10-22 11:30:54

You aren't sure if Ken Thompson knows what he's doing?

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 11:55:39

As a software architect? Absolutely. Programming language designer? Not sure, neither C or Go are good languages in my personal opinion.

EDIT: I meant to write that I think very highly of him as an architect/developer.

reply

sjansen | karma 405 | avg karma 4.88 · 2022-10-22 17:09:51

Experience has shown that often “worse is better”. Go does an amazing job of balancing complexity and power. I haven’t seen a ”better” language that isn’t either slower, harder to become productive, or both.

https://en.wikipedia.org/wiki/Worse_is_better

reply

kcartlidge | karma 664 | avg karma 1.23 · 2022-10-22 11:34:19

> not implementing generics when it was pretty obviously needed was a huge oversight

I get the desire for generics. I do a lot of C# and have used generics for a very many years. Yet I've been writing Go for around 6 or 7 years and other than in the beginning (when I was new to it) I haven't found myself missing them at all.

In other words, for many people the lack of generics comes across as an oversight. For others, including myself (again, a heavy generics user in C#) that really isn't the case. I write Go in the style of Go and it just hasn't been an issue.

Blanket statements are rarely true. YMMV.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 11:42:23

Well, it is not a blanket statement, it’s just the generic truth (pun not intended) based on decades of evolution of programming languages and a relatively expensive mistake for Java, which would have been a perfect opportunity to learn from.

Sure, it is seldom missed as an end user, but as a library user it is essential. That’s why map and the like had to be hard coded into the language, and why concurrent versions couldn’t be implemented for a long time in the language, the same way it was done for Java forever.

reply

kcartlidge | karma 664 | avg karma 1.23 · 2022-10-22 11:56:02

> Well, it is not a blanket statement, it’s just the generic truth

A bit contradictory, really, but that's just semantics I suppose.

More importantly even if you could say 99% of devs agree (and you can't because they don't) that still doesn't make it an oversight.

If they'd neglected to add generics because it wasn't considered, that's an oversight. If it was neglected because in the opinion of the creators of the language it wasn't needed for the purposes they created it for, that isn't an oversight but a thought-through engineering decision.

Of course you're free to disagree with that decision, but an oversight it was not.

reply

kaba0 | karma 9701 | avg karma 1.18 · 2022-10-22 12:00:33

Fair enough, I may not have used the correct word, but it is still a typical “told you” situation, both during development, after go’s initial appearance and ever since until it finally was decided that it should be indeed implemented.

kcartlidge | karma 664 | avg karma 1.23 · 2022-10-22 12:12:33

True.

Whilst I haven't missed it in Go myself, enough other people say they do that its inclusion was inevitable. Which means it probably should have gone in sooner.

reply

chrsig | karma 2686 | avg karma 3.64 · 2022-10-22 12:22:10

There's a big misconception that the creators of go didn't want generics. They've stated a number of times that they didn't have a design that they all thought was adequate.

After several years and attempts at a good enough proposal, Ian Lance Taylor put one out that was able to cross the finish line, and now we have generics.

reply

kcartlidge | karma 664 | avg karma 1.23 · 2022-10-22 12:36:18

> They've stated a number of times that they didn't have a design that they all thought was adequate

You know what, with all the 'discussions' in recent years about whether Go should have generics I'd actually lost track of that amongst all the noise. Which is irritating, as I do now remember the early conversations about it.

Thank you for the reminder.

reply

Thaxll | karma 4756 | avg karma 1.87 · 2022-10-22 11:30:57

I've seen some real world example where Go was as fast or faster than Rust for CPU / io intensive task.

Go is a fast language even with a GC.

https://github.com/boyter/scc/#performance

reply

akira2501 | karma 10559 | avg karma 3.02 · 2022-10-22 13:37:04

For maximum return on investment, wouldn't it better to focus on something other than raw speed? I personally don't see the usefulness of trying to make everything in Rust itself too much as it's handicapped by it's compiler and lack of specification.

frou_dh | karma 8157 | avg karma 3.09 · 2022-10-23 02:05:42

None of your ideas in this thread mention involving a profiler, so I take it you're from the Wild-Ass-Guess school of optimization?

binwiederhier | karma 2647 | avg karma 7.44 · 2022-10-22 10:27:41

Interesting read. Thank you for sharing. I always found parsers fascinating and mystical. It seems like these parser functions (which i think are analogous to what Rob Pike calls state functions) are a common way to do parsing, though i know very little about it. I especially found the combinators intriguing, though I don't care much for the functional programming syntax in a language like Go.

Anyway, thanks for sharing.

Tangentially, I wrote a little mini parser [0] of my own for my side project. It is inspired by Rob Pike's talk on parsers [1]. It doesn't use state functions, but instead just uses the call stack to keep track of where we are.

[0] https://github.com/binwiederhier/ntfy/blob/main/server/actio...

[1] https://www.youtube.com/watch?v=HxaD_trXwRE and https://go.dev/src/text/template/parse/lex.go

reply

skohan | karma 9090 | avg karma 3.19 · 2022-10-22 12:26:43

Yeah parsers are fun! We did a recursive descent parser for a toy language in uni and I think it was one of the most illuminating and fun projects we did at school.

Lately I've been working on a tool to make it easy to implement a parser, and I end up using it for everything, because DSL's are so nice to work with.

reply

tester756 | karma 3905 | avg karma 1.96 · 2022-10-22 10:49:12

Difference in complexity of IDE's parser and Compiler's parser feels like order of magnitude

IDE's you want to be very fast, so you use techniques like partial tree reparse and now when I think about it, then you also may need to update other places

like you use type that is defined somewhere below

and at first parse that type definition doesn't compile, so type usage above shows an error

and when you change type definition so it compiles, then if you only update that part of the tree, then previous would still scream about the error

it's really tricky

the theory behind how to deal with all of this problems seems to be easy

but when you actually get to the coding, then you have to be really thoughtful, careful and experienced in order to get the modeling of right

https://learn.microsoft.com/en-us/shows/seth-juarez/anders-h...

________

I have some experience with simpler and more complex parsers

and for me no other type of software requires this much careful thought as parsers do if you want to address all those things like

correctness, speed and recovery on broken code fragments, good error messages, good code, maybe concurrency

reply

bbkane | karma 1033 | avg karma 2.62 · 2022-10-22 22:19:35

You would like https://rust-analyzer.github.io/blog/2020/07/20/three-archit...

pharmakom | karma 3602 | avg karma 2.45 · 2022-10-22 10:57:55

> a successful parse consumes at least one rune

This avoids infinite loops?

reply

Jtsummers | karma 23473 | avg karma 2.62 · 2022-10-22 11:14:05

Yes, there would be two outcomes for an attempted parsing. Either it succeeds and makes progress (and eventually terminates) or it fails and consumes nothing (and terminates because eventually you run out of parsers to try).

throwaway290 | karma 1892 | avg karma 0.62 · 2022-10-22 11:43:22

Off-topic but are there any aretext users? How does it fare?

derek8bai | karma -1 | avg karma -1.0 · 2022-10-22 11:47:29

cool stuff

hk__2 | karma 6067 | avg karma 3.46 · 2022-10-22 12:17:51

If you want a hard/interesting parsing challenge, try Clojure’s #_ reader macro. It’s a powerful construct that allows you to comment the next form. If you’re not used to Clojure, it’s like writing #_ in front of anything --a function, an array, a keyword, etc-- to comment it, even if it’s on multiple lines.

For example:

    #_ (defn foo
         [x y]
         (println x y))

This is equivalent to commenting the three lines. Things become even harder when you learn that these thing "swallow" the next form and can be used anywhere:

    (let [a #_ b 43] #_ #_ hello (H N) (print a))

The code above is equivalent to the following:

    (let [a 43] (print a))

All the rest is comments.

The hard/interesting bit is that to tokenize you must construct a syntax tree in order to correctly parse the next form, but in order to construct a syntax tree you first need to tokenize the code.

reply

diffxx | karma 546 | avg karma 4.07 · 2022-10-22 12:29:45

> The parsers produce a sequence to tokens, not a full syntax tree. Writing a tokenizer is much easier than parsing full syntax trees...Most other editors don’t construct the full syntax tree.

Syntax highlighting is nice, but fast semantic analysis is the real holy grail.

I have come to think that the best way to develop a new language would be to implement a text editor in that language before releasing the language. The editor should (ideally) have emacs and vim key bindings, though it would surely at least have the bindings that the language author uses. The compiler/interpreter for the language would embedded in the editor. This would allow for a much richer editing experience that goes beyond syntax highlighting. Indeed, the source code would become like a living document in the editor where they editor could display inline information about both syntax _and_ semantics. The editor need not be fancy. It could be written as a terminal application, kind of like a language specific nano or vim.

If the language/editor author is careful in how they design the editor, all of the syntactical and semantic tooling should be exportable into packages that can be consumed by other editors with plugin systems/lsp like vscode or neovim. Then the rich editing experience can be relatively easily exported to any other text editor. Tool authors would also then be able to write static analysis/linting/formatting/whatever tools on top of the semantic tooling that the language supports.

In some ways what I am describing is a more minimalist version of what one would get out of an image/IDE based language like smalltalk but the code representation would still remain as text files. The editor then becomes like a REPL except rather than being an ephemeral process, the REPL state is continually being written to disk and is resumable at any time from any computer capable of running the editor.

reply

maxbond | karma 6958 | avg karma 2.86 · 2022-10-22 12:59:36

I think it would be sufficient & more valuable to implement a language server for your new language, which keeps you focused on the parts related to your language rather than the struggles of implementing an editor, and then you'll be able to drop into VSCode, neovim, etc.

thechao | karma 4578 | avg karma 3.54 · 2022-10-22 13:04:27

Does anyone have an good intro tutorial for writing a language server in, say, C for some simple language? I find most of the docs a little too inscrutable to follow for just a bit of dabbling.

diffxx | karma 546 | avg karma 4.07 · 2022-10-22 13:14:36

If you write your own editor with language server features, writing a language server implementation _should_ (dangerous word) be relatively easy. By writing the editor in your language, you prove that your language is actually useful for a real world problem. You also have the flexibility to add any functionality that you wish without contorting your thinking to whatever may be imposed on you by the lsp model.

It isn't necessarily clear to me that writing a language server implementation is that much easier than writing an editor. To be fair, I have personally done neither (though I have written REPLs) so perhaps I am wildly off in my estimation.

reply

maxbond | karma 6958 | avg karma 2.86 · 2022-10-22 13:30:01

I don't really disagree, but would point out that, for the languages I hack on, they're DSLs that wouldn't be appropriate to write an editor in. I agree that you should be writing code in your language & solving problems with it though, even before there's a working implementation. It does really help hone your vision and focus on the problems you're solving.

I've not written a language server either, and I'm positive it's one or two orders of magnitude more difficult than a simple editor, but I think the time is better spent for many projects because it gets you into the nitty gritty mechanics of your language & sets you up for a good developer experience. If your language specializes in writing things like editors though, that would certainly not be the case.

Tangentially, my language development hot take is that YAML is a really great platform to start with, so you can focus on prototyping your runtime & semantics and kick the can on parsing & syntax. Parsing takes up a lot of time and you might not know if this language idea is worthwhile or not yet. Additionally, YAML is pretty darn good and you may never need to move off of it.

reply

diffxx | karma 546 | avg karma 4.07 · 2022-10-22 14:00:54

> for the languages I hack on, they're DSLs that wouldn't be appropriate to write an editor in

Ha, I have been working on a DSL for writing DSLs, which perhaps explains my perspective a bit.

reply

maxbond | karma 6958 | avg karma 2.86 · 2022-10-22 14:08:57

Sounds cool! If you have something public I'd love to see it. Email is in my profile.

diffxx | karma 546 | avg karma 4.07 · 2022-10-22 15:01:18

It's not quite ready to share but hopefully soon :)

maxbond | karma 6958 | avg karma 2.86 · 2022-10-22 15:53:16

Best of luck & happy hacking

Gibbon1 | karma 7778 | avg karma 1.11 · 2022-10-22 16:20:54

I agree with the parent. The big advantage of what he's suggesting is code objects exist as first class objects. They don't have to creased by parsing a text file. Which means the editor can directly manipulate them.

If you've ever used CAD programs especially Schematic Capture and PCB Layout programs you'll get the idea. Everything displayed on the screen is an object in a database.

Instead of a parser blindly parsing a text file and coming across a struct definition, a struct is object created by the programmer. Which means you can have hard links between the objects that make up the program. And those can be directly manipulated with manually or programmatically.

The big advantage comes from maintenance and refactoring. Change the name of a field? Happens in exactly one place in the program. So instead of a diff with 1537 files changes you have 'renamed struct fobar to struct foobar'. Change a comment? Well it's just a changed comment.

reply

maxbond | karma 6958 | avg karma 2.86 · 2022-10-22 16:30:53

I'd like to see languages like that, but it's a massive lift. Our systems of source control and CI/CD are built around the assumption that languages are text files that are somewhat line oriented. It's valuable for people to be able to use different editors and tools, and text files are the integration point that makes this work.

Fallible parsers are a bit of a hack but they're a hack that works and is widely deployed already, and I'm not convinced the value of moving to a symbolic/binary representation creates enough value to justify the risk involved in the migration. That being said I'd love to see it happen. We'd end the tabs and spaces debate once and for all!

reply

bbkane | karma 1033 | avg karma 2.62 · 2022-10-22 22:11:24

I know of two languages trying something similar to this:

- https://dion.systems/faq.html - https://www.roc-lang.org/

They're both experimental, but fun to read about!

reply

theta_d | karma 134 | avg karma 3.05 · 2022-10-23 08:36:26

So smalltalk?

sesm | karma 878 | avg karma 2.03 · 2022-10-22 16:18:21

This article uses the word 'rune' extensively. From the context I assume it means 'lexem' or 'token' (i.e. the unit the lexer/scanner produces and feeds to parser). But then the article uses the word 'token' to mean the output of a parser ('keyword token'), while the usual terminology is that parser output is called a 'parse tree'.

So, in this terminology, the parser consumes 'runes' and outputs 'tokens', while the usual terminology is that parser consumes 'tokens'/'lexems' and outputs 'parse tree'.

reply

pgwhalen | karma 1163 | avg karma 1.94 · 2022-10-22 16:27:27

Rune is a type alias in go, which more or less maps to the more common words “character” or “code point”.

https://go.dev/blog/strings

reply

sesm | karma 878 | avg karma 2.03 · 2022-10-22 16:29:04

Thanks! I don't know Go, so this terminology was surprising to me.

sjansen | karma 405 | avg karma 4.88 · 2022-10-22 16:32:15

In Go, `rune` is an alias for `int32` and is used to indicate the value is a Unicode "code point".

For characters in the ASCII range, that means it's just a character encoded using more bits. If you need to worry about the full Unicode range then it's important to understand Unicode Normalization Forms.

https://go.dev/blog/strings

https://en.wikipedia.org/wiki/Unicode_equivalence

reply