Supercharging the Git Commit Graph

slobotron | karma 846 | avg karma 8.38 · 2018-06-28 06:43:00+00:00

Interesting tidbit:

> The developers making Microsoft Windows use Git

reply

xkcdefgh | karma 19 | avg karma 1.0 · 2018-06-28 06:57:35+00:00

Yeah, i believe it was a huge effort some time ago to migrate windows to git

LyalinDotCom | karma 2091 | avg karma 3.04 · 2018-06-28 07:44:08+00:00

Yep we use Git to build windows! here is the story that was told about our efforts.

https://arstechnica.com/gadgets/2018/03/building-windows-4-m...

reply

LyalinDotCom | karma 2091 | avg karma 3.04 · 2018-06-28 15:54:30+00:00

oh and also we did an interview with the DevOps lead for both source control and build system around Windows, so check that out too if you want: https://www.youtube.com/watch?v=nsXiKLLaH4M

01100011 | karma 7985 | avg karma 3.41 · 2018-06-28 03:40:49

Kinda surprising. I know everyone seems to love git these days, but I find it's really better suited towards distributed and/or smaller projects. After feeling the pain of a megarepo system at work, I'm pushing to switch to a monorepo(well, like 4 repos instead of 200). git sort of sucks for monorepos.

Also, even after learning a fair amount of git, I still find I spend a noticeable amount of time dealing with it. I don't remember spending a lot of time on any of the VCS systems I've used in the past. They just stayed out of my way and let do my thing.

reply

rtcoolaid | karma 0 | avg karma 0.0 · 2018-06-28 04:37:23

You are actually right and down voters have no clue (and this is what I hate about HN, if you don't comment then don't downvoye)

To answer your question though - Microsoft has a lot of git extensions that they are slowly submitting upstream. Hence git is usable for their megarepo.

Look up GVFS on GitHub, very cool work.

Same for Twitter (they are not submitting upstream though as far as I know)

Similar for fb/mercurial

reply

adrianN | karma 29995 | avg karma 2.78 · 2018-06-28 09:56:36+00:00

I do a lot of things that aren't really possible in older VCS with git, which is why I spent more time with it.

gmueckl | karma 4572 | avg karma 2.3 · 2018-06-28 10:39:14+00:00

I don't do new things in git, but since we transitioned, I spend about 3 to 4 times longer wrestling the VCS than with anything that I have used previously. This is merely for routine stuff, because it takes more steps

adrianN | karma 29995 | avg karma 2.78 · 2018-06-28 11:24:12+00:00

The additional steps enable workflows that weren't possible before, so it's a tradeoff.

gmueckl | karma 4572 | avg karma 2.3 · 2018-06-28 14:24:32+00:00

This argument is just wrong. I have fewer steps with less mental load on other VCS implementations, yet I can get better workflows at the same time with zero loss of functionality.

Prime example: the git staging area/cache/index needs to die. Git would be half as difficult to use with fewer code shredding surprises. This abomination is a prime example of badly exposed internal structure. Everything feature that is crammed into this whatever-the-hell-that-is could be replaced by a vastly superior solution which does not require anything like it.

reply

nemetroid | karma 3753 | avg karma 3.53 · 2018-06-28 16:18:28+00:00

The staging area is not necessary for your particular workflows. I found Mercurial horrible to work with precisely due to the lack of a staging step.

erik_seaberg | karma 4377 | avg karma 1.56 · 2018-06-28 16:25:57+00:00

Untested commits should never exist, so why are we creating commits from abstracted storage that can't even be built and tested? Staging should happen in the workspace, stashing stuff that's not ready to commit.

DaiPlusPlus | karma 8401 | avg karma 2.53 · 2018-06-28 17:14:18+00:00

> Untested commits should never exist

Says who? Branches and commits are cheap, it's how we can undo ourselves and freely experiment.

I agree that untested commits into a publishing/release branch shouldn't exist: all commits there should be merges from dev branches, but to say every commit should be tested is utter bollocks and denies us the advantages of cheap branching and commits.

reply

bzbarsky | karma 8095 | avg karma 3.01 · 2018-06-28 22:33:28+00:00

> all commits there should be merges from dev branches

Either they need to be squash+merge or the dev branches need to end up with working commits before the merge. Otherwise your life will become hell the first time you need to bisect.

reply

gmueckl | karma 4572 | avg karma 2.3 · 2018-06-28 11:34:36

Your dislike of mercurial is noted. But that does not change the fact that the staging area is objectively confusing and hindering users (don't have a link to the study here that looked into it).

jdbernard | karma 1277 | avg karma 1.89 · 2018-06-28 17:59:51+00:00

Objectivity is not something you can just claim. There must be some objective measure. You mention a study, but that's not sufficient without an actual reference.

jdbernard | karma 1277 | avg karma 1.89 · 2018-06-28 13:02:07

As a counter-anecdote, the staging area/cache/index is one of the main reasons I use git. When they were first released Mercurial was my initial choice. It was certainly more friendly to a beginner. But I quickly moved to using git and the staging area/cache/index you mention was one of the main reasons.

pdpi | karma 11546 | avg karma 4.83 · 2018-06-28 10:18:53+00:00

> git sort of sucks for monorepos.

Microsoft works around several of the issues there by using GVFS. Also, at Microsoft scale, everything "sort of sucks", there's just no silver bullets. You take one of the least bad options and put all the effort you can towards making it work as well as you can.

reply

a-dub | karma 3806 | avg karma 2.07 · 2018-06-28 18:01:30+00:00

No. Everything only "sort of sucks" at "Microsoft scale" if you're willing to blame it on "scale."

At "Microsoft scale," you have the resources to purpose build anything you need from scratch for any scale you're working at... therefore if anything "sort of sucks" it's because:

a) It's not worth the money/resources. (The "let it suck" approach)

or

b) Nobody cares (The "acceptance that we suck, at 'scale'" approach)

reply

a-dub | karma 3806 | avg karma 2.07 · 2018-06-28 21:26:42+00:00

... and I'm going to belabor this, because I think it's important.

There's a fair bit of excitement around the new, exciting, open source friendly Microsoft with built-in Linux kernel emulation, an embrace of git and a huge release of open source tools...

Honestly though, I'm not buying it, the issue I've had with Microsoft over the years doesn't just stem from the shitty software or aggressive business practices... it's the deep rooted culture of mediocrity it promotes.

People who belong to that church believe it's ok to build shitty software because doing it right is too hard. Why root-cause an issue when you can just script reboots? It's too hard anyway, we're at "Microsoft scale."

The rise of the internet and the companies that grew around it showed us that if you have a culture of giving a shit, you can build really complex things "at scale" that aren't complete shit.

I worry that the new Microsoft will be a different kind of trojan horse for the OSS world. It won't be "embrace, extend, extinguish" it will be more like a social media psyops campaign that beats it into everyone's heads that now we're at "Microsoft scale" it's ok for everything to "kinda suck", and if we're not careful... everything will.

reply

MaxBarraclough | karma 10788 | avg karma 2.11 · 2018-06-28 10:00:11

> I find it's really better suited towards distributed and/or smaller projects

But it was specifically designed for the Linux kernel.

reply

rootlocus | karma 2736 | avg karma 2.61 · 2018-06-28 10:21:12

Which is highly distributed with multiple contributors, maintainers, etc. The "monolithic" aspect is just one glorified repo: Linus's

yoklov | karma 1767 | avg karma 2.38 · 2018-06-28 06:43:16+00:00

Dang, this made `git log` inside the repo I use at work (which is enormous with a stupid number of commits) nearly instantaneous. Great work.

`git status` still takes over a second for me though, oh well...

reply

fanf2 | karma 38482 | avg karma 6.45 · 2018-06-28 07:52:03+00:00

Have you tried the fsmonitor hook feature that was added in git 2.17? https://blog.github.com/2018-04-05-git-217-released/

yoklov | karma 1767 | avg karma 2.38 · 2018-06-28 11:37:35+00:00

Yep, as well as the untracked cache. I also was using a split index for a while, but it didn’t play nicely with some tools...

erikb | karma 4464 | avg karma 1.37 · 2018-06-28 07:51:08

Everybody who seriously uses git will also want to see the graph somehow. Here's how I do it: https://github.com/erikbgithub/dot-files/blob/master/.gitcon...

bluebluetimes | karma 22 | avg karma 1.0 · 2018-06-28 08:09:37+00:00

To enable the commit-graph feature in your repository, run git config core.commitGraph true. Then, you can update your commit-graph file by running

‘git show-ref -s | git commit-graph write —stdin-commits'

how do you automatically update this?

reply

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 08:55:36+00:00

It's not perfect but doing it in a pre-push hook can be useful. Unfortunately there's no post-receive hook on client-side, which could have been useful in this situation...

masklinn | karma 65147 | avg karma 3.36 · 2018-06-28 11:49:15+00:00

> It's not perfect but doing it in a pre-push hook can be useful.

That's completely unnecessary and way too frequent. Some where else (reddit I think?) the authors noted that they'd like to have it run alongside GCs in the next version. So running it as pre-auto-gc is a better idea.

reply

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 12:17:40

I was thinking that you need to do that everytime refs change, but it's just a boost and I presume you can have some refs in the commit-graph and some not in there and it will still work, so you're probably right.

stolee | karma 99 | avg karma 3.81 · 2018-06-28 10:42:51+00:00

We are working to make this run automatically in the future. https://public-inbox.org/git/20180627132447.142473-1-dstolee...

You don't need to run this with every commit, but maybe once after a big fetch or right after a new clone.

reply

pmarin | karma 7280 | avg karma 11.7 · 2018-06-28 08:27:11+00:00

This is basically what Fossil call timeline or I'm missing something?

https://www.fossil-scm.org/index.html/timeline

reply

masklinn | karma 65147 | avg karma 3.36 · 2018-06-28 08:33:11+00:00

It's not a UI feature, the UI feature has existed forever (it's the log and log graph). This is a cache for the graph so that it does not have to be rebuilt every time it's displayed.

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 09:15:38+00:00

Yet another file in the .git directory. The work is impressive and certainly helpful, but I can already hear Fossil proponents say "just use SQLite", which is getting more and more true.

rtcoolaid | karma 0 | avg karma 0.0 · 2018-06-28 09:40:38

Is that what you got from the article? Or you're the guy that knows the names of all he latest tools but understands the underlying concept of none and comments this making HN a waste of time to actually learn anything.

naturalgradient | karma 1745 | avg karma 10.26 · 2018-06-28 10:16:58+00:00

I think you should refrain from making such comments and follow the best-intention assumption on HN.

Aissen | karma 14537 | avg karma 8.02 · 2018-06-28 09:58:15+00:00

I had to bite… The sqlite repository has only 20795 commits since May 2000 at the time of this writing: https://www.sqlite.org/src/timeline?udc=1&ss=m&n=100000&y=ci

This is the amount of commits that goes into Linux every ~5months.

Has anyone done any meaningful performance comparison between fossil and git?

reply

adrianN | karma 29995 | avg karma 2.78 · 2018-06-28 11:39:25+00:00

The number of commits is not a very meaningful metric.

masklinn | karma 65147 | avg karma 3.36 · 2018-06-28 11:47:28+00:00

It's very much a meaningful metric when the entire point of TFA is caching the commits graph. This is only an issue when you actually have lots of commits, and even more so a very branchy graph.

adrianN | karma 29995 | avg karma 2.78 · 2018-06-28 12:22:37

It's not a meaningful metric when discussing whether to use sqlite as storage for git instead of files in the .git directory.

jasode | karma 31908 | avg karma 8.42 · 2018-06-28 13:11:35

>It's not a meaningful metric when discussing whether to use sqlite as storage for git instead of files in the .git

There are 2 different conversations happening.

You seemed to be responding to ggp (rakoo) "sqlite db vs files".

However masklinn was responding to gp (Aissen) question of "Fossil vs Git" performance and a charitable reading would be comparing the Fossil algorithms (also affected by combination with underlying SQLite db engine algorithm) for commit searches, graph traversal, etc. In that case, the high number of commits and total repository size to stress test Fossil/Git would be very relevant.

An example post about Fossil scaling: https://www.fossil-scm.org/xfer/technote/be8f2f3447ef2ea3344...

reply

reificator | karma 6682 | avg karma 3.76 · 2018-06-28 14:21:46+00:00

How does the number of commits in the SQLite repository make it unsuitable as alternate store for git metadata?

Aissen | karma 14537 | avg karma 8.02 · 2018-06-29 07:58:10

@jasode's reply in this thread made a good summary of the two parallel discussions/quid pro quo:

- on filesystem vs sqlite (put git files in sqlite): there's a good benchmark on https://www.sqlite.org/fasterthanfs.html claiming sqlite is up to 35% faster than fs. I'd like to see the same benchmark with git's file pattern; also, it's a known issue with git that it was written for linux first, hence optimized against Linux (relatively) good fs performance (vs Windows and Mac at the time). Same with most OSS build systems that (over)use process forking, which is also very optimized in Linux.

- on Fossil vs git (why bother putting git files in sqlite and not directly jump to fossil?): that was my comment, and it relates to subject of this article (the commit-graph). I'm wondering if Fossil has seen the optimization that git has, with regards to number of commits, considering that sqlite is the only high-profile project that uses it. Maybe performance is supposed to be taken care of by the sqlite database itself ?

Edit: see https://www.fossil-scm.org/xfer/doc/trunk/www/stats.wiki for a fossil-provided performance analysis, that does a comparison with CVS (!).

reply

superflyguy | karma 550 | avg karma 2.13 · 2018-06-28 13:07:04

Use sqllite instead of git? Or git should use sqllite? If the latter then one problem is you'd need to keep your own fork forever as they don't accept patches. I'm not sure if that's a price worth paying to reduce the number of files git uses. Why is this a problem for you, anyway?

girvo | karma 11277 | avg karma 2.24 · 2018-06-28 13:53:08

They mean to use SQLite instead of the .git folder and file structures, I believe.

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 14:12:03

It's not a problem for me, because all I see is the different commands that _use_ the underlying infrastructure. It's more about the design that was chosen: if you want to speed up things with git you have to implement specific logic in application code that will write a file and will need to update it periodically to keep it up-to-date, instead of using a querying engine made specifically for this purpose.

I don't see git ever changing its file format, but I do see another tool that imports everything from git and gives you a read-only sqlite db where you can do whatever you want, including displaying a graph quickly as the post advertises.

reply

kyberias | karma 2624 | avg karma 1.95 · 2018-06-28 14:18:35

I think you don't fully understand what you are proposing. The storage engine (file system or SQLite) has little to do with git graph algorithm performance. SQLite doesn't magically "display a graph quickly".

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 14:29:00+00:00

What I'm saying is, the iteration step from "existing dataset" (what we have today) to "faster data traversal" (what the article proposes) is a custom file with a custom format on one side, and the appropriate query/index on the other side; one is definitely more understandable, portable and maintainable than the other.

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 16:43:43+00:00

Except that SQL has never had great DAG data structures, queries, nor indexes. You can model a DAG in a relational database, and you can non-standard SQL extensions to get some decent but not great recursive queries to do some okay semi-poorly indexed graph work, but having maintained databases like that at various times that all gets to be just as much a "custom file with a custom format" as dependent on database version and business logic as anything git is doing here.

If there was a stronger graph database store and graph query language for consideration than SQL you might be on to something. SQL isn't a great fit here either.

reply

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 19:47:34

Fossil itself is stored entirely inside a SQLite db and only uses it to do everything it needs; if Fossil can do it, any VCS can do it. In fact, there is a whole section on that point in the official SQLite page (https://www.sqlite.org/lang_with.html#rcex2).

I'm not saying SQL is the best way to store and query DAGs; any graph database would be better. All I'm saying is that SQL is probably better at designing and maintaining a solution than what git does with its custom file format and custom code.

I'm only comparing what the pile-of-files that git currently is and a full-fledged SQL database. None is perfect, but one feels overall easier than the other.

reply

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 20:46:06+00:00

But you are also almost intentionally confusing the SQL standard here in your comment with the SQLite implementation (a de facto standard, of a sort, but not a recognized standard by any body of peers to my knowledge) with SQLite's particular binary format (which does change between versions even). That is a custom file format with custom code. Certainly it is very portable custom code, as SQLite is open source and ported to a large number of systems, but just because it is related to the SQL standards doesn't gift it the benefit of being an SQL standard in and of itself.

The SQL standards define a query language, not a storage format. There are SQL databases that themselves optimize their internal storage structures into "piles of files". In fact, most have at one point or another. SQLite is an intentional outlier here; it's part of why SQLite exists.

There's nothing stopping anyone from building an SQL query engine that executes over a git database, for what that is worth. Because you can't execute SQL queries against it today doesn't really say anything at all about whether or not git's database storage format is insufficient or not.

All of that is also before you even start to get into the weeds about standards compliance in the SQL query language itself and how very little is truly compliant between database engines, as they all have slightly different dialects due to historic oddities. Or the weeds that there's never been a good interchange format between SQL database storage formats other than overly verbose DDL and INSERT statement dumps. That again are sometimes subject to compatibility failures if trying to migrate between database engines, due to dialectal differences. Including what should be incredibly fundamental things like making sure that foreign key relationships import and index correctly, without data loss or data security issues, because even some of that is dialectal and varies between engines (drop keys, ignore keys, read keys, make sure everything is atomically transacted to the strongest transaction level available in that particular engine, etc).

Git's current pile of files may not be better than "a full-fledged SQL database", that's a long and difficult academic study to undertake, but a "a full-fledged SQL database" isn't necessarily the best solution just because it has a mostly standard query language, either.

reply

SQLite | karma 2375 | avg karma 12.12 · 2018-06-28 22:59:05+00:00

> SQLite [is] not a recognized standard by any body of peers to my knowledge...

Well, there is this: https://www.loc.gov/preservation/resources/rfs/data.html

Also, the on-disk format for SQLite has been extended, but has not fundamentally changed since version 3.0.0 was released on 2004-06-18. SQLite version 3.0.0 can still read and write database files created by the latest release, as long as the database does not use any of the newer features. And, of course, the latest release of SQLite can read/write any database. There are over a trillion SQLite databases in active use in the wild, and so it is important to maintain backwards compatibility. We do test for that.

The on-disk format is well-documented (https://sqlite.org/fileformat2.html) and multiple third parties have used that document to independently create software that both reads and writes SQLite database files. (We know this because they have brought ambiguities and omissions to our attention - all of which have now been fixed.)

reply

rhencke | karma 1367 | avg karma 4.99 · 2018-06-28 13:28:00+00:00

I love SQLite, and Fossil is very cool, but I don't see the fundamental difference between Git adding another file in the .git directory, and Fossil adding another table or index in the SQLite database.

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 13:39:45

This is about having a single, unified interface for all operations. This is all explained in great details by SQLite itself at https://sqlite.org/appfileformat.html

grive | karma 783 | avg karma 4.35 · 2018-06-28 14:17:42

As usual with the SQLite / Fossil developer argumentation, it just seems very biased and far-fetched. Just one example:

> Pile-of-Files Formats. Sometimes the application state is stored as a hierarchy of files. Git is a prime example of this, though the phenomenon occurs frequently in one-off and bespoke applications. A pile-of-files format essentially uses the filesystem as a key/value database, storing small chunks of information into separate files. This gives the advantage of making the content more accessible to common utility programs such as text editors or "awk" or "grep". But even if many of the files in a pile-of-files format are easily readable, there are usually some files that have their own custom format (example: Git "Packfiles") and are hence "opaque blobs" that are not readable or writable without specialized tools. It is also much less convenient to move a pile-of-files from one place or machine to another, than it is to move a single file. And it is hard to make a pile-of-files document into an email attachment, for example. Finally, a pile-of-files format breaks the "document metaphor": there is no one file that a user can point to that is "the document".

More precisely:

> But even if many of the files in a pile-of-files format are easily readable, there are usually some files that have their own custom format (example: Git "Packfiles") and are hence "opaque blobs" that are not readable or writable without specialized tools.

What is advocated here is to transform the pile-of-files in a single SQLite database accessed through SQL queries. So instead of having only a few binary blob, transform everything in a binary blob and force the use of one specialized tool for everything.

> It is also much less convenient to move a pile-of-files from one place or machine to another, than it is to move a single file.

This is not true.

> And it is hard to make a pile-of-files document into an email attachment, for example.

I would not trust someone that had just sent his git repo over email.

> Finally, a pile-of-files format breaks the "document metaphor": there is no one file that a user can point to that is "the document".

A VCS will track source files. Maybe their argument is true for other applications, but for a VCS this is plain useless.

Indeed having only an SQL connector accessing a database is a unified interface to the file. But unifying this to the user means that you have to move the complexity further down, as explained:

> But an SQLite database is not limited to a simple key/value structure like a pile-of-files database. An SQLite database can have dozens or hundreds or thousands of different tables, with dozens or hundreds or thousands of fields per table, each with different datatypes and constraints and particular meanings, all cross-referencing each other, appropriately and automatically indexed for rapid retrieval, and all stored efficiently and compactly in a single disk file. And all of this structure is succinctly documented for humans by the SQL schema.

Yeah, and I don't want to have this complexity managed by a single "entity", I want to have several different tools available to do whichever kind of work I need to do. If I'm working on graphs and need to store them, I would prefer having the ability to read my file directly in my other tools for graph analysis / debugging without having to take the intermediate step of connecting to the SQL database, or redefining a way to work with the SQL paradigm to adapt my file format to the "dozens or hunders or thousands of different tables, fields per table, each with different datatypes".

This point is even more salient regarding grep / awk. The author obviously prefer using the query language of his choice and disregards the variety of tools to work on text, but there are many, many tools available to do all kind of work on it, and believing that

> An SQLite database file is not an opaque blob. It is true that command-line tools such as text editors or "grep" or "awk" are not useful on an SQLite database, but the SQL query language is a much more powerful and convenient way for examining the content, so the inability to use "grep" and "awk" and the like is not seen as a loss.

Is just nonsense. Passing on the file edition conveniently put under the rug, querying the text is usually only the beginning, usually someone wants to parse the output and act upon it, maybe even put back some modified version (sed), and so on.

The author just seems close-minded and living in his own world, unable to imagine that other people might want to work differently.

This reminds me a lot of his rant against git and for fossil, with the exact same bad faith arguments and lack of knowledge about other ways to do things.

reply

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 20:04:21

Your points are valid, especially considering that this page is explaining the benefits (for the author) of using SQLite as a generic application file format; however we're only talking about git here, and my usage of git is limited to "git some-command", sometimes "git some-command | grep foobar", and most of the time I'm in a GUI anyway. I'm not grepping the git objects directly, so whether I use a git subcommand or a sql subcommand won't make any difference to me. The real advantages of using sql subcommands for me are:

- I could probably plug that into something else with more ease than something that is git-specific - I have more flexibility for querying out of the box, without learning the specifics of each subcommand. The full SQL language is there at my disposal for outputting exactly what I need

reply

geezerjay | karma 3623 | avg karma 1.89 · 2018-06-28 16:26:13+00:00

> This is about having a single, unified interface for all operations.

Unless you're intending to run joins on git data, were exactlty do you see any fundamental difference between running CRUD operations via an SQL interface or just importing/exporting a file?

reply

rakoo | karma 6746 | avg karma 2.2 · 2018-06-28 19:34:18

That's the whole point: git data is highly relational. Retrieving a commit alone is completely useless to you, just as retrieving any of the core objects alone is. Every operation you do requires retrieving multiple, interconnected objects... which SQL excels at.

farresito | karma 1147 | avg karma 2.53 · 2018-06-28 09:15:52+00:00

To save people some time, an alias for your .gitconfig

create-graph = "!f() { git show-ref -s | git commit-graph write --stdin-commits ; } ; f"

reply

masklinn | karma 65147 | avg karma 3.36 · 2018-06-28 09:53:50

How does it save time, given this command will rarely be run, and next version will automatically run it on GC? Instead of copy/pasting a command you now have to copy-paste that same command with more gunk added and run it separately.

farresito | karma 1147 | avg karma 2.53 · 2018-06-28 10:59:50+00:00

You have to run it on every repository you want this in, don't you? At least, that's what I understood.

masklinn | karma 65147 | avg karma 3.36 · 2018-06-28 11:46:34

This is mostly for large repositories where building the revisions graph takes a long time (aka 5+ figures revisions). I have two of those from $dayjob, most of the stuff I work with/on/for doesn't come even remotely close. Running this on a repo with 5 commits is all but useless.

And even then you still really only need run it once per repository, you can just cd/paste/return; cd/paste/return; … Hell, you'd probably have an easier time writing a script which looks for all git repositories and runs the command versus having to manually visit each and do so.

reply

wscott | karma 1598 | avg karma 6.52 · 2018-06-28 09:51:27+00:00

"Before I joined Microsoft, I was a mathematician working in computational graph theory. I spent years thinking about graphs every day, so it was a habit that was hard to break. "

As a former BitKeeper developer, this is a key person for Microsoft to have on hand to improve git. BitKeeper got about 10X faster after it was used on the Linux kernel and many of the key performance wins were due to better graph traversal algorithms. Rick was a wizard at that sort of thing. The other was memory layout optimizations and caching. (my contribution)

So Rick made sure we walked the graph as little as necessary and I made sure the graph had an extremely compact representation containing as little information as possible and then once a target commit is found it is looked up in another store.

However, this quote was disturbing. "There are a few Git features that don’t work well with the commit-graph, such as shallow clones, replace-objects, and commit grafts. If you never use any of those features, then you should have no problems!" The joy of not having to write commercial software! Reading between the lines it appears they changed the default output order for 'git log' or some internal API and then didn't bother to fix the cases that depending on the old order.

reply

stolee | karma 99 | avg karma 3.81 · 2018-06-28 10:39:56+00:00

Sorry for the worrying note about the experimental feature. One issue when working in open source is that contributors don't have control over the release cycle, and review requires smaller series than having the feature be delivered all at once.

These interactions with grafts, replace-objects, and shallow clones are one reason 2.18 does not create and manage this file automatically. The commit-graph file works by representing the commit relationships in a new file, and if that file exists, we treat that as the truth. The commit grafts, replace-objects, and shallow clones use another set of special files to track commits whose parents have been modified in special ways. If you would like to see our progress on integrating these features together, please see this thread on the Git mailing list: https://public-inbox.org/git/20180531174024.124488-1-dstolee...

reply

wscott | karma 1598 | avg karma 6.52 · 2018-06-28 11:01:50+00:00

> One issue when working in open source is that contributors don't have control over the release cycle, and review requires smaller series than having the feature be delivered all at once.

Yes, my reply was unnecessarily disparaging. Overall this looks like a cool feature. Perhaps a stopgap solution if for those commands to just delete your cache. But then your repository will get mysteriously slower. I just need to think of this as a technology demonstration.

reply

stolee | karma 99 | avg karma 3.81 · 2018-06-28 11:06:02+00:00

The good news is that if you are using shallow clones, then you probably don’t have enough commits locally to need the commit-graph feature!

vhbit | karma 58 | avg karma 1.14 · 2018-06-29 04:21:03+00:00

Off topic: considering you've left BitKeeper and you were one of the most active developers in user forum, what's the status of BitKeeper? Is it still developed* or it's in maintenance mode for existing commercial clients?

* yes, it's open source, but being open source and "you can add any feature yourself" doesn't imply there is a momentum behind and a kind of "directed" force to move it forward

reply

tomfloyer | karma 49 | avg karma 24.5 · 2018-06-28 11:48:12+00:00

I wish i had those amazing algorithm skills.

xvilka | karma 6760 | avg karma 3.09 · 2018-06-28 14:19:45+00:00

Would be nice to have support of it in tig too.

yebyen | karma 2755 | avg karma 1.38 · 2018-06-28 16:44:52

One of the best things I ever did for my git usage was installing this ridiculous thing in my .gitconfig aliases:

    [alias]
        l = log --date-order --date=iso --graph --full-history --all --pretty=format:'%x08%x09%C(red)%h %C(cyan)%ad%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08%x08 %C(bold blue)%aN%C(reset)%C(bold yellow)%d %C(reset)%s'

It lets me type 'git l' and get this kind of commit graph:

    *       8264cf2a 2018-06-28 yebyen (origin/dev, dev) ...
    |\
    | *     df6e4c49 blabla commit messages
    | *     195d8924 commit messages
    |/
    | *     96e74a5c (branch-label) commit message
    | |\
    | | *   b2786c07 blabla
    | | *   85721288 blabla

This is semi-related but not nearly as interesting from a technical perspective. But it's one of the missing features in Git, to visually help people understand why it's helpful to rebase occasionally, and keep your commit history clean.

I think I found it here[1]

[1]: https://stackoverflow.com/a/16735971/661659

reply

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 16:56:09+00:00

I had a similar alias for a while at a past job, but realized I was just using `gitk` anyway, and `gitk` is installed by default everywhere and doesn't need an alias setup, so it's also especially easier to teach to junior developers.

Also, I don't encourage rebasing, especially to junior developers. I realize everyone has different preferences, but a messy graph is useful and there are other tools like `git log --first-parent` for getting "clean" baselines from the graph. The great thing about using a graph in the first place is there are a lot of traversal options and you don't have to micromanage where the trees are to still see the forest.

reply

yebyen | karma 2755 | avg karma 1.38 · 2018-06-28 17:58:18+00:00

Yes! I agree with you mostly, except that we try to keep from having junior developers for too long. Nobody is really a junior developer, and everyone should get a turn as release manager. The main benefit of this is that people can see directly how their work style impacts the release engineering process, if they do know exactly what that process involves and actually get a turn at it. (You have to stub your own toe in order to know how bad it hurts.)

Our team is actually really small, and we like to make sure everyone knows about the hassles involved in putting together a release and doing a complete code review when it's needed. The main reason I encourage rebasing is because it helps avoid a messy graph, and a messy graph makes it immediately much harder to do a rebase across any number of merges, or any non-trivial span of time.

So in other words, I like to selfishly expect the other developers on my team to do rebases at the appropriate times, in order to preserve my own capability to easily do rebases when needed. (We also learned last week that git rebase has a --preserve-merges option that I feel foolish to not have known about sooner. We've wiped out so many merge commits unnecessarily.)

We treat the master branch as carved in stone like anyone should, but other branches should clearly flow their merges only one way (features into releases, or features into environments), and if a branch hasn't actually been included in a release tag, it's considered fair game for rebasing. It helps us to prevent our three developers' sometimes too many concurrent trains of thought from resulting in equally too many confusing merges, or an intractable number of HEADs to manage and organize into releases.

One of the places we struggle is that we're not all-the-way onboard with CI/CD processes, but we do the best we can so that whenever support for that kind of thing materializes, we will mostly not need to change our processes at all and can obtain the benefits of that kind of tooling as quickly as possible.

reply

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 19:22:14+00:00

> other branches should clearly flow their merges only one way

I tend to disagree on this as well. The best person to merge a conflict is the developer creating the conflict in the first place, as soon or nearly as soon as they create the conflict, because they are most likely to know why the conflict exists in the first place. "Merge early, merge often."

Delaying "reverse" merges until the last possible second means you often don't have the integration expertise needed without research involved, even if it was "you" that introduced the conflict you may have moved on to other problems since and not recall why you did something one way or another, or which bits are important to integrate.

Delaying those merges as rebases I feel is even worse, because not only do you not have the resources to know why an integration needs to be made, you also aren't recording a history of the merge conflicts you saw in the rebase such that you can easily revisit the integration if you made an integration mistake (which does happen, because we are, after all, only human).

My advice tends to be to use a good PR system of some sort that makes your proper, code reviewed "forward" merges clear and obvious, and then don't worry about a mess of other merges "underneath" that top-level of strong PR merges. (Hence the key suggestion is that `git log --first-parent` is your best friend if you want a "clean" view of the DAG. It gives you a linear list of just your PR merges in master, or branch work and PR merges to that branch in any other branch.) Also, yes, CI/CD are really good ideas.

reply

yebyen | karma 2755 | avg karma 1.38 · 2018-06-28 19:35:50

> you also aren't recording a history of the merge conflicts you saw in the rebase

This is a good point. Mistakes are made during merge conflict resolution. But if you are tracking your upstream when you develop a long-lived feature branch, and rebasing when there are changes to the base, and actually comparing your rebased feature branches to the version that you had before you force push over the old remote version, then the only thing you really need to track is "does my diff still look like the change I intended to make."

The most valuable skill we are learning from Git is how to avoid merge conflicts altogether, and it tends to be more a human problem than a technological issue. (You don't avoid merge conflicts with some special commit strategy, you do it by ensuring that two people are not actively changing the same part of the codebase unless it's absolutely necessary.)

reply

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 20:05:08

I still think you are better off preserving every merge point-in-time when they were made than ever rebasing feature branches. Because yes, every rebase is an opportunity for mistakes to go unnoticed, and there's no "rebase log" to try to unwind a mistake weeks or months later.

Also, every rebased branch is its own likely source of merge conflicts for people working on "differently rebased" versions of the same code.

> The most valuable skill we are learning from Git is how to avoid merge conflicts altogether, and it tends to be more a human problem than a technological issue. (You don't avoid merge conflicts with some special commit strategy, you do it by ensuring that two people are not actively changing the same part of the codebase unless it's absolutely necessary.)

Definitely. Communication is a key. It's also why I think "merge early, merge often" works better, because it forces communication as early and often as possible ("I'm seeing a merge conflict with this work you are doing, can you explain it to me?"). Keep branches as short-lived as possible, and try to avoid "separate but equal" work where you can't comingle features and have to intentionally fence your branches from integrating with each other. Merge feature branches between each other, even, to keep communication up. Find tools like feature flags that allow you to "ship" to master "unfinished" work faster to save yourself from trying to integrate long running branches after the fact. (If you are going to let "marketing" choose which features "ship" in a version, it is much nicer to do so by flipping flags in a dashboard somewhere, maybe even one Marketing can use themselves, than to try to furiously merge long-running feature branches at release time.)

reply

yebyen | karma 2755 | avg karma 1.38 · 2018-06-28 20:17:08+00:00

+1 for feature flags. We have struggled to manage long-lived changes (our current iteration has been going on for nearly a year.)

The amount of uhh "pucker" I'll feel when we release next month, and certain pieces of code are touching prod for the first time, is much higher than I'd like.

The only reason I can sleep at night while spending a whole year with only hotfixes and minor feature addons for the last release going to prod, is because we've spent much too much time at all layers of the testing pyramid, and every time I've made a change that should probably break some tests, it actually breaks a few more tests than I expected (so I know that our coverage is pretty good.) Every time we write a brand new feature, it absolutely gets a matching Cucumber feature test in plain english that we run before and after every merge. And that's not even to mention unit testing.

So we can be reasonably confident that everything we've ever promised, is still true. Our test suite takes over an hour to run at this point, which is way longer than it should. But it helps us catch the bugs, and well before they start to have "piled up."

I'd feel marginally better if we knew what code doesn't run in prod because we could turn it on and off, and see those feature tests failing (or simply exclude them, since they would also be tagged with the feature label.)

It's interesting that some people feel feature flags are complicated enough to make "Feature flags as-a-Service" businesses a thing, like LaunchDarkly. I looked at what they are offering and thought it would serve us well, but we haven't actually started using feature flags for anything.

reply

linkmotif | karma 1335 | avg karma 1.53 · 2018-06-28 19:09:16

Rebasing is fundamental to a proficient local workflow. Interactive rebasing with git rebase -p lets you craft and recraft your commits easily until they are readable and tell a story. The sooner a person can rebase, the sooner they can use Git. It’s not really possibly to proficiently or pleasantly use Git without rebasing.

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 19:27:49+00:00

I very strongly disagree with that. I'd rather someone be proficient with the staging index, and maybe even git stash, than rebasing. Maybe, if they are feeling fancy, `git add --interactive`, `git add --patch`, and/or `git commit --amend`.

Rebasing is a fascinating footgun. I'd rather a messy story that includes details of how someone screwed up, then fixed their mistakes, than an entire branch I need to cherry pick in a hazmat suit because a developer rebased the wrong thing.

reply

yebyen | karma 2755 | avg karma 1.38 · 2018-06-28 19:42:09+00:00

> I'd rather a messy story that includes details of how someone screwed up

Strongly disagree. I am mostly interested in reviewing your best work, and ensuring that it is correct, as much as I enjoy hearing the story about how you got there. When we spend a lot of time on specs and ensuring they are correct, it's mostly important that I can review your change and compare it to the spec to ensure correctness. A commit that was wrong, but was corrected before it ever went to prod, doesn't really help me at all during code reviews.

I agree that 'git stash' and 'git commit --amend' are also important skills to have, but I prefer to make small, incremental commits that are obviously correct changes in isolation, even if they are not very interesting, so long as they are moving something toward the goal. Then I'll condense them into commits that look something like "the whole change I intended to make" – but that's maybe once it's fully formed and the tests are actually passing.

I don't like to keep a lot of small commits around, again selfishly and simply because they can make rebasing harder. It is telling how many things I do in service of keeping it easy to rebase a changeset. It's not an unobtrusive feature and there's definitely merit to the argument that it's a nice gun to shoot yourself in the foot with.

There is a happy medium between squashing all commits within a feature branch, and keeping every "boring, but obviously correct" incremental change intact forever. A change is only interesting in isolation if it might be reverted, or if I read it in isolation and for whatever reason strongly believe that I might ever want to read it in isolation again.

reply

WorldMaker | karma 12216 | avg karma 1.52 · 2018-06-28 20:26:54+00:00

A point I've tried to make is that we don't need a "happy medium", we can keep the mess that the sausage was made from and still get the sausage final product.

We have the power of a DAG to both record and hide all the mess. I can take my messiest repository and with a couple DAG traversal options still get that clean "story" back out of it. On the flipside with rebase you are absolutely erasing history and there is no way to dig deeper into any information lost or discarded in that rebase. I can give you what you want to see from my repositories, but you can't give me what I sometimes want to see from yours.

Merge commits are a great place to tell the story of "the whole change I intended to make". A good PR system even encourages and automates exactly that. (It also makes it easier to review the branch as a whole, or incrementally, or what have you in between to suit your needs/interests.)

I don't outlaw rebasing locally. There are absolutely cases in which a local rebase is necessary, and if a developer feels comfortable with rebasing I'm not going to stop them from doing it in a branch they control.

But I also don't encourage rebasing. I'd rather see how the sausage was made, ugly as it was, as it does tell a story even if you may not find it an "interesting" story, and sometimes in code archaeology or debugging nightmares you do need to dive into tiny incremental trivia.

reply

yebyen | karma 2755 | avg karma 1.38 · 2018-06-28 20:43:10

Your argument has merit, but I think we can agree to disagree. I'm okay with erasing some history sometimes.

The main point I'm trying to make is that developers should not have to feel compelled by some unspoken pressure to make fewer commits, knowing that once they are made, some day, someone will be stuck reading or replaying them. Commit history is malleable, and unpublished commits are mostly free to be thought of as exactly like soup; free to stir at any time, and ladle into portions as needed.

Sometimes I read a commit in a chain of commits that I made and, though it was interesting enough to commit for some reason, I can immediately tell that the moment I've just replayed will never be a point in time that I'll ever want to replay again. I can rebase in that moment and potentially save repeating some wasted time for "future-me."

Sometimes I even know at the time of making the commit that it will not be interesting ever again, but that at that moment, I'm less likely to make an error if I just `git add -p` the "done part," commit that, and keep the WIP unstaged. Maybe the change is so boring that I'll actually write "squash" or "fixup" as the commit message, and then I'll be sure to follow my own advice some time soon, when it's not so interruptive to my flow, and I will squash that commit before it gets published or merged.

I love descriptive commit messages, but I also love totally obvious, eminently readable code. And I feel like sometimes the message I need to get out to developer members of my team is that just one "readable" or patently obvious block of code is easily worth ten descriptive commit messages.

There also may be times when "nobody is reading your commit history" is a little demoralizing but also exactly what a developer needs to hear. "We don't need a novel about each commit, write just what you need to and spend that time you saved making the code itself actually better."

reply

linkmotif | karma 1335 | avg karma 1.53 · 2018-06-28 22:03:23+00:00

Take some time and play around with these commands and read about them. They’re great... there’s nothing “fancy” or “advanced” about them. They just help you express your thoughts with Git more and better than if you don’t use them. Trust me, no one wants to read your free form Git history.

claytonjy | karma 1395 | avg karma 2.22 · 2018-06-28 17:20:37

I wonder if any of these folks will have a hand in improving the Github commit graph, now that Microsoft owns it?

As great as Github is, I've never understood why they still have such a hard-to-read, horizontal commit graph while competitors like Stash/bitbucket/gitlab have all had beautiful vertical graphs (like shown in this article) for as long as I can remember. I think this is especially valuable for newbies who are less inclined to get similar viz at the command line, but still useful for vets when they (inevitably) end up in weird branch situations.

reply

s2g | karma None | avg karma None · 2018-07-03 15:43:33+00:00

This is very cool.

Makes me sad I'll never do anything like this.

reply