Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Git's initial commit (github.com) similar stories update story
351 points by olalonde | karma 25101 | avg karma 4.21 2014-11-23 18:18:04 | hide | past | favorite | 128 comments



view as:

I love checking out very early versions of projects. You often get to see the essence before the real world came in and ruined the beauty of it.

I do this as well. It really should be more widely broadcast.

(I've also spent some time thinking about how it's kind of a hack, and what we can do to make it better: http://akkartik.name/post/wart-layers)


There is The Architecture of Open Source Applications series of book http://aosabook.org/en/index.html were one of the author of the software explain the essence of the program.

I know what you mean. The SystemD controversy motivated me to take a look at the initial version of NetBSD's init rc script, which was nicely simple.

Great to see the original command set, and the title of course: "GIT - the stupid content tracker"

If I recall, Linus was highly pissed at the time he wrote GIT. Lots of his comments at the time were meant as a slam against the guy who was reverse-engineering the Bitkeeper protocol, which resulted in the license for Bitkeeper getting yanked for the kernel project. I wonder if Linus is still angry with Tridgell?

I've been in the situation where a combative party has spurred me on to do some of my best work. I doubt Linus holds a grudge... and considering the consequences I wouldn't be surprised if he wrote a tounge-in-cheek thank you letter!

Following the tradition of sports, I propose that commit id e83c5163316f89bfbde7d9ab23ca2e25604af290 be officially retired.

Every commit id is pretty much automatically retired. The odds of collisions in our lifetime are pretty small.

Or am I over-thinking it?


Yeah, I'm pretty sure it was a joke.

I was under the impression that git hashes are generated based on the contents in a commit. If that's true you cannot retire them. They're not random.

While that is true, part of the contents that is hashed is the time of commit. Therefore if you (somehow...) managed to get that hash, you could just reset and commit again. Or amend.

In any case, I believe he was joking. The odds of a sha1 collision are very very low.


Very low indeed. Chance of SHA1 collision is of the order of 2^52. One person explained it this way: Chance of everyone currently on Earth winning a jackpot in their lifetime is actually higher than a single random SHA1 collision. It should be actually mind boggling how many of the software systems and algorithms rely on hashes and them being not collided.

I think it is 2^61 now, the paper that once showed the 2^52 attack has removed the claims.

Given that the only way to reuse it is to duplicate the tree and commit metadata exactly, or find an sha1 collision, I think it's pretty safe. :)

I wonder if there are any git sha1 collisions out there in aggregate, say across all of github. Would they even notice if there were?


>I wonder if there are any git sha1 collisions out there in aggregate, say across all of github.

Despite the incredibly high number of all commits there must be, I think the chance of a collision is still very unlikely. 2^160 is a pretty big number.


The number of inputs before a likely collision is more on the order of 2^80. Which is still pretty large.

True, the birthday paradox definitely makes it a lot more likely, but as you say the odds should still be too low.

This is comparable to the number of atoms in the universe. Pretty large! We will never see an accidental collision.

Not quite, atoms in the universe is in the range of 10^80, which is a bit less than 2^266.

On the other hand, 2^80 is "only" approx. 1.2 * 10^24. Still, good luck colliding with that without big effort.


There's a table in http://en.wikipedia.org/wiki/Birthday_attack which gives some numbers, but it's missing the 160-bit entry. Nevertheless, even the number of 128 bits hashes required for a random collision are extremely high.

In hindsight, it's good that git didn't choose MD5, since collisions for MD5 can be generated almost trivially now. However, the decreasing security of SHA-1 could be a concern for the future.


I don't think commit hash was ever intended to be cryptographically secure. It's just a unique identifier.

> Source control management systems such as Git and Mercurial use SHA-1 not for security but for ensuring that the data has not changed due to accidental corruption. Linus Torvalds has said about Git: "If you have disk corruption, if you have DRAM corruption, if you have any kind of problems at all, Git will notice them. It's not a question of if, it's a guarantee. You can have people who try to be malicious. They won't succeed. [...] Nobody has been able to break SHA-1, but the point is the SHA-1, as far as Git is concerned, isn't even a security feature. It's purely a consistency check. The security parts are elsewhere, so a lot of people assume that since Git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that, OK, it's a huge security feature. It has nothing at all to do with security, it's just the best hash you can get.

http://en.wikipedia.org/wiki/SHA-1#Data_integrity


Too bad he didn't use SHA-256 though. It had been available for three years at that moment.

According to the Wikipedia entry[0], "No actual collisions have yet been produced", github or otherwise. The NSA might have produced them, but publicly non have been found, and it's not for lack of trying.

[0] http://en.wikipedia.org/wiki/SHA-1


I take that statement to imply "on purpose", or as part of an attack. You can't know whether there's a coincidental collision anywhere in github unless you bother to look. But I do understand that it's still extremely improbable.

In a thousand years, a git sha-1 collision is going to cause a lot of trouble.

Done. Don't ask me how I did it, but you will never see that hash come up again naturally during your lifetime.

it is SHA-1 though, might still be broken during our liftime

Which is why he added the caveat "naturally".


Good memory!


Is there a reason there aren't any braces around single-line if statements? Is that a C thing? It seems kind of inviting to bugs to me.

It's a C-style language syntax option. If it's only a single line in after the if, the braces are optional. I've also seen it in C++ and PHP.

Whether or not it's sloppy is up for debate and just a matter of personal preference.


C# also has it, and i've heard Java has got it as well, but I've never tested it in the latter.

I personally like being able to do it since it allows me to do away with the 2 extra lines auto indent puts in if i add brackets. That's a 50% reduction for a 4 line if. Maybe I should just buy a bigger monitor.


The lead programmer at my last job encouraged us to use it as a way to make sure our conditionals & foreach loops (PHP programmer) weren't doing too much. If we had to use braces, it was a sign to check it out and see if it could stand some refactoring.

Sounds a tad inane.

I use a plugin called littlebrace for visual studio, which reduces brace lines to like 3 pt font. My C# code has started looking like Python :)

This sounds really neat. Can you drop a link? I browsed around but couldn't find anything.


Also see this fork/port to VS2013:

https://github.com/owen2/little-braces

Also, it is available from the online add-in manager if you search for "little braces." The VS2013 community edition is just in time :)

I couple it with the indent guideline plugin for best effect (braces are super small, light lines to track indent level, 2 space indent...).


Or use a better bracing style. Or move to significant whitespace.

"If it's only a single line in after the if, the braces are optional."

Not _line_, _statement_. Consider

  if(flag)
    foo(); bar();
and

  if(flag)
    foo =
      bar +
      baz;
That first example always calls bar().

Warning: I haven't tested this, and am beginning to doubt a bit. It must be correct, but why, then, don't I remember seeing this in underhanded C contests? Combining that with macros allows you to hide the semicolon.


So as to have more code on the page. It's official style of the Linux kernel [1].

[1]: https://www.kernel.org/doc/Documentation/CodingStyle


In the C grammar, braces denote compound statements. Control flow statements can take any type of statement as their body rather than just the compound variety.

It confuses me that it doesn't work for functions. like

    int main() return 0;

In K&R C, the function braces serve to separate parameter declarations and local variables:

    int main(argc, argv)
      int argc;
      char **argv;
    {
      int local;
    }

Thanks, I haven't seen that syntax before.

It's pointless to argue over these kind of things. Every major project/company has their own codified code style guide, and if you want to contribute/earn your salary then you must follow that style guide to the T. Here's the relevant quote from the Linux kernel coding style[0]:

    Do not unnecessarily use braces where a single statement will do.

    if (condition)
	    action();
[0] https://www.kernel.org/doc/Documentation/CodingStyle

I consider that a bug in their style spec. Single line if statements are known to cause bugs.

That's what I'd thought for may be over a decade. About ~3 years ago I revamped my personal coding style to eliminate as unnecessary baggage as possible. As part of that I stopped using braces for single line if and I'd yet to bump in a bug because of that. Overall I find code looks more compact and cleaner, may be even less friction to read. Nowadays when I see a braces around single line if I get that "oh that's clunky code" feeling in my stomach. Things are worse with C# and lot of Java code where people insist not only having braces around single line if but also have { on its own separate lines.

I think a good language shouldn't have braces to mark blocks in first place. Given indentation,they are redundant most of the times and they just contribute in clunk. This is exactly the case with Python and hence this is essentially a default style and people hadn't be complaining about it's causing bugs.


There's an enormous difference with Python, because the indentation is syntax. These two code snippets, one in Python, the other in C, do not mean the same thing:

C:

  if (condition)
    statement_1();
    statement_2();
Python:

  if condition:
    statement_1()
    statement_2()
Personally, I always use braces in C and C++, even though it is more clunky. I want the assurance. I also frequently have to make changes to code that does not use braces, and then I have to add the braces in because I am adding statements to a conditional. To me, that is more clunky.

I nowadays resist the urge to make syntactically beautiful code if that means that it is a little bit brittle or vulnerable to mistakes.

Wrt. using { }, I omit them if I put the block to be executed on the same line, and usually I do that only with special cases, e.g.

  if(expr1) continue;

  if(expr2) throw new RuntimeException();

I see merit in both sides of the debate. As long as the style is consistent, it should be legible. It is an argument similar to vi versus Emacs -- both can be right. It is more important to be consistent through a project. Adopt one style.

From this day forward, let us proclaim to always use brackets; so that intent is more obvious to the reader.


You can sidestep the braces debate by using a lisp.

To introduce Lisp you need to fight the parens debate instead ...

or you could just use Python

... if you want to get in to the whitespace debate instead

Easily solved by pretending the parens are whitespace. :-D


Its common in Java too...

Most C-inspired languages (most popular programming languages) allow this. The only ones I've used that don't are Go and Swift. And I find it a little annoying.

If you're worried about bugs, there are other things in C/C++ to criticize first ;)


It's true for Rust as well. What these three languages have in common is that they don't require parentheses around the switching boolean expression, and when you have things like:

    if expression1 expression2
it can be fiendishly hard to determine the boundary between expression1 and expression2.

C has potentially ambiguous association on nested "else"s. The habit of too many braces is a safety.

Interesting fact about Git is that it was self hosting in two weeks, IIRC.

How can something that isn't a programming language be self-hosting?

Overloading the term. The OP presumably meant that the source for git was under git source control.

Yes, that's what I meant.

'Hosting' means 'contain', 'serve'. A building can host a department or a convention, and a married couple can host a dinner party, with neither being required to be a webserver or programming language.

To add to that, IMHO self-hosting for VCSs is closer to the original meaning of the phrase than for compilers.

Version control systems are self-hosting when they are used to manage the primary repository of their own source code. This shows confidence because if the program breaks, then it breaks its own configuration management, which could be a headache to unravel. For example if the repository format changes, then the change has to be managed so that the old versions remain accessible through the new version of the software. If this is not managed, and old compiled binaries of the version control system disappear from existence, then it may become impossible to recover the old sources.

Thus, successfully self-hosting a version control system is some measure of evidence that the developers know what they are doing and can manage the changes. (And thus they understand change management and we can trust them to be working on version control software.)

http://en.wikipedia.org/wiki/Self-hosting

"Other programs [than compilers] that are typically self-hosting include kernels, assemblers, command-line interpreters and revision control software.


Why are there multiple main() functions? I've never seen this style before. Is it multi-process?

There's a bunch of different utilities in there. Each has it's own main() function, and they're compiled into a bunch of binaries.

Could just be initializers for different modules maybe?

There are multiple main() functions because there are multiple programs! Check out https://github.com/git/git/commit/e83c5163316f89bfbde7d9ab23...

Look at the makefile, it just has a command line program for each basic operation.

Well, while we're looking at FIRST POSTS, here's Mercurial's, self-hosting a month after git, and like git, also created to replace bitkeeper:

http://selenic.com/hg/rev/0#l10.1

The revlog data structure from then is still around, slightly tweaked, but essentially unchanged in almost a decade.


Mercurial is impressive for making Git's UI look intuitive.

Other way around

C'mon, man, you make branches by cloning the repository[1]. That's insanity.

[1] http://hginit.com/05.html


git at revision 0 worked the same way. You can see that there are no references in git at that time either. They're both copying bitkeeper, which worked the same way.

Nowadays git has references (branches), and hg has bookmarks which are the same, plus hg also has the option to label every commit with a permanent branch name. They also still have branching-by-cloning, and if you listen to Linus's original Google code talk about Git, you can see that he conflates "branch" and "clone" because that's what he originally envisioned! Even in 2007 he was still thinking in bitkeeper terms too. I bet that branching with references was Junio Hamano's idea, after Linus did the code hand-off.

I find branching-by-cloning a bit more natural in hg, because you can push to any repo. It's useful for quick, throwaway, local, easy testing out of ideas. In git, you can only push if your push doesn't modify HEAD, which typically translates into only being able to push to bare repos.


Interesting, thanks for the info. I've only been using Git since 2009 or so. I love Git's model of commits being objects in their own right, allowing you to cherry-pick them across branches, or rebase them to reorder or squash several commits together, for example.

My usual development routine is to make a ton of small commits that add up to a small set of good commits, to promote bisect-ability. I do dozens of rebases, squashes and amends when working on a topic branch. I have to use Mercurial for one of my clients, and it's a nightmare doing my development model in an SCM where I can't toss commits around willy-nilly like I can in Git.


> I have to use Mercurial for one of my clients, and it's a nightmare doing my development model in an SCM where I can't toss commits around willy-nilly like I can in Git.

Yes you can. `hg histedit` is a lot like `git rebase -i`, and `hg rebase` is like `git rebase` without -i and `hg commit --amend` is a lot like `git commit --amend`.

There are also some really cool things that we're working on with hg:

https://www.youtube.com/watch?v=4OlDm3akbqg


hg and git have feature parity at this point

hg just starts out more user friendly, and puts the rest in extensions. I like it more!

ok, hg is a bit slower


It's so short.

The readme is the best explanation of git I've seen.


Does anyone know if the structure of git has changed much? I would like to read this thinking this is pretty close to the current implementation but I would have no idea. anyone?

You can just see the structure with git cat-file

    -> % git cat-file -p 8c48d1a36c3d11db44c75a431d4f09cb0035222f
    tree 288c2d5379768f685f391bdbffd31b8965318c63
    parent 002ae35061beef02453b7fb1045a50fa2f7f30f8
    author Denis Bilenko <denis.bilenko@gmail.com> 1246939605 +0700
    committer Denis Bilenko <denis.bilenko@gmail.com> 1246939605 +0700

    MANIFEST.in: include libevent.h and libevent-internal.h
    -> % git cat-file -p 288c2d5379768f685f391bdbffd31b8965318c63
    100644 blob 6e543dc13df1b556fd95530061ac0c77a9178309.hgignore
    100644 blob 79c7beb2227ce149c7a71e58e2f7379071b7a189MANIFEST.in
    100644 blob 0d05178544942a035a82599900bec27fbac1c9c5README.eventlet
    040000 tree edb8f37fa622315dcf7bf4f7316d5e85c48cfdbdexamples
    040000 tree 64cf252d77a4162099442bb0153985fc20ed5ba3gevent
    040000 tree 261052e04b4aece469b2e767e394aafbc9d88a32greentest
    100644 blob 488e805c563dfeeb6af5e7a1a8953b706d9676e3setup.py
    -> % git cat-file -p 6e543dc13df1b556fd95530061ac0c77a9178309
    syntax: glob
    *~
    *.pyc
    *.orig
    dist
    gevent.egg-info
    build
    htmlreports
    results.*.db
    gevent/core.so
And yeah it's still very similar though it currently doesn't store the objects individually but rather packs them together.

I wrote about the format of git trees (and other object types) here:

http://alblue.bandlem.com/2011/08/git-tip-of-week-trees.html


While it looks arcane, this comes in handy enough when grepping through history that I actually have "cat-file -p" aliased to 'cf'.

One noteworthy difference is that in the original repository format, a tree object was just a list of named blobs. Nowadays each subdirectory of a tree is its own nested tree object, which means that when you're comparing two trees, you can skip over the directories that are identical.

I'm not sure when that change was made but it must have been very early on, because the repository format has been basically stable for many years now.


It seems to be mostly the same, except that "Changeset" is now called "Commit" and "Current directory cache" is now called "index", but they are functionally the same.

It's actually really great to see that the model hasn't changed much (there must have been a long phase of thinking before though)

If you want to go deeper, you can check out this page:

http://www.git-scm.com/book/en/v2/Git-Internals-Git-Objects


I wonder what the first commits for big sites/projects look like?

I tried to compile a list of a few of them a while ago:

http://jcooney.net/post/2011/06/22/First-Check-in-Comments-f...


I've read so many git tutorials, I wish I had seen that README file before.

This. I find that learning from original documentation tends to be much more efficient than learning from third party blogs/tutorials which try to "simplify" things, and usually do the opposite.

Does Github offer an easy way to get to the first commit of a project? Traveling page by page back in time is time consuming (yeah, i did that)

You can go to the project's network graph (append /network to the url) then press shift ?. If the project has a lot of forks like git/git does it won't work though.

No, but if you have the full history you can grab it with a shell command.

    echo https://github.com/git/git/commit/$(git log --pretty=format:%H | tail -1)

https://github.com/git/git/commits?page=1091

If you want to see the commits going forward from here.


I found so many JAVA _PROGRAMMERS_ here asking stupid c question. What a world! people don't know C are building software. That's why so many Indian java coders in US. shameful.

I only realised reading the README that git is a great lesson in branding.

This is a great lesson in writing focused & succinct specs, when one clearly sees what his/her program is going to do.

My god... the comments. Looks like the reddit culture (i.e. fun for in jokes but not particularly professional)

"A marathon of clicking 'next page,' but the view is worth it." So, this commenter practically worships git, but apparently doesn't actually understand it well enough to know a better way to find the hash of the first commit and punch that into Github. Or, it was just a joke and they got there the quick way, but still felt obliged to post a dumb joke to inflate their own ego by "leaving their mark" on git. Maybe I'm being too mean, but yeah, I also think a lot of the comments are pointless.

It's probably the "I F*cking Love Computer Science" sub-reddits.

It's lots of subreddits. There are some serious ones, but the main-stream ones all contain the usual memes, injokes etc.

I enjoy diving into reddit every now and again. But I use github for work (and code for fun, although it's 'serious' fun). Although open-source collaboration is a fundamentally social activity, I think that mixing source control with a social network does inevitably leads to these kinds of comments. And I wouldn't dream of mixing that up with my professional identity.

Maybe it's just a marker of how versatile github is, and the community of people who write programs and put them in source control.


AFAIK you can't search by commit hash. You have to do some URL manipulation.

  git rev-list --max-parents=0 HEAD | tail -1

Without having to pipe:

> git rev-list --reverse HEAD


> Maybe I'm being too mean, but yeah, I also think a lot of the comments are pointless.

Yeah, I think you're being a little mean. If you browse to that user's GitHub page, it looks like it's just somebody new who's excited about software. Good for them.

The comments are pointless, sure, but also harmless. Similar comments might crowd out productive discussion if they were on (say) the head of the master branch, but I doubt that any serious development is happening on git's initial commit anyway. Let the new people have their fun.

As far as newbie disruptiveness goes, it could be far worse. When I was getting started with Linux, I posted this cringeworthy gem to LKML, now enshrined in the archives for all eternity: https://lkml.org/lkml/2000/10/22/69 If newbies today are merely posting "yay, git!" and "thank you!" to a secondary forum where it doesn't disrupt development, I'd say they're doing pretty well in comparison. :)


Yeah, fair enough. Good on you for linking your own cringey post. I think a lot of developers have those early cringe moments, especially if they were young when they started.

As far as disruption, it did occur to me later that somebody may be getting notification emails about these comments. But it's not too bad, as I assume they could just send the emails to /dev/null, since Github is not the official host of git. (As a tangential note, I sort of wish Github would handle this better. So many Github-mirrored projects end up with something like "don't submit pull requests or open issues here, they will be ignored" in their repo description.)


Linus wrote:

* +Side note on trees: since a "tree" object is a sorted list of +"filename+content", you can create a diff between two trees without +actually having to unpack two trees. Just ignore all common parts, and +your diff will look right. In other words, you can effectively (and +efficiently) tell the difference between any two random trees by O(n) +where "n" is the size of the difference, rather than the size of the +tree. *

Um, What?


Since a git hash points to a sorted list of filenames and content hashes, to diff two git commits you lookup the commit objects by their hash, run down the resultant list of filename/hash pairs & then only lookup & diff the content of those files that have differing hashes (if they have the same hash, they must have the same content according to the git data model, so they can be safely ignored).

Hence diffing arbitrary commits with git is always O(N) in the number of changed files, regardless of the number of interstitial commits.


In particular he's saying that for a tree, you can quickly skip sub-trees if they are the same, regardless of how deep they go. Kind of like a Merkle tree: http://en.m.wikipedia.org/wiki/Merkle_tree

I'm no git internals expert, but I suspect for a flat list of files the complexity is still O(n) where n is the number of files (not changes) because at very least you must check that n checksums are the same.


I'm no git internals expert, but I suspect for a flat list of files the complexity is still O(n) where n is the number of files (not changes) because at very least you must check that n checksums are the same.

Sure. The constant factors make a huge difference though - even if you've cached all the data in memory walking all those structures and diffing the actual file data is going to be enormously slower than simply walking a list of hashes, so you're really saying that the total time is big * O(number of files changed) + small * O(number of files). If small*N ~ big then it's reasonable to just disregard that cost - it's going to be lost in the noise.


I'm not arguing that, but rather that this ability to skip unchanged trees because the hash of all contents is bubbled up is specifically what Linus is referring to in the comment, not simply the comparison of hashes in the flat-directory use-case.

Wouldn't it still be O(total)? Or at the very least O(log total)? You have to look at all the files even if it's just to compare the hash. The size of the file doesn't matter so I think what Linus should have said was it's O(number of files) still, and maybe O(log total) in the average case. But if there are 1,000,000 files and only 2 change then I don't see how you don't have to look at all the hashes.

Did all of the inital commit code is written by Linux Torvalds ?

Yes. The interesting thing is actually that it isn't that much code.

Code comment about git:

  stupid. contemptible and despicable.
That sums it up quite well. Every day I pay thanks to The One Who Programmed Me that my workflow doesn't put me in need of that shitload of crap that is git. I pity those who do need git.

Maybe I've been drilled too hard by a couple of programming gurus, but I immediately noticed there are quite a lot of repeated yet unnamed magic constants in the (otherwise pretty clean) code. According to wikipedia [1] the rule to not use them is even one of the oldest in programming. Curious what kind of profanity Linus would come up with when confronted with this :]

[1] https://en.wikipedia.org/wiki/Magic_number_%28programming%29...


Gotta love the fact that there are open pull requests.

https://github.com/git/git/pulls


Where are the tests?

Legal | privacy