There is The Architecture of Open Source Applications series of book http://aosabook.org/en/index.html were one of the author of the software explain the essence of the program.
If I recall, Linus was highly pissed at the time he wrote GIT. Lots of his comments at the time were meant as a slam against the guy who was reverse-engineering the Bitkeeper protocol, which resulted in the license for Bitkeeper getting yanked for the kernel project. I wonder if Linus is still angry with Tridgell?
I've been in the situation where a combative party has spurred me on to do some of my best work. I doubt Linus holds a grudge... and considering the consequences I wouldn't be surprised if he wrote a tounge-in-cheek thank you letter!
While that is true, part of the contents that is hashed is the time of commit. Therefore if you (somehow...) managed to get that hash, you could just reset and commit again. Or amend.
In any case, I believe he was joking. The odds of a sha1 collision are very very low.
Very low indeed. Chance of SHA1 collision is of the order of 2^52. One person explained it this way: Chance of everyone currently on Earth winning a jackpot in their lifetime is actually higher than a single random SHA1 collision. It should be actually mind boggling how many of the software systems and algorithms rely on hashes and them being not collided.
>I wonder if there are any git sha1 collisions out there in aggregate, say across all of github.
Despite the incredibly high number of all commits there must be, I think the chance of a collision is still very unlikely. 2^160 is a pretty big number.
There's a table in http://en.wikipedia.org/wiki/Birthday_attack which gives some numbers, but it's missing the 160-bit entry. Nevertheless, even the number of 128 bits hashes required for a random collision are extremely high.
In hindsight, it's good that git didn't choose MD5, since collisions for MD5 can be generated almost trivially now. However, the decreasing security of SHA-1 could be a concern for the future.
I don't think commit hash was ever intended to be cryptographically secure. It's just a unique identifier.
> Source control management systems such as Git and Mercurial use SHA-1 not for security but for ensuring that the data has not changed due to accidental corruption. Linus Torvalds has said about Git: "If you have disk corruption, if you have DRAM corruption, if you have any kind of problems at all, Git will notice them. It's not a question of if, it's a guarantee. You can have people who try to be malicious. They won't succeed. [...] Nobody has been able to break SHA-1, but the point is the SHA-1, as far as Git is concerned, isn't even a security feature. It's purely a consistency check. The security parts are elsewhere, so a lot of people assume that since Git uses SHA-1 and SHA-1 is used for cryptographically secure stuff, they think that, OK, it's a huge security feature. It has nothing at all to do with security, it's just the best hash you can get.
According to the Wikipedia entry[0], "No actual collisions have yet been produced", github or otherwise. The NSA might have produced them, but publicly non have been found, and it's not for lack of trying.
I take that statement to imply "on purpose", or as part of an attack. You can't know whether there's a coincidental collision anywhere in github unless you bother to look. But I do understand that it's still extremely improbable.
C# also has it, and i've heard Java has got it as well, but I've never tested it in the latter.
I personally like being able to do it since it allows me to do away with the 2 extra lines auto indent puts in if i add brackets. That's a 50% reduction for a 4 line if. Maybe I should just buy a bigger monitor.
The lead programmer at my last job encouraged us to use it as a way to make sure our conditionals & foreach loops (PHP programmer) weren't doing too much. If we had to use braces, it was a sign to check it out and see if it could stand some refactoring.
"If it's only a single line in after the if, the braces are optional."
Not _line_, _statement_. Consider
if(flag)
foo(); bar();
and
if(flag)
foo =
bar +
baz;
That first example always calls bar().
Warning: I haven't tested this, and am beginning to doubt a bit. It must be correct, but why, then, don't I remember seeing this in underhanded C contests? Combining that with macros allows you to hide the semicolon.
In the C grammar, braces denote compound statements. Control flow statements can take any type of statement as their body rather than just the compound variety.
It's pointless to argue over these kind of things. Every major project/company has their own codified code style guide, and if you want to contribute/earn your salary then you must follow that style guide to the T. Here's the relevant quote from the Linux kernel coding style[0]:
Do not unnecessarily use braces where a single statement will do.
if (condition)
action();
That's what I'd thought for may be over a decade. About ~3 years ago I revamped my personal coding style to eliminate as unnecessary baggage as possible. As part of that I stopped using braces for single line if and I'd yet to bump in a bug because of that. Overall I find code looks more compact and cleaner, may be even less friction to read. Nowadays when I see a braces around single line if I get that "oh that's clunky code" feeling in my stomach. Things are worse with C# and lot of Java code where people insist not only having braces around single line if but also have { on its own separate lines.
I think a good language shouldn't have braces to mark blocks in first place. Given indentation,they are redundant most of the times and they just contribute in clunk. This is exactly the case with Python and hence this is essentially a default style and people hadn't be complaining about it's causing bugs.
There's an enormous difference with Python, because the indentation is syntax. These two code snippets, one in Python, the other in C, do not mean the same thing:
C:
if (condition)
statement_1();
statement_2();
Python:
if condition:
statement_1()
statement_2()
Personally, I always use braces in C and C++, even though it is more clunky. I want the assurance. I also frequently have to make changes to code that does not use braces, and then I have to add the braces in because I am adding statements to a conditional. To me, that is more clunky.
I see merit in both sides of the debate. As long as the style is consistent, it should be legible. It is an argument similar to vi versus Emacs -- both can be right. It is more important to be consistent through a project. Adopt one style.
From this day forward, let us proclaim to always use brackets; so that intent is more obvious to the reader.
Most C-inspired languages (most popular programming languages) allow this. The only ones I've used that don't are Go and Swift. And I find it a little annoying.
If you're worried about bugs, there are other things in C/C++ to criticize first ;)
It's true for Rust as well. What these three languages have in common is that they don't require parentheses around the switching boolean expression, and when you have things like:
if expression1 expression2
it can be fiendishly hard to determine the boundary between expression1 and expression2.
'Hosting' means 'contain', 'serve'. A building can host a department or a convention, and a married couple can host a dinner party, with neither being required to be a webserver or programming language.
Version control systems are self-hosting when they are used to manage the primary repository of their own source code. This shows confidence because if the program breaks, then it breaks its own configuration management, which could be a headache to unravel. For example if the repository format changes, then the change has to be managed so that the old versions remain accessible through the new version of the software. If this is not managed, and old compiled binaries of the version control system disappear from existence, then it may become impossible to recover the old sources.
Thus, successfully self-hosting a version control system is some measure of evidence that the developers know what they are doing and can manage the changes. (And thus they understand change management and we can trust them to be working on version control software.)
git at revision 0 worked the same way. You can see that there are no references in git at that time either. They're both copying bitkeeper, which worked the same way.
Nowadays git has references (branches), and hg has bookmarks which are the same, plus hg also has the option to label every commit with a permanent branch name. They also still have branching-by-cloning, and if you listen to Linus's original Google code talk about Git, you can see that he conflates "branch" and "clone" because that's what he originally envisioned! Even in 2007 he was still thinking in bitkeeper terms too. I bet that branching with references was Junio Hamano's idea, after Linus did the code hand-off.
I find branching-by-cloning a bit more natural in hg, because you can push to any repo. It's useful for quick, throwaway, local, easy testing out of ideas. In git, you can only push if your push doesn't modify HEAD, which typically translates into only being able to push to bare repos.
Interesting, thanks for the info. I've only been using Git since 2009 or so. I love Git's model of commits being objects in their own right, allowing you to cherry-pick them across branches, or rebase them to reorder or squash several commits together, for example.
My usual development routine is to make a ton of small commits that add up to a small set of good commits, to promote bisect-ability. I do dozens of rebases, squashes and amends when working on a topic branch. I have to use Mercurial for one of my clients, and it's a nightmare doing my development model in an SCM where I can't toss commits around willy-nilly like I can in Git.
> I have to use Mercurial for one of my clients, and it's a nightmare doing my development model in an SCM where I can't toss commits around willy-nilly like I can in Git.
Yes you can. `hg histedit` is a lot like `git rebase -i`, and `hg rebase` is like `git rebase` without -i and `hg commit --amend` is a lot like `git commit --amend`.
There are also some really cool things that we're working on with hg:
Does anyone know if the structure of git has changed much? I would like to read this thinking this is pretty close to the current implementation but I would have no idea. anyone?
One noteworthy difference is that in the original repository format, a tree object was just a list of named blobs. Nowadays each subdirectory of a tree is its own nested tree object, which means that when you're comparing two trees, you can skip over the directories that are identical.
I'm not sure when that change was made but it must have been very early on, because the repository format has been basically stable for many years now.
It seems to be mostly the same, except that "Changeset" is now called "Commit" and "Current directory cache" is now called "index", but they are functionally the same.
It's actually really great to see that the model hasn't changed much (there must have been a long phase of thinking before though)
If you want to go deeper, you can check out this page:
This. I find that learning from original documentation tends to be much more efficient than learning from third party blogs/tutorials which try to "simplify" things, and usually do the opposite.
You can go to the project's network graph (append /network to the url) then press shift ?. If the project has a lot of forks like git/git does it won't work though.
I found so many JAVA _PROGRAMMERS_ here asking stupid c question. What a world! people don't know C are building software. That's why so many Indian java coders in US. shameful.
"A marathon of clicking 'next page,' but the view is worth it." So, this commenter practically worships git, but apparently doesn't actually understand it well enough to know a better way to find the hash of the first commit and punch that into Github. Or, it was just a joke and they got there the quick way, but still felt obliged to post a dumb joke to inflate their own ego by "leaving their mark" on git. Maybe I'm being too mean, but yeah, I also think a lot of the comments are pointless.
It's lots of subreddits. There are some serious ones, but the main-stream ones all contain the usual memes, injokes etc.
I enjoy diving into reddit every now and again. But I use github for work (and code for fun, although it's 'serious' fun). Although open-source collaboration is a fundamentally social activity, I think that mixing source control with a social network does inevitably leads to these kinds of comments. And I wouldn't dream of mixing that up with my professional identity.
Maybe it's just a marker of how versatile github is, and the community of people who write programs and put them in source control.
> Maybe I'm being too mean, but yeah, I also think a lot of the comments are pointless.
Yeah, I think you're being a little mean. If you browse to that user's GitHub page, it looks like it's just somebody new who's excited about software. Good for them.
The comments are pointless, sure, but also harmless. Similar comments might crowd out productive discussion if they were on (say) the head of the master branch, but I doubt that any serious development is happening on git's initial commit anyway. Let the new people have their fun.
As far as newbie disruptiveness goes, it could be far worse. When I was getting started with Linux, I posted this cringeworthy gem to LKML, now enshrined in the archives for all eternity: https://lkml.org/lkml/2000/10/22/69 If newbies today are merely posting "yay, git!" and "thank you!" to a secondary forum where it doesn't disrupt development, I'd say they're doing pretty well in comparison. :)
Yeah, fair enough. Good on you for linking your own cringey post. I think a lot of developers have those early cringe moments, especially if they were young when they started.
As far as disruption, it did occur to me later that somebody may be getting notification emails about these comments. But it's not too bad, as I assume they could just send the emails to /dev/null, since Github is not the official host of git. (As a tangential note, I sort of wish Github would handle this better. So many Github-mirrored projects end up with something like "don't submit pull requests or open issues here, they will be ignored" in their repo description.)
* +Side note on trees: since a "tree" object is a sorted list of
+"filename+content", you can create a diff between two trees without
+actually having to unpack two trees. Just ignore all common parts, and
+your diff will look right. In other words, you can effectively (and
+efficiently) tell the difference between any two random trees by O(n)
+where "n" is the size of the difference, rather than the size of the
+tree. *
Since a git hash points to a sorted list of filenames and content hashes, to diff two git commits you lookup the commit objects by their hash, run down the resultant list of filename/hash pairs & then only lookup & diff the content of those files that have differing hashes (if they have the same hash, they must have the same content according to the git data model, so they can be safely ignored).
Hence diffing arbitrary commits with git is always O(N) in the number of changed files, regardless of the number of interstitial commits.
In particular he's saying that for a tree, you can quickly skip sub-trees if they are the same, regardless of how deep they go. Kind of like a Merkle tree: http://en.m.wikipedia.org/wiki/Merkle_tree
I'm no git internals expert, but I suspect for a flat list of files the complexity is still O(n) where n is the number of files (not changes) because at very least you must check that n checksums are the same.
I'm no git internals expert, but I suspect for a flat list of files the complexity is still O(n) where n is the number of files (not changes) because at very least you must check that n checksums are the same.
Sure. The constant factors make a huge difference though - even if you've cached all the data in memory walking all those structures and diffing the actual file data is going to be enormously slower than simply walking a list of hashes, so you're really saying that the total time is big * O(number of files changed) + small * O(number of files). If small*N ~ big then it's reasonable to just disregard that cost - it's going to be lost in the noise.
I'm not arguing that, but rather that this ability to skip unchanged trees because the hash of all contents is bubbled up is specifically what Linus is referring to in the comment, not simply the comparison of hashes in the flat-directory use-case.
Wouldn't it still be O(total)? Or at the very least O(log total)? You have to look at all the files even if it's just to compare the hash. The size of the file doesn't matter so I think what Linus should have said was it's O(number of files) still, and maybe O(log total) in the average case. But if there are 1,000,000 files and only 2 change then I don't see how you don't have to look at all the hashes.
That sums it up quite well. Every day I pay thanks to The One Who Programmed Me that my workflow doesn't put me in need of that shitload of crap that is git. I pity those who do need git.
Maybe I've been drilled too hard by a couple of programming gurus, but I immediately noticed there are quite a lot of repeated yet unnamed magic constants in the (otherwise pretty clean) code. According to wikipedia [1] the rule to not use them is even one of the oldest in programming. Curious what kind of profanity Linus would come up with when confronted with this :]
reply