Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

There is no formal linking structure in many, if not most cases. Ctrl+V is the weapon of choice of many a programmer. To say nothing of somebody then adding superficial changes to the code to, for instance, fit their personal style or adapt it into their project. And then of course on top of it, Github is not the alpha and omega of code. The original code have been published anywhere, or even nowhere in a case such as theft.

Then there's also parallel discovery. People frequently come to the same solution at roughly the same time, completely independently. And this is nothing new. For instance, who discovered calculus? Newton or Leibniz? This was a roaring controversy at the same time with both claiming credit. The reality is that they both likely discovered it, completely independently, at about the same time. And there's a whole lot more people working on stuff than than in Newton's time!

There's also just parallel creation. Task enough people with creating an octree based level-of-detail system in computer graphics and you're going to get a lot of relatively lengthy code that is going to look extremely similar, in spite of the fact that it's a generally esoteric and non-trivial problem.



sort by: page size:

In practice, the pointers are the authors.

In Computer Science, it’s a mess that works fine in practice. I go to GitHub - if it’s not there, then I google the authors or the name of the project. And I don’t know what I do if I don’t find any of them or their projects, because I’ve never encountered that scenario.

I’ve never seen a paper more than say 10 years old and thought “god I wish I had their code/data” because CS moves at a rapid speed. Important stuff is preserved because other people build on it. For example, it becomes the basis of an open source project.

Whilst from a rigorous and idealistic point of view, we want central long term data storage, in practice right now the real problem is that _many scientists do not make their data and code available in any form at all_ rather than worrying about pointers decaying.


Scientist is cool. It's a little annoying that the article makes it seem like no one has thought of this before Github. But whatever, maybe more people will learn about the pattern and use it.

The other implicit conclusion that the article is making, that this will somehow make software more like a traditional engineering discipline also makes me a little uncomfortable. There still is no silver bullet. http://c2.com/cgi/wiki?NoSilverBullet


[New account for anonymity]

An often neglected force in this argument is that many practitioners of "scientific coding" take rapid iteration to its illogical and deleterious conclusion.

I'm often lightly chastised for my tendencies to write maintainable, documented, reusable code. People laugh guiltily when I ask them to try checking out an svn repository, let alone cloning a git repo. It's certain that in my field (ECE and CS) some people are very adamant about clean coding conventions, and we're definitely able to make an impact bringing people to use more high level languages and better documentation practices.

But that doesn't mean an hour goes by without seeing results reverse due to a bug buried deep into 10k lines of undocumented C or Perl or MATLAB full of single letter variables and negligible modularity.


I used to work (current PhD student) in HPC (switched specialties) and I was extremely surprised that not only did these people not share code (DOE actually encourages open souring code), but they would not even do so upon request. Several times I had to get my advisor to ask the author's advisor. Several times I found out why I couldn't replicate another person's code. It is amazing to me that anyone in CS does not share code. It is trivial to do and we're all using git in some form or another anyways.

I worked at a productive computing research institute for a number of a years. I cannot count the number of times I found research teams duplicating critical algorithms. Research Scientists not only pay the price of the spaghetti nature of the code, they pay it over and over again by not sharing and improving on what has already been built by previous research groups.

The software industry has its own share of problems, but from what I've seen the research community is still largely operating on an outdated software model that shuns open collaboration out of fear of being "scooped".


This pisses me off so much. I'm not a mathematician, but I like to think I'm a pretty good programmer. I feel like I could pick up a mathematical concept described in a computer science paper more easily if I could actually see the damn code and run it myself. But most of the papers I've read haven't mentioned where to find the referenced source code or, if they do, it's either horribly written and only runs on the author's machine or it requires specialized software that only a university could afford.

For sure - I wish we could find ways of seeing whose "code" was "forked" and thus whose shoulders everyone is collectively standing one - the some of that is your advances in the field right?

The current system uses paper publishing (# / impact factor) as a proxy for this, but it produces a lot of perverse incentives. If we found a way to see whose concepts were being used and when - people would instead throwing out any idea they'd have hoping that it be picked up (even unknowingly) by other researchers so they can get some of the credit.

If this could be done, would it fix this status / scooped problem, and is there actually a way to replicate the "fork" structure of something like GitHub? Unfortunately, science is both: 1.) hard to pull in-depth semantic data out of (I'm guessing a lot of nuance) and 2.) not formally coded like programming language.


In the world of StackOverflow driven software authorship, a particular sequence of source code instructions appearing in source code, on the presence of HTTP requests to that article, coupled with people doing work in that area?

Yeah... I don't need statistics for that one. The bloody tooling speaks, and the chart of human agency/interest speaks for itself. To be honest, ai find it amazing the backflips people are doing to justify that there's no possible way that an interested grad student with the right tools would have tried something like this.

I'd have. Of course I avoid doing science in those types of areas because Murphy finds a way.


I'll give you some credit since Programming is not Science/Engineering but rather some combination of Science/Authority/Tradition/Art.

But in science fields, authors names don't really matter. The same outcome will exist regardless who is studying it.


The algorithm is inherently flawed because it is based purely on authorship of files, and not the quality of the documentation, source code, source code comments, etc.

It doesn't matter if a piece of software was written by only one person. If it has great documentation then the original author can die (or less morbidly just abandon it) and someone else can pick it up easily.

All the node.js modules by TJ Holowaychuk are a good example. They are extremely well written and documented, allowing them to continue being worked on by alternate maintainers even though he has moved away from the Node.js ecosystem to Go.


Thanks. I've had enough problems getting my own code working a few months later, never mind anyone else's, which is partly where this comes from. But the overall motivation is the same: how can we judge scientific work if we can't examine it statically (open source) and dynamically (recomputation)?

I hav a better TL;DR: Nature is a gatekeeper that is not provably fair.

So basically, if the knowledge you discovered is TOO revolutionary, Nature can pass on it, and it will get short shrift even if it's true.

I am SO glad that the programming field is not like this. My code stands on its own merits. So would science... since it's, you know, science... independently verifiable... One would think! What if all code had to go through GitHub admins before it was published on GitHub?


Agreed. In fact, code can be more expressive than mathematical notation, and providing code would do a better job of demonstrating a theory.

A lot of the proofs are just to legitimize papers. The cynic in me also thinks that by not providing code, the authors intentionally make it more difficult to reproduce results to hide tampered data. You see the same happen in science all the time with p-hacking, or with just plain fraudulent data that is purposely difficult to reproduce.

Luckily there are a good number of papers that provide a repo link, but not enough yet.


It's expected in academia. Even worse, I've seen some main contributors who wrote actual code might not even be included.

I like the concept of rewriting all the code by an unbiased third party to see if they can reproduce the results, but in practice what this leads to is:

1. People will not bother. It took a lot of minds to come up with the software used (in my friend's case, several PhD's amount of work). No one is going to invest that much effort to invent their own software libraries to get it to work.

2. Even when you do write your own version of the software, there are a lot of subtleties involved in, say, computational physics. Choices you make (inadvertently) affect the convergence and accuracy. My producing a software that gives different results could mean I had a bug. It could mean they did. It could mean we both did. Until both our codes are in the open, no one can know.

It is very unlikely that you'll have a case of one group's software giving one result and everyone else's giving another. More like everyone else giving different results.

Case in point:

https://physicstoday.scitation.org/do/10.1063/PT.6.1.2018082...

HN discussion:

https://news.ycombinator.com/item?id=17819420


This can only end if peer-reviewed journals require source code (and if possible datasets) to be made available as well. High-impact journals have the weight to enforce such policies.

It's true that third parties can apply methods easily to new data. But it is a testimony to the method, and references will help building the reputation of the original inventor.

Another concern only addressed in the comments on this blog post is that most scientists do not produce beautiful programs. The reasons are twofold:

- Programs are hacked together as quickly as possible to produce results. Scientists are mostly concerned with testing their theories, and not so much in producing software for public consumption.

- Most scientists are not great programmers.

Consequently, scientists usually do not want to make their source code available.

This situation sucks, given that in many countries taxpayers fund science.


I also like papers with code, but linking an author’s github in a blind conference submission wouldn’t make a ton of sense.

A problem for computational science is people care about their publications more than people being capable of reproducing their work. The funny thing is an open source code is a sure way to attain a legacy.

As discussed in the linked article, it's mostly because of code quality but with a healthy dose of percieved threat to job security. Unfortunately the academic citation economy isn't very nurturing of community codes, so a lot of work gets repeated when secretive research groups compete.

That's not the whole story and the fact that this article is getting published speaks to that, however I still don't think it's likely that things will change in the short term.

My friend Matt Turk wrote about a similar topic recently: http://arxiv.org/abs/1301.7064

next

Legal | privacy