People working in DevRel often aggregate developer oriented content and gain popularity that way, an example would be "swyx" for example. I'm not taking a dump on his work, but you can see the Github influencer effect over there.
The "github star" claim links to the source (it's some github program where you can nominate people to be accepted into some promotion campaign). Saying self proclaimed makes him sound pretentious, it's actually awarded by github.
You can be factual and still sound pretentious and cringey. Like the medical doctors who insist on being called “doctor”, to the point of smuggly “correcting” strangers in a social setting.
I don’t know this user and won’t assume his intentions, but I can see how having “I’m a GitHub star [star emoji]” as the first sentence on the profile is doing him a disservice: it makes it seem like it’s the most impressive thing he’s achieved and diminishes everything else.
To reiterate, I don’t know you and don’t assume your intentions—and thus do not judge them. I’m also not familiar with your work but I have no doubt it’s more relevant than whatever “star award”.
In other words: it makes zero difference to me what you write in your bio though I can see how its previous wording took away from what’s important. I was conveying to the parent comment my understanding of the comment they were replying to.
Apologies for making you feel judged, that was not the point. Quite the contrary: I wanted to underline that by not knowing your intentions it does not make sense to criticise how you choose to present yourself.
yup no hard feelings. felt defensive heheh. i guess as my career has gone on i've accumulated other stuff but early on the github star thing really did feel like a big deal + if i wasnt gonna plug it on my github readme where else
swyx is on hn and legit great writer. He's influenced my thinking in many areas.
I've never seen his github account before but I expect that people following him there are doing so because of the content he's putting out. His blog has been on the HN Frontpage many times and has a book about developer career building.
My github account isn't as pimped out as his, but marketing yourself isn't toxic, it's smart.
Agreed that marketing yourself is not toxic. I follow "swyx" on Twitter and find his insight valuable, and so do a lot of my peers. Btw, looks like his Github profile has not been updated for some time - he's no longer Head of DX at Airbyte and is now an independent consultant. https://www.swyx.io/about
appreciate it but also whoa this literally just happened and its freaky how up to date you are. consulting is temporary (check out https://www.trychroma.com/ if you are exploring LangChain/OpenAI apps and need an embeddings database) and i'm working on an ai infra startup idea on the side with a couple cofoudners.
i honestly dont even view my github readme as "marketing yourself". most pple dont even go to an individual's profile in the first place, but if you do its kinda like a cute little myspace thing where you can let people know you as a human being and be a little quirky. i certainly dont hold myself out as an authority on writing the best software in the world and hey if 40k stars on the react-typescript stuff doesnt count i'm alright with that
yeah i also am surprised that people use the follow feature for my work even tho i dont run a popular oss project.
well idk what "github influencer" even means but fwiw i am not "people living just from github". ive never taken a dime of github sponsor money. as far as github is concerned i just put my stuff up for free and the github stars program gets me an early look into new features so i can give them feedback. (eg i helped with Hey GitHub before the big launch at GH Universe).
obviously i'll happily ambassador github to anyone who will listen but who isnt already on github here
Yeah. Several years ago extremely clueless recruiters used to email people heaps. Lots of people were complaining about getting tonnes of spam from them. :(
Had to change my Location (or some similar obvious field) in my GitHub profile to "Recruiters FUCK OFF" before they took the hint. ;)
Thankfully, GitHub introduced some other way to signal if you are/aren't interested in getting a job (toggle switch?) not long after, which seemed to work.
I think it would be tough (a good thing) because how often do people go to someone's root github page, even if they have a good repo? Not to say it never happens, but github is really about the repo, not the person (again a good thing) so it would be harder for an individual to become "influential". Hopefully nobody gets any ideas.
Taylor Otwell lol.. He has some pretty dope cars in his garage and is doing well.
I follow him on GitHub, and pay for some of his products. I have been heavily influenced by his coding styles, and the tools he uses. His code just looks so tight and perfect. He writes his stuff so open ended and reusable that he basically writes a method once, and then reuses it across numerous projects.
> In spam detection, we often use heuristics in conjunction with machine learning to identify spammers.
Heuristics can only be used to identify suspected spammers. Not everyone who behaves like a spammer is a spammer, it could be e.g. a random user with privacy settings on, or someone who didn’t update their bio in a while and it got affected by link rot, etc.
Even if a group of low activity accounts stars the same projects, it could be that the account owners just discuss these projects elsewhere.
The article notes this, and like any spam detection method, it has a degree of false positives, but it seems very low (less than a percent according to the article). I'm sure an official implementation of this could take more internal, non-public factors into account, like IP addresses and clustering of account creation times, to make it even more accurate and drastically reduce the amount of spam users.
The claim I saw in the article is 98% precision. Which doesn't actually tell us the predictive value without the base rate which seems to be all over the place.
Goodhart's law: if you rely on a social signal to tell you what's good, you'll break that signal.
Very soon, the domain of bullshit will extend to actual text. We'll be able to buy HN comments by the thousand -- expertly wordsmithed, lucid AI comments -- and you can get them to say "this GitHub repo is the best", or "this startup is the real deal". Won't that be fun?
(I ninja-edited my comment in the first minute; the parent might have responded to a less clear version, since they posted at +3 minutes. I added "AI" in a revision).
If you want to, you can always set 'delay' in your profile to the number of minutes (up to 10) that you would like your comments to be visible only to you. This puts the stealth back in stealth editing. https://news.ycombinator.com/newsfaq.html
I rely heavily on this because it's somehow only after the comment is 'real' (i.e. staring back at me from a real HN thread) that I notice most of the edits I want to make.
Reddit better hold their IPO soon or they'll get caught up in this. Pretty soon there will be dozens of different GPT/LLM-powered Reddit spam bots on Github. Some of them no doubt for political trolling. [1]
Phone, then ID-based verification is a stop gap, but IDV services will have to spin up to support the mass volume of verifying all humans.
[1] I kind of want to do this from an innocent / artistic perspective myself. Perhaps a bot that responds with a bunch of rhetorical questions or onomatopoeia. Then I'd scale it to the point people start noticing and feeling weirded out by it. "Is this the new Gen Alpha lingo?" Alas, I have too many other AI projects.
If people see AI-generated comments on HN they should flag them and let us know at hn@ycombinator.com. HN is for humans to converse, and bots have never been allowed.
Of course it's not always easy to say what's AI-generated or not. But if an account is making a habit of it, it still seems possible to tell.
> Very soon, the domain of bullshit will extend to actual text. We'll be able to buy HN comments by the thousand -- expertly wordsmithed, lucid AI comments -- and you can get them to say "this GitHub repo is the best", or "this startup is the real deal". Won't that be fun?
Definitely already the case, you really think Rust and SQLite would get more than a couple of upvotes otherwise? :D
Content based auto moderation has been shitty since it’s inception. I don’t like that GPT will cause the biggest flood of shit mankind has ever seen, but I am happy that it will kill these flawed ideas about policing.
The obvious problem is we don’t have any great alternatives. We have captcha, and we can look at behavior and source data (IP), and of course everyone’s favorite fingerprinting. To make matters worse: abuse, spam and fraud prevention lives in the same security-by-obscurity paradigm that cyber security lived in for decades before “we” collectively gave up on it, and decided that openness is better. People would laugh at you to suggest abuse tech should be open (“you’d just help the spammers”).
I tried to find whether academia has taken a stab at these problems but came up pretty much empty handed. Hopefully I’m just bad at searching. I truly don’t get why people aren’t looking at these issues seriously and systematically.
In the medium term, I’m worried that we’ll not address the systemic threats, and continue to throw ID checks, heuristics and ML at the wall, enjoying the short lived successes when some classifier works for a month before it’s defeated. The reason this is concerning is that we will be neck deep in crap (think SEO blogspam and recipe sites but for everything) which will be disorienting for long enough to erode a lot of trust that we could really use right now.
Closest example I know of is Korean internet. It is almost nigh impossible to get an account in major websites without SSN and a phone number. Despite this, there are still countless bots and scammers that uses hacked or leaked personal data. So I’m not sure if it would be that effective
I am thinking more like webauthn - but where I own a key pair, and I go to post office with my passport, they give me a nonce and prove that my it's my key pair then they post that public key is definitely me. I then can use that posting as my "username" and any challenge response includes the public key so they know that only I could be signing up
I am very aware of "designing a security system they themselves cannot break" and the difficulties of key management etc.
Would be interested in knowing more from smarter people
something like 2 billion people have a phone with a secure enclave capable of this in their pockets today - and they use it everyday for logins, payment and paying at the car park.
We have the penetration
(Afaik smartphone penetration is around 4.5-5 BN, and something like 50%+ have secure enclaves but honestly Indont follow that deeply so would defer to more knowledgeable people)
That’s not your identity, it’s an access token protected by an advanced lock screen (which is greatly useful, but not the same). If you lose your device, the way you get back into your accounts is your de-facto identity—usually it ranges between the email you used during signup to your govt id.
There isn’t a widely deployed public key network with keys that represent a person, afaik. PGP is the closest maybe?
> something like 2 billion people have a phone with a secure enclave capable of this in their pockets today - and they use it everyday for logins, payment and paying at the car park.
They don't own a key pair. They carry one around, which is owned by google or some other entity?
Because the only way it'd work is if it was mandatory (because of point 2); it'd then be extended to porn sites to protect the children. That means politicians browsing history on pornhub would also be recorded and inevitably leaked when they get hacked.
If spam was your only problem now we have two spam and identity theft. Selling/obtaining identity information becomes very profitable and those working in the postal office must guard access like a bank vault.
The paradigm of fixed identity information as proof is pretty obviously doomed. Just like how the 1970s concept of username/password as proof of identity is on its way out. Or credit card numbers alone being used to validate transactions.
All of those notions are pre-internet ways of proving identity. In a world where we're all rarely more than an arm's length from a globally connected computer, they're on the way out.
But, and I understand the argument, that is a problem for IRL society / government to solve.
If someone walks upto me in the voting booth and says "vote for X or I will kill you" that's a crime. If they do it in the pub it's probably a crime. If they do it online the police don't have enough manpower to deal with the situation.
We should change that.
Every time some fuckwit tweets "you and your kids are going to get raped to death and I know where you live" because some woman dares suggest some political chnage I would like to see jail time.
And if we do that then I can understand your argument, but I would then say it is not valid - in a society that protects free speech.
It might get to be that way some day, but for now there is recourse. France is (in)famous for it and they are currently making use of that way.
And this is important because a "fair democratic society" that doesn't need people to be able to protest is, as history has shown many times, only a temporary affair. The best way to keep it is to not give the government the tools a worse government could use to suppress dissent.
I expect that's where we're heading. But then, as somebody who writes online mostly under my own name, maybe I'm just biased. Come on in, the water's fine!
I think there are cases for anonymous/pseudonymous speech, but I think that's going to have to shift away from disposable identities. Newspapers, for example, have been providing selective anonymity for hundreds of years, so I think there's a model to follow: trusted people/organizations who validate the quality of a non-public identity.
So a place like HN, for example, could promise that each pseudonymous account is connected to a unique human via some sort of government ID with challenge/response capability. Or you could end up with third-party ID providers that provide a similar service that goes beyond mere identity, like the Twitter Verified program scaled up.
Disposable identities have always been a struggle. E.g., look at Reddit's very popular Am I the Asshole, where people widely believe a lot of the content is creative writing exercises. But keeping up a fake identity over the long term was a lot of work. Not anymore, though!
> The obvious problem is we don’t have any great alternatives.
Of course we do. The rise of digital finance services has led to creation of a number of servives that offer identity verification necessary for KYC. All such services offer APIs, so adding an identity verification requirement to your forum is trivial.
Of course, if it isn't obvious, I'm only half joking.
Maybe even push that a level higher and have org to org vouching as well (so it can scale and reputation propagates social bubbles.) Bootstrapping remains somewhat an issue.
One somewhat popular solution for bootstrapping is to allow people to buy in, paired with quickly banning those members in cases of rule violation. It's by no means perfect, but it puts a real price on abuse and thus reduces it a lot
I've mentioned a "market of lemons" elsewhere in this thread. One such market is the market for malware and stolen credit card details. One result of the market being broken: serious criminals restrict themselves to very small (company like) social circles and invite only forums. One signal of trust that remained very long: a very short ICQ number. You don't want to burn such a handle with a bad trade, so trust was given upfront.
How would you imagine that applying here? If fake accounts are at least as convincing as real ones, then it seems like trust networks would be quickly prone to corruption as the fake accounts gain enough of a foothold to start recommending each other.
On a network started by 2-3-10 people, the first new members would need to be vouched by a percentage of those to get in - and so on.
If someone down the line does some BS activity, the accounts that vouched for it have their reputation on the line.
A whole tree of the person who did the BS and 1-2 layers of vouching above gets put on check, gets big red warning label in their UI presence (e.g. under their avatar/name), and loses privileges. It could even just get immediately deleted.
And since I said "identity based", you would need to provide to real world id to get in, on top of others vouching for you. It can be made so you wouldn't be able to get a fake account any easier than you can get a fake passport.
Next keyword: market of lemons. If you can't rely on said signals anymore, you must treat every item the same (untrusted), which drives out the legitimate players from the market. We have a lot of lemon markets, we can probably infer from them what the social result will be..
You can do it already. It's a normal order for a copywriter, nobody will bat an eye when you post an offer. It costs cents/dollars per 1000 words instead of fraction of a cent, but that's not exactly outside of reach of a funded startup.
This system is already essentially broken. Either you worked at a large business that only gives out dates of employment and job title by policy or you are in complete control of who the hiring company talks to.
The first time you don’t get a job because of a reference you gave you learn a lesson. If it ever happens again, it’s on you.
What's really an alternative. At least where I live, a multi-year gap in your CV is going to set off more red flags than an honest "It didn't work out between us".
Don’t give them your boss’s name. Give them a coworker’s name. Give them a friend’s name and have them lie for you.
If a company is proactively contacting people you don’t give them contact information for, that’s not requiring references — which is the process I (and the comment I replied to) was talking about. If a company knows where you’ve worked, they can contact them if they want.
Then you’re fucked if they check and the reference is bad and they care. Either you take your chances, leave it as a gap in your resume, or you make something up.
In the past, I’ve extended the time I was at either the company before/after and then leave the one in the middle off. Smaller gap is easier to explain and you just need a coworker at the one you stretched to cover for you - or have it be somebody who wasn’t there during the time you added. You can also just say you did the “freelance” thing and then talk about whatever you want.
I’ve also just been 100% honest and said, “I didn’t like this job and left on bad terms. I’d rather you not contact them.”
Just have to read the situation and make your best guess as to what is going to get you the job.
We'll be back to the 1990's "software agents" craze take two: Needing AI driven agents that seek out and index and evaluate content on our behalf, and seek to negotiate with each other for recommendations with currency being trust based on how "your" agent evaluated prior results.
I'm hoping to put an AI between me and my e-mail inbox this weekend (I had ChatGPT write most of the code; it's not much); not fully automated, but evaluating and summarising and categorising. I might extend that to e.g. give me an "algorithm" for my Mastodon timeline (despite all of the people insisting on reverse chronological, I'm at a few hundred people I follow and already can't keep up), and a number of other sites I visit. For most of these things latency does not matter, so e.g. putting them through llama.cpp rather than something faster is fine, and precision isn't critical (I won't trust it to automatically reply or automatically reject anything, but prioritisation an categorisation where missteps won't have any critical impact.
> We'll be able to buy HN comments by the thousand -- expertly wordsmithed, lucid AI comments
You're forgetting the millions of additional comments that will be written by humans to trick the AI into promoting their content.
Even worse, currently if you ask Chat GPT to write you some code, it will make up an API endpoint that doesn't exist and then make up a URL that doesn't exist where you can register for an API key. People are already registering these domains, and parking fake sites on them to scam people. ChatGPT is creating a huge market for creating fake companies to match the fake information it's generating.
The biggest risk may not be people using AI-generated comments to promote their own repos, but rather registering new repos to match the fake ones that the AI is already promoting.
I feel like you’re overstating this as a long term issue. sure it’s a problem now, but realistically how long before code hallucinations are patched out?
The black box nature of the model means this isn't something you can really 'patch out'. It's a byproduct of the way the system processes data - they'll get less frequent with targeted fine tuning and improved model power, but there's no easy solve.
this is clearly untrue. it’s an input, a black box, then an output. openai have 100% control over the output. they may not be able to directly control what comes out of the black box, but a) they can tune the model, and they undoubtedly will, and b) they can control what comes after the black box. they can—for example—simply block urls
They don’t have control over the output. They created something that creates something else. They can only tweak what they created, not whatever was created by what they created.
E.g., if I create a great paintbrush which creates amazing spatter designs on the wall when it is used just so, then, beyond a point, I have no way to control the spatter designs - I can only influence the designs to some extent.
Assuming those hallucinations are a thing to be patched out and not the core part of a system that works by essentially sampling a probability distribution for the most likely following word.
evidently, they can hard-code exceptions into it. this idea that it's entirely a black box that they have no control over is really strange and incorrect and feels to me like little more than contrarianism to my comment
Folks, doesn't it seem a little harsh to pile downvotes onto this comment? It's an interesting objection stimulating meaningful conversation for us all to learn from.
If you disagree or have proof of the opposite, just say so and don't vote up. There's no reason to get so emotional we also try to hide it from the community by spamming it down into oblivion.
Most people don’t understand the technology and maths at play in these systems. That’s normal, as is using familiar words that make that feel less awful. If you have a genuine interest in understanding how and why errant generated content emerges, it will take some study. There isn’t (in my opinion) a quick helpful answer.
I genuinely want to understand whether there’s a meaningful difference between non-hallucinatory and hallucinatory content generation other than “real world correctness”.
I’m far from an expert but as I understand it the reference point isn’t so much the “real world” as it is the training data. If the model generates a strongly weighted association that isn’t in the data, and shouldn’t exist perhaps at all. I’d prefer a word like “superstition”, it seems more relatable.
By hallucinating they’re trying to imply that it didn’t just get something wrong but instead dreamed up an alternate world where what you want existed, and then described that.
Or another way to look at it, it gave an answer that looks right enough that you can’t immediately tell it is wrong.
this isn't a good explanation. these LLMs are essentially statistical models. when they "hallucinate", they're not "imagining" or "dreaming", they're simply producing a string of results that your prompt combined with its training corpus implies to be likely
Now is the time to cultivate friendships and to make networks that persist online, and are verified via irl meetups / contacts. People who pull that off now will be in much, much better shape in the future. GPT's output is apparent to a discernible eye right now, but according to the power law, it won't take much "novel" input to train upon to make that discernment useless. Then, the only internet community that could be dependably reliable would be your group of irl verified people.
Agreed. It's very difficult now to build communities that have lasting impact, because everyone's saturated with info as-is. Contributions to niche communities now rely on a societal "outsider" status, which means there's basically a couple of people that contribute heavily and very few onlookers. Everything else is either gamified or comes from video games / gambling.
On the bright side, it's THE time to cultivate close friendships and to seek like-minded people. The entire phenomenon of popular attention hugging a community to death does not exist any longer. You can now have OG members persisting with notions for a long time and building a shared mythos with a small group of friends, because information is now more accessible than ever.
Obviously, most people aren't part of these communities. The people that are "drifting" alone are given to wasting their time on charismatic attention-seekers that talk a big game (twitch/e-celebs) but deliver nothing of value. So there's also room in the market for charismatic folk with some technical expertise to rally people to their cause, but only very briefly. This is because the number of people half-committing and then jumping ship is likely the highest it's ever been. Also, platforms have now resorted to paying people to stay on their platform (youtube / tiktok / sponsorships / twitch boosting streamers / etc.) to combat occasional ennui, ironically exacerbating the issue.
Most tight, close-knit groups originate from shared mythos. These can be family, proximity, "same school year", "same college", "friend of best friend", etc. Online, you can find people that are interested in some niche topic (or elaboration of some popular topic to an absurd degree) and engage with them. Small newsletters are also a good way to get people talking. What most people don't do is return attention, aka reciprocate positively. This could also mean you'd have you write about unrelated things or maybe try to build a "business relationship" that would then progress if you invest some time and hope for the best.
It's a really bad time to try and get the attention of someone more famous / notable than you, though. Sure, you can go on $platform and talk to them, but it's really not the same when they have a gorillion other messages. Same goes for people in large communities that are a "guy" there, known for something. Extremely high-return investments but you're likely going to fail.
Some people try to start youtube channels / info streams and then entice people to join their forum / server. While this does seem to work, it only brings in quality people AFTER the community is fully formed and rigorous laws are in place. The initial stragglers are usually the recently excommunicated looking to try their hand at the same shit somewhere else.
If you really put some effort into a topic and blog about it, you're likely to get some high-quality responses even if you only pose a question to someone that's partly interested. I've found this to be a really great way to separate the folks that are actually interested from those that aren't. You'll usually get people around your own level this way and IME this is the best approach.
It takes a lot of effort to make people clock in regularly to your online circle, and it's better to establish digital / irl face-to-face contact after a good interaction. It builds trust and because we're wired to judge people from their facial reactions rather than text, it also sobers conversation / tempers over potentially divisive topics. Works well with cerebral / "deep" people. Doesn't work with people that only come online to blow steam / enact a persona, so it's a good filter.
TL;DR: Touch grass (digitally), make friends (digitally)
Stop making up laws. You'll do much more good dismantling existing ones. And non-social signals like # of commits, # of pull requests cannot be faked? We need signals among the noise.
Sometimes signals are noise we just need to calibrate.
I mean, there have always been shills. What's changing now is the cost of shilling is dropping from dollars per comment to fractions of a cent. Troll farms used to be a lot of work to put together, but soon they'll be aaS.
Those of us who are careful internet readers have spent years developing good heuristics to use textual clues to tell us about the person behind the text. Are they smart? Are they sincere? Are they honest? Are they commenting in good faith? Those skills will soon be obsolete.
The folks at OpenAI, who are nominally on a mission to make sure AI "benefits all of humanity", have condemned us to a life sentence of fending off high-volume, high-quality bullshit. Bullshit that they are actively working to make harder to detect. And I think the first victims of that will be internet forums where text is the main signal, places like this and Reddit.
"The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor."
That's my issue with stars already. One repo having more stars than another doesn't mean it's better in any way. It might just mean it's been promoted more.
That's how record labels can simply decide what's going to be the next summer hit. They pick a song and promote the hell out of it. It's not the summer hit because it was somehow better, just more promoted.
You can buy Twitter followers, Instagram followers, YouTube views, Amazon reviews, Reddit upvotes, Reddit comments, and Yelp reviews - so what's so shocking about GitHub stars?
After that post on HN months ago[1] where users discovered OAuth permissions for unrelated things being used/abused to star projects without their knowledge this news of buying stars didn't come as a surprise.
It's unfortunate as I've seen stars used as a metric of trustworthiness in general user discussions.
GitHub is fully aware of these, would they consider something like a "confirmed" star count that subtracts the suspicious/fake number? Or is that too much of a slippery slope.
GitHub gradually removes these users as they catch up to them, so not helpful to have extra steps. I have a couple of repos which were briefly popular, so when a new user stars it today, and I see 1000s of other stars, it's suspicious and I get a peek into their world.
There are obvious numeric usernames, but also fake orgs with repos for the users to fork and interact with, and a few account takeovers (i.e. someone had signed up for GitHub in 2015 to make a free wedding website, abandoned it, and the account fell into spammer hands). These used to be easier to report.
>GitHub gradually removes these users as they catch up to them
With collaterals too I presume [1]. I guess I've been the victim of some automated system. They have banned my account without warning or explanation and they've been ignoring my support tickets for about 2 months!
> They have banned my account without warning or explanation and they've been ignoring my support tickets for about 2 months!
Which is especially ridiculous if this was due to a false positive spam detection as real spammers will not bother with chasing support when new accounts can be created easily.
How did you find out the name of the company behind GitHub24 though? If I go to their website I do cannot see it, I even cannot find anything if I search the company name.
I was also surprised when I saw it. A GbR is a German "Gesellschaft bürgerlichen Rechts" which does not need to be formally incorporated and offers no limited liability. The name needs to include the names of all partners, so we can deduce it is being run by two persons. I am quite surprised they do this without liability protection. Upon googling, I found only a playlist on YouTube which has this name and contains one explainer video about signing up a company with German tax authorities.
If they are indeed based in Germany, they're required to have an Impressum / imprint on their home-page, without it, they risk being fined.
Show HN: there are maybe dozens of those posted everyday but they rarely hit main page.
Reddit ad is great to kick off the star growth, but unless you have something interesting to many people, don’t expect more than 50 stars on first day and plateau to a star every few days.
Most GH stars I’ve got was from somebody mentioning my project in comment in some heated discussion on HN. So I guess drama sells?
Is it just me or the fact that Dagster has one of their competitors Mage.ai listed here as a repo with around 15% of fake stars seems like an odd coincidence?
If you’re going to accuse a competitor of fraud, writing a blog post showing your work seems like the most safe way to do it. People lie with statistics all the time of course.
[Blogpost author here]
We ran the numbers for Prefect and several other repos in our space and they came out clean. As we note in the article, while some repos game the system, from what we can tell the number of abusers is actually fairly small.
> we track our own GitHub star count along with that of other projects. So when we spotted some new open-source projects suddenly racking up hundreds of stars a week, we were impressed. In some cases, it looked a bit too good to be true, and the patterns seemed off
If their competitor has fake-looking star counts, I'd expect them to be the ones best equipped and most likely to suspect it.
It’s possible that was the impetus of the blog post. Maybe they suspect Mage.ai of astroturfing GitHub stars and investigate it as above. They then publish a blog post that:
1. Indicates the astroturfing without actually specifically calling them out
2. Does so in a way where others can verify their work and use it on other repos
3. Uses their product to do so
> Yet [GitHub stars] influence serious, high stakes decisions, including which projects get used by enterprises, which startups get funded, and which companies talented professionals join.
Really? I honestly just don't believe this... if I were to believe this, I think I'd have to conclude the world is just too broken to bother rescuing.
The list in the article, though, was carefully selected to presume competent people doing the decision-making. I totally believe many people use that star count for something... but an "enterprise"? someone investing non-trivial amount should of money? a specifically-"talented professional"? I just find that really difficult to believe. I've sold software to enterprise, I've worked with a number of venture capital funds, and I know a ton of actually-talented professionals... I dare say most of them consider GitHub's social features to be a joke.
The enterprises I deal with cared almost exclusively about stuff like license choices, support contract options, and "invoice billing" ;P. The vetting process I've dealt with at VCs was intense, having worked both sides of that situation; and I know multiple people who have worked data science jobs at such firms to try to better select investments. As for a "talented professional", I can pretty much guarantee they are going to look at your codebase, not the number of stars it has, while they evaluate any number of more reasonable things to judge an opportunity on (commute, pay, management style, etc.). A key property of competent deciders is that they aren't using trivial metrics.
One of my stock interview questions asks people how they evaluate 3rd-party dependencies for use in a production environment. So many interviewees respond with GitHub stars as their main or only criterion. It depresses me every time.
It depresses me too, but what else can you do? I check what the docs look like, but if I'm to depend on a thing I'd rather choose something popular than unpopular. GitHub stars, Hackage downloads, StackShare... what else can one check?
That's a very interesting question. There are so many things you can look at. How is the documentation? Who are the primary maintainers? How are they funded? What are their motivations? Are the primary maintainers active on Stack Overflow, Reddit, Discord, etc...? How many contributors are there? How does their Github issues page look? What about the Github discussion page? How many maintainers are there total? How many downloads per week on NPM (for JS libraries)? From all of these things - how long do you expect this library to be maintained? And that's just the initial qualification research, nothing about how it will impact the actual code-base.
What did I miss? What's the best answer you've ever heard? How do you evaluate 3rd party dependencies?
You overlooked what I consider to be the first thing you should check — when was the repository last committed to. There are countless projects that rank high on every other metric, but are essentially abandonware.
Yeah good point... definitely something I would have checked, forgot to put it in the list. I'm baffled people have trouble coming up with more than "number of stars" for this.
Of course there can be libraries that are more or less "finished", so the last commit/frequency of commits isn't on its own a deciding factor, but in proper context/holistically it is definitely an important metric!
FWIW, I am not baffled by that, as the vast majority of programmers are not "talented professionals" (which is the specific category of potential employee I was balling at, along with enterprises and venture capital firms). So like, you ask your question, they say "star count", and you don't have to really continue the interview.
(When I was in high school, I used to work for a pre-Internet company that helped people pre-filter interview candidates for ads posted in classified sections of newspapers and what they did was have questions like this that could be asked by people well before they reached your calendar for an interview.)
However, some language ecosystems are more OK with "finished" software than others. It hasn't had a commit in 4 years because none were necessary. Needing constant updates is a sign the local ecosystem is driven by churn over quality.
I don't really think this generalization holds. TeX is one of the very few widely used pieces of software that's considered complete, more or less everything else is either getting updated or superseded by other things.
Clojure, Elixir, and Lisp (especially Clojure) all have slower acceptable churn rates than other language ecosystems. If it works sensibly (both in terms of being fully debugged and ergonomics) and the underlying system hasn't had significant changes, what good does a commit within the past six months do beyond signaling to the GitHub meta game?
Insights -> contributors, and number of active maintainers based on entire commit history of the project and frequency of commits. Also, network page which shows number of active forks. Also, PRs, and how are they handled.
Contributors is the most informative page for me. So many projects are 1 man show basically all the time. I don't mind that, it means passion, but it also mean it can dissaper any moment depending on circumstances.
I also look into issue details to see how maintainers communicate with community members that do due dilligence before aksing for help.
Stars only mean something because of the people who do. They're the ones leading the herd. If you're just going off the social signals, then you're just monitoring where the herd is going.
Yep, this one is the headline item for me. Look at the code and, if it has further dependencies of its own, look at the code for those too.
The main question I'm asking myself while looking at the code is: if I had to fork this thing and maintain it myself, how would I feel about it? Because sometimes that happens.
I'd add support to that list. When it breaks, can I cut a contract and get an expert available to diagnose the problem within a few hours. Production outages are not the time for self help and digging around in other peoples code bases.
So, ask yourself for a moment: what is it you are actually caring about?
I'd like the project to not introduce security vulnerabilities or bugs into my code. I thereby care what language it was written in, what libraries they use, what their testing and QA/CI process is, and whether it is being used by any "critical" projects (like, if that library is embedded in Chrome, you have to bet there are tons of people like me every day trying to hack it).
As part of that, I care about if the project takes a cavalier attitude towards contributions: if I see a number of pull requests from random "contributors" being casually accepted, that is going to be a major major red flag; if possible, I want to see a core team doing most of the development and integration (and not merely most of the "review", add I see in some projects where the people in charge feel above doing work).
I definitely care that the project is being maintained and that there are people paying attention to issues, and it needs to have a culture of taking bug reports seriously... nothing is more dangerous than a project that tries to pretend they are responsive using bots to "automatically close" issues: I'd rather see bugs open for years than worry a critical issue was reported and subsequently lost.
I am certainly curious how work on the project is funded and whether I can trust that its license is going to hold constant over time: I don't want to end up relying on a dependency that is really the pet project of a small startup that is either going to disappear next year or will decide to redirect development to a closed-source fork. I'd thereby also prefer the project be run by a core committee of participants from multiple companies.
I honestly can't imagine caring two shits about how many stars a project had on GitHub... hell: what if the project isn't even on GitHub? What then? Do you just give up and decide it sucks? A world where everyone feels any incentive at all to put their code on a centralized platform is one where we have all failed as stewards of the future of software :(.
Activity on other sites related to finance/coding is similar (seekingalpha likes, for example) and I've gotten organic inbound requests for work periodically scraping such info into... Excel.
I have a half-written article about this, but I didn't have any good notion about quantifying the problem so this article is very welcome info to me.
My own angle is that copilot has shifted the incentives around this practice, maybe substantially. Businesses want to get (free tiers of) their paid SaaS endpoints into copilot suggestions - it's a great funnel!
I'd guess that github is as likely as not to become an SEO spam battlefield (like the rest of the web).
I wrote a tiny tool which calculates the "brightness" score of a github repo based on calculating the total star count of the people who starred your repo. It will automatically detect these kinds of scams (assuming that it's mostly low star bots giving the stars).
Edit: I love clustering, I really do, but I think that techniques like the one I am using are far superior to unsupervised learning for trying to detect fake accounts in this context.
It is worth noting that it is trivial to buy fake stars for a project you are not affiliated with. The reason someone might do this would be to "test" the purchasing of fake stars without risking contaminating their own project.
The projects with suspicious stars were still >80% nonfake stars. That to me suggests that most of the fake stars have been classified as nonfake. There isn't much psychological value in boosting your star count by just 25%.
Depends on when the fake stars were created. If they are early in a projects life cycle, they may be used to get attention on the project, and once they have awareness, fake stars are no longer necessary.
While evaluting OSS project, key indicator is community activity.
Github stars is a weak community activity indicator. Firstly, as shown in the article it can be gamed. Also, Stars is very low threshold action so does not indicate whether the person who starred the project will actually use it.
I think 2 great community activity indicators are - Github issues and of slack/discord/discourse comments.
One key thing with github issues in my opinions is that, If the github issues are mostly by the core team, it is not a great sign. You want a large mix of issues from customers or users and not from the team. This is a good indicator if the project is solving real problem or not. Stars is very low threshold action. Same goes with the slack comments, it should have both volume and freshness.
You have a point. I have often seen OSS projects being funded on the basis of github stars with no revenue whereas all the parameters show that the project health is not that great.
Maybe not as cheap as you may think. I think github takes a small cut plus you may need to declare the donation as Income on your taxes.
Also if you get "smart" and donate on multiple cards, I would think it is a trivial task for github to determine is is a scam. The CC address would match you Address for the funds your receive.
> GitHub Sponsors does not charge any fees for sponsorships from personal accounts, so 100% of these sponsorships go to the sponsored developer or organization. The 10% fee for sponsorships from organizations is waived during the beta. For more information, see "About billing for GitHub Sponsors."
GitHub sponsors has been out of beta for a long time, they take 10% of the donations if the code is under an organization which is very common for OSS projects. Of course one of the ways to get around it is to sponsor the lead developer, which is sometimes available as an option. Or just sponsor the developer some other way which doesn't go through Microsoft such as Liberapay or Opencollective.
Pretty sure those who game their repo are motivated by investment into associated startup. I think you are right that community activity is a high fiedlity indicator and a smart investor in OSS startups should definitely not only lurk in the community but if possible actually have resources to kick the project tires as well.
In a very strange way (but reflective of the economic regime) a startup that fakes stars vs a straight-arrow startup that doesn't is demonstrating a key element for success in business, which seems to require a significant element of bullshiting, and outright deceiving. The mantra has been that "grow grow grow" is the only guideline for success. Inflating your stars is just rookie hour practice for bigger better growth b.s. down the line.
My ex-employer used Github stars in their job description and during recruitement pitches. They regularly encouraged employees to go and star the firm's repos in Github. In all-hands meetings, the Github stars were one of the items they reported: "we've surpassed X in Github stars" (applause).
(The firm X, however, is a more well-known name than my ex-employer was).
A while ago, I listened to a Freakonomics episode where it was discussed that businesses use proxies to both boost their image and to cover up their incompetency. The example was that a lot of businesses chose fancy names starting with A (like, AAA plumbers), so that they get listed first in business directories. These firms were later proven to be very incompetent and/or even fraudulent.
do you mind elaborating on this? I am using Reddit to advertise some of my projects because it seems like a relevant crowd to advertise to, but I am curious to hear how it would be perceived.
only vaguely related - but I've been recently trying out dagster and I'm pretty impressed so far. I've run large scale data-processing from Hadoop onwards and was expecting the usual crumminess whenever you hit and edge case.
Instead I found a system that seems to be thoughtfully designed and, crucially, easy to debug.
I didn't knoe people used stars to make decisions. For me it is more like HN karma points. I use their issue history/pr history to get an idea of how good or bad a project is
I'm surprised that Github stars are valuable enough to buy. Personally I never look at the star count because even if they were legit, they don't really tell me anything more useful than I get from looking at other things in the repo.
I tend to check the age difference between the earliest and latest commits because that lets me be sure it's not a project that someone spent a couple weeks coding up, dropped on github, and then forgot about. I'll also check the issues on there. I'm looking for more closed issues than open ones, but I'll also quickly scan over them to get a rough idea of how many are truly meaningful issues. I also get signals from the readme and docs. It's not a hard pass if there's issues with those, but it's certainly helpful to my opinion if they exist and are both clear and detailed.
I mean based on the number of repos they identified buying stars and prices advertised, the revenue just doesn’t make sense. The sellers have made like, hundreds of dollars at most. How much effort have they invested for this meager return?
I find stars helpful when I'm evaluating several different repos to choose a particular tool for a job.
If one of the repos has many more stars, I weigh that strongly when choosing. Freshness of commits is definitely important, but for me the fact that many other people starred the repo shows that there are eyeballs and activity.
I'll admit I've used them. In particular, I've used paperswithcode to find implementations of ML models. There are often a number of implementations of the same model, and the quality is highly variable. I've used stars (which paperswithcode displays) as a pre-screen. Spoiler alert, the highest started implementations are not always the best. But it still helps to triage, as a proxy for how well used it is
You are likely not important enough to scam. The first people I can imagine this being shown to are VCs in pitch decks who are only going to see this on a powerpoint and not actually on github. Very unlikely the VC will check github itself to verify the number, and if they do, even less likely they'll verify that the stars are real.
You're the kind that checks everything. Even if you had something valuable, a scammer wouldn't waste their time with you then there are easier fish to bait.
Interesting, I just use them to keep track of interesting projects ( edit: not the number of starts as a proxy; stars is basically my bookmark ). People treat them as internet points?
I really wish GitHub would have some sort of flag for "stale" projects. I use your methods too (issues, dates, etc.), and I'm usually disappointed when search results bring up ghost projects. However, in a few instances, I found a project that was similar to an issue I was working on that went one step beyond where I was, and even though it was a ghost project, it helped. But in general, these projects don't help. I'm also disappointed that I'm thinking, "Hmmm, maybe LLMs can help..."
Why is stale a bad thing? It could be something that was created to serve a purpose, developed to the point that it was feature complete for that purpose, and now requires no more development yet continues to do its purpose without modifications.
It's almost like you are thinking of it as an expiration date and the software has spoiled.
"Stale" and "done" are different states. Stale is when bugs are known but not fixed, dependencies old and unsupported, build instructions do not work any more on modern versions of OSes and other environments.
All software is subject to shifting environments over time that will eventually render it obsolete. How fast this happens really depends on the ecosystem—it's a function of the abstraction level and context in which it runs. C or Go code that compiles to a standalone binary will be less susceptible to this, higher level Ruby or Node code that depends on a lot of peer libraries moving in lockstep will be more susceptible. Newer languages that have some notion of backwards compatibility baked into their charter like Elixir or Rust are somewhere in between.
well, the original dev did release the code as open source. you are free to take their lead and continue on with modifications in your own source or even as a fork if you feel so strongly about it needing to be maintained to that level.
Yes, I certainly could. This comment chain started with "why is stale a bad thing". It's bad because I have to do that.
There might be a maintained fork/separate project that does what I want that I would like to find instead. Or maybe I was just searching to save myself 30 minutes on a one time task and I'm not up for adopting an abandoned project.
Because many languages have breaking changes in the interpreter. For example it is almost impossible to review old Python projects you have to change so much, it is easier to rewrite in many cases.
Rust and other compiled languages that have backward and forward compatibility in mind do much better.
But in that case it should have a note saying it's finished or in maintenance mode (e.g. https://github.com/sirupsen/logrus); include references to replacements, offer paid support if you really need it or still use it, keep an eye on issues, and update dependencies.
Else, ask for a new maintainer. While code can be considered done (especially if no new features are added), it should never go unmaintained. If it's actually used a lot of course.
I have one project on GitHub that I use all the time as part of a script and only push changes when the python API breaks it. It is essentially “finished” and usually just needs a quick compile against the new python version whenever I upgrade the distro. I haven’t even had to touch for at least as long as GitHub required ssh keys so by all accounts this would be an abandoned project.
Now that I think about it — it is a python wrapper around a boost library and neither of those have made backwards incompatible changes in a long time which is quite suspicious.
Boost libs circa Ubuntu (14 or 16.04) had JSON parser that allowed comments, while the newer Boost in Ubuntu 20.04 (and I think already in 18.04) had "updated" it and then it didn't allow comments any more.
Just a small anecdote of Boost changing behavior that broke some of my stuff.
I kind of expect that I’ll have to do some work at upgrade time but it’s been a while. Usually python is the culprit and can only remember boost breaking something once but that was a different project. The maintainer was quite nice on trying to help me figure it out but I don’t think I ever got it working the same again.
Metrics based on issues / commit activity are certainly higher fidelity.
As you indicate though, they require more effort to adjudicate. Are issues from core team members? Are commits meaningful? Is community activity meaningful? I wish GitHub would give allow us to parse things like this more easily.
My use of star count is generally a binary indicator. 1k+ is probably a legit project and below is probably still early. Beyond that, it's probably too noisy.
Closed issues dont mean anything though... a lot of maintainers bulk close hundered of issues as "nofix", "no activity after 3 months", and so on. Just sweeping them under the rug. And many of them pride themselves with the 0 opened issues like it mean something. Any software in the world can have 0 issues if they played this game.
So unless you are really well versed in the project and spent some time following it, stars actually might be a better indicator of the project quality and reputation.
> a lot of maintainers bulk close hundered of issues as "nofix", "no activity after 3 months", and so on
God, I hate this. Every time I have an issue with something, look it up on the issue tracker and find the exact issue I'm having autoclosed as "stale" by a fucking bot because the author didn't reply "this is still an issue" once every 24 hours, it instantly makes my blood boil and I avoid using the software in question as much as possible in the future. Nothing screams "I care more about github numbers than my users or the quality of my software" more than this.
I don't think GP said anything about making demands. They said they avoid using that piece of software and that is not a demand on the software's author.
If you read my comment carefully you'll notice that I at no point demanded that the developers actually fix the issue.
The problem here is simply closing issues that are not fixed because they're "stale", no reason to do this unless you're obsessed with keeping the number of open issues low to deceive people into believing no issues exist. Keeping issues open does not take any effort.
I can be upset with people lying to me even if I don't pay them and there is nothing wrong with avoiding projects engaging in such behavior and warning others about them.
>I tend to check the age difference between the earliest and latest commits because that lets me be sure it's not a project that someone spent a couple weeks coding up
I doubt anyone would do this, but commit date can be arbitrarily changed.
I have moved all my repositories to sourcehut. They are generally mirrored by a github repository consisting of a single README file explaining the new location for the project, and my reasons for the migration.
However, given sourcehut eschews the use such "social metrics" (which at some level I agree with the principle behind it, on the other hand I do appreciate the value of being able to give visibility to good projects) I usually mention in my README that "If you like the project and wish to promote it, feel free to star this github page".
I'm sure github probably wouldn't like this use-case, but the stars would certainly be genuine, even if possibly quite dodgy-looking.
I’m conflicted about this. Sourcehut, Codeberg, etc are great. But having everything I’m looking for on GitHub is extremely convenient. I use the “Add to List” function extensively for bookmarking.
Yes, this is why I didn't want to migrate without leaving a trace on github. The redirecting README on github is a good compromise, I think.
Having said that, it may be worth thinking what is the price we may be paying as a community for this convenience, btw. MS Github is clearly already past the "embrace" phase, and well into the "extend" phase.
Rabbit trail: I accidentally right-clicked on their home icon and it brought up their branding page with license agreements for their IP. Really neat idea.
This sort of gamification exists only because there are too many green engineers that only care about their salaries, and they mimic what people successfully recruited by FAANG (etc.) did, and so do other companies. Then this purity spirals into taking the entire field down because there's no one around to educate the new newbies. Facebook was IMO a step in the right direction because it was a "general" social network, you could post anything. Imagine if FB had released some sort of an "extension" that allowed you to share anything via a template of sorts, instead of having to type out everything in the same old text post. It would have been meta enough (sorry) to not spiral very quickly.
Leaving the arena is the only viable option. Software projects that aren't dependent on github drive their own vehicle, everyone else is on a crowded bus.
I wrote on this topic a while ago; experimenting I found out you can basically change the repos names and keep the stars; this wouldn't work if you use the repo as issue tracker or PR tracker, since the history would all be broken, but if it's pretty much just the code it's easy to swap the star count between two repos:
Sounds like they take it more serious than Google does likes on Youtube. A competitor had a video that rapidly had over 100k likes - but if you looked at the total time played, each view averaged to just a couple of seconds on a video over 10 minutes. Reported it, but nothing came of it. (No, not something we regularly do. I think it may be the only video I've ever reported; just want a fair playing field)
youtube competitor. that's just funny to me. kind of even comes across as petty. you took however much time to investigate average viewed time of a competitor and then cried to daddy about the perceived slight in "advantage" instead of taking that time to improve your competing product to make it better.
No, we had someone show up out the blue, with no established presence in the space, with a video with hundreds of thousands of views. Was curious how they were so viral so fast.
Overall, it's bad for everyone if someone can create fraudulent views: us, other companies, and most importantly, consumers.
> taking that time to improve your competing product to make it better.
Took less than 3 minutes to do the math and send the report. I'm a fast developer, but I can't improve our product that fast :-)
> if you looked at the total time played, each view averaged to just a couple of seconds on a video over 10 minutes.
That makes no sense to me. Speaking as someone who has been using YouTube Data API v3 and YouTube Analytics API v2 for many years, estimated minutes watched of a video shouldn’t be public info. So how can you “look at the total time played” on a competitor’s video?
This is a great article, I've developed the same tactics for other projects but never was able to graft the proper vernacular. It really helps tackling how to organize and present information.
I wonder if this is also in general OSINT or ISC^2 training - everything this article showed for breadtrails and reverse operation (e.g. pay a company to do the work, see how it is, evaluate the results, see if you can find other work similar/akin to it.)
Things like this are part of why I cringe when I see supply chain analysis/security companies include “popularity” in their criticality metrics: the relationship between public popularity signals (like GitHub stars) and criticality is weak, at best.
In my experience, it's actually a great signal. That's why so many people rely on it. The distribution of GitHub stars is an extreme power law.[1] Stargazer thresholds are used by maintainers to make decisions on including projects for different purposes from dependency management to package manager maintainers deciding to list software by name.[2]
Selection suitability and criticality are different metrics. The former is what Homebrew uses, as a way to lessen maintainer load and prevent inclusion in Homebrew becoming its own quality signal. The latter is what I’ve seen supply chain companies provide: an implication that a project is somehow critical or essential to the overall ecosystem because it has so-and-so many stars.
That first use is not unreasonable, in my opinion. The second one is questionable, at best.
Maybe our code forges don't need to be social media platforms. These 'stars' have pretty dubious value and rarely correlate with code quality or importance (core libraries generally have less attention than apps or tools). There's also a heavy language skew where JavaScript and Python libraries & programs get way more thumbs-ups even when they're technically not any better than alternatives.
reply