Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Who is horse_js? (whoishorsejs.com) similar stories update story
206.0 points by juniusfree | karma 234 | avg karma 6.88 2019-01-19 07:01:23+00:00 | hide | past | favorite | 87 comments



view as:

bump

Very fun read, it contains just the right amount of detail to stay entertaining without getting bogged down in minutiae.

Neat concept, wonderful execution, and beautiful presentation. Now if only the entire web could follow suit.

I have to politely disagree. This could've been made as a static page. Instead, with Javascript off you see just a totally blank page. Thank God that most of the web doesn't follow the suit.

I would add to that the unnecessary scrolling of this presentation style. With classical layouting you can get far more information over the fold and allow your reader to skip stuff "ah, yeah, I see what they did there" - without having to constantly interact with the keyboard / mouse.

This is journalism!

This is native advertising.

But it does have the tone of a type of “data journalism” that we should see more often.

I would appreciate a site that treats all news the way 538 treats politics.


I was scrolling through the whole article raging for an ethics paragraph, but I guess they handled that pretty well.

With a nifty and I think necessary touch of themselves still being in the dark; I very much doubt that the data they gathered can really reveal the author's identity, and the result they arrived on (Tom Dale) seems to largely originate from the "quotes one person far more than others" metric.

You could almost consider it an anti-metric: which intently pseudonymous author would dare to retweet their own nym? However, to counter this analysis you'd have to blend in with the average Twitter user in your niche, so it then comes down to a psychological game of "what would horse_js do?".


The ember part is a giveaway too since it's a language that has a small but passionate base you'd expect the person behind it to care about ember. That being said the reasoning doesn't rule out wycats.

> I was scrolling through the whole article raging for an ethics paragraph, but I guess they handled that pretty well.

What good is an "ethics paragraph" anyway? Isn't it like prefacing an offensive statement with "no offense, but"? It's one thing if you have a disclaimer to protect the user, since the user is making the decision. It's another thing to make a disclaimer that just lectures the user about privacy, but does nothing to protect the doxxed. That just seems like a lame attempt to make the site less liable, like when people post copyright content with the disclaimer "I do not own this."


I'm clearly a terrible person, but I read the reveal and immediately thought "But X isn't funny enough to be horse_js!"

I thought "Oh! Exactly! That is definitely his type of humor, makes so much sense". Especially the actual accounts he quotes are often very small and quite specific.

I'm very surprised they didn't find him based on the least quoted people's followers actually.


Wowwwwww

haha, ur funny

> We got permission from our suspect before we released this site and they have allowed to use their name and release the data that we had about them.

Looking forward to your response!


Its interesting but also worrying. This is pretty much doxxing under the guise of intellectual curiosity.

If the article explained why they wanted to identify this account then fair enough. However you are going to end up in an ethical slippery slope were it will be used to doxx people who are controversial, troll, political dissidents and whistle blowers.


None of the techniques they used to come to their conclusion were groundbreaking or even surprising.

To you or me but for others its an education on how they can be compromised or how they can compromise others.

I don't see the problem. We didn't need this demonstration to know that people can be unmasked with these methods. They had consent from the "doxxed" person and explicitly addressed the ethical issues.

What more do you want? It's not like we can take away these dangers by not talking about them openly.


Bug report: time series analysis chart fails to display properly on Firefox, all points stay on x=0, I see "Unexpected value NaN parsing cx attribute" x250 in the console. Works fine on Chrome.

Breaks in Safari too.

The layout is completely broken in Edge too. They must have forgotten that it does not use the Chromium engine yet.

Too bad for a (fun and clever) Microsoft advertising.


It seems that the date format is not correctly recognized by moment.js. There is a warning in the console.

This must be them prepping for that new Internet Explorer-flavored Chrome or whatever.

I actually found it easier to read in Firefox, as there was only one axis to read from. The correlation was clearer.

In case you didn't notice, this is clever advertising for Microsoft Azure. Both authors work for Microsoft. I could count 3 mentions of Azure, two direct links to Azure products, 1 quote from a Microsoft researcher, 1 quote from a Microsoft dev advocate and 1 embedded Bing maps. You've just been played by Microsoft marketing. Also, Tom Dale works for Microsoft himself so it's just one big family story.

Marketing for EmberJS perhaps ? :P

Noticed it, but I like(d) the ad / package / presentation.

Personally, I think this is "good" marketing.


Kudos to you for noticing but! I think they should add some kind of disclosure.

This is the kind of PR (rather than marketing) that I despise, because it is deceiving.

I mean, its annoying that it wasn't clearly stated that it was a MS article, but from the beginning I thought it was fairly clear that it was an advertisement for something, and all the Microsoft references was a pretty strong hint it was for them.

Do we know for a fact that this was a marketing tactic conceived and developed by the Marketing group at Microsoft? Versus simply these three individuals, of their own accord, wanting to show off the tools in a fun(?) way?

Does it matter? If you're showing off something that depends on your company's product then you're advertising for them, regardless of whether you're paid for it directly.

Yes, it does matter. I'd question the intentions of a marketing department very differently than I would 3 developers who appreciate a given set of tools and want to share them with other developers.

Marketing for Twitter? (Bad sarcasm)

Fuckkkk, the ads are evolving. South Park is right!! I didn't realize it.

Compulsory XKCD link: https://xkcd.com/810/

Tom Dale working for MS makes it likely that this was a reverse-engineering effort.

As in, they knew who they were looking for from the start, and just worked with the data to find the known conclusion.

Also, there's no actual machine learning in this really, except calling out to a hosted language processing service...


A nice case of parallel construction.

Technically any project showcasing a specific technology could be considered advertising though.

Ah, who is whoishorsejs.com

This is really telling because the twitter api already tells you the source of a tweet anyway (android/tweetdeck/hootsuite). It's like they didn't even try.

What's the relevance of this? The article points this out, no?

That’s awfully cynical. Other than wondering exactly why they used those Azure services (which makes sense now that I know they work for Microsoft), nothing else about Microsoft stood out for me when I read through.

Edit: I’m not sure how that was worthy of a downvote. Not everything is a conspiracy, but thanks for valuing contradictory options.


>>That’s awfully cynical.

Perhaps so, but that doesn't make it not true.

http://www.paulgraham.com/submarine.html


>That’s awfully cynical.

The world is awfully exploitative and cunning. The world of business doubly so.


Am I the only one that skimmed the story so fast that I didn't notice the MS associations?

I could tell it wasn't going to be a super deep dive or deeply technical just from a look at layout/style of the post -- but it wasn't supposed to be super deeply technical right? Kind of just like a fun post.

They were extremely helpful in making the "conclusion" bits on the end of every section very obvious though (I aim to write like that as well, to save readers time), so I basically only read those, and skipped to the bottom where they unveiled and clicked the button...


The article is fun and well written, and it showcases how the marketed tools/services can actually be used to accomplish something interesting. As long as the content is useful, should I care if it was written as a way to advertise a product or not? This is so much better than doorslaming ads in one's face.

Thank goodness. I was worried that we were going to see another doxing like the one that drove whythe luckystiff away.

> Finally. Some Machine Learning.

> We ran all of @horse_js tweets from the last 2 years through Azure Cognitive Services Text Analytics service. This service identifies keywords in phrases.

How was that necessary in comparison to a simple "split by whitespace, count occurences"? :P


Sounds better in a job interview.

That service does more NLP level stuff - remove stop-words (a, the, an, etc.), tokenize text (eating becomes eat), keep only words that represent the core message of the text, I think that's about it

Neat, that's actually something I'll find useful in one of my personal projects. Guess I'll have to check it out.

We used Lucene (open source) in our information retrieval course and tokenizing (w/ removing stop words etc.) is one of the things it does. If you just want to experiment, that's also another option to look at if you like!

That's stuff I can hack together in 20 minutes using JS or PHP... Why on earth would you need a remote API for that?

This wasn't very rigorous at all, but it was a moderately fun read because it really made it clear how "large amounts of data" with some simple visualization can help people make some mostly-educated guesses.

I was most surprised at them glossing over the activity patterns with guesses based on their assumption of the target's sleeping patterns -- their guess of a time zone would have been stymied by someone who liked to sleep early/late or had an unusual work schedule, but there was no mention of that or their reasoning.

That all made sense given patrickaljord's comment about it all being one big Microsoft ad, though.


This also can be a pretty powerful troubleshooting way, which is why I consider having something like grafana or prometheus around extremely valuable.

Looking for anomalies at the same time, or in sequence easily turns "What is going on?" into "Alright, why is there so much more stuff coming into the system, and why is that increased ingress causing increased memory usage per event?"


I think it's more unlikely that it would be irregular sleep patterns but you're right, it could've been mentioned.

> their guess of a time zone would have been stymied by someone who liked to sleep early/late or had an unusual work schedule, but there was no mention of that or their reasoning.

As you said - this only makes sense because it's a Microsoft ad and they could have arguably known the answer to start with anyway.

Given that this is focused on someone who's at least interested in software development, they made some pretty specific assumptions.

There's no way anyone (unless they somehow literally think there is nothing outside America) would just discount the concept of it being someone in another country, and/or with non-'regular' work hours.

But this is Microsoft afterall. The imagination of a brick. I suspect if you asked about remote work, they'd think you want to work on remote desktop.


Just use elasticsearch/kibana and you will save yourselves from lots of Azure costs. Find keywords and group by device and location. Simple as that.

> Just use elasticsearch/kibana

Are you volunteering all the time to set that up for me? ;-)


> @horse_js lives in either the Central or Eastern time zone. Their activity dwindles sharply in the evening and disapears between ~11 PM - 12 AM CST and reappears at ~8 AM - 9 AM CST because they are likely asleep.

As someone commenting at 4 AM, this might not be a great assumption to make ;)


do you comment every day with consistency around 4am? the point is with enough data you establish a pattern and ignore outliers. the time series graph is indicative of an EST/CST sleep schedule.

More consistently than I'd certainly like. I've been told by concerned friends that I have issues with my sleep schedule; the most recent was something along the lines of "why is that when I check Hacker News in the morning I keep finding your comments made three hours ago".

The article was quite enjoyable to read though full of MS marketing.

When saw the part with Azure Cognitive Services Text Analytics - I burst out laughing. Earlier, they quoted: Half of the time when companies say they need "AI" what they really need is a SELECT clause with GROUP BY.

Their motivation of using AI is even below that threshold.

Now awaiting some horse_js comment about this absurdity.


This is very bad. They spoiled it for everyone. That privacy disclaimer at the end is of no use really. Also Bing maps? Thats the first time I saw that.

It's made by Microsoft employees, that's why all the advertisements for Azure and other Microsoft products.

That was a nice, fun read, and a simple way to show how sometimes, simple data analysis and common sense trumps everything else :)

Also, great website design. Simple and clean!


So why was this a statistics problem?

cluster horse_js and Tom Dale's tweets in an embedding space and you can confirm your hypothesis.

> The API is rate limited, so we created a set of Node.js Azure Functions that ran on a timers. These functions would request as many tweets as they could before they were rate limited, wait for the timeout interval specified by the API docs, then resume processing where they left off.

How does this work? You pay for your function's run time in serverless so you wouldn't want to just have the function sleep for x minutes or however long it gets rate limited surely. I can see a way to do it using a service bus queue (push the message with a delay of x minutes, have the function set up to run on messages on that queue) but they specifically said timers. Does Azure let you programatically set the timer for a function from inside that function (eg. "Run me again in 3 minutes")?


Azure Functions can be configured to run on fixed schedules via timer triggers (https://docs.microsoft.com/en-us/azure/azure-functions/funct...), so I’m guessing they set theirs to run every API timeout interval + max amount of time they could request tweets before getting rate limited. Their Cosmos DB instance could then be set up to track how many tweets they had gotten through on each function run.

> "But I already know who @horse_js is, and it's not [...]!"

Perhaps. The data here is not 100% conclusive. There are some critical assumptions holding up our conclusion and [...] has never confirmed (or denied) our findings.

Perhaps the horse lives to tweet another day...

Ironically this highlights one of the main problems with how machine learning is used.

On a very high level, I think you can sum up machine learning algorithms as finding pattern in enormous heaps of noisy data ("training") then trying to apply the discovered pattern to novel data and using the result to guess the answer to a question you posed ("predicting").

The keyword being guess here. Unlike algorithms not based on learning, there is no guarantee that the answer is correct, because you usually don't know if the training data you supplied was sufficient or if the learned patterns were the ones you need. If you knew, you could just hard code the patterns directly and get rid of the whole learning overhead altogether.

Researchers know and communicate this. However, in the press, "AI" seems to be seen as almost the exact opposite: Not only can those fantasy AI systems answer questions about fuzzy human concepts with the precision of a computer, their answers are even better than the human ones - which is why the things we need to worry about are ethics discussions and humanity becoming obsolete...

This could be funny if it were just restricted to science fiction and public discussion, but it becomes problematic when "AI" systems are used to make life-changing descisions like setting insurance premiums or declaring persons suspicious to law enforcement.


> it becomes problematic when "AI" systems are used to make life-changing descisions like setting insurance premiums or declaring persons suspicious to law enforcement.

Hasn't the latter already happened? I'm without link/source, but I seem to recall reading about there being tests of using a homanoid-looking AI-driven "attendant" at a border somewhere that would judge people based on looks/temperament and try to guess if they were lying about what's in their luggage.


Yes, discriminating machine learning is commonplace, but often hard to uncover, as it's not often obvious to the users of automated systems how those values are constructed.

Luckily some critical parts like issurance calculation is regulated (in some parts of the world) to have the requirement of explainable algorithms to prevent this kind of discrimination, so it's not as bleak as it's often made out to be. Of course it's also important that it stays that way.


I don't see that as a major problem. In fact most life-changing decisions are already based on probability. Insurance companies already do risk analysis and whether the algorithm uses ML or basic statistics, there is a threshold level of confidence used.

I'd also argue it's how our brains work. Many times as we come to a decision we are going off of confidence, not true correctness. I'm the case of declaring suspicious person's, well by definition they are suspects based on confidence, not by truth. Even in court we determine verdicts based on human probabilistic confidence that comes from the evidence.


I am so tired of Microsoft meddling in the developer space. I wish they just crawled away and made themselves irrelevant. If not for the money and how they shove they products in people's throats nobody would consider even using them. Internet? They had to taint it with IE6. Operating systems? The dreadful Windows 10 malware OS. Then lure developers with their software, which if you have money then you can produce a ton of and just crush everything that is good in the IT.

Nice try, Angelina Fabbro

This is marketing. Someone should flag it.

not knowing is half the fun.


lon ingram

Legal | privacy