Isn't this just called user testing? Also this is in the context of a fucking dataset. If data needs to go through DI in case something blows up on Twitter, I guess it's sad state we're in.
> There is no “testing” going on here, unfortunately. You’re just taking turns placing users in the treatment or control group, arbitrarily, depending on when they visit the page. How can you measure any treatment effect when everyone is part of either group?
This is not a perfect solution, but I have found that it is good enough to be able to identify clear winners (if there is actually a winning version). A lot of followers won't visit your profile multiple times anyway. They visit it, and they either follow it or they don't - and they will of course be influenced by the currently displayed version :). They won't just come back to your profile over and over again for no reason, but if they do, they will also convert better on the version they prefer. So a version might nudge the user into following you while another one might not. I would disagree that this is not testing. I think it is for the majority of the profile clicks you receive.
> I guess you could try something like switchback testing, but I’m not convinced visits to the average Twitter profile will yield enough samples.
I don't think that would be possible with the current Twitter API capabilities anyway.
> I think it’s a well-executed idea, but I don’t think it’s fair to sell results under the guise of statistical validity when they don’t appear to have that. (although it’s just Twitter profiles, not eg medical treatment, so no real harm done)
While the results might not be perfectly accurate, I think they are accurate enough to provide value, especially if you let the test run long enough to get a big sample size. I personally use Birdy (obviously :D) and I have noticed much better conversion, which is why I'm confident.
I'm looking forward to seeing new capabilities appear on the Twitter API to always make the process more accurate though.
At least now we know. The sample size of the test was like the whole twitter user base and seemed to be quite negative but I imagine if they are considering it a bad idea then they must have some real data to now back it up. Cool
whatever will show up in this repo, I hope people realize that depending on what data you put into some algorithm you can get whatever output you want, and twitter is never going to (and neither can or should they) publish everyone's personal information and interaction on the site.
So I'm not sure what the ultimate point of this exercise is other than producing faux-transparency.
also the skeptics are really hung up on semantics of "intelligence" and not addressing the model's output, much less where this is going to be in upcoming years.
like, take home coder test is probably just dead. today.
I'm far from a data scientist. I'm just someone that has an interest in building UIs and saw a potential to look at data in a way I hadn't seen before. I'm not actively working on this right now, but if you'd like to drop a line for whatever reason, I still check my Twitter for messages when they come up.
Correct. In internal Twitter jargon, some data is "perspectival" and some for performance reasons isn't. Actually viewing a tweet is calculated on the fly based on your personal perspective, as honoring privacy settings, blocks, etc, is crucial. But that's not true for counts, so those will be off.
People who find this a shocking and objectionable sign of bugs are generally people who have not build software at such large scales.
It's a fair criticism, but the nice thing about analyzing Twitter is that the data is already there. Polling has its own set of issues - how do you randomly and uniformly sample from the pool of programmers? It's probably not something you could do for a quick fun blog post.
reply