There are data sources that are difficult or expensive to obtain but provide a more accurate gauge of web traffic. For example, ad networks can provide this. You can dig deeper and get a representative sample of data from sources such as the OpenDNS Investigate platform.
It's a cool product, but the market fit isn't perfect yet.
You don't have many options if you need high accuracy: you have to pay a lot or try to trick Google which might be both immoral and against the law and for sure is tricky, hard to maintain and you can't count on it in the long run.
Let's hope there will be alternatives to Google provided traffic data. For now they seemed to monopolized it by offering it for free while losing money to discourage competition.
Agree, buying sampled Netflow data would technically hit this bar. Hard to know if it is troublesome without knowing the purpose. Are we using this for analyzing citizens traffic and correlating with other identifiers... Or are we analyzing traffic patterns to critical infrastructure and/or foreign nations?
It's just one data point though. The real juice comes when you have a hundred thousand traffic logs to compare, then you can start inferring similarities even from vague and incomplete data points.
They say they're doing port mirroring of AD traffic for gathering data to analyze. What kinds of information is in AD traffic? I know basically nothing about AD.
Presumably at some sort of additional cost, though. So then we’re into the business of weighing up whether to spend money on obtaining raw logs or purchasing the CDN’s own traffic analytics add on… or just going with a third party. This stuff isn’t just built in.
When you get to the level of traffic that it matters to check this, you are at the point that you cannot actually check this. Just think of the amount of traffic per second, the number of source addresses, and how many people it will take for how long to research that minute worth of traffic from last month.
It's a shame if Disqus can't provide statistics about traffic sources versus action. I'd like to see what the referer traffic looks like, as well as the country lookup for IP addresses.
It's generally good for the web to know where traffic is coming from. For example, it would be quite frustrating to be linked on say Hacker News and not know it but still see your server sending out thousands and thousands of additional page views an hour.
That's one of the raw sources that go into the network data. There's also the routing tables (without which you can't always match ASNs to netblocks), other datasets, and a bunch of custom scripts that post-process the data, performing cleanup, heuristic matching and more. And then putting it all into a data format that supports handling 250 million requests a day, and serving over 90% of them in less than 10ms.
It's a cool product, but the market fit isn't perfect yet.
reply