Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

There are data sources that are difficult or expensive to obtain but provide a more accurate gauge of web traffic. For example, ad networks can provide this. You can dig deeper and get a representative sample of data from sources such as the OpenDNS Investigate platform.

It's a cool product, but the market fit isn't perfect yet.



sort by: page size:

I think they use DNS data to estimate the amount of traffic.

Don't you want to know the sources of the traffic too?

You don't have many options if you need high accuracy: you have to pay a lot or try to trick Google which might be both immoral and against the law and for sure is tricky, hard to maintain and you can't count on it in the long run.

Let's hope there will be alternatives to Google provided traffic data. For now they seemed to monopolized it by offering it for free while losing money to discourage competition.


Agree, buying sampled Netflow data would technically hit this bar. Hard to know if it is troublesome without knowing the purpose. Are we using this for analyzing citizens traffic and correlating with other identifiers... Or are we analyzing traffic patterns to critical infrastructure and/or foreign nations?

Can you elaborate on your comment on the open traffic data? Is it difficult to acquire?

They own a number of popular browser extensions which feed them data and provide them a good "sample" of web traffic

It can be surprisingly close


How about web traffic?

It's just one data point though. The real juice comes when you have a hundred thousand traffic logs to compare, then you can start inferring similarities even from vague and incomplete data points.

They say they're doing port mirroring of AD traffic for gathering data to analyze. What kinds of information is in AD traffic? I know basically nothing about AD.

Presumably at some sort of additional cost, though. So then we’re into the business of weighing up whether to spend money on obtaining raw logs or purchasing the CDN’s own traffic analytics add on… or just going with a third party. This stuff isn’t just built in.

Fantastic. Thanks for sharing. Two questions:

1. How much data did you gather before extrapolating to the numbers in your post?

2. What tooling did you use to estimate traffic?


When you get to the level of traffic that it matters to check this, you are at the point that you cannot actually check this. Just think of the amount of traffic per second, the number of source addresses, and how many people it will take for how long to research that minute worth of traffic from last month.

Web traffic maybe?

This is great, what's your traffic like? Where did you get your data from?

It's a shame if Disqus can't provide statistics about traffic sources versus action. I'd like to see what the referer traffic looks like, as well as the country lookup for IP addresses.

ISPs (almost all of them) and lots of corporate (school, business, university) firewalls run analysers on websites and aggregate that data.

This data can provide a fair bit of powerful insight and intelligence into things like change of trends and growth of businesses.


It's generally good for the web to know where traffic is coming from. For example, it would be quite frustrating to be linked on say Hacker News and not know it but still see your server sending out thousands and thousands of additional page views an hour.

Can you conceive of an alternate way to score traffic on the Internet? What might that be?

That's one of the raw sources that go into the network data. There's also the routing tables (without which you can't always match ASNs to netblocks), other datasets, and a bunch of custom scripts that post-process the data, performing cleanup, heuristic matching and more. And then putting it all into a data format that supports handling 250 million requests a day, and serving over 90% of them in less than 10ms.
next

Legal | privacy