I really like CDNs because of the ability to drop in a file and know it will be cached correctly. (Also there is a high probability that your user already has a cached version of the file) But never thought about CDNs being able to track you.
Isn't there an alternative? A more transparant way to provide users with source files and still keep the 'cached items' aspect.
I guess, browsers could provide a separate cache for CDN domains that's _really_ long lived (there's a certain guarantee that files won't change there) and also send requests there with no referer at all times.
The script wouldn't have to be from a CDN to track people using the browser cache. I could infer whether you've visited a site that doesn't use CDNs or trackers by asking you to load something from that site and inferring whether you have that resource cached by the time it took you to load it.
In most cases, client does not even request bytes from CDN which is then not able to track Client.
But then again CDNs can implement tracking based on this lack of requests (which is kind of ironic and should be infeasible the more clients use this technique I think).
Actually the other issues are solved by the "DHT" part of this idea:
no centralized party can track which assets are already in your history.
The only tracking I can think of is by your nearest neighbours's browsers.
If such a neighbour N empties your cache (DNS attack?) it will trigger a full fetch from N.
Then N can attempt to fingerprint this assets query with what other pages list.
But then the whole point of this is to cache assets that are used on most pages!
I love this idea. Let's make the Web decentralized again!
(I couldn't resist)
I think that depends on user. Some use CDNs to improve performance via already cached files, others to offload traffic and use the actual CDN performance from Geo balancing.
Personally I don't think a user will have anything cached other than jQuery from Google. Which in this case I removed to follow the rule of eating own dogfood.
Even if the CDN can't (for whatever reason) one could easily include a tracking pixel on every page that is marked as `Cache-Control: no-cache`, or insert a few lines of JS to do the same.
Dude, as a good cdn will set cache headers to make browsers store the items forever, without even re-validating content.
Definitely not very useable for tracking purposes, they will only know about first visit of a user. Even more if sites a and b use the same js library and version, they will only know about the first that a user visit.
Anyway is a bad technical decision not to use a different domain to ensure clients don't need to send extra cookies in the headers.
I also use an add-on called Decentraleyes. It caches various common scripts from popular CDNs within the add-on itself so your device doesn't need to make any network requests for them. It was originally meant as a privacy to but the caching seems to be at least as valuable.
CDNs aren’t intended as a “canonical store”; content can be invalidated from a CDN’s caches at any time, for any reason (e.g. because the CDN replaced one of their disk nodes), and the CDN expects to be able to re-fetch it from the origin. You need to maintain the canonical store yourself — usually in the form of an object store. (Also, because CDNs try to be nearly-stateless, they don’t tend to be built with an architecture capable of fetching one “primary” copy of your canonical-store data and then mirroring it from there; but rather they usually have each CDN node fetch its own copy directly from your origin. That can be expensive for you, if this data is being computed each time it’s fetched!)
Your own HTTP reverse-proxy caching scheme, meanwhile, can be made durable, such that the cache is guaranteed to only re-fetch at explicit controlled intervals. In that sense, it can be the “canonical store”, replacing an object store — at least for the type of data that “expires.”
This provides a very nice pipeline: you can write “reporting” code in your backend, exposed on a regular HTTP route, that does some very expensive computations and then just streams them out as an HTTP response; and then you can put your HTTP reverse-proxy cache in front of that route. As long as the cache is durable, and the caching headers are set correctly, you’ll only actually have the reporting endpoint on the backend re-requested when the previous report expires; so you’ll never do a “redundant” re-computation. And yet you don’t need to write a single scrap of rate-limiting code in the backend itself to protect that endpoint from being used to DDoS your system. It’s inherently protected by the caching.
You get essentially the same semantics as if the backend itself was a worker running a scheduler that triggered the expensive computation and then pushed the result into an object store, which is then fronted by a CDN; but your backend doesn’t need to know anything about scheduling, or object stores, or any of that. It can be completely stateless, just doing some database/upstream-API queries in response to an HTTP request, building a response, and streaming it. It can be a Lambda, or a single non-framework PHP file, or whatever-you-like.
Doesn't that kill the possibility that a visitor will already have common content cached when visiting your site for the first time? I know that's not the only reason to use a CDN, but it's a pretty big one.
Isn't there an alternative? A more transparant way to provide users with source files and still keep the 'cached items' aspect.
reply