Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Cloudflare and the Wayback Machine, joining forces for a more reliable Web (blog.archive.org) similar stories update story
711 points by jgrahamc | karma 89756 | avg karma 10.41 2020-09-17 08:03:35 | hide | past | favorite | 163 comments



view as:

Normally I'd be upset about Cloudflare getting involved in anything good and pure like archive.org but this relationship, just suggesting new URLs to archive, seems harmless enough.

>>>seems harmless enough.

The number of things that started out with those words and then went on to be come huge problems is non-trivial I am sure

But I want to believe...


> I'd be upset about Cloudflare getting involved in anything good and pure

Why?

At least by FAANG standards, Cloudflare stands out to me as one of the good guys.


You fool, you have opened up a can of worms, no good will come out of this resulting flame war

https://en.wikipedia.org/wiki/Cloudflare#Controversies


Wow, zero coverage about their attempts to centralize key parts of the Internet under their control and how many people got outraged about that, but plenty on the trivial outrage of CF supplying services to customers they found offensive. I'm kind of tempted to update this section.

Why establish a false dichotomy? One can be just as mad about them abusing their position to attack free speech in a censorious way, as one is about their other efforts to create a monopoly over a large swathe of internet traffic.

I get it, nobody wants to defend badthink, but it's a hell of a lot easier philosophically to defend all political speech regardless of content than it is to try to pick and choose and make some moral case.


Set aside the free speech stuff which involves groups inciting violence - while we can agree that it's bad, deciding what to do will inevitably lead to wildly divergent ideas.

Cloudflare says that it's "free speech" for their customers to claim to be Adobe offering a "Flash Updater", or for their customers to claim to be Bank of America.

That is how evil Cloudflare is - they allow clear and unambiguous crime for pennies.


They do? I'm pretty sure they've always banned phishing pages. Do you have a source on that?

Any sufficiently large and centralized infrastructure (like Cloudflare is becoming) has an incentive to engage in user-hostile behaviors.

So far, I personally think most people would say they've been mostly benevolent.

But they are a public company, and a new CEO can have different priorities.

For example, (fairly or not) the public perception of Google has slowly been diminishing over time.


>So far, I personally think most people would say they've been mostly benevolent.

They're denying access to no-JS users to many sites.


The owner of the site is denying access, Cloudflare just provides a service. The owner can turn the feature off.

No, I can't even get through to the site itself past Cloudflare's gate...

The gate is configured by the website owner. Not to defend the frustrating experience - but I personally prefer my website to not be attacked by bot.

I'm just a happy Cloudflare customer.


Cloudflare being the enabler of a user-hostile technology doesn’t make them benevolent, no matter how happy the customers.

We can sit here and argue all day around this and we still won't reach an agreement.

On one hand, as a website operator, I want to prevent DDOS, spam etc... for my website. I can implement these solutions myself and do a bad job, or I can use Cloudflare that solves most of then. It's probably going to rule out some of the users as yourself - which is a shame. But until there's a better way to know that a visitor from the internet is not trying to attack the website, I'd have to use something like Cloudflare.

On the other hand, it's not like it's that hard to leave Cloudflare for me - so if there's a better alternative without causing legitimate users pain, I'd be happy to jump on board.


I don't think anyone is arguing that preventing DDoS attacks is desirable.

It's about cutting off access to a small segment of users just because it is easier that way.

I think that, similar to wheelchair access, we will continue to push for access to all devices and users as much as possible.

This attitude of "it's just 1%" or "it's just 0.1%" will become just as unacceptable as saying "well, there are only 3 people who need access ramps out of 30, so they need to suck it up and deal with it."

It's up to you which side you want to be on.


Disabling JS is a conscious decision. Using wheelchair is not — you cannot just flip a switch and walk once your spine is fucked.

> Any sufficiently large and centralized infrastructure

archive.org is exactly that. With rather poor performance no less.


Agreed. Just a matter of time, personnel, and offering price

Any reality that becomes more adjacent in possibility-space today, is better conceived as the seed of a future you might find yourself living in. Most any protective institution is just a set of measures working against this process -- moving things away from us in possibility-space

So worrying about this stuff is fair game. The only way I'd not worry, is if there was a complementary protective measure that protected us from the future we don't want. But that wasn't part of the announcement, so we're probably all a little bit less secure in getting the world we want. (EDIT: Though maybe funding archive.org would count as that...?)


> (EDIT: Though maybe funding archive.org would count as that...?)

That reminds me a bit on how Google "funds" Mozilla. Not the same, but when dealing with the devil...


I've always thought that people think about this backwards. Every dollar the devil gives you, is one dollar less the devil has to spend on devilry. Especially when that dollar is just charity, and the devil isn't getting anything out of it. But it also applies when the devil is getting something out of it that satisfies their preferences, as long as that satisfaction isn't displacing a need that frees up any of their other dollars to now be dedicated toward devilry.

Or, to put that another way: if you can charge [infamous politician] a million dollars for a fancy-but-useless painting, you absolutely should do so. Now you have a million dollars; they're out a million dollars; and all they have is a painting!


The issue comes when the politician comes back to you and says “Hey, I gave you a great deal on that painting, can you do a favor for me?”

Extracting money from bad sources is good as long as you absolutely positively don’t extract anything else. That’s hard to do in any circumstance. However, in this situation I think it’s fine and worth it.


Exactly right. And I agree pretty strongly. Unironically, this is why I think accepting money from the Saudi Arabian Sovereign Wealth Fund can be a great force for good. Adam Neumann may (again unironically) be a remarkable hero, accidentally.

This only works if you accept money from every devil that passes you by. If the majority of your funding comes from one devil, it doesn't matter how perfectly normal the underlying business transaction is - the moment you get in the way of your devilry, you're out a job.

Mozilla is a good case study in this: they are financially dependent on Google money to continue browser development. Google hasn't actually intervened in their affairs a whole lot. However, they could, which is why they're going through all sorts of self-inflicted harm trying to get away from their business of selling a browser default to a search engine.

Public companies are even worse, because what they are looking for isn't money, it's more money, or "growth". This is why a lot of American media companies suddenly got really quiet about certain kinds of atrocities committed by certain governments. If you call the devil out on concentration camps full of Uighurs, then maybe he doesn't buy your paintings anymore, and then you're out of the painting biz.


You’re talking about being employed by a devil, or maybe receiving continuing patronage from a devil. I’m talking more about having the one-off opportunity to drain a devil’s coffers (whether or not you get the resulting money), without having the ability to turn that into an ongoing relationship.

Basically, this is the other side of the coin to the idea that iterating the Prisoner’s Dilemma gets you the potential for tit-for-tat, and thereby cooperation under expectation of tit-for-tat. In this case, “defecting” against a devil is good — but, just like in the traditional Prisoner’s Dilemma, it’s only practical to defect if the scenario is one-shot.


> But they are a public company, and a new CEO can have different priorities.

And CF have access to decrypted SSL for many sites. Stuff like passwords, personal details, keys and tokens.


Just a reminder: Cloudflare with its standard settings is breaking the second and third world countries internet with their captchas on websites. This is discrimination in my opinion. As long as you are only in a first world country you will never notice.

Minor correction: As long as you are only in a first world country [and you don't use a VPN service or Tor] you will never notice.

I point this out because I think it's an anti-privacy "feature" and I wish CF would stop.


They could just block these IPs outright. A massive percentage of attack traffic runs over these networks. Before, site owners would often have to block Tor completely after dealing with enough spam/attacks from exit nodes, and now they can allow actual humans without any sort of complex setup.

I think it's a net win for everyone involved, personally.


Isn't this because of ISP-grade NAT where one bad actor (either maliciously or via running something like hola vpn) can kill a few hundred users' IP reputation?

I do almost all of my internet access from a 2nd/3rd world country and hadn’t noticed?

Depends on the country/ISP probably. From the Philippines, Firefox with uBlockOrigin and PrivacyBadger hit captcha walls all the time; all that stopped as soon as I moved to Singapore.

What else do you expect them to do? Also allow all the bots that use IP's from 2nd and 3rd world countries?

'First world country' means 'aligned with Western democracies in the Cold War' 'Second world country' means 'aligned with the USSR in the Cold War' 'Third world country' means 'unaligned in the Cold War'

Are you talking about countries with under-developed internet infrastructure? That includes swaths of the US....


No, GP means exactly the old classic definition of "second/third world" countries.

While Cloudflare actions have been mostly positive, the sheer size of their network is a concern for decentralization.

We can’t trust that they’ll always make good decisions.

I’m glad that the Internet Archive is independent and hope it always remains that way.


They allow and support known spammers and scammers. They make reporting of spammers and scammers arduous. They play dumb when asked about their barriers and repeat inane responses rather than answering questions.

In other words, they have a history of clearly, unambiguously showing themselves to be unapologetic assholes.


Wow, this is awesome to see. I hope this doesn’t put a lot of load on the IA, though..

I would assume that when a site goes offline Cloudflare fetches a snapshot from IA only once and then serves this copy to all further visitors, unless I'm missing something?

Here's a more detailed description of the service from Cloudflare support pages: https://support.cloudflare.com/hc/en-us/articles/200168436


Is this basically Archive.org becoming a customer of Cloudflare CDN to reduce load off their servers?

It's Archive.org being provided URL telemetry for archiving public sites they have not yet found through traditional means (crawling or users submitting requests through the Wayback Save page) by a Cloudflare product.

The next step would be for Cloudflare to point to Archive.org Wayback links when an origin isn't available (similar to browser extensions that point to Archive.org when sites 404 or are down, but in Cloudflare's core).

Cool stuff. Thanks Cloudflare folks.


I really doubt their customers would want that. Usually when a page is 404 it's because the company in question wants to forget about it :)

You would return the archived page for a 5xx error, not a 4xx error.

Ah I see. But this is precisely a usecase for cloudflare's own caching service.

It wouldn't be fair to use archive.org's community-sponsored resources for propping up businesses which are too cheap to pay for proper IT :)


While it's not explicitly mentioned, I think Cloudflare is providing financial support to the Internet Archive.

One would hope so. Considering the timing relative to the IA's potentially very expensive legal battle, I full expect this to be the case. Still, considering CF's anti-privacy/anti-TOR stance this is a deal with the devil. Guess I should give money directly to the IA. Considering how much value they provide, I'll do this immediately after updating this post.

In what ways is CF anti-privacy?

This is actually a really good symbiotic relationship that should foster the archival of a ton more content. Hoping to see this toggle enabled by default at some point.

I don't like the idea of "we're tacking this onto an existing service lots of people have enabled". CF bit me recently by suddenly taking away proxied dns wildcards from free zones, as it's now a premium feature (breaking the security promise in the process by changing the wild card entry to non-proxied). I don't like surprises and opt-out changes in critical infrastructure.

It's one thing to use CF's Always-On service - you're a customer, you know you can remove your data from it. It's another to get the Internet Archive involved, who may or may not remove your data, and may or may not honor robots.txt.


Sending the details seems to be tied into clicking the 'Update' button in the Cloud Flare UI, which documents that clicking it you agree. So they might not be sending your PII to a 3rd party until they get your permission. Hopefully any automated updates are not violating customers wishes. Yes, it is annoying the features have been tied together for people who choose to have as little interaction with IA as possible.

Great, until your admin panel is archived...

Cloudflare just gives more discovery, it doesn't give IA access to anything that was previously more secure...

"COVID tests: great, until you find a positive result"


| As new URLs are added to sites that use that service they are submitted for archiving to the Wayback Machine.

Yes, this would prevent most order-confirmation pages or otherwise private must-be-logged-in pages from being archived, but it will expose presumed-private URLs that are thought to be unique (tracking numbers, files uploaded with unique names, unique/private image urls that are otherwise publicly accessible)

If you've made efforts to your systems to prevent enumeration-attacks, this could partly bypass them.


What I'm hearing is "don't rely on security by obscurity", which I wholeheartedly agree with.

Was worried about CF getting their claws dug into archive.org, but on reading, this is a decidedly non-evil deal, actually it sounds wonderful. Still, I worry if there might be some unseen long term interest in the archive.

Never forget Dejanews


Keep in mind how Cloudflare makes most of their money: They sell a web proxy service with security and performance features including a CDN. Cloudflare's interests are furthered by improving that service in ways that help its customers. Keeping the Web Archive healthily stocked with content is aligned with their long term revenue growth.

T+10 years I very much expect CloudFlare's core business to have expanded significantly. I remember that time my Googler friend told me they were about to release that one thing they'd absolutely never do, Chrome came out a few weeks later, now look at Firefox

You need to pay attention to the silent positioning of these companies to even guess at where they might go, so deals with things like archive.org may have some unseen substance to them that might only become obvious much later


As a business they absolutely are not going to stay in the CDN lane as a primary.

Akamai has $3b in sales and an $18b market cap.

Cloudflare has $348m in sales and an $10.8b market cap.

Akamai is their maximum ceiling if they focus primarily on the CDN segment. Cloudflare is rapidly approaching their valuation ceiling if they stick to CDN as their core (and they'd have to start killing Akamai just to get there; the CDN business is increasingly a slower growth segment in the larger cloud industry).

Companies all around them in the cloud are growing faster, yet few are more important than Cloudflare. Zero question Cloudflare will continue to aggressively branch out, leveraging their critical positioning. In the not-so-distant future CDN will not be the center of their business. CDN is and will remain a springboard for them, a gateway drug, milk at the back of the grocery store.


Great comment. Cloudflare is not a CDN. They are an edge computing platform that happens to offer CDN services. Could Akamai grow into that market faster than Cloudflare can consume it? TBD.

Edge computing is super interesting, and today's CDN providers should be able to provide it given their current infrastructure deployment. It could really bring in the next era of computing and technology once certain networks/providers reach critical mass to provide edge services within 5-10ms to customers.

If jgrahamc is reading this, I'd really like to know if Cloudflare wants to work with telcos.

Imagine a small server in every cell tower, with locally-cached maps/Wikipedia/latest movies.

Some communication couldn't be cached (e.g. real-time video calls), but a lot of broadcast media could be. Of course there are copyright implications, and it might require partnering with Netflix or others.

The quick load times would be great for users, and the reduced load on the backbone would be good for the telecom companies.

If you'd like me to chat to some friends in telcos in New Zealand about this, drop me an email. It's not my job now (I'm in IoT) but I know who to talk to if you'd like to get this kind of thing moving.


AFAIK Netflix (and YouTube and others) already do this edge caching. They partner with telcos, as you said.


Kiwix does this for Wikipedia, other Wikimedia projects and other free content projects, through "Kiwix hotspot" (based on kiwix-serve). https://www.kiwix.org/en/downloads/kiwix-hotspot/

Akamai is not their ceiling because Akamai doesn't serve all segments of the market.

I'm fairly critical of Cloudflare for a lot of resources, but one thing I think they did right was focus on the SMB market with plans that were actually affordable to the average business. They targeted customers that companies like Akamai pretended didn't exist. Even now they have the cheapest plan available, and once they consolidate the market even further they can start raising those prices.


Akamai is their ceiling in CDN because they have a much higher value segment of the business, representing a drastically larger share of all dollars in the CDN space. Their business is nine times the size of Cloudflare because their customers are far more lucrative.

If Cloudflare holds onto all of their already considerable number of customers, and then kills Akamai and somehow takes all of Akamai's business, the combination will be a mere 10% larger than Akamai already is now. There is your general indie ceiling in action, with all segments combined (and Cloudflare isn't going to monopolize the entire CDN business besides).

All you need to know to spot the independent CDN ceiling is that Cloudflare + Fastly + Akamai = $3.6 billion in sales (with the understanding that it's a slowly increasing ceiling, as the CDN market is still growing). The ceiling in that space for Cloudflare just can't realistically be much larger than that combined group and that's not much larger than where Akamai is already at. The only way this isn't the case, is if you project Cloudflare knocks off most competitors and takes the market (they can't, Amazon, Microsoft, Google among other giants, are standing in the way of that outcome).

It'll take Cloudflare a small lifetime to get to $3 billion in sales in the CDN space at the rate they're growing (they're adding ~$8m-$10m per quarter in growth (all of which obviously isn't CDN), so maybe it'll only take a few decades with some compounding). It took Akamai 22 years to get there with very high value customers and a pretty nice open field for many of those years.

Akamai in absolute dollar terms is growing faster than Cloudflare + Fastly combined. The CDN ceiling is actually running away from Cloudflare at present. That shouldn't be happening.

Cloudflare knows full well CDN isn't their brightest business future. It's why so much of their expansion effort is going into everything else. Given the way they price-structured their CDN from day one, Cloudflare has always known CDN was a lure and the upside was in sprawling outward from it. Come for the CDN, stay for the workers or whatever preferably higher margin thing we can sell you on. It's also why they're not interested in / worried about trying to make money on domain registrations, as with SSL before that. They'll happily murder the margins in foreign services all day long (areas where they don't compete, but there is margin to wipe out cost effectively, and with customers to lure in), so long as they can occasionally launch a new service where they have a distinct advantage and can convert their base to use it and increase total revenue per customer in the process.

What would be a better path: if Cloudflare could own a big part of Akamai's CDN business by trying to aggressively climb up the ladder from an unassailable price-value position Akamai doesn't want to go down to, like an ARM eating an Intel from the feet upward; or just leave the snoring giant alone to keeping snoozing in his enterprise tower while Cloudflare busies itself sprawling out in many directions, leveraging the volume of customers that Akamai doesn't want to (and or can't) go after because they're not viewed as lucrative enough? I think what Cloudflare can find outside of the CDN business, is likely to be more valuable than what's inside the CDN business, very long-term speaking.

And if you're Akamai and you let Cloudflare get far enough along with that sprawling (likely already too late), how about if they drop your CDN legs out from under you. Cloudflare builds out many other legs to stand on, so they flip the switch on the margin and kill the CDN market for the independents, as they were willing to do with domains and SSL. Free CDN, all tiers, all features. They can't do that today, they might be able to do it tomorrow. The CDN market becomes the SSL market, and as a totally free lure it accelerates a rush into Cloudflare's other more exclusive services (including for larger, lucrative enterprise customers). Surely this switch has been pondered inside of Cloudflare, road-mapped as a potential.


> As a business they absolutely are not going to stay in the CDN lane as a primary.

Yeah, and all the big five Cloud vendors: AWS, Azure, GCP, IBM, Oracle all have their own CDN solutions bundled. Hard to make a case to purchase separate CDN solutions.


I'm not sure about all the providers but Amazon's CloudFront CDN product has additional costs, so it's "bundled" but not in the sense that it's free, only that it's integrated.

And one of Cloudflare's selling points imo is the multi-cloud customers. Use AWS all the way but Cloudflare as your CDN and you could swtich to GCP seamlessly. Or route traffic based on pricing etc. I think you're right they will/have absolutely branch out from CDN but I think their CDN product is actually compelling especially to bigger companies that are more afraid of Amazon that they are of Cloudflare.

(Other interesting point - it's worth noting that IBM's CDN is essentially white labeled Cloudflare).


They control SSL decryption for a massive number of websites. Governments will gladly fund Cloudflare for eternity.

Akamai as well then?

Much of the US Government already uses it, so yes.

If your root CA is subject to the laws of a government that can take the root certificates and MITM the connection with those root CAs that's not much better. Cloudflare just makes it easier.

Certificate Transparency makes this significantly harder to do stealthily. I’m not convinced that Cloudflare is a deep state operation either, but Cloudflare's ability to secretly MITM is a position afforded to a select few, and certainly not every CA.

It's much easier (and virtually undetectable) to MITM when you are also the reverse proxy though.

More like a web blocker "service". It is profoundly unhelpful to me that a proxy service cares if I have Javascript disabled in my browser.

That’s the website that has manually enabled a feature if it requires Javascript. Cloudflare does not require Javascript out of the box.

Please clarify. I thought all those captcha puzzles were coming from Cloudflare. Are you saying they are only enabled if the destination page has JS?

I believe GP is referring to a setting that a cloudflare user has to flip for requiring visitors to enable JavaScript

When users are used to this (getting redirected to a archived copy when the site is down/not available) and when this trial baloon has been proved to work, Cloudflare will replace archive.org with their own infrastructure. This is the common game plan.

Doesn't CF already have an "Always Online" feature using their own infrastructure? So this seems like the opposite happening.


Uh, no. We're literally doing the opposite. We used to have our own caching infrastructure for "Always Online" and we're getting rid of it and using archive.org instead.

How do you handle robots.txt? The previous incarnation of Always Online didn't care about robots.txt, while archive.org does.

https://blog.cloudflare.com/cloudflares-always-online-and-th...

We tell archive.org about the URI, they crawl it. They handle robots.txt.


archive.org doesn't handle robots.txt in any meaningful way (see my comment above at https://news.ycombinator.com/item?id=24516875 ). If that's changed recently, I'd like to know more.

Note that archive.org stopped respecting robots.txt since 2017. [1]

In my experience, the site owner must email archive.org support to be excluded from its crawler and archiving.

[1]: https://boingboing.net/2017/04/22/internet-archive-to-ignore...


"We're literally doing the opposite."

How does what you do now contradict what you will do in the future? What legal assurances are there that you won't do hat when you leave? (See Facebook/Oculus "no Facebook account promise")


Wait... so you think Cloudflare's master plan is to roll this new thing out to get people to accept it as normal, and then suddenly make a big shift to.... what they currently have?

Why don't they skip this step and just keep what they have now, then? No one seems to be up in arms that they currently provide their customers offline caching...


And thank god for it. Trying to explain to end users why their site was not, in fact, always online on account of the creaking behemoth that plodded along in IAD barely managing to successfully cache and serve anything ever was never any fun.

The original Always Online infra was long unloved and probably kept on life support far too long for lack of want to deprecate an early feature.


Thanks, so maybe this page is outdated where it mentions your own crawler with user-agent? Or does the Internet Archive use it for these crawls? https://www.cloudflare.com/always-online/

Not long ago, CF has been blocking access from Tor. And they are blocking access from my web crawler sometimes. I don't like CF as they act as a police or gatekeeper to the origin website, deciding who to penalize and who do not, while pretending to be speeding up websites and protecting from 'threads'.

> while pretending to be speeding up websites and protecting from 'threads'.

They do though. That's why people pay them lots of money to do those two things. Not sure what part you think is "pretending"?


One of the first 100 people to use cloudflare when it launched.

Paying them today to speed up a couple of websites while protecting them.

They rock at making big things possible for very small companies.


Hey, me too! Do you have the first-users t-shirt?

They’re acting more as a security guard. Which is to say that they’re intentionally employed by the owner of the property you’re trying to enter. Often specifically to “bounce” users like you, malicious or not. Believe it or not there are legitimate reasons for wanting only real human users on your website!

What worries me is that Cloudflare is deanonymizing a huge load of TOR users, and the issue that comes with it is that a huge part of TOR users actually needs access to the web archive due to country-wide DNS censorships (European countries included).

As Cloudflare is deanonymizing TOR users pretty much with every website that's hosted on it, I fear they are abusing that power once again to deanonymize users of the web archive.

Cloudflare always claims it's not their issue and that it's a webmaster setting with the shitty captchas and Google's infamous Prism-sponsored PREFS cookie - but to be honest they should just not have implemented it in the first place if privacy was a core value of their company.

The "DDoS" protection basically fingerprints a machine and user inside an encrypted HTTPS connection; which makes the encryption tunnel itself obsolete.


I don't know really. Cloudflare is notoriously in conflict with different archive sites and now this announcement makes that sound not too credible.

I think we will see selective removal of certain content.


> Was worried about CF getting their claws dug into archive.org

SAME. From the title, I assumed the Wayback Machine would be using Cloudflare. Nice prank, boys.


Good News.

I also recommend using Internet archive addon in browser. Clicking on it would archive the website. That way, you can archive pages you visit.


Or use this bookmarklet:

  javascript:window.location="https://web.archive.org/save/"+location.href

I use https://web.archive.org/web/submit?url=%s and set the keyword to "rez". That way I can type "rez example.com" and it will send me to the archived version.

Why "rez"?

Short for "resurrect" maybe?

Yup!

Another nice feature is when you hit a page that is a 404, it will automatically try to load it from Wayback if available

This should be made very clear to Cloudflare users, ideally a warning next to the Always Online checkbox.

"Always Online" now can mean "Archive Forever" - even when a site is pre-launch.


From the blog post, an image of the checkbox: https://lh6.googleusercontent.com/J42AtNZv8xNcyQPPefVywiAGEh...

It always has. If your site is publicly available and you don't disallow bots through robots.txt, they can crawl it at any time. Even if the site is "pre-launch", because that doesn't mean anything on its own.

And of course, remember that robots.txt is only a signal to benevolent bots which respect it. If you have secrets to keep, don't put them online in the first place.

Or properly authenticate (and audit) access.

Looking at my Splunk logs and then asking a lot of questions, I have learned that there are a LOT of not so benevolent bots that must be tolerated anyway.

Benevolence is a continuum.


In fact robots.txt is a list of things a nefarious crawler will absolutely want to examine - no need to know those paths when they're all laid out for you!

I'd just add that, while major players like the Internet Archive do respect robots.txt, it's essentially just a flag that depends on people voluntarily respecting it. If a site is publicly available but you don't want people to find it, you're just depending of security through obscurity.

Internet Archive stopped respecting robots.txt since 2017. See https://boingboing.net/2017/04/22/internet-archive-to-ignore...

Webarchive completely ignores robots for a few years now. They did it on purpose.

I have serious issues with this and the fact that site owners have to email a human support team in archive.org to be excluded.

Yeah, I definitely expect this to bite some people, if I'm understanding correctly. A plausible scenario (among many) would be: soft launch a site, show it to some early stakeholders, have Wayback archive everything via Always Online, fix embarrassing screwups or oversharing in soft-launched version, publicize site more broadly, everyone in the world can rewind to version zero, regrets. I don't think the existing warnings really make clear that a soft launch is now a forever launch.

The solution to this is... robots.txt. Otherwise your site might turn up in Google etc. Since it's archive.org that's doing the crawling and they respect robots.txt it won't get archived.

Archive.org does not respect robots.txt IIRC. I’ve run into this problem before with them. Ironically, I ended up blocking Internet Archive’s ASN using Cloudflare.

EDIT: Internet Archive started ignoring robots.txt in 2017: https://www.digitaltrends.com/computing/internet-archive-rob...


They only started ignoring robots.txt on US government websites (as that article also says)

That is not what the article says.

It says Internet Archive had already started ignoring robots.txt on US government websites.

Now (since 2017) they ignore it on all websites.


I think that's fine. The reason we fix screwups is so the next people who arrive don't see them. We don't fix screwups to hide that sometimes we fuck up. If someone goes out of their way to find old screwups, then so be it. As long as not the majority of people see it, we're mostly fine.

Most people password-protect this. It's very common. If you contract a webdev for something, he will recommend it for you 100%. Not the basic auth thing, just a shared secret. Something trivial.

Side rant: Sure would be nice if the Wayback Machine showed actual snapshots of web pages, instead of "hybrid" snapshots where they combine old with new (maybe it's a setting?). I recently horked a website, and thought to check the Wayback Machine. Curiously, an edit I made that day was showing on snapshots dating back several years. Until I discovered how the WBM worked, I was pulling my hair out.

This is probably due to XHR's. The IA loads all JS, so if a website hard-codes the URL or does other complicated XHR stuff the IA might not be equipped to save the response for those, if they do at all.

If only we could get the NSA to publish their archive of public data.

There is hope. With the help of then-senator Al Gore, the CIA made photographs available to researchers it had made of the polar regions to search for soviet nuclear installations. They became valuable for climate research later on.

Has Cloudflare clarified their stance on which content they allow and disallow? They’ve wavered in the past and given how their service is basically critical infrastructure for the internet, I really want them to commit to free speech, avoid deplatforming, and avoid exceeding legal minimums.

Next step, have CloudFlare start mirroring IA on their own servers so we have some redundancy in case IA ever goes bankrupt.

Ideally it would be a non-profit that does it, but as a last resort CF is one of the few companies I'd trust to do it right and do it transparently.


Unlike the Wayback Machine, Archive.is for example does not censor & remove certain pages because of "hate-speech" and other motives that are more or less political.

Shame really, but i guess compromises are necessary to stay in "business", even though an internet archive should be exactly that, updated but never removed.


Just how big are Internet Archive’s servers? I can’t fathom how they’re able to store so much of the web in so many versions.

The Wayback Machine uses 9.6 PetaBytes. Total storage is 50 PetaBytes.

https://archive.org/web/petabox.php


Cloudflare is really neat unless you find yourself mysteriously blacklisted by them as a user.

Then suddenly the web is a much smaller place.


You can use archive.org to bypass the Cloudflare blocklist, especially considering the save page feature.

Perhaps they can start archiving YouTube.

Whatever fills web.archive.org is good!

Though, it would be nice if someone invented technology, that can erase all the 404 pages and redirects, that are archived, as well, as soon the page goes offline. Maybe a job for AI?


No.

Keep historical revisionism out of archive.org.


Strange that the blog doesn't have https redirection.

Encouraging to see this kind of partnership this day in age. May it never forget why it started and only improve on it.

Interesting to find that, when I checked to see if I was using the feature, I had already agreed to the supplemental terms saying my information will be shared with IA.

(For others who need to opt out, https://support.cloudflare.com/hc/en-us/articles/200168436-U... describes how to disable "Always Online". There doesn't seem to be a way to turn off just the information sharing.)


Who owns Cloudflare? And what are they valued at?

I mean, these deals make them look cool and altruistic, but what happens when BigCompany offers them enough money to sell?


Perhaps we are getting a sugared pill? Perhaps CF are genuinely being useful here, but in order to gain trust to act nefariously in future?

I don't feel comfortable with their ability to switch off parts of the internet, nor in this case, that they have their hands near what is preserved for posterity.

As they say: "Cloudflare has become core infrastructure for the Web, and we are glad we can be helpful in making a more reliable web for everyone." They are indeed powerful.

I'm concerned that they are becoming gatekeepers to information, under the guise of providing a better internet service. They are able to operate at a level deeper than the odious restrictions youtube, facebook et al enforce on free speech.


I'm being downvoted - but we have seen major 'book burnings' on youtube, etc where billions of comments and videos have been purged. These are private platforms and can do what they like, so in a way that's acceptable as it is within their terms of service.

CF is a level deeper than that. This is a company that can effectively shut down the internet for companies and individuals. And now they are involved with archive.org? Should we be concerned about online historical revisionism as that relationship matures?

I feel uncomfortable that CF seems to be positioning itself as a guardian to all information - not at an application level, but at an architectural level.

Cloudflare is shaping up to be a key tool that an authoritarian government requires. And I'm concerned about it!


Could someone please automatically activate this for all content linked from HN? It happens all the time that many of the first page links are down due to traffic spike.

From what I can tell, all links submitted are automatically archived.

Cloudflare is not vpn friendly.

I'm a privacy concerned vpn user and in my daily browsing I have to deal dozens of times a day with cloudflare captchas or in some cases with cloudflare total blocking.


Is using this chrome/fx addon a option for your use case?

https://support.cloudflare.com/hc/en-us/articles/11500199265...


My heart shuddered when I read the headline. I can’t be alone in the fear.

This clearly isn't to create some utopic 'more reliable Web'. In fact, Cloudflare severely undermines that, by pushing their centralised view of what the internet should be.

I was hopeful, but after reading this:

> “The Internet Archive’s Wayback Machine has an impressive infrastructure that can archive the web at scale,” said Matthew Prince, co-founder and CEO of Cloudflare. “By working together, we can take another step toward making the Internet more resilient by stopping server issues for our customers and in turn from interrupting businesses and users online.”

It's plain to see that this is a money-making venture for Cloudflare. While I do like the added functionality, I personally can't see how this 'improves' the Wayback Machine. It's just going to place more load on it.


Legal | privacy