Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

I'd tried this in the past but my machine slowed to a crawl. I guess it was to do with the algorithm used for handling the list of hosts (this sounds like a job for a bloom filter).

I've just tried the list you provided and it seems to be ok. Will try it for a while to see how I get on.



view as:

It'd be interesting to dive into parts of e.g. Linux source to see how it tests domains against the hosts file. It probably isn't doing anything as clever as Bloom, though, but who knows...

A quick dive brought me to glibc: seems like everything is done in `resolv/gethnamaddr.c`. Look for `_PATH_HOSTS` mentions: it is defined as "/etc/hosts/". The parsing seems done by the function `gethtent`. However it returns a single `hostent`...

I'll try to understand it tonight.


It's a brute-force linear search. _gethtent() returns one line from the hosts file at a time, the loop itself is in _gethtbyname2(). It opens and parses the hosts file every single time. If I were asked to improve this, I'd probably open and parse the file only once, reparsing when the file's been updated, maybe use an on-disk cache file. Second change would be to use a hash table. I don't think anything as sophisticated as a Bloom filter is necessary unless you have truly huge hosts files.

Since gethnamaddr.c appears to be BSD-licensed I'm willing to bet that 99% of all OSs out there (including Windows) are going to have similar if not identical code.


I see. Thank you!

There's a few packages that take a serious approach at adblock-through-hosts-file, including hostsblock[0], linked below. The cost of the systematic linear search is mitigated by the use of a DNS caching daemon, such as dnsmasq or pdnsd.

Indeed a Bloom filter would be overkill, and I'd rather avoid false positives!

0: http://gaenserich.github.io/hostsblock/


Legal | privacy