Hacker News

Show HN: DNS over Wikipedia(github.com)

398 pointsaaronjanse posted 2 months ago108 Comments
108 Comments:
aaronjanse said 2 months ago:

Hey HN,

I saw a thread a while ago (linked in README) discussing how Wikipedia does a good job keeping track of the domains of websites like Sci-Hub or The Pirate Bay. Someone mentioned checking Wikipedia to find links to these sites, so I thought this would be a fun thing to automate!

To try it out, install an extension or modify your hosts file, then type in the name of a website with the TLD `.idk`.

For example: scihub.idk -> sci-hub.tw

Cheers!

Polylactic_acid said 2 months ago:

Its incredible how insane this seems from the title but how practical it sounds from the readme..

basch said 2 months ago:

Right? Basically a modern im feeling lucky meets meta-dns.

mathieubordere said 2 months ago:

hehe yeah, was thinking the exact same thing

Vinnl said 2 months ago:

I created whereisscihub.now.sh a while ago for exactly this purpose (but limited to the subset of Sci-Hub, of course, and it used Wikidata as its data source). It has since been taken down by Now.sh.

Just as a heads-up of what you could expect to see happening :)

CapriciousCptl said 2 months ago:

Could use some sort of verification since Wiki can be gamed.

1. Look at past wiki edits combined with article popularity or other signals to arrive at something like a confidence level.

2. Offer some sort of confirmation check to the user.

HugoDaniel said 2 months ago:

DNS translates a name into an IP address. This is not DNS per-se, it is just a search plugin for the url bar.

If an analogy was needed with a network service perhaps this is more like a proxy redirector than DNS.

Keep in mind: with this you will still be misdirected if your DNS/hosts file is pointing the name into a different IP than it should be.

capableweb said 2 months ago:

Indeed. Even the GitHub repositories description has this error.

> Resolve DNS queries using the official link found on a topic's Wikipedia page

@aaronjanse: you probably want to correct this. "Resolving DNS records" carry a specific meaning in that you have a DNS record and you "resolve" it to a value, which actually. You're kind of doing, in a way, I suppose.

I was convinced when I started writing this comment that calling this "resolve dns queries" is wrong. But thinking about it, DNS resolving is not necessarily resolving a "name into a IP-address" as @HugoDaniel in the comment I'm replying to is saying (think CNAME records and all the others that don't have IP addresses). It's just taking something and making it into something else, traditionally over DNS servers. But I guess you could argue that this is resolving a name into a different name, that then gets resolved into a IP address. So it's like a overlay over DNS resolving.

Meh, in the end I'm torn. Anyone else wanna give it a shot?

penagwin said 2 months ago:

I mean once we account for all the different types of DNS records - regardless of its original intent, isn't it essentially just a networked, hierarchical key store? For example the TXT field is "dns".

This project is still doing key -> value. It just fetches the value from Wikipedia first, much like your normal dns servers have to fetch non-cached keys from their sources (other dns servers normally)?

sfoley said 2 months ago:

Just because it's doing key-value pairs does not mean it's DNS. If I can't `dig` it, it's not DNS. This is simply doing HTTP redirects and works with no other protocols.

capableweb said 2 months ago:

Hm, there are plenty of DNS records (not to mention all the custom ones all around the world) that you won't be able to `dig` but most people would still call DNS.

icedchai said 2 months ago:

Can you provide an example of one that you can't 'dig'? I have my doubts.

capableweb said 2 months ago:

One example: I can't seem to get dig to work with URI records (but I might be missing some flag). Doing `dig URI _kerberos.hasvickygoneonholiday.com` returns "no servers could be reached" while doing `kdig URI _kerberos.hasvickygoneonholiday.com` returns the proper records.

So seems to be a proper DNS record, but I can't make it work with dig.

icedchai said 2 months ago:

Plain old "dig" works for me! I suspect it may be an older version of dig you're using? This is DiG 9.11.5-P4-5.1ubuntu2.1-Ubuntu on Ubuntu 19.10 ...

capableweb said 2 months ago:

Strange, but thanks for letting me know! I'm on my Ubuntu laptop now, so `DiG 9.11.3-1ubuntu1.11-Ubuntu` and it works too! But initially tried on my Arch Linux desktop, where it didn't work, and I would expect my desktop to run a more recent version than my laptop. Very strange, will take another look.

capableweb said 2 months ago:

Yeah, agree! I'd take it a step further and say it doesn't even have to be "networked" (in the technical sense) but could be local or even done person-to-person, which would also work albeit be slow.

Let's call that sneaksolving (from the infamous sneakernet/FloppyNet)

said 2 months ago:
[deleted]
nulbyte said 2 months ago:

Arguably, the plugin and resolver "resolve" domains under the top-level domain idk. However, the primary service provided is not DNS (which is not offered at all via the plugin), but HTTP redirection. DNS, on the other hand, serves a variety of applications, not just HTTP clients.

parhamn said 2 months ago:

I agree. DNS in conversation = K/V mapping pair for routing somewhere. TXT/MX/CNAME/A/WIKI etc. For the sake of this repo and what they're trying to get across this seems fair. I'm confused that I felt compelled to write this though.

jakear said 2 months ago:

> If you Google "Piratebay", the first search result is a fake "thepirate-bay.org" (with a dash) but the Wikipedia article lists the right one. — shpx

How interesting. Bing doesn't do this, which leads me to believe it's not a matter of legality. Is Google simply electing to self-censor results that it'd prefer it's used not to know about? Strange move, especially given the alternative Google does index is almost definitely more nefarious.

sixhobbits said 2 months ago:

Google has been downranking sites based on copyright takedown requests since 2018 at least [0]. And it's been very hard to find torrent sites or streaming sites through Google since then in my experience.

As many have pointed out, this just makes it easier for actually malicious sites to get traffic.

[0] https://torrentfreak.com/google-downranks-65000-pirate-sites...

tomcooks said 2 months ago:

Google does list proper pirating sites!

At the bottom of the page click on the DMCA complaint, you'll find all the URLs you shouldn't ever, never ever, click on~

philips4350 said 2 months ago:

Whats funny and ironic is that this actually makes finding pirated content much easier since only actual sites that contain pirated content are the ones that will be listed on DMCA complaint list

terramex said 2 months ago:

In recent years it is less easy, as content owners are now reporting huge batches of URLs in one complaint, so finding what you are looking for in this mass is much harder. They also often report fake downloads and scam websites in DCMA complaints.

tomcooks said 2 months ago:

Yes I wonder if these URLs have to be made public by law in DMCA notices.

I assume that, if they legally could, they wouldn't show you anything

kevin_thibedeau said 2 months ago:

The notices don't have to be disclosed to anyone but the alleged infringer. The URLs don't have to be hyperlinked either. This is one part of Google giving the trolls a middle finger.

StillBored said 2 months ago:

Google should index them all on a separate page. For science of course.

More than once I've done a search for something pedestrian (no intent for piracy/etc) only to notice the "some results removed" link. Out of curiosity I've clicked it, just to see what crazy things have been removed, and been quite amused/interested in the results.

capableweb said 2 months ago:

> At the bottom of the page click on the DMCA complaint, you'll find all the URLs you shouldn't ever, never ever, click on~

I don't think everyone is seeing this, because sometimes I see this but sometimes not, seems to depend on the query. Searching for "piratebay" doesn't show it in the bottom (I live in an European country [also in EU]) and meanwhile, official thepiratebay website is blocked on a ISP level here.

jonchurch_ said 2 months ago:

I'm not sure how long that's been the case. The actual site at their normal domain seems to have been down for a few months, with a 522 cloudflare timeout.

I'm curious if that's the case for you as well, or if it's my ISP blocking (I wouldn't expect to see the cloudflare error if my ISP was blocking but I don't know).

I bring this up because if the site is unresponsive from wherever you're searching (or perhaps unresponsive for all, idk) then maybe it got de-ranked on google.

onion2k said 2 months ago:

For me the address on Wikipedia times out with a 522 in exactly the same way. Bing's top result of the .party address works fine. I strongly suspect this is an ISP issue, but it is interesting that Google seems to have no knowledge of the .party domain.

el_nino said 2 months ago:

Just tested over Tor and it works: piratebayztemzmv.onion

P.S. Yes, it's their new official "readable" onion site link.

shp0ngle said 2 months ago:

Note that this is their official website, not just another fake Tor proxy

https://torrentfreak.com/the-pirate-bay-moves-to-a-brand-new...

shp0ngle said 2 months ago:

Is the .party actual official website, or is it just a proxy too?

shp0ngle said 2 months ago:

That's the case for me for months now. I have no idea where do the proxies and mirrors even take the content from.

aequitas said 2 months ago:

Piratebay is blocked in a few countries (like the Netherlands[0]). So proxies (with their own ads of course) are good business.

[0] https://blog.iusmentis.com/2017/06/19/eu-hof-verklaart-the-p...

jimmaswell said 2 months ago:

This fake one seems to work fine. To what end is it there, honeypot or just ad money?

frei said 2 months ago:

Pretty neat! Similarly, I often use Wikipedia to find translations for specific technical terms that aren't in bilingual dictionaries or Google Translate. If you go to a wiki page about a term, there are usually many links on the sidebar to versions in other languages, which are usually titled with the canonical term in that language.

itaysk said 2 months ago:

I do this as well, I find that wikipedia is the best dictionary

bausano_michael said 2 months ago:

I found this to be a great method too. Especially for topics which I have been educated on in my mother tongue in high school. I know the term in Czech but I'd be unsure about the direct translation.

hk__2 said 2 months ago:

+1; I’ve used that method so many times that I wrote a Python CLI tool for that a few years ago: https://github.com/bfontaine/wptranslate

iakh said 2 months ago:

Self plugging a quick page I wrote to do exactly this some time ago: http://adamhwang.github.io/wikitranslator/

nitrogen said 2 months ago:

Out of curiosity, how well does Wiktionary fare in this regard?

greenpresident said 2 months ago:

I use it primarily for cooking ingredients. The names on some unconventional grains and vegetables are easy to translate using this method and not always available in conventional dictionaries.

It would also be useful for identifying cuts of meat, as US cuts and, for example, Italian cuts differ not only in name but in how they are made. Compare the images on this article for an example of what I mean:

https://en.wikipedia.org/wiki/Cut_of_beef

carlob said 2 months ago:

I own an illustrated encyclopedia of Italian food: there are 9 pages of regional cuts of beef!

That's what you get in a country that unified 160 years ago...

matsemann said 2 months ago:

I often do it as well. It's not perfect, but it's nice for things not directly translateable. For instance events known by different names in different languages, where translating the name of the event with google just does a literal translation.

nicbou said 2 months ago:

Dict.cc is excellent for that, if you're translating between German and English. Linguee can also be really good.

EE84M3i said 2 months ago:

Similarly, jisho.org for English <-> Japanese often has search results from Wikipedia too.

cpach said 2 months ago:

I also do this :) It would be cool to build a dictionary that uses this method.

segfaultbuserr said 2 months ago:

There's a risk of phishing by editing Wikipedia articles if the plugin gets popular. Perhaps it's useful to crosscheck the current URL against the 24-hour earlier and 48-hour earlier versions of the same article. Crosscheck back in time, not back in revision, since one can spam the history by making a lot of edits.

cxr said 2 months ago:

I jotted down some thoughts about this very thing last year. Here's the part that argues that it could work out to be fairly robust despite this apparent weakness:

> Not as trivially compromised as it sounds like it would be; could be faked with (inevitably short-lived) edits, but temporality can't be faked. If a system were rolled out tomorrow, nothing that happens after rollout [...] would alter the fact that for the last N years, Wikipedia has understood that the website for Facebook is facebook.com. Newly created, low-traffic articles and short-lived edits would fail the trust threshold. After rollout, there would be increased attention to make sure that longstanding edits getting in that misrepresent the link between domain and identity [can never reach maturity]. Would-be attackers would be discouraged to the point of not even trying.

https://www.colbyrussell.com/2019/05/15/may-integration.html...

Asmod4n said 2 months ago:

I believe the German version of Wikipedia had(has?) a feature where you only get verified versions of a page when you browse it anonymously.

hk__2 said 2 months ago:

> I believe the German version of Wikipedia had(has?) a feature where you only get verified versions of a page when you browse it anonymously.

What’s a “verified version”? Who verifies?

fragmede said 2 months ago:
hk__2 said 2 months ago:

Verifiability is about the ability to check some information using reliable sources ; it has nothing to do with having “verified versions” of pages.

hn_101 said 2 months ago:
hk__2 said 2 months ago:

Thanks, that’s more like this.

BillinghamJ said 2 months ago:

Nice idea! Maybe should involve some randomised offsets so it can't just be planned ahead of time

pishpash said 2 months ago:

And what would you do if there was a difference?

s_gourichon said 2 months ago:

Return an error code. Also, since the DNS protocol allows ancillary informations, perhaps return additional informations in fields that would seem fit, else in comments.

Edit: this is not DNS over wikipedia. As other pointed out, there is no DNS involved in the linked artifact. One option would be to show alternatives with dates and let user choose.

hk__2 said 2 months ago:

> Wikipedia keeps track of official URLs for popular websites

This should be Wikidata. Wikipedia does that, but this is more and more moved into Wikidata. This is a good thing, because Wikidata is much easier to query, and the official website of an entity is stored at a single place, that is then reused by all articles about that entity in all languages.

snek said 2 months ago:

The extension has nothing to do with DNS, a more accurate name would be "autocorrect over wikipedia".

The rust server set up with dnsmasq is a legit DNS server though.

MatthewWilkes said 2 months ago:

It isn't autocorrect either. It's domain name resolution.

abiogenesis said 2 months ago:

Nitpicking: Technically it's not DNS as it doesn't resolve names to addresses. Maybe CNAME over Wikipedia?

usmannk said 2 months ago:

Nitpicking nitpicking: "Technically" CNAME is DNS insofar as DNS is "technically" defined at all.

datalist said 2 months ago:

It is not even a CNAME. It is a JavaScript redirect based on the the response of an HTTP request to Wikipedia.

stepanhruda said 2 months ago:

No one would click on “I’m feeling lucky for top level pages over Wikipedia” though

LinuxBender said 2 months ago:

This may be a little off topic, but has anyone ever considered a web standard that includes a cryptographic signed file in a standard "well known" location that would contain content such as

- Domains used by the site (first party)

- Domains used by the site (third party)

- Methods allowed per domain.

- CDN's used by the site

- A records and their current IP addresses

- Reporting URL for errors

Then include the public keys for that payload in DNS and in the APEX of the domain? Perhaps a browser add-on could verify the content and report errors back to a standard reporting URL with some technical data that would show which ISP is potentially being tampered with? Does something like this already exist beyond DANE? Similar to HSTS maybe the browser could cache some of this info and show diffs in the report? Maybe the crypto keys learned for a domain could also be cached and warn the user if something has changed (show diff and option to report)? Maybe more complex would be a system that allows a consensus aggregation of data to be ingested by users so they may start off in a hostile network and some trusted domains populated by the browser in advance, also similar to HSTS?

andrekorol said 2 months ago:

That's a good use case for blockchain, in regards to the "consensus aggregation of data" that you mentioned.

Spivak said 2 months ago:

Why would you need a blockchain for this? This would just be a text document sitting at $domain/.well-known/$blah and verifiable by virtue of being signed with a cert that's valid for $domain.

renewiltord said 2 months ago:

This is hecka cool. What a clever concept! I like the idea of piggy-backing on top of a mechanism that is sort of kept in the right state by consensus.

blattimwind said 2 months ago:

Wouldn't this be an excellent use case for Wikidata?

For example looking up "sci hub" on Wikidata leads to https://www.wikidata.org/wiki/Q21980377 which has an "official website" field.

oefrha said 2 months ago:

Pretty cool, although legally gray content distribution sites like Libgen, TPB, KAT, etc. are often or often better thought of as a collection of mirrors where any mirror (including the main site, if there is one) could be unavailable at any given time.

gbear605 said 2 months ago:

One concern is that you can’t always trust the Wikipedia link. For example, in this edit [1] to the Equifax page, a spammer changed the link to a spam site. They’re usually fixed quickly, but it’s not guaranteed. So it’s a really neat project, but be careful about actually using it, especially for sensitive websites.

[1]: https://en.wikipedia.org/w/index.php?title=Equifax&diff=9455...

edjrage said 2 months ago:

True, seems pretty risky. Maybe the extension could take advantage of the edit history and warn the user about recent changes?

Edit: Unrelated to this issue, but I have a more general idea for the kinds of inputs this extension may accept. It could be an omnibox command [0] that takes the input text, passes it through some search engine with "site:wikipedia.org", visits the first result and finally grabs the URL. So you don't have to know any part of the URL - you can just type the name of the thing.

[0]: https://developer.chrome.com/extensions/omnibox

yreg said 2 months ago:

The user should exercise caution, but in the use cases provided (a new scihub/tpb domain) that applies regardless.

29athrowaway said 2 months ago:

Many Wikipedia articles can be edited by anyone. This is not secure.

jrockway said 2 months ago:

Why does Google censor results, but not Wikipedia? It seems like you can DMCA Wikipedia just as easily as Google.

Overall this is a nifty hack and I like it a lot. Wikipedia has an edit history, and a DNS changelog is something that is very interesting to have. People can change things and phish users of this service, of course, but with the edit log you can see when and potentially why. That kind of transparency is pretty scary to someone that wants to do something malicious or nefarious.

jhasse said 2 months ago:

Google also sells copyrighted content, Wikipedia doesn't.

leoh said 2 months ago:

Nice work! Sometimes I seem to be directed to a wikipedia page as opposed to a URL. For example, with `aaronsw.idk` or `google.idk`. I wonder why that's the case?

O_H_E said 2 months ago:

I think it directs to the correct link when it is labeled `URL` in wiki. In the other cases the link is labeled `Website`.

aaronjanse said 2 months ago:

This was exactly the issue! I just pushed fixes for this problem.

cooper12 said 2 months ago:

I've written a userscript[0] before regarding official websites and I feel this is the hierarchy you should be using:

1. Try getting the Wikidata "official website" property

2. Then any link inside of a {{url}} template or |website= in an infobox

3. And if you really want to try to get something to resolve to, the first site wrapped in {{official website}}

If you need code to reference: https://en.wikipedia.org/wiki/User:Opencooper/domainRedirect...

[0]: https://en.wikipedia.org/wiki/User:Opencooper/domainRedirect

yreg said 2 months ago:

Is this inconsistency intended?

erikig said 2 months ago:

Interesting idea but:

- How do you handle ambiguity? e.g what happens when sci-hub.idk and scihub.idk differ?

- Aren’t you concerned by the fact that Wikipedia is open to editing by the public?

aaronjanse said 2 months ago:

> Aren’t you concerned by the fact that Wikipedia is open to editing by the public?

Arguably the thrill of uncertainty could add to the fun :D

Kye said 2 months ago:

This was the appeal of StumbleUpon.

captn3m0 said 2 months ago:

Maybe use WikiData? The slower rate of updates might work in your favour to avoid vandalism.

skissane said 2 months ago:

In my personal experience, Wikidata is often worse at detecting vandalism than Wikipedia. Wikipedia has more editors and so vandalism on Wikipedia tends to be noticed sooner. Wikidata gets less attention so vandalism can endure for much longer.

With the increasing trend to pull data from Wikidata into Wikipedia, this is I think becoming less of an issue – even if nobody is watching the Wikidata item, if some vandalised property is exposed in a Wikipedia infobox, that increases the odds that someone will notice the vandalism. However, there are always going to be more obscure items which lack Wikipedia articles, and more obscure properties which don't get displayed in any infobox, and for them the risk of vandalism is greater. (Plus, it is possible for a Wikipedia article to override the data in Wikidata with its own values; this is done for the English Wikipedia Sci-Hub article, for example – Wikidata is including all the historical web addresses, Wikipedia only wants to display the current ones – I don't think it is technically possible yet to filter out just the current ones, so instead Wikipedia is manually overriding the addresses from Wikidata.)

mmarx said 2 months ago:

> I don't think it is technically possible yet to filter out just the current ones, so instead Wikipedia is manually overriding the addresses from Wikidata.

Note that the historical ones are of “normal” rank, whereas the current ones have “preferred” rank. You can filter that when using the API, and when using the SPARQL endpoint, if you go for the “truthy triples” representation `wdt:P856` of the “official website” property, you will only get best-ran statements – in this case the preferred ones. If you want to be absolutely sure, you can go for the “reified triples” representation and query for statements that don't have any “end time” qualifiers.

trishmapow2 said 2 months ago:

It also shows all the historical links. Sample: https://www.wikidata.org/wiki/Q21980377

hobofan said 2 months ago:

I was going to say "The Wikipedia page uses the data from Wikidata", as I thought I had seen that in the past. Turns out that it's not the case, and after picking a few samples, it looks like Wikidata is barely put in use in Wikipedia (aside from the inter-wiki links).

skissane said 2 months ago:

A lot of Wikipedia templates now pull data from Wikidata – https://en.wikipedia.org/wiki/Category:Templates_using_data_...

In the case of the Sci-Hub article, it actually would pull the website address from Wikidata, except that the article has been configured to override the Wikidata website address data with its own. Heaps of articles do this – https://en.wikipedia.org/wiki/Category:Official_website_diff... – but I understand the aim is to try to reduce the number of those cases over time.

hobofan said 2 months ago:

Yeah, via the overriding is how I came to the conclusion that it's not being used. I looked at the pages (specifically the infoboxes) and basically all values shown in the article were spelled out in the source. At the same time no Wikidata ID was present in the source (though I now realize that the template can automatically query that based on the page it's being used on).

> A lot of Wikipedia templates now pull data from Wikidata

It looks okay for the English wiki, but the others seem to trail behind quite a lot (though obviously number of best templates isn't a perfect metric).

English: 539 German: 128 Chinese: 93 French: 52 Swedish: 37 Spanish: 19

Given that only ~2200 English Wikipedia articles have their infobox completely from Wikidata (https://en.wikipedia.org/wiki/Category:Articles_with_infobox...), it looks like Wikidata integration still has a long long way to go.

tubbs said 2 months ago:

Your second point was my first thought - mostly because of an experience I had.

I used Pushbullet's recipe for "Google acquisitions" up until the night I got the notification "Google acquires 4chan". After being perplexed for a bit and a few more "acquisitions" were made, I discovered the recipe just used Wikipedia's List of mergers and acquisitions by Alphabet[1] page as a source.

[1]: https://en.wikipedia.org/wiki/List_of_mergers_and_acquisitio...

jneplokh said 2 months ago:

Awesome idea! It could be applied to a lot of different websites. Even ones where I'm too lazy to type out the whole URL :p

Regardless, having a system where you can base it off a website could definitely be expanded beyond Wikipedia. Great work!

snorrah said 2 months ago:

Does this comply with the terms of service? I know this won’t be a popular reply and that’s fine, but I just want to know whether your admittedly intriguing concept isn’t taking the piss :)

upgoat said 2 months ago:

Woah this is hecka cool!! Nice work to the authors.

jaimex2 said 2 months ago:

Alternatively just don't use Google.

newswasboring said 2 months ago:

And find the site through clairvoyance?

Sabinus said 2 months ago:

What search engines don't censor?

jaimex2 said 2 months ago:

Yandex and Duckduckgo are good.

sm4rk0 said 2 months ago:

Nice hack, but you can do it much easier with DuckDuckGo's "I'm Feeling Ducky", which is used by prefixing the search with a backslash:

https://lmddgtfy.net/?q=%5Chacker%20news

That's especially useful if DDG is default search engine in your browser.

(I'm not affiliated with DDG)

kelnos said 2 months ago:

That just takes you to the first DDG result, no?

The purpose of this seems to be to treat Wikipedia as a trusted, reliable source of truth about the canonical URL for websites (debatable, of course). The idea is that you don't trust the search engines, perhaps because you live in a country where your government has required search engines to censor results in some way, but (for some reason?) lets you go to Wikipedia.

BubRoss said 2 months ago:

Wouldn't dns over github make more sense than this?

lizardmancan said 2 months ago:

name server over everything

rootsudo said 2 months ago:

This is cool!

nomanlaghari said 2 months ago:

please vist below website for poetry https://bit.ly/2yErlmt