Using the Wikipedia page-to-page link database

Henry Haselgrove, 28/1/09. Page last updated 7/4/2016.

Inspired by Michael Nielsen's lectures on PageRank.


Wikipedia allows its entire database to be downloaded. One file that is available for download is a list of all page-to-page links. This might therefore be an excellent intermediate-sized data set to try out techniques such as PageRank

Unfortunately, the format in which this information is offered is rather inconvenient. As of 5 March 2016 It consists of a 5GB gzipped .sql file (pagelinks.sql.gz) that decompresses to over 35GB. For each link listed inside the file, the page that is being linked from is referred to only by it's page_id (an integer somewhere in the range 1 --49659950)  whereas the page that is being linked to is referred to only by it's full title. There is no way to convert between the title and page_id, unless you download another file (page.sql.gz) which is a further 1.3GB download, decompressing to over 4GB.

Also, these files include a lot of what I consider uninteresting links, such as all links to "talk" pages, "user" pages, and to attached files such as images. Also, many links are broken. 

If you can't be bothered repeating what I did to get around these problems, you might want to just download the following two files that I created:

        links-simple-sorted.7z    (647MB)

        titles-sorted.7z                (51MB)

These files are derived from the 5 March 2016 version of the english language wikipedia data dump. They contain all links between proper Wikipedia pages, that is pages in "namespace 0". This includes disambiguation pages and redirect pages.

In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is:

    from1: to11 to12 to13 ...
    from2: to21 to22 to23 ...
    ...

where from1 is an integer labelling a page that has links from it, and to11 to12 to13 ... are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt, a UTF-8 encoded text file. Note that the integers do not correspond to the official wikipedia page i.d., since I have re-mapped them to the range [1, number of pages] and sorted the page titles alphabetically.

Here are some basic statistics from an older (2009) version of the data set:

Total # pages

5,716,808

Total # links

130,160,392

Max. # outlinks from a single page

5,775  -- List of endangered animal species. (A bit depressing, really)

Max. # inlinks to a single page

374,934  -- United States

# pages with no outlinks

10,438

# pages with no inlinks

1,942,943

 

 

 

 

 


Update, 30/9/10 --

By popular demand, I've uploaded the scripts which I used to do the conversion. You can download them from github. You'll need Matlab in order to run them. The procedure for using them is as follows:

1-    Change the line beginning "f_in=fopen" in convert_page_file.m to refer to the correct sql page file, as you downloaded from the Wikipedia website.
2-    Run convert_page_file.m. It creates a temporary file page-simple-matlab2.txt  (and also a smaller temporary file params.mat)
3-    Run sortpages.m. It creates titles-sorted.txt, and also a temporary file sorted_out2.mat
4-    Change the line beginning "f_in=fopen" in convert_links_file.m to refer to the correct sql links file, as downloaded from the Wikipedia website
5-    Run convert_links_file.m. It creates the temporary file links-simple-matlab.txt (and also a smaller temporary file link_param.mat)
6-    Run analyse_links_file.m. It creates the file links-simple-sorted.txt

Update, 30/1/09 --

I did some PageRank calculations, and put the results below. For a number of different choices of the parameter s, you can click on the appropriate button to see a list of the top-ranking pages. (And compare this with the top ranking pages by number of in-links).

It appears that PageRank does a better job at finding important Wikipedia pages than the simple rank of number of in-links. But, it doesn't do a great job. (For example, does Geographic_coordinate_system really deserve to be ranked #4 ?)

Rank by:

 


Just for fun, what happens if we replace the teleportation step so that instead of picking a page uniformly at random it always jumps back to the Wikipedia page of a certain ex-physicist?