Using the Wikipedia page-to-page link database

Henry Haselgrove, 28/1/09. Page last updated 30/9/2010.

Linked to from Michael Nielsen's friendfeed room on the Google Technology Stack.


Wikipedia allows its entire database to be downloaded. One file that is available for download is a list of all page-to-page links. This might therefore be an excellent intermediate-sized data set to try out techniques such as PageRank

Unfortunately, the format in which this information is offered is rather inconvenient. It consists of a 2GB gzipped .sql file (pagelinks.sql.gz) that decompresses to over 9GB. For each link listed inside the file, the page that is being linked from is referred to only by it's page_id (an integer somewhere in the range 1 --19661939)  whereas the page that is being linked to is referred to only by it's full title. There is no way to convert between the title and page_id, unless you download another file (page.sql.gz) which is a further 460MB download, decompressing to over 1GB.

Also, these files include a lot of what I consider uninteresting links, such as all links to "talk" pages, "user" pages, and to attached files such as images. Also, many links are broken. 

If you can't be bothered repeating what I did to get around these problems, you might want to just download the following two files that I created:

        links-simple-sorted.zip    (323MB)

        titles-sorted.zip                (28MB)

These files contain all links between proper Wikipedia pages, that is pages in "namespace 0". This includes disambiguation pages and redirect pages. (English language Wikipedia only, of course).

In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is:

    from1: to11 to12 to13 ...
    from2: to21 to22 to23 ...
    ...

where from1 is an integer labelling a page that has links from it, and to11 to12 to13 ... are integers labelling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt, a UTF-8 encoded text file.

Here are some basic statistics of the data set:

Total # pages

5,716,808

Total # links

130,160,392

Max. # outlinks from a single page

5,775  -- List of endangered animal species. (A bit depressing, really)

Max. # inlinks to a single page

374,934  -- United States

# pages with no outlinks

10,438

# pages with no inlinks

1,942,943

 

 

 

 

 


Update, 30/9/10 --

By popular demand, I've uploaded the scripts which I used to do the conversion. You can download them from this link. You'll need Matlab in order to run them (although  possibly the open-source Maltab clone Octave will run them as well). The procedure for using them is as follows:

1-    Change line 10 of convert_page_file.m to refer to the correct sql page file, as you downloaded from the Wikipedia website.
2-    Run convert_page_file.m. It creates a temporary file page-simple-matlab2.txt  (and also a smaller temporary file params.mat)
3-    Run sortpages.m. It creates titles-sorted.txt, and also a temporary file sorted_out2.mat
4-    Change line 26 of convert_links_file.m to refer to the correct sql links file, as downloaded from the Wikipedia website
5-    Run convert_links_file.m. It creates the temporary file links-simple-matlab.txt (and also a smaller temporary file link_param.mat)
6-    Run analyse_links_file.m. It creates the file links-simple-sorted.txt


I haven't actually tried the scripts on any more recent version of the data, but hopefully it will work for you.


Update, 30/1/09 --

I did some PageRank calculations, and put the results below. For a number of different choices of the parameter s, you can click on the appropriate button to see a list of the top-ranking pages. (And compare this with the top ranking pages by number of in-links).

It appears that PageRank does a better job at finding important Wikipedia pages than the simple rank of number of in-links. But, it doesn't do a great job. (For example, does Geographic_coordinate_system really deserve to be ranked #4 ?)

Rank by:

 


Just for fun, what happens if we replace the teleportation step so that instead of picking a page uniformly at random it always jumps back to the Wikipedia page of a certain ex-physicist?