The Wikipedia location network is a weighted, undirected large-scale network of roughly 720k nodes and 178M edges that is constructed from toponyms in the English Wikipedia corpus. Network construction is performed based on co-occurrences of location mentions within the entire Wikipedia corpus and edges are weighted based on the distances of toponyms within the text, before they are aggregated over all co-occurrences of mentions to collapse parallel edges and create a simple graph. The network is enriched with hierarchical information of the granularities city, country ,and continent and additional node attributes (e.g., Wikidata information about places). The data set consists of three files:
Contains all edges of the Wikipedia Location Network. Note: this is the only file that uses whitespaces as separator instead of tabs.
Format: <source location id> <target location id> <edge weight>
Contains additional information for locations in the network. Note that the IDs of locations correspond to Wikidata IDs and can be transformed by adding the letter Q in front of the ID. The location label is the Englis name of the entity in Wikidata. Location types are „POI“, „city“, „continent“, „country“, „mountain“, „mountain range“, „river“ and „NIL“ if no information is available.
Format: <location id> \t <location label> \t <location type>
Contains hierarchical information for locations in the network. The levels of the hierarchy are „continent“ and „country“. Since locations can belong to multiple countries or continents according to Wikidata, the entries are lists of location IDs. Multiple entries in a list are separated by commas, and lists are surrounded by [ ]. For some locations (such as countries), only the continent is known. In these cases, the entry only has 2 columns instead of 3.
Format: <location id> \t [<continent id>*] \t [<country id>*]
Download data as tar.gz (1.4 GB compressed, 6.2 GB uncompressed)
More information about the construction of the network can be found in the original article below. If you use the data for research purposes, please cite this paper as the source:
Johanna Geiß, Andreas Spitz, Jannik Strötgen, and Michael Gertz.
The Wikipedia location network: overcoming borders and oceans.
In: Proceedings of the 9th Workshop on Geographic Information Retrieval, GIR 2015, Paris, France, November 26-27, 2015. 2015, 2:1–2:3
[DOI:10.1145/2837689.2837694] [BibTeX (DBLP)]