NECKAr

The Named Entity Classifier for Wikidata (NECKAr) is a  tool to assign entities present in Wikidata to the NE classes Person, Location, and Organization.

Many Information Extraction (IE) tasks such as Named Entity Recognition (NER) or Event Detection require background repositories that provide a very simple classification of entities. The
predominately used classes are Location, Person, and Organization. There are several knowledge bases available that offer a very detailed and specific ontology of entities, but these are of limited use to IE approaches.
NECKAr, assigns Wikidata entities to the three main NE classes by exclusively using information present in Wikidata, the resulting Wikidata NE dataset, consists of over 8M classified entities.

Tool

The current version of the NECKAr tool can be downloaded here:NECKAr_v1.0.tar.gz

Requirements

  • Python 3
  • MongoDB 3.2 +
  • python modules as in requirements.txt
    • requests==2.10.0
    • pymongo==3.4.0

Installing, Configuring and Running NECKAr

  1. Download the tar file and unpack it.
  2. Download a Wikidata dump.
  3. Install and run MongoDB.
  4. Add the path to the dump and details about your database to the file NECKAr.cfg.
  5. Start one of the startNECKAr scripts

If you want to start the complete pipeline (import Wikidata dump to the database, extract the Named Entities, and create LOD links) use the script startNeckar.sh. To run the parts separately use the other scripts provided.

Wikidata NE dataset

The Wikidata NE dataset has two parts: the Named Entity files and the link files.

The Named Entity files include the most important information about the entities, whereas the link files contain the links and ids in other databases. The link files are available for English and German. The English link files include only entities that have a  page in the English Wikipedia, these entities are linked to  DBPedia.The German link files include only entities that have a  page in the German Wikipedia, these entities are linked to the German DBPedia.

The datasets can be downloaded from the following tables (all json.gz files)

Version 2

Wikidata dump 2017/03/20, NECKAr v1.0

http://event.ifi.uni-heidelberg.de/wp-content/uploads/2017/04/WikidataLODLinks_20170320_Persons_NECKAR_1_0.json_.gz

Named Entities Links en Links de number of entities
Persons (189 MB) Persons_Links (53 MB) Persons_Links (29 MB) 3,416,903
Organizations (30 MB) Organizations_Links (22 MB) Organizations_Links (11 MB) 1,002,937
Locations (116 MB) Locations_Links (38 MB) Locations_Links (17 MB) 4,842,665
all entities (336 MB) all_Links (107MB) all_Links (54 MB) 9,262,505

Version 1

Wikidata dump 2016/12/05, NECKAr v1.0

complete download (484.7 MB, 3 files: all entities, all en links, all de links)

Named Entities Links en Links de number of entities
Persons (179 MB) Persons_Links (51 MB) Persons_Links (28 MB) 3,322,217
Organizations (27 MB) Organizations_Links (20 MB)  Organizations_Links (10 MB) 939,840
Locations (105 MB) Locations_Links (35 MB)  Locations_Links (16 MB) 4,527,356
all entities (310 MB) all_Links (103MB)  all_Links (52 MB)  8,789,413

Example

The following examples show one entries for each entity type:

neClass LOC ORG PER
id Q201117 Q41187 Q132345
norm name Toyota Sony Shinzo Abe
description city in Aichi Prefecture, Japan conglomerate corporation Prime Minister
en Wikipedia Toyota, Aichi Sony Shinzo ̄ Abe
location type city, settlement instance of business enterprise occupation politician
population 423,343 CEO Ryoji Chubachi gender male
continent Asia Founder Masaru Ibuka, Akio Morita date of birth 9/21/54
continent Japan Inception 5/7/46 date of death
coordinate 35.083 137.156 Headquarter Sony City alias Abe Shinzo
Country Japan
official website  http://www.sony.net/

 References

If you use NECKAr in your work please cite our paper:

  • NECKAr: A Named Entity Classifier for Wikidata
    Johanna Geiß, Andreas Spitz, and Michael Gertz
    In: Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL ’17), 2017
    [pdf] [data] [code]