Linguists Needed For IPTC's EXTRA Rules-based Classification Engine

Job Information


Linguists Needed For IPTC's EXTRA Rules-based Classification Engine

IPTC is looking for linguists to write classification rules for EXTRA, an open source rules-based classification engine for news. The linguist will write Boolean rules to analyze the text of news articles and suggest the most relevant IPTC Media Topics, a news taxonomy of roughly 1,000 subjects (of which a portion will be selected to have rules written by the linguist for this project).

A browsable tree of the taxonomy is available here:

The project requires both an English and a German linguist, and requires that applicants complete a short application below demonstrating rule-writing proficiency (German linguists are asked to submit their responses in English). The position will be off-site and involve collaborating remotely with team members from different countries. The initial project phase is expected to run from the end of March to the end of June 2017, with an estimated 100-125 total hours of work per language required for the linguist.


EXTRA is the EXTraction Rules Apparatus, a multilingual open-source platform for rules-based classification of news content. IPTC was awarded a grant from the first round of Google’s Digital News Initiative Innovation Fund to build and freely distribute the initial version of EXTRA. "Classification" means assigning one or more categories to the text of a news document. Rules based classifiers use a set of Boolean rules, rather than machine-learning or statistical techniques, to determine which categories to apply.


* Master’s degree in Library or Information Science, or equivalent professional experience (i.e. taxonomy, classification, computational linguistics, data science or information architecture).
* Experience using rules-based categorization software, Regex, Natural Language Processing (NLP), and text mining tools.
* Familiarity with one or more query languages (ElasticSearch Query DSL, SQL, Lucene, XQueryFT, Teragram etc.).
* Familiarity with general tagging principles using taxonomies and scope notes.
* Experience with news content or working in the news industry a plus.
* Ability to work independently, while collaborating with remote team members.
* Fluency in English. For the German position: fluency in both English and German.


Starts On
March 27, 2017, 12:04 p.m.