IIPC RSS WEBINAR: Mining web archives for linguistic analysis

IIPC RSS WEBINAR: Mining web archives for linguistic analysis

By International Internet Preservation Consortium

Date and time

Mon, 20 Aug 2018 07:00 - 08:00 PDT

Location

Online

Description

IIPC Research Speaker Series (RSS) focuses on the research use of web archives and features presentations of use cases, collaborative projects and new tools for researchers. This webinar will introduce two projects which use data from the UK Web Archive and the web archive collections of the BnF. The presentations will be followed by a Q&A session.


DETECTING SEMANTIC CHANGE IN THE UK WEB ARCHIVE

Barbara McGillivray, Research Fellow at Alan Turing Institute and University of Cambridge

Pierpaolo Basile, Assistant Professor at the Department of Computer Science, University of Bari Aldo Moro, Italy

The project uses data from the UK Web Archive JISC dataset 1996-2013 to develop a system for detecting semantic change of words in the English language. The system is based on distributional semantic models and Temporal Random Indexing, a simple and effective way for building geometrical spaces of concepts from large textual datasets.

NÉONAUTE: MINING WEB ARCHIVES FOR LINGUISTIC ANALYSIS

Emmanuel Cartier and Loïc Galand, Laboratoire d'Informatique de Paris Nord (LIPN)

Néonaute is a project that seeks to study the use of neologisms in French using the web archive collections of the BnF. Initially a one-year project funded by the French Ministry of Culture, it uses a corpus drawn from the daily crawl of around one hundred news sites carried out by the BnF since December 2010 (representing 900 million files and 11TB of data). Néonaute is built on the existing projects Neoveille and Logoscope which seek to detect and track the life-cycle of neologisms.

From the full-text indexing of the news collection carried out by the BnF, additional analyses and processing are applied to identify relevant documents for the project (press articles), to retrieve relevant textual contents (boilerplate removal), and to enrich the indexes with linguistic information (morphosyntactical analysis) and extracted metadata (named entities, domain assignment). The presentation will discuss the technical challenges and the solutions adopted.

Néonaute includes three use cases:

  • multidimensional analysis of the life-cycle of previously identified neologisms;
  • comparative use of terms recommended by the DGLFLF (General Delegation for the French language and the languages of France, in charge of linguistic policy in France), versus terms already in use (especially Anglicisms);
  • use of terms in feminine gender over the period.

The search engine interface is complemented with an interactive visualization module that allows users to explore the lifecycle of terms over the period, according to various parameters (themes of articles, journals, named entities implied, etc.).

Organised by

The IIPC  is a membership organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage. Web archiving is the process of gathering up data that has been published on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research. The WARC archival standard, the Heritrix crawler, and the WARC analytic tools are all products of IIPC working groups and projects and initiatives, and they make up the standard tools for archival web capture around the world.

Sales Ended