From the shelf to the web, exploring historical newspapers in the digital age

Lesson about the digitisation and exploration of digital newspapers

go to assignments
about the lesson S

Introduction

The lesson deals with how digitised newspapers that are available online change the way historians use newspapers as historical sources, and ask new skills for applying source criticism.

watch this animation S

S Animation: From the shelf to the web

An animation on the impact of digital technology on the newspaper as historical source

From Chinese ink to digitised objects, changes in technology are creating a new future for yesterday’s news. This animation illustrates the peculiarities of newspapers as historical sources, how these paper sources have been turned into digital archives and how methods for enriching newspapers have opened up new ways of discovering content.

complete these assignments M

M Assignments (4)


1 out of 4 — Digitisation and how computers learn to read?

The process of digitisation begins by scanning a physical newspaper in order to produce an image of each page. Since images as such are not searchable, the letters in the text have to be made recognisable. This is done through a technique known as optical character recognition (OCR), with software that is able to understand the image of a character and turn it into a digital entity that represents a single character.

For this to happen, two sub-steps have to be taken:

  1. Binarisation of the image’s colours, i.e. the transformation of the image of a page into just two colours: black and white. This simplifies the image of the page and increases the contrast between dark and light sections, thereby making the individual characters stand out from the page background.

  2. Classification of the letters. After the individual characters have been delineated, they have to be assembled into words. In order to do so, the software first compares the captured letters to known fonts and selects the font that is most likely to fit.

These are features that can be considered:

  • Are these letters from a Latin alphabet or an Arabic one?
  • Are the letters in italic or bold?
  • Is the font Times New Roman or Comic Sans MS?

A similar detection mechanism then identifies the language and compares the words that are found to a corresponding dictionary. The output of the OCR processing is a transcribed machine readable text. We now have a digitised newspaper: the image of the page and its text.

Instructions

20 Min

1.a Font recognition

1.b OCR and Gothic font

1.c Improvement of OCR quality

Reading/viewing suggestions

2 out of 4 — Catching the flu with article segmentation

To make digitised newspapers searchable a further process is also needed: article segmentation. Newspapers tend to publish texts on topics that are not related to each other on the same page. Single articles are also often divided up into parts that are printed on different pages. To retrieve a meaningful version of an article, it is necessary to understand which characters and words belong to the same article. This “understanding” is made possible through the technique of article segmentation. While OCR gives us information on which letters and words appear on which page of the newspaper, article segmentation, the technique of scaling a page down into smaller units, gives us information on which letters and words on a page belong to the same unit.

After applying OCR and article segmentation, the final step is to make the digitised newspapers available to the public. The extracted texts and the images of the newspapers are published on the web and can be viewed and searched using a carefully designed interface. Applying article segmentation is however not always included in the digitisation process, and a user only notices after having conducted a search. You will notice the difference between searching digital newspaper archives where this prinicple has been applied and where it is missing.

Instructions

20 min

2.a How to find an article that deals with the Spanish flu

2.b Queries on the Spanish flu: different countries, different reactions?

Reading/viewing suggestions

3 out of 4 — Using digitised newspaper collections in practice

After taking measures to guarantee good quality OCR, another concern is the quality of the text retrieval by the search engine and the interaction with the audience through the interface. To find and read specific articles, a system has to be built and an interface has to be designed to enable you to query the database and access the place where the digitised newspapers are stored. The database therefore needs rich information on each of the digitised objects it contains. This information is called metadata, literally “information about data”. In general the most basic elements for books and articles are the author, the title of the publication, the date of publication and the language of the document. For newspapers, the available information is usually limited to the title of the newspaper and the date of publication of the issue.

To turn all the digitised material into an online search environment, we need the following:

  • a database where the output of the digitisation is stored (the image of each newspaper page, the OCR and the metadata),
  • an interface where you can type your query and browse the results
  • a search engine that will receive your query from the interface and look into the database to find relevant results.

The list of results that appears on your screen is the product of the interaction between these three elements. Any errors or missing elements may be the result of a problem arising in any one of these elements. As we have seen, digitisation itself can also yield errors. Furthermore, it is important to bear in mind that because we are dealing with historical texts, spellings might have changed over time or misspellings may have occurred in the historical source itself.

Instructions

20 Min

3.a What is in the database or where is the flu hiding?

Reading/viewing suggestions

4 out of 4 — Looking for Robert Schuman(n)

Names of people and locations are more prone to OCR mistakes, as they cannot be found in the dictionaries that are used to recognise printed words. This means that if the software used for digitisation has not integrated a dictionary of names, an accurate match for a query of a name relies completely on the correct identification of each single letter of the name.

Instructions

20 min

4.a How can we identify articles about “Robert Schuman”?

4.b Collecting articles on Robert Schuman(n)

4c. Looking for Robert Schuman in Luxembourg

Reading/viewing suggestions