From the shelf to the web, exploring historical newspapers in the digital age

complete these assignments M

M Assignments (4)

1 out of 4 — Digitisation and how computers learn to read? ¶

The process of digitisation begins by scanning a physical newspaper in order to produce an image of each page. Since images as such are not searchable, the letters in the text have to be made recognisable. This is done through a technique known as optical character recognition (OCR), with software that is able to understand the image of a character and turn it into a digital entity that represents a single character.

For this to happen, two sub-steps have to be taken:

Binarisation of the image’s colours, i.e. the transformation of the image of a page into just two colours: black and white. This simplifies the image of the page and increases the contrast between dark and light sections, thereby making the individual characters stand out from the page background.

Classification of the letters. After the individual characters have been delineated, they have to be assembled into words. In order to do so, the software first compares the captured letters to known fonts and selects the font that is most likely to fit.

These are features that can be considered:

Are these letters from a Latin alphabet or an Arabic one?

Are the letters in italic or bold?

Is the font Times New Roman or Comic Sans MS?

A similar detection mechanism then identifies the language and compares the words that are found to a corresponding dictionary. The output of the OCR processing is a transcribed machine readable text. We now have a digitised newspaper: the image of the page and its text.

Instructions

20 Min

1.a Font recognition

1.b OCR and Gothic font

1.c Improvement of OCR quality

This is the scanned image of the front page of the Neue Zürcher Zeitung (NZZ) published on 26.10.1793 in Zürich, Switzerland. It reports on the trial and execution of Louis XVI’s widow Marie Antoinette in October 1793.

The archives of the NZZ were entirely digitised for the first time in 2005, using the microfilms of newspapers to produce scans that were then OCRed. The result of this process proved to be imperfect, especially for earlier texts that were published in Gothic font.

As part of the impresso project, referred to in the clip of this lesson, Phillip Ströbel and Simon Clematide from the University of Zurich have experimented with software developed to recognise handwritten text to improve the quality of the OCR on Gothic fonts.

The two outputs of the OCR are shown below. Compare them and answer the questions.

A. First lines of the front page article of the 26.10.1793 issue of the NZZ

B. OCR output in 2005

Prozeß der Marie Antoinette. Nachdem dieselbe am i g. Weinm. alten StvlS, oder am rz. des ersten Monat« im 2,en Jahre der Republik neuen KaleuderstplS, in den Audienzsaal eingesührt wurde, und sie sich auf den Sessel niedergelassen hatte- fragte sie der Präsident: Wie sie heisse? „ Ich nenne mich, antwortete sie, Marie Antoinette von Lotharingen. Oestreich re. — Wer seyd ihr ?. Ich bin dir Wittwr Ludwig Capet«, ehemaligen Königs der Frauzo« seu.— Wie alt? Z8 Jahre. — Nun wurde von demGe-richtsschreiber die Auklagsakte vorgelesen. Darin» heißt e«,daß aus den dem Tribunale rnhandengestellten Schriften erhellet ‘Daß gleich den Messalinen Brunehaut, Fredegoude»nd Medizi«, die man einstKöniainnea von Frankreich genannt habe, und deren verhaßte Namennie au« de» Jahrbüchern der Geschichte werden vertilgt werde» , Marie Antoinette , Ludwig Capets Wittwr, feit ihrem Aufenthalte inFrankreich die Plage und Blotfaugeriun der Franzosen gewesen; daß sie” noch vor der glücklichen Revoluzion

improved OCR output in 2019

Prozeß der Marie Antoinette. Nachdem dieselbe am 15. Weinm. alten Styls, oder am 23. des ersten Monats im 2ten Jahre der Republik neuen Kalenderstyls, in den Audienzsaal eingeführt wurde, und sie sich auf den Sessel niedergelassen hatte, fragte sie der Präsident: Wie sie heisse? „ Ich nenne mich, antwortete sie, Marie Antoinette von Lotharingen- Oestreich ic. — Wer seyd ihr ?. Ich bin die Wittwe Ludwig Capets, ehemaligen Königs der Franzosen.— Wie alt? 38 Jahre. — Nun wurde von dem Gerichtsschreiber die Anklagsakte vorgelesen. Darinn heißt es, daß aus en dem Tribunale zuhandengestellten Schriften erhelle: Daß gleich den Messalinen Brunehaut, Fredegonde und Medizis, die man einst Königinnen von Frankreich genaunt habe, und deren verhaßte Namen nie aus den Jahrbüchern der Geschichte werden vertilgt werden, Marie Antoinette, Ludwig Capets Wittwe, seit ihrem Aufenthalte in Frankreich die Plage und Blutsaugerinn der Franzosen gewesen; daß sie noch vor der glückichen Revoluzion

How was the word “Wittwe” recognised in 2005 and 2019?
What differences do you notice in the recognition of numbers between the 2005 and 2019 outputs?

Now have a look at the manual transcription of the same passage, and compare this to how the numbers were recognised in the 2005 and 2019 outputs.

A. Manual Transcription

Prozeß der Marie Antoinette. Nachdem dieselbe am 15. Weinm. alten Styls, oder am 23. des ersten Monats im 2tem Jahre der Republik neuen Kalenderstyls, in den Audienzsaal eingeführt wurde, und sie sich auf dem Sessel niederlassen hatte, fragte sie der Präsident: Wie sie heisse? “Ich nenne mich, antwortete Sie, Marie Antoinette von Lotharingen-Oestreich - Wer seyd ihr? Ich bin die Wittwe Ludwig Capets, ehemaligen König der Franzosen. - Wie alt? 38 Jahre. - Nun wurde von dem Gerichtsschreiber die Anklagsakte vorgelesen. Darinn heißt es daß aus den dem Tribunale zuhandengestellten Schriften erhelle: Daß gleich den Messalinen Brunehaus, Fredegonde und Medizis, die man einst Königin von Frankreich genannt habe, und deren verhaßte Namen nie aus den Jahrbüchern der Geschichte werden vertilgt werden, Marie Antoinette, Ludwig Capets Wittwe, seit ihrem Aufenthalte in Frankreich die Plage und Blutsaugerinn der Franzosen gewesen: daß sie noch der glücklichen Revoluzion,

Would you have been able to find this article on the basis of the first OCR if you had searched with the following keywords? (explain why for each case) “Marie Antoinette” “Revolution”

Reading/viewing suggestions

2 out of 4 — Catching the flu with article segmentation ¶

To make digitised newspapers searchable a further process is also needed: article segmentation. Newspapers tend to publish texts on topics that are not related to each other on the same page. Single articles are also often divided up into parts that are printed on different pages. To retrieve a meaningful version of an article, it is necessary to understand which characters and words belong to the same article. This “understanding” is made possible through the technique of article segmentation. While OCR gives us information on which letters and words appear on which page of the newspaper, article segmentation, the technique of scaling a page down into smaller units, gives us information on which letters and words on a page belong to the same unit.

After applying OCR and article segmentation, the final step is to make the digitised newspapers available to the public. The extracted texts and the images of the newspapers are published on the web and can be viewed and searched using a carefully designed interface. Applying article segmentation is however not always included in the digitisation process, and a user only notices after having conducted a search. You will notice the difference between searching digital newspaper archives where this prinicple has been applied and where it is missing.

Instructions

20 min

2.a How to find an article that deals with the Spanish flu

2.b Queries on the Spanish flu: different countries, different reactions?

Reading/viewing suggestions

3 out of 4 — Using digitised newspaper collections in practice ¶

After taking measures to guarantee good quality OCR, another concern is the quality of the text retrieval by the search engine and the interaction with the audience through the interface. To find and read specific articles, a system has to be built and an interface has to be designed to enable you to query the database and access the place where the digitised newspapers are stored. The database therefore needs rich information on each of the digitised objects it contains. This information is called metadata, literally “information about data”. In general the most basic elements for books and articles are the author, the title of the publication, the date of publication and the language of the document. For newspapers, the available information is usually limited to the title of the newspaper and the date of publication of the issue.

To turn all the digitised material into an online search environment, we need the following:

a database where the output of the digitisation is stored (the image of each newspaper page, the OCR and the metadata),

an interface where you can type your query and browse the results

a search engine that will receive your query from the interface and look into the database to find relevant results.

The list of results that appears on your screen is the product of the interaction between these three elements. Any errors or missing elements may be the result of a problem arising in any one of these elements. As we have seen, digitisation itself can also yield errors. Furthermore, it is important to bear in mind that because we are dealing with historical texts, spellings might have changed over time or misspellings may have occurred in the historical source itself.

Instructions

20 Min

3.a What is in the database or where is the flu hiding?

Reading/viewing suggestions

4 out of 4 — Looking for Robert Schuman(n) ¶

Names of people and locations are more prone to OCR mistakes, as they cannot be found in the dictionaries that are used to recognise printed words. This means that if the software used for digitisation has not integrated a dictionary of names, an accurate match for a query of a name relies completely on the correct identification of each single letter of the name.

	resource 1	resource 2
Type of article (event, opinion, human interest)
Does the collection cover the years in which the Spanish flu spread in Europe?
Are there peaks in news coverage?
Is there a dominant frame?
Are there specific features in the interface that limit your search output?

From the shelf to the web, exploring historical newspapers in the digital age

Introduction

S Animation: From the shelf to the web

M Assignments (4)

1 out of 4 — Digitisation and how computers learn to read? ¶

Instructions

1.a Font recognition

1.b OCR and Gothic font

1.c Improvement of OCR quality

Reading/viewing suggestions

2 out of 4 — Catching the flu with article segmentation ¶

Instructions

2.a How to find an article that deals with the Spanish flu

2.b Queries on the Spanish flu: different countries, different reactions?

Reading/viewing suggestions

3 out of 4 — Using digitised newspaper collections in practice ¶

Instructions

3.a What is in the database or where is the flu hiding?

Reading/viewing suggestions

4 out of 4 — Looking for Robert Schuman(n) ¶

Instructions

4.a How can we identify articles about “Robert Schuman”?

4.b Collecting articles on Robert Schuman(n)

4c. Looking for Robert Schuman in Luxembourg

Reading/viewing suggestions

table of contents