Curated by Marten Duering, Estelle Bunout and Stefania Scagliola
Lesson about the digitisation and exploration of digital newspapers
A lesson about how digitised newspapers that are available online are changing the way historians use newspapers as historical sources, and ask new skills for applying source criticism.
Newspapers are imperfect recorders of history, yet they are a key asset for historical research. The digitisation of newspapers and their availability online has broadened the scope of historians exponentially. Yet at the same time historians have to be conscious of what lies behind the search results that appear on their screen. The impression of ‘completeness’ can be misleading, as a digitisation project is always determined by particular choices. Digitisation also introduces some hurdles. Searching for articles on a given topic can be hindered by mistakes that have been made when applying optical character recognition (OCR). This lesson is meant to teach how various digital technologies have impacted the way we can access newspapers as historical sources, and which new questions we need to ask.
The lesson consists of two modules with increasing levels of complexity and time required.
The SMALL offers an animation lasting 7 minutes and 58 seconds, and is intended to be accessible to a broad audience. This is followed by a quiz that takes around 12 minutes.
The MEDIUM offers a series of four assignments that are suitable for collaborative work for two or three students. The time required varies from 30 to 60 minutes.
Suggestions for background information:
For a technical and historical overview and information about some applications of optical character recognition (OCR), take a look at the Wikipedia article: https://en.wikipedia.org/wiki/Optical_character_recognition
To understand how optical character recognition (OCR) identifies characters and words as a single entity (known as pattern recognition), look at the explanation by Aryaman Sharda: https://www.youtube.com/watch?v=cAkklvGE5io
For a more detailed explanation of each step of the optical character recognition (OCR) process, look at the interview with Professor Steve Simske by Computerphile: https://www.youtube.com/watch?v=ZNrteLp_SvY
For an explanation of the process of binarisation watch Prof. Simske’s lecture : https://youtu.be/ZNrteLp_SvY?t=149
The process of digitisation begins by scanning a physical newspaper in order to produce an image of each page. Since images as such are not searchable, the letters in the text have to be made recognisable. This is done through a technique known as optical character recognition (OCR), with software that is able to understand the image of a character and turn it into a digital entity that represents a single character.
For this to happen, two sub-steps have to be taken:
Binarisation of the image’s colours, i.e. the transformation of the image of a page into just two colours: black and white. This simplifies the image of the page and increases the contrast between dark and light sections, thereby making the individual characters stand out from the page background.
Classification of the letters. After the individual characters have been delineated, they have to be assembled into words. In order to do so, the software first compares the captured letters to known fonts and selects the font that is most likely to fit.
These are features that can be considered:
- Are these letters from a Latin alphabet or an Arabic one?
- Are the letters in italic or bold?
- Is the font Times New Roman or Comic Sans MS?
A similar detection mechanism then identifies the language and compares the words that are found to a corresponding dictionary. The output of the OCR processing is a transcribed machine readable text. We now have a digitised newspaper: the image of the page and its text.
Computerphile, a channel dedicated to explaining computer science topics to a lay audience, published an interview with Professor Steve Simske, an expert on OCR, in 2017, in which he explains the underlying principles of OCR software. In the following excerpt he explains how the classification of fonts works: Watch this passage from 10’10’’ to 12’47’
The core principle of classification; What is needed for a word to be matched with a particular font? (choose two elements from the four below)
Some fonts are more difficult to process than others. A recurring difficulty that arises with historical texts is the recognition of texts in Gothic font. Open this link and compare the facsimile with the OCR text:
This is the scanned image of the front page of the Neue Zürcher Zeitung (NZZ) published on 26.10.1793 in Zürich, Switzerland. It reports on the trial and execution of Louis XVI’s widow Marie Antoinette in October 1793.
The archives of the NZZ were entirely digitised for the first time in 2005, using the microfilms of newspapers to produce scans that were then OCRed. The result of this process proved to be imperfect, especially for earlier texts that were published in Gothic font.
As part of the impresso project, referred to in the clip of this lesson, Phillip Ströbel and Simon Clematide from the University of Zurich have experimented with software developed to recognise handwritten text to improve the quality of the OCR on Gothic fonts.
The two outputs of the OCR are shown below. Compare them and answer the questions.
|A. First lines of the front page article of the 26.10.1793 issue of the NZZ|
|B. OCR output in 2005|
|Prozeß der Marie Antoinette. Nachdem dieselbe am i g. Weinm. alten StvlS, oder am rz. des ersten Monat« im 2,en Jahre der Republik neuen KaleuderstplS, in den Audienzsaal eingesührt wurde, und sie sich auf den Sessel niedergelassen hatte- fragte sie der Präsident: Wie sie heisse? „ Ich nenne mich, antwortete sie, Marie Antoinette von Lotharingen. Oestreich re. — Wer seyd ihr ?. Ich bin dir Wittwr Ludwig Capet«, ehemaligen Königs der Frauzo« seu.— Wie alt? Z8 Jahre. — Nun wurde von demGe-richtsschreiber die Auklagsakte vorgelesen. Darin» heißt e«,daß aus den dem Tribunale rnhandengestellten Schriften erhellet ‘Daß gleich den Messalinen Brunehaut, Fredegoude»nd Medizi«, die man einstKöniainnea von Frankreich genannt habe, und deren verhaßte Namennie au« de» Jahrbüchern der Geschichte werden vertilgt werde» , Marie Antoinette , Ludwig Capets Wittwr, feit ihrem Aufenthalte inFrankreich die Plage und Blotfaugeriun der Franzosen gewesen; daß sie” noch vor der glücklichen Revoluzion|
|improved OCR output in 2019|
|Prozeß der Marie Antoinette. Nachdem dieselbe am 15. Weinm. alten Styls, oder am 23. des ersten Monats im 2ten Jahre der Republik neuen Kalenderstyls, in den Audienzsaal eingeführt wurde, und sie sich auf den Sessel niedergelassen hatte, fragte sie der Präsident: Wie sie heisse? „ Ich nenne mich, antwortete sie, Marie Antoinette von Lotharingen- Oestreich ic. — Wer seyd ihr ?. Ich bin die Wittwe Ludwig Capets, ehemaligen Königs der Franzosen.— Wie alt? 38 Jahre. — Nun wurde von dem Gerichtsschreiber die Anklagsakte vorgelesen. Darinn heißt es, daß aus en dem Tribunale zuhandengestellten Schriften erhelle: Daß gleich den Messalinen Brunehaut, Fredegonde und Medizis, die man einst Königinnen von Frankreich genaunt habe, und deren verhaßte Namen nie aus den Jahrbüchern der Geschichte werden vertilgt werden, Marie Antoinette, Ludwig Capets Wittwe, seit ihrem Aufenthalte in Frankreich die Plage und Blutsaugerinn der Franzosen gewesen; daß sie noch vor der glückichen Revoluzion|
Now have a look at the manual transcription of the same passage, and compare this to how the numbers were recognised in the 2005 and 2019 outputs.
|A. Manual Transcription|
|Prozeß der Marie Antoinette. Nachdem dieselbe am 15. Weinm. alten Styls, oder am 23. des ersten Monats im 2tem Jahre der Republik neuen Kalenderstyls, in den Audienzsaal eingeführt wurde, und sie sich auf dem Sessel niederlassen hatte, fragte sie der Präsident: Wie sie heisse? “Ich nenne mich, antwortete Sie, Marie Antoinette von Lotharingen-Oestreich - Wer seyd ihr? Ich bin die Wittwe Ludwig Capets, ehemaligen König der Franzosen. - Wie alt? 38 Jahre. - Nun wurde von dem Gerichtsschreiber die Anklagsakte vorgelesen. Darinn heißt es daß aus den dem Tribunale zuhandengestellten Schriften erhelle: Daß gleich den Messalinen Brunehaus, Fredegonde und Medizis, die man einst Königin von Frankreich genannt habe, und deren verhaßte Namen nie aus den Jahrbüchern der Geschichte werden vertilgt werden, Marie Antoinette, Ludwig Capets Wittwe, seit ihrem Aufenthalte in Frankreich die Plage und Blutsaugerinn der Franzosen gewesen: daß sie noch der glücklichen Revoluzion,|
For a technical and historical overview and information about some applications of optical character recognition (OCR), take a look at the Wikipedia article on this topic
To understand how optical character recognition (OCR) identifies characters and words(known as pattern recognition), watch the explanation by Aryaman Sharda
For a more detailed explanation of each step of the optical character recognition (OCR) process, watch the interview with Professor Steve Simske by Computerphile
To make digitised newspapers searchable a further process is also needed: article segmentation. Newspapers tend to publish texts on topics that are not related to each other on the same page. Single articles are also often divided up into parts that are printed on different pages. To retrieve a meaningful version of an article, it is necessary to understand which characters and words belong to the same article. This “understanding” is made possible through the technique of article segmentation. While OCR gives us information on which letters and words appear on which page of the newspaper, article segmentation, the technique of scaling a page down into smaller units, gives us information on which letters and words on a page belong to the same unit.
After applying OCR and article segmentation, the final step is to make the digitised newspapers available to the public. The extracted texts and the images of the newspapers are published on the web and can be viewed and searched using a carefully designed interface. Applying article segmentation is however not always included in the digitisation process, and a user only notices after having conducted a search. You will notice the difference between searching digital newspaper archives where this prinicple has been applied and where it is missing.
To understand the principle of article segmentation you are going to search for the same term, but in different ways. The 1918 Influenza outbreak, often called the Spanish Flu was the deadliest pandemic in recent history and of course also widely covered by newspapers. This makes it an interesting example to illustrate how ‘article segmentation’ helps to identify which words belong to which article.
1.Go to Les Temps (Swiss Newspapers)
Choose two languages that you master and two search environments to search with the query “Spanish flu” or “grippe espagnole” or “Spanische Grippe”:
Compare the results by considering the following features:
|resource 1||resource 2|
|Type of article (event, opinion, human interest)|
|Does the collection cover the years in which the Spanish flu spread in Europe?|
|Are there peaks in news coverage?|
|Is there a dominant frame?|
|Are there specific features in the interface that limit your search output?|
After taking measures to guarantee good quality OCR, another concern is the quality of the text retrieval by the search engine and the interaction with the audience through the interface. To find and read specific articles, a system has to be built and an interface has to be designed to enable you to query the database and access the place where the digitised newspapers are stored. The database therefore needs rich information on each of the digitised objects it contains. This information is called metadata, literally “information about data”. In general the most basic elements for books and articles are the author, the title of the publication, the date of publication and the language of the document. For newspapers, the available information is usually limited to the title of the newspaper and the date of publication of the issue.
To turn all the digitised material into an online search environment, we need the following:
- a database where the output of the digitisation is stored (the image of each newspaper page, the OCR and the metadata),
- an interface where you can type your query and browse the results
- a search engine that will receive your query from the interface and look into the database to find relevant results.
The list of results that appears on your screen is the product of the interaction between these three elements. Any errors or missing elements may be the result of a problem arising in any one of these elements. As we have seen, digitisation itself can also yield errors. Furthermore, it is important to bear in mind that because we are dealing with historical texts, spellings might have changed over time or misspellings may have occurred in the historical source itself.
This assignment requires you to download a folder with different kinds of data so that you can see what the files look like as they are stored in the database.
Open one of the folders
Names of people and locations are more prone to OCR mistakes, as they cannot be found in the dictionaries that are used to recognise printed words. This means that if the software used for digitisation has not integrated a dictionary of names, an accurate match for a query of a name relies completely on the correct identification of each single letter of the name.
We will now use this example of Robert Schuman(n) to understand how best to control our query parameters.
Robert Schuman, the politician, is most famous for the eponymous declaration
Robert Schumann, the composer
Go to the website of the Australian National Library and select the section dedicated to digitised newspapers
Go to the website of the Luxembourg National Library on digitised newspapers
Conduct the same query, one article on Robert Schuman the politician and one on Robert Schumann the composer (including advertisements for concerts)
As you will know by now, Robert Schuman is a politician from Luxembourg, where he is not the only person with this name. There was a doctor in Bettembourg in the 1930s with the same name. How can you find an article about this person from the Luxembourg National Library’s collection of digitised newspapers?
Read this blog by technologist and industry expert Chris Riley to understand how progress in digital technology has led to the improvement of the OCR technique