Lesson about the digitisation and exploration of digital newspapers
The lesson deals with how digitised newspapers that are available online change the way historians use newspapers as historical sources, and ask new skills for applying source criticism.
The process of digitisation begins by scanning a physical newspaper in order to produce an image of each page. Since images as such are not searchable, the letters in the text have to be made recognisable. This is done through a technique known as optical character recognition (OCR), with software that is able to understand the image of a character and turn it into a digital entity that represents a single character.
For this to happen, two sub-steps have to be taken:
Binarisation of the image’s colours, i.e. the transformation of the image of a page into just two colours: black and white. This simplifies the image of the page and increases the contrast between dark and light sections, thereby making the individual characters stand out from the page background.
Classification of the letters. After the individual characters have been delineated, they have to be assembled into words. In order to do so, the software first compares the captured letters to known fonts and selects the font that is most likely to fit.
These are features that can be considered:
- Are these letters from a Latin alphabet or an Arabic one?
- Are the letters in italic or bold?
- Is the font Times New Roman or Comic Sans MS?
A similar detection mechanism then identifies the language and compares the words that are found to a corresponding dictionary. The output of the OCR processing is a transcribed machine readable text. We now have a digitised newspaper: the image of the page and its text.
To make digitised newspapers searchable a further process is also needed: article segmentation. Newspapers tend to publish texts on topics that are not related to each other on the same page. Single articles are also often divided up into parts that are printed on different pages. To retrieve a meaningful version of an article, it is necessary to understand which characters and words belong to the same article. This “understanding” is made possible through the technique of article segmentation. While OCR gives us information on which letters and words appear on which page of the newspaper, article segmentation, the technique of scaling a page down into smaller units, gives us information on which letters and words on a page belong to the same unit.
After applying OCR and article segmentation, the final step is to make the digitised newspapers available to the public. The extracted texts and the images of the newspapers are published on the web and can be viewed and searched using a carefully designed interface. Applying article segmentation is however not always included in the digitisation process, and a user only notices after having conducted a search. You will notice the difference between searching digital newspaper archives where this prinicple has been applied and where it is missing.
After taking measures to guarantee good quality OCR, another concern is the quality of the text retrieval by the search engine and the interaction with the audience through the interface. To find and read specific articles, a system has to be built and an interface has to be designed to enable you to query the database and access the place where the digitised newspapers are stored. The database therefore needs rich information on each of the digitised objects it contains. This information is called metadata, literally “information about data”. In general the most basic elements for books and articles are the author, the title of the publication, the date of publication and the language of the document. For newspapers, the available information is usually limited to the title of the newspaper and the date of publication of the issue.
To turn all the digitised material into an online search environment, we need the following:
- a database where the output of the digitisation is stored (the image of each newspaper page, the OCR and the metadata),
- an interface where you can type your query and browse the results
- a search engine that will receive your query from the interface and look into the database to find relevant results.
The list of results that appears on your screen is the product of the interaction between these three elements. Any errors or missing elements may be the result of a problem arising in any one of these elements. As we have seen, digitisation itself can also yield errors. Furthermore, it is important to bear in mind that because we are dealing with historical texts, spellings might have changed over time or misspellings may have occurred in the historical source itself.
Names of people and locations are more prone to OCR mistakes, as they cannot be found in the dictionaries that are used to recognise printed words. This means that if the software used for digitisation has not integrated a dictionary of names, an accurate match for a query of a name relies completely on the correct identification of each single letter of the name.