To make digitised newspapers searchable a further process is also needed: article segmentation. Newspapers tend to publish texts on topics that are not related to each other on the same page. Single articles are also often divided up into parts that are printed on different pages. To retrieve a meaningful version of an article, it is necessary to understand which characters and words belong to the same article. This “understanding” is made possible through the technique of article segmentation. While OCR gives us information on which letters and words appear on which page of the newspaper, article segmentation, the technique of scaling a page down into smaller units, gives us information on which letters and words on a page belong to the same unit.
After applying OCR and article segmentation, the final step is to make the digitised newspapers available to the public. The extracted texts and the images of the newspapers are published on the web and can be viewed and searched using a carefully designed interface. Applying article segmentation is however not always included in the digitisation process, and a user only notices after having conducted a search. You will notice the difference between searching digital newspaper archives where this prinicple has been applied and where it is missing.
To understand the principle of article segmentation you are going to search for the same term, but in different ways. The 1918 Influenza outbreak, often called the Spanish Flu was the deadliest pandemic in recent history and of course also widely covered by newspapers. This makes it an interesting example to illustrate how ‘article segmentation’ helps to identify which words belong to which article.
Go to Les Temps (Swiss Newspapers)
Go to E-NEWSPAPER ARCHIVES.CH
Choose two languages that you master and two search environments to search with the query “Spanish flu” or “grippe espagnole” or “Spanische Grippe”:
Compare the results by considering the following features:
resource 1 | resource 2 | |
Type of article (event, opinion, human interest) | ||
Does the collection cover the years in which the Spanish flu spread in Europe? | ||
Are there peaks in news coverage? | ||
Is there a dominant frame? | ||
Are there specific features in the interface that limit your search output? |