Catching the flu with article segmentation

To make digitised newspapers searchable a further process is also needed: article segmentation. Newspapers tend to publish texts on topics that are not related to each other on the same page. Single articles are also often divided up into parts that are printed on different pages. To retrieve a meaningful version of an article, it is necessary to understand which characters and words belong to the same article. This “understanding” is made possible through the technique of article segmentation. While OCR gives us information on which letters and words appear on which page of the newspaper, article segmentation, the technique of scaling a page down into smaller units, gives us information on which letters and words on a page belong to the same unit.

After applying OCR and article segmentation, the final step is to make the digitised newspapers available to the public. The extracted texts and the images of the newspapers are published on the web and can be viewed and searched using a carefully designed interface. Applying article segmentation is however not always included in the digitisation process, and a user only notices after having conducted a search. You will notice the difference between searching digital newspaper archives where this prinicple has been applied and where it is missing.

Instructions

To understand the principle of article segmentation you are going to search for the same term, but in different ways. The 1918 Influenza outbreak, often called the Spanish Flu was the deadliest pandemic in recent history and of course also widely covered by newspapers. This makes it an interesting example to illustrate how ‘article segmentation’ helps to identify which words belong to which article.

2.a How to find an article that deals with the Spanish flu | 20 min

Go to Les Temps (Swiss Newspapers)

  • First search for the combination “grippe espagnole” (French for “Spanish flu”, as the newspaper is in French) and document your findings.
  • Then search for the single nouns “grippe” and “espagnole” and document your findings.
  • How can you explain the difference in hits between the two queries?

Go to E-NEWSPAPER ARCHIVES.CH

  • Search for “grippe” and “espagnole”: what difference do you notice compared with the previous interface?
  • Document what you have learned from this comparison

2.b Queries on the Spanish flu: different countries, different reactions?

Choose two languages that you master and two search environments to search with the query “Spanish flu” or “grippe espagnole” or “Spanische Grippe”:

  • EN: https://trove.nla.gov.au/newspaper/?q=
  • FR/DE: http://www.eluxemburgensia.lu
  • EN/DE/FR: https://www.europeana.eu/portal/en/collections/newspapers?q=
  • FR/DE: https://www.e-newspaperarchives.ch
  • DE: http://anno.onb.ac.at/anno-suche#searchMode=simple&from=1

Compare the results by considering the following features:

  resource 1 resource 2
Type of article (event, opinion, human interest)    
Does the collection cover the years in which the Spanish flu spread in Europe?    
Are there peaks in news coverage?    
Is there a dominant frame?    
Are there specific features in the interface that limit your search output?    

Reading/viewing suggestions