NGS resources aid Arabic Corpus Digitisation
There is a new case study now available from the NGS website on the use of NGS resources to accelerate the processing of a large corpus of Arabic words.
Majdi used the NGS to gain a massive reduction in processing time for his research. He divided the Arabic Web Corpus into half-million-word files and then wrote a program that generates scripts to run the processing for each file in parallel. By using the NGS he massively reduced the execution time of processing the 176M-word corpus to only 5 days.
Majdi explained “Roughly, an estimated execution time for lemmatizing the full Arabic Internet Corpus was 300 days using ordinary uni-processor machine. By using the computational power of the NGS a massive reduction in execution time was gained – instead it only took 5 days."
It wasn’t just Majdi that benefited from using NGS resources. He explained “It made the processed Arabic Internet Corpus available to other translation studies and Arabic and Middle Eastern study researchers at the University of Leeds and other world-wide institutions."
To read more about Majdi’s research, see the "Accelerating the Processing of Large Corpora: Using Grid Computing Technologies for Lemmatizing 176 Million Words Arabic Internet Corpus” case study.

