Automated information extraction
The literature in the life sciences has grown to over 40 million articles making it impossible to process for humans. Yet, precisely because of how vast this body of work is, consolidating knowledge dispersed across the articles holds the potential to yield particularly valuable insights. These insights could inform future research endeavors but also facilitate the development of new medical treatments. In the project Mining the life science literature (Aits, Ahmed, Kazemi Rashed, Berck), researchers draw onlanguage technology to create a so-called knowledge graph of domain-specific entities (e.g. proteins) related to cell death. The project uses named entity recognition (NER) and relation extraction methods.To handle the large amounts of data, themodels are trained on HPCs at the National Academic Infrastructure for Supercomputing in Sweden (NAISS) using the Berzelius super cluster. The project’s knowledge graph will be used to enhance and evaluate insights gained from microscopic image analysis. Moreover, importantly, the project will make its large-scale biomedical natural language processing (NLP) tools accessible to other researchers in other domains.