Jan Voskuil


From documents to datasets: challenges and solutions in the context of IDMP and pharmacology

To obtain authorization to bring a medicinal product on the market, 200,000 pages of text need to be submitted. The upcoming effectuation of the IDMP directive (EU) forces pharma companies to submit datasets instead. This has enormous impact. The challenges that this poses are manifold. Semantic Web technology is optimally positioned to address many of these. This presentation focusses on one of these challenges. When the authorization for an existing product has to be renewed, an IDMP-compliant dataset has to be compiled. Some 70 to 80 percent of the datapoints is described in the text and not obtainable from IT-systems. Manual data entry is error prone and does not scale, since it is estimated that the total number of datapoints may exceed 1700 for a single submission. Based on state of the art entity extraction software, a solution is developed that generates those parts of the dataset that can be obtained from the text. The presentation describes some of the major challenges that had to be overcome and details the solutions that were found. It presents some results and describes the major business requirements that need to be met.


After obtaining a PhD in theoretical linguistics, Jan worked for several start-ups in the field of artificial intelligence. Jan has worked as senior solution architect at Logica and was involved in several large-scale, high-profile innovation programs. Jan is a technology evangelist in the field of Linked Data and Semantic Web technology, and specializes in language processing, controlled vocabularies and business glossaries. Jan is currently employed as CEO of Taxonic, which he co-founded in 2012. Taxonic is a consultancy that focusses on applying Linked Data technologies to real world business problems.