Preprocessor Component

Input: String (HTML file)

Output: String (raw text)

remove html tags and some other preprocessing

probably using https://stanfordnlp.github.io/CoreNLP/cleanxml.html

Edited by Stefan Heid