You are viewing the site in preview mode
Skip to main content
|
Feature name
|
Type
|
Description
|
|---|
|
Dictionary
|
Semantic
|
Person names; Organization names; Location names
|
|
Distributional
|
Semantic
|
Distributional thesaurus
|
|
Section
|
Pragmatic
|
Name of the section in which the sentence appears
|
|
Part of speech
|
Syntactic
|
Part of speech of the token in the sentence
|
|
Others
|
Lexical
|
Lower case token, Lemma, Prefixes, Suffixes, n-grams, Matching patterns such as beginning with a capital, etc.
|
- Dictionary features: all the three dictionaries contain words that have a single token and are obtained by removing stop words. Each dictionary corresponds to one feature depending on whether a token is present in the dictionary. Distributional features: using the Semantic Vectors package [27] trained on the text retrieved from the links obtained for the case study, each word is represented in a 2000-dimensional vector space. The vector representation is used to find the 20 most similar words from the text to each word. For each token, we thus have 20 distributional semantic features that represent the entries in the thesaurus. Section features: section names are detected automatically using simple rules (e.g. a sentence ending with a semi-colon). Other features: there are about a hundred more features considering different part of speech tags according to Penn Treebank format, the different matching patterns used, prefixes, n-grams etc