SDFTagger

SDFTagger is a general-purpose rule-based NLP technology that I had developed since my PhD thesis. SDFTagger is inspired by the TULE parser, a rule-based morpho-syntactic dependency parser for Italian authored by prof. Leonardo Lesmo, the supervisor of my PhD thesis. "SDF" stands for "Sono Davvero Felice", the greatest work of art ever written in humankind.

A demo that uses SDFTagger to process case law from the European Court of Justice is available under the GNU General Public License.
You may download the demo from HERE and a user manual from HERE.

The core of SDFTagger are SDFRule(s). These are if-then rules in XML format able to associate words of an input text with certain (prioritized) tags. Tags are strings, which may be used in post-processing NLP procedures such as text classification, named entity recognition, sentiment analysis, etc. Some past applications where SDFTagger has been used are reported at the bottom of this page.

If the conditions defined within an SDFRule are met, (some of) the words that satisfy these conditions are associated with certain tags, which are given in output. Each SDFRule is associated with a priority, which is then assigned to the output tags. In addition, it is possible to parse the input text with a dependency parser and to feed SDFTagger with the parsed trees: SDFRule(s) may also pose conditions on the grammatical relations that connect the words. For instance, in the above downloadable demo, input texts are first parsed with the Stanford CoreNLP (but SDFTagger is parser-neutral: any other dependency parsing can be used as well). Specifically, the architecture of SDFTagger is the following:

The figure shows that SDFRule(s) may include conditions on four possible directions: right (i.e., on the subsequent word on the surface order), left (i.e., on the precedent word on the surface order), up (i.e., on the governor of the word, in the dependency tree) and down (i.e., on the dependents of the word, in the dependency tree). On each word reached by the SDFRule(s), it is possible to pose conditions on the form, the lemma, the part-of-speech, etc. If these conditions are met, SDFRule(s) may assign a tag to the word. Conditions can be recursively defined: from each word reached by the SDFRule(s), it is possible to move again along one of the four possible directions, i.e., to pose further sub-conditions on the nearest words, in a recursive manner. If all conditions of an SDFRule are met, the tags assigned during the execution of the SDFRule are given in output.

Although SDFTagger is rule-based, it is intended for being used to devise hybrid (i.e, rule-based and statistical) NLP tagging. SDFRule(s), being encoded in XML format, could be easily generated via statistical approaches (e.g., statistical pattern-recognition, key phrase extraction, topic modeling, etc.). Then, higher-priority rules may be manually added in order to override statistical tagging in case of errors or exceptions in specific cases.

In my view, hybrid approaches provide an easy way to overcome the two main limits/difficulties of main current NLP technologies, mostly based on statistics: (1) the Pareto principle, a.k.a. the 80/20 rule and (2) the achievement of explainable NLP.

The Pareto principle. It is well-known that phenomena in the universe, including the distribution of words and patterns in natural languages, may be usually characterized by an 80% of standard cases and a 20% of exceptions. Statistical approaches may be (quickly!) trained to identify the 80% of standard cases but they intrinsically fail to recognize instances from the 20% of exceptions that deviate from the standard trend. In order to achieve accuracy close to 100%, the idea of SDFTagger is to generate via statistic an initial knowledge base of rules, and assign them an initial (fixed) priority. Then, as long as exceptions, mis-processed by these initial rules, are identified, it is possible to manually add higher-priority rules for overriding the standard trend in these specific cases. Exceptions of exceptions (and exceptions of exceptions of exceptions, etc.) may be recursively handled by adding further rules with even higher priority. In other words, the general idea is simply the one of converting statistical models into a more human-readable format, such as the XML format of the SDFRule(s), rather than into huge matrices of numbers or neural networks with (numerical) weights on the connections, in order to enable subsequent manual tuning.

Explainable AI. It is well-known that explainability will steer the future of Artificial Intelligence (AI), and lot of investments are consequently expected in the next years. Most current AI systems are opaque "black boxes", as they are unable to make users aware of how their decisions are being made. This is particularly true for applications based on Deep Learning, one of the main AI technologies used nowadays, where the link between the training set and the weights of the connections is lost during the training phase. Basic neural networks do not register how a certain instance of the training set affected the changing of the weights of the connections. Thus, when the neural network processes new instances, it is unable to explain why it took a certain decision (for instance, by providing the set of instances from the training set relevant for that decision). On the other hand, I believe that rules represent the path towards explainable AI, in that they allow to link automatic decisions to the original (training) set. More generally, if-then rules define a logic, on which building explanations.

The first complete prototype of SDFTagger was devoted to recognize temporal expressions in Italian texts, and store them inTimeML files (see [Robaldo et al, 2011a]). Then, it has been used in another prototype for automatically identifying and classifying types of modifications in Italian legal text (see [Robaldo et al, 2012]). SDFTagger was also used to perform named entity recognition within legal texts in the context of the FP7 EU project EUCases, where I was responsible of the NLP tasks for the Italian language, and ultimately in the legal document management system Eunomos. It has been used in the Phrase Detective game-with-a-purpose in order to implement the Italian pipeline, needed to make the game available for Italian language (see [Poesio et al, 2013]). Finally, the most successful project where SDFTagger was employed was SentiTagger, retained for funding from the WCAP Telecom Italia 2014 start-up competition. SentiTagger was an hybrid system for sentiment analysis, specifically for tagging positive/negative opinions in OpinionMining-ML.