Most of what I’ve done in the space of NLP was for obvious reasons coded up in Python. In my day job I primarily work with JVM. Wouldn’t it be cool if one could build something close to NLP / machine learning with .. say Scala? Well, Stanford CoreNLP has a good reputation and is written in Java – not bad for starters.

There is a lot to unpack. I will delve into details as we go through this series. In this introductory post I give a bird’s eye view of what CoreNLP is and how can it be useful.

CoreNLP in a Nutshell

CoreNLP is a library for extracting of essential linguistics features from a piece of text. It’s a project by a renowned group of Stanford’s researchers, and as such is fairly popular with NLP community.

The library is written in Java, which is of a particular interest to anyone who appreciates advantages of strongly typed languages, myself included. In the course of this tutorial I will diverge from Java in favour of Scala. At the moment, let’s just assume we stick to JVM and enjoy the benefits it has to offer.

Key Concepts

Here is how CoreNLP works in a nutshell.

CoreNLP makes use of linguistic annotations. As a result, raw text is turned into a tree structure. Think nouns, verbs, coreferences, named entities etc.

The brown fox is quick and he is jumping over the lazy dog” – Source: StackOverflow.

CoreNLP chains individual analytical steps into pipelines. That’s not dissimilar from how other libraries do it. The exact pipeline is determined by configuration. For example, “tokenize, ssplit, parse” means that the raw text will be tokenized, split into sentences and analysed from the linguistics point of view. Indeed, the order matters as annotators (processing steps) further down the line depend on the output of their ancestors. See the full list of annotators and dependencies between them.

Raw text processing pipeline. Source: Stanford CoreNLP Natural Processing Toolkit

A parsed document provides access to all annotations and can be efficiently serialised as a Google Protocol Buffers object.

The API provides wrappers and guarantees:

  • Lazy computation – allows to apply transformations before the pipeline is executed.
  • Fast and robust serialization via GPB
  • Thread safety
  • Optional over null – lazy computation and the use of Optional guarantees a function always returns a value.

Let’s look at licensing and high-level features. Delving into details reveals that using the library commercially incurs additional cost. On the other hand, CoreNLP provides pre-trained models for sentiment analysis (general models only!). This makes for an easy start in the upcoming parts of this tutorial.

FeatureCoreNLPOpenNLPNLTKspaCy
APIJava logoJava logoPython logoPython logo
LicenseGNU GPLApache 2.0Apache 2.0MIT
Commercial usePaidYesYesYes
General pre-trained modelsYesYesYesYes
Domain specific pre-trained modelsNoNoNoYes
Pre-trained models for sentiment analysisYesNoNoNo
Training on GPUNoNoNoYes
No. of languages supported out of the box6710+10+
CoreNLP vs other established NLP libraries.

Resources:

A head-to-head comparison with spaCy yields slightly disappointing results when it comes to performance:

Perfomance MetricCoreNLPspaCy
Tokenizer2 ms0.2 ms
Tagger1 ms10 ms
Parser19 ms49 ms
NER Accuracy0.790.72
CoreNLP vs spaCy: Speed in ms and accuracy. Source: EKbana

If you are looking for a Python alternative to CoreNLP, then go check the recent (March 2020) release of Stanza. Supposedly, it outperforms spaCy in some of the performance metrics.

Summary

In this post, I gave you ten thousand view of CoreNLP, Stanford’s NLP Java library. I hope it helped you understand how does it compare with other popular solutions and that there are both benefits and drawbacks when using it. My next post will provide a detailed look at sentiment analysis. Specifically at what useful metadata we can collect and how to arrive at a sentiment of a larger piece of text.

Thanks for reading. Did you find this post useful? Is there anything you want to know in particular? Please comment in the section below. Thank you and stayed tuned for my next post.


Tomas Zezula

Hello! I'm a technology enthusiast with a knack for solving problems and a passion for making complex concepts accessible. My journey spans across software development, project management, and technical writing. I specialise in transforming rough sketches of ideas to fully launched products, all the while breaking down complex processes into understandable language. I believe a well-designed software development process is key to driving business growth. My focus as a leader and technical writer aims to bridge the tech-business divide, ensuring that intricate concepts are available and understandable to all. As a consultant, I'm eager to bring my versatile skills and extensive experience to help businesses navigate their software integration needs. Whether you're seeking bespoke software solutions, well-coordinated product launches, or easily digestible tech content, I'm here to make it happen. Ready to turn your vision into reality? Let's connect and explore the possibilities together.