Data is gold and time is of essence. Would’t it be great to get a summary of an elaborate article, latest news or a complicated piece of a legal text? Well, we aren’t quite there yet. Algorithms that generate concise information-rich summaries that make sense and are pleasant to read remain a tough challenge. Nevertheless, I’ve decided to give it a chance and experimented with various NLP libraries. While the results aren’t perfect, they look … promising. Follow me in this brief tutorial, where I explain the absolute basics of simple text extraction algorithms and provide you with a whole notebook where you can rock on your own.

A while ago I gave an overview of text processing algorithms. It was only logical to move on and show a practical application. However, first things first ..

Give Credit Where Credit’s Due

Most of this tutorial is based on a brilliant YouTube post by Jesse E. Agbe. In my notebook, I’ve highlighted the common parts of the summarization task and provided a step-by-step guidance.

TLDR;

I’d appreciate if you take the time to read this post to the end and let me know your feedback. However, I understand that you are pressed for time and just want to get your hands on some code.



Your comments and criticism are highly appreciated. Google Colab makes it particularly easy to comment and share your feedback. Thanks in advance.

Without further ado, let’s delve right into it.

Mere Extraction vs Clever Abstraction

Text extraction is the simplest and the most commonly used approach in terms of generating text summaries. Its simplicity stems from a rather simple mathematical formula.

Given a piece of text:

  • find prevalent words, e.g. identify all words and count frequencies of individual words
  • identify key words, e.g. score individual words relative to the most frequent one
  • now score the sentences, e.g. break the text down into sentences and score each of them based on the word scores
  • pick an arbitrary number of the most scored sentences and print them out in order as they appear in the original text

Neat and simple, huh? Of course it doesn’t work without glitches. The extracted sentences are likely to be highly disconnected from each other leading to a poor reading experience. However, it still provides a methodical approach to getting a sense of what the article is about. As you can imagine you can always leverage additional capabilities of the NLP framework of your choice. For example, you can throw in entity extraction (people, places, dates ..) as an additional enrichment.

Creating authentic human-like text summaries requires extensive knowledge of deep learning. Abstractive summarization is a hard problem tackled by the recent advancements in machine learning.

If you want to dig deeper into the distinction between these two approaches I recommend reading this article.

For the rest of this post I stick to the pragmatic text extraction approach.

Introducing the Toolbox

The test compares four different NLP libraries. Two of them are general and mature NLP frameworks and the remaining two are tiny libraries specialised on working with larger texts. My objective was not to choose the winner. I didn’t put enough time and effort in thorough benchmarking. However, I believe that you get a sneak peak into each of them and a rough idea of what to expect.

Here they are, in a nutshell ..

spaCy

  • https://spacy.io
  • well adopted general purpose NLP framework in Python
  • 50+ languages
  • fast and efficient

NLTK

  • https://www.nltk.org
  • well established NLP toolkit in Python
  • 50+ corpora and lexical sources
  • versatile library suitable for teaching purposes

Gensim

  • https://radimrehurek.com/gensim
  • Python library specialised in finding similarity in documents
  • performant: able to process large, web-scale corpora
  • suitable for distributed computations

Sumy

  • https://github.com/miso-belica/sumy
  • Python library for extracting summary from HTML pages or plain texts
  • started as a thesis, focused on Czech / Slovak languages
  • popular and seems to cope well with English 🙂

Text Extraction Algorithm

NLTK and spaCy share the exact same approach. You can use both to reliably identify words and sentences. The rest can be easily coded up. I hope the gist below is self explanatory – if not, just leave me a comment and I’ll gladly help.

Gensim and Sumy provide text summarization out of the box.

The Benchmark

I’ve chosen an article from Wikipedia. It’s about one of my favourite writers, Ernest Hemingway. The article is quite long, so rather than posting the text excerpt here, go and explore the notebook.

Summary

All of the four candidates did, in my opinion, a decent job. NLTK and spaCy proved the usefulness of the custom text extraction. When it comes to the specialists (Gensim and Sumy), they didn’t disappoint either. Especially with Gensim I was able to get a sensible summary using two lines of code. Impressive, especially given the fact I didn’t try to tweak the config or help the tool in any way.




Tomas Zezula

Hello! I'm a technology enthusiast with a knack for solving problems and a passion for making complex concepts accessible. My journey spans across software development, project management, and technical writing. I specialise in transforming rough sketches of ideas to fully launched products, all the while breaking down complex processes into understandable language. I believe a well-designed software development process is key to driving business growth. My focus as a leader and technical writer aims to bridge the tech-business divide, ensuring that intricate concepts are available and understandable to all. As a consultant, I'm eager to bring my versatile skills and extensive experience to help businesses navigate their software integration needs. Whether you're seeking bespoke software solutions, well-coordinated product launches, or easily digestible tech content, I'm here to make it happen. Ready to turn your vision into reality? Let's connect and explore the possibilities together.