4 Useful Ways to Automate Text Summarization

Data is gold and time is of essence. Would’t it be great to get a summary of an elaborate article, latest news or a complicated piece of a legal text? Well, we aren’t quite there yet. Algorithms that generate concise information-rich summaries that make sense and are pleasant to read remain a tough challenge. Nevertheless, I’ve decided to give it a chance and experimented with various NLP libraries. While the results aren’t perfect, they look … promising. Follow me in this brief tutorial, where I explain the absolute basics of simple text extraction algorithms and provide you with a whole notebook where you can rock on your own.

A while ago I gave an overview of text processing algorithms. It was only logical to move on and show a practical application. However, first things first ..

Give Credit Where Credit’s Due

Most of this tutorial is based on a brilliant YouTube post by Jesse E. Agbe. In my notebook, I’ve highlighted the common parts of the summarization task and provided a step-by-step guidance.

TLDR;

I’d appreciate if you take the time to read this post to the end and let me know your feedback. However, I understand that you are pressed for time and just want to get your hands on some code.

Explore my notebook on Gitlab

Run my notebook on Google colab

Your comments and criticism are highly appreciated. Google Colab makes it particularly easy to comment and share your feedback. Thanks in advance.

Without further ado, let’s delve right into it.

Mere Extraction vs Clever Abstraction

Text extraction is the simplest and the most commonly used approach in terms of generating text summaries. Its simplicity stems from a rather simple mathematical formula.

Given a piece of text:

find prevalent words, e.g. identify all words and count frequencies of individual words
identify key words, e.g. score individual words relative to the most frequent one
now score the sentences, e.g. break the text down into sentences and score each of them based on the word scores
pick an arbitrary number of the most scored sentences and print them out in order as they appear in the original text

Neat and simple, huh? Of course it doesn’t work without glitches. The extracted sentences are likely to be highly disconnected from each other leading to a poor reading experience. However, it still provides a methodical approach to getting a sense of what the article is about. As you can imagine you can always leverage additional capabilities of the NLP framework of your choice. For example, you can throw in entity extraction (people, places, dates ..) as an additional enrichment.

Creating authentic human-like text summaries requires extensive knowledge of deep learning. Abstractive summarization is a hard problem tackled by the recent advancements in machine learning.

If you want to dig deeper into the distinction between these two approaches I recommend reading this article.

For the rest of this post I stick to the pragmatic text extraction approach.

Introducing the Toolbox

The test compares four different NLP libraries. Two of them are general and mature NLP frameworks and the remaining two are tiny libraries specialised on working with larger texts. My objective was not to choose the winner. I didn’t put enough time and effort in thorough benchmarking. However, I believe that you get a sneak peak into each of them and a rough idea of what to expect.

Here they are, in a nutshell ..

spaCy

https://spacy.io
well adopted general purpose NLP framework in Python
50+ languages
fast and efficient

NLTK

https://www.nltk.org
well established NLP toolkit in Python
50+ corpora and lexical sources
versatile library suitable for teaching purposes

Gensim

https://radimrehurek.com/gensim
Python library specialised in finding similarity in documents
performant: able to process large, web-scale corpora
suitable for distributed computations

Sumy

https://github.com/miso-belica/sumy
Python library for extracting summary from HTML pages or plain texts
started as a thesis, focused on Czech / Slovak languages
popular and seems to cope well with English 🙂

Text Extraction Algorithm

NLTK and spaCy share the exact same approach. You can use both to reliably identify words and sentences. The rest can be easily coded up. I hope the gist below is self explanatory – if not, just leave me a comment and I’ll gladly help.

Gensim and Sumy provide text summarization out of the box.

The Benchmark

I’ve chosen an article from Wikipedia. It’s about one of my favourite writers, Ernest Hemingway. The article is quite long, so rather than posting the text excerpt here, go and explore the notebook.

Summary

All of the four candidates did, in my opinion, a decent job. NLTK and spaCy proved the usefulness of the custom text extraction. When it comes to the specialists (Gensim and Sumy), they didn’t disappoint either. Especially with Gensim I was able to get a sensible summary using two lines of code. Impressive, especially given the fact I didn’t try to tweak the config or help the tool in any way.

Explore my notebook on Gitlab

Run my notebook on Google colab

4 Useful Ways to Automate Text Summarization

Published by Tomas Zezula on August 15, 2020

Give Credit Where Credit’s Due

TLDR;

Mere Extraction vs Clever Abstraction

Introducing the Toolbox

spaCy

NLTK

Gensim

Sumy

Text Extraction Algorithm

The Benchmark

Summary

Tomas Zezula

Natural Language Processing

Natural Language Processing as Part of Daily Life

Natural Language Processing

Analyse Financial Tweets with Stanford CoreNLP – Part 3

Natural Language Processing

Analyse Financial Tweets with Stanford CoreNLP – Part 2

4 Useful Ways to Automate Text Summarization

Published by Tomas Zezula on August 15, 2020

Give Credit Where Credit’s Due

TLDR;

Mere Extraction vs Clever Abstraction

Introducing the Toolbox

spaCy

NLTK

Gensim

Sumy

Text Extraction Algorithm

The Benchmark

Summary

Tomas Zezula

Related Posts

Natural Language Processing

Natural Language Processing as Part of Daily Life

Natural Language Processing

Analyse Financial Tweets with Stanford CoreNLP – Part 3

Natural Language Processing

Analyse Financial Tweets with Stanford CoreNLP – Part 2