Contributed by: Sandra Mitrović
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!
In extractive automated single-document summarization, we aim at teaching machines to recognize the most informative parts of the given text (usually sentences), which should then be extracted and concatenated into the resulting summary. The main challenge here is, obviously, to distinguish between the informative and uninformative sentences. While the definition of “informative” sentence is not very strict, literature undoubtedly agrees that the aim is to identify those sentences which are important (that is, which carry key information related to the given text) and novel (which contain some additional information, as compared to previous sentences).
A traditional way to determine sentence importance within the text is to perform statistical analysis based on word frequencies and its variations (e.g. term frequency-inverted document frequency). The assumption behind usage of this method is that repetitive occurrence of a term implies its importance. Due to different forms that words can have, there is often some kind of data pre-processing, such as stemming which let us obtain the “root” form of a word. E.g. the idea behind extensive usage of term frequency-inverted document frequency (TF-IDF) approach is to favor those terms which appear more often in a particular sentence (high TF, descriptive words) but do not appear too often in the whole text (high IDF, not discriminative words), thus avoiding frequently used conjunctions, pronouns, etc. (in literature known as so-called “stopwords”.
Similar statistical methods can be used in the task of determining novelty in a sentence (or better: comparing the content of two sentences), in the following way: we first create a vocabulary consisting of all different (stemmed) words existing in the given text, we then represent each sentence as a vector in the vocabulary space as a bag-of-words (using e.g. TF-IDF weights) and we then compare these two using a similarity measure. There is a whole plethora of different similarity measures used in the literature so far. The most simple and most widely used one is probably cosine similarity, which in its basic form can be interpreted as a measure of overlapping words in two sentences, although this can be enriched in various ways using weights, calculating overlapping phrases of different lengths (e.g. bi-grams, tri-grams) etc. Some other, also very heavily used (dis)similarity measures are Euclidean distance (and its variations: square and normalized squared), Manhattan distance, Jaccard, Dice, hamming, edit, Damerau–Levenshtein distance and so on.
All these mentioned similarity measures are focusing on the content from purely syntactical aspects, which is especially useful if the text to be summarized is long. If on the other side, we have to deal with very short texts consisting of only few sentences, existing work proved that trying to exploit the semantics of the text could be very beneficial. For example, if the writer is intensively using synonyms (different words having the same meaning), then simply using a bag-of-words approach with cosine similarity measure can lead us to conclusion that two sentences are different, while in essence they convey the same message.
Sentence similarity observed from semantic point of view boils down to phrasal (semantic) similarity and further to word (semantic) similarity. Semantic word similarity measures can be divided in two wide categories: ontology/thesaurus-based and information theory/corpus-based (also called distributional).
Ontology/thesaurus-based measures relate to the distance or shortest path between two concepts in ontology (also referred to as path-based measures). The simplest form of path-based similarity measure calculates similarity score as inversely proportional to the number of nodes along the shortest path between the concepts. Some variations include using number of edges in the shortest path linking the two concepts, taking logarithmic values of the shortest path length (known as Leacock-Chodorow measure), using the depths of concepts and their least common subsumer (Wu & Palmer similarity) etc. However, path-based measures have some significant drawbacks. First, the elementary distance between two concepts in ontology is uniform and this, in reality, is not the case. Additionally, these similarity measures are not able to discriminate based on the level of hierarchy of the nodes included in the shortest path, which is also quite important. For illustration, if we assume that similarity score is calculated as the shortest path length, if we have two concepts connected through high-level nodes in hierarchy (which are quite abstract), then these two concepts should be considered less similar than some others having the same similarity score but whose shortest path goes through lower-level nodes.
These drawbacks provide motivation for introducing information-content based similarity measures, which are based on the idea that the more two words have in common, the more similar they are. The most representative measure among these is Resnik similarity where the common information between two concepts is defined as the information content provided by their least common subsumer (that is, the lowest node in hierarchy which subsumes both concepts). There exist as well other similarity measures defined on top of Resnik similarity, such as Jiang-Conrath similarity, Lin similarity etc.
Different ontologies have now being developed for different domains and languages. WordNet is probably the most used general-purpose hierarchically organized lexical database and on-line thesaurus in English. WordNet is based on the concept of synsets, sets of near-synonyms that share a common meaning, which are connected among themselves through different semantic/ontological relations: hypernyms (from concept to superordinate), hyponyms (from concept to subordinate), meronyms (from whole to part), holonyms (from part to whole) etc. Hence, while using WordNet for checking ontology-based sentence similarity is very useful, it requires significant computational effort to develop whole sequence of steps to go from synsets level to the level of words and finally that of sentences.
Distributional similarity represents another important method for computing similarity, especially as it overcomes the problems of missing or incomplete ontology. Unlike ontology-based metrics, where words are similar in meaning if they are “near” in hierarchy or have similar definitions, in distributional algorithms words are similar if they have similar distributional contexts, that is, if their context vectors are similar. Obviously, for calculating these distributional contexts, (at least one text) corpus is required, hence the name corpus-based. Some of the popular corpora are: Brown corpus, Penn Treebank, The Reuters, CONLL etc.
On top of already mentioned distance measures, the distance between two distributions can be found using as well Kullback-Leibler or Jensen-Shannon divergence. Another well-known technique used in corpus-based similarity research area is pointwise mutual information (PMI). PMI is used in other domains as well and, in general, provides the information on the logarithmic discrepancy between the joint probability of two events and the probability of them occurring together assuming their independency. In the text mining context, these events can be associated with two words occurring together or a word occurring in a particular context etc. The former is heavily used for detecting so-called collocations – sets of words usually used together conveying a different meaning from the one we might derive by translating their parts. The latter can be used for text summarization to construct word-context matrices on which then different methods can be applied (e.g. Latent Semantic Analysis). PMI equals zero if two events are independent while it is maximal when a word and a context (or two words) occur simultaneously. Since PMI is biased toward infrequent occurrences, it is common practice to apply some of the existing smoothing techniques.
In this article, we discussed different types of similarity measures used for automated text summarization. While we did not aim to provide a comprehensive list of similarity measures, we introduced different similarity measures and to compare their advantages and purposes. It is also very important to mention that the choice of sentence similarity measure influences summarization result, as several previous works have shown.
References
- Dan Jurafsky and James Martin, Speech & Language Processing
- Christopher D. Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing
- Zhang Pei-Ying and Li Cun-He, Automatic text summarization based on sentences clustering and extraction
- Sandra Mitrović and Henning Müller, Summarizing Citation Contexts of Scientific Publications
- http://wordnet.princeton.edu/