The Mathematical Intrigue of Punctuation in Major Language Literature

# The Mathematical Intrigue of Punctuation in Major Language Literature

Date:

By Henryk Niewodniczanski Institute of Nuclear Physics Polish Academy of Sciences

A moment of hesitation… Yes, stop here – but shouldn’t there be a comma? Or would a hyphen be better? Punctuation can be a nuisance; It is often simply neglected. mistake! The latest statistical analyzes paint a different picture: punctuation marks seem to “grow” from exponents common to all (examined) languages, and their peculiarities are far from trivial.

To many, punctuation seems a necessary evil, and should be happily ignored whenever possible. Recent analyzes of the literature written in the major current languages ​​of the world require a change of opinion. In fact, the same statistical features of punctuation usage patterns are observed in several hundred works written in seven languages, mostly Western languages.

Punctuation, all of which can be found in the preface to this text, turns out to be a universal and indispensable complement to the mathematical perfection of every language studied. This fascinating conclusion about the role of mere commas, exclamation marks, or full stops comes from an article by scientists from the Institute for Nuclear Physics of the Polish Academy of Sciences (IFJ PAN) in Kraków, published in the journal Nuclear Physics. Chaos, Solitons, and Fractals.

“The current analyzes are an extension of our previous findings on multi-comma features of sentence length variation in works of world literature. After all, what is sentence length? It is nothing more than the distance to the next specific punctuation mark – the whole point,” says Professor Stanislaw Drozdz (IFJ PAN, University of Krakow Tech).

Two sets of texts were studied. The main analyzes regarding punctuation in each language were performed on 240 popular literary works written in seven major Western languages: English (44), German (34), French (32), Italian (32), Spanish (32), Polish (34), and Russian (32). . This specific selection of languages ​​was based on a criterion: the researchers assumed that at least 50 million people must speak the language in question, and works written in it should have been awarded at least five Nobel Prizes for Literature.

In addition, for statistical verification of search results, each book must contain at least 1,500 consecutive words separated by punctuation marks. A separate group was set up to monitor the stability of punctuation in translation. It contained 14 works, each of which was available in each of the languages ​​studied (however, two versions of the 98 languages ​​were omitted due to unavailability).

In total, authors in both groups included such writers as Conrad, Dickens, Doyle, Hemingway, Kipling, Orwell, Salinger, Wolf, Grasse, Kafka, Mann, Nietzsche, Goethe, La Fayette, Dumas, Hugo, Proust, Verne, Eco, Cervantes Or Cinquewitch or Remont.

The attention of the Krakow researchers was primarily drawn to the statistical distribution of the space between consecutive punctuation marks. It soon became apparent that in all the languages ​​studied, it was best described by one of the strictly defined variants of the Weibull distribution.

A curve of this kind has a characteristic shape: it grows rapidly at first and then, after reaching a maximum value, decreases somewhat more slowly to a certain critical value, below which it reaches zero with a small and constantly decreasing dynamics. The Weibull distribution is usually used to describe survival phenomena (such as population as a function of age), but also various physical processes, such as increasing material stress.

“The fit between the distribution of word sequence lengths between punctuation marks and the functional form of the Weibull distribution was better the more types of punctuation marks we included in the analyses; for all signs the fit was found to be nearly perfect. At the same time, some differences in the distributions between languages ​​appear However, this is nothing more than a choice of slightly different values ​​of the distribution parameters, specific to the language in question. Punctuation thus appears to be an integral part of all the languages ​​studied,” notes Prof. Drozdz.

After a moment he adds it with some amusement: “…and since the Weibull distribution is concerned with phenomena such as survival, it may be said without much word in the cheek that punctuation by its very nature involves an actual struggle for survival.”

The next stage of the analyzes consisted of identifying the risk function. In the case of punctuation, it describes how the conditional probability of success—that is, the probability of the next punctuation mark—changes if no such mark appears in the parsed sequence.

The results here are clear: the language with the least tendency to use punctuation marks is English, with Spanish not far behind; The Slavic languages ​​proved to be the most dependent on punctuation marks. The hazard function curves for punctuation marks in the six languages ​​studied seem to follow a similar pattern, they differed mainly in vertical shift.

German proved to be the exception. Its dangerous function is the only one that crosses most curves designed for other languages. German punctuation thus appears to combine features of the punctuation marks of many languages, making it a type of Esperanto punctuation.

The above observation is consistent with the following analysis, which was to see if the punctuation features of the original literary works can be seen in their translations. As expected, the most faithful language in shifting punctuation marks from the original language to the target language turned out to be German.

In spoken communication, pauses may be justified by human physiology, such as the need to catch one’s breath or take time to structure what is to be said next in one’s mind. And in written communication?

“Creating a sentence by adding one word after another while making sure the message is clear and unambiguous is a bit like tightening a bow string: it is easy at first, but becomes more demanding with each passing moment. If there are no ordering elements in the text (that is The role of punctuation), the difficulty of interpretation increases the longer the string of words. Too narrow a bracket can break, and a sentence that is too long can become incomprehensible. You are faced with the necessity to “release the arrow”, i.e. close a passage of text with some kind of punctuation mark. This note applies to all languages ​​analyzed, so we are dealing with what might be called a linguistic law,” says Dr. Tomasz Stanisz (IFJ PAN), first author of the article in question.

Finally, it should be noted that the invention of punctuation marks is relatively recent – punctuation marks did not occur at all in ancient texts. The emergence of optimal punctuation patterns in modern written languages ​​can therefore be explained as a result of their evolutionary development. However, an excessive need for punctuation is not necessarily a sign of this complexity.

In light of the above studies, English and Spanish, the two most universal languages, seem less strict about the frequency of punctuation. These languages ​​are likely to be formalized in terms of syntax so that there is less room for ambiguity to be resolved with punctuation.

Tomasz Stanisz et al., Universal versus system-specific features of patterns of punctuation use in major Western languages, Chaos, Solitons, and Fractals (2023). DOI: 10.1016/j.chaos.2023.113183

Provided by the Henryk Niewodniczanski Institute for Nuclear Physics and the Polish Academy of Sciences

