Computer help. Zipf's law and the fractal nature of social and economic phenomena

The first time I met a description of Zipf's law while reading. The essence of the law: if the words of any text are ranked by frequency of use, then the product of the rank by the frequency is a constant value:

F*R=C, where:

F is the frequency of occurrence of the word in the text;

R - word rank (the most frequently used word gets rank 1, the next - 2, etc.);

C is a constant.

For those who still remember a little algebra :), in the above formula, it is easy to recognize the equation of a hyperbola. Zipf experimentally determined that C ≈ 0.1. So, the graphical representation of Zipf's law is approximately the following:

Rice. 1. Hyperbola of Zipf's law.

Hyperbolas have a remarkable property. If we take a logarithmic scale for both axes, then the hyperbola will look like a straight line:

Rice. 2. The same hyperbole, but on a graph with logarithmic scales

The question may arise: what does search engine optimization have to do with it? So, it turns out that specially generated texts containing an increased number of keywords do not fit into the law. Search engines (Google, Yandex) check texts for "naturalness", that is, compliance with Zipf's law and either lower the rating of sites with "suspicious" texts, or even ban such sites.

The second time I met Zipf's law was with Benoit Mandelbrot in his book. And I liked this little section so much that let me quote it in full.

Unexpected power law

In 1950, I was a young mathematics student at the University of Paris looking for a topic for my dissertation. My uncle Zolem was the local textbook example of a professor of mathematics: a deep theorist, very conservative and, despite being born in Poland, a pillar of the French scientific community. Already at the age of 31, he was elected full-time professor at the prestigious French College.

That was the era of Nicolas Bourbaki; Behind this collective pseudonym was hidden a mathematical "club" which, like Dada in art or existentialism in literature, spread from France and became for a time extremely influential on the world stage. Abstraction and pure mathematics, mathematics for the sake of mathematics, were elevated to the rank of a cult; members of the "club" despised pragmatism, applied mathematics, and even mathematics as a tool of science. This approach was a dogma for French mathematicians, and for me, perhaps, the reason to leave France and go to work at IBM. I was, to my uncle's dismay, a young rebel. While working on my doctoral dissertation, I often went into his office at the end of the day to chat, and often these conversations turned into a discussion. Once, trying to somehow brighten up the upcoming long and boring subway ride home, I asked him for something to read on the way. He reached into the wastebasket and pulled out several crumpled pieces of paper.

“Here, take this,” my uncle muttered. “The stupidest article you love.

It was a review of a book by sociologist George Kingsley Zipf. Zipf, a man rich enough not to think about his daily bread, lectured at Harvard University on the discipline he invented, which he called statistical human ecology. In his book Human Behavior and the Principle of Least Effort, power laws were seen as the ubiquitous structures of the social sciences. In a chip, power laws are quite common and act as a form of what I now call fractal self-repetition on a scale. Seismologists have a mathematical formula for the power-law dependence of the number of earthquakes on their strength according to the famous Richter scale. Or, in other words: weak earthquakes are common, while strong ones are rare, and the frequency and strength of earthquakes are related by an exact formula. At that time there were few such examples, and they were known to only a few people. Zipf, the encyclopedist, was obsessed with the idea that power laws were not limited to the physical sciences; they are subject to all manifestations of behavior, organization and human anatomy - even the size of the genitals.

Fortunately, the review of the book that my uncle gave me limited itself to only one unusually elegant example: the frequency of words. In text or speech, some words, such as the English the (definite article) or this ("it"), occur frequently; others, milreis or momus, appear rarely or never at all (for the most inquisitive: the first means an ancient Portuguese coin, the second is a synonym for the word "critic"). Zipf proposed the following exercise: take any text and count how many times each word appears in it. Then assign a rank to each word: 1 - for the most frequently used words, 2 - for those occupying the second place in terms of frequency of occurrence, etc. Finally, construct a graph on which, for each rank, indicate the number of occurrences of this word. We will get an amazing drawing. The curve does not decrease uniformly from the most common word in a given text to the rarest. At first it falls with dizzying speed, after which it begins to decrease more slowly, repeating the trajectory of a skier who jumped from a springboard, and then landed and descended the relatively gentle slope of a snow-covered mountain. An example of a classic non-uniform scale. Zipf, having adjusted the curve to fit his diagrams, came up with a formula for it.

I was stunned. By the end of my long subway ride, I already had a topic for half of my doctoral dissertation. I knew exactly how to explain the mathematical foundations of the frequency distribution of words, which Zipf, not being a mathematician, could not have done. In the months that followed, amazing discoveries awaited me. Using this equation, you can create a powerful tool for social research. An improved version of the Zipf formula made it possible to quantify and rank the richness of the vocabulary of any person: a high value - a rich vocabulary; low value - poor. With such a scale, one can measure differences in vocabulary between texts or speakers. It becomes possible to quantify erudition. True, my friends and consultants were horrified by my determination to tackle this strange topic. Zipf, they told me, is a quirky man. I was shown his book and I agreed that it was disgusting. Word count is not real mathematics, I was convinced. Having taken up this subject, I will never find a good job; And it won't be easy for me to become a professor either.

But I remained deaf to wise advice. Moreover, I wrote my dissertation without any consultants at all and even persuaded one of the university bureaucrats to certify it with a seal. I was determined to follow the chosen path to the end and apply Zipf's ideas in economics, because not only speech can be reduced to a power law. We are rich or poor, prosperous or starving - all this also seemed to me the object of a power law.

Mandelbrot slightly modified Zipf's formula:

F \u003d C * R -1 /a, where

a - coefficient characterizing the richness of the vocabulary; the larger the value of a, the richer the vocabulary of the text, since the curve of dependence of the frequency of occurrence of each word on its rank decreases more slowly, and, for example, rare words appear more often than with smaller values ​​of a. It was this property that Mandelbrot intended to use to assess erudition.

Not everything is so smooth with Zipf's law, and in specific applications it is not always possible to rely on the experimentally determined coefficient a. At the same time, Zipf's law is nothing more than Pareto's law "on the contrary", since both of them are special cases of power series, or ... a manifestation of the fractal nature of economic and social systems.

For myself, I formulated the essence of the fractal nature of economic systems as follows. On the one hand, there is a game of chance: roulette, throwing dice. On the other hand, technological/physical accident: variation in the diameter of a shaft made on a lathe, variation in the height of an adult. All of these phenomena are described. So, there are a number of phenomena that do not follow this distribution: the wealth of countries and individuals, fluctuations in stock prices, exchange rates, the frequency of use of words, the strength of earthquakes ... For such phenomena, the characteristic is that the average value is very dependent on the sample. For example, if you take a hundred random people of different heights, then adding the tallest person on Earth to them will not change the average height of this group much. If we calculate the average income of a hundred random people, then adding the richest person on the planet - Carlos Slim Elu (and not Bill Gates, as many might think :)) will significantly increase the average wealth of everyone, to about 500 million dollars!

Another manifestation of fractality is a significant stratification of the sample. Consider, for example,

Agree, the presented pattern is like two drops of water similar to the Zipf curve!

One of the properties of fractality is self-repetition. So, out of the 192 countries of the world listed in the list, 80% of the world's wealth is concentrated in just 18 countries - 9.4% (18/192). If we now consider only these 18 countries, then their total wealth is 46 trillion. dollars - distributed equally unevenly. 80% of these 46 trillion. Concentrated in less than half of the countries, etc.

You may ask: what is the practical conclusion of all this? I would say this:

  1. Social and economic systems are not described by a Gaussian. These patterns obey power series [synonymous with fractal nature].
  2. Outliers from the mean are substantially more likely than those predicted by the Gaussian bell curve. Moreover, outliers are intrinsic to the system; they are not random, but regular.
  3. Risk estimates cannot be built on the basis of a normal probability distribution of rare undesirable events.
  4. … I won’t lie, I can’t think of anything else yet… but this does not mean that there are no more practical conclusions… it’s just that my knowledge is limited to this…

... but you must admit, beautiful patterns!

For fractality, see Benoit Mandelbrot

It should be noted that data from different sources vary greatly, but this is not relevant to the topic discussed here.

