Computer help. Zipf's law and the fractal nature of social and economic phenomena

The first time I met a description of Zipf's law while reading. The essence of the law: if the words of any text are ranked by frequency of use, then the product of the rank by the frequency is a constant value:

F*R=C, where:

F is the frequency of occurrence of the word in the text;

R - word rank (the most frequently used word gets rank 1, the next - 2, etc.);

C is a constant.

For those who still remember a little algebra :), in the above formula, it is easy to recognize the equation of a hyperbola. Zipf experimentally determined that C ≈ 0.1. So, the graphical representation of Zipf's law is approximately the following:

Rice. 1. Hyperbola of Zipf's law.

Download note in format , examples in format

Hyperbolas have a remarkable property. If we take a logarithmic scale for both axes, then the hyperbola will look like a straight line:

Rice. 2. The same hyperbole, but on a graph with logarithmic scales

The question may arise: what does search engine optimization have to do with it? So, it turns out that specially generated texts containing an increased number of keywords do not fit into the law. Search engines (Google, Yandex) check texts for "naturalness", that is, compliance with Zipf's law and either lower the rating of sites with "suspicious" texts, or even ban such sites.

The second time I met Zipf's law was with Benoit Mandelbrot in his book. And I liked this little section so much that let me quote it in full.

Unexpected power law

In 1950, I was a young mathematics student at the University of Paris looking for a topic for my dissertation. My uncle Zolem was the local textbook example of a professor of mathematics: a deep theorist, very conservative and, despite being born in Poland, a pillar of the French scientific community. Already at the age of 31, he was elected full-time professor at the prestigious French College.

That was the era of Nicolas Bourbaki; Behind this collective pseudonym was hidden a mathematical "club" which, like Dada in art or existentialism in literature, spread from France and became for a time extremely influential on the world stage. Abstraction and pure mathematics, mathematics for the sake of mathematics, were elevated to the rank of a cult; members of the "club" despised pragmatism, applied mathematics, and even mathematics as a tool of science. This approach was a dogma for French mathematicians, and for me, perhaps, the reason to leave France and go to work at IBM. I was, to my uncle's dismay, a young rebel. While working on my doctoral dissertation, I often went into his office at the end of the day to chat, and often these conversations turned into a discussion. Once, trying to somehow brighten up the upcoming long and boring subway ride home, I asked him for something to read on the way. He reached into the wastebasket and pulled out several crumpled pieces of paper.

“Here, take this,” my uncle muttered. “The stupidest article you love.

It was a review of a book by sociologist George Kingsley Zipf. Zipf, a man rich enough not to think about his daily bread, lectured at Harvard University on the discipline he invented, which he called statistical human ecology. In his book Human Behavior and the Principle of Least Effort, power laws were seen as the ubiquitous structures of the social sciences. In a chip, power laws are quite common and act as a form of what I now call fractal self-repetition on a scale. Seismologists have a mathematical formula for the power-law dependence of the number of earthquakes on their strength according to the famous Richter scale. Or, in other words: weak earthquakes are common, while strong ones are rare, and the frequency and strength of earthquakes are related by an exact formula. At that time there were few such examples, and they were known to only a few people. Zipf, the encyclopedist, was obsessed with the idea that power laws were not limited to the physical sciences; they are subject to all manifestations of behavior, organization and human anatomy - even the size of the genitals.

Fortunately, the review of the book that my uncle gave me limited itself to only one unusually elegant example: the frequency of words. In text or speech, some words, such as the English the (definite article) or this ("it"), occur frequently; others, milreis or momus, appear rarely or never at all (for the most inquisitive: the first means an ancient Portuguese coin, the second is a synonym for the word "critic"). Zipf proposed the following exercise: take any text and count how many times each word appears in it. Then assign a rank to each word: 1 - for the most frequently used words, 2 - for those occupying the second place in terms of frequency of occurrence, etc. Finally, construct a graph on which, for each rank, indicate the number of occurrences of this word. We will get an amazing drawing. The curve does not decrease uniformly from the most common word in a given text to the rarest. At first it falls with dizzying speed, after which it begins to decrease more slowly, repeating the trajectory of a skier who jumped from a springboard, and then landed and descended the relatively gentle slope of a snow-covered mountain. An example of a classic non-uniform scale. Zipf, having adjusted the curve to fit his diagrams, came up with a formula for it.

I was stunned. By the end of my long subway ride, I already had a topic for half of my doctoral dissertation. I knew exactly how to explain the mathematical foundations of the frequency distribution of words, which Zipf, not being a mathematician, could not have done. In the months that followed, amazing discoveries awaited me. Using this equation, you can create a powerful tool for social research. An improved version of the Zipf formula made it possible to quantify and rank the richness of the vocabulary of any person: a high value - a rich vocabulary; low value - poor. With such a scale, one can measure differences in vocabulary between texts or speakers. It becomes possible to quantify erudition. True, my friends and consultants were horrified by my determination to tackle this strange topic. Zipf, they told me, is a quirky man. I was shown his book and I agreed that it was disgusting. Word count is not real mathematics, I was convinced. Having taken up this subject, I will never find a good job; And it won't be easy for me to become a professor either.

But I remained deaf to wise advice. Moreover, I wrote my dissertation without any consultants at all and even persuaded one of the university bureaucrats to certify it with a seal. I was determined to follow the chosen path to the end and apply Zipf's ideas in economics, because not only speech can be reduced to a power law. We are rich or poor, prosperous or starving - all this also seemed to me the object of a power law.

Mandelbrot slightly modified Zipf's formula:

F \u003d C * R -1 /a, where

a - coefficient characterizing the richness of the vocabulary; the larger the value of a, the richer the vocabulary of the text, since the curve of dependence of the frequency of occurrence of each word on its rank decreases more slowly, and, for example, rare words appear more often than with smaller values ​​of a. It was this property that Mandelbrot intended to use to assess erudition.

Not everything is so smooth with Zipf's law, and in specific applications it is not always possible to rely on the experimentally determined coefficient a. At the same time, Zipf's law is nothing more than Pareto's law "on the contrary", since both of them are special cases of power series, or ... a manifestation of the fractal nature of economic and social systems.

For myself, I formulated the essence of the fractal nature of economic systems as follows. On the one hand, there is a game of chance: roulette, throwing dice. On the other hand, technological/physical accident: variation in the diameter of a shaft made on a lathe, variation in the height of an adult. All of these phenomena are described. So, there are a number of phenomena that do not follow this distribution: the wealth of countries and individuals, fluctuations in stock prices, exchange rates, the frequency of use of words, the strength of earthquakes ... For such phenomena, the characteristic is that the average value is very dependent on the sample. For example, if you take a hundred random people of different heights, then adding the tallest person on Earth to them will not change the average height of this group much. If we calculate the average income of a hundred random people, then adding the richest person on the planet - Carlos Slim Elu (and not Bill Gates, as many might think :)) will significantly increase the average wealth of everyone, to about 500 million dollars!

Another manifestation of fractality is a significant stratification of the sample. Consider, for example,

Agree, the presented pattern is like two drops of water similar to the Zipf curve!

One of the properties of fractality is self-repetition. So, out of the 192 countries of the world listed in the list, 80% of the world's wealth is concentrated in just 18 countries - 9.4% (18/192). If we now consider only these 18 countries, then their total wealth is 46 trillion. dollars - distributed equally unevenly. 80% of these 46 trillion. Concentrated in less than half of the countries, etc.

You may ask: what is the practical conclusion of all this? I would say this:

  1. Social and economic systems are not described by a Gaussian. These patterns obey power series [synonymous with fractal nature].
  2. Outliers from the mean are substantially more likely than those predicted by the Gaussian bell curve. Moreover, outliers are intrinsic to the system; they are not random, but regular.
  3. Risk estimates cannot be built on the basis of a normal probability distribution of rare undesirable events.
  4. … I won’t lie, I can’t think of anything else yet… but this does not mean that there are no more practical conclusions… it’s just that my knowledge is limited to this…

... but you must admit, beautiful patterns!

For fractality, see Benoit Mandelbrot

It should be noted that data from different sources vary greatly, but this is not relevant to the topic discussed here.

Among the criteria for assessing the quality of the text, its naturalness is considered the main one. This indicator can be verified using a mathematical method discovered by the American linguist George Zipf.

Zipf's law test- This is a method for assessing the naturalness of the text, determining the regularity of the arrangement of words, where the frequency of the word is inversely proportional to its place in the text.

Zipf's first law "rank - frequency"

C \u003d (Frequency of occurrence of a word x Rank of frequency) / Number of words.

If we take the ratio of a word to the rank of frequency, then the value (C) will be unchanged, and this is true for a document in any language, within each language group the value will be constant.

The words that are significant for the document and determine its subject matter are in the middle of the hyperbole. The words used most often, as well as low-frequency ones, do not carry a decisive semantic meaning.

Zipf's second law "quantity - frequency"

The frequency of a word and its number in the text are also related to each other. If you build a graph, where X is the frequency of a word, Y is the number of words of a given frequency, the shape of the curve will be unchanged.

The principle of writing good text suggests that it must be made the most understandable using the fewest words.

The law shows a common property for any language, since there will always be a certain number of most frequently occurring words.

It is necessary to check the SEO text for naturalness if keywords were used in writing so that it is interesting and understandable for a large audience of readers. Also, this indicator is important when ranking sites by search engines, which determine the correspondence of the text to key queries, distributing words into groups of important, random and auxiliary.

More:

  • The relationship between the frequency of occurrence of a word in the text f, and its place in the frequency dictionary (rank) r, is inversely proportional. The higher the rank of the word (the farther it is from the beginning of the dictionary), the lower the frequency of its occurrence in the text.
  • The graph of such a dependence is a hyperbola, which drops off very sharply at low ranks, and then, in the region of small values ​​of the frequency of occurrence, f, stretches very far, gradually, but very imperceptibly, decreasing as the rank, r, increases.
  • If the frequency of occurrence of one word is 4 per million, and the frequency of another is 3 per million, it does not matter that the ranks of these words differ by a thousand times. These words are used so rarely that many native speakers have not even heard them.
  • However, this distant region is remarkable in that the word located here can very easily reduce the value of its rank many times over. Even the smallest increase in the frequency of occurrence of a word dramatically shifts its position to the beginning of the frequency dictionary.
  • In terms of this law, the measure of the popularity of a word is its position in the frequency dictionary of the language. A more popular word is closer to the top of the dictionary than a less popular one.
  • It reflects the dependence of the frequency of using a word in a language on its place in the frequency dictionary. Popular words of the language are used more often. From a mathematical point of view, the graph of this dependence is a hyperbola with a sharp rise as it approaches the origin of coordinates and a long, gentle, almost horizontal, “tail”. Most of the words of the language are located in this "tail". Here the place of a word in the frequency dictionary, if it changes the frequency of use of this word in the language, is not at all by much.
  • But as soon as the position of the word in the frequency dictionary reaches that place on the hyperbola, where, as we approach the origin, a significant rise in the curve begins, the situation changes. Now a small change in the frequency of occurrence of a word no longer leads to significant changes in its rank, that is, the position of the word in the frequency dictionary ceases to change. This means that the growth of the word's popularity has slowed down. In order for it to continue, special measures should be taken to increase the frequency of occurrence of the word. For example, if the word is the name of the product, you need to spend money on an advertising campaign (

Hi all! Recently, more and more often I hear from colleagues about the requirement in the TOR to evaluate the quality of the text according to Zipf's law. And not everyone understands how to edit the text for this law. In today's article I will try to tell you how to improve the parameter in the simplest way, and also clarify why good authors don't really need it.

You can determine the quality of the text according to Zipf's law using several services. But, I think PR-CY is the most adequate, it combines the right formula with a simple and understandable interface. That is what I used in the preparation of this material.

What is Zipf's law

To begin with, it is worth understanding what it is. According to Wikipedia, Jean-Baptiste Estoux formulated this pattern in 1908, this law originally referred to shorthand. The first application of the regularity known to the general public relates to demography, and more precisely to the distribution of population in cities, was used by Felix Auerbach.

The pattern received its modern name in 1949 thanks to the linguist George Zipf. He showed with its help the gradation of the distribution of wealth among the population. And only then the law began to be applied to determine the readability of texts.

How is it calculated

To properly use this law, you need to understand how it works. Let's analyze the formula for the calculation.

  • F is the frequency of using the word;
  • R is serial number;
  • C is a constant value (a number indicating the largest word in terms of the number of repetitions).

In practice, another formula turns out to be more convenient, it looks clearer.

This approach is more convenient, since we have data on the number of repetitions of the most common word. It is from this quantity that they are repelled.

To simplify, in our text the second most frequent word should be twice as rare as the first. Coming in third place, three times and so on.

Text fitting example

The theory has been dealt with a little. It remains to deal with practice. As an experimental text, I took an article from T-Zh. Why from there? Everything is simple. At the moment, this is one of the best examples of the info style loved by many. Well, it was interesting what the text written under the direction of Maxim Ilyakhov would show. I will say right away that the texts for this indicator are at the level, although, having shoveled more than 40 sites, I did not find a single article with poor naturalness at all. Also, I’ll immediately jump ahead and say that the experimental text after fitting became much worse, despite the improved Zipf score, you shouldn’t bother too much with an excessive increase in naturalness.

This is what the analyzer showed us after checking.

Let's take a look at what's in there. As you can see, there is a column with words, as well as incomprehensible numbers. The "occurrence" column (1) indicates how many times the word forms occur in the text. In the Zipf column (2) is the recommended number of entries. Markers 3 and 4 mark ideal indicators for the second and third positions. You should also pay attention to the recommendations, it indicates how many words you need to remove to achieve the perfect combination.

For a better understanding, let's analyze what the analyzer counted. We take the number 39 (C) as a basis, we also need a serial number, pay attention to the 2 (F) position. We take the formula.

Substitute.

F=39/2=19.5

We round up and get 20, this will be the required number of occurrences. This is confirmed by the analyzer. In our country, the second most popular word is used 28 times, respectively, 8 repetitions will need to be removed or replaced.

Having dealt with the principle of the law, we begin to edit. To do this, we delete or replace with synonyms words that have more occurrences than required by Zipf. As a result, we get this picture.

As you can see, I managed to increase the rate from 83% to 88%. However, the quality of the text suffered significantly. You should not strive to increase this figure to 100%. In fact, if you already have 75%, this is excellent and you should not pervert further.

Useful advice

Pay attention not only to the first lines. Start fitting from the last positions in the list, they often have a greater impact on the overall score than the first ten words.

Zipf and SEO

Now let's move on to why a copywriter needs to know this pattern. When ordering texts, SEOs strive to make them the most convenient for search engines. It is believed (though not clear by whom) that Zipf's law is actively used by search algorithms. It is difficult to prove or disprove this statement. I could not find any sane research and experiments on this topic.

Decided to check it out myself. To do this, I took the issue for such a competitive query “plastic windows”, Yandex took the Moscow issue, I had to conjure in Google, and he also seemed to identify me as a resident of the capital (at least he showed me an ad with Moscow geolocation). I took the first page of the issue, plus 49th place. This is how the sign turned out.

If you look more closely, you can see that in Yandex the output is more even, if you look at the pattern we are studying. But, at the same time, a higher figure does not guarantee victory in the fight for first place in the top.

Based on this, it can be said that if search engines apply this law, it is only one of the factors. And not the main one.

conclusions

OK it's all over Now. Now you know what the quality of the text according to Zipf's law is, and you can also adjust this indicator. In fact, there is nothing complicated here, everything is quite simple. It is enough to understand the principle of operation of this regularity once.

The world of SEO is constantly evolving, and optimization does not stand still. There are new methods of writing texts, their preparation for better indexing. One of the parameters that optimizers paid close attention to is the naturalness of the text according to Zipf's law. What is Zipf's law and its role in SEO promotion?

According to the wording, Zipf's law is an empirically established regularity in the location of the frequency of words in a text. According to the law, the frequency of a word in a text is almost inversely proportional to its place in the list. That is, if we start from the law, the second most frequently mentioned word in the text should be used two times less often than the first, and the third - three times less often, and so on.

For ease of understanding this pattern, you should pay attention to the arrangement of letters on a computer keyboard. It is not accidental: the most frequently used letters of any language are located more conveniently than those used less often. The situation with words is identical: there are frequently used words and rarely used, more significant words that determine the subject of the text.

Separation by the importance of words is also used when ranking sites in search engine algorithms. With this in mind, the difference in words in terms of meaning and frequency of use helps to divide words into 3 groups when writing SEO texts:

  • Auxiliary. This group includes words that do not carry an independent semantic load, such as conjunctions, prepositions, pronouns, particles. All auxiliary words are perceived by search engines as informational noise and are ignored when ranking.
  • Important. Such words are less common in texts and carry a significant semantic load. Search engines perceive the words of this group as keywords.
  • Random. The words of this group are rarely used for texts on a specific subject and practically do not affect the search ranking.

According to SEO specialists, the American linguist George Zipf defined the laws that search engines began to use to determine the naturalness and uniqueness of texts by the frequency of words used.

SEOs often face problems with text promotion when uniqueness and relevance scores are high. That is, the text can be 100% unique, optimized for a keyword with high relevance, and still not reach the top or, worse, remain out of the view of position analysis programs.

It is not easy to establish how much Zipf's law individually affects search results. Most likely, the search engines take into account a combination of many factors, among which there is a check for naturalness according to Zipf. Today, content plays one of the most important roles in search promotion, therefore, when creating SEO texts, it is recommended to carefully monitor the indicators of uniqueness and naturalness. There are many services for checking texts. Let's dwell on the two most popular and proven sites - 1y.ru and pr-cy.ru.

Service 1y.ru

The site allows you to check the naturalness of the content of individual web pages, entire sites or texts from 100 to 5000 words. The limit for anonymous users allows checking up to 2000 texts per day. The disadvantage of the site is that it is impossible to check the web page without distorting the results, since the service scans all found textual information, including rubricator, widgets, menus and other types of auxiliary text.

After checking the text, 1y.ru provides content statistics with recommendations for reducing repeated words and provides a graph with three curves: the curve of the values ​​of the checked text, the curve of recommended values ​​and the curve of ideal values.

Service pr-cy.ru

This resource also provides an opportunity to evaluate the naturalness of texts and web pages. The service filters out stop words, calculates the percentage of text nausea, and also provides recommendations for reducing or increasing the number of occurrences according to Zipf's law.


Conclusion

The difference in results when checking one text in different services can be significant. So, the first three paragraphs of the text you are reading showed 59% for 1y.ru and 88% for pr-cy.ru. There is only one conclusion: when writing texts, you should not pay too much attention to entering key entries into the body of the article. You need to write in an interesting and accessible way, and if you still need to embed keywords in the text, then you should check the text using the Zipf method.

words of a natural language: if all the words of a language (or just a sufficiently long text) are ordered in descending order of their frequency of use, then the frequency n-th word in such a list will be approximately inversely proportional to its ordinal number n(the so-called rank of this word, see scale of order). For example, the second most used word is about two times less common than the first, the third is three times less common than the first, and so on.

History of creation[ | ]

The author of the discovery of the pattern is a French stenographer (fr. Jean-Baptiste Estoup), who described it in 1908 in The Range of Shorthand. The law was first applied to describe the distribution of city sizes by the German physicist Felix Auerbach in his work "The Law of Population Concentration" in 1913 and bears the name of the American linguist George Zipf, who in 1949 actively popularized this pattern, first proposing to use it to describe the distribution of economic forces and social status.

An explanation of Zipf's law based on the correlation properties of additive Markov chains (with step memory function) was given in 2005.

Zipf's law is mathematically described by the Pareto distribution. It is one of the basic laws used in infometrics.

Applications of the law[ | ]

George Zipf in 1949 first showed the distribution of people's incomes according to their size: the richest person has twice as much money as the next richest, and so on. This statement turned out to be true for a number of countries (England, France, Denmark, Holland, Finland, Germany, USA) in the period from 1926 to 1936.

This law also works in relation to the distribution of the city system: the city with the largest population in any country is twice the size of the next largest city, and so on. If you arrange all the cities of a certain country in the list in descending order of population, then each city can be assigned a certain rank, that is, the number that it receives in this list. At the same time, the population size and rank obey a simple pattern expressed by the formula:

P n = P 1 / n (\displaystyle P_(n)=P_(1)/n),

where P n (\displaystyle P_(n))- city population n-th rank; P 1 (\displaystyle P_(1))- population of the main city of the country (1st rank).

Empirical studies support this assertion.

In 1999, the economist Xavier Gabet described Zipf's law as an example of a power law: if cities grow randomly with the same standard deviation, then at the limit the distribution will converge to Zipf's law.

According to the findings of researchers in relation to urban settlement in the Russian Federation, in accordance with Zipf's law:

  • most cities in Russia lie above the ideal Zipf curve, so the expected trend is a continued decline in the number and population of medium and small cities due to migration to large cities;
  • accordingly, 7 million-plus cities (St. Petersburg, Novosibirsk, Yekaterinburg, Nizhny Novgorod, Kazan, Chelyabinsk, Omsk), which are below the ideal Zipf curve, have a significant population growth reserve and expect population growth;
  • there are risks of depopulation of the first city in the rank (Moscow), since the second city (St. Petersburg) and subsequent large cities are far behind the ideal Zipf curve due to a decrease in demand for labor with a simultaneous increase in the cost of living, including, first of all, the cost of purchase and rental housing.

Criticism [ | ]

American bioinformatician proposed a statistical explanation of Zipf's law, proving that a random sequence of characters also obeys this law. The author concludes that Zipf's law, apparently, is a purely statistical phenomenon that has nothing to do with the semantics of the text and has a superficial relation to linguistics.