The frequency of the use of letters in the Russian language. How to use the new frequency dictionary of Russian vocabulary Frequency statistics of words in Russian

The frequency of the use of letters in Russian

Do you know that some letters of the alphabet are found in words more often than others ... Moreover, the frequency of use of vowels in the language is higher than consonants.

What letters of the Russian alphabet are most or least common in words used to write text?

Statistics is engaged in the identification and study of general patterns. With the help of this scientific direction, one can answer the above question by counting the number of each of the letters of the Russian alphabet, the words used, choosing an excerpt from the works of various authors. For their own interest and for the sake of boredom, everyone can do it on their own. I will refer to the statistics of an already conducted study ...

The Russian alphabet is Cyrillic. During its existence, it has gone through several reforms, which resulted in the formation of the modern Russian alphabetical system, which includes 33 letters.

o - 9.28%
a — 8.66%
e - 8.10%
and - 7.45%
n - 6.35%
t - 6.30%
p - 5.53%
c - 5.45%
l - 4.32%
c — 4.19%
k - 3.47%
n - 3.35%
m - 3.29%
y - 2.90%
e - 2.56%
I - 2.22%
s — 2.11%
b - 1.90%
h - 1.81%
b - 1.51%
d - 1.41%
th - 1.31%
h - 1.27%
yu - 1.03%
x - 0.92%
g - 0.78%
w - 0.77%
c - 0.52%
u - 0.49%
f - 0.40%
e - 0.17%
b — 0.04%

The Russian letter with the highest frequency in use is the vowel " O', as has been rightly suggested here. There are also characteristic examples, such as " DEFENSECAPABILITY"(7 pieces in one word and nothing exotic or surprising; very familiar to the Russian language). The high popularity of the letter "O" is largely due to such a grammatical phenomenon as full vowel. That is, "cold" instead of "cold" and "frost" instead of "scum".

And at the very beginning of words, the consonant letter “ P". This leadership is also confident and unconditional. Most likely, the explanation gives a large number of prefixes with the letter “P”: re-, pre-, pre-, pre-, pro- and others.

Letter frequency is the basis of cryptanalysis.

I want to warn you that the information presented in this article is somewhat outdated. I did not rewrite it so that later I could compare how SEO standards change over time. You can find up-to-date information on this topic in new materials:

Hello, dear readers of the blog site. Today's article will again be devoted to such a topic as search engine optimization (). Earlier, we have already touched on many issues related to such a concept as.

Today I want to continue talking about on-page SEO, while clarifying some of the points mentioned earlier, as well as talking about what we have not discussed yet. If you are able to write good unique texts, but at the same time do not pay due attention to the perception of them by search engines, then they will not be able to make their way to the top of the search results for queries related to the topics of your wonderful articles.

What affects the relevance of the text to the search query

And this is very sad, because in this way you do not realize the full potential of your project, which can be very impressive. You need to understand that search engines for the most part are stupid and straightforward programs that are not able to go beyond their capabilities and look at your project with human eyes.

They will not see much of what is good and necessary on your project (what you have prepared for visitors). They can only analyze the text, taking into account a lot of components, but they are still very far from human perception.

Therefore, we will need to get into the shoes of search robots at least for a while and understand what they focus on when ranking various texts for various search queries (). And for this you need to have an idea about, for this you will need to read the article.

Usually they try to use keywords in the title of the page, in some internal headings, as well as evenly and as naturally as possible to distribute them throughout the article. Yes, of course, the selection of keys in the text can also be used, but do not forget about the re-optimization that may follow.

The density of occurrence of keys in the text is also important, but now it is rather not a desirable factor, but, on the contrary, a warning one - you can’t overdo it.

The value of the keyword occurrence density in the document is determined quite simply. In fact, this is the frequency of its use in the text, which is determined by dividing the number of its occurrence in the document by the length of the document in words. Previously, the position of the site in the issue directly depended on this.

But you probably understand that it will not be possible to compose all the material only from the keys, because it will not be readable, but thank God this is not necessary. Why, you ask? Yes, because there is a limit to the frequency of using a keyword in the text, after which the relevance of a document for a query containing this keyword will no longer increase.

Those. it will be enough for us to achieve a certain frequency and we, thus, optimize it as much as possible. Or we overdo it and fall under the filter.

It remains to solve two questions (and maybe three): what is the maximum density of the occurrence of the keyword, after which it is already dangerous to increase it, as well as to find out.

The fact is that keywords highlighted with accent tags and enclosed in the TITLE tag have more weight for the search than similar keywords that simply occur in the text. But lately, webmasters have begun to use it and completely spammed this factor, in connection with which its importance has decreased and can even lead to a ban of the entire site due to the abuse of strongs.

But the keys in the TITLE are still relevant, it is better not to repeat them there and not to try to push them too much into one page title. If the keywords are in the TITLE, then we can significantly reduce their number in the article (and therefore make it easy to read and more suitable for people, and not for search engines), having achieved the same relevance, but without the risk of falling under the filter.

I think that everything is clear with this question - the more keys are enclosed in accent and TITLE tags, the more chances there are to lose everything at once. But if you do not use them at all, then you will not achieve anything either. The most important criterion is the naturalness of the introduction of keywords in the text. If they are, but the reader does not stumble about them, then in general everything is fine.

Now it remains to figure out what frequency of using a keyword in a document is optimal, which allows you to make the page as relevant as possible without entailing sanctions. Let's first remember the formula that most (probably all) search engines use to rank.

How to determine the acceptable frequency of using a key

We have already talked about the mathematical model earlier in the article mentioned just above. Its essence for this particular search query is expressed by one simplified formula: TF*IDF. Where TF is the direct frequency of occurrence of this query in the text of the document (the frequency with which words occur in it).

IDF - the inverse frequency of occurrence (rarity) of this query in all other Internet documents indexed by this search engine (in the collection).

This formula allows you to determine the correspondence (relevance) of a document to a search query. The higher the value of the product TF*IDF, the more relevant this document will be and the higher it will be, all other things being equal.

Those. it turns out that the weight of the document for a given query (its correspondence) will be the greater, the more often the keys from this query are used in the text, and the less often these keys are found in other Internet documents.

It is clear that we cannot influence the IDF, except by choosing another query for which we will optimize. But we can and will influence TF, because we want to grab our share (and not a small one) of traffic from Yandex and Google search results on the user questions we need.

But the fact is that search algorithms calculate the TF value using a rather tricky formula that takes into account the growth in the frequency of using the keyword in the text only up to a certain limit, after which the growth of TF practically stops, despite the fact that you will increase the frequency. This is a kind of anti-spam filter.

A relatively long time ago (until about 2005), the TF value was calculated using a fairly simple formula and was actually equal to the keyword occurrence density. The results of calculating relevance using this formula were not exactly liked by search engines, because they pandered to spammers.

Then the TF formula became more complicated, such a thing as page nausea appeared and it began to depend not only on the frequency of occurrence, but also on the frequency of use of other words in the same text. And the optimal value of TF could be achieved if the key turned out to be the most frequently used word.

It was also possible to increase the TF value by increasing the text size while maintaining the occurrence percentage. The larger the towel with the article with the same percentage of keys, the higher this document will be.

Now the TF formula has become even more complicated, but at the same time, now we do not need to bring the density to the point where the text becomes unreadable and search engines will impose ban on our project for spam. And now there is no need to write disproportionately long sheets either.

While maintaining the same ideal density (we will define it a little lower from the corresponding graph), increasing the word size of an article will only improve its position in the SERP until it reaches a certain length. Once you have the ideal length, increasing it further will not affect relevance (more precisely, it will, but very, very little).

All this can be seen clearly if you build a graph based on this tricky TF (direct entry frequency). If on one scale of this graph there is TF, and on the other scale - the percentage of the frequency of occurrence of the keyword in the text, then we will get the so-called hyperbole as a result:

The schedule, of course, is approximate, because few people know the real TF formula used by Yandex or Google. But qualitatively it can be determined optimal range where the frequency should be. This is about 2-3 percent of the total number of words.

If you take into account that you will still enclose some of the keys in accent tags and the TITLE header, then this will be the limit, after which a further increase in density may be fraught with a ban. It is no longer profitable to saturate and disfigure the text with a large number of keywords, because there will be more minuses than pluses.

What is the length of the text will be sufficient for promotion

Based on the same assumed TF, one can plot its value against word length. In this case, you can take the frequency of keywords constant for any length and equal, for example, to any value from the optimal range (from 2 to 3 percent).

Remarkably, we will get a graph of exactly the same shape as the one discussed above, only the length of the text in thousands of words will be adjusted along the abscissa. And from it it will be possible to draw a conclusion about optimal length range, at which almost the maximum value of TF is already reached.

As a result, it turns out that it will lie in the range from 1000 to 2000 words. With a further increase, relevance will practically not grow, and with a shorter length, it will fall rather sharply.

That. we can conclude that in order for your articles to take high places in the search results, you need to use keywords in the text with a frequency of at least 2-3%. This is the first and main conclusion that we made. Well, the second one is that now it is not at all necessary to write very voluminous articles in order to get into the Top.

It will be enough to surpass the milestone of 1000 - 2000 words and include 2-3% of keywords in it. That's it - that's it recipe for the perfect text, which will be able to compete for a place in the top for low-frequency queries, even without the use of external optimization (buying links to this article with anchors that include keywords). Although, to rummage around a bit in Miralinks , GGL, Rotapost or GetGoodLink is fine as it will help your project.

Let me remind you once again that the length of the text you wrote, as well as the frequency of using certain keywords in it, you can find out with the help of specialized programs or with the help of online services that specialize in their analysis. One of these services is ISTIO, about the work with which I spoke.

Everything I said above is not one hundred percent reliable, but very similar to the truth. Anyway, my personal experience confirms this theory. But the algorithms of Yandex and Google are constantly undergoing changes, and few people know how it will be tomorrow, except for those who are close to their development or developers.

Good luck to you! See you soon on the blog pages site

You may be interested

Internal optimization - keyword selection, nausea check, optimal Title, content duplication and relinking under low frequencies
Keywords in text and headings
How keywords affect website promotion in search engines
Online services for webmasters - everything you need to write articles, search engine optimization and analyze its success
Ways to optimize content and take into account the theme of the site during link promotion to minimize costs
Yandex Wordstat and the semantic core - selection of keywords for the site using statistics from the online service Wordstat.Yandex.ru
Anchor - what is it and how important are they in website promotion
What search engine optimization factors affect website promotion and to what extent
Promotion, promotion and optimization of the site independently
Accounting for the morphology of the language and other problems solved by search engines, as well as the difference between HF, MF and LF queries
Website trust - what it is, how to measure it in XTools, what affects it and how to increase the authority of your site

Brief statement of the problem

There is a set of files with texts in Russian from fiction of different genres to news reports. It is necessary to collect statistics on the use of prepositions with other parts of speech.

Important points in the task

1. Among the pretexts there are not only at and to, but stable combinations of words used as prepositions, for example compared to or in spite of. Therefore, it is impossible to simply crumble texts by spaces.

2. There are a lot of texts, several GB, so processing should be fast enough, at least within a few hours.

Outline solution and results

Taking into account the existing experience in solving problems with text processing, it was decided to stick to the modified "unix-way", namely, to split the processing into several stages, so that at each stage the result would be plain text. Unlike the pure unix-way, instead of transferring text raw materials through pipes, we will save everything as disk files. Fortunately, the cost of a gigabyte on a hard drive is now scanty.

Each stage is implemented as a separate, small and simple utility that reads text files and saves the products of its silicon life.

An additional bonus of this approach, in addition to the simplicity of the utilities, lies in the incremental nature of the solution - you can debug the first stage, run all the gigabytes of text through it, then start debugging the second stage, without wasting time on repeating the first.

Breaking text into words

Since the source texts to be processed are already stored as flat files in utf-8 encoding, we skip the zero stage - parsing documents, extracting text content from them and saving them as simple text files, immediately proceeding to the task of tokenization.

Everything would be simple and boring if not for the simple fact that some prepositions in Russian consist of several "lines" separated by a space, and sometimes a comma. In order not to break up such verbose prepositions, I first included the tokenization function in the dictionary API. The layout in C# turned out to be simple and uncomplicated, literally a hundred lines. Here is the source. If we discard the introductory part, loading the dictionary and the final part with its removal, then everything comes down to a couple of dozen lines.

All this successfully grinds files, but the tests showed a significant drawback - very low speed. On the x64 platform, it turned out to be about 0.5 MB per minute. Of course, the tokenizer takes into account all sorts of special cases like " A.S. Pushkin", but for the solution of the original problem, such accuracy is unnecessary.

As a guide to the possible speed, there is a statistical file processing utility Empirika. It does frequency processing of 22 GB of texts in about 2 hours. There's a smarter solution to the problem of verbose prepositions inside, so I've added a new script enabled by the -tokenize option on the command line. According to the results of the run, it turned out to be about 500 seconds per 900 MB, that is, about 1.6 MB per second.

The result of working with these 900 MB of text is a file of about the same size, 900 MB. Each word is stored on a separate line.

Frequency of using prepositions

Since I didn’t want to drive a list of prepositions into the program text, I again picked up a grammar dictionary for the C # project, using the sol_ListEntries function I got a complete list of prepositions, about 140 pieces, and then everything is trivial. Program text in C#. It collects only pairs of preposition + word, but it will not be a problem to expand.

Processing a 1 GB text file with words takes only a few minutes, resulting in a frequency table, which we upload to disk again as a text file. The preposition, the second word and the number of occurrences are separated in it by a tab character:

PRO BROKEN 3
PRO Scored 1
PRO FORM 1
PRO NORM 1
PRO HUNGRY 1
IN LEGAL 9
FROM TERRACE 1
DESPITE THE TAPE 1
OVER DRAWER 14

In total, from the initial 900 MB of text, approximately 600 thousand pairs were obtained.

Analyze and view results

It is convenient to analyze the table with the results in Excel or Access. By force of habit to SQL I loaded the data in Access.

The first thing to do is to sort the results in descending order of frequency to see the most frequent pairs. The initial amount of processed text is too small, so the sample is not very representative and may differ from the final results, but here are the top ten:

WE HAVE 29193
IN VOLUME 26070
I have 25843
ABOUT VOLUME 24410
HIM has 22768
IN THIS 22502
IN AREA 20749
DURING 20545
ABOUT IT 18761
With NIM 18411

Now you can build a graph so that the frequencies are on the OY axis, and the patterns are lined up along the OX in descending order. This will give the expected distribution with a long tail:

Why is this statistic needed?

In addition to the fact that two C# utilities can be used to demonstrate how to work with the procedural API, there is another important goal - to give the translator and the text reconstruction algorithm statistical raw materials. In addition to pairs of words, trigrams will also be required, for this it will be necessary to slightly expand the second of the mentioned utilities.

- - Topics information protection EN word usage frequency ... Technical Translator's Handbook

s; frequencies; well. 1. to Frequent (1 digit). Keep track of the frequency of repetition of moves. Necessary hours of planting potatoes. Pay attention to the pulse rate. 2. The number of repetitions of the same movements, fluctuations in what l. unit of time. H. wheel rotation. Ch... encyclopedic Dictionary

I Alcoholism is a chronic disease characterized by a combination of mental and somatic disorders resulting from the systematic abuse of alcohol. The most important manifestations of A. x. are altered endurance to ... ... Medical Encyclopedia

CAPTURE- one of the specific terms used in hook records in Rus. non-linear polyphony, characterized by a developed sub-voice polyphonic warehouse and a sharp dissonance of the vertical. Pevch. implementation of the term in present. time is unknown... Orthodox Encyclopedia

Stylostatistic method of text analysis- is the use of tools of mathematical statistics in the field of stylistics to determine the types of functioning of the language in speech, the patterns of functioning of the language in different areas of communication, types of texts, the specifics of functions. styles and...

Portion flavored snus, mini portion Snus is a type of tobacco product. It is crushed moistened tobacco, which is placed between the upper (less often lower) lip and gum ... Wikipedia

scientific style- represents scientific the sphere of communication and speech activity associated with the implementation of science as a form of social consciousness; reflects theoretical thinking, acting in a conceptual logical form, which is characterized by objectivity and abstraction ... Stylistic encyclopedic dictionary of the Russian language

- (in specialized literature also a patronymic) part of the generic name, which is assigned to the child by the name of the father. Variations of patronymic names can connect their carriers with more distant ancestors, grandfathers, great-grandfathers ... ... Wikipedia

General use, applicability, prevalence, applicability, marketability, generally accepted Dictionary of Russian synonyms. commonness noun, number of synonyms: 10 commonality (11) ... Synonym dictionary

reasoning- - functionally semantic type of speech (see) - (FSTR), corresponding to the form of abstract thinking - inference, performing a special communicative task - to give speech a reasoned character (come in a logical way to a new judgment or ... ... Stylistic encyclopedic dictionary of the Russian language