Empirical distribution of attribute y. Estimating the distribution of empirical data

21.09.2019

To determine the methods of mathematical and statistical processing, it is necessary to evaluate the nature of the data distribution according to all parameters (characteristics). For parameters that have a normal or close to normal distribution, you can use parametric statistics methods, which are often more effective than nonparametric methods. Their advantage is the ability to check statistical hypotheses regardless of the form of distribution.

Normal distribution is a type of distribution of variables that is observed when a characteristic (variable) changes under the influence of many relatively independent factors. Such an impact is typical for mental phenomena, so the researcher often calculates the normal distribution to statistically describe the totality of empirical data, assess population according to the sample, for standard norming of test scores and converting them into scale scores. Based on the properties of normal distribution statistical criteria testing hypotheses (r-test, x2 test, Fisher's f-test, and Student's test, etc.). The main purpose of identifying a normal distribution is to determine methods for mathematical and statistical data processing.

With a normal distribution of indicators of a psychological trait or close to it, which is described by a Gaussian curve, one can use parametric methods of mathematical statistics as simple, reliable and reliable: comparative analysis, calculation of the reliability of differences in characteristics between samples using the Student’s t-test, Fisher’s f-test, Pearson’s correlation coefficient, etc.

If the distribution curve of indicators of a psychological trait is far from normal, the researcher is forced to use methods of non-parametric statistics: calculation of the reliability of differences according to Rosenbaum’s Q criterion (for small samples), according to the Mann-Whitney U-test, Spearman’s rank correlation coefficient, factorial, multifactorial, cluster and other methods of analysis.

Based on the nature of the distribution, one can obtain general idea about the characteristics of the sample of subjects on a certain basis and the validity of the sampling technique.

Statistical inferences based on a model approximating a normal distribution are also approximate. Evaluation of the approximation of a practical curve with normal parameters is carried out by calculating the coefficients of skewness, kurtosis and the Pearson, Kolmogorov and Yastremsky consistency criteria.

The asymmetry coefficient As evaluates the placement of the vertex of the practical curve according to the theoretical one, shows the amount of displacement of the vertex along the calculated vertex horizontally (to the right “+”; to the left “-”) (Fig. 2.3).

Rice. 2.3. Sociometric distribution of empirical data

The asymmetry coefficient is an indicator of the skewness of the distribution to the left or right side along the abscissa axis in Fig. 2.4.

Rice. 2.4. Skewed distribution of empirical data

If the right branch of the curve is longer than the left, we're talking about o right-sided (positive) asymmetry, and if the left branch is longer than the right - o left-sided (negative) asymmetry (Fig. 2.5).

Rice. 2.5. Bimodal distribution of empirical data (right- and left-sided asymmetry)

The asymmetry coefficient As is calculated using the formula:

The kurtosis coefficient Eh, that is, certain “sections” (frequency groups) of the practical curve along the theoretical normal, determines the displacement of the practical curve (vertex) (vertically - up “+”; down “-”). Kurtosis is an indicator of state leadership. Curves higher in the middle part (pointed) are called exceptional. As the kurtosis value decreases, the curve becomes flat, taking on the appearance of a plateau, and then saddle-shaped, that is, with a deflection in the middle part (Fig. 2.6).

Rice. 2.6. Distribution excesivity indicators

These parameters help to get a first rough idea of ​​the nature of the distribution:

In a normal distribution you can rarely find a skewness coefficient close to unity or greater than it (1 and 1);

The kurtosis of traits with a normal distribution usually has a value in the range of 2-4.

IN simple version indicators of asymmetry and kurtosis with their errors of representativeness are determined using the following formulas:

You can calculate the skewness and kurtosis of an empirical distribution using the Descriptive Statistics function in Excel.

Indicators of asymmetry and kurtosis indicate a significant difference between empirical distributions and normal distributions if they exceed absolute value its representativeness error by 3 or more times:

The general reason for the deviation of the shape of the sample distribution of a characteristic from the normal form is most often a feature of the measurement procedure: the scale that is used may have uneven sensitivity to the property, measuring, in different parts range of variability.

Empirical deviations from normality such as right- or left-sided asymmetry or slight kurtosis (or bimodal distribution) are often observed in practice. This is due to the characteristics of the experimental sample and the measurement procedures used.

Methods statistical analysis empirical data allow deviations from the normal distribution (some to a greater extent, others to a lesser extent). However, if a convincing substantiation of the obtained results and calculations made on their basis is required, simple methods of NEP parametric statistics should be used as additional ones.

The distribution curve of Gaussian test scores in the characteristics of psychological phenomena (assessments, results of completing tasks, etc.) reflects the properties of the items that make up the test (tasks), and also characterizes the composition of the sample of subjects (how successfully they perform tasks, how well the test or task differentiate the sample according to the corresponding quality, characteristic).

If the curve has a right-sided asymmetry, this means that the test is dominated by difficult tasks (for the specified sample); if the curve has a left-sided asymmetry, this indicates that the majority of items in the test are easy. This may be due to the following reasons:

The test (tasks) poorly differentiates subjects from low level development of abilities (properties, qualities, characteristics): most subjects receive approximately the same low score

The test does not differentiate well between subjects with high development of abilities (properties, qualities, characteristics): most subjects receive a high score.

Analysis of the kurtosis of the distribution curve allows us to draw the following conclusions depending on the shape of the distribution of indicators (data, variant) of the psychological trait:

1) when a significant positive kurtosis occurs (excessive - on the curve) and the scores are concentrated near the average value (Fig. 2.6, a), this can be caused by the following reasons:

The key is composed incorrectly, that is, when calculating, negatively related characteristics are combined and mutually neutralize the score. The use of valid and reliable techniques makes it impossible for such a problem to arise;

Having guessed the direction of the test (questionnaire), the subjects use a special tactic of “median score”: they artificially balance the answers “for” and “against” in one of the poles of the psychological attribute, measured;

2) according to the selection of items, they are closely positively correlated with each other (i.e., the tests are not statistically independent), a negative kurtosis appears in the distribution of points, which takes the form of a plateau (Fig. 2.6, b);

3) negative kurtosis reaches maximum values ​​with increasing concavity of the distribution top until the formation of two peaks - two modes (with a deflection between them, Fig. 2.6, c). This bimodal configuration of the distribution of scores indicates that the sample of subjects was divided into two categories, subgroups (with a smooth transition between them): some coped with most of the tasks (agreed with most of the questions), others failed (disagreed). The distribution indicates that the basis of the tasks (items) is one feature common to all, which corresponds to a certain property of the subjects: if the subjects under study have this property (ability, knowledge, skills), then they cope with the majority of the items, tasks, and in the absence of it - they can’t cope.

Primary statistics are sensitive to the presence of variants that drop out. Large values ​​of kurtosis and skewness are often an indicator of errors in manual calculations or when entering data from the keyboard for computer processing. Gross mistakes when entering data, you can find it by comparing sigma values ​​in similar parameters. Sigma can indicate errors.

In this case, the rules are observed according to which all actions should be performed twice (especially responsible - three times), preferably different ways, with a variation in the sequence of access to a numeric array.

Large indicators of kurtosis and asymmetry may be caused by insufficient reliability and validity of the methods.

In a separate sample, it is impossible to fully characterize the whole (general population, population); there is always the possibility of an insufficiently accurate, even erroneous assessment of the general population based on sample data. Errors, generalizations, extrapolations associated with the transfer of results obtained from studying a sample to the entire population are called representativeness errors.

Representativeness is the degree of correspondence of sample indicators to general parameters.

Statistical errors of representativeness show the extent to which partial results obtained on the basis of specific samples can deviate from the parameters of the general population (from the mathematical expectation or true values). The greater the variation in the characteristic and the smaller the sample, the higher the error. This is reflected by formulas for calculating statistical errors that characterize the variation of sample indicators around their general parameters. Therefore, primary statistics necessarily include the statistical error of the arithmetic mean. it is calculated using the formula:

Basic methods of parametric and nonparametric statistics make it possible to substantiate the results of empirical psychological research.

Empirical distribution function

ED processing methods rely on basic concepts probability theory and mathematical statistics. These include the concepts of general population, sample, empirical distribution function.

Under general population understand all possible parameter values ​​that can be recorded during an unlimited time observation of an object. Such a set consists of an infinite number of elements. As a result of observing an object, a limited-in-volume set of parameter values ​​is formed x 1 , x 2 , …, xn. From a formal point of view, such data represent sample from the general population.

We will assume that the sample contains complete developments before system events (there is no censoring). Observed values x i called options , and their number is sample size n. In order for any conclusions to be drawn from the observation results, the sample must be representative(representative), i.e. correctly represent the proportions of the general population. This requirement is met if the sample size is large enough and each element in the population has the same probability of being included in the sample.

Let the resulting sample have a value x 1 parameter observed n 1 time, value x 2 – n 2 times, meaning xk nk once, n 1 +n 2 + … +nk=n.

A set of values ​​written in ascending order is called variation series , quantities n i – frequencies, and their relationship to the sample size ni=n i /n relative frequencies(frequencies). Obviously, the sum of the relative frequencies is equal to unity.

Distribution refers to the correspondence between observed variants and their frequencies or frequencies. Let nx – the number of observations in which the random values ​​of the parameter X less x. Event Frequency X equal to nx/n. This ratio is a function of x and on sample size: F n(x)=nx/n. Magnitude Fn(x) has all the properties of a function:

distributions: Fn(x) non-decreasing function, its values ​​belong to the segment ;

If x 1 is the smallest value of the parameter, and xk – the greatest, then Fn(x)= 0, When x<x 1 , And FP(xk)= 1 when x>=xk.

Function Fn(x) is determined by ED, which is why it is called empirical distribution function. Unlike the empirical function Fn(x) distribution function F (x) of the population is called the theoretical distribution function, it characterizes not the frequency, but the probability of an event X<x. From Bernoulli's theorem it follows that the frequency Fn(x) tends in probability to probability F(x) with unlimited magnification n. Consequently, with a large volume of observations, the theoretical distribution function F(x) can be replaced by the empirical function Fn(x).

Graph of empirical function Fn(x) represents broken line. In the spaces between adjacent members variation series Fn(x) remains constant. When passing through axis points x, equal to the sample members, Fn(x) undergoes a discontinuity, increasing abruptly by the value 1/ n, and if there is a coincidence l observations - on l/n.

Example 2.1. Construct a variation series and graph of the empirical distribution function based on the observation results, table. 2.1.

Table 2.1

The desired empirical function, Fig. 2.1:

Rice. 2.1. Empirical distribution function

With a large sample size (the concept of “large volume” depends on the goals and processing methods, in this case we will consider P big if n>40) for the convenience of processing and storing information resort to grouping EDs into intervals. The number of intervals should be chosen so that the variety of parameter values ​​in the aggregate is reflected to the required extent and at the same time the distribution pattern is not distorted by random frequency fluctuations in individual categories. There are loose guidelines for choosing quantity y And size h such intervals, in particular:

each interval must contain at least 5–7 elements. In extreme ranks, only two elements are allowed;

the number of intervals should not be very large or very small. Minimum the y value must be at least 6 – 7. With a sample size not exceeding several hundred elements, the value y is set in the range from 10 to 20. For a very large sample size ( n>1000) the number of intervals may exceed the specified values. Some researchers recommend using the ratio y=1.441*ln( n)+1;

with relatively small unevenness in the length of the intervals, it is convenient to choose the same and equal to the value

h= (x max – x min)/y,

Where x max – maximum and x min – minimum value of the parameter. If the distribution law is significantly uneven, the length of the intervals can be set to a smaller size in the region of rapid changes in the distribution density;

If there is significant unevenness, it is better to assign approximately the same number of sample elements to each category. Then the length of a particular interval will be determined by the extreme values ​​of the sample elements grouped into this interval, i.e. will be different for different intervals(in this case, when constructing a histogram, normalization by the length of the interval is required - otherwise the height of each element of the histogram will be the same).

Grouping observation results by intervals provides for: determining the range of changes in a parameter X; choosing the number of intervals and their size; counting for everyone i- th interval [ xixi+1 ] frequencies ni or relative frequency (frequency n i) options fall into the interval. As a result, a representation of the ED is formed in the form interval or statistical series.

Graphically, a statistical series is displayed in the form of a histogram, polygon and stepped line. Often histogram represented as a figure consisting of rectangles, the bases of which are intervals of length h, and the heights are equal to the corresponding frequency. However, this approach is inaccurate. Height i- th rectangle z i should be chosen equal ni/ (nh). Such a histogram can be interpreted as a graphical representation of the empirical distribution function fn(x), in it the total area of ​​all rectangles will be one. The histogram helps to select the type of theoretical distribution function for approximating the ED.



Polygon called a broken line, the segments of which connect points with coordinates along the abscissa axis equal to the midpoints of the intervals, and along the ordinate axis equal to the corresponding frequencies. The empirical distribution function is displayed as a stepped broken line: a horizontal line segment is drawn over each interval at a height proportional to the accumulated frequency in the current interval. The accumulated frequency is equal to the sum of all frequencies, starting from the first and up to this interval inclusive.

Example 2.2. There are results of recording signal attenuation values xi at a frequency of 1000 Hz of the switched channel of the telephone network. These values, measured in dB, are presented in the form of a variation series in table. 2.3. It is necessary to construct a statistical series.

Table 2.3

i
xi 25,79 25,98 25,98 26,12 26,13 26,49 26,52 26,60 26,66 26,69 26,74
i
xi 26,85 26,90 26,91 26,96 27,02 27,11 27,19 27,21 27,28 27,30 27,38
i
xi 27,40 27,49 27,64 27,66 27,71 27,78 27,89 27,89 28,01 28,10 28,11
i
xi 28,37 28,38 28,50 28,63 28,67 28,90 28,99 28,99 29,03 29,12 29,28

Solution. The number of digits of the statistical series should be chosen as minimal as possible to ensure a sufficient number of hits in each of them; let’s take y = 6. Let’s determine the size of the digit

h =(x max – x min)/y =(29.28 – 25.79)/6 = 0.58.

Let's group observations by category, table. 2.4.

Table 2.4

i
xi 25,79 26,37 26,95 27,5 3 28,12 28,70
ni
n i=ni/n 0,114 0,205 0,227 0,205 0,11 4 0,136
z i =n i /h 0,196 0,353 0,392 0,353 0,196 0,235

Based on the statistical series, we will construct a histogram, Fig. 2.2, and the graph of the empirical distribution function, Fig. 2.3.

Graph of the empirical distribution function, Fig. 2.3 differs from the graph presented in Fig. 2.1 equality of the change step of the options and the size of the increment step of the function (when constructed using a variation series, the increment step is a multiple

1/ n, and by statistical series– depends on the frequency in a particular discharge).

The considered ED representations are the initial ones for subsequent processing and calculation of various parameters.

Instructions for performing and preparing laboratory work

Work is performed on sheets of A-4 format. On title page the title of the work, the surname and name of the performer, group, department, this year and semester.

Drawings, diagrams, pictures, tables are made using drawing tools. All of them must be accompanied by names and necessary inscriptions. The current text is written with a pen. Important places works can be highlighted in color. Work can be completed on a computer.

When performing work, in all cases the formulas used, intermediate calculations are written down, and the necessary written explanations are given. The results obtained during data processing are particularly highlighted.

At the end of each work there is written analysis the results obtained, hypotheses are put forward, conclusions and generalizations are made, and forecasts are made.

Selection of numerical material to perform work

Works 1-2.

H The numerical data is selected from the "Statistical data" table. It is included in the appendix to this set of works. The option is announced by the teacher.

Work 3.

The original numerical data is the same as the numerical data used in activity 1.

Work 4.

Two groups of numerical data are required: the X indicator and the Y indicator. The X indicator coincides with the numerical data used in the first work. The Y indicator is taken from the next row of the “Statistical Data” table, in relation to the row used in the first work.

Job 5

Two groups of numerical data are required: test and retest. The test matches the numerical data used in the first job. The retest values ​​are taken from the second row of the "Statistics" table, in relation to the row used in the first work.

Job 6

Requires 5 groups of data (5 tests). The work is performed for 7 athletes. Their names are chosen independently, surnames are not mentioned.

To obtain the values ​​of the “body weight” test, you need to take the numerical data from the row of the “Statistical data” table used in work 1 and increase each of them by the same number taken from the range 50 – 100. Round the resulting numbers to whole numbers. Note that the mass values ​​were plausible.

To obtain the values ​​of the “growth” test, you need to take the numerical data from the row of the “Statistical data” table used in work 1 and increase each of them by the same number taken from the interval 100 - 150. Round the resulting numbers to whole values. Please ensure that the height values ​​are plausible.

Adjust the resulting Mass and Height to plausible values.

The remaining five tests and their numerical values ​​are chosen independently.

Work 7,

One test and two criteria are required. The test values ​​are taken from line 33 of the "Statistics" table. For the first criterion, numerical data is taken from the line that was used when performing the first job. For the second criterion, the next row of the “Statistical data” table is taken in relation to the row used in the first work.

Topic 1. Processing of statistical material using the method of average values

Theoretical information

Processing statistical data using the method of average values ​​is the most popular among workers physical culture and sports. It consists in obtaining a number of average indicators that allow you to analyze statistical data.

A). Primary processing of incoming data

The sample size is established, namely the number of processed data is determined. It must be borne in mind that the larger the sample size, the more accurate the indicators obtained and the more difficult it is to carry out calculations. During competitions or other actions (competition protocols are used), data is received in random order. For convenience, it is recommended to keep data records in the form of a table with five or ten numbers in each line, which makes it easier to determine their number.

b). Construction of a variation series (variation table) and determination of their parameters and numerical characteristics for the population under consideration.

Each variation series is a mathematical system, i.e. a group of numbers related to each other. Such a system is characterized by the following indicators:

~ arithmetic mean, denoted by: , X avg, , X avg, x avg

~ dispersion, denoted: d or s 2

~ standard deviation, denoted by: s

~ coefficient of variation, denoted by: u

2. Sequence of data processing:

1. Data ranking.

Write down the data taken from the table (see appendix) in a convenient order for you

A). A ranking table is constructed according to the model of Table 1-1.

In the first column The numerical values ​​of the indicators are recorded in ascending order. It is recommended to record sequentially all values ​​from the minimum to maximum indicator. Adjacent values ​​may differ by measurement accuracy.

In the second column a note is made about the presence of such indicators in the sample. To do this, place a stick (asterisk, dot or other sign) against the corresponding indicator when viewing the sample sequentially. Some lines in this column may be empty.

In the third column the number of identical indicators encountered is recorded.

b). Based on table 1-1, a generalized table 1-2 is constructed, consisting of two columns.

First (left) column consists of its own indicators - option. It is denoted by x i and contains the values ​​of the next indicator.

Second (right) column contains the number of indicators (option), called frequency. It shows the number of corresponding identical indicators and is denoted by n i

The sum of frequencies determines the volume of the population.

Comment. Own indicator and frequency are indicated with Latin letters, the index indicates the number of the set to which the corresponding indicator belongs. The volume of the population is indicated by a letter without an index. For example, n=40. When simultaneously considering several variation series, it is recommended to use different letters.

2. Calculation of the arithmetic mean.

This characteristic is the indicator that is most easily calculated and is therefore often used by researchers.

, n – population volume; x 1, x 2 ...x n - indicators taken from the original table 1-1.

To calculate the arithmetic mean, it is convenient to compile Table 1-3 and then the formula for calculating the arithmetic mean has the form:

X medium = , where x i – frequency; n – population volume

In the future, other characteristics of the variation series will be considered.

Notes:

1. Table 3 is part of table 4, so they can be combined.

2. The accuracy of the calculation results obtained during calculations and the accuracy of measurements must match. (Have the same number of decimal places after the decimal point). Intermediate results should have higher precision: one or two spare digits. Final result rounded to the required accuracy. If rounding with the required accuracy leads to a zero result, then rounding is carried out to the first significant digit other than zero, counting from the left.

3. Calculation of variance.

Dispersion indicates the variation (dispersion) of the original data relative to the arithmetic mean. Variance is indicated by letters d or σ 2 and is calculated by the formula:

d =

1. A layout of table 1-4 is drawn, into which the data obtained earlier is entered. This is, for example, from the first to the fourth columns. The rest are filled in as calculations are carried out. Please note that in this table the first four columns repeat the previous table 1-3. Therefore, if the researcher plans in advance to calculate the variance, then Table 1-3 need not be presented separately

2. X environments are determined

3. The fifth column of table 1-4 is filled in; for this, the average indicator is subtracted from each indicator in the second column: x i - x avg

4. The found differences, these are the indicators of the fifth column, are squared: (x i - x avg) 2 and entered into the sixth column of table 1-4

5. The resulting squares (column 6) are multiplied by the corresponding frequencies (column 3), the results are entered in the last column of table 1-4: namely, (x i - x avg) 2 ·n i.

6. The sum S of the resulting products is found - the last column of this table is summed up.

7. The resulting amount S is divided by the volume of the population n=25. The result obtained is the variance. Rounded to the accuracy of the original (processed) indicators.

4. Calculation of standard deviation

The root mean square value is calculated using the formula s = =

5. Calculation of the coefficient of variation.

The coefficient of variation is calculated using the formula: , if the coefficient is presented as a percentage. If you need to present it as a decimal fraction, then the formula does not contain the 100% factor

6. Analysis of the obtained indicators

The main parameters of the variation series are the arithmetic mean, the mean square, and the dispersion coefficient.

Inequalities are made

A< X сред < B, где А = X сред - s, В = X сред + s

or X medium - s< X сред < В = X сред + s

From these characteristics, typical indicators are identified that are included in the interval (A; B) and atypical indicators that are not included in the specified interval. We can recommend the interval for consideration, i.e. the boundaries of the interval are included.

The theoretical basis for mathematical statistics is probability theory, which studies the patterns of random phenomena in abstract form. Based on these patterns, models or laws of distribution of random values ​​are developed.

The law of distribution of a discrete quantity is the assignment of the probabilities of its possible values ​​X = x i. Continuous distribution law random variable represented as a distribution function of X values< x i , т. е. в интегральной форме и в виде плотности распределения. Вероятность отдельного значения непрерывной случайной величины равна 0, а вероятность значений, входящих в заданную градацию, равна приращению функции распределения на участке, занимаемом данной градацией Δх.

Each theoretical distribution has characteristics similar to those of statistical distributions (expectation M, variance D, coefficients of variation, skewness and kurtosis). These or other constants associated with them are called distribution parameters.

Finding a theoretical distribution that corresponds to the empirical one, or “leveling” it, is one of the important tasks of climatological processing. If a theoretical distribution is found and successfully found, then the climatologist receives not only a convenient form of representation of the value being studied, which can be included in machine calculations, but also the ability to calculate characteristics not directly contained in the original series, as well as to identify certain patterns. Thus, the extremes observed at the point are certainly of interest. However, their appearance in the available sample is largely random, so they are poorly mapped and sometimes differ significantly at neighboring stations. If, with the help of the found distributions, we determine the extreme characteristics of a certain security, then they are largely free from the indicated shortcomings and therefore are more representative. It is on the calculated extremes that various regulatory requirements are based. Therefore, special attention should be paid to finding a theoretical distribution and checking its correctness.

Distribution parameters can be determined in different ways; the most accurate, but at the same time complex, method is the maximum likelihood method. In climatological practice, the method of moments is used.

Statistical characteristics are considered as estimates of distribution parameters characterizing the general population of values ​​of a given random variable.

The moment method for determining parameter estimates is as follows. The mathematical expectation, theoretical coefficients of skewness and kurtosis are simply replaced by the empirical average and empirical coefficients; The theoretical variance is equal to the empirical variance multiplied by . If the parameters are functions of moments, then they are calculated from empirical moments.


Let's look at some probabilistic models often used in climatology.

For discrete random variables, binomial and Poisson distributions (simple and complex) are used.

The binomial distribution (Bernoulli) arises as a result of repetition under constant conditions of the same test, which has two outcomes: the occurrence or non-occurrence of an event (in climatology, for example, the absence or presence of an event on every day of the year or month).

A random discrete variable is understood as the number of cases of occurrence of some random event (phenomenon) out of n possible cases and can take values ​​0, 1, 2, ..., n.

The analytical expression for the binomial distribution law has the form (5.1)

The law determines the probability that an event with probability p will occur x times in n trials. For example, in climatology, a day can be either with a phenomenon or without a phenomenon (with fog, with a certain amount of precipitation, air temperature of certain gradations, etc.). In all these cases, two outcomes are possible, and the question of how many times an event (for example, a day with fog) will be observed can be answered using the binomial law (5.1). In this case, p is taken equal to p*, i.e. relative frequency - the ratio of the number of cases with the phenomenon to total number cases (formula (2.3)).

For example, if the number of days with fog in August is considered and it is established from a long-term series that on average there are 5 days with fog in August, then the relative frequency (probability) of a day with fog in August (31 days) is equal to

The parameters of the binomial distribution are n and p, which are related to the mathematical expectation (mean value), standard deviation, skewness and kurtosis coefficients of this distribution by the following expressions:

In Fig. 5.1 shows graphs of the binomial distribution for different parameters n and p.

Let us calculate, for example, using the binomial law, the probability that in August the station will experience three days with fog if the probability of fog formation on any day in August (i.e. the ratio of the average number of days with fog in August to the total number of days in the month ) is 0.16.

Since n = 31, and 1 - p = 0.84, using formula (5.1) we obtain

p(3)=0.1334≈0.13

The limit of the binomial distribution, provided that low-probability events in a long series of independent trials (observations) are considered, is the Poisson distribution.

A random variable distributed according to Poisson's law can take a number of values ​​forming an infinite sequence of integers 0, 1, 2, ∞ with probability

where λ. -parameter, which is the mathematical expectation of the distribution.

The law determines the probability that a random variable will be observed x times if its average value (mathematical expectation) is equal to λ.

Let us pay attention to the fact that the parameter of the binomial law is the probability of the event p, and therefore it is necessary to indicate from which total number cases n, the probability p(x) is determined. In Poisson's law, the parameter is the average number of cases λ over the period under consideration, so the duration of the period is not directly included in the formula.

The variance of the Poisson distribution and the third central moment are equal to the mathematical expectation, that is, they are also equal to λ.

If there are large differences between the mean and variance, Poisson's law cannot be used. The Poisson distribution is tabulated and given in all collections of statistical tables, reference books and textbooks on statistics. In Fig. Figure 5.2 shows the distribution of the number of days with thunderstorms (a rare event) according to Poisson’s law. For Arkhangelsk for the year λ, = 11 days and for July λ = 4 days. As can be seen from Fig. 5.2, in Arkhangelsk the probability of eight days with a thunderstorm in July is approximately 0.03, and the probability of eight days a year is about 0.10. Let us pay attention to one circumstance. Often, the average number of days with a phenomenon in a year λ for λ≤1 is interpreted as the reciprocal of the repetition period T (for example, λ = 0.3 - one day every three years, λ = 1 - almost annually).

This “averaged” approach is fraught with errors, the greater the larger λ. Even if the days with the phenomenon are not related to each other, years with not one, but several days are likely. As a result, the relation T = 1/λ turns out to be incorrect. Thus, with λ = 1, the phenomenon, as can be easily seen from the formula of Poisson’s law, is observed not annually, but only in 6-7 years out of 10. The probability that the phenomenon will not be observed in a year is equal to the probability that there will be one day with the phenomenon (0.37) and almost the same as the probability that there will be two or more days. Only at λ≤ 0.2 can the indicated relationship be used with sufficient justification; because the probability of two or more days a year in this case is less than 0.02 (less than once every 50 years).

The application of Poisson's law to rare meteorological events is not always useful. For example, sometimes rare phenomena can follow one another due to the fact that the conditions that cause them persist long time, and the conditions of Poisson's law are not satisfied.

The complex Poisson distribution (negative binomial distribution) is more consistent with the nature of rare meteorological phenomena. It arises when a number of phenomena can be considered as values ​​of different random variables (samples from different populations). All these quantities have a Poisson distribution, but with different parameters λ 1, λ 2 ..., λ k.

The complex Poisson distribution depends, on the one hand, on the distribution of the set of parameters, and on the other, on the distribution of each of the values. The expression for the probability in the case of this distribution has the form

(5.2)

or in a form more convenient for calculations

The mathematical expectation M and variance D of this distribution are related to its parameters γ and λ by the formulas

(5.3)

Replacing the values ​​of M and D with their estimates and , we obtain

(5.4)

Calculations p(x) can be simplified by taking advantage of the fact that there is equality

, (5.5)

. (5.6)

Hence,

Calculation example. Let's calculate the distribution of the number of days with strong wind at the station Chulym for July, if =1 day, σ=1.7 days. Let us define α and γ:

α≈

γ≈

The probability of not having a single day with strong winds is

p(0)=

The probability that there will be one day with strong wind is p(1)= . The compound Poisson distribution plot is shown in Fig. 5.3.

For continuous random variables in climatology, the most commonly used distributions are normal, lognormal, Charlier distribution, gamma distribution, Weibull and Gumbel distributions, as well as the composition law of normal and uniform density.

The normal, or Gaussian, distribution law has the greatest theoretical and practical significance. This law is a limit for many other theoretical distributions and is formed when each value of a random variable can be considered as a sum of sufficiently large number independent random variables.

The normal law is given by expressions for the density and distribution function of the form

When considering the basic principles of probability theory and mathematical statistics and determining the distribution parameters, we proceeded from the assumption that a sufficiently large, in the limit infinite, number of tests n®N (N®¥) is carried out, which is practically impossible to implement.

However, there are methods that allow you to estimate these parameters from a sample of (part of) random events.

General is the set of all conceivable values ​​of observations that we could make under a given set of conditions. In other words, all possible realizations of a random variable, theoretically in the limit there can be an infinite number of them (N®¥). Part of this totality nÎN, i.e. the results of a limited series of observations x 1 , x 2 ,..., x n of a random variable can be considered as a sample value of a random variable (for example, when determining the chemical composition of alloys, their mechanical strength, etc.). If all the ingots of a given grade of steel, cast iron, alloy are cut into samples and examined chemical composition, mechanical strength and others physical characteristics, then they would have a general population of observations. In fact, it is possible (expedient) to study the properties of a very limited number of samples - this is a sample of their general population.

Based on the results of such a limited number of observations, it is possible to determine point estimates of the distribution laws and their parameters. An estimate (or sample statistics) Q* of any parameter Q is an arbitrary function Q*=Q*(x 1 , x 2 ,..., x n) of observed values ​​x 1 , x 2 ,..., x n , in that or to another extent reflecting the actual value of the Q parameter.

If we talk about the characteristics of probability distributions, then the characteristics of theoretical distributions (M x, s x 2, M o, M e) can be considered as characteristics that exist in the general population, and those characterizing the empirical distribution - as their sample characteristics (estimates). Numerical parameters for estimating M x, s x 2, etc. are sometimes called statistics.

To estimate the mathematical expectation, the arithmetic mean (average value) of a number of measurements in the sample is used:

where x i is the implementation of either a discrete or a separate point for a continuous random variable; n – sample size.

To characterize the spread of a random variable, an estimate of the theoretical variance is used - sample variances (see Fig. 2.4):

(3.2a)

(3.2b)

Non-negative value square root from the sample variance is the sample standard deviation(sample standard) deviation

It should be noted that in any problem involving measurement, there are two possible ways to obtain an estimate of the value of s x 2.

When using the first method, a sequence of instrument readings is taken and by comparing the results obtained with a known or calibrated value of the measured quantity, a sequence of deviations is found. Then the resulting sequence of deviations is used to calculate the standard deviation using formula (3.3a).

The second way to obtain an estimate of the value of s x 2 is to determine the arithmetic mean, since in this case, the actual (exact) value of the measured quantity is unknown. In this case, it is advisable to use another formula for finding the standard deviation (3.2b, 3.3b). Division by (n-1) is done because best estimate, obtained by averaging the X array, will differ from exact value by some amount if a sample is considered rather than the entire population.

In this case, the sum of squared deviations will be slightly less than when using the true average . Dividing by (n-1) instead of n will partially correct this error. In some manuals mathematical statistics It is recommended to always divide by when calculating the sample standard deviation, although sometimes this should not be done. It is necessary to divide by only in cases where the true value has not been obtained by an independent method.

The sample value of the coefficient of variation n, which is a measure of the relative variability of the random variable, is calculated using the formula

or as a percentage

(3.4b)

The one of the samples has a larger scattering and the variation is greater.

Estimations , S x 2 are subject to the requirements of consistency, unbiasedness and efficiency.

An estimate of the parameter Q* is said to be consistent if, as the number of observations n grows (i.e., n®N in the case of a finite population of volume N and with n®¥ in the case of an infinite population), it tends to the estimated theoretical value of the parameter

For example, for variance

(3.5)

An estimate of the parameter Q* is called unbiased if its mathematical expectation M(Q*) for any n asymptotically tends to the true value M(Q*)=Q. Satisfying the requirement of unbiasedness eliminates the systematic error in parameter estimation, which depends on the sample size n and, if consistent, tends to zero at n®¥. Above, two estimates were defined for the variance and . In the case of an unknown value of the mathematical expectation (the true value of the measured quantity), both estimates are consistent, but only the second (3.2b), (3.3b), as was shown earlier, is unbiased. The requirement of unbiasedness is especially important with a small number of observations, since when n®¥ ® .

An estimate of the parameter Q 1 * is called effective if, among other estimates of the same parameter Q 2 *, Q 3 *, it has the smallest variance.

(3.6)

where Q i * is any other estimate.

So, if there is a sample x 1, x 2,..., x n from the general population, then the average mathematical expectation can be estimated in two ways:

(3.7)

where x max (n), x min (n) are the maximum and minimum values ​​of the random variable from sample n, respectively.

Both estimates have the properties of consistency and unbiasedness, however, it can be shown that the variance in the first method of estimation is equal to S x 2 /n, and in the second, p 2 S x 2 /, i.e. significantly more. Thus, the first method of estimating the mathematical expectation is consistent, unbiased and effective, and the second is only consistent and unbiased. Note that of all unbiased and consistent estimates, one should prefer the one that turns out to be closest to the estimated parameter.

Note that all of the above applies to equal-precision measurements, i.e. to measurements that contain only a random error subject to normal law distributions.