Mathematical statistical analysis. Basic concepts of mathematical statistics. The value of the reliable interval, the reliable probability  and the sample size n depend on each other. In fact, the relationship

RANDOM VALUES AND THE LAWS OF THEIR DISTRIBUTION.

Random called a quantity that takes values ​​depending on the combination of random circumstances. Distinguish discrete and random continuous quantities.

Discrete A quantity is called if it takes a countable set of values. ( Example: the number of patients at the doctor's office, the number of letters per page, the number of molecules in a given volume).

Continuous called a quantity that can take values ​​within a certain interval. ( Example: air temperature, body weight, human height, etc.)

distribution law A random variable is a set of possible values ​​​​of this quantity and, corresponding to these values, probabilities (or frequencies of occurrence).

EXAMPLE:

x x 1 x2 x 3 x4 ... x n
p p 1 p 2 p 3 p 4 ... p n
x x 1 x2 x 3 x4 ... x n
m m 1 m2 m 3 m4 ... m n

NUMERICAL CHARACTERISTICS OF RANDOM VALUES.

In many cases, along with the distribution of a random variable or instead of it, information about these quantities can be provided by numerical parameters called numerical characteristics of a random variable . The most commonly used of them:

1 .Expected value - (mean value) of a random variable is the sum of the products of all its possible values ​​by the probabilities of these values:

2 .Dispersion random variable:


3 .Standard deviation :

The THREE SIGMA rule - if a random variable is distributed according to a normal law, then the deviation of this value from the mean value in absolute value does not exceed three times the standard deviation

ZON GAUSS - NORMAL DISTRIBUTION LAW

Often there are values ​​distributed over normal law (Gauss' law). main feature : it is the limiting law to which other laws of distribution approach.

A random variable is normally distributed if its probability density looks like:



M(X)- mathematical expectation of a random variable;

s- standard deviation.

Probability Density(distribution function) shows how the probability related to the interval changes dx random variable, depending on the value of the variable itself:


BASIC CONCEPTS OF MATHEMATICAL STATISTICS

Math statistics- a branch of applied mathematics, directly adjacent to the theory of probability. The main difference between mathematical statistics and probability theory is that mathematical statistics does not consider actions on distribution laws and numerical characteristics of random variables, but approximate methods for finding these laws and numerical characteristics based on experimental results.

Basic concepts mathematical statistics are:

1. General population;

2. sample;

3. variation series;

4. fashion;

5. median;

6. percentile,

7. frequency polygon,

8. bar chart.

Population- a large statistical population from which some of the objects for research are selected

(Example: the entire population of the region, university students of the city, etc.)

Sample (sample population)- a set of objects selected from the general population.

Variation series- statistical distribution, consisting of variants (values ​​of a random variable) and their corresponding frequencies.

Example:

X, kg
m

x- the value of a random variable (mass of girls aged 10 years);

m- frequency of occurrence.

Fashion– the value of the random variable, which corresponds to the highest frequency of occurrence. (In the example above, 24 kg is the most common value for fashion: m = 20).

Median- the value of a random variable that divides the distribution in half: half of the values ​​are located to the right of the median, half (no more) - to the left.

Example:

1, 1, 1, 1, 1. 1, 2, 2, 2, 3 , 3, 4, 4, 5, 5, 5, 5, 6, 6, 7 , 7, 7, 7, 7, 7, 8, 8, 8, 8, 8 , 8, 9, 9, 9, 10, 10, 10, 10, 10, 10

In the example, we observe 40 values ​​of a random variable. All values ​​are arranged in ascending order, taking into account the frequency of their occurrence. It can be seen that 20 (half) of the 40 values ​​are located to the right of the selected value 7. So 7 is the median.

To characterize the scatter, we find the values ​​that were not higher than 25 and 75% of the measurement results. These values ​​are called the 25th and 75th percentiles . If the median bisects the distribution, then the 25th and 75th percentiles are cut off from it by a quarter. (The median itself, by the way, can be considered the 50th percentile.) As you can see from the example, the 25th and 75th percentiles are 3 and 8, respectively.

use discrete (point) statistical distribution and continuous (interval) statistical distribution.

For clarity, statistical distributions are depicted graphically in the form frequency polygon or - histograms .

Frequency polygon- a broken line, the segments of which connect points with coordinates ( x 1 ,m 1), (x2,m2), ..., or for polygon of relative frequencies - with coordinates ( x 1 ,p * 1), (x 2 ,p * 2), ...(Fig.1).


m m i /n f(x)

Fig.1 Fig.2

Frequency histogram- a set of adjacent rectangles built on one straight line (Fig. 2), the bases of the rectangles are the same and equal dx , and the heights are equal to the ratio of frequency to dx , or R * to dx (probability density).

Example:

x, kg 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 4,0 4,1 4,2 4,3 4,4
m

Frequency polygon

The ratio of the relative frequency to the width of the interval is called probability density f(x)=m i / n dx = p* i / dx

An example of constructing a histogram .

Let's use the data from the previous example.

1. Calculation of the number of class intervals

where n - number of observations. In our case n = 100 . Consequently:

2. Calculation of the interval width dx :

,

3. Drawing up an interval series:

dx 2.7-2.9 2.9-3.1 3.1-3.3 3.3-3.5 3.5-3.7 3.7-3.9 3.9-4.1 4.1-4.3 4.3-4.5
m
f(x) 0.3 0.75 1.25 0.85 0.55 0.6 0.4 0.25 0.05

bar chart

Odessa National Medical University Department of Biophysics, Informatics and Medical Equipment Guidelines for 1st year students on the topic “Fundamentals of Mathematical Statistics” Odessa 2009

1. Topic: “Fundamentals of mathematical statistics”.

2. Relevance of the topic.

Mathematical statistics is a branch of mathematics that studies methods for collecting, systematizing and processing the results of observations of massive random events in order to clarify and apply existing patterns in practice. Methods of mathematical statistics are widely used in clinical medicine and public health. They are used, in particular, in the development of mathematical methods for medical diagnostics, in the theory of epidemics, in planning and processing the results of a medical experiment, and in organizing healthcare. Statistical concepts, consciously or unconsciously, are used to make decisions in such matters as clinical diagnosis, predicting the course of an individual patient's illness, predicting the likely outcomes of certain programs in a given population, and choosing the appropriate program in specific circumstances. Familiarity with the ideas and methods of mathematical statistics is a necessary element of the professional education of every health worker.

3. Whole classes. The general goal of the lesson is to teach students to consciously use mathematical statistics in solving problems of a biomedical profile. Specific whole classes:
  1. to acquaint students with the basic ideas, concepts and methods of mathematical statistics, paying attention mainly to issues related to processing the results of observations of massive random events in order to clarify and apply existing patterns in practice;
  2. to teach students to consciously apply the basic concepts of mathematical statistics in solving the simplest problems that arise in the professional activities of a doctor.
The student must know (level 2):
  1. class frequency definition (absolute and relative)
  2. determination of the general order and selection, the volume of the selection
  3. point and interval estimation
  4. reliable interval and validity
  5. determination of mode, median and sample mean
  6. determination of range, interquartile range, quartile deviation
  7. determination of mean absolute deviation
  8. determination of sample covariance and variance
  9. determination of sample standard deviation and coefficient of variation
  10. determination of sample regression coefficients
  11. empirical linear regression equations
  12. determination of the sample correlation coefficient.
The student must master elementary habits of calculation (Level 3):
  1. mode, median and sample mean
  2. range, interquartile range, quartile deviation
  3. mean absolute deviation
  4. sample covariance and variance
  5. sample standard deviation and coefficient of variation
  6. reliable interval for mathematical expectation and variance
  7. sample regression coefficients
  8. sample correlation coefficient.
4. Ways to achieve the goals of the lesson: To achieve the goals of the lesson, you need the following basic knowledge:
  1. Definition of distribution, series of distribution and distribution polyhedron of a discrete random variable
  2. Determination of the functional deposit between random variables
  3. Determination of correlation zazhnіstі between random variables
You also need to be able to calculate the probabilities of incompatible and joint events using the appropriate rules. 5. A task for students to check their initial level of knowledge. test questions
  1. Definition of a vipadical event, its relative frequency and probability.
  2. The probabilities composition theorem for incompatible events
  3. The theorem for compiling the probabilities of joint events
  4. The theorem of multiplication of probabilities of independent events
  5. The theorem of multiplication of probabilities of dependent events
  6. Total Probability Theorem
  7. Bayes' theorem
  8. Definition of random variables: discrete and continuous
  9. Distribution definition, distribution series and distribution polygon of a discrete random variable
  10. Definition of the distribution function
  11. Determination of distribution center location measures
  12. Determination of measures of variability of values ​​of a random variable
  13. Determination of the width of the distribution and the distribution curve of a continuous random variable
  14. Definition of functional dependence between random variables
  15. Determining the correlation between random variables
  16. Regression definition, equation and regression lines
  17. Determination of covariance and correlation coefficient
  18. Definition of a linear regression equation.
6. Information to strengthen the initial knowledge-skills can be found in the manuals:
  1. Zhumatiy P.G. Lecture “Probability Theory”. Odessa, 2009.
  2. Zhumatiy P.G. "Fundamentals of Probability Theory". Odessa, 2009.
  3. Zhumatiy P.G., Senytska Ya.R. Elements of the theory of probability. Methodical instructions for students of medical institute. Odessa, 1981.
  4. Chaly O.V., Agapov B.T., Tsekhmister Ya.V. Medical and biological physics. Kyiv, 2004.
7. The content of the educational material from this topic, highlighting the main key issues.

Mathematical statistics is a branch of mathematics that studies methods for collecting, systematizing, processing, displaying, analyzing and interpreting observational results in order to identify existing patterns.

The application of statistics in health care is needed both at the community level and at the level of individual patients. Medicine deals with individuals who differ from each other in many ways, and the value of the indicators on the basis of which a person can be considered healthy vary from one individual to another. No two patients or two groups of patients are exactly alike, so decisions regarding individual patients or populations must be made on the basis of experience gained from other patients or populations with similar biological characteristics. It is necessary to realize that, given the existing discrepancies, these decisions cannot be absolutely accurate - they are always associated with some uncertainty. It is in this that the modern nature of medicine consists.

Some examples of the application of statistical methods in medicine:

interpretation of variation (the variability of the characteristics of an organism when deciding what value of a particular characteristic will be ideal, normal, average, etc., makes it necessary to use appropriate statistical methods).

diagnosis of diseases in individual patients and assessment of the health status of a population group.

predicting the end of a disease in individual patients or the possible outcome of a disease control program in any population group.

selection of suitable influence on the patient or on the population group.

planning and conducting medical research, analysis and publication of results, their reading and critical evaluation.

health planning and management.

Useful medical information is usually hidden in a mass of raw data. It is necessary to concentrate the information contained in them and present the data in such a way that the structure of the variation is clearly visible, and then select specific methods of analysis.

Depicting data provides familiarity with the following concepts and terms:

variational series (ordered arrangement) - a simple ordering of individual observations of a quantity.

class - one of the intervals into which the entire range of values ​​of a random variable is divided.

extreme points of the class - the value that limits the class, for example 2.5 and 3.0, the lower and upper limits of the class 2.5 - 3.0.

The (absolute) class frequency is the number of observations in the class.

relative class frequency - the absolute frequency of a class, expressed as fractions of the total number of observations.

cumulative (cumulative) class frequency - the number of observations, which is equal to the sum of the frequencies of all previous classes and this class.

column chart - a graphical representation of data frequencies for nominal classes using columns whose heights are directly proportional to the class frequencies.

pie chart - a graphical representation of data frequencies for nominal classes using circle sectors, the areas of which are directly proportional to the class frequencies.

histogram - a graphical representation of the frequency distribution of quantitative data by the areas of rectangles directly proportional to the frequencies of the classes.

polygon of frequencies - a graph of the frequency distribution of quantitative data; the point corresponding to the frequency of the class is placed above the middle of the interval, each two adjacent points are connected by a straight line segment.

ogive (cumulative curve) - graph of the distribution of cumulative relative frequencies.

Variability is inherent in all medical data, which is the analysis of measurement results based on the study of information about what values ​​the random variable under study took.

The set of all possible values ​​of a random variable is called the general.

The part of the general population registered as a result of the tests is called the sample.

The number of observations included in a sample is called the sample size (usually denoted n).

The task of the sampling method is to make a correct estimate of the random variable that is being studied using the obtained voter. Therefore, the main requirement that is presented to the selection is the maximum display of all the features of the general population. The selection that satisfies this requirement is called representative. The evaluation of the assessment depends on the representativeness of the selection, that is, the degree of compliance of the assessment with the parameter that it characterizes.

When estimating the parameters of the general population by the voter (parametric estimation), the following concepts are used:

point estimation - an estimate of the parameter of the general population in the form of a single value, which it can take with the highest probability.

interval estimation - estimation of a population parameter in the form of an interval of values ​​that has a given probability to cover its true value.

In interval estimation, the concept is used:

reliable interval - an interval of values ​​that has a given probability to cover the true value of the population parameter in interval estimation.

reliability (reliable probability) - the probability with which the reliable interval covers the true value of the population parameter.

reliable bounds - the lower and upper bounds of the reliable interval.

The conclusions that are obtained by the methods of mathematical statistics are always based on a limited, selective number of observations, so it is natural that for the second sample the results may be different. This circumstance determines the imaginative nature of the conclusions of mathematical statistics and, as a consequence, the widespread use of probability theory in the practice of statistical research.

A typical way of statistical research is as follows:

after estimating the magnitudes or dependencies between them according to observational data, they put forward the assumption that the phenomenon that is being studied can be described by one or another stochastic model

using statistical methods, this assumption can be confirmed or rejected; when confirming, the goal is achieved - a model is found that describes the studied patterns, otherwise they continue to work, putting forward and testing a new hypothesis.

Definition of sample statistical estimates:

the mode is the value that occurs most often in the voter,

median - the central (median) value of the variation series

range R - the difference between the largest and smallest values ​​in a series of observations

percentiles - the value in the variation series that divide the distribution into 100 equal parts (thus, the median will be the 50th percentile)

first quartile - 25th percentile

third quartile - 75th percentile

interquartile range - the difference between the first and third quartiles (covers the central 50% of observations)

quartile deviation - half of the interquartile range

sample mean - arithmetic mean of all sample values ​​(sample estimate of mathematical expectation)

mean absolute deviation - the sum of deviations from the corresponding beginning (without taking into account the sign), divided by the volume of the sample

the average absolute deviation from the sample mean is calculated using the formula

sample variance ( X ) - (sample estimator of variance) is given by

sample covariance -- (sample estimate of covariance K ( X,Y )) equals

sample regression coefficient of Y on X (sample estimate of the regression coefficient of Y on X ) equals

the empirical linear regression equation for Y on X is

the sample X-on-Y regression coefficient (the sample estimate of the X-on-Y regression coefficient) is

the empirical linear regression equation X on Y has the form

sample standard deviation s(X) - (sample estimate of standard deviation) equals the square root of the sample variance

sample correlation coefficient - (sample estimate of the correlation coefficient) equals

sample coefficient of variation  - (sample estimate of coefficient of variation CV) equals

.

8. Task for self-training of students. 8.1 Task for independent study of material from the topic.

8.1.1 Practical calculation of sample estimates

Practical calculation of sample point estimates

Example 1 .

The duration of the disease (in days) in 20 cases of pneumonia added up:

10, 11, 6, 16, 7, 13, 15, 8, 9, 10, 11, 13, 7, 8, 13, 15, 16, 13, 14, 15

Determine the mode, median, range, interquartile range, sample mean, mean absolute deviation from the sample mean, sample variance, sample coefficient of variation.

Rozv "zok.

The variational series for the selection has the form

6, 7, 7, 8, 8, 9, 10, 10, 11, 11, 13, 13, 13, 13, 14, 15, 15, 15, 16, 16

Fashion

The most common number in the selector is 13. Therefore, the value of the mode in the selector will be this number.

Median

When a variation series contains a paired number of observations, the median is the average of the two central members of the series, in this case 11 and 13, so the median is 12.

scope

The minimum value in the selector is 6 and the maximum is 16, so R = 10.

Interquartile range, quartile deviation

In a variational series, a quarter of all data have a value less than, or level 8, so the first quartile is 8, and 75% of all data have a value less, or level 12, so the third quartile is 14. So, the interquartile range is 6, and the quartile deviation is 3.

sample mean

The arithmetic mean of all sample values ​​is equal to

.

Average absolute deviation from the sample mean

.

Sample variance

Sample standard deviation

.

Vibration coefficient of variation

.

In the following example, we consider the simplest means of studying a stochastic relationship between two random variables.

Example 2 .

When examining a group of patients, data were obtained on the growth of H (cm) and the volume of circulating blood V (l):

Find empirical linear regression equations.

Rozv "zok.

The first thing to calculate is:

sample mean

sample mean

.

The second thing to calculate is:

sample variance (N)

sample variance (V)

sample covariance

The third is the calculation of sample regression coefficients:

sample regression coefficient V on H

sample regression coefficient H on V

.

Fourth, write down the desired equations:

the empirical linear regression equation for V on H has the form

the empirical linear regression equation for H on V is

.

Example 3 .

Using the conditions and results of Example 2, calculate the correlation coefficient and test the existence of a correlation between human height and circulating blood volume with a 95% reliable probability.

Rozv "zok.

The correlation coefficient is related to the regression coefficients and a practically useful formula

.

For a selective estimate of the correlation coefficient, this formula has the form

.

Using the values ​​of the sample regression coefficients and in Example 2, we get

.

Checking the reliability of the correlation dependence between random variables (assumes a normal distribution for each of them) is carried out in this way:

  • calculate the value of T

  • find the coefficient in the Student's distribution table

  • the existence of a correlation dependence between random variables is confirmed when the roughness is performed

.

Since 3.5 > 2.26, then with a 95% reliable probability of the existence of a correlation between the patient's height and the volume of circulating blood can be considered established.

Interval estimates for mean and variance

If the random variable has a normal distribution, then the interval estimates for the mathematical expectation and variance are calculated in the following sequence:

1. find the sample mean;

2.calculate the sample variance and sample standard deviation s ;

3. in the table of the Student's distribution, for the reliable probability  and the volume of the sample n, the Student's coefficient is found;

4. The reliable interval for the mathematical expectation is written as

5. in the distribution table ">  and the volume of the sample n find the coefficients

;

6. The reliable interval for dispersion is written as

The value of the reliable interval, the reliable probability  and the sample size n depend on each other. In fact, the relationship

decreases with growth of n, so, with a constant value of the reliable interval, with growth of n, u increases. With a constant reliable probability, with an increase in the volume of viborkp, the size of the reliable interval decreases. When planning medical research, this relationship is used to determine the minimum sample volume that will provide the values ​​of the reliable interval and reliable probability required by the conditions of the problem being solved.

Example 5

Using the conditions and results of Example 1, find the interval estimates of the mean and variance for the 95% reliable probability.

Rozv "zok.

In example 1, the point estimates of the mean (sample mean = 12), variance (sample variance = 10.7) and standard deviation (sample standard deviation) are tested. The volume of the sample is equal to n = 20.

From the Student's distribution table, we find the value of the coefficient

then we calculate the half-width d of the reliable interval

and write down the interval estimate of the expectation

10,5 < < 13,5 при = 95%

From the Pearson distribution table "chi-square" we find the coefficients

calculate the lower and upper reliable bounds

and write the interval estimate for the variance in the form

6.2 23 at  = 95%.

8.1.2. Tasks for independent solution

Problems 5.4 C 1 - 8 are proposed for independent solution (P.G. Zhumatiy. “Mathematical processing of biomedical data. Tasks and examples”. Odessa, 2009, p. 24-25)

8.1.3. test questions
  1. Class frequency (absolute and relative).
  2. General population and sample, sample size.
  3. Point and interval estimation.
  4. Reliable interval and reliability.
  5. Mode, median and sample mean.
  6. Range, interquartile range, quarterly deviation.
  7. Average absolute deviation.
  8. Sample covariance and variance.
  9. Sample standard deviation and coefficient of variation.
  10. Sample regression coefficients.
  11. Empirical regression equations.
  12. Calculation of the correlation coefficient and the reliability of the correlation.
  13. Construction of interval estimates for normally distributed random variables.
8.2 Main literature
  1. Zhumatiy P.G. “Mathematical processing of biomedical data. Tasks and examples”. Odessa, 2009.
  2. Zhumatiy P.G. Lecture “Mathematical statistics”. Odessa, 2009.
  3. Zhumatiy P.G. "Fundamentals of Mathematical Statistics". Odessa, 2009.
  4. Zhumatiy P.G., Senytska Ya.R. Elements of the theory of probability. Methodical instructions for students of medical institute. Odessa, 1981.
  5. Chaly O.V., Agapov B.T., Tsekhmister Ya.V. Medical and biological physics. Kyiv, 2004.
8.3 Further reading
  1. Remizov O.M. Medical and biological physics. M., Higher School, 1999.
  2. Remizov O.M., Isakova N.Kh., Maksina O.G. Collection of problems from medical and biological physics. M., ., “Higher School”, 1987.
Methodological instructions were compiled by Assoc. P. G. Zhumatiy.

3.1.1 Tasks and methods of mathematical statistics

Math statistics is a branch of mathematics devoted to the methods of collecting, analyzing and processing the results of statistical observational data for scientific and practical purposes. Methods of mathematical statistics are used in those cases when they study the distribution mass phenomena, i.e. a large collection of objects or phenomena distributed on a certain basis.

Let a set of homogeneous objects, united by a common feature or property of a qualitative or quantitative nature, be studied. Individual elements of such a set are called its members. The total number of members of a population is its volume. The set of all objects united by some attribute will be called general population. For example, the income of the population, the market value of shares or the deviation from the State Standard are studied in the course of a qualitative assessment of manufactured products.

Mathematical statistics is closely related to the theory of probability and relies on its conclusions. In particular, the concept population in mathematical statistics corresponds to the concept space of elementary events in probability theory.

The study of the entire general population is most often impossible or impractical due to significant material costs, damage or destruction of the object of study. Thus, it is impossible to obtain objective and complete information on the income of the population of the entire region; each individual inhabitant. Due to the deterioration of the research object, it is impossible to obtain reliable information about the quality, for example, of certain medicines or food products.

Main a task mathematical statistics is to study the general population based on sample data depending on the goal, that is, the study of the probabilistic properties of the population: the law of distribution, numerical characteristics, etc. for making managerial decisions under conditions of uncertainty.

3.1.2 Sample types

One of the methods of mathematical statistics is sampling method. In practice, most often, not the entire population is studied, but a limited sample from it.

sample(sample set) is a set of randomly selected objects. With the help of the sampling method, not the entire population is examined, but the sample ( X 1 ,X 2 ,...,x n) as a result of a limited number of observations. Then, according to the probabilistic properties of this sample, a judgment is made about the entire population from a certain general population. Various sampling methods are used to obtain a sample. The objects of study after the study can be in the general population, which corresponds to
sample.

The sample is called representative or representative, if it reproduces the general population well, that is, the probabilistic properties of the sample coincide or are close to the properties of the general population itself.

So, the effectiveness of the application of the sampling method increases under a number of conditions, which include the following:

    Number of sample items studied enough to draw conclusions, that is, the sample is representative or " representative».

So, a sufficient number of parts in a batch that is checked for quality (marriage) is established using the laws of probability theory and mathematical statistics.

    Sample items must be varied, taken randomly, those. principle must be respected randomization.

    Studied trait typical, is typical for all elements of the set of studied objects those. for the entire population.

    The trait being studied is essential for all elements of this class.

A change in a sign of a statistical population studied by a sampling method is called variation, and the observed values ​​of the feature x i - option. Absolute frequency (frequency or frequency) options x i called the number of members of the population (general or sample) that have the value x i(i.e. this is the number of particles i- th grade).

Ranked grouping of the variant according to the individual values ​​of the attribute (or according to the intervals of change), i.e. a sequence of options arranged in ascending order is called variational series. Any function ( X 1 ,X 2 ,…,X n) from the results of observations X 1 ,X 2 ,…,X n the random variable under study is called statistics.

Accepted volume of the general population designate N, its absolute frequencies are N i, sample size - n, its absolute frequencies are n i. It's obvious that

,
.

The ratio of frequency to population size is called relative frequency or statistical probability and denoted W i or :

.

If the number of options is large or close to the sample size (with a discrete distribution), and also if the sample is made from a continuous general population, then the variation series is not compiled by individual - point - values, but intervals population values. The variational series represented by the table, constructed using the grouping procedure, will be called interval. When compiling an interval variation series, the first line of the table is filled with intervals of values ​​of the studied population equal in length, the second - with the corresponding absolute or relative frequencies.

Let from some general population as a result n observations retrieved volume sample P. Statistical distribution samples called a list of options and their corresponding absolute or relative frequencies. Dot variation series absolute frequencies can be represented by a table:

x i

X k

n i

n k

and
.

Dot variation series relative frequencies represented by a table:

x i

X k

and
.

When constructing an interval distribution, there are rules in choosing the number of intervals or the size of each interval. The criterion here is the optimal ratio: with an increase in the number of intervals, the representativeness improves, but the amount of data and the time for processing them increase. Difference x max - x min between the largest and smallest values ​​is a variant called on a grand scale samples.

To count the number of intervals k Sturgess' empirical formula is usually used:

k= 1+3.3221g n (3.1)

(assuming rounding to the nearest integer). Accordingly, the value of each interval h can be calculated using the formula:

. (3.2)

x min = x max - 0,5h.

Each interval must contain at least five options. In the case when the number of options in the interval is less than five, it is customary to combine adjacent intervals.

Math statistics- this is a branch of mathematics that studies approximate methods for collecting and analyzing data based on the results of an experiment to identify existing patterns, i.e. finding laws of distribution of random variables and their numerical characteristics.

In mathematical statistics, it is customary to distinguish two main areas of research:

1. Estimation of the parameters of the general population.

2. Testing statistical hypotheses (some a priori assumptions).

The basic concepts of mathematical statistics are: general population, sample, theoretical distribution function.

General population is the set of all conceivable statistical data in observations of a random variable.

X G \u003d (x 1, x 2, x 3, ..., x N, ) \u003d ( x i; i \u003d 1,N)

The observed random variable X is called a feature or sampling factor. The general population is a statistical analogue of a random variable, its volume N is usually large, therefore, a part of the data is selected from it, called the sample population or simply a sample.

X B \u003d (x 1, x 2, x 3, ..., x n, ) \u003d ( x i; i \u003d 1,n)

Х В М Х Г, n £ N

Sample is a collection of randomly selected observations (objects) from the general population for direct study. The number of objects in the sample is called the sample size and is denoted by n. Typically, the sample is 5% -10% of the general population.

The use of a sample to construct patterns to which an observed random variable is subject makes it possible to avoid its continuous (mass) observation, which is often a resource-intensive process, or even simply impossible.

For example, a population is a set of individuals. The study of an entire population is laborious and expensive, therefore, data are collected on a sample of individuals who are considered representatives of this population, allowing to draw a conclusion about this population.

However, the sample must necessarily satisfy the condition representativeness, i.e. give a reasonable idea of ​​the general population. How to form a representative (representative) sample? Ideally, a random (randomized) sample is sought. To do this, a list of all individuals in the population is compiled and randomly selected. But sometimes the costs of compiling the list may be unacceptable, and then take an acceptable sample, for example, one clinic, hospital, and examine all patients in that clinic with this disease.

Each item in the sample is called a variant. The number of repetitions of options in the sample is called the frequency of occurrence. The value is called relative frequency options, i.e. is found as the ratio of the absolute frequency of variants to the entire sample size. A sequence of options written in ascending order is called variational series.


Let's consider three forms of variation series: ranged, discrete and interval.

ranked row- this is a list of individual units of the population in ascending order of the trait under study.

Discrete variation series is a table consisting of graphs or lines: a specific value of the attribute x i and the absolute frequency n i (or relative frequency ω i) of the manifestation of the i-th value of the attribute x.

An example of a variation series is the table

Write the distribution of relative frequencies.

Solution: Find the relative frequencies. To do this, we divide the frequencies by the sample size:

The distribution of relative frequencies has the form:

0,15 0,5 0,35

Control: 0.15 + 0.5 + 0.35 = 1.

A discrete series can be represented graphically. In a rectangular Cartesian coordinate system, points with coordinates () or () are marked, which are connected by straight lines. Such a broken line is called frequency polygon.

Construct a discrete variation series (DVR) and draw a distribution polygon for 45 applicants according to the number of points they received in the entrance exams:

39 41 40 42 41 40 42 44 40 43 42 41 43 39 42 41 42 39 41 37 43 41 38 43 42 41 40 41 38 44 40 39 41 40 42 40 41 42 40 43 38 39 41 41 42.

Solution: To construct a variational series, we arrange the various values ​​of the attribute x (options) in ascending order and write down its frequency under each of these values.

Let's build a polygon of this distribution:

Rice. 13.1. Frequency polygon

Interval variation series used for a large number of observations. To build such a series, you need to select the number of feature intervals and set the length of the interval. With a large number of groups, the interval will be minimal. The number of groups in a variation series can be found using the Sturges formula: (k is the number of groups, n is the sample size), and the interval width is

where is the maximum; - the minimum value of the variant, and their difference R is called span variation.

We study a sample of 100 people from the totality of all students of a medical university.

Solution: Calculate the number of groups: . Thus, to compile an interval series, it is better to divide this sample into 7 or 8 groups. The set of groups into which the results of observations are divided and the frequencies of obtaining the results of observations in each group is called aggregate.

A histogram is used to visualize a statistical distribution.

Frequency histogram- this is a stepped figure, consisting of adjacent rectangles built on the same straight line, the bases of which are the same and equal to the width of the interval, and the height is equal to either the frequency of falling into the interval or the relative frequency ω i .

Observations of the number of particles that hit the Geiger counter for a minute gave the following results:

21 30 39 31 42 34 36 30 28 30 33 24 31 40 31 33 31 27 31 45 31 34 27 30 48 30 28 30 33 46 43 30 33 28 31 27 31 36 51 34 31 36 34 37 28 30 39 31 42 37.

Based on these data, build an interval variation series with equal intervals (I interval 20-24; II interval 24-28, etc.) and draw a histogram.

Solution:n=50

The histogram of this distribution looks like:

Rice. 13.2. Distribution histogram

Task options

№ 13.1. Every hour the voltage in the mains was measured. In this case, the following values ​​were obtained (B):

227 219 215 230 232 223 220 222 218 219 222 221 227 226 226 209 211 215 218 220 216 220 220 221 225 224 212 217 219 220.

Build a statistical distribution and draw a polygon.

№ 13.2. Observations of blood sugar in 50 people gave the following results:

3.94 3.84 3.86 4.06 3.67 3.97 3.76 3.61 3.96 4.04

3.82 3.94 3.98 3.57 3.87 4.07 3.99 3.69 3.76 3.71

3.81 3.71 4.16 3.76 4.00 3.46 4.08 3.88 4.01 3.93

3.92 3.89 4.02 4.17 3.72 4.09 3.78 4.02 3.73 3.52

3.91 3.62 4.18 4.26 4.03 4.14 3.72 4.33 3.82 4.03

Based on these data, build an interval variation series with equal intervals (I - 3.45-3.55; II - 3.55-3.65, etc.) and depict it graphically, draw a histogram.

№ 13.3. Construct a range of frequencies for the distribution of erythrocyte sedimentation rate (ESR) in 100 people.

The data obtained as a result of the experiment is characterized by variability, which can be caused by a random error: the error of the measuring device, the heterogeneity of the samples, etc. After conducting a large amount of homogeneous data, the experimenter needs to process them in order to extract the most accurate information about the quantity under consideration. For processing large arrays of measurement data, observations, etc., which can be obtained during the experiment, it is convenient to use methods of mathematical statistics.

Mathematical statistics is inextricably linked with the theory of probability, but there is a significant difference between these sciences. Probability theory uses the already known distributions of random variables, on the basis of which the probabilities of events, mathematical expectation, etc. are calculated. Problem of mathematical statistics– to obtain the most reliable information about the distribution of a random variable based on experimental data.

Typical directions mathematical statistics:

  • sampling theory;
  • evaluation theory;
  • testing of statistical hypotheses;
  • regression analysis;
  • dispersion analysis.

Methods of mathematical statistics

Methods for evaluating and testing hypotheses are based on probabilistic and hyper-random models of data origin.

Mathematical statistics evaluates parameters and functions from them, which represent important characteristics of distributions (median, mathematical expectation, standard deviation, quantiles, etc.), density and distribution functions, etc. Point and interval estimates are used.

Modern mathematical statistics contains a large section − statistical sequential analysis, in which the formation of an array of observations for one array is allowed.

Mathematical statistics also contains general hypothesis testing theory and a large number of methods for testing specific hypotheses(for example, about the symmetry of the distribution, about the values ​​of parameters and characteristics, about the agreement of the empirical distribution function with the given distribution function, the homogeneity test hypothesis (coincidence of characteristics or distribution functions in two samples), etc.).

By holding sample surveys, associated with the construction of adequate methods for evaluating and testing hypotheses, with the properties of different schemes for organizing samples, the branch of mathematical statistics, which is of great importance, is engaged. Methods of mathematical statistics directly uses the following basic concepts.

Sample

Definition 1

sample called the data obtained during the experiment.

For example, the results of the range of a bullet when firing the same or a group of the same type of guns.

Empirical distribution function

Remark 1

distribution function makes it possible to express all the most important characteristics of a random variable.

In mathematical statistics, there is a concept theoretical(not previously known) and empirical distribution functions.

The empirical function is determined according to the data of experience (empirical data), i.e. by sample.

bar chart

Histograms are used to provide a visual, but rather approximate, representation of an unknown distribution.

bar chart is a graphical representation of the distribution of data.

To obtain a high-quality histogram, adhere to the following rules:

  • The number of sample elements should be significantly less than the sample size.
  • The partitioning intervals must contain a sufficient number of sample elements.

If the sample is very large, often the interval of sample elements is divided into equal parts.

Sample mean and sample variance

Using these concepts, one can obtain an estimate of the necessary numerical characteristics of an unknown distribution without resorting to the construction of a distribution function, a histogram, etc.



What else to read