Basic statistical parameters of a large and small sample population and their characteristics. Bootstrap, small samples, application in data analysis Small sample theory

The methods discussed above for calculating the characteristics of a sample population (variance, average and maximum errors, etc.) provide for a sufficiently large sample size (n > 30). At the same time, a large sample size is not always possible or advisable. In the practice of industrial observations and in scientific research work, it is often necessary to use small samples, the number of which does not exceed 30 units(agronomic and zootechnical experiments, product quality checks related to the destruction of samples, etc.). In statistics, they are called small samples. According to samples with a population of more than 30 units are called large samples.

A small sample size reduces its precision compared to a large sample. However, it has been proven that results obtained with small samples can also be generalized to the general population. But here it is necessary to take into account some features, in particular, when calculating the standard deviation. If the sample size is small, an unbiased variance estimate of 52 should be used.

The foundations of the theory of small samples were developed by the English mathematician and statistician W. Gosset (pseudonym Student). Student's studies have shown that when the population size is small, the standard deviation in the sample differs significantly from the standard deviation in the general population.

Since the standard deviation of the population is one of the parameters of the normal distribution curve, it is inappropriate to use the normal distribution function to estimate the parameters of the population from data from small samples due to large errors.

When calculating the average error for small samples, you should always use an unbiased estimate of the variance

where n - 1 is the number of degrees of freedom of variation (k), which is understood as the number of units capable of taking arbitrary values ​​without changing their general characteristics (average).

For example, three observations were made: x1= 4; x2 = 2; x3 = 6. Average value

So, there are only two freely varying quantities left, because the third can be found from the known two quantities and the average:

Therefore, for this example, the number of degrees of freedom of variation is 2 (k = n - 1 = 3 - 1 = 2).

The t-test substantiated the law of distribution of deviations of sample means from the general mean for small samples. According to the Student distribution, the probability that the marginal error will not exceed u-fold the average error in small samples depends on the size and size of the sample.

The theoretical normalized deviation for small samples is called the i-criterion, in contrast to the i-criterion for normal distribution, which is used in large samples. The value of Student's t-test is given in special tables (Appendix 3).

Let's consider the procedure for determining the average and maximum error for a small sample using this example. Let’s say that to determine the amount of losses during potato harvesting, five randomly selected areas of 4 m2 each were dug up. Losses by site were (kg); 0.6; 0.2; 0.8; 0.4; 0.5.

Average loss

Judging by individual observations, the magnitude of losses varies greatly and the average of only five observations may have a large error.

To calculate sampling errors, we define an unbiased variance estimate

Let's calculate the average error of the sample mean, where instead of the standard deviation its unbiased estimate is used:

Using Student's tables (Appendix 3), we establish that with confidence probability G= 0.95 (significance level a = 0.05) and at k = n - 1 = 5 - 1 = 4 degrees of freedom variation And= 2.78. Then the maximum sampling error is

So, with a probability of P = 0.95, we can say that the amount of losses over the entire field will be 0.5 ± 0.28 kg, or from 0.22 to 0.78 kg per 4 m2.

As we can see from the example, the limits of random fluctuations with small samples are quite large and can be reduced by increasing the sample size and reducing the fluctuations (dispersion) of the characteristics.

If we used the probability integral table (Appendix 2) to calculate the confidence limits of the general average, then And would be equal to 1.96 and єх = iZi = 1.96 o 0.10 = 0.20 kg, i.e. the confidence interval would be narrower (0.30 to 0.70 kg).

Small samples, due to their small numbers, even with the most careful organization of observation, do not accurately reflect the indicators of the general population. Therefore, results from small samples are rarely used to establish reliable boundaries within which population characteristics lie.

The Student's t test is used primarily to test statistical hypotheses about the significance of differences between the performance of two or more small samples (see Section 7).

In addition to the actual random sample with its clear probabilistic justification, there are other samples that are not completely random, but are widely used. It should be noted that the strict application of purely random selection of units from the general population is not always possible in practice. Such samples include mechanical sampling, typical, serial (or nested), multiphase and a number of others.

It is rare for a population to be homogeneous; this is the exception rather than the rule. Therefore, when there are different types of phenomena in the population, it is often desirable to ensure a more even representation of the different types in the sample. This goal is successfully achieved by using typical sampling. The main difficulty is that we must have additional information about the entire population, which in some cases is difficult.

A typical sample is also called a stratified or stratified sample; it is also used for the purpose of more uniform representation of different regions in the sample, and in this case the sample is called regionalized.

So, under typical A sample is understood as a sample in which the general population is divided into typical subgroups formed according to one or more essential characteristics (for example, the population is divided into 3-4 subgroups according to average per capita income or level of education - primary, secondary, higher, etc. ). Next, from all typical groups, you can select units for the sample in several ways, forming:

a) a typical sample with uniform placement, where an equal number of units are selected from different types (layers). This scheme works well if in the population the layers (types) do not differ very much from each other in the number of units;

b) typical sampling with proportional placement, when it is required (as opposed to uniform placement) that the proportion (%) of selection for all strata be the same (for example, 5 or 10%);

c) a typical sample with optimal placement, when the degree of variation of characteristics in different groups of the general population is taken into account. With this placement, the proportion of selection for groups with large variability of the trait increases, which ultimately leads to a decrease in random error.

The formula for the average error in a typical selection is similar to the usual sampling error for a purely random sample, with the only difference being that instead of the total variance, the average of the particular within-group variances is entered, which naturally leads to a decrease in error compared to a purely random sample. However, its use is not always possible (for many reasons). If there is no need for great precision, it is easier and cheaper to use serial sampling.

Serial(cluster) sampling consists in the fact that not units of the population (for example, students), but individual series or nests (for example, study groups) are selected for the sample. In other words, with serial (cluster) sampling, the observation unit and the sampling unit do not coincide: certain groups of units (nests) adjacent to each other are selected, and the units included in these nests are subject to examination. So, for example, when conducting a sample survey of housing conditions, we can randomly select a certain number of households (sampling unit) and then find out the living conditions of the families living in these houses (observation units).

Series (nests) consist of units connected to each other territorially (districts, cities, etc.), organizationally (enterprises, workshops, etc.), or in time (for example, a set of units of products produced over a given period of time) .

Serial selection can be organized in the form of single-stage, two-stage or multi-stage selection.

Randomly selected series are subjected to continuous research. Thus, serial sampling consists of two stages of random selection of series and continuous study of these series. Serial selection provides significant savings in manpower and resources and is therefore often used in practice. The error of serial selection differs from the error of random selection itself in that instead of the value of the total variance, interseries (intergroup) variance is used, and instead of the sample size, the number of series is used. The accuracy is usually not very high, but in some cases it is acceptable. A serial sample can be repeated or non-repetitive, and series can be equal-sized or unequal-sized.

Serial sampling can be organized according to different schemes. For example, you can form a sample population in two stages: first, the series to be surveyed are selected in random order, then from each selected series a certain number of units are also selected in random order to be directly observed (measured, weighed, etc.). The error of such a sample will depend on the error of serial selection and on the error of individual selection, i.e. Multi-stage selection, as a rule, gives less accurate results compared to single-stage selection, which is explained by the occurrence of representativeness errors at each sampling stage. In this case, you need to use the sampling error formula for combined sampling.

Another form of selection is multiphase selection (1, 2, 3 phases or stages). This selection differs in structure from multi-stage selection, since with multi-phase selection the same selection units are used in each phase. Errors in multiphase sampling are calculated at each phase separately. The main feature of a two-phase sample is that the samples differ from each other according to three criteria depending on: 1) the proportion of units studied in the first phase of the sample and again included in the second and subsequent phases; 2) from maintaining equal chances for each sample unit of the first phase to again be the object of study; 3) on the size of the interval separating the phases from each other.

Let us dwell on one more type of selection, namely mechanical(or systematic). This selection is probably the most common. This is apparently explained by the fact that of all the selection techniques, this technique is the simplest. In particular, it is much simpler than random selection, which requires the ability to use tables of random numbers, and does not require additional information about the population and its structure. In addition, mechanical selection is closely intertwined with proportional stratified selection, which leads to a reduction in sampling error.

For example, the use of mechanical selection of members of a housing cooperative from a list compiled in the order of admission to this cooperative will ensure proportional representation of cooperative members with different lengths of experience. Using the same technique to select respondents from an alphabetical list of individuals ensures equal chances for surnames beginning with different letters, etc. The use of time sheets or other lists at enterprises or educational institutions, etc. can ensure the necessary proportionality in the representation of workers with different lengths of experience. Note that mechanical selection is widely used in sociology, in the study of public opinion, etc.

In order to reduce the magnitude of the error and especially the costs of conducting a sampling study, various combinations of individual types of selection (mechanical, serial, individual, multiphase, etc.) are widely used. In such cases, more complex sampling errors should be calculated, which consist of errors that occur at different stages of the study.

A small sample is a collection of units less than 30. Small samples occur quite often in practice. For example, the number of rare diseases or the number of units possessing a rare trait; In addition, a small sample is resorted to when the research is expensive or the research involves the destruction of products or samples. Small samples are widely used in the field of product quality surveys. The theoretical foundations for determining small sample errors were laid by the English scientist W. Gosset (pseudonym Student).

It must be remembered that when determining the error for a small sample, instead of the sample size, you should take the value ( n– 1) or before determining the average sampling error, calculate the so-called corrected sample variance (in the denominator instead of n should be put ( n- 1)). Note that such a correction is made only once - when calculating the sample variance or when determining the error. Magnitude ( n– 1) is called the degree of freedom. In addition, the normal distribution is replaced t-distribution (Student distribution), which is tabulated and depends on the number of degrees of freedom. The only parameter of the Student distribution is the value ( n- 1). Let us emphasize once again that the amendment ( n– 1) is important and significant only for small sample populations; at n> 30 and above the difference disappears, approaching zero.

So far we have been talking about random samples, i.e. such when the selection of units from the population is random (or almost random) and all units have an equal (or almost equal) probability of being included in the sample. However, the selection of units can be based on the principle of non-random selection, when the principle of accessibility and purposefulness is at the forefront. In such cases, it is impossible to talk about the representativeness of the resulting sample, and the calculation of errors of representativeness can only be done with information about the general population.

There are several known schemes for forming a non-random sample, which have become widespread and are used mainly in sociological research: selection of available observation units, selection according to the Nuremberg method, targeted sampling when identifying experts, etc. Quota sampling, which is formed by the researcher from a small number, is also important. significant parameters and gives a very close match to the general population. In other words, quota selection should provide the researcher with almost complete coincidence of the sample and general populations according to his chosen parameters. Purposeful achievement of the proximity of two populations in a limited range of indicators is achieved, as a rule, using a sample of a significantly smaller size than when using random selection. It is this circumstance that makes quota selection attractive for a researcher who does not have the opportunity to focus on a self-weighting random sample of a large size. It should be added that a reduction in sample size is most often combined with a reduction in monetary costs and research time, which increases the advantages of this selection method. Let us also note that with quota sampling there is quite significant preliminary information about the structure of the population. The main advantage here is that the sample size is significantly smaller than with random sampling. The selected characteristics (most often socio-demographic - gender, age, education) should closely correlate with the studied characteristics of the general population, i.e. object of research.

As already indicated, the sampling method makes it possible to obtain information about the general population with much less money, time and effort than with continuous observation. It is also clear that a complete study of the entire population is impossible in some cases, for example, when checking the quality of products, samples of which are destroyed.

At the same time, however, it should be pointed out that the population is not a completely “black box” and we still have some information about it. Conducting, for example, a sample study concerning the life, everyday life, property status, income and expenses of students, their opinions, interests, etc., we still have information about their total number, grouping by gender, age, marital status, place of residence , course of study and other characteristics. This information is always used in sample research.

There are several types of distribution of sample characteristics to the general population: the method of direct recalculation and the method of correction factors. Recalculation of sample characteristics is carried out, as a rule, taking into account confidence intervals and can be expressed in absolute and relative values.

It is quite appropriate to emphasize here that most of the statistical information relating to the economic life of society in its most diverse manifestations and types is based on sample data. Of course, they are supplemented by complete registration data and information obtained as a result of censuses (of population, enterprises, etc.). For example, all budget statistics (on income and expenses of the population) provided by Rosstat are based on data from a sample study. Information on prices, production volumes, and trade volumes, expressed in the corresponding indices, is also largely based on sample data.

Statistical hypotheses and statistical tests. Basic Concepts

The concepts of statistical test and statistical hypothesis are closely related to sampling. A statistical hypothesis (as opposed to other scientific hypotheses) is an assumption about some properties of the population that can be tested using data from a random sample. It should be remembered that the result obtained is probabilistic in nature. Consequently, the result of the study, confirming the validity of the put forward hypothesis, can almost never serve as a basis for its final acceptance, and conversely, a result that is inconsistent with it is quite sufficient to reject the put forward hypothesis as erroneous or false. This is so because the result obtained can be consistent with other hypotheses, and not just with the one put forward.

Under statistical criterion is understood as a set of rules that allow us to answer the question under which observation results the hypothesis is rejected and under which it is not. In other words, a statistical criterion is a kind of decisive rule that ensures the acceptance of a true (correct) hypothesis and the rejection of a false hypothesis with a high degree of probability. Statistical tests are one-sided and two-sided, parametric and non-parametric, more or less powerful. Some criteria are used frequently, others are used less frequently. Some criteria are intended to solve special issues, and some criteria can be used to solve a wide class of problems. These criteria have become widespread in sociology, economics, psychology, natural sciences, etc.

Let us introduce some basic concepts of statistical hypothesis testing. Hypothesis testing begins with a null hypothesis. N 0, i.e. some assumption of the researcher, as well as a competing, alternative hypothesis N 1, which contradicts the main one. For example: N 0: , N 1: or N 0: , N 1: (where A- general average).

The main goal of the researcher when testing a hypothesis is to reject the hypothesis he puts forward. As R. Fisher wrote, the purpose of testing any hypothesis is to reject it. Hypothesis testing is based on contradiction. Therefore, if we believe that, for example, the average wage of workers obtained from a particular sample and equal to 186 monetary units per month does not coincide with the actual wages for the entire population, then the null hypothesis is accepted that these wages are equal.

Competing hypothesis N 1 can be formulated in different ways:

N 1: , N 1: , N 1: .

Next, it is determined Type I error(a), which states the probability that a true hypothesis will be rejected. Obviously, this probability should be small (usually from 0.01 to 0.1, most often the default is 0.05, or the so-called 5% significance level). These levels arise from the sampling method, according to which a twofold or threefold error represents the limits beyond which random variation in sample characteristics most often does not extend. Type II error(b) is the probability that an incorrect hypothesis will be accepted. As a rule, a type I error is more “dangerous”; it is precisely this that is recorded by the statistician. If at the beginning of the study we want to record a and b simultaneously (for example, a = 0.05; b = 0.1), then for this we must first calculate the sample size.

Critical zone(or area) is a set of criterion values ​​at which N 0 is rejected. Critical point T kr is the point separating the area of ​​acceptance of the hypothesis from the area of ​​deviation, or critical zone.

As already mentioned, a Type I error (a) is the probability of rejecting a correct hypothesis. The smaller a, the less likely it is to make a Type I error. But at the same time, when a decreases (for example, from 0.05 to 0.01), it is more difficult to reject the null hypothesis, which, in fact, is what the researcher sets for himself. Let us emphasize again that further reduction of a to 0.05 and beyond will actually result in all hypotheses, true and false, falling within the range of acceptance of the null hypothesis, and will make it impossible to distinguish between them.

Type II error (b) occurs when it is accepted N 0, but in fact the alternative hypothesis is true N 1 . The value g = 1 – b is called the power of the criterion. Type II error (i.e., incorrectly accepting a false hypothesis) decreases with increasing sample size and increasing significance level. It follows from this that it is impossible to simultaneously reduce a and b. This can only be achieved by increasing the sample size (which is not always possible).

Most often, hypothesis testing tasks come down to comparing two sample means or proportions; to compare the general average (or share) with the sample one; comparison of empirical and theoretical distributions (goodness-of-fit criteria); comparison of two sample variances (c 2 -criterion); comparing two sample correlation coefficients or regression coefficients and some other comparisons.

The decision to accept or reject the null hypothesis consists of comparing the actual value of the criterion with the tabulated (theoretical) value. If the actual value is less than the tabulated value, then it is concluded that the discrepancy is random and insignificant and the null hypothesis cannot be rejected. The opposite situation (the actual value is greater than the tabulated value) leads to the rejection of the null hypothesis.

When testing statistical hypotheses, the tables of normal distribution, distribution c 2 (read: chi-square), t-distributions (Student distributions) and F-distributions (Fisher distributions).

When controlling the quality of goods in economic research, an experiment can be conducted on the basis of a small sample. Under small sample refers to a non-continuous statistical survey in which the sample population is formed from a relatively small number of units in the general population. The volume of a small sample usually does not exceed 30 units and can reach 4 - 5 units. The average error of a small sample is calculated by the formula:, where is the variance of the small sample. When determining the variance, the number of degrees of freedom is n-1: . The marginal error of a small sample is determined by the formula. In this case, the value of the confidence coefficient t depends not only on the given confidence probability, but also on the number of sample units n. For individual values ​​of t and n, the confidence probability of a small sample is determined using special Student tables (Table 9.1.), which give the distributions of standardized deviations: Since when conducting a small sample, the value of 0.59 or 0.99 is practically accepted as a confidence probability, then to determine the marginal error of a small sample, the following readings of the Student distribution are used:

Ways to generalize sample characteristics to the population. The sampling method is most often used to obtain characteristics of the population according to the corresponding sample indicators. Depending on the purposes of the research, this is done either by direct recalculation of sample indicators for the general population, or by calculating correction factors. Direct recalculation method. It consists in the fact that the indicators of the sample share or the average are extended to the general population, taking into account the sampling error. Thus, in trade, the number of non-standard products received in a consignment is determined. To do this (taking into account the accepted degree of probability), the indicators of the share of non-standard products in the sample are multiplied by the number of products in the entire batch of goods. Method of correction factors. It is used in cases where the purpose of the sampling method is to clarify the results of a complete census. In statistical practice, this method is used to clarify data from annual censuses of livestock owned by the population. For this purpose, after generalizing the data from the complete census, a 10% sample survey is used to determine the so-called “percentage of undercounting”. Methods for selecting units from the general population. In statistics, various methods of forming sample populations are used, which is determined by the objectives of the study and depends on the specifics of the object of study. The main condition for conducting a sample survey is the prevention of systematic errors arising from violation of the principle of equal opportunity for each unit of the general population to be included in the sample. Prevention of systematic errors is achieved through the use of scientifically based methods for forming a sample population. There are the following methods for selecting units from the general population: 1) individual selection - individual units are selected for the sample; 2) group selection - qualitatively homogeneous groups or series of studied units are included in the sample; 3) combined selection - this is a combination of individual and group selection. Selection methods are determined by the rules for forming a sample population. Sampling can be: - random; - mechanical; - typical; - serial; - combined. Proper random sampling consists in the fact that the sample population is formed as a result of random (unintentional) selection of individual units from the general population. In this case, the number of units selected in the sample population is usually determined based on the accepted sample proportion. The sample share is the ratio of the number of units in the sample population n to the number of units in the general population N, i.e. So, with a 5% sample from a batch of goods of 2,000 units. sample size n is 100 units. (5*2000:100), and with a 20% sample it will be 400 units. (20*2000:100), etc. Mechanical sampling consists in the fact that the selection of units in the sample population is made from the general population, divided into equal intervals (groups). In this case, the size of the interval in the general population is equal to the reciprocal of the sample share. Thus, with a 2% sample, every 50th unit is selected (1: 0.02), with a 5% sample - every 20th unit (1: 0.05), etc. Thus, in accordance with the accepted proportion of selection, the general population is, as it were, mechanically divided into equal groups. From each group, only one unit is selected for the sample. An important feature of mechanical sampling is that the formation of a sample population can be carried out without resorting to compiling lists. In practice, the order in which the units of the population are actually located is often used. For example, the sequence of exit of finished products from a conveyor or production line, the order of placement of units of a batch of goods during storage, transportation, sales, etc. Typical sample. In typical sampling, the population is first divided into homogeneous typical groups. Then, from each typical group, a purely random or mechanical sample is used to individually select units into the sample population. Typical sampling is usually used when studying complex statistical populations. For example, in a sample survey of the labor productivity of trade workers, consisting of separate groups by qualification. An important feature of a typical sample is that it gives more accurate results compared to other methods of selecting units in the sample population. To determine the average error of a typical sample, the formulas are used: re-selection , non-repetitive selection , The variance is determined by the following formulas: , At single stage In a sample, each selected unit is immediately studied according to a given characteristic. This is the case with purely random and serial sampling. With multi-stage In the sample, individual groups are selected from the general population, and individual units are selected from the groups. This is how a typical sample is made with a mechanical method of selecting units into the sample population. Combined sampling can be two-stage. In this case, the population is first divided into groups. Then the groups are selected, and within the latter the individual units are selected.
  • 6. Types of statistical groupings, their cognitive significance.
  • 7.Statistical tables: types, construction rules, reading techniques
  • 8.Absolute quantities: types, cognitive significance. Conditions for the scientific use of absolute and relative indicators.
  • 9. Average values: content, types, types, scientific conditions of application.
  • 11.Dispersion properties. The rule for adding (decomposing) variance and its use in statistical analysis.
  • 12.Types of statistical graphs according to the content of the problems being solved and methods of construction.
  • 13. Dynamic series: types, analysis indicators.
  • 14. Methods for identifying trends in time series.
  • 15. Indices: definition, main elements of indices, problems solved with the help of indices, index system in statistics.
  • 16. Rules for constructing dynamic and territorial indices.
  • 17. Fundamentals of the theory of the sampling method.
  • 18. Small sample theory.
  • 19. Methods for selecting units in the sample population.
  • 20.Types of connections, statistical methods for analyzing relationships, the concept of correlation.
  • 21. Contents of correlation analysis, correlation models.
  • 22.Assessment of the strength (closeness) of the correlation connection.
  • 23. System of indicators of socio-economic statistics.
  • 24. Basic groupings and classifications in socio-economic statistics.
  • 25. National wealth: category content and composition.
  • 26. Contents of the land cadastre. Indicators of land composition by type of ownership, intended purpose and type of land.
  • 27. Classification of fixed assets, methods of evaluation and revaluation, indicators of movement, condition and use.
  • 28. Objectives of labor statistics. The concept and content of the main categories of the labor market.
  • 29. Statistics on the use of labor and working time.
  • 30. Labor productivity indicators and methods of analysis.
  • 31. Indicators of crop production and agricultural yields. Crops and lands.
  • 32. Indicators of livestock production and productivity of farm animals.
  • 33. Statistics of public costs and production costs.
  • 34. Statistics of wages and labor costs.
  • 35.Statistics of gross output and income.
  • 36.Indicators of movement and sales of agricultural products.
  • 37.Tasks of statistical analysis of agricultural enterprises.
  • 38. Statistics of prices and goods in sectors of the national economy: tasks and methods of analysis.
  • 39. Statistics of the market of goods and services.
  • 40. Statistics of social production indicators.
  • 41.Statistical analysis of consumer market prices.
  • 42.Inflation statistics and main indicators of its assessment.
  • 43.Tasks of financial statistics of enterprises.
  • 44. Main indicators of financial results of enterprises.
  • 45.Tasks of state budget statistics.
  • 46. ​​System of indicators of state budget statistics.
  • 47. System of indicators of monetary circulation statistics.
  • 48. Statistics of the composition and structure of the money supply in the country.
  • 49. The main tasks of banking statistics.
  • 50. Main indicators of banking statistics.
  • 51. Concept and classification of credit. Objectives of its statistical study.
  • 52.System of credit statistics indicators.
  • 53. Basic indicators and methods of analysis of savings business.
  • 54.Tasks of statistics of the stock market and securities.
  • 56. Statistics of commodity exchanges: objectives and system of indicators.
  • 57. System of national accounts: concepts, main categories and classification.
  • 58.Basic principles of constructing a snc.
  • 59. Main macroeconomic indicators – content, methods of determination.
  • 60. Inter-industry balance: concepts, tasks, types of mob.
  • 62.Statistics of income and expenses of the population
  • 18. Small sample theory.

    With a large number of units in the sample population (n>100), the distribution of random errors of the sample mean in accordance with A.M. Lyapunov’s theorem is normal or approaches normal as the number of observations increases.

    However, in the practice of statistical research in a market economy, one increasingly has to deal with small samples.

    A small sample is a sample observation whose number of units does not exceed 30.

    When assessing the results of a small sample, the population size is not used. To determine possible error limits, the Student's test is used.

    The value of σ is calculated based on sample observation data.

    This value is used only for the population under study, and not as an approximate estimate of σ in the population.

    The probabilistic assessment of the results of a small sample differs from the assessment in a large sample in that with a small number of observations, the probability distribution for the average depends on the number of selected units.

    However, for a small sample, the value of the confidence coefficient t is related to the probability assessment differently than for a large sample (since the distribution law differs from normal).

    According to the distribution law established by Student, the probable distribution error depends both on the value of the confidence coefficient t and on the sample size B.

    The average error of a small sample is calculated using the formula:

    where is the small sample variance.

    In MV, the coefficient n/(n-1) must be taken into account and must be adjusted. When determining the dispersion S2, the number of degrees of freedom is equal to:

    .

    The marginal error of a small sample is determined by the formula

    In this case, the value of the confidence coefficient t depends not only on the given confidence probability, but also on the number of sampling units n. For individual values ​​of t and n, the confidence probability of a small sample is determined using special Student tables, which give the distribution of standardized deviations:

    The probabilistic assessment of the results of MV differs from the assessment in BB in that with a small number of observations, the probability distribution for the average depends on the number of selected units

    19. Methods for selecting units in the sample population.

    1. The sample population must be large enough in size.

    2. The structure of the sample population should best reflect the structure of the general population

    3. The selection method must be random

    Depending on whether the selected units participate in the sample, a distinction is made between the non-repetitive and repeated methods.

    Non-repetitive selection is a selection in which a unit included in the sample does not return to the population from which further selection is carried out.

    Calculation of the average error of a non-repetitive random sample:

    Calculation of the maximum error of a non-repetitive random sample:

    During repeated selection, the unit included in the sample, after recording the observed characteristics, is returned to the original (general) population to participate in the further selection procedure.

    The average error of repeated simple random sampling is calculated as follows:

    Calculation of the maximum error of repeated random sampling:

    The type of formation of the sample population is divided into individual, group and combined.

    Selection method - determines the specific mechanism for selecting units from the general population and is divided into: actually - random; mechanical; typical; serial; combined.

    Actually – random the most common method of selection in a random sample, it is also called the drawing of lots, in which a ticket with a serial number is prepared for each unit of the statistical population. Next, the required number of units of the statistical population is randomly selected. Under these conditions, each of them has the same probability of being included in the sample.

    Mechanical sampling. It is used in cases where the general population is ordered in some way, i.e. there is a certain sequence in the arrangement of units.

    To determine the average error of mechanical sampling, the formula for the average error in actual random non-repetitive sampling is used.

    Typical selection. It is used when all units in the general population can be divided into several typical groups. Typical selection involves selecting units from each group in a purely random or mechanical way.

    For a typical sample, the standard error depends on the accuracy of the group means. Thus, in the formula for the maximum error of a typical sample, the average of the group variances is taken into account, i.e.

    Serial selection. It is used in cases where population units are combined into small groups or series. The essence of serial sampling lies in the actual random or mechanical selection of series, within which a continuous examination of units is carried out.

    With serial sampling, the magnitude of the sampling error depends not on the number of units studied, but on the number of surveyed series (s) and on the magnitude of intergroup dispersion:

    Combined selection may go through one or more stages. A sample is called single-stage if once selected units of the population are studied.

    The sample is called multi-stage, if the selection of a population takes place in stages, successive stages, and each stage, stage of selection has its own unit of selection.

    "

    In the practice of statistical research one often encounters small samples , which have a volume of less than 30 units. Large samples usually include samples of more than 100 units.

    Usually small samples are used in cases where it is impossible or impractical to use a large sample. One has to deal with such samples, for example, when surveying tourists and hotel visitors.

    The magnitude of the error of a small sample is determined using formulas that differ from those for a relatively large sample size ().

    With a small sample size n the relationship between sample and population variance should be taken into account:

    Since in a small sample the fraction is significant, the variance is calculated taking into account the so-called number of degrees of freedom . It is understood as the number of options that can take arbitrary values ​​without changing the value of the average.

    The average error of a small sample is determined by the formula:

    The maximum sampling error for the mean and proportion is found similarly to the case of a large sample:

    where t is the confidence coefficient, depending on the given level of significance and the number of degrees of freedom (Appendix 5).

    The coefficient values ​​depend not only on the given confidence probability, but also on the sample size n. For individual values ​​of t and n, the confidence probability is determined by the Student distribution, which contains distributions of standardized deviations:

    Comment. As the sample size increases, the Student distribution approaches the normal distribution: when n=20 it differs little from the normal distribution. When conducting small sample surveys, it should be taken into account that the smaller the sample size n, the greater the difference between the Student distribution and the normal distribution. For example, when p min . = 4 this difference is quite significant, which indicates a decrease in the accuracy of the results of a small sample.

    Views