Statistical significance: definition, concept, significance, regression equations and hypothesis testing. Level of statistical significance (p)

Statistical reliability is essential in the FCC's calculation practice. It was previously noted that multiple samples can be selected from the same population:

If they are selected correctly, then their average indicators and the indicators of the general population differ slightly from each other in the magnitude of the representativeness error, taking into account the accepted reliability;

If they are selected from different populations, the difference between them turns out to be significant. Statistics is all about comparing samples;

If they differ insignificantly, unprincipally, insignificantly, i.e., they actually belong to the same general population, the difference between them is called statistically unreliable.

Statistically reliable A sample difference is a sample that differs significantly and fundamentally, that is, it belongs to different general populations.

At the FCC, assessing the statistical significance of sample differences means solving many practical problems. For example, the introduction of new teaching methods, programs, sets of exercises, tests, control exercises is associated with their experimental testing, which should show that the test group is fundamentally different from the control group. Therefore, special statistical methods are used, called statistical significance criteria, to detect the presence or absence of a statistically significant difference between samples.

All criteria are divided into two groups: parametric and non-parametric. Parametric criteria require the presence of a normal distribution law, i.e. This means the mandatory determination of the main indicators of the normal law - the arithmetic mean and the standard deviation s. Parametric criteria are the most accurate and correct. Nonparametric tests are based on rank (ordinal) differences between sample elements.

Here are the main criteria for statistical significance used in the FCC practice: Student's test and Fisher's test.

Student's t test named after the English scientist K. Gosset (Student - pseudonym), who discovered this method. The Student's test is parametric and is used to compare the absolute values ​​of samples. Samples may vary in size.

Student's t test is defined like this.

1. Find the Student t test by the following formula:


where are the arithmetic averages of the compared samples; t 1, t 2 - errors of representativeness identified based on the indicators of the compared samples.

2. Practice at the FCC has shown that for sports work it is enough to accept the reliability of the account P = 0.95.

For counting reliability: P = 0.95 (a = 0.05), with the number of degrees of freedom

k = n 1 + n 2 - 2 using the table in Appendix 4 we find the value of the limit value of the criterion ( t gr).

3. Based on the properties of the normal distribution law, the Student’s criterion compares t and t gr.

We draw conclusions:

if t t gr, then the difference between the compared samples is statistically significant;

if t t gr, then the difference is statistically insignificant.

For researchers in the field of FCS, assessing statistical significance is the first step in solving a specific problem: whether the samples being compared are fundamentally or non-fundamentally different from each other. The next step is to evaluate this difference from a pedagogical point of view, which is determined by the conditions of the task.

Let's consider the application of the Student test using a specific example.

Example 2.14. A group of 18 subjects was assessed for heart rate (bpm) before x i and after y i warm-up.

Assess the effectiveness of the warm-up based on heart rate. Initial data and calculations are presented in table. 2.30 and 2.31.

Table 2.30

Processing heart rate indicators before warming up


The errors for both groups coincided, since the sample sizes are equal (the same group is studied at different conditions), and the standard deviations were s x = s y = 3 beats/min. Let's move on to defining the Student's test:

We set the reliability of the account: P = 0.95.

Number of degrees of freedom k 1 = n 1 + n 2 - 2 = 18 + 18-2 = 34. From the table in Appendix 4 we find t gr= 2,02.

Statistical inference. Since t = 11.62, and the boundary t gr = 2.02, then 11.62 > 2.02, i.e. t > t gr, therefore the difference between the samples is statistically significant.

Pedagogical conclusion. It was found that in terms of heart rate the difference between the state of the group before and after warm-up is statistically significant, i.e. significant, fundamental. So, based on the heart rate indicator, we can conclude that the warm-up is effective.

Fisher criterion is parametric. It is used when comparing sample dispersion rates. This usually means a comparison in terms of stability of sports performance or stability of functional and technical indicators in practice physical culture and sports. Samples can be of different sizes.

The Fisher criterion is defined in the following sequence.

1. Find the Fisher criterion F using the formula


where , are the variances of the compared samples.

The conditions of the Fisher criterion stipulate that in the numerator of the formula F there is a large dispersion, i.e. the number F is always greater than one.

We set the calculation reliability: P = 0.95 - and determine the number of degrees of freedom for both samples: k 1 = n 1 - 1, k 2 = n 2 - 1.

Using the table in Appendix 4, we find the limit value of criterion F gr.

Comparison of F and F criteria gr allows us to formulate conclusions:

if F > F gr, then the difference between the samples is statistically significant;

if F< F гр, то различие между выборками статически недо­стоверно.

Let's give a specific example.

Example 2.15. Let's analyze two groups of handball players: x i (n 1= 16 people) and y i (n 2 = 18 people). These groups of athletes were studied for the take-off time (s) when throwing the ball into the goal.

Are the repulsion indicators of the same type?

Initial data and basic calculations are presented in table. 2.32 and 2.33.

Table 2.32

Processing of repulsion indicators of the first group of handball players


Let us define the Fisher criterion:





According to the data presented in the table in Appendix 6, we find Fgr: Fgr = 2.4

Let us pay attention to the fact that the table in Appendix 6 lists the numbers of degrees of freedom of both greater and lesser dispersion when approaching large numbers gets rougher. Thus, the number of degrees of freedom of the larger dispersion follows in this order: 8, 9, 10, 11, 12, 14, 16, 20, 24, etc., and the smaller one - 28, 29, 30, 40, 50, etc. d.

This is explained by the fact that as the sample size increases, the differences in the F-test decrease and it is possible to use tabular values ​​that are close to the original data. So, in example 2.15 =17 is absent and we can take the value closest to it k = 16, from which we obtain Fgr = 2.4.

Statistical inference. Since Fisher's test F= 2.5 > F= 2.4, the samples are statistically distinguishable.

Pedagogical conclusion. The values ​​of the take-off time (s) when throwing the ball into the goal for handball players of both groups differ significantly. These groups should be considered different.

Further research should reveal the reason for this difference.

Example 2.20.(on the statistical reliability of the sample ). Has the football player's qualifications improved if the time (s) from giving the signal to kicking the ball at the beginning of the training was x i , and at the end y i .

Initial data and basic calculations are given in table. 2.40 and 2.41.

Table 2.40

Processing time indicators from giving a signal to hitting the ball at the beginning of training


Let us determine the difference between groups of indicators using the Student’s criterion:

With reliability P = 0.95 and degrees of freedom k = n 1 + n 2 - 2 = 22 + 22 - 2 = 42, using the table in Appendix 4 we find t gr= 2.02. Since t = 8.3 > t gr= 2.02 - the difference is statistically significant.

Let us determine the difference between groups of indicators using Fisher’s criterion:


According to the table in Appendix 2, with reliability P = 0.95 and degrees of freedom k = 22-1 = 21, the value F gr = 21. Since F = 1.53< F гр = = 2,1, различие в рассеивании исходных данных статистически недостоверно.

Statistical inference. According to the arithmetic average, the difference between groups of indicators is statistically significant. In terms of dispersion (dispersion), the difference between groups of indicators is statistically unreliable.

Pedagogical conclusion. The football player's qualifications have improved significantly, but attention should be paid to the stability of his testimony.

Preparing for work

Before conducting this laboratory work in the discipline “Sports Metrology” to all students study group it is necessary to form work teams of 3-4 students in each, to jointly complete the work assignment of all laboratory work.

In preparation for work read the relevant sections of the recommended literature (see section 6 of the data methodological instructions) and lecture notes. Study sections 1 and 2 for this laboratory work, as well as the work assignment for it (section 4).

Prepare a report form on standard sheets of A4 size writing paper and fill it with the materials necessary for the work.

The report must contain :

Title page indicating the department (UC and TR), study group, last name, first name, patronymic of the student, number and title of laboratory work, date of its completion, as well as last name, academic degree, academic title and position of the teacher accepting the job;

Goal of the work;

Formulas with numerical values ​​explaining intermediate and final results of calculations;

Tables of measured and calculated values;

Required by assignment graphic material;

Brief conclusions on the results of each stage of the work assignment and on the work performed in general.

All graphs and tables are drawn carefully using drawing tools. Conventional graphic and letter symbols must comply with GOSTs. It is allowed to prepare a report using computer technology.

Work assignment

Before taking all measurements, each member of the team must study the rules for using sportswear. Darts games, given in Appendix 7, which are necessary to carry out the following stages of research.

Stage I of research“Research into the results of hitting a target sports game Darts by each member of the team for compliance with the normal distribution law according to the criterion χ 2 Pearson and the three sigma criterion"

1. measure (test) your (personal) speed and coordination of actions, by throwing darts 30-40 times at a circular target in the sports game Darts.

2. Results of measurements (tests) x i(in glasses) formulate in the form of a variation series and enter into table 4.1 (columns, complete all necessary calculations, fill in the necessary tables and draw appropriate conclusions on the compliance of the resulting empirical distribution with the normal distribution law, by analogy with similar calculations, tables and conclusions of example 2.12, given in section 2 of these guidelines on pages 7 -10.

Table 4.1

Correspondence of the speed and coordination of the subjects’ actions to the normal distribution law

No. rounded
Total

II – stage of research

“Assessment of the average indicators of the general population of hits on the target of the sport game Darts of all students of the study group based on the results of measurements of members of one team”

Assess the average indicators of speed and coordination of actions of all students in the study group (according to the list of the study group in the class magazine) based on the results of hitting the Darts target of all team members, obtained at the first stage of research of this laboratory work.

1. Document the results of measurements of speed and coordination of actions when throwing darts at a circular target in the sports game Darts of all members of your team (2 - 4 people), who represent a sample of measurement results from the general population (measurement results of all students in a study group - for example, 15 people), entering them in the second and third columns Table 4.2.

Table 4.2

Processing indicators of speed and coordination of actions

brigade members

No.
Total

In table 4.2 under should be understood , matched average score (see calculation results in Table 4.1) members of your team ( , obtained at the first stage of research. It should be noted that, usually, Table 4.2 contains the calculated average value of the measurement results obtained by one member of the team at the first stage of research , since the likelihood that the measurement results of different team members will coincide is very small. Then, as a rule, the values in column Table 4.2 for each row - equal to 1, A in the line “Total " columns " ", is written the number of members of your team.

2. Perform all the necessary calculations to fill out table 4.2, as well as other calculations and conclusions similar to the calculations and conclusions of example 2.13 given in the 2nd section of this methodological development on pages 13-14. It should be kept in mind when calculating the representativeness error "m" it is necessary to use formula 2.4 given on page 13 of this methodological development, since the sample is small (n, and the number of elements of the general population N is known, and is equal to the number of students in the study group, according to the list of the journal of the study group.

III – stage of research

Evaluation of the effectiveness of the warm-up according to the indicator “Speed ​​and coordination of actions” by each team member using the Student’s t-test

To evaluate the effectiveness of the warm-up for throwing darts at the target of the sports game "Darts", performed at the first stage of research of this laboratory work, by each member of the team according to the indicator "Speed ​​and coordination of actions", using the Student's criterion - a parametric criterion for the statistical reliability of the empirical distribution law to the normal distribution law .

… Total

2. variances and RMS , results of measurements of the indicator “Speed ​​and coordination of actions” based on the results of warm-up, given in table 4.3, (see similar calculations given immediately after table 2.30 of example 2.14 on page 16 of this methodological development).

3. Each member of the work team measure (test) your (personal) speed and coordination of actions after warming up,

… Total

5. Perform average calculations variances and RMS ,results of measurements of the indicator “Speed ​​and coordination of actions” after warm-up, given in table 4.4, write down the overall measurement result based on the warm-up results (see similar calculations given immediately after table 2.31 of example 2.14 on page 17 of this methodological development).

6. Perform all necessary calculations and conclusions similar to the calculations and conclusions of example 2.14 given in the 2nd section of this methodological development on pages 16-17. It should be kept in mind when calculating the representativeness error "m" it is necessary to use formula 2.1 given on page 12 of this methodological development, since the sample is n and the number of elements in the population N ( is unknown.

IV – stage of research

Assessment of the uniformity (stability) of the indicators “Speedness and coordination of actions” of two team members using the Fisher criterion

Assess the uniformity (stability) of the indicators “Speedness and coordination of actions” of two team members using the Fisher criterion, based on the measurement results obtained at the third stage of research in this laboratory work.

To do this you need to do the following.

Using the data from tables 4.3 and 4.4, the results of calculating variances from these tables obtained at the third stage of research, as well as the methodology for calculating and applying the Fisher criterion for assessing the uniformity (stability) of sports indicators, given in example 2.15 on pages 18-19 of this methodological development, draw appropriate statistical and pedagogical conclusions.

V – stage of research

Assessment of groups of indicators “Speedness and coordination of actions” of one team member before and after warm-up

Most recently, Vladimir Davydov wrote a post on facebook about A/B or MVT testing, which raised a lot of questions.

Typically, conducting A/B or MVT testing on websites is a very difficult thing. Although it seems to the “landers” that this is elementary, because “this is the same, there is special programs, gig.”

If you decide to test web content, remember:

1. First, you need to isolate an audience of equal, equal size, and equal quality. Conduct A/A tests. The vast majority of tests conducted by online agencies or inexperienced internet marketers are incorrect. Precisely for the reason that content is tested on different audiences.

2. Conduct dozens or better yet hundreds of tests over several months. It’s not worth testing 2-3 versions of a page for a week.

3. Remember that you can also test in the MVT format (that is, many options), and not just A and B.

4. Statistically analyze the data array with test results (Excel is absolutely fine, you can also use SPSS). Are the results within the margin of error, how much do they deviate, and how do they depend on time? If, for example, in the first point of the A/A test you received strong deviations of one option from another, this is a failure, and you cannot test further.

5. No need to test everything. This is not entertainment (unless you really have nothing else to do). It makes sense to test only what, from the point of view of marketing and business analysis, can lead to noticeable results. And also something from which the results can actually be measured. For example, you decided to increase the font size on the website, tested a page with a larger font for a couple of weeks - sales increased. What does this mean? That’s nothing for me (see previous paragraphs).

6. Entire paths need to be tested. That is, it is not enough to take and test the purchase page (or some action on the site) - you need to test those pages and steps that lead to this final conversion page.

The question was asked in the comments:

“How to determine the winner? Here we tested the headline on a page selling “head-on”. What difference in conversion must there be between A and B to declare a winner?

Vladimir's answer:

First, you need to conduct long-term isolated experiments (the basic rule of any statistical evaluation). Secondly, everything inevitably comes down to statistics and mathematics (that’s why I recommend excel and spss or free analogues). We need to calculate the confidence probability that the difference in values ​​means something. Eat good article(one of many). There they take transactions from GA based on Optimizely testshttps://www.distilled.net/uploads/ga_transactions.png , compare transactions (purchases) with the usual bell distribution and see if the average falls within the range confidence interval errorshttps://www.distilled.net/uploads/t-test_tool.png

Would you like to receive an offer from us?

Start cooperation

The Role of Statistical Significance in Increasing Conversions: 6 Things You Need to Know

1. Exactly what it means

“The change allowed us to achieve a 20% increase in conversion with a 90% confidence level.” Unfortunately, this statement is not at all equivalent to another, very similar one: “The chances of increasing conversion by 20% are 90%.” So what is it really about?

20% is an increase that we recorded based on the results of tests on one of the samples. If we started to fantasize and speculate, we might imagine that this growth could persist permanently—if we continued testing indefinitely. But this does not mean that with a 90% probability we will get a twenty percent increase in conversion, or an increase of “at least” 20%, or “approximately” 20%.


90% is the probability of any change in conversion. In other words, if we ran ten A/B tests to get this result, and decided to run all ten ad infinitum, then one of them (since the probability of change is 90%, then 10% remains for the unchanged outcome) will probably , would end up bringing the “post-test” result closer to the original conversion – that is, without change. Of the remaining nine tests, some could show an increase of much less than 20%. In others, the result could exceed this bar.

If we misinterpret this data, we take a big risk by “rolling out” the test. It's easy to get excited when a test shows high conversion rates with a 95% confidence level, but it's wise not to expect too much until the test is taken to its logical conclusion.

2. When to use

The most obvious candidates are A/B split tests, but they are far from the only ones. You can also test for statistically significant differences between segments (for example, visits from organic versus paid search) or time periods (for example, April 2013 and April 2014).

However, it is worth noting that this correlation does not imply causation. When we run split tests, we know that we can attribute any changes in results to the elements that differentiate the pages - because Special attention care is taken to ensure that the rest of the pages are completely identical. If you are comparing groups such as visitors coming from organic and paid search, any other factors can come into play - for example, from organic search there may be a lot of visits at night, and the conversion rate among overnight visitors is quite high. Significance tests can help determine whether there is a reason for a change, but they cannot tell what the reason is.

3. How to test changes in conversion rates, bounce rates and exit rates

When we look at “indicators,” we are really looking at averages of binary variables—someone either completed the target actions or they did not. If we have a sample of 10 people with a 40% conversion rate, we're actually looking at a table like this:

We need this table, along with the average, to calculate the standard deviation, a key component of statistical significance. However, the fact that every value in the table is either a zero or a one makes it easier for us - we can avoid having to copy a huge list of numbers by using an A/B test confidence calculator, and starting from knowing the average and sizes samples. This is a tool from KissMetrics.

(Important! This tool only takes one side of the probability distribution into account in its calculations. To use both sides and convert the result to two-sided significance, you need to double the distance from 100% - for example, one-sided 95% becomes two-sided 90%).

Although the description says “A/B test validity tool,” it can also be used for any other metric comparison—just replace conversion with bounce or exit rate. In addition, it can be used to compare segments or time periods - the calculations will be the same.

Also, it is well suited for multivariate testing (MVT) - just compare each change individually with the original.

4. How to test changes in the average bill

To test for means of non-binary variables, we need the full data set, so things get a little more complicated here. For example, we want to determine whether there is a significant difference in the average order value for an A/B split test - this point is often omitted in conversion optimization, although for business indicators it is as important as the conversion itself.

The first thing we need is to get from Google Analytics full list transactions for each test option - for A and B (was, became). The simplest way The way to do this is to create custom segments based on custom variables for your split test, and then export the transaction report to an Excel spreadsheet. Make sure that all transactions are included there, not just the default 10 rows.

When you have two lists of transactions, you can copy them into a tool like this:

In the above case, we do not have a confidence level at the chosen level of 95%. In fact, if we look at the p-score above the bottom graph of 0.63, it's clear that we don't even have 50% significance - there's a 63% chance that the difference between page scores is purely due to chance.

5. How to predict the required duration of an A/B split test

Evanmiller.org has another handy tool for conversion optimization: a sample size calculator.

This tool allows you to answer the question “How long will it take to get reliable test results?”, and this answer is not worth trying to guess.

There are a few things worth noting. First, the tool has an absolute/relative switch - if you want to find out the difference between a base conversion rate of 5% and a variable conversion rate of 6%, it will be 1% absolute (6-5=1) or 20 % in relative terms (6/5=1.2). Secondly, at the bottom of the page there are two “sliders”. The lower one is responsible for the required level of significance - if your goal is to achieve a significance of 95%, then the slider should be set to 5%. The top slider shows the probability that the number of visits required to a page will be sufficient - for example, if you want to know the number of visits required to achieve an eighty percent chance of finding significance of 95%, set the top slider to 80% and the bottom slider to 5%.

6. What not to do

There are several simple ways to identify the unsuitability of a split test, which, however, are not always obvious at first glance:

A) Split testing of non-binary ordinal values

For example, your goal is to find out whether significant difference the likelihood that visitors from the “initial” and “after changes” groups will buy certain products. You label the three products "1", "2" and "3" and then enter these values ​​into the significance test fields. Unfortunately, this approach won't work—product 2 is not the average of products 1 and 3.

B) Traffic distribution settings

At the beginning of the test, you decide not to take risks and set the traffic distribution to 90/10. After some time, you see that the change did not lead to a noticeable change in conversion, and you move the slider to 50/50. But returning visitors still belong to their original group, so you end up in a situation where the "pre-change" version has a higher proportion of returning visitors showing a high likelihood of converting. Things get complicated very quickly, and the only simple way to get data you can rely on is to look at new and returning visitors separately. However, in this case it will take longer to obtain meaningful results. And even if both subgroups show significant results, what if one of them actually generates more returning visitors? In general, there is no need to do this and change the traffic distribution during the test.

B) Planning

Seems obvious, but don't compare data collected at the same time of day with data collected during the day or at other times of the day. If you want to test for a specific time of day, you have two options.

1. Handle visitor requests throughout the day as usual, but show them the original version of the page at a time of day in which you are not interested.

2. Compare apples to apples – If you are only looking at change data for the first half of the day, compare it to the original data for the first half of the day.

I hope you find some of the above helpful in optimizing your conversion rates. If you have your own know-how, please share it in the comments.

STATISTICAL RELIABILITY

- English credibility/validity, statistical; German Validitat, statistische. Consistency, objectivity and lack of ambiguity in a statistical test or in a q.l. set of measurements. D. s. can be tested by repeating the same test (or questionnaire) on the same subject to see if the same results are obtained; or by comparing different parts of a test that are supposed to measure the same object.

Antinazi. Encyclopedia of Sociology, 2009

See what “STATISTICAL RELIABILITY” is in other dictionaries:

    STATISTICAL RELIABILITY- English credibility/validity, statistical; German Validitat, statistische. Consistency, objectivity and lack of ambiguity in a statistical test or in a q.l. set of measurements. D. s. can be verified by repeating the same test (or... Dictionary in Sociology

    In statistics, a value is called statistically significant if the probability of its occurrence by chance or even more extreme values ​​is low. Here, by extreme we mean the degree of deviation of the test statistics from the null hypothesis. The difference is called... ...Wikipedia

    The physical phenomenon of statistical stability is that as the sample size increases, the frequency of a random event or the average value physical quantity tends to some fixed number. The phenomenon of statistical... ... Wikipedia

    RELIABILITY OF DIFFERENCES (Similarities)- analytical statistical procedure for establishing the level of significance of differences or similarities between samples according to the studied indicators (variables) ... Modern educational process: basic concepts and terms

    REPORTING, STATISTICAL Great Accounting Dictionary

    REPORTING, STATISTICAL- a form of state statistical observation, in which the relevant authorities receive from enterprises (organizations and institutions) the information they need in the form of legally established reporting documents ( statistical reports) behind … Large economic dictionary

    The science that studies techniques for systematic observation of mass phenomena social life humans, compiling their numerical descriptions and scientific processing of these descriptions. Thus, theoretical statistics is a science... ... encyclopedic Dictionary F. Brockhaus and I.A. Ephron

    Correlation coefficient- (Correlation coefficient) The correlation coefficient is a statistical indicator of the dependence of two random variables Definition of the correlation coefficient, types of correlation coefficients, properties of the correlation coefficient, calculation and application... ... Investor Encyclopedia

    Statistics- (Statistics) Statistics is a general theoretical science that studies quantitative changes in phenomena and processes. State statistics, statistics services, Rosstat (Goskomstat), statistical data, query statistics, sales statistics,... ... Investor Encyclopedia

    Correlation- (Correlation) Correlation is a statistical relationship between two or more random variables. The concept of correlation, types of correlation, correlation coefficient, correlation analysis, price correlation, correlation currency pairs on Forex Contents... ... Investor Encyclopedia

Books

  • Research in mathematics and mathematics in research: Methodological collection on student research activities, Borzenko V.I.. The collection presents methodological developments, applicable in organizing student research activities. The first part of the collection is devoted to the application of a research approach in...

Statistical significance or p-level of significance is the main result of the test

statistical hypothesis. In technical terms, this is the probability of receiving a given

the result of a sample study, provided that in fact for the general

In the aggregate, the null statistical hypothesis is true - that is, there is no connection. In other words, this

the probability that the detected relationship is random and not a property

totality. It is statistical significance, the p-level of significance, that is

quantitative assessment of communication reliability: the lower this probability, the more reliable the connection.

Suppose, when comparing two sample means, a level value was obtained

statistical significance p=0.05. This means that testing the statistical hypothesis about

equality of means in the population showed that if it is true, then the probability

The random occurrence of detected differences is no more than 5%. In other words, if

two samples were repeatedly drawn from the same population, then in 1 of

20 cases would reveal the same or greater difference between the means of these samples.

That is, there is a 5% chance that the differences found are due to chance.

character, and are not a property of the aggregate.

In a relationship scientific hypothesis level of statistical significance is a quantitative

an indicator of the degree of distrust in the conclusion about the existence of a connection, calculated from the results

selective, empirical testing of this hypothesis. How less value p-level, the higher

the statistical significance of a research result confirming a scientific hypothesis.

It is useful to know what influences the significance level. Level of significance, all other things being equal

conditions are higher (the p-level value is lower) if:

The magnitude of the connection (difference) is greater;

The variability of the trait(s) is less;

The sample size(s) is larger.

Unilateral Two-sided significance tests

If the purpose of the study is to identify differences in the parameters of two general

aggregates that correspond to its various natural conditions (living conditions,

age of the subjects, etc.), then it is often unknown which of these parameters will be greater, and

Which one is smaller?

For example, if you are interested in the variability of results in a test and

experimental groups, then, as a rule, there is no confidence in the sign of the difference in variances or

standard deviations of the results by which variability is assessed. In this case

the null hypothesis is that the variances are equal, and the purpose of the study is

prove the opposite, i.e. presence of differences between variances. It is allowed that

the difference can be of any sign. Such hypotheses are called two-sided.

But sometimes the challenge is to prove an increase or decrease in a parameter;

for example, the average result in the experimental group is higher than the control group. Wherein

It is no longer allowed that the difference may be of a different sign. Such hypotheses are called

One-sided.

Significance tests used to test two-sided hypotheses are called

Double-sided, and for one-sided - unilateral.

The question arises as to which criterion should be chosen in a given case. Answer

This question is beyond formal statistical methods and completely

Depends on the goals of the study. Under no circumstances should you choose one or another criterion after

Conducting an experiment based on the analysis of experimental data, as this may

Lead to incorrect conclusions. If, before conducting an experiment, it is assumed that the difference

The compared parameters can be either positive or negative, then you should

Hypotheses are tested using statistical analysis. Statistical significance is found using the P-value, which corresponds to the probability of a given event assuming that some statement (null hypothesis) is true. If the P-value is less than a specified level of statistical significance (usually 0.05), the experimenter can safely conclude that the null hypothesis is false and proceed to consider the alternative hypothesis. Using the Student's t test, you can calculate the P-value and determine significance for two data sets.

Steps

Part 1

Setting up the experiment

    Define your hypothesis. The first step in assessing statistical significance is to choose the question you want to answer and formulate a hypothesis. A hypothesis is a statement about experimental data, their distribution and properties. For any experiment, there is both a null and an alternative hypothesis. Generally speaking, you will have to compare two sets of data to determine whether they are similar or different.

    • The null hypothesis (H 0) typically states that there is no difference between two sets of data. For example: those students who read the material before class do not receive higher grades.
    • The alternative hypothesis (H a) is the opposite of the null hypothesis and is a statement that needs to be supported by experimental data. For example: those students who read the material before class get higher grades.
  1. Set the significance level to determine how much the data distribution must differ from normal for it to be considered a significant result. Significance level (also called α (\displaystyle \alpha )-level) is the threshold you define for statistical significance. If the P-value is less than or equal to the significance level, the data is considered statistically significant.

    • As a rule, the significance level (value α (\displaystyle \alpha )) is taken to be 0.05, in which case the probability of detecting a random difference between different data sets is only 5%.
    • The higher the significance level (and, accordingly, the lower the P-value), the more reliable the results.
    • If you want more reliable results, lower the P-value to 0.01. Typically, lower P-values ​​are used in manufacturing when it is necessary to identify defects in products. In this case, high reliability is required to be sure that all parts work as expected.
    • For most hypothesis experiments, a significance level of 0.05 is sufficient.
  2. Decide which criterion you will use: one-sided or two-sided. One of the assumptions in the Student t test is that the data is normally distributed. The normal distribution is a bell-shaped curve with maximum number results in the middle of the curve. Student's t-test is a mathematical method of testing data that allows you to determine whether data falls outside the normal distribution (more, less, or in the “tails” of the curve).

    • If you are not sure whether the data is above or below the control group values, use a two-tailed test. This will allow you to determine significance in both directions.
    • If you know in which direction the data might fall outside the normal distribution, use a one-tailed test. In the example above, we expect students' grades to increase, so a one-tailed test can be used.
  3. Determine sample size using statistical power. The statistical power of a study is the probability that, given the sample size, the expected result will be obtained. A common power threshold (or β) is 80%. Analyzing statistical power without any prior data can be challenging because it requires some information about the expected means in each group of data and their standard deviations. Use an online power analysis calculator to determine the optimal sample size for your data.

    • Typically, researchers conduct a small pilot study that provides data for statistical power analysis and determines the sample size needed for a larger, more complete study.
    • If you are unable to conduct a pilot study, try to estimate possible averages based on the literature and other people's results. This may help you determine the optimal sample size.

    Part 2

    Calculate standard deviation
    1. Write down the formula for standard deviation. The standard deviation shows how much spread there is in the data. It allows you to conclude how close the data obtained from a certain sample are. At first glance, the formula seems quite complicated, but the explanations below will help you understand it. The formula is as follows: s = √∑((x i – µ) 2 /(N – 1)).

      • s - standard deviation;
      • the sign ∑ indicates that all data obtained from the sample should be added;
      • x i corresponds to the i-th value, that is, a separate result obtained;
      • µ is the average value for a given group;
      • N is the total number of data in the sample.
    2. Find the average in each group. To calculate the standard deviation, you must first find the mean for each study group. The mean value is denoted by the Greek letter µ (mu). To find the average, simply add up all the resulting values ​​and divide them by the amount of data (sample size).

      • For example, to find the average grade for a group of students who study before class, consider a small data set. For simplicity, we use a set of five points: 90, 91, 85, 83 and 94.
      • Let's add all the values ​​together: 90 + 91 + 85 + 83 + 94 = 443.
      • Let's divide the sum by the number of values, N = 5: 443/5 = 88.6.
      • Thus, the average for this group is 88.6.
    3. Subtract each value obtained from the average. The next step is to calculate the difference (x i – µ). To do this, subtract from the found average size each value received. In our example, we need to find five differences:

      • (90 – 88.6), (91 – 88.6), (85 – 88.6), (83 – 88.6) and (94 – 88.6).
      • As a result, we get the following values: 1.4, 2.4, -3.6, -5.6 and 5.4.
    4. Square each value obtained and add them together. Each of the quantities just found should be squared. At this step everyone will disappear negative values. If after this step you still have negative numbers, which means you forgot to square them.

      • For our example, we get 1.96, 5.76, 12.96, 31.36 and 29.16.
      • We add up the resulting values: 1.96 + 5.76 + 12.96 + 31.36 + 29.16 = 81.2.
    5. Divide by sample size minus 1. In the formula, the sum is divided by N – 1 due to the fact that we do not take into account the general population, but take a sample of all students for evaluation.

      • Subtract: N – 1 = 5 – 1 = 4
      • Divide: 81.2/4 = 20.3
    6. Remove Square root. After you divide the sum by the sample size minus one, take the square root of the value found. This is the last step in calculating the standard deviation. There are statistical programs that, after entering the initial data, perform all the necessary calculations.

      • In our example, the standard deviation of the grades of those students who read the material before class is s =√20.3 = 4.51.

      Part 3

      Determine significance
      1. Calculate the variance between the two groups of data. Before this step, we looked at an example for only one group of data. If you want to compare two groups, you should obviously take data from both groups. Calculate the standard deviation for the second group of data, and then find the variance between the two experimental groups. The variance is calculated using the following formula: s d = √((s 1 /N 1) + (s 2 /N 2)).

Views