2017
COMMERCE
Paper: 203
(Research Methodology and Statistical Analysis)
Full Marks: 80
Time: 3 hours
The figures in the margin indicate full marks for the questions
1. (a) The difference in earnings of different income categories is responsible for different buying and usage habits of washing powder. You have been asked to conduct a survey and choose a sample. What kind of sampling will you avail? Reason out your answer. (16)
(b) Discuss the difference between ratio scale and interval scale. (16)
-> The interval scale and ratio scale are variable measurement scales . They offer a quantitative definition of the variable attributes.
The difference between interval and ratio scales comes from their ability to dip below zero. Interval scales hold no true zero and can represent values below zero. For example, you can measure temperature below 0 degrees Celsius, such as -10 degrees.
Ratio variables, on the other hand, never fall below zero. Height and weight measure from 0 and above, but never fall below it.
An interval scale allows you to measure all quantitative attributes. Any measurement of interval scale can be ranked, counted, subtracted, or added, and equal intervals separate each number on the scale. However, these measurements don’t provide any sense of ratio between one another.
A ratio scale has the same properties as interval scales. You can use it to add, subtract, or count measurements. Ratio scales differ by having a character of origin, which is the starting or zero-point of the scale.
Interval-ratio scales comparison
Measuring temperature is an excellent example of interval scales. The temperature in an air-conditioned room is 16 degrees Celsius, while the temperature outside the room is 32 degrees Celsius. You can conclude the temperature outside is 16 degrees higher than inside the room.
But if you said, “It is twice as hot outside than inside,” you would be incorrect. By stating the temperature is twice that outside as inside, you’re using 0 degrees as the reference point to compare the two temperatures. Since it’s possible to measure temperature below 0 degrees, you can’t use it as a reference point for comparison. You must use an actual number (such as 16 degrees) instead.
Interval variables are commonly known as scaled variables. They’re often expressed as a unit, such as degrees. In statistics, mean, mode, and median can also define interval variables.
A ratio scale displays the order and number of objects between the values of the scale. Zero is an option. This scale allows a researcher to apply statistical techniques like geometric and harmonic mean.
Where you cannot imply that the temperature is twice as warm outside because it’s an interval scale, you can say you are twice another’s age because it’s a ratio variable.
Age, money, and weight are common ratio scale variables. For example, if you are 50 years old and your child is 25 years old, you can accurately claim you are twice their age.
Interval scale Vs Ratio scale: Points of difference
Features | Interval scale | Ratio scale |
Variable property | All variables measured in an interval scale can be added, subtracted, and multiplied. You cannot calculate a ratio between them. | Ratio scale has all the characteristics of an interval scale, in addition, to be able to calculate ratios. That is, you can leverage numbers on the scale against 0. |
Absolute Point Zero | Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 degrees Celsius and into negative temperatures. | The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero. |
Calculation | Statistically, in an interval scale, the arithmetic mean is calculated. | Statistically, in a ratio scale, the geometric or harmonic mean is calculated. |
Measurement | Interval scale can measure size and magnitude as multiple factors of a defined unit. | Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another. |
Example | A classic example of an interval scale is the temperature in Celsius. The difference in temperature between 50 degrees and 60 degrees is 10 degrees; this is the same difference between 70 degrees and 80 degrees. | Classic examples of a ratio scale are any variable that possesses an absolute zero characteristic, like age, weight, height, or sales figures. |
2. (a) What is data and what are the types of data? Give examples and explain the types of data. (8+8=16)
-> Data can be defined as a systematic record of a particular quantity . It is the different values of that quantity represented together in a set. It is a collection of facts and figures to be used for a specific purpose such as a survey or analysis. When arranged in an organized form, can be called information. The source of data (primary data, secondary data) is also an important factor.
Types of Data
Primary Data
· Primary data is an original and unique data, which is directly collected by the researcher from a source according to his requirements.
· It is the data collected by the investigator himself or herself for a specific purpose.
· Data gathered by finding out first-hand the attitudes of a community towards health services, ascertaining the health needs of a community, evaluating a social program, determining the job satisfaction of the employees of an organization, and ascertaining the quality of service provided by a worker are the examples of primary data.
Secondary Data
· Secondary data refers to the data which has already been collected for a certain purpose and documented somewhere else.
· Data collected by someone else for some other purpose (but being utilized by the investigator for another purpose) is secondary data.
· Gathering information with the use of census data to obtain information on the age-sex structure of a population, the use of hospital records to find out the morbidity and mortality patterns of a community, the use of an organization’s records to ascertain its activities, and the collection of data from sources such as articles, journals, magazines, books and periodicals to obtain historical and other types of information, are examples of secondary data.
Cross-Sectional Data
· Cross-sectional data is a type of data collected by observing many subjects (such as individuals, firms, countries, or regions) at the same point of time, or without regard to differences in time.
· It is the data for a single time point or single space point.
· This type of data is limited in that it cannot describe changes over time or cause and effect relationships in which one variable affects the other.
Categorical Data
· Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.
· The data, which cannot be measured numerically, is called as the categorical data. Categorical data is qualitative in nature.
· The categorical data is also known as attributes.
· A data set consisting of observation on a single characteristic is a univariate data set. A univariate data set is categorical if the individual observations are categorical responses.
Time-Series Data
· Time series data occurs wherever the same measurements are recorded on a regular basis.
· Quantities that represent or trace the values taken by a variable over a period such as a month, quarter, or year.
· The values of different phenomenon such as temperature, weight, population, etc. can be recorded over a different period of time.
· The values of the variable remain increasing or decreasing or constant.
· The data according to time periods is called time-series data. e.g. population in a different time period.
Spatial Data
· Also known as geospatial data or geographic information it is the data or information that identifies the geographic location of features and boundaries on Earth, such as natural or constructed features, oceans, and more.
· Spatial data is usually stored as coordinates and topology and is data that can be mapped.
· Spatial data is used in geographical information systems (GIS) and other relocation or positioning services.
· Spatial data consists of points, lines, polygons and other geographic and geometric data primitives, which can be mapped by location, stored with an object as metadata or used by a communication system to locate end-user devices.
· Spatial data may be classified as scalar or vector data. Each provides distinct information pertaining to geographical or spatial locations.
Ordered Data
· Data according to ordered categories is called as ordered data.
· Ordered data is similar to a categorical variable except that there is a clear ordering of the variables.
· For example for category economic status ordered data may be, low, medium and high.
The most common data types (with examples)
Qualitative vs Quantitative Data
1. Quantitative data
Quantitative data seems to be the easiest to explain. It answers key questions such as “how many, “how much” and “how often”.
Quantitative data can be expressed as a number or can be quantified. Simply put, it can be measured by numerical variables.
Quantitative data are easily amenable to statistical manipulation and can be represented by a wide variety of statistical types of graphs and charts such as line, bar graph, scatter plot, and etc.
Examples of quantitative data:
· Scores on tests and exams e.g. 85, 67, 90 and etc.
- The weight of a person or a subject.
- Your shoe size.
- The temperature in a room.
There are 2 general types of quantitative data: discrete data and continuous data. We will explain them later in this article.
2. Qualitative data
Qualitative data can’t be expressed as a number and can’t be measured. Qualitative data consist of words, pictures, and symbols, not numbers.
Qualitative data is also called categorical data because the information can be sorted by category, not by number.
Qualitative data can answer questions such as “how this has happened” or and “why this has happened”.
Examples of qualitative data:
- Colors e.g. the color of the sea
· Your favorite holiday destination such as Hawaii, New Zealand and etc.
- Names as John, Patricia,…..
· Ethnicity such as American Indian, Asian, etc.
There are 2 general types of qualitative data: nominal data and ordinal data
Nominal vs Ordinal Data
3. Nominal data
Nominal data is used just for labeling variables, without any type of quantitative value. The name ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
The nominal data just name a thing without applying it to order. Actually, the nominal data could just be called “labels.”
Examples of Nominal Data:
- Gender (Women, Men)
· Hair color (Blonde, Brown, Brunette, Red, etc.)
· Marital status (Married, Single, Widowed)
- Ethnicity (Hispanic, Asian)
As you see from the examples there is no intrinsic ordering to the variables.
Eye color is a nominal variable having a few categories (Blue, Green, Brown) and there is no way to order these categories from highest to lowest.
4. Ordinal data
Ordinal data shows where a number is in order. This is the crucial difference from nominal types of data.
Ordinal data is data which is placed into some kind of order by their position on a scale. Ordinal data may indicate superiority.
However, you cannot do arithmetic with ordinal numbers because they only show sequence.
Ordinal variables are considered as “in between” qualitative and quantitative variables.
In other words, the ordinal data is qualitative data for which the values are ordered.
In comparison with nominal data, the second one is qualitative data for which the values cannot be placed in an ordered.
We can also assign numbers to ordinal data to show their relative position. But we cannot do math with those numbers. For example: “first, second, third…etc.”
Examples of Ordinal Data:
· The first, second and third person in a competition.
- Letter grades: A, B, C, and etc.
· When a company asks a customer to rate the sales experience on a scale of 1-10.
- Economic status: low, medium and high.
Discrete vs Continuous Data
As we mentioned above discrete and continuous data are the two key types of quantitative data.
In statistics, marketing research, and data science, many decisions depend on whether the basic data is discrete or continuous.
5. Discrete data
Discrete data is a count that involves only integers. The discrete values cannot be subdivided into parts.
For example, the number of children in a class is discrete data. You can count whole individuals. You can’t count 1.5 kids.
To put in other words, discrete data can take only certain values. The data variables cannot be divided into smaller parts.
It has a limited number of possible values e.g. days of the month.
Examples of discrete data:
- The number of students in a class.
- The number of workers in a company.
· The number of home runs in a baseball game.
· The number of test questions you answered correctly
6. Continuous data
Continuous data is information that could be meaningfully divided into finer levels. It can be measured on a scale or continuum and can have almost any numeric value.
For example, you can measure your height at very precise scales — meters, centimeters, millimeters and etc.
You can record continuous data at so many different measurements – width, temperature, time, and etc. This is where the key difference from discrete types of data lies.
The continuous variables can take any value between two numbers. For example, between 50 and 72 inches, there are literally millions of possible heights: 52.04762 inches, 69.948376 inches and etc.
A good great rule for defining if a data is continuous or discrete is that if the point of measurement can be reduced in half and still make sense, the data is continuous.
Examples of continuous data:
· The amount of time required to complete a project.
- The height of children.
· The square footage of a two-bedroom house.
- The speed of cars.
3. (a) Discuss on conditional probability and Baye’s Theorem. (16)
-> Conditional probability:-
Conditional probability is the probability of an event occurring given that another event has already occurred. The concept is one of the quintessential concepts in probability theory . Note that conditional probability does not state that there is always a causal relationship between the two events, as well as it does not indicate that both events occur simultaneously.
The concept of conditional probability is primarily related to the Bayes’ theorem , which is one of the most influential theories in statistics.
Formula for Conditional Probability
Where:
· P(A|B) – the conditional probability; the probability of event A occurring given that event B has already occurred
· P(A ∩ B) – the joint probability of events A and B; the probability that both events A and B occur
- P(B) – the probability of event B
The formula above is applied to the calculation of the conditional probability of events that are neither independent nor mutually exclusive.
Another way of calculating conditional probability is by using the Bayes’ theorem. The theorem can be used to determine the conditional probability of event A, given that event B has occurred, by knowing the conditional probability of event B, given the event A has occurred, as well as the individual probabilities of events A and B. Mathematically, the Bayes’ theorem can be denoted in the following way:
Conditional Probability for Independent Events
Two events are independent if the probability of the outcome of one event does not influence the probability of the outcome of another event. Due to this reason, the conditional probability of two independent events A and B is:
P (A|B) = P (A)
P (B|A) = P (B)
Conditional Probability for Mutually Exclusive Events
In probability theory, mutually exclusive events are events that cannot occur simultaneously. In other words, if one event has already occurred, another can event cannot occur. Thus, the conditional probability of mutually exclusive events is always zero.
P(A|B) = 0
P(B|A) = 0
Baye’s Theorem:-
In statistics and probability theory, the Bayes’ theorem (also known as the Bayes’ rule) is a mathematical formula used to determine the conditional probability of events. Essentially, the Bayes’ theorem describes the probability of an event based on prior knowledge of the conditions that might be relevant to the event.
The theorem is named after English statistician, Thomas Bayes, who discovered the formula in 1763. It is considered the foundation of the special statistical inference approach called the Bayes’ inference.
Besides statistics , the Bayes’ theorem is also used in various disciplines, with medicine and pharmacology as the most notable examples. In addition, the theorem is commonly employed in different fields of finance. Some of the applications include but are not limited to, modeling the risk of lending money to borrowers or forecasting the probability of the success of an investment.
Formula for Bayes’ Theorem
The Bayes’ theorem is expressed in the following formula:
Where:
· P(A|B) – the probability of event A occurring, given event B has occurred
· P(B|A) – the probability of event B occurring, given event A has occurred
- P(A) – the probability of event A
- P(B) – the probability of event B
Note that events A and B are independent events (i.e., the probability of the outcome of event A does not depend on the probability of the outcome of event B).
A special case of the Bayes’ theorem is when event A is a binary variable . In such a case, the theorem is expressed in the following way:
Where:
- P(B|A^{–}) – the probability of event B occurring given that event A^{–} has occurred
- P(B|A^{+}) – the probability of event B occurring given that event A^{+} has occurred
In the special case above, events A^{–} and A^{+} are mutually exclusive outcomes of event A.
Example of Bayes’ Theorem
Imagine you are a financial analyst at an investment bank. According to your research of publicly-traded companies , 60% of the companies that increased their share price by more than 5% in the last three years replaced their CEOs during the period.
At the same time, only 35% of the companies that did not increase their share price by more than 5% in the same period replaced their CEOs. Knowing that the probability that the stock prices grow by more than 5% is 4%, find the probability that the shares of a company that fires its CEO will increase by more than 5%.
Before finding the probabilities, you must first define the notation of the probabilities.
· P(A) – the probability that the stock price increases by 5%
· P(B) – the probability that the CEO is replaced
· P(A|B) – the probability of the stock price increases by 5% given that the CEO has been replaced
· P(B|A) – the probability of the CEO replacement given the stock price has increased by 5%.
Using the Bayes’ theorem, we can find the required probability:
Thus, the probability that the shares of a company that replaces its CEO will grow by more than 5% is 6.67%.
(b) Discuss the Statistical Decision Theory. (16)
-> Every individual has to make some decisions or others regarding his every day activity. The decisions of routine nature do not involve high risks and are consequently trivial in nature. When business executives make decisions, their decisions affect other people like consumers of the product, shareholders of the business unit, and employees of the organization.
Such decisions which affect other people in society involve a very careful and objective analysis of their consequences. The statistician’s task is to split a decision problem in its simple components and study whether any or some of them are amenable to scientific treatment and therefore he tries to bring out a method by which these components can be woven into coherent and consistent decision of the problem as a whole.
The decision problems can be classified into five types and they are:
1. Decision Making Under Certainty:
There are a few problems where the decision maker gets almost complete information so that he knows all the facts about the state of nature and again which state of nature would occur and also the consequences of the state of nature. In such a situation, the problem of decision making is simple because the decision maker has only to choose the strategy which will give him maximum pay-off in terms of utility.
In cases where the strategy rows are normally very large and it is impossible even to list them, the technique of operational research like linear and nonlinear programming and geometric programming would have to be used to achieve the optimal strategy.
2. Decision Making Under Risk:
A problem of this kind arises when the state of nature is unknown, but based on the objective or empirical evidence, we can possibly assign probabilities to various states of nature. In a number of problems on the basis of historical data and past experience, we are able to assign probabilities to various states of nature. In such cases, the pay-off matrix is of immense help for reaching an optimal decision by assigning probabilities to various states of nature.
3. Decision Making Under Uncertainty:
The process of making decision under conditions of uncertainty takes place when there is hardly any knowledge about states of nature and no objective information about their probabilities of occurrence. In such cases of absence of historical data and relative frequency, the probability of the occurrence of the particular state of nature cannot be indicated.
Such situations arise when a new product is introduced or a new plant is set up. Of course, even in such cases some market surveys are conducted and relevant information is gathered though it is not generally sufficient to indicate a probability figure for the occurrence of a particular state of nature.
4. Decision Making Under Partial Information:
This type of situation is somewhere between the conditions of risk and conditions of uncertainty. As regards conditions of risk, we have seen that the probability of the occurrence of various states of nature are known as the basis of past experience, and in conditions of uncertainty, there is no such data available. But many situations arise where there is partial availability of data. In such circumstances, we can say that decision making is done on the basis of partial information.
5. Decision Making Under Conflict:
A condition of conflict is supposed to occur when we are dealing with rational opponent rather than the state of nature. The decision maker, therefore, has to choose a strategy taking into consideration the action or counter-action of his opponent. Brand competition, military weapons, market place, etc. are problems which come under this category. The strategy choice is done as the basis of game theory where a decision maker anticipates the action of the opponent and then determines his own strategy.
The main purpose of studying decision theory is to put the problem into a suitable logical framework. It includes identification of the problem. Personal perception and innovativeness are two essential things for the identification of the problem, and then generating alternative course of action and finally evolving criteria for evaluating the different alternatives to arrive at the best choice of action.
The basic components of a decision situation are the following:
1. Acts:
There are many alternative courses of action in any decision problem. But only some relevant alternatives need be considered. For instance, the business firm may decide to market its goods within the state or within the country or beyond the boundaries of the country. Here, there are three alternatives. There may be more such alternatives. The final choice of any one will depend upon the payoffs from each strategy.
2. States of Nature:
There are those possible events or the states of nature which are uncertain but are vital for the choice of any one of the alternative acts. For example, the radio dealer does not know how many radios he will be able to sell. There is an element of uncertainty about it and for this reason he cannot decide how many radios to buy. This uncertainty is known as the state of nature or the state of the world.
3. Outcomes:
There is an outcome of the combination of each of the likely acts and possible states of nature. This is otherwise known as conditional value. The outcome has not much significant unless we calculate the pay-offs in terms of monetary gain or loss for each outcome. Thus outcome refers to the result of the combination of an act and each of the states of nature.
4. Pay-off:
The pay-off deals with the monetary gain or loss from each of the outcomes. It can be also in terms of cost-saving or lime-saving but the expression of pay-off should always be in quantitative terms to help precise analysis. Therefore where the value of output is expressed directly in terms of gain expressed in money it is called pay-off. The calculation of pay-off or utility of each outcome has to be carefully done.
5. Expected Values of Each Act:
In practical business situation, there is risk and uncertainty. In the case of risk, the probability of each state of nature is known, and in uncertainty, it is unknown. Therefore, each likely outcome of an act has to be appraised with reference to the probability of occurrence.
The expected value of a given act can be calculated by the following formula:
Where P_{1} to P_{n }refers to event probabilities of events E_{1}to E_{n }and O_{ij}, the pay-offs of the outcome with the combination of each event and act. The expected value of each alternative is thus calculated with reference to probability assigned to each state of nature.
4. (a) What do you understand by testing of hypotheses? (16)
-> Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is most often used by scientists to test specific predictions, called hypotheses that arise from theories.
There are 5 main steps in hypothesis testing:
1. State your research hypothesis as a null (H_{o}) and alternate (H_{a}) hypothesis.
2. Collect data in a way designed to test the hypothesis.
3. Perform an appropriate statistical test .
4. Decide whether the null hypothesis is supported or refuted.
5. Present the findings in yourresults and discussion section.
Though the specific details might vary the procedure you will use when testing a hypothesis will always follow some version of these steps.
Step 1: State your null and alternate hypothesis
After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H_{o}) and alternate (H_{a}) hypothesis so that you can test it mathematically.
The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables . The null hypothesis is a prediction of no relationship between the variables you are interested in.
Step 2: Collect data
For a statistical test to be valid, it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.
Step 3: Perform a statistical test
There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).
If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p-value. This means it is unlikely that the differences between these groups came about by chance.
Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p-value. This means it is likely that any difference you measure between groups is due to chance.
Your choice of statistical test will be based on the type of data you collected.
Step 4: Decide whether the null hypothesis is supported or refuted
Based on the outcome of your statistical test, you will have to decide whether your null hypothesis is supported or refuted.
In most cases you will use the p-value generated by your statistical test to guide your decision. And in most cases, your cutoff for refuting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.
Step 5: Present your findings
The results of hypothesis testing will be presented in the results and discussion sections of your research paper.
In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p-value). In the discussion, you can discuss whether your initial hypothesis was supported or refuted.
In the formal language of hypothesis testing, we talk about refuting or accepting the null hypothesis.
Point estimation-
· Basically point estimators are known to be functions that are used to find an approximate or an estimated value of a population parameter from various random samples of the population.
· To calculate a point estimate point estimators generally use the sample data of a population or they use a statistic that serves as the best estimate of an unknown parameter of a given population.
Properties of Point Estimators
The following are the important characteristics of point estimators:
· Bias
We can define the bias of a point estimator as the difference between the expected value of the estimator as well as the value of the parameter being estimated. When the estimated value of the parameter as well as the value of the parameter being estimated is equal, then we can say that the estimator is unbiased.
Also, the closer is the expected value of a parameter to the value of the parameter that is to be measured, the lesser is the bias.
· Consistency
Generally, consistency tells us how close the point estimator stays to the value of the parameter as it grows in size. The point estimator generally requires a huge sample size for it to be more consistent and for it to be more accurate.
We can also check whether a point estimator is consistent or not by looking at its corresponding expected value as well as its variance. A point estimator is said to be consistent, the value that is expected should move towards the true or actual value of the parameter.
· Most Efficient or Unbiased
The most efficient point estimator is the one with the smallest variance of all the unbiased as well as consistent estimators. Generally, the variance measures the level of dispersion from the estimate, as well as the smallest variance, which should vary the least from one sample to the other.
Generally, the efficiency of the estimator is said to be dependent on the distribution of the population.
Point Estimate Formulas
Four different point estimate formulas can be used:
1. Maximum Likelihood Estimation (MLE)
2. Wilson Estimation, Laplace Estimation
3. Jeffrey Estimation
To Calculate the Point Estimate, You Will Need the Following Values That are Listed below:
1. The number of successes, denoted by S: for example, the number of heads you got while tossing the coin.
2. The number of trials denoted by T: in the coin example, it’s the total number of tosses.
· Confidence interval: that is the probability that your best point estimate is correct (within the margin of error).
3. Z-score, denoted by z: it will be calculated automatically from the confidence interval.
Once You Know All the Values Listed Above, You Can Start Calculating the Point Estimate According to the Following Given Equations:
- Maximum Likelihood Estimation: MLE = S / T
· Laplace Estimation: Laplace equals (S + 1) / (T + 2)
· Jeffrey Estimation: Jeffrey equals (S + 0.5) / (T + 1)
· Wilson Estimation: Wilson equals (S + z²/2) / (T + z²)
Once All Four Values have been Calculated, You Need to Choose the Most Accurate One. This should be done According to the Following Rules Listed below:
· If the value of MLE ≤ 0.5, the Wilson Estimation is the most accurate.
· If the value of MLE – 0.5 < MLE < 0.9, then the Maximum Likelihood Estimation is the most accurate.
· If 0.9 < MLE, then the smaller of Jeffrey and Laplace Estimations is said to be the most accurate.
Interval estimation
Interval estimation , in statistics , the evaluation of a parameter—for example, the mean (average)—of a population by computing an interval, or range of values, within which the parameter is most likely to be located. Intervals are commonly chosen such that the parameter falls within with a 95 or 99 percent probability, called the confidence coefficient. Hence, the intervals are called confidence intervals ; the end points of such an interval are called upper and lower confidence limits.
The interval containing a population parameter is established by calculating that statistic from values measured on a random sample taken from the population and by applying the knowledge (derived from probability theory ) of the fidelity with which the properties of a sample represent those of the entire population.
The probability tells what percentage of the time the assignment of the interval will be correct but not what the chances are that it is true for any given sample. Of the intervals computed from many samples, a certain percentage will contain the true value of the parameter being sought.
(b) How is sampling distribution done in large and small samples? (16)
-> Sampling is a process used in statistical analysis in which a predetermined number of observations are taken from a larger population . The methodology used to sample from a larger population depends on the type of analysis being performed, but it may include simple random sampling or systematic sampling.
In statistics, asampling distribution or finite-sample distribution is the probability distribution of a given random-sample -based statistic . If an arbitrarily large number of samples, each involving multiple observations (data points), were separately used in order to compute one value of a statistic (such as, for example, the sample mean or sample variance ) for each sample, then the sampling distribution is the probability distribution of the values that the statistic takes on. In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.
Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference . More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values.
The sampling distribution of a statistic is the
distribution
of that statistic, considered as a
random variable
, when derived from a
random sample
of size n {\displaystyle n}
For example, consider a
normal
population with mean μ {\displaystyle \mu }
The mean of a sample from a population having a normal distribution is an example of a simple statistic taken from one of the simplest statistical populations . For other statistics and other populations the formulas are more complicated, and often they do not exist in closed-form . In such cases the sampling distributions may be approximated through Monte-Carlo simulations , bootstrap methods, or asymptotic distribution theory.
5. (a) Explain the Wilcoxon signed test. (16)
-> The Wilcoxon signed rank test (also called the Wilcoxon signed rank sum test) is a non-parametric test to compare data. When the word “non-parametric” is used in stats, it doesn’t quite mean that you know nothing about the population. It usually means that you know the population data does not have a normal distribution . The Wilcoxon signed rank test should be used if the differences between pairs of data are non-normally distributed.
Two slightly different versions of the test exist:
- The Wilcoxon signed rank test compares your sample median against a hypothetical median.
- The Wilcoxon matched-pairs signed rank test computes the difference between each set of matched pairs, then follows the same procedure as the signed rank test to compare the sample against some median.
The term “Wilcoxon” is often used for either test. This usually isn’t confusing, as it should be obvious if the data is matched, or not matched.
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ (i.e., it is a paired difference test ). It can be used as an alternative to the paired Student’s t-test (also known as “t-test for matched pairs” or “t-test for dependent samples”) when the distribution of the difference between two samples’ means cannot be assumed to be normally distributed . A Wilcoxon signed-rank test is a nonparametric test that can be used to determine whether two dependent samples were selected from populations having the same distribution.
Assumptions
1. Data are paired and come from the same population.
2. Each pair is chosen randomly and independently^{[} ^{citation needed} ^{]} .
3. The data are measured on at least an interval scale when, as is usual, within-pair differences are calculated to perform the test (though it does suffice that within-pair comparisons are on an ordinal scale ).
(b) Explain the Kruskal-Walle’s Test. (16)
-> The Kruskal Wallis test is the non parametric alternative to the One Way ANOVA . Non parametric means that the test doesn’t assume your data comes from a particular distribution. The H test is used when the assumptions for ANOVA aren’t met (like the assumption of normality ). It is sometimes called the one-way ANOVA on ranks, as the ranks of the data values are used in the test rather than the actual data points.
The test determines whether the medians of two or more groups are different. Like most statistical tests, you calculate a test statistic and compare it to a distribution cut-off point. The test statistic used in this test is called the H statistic. The hypotheses for the test are:
- H_{0}: population medians are equal.
- H_{1}: population medians are not equal.
The Kruskal Wallis test will tell you if there is a significant difference between groups. However, it won’t tell you which groups are different.
Assumptions
When you choose to analyse your data using a Kruskal-Wallis H test, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using a Kruskal-Wallis H test. You need to do this because it is only appropriate to use a Kruskal-Wallis H test if your data “passes” four assumptions that are required for a Kruskal-Wallis H test to give you a valid result. In practice, checking for these four assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.
Before we introduce you to these four assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., is not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out a Kruskal-Wallis H test when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let’s take a look at these four assumptions:
Assumption 1: Your dependent variable should be measured at theordinal or continuous level (i.e.,interval or ratio). Examples of ordinal variables include Likert scales (e.g., a 7-point scale from “strongly agree” through to “strongly disagree”), amongst other ways of ranking categories (e.g., a 3-pont scale explaining how much a customer liked a product, ranging from “Not very much”, to “It is OK”, to “Yes, a lot”). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about ordinal and continuous variables in our article: Types of Variable .
Assumption 2: Your independent variable should consist oftwo or more categorical, independent groups. Typically, a Kruskal-Wallis H test is used when you have three or more categorical, independent groups, but it can be used for just two groups (i.e., a Mann-Whitney U test is more commonly used for two groups). Example independent variables that meet this criterion include ethnicity (e.g., three groups: Caucasian, African American and Hispanic), physical activity level (e.g., four groups: sedentary, low, moderate and high), profession (e.g., five groups: surgeon, doctor, nurse, dentist, therapist), and so forth.
Assumption 3: You should have independence of observations, which means that there is no relationship between the observations in each group or between the groups themselves. For example, there must be different participants in each group with no participant being in more than one group. This is more of a study design issue than something you can test for, but it is an important assumption of the Kruskal-Wallis H test. If your study fails this assumption, you will need to use another statistical test instead of the Kruskal-Wallis H test (e.g., a Friedman test ). If you are unsure whether your study meets this assumption, you can use our Statistical Test Selector , which is part of our enhanced content.
As the Kruskal-Wallis H test does not assume normality in the data and is much less sensitive to outliers, it can be used when these assumptions have been violated and the use of a one-way ANOVA is inappropriate. In addition, if your data is ordinal, a one-way ANOVA is inappropriate, but the Kruskal-Wallis H test is not. However, the Kruskal-Wallis H test does come with an additional data consideration.
Assumption 4: In order to know how to interpret the results from a Kruskal-Wallis H test, you have to determine whether the distributions in each group (i.e., the distribution of scores for each group of the independent variable) have the same shape (which also means the same variability).