Statistic Basic

1 Statistic

One of the most important tasks while analyzing any time series is to describe and summarize the time series data in forms, which easily convey their important characteristics.

Key statistical characteristics often described include: a measure of the central tendency of the data, a measure of spread or variability, a measure of the symmetry of the data distribution, and perhaps estimates of extremes such as some large or small percentile (Snedecor and Cochran 1980).

1.1 Population and Sample

According to Helsel and Hirsch (2020), the data about which a statement or summary is to be made are called ‘population’ or sometimes ‘target population’. It may be impossible both physically and economically to collect all data of interest. Alternatively, a subset of the entire data called ‘sample’ is selected and measured in such a way that conclusions about the sample may be extended to the entire population.

1.2 Measures of Location

In statistics, measures of location or central tendency are used to summarize and describe the central or typical value in a dataset. Here are the six common measures of location (Machiwal and Jha 2012):

  • Mean: The mean, often referred to as the average, is calculated by summing all the values in a dataset and dividing by the number of values. It represents the balance point of the data.

  • Median: The median is the middle value when the data is sorted in ascending order. It’s less sensitive to extreme values (outliers) than the mean and is a good measure of the central value when the data is skewed.

  • Mode: The mode is the value that appears most frequently in the dataset. There can be multiple modes in a dataset, and it’s useful for categorical or discrete data.

  • Geometric Mean: The geometric mean is used for data that is not normally distributed, such as financial returns or growth rates. It’s calculated by taking the nth root of the product of n values.

  • Trimmed Mean: The trimmed mean is a variation of the mean that removes a certain percentage of extreme values (usually a specified percentage from both tails of the distribution) before calculating the mean. This makes it more robust to outliers.

Among these measures, the mean and median are the most widely used for summarizing data.

1.2.1 Arithmetic Mean

The arithmetic mean (\(\overline{{x}}\)) is calculated by summing up of all data values, \(x_{\mathrm{i}}\) and dividing the sum by the sample size \(n\):

\[ {\overline{{x}}}=\sum_{i=1}^{n}{\frac{x_{\mathrm{i}}}{n}} \]

1.2.2 Median

The median is the middle value in a dataset when the data is ordered from smallest to largest. It’s a robust measure of central tendency that is not influenced by extreme values (outliers).

For an ordered dataset with ‘n’ values:

  • If ‘n’ is odd, the median is the middle value: \[ \text{M} = x_{\frac{n+1}{2}} \]

  • If ‘n’ is even, the median is the average of the two middle values: \[ \text{M} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} \]

1.2.3 Geometric Mean

The geometric mean (GM) is often used to compute summary statistic for positively skewed datasets (Machiwal and Jha 2012).

\[ {\mathrm{GM}}={\mathrm{exp}}\left[\sum_{i=1}^{n}{\frac{\ln\left(x_{\mathrm{i}}\right)}{n}}\right] \]

For the positively skewed data series, the GM is usually fairly close to the median of the series. In fact, the GM is an unbiased estimate of the median when the logarithms of the datasets are symmetric (Helsel et al. 2020).

1.3 Measures of Spread/Dispersion

1.3.1 Variance and Standard Deviation

The ‘sample variance’ and ‘sample standard deviation’ (square root of sample variance) are classical measures of spread (dispersion), which are the most common measures of dispersion (Machiwal and Jha 2012).

\[ s^{2}=\sum_{i=1}^{n}\frac{\left(x_{\mathrm{i}}-{\overline{{x}}}\,\right)^{2}}{\left(n-1\right)} \]

\[ s={\sqrt{\sum_{i=1}^{n}{\frac{\left(x_{i}-{\overline{{x}}}\,\right)^{2}}{\left(n-1\right)}}}} \]

1.3.2 Robust Measures

Robust measures of spreading about the mean include ‘range’, ‘interquartile range’, ‘coefficient of variation’ and ‘median absolute deviation’ (Machiwal and Jha 2012).

1.3.2.1 Quantiles

Quantiles are values that divide a dataset into equally sized subsets. Common quantiles include quartiles (dividing data into four parts), quintiles (dividing into five parts), deciles (dividing into ten parts), and percentiles (dividing into one hundred parts).

  • Sort the dataset in ascending order.
  • Compute the index ‘i’ as

\[ i = \text{round}((n+1) \cdot q) \]

  • If ‘i’ is an integer, the quantile is

\[ \text{Q}(q) = x_i \] - If ‘i’ is not an integer, the quantile is interpolated as

\[ \text{Q}(q) = x_{\lfloor i \rfloor} + (i - \lfloor i \rfloor) \cdot (x_{\lfloor i \rfloor + 1} - x_{\lfloor i \rfloor}) \]

Quantiles are used to understand the spread and distribution of data and are often used in box plots and histograms to visualize data distribution.

1.3.2.2 coefficient of variation

The coefficient of variation (CV) gives a normalized measure of spreading about the mean, and is estimated as (Machiwal and Jha 2012):

\[ \mathbf{C}\mathbf{V}(\vartheta_{0})={\frac{s}{\bar{x}}}\times100 \]

Hydrologic variables with larger CV values are more variable than those with smaller values. Wilding (in (Nielsen and Bouma 1985)) suggested a classification scheme for identifying the extent of variability for soil properties based on their CV values, where CV values of 0-15, 16-35 and >36 indicate little, moderate and high variability, respectively.

1.3.2.3 Quartile coefficient

Quartile coefficient (QC) of dispersion is another descriptive statistic which measures dispersion and is used to make comparison within and between datasets. The test-statistic is computed using the first (P25) and third (P75) quartiles for each data set. The quartile coefficient of dispersion (QC) is given as (Machiwal and Jha 2012):

\[ \text{QC}={\frac{P_{75}-P_{25}}{P_{75}+P_{25}}} \]

1.4 Measures of Skewness

Hydrologic time series data are usually skewed, which means that data in the time series are not symmetric around the mean or median, with extreme values extending out longer in one direction (Machiwal and Jha 2012).

1.4.1 coefficient of skewness

It is defined as the adjusted third moment about the mean divided by the cube of the standard deviation (s), and is mathematically expressed as follows:

\[ g={\frac{n}{\left(n-1\right)\,\left(n-2\right)}}\sum_{i=1}^{n}{\frac{\left(x_{i}-{\overline{{x}}}\,\right)^{3}}{s^{3}}} \]

A positively skewed distribution of hydrologic time series with right extended tail has a positive coefficient of skewness, whereas a time series with negative-skewed distribution with left extended tail has a negative coefficient of skewness (Machiwal and Jha 2012).

1.4.2 quartile skew coefficient (Robust Measure)

A robust measure of skewness is the ‘quartile skew coefficient (QS)’, which is defined as the difference in distances of the upper and lower quartiles from the median, divided by the IQR (Kenney John F 1939). Mathematically, it is expressed as:

\[ \text{QS}=\frac{\left(P_{75}-P_{50}\,\right)-\left(P_{50}-P_{25}\,\right)}{P_{75}-P_{25}} \]

2 Skript (R & Python)

library(moments)
# Sample dataset (replace with your data)
data <- c(12, 15, 18, 22, 24, 28, 31, 35, 40, 45, 50)

# Calculate Mean
mean_value <- mean(data)

# Calculate Median
median_value <- median(data)

# Calculate Variance
variance_value <- var(data)

# Calculate Standard Deviation
std_deviation_value <- sd(data)

# Calculate Quantiles (25th, 50th, and 75th percentiles)
quantiles_values <- quantile(data, probs = c(0.25, 0.5, 0.75))

# Calculate Skewness
skewness_value <- moments::skewness(data)

# Print the results
cat("Mean:", mean_value, "\n")
Mean: 29.09091 
cat("Median:", median_value, "\n")
Median: 28 
cat("Variance:", variance_value, "\n")
Variance: 153.8909 
cat("Standard Deviation:", std_deviation_value, "\n")
Standard Deviation: 12.40528 
cat("Quantiles (25th, 50th, 75th percentiles):", quantiles_values, "\n")
Quantiles (25th, 50th, 75th percentiles): 20 28 37.5 
cat("Skewness:", skewness_value, "\n")
Skewness: 0.2766313 
import numpy as np
from scipy.stats import skew

# Sample dataset (replace with your data)
data = np.array([12, 15, 18, 22, 24, 28, 31, 35, 40, 45, 50])

# Calculate Mean
mean_value = np.mean(data)

# Calculate Median
median_value = np.median(data)

# Calculate Variance
variance_value = np.var(data, ddof=0)  # Set ddof to 0 for population variance

# Calculate Standard Deviation
std_deviation_value = np.std(data, ddof=0)  # Set ddof to 0 for population standard deviation

# Calculate Quantiles (25th, 50th, and 75th percentiles)
quantiles_values = np.percentile(data, [25, 50, 75])

# Calculate Skewness
skewness_value = skew(data)

print("Mean:", mean_value)
Mean: 29.09090909090909
print("Median:", median_value)
Median: 28.0
print("Variance:", variance_value)
Variance: 139.900826446281
print("Standard Deviation:", std_deviation_value)
Standard Deviation: 11.82796797621134
print("Quantiles (25th, 50th, 75th percentiles):", quantiles_values)
Quantiles (25th, 50th, 75th percentiles): [20.  28.  37.5]
print("Skewness:", skewness_value)
Skewness: 0.27663130070935216

References

Helsel, Dennis R., Robert M. Hirsch, Karen R. Ryberg, Stacey A. Archfield, and Edward J. Gilroy. 2020. “Statistical Methods in Water Resources.” 4-A3. Techniques and Methods. U.S. Geological Survey. https://doi.org/10.3133/tm4A3.
Kenney John F. 1939. Mathematics Of Statistics Part One. D.van Nostrand Company Inc Toronto New York.
Machiwal, Deepesh, and Madan Kumar Jha. 2012. Hydrologic Time Series Analysis: Theory and Practice. Neu Dehli: Captial Publishing Company.
Nielsen, D. R., and Johan Bouma. 1985. “Soil Spatial Variability.” Pudoc Wageningen, January, 2–30.
Snedecor, George W., and William G. Cochran. 1980. Statistical Methods. Seventh Edition. isbn 0813815606. 7th Edition. Iowa State.