Univariate analysis focuses on exploring and summarizing the distribution of individual variables in a dataset. For continuous variables, which have a range of values, univariate analysis provides insights into the central tendency, spread, and shape of the variable’s distribution. Here are some common techniques for univariate analysis of continuous variables:
1. Descriptive Statistics:
- Measures of Central Tendency:
- Mean: The average value of the variable.
- Median: The middle value when the data is sorted.
- Mode: The most frequently occurring value.
- Measures of Spread:
- Range: The difference between the maximum and minimum values.
- Variance: The average of the squared differences from the mean.
- Standard Deviation: The square root of the variance.
- Percentiles:
- Identify values below which a given percentage of observations fall (e.g., 25th percentile, 75th percentile).
- Example (using Python and pandas):
python # Descriptive statistics for a continuous variable print(df['Age'].describe())
2. Histograms:
- Visualize the distribution of a continuous variable using histograms. Histograms show the frequency or probability density of different value ranges.
- Example (using Python and matplotlib):
import matplotlib.pyplot as plt # Histogram for Age variable plt.hist(df['Age'], bins=10, color='blue', edgecolor='black') plt.title('Histogram of Age') plt.xlabel('Age') plt.ylabel('Frequency') plt.show()
3. Kernel Density Plots:
- Kernel density plots provide a smooth representation of the distribution. They are especially useful for visualizing the shape of the distribution.
- Example (using Python and seaborn):
import seaborn as sns # Kernel density plot for Age variable sns.kdeplot(df['Age'], fill=True, color='green') plt.title('Kernel Density Plot of Age') plt.xlabel('Age') plt.ylabel('Density') plt.show()
4. Box Plots:
- Box plots provide a visual summary of the distribution, indicating the median, quartiles, and potential outliers.
- Example (using Python and seaborn):
python # Box plot for Age variable sns.boxplot(x=df['Age']) plt.title('Box Plot of Age') plt.xlabel('Age') plt.show()
5. Summary Tables:
- Create summary tables to present key statistics, such as mean, median, and standard deviation.
- Example (using Python and pandas):
python # Summary table for Age variable summary_table = df['Age'].agg(['mean', 'median', 'std']).to_frame() summary_table.columns = ['Age Statistics'] print(summary_table)
6. Q-Q Plots (Quantile-Quantile Plots):
- Q-Q plots compare the distribution of the variable to a theoretical normal distribution. Deviations from the diagonal line suggest departures from normality.
- Example (using Python and scipy.stats):
from scipy.stats import probplot # Q-Q plot for Age variable probplot(df['Age'], plot=plt) plt.title('Q-Q Plot of Age') plt.show()
7. Skewness and Kurtosis:
- Skewness measures the asymmetry of the distribution, and kurtosis measures the “tailedness” of the distribution.
- Example (using Python and scipy.stats):
from scipy.stats import skew, kurtosis # Skewness and kurtosis for Age variable skewness = skew(df['Age']) kurt = kurtosis(df['Age']) print(f'Skewness: {skewness:.2f}') print(f'Kurtosis: {kurt:.2f}')
Univariate analysis provides a foundational understanding of the characteristics of continuous variables. It helps identify potential issues, such as outliers or skewed distributions, and informs further analysis and modeling decisions.