Sampling and Bootstrapping

The first topic I wanted to review was sampling methods, particularly resampling as this was on the list of topics to review that was provided to me. For the purposes of this post, I’ll be reviewing the entirety of the topics presented in the DataCamp course Sampling in Python, which touches on sampling, resampling (or sampling with replacement), selection bias, sample size and population parameters, the Central Limit Theorem, bootstrapping, standard error, and confidence intervals. This was broken into four sections which I will follow here, the most important being the final section which focuses on bootstrapping.

1. Sampling basics, selection bias, and random number generating functions

The first chapter of the Sampling in Python DataCamp course introduces sampling gently using the pd.sample method. This method takes at the very least a single argument, n, for the number of samples you want. You can also use the parameter random_state to choose a seed state (pick any number) to make your random sample reproducible. As an example, if I had a DataFrame, df, and I wanted a sample of 1000 rows from it, with a reproducible state, I would write the following:

df.sample(n=1000, random_state=13)

In this section, James Chapman, the course instructor, also differentiates between two important terms: the population and the sample. The population refers to the entire group of interest, and the sample is, of course, a smaller sub-group that we are choosing to work with. Usually this is because the larger group is either unavailable (it is expensive or impossible to measure the population), or working with the full population dataset would be costly and data-intensive.

Next, Chapman goes on to explain a bit about selection bias, though I feel this topic is not covered in as much depth as is probably necessary for the Google interview. He uses the example of a presidential poll that was conducted in 1936 for the election between Landon and Roosevelt. The poll predicted Landon would win 57% of the votes and Roosevelt 43%. When the election rolled around, however, Roosevelt won by a landslide, earning 62% of the votes while Landon got 38%. The reason for this, Chapman explained, is that the poll was conducted via telephone. At the time, only the very wealthy had phones in their homes. Since household income is a contributing factor in which party one votes for, this introduced bias into the sample, leading to results that did not reflect the true population value. Chapman then goes on to note that this is why you should always use a random sample instead of a convenience sample of your dataset.

Finally, the author takes time to explore the different types of random samples you can generate with numpy, each from a different statistical distribution. They are: Beta, Binomial, Chi-Squared, Exponential, F, Gamma, Geometric, Hypergeometric, Lognormal, Negative binomial, Normal, Poisson, t, and Uniform. I think it would be good at a later point to come back and review each of these different distributions, what their probability density functions (pdfs) look like, and what they’re primarily used for, since I don’t think I have an excellent grasp on all of them. However, for the time being I think it’s fair to say that the most commonly used are normal and uniform, of which I’m very familiar. Probably the next few to check up on would be binomial, chi-squared, exponential, and Poisson.

2. The Four Random Sampling Methods

The next section focuses on introducing and implementing the four random sampling methods in Python. These are:

Simple
Systematic
Stratified
Cluster

Simple Sampling Method

This is the most basic form of sampling and is performed as shown above in the previous section. Simple sampling means choosing randomly, without replacment, from your population, until you reach the desired number of samples.

Systematic Sampling Method

Systematic sampling isn’t really “random”, but rather chooses the nth observation, where n is the $\frac{\text{total number of observations}}{\text{number of samples you’d like}}$. So you may take the 5th, 10th, 15th and so on up to 100 for a total of 20 samples. This method has the drawback that if the data is organized in any meaningful way by a certain attribute, then you will introduce bias into your sample.

Stratified Sampling Method

Stratified sampling allows you to sample your data by subgroups. You can either take an even split of the data so that each subgroup has the same size, or take a fractioal split where each subgroup is of the same relative size as it appears in the population dataset. To do this you would groupby your attribute of interest, and then pass either frac or n as your parameter to .sample.

In this section Chapman also detailed weighted random sampling, where you can sample from each group with a pre-determined weight.

Cluster Sampling Method

Cluster sampling is twist on stratified sampling which allows us to use only a random sample of the subgroups, and then a random sample of each of those subgroups. This allows us to use less data. To achieve this, you would do the following, where df is your DataFrame, attribute is your attribute of interest, and num_cat is the number of subgroups you wish to sample.

import random

attribute_pop = list(df['attribute'].unique())
attribute_samp = random.sample(attribute_pop, k=num_cat)
attribute_condition = df['attribute'].isin(attribute_samp)
df_cluster = df[attribute_condition]
df_cluster = df_cluster['attribute'].cat.remove_unusued_categories()
df_cluster.groupby('attribute').sample(n=your_n, random_state=your_random_state)

3. Sample Size & Population Parameters, Standard Error, Sampling Distributions, and the CLT

In this chapter, Chapman introduces the concept of the population parameter versus the point estimate. The population parameter is just a fancy word for a statistic based off the whole population, such as the mean. The point estimate is the same statistic, but calculated off the sample. He demonstrates how as you increase a sample size, the point statistic will become a better and better estimate for the population parameter by approximating it more closely.

Chapman then goes on to explain the most common way to measure the difference between the point estimate and the population parameter, which is the standard error. This is given by $100 * \frac{abs(\text{population_mean - sample_mean})}{\text{population_mean}}$. As the sample size increases, this number decreases, eventually reaching zero when the sample size is the same as the population size. In the beginning, adding just a few more samples to your sample size can result in great improvements in your standard error. However, as your number of samples increases, a few additional samples provides less and less improvement in the standard error.

From there, Chapman introduces sampling distributions. This occurs when you sample over and over again from the same population (with the same sample size), or produce a distribution of replicates of the same point estimate. However, you cannot always produce an exact sample distribution. In this case, you would simulate the sampling distribution with an approximate sampling distribution. With enough iterations, your approximate sampling distribution will look close enough to the exact sampling distribution.

Chapman then demonstrates an important concept in statistics: the Central Limit Theorem. This theorem states that averages of independent samples have approximately normal distributions. This is important to know as we move farther in our studies towards hypothesis testing, where a key tenant is that observations are assumed to be independent. This is so that further calculations can be performed with the assumption that the data is normally distributed. Chapman also notes that as the sample size increases, the distribution of the the averages gets closer to being normally distributed. Additionally, the width of the sampling distribution gets narrower.

Finally, Chapman shows by example that if you divide the population standard deviation by the square root of the sample size, you will approximate the sample standard deviation. He also refers to the standard deviation of the sampling distribution as the standard error, so this must be the same thing as described above.

This was a tricky chapter with a lot of information to cover, so I highly recommend taking the course yourself and walking yourself through the videos and exercises in case I missed anything in this summary.

4. Bootstrapping and Confidence Intervals

Finally, the chapter we’ve all been waiting for! This final chapter opens by describing the difference between sampling with replacement (resampling) and sampling without replacment. Next, it shows how to sample with replacement, which is done by simply adding the parameter replace=True to the pd.sample method.

Bootstrapping is then introduced as “the opposite of sampling from a population”. Using that logic, bootstrapping would then mean “creating a (theoretical) population by resampling multiple times from an existing sample”.

Here is the process for bootstrapping:

Randomly sample with replacement that is the same size as the original sample
Calculate the statistic of interest for this bootstramp sample.
Repeat steps 1 and 2 several times.

In code, this would look something like this:

import numpy as np

mean_attribute_1000 = []

for i in range(1000):
  mean_attribute_1000.append(
    np.mean(df_sample.sample(frac=1, replace=True)['attribute'])
  )

In the above code snippit, we use frac=1 to sample the entire sample, and replace=True to ensure resampling.

Next, the course goes on to show that the bootstrapped mean is usually very close to the original sample mean. However, if there is any bias or too small of a sample, this mean may not be very close to the true population mean.

On the other hand, the standard deviation of a bootstrapped dataset will not resemble the standard deviation of the original sample set. As shown in the previous chapter, if you multiply the standard error by the square root of the sample size, you will have an estimate for the standard error of the population. This is an incredibly useful aspect of bootstrapping.

The final part of the course delves into confidence intervals, another important topic for interview prep. Three methods for calculating confidence intervals are shown:

Standard Deviation
Quantile method (e.g. 5th to 95th percentile)
Inverse cumulative distribution function.

I’ll skip over the first two as they’re fairly self-explanatory, but the last one is quite interesting. So to do this you take the PDF of the normal distribution, which is simply the bell curve we’re all familiar with. The CDF of the normal distribution would then be the area under that curve. Flipping the x and y axes of the CDF, you get the inverse CDF. You can do this with the following code:

from scipy.stats import norm

norm.ppf(quantile, loc=0, scale=1)

In this case, loc and scale correspond to the mean and standard distribution of the normal curve. We would then call norm.ppf using the point estimate of the mean for the bootstrapped dataset for loc and the standard deviation of the bootstrapped dataset for scale. Providing lower and upper quantiles such as 0.025 and 0.975 would then give us the lower and upper bounds for our confidence interval. Remember, confidence intervals are a way of estimating an unknown value; in this case, it would be the mean of the population.

Well, that’s all I have for today! Hope this information was useful in your job hunt or simply in honing your data science skills.

–Em

This post is part of a series on Prepping for the Google Product Analyst Interview.

Written on June 23, 2022