import pandas as pd
import numpy as np
from pandas.errors import UnsupportedFunctionCall
A Pandas surprise - NaNs and groupby
I figured out something about pandas today, which I was very surprised by. Applying .groupby
on a pd.DataFrame
automatically ignores NaN
values. This is intended behavior, but sometimes you actually want to have some NaN
in the data, to check whether your data-frame is correct and to find possible corruptions.
Here is a little example:
# Create a sample Array:
= pd.DataFrame.from_dict({'g1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
DF 'g2': ['c', 'c', 'd', 'd', 'c', 'c', 'd', 'd'],
'd1': [0, 1, np.nan, 3, 4, 5, 6, 7]})
Averaging the entries in DF
, we would expect a NaN
in group a
, d
, but we get 3.0
!
'g1', 'g2']).mean() DF.groupby([
If you apply pandas .mean()
method on a DataFrame
you could specify a skipna = False
in the function. This, unfortunately doesn’t work after using .groupby
.
# This creates an error
try:
'g1', 'g2']).mean(skipna=False)
DF.groupby([except UnsupportedFunctionCall:
print('UnsupportedFunctionCall')
I think, I have seen one solution to solve this issue stating that using .apply(np.mean)
instead of using .mean()
might solve the problem.
However:
'g1', 'g2']).apply(np.mean) DF.groupby([
Calling np.mean
causes pandas to bypass the function and calls DF.mean()
from pandas with skipna=True
! As far as I know, you have to create a new function to solve the issue.
def mean_w_nan(x):
# Don't forget the np.array call!
return np.mean(np.array(x))
'g1', 'g2']).apply(mean_w_nan) DF.groupby([