Pandas Basics¶

Much like numpy can be used to query and analyze data, pandas can do the same thing! The reason that both tools are used in the data science community is that they fill different requirements and excel at different things. In this lecture we will be learning more about pandas and it’s strengths, specifically:

Pandas versus Numpy
Pandas indexes
Working with data in Pandas
SQL Introduction
Querying DataFrames
Merging DataFrames
Aggregations

Pandas versus Numpy¶

In the last lecture we learned about Numpy and it’s power for making Python very fast. We saw how to query data, as well as learned how Numpy “thinks” about data, as a tensor. In this lecture we will look at Pandas, which thinks about data primarily like a database table. This is because there are obvious primitives for dealing with one dimensional tensors, called Series and obvious primitives for dealing with two dimensional tensors called a DataFrame, but there are no obvious primitives for dealing with “higher order” tensors in pandas. You can do some things for creating higher order tensors in pandas, but honestly it’s kind of tough to work with that set of primitives.

While pandas is limited in the order of dimensions it can represent, most data is either one dimensional or two dimensional anyway, or at the very least can be decomposed to an order 1 or order 2 tensor, so this isn’t as much of a restriction as you might think. In this way you can think of pandas as “specializing” in the typical case and not worrying about the exotic cases.

Additionally, generally speaking pandas is somewhat slower than numpy. For this reason it is possible to access numpy equivalents of any of the pandas data structures, allowing you to access numpy’s speed, while making use of pandas ease of use, in some cases. That said, not all pandas operations can be done with the numpy objects.

Let’s look at some simple examples of how to do things in numpy and equivalently in pandas:

import numpy as np
import time

start = time.time()
vector_one = np.array([1, 4, 7])
vector_two = np.array([2, 4, 6])

print("Result", np.matmul(vector_one, vector_two.T))
print("took ", time.time() - start, "seconds")

Result 60
took  0.0005822181701660156 seconds

import pandas as pd

start = time.time()
series_one = pd.Series(vector_one)
series_two = pd.Series(vector_two)

print("Result", series_one.dot(series_two.T))
print("took", time.time() - start, "seconds")

Result 60
took 0.0007469654083251953 seconds

As you can see the numpy version is an entire order of magnitude faster. But the pandas version is a little bit easier to read.

Next let’s look at how to slice pandas versus numpy data structures:

array = np.random.normal(0, 1, size=10000)

start = time.time()
print("Result", len(array[array > 0.5])/len(array))
print("took", time.time() - start, "seconds")

Result 0.3077
took 0.0017809867858886719 seconds

series = pd.Series(array)

start= time.time()
print("Result", len(series[series > 0.5])/len(array))
print("took", time.time() - start, "seconds")

Result 0.3077
took 0.002260923385620117 seconds

So as you can see, numpy flatly beats pandas for speed. However, there are two ways in which pandas is the better choice. First, look at the number of public methods associated with a pandas Series versus a numpy order 1 tensor:

numpy_methods = [method for method in dir(array) if "_" not in method]
pandas_methods = [method for method in dir(series) if "_" not in method]
print(len(numpy_methods))
print(len(pandas_methods))

71
161

It should be clear, that you can just do a lot more with a pandas series out of the box than a numpy order 1 tensor. While this isn’t always a good thing, it’s worth keeping in mind the strengths and deficiencies of both frameworks.

Pandas Indexes¶

Pandas dataframe come equipped with three indexes:

an index for columns
an index for rows
an ordering for rows that serves as in implicit index.

The third index is the least forward without an example, and we’ll look at it soon.

Let’s look at a motivating example for the rows and columns first:

import pandas as pd

supply_matrix = np.random.rand(3,3) * 100
demand_vector = np.random.rand(3) * 100

df = pd.DataFrame(supply_matrix)
df.columns = ["Manufacturing", "Technology", "Real Estate"]
df

	Manufacturing	Technology	Real Estate
0	62.186098	92.382639	74.024296
1	62.984121	42.616047	11.667722
2	53.225074	52.152148	27.873661

As you can see the columns are listed above the dataframe’s data and give us information how each column is semantically named. If you’ve ever seen an excel spreadsheet, a database table or a CSV, this is carries the same meaning.

Additionally there is a row index as shown on the left hand side of the dataframe. This can either be a simple enumeration of the columns, indexed by the natural numbers, or it can take on a semantic meaning as follows:

import pandas as pd

supply_matrix = np.random.rand(3,3) * 100

df = pd.DataFrame(supply_matrix)
df.columns = ["Manufacturing", "Technology", "Real Estate"]
df.index = ["Manufacturing", "Technology", "Real Estate"]
df

	Manufacturing	Technology	Real Estate
Manufacturing	87.277123	7.146411	83.609861
Technology	35.083035	76.971169	30.210426
Real Estate	72.918688	34.753327	33.607585

But typically it is the former rather than the latter which is used for the row index. Let’s return to our previous example and show the “third” index:

import pandas as pd

supply_matrix = np.random.rand(3,3) * 100

df = pd.DataFrame(supply_matrix)
df.columns = ["Manufacturing", "Technology", "Real Estate"]
df

	Manufacturing	Technology	Real Estate
0	31.050919	6.621009	47.343123
1	52.103564	76.781645	46.231320
2	8.868035	27.525854	27.718102

df = df.sort_values("Manufacturing")
df

	Manufacturing	Technology	Real Estate
2	8.868035	27.525854	27.718102
0	31.050919	6.621009	47.343123
1	52.103564	76.781645	46.231320

As you can see, now the row index is inconflict with order in which the rows appear. This means we can query for the first row in two ways:

df.iloc[0]

Manufacturing     8.868035
Technology       27.525854
Real Estate      27.718102
Name: 2, dtype: float64

df.loc[2]

Manufacturing     8.868035
Technology       27.525854
Real Estate      27.718102
Name: 2, dtype: float64

Notice these two ways of selecting the first row in the dataframe differed in which index they use, but return the same information. This will become especially important when we move onto merging dataframes, because the row index is used rather than the “implicit” third index, defined by the order the data appears in the dataframe. This explicit row index can be mutated of course, but need not be when a mutation to the ordering of data occurs. This may seem complex, but can be desirable, especially given the ability to control when and if this explicit row index is updated.

Working With Data In Pandas¶

One of pandas greatest strengths is it’s ability to easily read in data and dump out data. Let’s look at an example:

supply_matrix = np.round(np.random.rand(500,3) * 100, 2)

df1 = pd.DataFrame(supply_matrix)
df1.columns = ["Manufacturing", "Technology", "Real Estate"]
df1.to_csv("economy.csv", index=False)

Here we choose not to save our index to the csv which is why index=False, this tells the method to not save the index as well, just the data. Now let’s see how to read in the dataframe:

df2 = pd.read_csv("economy.csv")
df2.equals(df1)

True

We can also read csv’s directly from the internet with the same method:

churn_df = pd.read_csv("https://raw.githubusercontent.com/EricSchles/datascience_book/master/Churn_Modelling.csv")
churn_df

As you can see, all we need to do is supply a url pointing to a csv and we are all set! In addition to being able to save to csvs and read from csvs, we can also read data directly from a database into pandas! This is why pandas is sometimes thought of as an “in-memory” database.

For this we are going to need a connection object to the database and a relevant query:

import pandas as pd
import sqlite3 as sql
connection = sql.connect("economy.db")
df = pd.read_sql_query("SELECT * FROM industry", connection)
df.head()

	id	manufacturing	technology	real_estate
0	0	18.24	66.41	72.19
1	1	14.89	11.78	73.49
2	2	95.07	67.08	12.53
3	3	58.37	45.39	75.58
4	4	47.39	80.52	43.65

As you can see, we can pull in any information we might want from a database connect, directly into python. This is a huge asset for doing data analysis and can make working on real world projects much easier. Notice that we have all the column names automatically as they appear in the database without any further manipulation in python. This makes the pandas dataframe, truly a batteries included library.

SQL Introduction¶

Since reading data from a database is so important for working as a data scientist, we’ll do a brief review of SQL now and then look at the pandas equivalences along side our SQL queries.

The most basic SQL query is:

SELECT * FROM [SCHEMA NAME].[TABLE NAME];

Notice that this statement, like all SQL statements ends with a ; which is necessary unlike in python. We’ve already seen an example, but let’s look at it again:

engine = connection.cursor()
sql_data = engine.execute("SELECT * FROM industry;").fetchall()

Notice that we need to call the fetchall method in order to actually execute the query. Whenever we are getting data from a database connection object, we need to do a fetch of some kind. There are multiple fetch methods available:

[
    method 
    for method in dir(engine.execute("SELECT * FROM industry;")) 
    if "fetch" in method and "_" not in method
]

['fetchall', 'fetchmany', 'fetchone']

We will make use of fetchall most often, however for large tables that can’t be read in all at once fetchmany and fetchone are useful primitives.

Now let’s look at the pandas equivalence:

#id	manufacturing	technology	real_estate
df[["id", "manufacturing", "technology", "real_estate"]]

	id	manufacturing	technology	real_estate
0	0	18.24	66.41	72.19
1	1	14.89	11.78	73.49
2	2	95.07	67.08	12.53
3	3	58.37	45.39	75.58
4	4	47.39	80.52	43.65
...	...	...	...	...
495	495	55.45	17.79	33.08
496	496	87.85	39.17	31.16
497	497	10.35	92.77	63.09
498	498	50.74	67.49	54.13
499	499	60.79	22.34	22.36

500 rows × 4 columns

As you can see, we have to be explicit about which columns we want in this version of the select statement. However we can choose specific columns to select in both SQL and in pandas:

sql_data = engine.execute("SELECT manufacturing FROM industry;").fetchall()

df["manufacturing"]

    18.24
    14.89
    95.07
    58.37
    47.39
       ...  
  55.45
  87.85
  10.35
  50.74
  60.79
Name: manufacturing, Length: 500, dtype: float64

In general SQL statements function sort of like list comprehensions. And most of them are built around the basic SELECT statement we just looked at. Now let’s explore some modifiers to this SELECT statement. First let’s look at how to recover information about the data from our query.

These sorts of statements will take specific columns, apply functions to those columns, and then return the resultant transformation. Here are some examples:

engine.execute("SELECT COUNT(manufacturing) FROM industry;").fetchall()

[(500,)]

Here we want to know how many rows the manufacturing column has from the industry. In general database tables all have the same number of rows, however we’ll see how other modifers break this rule. In addition to being able to call COUNT on a single column, we can also call it across all columns:

engine.execute("SELECT COUNT(*) FROM industry;").fetchall()

[(500,)]

Here are the pandas equivalences:

len(df["manufacturing"]), df["manufacturing"].shape[0]

(500, 500)

The above are two different wants to return the number of rows for a specific column. We can also do this across all rows:

len(df), df.shape[0]

(500, 500)

Next let’s look at how to get the maximum or minimum of a column in SQL:

engine.execute(
    "SELECT MAX(manufacturing), MIN(manufacturing) FROM industry;"
).fetchall()

[(99.98, 0.01)]

As you can see, we can return multiple transformations on a single column in one SELECT statement. This makes SQL very flexible, however this can also make things complicated quickly. We’ll see how to resolve that in a moment. But first let’s look at the pandas equivalence:

df["manufacturing"].max(), df["manufacturing"].min()

(99.98, 0.01)

One difference between SQL and pandas is with pandas we can operate over the entire dataframe at once:

df.max()

id               499.00
manufacturing     99.98
technology        99.60
real_estate       99.92
dtype: float64

df.min()

id               0.00
manufacturing    0.01
technology       0.18
real_estate      0.04
dtype: float64

The equivalent is not as easily accomplished in SQL. Next let’s see how to query for rows that meet specific conditions. For this we will need a WHERE clause as follows:

SELECT * FROM [SCHEMA NAME].[TABLE NAME] WHERE [TABLE COLUMN MEETS CONDITION];

Let’s look at some specific examples:

sql_data = engine.execute("SELECT * FROM industry WHERE manufacturing>0.3;").fetchall()

The pandas equivalence is:

df[
    df["manufacturing"] > 0.3
]

	id	manufacturing	technology	real_estate
0	0	18.24	66.41	72.19
1	1	14.89	11.78	73.49
2	2	95.07	67.08	12.53
3	3	58.37	45.39	75.58
4	4	47.39	80.52	43.65
...	...	...	...	...
495	495	55.45	17.79	33.08
496	496	87.85	39.17	31.16
497	497	10.35	92.77	63.09
498	498	50.74	67.49	54.13
499	499	60.79	22.34	22.36

498 rows × 4 columns

We can modify both of these statements to ensure that the number of rows is the same for both:

engine.execute("SELECT COUNT(*) FROM industry WHERE manufacturing>0.3;").fetchall()

[(498,)]

df[df["manufacturing"] > 0.3].shape[0]

As you can see the same number of elements is returned in both cases. Next let’s look at an equality condition:

engine.execute("SELECT * FROM industry WHERE id=10;").fetchall()

[(10, 97.79, 94.64, 95.03)]

And the pandas equivalence:

df[df["id"] == 10]

	id	manufacturing	technology	real_estate
10	10	97.79	94.64	95.03

Notice the use of double equals in pandas and single equals in SQL as a difference between the two statements. Sometimes our SQL statements can get very verbose, for this reason sometimes its useful to name our selections, especially if they are being modified. In order to do this we use the AS keyword as follows:

SELECT
[SELECTED COLUMN OR TRANSFORMATION] as transform
FROM [SCHEMA NAME].[TABLE NAME];

Let’s look at an example:

engine.execute("""
SELECT 
    MIN(manufacturing) as min_manufacturing,
    MAX(technology) as max_technology,
    COUNT(real_estate) as size 
FROM 
    industry 
WHERE 
    real_estate >0.5;
""").fetchall()

[(0.01, 99.6, 495)]

There isn’t really a pandas equivalence to this naming convention, but it’s necessary, when multiple SELECT statements are combined:

engine.execute("""
SELECT 
    MIN(manufacturing)
FROM (
 SELECT * 
 FROM industry 
 WHERE real_estate>0.5
) as expensive_real_estate;
""").fetchall()

[(0.01,)]

The multiple select statement query does have an equivalence in pandas:

expensive_real_estate = df[df["real_estate"] > 0.5]
expensive_real_estate["manufacturing"].min()

0.01

One final point to make, and an important difference between SQL statements and querying a pandas dataframe:

Order is not guaranteed in a SQL `SELECT` statement.¶

This point may seem trivial but it will come up at some point on a project so I mention it here. You can always impose an ordering with an ORDER BY statement which will see in a moment. But if you fail to impose an ordering, it is not guaranteed that your rows will be returned in the same order. Let’s see an example now:

sql_data_one = engine.execute("""
 SELECT * 
 FROM industry 
 ORDER BY manufacturing;
""").fetchall()

sql_data_two = engine.execute("""
 SELECT * 
 FROM industry;
""").fetchall()

sql_data_one == sql_data_two

False

While we cannot guarantee ordering in SQL, we can in pandas. That said, we can choose to change how the data is order with a call to sort_values:

df = df.sort_values("manufacturing")
df

	id	manufacturing	technology	real_estate
399	399	0.01	37.60	9.43
328	328	0.25	71.99	21.01
161	161	0.76	80.10	23.50
374	374	0.77	34.61	74.78
182	182	0.88	92.66	29.24
...	...	...	...	...
409	409	98.32	97.11	84.44
87	87	98.66	98.61	6.73
140	140	99.67	65.88	80.85
133	133	99.69	58.57	93.57
414	414	99.98	71.30	53.31

500 rows × 4 columns

Notice if we do not mutate the dataframe by reassigning it, the new ordering will not be maintained. This is why we have an equals statement followed by the sort_values method.

Because this is a course on python and not SQL this ends our demonstration of SQL statements and there pandas equivalences. The point is hopefully clear - pandas can do most if not all of the same things as a SQL query.

Pandas Querying Techniques¶

Next we’ll look at some more querying techniques, however just in pandas. First let’s see how to create a compound SELECT statement. For each component statement we simply enclose them in parantheses and then join them with a boolean operator. Here are the boolean operators we will consider:

& - The AND operator - It’s truth table is

First	Second	First AND Second
True	True	True
True	False	False
False	True	False
False	False	False

As you can see, only when both inputs of the operator are True does the AND operator return True.

| - the OR operator - It’s truth table is

First	Second	First OR Second
True	True	True
True	False	True
False	True	True
False	False	False

As you can see, when either input is True the OR operator returns True.

~ - the NOT operator - It’s truth table is

First	NOT First
True	False
True	False
False	True
False	True

As you can see, the NOT operator simply reverses the truth of the input.

You may recall from the numpy lecture, the way slicing works. This is equivalent in pandas:

import numpy as np
import pandas as pd

matrix = np.random.rand(500, 5) * 100
df = pd.DataFrame(matrix, columns=["A", "B", "C", "D", "E"])

df[
    (df["A"] > 50) &
    (df["B"] < 25)
].shape

(78, 5)

Now let’s look at an OR example:

df[
    (df["A"] > 50) |
    (df["B"] < 25)
].shape

(337, 5)

Notice, significantly more data is returned because either of the component statements need to be True not both.

Pandas Merging¶

In this section we will go over merging two or more dataframes together in pandas. For this we will look at the following methods:

append
concat
merge

While we will introduce the topic here, we will not provide an exhaustive enumeration of the functionality of these methods. Please consider checking out the pandas docs for more information.

Append¶

The most common technique for combining dataframes in the append. It works similarly to the builtin list’s append function. Let’s look at some examples:

import pandas as pd
import random 

df = pd.DataFrame()

for _ in range(100):
    df = df.append({
        "A": round(random.random() * 100, 2),
        "B": round(random.random() * 1000, 3)
    }, ignore_index=True)
    
df.shape

(100, 2)

As you can see, we can append dictionaries to a dataframe, so long as we use consistent keys. Those keys get mapped to the columns of the dataframe:

df.head()

	A	B
0	17.81	679.673
1	51.49	674.874
2	92.51	879.666
3	61.93	631.964
4	10.77	984.096

Also notice the use of ignore_index - this is required when we append a dictionary to a pandas dataframe because dictionaries don’t have an index and pandas expects this to know where to put the data. If no index is present, the data just gets appended to the end of the dataframe. However, if you append a pandas dataframe to a pandas dataframe and the indexes are the same, something different happens:

import pandas as pd
import numpy as np

matrix_one = np.random.rand(3, 2) * 100
df_one = pd.DataFrame(matrix_one, columns=["A", "B"])
df_two = pd.DataFrame()

for index in range(3):
    df_two = df_two.append({
        "A": index,
        "B": index
    }, ignore_index=True)
    
df_two

	A	B
0	0.0	0.0
1	1.0	1.0
2	2.0	2.0

df_two = df_two.append(df_one)
df_two

	A	B
0	0.000000	0.000000
1	1.000000	1.000000
2	2.000000	2.000000
0	53.021144	81.479154
1	63.573473	54.581931
2	47.762291	76.338762

Notice, now the index repeats itself since the dataframes have the same index. Let’s see what happens if we try to select the first element:

df_two.loc[0]

	A	B
0	0.000000	0.000000
0	53.021144	81.479154

This returns two elements, which is certainly unexpected behavior. Many of pandas other methods rely on the index elements being unique. Therefore this error will break a bunch of code and is a very common pandas bug. Luckily this is very easy to resolve:

df_two = df_two.reset_index()
df_two.loc[0]

index    0.0
A        0.0
B        0.0
Name: 0, dtype: float64

By reseting the index, we recover an ordering with a unique row index. Be careful when using this that simply resetting the row index is what you want.

Concat¶

Next let’s look at how to “concatenate” two dataframes together:

import pandas as pd
import numpy as np

matrix_one = np.random.rand(3, 2) * 100
df_one = pd.DataFrame(matrix_one, columns=["A", "B"])
df_two = pd.DataFrame()

for index in range(3):
    df_two = df_two.append({
        "A": index,
        "B": index
    }, ignore_index=True)
    
pd.concat([df_one, df_two])

	A	B
0	21.141901	66.162747
1	42.382557	43.663889
2	13.975004	73.809950
0	0.000000	0.000000
1	1.000000	1.000000
2	2.000000	2.000000

Notice this does the same thing as an append, except it allows us to provide a list of dataframes, rather than having to append them one at a time. Note that concat expects the columns to be the same for both dataframes. If you try to concatenate two dataframes with different columns, you will get unexpected behavior:

import pandas as pd
import numpy as np

matrix_one = np.random.rand(3, 2) * 100
df_one = pd.DataFrame(matrix_one, columns=["A", "B"])
df_two = pd.DataFrame()

for index in range(3):
    df_two = df_two.append({
        "C": index,
        "D": index
    }, ignore_index=True)
    
pd.concat([df_one, df_two])

	A	B	C	D
0	54.594504	14.399045	NaN	NaN
1	13.776155	17.707866	NaN	NaN
2	8.499159	68.064523	NaN	NaN
0	NaN	NaN	0.0	0.0
1	NaN	NaN	1.0	1.0
2	NaN	NaN	2.0	2.0

Notice that concat doesn’t know how to handle the missing data, so it just sticks NaNs in as a placeholder.

Merge¶

Next we’ll look at merge which relies on a join. The nice thing about this method is, it doesn’t require all the columns in the two dataframes to be the same. However they need to have at least one column in common which is used to “merge” the data.

import pandas as pd
import numpy as np

df_one = pd.DataFrame()

for index in range(3):
    df_one = df_one.append({
        "A": index,
        "E": index * 3
    }, ignore_index=True)

df_two = pd.DataFrame()

for index in range(3):
    df_two = df_two.append({
        "A": index,
        "D": index * 5
    }, ignore_index=True)
    
pd.merge(df_one, df_two, how="inner", on="A")

	A	E	D
0	0.0	0.0	0.0
1	1.0	3.0	5.0
2	2.0	6.0	10.0

You can make an explicit index column if there is no obvious choice:

import pandas as pd
import numpy as np

matrix_one = np.random.rand(3, 2) * 100
matrix_two = np.random.rand(3, 2) * 100
df_one = pd.DataFrame(matrix_one, columns=["A", "B"])
df_two = pd.DataFrame(matrix_two, columns=["C", "D"])

df_one["index"] = df_one.index
df_two["index"] = df_two.index

pd.merge(df_one, df_two, how="inner", on="index")

	A	B	index	C	D
0	58.813212	23.542619	0	72.604185	78.022299
1	97.875475	31.352383	1	19.036230	92.510758
2	12.437097	61.702843	2	47.217048	82.834555

Notice we’ve been using something called how as one of the keywords in our merge. Here how refers to the type of “join”. There are some variants of this technique, but we will only consider two here:

inner - selects values from index in both the left AND right dataframe
outer - selects values from index in either the left OR right dataframe

As you can see, a join is just another way of doing an AND or OR except it’s explicitly over a provided index column which appears in both dataframes.

Aggregations¶

Before we close out this chapter we will look at one more technique that can be very useful, an aggregation. This is often used in conjunction with something called a groupby. Let’s see an example of a groupby first:

import pandas as pd
import numpy as np

matrix = np.round(np.random.rand(500, 2) * 100, 2)

df = pd.DataFrame(matrix, columns=["average_income", "age"])
df["city"] = np.nan
df.loc[0:250, "city"] = "New York"
df.loc[250:, "city"] = "Boston"

df

	average_income	age	city
0	23.51	84.12	New York
1	0.64	50.05	New York
2	27.92	79.04	New York
3	54.22	0.33	New York
4	40.76	0.58	New York
...	...	...	...
495	93.54	79.50	Boston
496	33.72	31.29	Boston
497	56.44	70.34	Boston
498	55.95	52.34	Boston
499	99.10	26.83	Boston

500 rows × 3 columns

As you can see, one of these columns has strings for values instead of numbers. By using a groupby we can select subsections of the above dataframe, conditional on which city we are referring to:

for group, tmp in df.groupby("city"):
    print(group, tmp.shape)

Boston (250, 3)
New York (250, 3)

Here we print out the slice of the dataframe for Boston and then New York city and we observe their relative number of rows and columns. Next let’s get the average income for each slice:

for group, tmp in df.groupby("city"):
    print(group, tmp["average_income"].mean())

Boston 50.42168
New York 51.88596

We can also accomplish the above with an aggregation:

df.groupby("city").agg({"average_income":"mean"})

	average_income
city
Boston	50.42168
New York	51.88596

Notice we no longer need the for loop in order to get the average income by city. We can also specify a function explicitly:

df.groupby("city").agg({"average_income": np.mean})

	average_income
city
Boston	50.42168
New York	51.88596

I find this to be more useful than using the string equivalent, because it gives us far more control. Additionally we can call an aggregation without the use of a groupby:

df.agg({"average_income": "mean"})

average_income    51.15382
dtype: float64

We can also supply a function, however this means we must transform across the entire axis:

df.agg(np.mean, axis=0)

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/apply.py:169: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  return getattr(obj, f)()

average_income    51.15382
age               51.64458
dtype: float64

Introduction to Data Science and Computer Science