pandas Dataframes
Data Structures and Types
pandas.Series
The Series
object holds data from a single input
variable and is required, much like numpy arrays, to be homogeneous in
type. You can create Series
objects from lists or numpy arrays quite easily
s = pd.Series([1,3,5,np.nan, 9, 13]) s
0 1.0 1 3.0 2 5.0 3 NaN 4 9.0 5 13.0 dtype: float64
s2 = pd.Series(np.arange(1,20)) s2
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 dtype: int64
You can access elements of a Series
much like a dict
s2[4]
5
There is no requirement that the index of a Series
has to be numeric. It can be any kind of scalar object
s3 = pd.Series(np.random.normal(0,1, (5,)), index = ['a','b','c','d','e']) s3
a -0.283473 b 0.157530 c 1.051739 d 0.859905 e 1.178951 dtype: float64
s3['d']
0.859904696094078
s3['a':'d']
a -0.283473 b 0.157530 c 1.051739 d 0.859905 dtype: float64
Well, slicing worked, but it gave us something different than expected. It gave us both the start and end of the slice, which is unlike what we've encountered so far!!
It turns out that in pandas
, slicing by index actually does this. It is a discrepancy from numpy
and Python in general that we have to be careful about.
You can extract the actual values into a numpy array
s3.to_numpy()
array([-0.28347282, 0.1575304 , 1.05173885, 0.8599047 , 1.17895111])
In fact, you'll see that much of pandas
' structures are built on top of numpy
arrays. This is a good thing since you can take advantage of the
powerful numpy functions that are built for fast, efficient scientific
computing.
Making the point about slicing again,
s3.to_numpy()[0:3]
array([-0.28347282, 0.1575304 , 1.05173885])
This is different from index-based slicing done earlier.