pandas Dataframes
Data Import and Export
Most data sets you will work with are set up in tables, so are rectangular in shape. Think Excel spreadsheets. In pandas
the structure that will hold this kind of data is a DataFrame
. We can read external data into a DataFrame
using one of many read_*
functions. We can also write from a DataFrame
to a variety of formats using to_*
functions. The most common of these are listed below:
Format type | Description | reader | writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
Excel | read_excel | to_excel | |
text | JSON | read_json | to_json |
binary | Feather | read_feather | to_feather |
binary | SAS | read_sas | |
SQL | SQL | read_sql | to_sql |
We'll start by reading in the mtcars
dataset stored as a CSV file
pd.read_csv('data/mtcars.csv')
make mpg cyl disp hp ... qsec vs am gear carb 0 Mazda RX4 21.0 6 160.0 110 ... 16.46 0 1 4 4 1 Mazda RX4 Wag 21.0 6 160.0 110 ... 17.02 0 1 4 4 2 Datsun 710 22.8 4 108.0 93 ... 18.61 1 1 4 1 3 Hornet 4 Drive 21.4 6 258.0 110 ... 19.44 1 0 3 1 4 Hornet Sportabout 18.7 8 360.0 175 ... 17.02 0 0 3 2 5 Valiant 18.1 6 225.0 105 ... 20.22 1 0 3 1 6 Duster 360 14.3 8 360.0 245 ... 15.84 0 0 3 4 7 Merc 240D 24.4 4 146.7 62 ... 20.00 1 0 4 2 8 Merc 230 22.8 4 140.8 95 ... 22.90 1 0 4 2 9 Merc 280 19.2 6 167.6 123 ... 18.30 1 0 4 4 10 Merc 280C 17.8 6 167.6 123 ... 18.90 1 0 4 4 11 Merc 450SE 16.4 8 275.8 180 ... 17.40 0 0 3 3 12 Merc 450SL 17.3 8 275.8 180 ... 17.60 0 0 3 3 13 Merc 450SLC 15.2 8 275.8 180 ... 18.00 0 0 3 3 14 Cadillac Fleetwood 10.4 8 472.0 205 ... 17.98 0 0 3 4 15 Lincoln Continental 10.4 8 460.0 215 ... 17.82 0 0 3 4 16 Chrysler Imperial 14.7 8 440.0 230 ... 17.42 0 0 3 4 17 Fiat 128 32.4 4 78.7 66 ... 19.47 1 1 4 1 18 Honda Civic 30.4 4 75.7 52 ... 18.52 1 1 4 2 19 Toyota Corolla 33.9 4 71.1 65 ... 19.90 1 1 4 1 20 Toyota Corona 21.5 4 120.1 97 ... 20.01 1 0 3 1 21 Dodge Challenger 15.5 8 318.0 150 ... 16.87 0 0 3 2 22 AMC Javelin 15.2 8 304.0 150 ... 17.30 0 0 3 2 23 Camaro Z28 13.3 8 350.0 245 ... 15.41 0 0 3 4 24 Pontiac Firebird 19.2 8 400.0 175 ... 17.05 0 0 3 2 25 Fiat X1-9 27.3 4 79.0 66 ... 18.90 1 1 4 1 26 Porsche 914-2 26.0 4 120.3 91 ... 16.70 0 1 5 2 27 Lotus Europa 30.4 4 95.1 113 ... 16.90 1 1 5 2 28 Ford Pantera L 15.8 8 351.0 264 ... 14.50 0 1 5 4 29 Ferrari Dino 19.7 6 145.0 175 ... 15.50 0 1 5 6 30 Maserati Bora 15.0 8 301.0 335 ... 14.60 0 1 5 8 31 Volvo 142E 21.4 4 121.0 109 ... 18.60 1 1 4 2 [32 rows x 12 columns]
This just prints out the data, but then it's lost. To use this data, we have to give it a name, so it's stored in Python's memory
mtcars = pd.read_csv('data/mtcars.csv')
One of the big differences between a spreadsheet program and a programming language from the data science perspective is that you have to load data into the programming language. It's not "just there" like Excel. This is a good thing since it allows the common functionality of the programming language to work across multiple data sets and keeps the original data set pristine. Excel users can run into problems and corrupt their data if they are not careful.
If we wanted to write this data set back out into an Excel file, say, we could do
mtcars.to_excel('data/mtcars.xlsx')
You may get an error if you don't have the openpyxl
package installed. You can easily install it from the Anaconda prompt using conda install openpyxl
and following the prompts.