Data Import and Export

Most data sets you will work with are set up in tables, so are rectangular in shape. Think Excel spreadsheets. In pandas the structure that will hold this kind of data is a DataFrame. We can read external data into a DataFrame using one of many read_* functions. We can also write from a DataFrame to a variety of formats using to_* functions. The most common of these are listed below:

Format type Description reader writer
text CSV read_csv to_csv
Excel read_excel to_excel
text JSON read_json to_json
binary Feather read_feather to_feather
binary SAS read_sas
SQL SQL read_sql to_sql


We'll start by reading in the mtcars dataset stored as a CSV file

  pd.read_csv('data/mtcars.csv') 
                    make   mpg  cyl   disp   hp  ...   qsec  vs  am  gear  carb
0             Mazda RX4  21.0    6  160.0  110  ...  16.46   0   1     4     4
1         Mazda RX4 Wag  21.0    6  160.0  110  ...  17.02   0   1     4     4
2            Datsun 710  22.8    4  108.0   93  ...  18.61   1   1     4     1
3        Hornet 4 Drive  21.4    6  258.0  110  ...  19.44   1   0     3     1
4     Hornet Sportabout  18.7    8  360.0  175  ...  17.02   0   0     3     2
5               Valiant  18.1    6  225.0  105  ...  20.22   1   0     3     1
6            Duster 360  14.3    8  360.0  245  ...  15.84   0   0     3     4
7             Merc 240D  24.4    4  146.7   62  ...  20.00   1   0     4     2
8              Merc 230  22.8    4  140.8   95  ...  22.90   1   0     4     2
9              Merc 280  19.2    6  167.6  123  ...  18.30   1   0     4     4
10            Merc 280C  17.8    6  167.6  123  ...  18.90   1   0     4     4
11           Merc 450SE  16.4    8  275.8  180  ...  17.40   0   0     3     3
12           Merc 450SL  17.3    8  275.8  180  ...  17.60   0   0     3     3
13          Merc 450SLC  15.2    8  275.8  180  ...  18.00   0   0     3     3
14   Cadillac Fleetwood  10.4    8  472.0  205  ...  17.98   0   0     3     4
15  Lincoln Continental  10.4    8  460.0  215  ...  17.82   0   0     3     4
16    Chrysler Imperial  14.7    8  440.0  230  ...  17.42   0   0     3     4
17             Fiat 128  32.4    4   78.7   66  ...  19.47   1   1     4     1
18          Honda Civic  30.4    4   75.7   52  ...  18.52   1   1     4     2
19       Toyota Corolla  33.9    4   71.1   65  ...  19.90   1   1     4     1
20        Toyota Corona  21.5    4  120.1   97  ...  20.01   1   0     3     1
21     Dodge Challenger  15.5    8  318.0  150  ...  16.87   0   0     3     2
22          AMC Javelin  15.2    8  304.0  150  ...  17.30   0   0     3     2
23           Camaro Z28  13.3    8  350.0  245  ...  15.41   0   0     3     4
24     Pontiac Firebird  19.2    8  400.0  175  ...  17.05   0   0     3     2
25            Fiat X1-9  27.3    4   79.0   66  ...  18.90   1   1     4     1
26        Porsche 914-2  26.0    4  120.3   91  ...  16.70   0   1     5     2
27         Lotus Europa  30.4    4   95.1  113  ...  16.90   1   1     5     2
28       Ford Pantera L  15.8    8  351.0  264  ...  14.50   0   1     5     4
29         Ferrari Dino  19.7    6  145.0  175  ...  15.50   0   1     5     6
30        Maserati Bora  15.0    8  301.0  335  ...  14.60   0   1     5     8
31           Volvo 142E  21.4    4  121.0  109  ...  18.60   1   1     4     2

[32 rows x 12 columns] 

This just prints out the data, but then it's lost. To use this data, we have to give it a name, so it's stored in Python's memory

  mtcars = pd.read_csv('data/mtcars.csv') 

One of the big differences between a spreadsheet program and a programming language from the data science perspective is that you have to load data into the programming language. It's not "just there" like Excel. This is a good thing since it allows the common functionality of the programming language to work across multiple data sets and keeps the original data set pristine. Excel users can run into problems and corrupt their data if they are not careful.

If we wanted to write this data set back out into an Excel file, say, we could do

  mtcars.to_excel('data/mtcars.xlsx') 

You may get an error if you don't have the openpyxl package installed. You can easily install it from the Anaconda prompt using conda install openpyxl and following the prompts.