联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Web作业Web作业

日期:2021-12-12 12:14



Pandas is the most commonly used package for working with datasets

in Python. The name pandas comes from "Panel Data" (a statistical

method for data captured over time) and "Python Data Analysis"

(analysing data). Nothing to do with the animals, though I am a big fan!

We're going to barely scratch the surface this week. You can find full

Pandas documentation here.

Motivation

Consider the following data on local department store Fenwick (I may have mentioned earlier my

love of the department store vibe - seeing the wooden escalators was one of my favourite bits of

New York!)

Year Window Revenue (Millions £)* White Christmas**

2015 Once Upon a Christmas 298.2 True

2016 Beatrix Potter 302.0 True

2017 Paddington Bear 297.9 True

2018 Charlie and the Chocolate Factory 288.0 False

2019 The Snowman 246.32 False

2020 Wind in the Willows 218.0 False

* Financial data from https://craft.co/fenwick-limited

** White Christmas in England according to the Met Office

Looking at the table above, we just don't currently know how to handle this with what we've

learnt so far: it is a table with multiple different data types that would naturally be represented (in

order of columns) as integers, strings, float values and booleans. And we could well have

information not available to us, for example missing data in one particular year.

What we really want is a data type that acts like a table: that can store data of different types,

that can have missing data, that we can query and manipulate and make plots from. That's what

pandas offers.

We are not trying to be statisticians (yet)! With a statistics mindset we might ask, for example,

whether a White Christmas has a statistically significant impact on revenue (probably not!). When

we look at a larger dataset later then it will open up lots of other things we could ask, but right

now we are just interested in handling this data. We'll effectively be laying the foundations for

doing some "proper" statistics, but actually the programming language of choice for statistics in

our degree programme is R, which you will meet next semester.

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 2

Importing pandas

We can import pandas with

import pandas as pd

Series and Dataframes

Pandas has two objects that we are interested in:

A Series you can think of as a column in a table and is very like an array in NumPy.

A DataFrame you can think of as the table in its entirety and is made up of Series.

Creating a series

Let's create a Series for the year's column above. The function pd.Series takes in a list of

values and turns it into a Series:

years = pd.Series([2015,2016,2017,2018,2019,2020])

type(years)

pandas.core.series.Series

Printing years , we get something a bit different from printing a list:

print(years)

0 2015

1 2016

2 2017

3 2018

4 2019

5 2020

dtype: int64

Note the presence of an "index column" which runs from 0 upwards.

We can do some familiar things like this

years[0]

2015

and some other unfamiliar things. Let's create a Series for revenue to demonstrate:

revenue = pd.Series([298.2,302.0,297.9,288.0,246.32,218.0])

Now check this out... we can query rows in a series like this

revenue[revenue > 280]

0 298.2

1 302.0

2 297.9

3 288.0

dtype: float64

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 3

which is really cool, but will be even better once we put this data into a DataFrame instead.

Creating a DataFrame

A pandas DataFrame is much like a table of data.

To create a DataFrame manually (we're more likely to create a DataFrame by importing data from

a CSV as we'll do later in the practical), we first create a dictionary. Recall (week 6) that a

dictionary is created as follows, made up of key:value pairs separated by commas and inside

braces {}

car = {"manufacturer": "Citreon", "model": "Picasso"}

print(car["manufacturer"])

Citreon

Here we go with the Fenwick data... I recommend copy and pasting this, or be very careful with

your commas etc...

# Dictionary - the indenting isn't important, it's just for easy reading

data = {

"year": [2015,2016,2017,2018,2019,2020],

"window": ["Once Upon a Christmas","Beatrix Potter","Paddington

Bear",

"Charlie and the Chocolate Factory","The Snowman","Wind

in the Willows"],

"revenue": [298.2,302.0,297.9,288.0,246.32,218.0],

"white_christmas": [True,True,True,False,False,False]

}


# load data into a DataFrame

df = pd.DataFrame(data)

We've got ourselves a DataFrame! We're going to use df for the variable name, but this could be

anything.

type(df)

pandas.core.frame.DataFrame

Let the fun commence!

Getting to know our DataFrame

We've just got a small bit of data here, but these methods/attributes will be extremely useful

when we look at the larger data set later on. Try these in the Console:

df.head() # show the top of the data frame

df.tail() # show the bottom of the data frame

df.columns # show the column names

df.dtypes # show the data types

Now let me show you something amazing...

df.describe()

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 4

year revenue

count 6.000000 6.000000

mean 2017.500000 275.070000

std 1.870829 34.733739

min 2015.000000 218.000000

25% 2016.250000 256.740000

50% 2017.500000 292.950000

75% 2018.750000 298.125000

max 2020.000000 302.000000

without any effort from us, we've got summary statistics there from the two numeric columns in

our DataFrame. I already know the mean revenue for Fenwick over those 6 years is £275.07 and

we've barely raised a finger!

Here's one more:

df.groupby("white_christmas").size()

white_christmas

False 3

True 3

dtype: int64

df.groupby("white_christmas").mean()

Sorting the DataFrame

We can sort the DataFrame like this:

df.sort_values(by="revenue")

which orders by revenue ascending, or descending with

df.sort_values(by="revenue",ascending=False)

Note this does not modify the DataFrame. If we want to make this permanent we could assign to

a new DataFrame though:

df_by_revenue = df.sort_values(by="revenue",ascending=False)

Obtaining Series from the DataFrame

We can isolate a column in our DataFrame like this

df["revenue"]

This itself is a Series. We can check like this:

revenue = df["revenue"]

type(revenue)

pandas.core.series.Series

If we wanted to then we could change this to a NumPy array using the to_numpy() method:

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 5

revenue = df["revenue"].to_numpy()

type(revenue)

numpy.ndarray

which would be one way to, for example, make some plots

# Make a plot using matplotlib

import matplotlib.pyplot as plt


# Convert Series to numpy arrays

revenue = df["revenue"].to_numpy()

year = df["year"].to_numpy()


# Plot using the usual command

plt.plot(year, revenue, '-o')

There is another, more direct way, which we will see shortly.

We can also isolate a subset of columns, for example just the year and revenue

df[["year","white_christmas"]]

year white_christmas

0 2015 True

1 2016 True

2 2017 True

3 2018 False

4 2019 False

5 2020 False

The output of this is itself a DataFrame.

Querying the DataFrame

We can query the DataFrame like this:

# find rows where revenue was greater than 280

df[df["revenue"] > 280]

year window revenue white_christmas

0 2015 Once Upon a Christmas 298.2 True

1 2016 Beatrix Potter 302.0 True

2 2017 Paddington Bear 297.9 True

3 2018 Charlie and the Chocolate Factory 288.0 False

Note that the part in df[] , df["revenue"] > 280 is a Series of booleans:

df["revenue"] > 280

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 6

0 True

1 True

2 True

3 True

4 False

5 False

Name: revenue, dtype: bool

Here's some more example:

# Find rows where it was a white christmas

# Note: same as df[df["white_christmas"]==True]

df[df["white_christmas"]]

year window revenue white_christmas

0 2015 Once Upon a Christmas 298.2 True

1 2016 Beatrix Potter 302.0 True

2 2017 Paddington Bear 297.9 True

# Find rows where the window was "The Snowman"

df[df["window"] == "The Snowman"]

year window revenue white_christmas

4 2019 The Snowman 246.32 False

and since each of these outputs is itself a DataFrame, we can even do this sort of thing:

# Get summary statistics for years with a white christmas

df[df["white_christmas"]].describe()

year revenue

count 3.0 3.000000

mean 2016.0 299.366667

std 1.0 2.285461

min 2015.0 297.900000

25% 2015.5 298.050000

50% 2016.0 298.200000

75% 2016.5 300.100000

max 2017.0 302.000000

Plotting DataFrame data

I hope you are appreciating how powerful pandas is. It is also is integrated with Matplotlib, so that

there is an alternative to converting to NumPy arrays and then using Matplotlib. Try this:

# Plot revenue versus year with the '-o' marker style

df.plot("year","revenue",style="-o")

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 7

The plot method for a DataFrame is a little less flexible and there are some subtle differences

plotting this way (for example setting line styles), but it's provided for convenience and certainly

ticks that box.

Importing datasets from CSV

We're going to work for most of the remainder of this session on a data set of music albums

(adapted from this data set on Kaggle, which is a good source for finding datasets to work with).

Download the data set

Import the dataset with the following (assumes music.csv is in the same folder as your Python

script):

import pandas as pd

df = pd.read_csv('music.csv')

Note that because of its size, if you do

print(df)

or type df in the console then you may find that you see only part of the DataFrame. If you're

working on a large screen, you may wish to set pandas to show all columns in the console with

pd.set_option('display.max_columns', None) # removes the maximum

or if you're working on a particularly small screen then you might like to set your own value for

this:

pd.set_option('display.max_columns', 4) # show 4 columns

Exercise 9.1

Let's familiarise ourselves with the DataFrame.

Show Exercise

Querying the music dataframe

Let's see if we can find some classic albums...

Week 9 Handout - MAS1803 Problem Solving with Python 2021-22

Dr Chris Graham 8

# Album exactly matches the word Thriller

df[df["Album"]=="Thriller"]

We also might want to match by a string, which can be done using str.contains . What was the

name of that Dire Straits album? Brother something...

# Album contains the word "Brother"

df[df["Album"].str.contains("Brother")]

Brothers in Arms... that was it... lots of other "brothers" in there too!

Before we go any further, we might get a good feel for some of the ranges of values that there

are in the DataFrame (for example what years are covered, what the maximum rating is etc) by

using describe() :

df.describe()


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp