联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> CS作业CS作业

日期:2021-04-20 10:38

Unsupervised Data Mining Workbook

Prof Emel Aktas

8 March 2021

1 A Simple Affinity Example

A typical application of data mining is in marketing: We can ask a customer who buys a product

if they would like to buy another similar product. This type of analysis is ‘affinity analysis’, which

studies things that appear together. It is a correlation analysis and don’t mistake correlation for

causation.

The example we will investigate is from a hardware wholesaler. For the purposes of class demonstration

we processed the raw data from the wholesaler and shortened it to 100 observations and

five products:

1. Adjustable Wristband

2. Portable Counterweight

3. Canvas Pouch

1

4. Rescue Harness

5. Cellphone Holster

In the exercise, we will investigate which of these products were ordered together by 100 customers.

A practical application of this analysis could inform where in the warehouse the products

should be located to minimise the picking time.

Disclaimer: Product images from https://www.3m.co.uk/3M/en_GB/company-uk/3m-products/

~/All-3M-Products/Safety/DBI-SALA/?N=5002385+8709322+8711017+8734872&rt=r3 are used

as representations of products. The data is realistic but processed to disguise any commercial

information. No conclusions about any company or volumes of sales shall be inferred from the

data.

[1]: # import the numpy package

import numpy as np

# import seaborn for graphs

import seaborn as sn

sn.set(color_codes = True)

# import matplotlib for plotting

import matplotlib.pyplot as plt

# import pandas for data analysis

import pandas as pd

# define the dataset file name

dataset_filename = "affinity_dataset.txt"

# load the dataset

X = np.loadtxt(dataset_filename)

[2]: X.shape

2

[2]: (100, 5)

[3]: X[:5]

[3]: array([[0., 1., 0., 0., 0.],

[1., 1., 0., 0., 0.],

[0., 0., 1., 0., 1.],

[1., 1., 0., 0., 0.],

[0., 0., 1., 1., 1.]])

[4]: # assign number of observations and number of variables

n_samples, n_features = X.shape

[5]: # names of variables (features)

features = ["Wristband", "Counterweight", "Pouch", "Harness", "Holster"]

[6]: # support and confidence for the rule: "if a customer orders Holster, they also?

,→buy X"

# third row of the X as an example sample

sample = X[2]

[7]: # take a look at what's in sample and compare with X[:5]

sample

[7]: array([0., 0., 1., 0., 1.])

[8]: # the fifth element of sample (corresponds to Hoster)

sample[4]

[8]: 1.0

[9]: # create a default dictionary to capture valid and invalid rules

from collections import defaultdict

valid_rules = defaultdict(int)

invalid_rules = defaultdict(int)

num_occurences = defaultdict(int)

[10]: # check the entire dataset for each feature as a premise and check the?

,→conclusion.

# when the premise is true, if the conclusion is also true, the rule is valid.

for sample in X:

for premise in range(n_features):

if sample[premise] == 0:

continue

# Record that the premise was bought in another transaction

num_occurences[premise] += 1

for conclusion in range(n_features):

3

if premise == conclusion:

# It makes little sense to measure if X -> X.

continue

if sample[conclusion] == 1:

# This person also bought the conclusion item

valid_rules[(premise, conclusion)] += 1

[11]: # how many times each product is bought

num_occurences

[11]: defaultdict(int, {1: 52, 0: 28, 2: 39, 4: 57, 3: 43})

[12]: # how many times holster was ordered together with other products

valid_rules

[12]: defaultdict(int,

{(0, 1): 13,

(1, 0): 13,

(2, 4): 20,

(4, 2): 20,

(2, 3): 22,

(3, 2): 22,

(3, 4): 27,

(4, 3): 27,

(1, 3): 18,

(3, 1): 18,

(1, 4): 27,

(4, 1): 27,

(0, 2): 5,

(2, 0): 5,

(0, 4): 16,

(4, 0): 16,

(1, 2): 11,

(2, 1): 11,

(0, 3): 9,

(3, 0): 9})

[13]: # support of the rule

support = valid_rules

[14]: # confidence calculation

confidence = defaultdict(float)

for premise, conclusion in valid_rules.keys():

rule = (premise, conclusion)

confidence[rule] = valid_rules[rule] / num_occurences [premise]

4

[15]: # confidence of the rule. percentage of times the rule applies when the premise?

,→applies

confidence

[15]: defaultdict(float,

{(0, 1): 0.4642857142857143,

(1, 0): 0.25,

(2, 4): 0.5128205128205128,

(4, 2): 0.3508771929824561,

(2, 3): 0.5641025641025641,

(3, 2): 0.5116279069767442,

(3, 4): 0.627906976744186,

(4, 3): 0.47368421052631576,

(1, 3): 0.34615384615384615,

(3, 1): 0.4186046511627907,

(1, 4): 0.5192307692307693,

(4, 1): 0.47368421052631576,

(0, 2): 0.17857142857142858,

(2, 0): 0.1282051282051282,

(0, 4): 0.5714285714285714,

(4, 0): 0.2807017543859649,

(1, 2): 0.21153846153846154,

(2, 1): 0.28205128205128205,

(0, 3): 0.32142857142857145,

(3, 0): 0.20930232558139536})

[16]: for premise, conclusion in confidence:

premise_name = features[premise]

conclusion_name = features[conclusion]

print("Rule: If a customer orders {0} they will also order {1}".

,→format(premise_name, conclusion_name))

print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))

print(" - Support: {0}".format(support [(premise, conclusion)]))

print("")

Rule: If a customer orders Wristband they will also order Counterweight

- Confidence: 0.464

- Support: 13

Rule: If a customer orders Counterweight they will also order Wristband

- Confidence: 0.250

- Support: 13

Rule: If a customer orders Pouch they will also order Holster

- Confidence: 0.513

- Support: 20

5

Rule: If a customer orders Holster they will also order Pouch

- Confidence: 0.351

- Support: 20

Rule: If a customer orders Pouch they will also order Harness

- Confidence: 0.564

- Support: 22

Rule: If a customer orders Harness they will also order Pouch

- Confidence: 0.512

- Support: 22

Rule: If a customer orders Harness they will also order Holster

- Confidence: 0.628

- Support: 27

Rule: If a customer orders Holster they will also order Harness

- Confidence: 0.474

- Support: 27

Rule: If a customer orders Counterweight they will also order Harness

- Confidence: 0.346

- Support: 18

Rule: If a customer orders Harness they will also order Counterweight

- Confidence: 0.419

- Support: 18

Rule: If a customer orders Counterweight they will also order Holster

- Confidence: 0.519

- Support: 27

Rule: If a customer orders Holster they will also order Counterweight

- Confidence: 0.474

- Support: 27

Rule: If a customer orders Wristband they will also order Pouch

- Confidence: 0.179

- Support: 5

Rule: If a customer orders Pouch they will also order Wristband

- Confidence: 0.128

- Support: 5

Rule: If a customer orders Wristband they will also order Holster

- Confidence: 0.571

- Support: 16

6

Rule: If a customer orders Holster they will also order Wristband

- Confidence: 0.281

- Support: 16

Rule: If a customer orders Counterweight they will also order Pouch

- Confidence: 0.212

- Support: 11

Rule: If a customer orders Pouch they will also order Counterweight

- Confidence: 0.282

- Support: 11

Rule: If a customer orders Wristband they will also order Harness

- Confidence: 0.321

- Support: 9

Rule: If a customer orders Harness they will also order Wristband

- Confidence: 0.209

- Support: 9

[17]: from operator import itemgetter

sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)

[18]: sorted_support

[18]: [((3, 4), 27),

((4, 3), 27),

((1, 4), 27),

((4, 1), 27),

((2, 3), 22),

((3, 2), 22),

((2, 4), 20),

((4, 2), 20),

((1, 3), 18),

((3, 1), 18),

((0, 4), 16),

((4, 0), 16),

((0, 1), 13),

((1, 0), 13),

((1, 2), 11),

((2, 1), 11),

((0, 3), 9),

((3, 0), 9),

((0, 2), 5),

((2, 0), 5)]

7

[19]: sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)

for index in range(5):

print("Rule #{0}".format(index + 1))

premise, conclusion = sorted_support[index][0]

print("Rule: If a customer orders {0} they will also order {1}".

,→format(features[premise], features[conclusion]))

print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))

print(" - Support: {0}".format(support[(premise, conclusion)]))

print("")

Rule #1

Rule: If a customer orders Harness they will also order Holster

- Confidence: 0.628

- Support: 27

Rule #2

Rule: If a customer orders Holster they will also order Harness

- Confidence: 0.474

- Support: 27

Rule #3

Rule: If a customer orders Counterweight they will also order Holster

- Confidence: 0.519

- Support: 27

Rule #4

Rule: If a customer orders Holster they will also order Counterweight

- Confidence: 0.474

- Support: 27

Rule #5

Rule: If a customer orders Pouch they will also order Harness

- Confidence: 0.564

- Support: 22

[20]: sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)

for index in range(5):

print("Rule #{0}".format(index + 1))

premise, conclusion = sorted_confidence[index][0]

print("Rule: If a customer orders {0} they will also order {1}".

,→format(features[premise], features[conclusion]))

print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))

print(" - Support: {0}".format(support [(premise, conclusion)]))

print("")

Rule #1

8

Rule: If a customer orders Harness they will also order Holster

- Confidence: 0.628

- Support: 27

Rule #2

Rule: If a customer orders Wristband they will also order Holster

- Confidence: 0.571

- Support: 16

Rule #3

Rule: If a customer orders Pouch they will also order Harness

- Confidence: 0.564

- Support: 22

Rule #4

Rule: If a customer orders Counterweight they will also order Holster

- Confidence: 0.519

- Support: 27

Rule #5

Rule: If a customer orders Pouch they will also order Holster

- Confidence: 0.513

- Support: 20

[21]: from matplotlib import pyplot as plt

plt.plot([confidence[rule[0]] for rule in sorted_confidence])

plt.ylabel('Confidence')

plt.xlabel('Rule') # possibly use the first five rules

[21]: Text(0.5, 0, 'Rule')

9

2 Cluster Analysis

Cluster analysis is used for categorising data without knowing categories. For this exercise we

have a customer data with four features and 3184 customers. The features are

1. Transportation Cost

2. Warehouse Cost

3. Overhead

4. Profit

Without knowing how many customer groups exist in the data, we will read, plot, and cluster the

customers into an acceptable number of groups using the k-means clustering algorithm.

[59]: customer = pd.read_csv("cluster_dataset.csv")

[60]: customer.head()

[60]: transportation_cost warehouse_cost overhead profit

0 18890.063700 1510.207094 683.145441 59675.1824

1 242.019333 716.213643 122.229621 58091.2485

2 119.276458 358.106821 122.229621 47502.0468

3 33730.627370 1510.207094 683.145441 38209.7592

4 7591.826018 1168.491532 325.997463 38197.7255

[61]: customer.shape

10

[61]: (3184, 4)

[62]: customer.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3184 entries, 0 to 3183

Data columns (total 4 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 transportation_cost 3184 non-null float64

1 warehouse_cost 3184 non-null float64

2 overhead 3184 non-null float64

3 profit 3184 non-null float64

dtypes: float64(4)

memory usage: 99.6 KB

[63]: customer.describe()

[63]: transportation_cost warehouse_cost overhead profit

count 3184.000000 3184.000000 3184.000000 3184.000000

mean 2107.769227 1101.700615 462.705894 3993.600789

std 2894.888155 439.223708 262.686723 5284.903650

min 0.094115 226.138945 122.229621 -137377.462000

25% 513.508606 810.384711 325.997463 1377.042925

50% 1374.763639 1168.491532 325.997463 3126.758200

75% 2704.144414 1510.207094 683.145441 5214.488175

max 49106.959210 2028.357506 915.429857 59675.182400

[64]: # box and whisker plots

customer.plot(kind='box', sharex=False, sharey=False)

[64]: <AxesSubplot:>

11

[65]: # box and whisker plots

customer["transportation_cost"].plot(kind='box', sharex=False, sharey=False)

[65]: <AxesSubplot:>

12

[66]: # box and whisker plots

customer["warehouse_cost"].plot(kind='box', sharex=False, sharey=False)

[66]: <AxesSubplot:>

[67]: # box and whisker plots

customer["overhead"].plot(kind='box', sharex=False, sharey=False)

[67]: <AxesSubplot:>

13

[68]: # box and whisker plots

customer["profit"].plot(kind='box', sharex=False, sharey=False)

[68]: <AxesSubplot:>

14

[70]: # histograms

customer.hist(edgecolor='white', linewidth=1.1)

[70]: array([[<AxesSubplot:title={'center':'transportation_cost'}>,

<AxesSubplot:title={'center':'warehouse_cost'}>],

[<AxesSubplot:title={'center':'overhead'}>,

<AxesSubplot:title={'center':'profit'}>]], dtype=object)

[71]: from pandas.plotting import scatter_matrix

# scatter plot matrix

scatter_matrix(customer,figsize=(10,10))

plt.show()

15

[73]: # updating the diagonal elements in a pairplot to show a kernel density?

,→estimation (kde)

sn.pairplot(customer,diag_kind="kde")

[73]: <seaborn.axisgrid.PairGrid at 0x7fc1919d5b80>

16

[74]: # import KMeans

from sklearn.cluster import KMeans

[78]: # create kmeans object

kmeans = KMeans(n_clusters=4)

# create np array for data points

points = np.array(customer)

# fit kmeans object to data

kmeans.fit(points)

# print location of clusters learned by kmeans object

17

print(kmeans.cluster_centers_)

# save new clusters for chart

y_km = kmeans.fit_predict(points)

[[ 1.23540923e+03 1.00672039e+03 4.26418588e+02 2.18647926e+03]

[ 3.67597186e+03 1.34914009e+03 5.54594706e+02 7.63349887e+03]

[ 2.52645291e+03 1.07432046e+03 1.22229621e+02 -1.37377462e+05]

[ 1.17679998e+04 1.40823540e+03 6.09630815e+02 2.19542255e+04]]

[79]: plt.scatter(points[y_km ==0,0], points[y_km == 0,1], s=100, c='red')

plt.scatter(points[y_km ==1,0], points[y_km == 1,1], s=100, c='black')

plt.scatter(points[y_km ==2,0], points[y_km == 2,1], s=100, c='blue')

plt.scatter(points[y_km ==3,0], points[y_km == 3,1], s=100, c='cyan')

[79]: <matplotlib.collections.PathCollection at 0x7fc193094dc0>

[88]: # create kmeans object

kmeans = KMeans(n_clusters=5)

# create np array for data points

points = np.array(customer)

# fit kmeans object to data

kmeans.fit(points)

18

# print location of clusters learned by kmeans object

print(kmeans.cluster_centers_)

# save new clusters for chart

y_km = kmeans.fit_predict(points)

[[ 2.46353548e+03 1.28705755e+03 5.53547056e+02 4.61479988e+03]

[ 1.08885603e+04 1.35750884e+03 5.62405013e+02 2.99614482e+04]

[ 2.52645291e+03 1.07432046e+03 1.22229621e+02 -1.37377462e+05]

[ 6.84868290e+02 8.42693534e+02 3.47143638e+02 1.19178766e+03]

[ 5.71651402e+03 1.40609811e+03 5.67251782e+02 1.08938860e+04]]

[91]: plt.scatter(points[y_km ==0,0], points[y_km == 0,1], s=100, c='orange')

plt.scatter(points[y_km ==1,0], points[y_km == 1,1], s=100, c='blue')

plt.scatter(points[y_km ==2,0], points[y_km == 2,1], s=100, c='green')

plt.scatter(points[y_km ==3,0], points[y_km == 3,1], s=100, c='black')

plt.scatter(points[y_km ==4,0], points[y_km == 4,1], s=100, c='red')

[91]: <matplotlib.collections.PathCollection at 0x7fc1938a7ee0>

[96]: from sklearn.metrics import silhouette_score

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_features = scaler.fit_transform(customer)

19

[107]: kmeans_kwargs = {

"init": "random",

"n_init": 10,

"max_iter": 300,

"random_state": 42,

}

# A list holds the SSE values for each k

sse = []

for k in range(1, 50):

kmeans = KMeans(n_clusters=k, **kmeans_kwargs)

kmeans.fit(scaled_features)

sse.append(kmeans.inertia_)

[108]: plt.plot(range(1, 50), sse)

plt.xticks(range(1, 50))

plt.xlabel("Number of Clusters")

plt.ylabel("SSE")

plt.show()

[ ]:

20


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp