联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2022-12-07 09:32

STAT0041 Algorithms and Data

Structures: ICA 2022

2022/12/4, 11:28 PM

Page 1 of 11

Instructions

The deadline is 9th of December, 5pm, via Moodle. Please see Moodle

instructions for submission.

This ICA must be done in pairs. If you have problems finding a pair, contact

me.

The first question is worth 20% of the total marks. The other two, 40% each.

You can write solutions in either Python or R, or a combination of both for different

questions. No single question should be written in two different languages!

In my instructions, I always refer to arrays starting from index 1. If you are using

Python, feel free to make them always start from index 0. This is not an important

point. Just be coherent.

Commenting code is recommended, but keep it light. If you think particular pieces

of code are clear enough, no need to clutter them with redudant comments. Your

future co-workers will thank you for your habit of keeping code uncluttered.

No need to worry about language-specific efficiency, e.g. both Python and R are

terrible with loops, but let's not worry about this. Programming language

pyrotechnics is not the point of STAT0041.

It's OK to deliver questions in separate files, .py or .R , but no more than one

file per question! Jupyter notebooks or R Markdown are especially welcome (if you

follow this route, please just create one file for all answers). You can only upload

one file, so you need to zip whatever files you have. No unusual compression

formats like RAR please! Zip only.

Any questions where you need to write in plain English must be done as

comments/Markdown text in the respective .py / .R / .ipynb / .Rmd files.

Indicate very clearly which answer corresponds to which question.

Despite the verbosity of each question (needed to remove ambiguities), rest

assured that each deliverable amounts to actually fairly little coding. Expect that

your full solution for each question will be shorter than the corresponding question

setup!

The deadline is 9th of December, 5pm, via Moodle. Please see Moodle

instructions for submission.

Question 1

2022/12/4, 11:28 PM

Page 2 of 11

Question 1

Calculating order statistics is a very standard task in descriptive statistics. Let's

investigate empirically how much of a cost it is to use sorting as a way of

calculating them.

Long story short: you will investigate how long it takes to compute order

statistics using sorting, if every now and then you need to sort the data. This will

be done by implementing a pretend simulation of changes to the data that force a

re-sorting. You will compare your implementation against whatever built-in way of

finding medians R or Python has to offer.

Here are the details. At this end of this question there are several suggestions on

how to code some of the finer details in R or Python.

Deliverable 1: Write a function simulate_order_sort(m, n, p) to simulate

requests for order statistics as if they were coming from real people. Argument m

indicates that we are going to run m simulated requests for order statistics.

Argument n indicates that we are going to simulate arrays of length n from

which order statistics will come: this will be done by sorting the data and picking

the corresponding position o in the sorted data, where o is chosen randomly in

. Argument p is a number between 0 and 1 that controls how often

we need to re-sort the data because of pretend changes to it.

The simulator must work as follows:

1. Simulate an array x of length n , each entry drawn independently from a

standard Gaussian.

2. Sort x .

3. Create an array y of length m that will store results.

4. Repeat the following m times. First, draw an integer o in

uniformly at random (assuming x starts from index 1), and copy x[o] to

y[i] at the corresponding iteration number i in . As seen in

Chapter 2, x[o] will be the order statistic corresponding to the request o .

At the end of the iteration, with probability p you will change the data: this

means drawing x again and re-sorting it (therefore, with probability

there is no change of data and no re-sorting).

5. Add the end of the m iterations, return y .

Deliverable 2: Write a short script to test your simulate_order_sort

function. For some (say, ) and (say, also), choose a

range of values for p (say, ) and print the wall-clock time

1, 2, 3,… ,n

{1, 2,… ,n}

1, 2, 3,… ,m

1? p

m m = 10000 n n = 10000

{0, 0.01, 0.1, 0.5}

2022/12/4, 11:28 PM

Page 3 of 11

required to run simulate_order_sort for each choice of p . Follow up the

code by writing a sentence or two explaining where the variability in wall-clock

time caused by p comes from.

Deliverable 3: Say all we care about is the median (here defined as the order

statistic at position ). Create a modified version of

simulate_order_sort , to be called simulate_median_sort , where the

only difference is that o is always set to . Write a second function,

simulate_median(m, n, p) , with analogous inputs but which does no sorting

at all: all medians are calculated by some existing choice of median computation

function you can choose from R or Python. Do the wall-clock time analysis for

both functions at different choices of p . Report the differences you see and

whether they correspond to your expectations.

Programming hints:

As we have seen in the notes/slides, for any we have that is the

"rounded-up" version of . This corresponds to functions like ceiling in R

or numpy.ceil in Python.

If you are answering this question in Python, it's OK to use Python lists instead

of a "proper" array of the likes you would see in NumPy. Feel free to use

arrays/lists which start at 0, but adapt the range of o accordingly.

Several exercises in the Exercise Sheets do wall-clock computations and

simulated data. For instance, see Exercise 8 of Exercise Sheet #1. In R,

function Sys.time() has a similar functionality.

To sample from a set of integers in Python, you can look at the NumPy

function np.random.choice , which by the way was used in the solution of

Exercise 2 from Exercise Sheet #2. In R, function sample has a similar

functionality. The same functions can be used to simulate a coin flip with

probability p , which you need in order to decide whether it's time to pretend

that the data has changed. For standard Gaussians, np.random.normal in

Python and rnorm in R is what you will look for.

To calculate medians in Python, feel free to use function median from the

package statistics . In R, function median is built-in already.

Question 2

2022/12/4, 11:28 PM

Page 4 of 11

You will implement a set of functions that manipulate an array-like data structure.

Let's call this data structure a "stretchable array" (s-array), for the lack of a better

name. The idea is to build upon a standard array, but delivering a data structure

that allows for extending its size in a way that the user of such a data structure

doesn't need to know.

Long story short: every time we modify an element that is outside the range of

the array, we will double the array in size. You will write the record and relevant

functions that manipulate this data structure.

Here are the details. They are long and somewhat tedious in order to make the

question as clear as possible. Your solution will probably be shorter than the

framing of the question...

Background: A variable of "type" s-array is a record which we will describe

shortly. We will assume that the data in a s-array are always integers. We will

operate on a s-array only via the following functions:

s_array_create(l) . Creates a s-array of capacity at least l with all

entries initialised to 0 . It should return a new record of "type" s-array. The

definition of "capacity" is given below. The idea of "at least l " will also be

explained below.

s_array_set(x, i, value) . Set the -th position of s-array x to

value . Informally, this is what we would normally code as x[i] = value

in Python or R. But x is not meant to be a standard built-in array that comes

pre-defined in the language. To avoid fiddling with Python's or R's syntax, all

operations in a s-array will be done by explicit functions. This function doesn't

need to return anything (but see the very bottom of the Programming hints

section if you are using R).

s_array_get(x, i) . This function returns the value stored at the -th

position of s-array x . Informally, this is equivalent to returning x[i] if we

were to use a friendlier syntax. This will be a trivial function to implement, as

you should quickly realise from the specification below.

s_array_len(x) . This function returns the attribute length of x , as

described below.

A s-array must be implemented as a record with the following attributes.

data : an internal standard array where the data is actually stored. This

means that e.g. s_array_get(x, i) should, under the hood, return

x.data[i] .

capacity : the length of data .

i

i

2022/12/4, 11:28 PM

Page 5 of 11

length : that's where things get interesting. Here, length is the highest

index among all elements which have been modified by a call to

s_array_set .

What's this weird business about the difference between capacity and

length ? The idea is that we are allowed to set the value of an entry which lies

beyond the size of data . Since data is an array that is assumed to be of fixed

size, this will mean creating a new, larger, array and destroying the old, while

copying information. In particular, everytime there is a call s_array_set(x, i,

value) where i > x.capacity then, inside the s_array_set function, the

following should happen:

1. We create a new array new_data that has twice as many elements as

data , setting all of its entries to zero.

2. We copy all entries of x.data to the corresponding positions of

new_data .

3. We set x.data to now point to new_data .

4. We update x.capacity and x.length accordingly (the former is just the

size of data , the latter is just i if we start counting positions from 1

instead of 0).

The idea is, if we want to print the entries of some s-array x , we stop at the last

position that we know to be non-zero. We will treat zeroes as a special,

"uninteresting", value. Given this explanation, your functions must also do the

following:

If we call s_array_set(x, i, value) where i exceeds the length but

not the capacity of x , then all we do is to update the length of x to indicate

where the highest-indexed modified element can be found.

For what follows, you will provide code in either R or Python. For instance, step 3

above can be done in Python and R by simply assigning an array (or list) to

another. A record in R can be implemented just as a list with named fields (you do

need to name the fields, e.g. you want your code to read as x$data as opposed

to x[[1]] , for instance). A record in Python can be implemented as a class. A

"standard array" in Python can be just a Python list. No need to invoke NumPy or

any other special data structures for this question.

Deliverable 1: Write all functions above, adding some basic comments to the

code to explain the logic of what you are doing. A special requirement is the

following: assume that you have a project manager that tells you that the length of

the data attribute should always be a power of 2, with minimal waste. That is, a

2022/12/4, 11:28 PM

Page 6 of 11

call that creates a s-array by, say, x = s_array_create(12) should create a

record where attribute data is an array of length 16, which is the smallest power

of 2 greater than or equal to 12. A call to s_array_create(17) should create

an array data of size 32 and so on. The 12 in the call x =

s_array_create(12) just means that x.capacity must be at least 12, not

exactly 12.

(This may sound arbitrary, but it has a reason: this is because the hypothetical

manager believes that empirically, given the experience they have, this will provide

a good trade-off between memory use and multiple array destruction/copying.

Such trade-offs are not that uncommon when engineering a data structure.)

Deliverable 2: Write a short explanation (< 200 words) of what you personally

think are the good and bad properties of the above implementation when the goal

is to have a data structure where: (i) most of the entries are 0s and are not

interesting; (ii) the size of the data structure may change as we use it. Briefly

compare to some alternative way, of your choice, of implementing such a

structurethat is not based on standard arrays but another starting point like a

linked list.

Deliverable 3: Say that from time to time you may want to check whether you

could shrink the physical length of the data attribute of a particular s-array x

(as some non-zero values of data get turned to 0 by s_array_set calls).

Implement a function s_array_pack(x) which checks the position of the last

entry of x.data that is not zero (we assume that 0s mean "irrelevant"). We then

check whether it is possible to resize x.data to a smaller power of 2 size while

not losing any "relevant" data, and do the reduction. For example: if x.data is

an array of size 64 ( x.capacity == 64 ) and the non-zero element of x.data

of highest index is x[30] ( x.length == 30 , assuming we start counting from

1), then we can update x.data to a size 32 array, updating x.capacity to 32,

accordingly.

Deliverable 4: Provide a short criticism (< 200 words) of your implementation of

s_array_pack , and how you could improve it for a future version (no need to

write any further code!). Or, if you think it is perfect, you can make the case of why

you think so!

Deliverable 5: Write a short piece of code demonstrating the functionality of a s-

array, in any way you want. Feel free to be minimalist. Just make sure that each

function above is called at least once. The main point of this question is to force

2022/12/4, 11:28 PM

Page 7 of 11

you to test your code.

Programming hints:

check out how logarithms can be calculated in R or Python so that you know

how to "round up" numbers to the closest power of 2 (for instance,

, which means that is the capacity that should

be allocated initially). So basically you need something like to get

the smallest power of 2 no smaller than . Look for functions like ceil or

ceiling , pow or ^ , numpy.log2 or just log2 , depending on your

choice of language. Make sure the result is converted to an integer in the end

( int() in Python or round() in R etc.)

You can make your s-array start at index 0 or index 1, it's up to you (probably

whatever it's easier in your doing it in Python or R). Bear in mind that if you

start at 0, you may want to think a bit what x.length and x.capacity

should be...

If you are implementing this in R, you may have difficulties with functions that

modify the contents of an array: R's default behaviour is "passing by value".

Unfortunately there is no elegant way out (if you find one that doesn't require

obnoxious overuses of object-oriented programming, let me know!). You may

need to use a function like s_array_set(x, i, value) by ridiculously

returning the entire array back, resulting in calls like x <- s_array_set(x,

i, value) . R wasn't created having efficiency in mind, but readability, and

there is something positive to be said about avoiding calls by reference. Still,

don't shoot the messenger...

Question 3

Graph traversal algorithms like breadth-first (BFS) and depth-first (DFS) search

are among the most common framings of search problems that exist in practice.

How they perform empirically may depend on the type of problem and how they

are coded, as in the worst-case scenario they are equivalent.

Long story short: you will analyse empirically how BFS and DFS work across a

variety of simulated problems and metrics.

Here are the details.

log2(12) ≈ 3.58 < 4 24 = 16

2(?log2(l)?)

l

2022/12/4, 11:28 PM

Page 8 of 11

Background: You will use the package igraph, which exists for both Python and R

(find installation instructions at https://igraph.org). Here's the Python

documentation and here's the R documentation. Both Python and R users may

benefit from the Python tutorial, keeping in mind that the syntax will be different in

R.

Part of the idea of this question is that you will be (partially) left to your own wits

on how to understand a package documentation based on your knowledge of

algorithms and data structures (without which the documentation would look like

gibberish). That's what you should expect in your professional life! But I know your

time is limited, so I'll point out some of the functions you will need.

There are different ways graphs can be represented in igraph. I'd recommend you

to first browse the Graph generation section of the documentation. It suffices the

understand that we can use adjacency matrices and adjacency lists to represent a

graph.

Deliverable 1: Write a function, to be called search_eval(m, n, p) , which

runs m simulations as follows. In each simulation, this function generates a

random directed graph over n vertices. At the end of each simulation, we report

two numbers: the wall-clock time of running BFS and DFS on each of the m

cases. So, search_eval(m, n, p) must return a matrix r , where the

first column of r are the wall-clock times for BFS, and the second column are

the corresponding times for DFS.

To generate a random graph, I suggest using an adjacency matrix (say, called g )

for representation, and independently setting each entry g[i, j] to 1 with

probability p (the third argument in the call fo search_eval ), 0 otherwise. You

can implement this in a variety of ways using random number generators of your

choice, with or without loops, even in NumPy, but you should do this without

making using of any specific igraph function that generates random graphs.

For each simulation, your function must choose one vertex uniformly at random to

be a "source" (called a "root", by igraph's documentation) vertex from which a

search starts. Your function must then run BFS and DFS using igraph, recording

the wall-clock time separately for each. See Exercise 8 from Exercise Sheet #1 for

an example of wall-clock time experiment. There is no need for a "destination"

vertex. When we run a BFS or a DFS in a graph using igraph, it just calculates how

many steps it takes to reach all other vertices (if at all reachable).

The function for depth-first search is called igraph_dfs in R, and just dfs in

Python. You should figure it out the details on how to use them from the igraph

m× 2

m

2022/12/4, 11:28 PM

Page 9 of 11

docs (you can just a web search to find the corresponding pages, for instance).

You are expected to be more independent here, as you would if someone assigned

you this task as part of a job.

Deliverable 2: In this question, we will test whether there is any empirical

difference in the DFS and BFS implementations in igraph for your randomly

generated graphs. You can think of this as a sanity check of this library. We will do

this in two ways.

First, write a script that calls r = search_eval(m, n, p) with m = 1000 , n

= 200 and a variety of choices of p (I suggest varying from small numbers, like

0.01 to up to 0.1 . It's up to you.). For each instance of p , do a z-test to

whether the difference between the first and second columns of r comes from a

distribution with zero mean. The hypothesis is rejected if the resulting p-value is

less than a threshold (say, 0.05).

Second, plot either a scatterplot of a qqplot of the two columns of r to visually

judge how different (if at all) the run-time of BFS and DFS in igraph may look to

you. Write a sentence or two with your interpretation.

Deliverable 3: A call to BFS or DFS in igraph returns an array (among other

outputs, read the docs) indicating the ordering by which each vertex was found in

our search (if they were found at all, as there might be no path linking the source

to a particular vertex). Let's call this array order from now on. This means that

the DFS/BFS implementation of igraph is meant to list how many steps it takes to

go to all vertices from a particular source vertex, according to the respective

search schedules of each algorithm.

Let's create a family of graphs to empirically see how it affects DFS vs BFS. You

will write a function depth_charge(m, w, d) . This function will create

simulations of a directed graph defined as follows. Each graph has a "root" vertex

, and chidren . We then choose one of ,

uniformly at random, to start a chain . We

then call BFS and DFS on this graph, and peek into output order to see how

many search steps it took to find starting from . We store this information

for all trials and return a matrix, similarly to what you did in Deliverable

2.

Given your code for depth_charge , write a script that calls it for m = 1000 , w

= 100 and d = 100 . Do a comparison between igraph's implementation of BFS

and DFS by plotting a histogram of the first and second columns of the output r

m

v0 w v1, v2,… , vw vi v1, v2,… , vw

vi → vw+1 → vw+2 →?→ vw+d

vw+d v0

m m× 2

2022/12/4, 11:28 PM

Page 10 of 11

= depth_charge(1000, 100, 100) .

Redo the analysis above, using a modified version of depth_charge (call it

depth_charge_1 ) which always chooses to be . Again, do the same by

setting to be , call that depth_charge_final .

What are your conclusions, particularly in light of the following conjecture?

a standard implementation of BFS or DFS would treat the ordering of the

children of in whatever order you entered them when creating the graph

variable. That is, if you created a adjacency matrix, then the children of a

vertex would be peeked into by following the same ordering of the columns of

the original matrix. This means that, in the second variation

( depth_charge_1 ), DFS would be much more efficient than BFS, taking

steps to find , while BFS would take steps instead.

I actually don't know whether this is true, as I haven't looked at how igraph is

implemented. Let's find out empirically, and assess its implications!

Programming hints:

The solution to Exercise 8 in Exercise Sheet #1 and several exercises in

Exercise Sheet #2 (Exercise 2 in particular) provide some ideas on how to run

simulations, including a demonstration of np.random.choice to simulate

drawing integer values (useful to draw whether an edge exists, and which

vertex among will grow a chain). In R, function sample is the

one to look for.

The Graph generation link above should tell you how to create a igraph graph

out of an adjacency matrix. For the R version, it should be reasonably clear

what the corresponding R function is after you realise what needs to be done

in Python.

to do a z-test in Python, check out this function. For a scatterplot or qqplot,

see here and here.

in R, a simple implementation of a z-test and a demonstration of

scatterplot/qqplot can be found here.

If you haven't done any statistics module before and need help with what a

scatterplot or a qqplot means, please feel free to contact me.


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp