联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-12-07 09:24

Academic Session: 2023-24

Lab Exercise 3: Spellchecking (Better Trade-offs)

Duration: 4 weeks

You should do all your work in the lab3 directory of the COMP26120 2023 repository - see Blackboard

for further details. You will need to make use of the existing code in the branch as a starting point.

Important: You submit this lab via a quiz on Blackboard. This will:

1. Ask you some questions about your implementation, including the hash of the commit you want

us to mark (see below for details of this).

2. Ask you to upload a zip file of the python, c or java folder you have been working in.

3. Ask you to upload a PDF report of the experiments you will run in Part C of this lab.

4. Ask you for a statement on whether and (if relevant) how you used generative AI tools in your

work. If you used generative AI you also have to upload a file containing transcripts of your

usage.

You can save your answers and submit later so we recommend filling in the questions for each part as

you complete it, rather than entering everything at once at the end.

? You MUST fill in and submit the quiz (at least some of it) to get marked.

? Reports MUST be uploaded to Blackboard if you want them marked.

? Code you want marked MUST be in the commit hash you give us in the Blackboard

quiz.

We will not mark anything pushed to GitLab if we don’t have the commit hash we can

identify from Blackboard. Submission time is recorded as the time the Blackboard quiz

is submitted, not the time you last pushed to GitLab or the last time you saved the quiz.

Code Submission Details

You have the choice to complete the lab in C, Java or Python. Program stubs for this exercise exist

for each language. Only one language solution will be marked.

Because people have had a number of issues with GitLab over the years we take a multiple redundancy approach to submission of code. This involves both pushing a commit to GitLab and uploading

a zip of your code to Blackboard. By preference we will mark the code you submitted to GitLab but

1

Figure 1: Identifying the hash of your most recent commit in GitLab

Figure 2: Identifying the hashes of previous commits in GitLab

if we can’t find it or it doesn’t check out properly then we will look in the zip file on Blackboard.

Please do both to maximise the chance that one of them will work.

When you submit the assignment through Blackboard you will be asked for the hash of the commit

you want marked. This is to make sure the TAs can identify exactly which GitLab commit you want

marked.

You can find the hash of your most recent commit by looking in your repository on GitLab as

shown in figure 1.

You can also find the hash for a previous commit by clicking on the “commits” link and then

identifying the commit you are interested in. This is shown in figure 2.

Note that while the full hash for commits are quite long, we only need the first 8 characters (as

shown in the screenshots) to identify for marking.

For this assignment, you may use built-in libraries in your chosen language but bear in

mind that we ask you to implement hash sets - simply using the Java HashSet class (for

instance) will not get you the marks for implementation.

Reminder: It is bad practice to include automatically generated files in source control (e.g. your git

repositories). This applies to object files (C), class files (Java), and compiled bytecode files (Python).

2

It’s not fatal if you do this but it can cause issues during marking so try not to.

While it is fine to discuss this coursework with your friends and compare notes, the work submitted

should be your own. In particular this means you should not have copied any of the source code,

or the report. We will be using the turnitin tool to compare reports for similarities.

Generative AI You may use generative AI for this assignment. If you choose to do so you should

read the guidance notes and watch the guidance video in the Blackboard folder for the lab. At the

start of the quiz you are asked to make a statement about your use of generative AI. If you leave

this blank it will be understood as a declaration that you have not used generative AI

for the coursework. The statement should make the following things clear:

1. Which generative AI tool you used.

2. What parts of the coursework you used it for: if you used it for the code was it for all the code

or only some of it? If just some, which bits? If you used it to help you answer quiz questions,

which quiz questions? if you used it to help with the report, which parts of the report?

3. What steps you took to ensure that the generative AI tool had provided correct information.

4. What evidence there is either in the statement, or in your coursework submission, that demonstrates you have met the coursework learning outcomes.

If you used generative AI you should also upload a zip file containing the transcripts of your usage in

question 2 of the Blackboard quiz.

3

Learning Objectives

By the end of this lab you will be able to:

? Explain the role of the hash function in the implementation of hash tables and describe at least

one hash collision strategy.

? Explain the basic structure of a binary search tree and the basic operations on this structure.

? Produce C, Java, or Python code to implement the above concepts.

? Evaluate experimentally the behaviour of hash sets and binary search trees.

Introduction

The aim of this exercise is to use the context of a simple spell-checking program to explore the hash

table and binary search tree data structures.

This exercise builds on work in Labs 1 and 2. If you have not done these labs we recommend you

take a look at them and their model solutions in order to familiarise yourself with the spell-checking

program and our expectations for creating and writing up experiments.

Getting the Support Code

You should be able to see a lab3 folder in your Gitlab that contains all the support files for this lab.

You can find instructions for accessing your gitlab on Blackboard.

Data Structures

The spell-checking program stores a dictionary of words using a Set datatype. There are two data

structures used to implement this Set datatype in this lab. In Lab 1 we introduced this datatype

but we repeat that information here. You may also need to look online (e.g. the Wikipedia page) to

complete your knowledge.

Set. This is an abstract datatype that is used to store the dictionary of words in our spell-checking

program. The operations required for this application are:

? find whether a given value (i.e. string, alphabetic word) appears in the set.

? insert a given value in the set. Note that there is no notion of multiplicity in sets - a value

either appears or it does not. Therefore, if insert is called with a duplicate value (i.e. it is

already in the set) it can be ignored.

There would usually also be a remove function but this is not required for this application. We also

include two further utility functions, which we ask you to implement, that are useful for printing what

is happening :

? print set to list the contents of a Set.

? print stats to output statistical information to show how well your code is working.

You should have already used a dynamic array to implement this dataype in Lab 1.

In this exercise we use hash tables and binary search trees to implement the Set datatype. These

have been introduced in lectures and we describe these briefly below. You may want to look at the

recommended textbooks or online to complete your knowledge.

4

Hash Table. The underlying memory structure for a hash table is an array. A hash function is

then used to map from a large ‘key’ domain to the (much smaller) set of indices into the array, where

a value can be stored. When two values in the input domain map to the same index in the array this

is called a collision and there are multiple ways to resolve these collisions. To use a hash table to

represent a set we make the key and value the same - the result is usually called a hash set.

Binary Search Tree. You can think of a tree as a linked list where every node has two ‘children’.

This allows us to differentiate between what happens in each sub-tree. In a binary search tree the

general idea is that the ‘left’ sub-tree holds values smaller than that found in the current node, and

the ‘right’ sub-tree holds larger values.

Lab 3: Better Storage

In Lab 1 we achieved a faster find function by first sorting the input. In this part we explore two data

structures that organise the data in a way that should make it faster to perform the find operation.

Part a: Hash Table Lookup

Time Budget: Students generally find Part a of this lab more challenging than Part b. If you have

not attempted Lab 1, then you should budget around 1 hour for familiarising yourself with the code

stubs we provide, the spell-checking program, command line flags, test harness and set data-type

implementation. You might want to look at the sample answers to Lab 1 to assist with this. You

should then budget 2-3 hours for the implementation itself. If you have spent more than 4 hours

on Part a, we recommend skipping ahead to Part b and only returning to Part a if you have time

remaining.

So far our solution to a fast find function has been sorting the input and in Part b we will look at

storing it in a sorted order. In this part you will take a different approach that relies on a hash

function to distribute the values to store into a fixed size array.

In this part you need to edit the program stub in your chosen language for hash sets. You should:

1. Complete the insert function for hash sets. What should you do if the value to be inserted

already exists? Hint: this is representing a set.

2. Complete the find function. Importantly, it should follow the same logic as insertion.

3. Implement the print set function – This should print out a sensible representation of the hash

set when called with the ?vvv flag.

4. Implement the print stats function – This should print out a sensible set of statistics. This

should include the total number of collisions (including those incurred when re-inserting after a

rehash), total number of rehashes and the average number of collisions per access. print stats is

automatically called by the spell-checking program on completion. We will look at this output

so do not change or disable it. This will require some extra fields in the class (Java, Python) or

node struct (C).

You can find a discussion of how to test your code in the section of these instructions on Testing (after

Part b).

The hash-value(s) needed for inserting a string into your hash-table should be derived from the

string. For example, you can consider a simple summation key based on ASCII values of the individual characters, or a more sophisticated polynomial hash code, in which different letter positions are

associated with different weights. (Warning: if your algorithm is too simple, you may find

that your program is very slow when using a realistic dictionary and causes issues when

5

you perform your experimental evaluation.)

We want you to implement an open addressing hashing strategy so that collisions are dealt

with by using a collision resolution function. The simplest of these is linear probing, the most

straightforward strategy but you could also choose to implement quadratic probing or double hashing. To understand what is going on you should add code to count the number of collisions

so you can calculate the average per access in print stats. You are welcome to implement whatever

open addressing strategy you like.

The program stubs allow you to implement several collision strategies and hash functions which

can be accessed by modes already given in the stubs. The implementation you want us to mark

should run in mode 0. You may find this useful if you are attempting the extension exercise in the

report. More details of these are contained in the language specific instructions. Do not change these

since they are used in our testing processes. Write your code so that you can use the ?m parameter,

which sets the mode.

An issue with open addressing is what to do when it is not possible to insert a value. To make things

simple to begin with you can simply fail in such cases. Your code should keep the num entries field

of the table up to date. Your insert function should check for a full table, and exit the program

cleanly should this occur. Once this is working you should increase (double?) the hash table size

when the table is getting full and then rehash into a larger table. Implement this resize and

rehash functionality. When the program is called using the ?vvv flag it should print out

a message when it rehashes and then call the print set function to show the new state of

the hash set once rehashing completes. In C, beware of memory leaks!1

Mod of negative numbers:

Most of you are likely to use the mod operator in your hash functions by doing something

like

h = x % size

to try to get a value of h that is between 0 and size - 1.

If x is negative then the behaviour of this operator varies by language: in particular in C and Java it will return a negative number. See the discussion at

https://torstencurdt.com/tech/posts/modulo-of-negative-numbers/

You might think that x cannot be negative. However, if you are adding or multiplying large

numbers together (e.g., when hashing a string) then you are likely to get integer overflow.

In hashing, we don’t care about such overflows as it is deterministic and just adds to the

‘chaos’ a bit but Java, for instance, may throw ArrayIndexOutOfBoundsException if a

negative number is passed to the hash data structure, if you have not accounted for this

possibility in your implementation.

In C you could use an unsigned integer to make sure that you only have positive numbers.

Once you have completed this implementation you should look at the Blackboard submission form

where you will be asked the following questions about Part a.

1. What did you implement as your hash function? How does it work? How “good” is it? (No

more than 100 words).

1One can also get memory leaks in memory managed languages but these are more semantic e.g. preserving a

reference to something you will never use again.

6

2. How do find and insert for hash sets work? How have you implemented them? You do not need

to discuss the details of your collision resolution strategy here (this is the next question) but

should cover the basics of how a word is inserted into the dictionary and how a word is found

(or not) in the dictionary. (No more than 100 words, not counting any code snippets included).

3. What is your collision resolution strategy? How does it work? How have you implemented

it? Illustrate your explanation with a sample of your code - this may be tided up a bit for

presentation but should clearly be the same as the code you have submitted (No more than 200

words, not counting the code snippet).

4. Did you implement rehashing? If so briefly explain your implementation including explaining

when you choose to rehash. (No more than 200 words, not counting any code snippet included).

Part b: Binary Search Tree Lookup

Time Budget: This part of the lab should take you about 2 hours to complete. If it is taking you

more than 3 hours and you have completed a basic hash set implementation for Part a you might

want to skip forward and do the first section of Part c (which only needs the hash set implementation)

before returning to this. If you have not completed any of Part a it is worth persevering with this to

get it working before moving on to Part c where you should focus on those sections relating to binary

trees.

In this part you need to edit the program stub in your chosen language for binary search trees.

You should:

1. Complete the insert function by comparing the value to insert with the existing value.

2. Complete the find function. Importantly, it should follow the same logic as insertion.

3. Implement the print set function – This should print out a sensible representation of the hash

set when called with the ?vvv flag.

4. Implement the print stats function – This should print out a sensible set of statistics, including

the height of the tree and average number of comparisons per insert and find. print stats is

automatically called by the spell-checking program on completion. We will look at this output

so do not change or disable it.

Once you have completed this implementation you should look at the Blackboard submission form

where you will be asked the following question about Part b.

1. How do find and insert for binary trees work? How have you implemented them? Include your

code for insert and find and relate your explanation to this code. (No more than 200 words, not

including the code snippet(s))

In this lab exercise we will stop there with trees. However, it is worth noting that this implementation

is not optimal. What will happen if the input dictionary is already sorted? In the next lab we will be

exploring self-balancing trees.

Testing

There are instructions in the individual language sections explaining how to test your algorithm on a

single example and you should try that first. Once that is done you can test them on some test suites.

We have provided a test script, test.py in the top level directory of the COMP26120 2023 repository and some data in the data directory. test.py takes up to five arguments: the first is the language

you are using; the second is the test suite (simple, large and collision tests are provided but

you can create more); the third is the data structure (bstree or hashset) you are testing; the fourth

7

(optional, defaults to 0) is the mode you want to test; and the fifth (optional) is the initial hash set

size (only applicable for the hashset).

For example, to run the simple tests for mode 2, your language is Java and you want to test your

binary tree implementation, you should run

./python3 test.py java simple bstree 2

You might want to edit the script to do different things but we will use the one provided when marking

your code so make sure it runs with this. We have also provided a large test (replace simple by

large above) which will take a lot longer to run (see help box below). You will need to unzip the

files by running unzip henry.zip in the data/large directory. This should be particularly useful to

check if you have implemented rehashing correctly.

Finally we have provided some collision test tests. These are dictionaries containing exactly

five items. If you run them with an initial hash set size of 5 it is highly likely that at least one insert

will result in a collision and this will allow you to check if your collision resolution mechanism is

working correctly. These can also be used to test if your rehashing implementation is working.

For example, to run the collision tests for mode 0 with python as the language, you should run

./python3 test.py python collision_tests hashset 0 5

This will run your hash set implementation on these tests with an initial hash set size of 5.

In general, to be sure that you collision resolution strategy is working, we would recommend adding

a message that gets printed out when run in verbose mode (e.g., with the -vv flag) when a collision

occurs, possibly including printing out the set after the element is inserted, and then running tests

manually using the instructions in the language specific sections to make sure at least one of your

tests is encountering a collision and it is handled correctly when it does.

Failing Tests:

If the tests are failing there are two possible explanations. Either your implementation is

wrong or the test script is being more picky than you thought.

For the first reason, make sure you understand what you expect to happen. Create a

small dictionary and input file yourself and test using these or use one of the individual

simple tests. There are nine of these and you will find them organised in the data/simple

directory where each sub-directory contains an example dictionary, input file and the

answer expected by the test. Most of these are small and easy to understand. Remember

that the program should print out all of the words that are in the input file but not in the

dictionary. Make sure that your code is connected up properly in find so that it returns

true/false correctly.

For the second reason, make sure that you don’t print anything extra at verbosity level 0. If

you look at test.py you’ll see that what it’s doing is comparing the output of your program

between “Spellchecking:” and “Usage statistics:” against some expected output. It runs the

program at verbosity level 0 and expects that only spelling errors are output in-between

those two points.

As part of marking, we will run your implementations on the simple and collision tests test

suites and may also run them on the large test suite if we feel additional testing is required.

Part c: Making a Comparison

Time Budget: If you have not attempted Lab 2 then you should budget 1 hour to familiarise

yourself with the concept of running experiments, generating data and so on. You should then expect

8

this part of the lab should take you 2-3 hours to complete. However, actually generating data and

running experiments is likely to take more time than that, but you can leave your computer running

and do something else while this is happening – particularly if you write scripts to automate data

generation and run experiments. You should plan your time with this in mind. We recommend writing

up the experiments as you go so you have something to submit if you run out of time.

You should now perform an experimental evaluation like the one you did for Lab 2. We want to

address the question.

Under what conditions are your different implementations of the dictionary data structure

preferable?

Note that for hash sets the initial size of the data structure will potentially play a role. You may

want to design your experiment so this is not a concern. You may also find it useful to correlate your

findings with statistics such as the number of collisions in the hash table.

We recommend you read, if you have not already done so, the Lab 2 instructions on how to generate

inputs for your experiments. You may also (if possible) reuse the inputs you generated as part of that

lab as part of your evaluation here in order to save time. Note that generation of inputs is likely to

take some time (hours for large dictionaries and queries), so you should plan your work so you can

set a generation script running and then leave it while you do something else.

You should perform two experiments for your evaluation and then draw a conclusion that answers

the question above. You can then devise and run a third optional (set of) experiments as an extension

exercise.

The two experiments you should perform should answer the questions:

1. Does your implementation of insert for hash sets work as expected in the average case in terms

of complexity.

2. Does your implementation of insert for binary trees work as expected in the average case in

terms of complexity.

Important: In your hypothesis, experimental design and results you should be very careful to

consider whether you are discussing or measuring a single insertion, or several insertions and you

should be consistent around this throughout. Of course, you can measure several insertions and from

that calculate the cost of a single insertion, but you should be clear about this.

Your presentation of each experiment should have four sections – these are likely to be very similar

for each experiment:

Hypothesis You should state, as a hypothesis, what you expect the performance to be in big O

notation and then write a short paragraph explaining why you believe this to be the case, based

on the theoretical complexities of insert for the data structure. Be careful about how you present

big O notation. In particular O is not a function so it doesn’t make sense to perform arithmetic

with it such as

O(n)

n

or

O(n) × n

(This section should be about 200 words and take up no more than half a page in your report).

Experimental Design You should describe how you designed the experiment. This should include:

? Your intended independent variable – this is the thing you are varying.

? Your intended dependent variable – this is the thing you are measuring whose value your

hypothesis says depends upon the value of the independent variable.

9

? Anything that you could vary but you are not – for instance to avoid confusing your results

with other variables that might also influence performance.

? What input data you generated and how many input files you used for each value of your

independent variable.

? What program you ran on what inputs. How many times you ran it on each input.

? What you measured and how.

You should also mention anything else you think is relevant that will help your marker judge

how well you designed your experiment to test your hypothesis. (This section should be about

500 words and take up no more than a page in your report – not counting any scripts included

as appendices).

Results You should present your results, ideally as a graph showing a best fit line. If you do any

processing on results, such as generating best fit lines, computing averages, etc., then these

should be described. Raw data should be presented in an appendix or included with your code

submission. You should state clearly whether your hypothesis was confirmed or refuted (or it is

equivocal or difficult to tell). If your hypothesis was refuted you should discuss why that might

be (e.g., incorrect implementation of insert, some problem with the experimental design) and

sketch how you might investigate further. (This section should be about 200 words and take up

no more than half a page in your report – not counting graphs and tables).

Note that it is more important to be able to clearly explain your experimental design (justifying

your decisions) than it is to have excellent results. The answers to the wrong question are useless.

Top marks can be obtained for Part c for a well-designed experiment that has disappointing results

(either disproving a hypothesis or just being unclear in what they show), so long as the conclusions

you draw make it clear you are aware of the issues with the results.

Disappointing/Surprising Results:

While you can get full marks for results that disprove your hypothesis, these often are

indicative of issues with the implementation or, if you are using them, with any scripts

written to automate the experiments. Therefore it is alway worthwhile double-checking these

things since problems with these will effect your marks. The commonest implementation

issue we see here is that find and/or insert are implemented in a way that returns the correct

answer but does more work than necessary. The commonest issue with automation scripts

is that they are not running the program at all, trying to run it on non-existent or empty

input files, or causing it to crash somehow. If your results seem to show that your program

takes the same amount of time on all inputs (particularly if that time is very short) then

it is worth double-checking that your automation script is actually running the program

correctly.

Once you have written up your experiments you should write two further sections.

Conclusion Use your results to address the initial question: when should you use your hash set

implementation and when should you use your binary tree implementation? You do not need to

validate this experimentally since for some implementations this is likely to take a lot of time,

even with a good implementation on a fast machine. Theoretically, there is a crossover point

where one becomes better than the other that you should be able to calculate from your existing

results so even if you can’t run large enough experiments to find this point we expect to see the

answer you have calculated.

Data Statement It is good practice to inform readers where they can find your data and code in

order to check your results and re-run experiments for themselves. This should appear in a Data

statement.

10

We’ve asked you not to include large input files in your gitlab, but you should include any scripts

you created - for instance to generate input files or run experiments, and the“raw” output data

- e.g., a table (e.g., as a .csv file or spreadsheet) of independent variable values matched with

dependent variable values. You may also include this as an appendix in the report.

The data statement should clearly state where all the input data (if included), scripts and output

data can be found.

Hints

? You should be able to do all of the analysis you need to do for this whole lab using a simple

spreadsheet application. However the unix utility gnuplot and the python matplotlib library

also both allow you to create plots from data and fit lines to those plots.

? As the numbers are all small you might want to count n in lots of 10k e.g. for an input of 10,000

say it’s n = 1, for 20,000 say it’s n = 2 etceteras. Alternatively, you may wish to measure time

in milliseconds rather than seconds. This is just to make numbers look more sensible.

? You can use the UNIX time utility to measure running time. This gives you times for real,

user and sys. The most accurate way to judge the running time of your algorithm is to take

the sum or user and sys (real includes the time when other resources might be using your

CPU and can’t account for computation time across multiple cores).

? You may want to plot/calculate average time per insert, rather than plotting the total time

taken to build the dictionary and query it for words. When doing this (particularly for Java) be

aware that there may be a fixed “overhead” time that is independent of the size of dictionary

or query – if this is the case and you are calculating averages you may see that the average time

appears to get better as the dictionary size increases not worse as you would expect. If this is

happening, you may want to start by plotting a graph of total time and determining where this

crosses the origin, in order to give you a guess at overhead, and then deduct this start up value

from the total time before calculating the average. If you do this, make sure to document it in

your report.

? If you are convinced your implementation is correct but your results don’t seem to be as expected,

particularly if you are looking at an overhead that is making time per insert hard to calculate,

you might want to investigate correlations between other factors – e.g., the relationship between

dictionary size and tree height for binary trees to see if that can illuminate what is happening.

Report Extension. As an extension activity you can include a third (set of) experiment(s) in your

report. This experiment can be to explore any aspect of the practical complexity of your implementation of hash sets or binary trees that you wish – good things to look at include: query time; the other

collision resolution strategies for hash sets (if you have implemented them); other hash functions;

and the effect of rehashing on the performance of hash sets, but there are many opportunities here.

Note that you can get 90% of the marks for this coursework without attempting the extension. This

is for students who wish to stretch themselves and should not be attempted unless the rest of the

coursework has been completed in good time.

Instructions for C Solutions

If you intend to provide a solution in C you should work in the lab3/c directory of the COMP26120 2023

repository. All program stubs and other support files for C programs can be found in this directory.

The completed programs will consist of several “.c” and “.h” files, combined together using make.

You are given a number of complete and incomplete components to start with:

11

? global.h and global.c - define some global variables and functions, including the verbosity level.

? speller.c - the driver for each spell-checking program, which:

1. reads strings from a dictionary file and inserts them in your data-structure

2. reads strings from a second text file and finds them in your data-structure

- if a string is not found then it is assumed to be a spelling mistake and is reported to the

user, together with the line number on which it occurred

3. processes input flags to set the dictionary, query file, size, mode and verbosity level to be

used.

You should not need to change speller.c – if you do, make sure that the input flags for running

the program continue to work as specified below.

? set.h - defines the generic interface for the data-structure

? hashset.h and hashset.c - includes a partial implementation for hash sets that you need to

complete in Part a.

? bstree.h and bstree.c - includes a partial implementation for binary search trees that you need

to complete in Part b.

Note: The code in speller.c that reads words treats any non-alphabetic character as a

separator, so e.g. “non-alphabetic” would be read as the two words “non” and “alphabetic”.

This is intended to extract words from any text for checking (is that “-” a hyphen or a

subtraction?) so we must also do it for the dictionary to be consistent. This means that

your code has to be able to deal with duplicates i.e. recognise and ignore them. For example,

on my PC /usr/share/dict/words is intended to contain 479,829 words (1 per line) but is

read as 526,065 words of which 418,666 are unique and 107,399 are duplicates.

Running your code

To test your implementation, you will need a sample dictionary and a sample text file with some text

that contains the words from the sample dictionary (and perhaps also some spelling mistakes). You

are given several such files in the data directory of the COMP26120 2023 repository, and you will

probably need to create some more to help debug your code. These files will need to be copied to

the directory where you are working or you will need to set your PATH so that your program can be

executed from anywhere.

You should also test your program using a larger dictionary. One example is the Linux dictionary

that can be found in /usr/share/dict/words.

Compile and link your code using make hashset (for part a), or make bstree (for part b). These will

create executables called speller hashset and speller bstree respectively.

When you run your spell-checker program, you can:

? use the ?d flag to specify a dictionary.

? (for part a) use the ?s flag to specify a hash table size: try a prime number greater than twice

the size of the dictionary (e.g. 1,000,003 for /usr/share/dict/words). Do not hard code the

dictionary size into your code. You should let it be set by the size argument to the

initialize set function.

12

? use the ?m flag to specify a particular mode of operation for your code (e.g. to use a particular

hashing algorithm). You can access the mode setting by using the corresponding mode variable

in the code you write. Note that the modes you should use have already been specified in

hashset.h. You should not need to use this unless you want to investigate additional hash

functions or addressing strategy.

? use the ?v flag to turn on diagnostic printing, or ?vv for more printing (or ?vvv etc. - the

more ”v”s, the higher the value of the corresponding variable verbose). We suggest using this

verbose value to control your own debugging output. You can access the verbosity level as the

global verbose. This is 0 by default and will be set to 1, 2 or 3 by use of the ?v, ?vv and ?vvv

flags.

So the general form of a call to your program should be

speller hashset -d <dictionary> -s <size> -m <mode> <input-file>

For instance to run your hash set program on the third of the simple test cases, in mode 1, with a

medium level of verbosity, you would call:

speller hashset -d ../data/simple/3/dict -m 1 -vv ../data/simple/3/infile

NOTE IF CUTTING AND PASTING THIS that the underscores often get lost when you cut and

paste.

Instructions for Java Solutions

If you intend to provide a solution in Java you should work in the lab3/java directory of the

COMP26120 2023 repository. The completed programs will form a Java package called comp26120.

You can find this as a sub-directory of the java directory. All program stubs and other support files

for Java programs can be found in this directory but you will need to copy in either your solution or

the model solutions for Lab 1 into this directory – this should be the file darray.java.

You are given a number of complete and incomplete components to start with:

? speller config.java - defines configuration options for the program, including the verbosity level.

? speller.java - the driver for each spell-checking program, which:

1. reads strings from a dictionary file and inserts them in your data-structure

2. reads strings from a second text file and finds them in your data-structure

- if a string is not found then it is assumed to be a spelling mistake and is reported to the

user, together with the line number on which it occurred

3. processes input flags to set the dictionary, query file, size, mode and verbosity level to be

used.

You should not need to change speller.java – if you do, make sure that the input flags for running

the program continue to work as specified below.

? set.java - defines the generic interface for the data-structure

? set factory.java - defines a factory class to return the appropriate data structure to the program.

? hashset.java -includes a partial implementation for hash sets that you need to complete in Part

a.

? speller hashset.java - sub-classes speller for use with hashset.

13

? bstree.java - includes a partial implementation for binary search trees that you need to complete

in Part b. You will need to edit this.

? speller bstree.java - sub-classes speller for use with bstree.

Note: The code in speller.java that reads words treats any non-alphabetic character as a

separator, so e.g. “non-alphabetic” would be read as the two words “non” and “alphabetic”.

This is intended to extract words from any text for checking (is that “-” a hyphen or a

subtraction?) so we must also do it for the dictionary to be consistent. This means that

your code has to be able to deal with duplicates i.e. recognise and ignore them. For example,

on my PC /usr/share/dict/words is intended to contain 479,829 words (1 per line) but is

read as 526,065 words of which 418,666 are unique and 107,399 are duplicates.

Running your code

To test your implementation, you will need a sample dictionary and a sample text file with some text

that contains the words from the sample dictionary (and perhaps also some spelling mistakes). You

are given several such files in the data directory of the COMP26120 2023 repository, and you will

probably need to create some more to help debug your code. These files will need to be copied to

the java directory or you will need to set your CLASSP AT H so that your program can be executed

from anywhere.

You should also test your program using a larger dictionary. One example is the Linux dictionary

that can be found in /usr/share/dict/words.

Compile your code using javac *.java. This will create an executable called speller bstree.class

(for Part a) and speller hashset.class (for Part a). To run your program you should call java

comp26120.speller bstree sample-file and java comp26120.speller hashset sample-file respectively. Note that you will either need to set your CLASSP AT H to the java directory of the

COMP26120 2023 repository or call java comp26120.speller bstree sample-file (resp. java

comp26120.speller hashset sample-file) in that directory.

When you run your spell-checker program, you can:

? use the ?d flag to specify a dictionary.

? (for part a) use the ?s flag to specify a hash table size: try a prime number greater than twice

the size of the dictionary (e.g. 1,000,003 for /usr/share/dict/words). Do not hard code the

dictionary size into your code. You should let it be set by config.init size field of

the config object supplied as an argument to the hashset constructor.

? use the ?m flag to specify a particular mode of operation for your code (e.g. to use a particular

sorting or hashing algorithm). You can access the mode setting by using the corresponding mode

variable in the code you write. Note that the modes you should use have already been specified

in hashset.java. You should not need to use this unless you want to investigate additional hash

functions or addressing strategy.

? use the ?v flag to turn on diagnostic printing, or ?vv for more printing (or ?vvv etc. - the

more ”v”s, the higher the value of the corresponding variable verbose). We suggest using this

verbose value to control your own debugging output. We suggest using this verbose value to

control your own debugging output. You can access the verbosity level as the verbose field of

the hashset object. This is 0 by default and will be set to 1, 2 or 3 by use of the ?v, ?vv and

?vvv flags.

14

So the general form of a call to your program should be

java comp26120.speller hashset -d <dictionary> -s <size> -m <mode> -v <input-file>

For instance to run your hash set program on the third of the simple test cases, in mode 1, with a

medium level of verbosity, you would call:

java comp26120.speller hashset -d ../data/simple/3/dict -m 1 -vv ../data/simple/3/infile

NOTE IF CUTTING AND PASTING THIS that the underscores often get lost when you cut and

paste.

Instructions for Python Solutions

If you intend to provide a solution in Python you should work in the lab3/python directory of the

COMP26120 2023 repository. All program stubs and other support files for Python programs can be

found in this directory but you will need to copy in either your solution or the model solutions for

Lab 1 into this directory – this should be the file darray.py. You will need to use Python 3.

You are given a number of complete and incomplete components to start with:

? config.py - defines configuration options for the program, including the verbosity level.

? speller.py - the driver for each spell-checking program, which:

1. reads strings from a dictionary file and inserts them in your data-structure

2. reads strings from a second text file and finds them in your data-structure

- if a string is not found then it is assumed to be a spelling mistake and is reported to the

user, together with the line number on which it occurred

You should not need to change speller.py – if you do, make sure that the input flags for running

the program continue to work as specified below.

? set factory.py - defines a factory class to return the appropriate data structure to the program.

? hashset.py - includes a partial implementation for binary search trees that you need to complete

in Part a. You will need to edit this.

? speller hashset.py - front end for use with hashset which then calls the functionality in speller.py.

? bstree.py - includes a partial implementation for binary search trees that you need to complete

in Part b. You will need to edit this.

? speller bstree.py - front end for use with bstree which then calls the functionality in speller.py.

Note: The code in speller.py that reads words treats any non-alphabetic character as a

separator, so e.g. “non-alphabetic” would be read as the two words “non” and “alphabetic”.

This is intended to extract words from any text for checking (is that “-” a hyphen or a

subtraction?) so we must also do it for the dictionary to be consistent. This means that

your code has to be able to deal with duplicates i.e. recognise and ignore them. For example,

on my PC /usr/share/dict/words is intended to contain 479,829 words (1 per line) but is

read as 526,065 words of which 418,666 are unique and 107,399 are duplicates.

15

Running your code

Important: To run the provided program stubs in python2.7 you will need to install the enum34

package. You can do this from the UNIX command line. We advise that you use Python 3 instead.

To test your implementation, you will need a sample dictionary and a sample text file with some

text that contains the words from the sample dictionary (and perhaps also some spelling mistakes).

You are given several such files in the data directory of the COMP26120 2023 repository, and you

will probably need to create some more to help debug your code. These files will need to be copied

to the python directory or you will need to set your P Y T HONP AT H so that your program can be

executed from anywhere.

You should also test your program using a larger dictionary. One example is the Linux dictionary

that can be found in /usr/share/dict/words.

To run your program you should call python3 speller hashset.py sample-file (where sample ?

f ile is a sample input file for spell checking) for Part a and python3 speller bstree.py sample-file

for Part b. Note that you will either need to set your P Y T HONP AT H to the python directory of

the COMP26120 2023 repository or call python3 speller hashset.py sample-file (resp. python3

speller bstree sample-file) in that directory.

When you run your spell-checker program, you can:

? use the ?d flag to specify a dictionary.

? (for part a) use the ?s flag to specify a hash table size: try a prime number greater than twice

the size of the dictionary (e.g. 1,000,003 for /usr/share/dict/words). Do not hard code the

dictionary size into your code. You should use self.hash table size which is created

from the configuration options when the hash set object is created by init .

? use the ?m flag to specify a particular mode of operation for your code (e.g. to use a particular

sorting or hashing algorithm). You can access the mode setting by using the corresponding mode

variable in the code you write. Note that the modes you should use have already been specified

in hashset.py. You should not need to use this unless you want to investigate additional hash

functions or addressing strategy.

? use the ?v flag to turn on diagnostic printing, or ?vv for more printing (or ?vvv etc. - the

more ”v”s, the higher the value of the corresponding variable verbose). We suggest using this

verbose value to control your own debugging output. You can access the verbosity level as the

self.verbose field of the hashset object. This is 0 by default and will be set to 1, 2 or 3 by use

of the ?v, ?vv and ?vvv flags.

So the general form of a call to your program should be

python3 speller hashset.py -d <dictionary> -s <size> -m <mode> <input-file>

For instance to run your hash set program on the third of the simple test cases, in mode 1, with a

medium level of verbosity, you would call:

python3 speller hashset.py -d ../data/simple/3/dict -m 1 -vv ../data/simple/3/infile

NOTE IF CUTTING AND PASTING THIS that the underscores often get lost when you cut and

paste.

Marking Rubric

This coursework is worth 20% of your final mark for COMP26120. This means each mark in this

mark scheme is worth 0.5% of your final mark for the module.

16

In the rubric below, the marks that an item is worth are slightly approximate due to the way

Blackboard allocates percentages across sections. In general unsatisfactory performance will get you

10% of the available marks for a section, satisfactory will get you 50% and excellent will get you

100%. This can vary for items worth more than 1 mark because several criteria may be involved each

of which can be marked unsatisfactory, satisfactory or excellent.

Note that on Blackboard, most of the execution marks are given for the Code Upload question

since the TA will run a bunch of tests on your code in order to determine how well it executes – we

separate out these marks here into the sections related to specific parts of the implementation.

If you have used generative AI and have produced excellent quality answers in any section but we

are not convinced by your statement that you have met the learning outcomes then your mark will

be capped at satisfactory. Depending upon how many people are affected by this we may then hold

vivas to see if the mark should be raised.

Code Submission (1 mark)

Description Unsatisfactory Satisfactory Excellent

Commit Hash (1 mark) A markable submission

can be identified (with a

bit of work) but a tag

(or some other indicator

which is not a commit

hash) has been supplied

by mistake.

A markable submission

can be identified but

there is a minor error

(e.g., a mistranscription

of a character).

The commit hash supplied accurately identifies

a markable submission

17

Part A (13 marks)

Basic implementation (10 marks)

Description Unsatisfactory Satisfactory Excellent

Execution (1 mark) Code passes at least one

of the of the simple and

collision tests when run

with mode 0

Code passes more than

half the tests when run

on the simple and collision tests with mode 0

Code passes all tests

when called using

test.py on the simple

and collision tests with

mode 0.

Hash Function (2 marks) There is an attempt to

describe a hash function

but it doesn’t appear to

be related to the one in

the implementation.

There is a description

and discussion of the implemented hash function,

but

? the implementation

has errors; or

? the description or

discussion is very

difficult to understand; or

? the description or

discussion are so

vague and highlevel they could

relate to almost any

implementation; or

? there is no evaluation of how good

the function is; or

? the discussion

merely states that

the hash function

is good or bad and

does not discuss

why.

The implemented hash

function is correct and

is described and discussed clearly and concisely. The discussion includes an evaluation of

how good the function is.

18

Description Unsatisfactory Satisfactory Excellent

Find and Insert (2 marks) An attempt has been

made to implement find

and/or insert but

? there are significant

logic errors; or

? the description is

non-existent; or

? the description

does not seem

to relate to the

implementation; or

? or the description

so vague and highlevel it could relate

to almost any implementation.

An implementation and

description of find and

insert exist and any

included code snippets

match the submitted

code but

? there are minor errors in the implementation although

the basic logic is

sound; or

? there are minor differences in the logic

for find and insert;

or

? the discussion is

over-length, very

flawed, or very

difficult to follow;

or

? duplicates are not

handled.

Find and Insert have

been implemented correctly, using the same

logic, and are described

clearly, concisely and correctly.

Collision Resolution

Strategy (3 marks)

An attempt has been

made to implement an

open addressing strategy,

but

? there are significant

logic errors meaning the strategy

does not function

as it should on

most inputs; or

? the description is

non-existent or;

? the description

does not seem

to relate to the

implementation; or

? the description so

vague and highlevel it could relate

to almost any

implementation.

An implementation and

description of a collision resolution strategy

exist and are illustrated

by a code snippet which

clearly matches the submitted code, but

? there are minor errors in the implementation although

the basic logic is

sound; or

? It is only implemented for one of

find or insert (not

both); or

? The strategy is misidentified (e.g., described as linear

probing when it is

double-hashing); or

? the discussion is

over-length, very

flawed, or very

difficult to follow.

The collision resolution

strategy is clearly, concisely and correctly described and implemented

and the same logic is used

for both find and insert.

19

Description Unsatisfactory Satisfactory Excellent

Print Set (1 mark) Something is printed out

but it isn’t clear how

it relates to the current

state of the hash set.

The current state of the

hash set is printed but is

difficult to interpret.

A nice, easy to understand, representation of

the hash set is printed.

Statistics (1 mark) Some statistics are

printed but they are

incorrectly implemented.

Some correctly calculated

statistics are printed but

these do not include all

the ones requested.

Statistics are printed including number of collisions, rehashes and average collisions per access/insert

Rehashing (3 marks)

Description Unsatisfactory Satisfactory Excellent

Execution (1 mark) - Code passes the collision tests when run using

a small initial dictionary

size (e.g., 5) but no information is printed out

to judge whether rehashing is actually working as

intended.

Code passes collision

tests when run using a

small initial dictionary

size (e.g., 5). Enough

information is printed to

confirm that resizing is

taking place sensibly.

Implementation & Discussion (1 mark)

There seems to be an

attempt at implementing

rehashing but

? the implementation

contains significant

logic flaws; or

? the description is

missing; or

? the description

does not appear

to relate to the

implementation; or

? the description is

so vague and highlevel it could relate

to any implementation.

There is an implementation of rehashing that is

described but

? the implementation

contains minor errors; or

? the calling of the

the implementation

is not well thought

through (e.g., it

is called too eagerly or not eagerly

enough); or

? the description

does not explain

when rehashing is

called; or

? the description is

very difficult to understand or is overlength.

The implementation of

rehashing is correct and

sensible and is clearly,

concisely and correctly

described.

20

Part B (6 marks)

Basic implementation (6 marks)

Description Unsatisfactory Satisfactory Excellent

Execution (1 mark) Code passes at least one

of the simple tests when

run with bstree

Code passes more than

half the simple tests when

run with bstree

Code passes all tests

when called using

test.py on the simple

tests with bstree.

Find and Insert (3 marks) An attempt has been

made to implement find

and/or insert but

? there are significant

logic errors meaning the implementation does not behave as intended; or

? the description is

missing; or

? the description contains major flaws;

or

? the description

does not appear

to relate to the

implementation; or

? the description is

so vague and highlevel it could relate

to any implementation.

Find and Insert have

both been implemented

using the same logic and

described but

? there are minor errors; or

? there are minor differences in the logic

used for the two operations; or

? duplicates are not

handled; or

? no code snippets

are included in the

description; or

? the description is

very difficult to understand or is overlength.

Find and Insert have

been implemented correctly, using the same

logic, duplicates and are

clearly and concisely described including code

snippets. Duplicates are

handled sensibly.

Print Set (1 mark) Something is printed out

but it isn’t clear how

it relates to the current

state of the binary tree.

The current states of the

binary tree is printed but

is difficult to interpret.

An easy to understand

(provided it is not too

large) representation of

the binary tree is printed.

Statistics (1 mark) Some statistics are

printed but they are

incorrectly implemented.

Some correctly calculated

statistics are printed but

these do not include all

the ones requested.

Correctly calculated

statistics are printed

including the height of

the tree and average

comparisons per insert

and find.

21

Part C (20 marks)

Experiments 1 & 2 (10 marks)

Description Unsatisfactory Satisfactory Excellent

Hypothesis (3 marks) A vaguely relevant hypothesis is stated for at

least one experiment but

it’s relation to the question is tenuous at best.

A sensible hypothesis is

given with a justification

for at least one experiment but

? there is no sensible hypothesis for

the other experiment; or

? not all details are

precisely supplied

(e.g., what n represents in O(n) and

similar formulae);

or

? the justification

reveals misunderstandings about

the complexity of

hash set insertion;

or

? Big O notation is

handled incorrectly

(e.g., by performing

arithemetic upon

it); or

? the hypothesis

and justification

do not make it

clear whether they

are talking about

the complexity of

performing a single

insertion, or several

insertions; or

? the justification is

very difficult to understand.

A sensible hypothesis is

clearly stated and explained.

22

Description Unsatisfactory Satisfactory Excellent

Design (4 marks) An experimental design is

outlined for at least one

experiment but is deeply

flawed and unlikely to

yield results that can confirm or disprove the hypothesis.

A reasonable design is described for at least one

experiment but

? there is no sensible

design for the other

experiment; or

? some of the requested elements

are not included; or

? there are some

flaws that might

make results hard

to interpret; or

? it is unclear

whether the design relates to

the complexity of

performing a single

insertion, or several

insertions; or

? the design is very

difficult to understand.

A good design, likely

to confirm or disprove

the hypothesis, welldescribed with all the

requested elements.

23

Description Unsatisfactory Satisfactory Excellent

Results (3 marks) Some results are presented for at least one experiment but

? the results don’t

seem to relate to

the hypothesis or

design; or

? there is no statement of whether

the hypothesis

is confirmed or

refuted; or

? any statement

of whether the

hypothesis is confirmed or refuted

appears clearly

contrary to the

presented results.

Results are sensibly presented for at least one experiment but

? there are no sensible results for the

other experiment;

or

? there are missing

details such as labels on axes; or

? only a table is

given; or

? if there is a graph

there is no best fit

line; or

? some sort of processing seems to

have happened but

this isn’t described

or is unclear; or

? there is confusion

around whether the

results relate to the

complexity of performing a single insertion, or several

insertions; or

? statements around

confirmation or

refutation of the

hypothesis are

over-optimistic or

over-pessimistic; or

? if the hypothesis

was refuted then

there is no discussion of why this

may have happened

or any discussion

of possible followup investigations.

Results are presented in

a clearly labelled graph

with a best fit line. Any

processing of raw data is

clearly described. There

is a good discussion of

whether the hypothesis is

confirmed or refuted.

24

Conclusion (2 marks)

Description Unsatisfactory Satisfactory Excellent

Conclusion (2 marks) A conclusion is drawn

that relates to the question but

? it does not appear

to relate to the experimental results;

or

? it is so vague and

high-level it could

refer to almost any

results.

A conclusion is drawn

that relates to the question but

? it contains no attempt to calculate

the crossover point;

or

? reflects confusion

about what the experimental results

show.

A conclusion is drawn,

giving convincing information or estimates

about when which technique is preferable based

on the experimental

results.

Data Statement (1 mark)

Description Unsatisfactory Satisfactory Excellent

Data Statement (1 mark) Some attempt has been

made to provide the

information necessary

to reproduce the experiments or validate the

results

Enough information is

given to either reproduce

the experiments, or the

raw data created for the

results sections is provided (either in the results sections themselves,

in an appendix, or as a

file included in the commit/zip file) but some detail or some files seem to

be missing.

All program calls, scripts

used and raw data is

clearly documented and

can be found in the report, appendices or int

separate files as appropriate.

25

Professional Presentation (2 marks)

Description Unsatisfactory Satisfactory Excellent

Presentation (2

marks)

The report was submitted as a

PDF. Care has been taken over

presentation such as appropriate use of punctuation and

capital letters, making sure

sentences are complete, making sure graphs appear fully on

a page and don’t obscure each

other or the text, even if there

are some minor issues. All requested sections appear. but

? there are persistent

spelling errors that

should have been caught

by a spell-checker; or

? although all requested

sections appear the correct information is not

always in the correct section; or

? graphs are difficult to

read (e.g., fuzzy fonts;

labels too small)

The report was submitted as a

PDF and a good attempt has

been made to make it look professional but

? the language used isn’t

appropriate for academic

writing – for instance

over-use of slang terms

or colloquial expressions;

or

? sentence/paragraph construction is frequently

inelegant or obfuscates

meaning – e.g., many

sentences that run on

over several lines with

multiple sub-clauses

making them difficult to

understand/paragraphs

that appear as a wall

of text taking up whole

pages; or

? aethestics are inelegant

suggesting a poor grasp

of academic writing tools

– e.g., awkwardly sized

or placed images and figures, equations not created using an equation

editor; or

? not all figures are mentioned in the text so it

may not be entirely clear

what they refer to.

The report is submitted

as a PDF. The presentation is excellent with

only minor spelling errors. The language is

consistent with professional academic writing

and could appear with

only minor changes in an

academic paper.

26

Extension (5 marks)

Description Unsatisfactory Satisfactory Excellent

Question and Conclusion

(1 mark)

A research question is

stated or conclusion

drawn, but not both.

A research question is

posed and a conclusion

drawn but they are very

difficult to understand or;

do not appear to relate

to each other, the experiment or the results.

There is a clearly stated

research question and

conclusion that is wellconsidered and follows

from the results.

Basics (2 marks) The extension experiment is poorly presented

lacking at least one of

hypothesis, design or

results.

The extension experiment is competently

structured but there are

issues with hypothesis,

design or results (e.g.

missing details, missing

labelling on graphs,

incorrect statements etc.,

as in the sections above)

All the basics of the

experiment are in place

with a clearly stated

and justified hypothesis, all necessary detail

in the design which is

well-considered, and wellpresented results.

Ambition (2 marks) The extension experiment is a simple re-run

of one of the previous

experiments with a minor change (e.g., with a

different hash function,

or collision resolution

strategy)

The extension experiment has required

additional design work

beyond the previous

experiments but still

follows from the suggestions given (e.g., it

looks at query rather

than insertion time,

compares collision resolution strategies or hash

functions; attempts to

look at the effect of rehashing) or was generally

straightforwrd.

The hypothesis chosen

has challenging elements

for the design which have

been excellently handled.

It was not suggested in

the coursework.

27


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp