联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2023-04-11 07:05

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 1/7

COMP9315 23T1 Assignment 2

Relational Operators and Memory Buffer

Management

DBMS Implementation

Last updated: Saturday 8th April 6:48pm

Most recent changes are shown in red ... older changes are shown in brown.


Recent Updates

[2023-04-07 22:40]

A minor update for test4.

[2023-04-07 15:00]

We have prepared four additional test cases. You can download the updated code from

the same address. Note that the updated version is only about test cases. No code is

changed. Please let us know if you find some errors in the source code.

We also provides more marking details. In real applications, you cannot make any

assumption about the input. Therefore, around 10--15 test cases (including the provided

ones) will be used in marking. If you implement the buffer correctly, your output should pass

all these cases. To facilitate the verification of your solution in marking, db.c may be

modified at two places. First, information in log_read_page, log_release_page,

log_open_file, and log_close_file will be printed as described in code comments.

Second, as you can see in init_db, the current version simply assigns page id by

increasing an integer from 0. In marking, a different way will be used to produce page id.

That means the only way you get the page id is from the first INT64 in each page, and it can

be any possible value. The only guaranteed thing is that each page id is unique for each

table (i.e., page id of pages from different tables may be same).

We also provide some tips to do assignment. As we introduced in lectures, you do not need

to consider the data in current memory buffers when analyzing IOs (i.e., you can assume the

memory buffer is empty). A reference can be found in cost models. For example, when join

two tables via nested for loop, you need to choose one table for outer for loop to minimize

IOs. Some table pages may already in the memory buffer, but you do not need to consider

them. The clock-sweep strategy is an important in this assignment. To help you do the

assignment, we present the following pseudocode about the process when requesting a

page from the buffer. You can compare it with your implementation.

... initialization ...

nvb = 0

foreach slot in buffer[]: slot.pin=0, slot.usage=0

... request a page from buffer ...

function request_page(page)

if(page in buffer)

buffer[page index].usage++

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 2/7

buffer[page index].pin=1

return the page

while(true)

if(buffer[nvb].pin == 0 && buffer[nvb].usage == 0)

read page from disk to buffer[nvb]

buffer[nvb].pin = 1

buffer[nvb].usage = 1

nvb = (nvb+1) % buffer_size

return the page

else

if(buffer[nvb].usage > 0) buffer[nvb].usage --

nvb = (nvb+1) % buffer_size

... notify buffer stop using the page ...

function release_page(page)

buffer[page index].pin = 0

[2023-04-03]

We provide some tips for doing assignment and marking. To simplify the assignment, you

do not need to leave an output buffer for both select and join. For example, assume

that N represent the number of buffer slots. To conduct a nested for-loop join, you can use

N-1 slots to store N-1 pages of one table and use 1 slot to store one page of the other table

(note that in lecture notes, you need to allocate one slot as an output page). You can also

use additional space (e.g., arrays) for sorting or building hash table. All additional space use

in each join need to be freed. The other tip for marking the join operation is that you need to

consider the performance of the join order. For instance, in the nested for-loop join, using

a different table in the outer loop may produce different read IOs. Therefore, you need to

choose the reasonable table in the outer for-loop. As we discussed in lectures, when

analyzing the performance (number of IOs) of an operation, you do not need to consider the

data in the memory buffer slots. If you are not clear about any requirement of the

assignment, feel free to ask questions on the forum.

[2023-03-28]

Fix the bug of writing unexpected values in db.c. The bug would not affect you, and the

update is just for rigor.

Aims

This assignment aims to give you

an understanding of how data is organized inside a DBMS

practice in implementing simple relational operators

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 3/7

Deadline

Pre-requisites:

Late Penalty:

Marks:

Submission:

practice in implementing the memory buffer

The goal is to implement two relational operators, selection and join. Given the memory

buffer slots and the page size, you need to implement your own memory buffer to read/write

data files from hard drive.

Summary

Monday 17 April, 9pm

Buffer Pools, File Management, and Relational Operations

5% of the max assessment mark per-day reduction, for up to 5 days

This assignment contributes 20 marks toward your total mark for this

course.

Moodle > Assignment > Assignment 2

Make sure that you read this assignment specification carefully and completely before

starting work on the assignment. Questions which indicate that you haven't done this will

simply get the response "Please read the spec". We will provide an extension for all students

if there is a major code update in the last three days. Let us know if there is any error in

the assignment specification or in the source code.

Introduction

The course is DBMS implementation. It is time to implement a database. The database is

somehow a software or a system to retrieve data from files in hard drive. Users

communicate with the database via SQL, and SQL is intepreted into the combination of

several relational operators. In this assignment, you are required to implement two typical

relational operators, select and join. To implement the opeartors, you mush read data from

hard drive and manage the memory buffer given that there is a limited number of buffer

slots. Related lecture notes can be found as follows. Undoubtedly, a real model database

involves a great amount of features and techniques. We simulate a mini database for this

assignment. To simplify the problem, we make the folowing assumptions.

The server contains only one database.

The database never updates.

All attributes are integers, i.e., the size of each tuple is fixed.

Each table takes one file, i.e., there is no overflow files.

The select condition is always the equality test for a single attribute.

Get Ready

You do not need the PostgreSQL server for this assignment. All tasks are about C

programming. That means you may do the assignment in any Linux environemnt (windows

is not supported since linux system functions are used). Remember to test your code on

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 4/7

vxdb server before submission since we will test your code there. The following process

assumes that you are on the vxdb server. You may change some paths according your local

environment. You should start by getting the template codes.

... assume you are on the vxdb server ...

$ cd /localstorage/$USER

$ unzip /web/cs9315/23T1/assignment/2/ass2.zip

You should see a new directory called ass2 and the following files inside.

ass2/

|--- README.md // log for recent updates

|--- db.c // global database information

|--- db.h // definitions for all data types

|--- main.c // main entry

|--- run.sh // script to run the code

|--- ro.c // relational operators

|--- ro.h // definitions for ro.c

|--- Makefile // compile rules

|--- test1/ // directory containing testing files

|--- data_1.txt // testing data

|--- query_1.txt // testing queries

|--- expected_log_1.txt // expected results

Next, you can complile and run the code.

$ cd /localstorage/$USER/ass2

$ make

... should no error here ...

$ ./run.sh

... execture the program and examine the output ...

After running the code, a log file called log.txt will be created under the test1/ directory.

Code structure

A great amount of comments are provided in each file. Read all codes to fully understand

the data structure. You can start from reading db.h and then main.c. For instance, db.h

contains a set of important data types to represent the database, the table, and the global

configuration. Certain code snippets will give you inspirations to complete the task. We

briefly describe what main.c does below.

1. load several input arguments.

2. init_db() reads an input .txt data file and produce all table files in a certain folder.

3. init() is an interface for your implementation.

4. run() loads and inteprets queries from a .txt query file and invoke your

implementations of sel() and join() to process queries. Query results will be written

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 5/7

into a log file.

5. release() is an interface for your implementation to release your data structure used

for query processing.

An example of table file structure. The following figure gives an example of the file

structure for each table stored in hard drive. Each table file is named as the table's oid,

which is exactly same as PostgreSQL. The oid of a table can be identified from Table.oid

(see db.h for details). Since we are dealing with fixed-size tuples, we do not need to record

the meta information in each table. For each page, we first add an INT64 to represent the

page id. Then, we sequentially add each tuples until the sapce is not enough to hold a tuple.

We add 0 to the end and get the whole page. The example shows a table file for three

attributes. You can identify the details by reading the implementation of init_db() in

db.c.

Find more details by reading the code. All following content in this specification assumes

that you have already read the code and at least understood the role of each file.

Tasks

In this assignment, you need to implement two functions, sel() and join(), which

represent select and join relational operators, respectively. You are also free to add any

additional functions and files. In your implementation, you are given the page size

(Conf.page_size), the number of available memory buffer slots (Conf.buf_slots), and

the maximum allowed number of opend files (Conf.file_limit) at the same time.

To support the two operators, you need to write your own code for memory buffer

management based on the input configuration, which means the number of data pages

cannot exceed the number of buffer slots. You are only required to implement the clock

sweep policy for memory buffer replacement, which is also used by Postgresql. To verified

your process of buffer memory for marking, you need to invoke log_read_page()

every time you read a page from the hard drive and invoke log_release_page()

every time you release a page from the hard drive. The input argument for both functions

are the page id. You can check what we do in these functions in db.c.

As introduced in lectures, opening a file is relatively costly. To improve the performance, we

can maintain several opened file pointers. When reading data from a file, you can directly

used the file pointer if the file has already been opened and maintained. Given the

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 6/7

Conf.file_limit, you need to close a file pointer when the number of opened files

reaches the limit. You are free to use any replacement stratey for opened file pointers. To

verified your process of file pointers management, you need to invoke

log_open_file() every time you open a file and log_close_file() every time you

close a file. The input argument for both functions are the table oid. You can check what we

do in these functions in db.c. You will be reduced 3/20 marks if maintaining opened file

pointers is not implemented (i.e., open a file when you need, and close the file when finish

reading).

Task 1: select

The function for select operator is presented as follows. You can see it in ro.h. For

simplicity, we only consider the equality testing for a single attribute in the condition of select.

Instead of giving an attribute name, we directly provide the idx argument, which the offset

of the attribute in the tuple. idx starts from 0. cond_val is the value for equality testing of

the (idx+1)-th attribute value. Clearly, table_name represents the table name. The

returned value is a _Table pointer. See db.h for details of _Table. Below is an example of

output.

_Table* sel(const UINT idx, const INT cond_val, const char* table_name);

# a table example, named "T1"

# the table contains three attributes for each tuple

1 2 3

5 8 13

21 34 55

2 8 6

# the following result table is derived after running sel(1,8,"T1")

5 8 13

2 8 6

Task 2: join

The function for (inner) join operator is presented as follows. You can see it in ro.h. Similar

to sel(), we directly provide the attribute offset of each table for joining. table1_name and

table2_name are names for two tables, respectively. idx1 and idx2 are the offsets of two

attributes in table1_name and table2_name, respectively. The returned value is a

_Table pointer. Below is an example of output.

_Table* join(const UINT idx1, const char* table1_name, const UINT idx2,

# a table example, named "T1"

# the table contains three attributes for each tuple

1 2 3

5 8 13

21 34 55

5 8 6

2023/4/9 21:33 COMP9315 23T1 - Assignment 2

https://cgi.cse.unsw.edu.au/~cs9315/23T1/assignment/2/ 7/7

5 8 6

# a table example, named "T2"

# the table contains three attributes for each tuple

1 2 3

4 5 6

7 8 9

10 11 12

# the following result table is derived after running join(0, "T1", 1, "T2")

5 8 13 4 5 6

5 8 6 4 5 6

Assume we aim to join two tables named T1 and T2. T1.npages and T2.npages denote

the number of pages of two tables, respectively. For the strategy of join, you are only require

to implement the naive nested for-loop join when the number of buffer slots is not

enough to hold two tables, i.e., Conf.buf_slots < T1.npages + T2.npages. The

number of pages for a table can be calculated by Conf.page_size and Table.ntuples.

If the buffer slots are enough (i.e., Conf.buf_slots >= T1.npages + T2.npages),

you can load all data in memory and perform an in-memory join. In this case, you need to

implement either sort-merge join or hash join.

Submission & Evaluation

In this assignment, you are required to complete ro.c and ro.h. You may add some

additional .c and .h files, and if that happens, you definitely need to update Makefile to

make your new files work. Therefore, you need to submit at least three files, ro.c, ro.h,

and Makefile. You need to submit all files on Moodle. You do not need to upload all green

files marked in the file structure since we will add them to your code in testing. That means

you are free to change them for developing and debugging, but all of them will be

overwritten even you submit them. Specifically, below is the process to test your code.

1. download your solution.

2. copy db.h, db.c and main.c to the folder (the same ones as you can see in ass2/).

Overwrite files with the same name.

3. execute make to compile the code.

4. copy run.sh and test/ to the folder, which provide different and much more test

cases.

5. execute ./run.sh to load data and process queries.

6. compared the output log file with the expected result.

Have Fun~


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp