联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2021-12-14 10:02

Individual Assignment

Artificial Intelligence in Games

September 23, 2021

In this assignment, you will train a Deep Q-Network [Mnih et al., 2015] to play Atari Breakout. Please read

this entire document and the paper before you start working on the assignment.

Figure 1: Atari Breakout.

1 Programming environment

You will use Google Colaboratory to train a Deep Q-Network on a graphics processing unit (GPU). If you are

not familiar with Google Colaboratory, please read this tutorial. You will need a personal Google account.

As a first step, create a new notebook on Google Colaboratory. On the menu at the top, select Runtime,

select Change runtime type, select GPU as the Hardware accelerator, and select Save. After this, your notebook

will have access to a GPU.

The next step is to install OpenAI baselines. In a new code cell, type !pip install baselines --no-deps

, and execute the cell (note the exclamation mark).

The next step is to mount your Google Drive so that it is accessible from your notebook. This can be

accomplished by executing a code cell containing the code shown in Listing 1 and enabling access as required.

Listing 1: Mounting a Google Drive.

from g o o gl e . c ol a b import d ri v e

d ri v e . mount ( ’ / c o n t e n t / d ri v e ’ )

In order to import the Atari Breakout ROM, you need to download the file Roms.rar from the Atari 2600

VCS ROM Collection. You should extract this file and upload the two zip files that it contains to a folder called

game ai/roms in your Google Drive. You do not need to extract these zip files. Finally, you can import the

ROMs by executing !python -m atari py.import roms /content/drive/MyDrive/game ai/roms in a new

code cell. If this process is successful, the ROMs will be listed as they are imported.

1

2 Baseline implementation

This assignment is based on a didactic but inefficient implementation of Deep Q-Networks that is available here.

As a first step, copy and paste this implementation into a code cell in your notebook. Although you could

run this code cell and observe the training process, this would not be very useful: your virtual machine would

run out of memory or time (a Google Colaboratory virtual machine dies after twelve hours, and so everything

that is not saved to Google Drive will be lost), and the trained model (convolutional neural network) would not

be saved to Google Drive.

This baseline implementation uses OpenAI Gym [Brockman et al., 2016] to create the Atari Breakout environment.

If you are not familiar with the OpenAI Gym, please read this introduction. You should become

familiar with the main OpenAI Gym environment methods, such as reset and step.

The implementation also uses several Gym wrappers to transform the original Atari Breakout environment

for our purposes. You can find more information about the Atari wrappers here.

Finally, the implementation is also based on Keras, an open-source Python library for artificial neural

networks based on TensorFlow. If you are not familiar with Keras, please read the introduction to Keras for

researchers and the functional interface tutorial. If you are not familiar with NumPy, you may also read the

NumPy quickstart tutorial and the NumPy broadcasting tutorial.

After reading the paper and completing the steps listed above, you should be able to understand the baseline

implementation of Deep Q-Networks.

3 Improved implementation

This section lists some improvements that you should make to the baseline implementation.

The baseline implementation uses an optimizer called Adam [Kingma and Ba, 2014]. In order to follow the

paper, your first task is to substitute the optimizer created by a call to keras.optimizers.Adam by an optimizer

created by a call to tf.keras.optimizers.RMSprop. You should use a learning rate of 0.0001 and a discounting

factor (rho) of 0.99.

The baseline implementation uses the variable epsilon random frames to control how many frames (states)

should be observed before greedy actions are taken. You should eliminate this variable and remove it from the

conditional test where it appears.

The baseline implementation has a replay buffer of size 100000. This will cause memory problems on Google

Colaboratory. You should reduce the size of the replay buffer to 10000.

The baseline implementation will train the Deep Q-Network until the average return obtained during the

last 100 episodes surpasses 40. This will take longer than the maximum lifetime of a virtual machine. You

should change the outermost while loop so that training is interrupted after 2000000 frames (states) have been

observed.

The baseline implementation draws a batch and updates the Deep Q-Network every four frames only after

the replay buffer size is larger than the batch size (note that the corresponding comment found in the code is

inaccurate). You should draw a batch and update the Deep Q-Network every four frames only after the replay

buffer is full.

The baseline implementation stores the return of the last 100 episodes in a list called episode reward history.

Instead, you should store the return of every episode. The variable running reward should still only contain the

average return of the last 100 episodes.

The baseline implementation computes the one-step return for each state in a batch incorrectly. Although

this implementation works for Atari Breakout, it would not work for many other Atari games. You should

substitute two assignment in the baseline implementation by a single assignment shown in Listing 2.

Listing 2: Corrected one-step return computation.

# B a s el i n e im plemen t a t i on :

# u p d a t e d q v a l u e s = rew a r d s s am ple + gamma ? t f . reduce max (

# f u t u r e r ew a r d s , a x i s=1

# )

# u p d a t e d q v a l u e s = u p d a t e d q v a l u e s ? (1 ? done sample ) ? done sample

# C o r rec t im plemen t a t i on :

u p d a t e d q v al u e s = rew a rd s s ample + ( 1 ? done sample )?gamma? t f . reduce max (

f u t u r e r e w a r d s , a x i s=1

)

2

After the training loop, you should use Keras to save your model (convolutional neural network) to a Google

Drive folder (for instance, /content/drive/MyDrive/game ai/model). You should also use NumPy to save the

list episode reward history that contains the return for each training episode.

4 Training

As a next step, train your Deep Q-Network for 2000000 frames. This may take more than ten hours. The

average return during the last 100 episodes should be close to 10.

Hint: You should test your implementation by training your Deep Q-Network for a few frames only. Once

you are confident in your implementation, you should train your Deep Q-Network overnight.

5 Testing

In order to test your Deep Q-Network after training, you should create an additional code cell.

As a first step, load the trained model (convolutional neural network) from your Google Drive. After that,

create the Atari Breakout environment and employ the same wrappers used for training the model.

You should use the wrapper gym.wrappers.Monitor to record videos of your trained agent interacting with

the environment. An environment wrapped by this class behaves normally, except for the fact that the episodes

are recorded into a video that is stored to disk. The wrapper gym.wrappers.Monitor is able to see past the

other wrappers in order to record raw frames.

You should record ten episodes of interaction using a greedy (not -greedy) policy based on your trained

Deep Q-Network. Listing 3 presents an incomplete version of the testing code.

Important: The videos will be written to your Google Drive only after the method env.close is called.

Listing 3: Recording videos of a Deep Q-Network agent.

from g o o gl e . c ol a b import d ri v e

d ri v e . mount ( ’ / c o n t e n t / d ri v e ’ )

from b a s e l i n e s . common . a t a ri w r a p p e r s import m a ke a t a ri , wrap deepmind

import numpy a s np

import t e n s o r fl o w a s t f

from t e n s o r fl o w import k e r a s

import gym

s e e d = 42

model = k e r a s . models . l o ad m odel ( ’ / c o n t e n t / d ri v e /MyDrive/ path / t o / your /model ’ )

env = m a k e a t a ri ( ” BreakoutNoFrameskip?v4 ” )

env = wrap deepmind ( env , f r am e s t a c k=True , s c a l e=True )

env . s e e d ( s e e d )

env = gym . wrappers . Monitor ( env , ’ / c o n t e n t / d ri v e /MyDrive/ path / t o / your / vi d e o s ’ ,

v i d e o c a l l a b l e=lambda e p i s o d e i d : True , f o r c e=True )

n e pi s o d e s = 10

r e t u r n s = [ ]

for in range ( n e pi s o d e s ) :

r e t = 0

s t a t e = np . a r r a y ( env . r e s e t ( ) )

done = F al s e

while not done :

# FIXME : I nc om ple te

r e t += reward

r e t u r n s . append ( r e t )

3

env . c l o s e ( )

pr int ( ’ Returns : {} ’ . format ( r e t u r n s ) )

6 Submission instructions

This assignment corresponds to 20% of the final grade for this module. You will work individually. The deadline

for submitting this assignment is Dec 17th, 2021. Penalties for late submissions will be applied in accordance

with the School policy. The submission cut-off date is 7 days after the deadline. Submissions should be made

through QM+. Submissions by e-mail will be ignored. Please always check whether the files were uploaded

correctly to QM+. Cases of extenuating circumstances have to go through the proper procedure in accordance

with the School policy. Only cases approved by the School in due time will be considered.

If you are unsure about what constitutes plagiarism, please ask.

The (individual) submission must be a single zip file. This file must contain a single folder named [student

id] [student name]. This folder must contain a report, a folder named videos, and a folder named code.

The code folder must contain two files: dqn train.py and dqn test.py. These files should contain the content

of the code cells used for training and testing, respectively. Based solely on the correctness and clarity of your

code, you will receive the following number of points for accomplishing each of the following tasks:

? Implementing the improvements listed in Section 3. [20/100]

? Implementing the testing described in Section 5. [10/100]

The video folder must contain 10 video files. These files should be the result of the testing procedure

described in Section 5. You will receive the following number of points for accomplishing the following task:

? Recording videos that provide evidence of successful training (see Section 4). [10/100].

The report must be a single pdf file. Other formats are not acceptable. The report must be excellently

organized and identified with your name, student number, and module identifier. You will receive the following

number of points for answering each of the following questions:

1. During training, why is it necessary to act according to an -greedy policy instead of a greedy policy (with

respect to Q)? [10/100, please write less than 100 words]

2. How do the authors of the paper [Mnih et al., 2015] explain the need for a target Q-network in addition

to an online Q-network? [10/100, please write less than 100 words]

3. Explain why the one-step return for each state in a batch is computed incorrectly by the baseline implementation

and compare it to the correct implementation. [10/100, please write less than 300 words]

4. Plot a moving average of the returns that you stored in the list episode reward history. Use a window of

length 100. Hint: np.convolve(episode reward history, np.ones(100)/100, mode=’valid’). [10/100]

5. Several OpenAI Gym wrappers were employed to transform the original Atari Breakout environment.

Briefly explain the role of each of the following wrappers: MaxAndSkipEnv, EpisodicLifeEnv, WarpFrame,

ScaledFloatFrame, ClipRewardEnv, and FrameStack. [20/100, please write less than 500 words]

You may also work on the following tasks for your own edification:

? Adapt your code to train a Deep Q-Network to play a different Atari game available in the OpenAI gym.

? Write your own wrapper for an Atari game. This wrapper should transform observations or rewards in

order to make it much easier to find a high-performing policy.

References

[Brockman et al., 2016] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and

Zaremba, W. (2016). OpenAI gym.

[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv

preprint arXiv:1412.6980.

[Mnih et al., 2015] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A.,

Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement

learning. Nature, 518(7540):529.

4


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp