Konpat’s Record of Struggles

Pickling Pytorch views is pickling the whole underlying memory

2019-03-11T00:00:00+07:00

It is important at the time of pickling to know what we are really saving to disks. Saving Pytorch tensors is most likely saving its underlying storage. This could be confused with views because many views could share the same underlying storage. The view could be very narrow, but its underlying storage could still be very large. Saving this view to disks would result in a large file as well.

For example:

import torch

x = torch.randn(100)
torch.save(x[0], 'saved.pt')

The saved.pt would have the size of 100 floats instead of one. To really save just the view portion of the storage, we need to create a distinct storage containing only the viewed portion of the old one. This could be done using x[0].clone().

Setting up LXC with Intel GPU (Proxmox), keyboard, mouse and audio

2019-03-11T00:00:00+07:00

The goal is to create a linux container with Proxmox that utilizes integrated Intel GPU. Additional requirements are mouse and keyboard must work also with audio.

There are some specificities in this guide. I use Xubuntu (Xfce) as the desktop environment it might be somewhat different from others, but I hope in overall it should be transferable.

Let’s start by assuming that we have created and installed our desktop environment on a new container already after which we can now begin.

Container with GPU

You should be well familiar with lxc.cgroup.devices.allow we need to declare it for all devices we want the container to have access to be it GPU, keyboard, mouse or audio.

For example, if you want to grant some access for container number 123, you go to /etc/pve/lxc/123.conf and adjust the configurations.

The overall limitation of using GPU inside the container boils down to this: we install the GPU driver on the host, allowing the container to access to the device, and reinstall the GPU driver on the container as well. This is because the container has no access to change the kernel modules, our driver module must already be loaded by the host.

In my case, the Intel GPU driver is already loaded in the Proxmox host under the name of i915 kernel module. You can see for yourself using lspci -v. Find the Intel GPU and see its loaded kernel module. Please note that i915 seems to be the module name for all GPUs.

Granting the container access

What then needs to be granted access to the container? My experience results in this:

/dev/dri/card0
/dev/dri/renderD128
/dev/fb0. This the frame buffer for card0.
/dev/tty7. Actually, you can use any (that is not currently used). To my understanding this is like allowing the container to access our monitor, I guess.

To grant the container access, we need to have their major and minor numbers. Which could be obtained thus:

root@desktop:/dev# ls -l /dev/dri/card0
crw-rw---- 1 root video 226, 0 Mar 11 00:53 /dev/dri/card0

Device	Major	Minor
`/dev/dri/card0`	226	0
`/dev/dri/renderD128`	226	128
`/dev/fb0`	29	0
`/dev/tty7`	4	7

So we have the first change we need to make to the /etc/pve/lxc/<id>.conf should be:

lxc.cgroup.devices.allow: c 226:0 rwm
lxc.cgroup.devices.allow: c 226:128 rwm
lxc.cgroup.devices.allow: c 4:7 rwm
lxc.cgroup.devices.allow: c 29:0 rwm
lxc.mount.entry: /dev/dri/card0 dev/dri/card0 none bind,optional,create=file
lxc.mount.entry: /dev/dri/renderD128 dev/dri/renderD128 none bind,optional,create=file
lxc.mount.entry: /dev/tty7 dev/tty7 none bind,optional,create=file
lxc.mount.entry: /dev/fb0 dev/fb0 none bind,optional,create=file

Explanations

lxc.cgroup.devices.allow: c 226:0 rwm means allowing the container to rwm (read/write/mount) the device which has the major number of 226 and minor number of 0. You can use a wildcard here e.g. c 226:* rwm .

Granting the permission alone is not enough if the device is not present in the container’s /dev directory. The second part is just creating corresponding files in the container’s dev. If you want to create a whole directory (and everything within it) you could use create=dir option instead i.e. lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir. This could be problematic if you have more than one GPU. Allowing the container to see a GPU meant not to be used by it is a bad idea because it leads to many problems down the line. For example, in my case, I also have an Nvidia card. The container’s Xorg tries to utilize both GPUs via Nvidia’s Optimus technology which might not be the thing you want.

Please note that dev/tty7 is the correct container’s path with leading / it won’t work.

The host’s TTY number could be any as said, but the container’s TTY needs to be tty7 unless configured otherwise. It seems that Xorg looks at this particular TTY, if it is not present you will see an error in container’s /var/log/Xorg.0.log. So it is possible to configure it this way: lxc.mount.entry: /dev/ttyN dev/tty7 where N is the number you like.

You can now stop and start your container:

pct stop <id>
pct start <id>

You should see the monitor displaying your desktop environment albeit keyboards and mouses might not work yet.

Many would suggest installing the Intel driver for accelerated video decoding on the container as well:

apt install i965-va-driver

Switching between TTY’s

You can use ctrl+alt+f<number> to switch to TTY<number>. This might not work 100% if the desktop manager on the current TTY ignore these keys.

Preventing the screen tearing

Using Intel GPU could see tearing artifacts while scrolling. The feeling is like playing games without vertical sync. If you don’t encounter this problem, feel free to skip.

File: /etc/X11/xorg.conf.d/20-intel.conf

Section "Device"
    Identifier "Intel Graphics"
    Driver "intel"
    Option "TearFree" "true"
EndSection

Thanks to https://wiki.archlinux.org/index.php/intel_graphics

Keyboard and mouse in LXC

Keyboard and mouse are /dev/input devices even though they are connected via USB ports. It is easy misled by /dev/usb but it has nothing to do with it. You don’t need to grant access to /dev/usb to make keyboard and mouse work.

We do the same thing as before using:

ls -l /dev/input
total 0
drwxr-xr-x 2 root root     180 Mar 10 17:05 by-id
drwxr-xr-x 2 root root     200 Mar 10 17:05 by-path
crw-rw---- 1 root input 13, 64 Mar 10 17:05 event0
crw-rw---- 1 root input 13, 65 Mar 10 17:05 event1
crw-rw---- 1 root input 13, 74 Mar 10 17:05 event10
crw-rw---- 1 root input 13, 75 Mar 10 17:05 event11
... a lot more ...

I personally grant and map them all to the container using, update the /etc/pve/lxc/<id>.conf:

lxc.cgroup.devices.allow = c 13:* rwm 
lxc.mount.entry: /dev/input dev/input none bind,optional,create=dir

Next we need to tell the Xorg to use these input devices also telling which driver we want to use for each. I have found evdev driver to work well in my case both mouse and keyboard.

To install the driver in the container:

apt install xserver-xorg-input-evdev

To tell the Xorg which devices to load we need to know which /dev/input/eventN ‘s are related to our mouse and keyboard. This could be done using evtest. You can install the program on the host or on the container:

apt install evtest

Using it would result:

root@desktop:/dev# evtest
No device specified, trying to scan all of /dev/input/event*
Available devices:
/dev/input/event0:	Sleep Button
/dev/input/event1:	Power Button
/dev/input/event2:	Power Button
/dev/input/event3:	Logitech USB Receiver
/dev/input/event4:	Logitech USB Receiver
/dev/input/event5:	Microsoft Microsoft® Nano Transceiver v2.0
/dev/input/event6:	Microsoft Microsoft® Nano Transceiver v2.0
/dev/input/event7:	Microsoft Microsoft® Nano Transceiver v2.0
/dev/input/event8:	Video Bus
/dev/input/event9:	PC Speaker
/dev/input/event10:	HDA NVidia HDMI/DP,pcm=3
/dev/input/event11:	HDA NVidia HDMI/DP,pcm=7
/dev/input/event12:	HDA NVidia HDMI/DP,pcm=8
/dev/input/event13:	HDA NVidia HDMI/DP,pcm=9
/dev/input/event14:	HDA Intel PCH Front Mic
/dev/input/event15:	HDA Intel PCH Rear Mic
/dev/input/event16:	HDA Intel PCH Line
/dev/input/event17:	HDA Intel PCH Line Out
/dev/input/event18:	HDA Intel PCH Front Headphone
/dev/input/event19:	HDA Intel PCH HDMI/DP,pcm=3
/dev/input/event20:	HDA Intel PCH HDMI/DP,pcm=7
/dev/input/event21:	HDA Intel PCH HDMI/DP,pcm=8
/dev/input/event22:	HDA Intel PCH HDMI/DP,pcm=9
/dev/input/event23:	HDA Intel PCH HDMI/DP,pcm=10
Select the device event number [0-23]: 

The mouse is Logitech in this case, and the keyboard is Microsoft. I could tell Xorg to load them all. However, if you are in doubt you can always test it by inputting the number and try typing or moving it to see if it triggers any event.

In container, edit the file /usr/share/X11/xorg.conf.d/10-lxc-input.conf:

Section "InputDevice"
    Identifier "event3"
    Option "Device" "/dev/input/event3"
    Option "AutoServerLayout" "true"
    Driver "evdev"
EndSection

Section "InputDevice"
    Identifier "event4"
    Option "Device" "/dev/input/event5"
    Option "AutoServerLayout" "true"
    Driver "evdev"
EndSection

... for any event you want ...

Restart the container again. Mouse and keyboard should now work.

Audio in LXC

Most of the configurations in this guide need some understanding of the linux hardware and software stack which I am lacking. But none is worse in the case of audio. This is convoluted. I just managed to make it work, but don’t expect my explanation to be correct.

I think a good starting point if you are interested in keywords in linux audio stack is: https://www.cnblogs.com/little-ant/p/4016180.html. This is not the original post, but it is a working link.

Disclaimer: I have seen many approaches to audio in container. In this guide, it will be just one of them.

Make audio work on the host

To be foolproof, I think it is a good start if we can make sound from the host first, and then we will make it work from the container. We will need many tools in alsa-utils so we first install it on the host.

apt install alsa-utils

Now, we are going to list the audio devices:

root@desktop:/dev# aplay -L
null
    Discard all samples (playback) or generate zero samples (capture)
default:CARD=PCH
    HDA Intel PCH, ALC887-VD Analog
    Default Audio Device
sysdefault:CARD=PCH
    HDA Intel PCH, ALC887-VD Analog
    Default Audio Device
front:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    Front speakers
surround21:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    2.1 Surround output to Front and Subwoofer speakers
surround40:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    4.0 Surround output to Front and Rear speakers
surround41:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    4.1 Surround output to Front, Rear and Subwoofer speakers
surround50:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    5.0 Surround output to Front, Center and Rear speakers
surround51:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    5.1 Surround output to Front, Center, Rear and Subwoofer speakers
surround71:CARD=PCH,DEV=0
    HDA Intel PCH, ALC887-VD Analog
    7.1 Surround output to Front, Center, Side, Rear and Woofer speakers
hdmi:CARD=PCH,DEV=0
    HDA Intel PCH, HDMI 0
    HDMI Audio Output
... many more ...

It is unclear to know which is responsible for the output we are listening to. In my settings, I’m connecting a monitor via an HDMI cable, so I’m looking for HDMI in particular.

To know which one, we need test each output by producing some noise. This could be done using speaker-test:

You might need to test all of them, but the command is something like:

speaker-test -D <name> -c 2
# for example
speaker-test -D hdmi:CARD=PCH,DEV=0 -c 2

One of them would produce noise, and you now know which one it is, and also know that the audio works.

Make audio work on the container

At the container’s side, we are looking for /dev/snd which has the major number of 116. We would map them all to the container.

root@desktop:/dev# ls -l /dev/snd
total 0
drwxr-xr-x 2 root root       80 Mar 10 17:05 by-path
crw-rw---- 1 root audio 116,  8 Mar 10 17:05 controlC0
crw-rw---- 1 root audio 116,  2 Mar 10 17:05 controlC1
crw-rw---- 1 root audio 116, 17 Mar 10 17:05 hwC0D0
crw-rw---- 1 root audio 116, 18 Mar 10 17:05 hwC0D2
crw-rw---- 1 root audio 116,  7 Mar 10 17:05 hwC1D0
crw-rw---- 1 root audio 116, 10 Mar 10 17:05 pcmC0D0c
crw-rw---- 1 root audio 116,  9 Mar 10 17:05 pcmC0D0p
crw-rw---- 1 root audio 116, 16 Mar 10 17:05 pcmC0D10p
... many more ...

Please note that if the audio producer in the container is not root, let’s say joe, you need to add joe to audio group using, at the container:

sudo usermod -aG audio joe

At the host, we need to update the container configuration again:

lxc.cgroup.devices.allow = c 116:* rwm 
lxc.mount.entry: /dev/snd dev/snd none bind,optional,create=dir

After restarting the container, test if the audio devices are available:

aplay -L

You should see the very same output as the root’s one.

Finally, we tell the Pulse audio in the container to manually look for alsa-sink on a specific device we have tested. Modify the file /etc/pulse/default.pa by looking for load-module module-also-sink. You might see it commented, uncomment it, and update it so:

load-module module-alsa-sink device=<name>
# for example
load-module module-alsa-sink device=hdmi:CARD=PCH,DEV=0

After that you might need to reload the pulse audio, in the container:

pulseaudio -k

Or restart the container should yield the same result. Your audio should now work. You can go to PulseAudio app and see if it the audio device is recognized.

However, it is possible that the device is muted (by some external means). To guarantee it is not, we run, in the container:

alsamixer

You should see the following screen:

If you see any of them to be MM which is muted. You need to make it not by selecting it and press M. You’ll see the following screen:

Nothing is muted now.

Approximately Optimal Approximate Reinforcement Learning (Kakade & Langford, 2002)

2019-03-09T00:00:00+07:00

In this article, we will try to retell the paper in a simpler way by which it is easier to follow. At the moment, we will only focus on the first part of the paper which tries to give an answer to the following question:

Is there a way to guarantee policy improvement?

And the answer is yes.

In order to do show we need 3 ingredients:

Policy performance measurement
Policy improvement algorithm
Improved policy performance estimation

The overall idea is that if we can give the lower bound to the improved policy performance and we can show that it is > 0, we thus guarantee policy improvement.

So the path forward is to show you approaches to estimate the improved policy performance.

Basics

$V_\pi(s)$ is a state-value function.

$Q_\pi(s, a)$ is a action-value function.

$A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)$ is an advantage function.

Policy performance

We first define the policy performance as an average performance over state states.

$\eta_D(\pi) = \mathrm{E}_{s \sim D} \left[ V_\pi(s) \right]$

where the start state distribution is $D$ . In the paper, this $D$ could be substituted with other distribution at will under the notion of restart distribution but we don’t care about it that much here. Let’s say that it is under some start state distribution.

Conservative greedy policy improvement

The usual policy improvement is to alter the current policy to be $\mathrm{argmax}_a A(s, a)$ for all $s$ . Here we look for a more general case allowing us to transform the policy in a more granular way using $\alpha$ as a parameter.

$\pi_{new} = (1-\alpha)\pi + \alpha\pi'$

where $\pi'$ is a greedy improvement of $\pi$ .

So our goal is to guarantee the improvement of policy under the conservative greedy improvement that is:

$\eta(\pi_{new}) - \eta(\pi) > 0 \label{eq:eta}$

That is at any moment we need to find $\alpha$ that satisfies the above inequation. In other words, how small should $\alpha$ be that it still improves the policy.

Improved policy performance estimation

As you see from $\eqref{eq:eta}$ , we need to get the improved policy performance $\eta(\pi_{new})$ , but we want to get it cheaply because we might need to fine tune it for the right $\alpha$ . This not viable to just rerun the policy evaluation (on a new set of experience from $\pi_{new}$ ), it is just too slow. We need to estimate its lower bound.

In the paper, the author shows two ways for estimation:

Using Taylor’s series to the first order. Unfortunately this approach does get us any closer to the lower bound of the estimation. But it is a useful starting point anyway.
Using the author’s proposed approach. This gives a lower bound.

Using Taylor’s series to approximate

If we write $\eta(\pi)$ using Taylor’s expansion to the first degree we will get:

$\eta(\pi+x) = \eta(\pi) + x \nabla_\pi \eta(\pi) + \mathrm{O}(x^2) \label{eq:eta_x}$

Here we have an approximation error in the order of $\mathrm{O}(x^2)$ albeit not knowing its constant factor.

Since our policy improvement is not exactly in the form of aforementioned $x$ , we rather want it to be in the form of $\alpha$ (recall the conservative policy improvement).

So we want to get the estimate of something like:

$\eta_\pi(\alpha) = \eta((1-\alpha)\pi + \alpha \pi') = \eta(\pi) + \alpha \nabla_\alpha \eta(\pi) + \mathrm{O}(\alpha^2) \label{eq:eta_alpha}$

From $\eqref{eq:eta_x}$ , the only problematic part is the second term (first derivative), we want $\nabla_\alpha$ not $\nabla_\pi$ .

We now begin to derive the $\nabla_\alpha \eta(\pi)$ .

The gradient of policy performance was first derived in Sutton’s 1999, policy gradient theorem. We would put it here without further ado:

$\begin{equation} \begin{aligned} \nabla_\pi \eta(\pi) &= \sum_{s, a} d_\pi(s) Q_\pi(s, a) \nabla \pi(a|s) \\ &= \sum_{s, a} d_\pi(s) A_\pi(s, a) \nabla \pi(a|s) \end{aligned} \label{eq:policy_gradient} \end{equation}$

where $d_\pi(s)$ is a discounted state visitation probability. For completeness:

$d_\pi(s) = \sum_{t=0}^\infty \gamma^t \mathrm{P}(s_t=s, \pi)$

where $\mathrm{P}(s_t=s, \pi)$ is the probability of visiting state $s$ after taking $t$ steps under a policy $\pi$ . Please note that $d_\pi$ is not a probability distribution (it does not sum to $1$ ), but we can make it so by multiplying $1-\gamma$ to it (since $\sum_{i=0}^\infty \gamma^i = \frac{1}{1-\gamma}$ ).

From $\eqref{eq:policy_gradient}$ , we substitute $\nabla_\pi$ with $\nabla_\alpha$ , we also write $\pi$ as a function of $\alpha$ :

$\begin{align} \nabla_\alpha \eta(\pi) &= \sum_{s, a} d_\pi(s) A_\pi(s, a) \nabla_\alpha \left[ (1-\alpha) \pi + \alpha \pi' \right] \label{eq:eta_alpha1} \end{align}$

Consider $\nabla_\alpha \left[ (1-\alpha) \pi + \alpha \pi' \right]$ :

$\nabla_\alpha \left[ (1-\alpha) \pi + \alpha \pi' \right] = -\pi + \pi' \label{eq:pi_grad}$

We substitute $\eqref{eq:pi_grad}$ into $\eqref{eq:eta_alpha1}$ followed by some algebra:

$\begin{equation} \begin{aligned} \nabla_\alpha \eta(\pi) &= \sum_{s, a} d_\pi(s) A_\pi(s, a) (-\pi + \pi') \\ &= - \sum_{s, a} d_\pi(s) A_\pi(s, a) \pi(a|s) + \sum_{s,a} d_\pi(s) A_\pi(s, a) \pi'(a|s) \\ &= - \sum_{s} d_\pi(s) \cancel{\sum_a A_\pi(s, a) \pi(a|s)} + \sum_{s,a} d_\pi(s) A_\pi(s, a) \pi'(a|s) \\ &= \sum_{s,a} d_\pi(s) A_\pi(s, a) \pi'(a|s) \end{aligned} \label{eq:eta_grad} \end{equation}$

This gradient can be computed without the need to further interact with the environment. We just need to change $\pi$ to $\pi'$ and then rerun on the previous experience.

Policy advantage

The quantity in $\eqref{eq:eta_grad}$ is closely related to policy advantage which defines:

$\mathbb{A}_\pi(\tilde{\pi}) = \mathrm{E}_{s \sim d_\pi} \mathrm{E}_{a \sim \tilde{\pi}} A_\pi(s, a) \label{eq:policy_adv}$

Since it obeys the expectation, it uses a normalized distribution. Hence: $\mathbb{A}_\pi(\tilde{\pi}) = (1-\gamma) \nabla_\alpha \eta(\pi)$

Intuitively, the policy advantage tells us how much $\tilde{\pi}$ tries to take large advantages (be greedy). If $\tilde{\pi} = \pi$ , this quantity is $0$ . It is maximized when $\pi'$ is a greedy policy wrt. $\pi$ .

Don’t be confused! $\mathbb{A}_\pi$ is a policy advantage which looks at all states, but $A_\pi$ is an advantage function looks at a particular state and action.

Taylor’s expansion of policy performance

We now get:

$\begin{equation} \begin{aligned} \eta(\pi_{new}) &= \eta(\pi) + \alpha \nabla_\alpha \eta(\pi) + \mathrm{O}(\alpha^2) \\ &= \eta(\pi) + \frac{\alpha}{1-\gamma} \mathbb{A}_\pi(\pi') + \mathrm{O}(\alpha^2) \end{aligned} \end{equation}$

Now, we can draw some conclusion from the above equation:

With policy improvement the second term (first derivative) is positive (if the policy is not optimal).
If $\alpha$ is small enough, the second term will dominate the third term (second derivative) resulting in policy improvement.

The only problem is that we don’t know what $\alpha$ is to guarantee the policy improvement. We now turn to a different approach.

Using the author’s approach

In order to guarantee policy improvement, we need to show that $\eta_\pi(\pi_{new}) - \eta_\pi(\pi) > 0$ .

We first rewrite it in a different form.

Lemma 6.1

$\eta_\pi(\tilde{\pi}) - \eta_\pi(\pi) = \frac{1}{1-\gamma} \mathrm{E}_{a,s \sim \tilde{\pi}, d_{\tilde{\pi}}} \left[ A_\pi(s, a) \right]$

Proof:

$\begin{equation*} \begin{aligned} & \frac{1}{1-\gamma} \mathrm{E}_{a,s \sim \tilde{\pi}, d_{\tilde{\pi}}} \left[ A_\pi(s, a) \right] \\ &= \mathrm{E}_{s_0, a_0, s_1, a_1, \dots \sim \tilde{\pi}} \left[ A_\pi(s_0, a_0) + \gamma A_\pi(s_1, a_1) + \dots \right] \\ &= \mathrm{E}_{s_0, a_0, s_1, a_1, \dots \sim \tilde{\pi}} \left[ r_1 + \cancel{\gamma V_1} - V_0 + \gamma r_2 + \cancel{\gamma^2 V_2} - \cancel{\gamma V_1} + \dots \right] \\ &= \mathrm{E}_{s_0, a_0, s_1, a_1, \dots \sim \tilde{\pi}} \left[ \sum_{t=0}^\infty \gamma^t r_{t+1} - V_\pi(s_0) \right] \\ &= \mathrm{E}_{s_0 \sim \tilde{\pi}} \left[ V_{\tilde{\pi}} - V_\pi(s_0) \right] \\ &= \eta_\pi(\tilde{\pi}) - \eta_\pi(\pi) \end{aligned} \end{equation*}$

With Lemma 6.1 we now have:

$\begin{equation} \begin{aligned} \eta_\pi(\pi_{new}) - \eta_\pi(\pi) &= \frac{1}{1-\gamma} \mathrm{E}_{a,s \sim \pi_{new}, d_{\pi_{new}}} \left[ A_\pi(s, a) \right] \\ &= \sum_{t=0}^\infty \gamma^t \mathrm{E}_{s \sim P(s_t, \pi_{new})} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \end{aligned} \label{eq:eta_delta} \end{equation}$

where $P(s_t, \pi_{new})$ is the probability of visiting state $s_t$ at time $t$ under policy $\pi_{new}$ .

Evidently, we do have $P(s_t, \pi)$ but we do not have $P(s_t, \pi_{new})$ . A way forward is to estimate the equation $\eqref{eq:eta_delta}$ with all we have. Since the deviation from our estimate comes from the mismatch between $P(s_t, \pi_{new})$ and $P(s_t, \pi)$ , intuitively, a small $\alpha$ should result in a small mismatch and vice versa. This implies that $P(s_t, \pi_{new})$ must share some roots with $P(s_t, \pi)$ which part we can work with. This allows us to get an informed estimate and put an upper bound to the part we cannot work with.

The two parts

Consider the policy $\pi_{new}$ , we know from its definition that it is a compound policy. Another way to look at it is we have two policies: $\pi$ an $\pi'$ . With probability of $\alpha$ we will select an action according to $\pi'$ , and probability of $1-\alpha$ we will select an action from $\pi$ .

At time $t$ , we define our two parts as:

Part one: we follow $\pi$ from $t=0$ until now.
Part two: at some point we selected an action from $\pi'$ .

If we has been following $\pi$ .

The probability is $P(\text{follow } \pi) = (1-\alpha)^t = 1 - \rho_t$ .

The expected advantage function for this part is:

$\begin{equation*} (1-\rho_t) \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \end{equation*}$

If we has followed $\pi'$ at any point prior $t$ .

The probability is $P(\text{not follow } \pi) = \rho_t = 1-(1-\alpha)^t$ .

The expected advantage function for this part is:

$\begin{equation*} \rho_t \mathrm{E}_{s \sim P(s_t|\text{not follow }\pi)} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \end{equation*}$

We can define the upper bound of this value to be:

$\begin{align} \mathrm{E}_{s \sim P(s_t|\text{not follow }\pi)} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] &\leq \max_s \left\vert \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \right\vert \\ &\leq \max_s \left\vert \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \right\vert \\ &= \epsilon \end{align}$

This is obvious we just use the $\max$ here which literally cannot be exceeded. As you shall see later on, the smaller the $\epsilon$ the tighter our estimate would be.

The total expected advantage function at time $t$ is then the sum of both:

$\begin{equation} \begin{aligned} \mathrm{E}_{s \sim P(s_t, \pi_{new})} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] &= (1-\rho_t) \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \\ & \quad + \rho_t \mathrm{E}_{s \sim P(s_t|\text{not follow }\pi)} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \\ & \geq \alpha (1-\rho_t) \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] \\ & \quad - \alpha \rho_t \epsilon \end{aligned} \label{eq:two_paths} \end{equation}$

Furthermore, we can show that $\mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] = \alpha \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right]$ :

$\begin{equation} \begin{aligned} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] &= \sum_a ((1-\alpha) \pi(a|s) + \alpha \pi(a|s)) A_\pi(s, a) \\ &= (1-\alpha) \cancel{\sum_a \pi(a|s) A_\pi(s, a)} + \alpha \sum_a \pi'(a|s) A_\pi(s, a) \\ &= \alpha \sum_a \pi'(a|s) A_\pi(s, a) \end{aligned} \label{eq:pi'_a} \end{equation}$

Substitute $\eqref{eq:pi'_a}$ into $\eqref{eq:two_paths}$ :

$\begin{equation} \begin{aligned} \mathrm{E}_{s \sim P(s_t, \pi_{new})} \mathrm{E}_{a \sim \pi_{new}} \left[ A_\pi(s, a) \right] &\geq \alpha (1-\rho_t) \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \\ & \quad - \alpha \rho_t \epsilon \end{aligned}\label{eq:two_paths2} \end{equation}$

This is just for a time frame $t$ . After all, we still need to incorporate it into the whole trajectories which extends from $t=0$ to $t=\infty$ .

Apply to all time steps

Substitute $\eqref{eq:two_paths2}$ into $\eqref{eq:eta_delta}$ :

$\begin{equation} \begin{aligned} \eta(\pi_{new}) - \eta(\pi) &\geq \alpha \sum_{t=0}^\infty \gamma^t (1-\rho_t) \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \\ & \quad - \alpha \epsilon \sum_{t=0}^\infty \gamma^t \rho_t \end{aligned} \label{eq:two_parts3} \end{equation}$

Looking more carefully at the first term, $\rho_t$ depends on $\alpha$ which is something we want to find (remember we want to find the policy improving $\alpha$ ). With this form, solving to find $\alpha$ will be very hard because it is not in a closed form. We want the $\sum$ term to be a constant independent of $\alpha$ . In this way, solving to find $\alpha$ becomes trivial.

To realize this, we further substitute $\epsilon$ into $\eqref{eq:two_parts3}$ :

$\begin{equation} \begin{aligned} \eta(\pi_{new}) - \eta(\pi) &\geq \alpha \sum_{t=0}^\infty \gamma^t \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \\ & \quad - \alpha \sum_{t=0}^\infty \gamma^t \rho_t \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \\ & \quad - \alpha \epsilon \sum_{t=0}^\infty \gamma^t \rho_t \\ &\geq \alpha \sum_{t=0}^\infty \gamma^t \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \\ & \quad - \alpha \epsilon \sum_{t=0}^\infty \gamma^t \rho_t \\ & \quad - \alpha \epsilon \sum_{t=0}^\infty \gamma^t \rho_t \\ &= \alpha \sum_{t=0}^\infty \gamma^t \mathrm{E}_{s \sim P(s_t|\text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right] \\ & \quad - 2\alpha \epsilon \sum_{t=0}^\infty \gamma^t \rho_t \\ \end{aligned} \label{eq:two_parts4} \end{equation}$

Consider $\sum_{t=0}^\infty \gamma^t \mathrm{E}_{s \sim P(s_t \vert \text{follow }\pi)} \mathrm{E}_{a \sim \pi'} \left[ A_\pi(s, a) \right]$ , this is in fact the unnormalized policy advantage (see $\eqref{eq:policy_adv}$ ). It equals to $\frac{1}{1-\gamma} \mathbb{A}_\pi(\pi')$ .

Now consider the $\sum_{t=0}^\infty \gamma^t \rho_t$ , we can substitute its real values and get:

$\begin{equation} \begin{aligned} \sum_{t=0}^\infty \gamma^t \rho_t &= \sum_{t=0}^\infty \gamma^t (1-(1-\alpha)^t) \\ &= \sum_{t=0}^\infty \gamma^t - \sum_{t=0}^\infty \gamma^t (1-\alpha)^t \\ &= \frac{1}{1-\gamma} - \frac{1}{1-\gamma(1-\alpha)} \\ &= \frac{\gamma\alpha}{(1-\gamma)(1-\gamma(1-\alpha))} \end{aligned} \end{equation}$

Substitute them into $\eqref{eq:two_parts4}$ :

$\begin{equation} \begin{aligned} \eta(\pi_{new}) - \eta(\pi) &\geq \frac{\alpha}{1-\gamma} \mathbb{A}_\pi(\pi') \\ & \quad - 2\alpha \epsilon \left[ \frac{\gamma\alpha}{(1-\gamma)(1-\gamma(1-\alpha))} \right] \\ &= \frac{\alpha}{1-\gamma} \left[ \mathbb{A}_\pi(\pi') - \frac{2\epsilon\gamma\alpha}{1-\gamma(1-\alpha)} \right] \end{aligned} \label{eq:two_parts5} \end{equation}$

We call equation $\eqref{eq:two_parts5}$ theorem 4.1.

Finding the right step

Finally, we want to guarantee the policy improvement by selecting a proper $\alpha$ . We then need to solve for $\alpha$ :

$\begin{equation} \begin{aligned} \eta(\pi_{new}) - \eta(\pi) &\geq \frac{\alpha}{1-\gamma} \left[ \mathbb{A}_\pi(\pi') - \frac{2\epsilon\gamma\alpha}{1-\gamma(1-\alpha)} \right] \gt 0 \end{aligned} \label{eq:two_parts6} \end{equation}$

This $\alpha$ would be guaranteed to improve the policy because we calculate it from the pessimistic estimate (its lower bound).

NFS file attribution caching causes reading inconsistency in multi-producer scenario

2019-02-17T00:00:00+07:00

With two producers e.g. local and remote producers, the changes made locally might need time to be acknowledged by the remote.

Even without file caching this could still be a problem, I experience this first-hand while using Python. I think Python might use some kind of file attribute to determine file updates. NFS has this which is called “attribute caching”.

If you mount NFS with the option actimeo, it enables this attribute caching mechanism. To disable it consider using noac option.

There is a mention about performance degradation of noac option. The source suggests that actimeo=0 has lower performance impact comparing to noac .

Personally, I see noac and actimeo=0 to be too much a drag on performance. I now use actimeo=3 (3 seconds) and see a much lower drag.

Reference NFS mounting configuration in fstab:

203.0.113.0:/home       /nfs/home      nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

NFS client documentation

Vim conflicting with VSCode

2019-02-17T00:00:00+07:00

VSCode Vim extension binds <Ctrl-d> for its own use. This keymap is used by VSCode for “selecting next word occurrence” in a multi-cursor fashion. If you like VSCode to handle this keymap (and others of your choice) you could do so via vim.handleKeys option.

The following table is copied and pasted from the project:

Setting	Description	Type	Default Value
vim.handleKeys	Delegate configured keys to be handled by VSCode instead of by the VSCodeVim extension. Any key in `keybindings` section of the package.json that has a `vim.use<C-...>` in the when argument can be delegated back to VS Code by setting `"<C-...>": false`. Example: to use `ctrl+f` for find (native VS Code behaviour): `"vim.handleKeys": { "<C-f>": false }`.	String	`"<C-d>": true`

Using the vim.handleKeys option you could delegate the handles to VSCode by setting each to false like so:

"vim.handleKeys": {
    "<C-d>": false
}

Vim navigation between wrapped Lines in VSCode

2019-02-17T00:00:00+07:00

Noticing that long lines might be soft-wrapped automatically, and navigating with j k would result in jumping across many soft-wrapped lines (but it is in fact a single line). VIM uses g j and g k to navigate within the lines instead.

If you want j k to be your default navigating keys through the block, one possible way is to remap it to g j g k instead. As mentioned in this Stackoverflow thread.

The key mapping for Vim extension for VSCode is done thus:

"vim.normalModeKeyBindingsNonRecursive": [
    {
        "before": [
            "j"
        ],
        "after": [
            "g",
            "j"
        ]
    },
    {
        "before": [
            "k"
        ],
        "after": [
            "g",
            "k"
        ]
    }
],

Put it in your settings.json .

Off-policy Importance Sampling

2019-02-02T00:00:00+07:00

ขั้นตอนในการเรียนรู้ policy ที่ดีใน reinforcement learning นั้น มักจะประกอบไปด้วยสองส่วน หนึ่ง คือส่วนที่เรียกว่า prediction ก็คือส่วนที่ตอบว่า “ถ้าเราเดินไปตามเส้นทางนี้ แล้วจะดีขนาดไหน” นั่นก็คือ “คาดการณ์” (prediction) ค่าของ $v_\pi$ และหรือ $q_\pi$

อีกส่วนหนึ่งก็คือส่วนที่เรียกว่า control ถ้าเรามีข้อมูลเหล่านี้ ( $v_\pi, q_\pi$ ) หรือไม่มี จะสามารถหา policy ที่ดีได้อย่างไร ในกรณีของ GPI (Generalized Policy Iteration) นั้น control (การหา policy ที่ดี) นั้นอาศัยการ prediction ที่ดีมาก่อนด้วย นั่นทำให้ในหลาย ๆ ครั้งเราถือว่า prediction กับ control เหมือนกับสองส่วนที่ขาดจากกันไม่ได้

ในที่นี้เราพูดถึงโจทย์ prediction เป็นหลัก โดยเฉพาะเราพูดถึงการ predict ค่า $q_\pi$ โดยที่เราไม่มีประสบการณ์จากการเล่น policy $\pi$ เลย แต่เราได้ประสบการณ์จากาแหล่งอื่น ๆ แทน เราเรียกแหล่งนั้นว่า behavioral policy หรือ $b$

จะเป็นไปได้ไหม หรือต้องทำอย่างไรเราถึงจะได้ $v_\pi$ ในเมื่อเรามีแต่ประสบการณ์จาก $b$ ?

โจทย์นี้มีชื่ออย่างเป็นทางการว่า Off-policy prediction ก็เพราะว่าประสบการณ์ที่เรามีมัน off ไปจาก policy $\pi$

ก่อนที่จะไปต่อ กล่าวก่อนว่า off-policy prediction นั้นก็ไม่ได้มีสูตรสำเร็จ แต่ละวิธีก็อาจจะมีจุดแข็งจุดอ่อนของตัวเอง (เหมือนกับทุกอย่างใน RL) แต่เนื่องจาก off-policy prediction นั้นยากกว่า on-policy prediction (กรณีที่ไม่มี $b$ ) มาก ดังนั้นงานวิจัยในด้านนี้ก็ยังไม่เจริญเท่า

เพราะฉะนั้นในบทความนี้เราก็อาจจะพูดแบบเกริ่น ๆ วิธีที่มีมานานแล้วของการทำ off-policy prediction ไปก่อนซึ่งในที่นี้เราจะเสนอไอเดียที่เรียกว่า Importance Sampling ซึ่งเป็นเทคนิคทางสถิติ ซึ่งเข้าใจได้ง่ายแม้จะไม่ได้เข้าใจ RL มากเท่าไหร่

On-policy prediction

ในกรณีที่ $\pi$ คือ $b$ กล่าวคือ $\pi(a \vert s) = b(a \vert s)$ สำหรับทุก $a, s$ เราจะได้ว่า เราสามารถแก้โจทย์ prediction ด้วยวิธีการแบบ Monte Carlo นั่นก็คือ

$\mathrm{E}_{r_{t+1}, r_{t+2}, ... \sim \pi} \left[ r_{t+1} + r_{t+2} + ... \right] = v_\pi$

แปลเป็นภาษาไทยก็คือ เราสามารถหาค่า $v_\pi$ ได้ (prediction) ด้วยการใช้ประสบการณ์จำนวนมากมาหาค่า “คาดหวัง” โดยมีเงื่อนไขว่าประสบการณ์จะต้องมาจาก policy $\pi$ เท่านั้น

ข้อสังเกต: เราทำการ sample $r$ หลายครั้งจนกว่าจะจบ episode

เพื่อความง่ายเราจะกำหนด $G_t = r_{t+1} + r_{t+2} + ...$ โดยเราเรียก $G_t$ ว่า return อาจจะแปลได้ว่า ค่าทดลองการเล่น (จากการเล่น 1 ครั้งจนจบ แล้วรวมรางวัลที่ได้ทั้งหมด)

โจทย์ในที่นี้สำหรับ off-policy prediction ก็น่าจะเป็น ถ้าเราไม่ได้ $r_{t+1}, r_{t+2}, ... \sim \pi$ แต่มาจาก $b$ แทน เป็นไปได้มั้ยที่เราจะยังประมาณค่า $v_\pi$ ได้เหมือนเดิม

Off-policy prediction โดยใช้ Importance Sampling

Importance Sampling เป็นวิธีทางสถิติ โดยสามารถแสดงให้เห็นแบบง่าย ๆ ได้ดังนี้

สมมติว่าเรามีฟังก์ชัน $f(x)$ เรามีฟังก์ชันความน่าจะเป็น $p(x)$ และเราต้องการหา “ค่าคาดหวังของ $f(x)$ ภายใต้ $p(x)$ ” เราจะได้ว่า

$\mathrm{E}_{x \sim p(x)} f(x)$

แต่ว่าถ้าเราไม่ได้ sample $x$ จาก $p(x)$ แต่เป็น $g(x)$ แทนล่ะ ? เราก็ยังหาค่าคาดหวังได้อยู่ดีแหละดังนี้

$\begin{equation} \begin{split} \mathrm{E}_{x \sim p(x)} f(x) &= \sum_x p(x) f(x) \\ &= \sum_x \frac{g(x)}{g(x)} p(x) f(x) \\ &= \sum_x g(x) \frac{p(x)}{g(x)} f(x) \\ &= \mathrm{E}_{x \sim g(x)} \frac{p(x)}{g(x)} f(x) \end{split} \end{equation}$

การคูณ $\frac{g(x)}{g(x)}$ เป็นทริกที่ไม่ได้เปลี่ยนค่าแต่ใด เพราะว่าคูณด้วย $1$ แต่ว่าช่วยให้เราสามารถเปลี่ยนการ sample ได้ แทนที่จะต้อง sample จาก $p(x)$ กลายเป็น sample จาก $g(x)$

เราจะได้ว่า หากทุกครั้งที่เรา sample $x \sim g(x)$ แล้วแทนที่จะใช้ค่า $f(x)$ ตรง ๆ เลยเราเอาค่านั้นมาคูณด้วย $\frac{p(x)}{g(x)}$ (เรียกว่า Importance sampling ratio) ก่อน นั่นคือ $\frac{p(x)}{g(x)} f(x)$ เราก็จะได้ค่าคาดหวังอันเดิมได้นั่นเอง แต่นั่นก็แปลว่าเราต้องรู้ด้วยว่า $g(x)$ ของเรานั้นมีค่าเท่าไหร่

การใช้งานกับ Reinforcement Learning

เราจะเริ่มจากกรณีของ on-policy prediction ซึ่งก็คือการ sample $r$ จำนวนมากจาก $\pi$ คราวนี้เราต้องเปลี่ยนเป็น sample จาก $b$ แทน

แต่มันอาจจะง่ายกว่าถ้าเราลองมองแค่ reward เดียวก่อน

การจะ sample reward แต่ละอันสิ่งที่เราต้องทำก็คือ

ทำการ sample action จาก policy $\pi(a \vert s)$
ทำการ sample reward จาก reward distribution $p(r \vert s, a)$
ทำการ sample state ต่อไป (ถ้าต้อง sample reward ต่อไป) จาก state transition probability $p(s' \vert s, a)$

สำหรับ reward แรกในที่นี้ $r_{t+1}$ (สมมติว่าเราเริ่มที่ $t$ ) จะได้ว่า

$r_{t+1} \sim \pi(a_t|s_t) p(r_{t+1}|s_t, a_t)$

สำหรับ reward ต่อไป $r_{t+2}$ ก็จะได้

$r_{t+2} \sim \pi(a_t|s_t) p(s_{t+1}|s_t, a_t) \pi(a_{t+1}|s_{t+1}) p(r_{t+2}|s_{t+1}, a_{t+1})$

หากเราไม่ได้ sample action จาก $\pi$ แต่เป็นจาก $b$ แทน เราจะเขียนใหม่ได้ว่า

$r_{t+1} \sim b(a_t|s_t) p(r_{t+1}|s_t, a_t)$

และ

$r_{t+2} \sim b(a_t|s_t) p(s_{t+1}|s_t, a_t) b(a_{t+1}|s_{t+1}) p(r_{t+2}|s_{t+1}, a_{t+1})$

ถ้าเรามองเฉพาะ reward แรก $r_{t+1}$ แล้วลองเปรียบเทียบกับ กรณี $f(x), p(x), g(x)$ เราจะเห็นว่า:

$\pi(a_t \vert s_t) p(r_{t+1} \vert s_t, a_t)$ ราวกับเป็น $p(x)$ แต่เดิม และ
$b(a_t \vert s_t) p(r_{t+1} \vert s_t, a_t)$ ก็คือ $g(x)$ อันใหม่
$r_{t+1}$ ก็คือ $f(x)$ ของเรานั่นเอง

เมื่อเห็นฉะนี้เราก็สามาถรเขียนได้ว่า Importance sampling ratio ระหว่าง $\pi$ กับ $b$ สำหรับ reward $r_{t+1}$ ก็คือ

$\begin{equation} \begin{split} \rho_{t:t} &= \frac{\pi(a_t|s_t) \cancel{p(r_{t+1}|s_t, a_t)}}{b(a_t|s_t) \cancel{p(r_{t+1}|s_t, a_t)}} \\ &= \frac{\pi(a_t|s_t)}{b(a_t|s_t)} \end{split} \end{equation}$

ใช้อักษร $\rho$ (โรห์) แทน importance sampling ratio

ข้อสังเกต: จะเห็นว่าถ้า $b(a_t \vert s_t)$ มีค่าน้อยมาก ๆ อาจจะทำให้ค่า $\rho$ ระเบิดได้ดังนั้น behavioral policy จะต้องไม่ “คม” เกินไป กล่าวคือจะต้องให้โอกาสเลือก action ต่าง ๆ ไม่น้อยเกินไปนั่นเอง

และถ้าเราคิดต่อไปสำหรับกรณี reward ตัวที่สองเราก็จะได้ว่า Importance sampling ratio ระหว่าง $\pi$ กับ $b$ สำหรับ $r_{t+2}$ ก็คือ

$\begin{equation} \begin{split} \rho_{t:t+1} &= \frac{\pi(a_t|s_t) \cancel{p(s_{t+1}|s_t, a_t)} \pi(a_{t+1}|s_{t+1}) \cancel{p(r_{t+2}|s_{t+1}, a_{t+1})}}{b(a_t|s_t) \cancel{p(s_{t+1}|s_t, a_t)} b(a_{t+1}|s_{t+1}) \cancel{p(r_{t+2}|s_{t+1}, a_{t+1})}} \\ &= \frac{\pi(a_t|s_t)\pi(a_{t+1}|s_{t+1})}{b(a_t|s_t)b(a_{t+1}|s_{t+1})} \end{split} \end{equation}$

$\rho_{t:t+1}$ บอกถึงว่าเรา “คูณกันจาก t ถึง t+1”

จะเห็นว่าแม้ในตอนคูณกันนั้นจะมีส่วนของ state transition probability และ reward probability สุดท้ายแล้วก็ํจะตัดกันเองอยู่ดี เพราะว่าไม่ว่าจะเป็น $\pi$ หรือ $b$ ก็ล้วนอยู่ภายใต้ environment (MDP) เดียวกัน ทำเราให้เราสนใจเฉพาะความต่างของทั้งสองก็พอ

สำหรับกรณี reward อื่น ๆ นั้นเราสามารถเดาไปจากตรงนี้ว่า สำหรับ $r_T$ จะมี importance sampling ratio ของตัวเองเท่ากับ

$\begin{equation} \begin{split} \rho_{t:T-1} &= \prod_{k=t}^{T-1} \frac{\pi(a_k|s_k)}{b(a_k|s_k)} \end{split} \end{equation}$

เอาทุกอย่างมารวมกัน เราจะได้ว่าสำหรับแต่ต้นนั้นในกรณีของ on-policy เรามี

$\mathrm{E}_{r_{t+1}, r_{t+2}, \dots \sim \pi} \left[ r_{t+1} + r_{t+2} + \dots \right] = v_\pi$

ในกรณีของ off-policy เราจะต้องคูณ $\rho$ ของแต่ละ reward ให้ถูกต้อง หน้าแต่ละ reward เพื่อทำให้ได้ค่าเหมือนเดิม

$\mathrm{E}_{r_{t+1}, r_{t+2}, \dots \sim b} \left[ \rho_{t:t}r_{t+1} + \rho_{t:t+1}r_{t+2} + \dots \right] = v_\pi$

จะเห็นว่าเราต้องใช้ แต่ละ $\rho$ สำหรับแต่ละ reward วิธีการนี้จึงมีชื่อว่า Per-decision Importance Sampling ก็เพราะว่าเราสร้าง importance sampling สำหรับแต่ละ decision (แต่ละ action แต่ละ reward) เลย

ในการ Implement จริง ๆ เราจะต้องเขียนได้ว่า $v(s)$ นั้นหามาจากไหน เราจะเขียนได้ว่า

$\begin{equation} v(s) = \frac{\sum_{t \in \tau(s)} \sum_{k=t}^{T-1} \rho_{t:k} r_{t+1}}{|\tau(s)|} \end{equation}$

อาจจะดูเข้าใจยากซักนิด แต่หากแยกเป็นส่วน ๆ จะเห็นว่า

$\sum_{k=t}^{T-1} \rho_{t:k} r_{t+1}$ ตรงนี้จริง ๆ แล้วก็คือ ผลรวมของ reward ที่ผ่านการคูณด้วย importance sampling ratio แล้ว
ส่วนที่เหลือก็คือการหาค่า “เฉลี่ย” จากหลาย ๆ ครั้งนั่นเอง (เพราะว่าค่าคาดหวังก็คือค่าเฉลี่ย)

การหาค่าเฉลี่ยในที่นี้ใช้เครื่องหมาย $\tau(s)$ ช่วย โดยเรากำหนดขึ้นมาที่นี้ว่า $\tau(s)$ เป็น set ของทุก ๆ timestep ที่ state “ผ่าน” state $s$ พอดิบพอดี ดังนั้นการหาค่าเฉลี่ยของทุก ๆ ครั้งที่เราวิ่งผ่าน $s$ ก็คือการหา “ค่าเฉลี่ยของผลรวม reward ทุก ๆ ครั้งที่เริ่มจาก $s$ ” นั่นเอง

เพิ่มเติม: การกำหนด $\tau(s)$ ว่าเป็น set ของทุก ๆ timestep ที่ผ่าน state $s$ เรียกว่า every-visit Monte Carlo อีกวิธีหนึ่งที่เราทำได้ ก็คือ “สนใจเฉพาะ $s$ แรกของแต่ละ episode เท่านั้น” เราก็จะแก้ความหมาย $\tau(s)$ เล็กน้อย แล้วเรียกว่า first-visit Monte Carlo ความต่างของทั้งสองก็คือ every-visit นั้น implement ง่ายกว่า เพราะเราไม่ต้องจำว่าเราผ่าน state นี้ครั้งแรกหรือเปล่า แต่ว่าอาจจะให้ค่าที่แปลก ๆ เพราะหากเราผ่าน state $s$ หลายครั้งในหนึ่ง episode ค่า $v(s)$ จะเป็นค่าเฉลี่ยจากทุกครั้งที่เราผ่าน ต่างกับกรณี first-visit ที่จะสนใจเฉพาะค่าแรกเท่านั้น

Importance Sampling สำหรับผลรวมทั้งเส้น

แต่ว่าก็มีอีกวิธีที่เรามอง “ทุก reward เป็นภาพรวม” กล่าวคือเราไม่แยกมองเป็นทีละชิ้น แต่เรามองทั้งหมดเป็น return $G_t$ เลย

กล่าวคือ

$G_t = r_{t+1} + r_{t+2} + r_{t+3} + \dots \quad \sim \pi$

โดยเราจะหาความน่าจะเป็นที่จะ sample ได้ค่า $r_{t+1}, r_{t+2}, \dots$ พร้อมกันทั้งหมดแทน ซึ่งก็สามารถทำได้ในลักษณะเดียวกัน โดยตรงนี้จะแสดงให้เห็นกรณี reward 2 ตัวแรก

$r_{t+1}, r_{t+2} \sim \pi(a_t|s_t) p(r_{t+1}|s_t, a_t) p(s_{t+1}|s_t, a_t) \pi(a_{t+1}|s_{t+1}) p(r_{t+2}|s_{t+1}, a_{t+1})$

หลังจากนั้นเราสามารถเขียนสำหรับกรณีทั่วไปได้ดังนี้

$r_{t+1}, r_{t+2}, \dots, r_{T} \sim \prod_{k=t}^{T-1} \pi(a_k|s_k) p(r_{k+1}|s_k, a_k) p(s_{k+1}|s_k, a_k) \label{eq:pG}$

จากสมการ $\eqref{eq:pG}$ เราสามารถหาค่า Importance sampling ratio ระหว่าง $\pi$ กับ $b$ สำหรับ $G_t$ ได้ดังต่อไปนี้

$\begin{equation} \begin{split} \rho_{t:T-1} &= \prod_{k=t}^{T-1} \frac{\pi(a_k|s_k) \cancel{p(r_{k+1}|s_k, a_k)} \cancel{p(s_{k+1}|s_k, a_k)}}{b(a_k|s_k) \cancel{p(r_{k+1}|s_k, a_k)} \cancel{p(s_{k+1}|s_k, a_k)}} \\ &= \prod_{k=t}^{T-1} \frac{\pi(a_k|s_k)}{b(a_k|s_k)} \end{split} \end{equation}$

จะเห็นว่าหน้าตาคล้ายเดิมอย่างยิ่ง เพียงแต่ ณ ตอนนี้เราสามารถใช้ ค่า $\rho_{t:T-1}$ คูณเข้าไปยัง $G_t$ ทั้งเส้น เพื่อให้ได้ค่าที่ถูกต้อง แทนที่จะต้องใช้ค่า $\rho$ สำหรับแต่ละ reward

$\begin{equation} \mathrm{E}_{r_{t+1}, r_{t+2}, \dots \sim b} \left[ \rho_{t:T-1} \sum_t^{T-1} r_{t+1} \right] = v_\pi \end{equation}$

วิธีนี้เรียกว่า importance sampling แบบปกติ (ทั้งเส้น)

เราสามารถเขียนนิยามของ $v(s)$ ได้ในลักษณะเดียวกัน ก็คือการเฉลี่ยจากหลาย ๆ ครั้งที่เราผ่าน state $s$ นั้น ๆ

$\begin{equation} v(s) = \frac{\sum_{t \in \tau(s)} \rho_{t:T-1} G_t}{|\tau(s)|} \label{eq:ord_is} \end{equation}$

เปรียบเทียบ Importance Sampling และ Per-decision Importance Sampling

ปัญหาของ Importance sampling ในภาพรวมก็คือการคูณทบ ๆ กัน ของ $\frac{\pi}{b}$ ซึ่งมันจะทำให้ค่าที่ได้มีการแกว่งมาก ๆ ก็เพราะว่ามันขึ้นอยู่กับการ sampling ต่อเนื่องเป็นระยะยาว และเอาแต่ละพจน์มาคูณกัน ยิ่งส่งผลให้ช่วงของค่ามากขึ้นไปอีก ทำให้ importance sampling มีปัญหากับขนาดของ variance มีการกล่าวในหนังสือของ (Sutton, 2018) ว่า variance ของ importance sampling นั้นอาจจะถึง อนันต์ ซึ่งก็เห็นด้วยได้ง่ายเพราะว่าความยาวของ episode นั้นอาจจะยาวเท่าใดก็ได้

เพราะฉะนั้นสำหรับงานใดที่ต้องการใช้ importance sampling จำต้องพิเคราะห์ถึงการจำกัด variance ให้ดี โดยวิธีที่จำกัด variance ได้มาก มักก็จะส่งผลให้มีผลดีในการใช้งานจริงด้วย

เป็นที่ทราบกันว่าหากใช้ importance sampling (แบบทั้งเส้น) โดยตรงนั้น การเทรนจะทำได้ยากอย่างยิ่ง และอาจจะไม่ converge เลยด้วยซ้ำ แต่ว่า per-decision importance sampling ซึ่ง แม้ว่าจะมีพจน์ $\rho_{t:T-1}$ เช่นกัน แต่ว่าก็ส่งผลเพียงต่อ reward ท้าย ๆ ซึ่งก็อาจจะถูกพลังของ discount $\gamma$ ลดความสำคัญลงไปเยอะ ก็จะช่วยทำให้ variance ลดลงได้

อย่างไรก็ดีในการใช้งานจริงนิยมใช้วิธีที่ “ใกล้เคียง” แต่ไม่ใช่ทั้ง importance sampling หรือ per-decision importance sampling ซึ่งเรียกว่า weighted importance sampling ซึ่งวิธีนี้นั้นจริง ๆ แล้ว “ไม่ให้ค่าที่ถูกต้อง” (มี bias) แต่ว่าสามารถควบคุม variance ได้เป็นอย่างดีจึงทำให้การใช้งานจริงนั้นให้ผลที่ดีกว่ามาก

Weighted Importance Sampling

การที่เราคูณ $\prod \frac{\pi}{b}$ หลาย ๆ ครั้งส่งผลให้ค่า $\rho$ นี้ แกว่งมาก ๆ และก็แกว่งมากขึ้นเรื่อย ๆ ตามจำนวนการคูณ วิธีหนึ่งที่จะช่วย “จำกัด” การแกว่งก็คือการ “หาร” ด้วยอะไรที่เยอะพอ ๆ กัน

เราทำการแก้ไขเล็กน้อยจากสมการ $\eqref{eq:ord_is}$ โดยการแก้ตัวส่วนให้มีค่าแกว่งไปพร้อม ๆ กับตัวเศษ

$v(s) = \frac{\sum_{t \in \tau(s)} \rho_{t:T-1} G_t}{\sum_{t \in \tau(s)} \rho_{t:T-1}} \label{eq:weighted_is}$

สิ่งที่เราเห็นในทันทีก็คือ $v(s)$ ใหม่นี้จะแกว่งน้อยกว่ามากส่งผลให้ variance น้อยลงตามไปด้วย วิธีนี้จึงเหมาะสมมากกว่าในการใช้งานจริง

สิ่งที่เราเห็นต่อมาก็คือ อยู่ดี ๆ เราจะแก้ตามอำเภอใจแบบนี้ไม่ได้สิ คำตอบแบบสั้น ๆ ก็คือ เรายอม “ผิด” เพราะว่าสมการ $\eqref{eq:weighted_is}$ นั้น bias ไม่ได้ให้ค่าที่ถูกต้องโดยเฉลี่ยเหมือนกับ importance sampling ทั่วไป แต่เราก็ยอมจ่ายเพราะว่ามันช่วยให้เราทำงานกับมันได้ง่ายมากขึ้น

คำถามต่อมาก็คือ แล้วมัน bias ไปขนาดไหนล่ะ? เพราะว่าถ้ามัน bias แบบไม่เห็นเค้าเดิมเลยมันก็ไม่น่าจะดีอยู่แล้ว จริงอย่างว่า เพราะว่า สมการ $\eqref{eq:weighted_is}$ นั้นมี bias ก็จริง แต่ว่าขนาดของ bias นั้น “น้อยลง” เรื่อย ๆ หากเราเฉลี่ยด้วยจำนวนที่มากขึ้น และมัน “เข้าใกล้” ค่าจริงเมื่อเราเฉลี่ยด้วยจำนวนอนันต์ แต่แน่นอนว่าเราไม่ได้เฉลี่ยด้วยจำนวนอนันต์จึงต้องยอมรับว่า มันก็ยัง bias อยู่ดี

การแสดงว่ามันเข้าใกล้ค่าจริง เมื่อเฉลี่ยด้วยจำนวนอนันต์สามารถทำได้ดังนี้

เราอาศัยความจริงที่ว่า $\mathrm{E}[\rho] = 1$
และเราจะแสดงว่า weighted importance sampling นั้นเข้าใกล้ importance sampling เมื่อ $\tau(s)$ มีขนาดอนันต์

$\begin{equation} \frac{\sum_{t \in \tau(s)} \rho_{t:T-1}}{\left| \tau(s) \right|} = \mathrm{E}_b \left[ \rho_{t:T-1} \right] = 1 \label{eq:expect_rho_tau} \end{equation}$

สมการ $\eqref{eq:expect_rho_tau}$ จะเป็นจริงก็ด้วย Law of large numbers เท่านั้น ดังนั้นหาก $\left \vert \tau(s) \right \vert$ ไม่ได้เยอะเข้าใกล้ $\infty$ แล้วก็พูดอย่างนั้นไม่ได้

หลังจากนั้นเราก็ทำการย้ายข้างเล็กน้อยดังนี้

$\begin{equation} \sum_{t \in \tau(s)} \rho_{t:T-1} = \left| \tau(s) \right| \mathrm{E}_b \left[ \rho_{t:T-1} \right] = \left| \tau(s) \right| \end{equation}$

นำค่าที่ได้ไปแทนในสมการ $\eqref{eq:weighted_is}$ จะได้ว่า

$\begin{equation} \begin{split} v(s) &= \frac{\sum_{t \in \tau(s)} \rho_{t:T-1} G_t}{\sum_{t \in \tau(s)} \rho_{t:T-1}} \\ &= \frac{\sum_{t \in \tau(s)} \rho_{t:T-1} G_t}{\left|\tau(s)\right|} \\ \end{split} \end{equation}$

เราจะเห็นว่าจริง ๆ แล้ว weighted importance sampling กับ importance sampling ธรรมดานั้นมีค่าเท่ากันเมื่อ $\vert \tau(s) \vert$ เป็นอนันต์ (ใหญ่พอ)

หมายเหตุ: การจะทำแบบเดียวกันนี้กับ per-decision importance sampling นั้นไม่ตรงไปตรงมาเท่าใดนักก็เพราะว่า $\rho$ ของแต่ละ reward ไม่เหมือนกัน ทำให้พูดได้ยากว่าอะไรคือ weight ที่เหมาะสมกันแน่ อย่างไรก็ดีมีงานที่เสนอการทำ weighted per-decision importance sampling ชื่อว่า Eligibility Traces for Off-Policy Policy Evaluation (2000)

แสดงว่า $\mathrm{E}[\rho] = 1$

เพื่อให้เห็นภาพจะแสดงให้ดูในกรณีของ $r_{t+1}, r_{t+2}$ อย่างละเอียดเพื่อให้เห็นภาพชัดเจน

$\rho_{t:t+1} = \frac{\pi(a_t|s_t)\pi(a_{t+1}|s_{t+1})}{b(a_t|s_t)b(a_{t+1}|s_{t+1})}$

ลองหาค่าคาดหวังของ $\rho$ ภายใต้ behavioral policy $b$

$\begin{equation} \begin{split} \mathrm{E}_{a_t, a_{t+1} \sim b} \left[ \frac{\pi(a_t|s_t)\pi(a_{t+1}|s_{t+1})}{b(a_t|s_t)b(a_{t+1}|s_{t+1})} \right] &= \sum_{a_t} b(a_t|s_t) \sum_{a_{t+1}} b(a_{t+1}|s_{t+1}) \frac{\pi(a_t|s_t)\pi(a_{t+1}|s_{t+1})}{b(a_t|s_t)b(a_{t+1}|s_{t+1})} \\ &= \sum_{a_t} \cancel{b(a_t|s_t)} \frac{\pi(a_t|s_t)}{\cancel{b(a_t|s_t)}} \sum_{a_{t+1}} \cancel{b(a_{t+1}|s_{t+1})} \frac{\pi(a_{t+1}|s_{t+1})}{\cancel{b(a_{t+1}|s_{t+1})}} \\ &= \sum_{a_t} \pi(a_t|s_t) \sum_{a_{t+1}} \pi(a_{t+1}|s_{t+1}) \\ &= \sum_{a_t} \pi(a_t|s_t) 1 \\ &= 1 \end{split} \end{equation}$

จะเห็นว่าจริง ๆ แล้วเนื่องจากระหว่าง $a_t$ กับ $a_{t+1}$ นั้นไม่เกี่ยวข้องกัน (เมื่อกำหนด $s$ ให้) ดังนั้นจึงราวกับว่าแต่ละ term ที่คูณกันใน $\rho$ สามารถแยกกันคิดได้ ซึ่งก็ส่งผลให้ค่าคาดหวังทั้งหมดกลายเป็น 1 เพราะว่าแต่ละส่วนเป็น 1 นั่นเอง

การใช้งานกับ n-step TD

โดยปกติ TD หรือ Temporal Difference จะใช้ตัวอย่าง reward เพียง 1 ตัวอย่างเพื่ออัพเดทค่าประมาณ $v$ หรือ $q$ เป็นที่ทราบกันดีว่า TD นั้นอยู่ฝั่งตรงข้ามกับ Monte Carlo ในมุมของ variance และ bias กล่าวคือ

TD มี bias มาก มี variance น้อย
Monte Carlo ไม่มี bias แต่มี variance มาก

n-step TD คือความพยายามหา “จุดกึ่งกลาง” ระหว่าง 2 วิธีการนี้ แทนที่จะใช้ reward เพียงอันเดียวแบบ TD ดังนี้

$G_{t:t+1} = r_{t+1} + \gamma v(s_{t+1})$

เราจะใช้ reward หลาย ๆ ตัว ยกตัวอย่างเช่น 2 ตัว ดังต่อไปนี้

$G_{t:t+2} = r_{t+1} + \gamma r_{t+2} + \gamma^2 v(s_{t+2})$

เราก็สามารถเดาได้ว่า $G_{t:t+n}$ จะมีหน้าตาเป็นอย่างไร

n-step TD คือการใช้ $G_{t:t+n}$ มาแทนที่ของ $G_{t:t+1}$ เราจะได้ว่าหน้าตาของ n-step SARSA เป็นดังนี้

$q(s_t,a_t) \leftarrow q(s_t, a_t) + \alpha \left[ G_{t:t+n} - q(s_t, a_t) \right]$

จะเห็นว่า $G_{t:t+n}$ ต้องการ $v$ แต่ในกรณีที่เรามีเฉพาะค่า $q$ เราจะสามารถหาค่า $G_{t:t+n}$ ได้ดังนี้

$G_{t:t+n} = \left( \sum_{i=0}^{n-1} \gamma^i r_{t+i+1} \right) + \gamma^n \mathrm{E}_{a \sim \pi} \left[ q(s_{t+n}, a) \right]$

หากเราใช้การ sampling สำหรับประมาณ $v$ ดังด้านบน เราจะเรียกว่า SARSA เฉย ๆ แต่ว่าถ้าเราใช้การหาค่าเฉลี่ยแบบเป๊ะ ๆ ดังต่อไปนี้

$G_{t:t+n} = \left( \sum_{i=0}^{n-1} \gamma^i r_{t+i+1} \right) + \gamma^n \sum_a \pi(a|s_{t+n}) q(s_{t+n}, a)$

เราเรียกวิธีการนี้ว่า n-step expected SARSA

ซึ่งเวลาที่เรามาใช้กับ off-policy importance sampling สิ่งที่เราต้องสนใจก็คือ “สำหรับแต่ละ term มีการ sampling action หรือเปล่า?” เพราะว่าเราต้องคูณ $\rho$ ทุกที่ที่มีการ sampling action

จะเห็นว่าในกรณีของ n-step SARSA เรามีการสุ่ม n-1 ครั้งสำหรับ n reward แรก ที่เป็น n-1 ก็เพราะว่า SARSA เป็นฟังก์ชัน $q(s,a)$ ดังนั้น action แรกไม่ได้เกิดจากการสุ่ม และรวมกับอีก 1 ครั้งตอนสุ่มหาค่า $v$

จึงได้ว่าสำหรับ n-step SARSA เราจะเขียนแบบ off-policy ได้ดังนี้

$G_{t:t+n} = \left( \sum_{i=0}^{n-1} \rho_{t:t+i} \gamma^i r_{t+i+1} \right) + \rho_{t:t+n} \gamma^n \mathrm{E}_{a \sim \pi} \left[ q(s_{t+n}, a) \right]$

สำหรับ n-step expected SARSA เรามีการ n-1 ครั้งสำหรับ n reward แรก เหมือนกัน แต่ว่าเราไม่ได้สุ่มอีกเลยตอนหาค่า v เพราะเราหาค่าแบบเป๊ะ ๆ จึงได้ว่าสำหรับ n-step expected SARSA เราสามารถเขียน off-policy ได้ดังนี้

$G_{t:t+n} = \left( \sum_{i=0}^{n-1} \rho_{t:t+i} \gamma^i r_{t+i+1} \right) + \rho_{t:t+n-1} \gamma^n \sum_a \pi(a|s_{t+n}) q(s_{t+n}, a)$

หมายเหตุ: เราสามารถใช้ได้ทั้ง importance sampling หรือ per-decision importance sampling หรือ weighted importance sampling เพียงแค่แก้ค่า $G_{t:t+n}$ ให้เหมาะสม

การใช้งานกับ Experience Replay

สำหรับอัลกอริทึมที่ใช้ข้อมูลแบบ off-policy เรามักจะใช้งาน experience replay เนื่องจากเราสามารถใช้ขอมูลเก่า ๆ ได้ (ต่างจาก on-policy ที่ต้องการประสบการณ์สดใหม่เท่านั้น จึงไม่นิยมใช้ experience replay)

หมายเหตุ: กรณีของ DQN และ Q-learning โดยทั่วไปไม่ต้องทำการคูณด้วย importance sampling ratio เนื่องจากสมการของ Q-learning ไม่ได้สนใจว่าประสบการณ์นั้นมาจาก policy ใด จึงเรียกว่า Q-learning เป็น off-policy โดยกำเนิด แต่นี่ไม่เป็นจริงสำหรับการใช้ n-step Q-learning (และ n-step algorithm ในภาพรวม)

เนื่องจากการใช้งาน importance sampling เราจำเป็นต้องรู้ว่า policy ที่ใช้เก็บข้อมูลนั้นมีหน้าตาเป็นอย่างไร กล่าวคือ $b(a \vert s)$ มีค่าเท่าไหร่ เพราะว่าต้องเอาไปเป็นตัวหาร ดังนั้น นอกจากจะต้องเก็บข้อมูลอย่าง state, action, reward ใน experience replay แล้ว ก็ยังจะต้องเก็บด้วยว่า ณ ตอนที่เก็บข้อมูลนี้นั้น $b(a \vert s)$ มีค่าเท่าไหร่

อ้างอิง

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press. https://doi.org/10.1016/S1364-6613(99)01331-5

Hernandez-Garcia, J. F., & Sutton, R. S. (2019). Understanding Multi-Step Deep Reinforcement Learning: A Systematic Study of the DQN Target. Retrieved from http://arxiv.org/abs/1901.07510

Amherst, S., Precup, D., Sutton, R. S., & Singh, S. (2000). Eligibility Traces for Off-Policy Policy Evaluation (Vol. 80).

Academic Machine Learning Development Process

2019-01-23T00:00:00+07:00

Update (3 May 2019): This is outdated. VSCode team just released a set of extensions which works better in this scenario. Please visit https://code.visualstudio.com/blogs/2019/05/02/remote-development.

Machine learning projects nowadays cannot easily fit into a laptop. While we optimize laptop for maximum portability, it does not usually come with a powerful GPU. The conventional wisdom is that we buy our own machine learning rig with powerful enough GPU(s). This now splits our development machine, where we actually writing codes, and the running machine, where we have all the libraries and resources. Of course, In this scenario, we don’t include the GPU farm, if you have any, in that case other layer of considerations must be in place.

Splitting developing machine and running machine give rise to a lot of headaches because the latency of the task is generally very low, iterations are fast. How could we manage to do this? Worse yet, usually during coding we want to have access to all development tools including but not limited to code suggestion and auto-completion. If all the libraries reside in the running machine, how could we have a precise code suggestion then?

I think Jupyterlab (and all other IDEs with client side UI) try to solve the right problem, it is really a problem developers haven’t seen conventionally, it is just beginning to emerge and our tools are not capable of it yet. Jupyterlab is quite limited by itself. The project began as a language agnostic platform, it does not provide any language specific code suggestion and completion. This task should be done via extensions according to the project philosophy. The problem is rather the project is still very young and there is no such extension. This renders the tools only good enough for small code bases where the interactivity which is the main-selling point of this tool outshines its flaws.

We now move on to see other candidates where the core resides on the running machine but has a client-side UI. The solution I have found is quite unexpected, it is Visual Studio Code. I don’t think Code is designed to be a server side IDE at all (only recently it seems to have enough requests to do so), but since the introduction of its feature Live share, it now has the capability to do so.

Live share solves our problem in a sense that we can remote to our running machine and edit the code there while maintaining laptop portability. If all operations are done in the server side e.g. file operations, debugging, versioning, leaving only the visuals and controls to be transmitted through the wire, the cost of being remote should be rather small since these operations are not latency sensitive. If the dynamics of GUI is known in the client side, there should be no delay between typing and presenting the characters typed.

Live share is still beta though and only recently it has met the usable quality. It still has a long way to go i.e. versioning is not fully supported from the client but we still can use a separate terminal to do that, not that hard.

To run live share we need to run Code, and to run Code we need a desktop environment. The fact that Code does not support headless mode is still a caveat to use this solution, but for me this is rather slim since I could just install a VNC server of some kind (I personally use TurboVNC) and run the Code from there.

Visual Studio Code now seems to focus on machine learning community as well, you can see with its powerful Python extension, and also its Intellicode extension which supports many deep learning frameworks.

Moving to Jekyll

2019-01-23T00:00:00+07:00

It has been more than a year without an update to the blog. It is because I quitted being a developer, and becoming a machine learning researcher. The blog, which has initially been conceived for jotting down my development struggles, seems unfit for my new direction.

Now I aim to revive the blog somehow with a new goal under new requirements. The contents before were about coding snippets, but now I have to look from an academic perspective. My blog may need to support more for equations. I take this opportunity as a fresh start to try out Jekyll, a static-file blog platform.

Currently, I have migrated contents from the old blog to here.

Install Canon MP280 Driver on Ubuntu 18.04

2018-07-10T00:00:00+07:00

You need libtiff4 which cannot be installed via apt from here (direct link: http://old-releases.ubuntu.com/ubuntu/pool/universe/t/tiff3/libtiff4_3.9.7-2ubuntu1_amd64.deb)

Install the libtiff4 package:

dpkg -i libtiff4_3.9.7-2ubuntu1_amd64.deb

You also need libpng12–0 which will not be found in the apt-get from here(direct link: http://mirrors.kernel.org/ubuntu/pool/main/libp/libpng/libpng12-0_1.2.54-1ubuntu1_amd64.deb)

Install the libpng12–0 package:

dpkg -i libpng12-0_1.2.54-1ubuntu1_amd64.deb

Now, install the dependencies of the driver:

apt install libatk1.0-0 libgtk2.0-0 libpango1.0-0

You can download the driver from http://support-in.canon-asia.com/contents/IN/EN/0100301402.html.

After extracting the archive, you will see:

packages
resources
install.sh

Just go to the packages directly, you will see:

cnijfilter-common_3.40-1_amd64.deb
cnijfilter-common_3.40-1_i386.deb
cnijfilter-mp280series_3.40-1_amd64.deb
cnijfilter-mp280series_3.40-1_i386.deb

I will assume that you use amd64 architecture. Go ahead install both of the common and the mp280series packages.

dpkg -i cnijfilter-common_3.40-1_amd64.deb
dpkg -i cnijfilter-mp280series_3.40-1_amd64.deb

That should be all!

Konpat’s Record of Struggles

Pickling Pytorch views is pickling the whole underlying memory

Setting up LXC with Intel GPU (Proxmox), keyboard, mouse and audio

Container with GPU

Granting the container access

Explanations

Switching between TTY’s

Preventing the screen tearing

Keyboard and mouse in LXC

Audio in LXC

Make audio work on the host

Make audio work on the container

Approximately Optimal Approximate Reinforcement Learning (Kakade & Langford, 2002)

Basics

Policy performance

Conservative greedy policy improvement

Improved policy performance estimation

Using Taylor’s series to approximate

Policy advantage

Taylor’s expansion of policy performance

Using the author’s approach

Lemma 6.1

The two parts

Finding the right step

NFS file attribution caching causes reading inconsistency in multi-producer scenario

Vim conflicting with VSCode

Vim navigation between wrapped Lines in VSCode

Off-policy Importance Sampling

On-policy prediction

Off-policy prediction โดยใช้ Importance Sampling

การใช้งานกับ Reinforcement Learning

Importance Sampling สำหรับผลรวมทั้งเส้น

เปรียบเทียบ Importance Sampling และ Per-decision Importance Sampling

Weighted Importance Sampling

แสดงว่า \mathrm{E}[\rho] = 1

การใช้งานกับ n-step TD

การใช้งานกับ Experience Replay

อ้างอิง

Academic Machine Learning Development Process

Moving to Jekyll

Install Canon MP280 Driver on Ubuntu 18.04

แสดงว่า $\mathrm{E}[\rho] = 1$