The unifying theme of my research is **learning** in the **temporal** domain.
Many challenging problems cannot be cast as i.i.d., but rather involve
long temporal sequences, possibly with changing dynamics of
the environment and agents that change their policy while learning.
In the passive setting, the goal is to learn how to map observed sequences
to next-step predictions, target sequences or class labels.
In the active setting, the task is generally framed as
reinforcement learning, where the learning agent
takes decisions that directly affect its environment, and the rewards it can obtain.
Useful for both scenarios are function approximators that
can handle input sequences of arbitrary length, and among the
most powerful of them are recurrent neural networks.

Reinforcement learning (RL) concerns itself with learning which actions to take in each sequential state, in an environment which provides positive or negative reward signals (i.e., the proverbial carrot and stick). In the past, I have helped develop an expectation-maximization RL algorithm[2] designed for use in conjunction with recurrent neural networks, and define a taxonomic distinction between ontogenetic and phylogenetic methods[7].

My recent focus has been on **modular RL**:
Most reinforcement learning solutions assume a single, monolithic learning mechanism, whereas
in practice, there are many reasons to decompose the learning task and spread
it among many specialist components[1], that
each learn a limited domain of
expertise; only together can they solve the task as a whole.
The novel modular architecture Mark Ring and I are developing relies on
**Q-error prediction** to choose among its components,
which enables multiple linear controllers to jointly solve highly non-linear
problems[22], and
permits **spatially organizing** RL modules by their
behavior[25] (akin to the motor cortex).
We developed a learning rule that produces
an emerging **temporal coherence**, where sequential actions are grouped together
[28],
simultaneously with their spatial organization.

In addition, I am currently looking into methods that produce **predictive representations**
(based on Sutton's General Value Functions, or '**forecasts**'), evaluating their usefulness
and studying how well they generalize[32],
to determine whether they are a good candidate for a general-purpose representation in RL, and **continual learning** more
specifically (hint: they probably are).

Optimization problems that are too difficult or complex to model directly
can be solved in a 'black box' manner[5].
Our **Natural Evolution Strategies**
(NES[3], see
the dedicated page for details),
is one such family of algorithms, which iteratively update a **search distribution**
from which search points are drawn and evaluated,
which is then updated in the direction of higher expected fitness,
using ascent along the **natural gradient**.
At the BBOB-2012 workshop, NES variants were extensively benchmarked and found to achieve comparably to the
state-of-the-art[W4][W5][W6][W7][W8].
Further recent developments include
an extension of the approach to non-Gaussian
distributions[21],
a variant for multi-objective problems[19],
two algorithm variants that scale linearly with problem dimension[30],
and a convergence proof[27].

Large optimization problems can be solved much more efficiently
whenever it is possible to obtain gradient information
(e.g. when training large, but differentiable neural networks).
The bread-and-butter algorithm in this case is
**stochastic gradient descent** (SGD), for which we are currently developing
algorithmic improvements, like our **adaptive learning rate**
(schedules)[33]
that rely on tracking sample gradient statistics during optimization;
these are particularly helpful if for streaming or non-stationary datasets, and avoid
the hassle of parameter tuning.
This scheme can also be extended to apply to **non-smooth** loss functions, **sparse** gradients,
and mini-batch **parallelization**[31].

Artificial neural networks are a powerful class of non-linear function approximators.
When connections between neurons can be time-delayed, they are known as
**recurrent** neural networks (RNN), which
are state-of-the-art on a range of problems with temporal structure, like time series prediction
or connected handwriting recognition.

One particular type of RNN uses gating units that protect an internal
state for an arbitrary amount of time. These 'long short-term memory' cells (**LSTM**)
capture long-term time dependencies,
a crucial property for RL,
where typical reward structures involve long and variable time lags.
I have used **multi-dimensional** RNNs (that is with more than one
temporal dimension) on games with strong
spatial structure[6][8].
Training RNNs with gradient methods is harder than for feed-forward networks,
such that black-box methods can provide a robust alternative,
despite making use of less information.

Currently, I am investigating how to best combine recurrent architectures with
**deep** networks, and how **convolutions** along the time-axis can reduce the
temporal resolution at the top. In the process, we found improved
stochastic optimization methods that speed up learning in such large networks.

Humans explore the world quite efficiently, going for areas/topics that they do not know well, but where they also expect to be able to learn more about the world. In AI this idea has been formally introduced as artificial curiosity.

We're currently aiming to improve the **exploration strategies** of existing algorithms
based on this idea, namely in the context of costly optimization[20]
(with applications to metamaterials[29]), where evaluations of the objective are too expensive to explore randomly.

**Coherence progress** is a restricted measure of interestingness, that only depends on
compression[23][24].
Because of its applicability to any type of compressor, it allows for an easy and domain-specific
implementation, but at the cost of no longer considering the agent's adaptation.

Recent advances in brain implants and brain-machine interfaces have unlocked vast amounts of temporal data (like voltage readings and spike trains), that are expected to encode a wealth of information (e.g., movement intentions), even if the dependencies are highly non-linear.

Novel **decoding** algorithms are likely to be necessary for
extracting high-dimensional sequences from this type of data.
Working with Bijan Peseran's team,
I am currently applying
recurrent neural networks (with temporal convolutions) to this task,
and using advanced stochastic optimization algorithms to train them.

Games are very appropriate application domains for a variety of machine learning techniques (most notably for reinforcement learning), because they are clean, have controllable complexity and are fun to work with.

In fact, they may even lend themselves to
practical **measures of intelligence**[13][T1][36]
if a suitable game description language is used[35].
A draft implementation of such a language is py-vgdl[34], built on top of pygame.

I'm particularly interested in board games
with **spatial structure** and patterns like Go
(for which used a customized type of multi-dimensional recurrent neural
network[6][8][12]),
or games with complex graph structure like Sokoban, where a modular approach based on artificial economies[1]
was successful.