C Our approach to artificial general intelligence - OCCAM

Our approach to artificial general intelligence

This summary contains a collection of ideas that we have developed over the last years concerning the creation of a system exhibiting artificial general intelligence (AGI). Although we came up with them on our own, most if not all are not new and spread all over the literature.

The notion “general” is to be emphasized here. Unfortunately, after early unsuccessful attempts research in artificial intelligence (AI) has moved its focus on solving narrowly defined problems and tasks, which became known as “narrow AI” (Kurzweil, 2005): world level in chess, jeopardy, backgammon, self-driving cars, talking personal assistants and a myriad of other commercial applications. Although those are impressive achievements and the usefulness of such applications is beyond any doubt, a system that exhibits general intelligence seems still to be far away.

After compiling a large set of definitions in the literature Legg et al. (2007) came up with a definition of intelligence that is consistent with most other attempts:

“Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”

This is exactly, what narrow AI does not achieve: it is programmed for a very specific well-defined set of environments. Any deviation from that narrow set most probably leads to failure of the system.

Conversely, humans are usually able to solve all sorts of tasks in very diverse environments. Moreover, neuroscientific evidence teaches us, that brains are able to process data cross-modally, e.g. by transforming visual data to auditory or tactile stimuli in sensory substitution devices. It is also known that in newborn ferrets neurons in the auditory cortex adopt characteristics of visual cells, if fed with stimuli from the visual pathway (Sur et al., 1988). Those observations point to the hypothesis that the human brain is a general processor of quite diversely structured data.

This idea is, of course, not new and is around at least since Simon and Newell’s General Problem Solver developed in 1957. Although the problem is far from being solved practically, Hutter (2005) has developed a mathematical formulation and theoretical solution to the universal AGI problem, called AIXI. Even though AIXI is incomputable, a lot can be learned from the formulation and general thrust of research. The basic idea is the following. An AGI agent receives input data from its sensors and picks an action at every time step while trying to maximized reward. All data can be expressed as a binary sequence. In order to act successfully, sequences have to be predicted, which is achieved through Solomonoff’s universal theory of induction. Solomonoff derived an optimal way of predicting future data, given previous observations, provided the data is sampled from a computable probability distribution. In a nutshell, Hutter defines AIXI by espousing the Bellman equation of reinforcement learning to Solomonoff’s sequence prediction.

Solomonoff (1964); Solomonoff (1978) has defined his famous universal prior that assigns a prior probability (a semimeasure to be precise) to every sequence,

\( M(x)\equiv\sum_{p:U(p)=x}2^{-\text{|}p|}\)

where the sum is over all halting programs \(p\) of length \(|p|\) for which the universal prefix Turing machine \(U\) outputs the sequence \(x\). The universal prior exhibits an Occam bias: by far the most probability mass is captured by short explanations (programs) for an observation \(x\). Impressively, Solomonoff has proved that this prior correctly predicts any computable sequence: \(M(x_{t}|x_{1},\ldots,x_{t-1})\rightarrow1\) as \(t\rightarrow\infty\), where \(x_{i}\) denotes the \(i\)th sequence entry. In essence, we learn that if we are able to find short programs for arbitrary sequences the problem of universal inference is provably solved. Intuitively, the scientific method itself is about the search of simple (short) explanations of phenomena. Arguably, it is a formal and institutionalized reasoning method, but people, even infants, seem use it in more simple everyday situations (Gopnik et al., 1999). If understanding the world means to compress sensory data, we need a general data compressor.

Unfortunately, Solomonoff induction is not computable. Therefore, Hutter and colleagues have developed approximations to AIXI, e.g. a Monte-Carlo approximation that uses prediction suffix trees that enable predicting binary D-order Markov sequences (Veness et al., 2011). This is an impressive achievement leading to a single system being able to play various games (Pac-Man, Kuhn poker, TicTacToe, biased rock-paper-scissors, 1d-maze, cheese maze, tiger and extended tiger) without having specifically been programmed for them — a notable step towards generality of AI. In spite of that, it seems questionable whether this approximation can be extended any further beyond Markov sequences, since a well-known computational problem awaits: the curse of dimensionality. We will come back to that later.

It may seem not intuitive that data compression plus reinforcement learning can lead to the solution of such diverse and non-trivial tasks. Traditionally, one may suspect that various cognitive processes must be involved in the solution of such tasks. Hutter shows how data compression implicitly incorporates those processes. It may be objected that simple deep search of a chess program also replaces all sorts of reasoning processes that presumably go on inside a human chess player’s brain. However, AIXI is not a short-cut narrow-AI-like solution, but provably a genuinely general approach. This has convinced us that general data compression is the way to go if we head for general intelligence.

Approaching general data compression

Simple but general

Given the form of the universal prior one may consider universal search. For example, Levin search executes all possible programs, starting with the shortest,

until one of them generates the required sequence. Although general, it is not surprising that it is a computationally costly approach and rarely applicable in practice.

On the other side of the spectrum, we have non-general but computationally tractable approaches: common AI algorithms and machine learning techniques. Why could they not be generalized? The problem that all those techniques face at some point is known as the curse of dimensionality. Considering the (algorithmic) complexity and diversity of tasks solved by typical today’s algorithms, we observe that most if not all will be highly specific and many will be able to solve quite complex tasks. Algorithms from the field of data compression are no exception. For example, the celebrated Lempel-Ziv compression algorithm (see e.g. Cover et al., 2012) handles stationary sequences but fails at compressing a simple non-stationary sequence efficiently. AI algorithms undoubtedly exhibit some intelligence, but when comparing them to humans, a striking difference comes to mind: the tasks solvable by humans seem to be much less complex albeit very diverse. After all, it is very hard for humans to perform depth search in chess 10 moves ahead or learn the transition probabilities of a variable-order stochastic Markov process, while they can do both to some extent. For example, fitting the latter is performed by Hutter’s Monte-Carlo AIXI approximation. Although Hutter has found a general, but incomputable solution to the AGI problem, in the Monte-Carlo approximation he uses again a narrow-AI-like approach. Others try to fill the task space by “gluing together” various narrow algorithms that would, hopefully, synergistically cancel each other’s combinatorial explosions (Goertzel, 2009).

In a nutshell, we suggest that we should not try to beat the curse of dimensionality mercilessly awaiting us at high complexities, but instead head for general algorithms at low complexity levels and fill the task cup from the bottom up.

Given that we have set the goal to compress general but simple data sets, the question arises whether the algorithm that performs that task can expected to be complex or rather simple as well. From the point of view of “narrow AI” the programmer has to anticipate exhaustively all data situations that his algorithm could possibly be exposed to, which would otherwise lead to bugs. Such an approach naturally leads to very complex and still not general algorithms. However, as mentioned earlier, it is the very hallmark of generality that the algorithm itself is required to be able to deal with the whole variability of data situations. Does it mean that the general AI algorithm could actually be quite simple itself?

A biological argument points in that direction (Berglas, 2008). Human intelligence must ultimately be encoded in the DNA. The human DNA consists of only 3 billion base pairs. Since there are four bases (A, C, T and G), one base carries the information of 2 bits. Therefore, the amount of information encoded in the DNA is merely \(3\cdot10^{9}\cdot2/8/1024^{2}=715\) megabytes. It fits on a single Compact Disk and is much smaller than substantial pieces of non-intelligent software such as Microsoft Vista, Office, or the Oracle database.

“Further”, Berglas writes, “only about 1.5% of the DNA actually encodes genes [although it is currently debated whether the rest is just redundant repetitive junk]. Of the gene producing portions of DNA, only a small proportion appears to have anything to do with intelligence (say 10%). The difference between Chimpanzee DNA and man is only about 1% of gene encoding regions, 5% non-gene. Much of this can be attributed to non-intelligent related issues such as the quickly changing immune system and human’s very weak sense of smell. So the difference in the “software” between humans and chimpanzees might be as little as \(715\cdot10\%\cdot1.5\%\cdot1\%=11\) kilobytes of real data.” Of course, we are dealing with a quite compact representation and Berglas may be wrong about one or two orders of magnitude, but hardly more. “In computer software terms even 1.0 megabytes is tiny.”

We therefore conclude that the algorithm for general intelligence, at least as general as human intelligence, must be simple compared to modern software. We are facing a software problem, not a memory problem.

Incremental and hierarchical compression

In accordance with this general thrust of research we have developed a theory of incremental compression, which can find short representations efficiently for arbitrary strings that can be generated by a composition of functions (features), \(x = f_1\circ\cdots\circ f_k\). It is probably fair to say that the overwhelming fraction of data occurring naturally can be generated in this fashion. Similarly, exploiting locality of correlations (power law correlations functions), that are usually present in natural data, can lead to even more efficient compression algorithms, which is work in progress.

Grounded reasoning

Although general data compression seems to be a central ingredient for AGI, several other important issues like language, memory, reasoning, commonsense knowledge, resourcefulness, brittleness and many others keep staring at the researcher intimidatingly. In the following, far from claiming to have solved anything, we will introduce my ideas about some of them and highlight the way, general data compression bears on them. Suppose, general compression works. What then? One of the most burning questions in AI is the problem that AI systems do not really know anything about the world, commonsense knowledge possessed by any 3-year-old. As Marvin Minksy has put it, “no program today can look around a room and then identify the things that meet its eyes” (Minsky, 2011). The problem is not just about identification but about being able to understand and describe the objects and knowing about their function. For more on this topic the reader is referred to our position paper where we argue why grounded reasoning is an important step towards commonsense reasoning an thereby towards AGI and how it densely and naturally interacts with general compression.

References

  1. Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
  2. Alison Gopnik, Andrew N Meltzoff, and Patricia K Kuhl. The scientist in the crib: Minds, brains, and how children learn. William Morrow & Co, 1999.
  3. Marcus Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2005. ISBN 3-540-22139-5. doi: 10. 1007/b138233. URL http://www.hutter1.net/ai/uaibook.htm. 300 pages, http://www.hutter1.net/ai/uaibook.htm
  4. Ray Kurzweil. The singularity is near: When humans transcend biology. Penguin, 2005.
  5. Shane Legg and Marcus Hutter. A collection of definitions of intelligence. In B. Goertzel and P. Wang, editors, Advances in Artificial General Intelligence: Concepts, Architectures and Algorithms, volume 157 of Frontiers in Artificial Intelligence and Applications, pages 17–24, Amsterdam, NL, 2007. IOS Press. ISBN 978-1-58603-758-1. URL http://arxiv.org/abs/0706.3639.
  6. Marvin Minsky. Interior grounding, reflection, and self-consciousness. Information and Computation, pages 287-305, 2011.
  7. Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964.
  8. Ray Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. Information Theory, IEEE Transactions on, 24(4):422–432,1978.
  9. Mriganka Sur, Preston E Garraghty, and Anna W Roe. Experimentally induced visual projections into auditory thalamus and cortex. Science, 242(4884): 1437–1441, 1988.
  10. Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40:95–142, 2011. ISSN 1076-9757. doi: 10.1613/jair.3125. URL http:// arxiv.org/abs/0909.0801.