DeepMind built a “meta-universe” to keep the AI fighting and upgrading.

Spark Global Limited reports:

Spark Global Limited reports:

DeepMind has a little surprise for us.

As we all know, reinforcement learning suffers from poor generalization and often has to learn from scratch for a single task.

Although AlphaZero, which DeepMind previously developed, can play Go, chess and Japanese general chess, training for each board game has to be done from scratch.

Poor generalization ability is also a major reason why AI has been criticized as an artificial intellectual disability. Human intelligence is a powerful point, can learn from the experience before, quickly adapt to the new environment, for example, you will not because it is the first time to eat Sichuan food, looking at a yuanyang pot at a loss, you have eaten Chaoshan hotpot, not all rinse things.

However, generalization doesn’t happen overnight, just like when we play games, we start with simple tasks and then work our way up to more complex ones. In Hollow Knight, you start by walking around and slashing monsters, but in the nightmare difficulty “Road of Pain” level, without the accumulated skills you’ve learned, you’re left alone.


The multitask metauniverse

DeepMind is using this “course learning” approach, allowing agents to learn in an ever-expanding and escalating open world. In other words, the AI’s new tasks (training data) are constantly generated based on the old tasks.

In this world, agents can exercise themselves, as simple as “approach the purple cube,” as complex as “approach the purple cube or place the yellow sphere on the red floor,” and even play with other agents, such as “find each other and not be found.”

Each small game exists in a small corner of the world, and thousands of small corners are pieced together into a huge physics simulation world, such as the geometric “Earth” shown below.

In general, the task in this world is made up of three elements, mission = game + world + player, and the complexity of the task is determined according to the relationship between the three elements.

Complexity is measured in four dimensions: competition, balance, options, and difficulty of exploration.

For example, in the game “Block Grab,” the blue agent needs to put the yellow block in the white area, and the red agent needs to put the yellow block in the blue area. These two goals are contradictory and therefore more competitive; At the same time, the conditions of both sides are equal and the balance is relatively high; Because the goal is simple, there are few options; DeepMind rated the difficulty of exploration as above medium here, probably because the location is a more complex scene.

For example, in the “Ball likes to play with blocks” game, both the blue and red agents have a common goal of placing the same colored spheres and blocks in similar positions.

At this time, competition is naturally very low; The balance is undeniably high; Options are much higher than above; As for the difficulty of exploration, there is no location area, so the agent can place the spheres and squares wherever he wants, making the difficulty smaller.

Based on these four dimensions, DeepMind has built a supersized meta-universe of mission space, of which the geometric Earth is only a small corner, confined to a single point in this four-dimensional mission space. This “metauniverse”, as DeepMind calls Xland, contains billions of missions.

To see the full picture of XLand, it consists of a series of games, each of which can be played in a number of different simulated worlds whose topologies and features change smoothly.


A lifelong learning

Now that you have the data, then you have to find the right algorithm. DeepMind has found that the Targeted Attention Network (GOAT) can learn more general strategies.

Specifically, the input of the agent includes RGB images from the first view, proprioception, and targets. After initial processing, the intermediate output is generated and passed to the GOAT module, which will process the specific part of the intermediate output according to the current target of the agent and carry out logical analysis on the target.

Logical analysis means that for each game, there are ways to construct another game and limit the optimal value of the value function of the strategy.

At this point, DeepMind asks us a question: What is the best set of tasks for each agent? In other words, what kind of level will make the player become a “real” master in the level of fighting monsters, rather than a 9999?

DeepMind’s answer was that each new task was built on top of an old one, “neither too difficult nor too easy”. In fact, this is exactly what makes learning exciting.

Tasks that are too difficult or too easy at the start of training may encourage early learning, but can lead to learning saturation or stagnation later in training.

In fact, we do not require agents to be good at one task, but rather encourage lifelong learning, that is, constantly adapting to new tasks.

And the so-called too difficult, too easy is actually a more vague description. What we need is a quantitative way to make an elastic connection between the new task and the old task.

Why not let the agent “die” in the new task because it does not adapt? Evolutionary learning offers great flexibility. In general, the new task and the old task are performed simultaneously, and there are multiple agents “competing” for each task. Agents that are well adapted to the old task are selected to continue learning on the new task.

In the new task, the weight, instantaneous task distribution and super parameters of the excellent agents in the old task will be replicated and participate in a new round of “competition”.

And, in addition to the good agents in the old missions, there are many new people involved, which introduces randomness, innovation, flexibility, and no fear of sudden death.

Of course, there isn’t just one good agent per mission. Because tasks are also constantly generated and dynamically changing, a task can train agents with different strengths and evolve in different directions (depending on the relative performance and robustness of the agents).

Eventually, each agent will form a different set of good tasks, much like the “contention of a hundred schools of thought” in the Spring and Autumn and Warring States Periods. Say dozen strange upgrade appears the pattern is small, this is to simulate the earth simply.

“The iterative nature of this combinatorial learning system, which optimizes not bounded performance metrics but the general-purpose range of capabilities defined by the iteration, allows the agent to learn in an open way, limited only by the environment space and the agent’s neural network expressive capabilities,” DeepMind said.


The first appearance of smart

In the end, what excellent species do intelligent bodies that upgrade, evolve, and diverge in this complex meta-universe become?

According to DeepMind, agents are remarkably capable of zero-sample learning, such as using tools, working around, counting, collaborating and competing.

Let’s look at a few concrete examples.

First, agents learn to improvise. Its goals are threefold:

1. Place the black pyramid next to the yellow sphere;

2. Place the purple sphere next to the yellow pyramid;

3. Place the black pyramid on the orange floor.

The AI initially finds a black pyramid and tries to take it to the orange floor (goal 3), but while moving it, it catches sight of a yellow orb and changes its mind. “I can achieve goal 1” and places the black pyramid next to the yellow orb.

The second example is, how can you get the purple pyramid on the high platform if you don’t know how to jump?

In this task, the agent needs to find a way to break through the barriers to the purple pyramid on the high platform. There is no similar path around the high platform, such as steps and slopes.

Because it will not be raised, so the intelligent body is urgent “flip the table”, the surrounding of a few pieces of erected boards have been knocked down. Then, as luck would have it, a black SLATE fell on the edge of the platform and said, “Oh, wait, isn’t that what I want?”

Whether this process reflects the intelligence of the agent is not yet certain, and may just be a stroke of luck. Again, look at the statistics.

After five generations of training, agents played around 700,000 independent games in XLand’s 4,000 independent worlds, involving the results of 3.4 million independent missions, with each agent in the final generation going through 200 billion training steps.

Currently, agents can successfully participate in almost every evaluation task, except for a few tasks that even humans cannot perform.

The DeepMind study may be part of an example of the importance of “intensive learning”. That is to say, not only the amount of data to be large, the amount of tasks to be large. This also allows agents to perform well in generalization. For example, data show that agents can quickly adapt to new complex tasks after 30 minutes of intensive training, while agents trained from scratch with reinforcement learning cannot learn these tasks at all.

In the years to come, we can expect this meta-universe to become even more complex and dynamic, as AI continues to evolve and give us amazing (and terrifying) experiences.