So there’s a new paper from Nvidia that you might wanna see.
They got AI to play Minecraft.
…but after reading this paper, I’m kind of blown away.
The implications of this recent discovery are considerably more significant than I initially comprehended. Let’s begin by discussing the initial results, but I implore you to stay with me because the process by which they achieved these outcomes is, in my view, much more impactful. This is especially true as we start to witness artificial intelligence (AI) writing and enhancing its own code.
If you are intrigued by the future of software development and the role that AI is slated to play, this development will definitely capture your interest. The team utilized GPT-4, treating it as a reasoning engine. However, as you’ll soon see, it’s not exactly integrated with Minecraft.
It’s crucial to remember that GPT-4 does not possess vision capabilities. Even though vision is in the pipeline, it has yet to be rolled out. Hence, GPT-4 cannot visually perceive the screen or understand what’s transpiring. Keep this in mind as we delve deeper into this fascinating topic.
You can find related links in the description below, allowing you to navigate through different sections. And remember, subscribing to this channel enhances your intellectual prowess. So don’t miss out on this opportunity.
Let’s turn our attention to this paper, titled ‘Voyager: An Open-Ended Embodied Agent with Large Language Models’. I first stumbled upon it when Dr. Jim Fan, one of the researchers involved in the paper, posted about it on Twitter. Voyager is the name of the AI agent, and Dr. Fan mentions that Voyager continually improves itself by writing, refining, committing, and retrieving code from a skill library.
Furthermore, all our resources are open-source. So if you’re keen to experiment with this on your own, you can download Voyager and immerse it in the world of Minecraft, or any other environment of your choice. You might be interested to learn about some of the skills it can perform in action.
Let’s quickly delve into an introduction of Voyager. This AI agent represents the first Large Language Model (LLM) powered, embodied lifelong learning entity in the realm of Minecraft. Voyager continually explores the world, acquiring an array of skills and making novel discoveries autonomously, without any human intervention.
From an empirical standpoint, Voyager demonstrates a robust in-context, lifelong learning capability, and shows exceptional proficiency in Minecraft gameplay. It acquires 3.3 times more unique items, traverses 2.3 times longer distances, and unlocks key tech tree milestones up to 15.3 times faster than its predecessor models.
In addition, Voyager has the ability to utilize its learned skill library in a new Minecraft world, thereby solving novel tasks from scratch. This capability stands out as other techniques typically struggle to generalize. This point, as you’ll see shortly, is quite significant.
Now, let’s discuss the Minecraft tech tree. If you’re unfamiliar with Minecraft, the game entails exploring a randomly generated world. Each world is unique and deeply complex, packed with numerous technologies and mysteries to unearth. Being a sandbox game, it provides the freedom to engage in a multitude of activities, including diverse interactions with animals, exploration of cave systems, and adaptation to day-night cycles. The world’s complexity is amplified by its 3D nature, which poses an added challenge for AI navigation.
The game also incorporates realistic survival mechanisms like hunger, necessitating players to eat to survive. Additional elements like health and breath mechanics (required for underwater exploration), and the need for light sources in dark places like dungeons, contribute to the game’s intricacy.
The game progresses by first collecting basic materials to create elementary tools. Gradually, the tools advance, moving from stone to iron and beyond, as the player ascends the tech tree and crafts increasingly advanced items.
Coming back to Voyager, it comprises three key components: an automatic curriculum designed for open-ended exploration, a skill library that facilitates increasingly complex behaviors, and an iterative prompting mechanism that employs code as an action space.
At first glance, I found the integration of these components a bit confounding. So, let’s take a moment to step back and delve into how these components collaboratively function.
Firstly, let’s discuss the Mind Flare API, a tool that enables developers to dive into a Minecraft game and perform a variety of tasks. This includes automating movements, initiating various in-game events, and essentially interacting with the game world in numerous ways.
Mind Flare API is a standalone tool for interacting with Minecraft; it doesn’t require any AI or GPT-4 to function. It’s simply a tool. However, what the researchers have done is to make GPT-4 use a multitude of these commands to navigate the player in the game world.
For instance, if we want GPT-4 to mine something in the game, it invokes a specific function for that. Similarly, if we want it to place a workstation somewhere near the player, GPT-4 triggers a script to execute this action, and the in-game character proceeds to do just that.
Usually, a human player would use a keyboard and mouse to interact with the game, but GPT-4 uses these commands. It doesn’t have any special abilities; it is constrained by the same rules and physics that any typical player would encounter.
So far, this makes sense, right? We have GPT-4, Minecraft, and a mechanism for GPT-4 to control Minecraft. Notably, GPT-4 perceives the world of Minecraft not through vision – as we’ve mentioned before – but through prompts.
Voyager receives regular updates about its surroundings. For instance, on the left side, you can see the kind of information it is provided with, like an inventory report. For example, it may be notified, “You have these items in your inventory. What do you want to do next?” GPT-4 then reasons, “Since you possess a wooden pickaxe and some stones, it would be advantageous to upgrade your pickaxe to a stone one for better efficiency.” Subsequently, it generates a task for itself: “Craft one stone pickaxe.”
Voyager processes the information about its environment, contemplates its next move, and then self-assigns a task. Here, a ‘chain-of-thought’ reasoning approach is employed to help it think through the steps and then execute them.
Let’s look at another example. Suppose we inform it, “You’re next to a river, and you have a fishing rod. What do you want to do?” The information it receives is an overview of what it sees in general. It is provided with a much more detailed set of information, and the same data is relayed every time. Soon, you’ll see precisely what it gets with each prompt. In this instance, they’re merely highlighting the factors it considers when making a decision.
In response to being by a river with a fishing rod, it decides, “We should fish,” and generates the task, “Catch one fish.” If it’s nighttime and a zombie is nearby, it surmises, “We should fight this creature to protect ourselves,” and generates the task, “Kill one zombie.”
So far, everything seems straightforward, right? We provide GPT-4 with some information and tell it what it’s perceiving. Then it reasons and decides its next move based on that information.
Here’s where things get more intriguing: Voyager generates code to accomplish its tasks. For example, it creates a function called ‘CombatZombie’ to fight the zombie. Essentially, this function equips a stone sword for the combat. If a stone sword isn’t available, it crafts one and also crafts and equips a shield for additional protection. After replenishing hunger by cooking sticks, it seeks out a zombie, engages in combat, and eliminates it.
This process turns ‘fighting a zombie’ into a skill, which is then stored in its skill library. Voyager is creating tasks for itself, then crafting skills to fulfill those tasks, and saving those skills in its library for future use.
Let’s consider the iterative prompting mechanism. Suppose Voyager comes across Acacia trees and decides that the next logical step would be to craft an axe from the Acacia wood. It generates corresponding code, creating a function called ‘CraftAcaciaAxe’ and everything else needed to execute that task. However, an error message returns stating that there’s no such item as an ‘Acacia Axe’ in Minecraft. Recognizing the error, GPT-4 amends the code to ‘CraftWoodenAxe’ instead of ‘CraftAcaciaAxe’.
Similarly, it might realize that it requires two more planks to craft a particular shovel, and will adjust its code accordingly.