At long last, the perfect score for arcade classic Ms. Pac-Man has been achieved, though not by a human. Maluuba — a deep learning team acquired by Microsoft in January — has created an AI system that’s learned how to reach the game’s maximum point value of 999,900 on Atari 2600, using a unique combination of reinforcement learning with a divide-and-conquer method.
AI researchers have a documented penchant for using video games to test machine learning; they better mimic real-world chaos in a controlled environment versus more static games like chess. In 2015, Google’s DeepMind AI was able to learn how to master 49 Atari games using reinforcement learning, which provides positive or negative feedback each time the AI attempts to solve a problem.
Though AI has conquered a wealth of retro games, Ms. Pac-Man has remained elusive for years, due to the game’s intentional lack of predictability. Turns out it’s a toughie for humans as well. Many have tried to reach Ms. Pac-Man’s top score, only coming as close as 266,330 on the Atari 2600 version. The game’s elusive 999,900 number though, has so far only been achieved by mortals via cheats.
Maluuba was able to use AI to beat the game by tasking out responsibilities, breaking it up into bite-sized jobs assigned to over 150 agents. The team then taught the AI using what they call Hybrid Reward Architecture — a combination of reinforcement learning with a divide-and-conquer method. Individual agents were assigned piecemeal tasks — like finding a specific pellet — which worked in tandem with other agents to achieve greater goals. Maluuba then designated a top agent (Microsoft likens this to a senior manager at a company) that took suggestions from all the agents in order to inform decisions on where to move Ms. Pac-Man.
The best results came when individual agents “acted very egotistically” and the top agent focused on what was best for the overall team, taking into account not only how many agents wanted to go in a particular direction, but the importance of that direction. (Example: fewer agents wanting to avoid a ghost took priority over a higher amount of agents wanting to pursue a pellet.) “There’s this nice interplay,” says Harm Van Seijen, a researcher with Maluuba, “between how they have to, on the one hand, cooperate based on the preferences of all the agents, but at the same time each agent cares only about one particular problem. It benefits the whole.”
Maluuba says this Hybrid Reward Architecture version of AI learning has expansive, practical applications, like helping to predict a company’s sales leads, or making advances in natural language processing. Watch Microsoft explain its methods, above.