A naive brute force algorithm would take a lot longer to complete some of these levels.
Let's say it takes 45 seconds to complete a level. That's 45 * (60 / 12) = 225 moves. The size of the action space is 14, so you'd be looking at 14 ^ 255 (give or take a few orders of magnitude) trajectories before finding the solution.
To brute force in a reasonable a time, you would have to look at the environment and have the algorithm iterate through all trajectories in a clever way. For instance, you may choose to only try trajectories that are constantly moving right. This strategy may find a solution to 1-1 fairly quick, but this does not generalize to other levels, especially ones that requires backtracking or waiting.
You'd have to design a pretty gnarly algorithm for it to beat 1-1, 1-4 and 2-2. This gets even more complicated if you bring in other environments: the original paper also trained on Montezuma's Revenge, Private Eye, Venture, Freeway and Gravitar
They have 350 hrs of playing per level. They could brute force on these parameters.
1. Detect to_avoid moving obstacles
2. Detect to_avoid stationary obstacles (gaps)
3. Player moves (jump (left, right)(up, far, farthest), walk)
4. Success (hitting the flag, reaching princess)
Once above information is available on the screen, it becomes easily bruteforcible. I was hoping use of genetic algorithms i.e. they could have taken success from one part (path) of level and crossed it with another and tried that with across levels. But there won't be a generic strategy or learning anyways as there's fair bit of randomness. So this does seem like brute forcing for success path
The whole point is that the algorithm doesn't know about obstacles or success as a concept baked into the algorithm. Likewise, this is pretty initial research, meant to inform and promote
In other words, this isn't meant to be super useful by itself. It seems tailor made (as many of these things do) to play super-simple 80's video games and literally nothing else, but it's an interesting proof of concept. I'd also be interested in different iterations on this general pattern - for instance, something that didn't translate directly from screen + button -> prediction, and instead had some interstitial systems - translating from screen -> entities, then predicting entity state of entities given button presses. It'd also be interesting to see how this performs with ML algorithms designed to learn on the fly instead of through training from a static set of data (at least, this looked like it learned through back propagation - I skimmed).
But I can see broader practical applications for this in, for instance, recommender systems trying to break users out of the closed feedback loop that people tend to end up in when going down certain rabbit holes (e.g. watch one Flat Earther conspiracy video and suddenly that's all you see for a week because the recommender system knows that people who look at one will look at more). The point being: the real test comes when this strategy is exposed to more diverse problem spaces, it's just that those are harder to model and we need to weed out the pointless stuff first.
Let's say it takes 45 seconds to complete a level. That's 45 * (60 / 12) = 225 moves. The size of the action space is 14, so you'd be looking at 14 ^ 255 (give or take a few orders of magnitude) trajectories before finding the solution.
To brute force in a reasonable a time, you would have to look at the environment and have the algorithm iterate through all trajectories in a clever way. For instance, you may choose to only try trajectories that are constantly moving right. This strategy may find a solution to 1-1 fairly quick, but this does not generalize to other levels, especially ones that requires backtracking or waiting.
You'd have to design a pretty gnarly algorithm for it to beat 1-1, 1-4 and 2-2. This gets even more complicated if you bring in other environments: the original paper also trained on Montezuma's Revenge, Private Eye, Venture, Freeway and Gravitar