It's not flawed. Wouldn't you assume that whatever bias (due to "peeking" at results) was there for one algorithm would also be there for another? Also, to offset this effect somewhat, I waited till I saw statistical significance at least 10 times. (I know it's a heuristic) Moreover, I'm curious if there's a theoretical result that compares statistical power of randomization and MAB.
Though like a commenter said, comparing these two algorithms is actually like comparing apples to oranges, and that was precisely the conclusion.
Yes, I agree it is not a rigorous statistical analysis but I tried doing a fair comparison (based on assumptions/heuristics clearly described in the post). The blog post simply claims what is described there.
tldr: The methodology is flawed.