Update:
afterall I used the code of Zwired. It showed better performance with large data (as you already mentioned @Zwired). I changed only 2 small things and it worked perfectly.
Tim's code had some small failures (for example non-initialized shift register, and ignoring the last two plateaus)
Thank you again to everybody