In January I wrote a post on parallel error diffusion. In the meantime the paper has been published with this citation: Yao Zhang, John L. Recker, Robert Ulichney, Giordano B. Beretta, Ingeborg Tastl, I-Jong Lin and John D. Owens, "A parallel error diffusion implementation on a GPU", Proc. SPIE 7872, 78720K (2011); doi:10.1117/12.872616. The link is http://dx.doi.org/10.1117/12.872616. In that paper we focussed on achieving a possibly efficient CUDA implementation of the BIPED algorithm.
A new paper, Yan Zhou, Chun Chen, Qiang Wang, Jiajun Bu and Hua Zhou, "Block-based threshold modulation error diffusion", J. Electron. Imaging 20, 013018 (Mar 25, 2011); doi:10.1117/1.3555132 just appeared in JEI. Their focus is on achieving a possibly high image quality with BIPED. Lacking performance data, I do not know how it performs compared to sequential ED. The link is http://dx.doi.org/10.1117/1.3555132.