If you're profiling a possible solution, write the inner loop first. If you can't get the inner loop to perform, you've saved yourself a lot of work.
I was enamoured with the idea of using this method to do antialiased lines. Basically, you use four edge functions (max[d1..d4]) and lookup your distance into a table, which gives you your coverage per pixel. It's beautiful because its an implicit function, and it looks as though it maps well onto SSE2 instructions. Should be no branches, the vector width is a perfect match, etc.
Alas, the inner loop winds up being a lot more instructions than I expected. Maybe I just didn't try hard enough at the assembly, but the principle holds. My biggest mistake here though, was that the inner loop was the last thing I wrote. Before I'd written the inner loop, I wrote the entire algorithm setup in assembly, only to find out at the very end that I couldn't manage to get the inner loop fast enough.