1) Pentium 4's do not have "more pipelines", actually they have fewer than an Athlon. (3 for athlon, 2 for P4...) The reason they need more memory bandwidth is because the number of stages in the pipeline is 31 (in the case of Prescott-based P4's, or 20, in the case of Northwoods...), versus only 10/15 (ALU/FPU) for a traditional Athlon. (Athlon 64 has 12/17 stages. AMD had to lengthen the pipeline or else scaling might've never happened...)
2) Intel is making 64-bit desktop CPUs:
http://www.anandtech.com/cpuchipsets...oc.aspx?i=2152
3) Intel's 64-bit clone of the AMD64 architecture so far, looks to be solid, but their implementation does not have the on-board memory controller, or HyperTransport, so scaling in MP environments is poor, and even single processor machines appear to be slightly behind the Opteron / Athlon 64.
4) Cache size can often be irrelevant. It depends on what type of application you are running, whether or not it can fit directly into a larger L2 cache, and the microarchitecture of the CPU itself.
If you are running an application where the binary is tiny, and it fits easily into Athlon's 256KB L2, then having a 1MB L2 cache is pointless, and shouldn't result in any higher performance. If, however, your application requires more than 256KB to fit into the L2, then obviously the P4 w/1MB cache will be better.
Also, remember that the L2 cache is not the first cache on your CPU. If you have a large L1 cache, like the Athlon does, (128KB), then your L2 doesn't have to be as large to get the same performance. The P4 only has 20K L1, and 8K of that is trace cache, and NOT L1 data or instruction cache. P4's L2 cache is also inclusive, so whatever is in L1, gets replicated into L2, so take 1024KB-20KB=1004KB useable cache. (Small difference, but nonetheless, present.) Both Athlon and Athlon-64 have exclusive L2 caches, where L1 is not replicated into L2. The reason Intel always tends to favor inclusive caches is to reduce latency. Since the Athlon-XP is not such an unwieldly 31-stage architecture, an occasional cache-hit-miss is not the big deal that it is on the P4.
Various CPU architectures also can use cache more or less efficiently. P4's L2 cache is fast and wide, but has only 8-way set associativity, while the Athlon's is not as fast (still runs at full speed, but with only a 64-bit data path), it is 16-way set associative.
In the scheme of things, the P4 also needs a larger cache. Because it has a tiny L1 cache, and is bandwidth-deprived, the P4 will benefit much more from the larger L2. Tests have shown that an Athlon XP w/256KB L2 cache is roughly the same as an Athlon XP w/512KB cache (Barton). The K7 architecture is not nearly as bandwidth deprived as the P4.
So - the short answer to your long question is: "It depends on which application you are planning on using". Some applications don't care much about the cache, and some make a big deal about it.