My progress so far:
This appears to be a different program from the one bigrigdriver reported, as only once in several examples did the 'yq' combination pop up.
I have, however, found other patterns. For one, all of the random strings are 8 letters long! I've saved a log file of each incident, and when I strip out HTML tags and the "http://" and ".com" parts, I end up with neat columns of 8-letter strings.
Furthermore, I wrote a letter-frequency script. I might as well post:
Code:
#!/bin/bash
for LETTER in a b c d e f g h i j k l m n o p q r s t u v w x y z;
do
LETTER_COUNT=$(cat $1 | tr [A-Z] [a-z] | grep -o $LETTER | wc -l)
echo $LETTER_COUNT ":" $LETTER >> temp
done
cat temp | sort -gr > frequency_count
rm temp
exit 0
and running it on the file gives me a startlingly different pattern from normal English usage. The 10 most frequent letters by the spambot are bwagpmjizl, as opposed to the standard English distribution of etaoinshrd.
So this gives me three ways to kick it out (and I can hard-code them into my PHP comment script, so they get rejected on the posting attempt): (1) Check the URL and make sure it exists, (2) check for all fields to show the 8-letter word pattern, (3) check for letter-frequency. On the letter-frequency part, I hope to create some kind of "fuzzy-match" method, because I don't want to block legitimate users with bad spelling (or 'AOLbonics')... but having the letter 'z' in the top ten letters is of course a dead giveaway!
I hope not to just solve this problem, but come up with a general method for kicking out random-letter spam of all kinds, since I bet that these will become more frequent as a means to undermine Bayesian filtering.