Mar 04 2012

Cleaning up a Mediawiki spam mess

Published by at 4:06 pm under Tech

I run a wiki for CURATEcamp, using Mediawiki.  I don’t run it well, so it got full of spam.  I learned how to add a little math script to each page edit, and that slowed down the spam for a while, but it’s easy to hack and the spam started flowing again.  So now I have 700+ pages of spam and more coming in every day.  So I have 3 problems to solve:

  1. Stop the addition of new users without confirmation
  2. Stop new spam
  3. Clean up all the spam pages
I found the ConfirmAccount extension and installed it.  That fixed #1.

Next, I found the page Preventing access and followed the instructions to add these lines to the LocalSettings.php file:

# Disable anonymous editing
$wgGroupPermissions['*']['edit'] = false;

That stopped the random adding of new spam.

 

Next, I started looking for easy clean up tools, and didn’t really find any.  I could list all of the pages on the wiki, but I’d have to visit each one and delete it – a real pain for 700+ pages.  I also had about 20 pages that I wanted to keep.  I found a DeleteBatch extension that would allow me to put the spam page names into a text box (or text file) and delete them all at once.

Now I needed to generate a list of spam page names, so I went to the Special Page that lists All pages, and cut and pasted those into an Excel spreadsheet.  It was a bit of a pain because the list was in three columns, and split into three pages, but I just dragged and dropped the list around in Excel until I had it all as one column.  Most of the spam pages are user pages, and the titles of the pages end in a number.  So I set up a second column that chopped the last 2 characters from the page title:

=VALUE(RIGHT(A115,2))

then had a third column which was a conditional that repeated the page title if it ended in a number.  I bet I could have made it simpler with some function that converts a cell made up of a word and a number, like “ClardyGarces959” into just “959” but I couldn’t remember how to do that.

=IF((ISNUMBER(B115)),A115)

Next, I sorted by this column, which grouped all of the page titles that ended in a number.  I visually inspected the list, and I’m glad I did because some of my legitimate pages also ended in numbers.  I deleted those from the list, then pasted the list of known spam page titles into DeleteBatch.

This left me with a handful of spam pages that I had to pick through individually, but way fewer than before.

Hope this helps someone else with the same problem!

UPDATE

Make sure to look for pages in spaces other than Main.  I found a bunch more User: pages full of spam, and uses the same methods as above to quickly get rid of them.

2 responses so far

2 Responses to “Cleaning up a Mediawiki spam mess”

  1. Andy Beverleyon 16 Sep 2012 at 8:14 am

    Same problem on my wiki. However, it looks like the Extension:Nuke has now been updated, which makes it a lot quicker to delete pages. A search can be done of all pages, and quickly deleted using checkboxes.

    The remaining problem is that the database is still full of the history of all the old pages and edits. It would be nice to clear this out (as well as delete all the spam accounts that have been created). Extension for this functionality please!

    Andy

  2. Andy Beverleyon 16 Sep 2012 at 8:22 am

    2 other things:

    Rather than deleting the spam users (not recommended) I set their password to an invalid value:

    mysql> update user set user_password=’XXX’ where user_name’Admin’;

    I also prevented new page creations (in LocalSettings.php):

    $wgGroupPermissions[‘*’ ][‘createpage’] = false;
    $wgGroupPermissions[‘user’ ][‘createpage’] = false;

Trackback URI | Comments RSS

Leave a Reply