By: hatmandu

hatmandu — Mon, 07 Dec 2009 09:00:30 +0000

Hi, and thanks for the comments.

Twitter usernames: yes, others have raised this. I probably will do that (though in a sense the commonness of other people’s usernames shows how important they are to the writer, I guess). To be honest, the Twitter option of this is not ideal anyway as the text samples aren’t huge (up to 200 tweets) – but it helps spread the word!

New coinage: um, good point. The BNC isn’t massively current. There are internet-based corpuses, so maybe for longevity one of those might ultimately be better.

I’d like to implement something with bigrams in the future, as and when I get time. Definitely agree they provide more sophisticated analysis. This current tool is inevitably rough and ready.

Oh, and ‘;•’ – yeah, somehow that’s slipped through the net, and I’m going to zap it!

By: Pete G

Pete G — Sun, 06 Dec 2009 18:23:55 +0000

Interesting. It’s almost surprising nobody’s done this before.

Immediate observation on the twitter version is that it ought to filter out usernames, or at the very least the target’s own username (in one I just tried, there were 20 occurrences of the target username, with just 3 of the next most common word).

How do you weight new coinages?

I guess the most obvious extension is to include bigrams. Does the form you have the BNC in allow you to obtain reference frequencies for those, too? Even at its simplest, this would allow you to distinguish between “bike ride” and “fairground ride”, which are obviously different keywords. More generally, you can get useful information by looking at clusters of words. As something to aspire to, what would it take to correctly identify the keyword ‘can’ for http://en.wikipedia.org/wiki/Can_%28band%29 ? (Incidentally, I’m curious about how your current lexer decides that ;· is a word…)

Comments on: What’s it all about, Alfie

By: hatmandu

By: Pete G