I’ve just launched a new tool at Hatmandu.net, a text content and keyword analyser – in theory useful for search engine optimisation, but also to get the general gist of a text.From the notes:
This text content and keyword analyser is intended to give a more precise indication of a text’s most important words than other tools available. Most keyword analysers use simple word frequency (which is also shown here anyway), but that doesn’t relate the specific text to the language in general – common terms such as ‘people’ and ‘time’, for example, appear in many documents, but do not necessarily indicate the essence of the particular text being analysed. This analyser uses the TF-IDF statistical method to relate the frequencies of words in the specific text to their general frequencies in the British National Corpus. I am indebted to Adam Kilgarriff‘s version of the BNC, which I have adapted considerably for this tool. This analyser mainly uses the nouns in the BNC, on the basis that these are the parts of speech that best indicate the subject matter of a text. (At some point I hope to produce a version using an American English corpus, though I’d be surprised if the results were very different.)
It works with Twitter accounts (though it only reads the last 200 tweets, which may not form a usefully large body of text), and URLs where my humble scraping tool is able to extract the text successfully – most useful is the ‘paste text’ field, which will accept up to 1Mb of text (about 200,000 words) – so will analyse entire books if desired. Livejournal users can enter their URL (http://username.livejournal.com) assuming their account is public.
It’s a bit experimental at the moment, but hopefully might migrate from ‘possibly fun’ to ‘possibly useful’ in due course!