So I have been working on what I am from now on calling the "comment uniqueness index rater" (will be referred as CUIR in this post) all day.
Turns out interacting with a website is actually pretty easy! Almost as if people have been doing this for decades and created special tools, libraries and packages... *big hmmm*
I am now able to scrape the jam page for all the usernames, the game names and the urls. Following the links to the game page and collecting all comments and running them thru my CUI calculator should be a breeze now and will be done by tomorrow.
The whole functionality will be as follows:
- After entering the Jam url the program automatically retrieves the game name their author and the link
- This data is then saved in a database with an additional slot for the list of comments. In this database the author is the primary key attribute
- We then go thru each link and retrieve the comments/ratings. Then each comment is saved into the comment list of the respective author.
- If the webpage hasn't kicked us by this point for creating too much traffic, we can go offline and analyze the data we loaded.
- Depending on how fast the results should come in and how cheat resistant the CUIR is supposed to be, we can enable or disable the comment spellcheck. (I would personally give it a pass without and a second pass with spellcheck)
- For each authors comments we calculate the Jaccard Index and calculate the arithmetic mean to get the average comment similarity. This value will be then known as the CUI, with 0 for completely unique comments and 1 for only one copy pasted comment.
- The results are then saved into the database, where we can filter out the cheaters
Features:
- Detect copy paste comments
- Swapping sentences or parts of sentences around won't change the result
- spellcheck will disable the possibility to evade detection thru misspelling
Flaws:
- The spellcheck will shit itself with foreign languages. In theory there shouldn't be too many non english users, since only few would actually understand them. The foreign script by itself won't cause any issue though, since the must be UTF8 encoded to be properly displayed on the website anyway.
- Few comments may result in a high CUI. This means there must be at least more than one comment per person analyzed. These mutes can be filtered out and put into a separate list for different use.
- I don't know how it will handle images and emoticons as well as special formatting. This might crash the system and therefore this will require some testing.
- One can evade detection, by swapping words for their synonyms, since the Jaccard index only cares for the syntactic value of a word not the actual meaning. This isn't an issue actually, since we only want to find blatant copy-pasters. Finding synonyms for a single protocomment would actually require more work than writing a completely new one (an this is what we want).
When everything is done and in a presentable (and hopefully working) condition I will release the sourcecode on my github for you to scrutineer. If we're lucky the program will do what it's supposed to do by the weekend.