A comparative study of text classification algorithms on user submitted bug reports

In software industries, various open source projects utilize the services of Bug Tracking Systems that let users submit software issues or bugs and allow developers to respond to and fix them. The users label the reports as bugs or any other relevant class. This classification helps to decide which team or personnel would be responsible for dealing with an issue. A major problem here is that users tend to wrongly classify the issues, because of which a middleman called a bug triager is required to resolve any misclassifications. This ensures no time is wasted at the developer end. This approach is very time consuming and therefore it has been of great interest to automate the classification process, not only to speed things up, but to lower the amount of errors as well. In the literature, several approaches including machine learning techniques have been proposed to automate text classification. However, there has not been an extensive comparison on the performance of different natural language classifiers in this field. In this paper we compare general natural language data classifying techniques using five different machine learning algorithms: Naive Bayes, kNN, Pegasos, Rocchio and Perceptron. The performance comparison of these algorithms was done on the basis of their apparent error rates. The data-set involved four different projects, Httpclient, Jackrabbit, Lucene and Tomcat5, that used two different Bug Tracking Systems - Bugzilla and Jira. An experimental comparison of pre-processing techniques was also performed.

paper