Finding the Needle in a Haystack: On the Automatic Identification of Accessibility User Reviews


In recent years, mobile accessibility has become an important trend with the goal of allowing all users the possibility of using any app without many limitations. User reviews include insights that are useful for app evolution. However, with the increase in the amount of received reviews, manually analyzing them is tedious and time-consuming, especially when searching for accessibility reviews. The goal of this paper is to support the automated identification of accessibility in user reviews, to help technology professionals in prioritizing their handling, and thus, creating more inclusive apps. Particularly, we design a model that takes as input accessibility user reviews, learns their keyword-based features, in order to make a binary decision, for a given review, on whether it is about accessibility or not. The model is evaluated using a total of 5326 mobile app reviews. The findings show that (1) our model can accurately identify accessibility reviews, outperforming two baselines, namely keyword-based detector and a random classifier; (2) our model achieves an accuracy of 85 % with relatively small training dataset; however, the accuracy improves as we increase the size of the training dataset.

More specifically, the research questions that we investigated are:

RQ1. To what extent machine learning models can accurately distinguish accessibility reviews from non-accessibility reviews?

To answer this research question, we rely on a manually curated dataset of 2663 accessibility reviews, which we augment with another 2663 non-accessibility reviews. Then we perform a comparative study between state-of-the-art binary classification models, to identify the best model that can properly detect accessibility reviews, from non-accessibility reviews.

RQ2. How effective is our machine learning approach in identifying accessibility reviews?

To answer this research question, we rely on a manually curated dataset of 2663 accessibility reviews, which we augment with another 2663 non-accessibility reviews. Then we perform a comparative study between state-of-the-art binary classification models, to identify the best model that can properly detect accessibility reviews, from non-accessibility reviews.

RQ3. What is the size of the training dataset needed for the classification to effectively identify accessibility reviews?

In this research question, we empirically extract the minimum number of training instances, i.e., accessibility reviews, needed for our best performing model, to achieve its best performance. Such information is useful for practitioners, to estimate the amount of manual work needs to be done (i.e., preparation of training data) to design this solution.



Web Service

How to use our deployed Azure web service?


Web service request and response


    Request
    -------------
{
  "Inputs": {
    "input1": {
      "ColumnNames": [
        "review_text"
      ],
      "Values": [
        [
          "value"
        ],
        [
          "value"
        ]
      ]
    }
  },
  "GlobalParameters": {}
}

    Response
    -------------
{
  "Results": {
    "output1": {
      "type": "DataTable",
      "value": {
        "ColumnNames": [
          "Scored Labels"
        ],
        "ColumnTypes": [
          "String"
        ],
        "Values": [
          [
            "value"
          ],
          [
            "value"
          ]
        ]
      }
    }
  }
}

Python script to call the web service


import urllib2
# If you are using Python 3+, import urllib instead of urllib2

import json 


data =  {

        "Inputs": {

                "input1":
                {
                    "ColumnNames": ["review_text"],
                    "Values": [ [ "value" ], [ "value" ], ]
                },        },
            "GlobalParameters": {
}
    }

body = str.encode(json.dumps(data))

url = 'https://ussouthcentral.services.azureml.net/workspaces/41654fb2238f449daf8dc7954f22ee9b/services/128f31001e794cd19ab042010d1f4a0e/execute?api-version=2.0&details=true'
api_key = 'FrykEwu75lBeqhdG/Iz8NdwmY0GOrAJVNl+f+BcXCPChMrDkYRL4S/F7E33YxFRkJrese1giJ9NrWOBxJjWgag==' # API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib2.Request(url, body, headers) 

try:
    response = urllib2.urlopen(req)

    # If you are using Python 3+, replace urllib2 with urllib.request in the above code:
    # req = urllib.request.Request(url, body, headers) 
    # response = urllib.request.urlopen(req)

    result = response.read()
    print(result) 
except urllib2.HTTPError, error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())

    print(json.loads(error.read()))                 

If you are interested to learn more about the process we followed, please refer to our paper.


Related Paper

E. A. AlOmar, W. Aljedaani, M. Tamjeed, M. W. Mkaouer, and Y. El-Glaly, "Finding the Needle in a Haystack: On the Automatic Identification of Accessibility User Reviews", the international conference on Human-Computer Interaction (CHI'2021). [preprint]