Business Intelligence

Sentiment Analysis and Topic Detection with Microsoft Cognitive Services using Python

Microsoft’s Cognitive Services is a grab-bag of amazing capabilities that you can purchase by the transaction. The Cognitive Services APIs are grouped by vision,  speech, language, knowledge and search. This article is about using the Text Analytics API (in the language group) to score the sentiment and detect the topics of a large number of text phrases. More specifically, the scenario we’ll explore is that we’ve been given a file containing thousands of responses to a survey conducted in Facebook, and our client would like to know the sentiment of the text comments, the topics mentioned by those text comments, and the frequency of those topics.

The text analytics API is well documented by Microsoft. The request body has to have exactly this format:

{ "documents": [ { "id": "string", "text": "string" } ] }

For example:

{“documents”:[{“id”:”1190″,”text”:”thank you!”},{“id”:”1191″,”text”:”i thought it was perfect, as is.”},{“id”:”1783″,”text”:”more tellers on certain busy days ,for example mondays.”},]}

In practice I found that the API would accept request bodies up to about 64kb. So, if you have more than 64kb of comments, you have some looping your future.

Here are the imports. (There may be one or more here that aren’t needed. This was a long project.)

 

import urllib2
import urllib
import sys
import base64
import json
import os
import pandas as pd

 

Assuming our input text file is formatted as a CSV file, we import it into a Pandas dataset.

 

fileDir = ‘C:/Users/Administrator/Documents/’
fileName = ‘SurveySentiment.csv’
mydata = pd.read_csv(fileDir + fileName,header = 0)

mydata.head() # look at the first few rows

 

First, let’s add a new column to our dataset containing a cleaned version of the survey comment text.

 

import re

#%% clean data
def clean_text(mystring):
    mystring = unicode(str(mystring), errors=’ignore’) # professional developer, don’t try this at home
    mystring = mystring.decode(‘utf8’) # change encoding   
    mystring = re.sub(r”\d”, “”, mystring) # remove numbers 
    mystring = re.sub(r”_+”, “”, mystring) # remove consecutive underscores
    mystring = mystring.lower() # tranform to lower case
    mystring = mystring.replace(”  “,” “)
   
    return mystring.strip()

mydata[“Comment_cleaned”] = mydata.Comment.apply(clean_text) # adds the new column

 

Now that we have the data in a form we can iterate through, we need to feed it into a structure that can be sent to the text analytics API. In this case I broke the input into ten equally sized segments to keep each request body under 64kb.

 

input_texts = pd.Series() #init a series to hold strings for submission to API
num_of_batches = 10

l = len(mydata)
for j in range(0,num_of_batches): # this loop will add num_of_batches strings to input_texts
    input_texts.set_value(j,””)   # initialize input_texts string j
    for i in range(j*l/num_of_batches,(j+1)*l/num_of_batches): #loop through a window of rows from the dataset
        comment = str(mydata[‘Comment_cleaned’][i])            #grab the comment from the current row
        comment = comment.replace(“\””, “‘”) #remove backslashes (why? I don’t remember. #honestblogger)

        #add the current comment to the end of the string we’re building in input_texts string j  
        input_texts.set_value(j, input_texts[j] + ‘{“id”:”‘ + str(i) + ‘”,”text”:”‘+ comment + ‘”},’)

    #after we’ve looped through this window of the input dataset to build this series, add the request head and tail
    input_texts.set_value(j, ‘{“documents”:[‘ + input_texts[j] + ‘]}’)

 

Sentiment Analysis

Okay, now we have a series of ten strings in the correct format to be sent as requests to the sentiment API. Let’s loop through that series and call the API with each of the strings.

 

# Azure portal URL.
base_url = ‘
https://westus.api.cognitive.microsoft.com/’
account_key = ’00e000e000f000b0aeb0000b0edec0000′ # Your account key goes here.
headers = {‘Content-Type’:’application/json’, ‘Ocp-Apim-Subscription-Key’:account_key}

num_detect_langs = 1;

Sentiment = pd.Series() #initialize a new series to hold our sentiment results


batch_sentiment_url = base_url + ‘text/analytics/v2.0/sentiment’

for j in range(0,num_of_batches):
    # Detect sentiment for the each batch.
    req = urllib2.Request(batch_sentiment_url, input_texts[j], headers)
    response = urllib2.urlopen(req)
    result = response.read()
    obj = json.loads(result)

    #loop through each result string, extracting the sentiment associated with each id
    for sentiment_analysis in obj[‘documents’]:
        Sentiment.set_value(sentiment_analysis[‘id’], sentiment_analysis[‘score’])  

#tack our new sentiment series onto our original dataframe

mydata.insert(len(mydata.columns),’Sentiment’,Sentiment.values)

 

This is what the head of the dataframe looked like at this point in my project:

image

 

Now, just save our dataframe with sentiment to a file:

 

mydata.to_csv(Sentiment_file)

 

Topic Detection

The topic detection API looks through a set of documents (like our comments), detects the topics in those documents, and scores the topics by frequency.

We still have our input_texts series containing ten strings in the format:{ "documents": [ { "id": "string", "text": "string" } ] }

The request format for the detect topics API is a superset of the above format, but the format we have will work just fine if we don’t need to specify stop words or exclude topics.

 

{ "stopWords": [ "string" ], "topicsToExclude": [ "string" ], "documents": [ { "id": "string", "text": "string" } ] }

 

So, let’s iterate again through our series, calling the topic detection API this time:

 

import time

# Simple program that demonstrates how to invoke Azure ML Text Analytics API: topic detection.
headers = {‘Content-Type’:’application/json’, ‘Ocp-Apim-Subscription-Key’:account_key}

TopicObj = pd.Series()

for i in range(0,num_of_batches):   
    # Start topic detection and get the URL we need to poll for results.
    print(‘Starting topic detection.’)
    uri = base_url + ‘text/analytics/v2.0/topics’
    req = urllib2.Request(uri, input_texts[i], headers)
    response_headers = urllib2.urlopen(req).info()
    uri = response_headers[‘operation-location’]

    # Poll the service every few seconds to see if the job has completed.
    while True:
        req = urllib2.Request(uri, None, headers)
        response = urllib2.urlopen(req)
        result = response.read()
        TopicObj.set_value(i,json.loads(result))

        if (TopicObj[i][‘status’].lower() == “succeeded”):
            break

        print(‘Request processing.’ + str(time.localtime()))
        time.sleep(10)

    print(‘Topic detection complete.’)

 

This API takes some time. For my 4471 row dataset it took over half an hour. It returns a JSON string. Here’s how I used Pandas to process it:

 

TopicDF = pd.read_json(json.dumps(TopicObj[0]))

for i in range(1,num_of_batches):
    TopicDF.append(pd.read_json(json.dumps(TopicObj[i])))

image

 

Then, to extract the ids, scores and key phrases out of that dataframe:

 

Topics_id = pd.Series()
for i in range(0,len(TopicDF.iloc[2][1])):
    Topics_id.set_value(i,TopicDF.iloc[2][1][i][‘id’])
   
Topics_score = pd.Series()
for i in range(0,len(TopicDF.iloc[2][1])):
    Topics_score.set_value(i,TopicDF.iloc[2][1][i][‘score’])
   
Topics_keyPhrase = pd.Series()
for i in range(0,len(TopicDF.iloc[2][1])):
    Topics_keyPhrase.set_value(i,TopicDF.iloc[2][1][i][‘keyPhrase’])

Topics = Topics_id.to_frame()
Topics.insert(len(Topics.columns),’score’,Topics_score.values)
Topics.insert(len(Topics.columns),’keyValues’,Topics_keyPhrase.values)
Topics.rename(columns= {0:’id’}, inplace=True)

Topics.to_csv(Topics_file)                                # write out the topics to a file

 

Now, to tie the topics back to individual comments:

 

TopicAssignments_documentId = pd.Series()
for i in range(0,len(TopicDF.iloc[1][1])):
    TopicAssignments_documentId.set_value(i,TopicDF.iloc[1][1][i][‘documentId’])
   
TopicAssignments_topicId = pd.Series()
for i in range(0,len(TopicDF.iloc[1][1])):
    TopicAssignments_topicId.set_value(i,TopicDF.iloc[1][1][i][‘topicId’])
   
TopicsAssignments_distance = pd.Series()
for i in range(0,len(TopicDF.iloc[1][1])):
    TopicsAssignments_distance.set_value(i,TopicDF.iloc[1][1][i][‘distance’])

TopicAssignments = TopicAssignments_documentId.to_frame()
TopicAssignments.insert(len(TopicAssignments.columns),’topicId’,TopicAssignments_topicId.values)
TopicAssignments.insert(len(TopicAssignments.columns),’distance’,TopicsAssignments_distance.values)
TopicAssignments.rename(columns= {0:’documentId’}, inplace=True)

image

TopicAssignments.to_csv(TopicAssignments_file)

 

Usage

Now that we have the sentiment of each comment, the topics across all comments, the frequency of each topic, and the association of each topic back to its comments, it’s readily feasible to pull these CSVs into PowerBI and build reports on them.

image

Good luck!

Leave a Reply