As part of my grad school course work, I had half a dozen XML files with content that needed to be analyzed for sentiment.   AWS Comprehend is a service that analyzes text in a number of ways, and one of those is sentiment analysis.

My options were to either cut and paste the content of 400 comments from these XML files, or come up with a programmatic solution.  Naturally, I chose the latter.

The XML file is formatted like so:

 

        <posts>
          <post id="123456">
            <parent>0</parent>
            <userid>user id</userid>
            <created>timestamp</created>
            <modified>timestamp</modified>
            <mailed>1</mailed>
            <subject>Post title</subject>
            <message> Message content </message>

 

What I needed to get at was the message element of each post, as well as the post id.

The script imports BeautifulSoup to work with the XML, and boto3, to work with AWS.    We next define a string buffer, because we need to store the results of the analysis somehow.

Next we define the client, which tells AWS everything it needs to know.  Tell it the service you’re after, the AWS region, and the tokens you’d use to authenticate against AWS.

After that we provide a list of XML files that the script needs to parse, and tell it to loop through and read each one.

We next tell BeautifulSoup to find all of the elements with a “post” type.  This saves us having to drill down through the entire hierarchy of the XML file.

 

Armed with an array of all of the posts in the current XML file, we loop through that array.   We first examine the length of the message (content) of the current post. If it exceeds 4999 bytes,  don’t send it to the API.  The Comprehend API has a 5000 byte limit.

If the current post’s message length is less than 5000 bytes, we come to the point where we actually send it to the Comprehend service.  We define the response object as being a set of attributes sent to the detect_sentiment method of the client.  This is the line that tells Comprehend specifically what you want it to do with the text you’re sending. 

Comprehend should send back the response, from which we’ll extract the relevant attributes. Below is the format of the JSON that AWS sends back as the response:

 

{
    "Sentiment": {
        "Sentiment": "NEUTRAL",
        "SentimentScore": {
            "Positive": 0.0007247643661685288,
            "Negative": 0.012237872928380966,
            "Neutral": 0.9870284795761108,
            "Mixed": 0.000008856321983330417
        }
    }
}

 

If I wanted to simply access the sentiment, I would call response['Sentiment'].  In order to retrieve any aspect of the sentiment score, I would call response['SentimentScore']['Positive'], or whatever type of score you’re after.  The script below returns all attributes of the JSON response, and stores them in the string buffer.

The final step in the script is to print the string buffer that is now holding all of our responses.  Here’s the full script:

 


from bs4 import BeautifulSoup 
import boto3
 



stringBuffer = []
#Provide a string buffer to store the results in

client = boto3.client(
    service_name= 'comprehend',
    region_name='us-west-2',
    aws_access_key_id='<>',
    aws_secret_access_key='<>',
    
)
#Define the client, with which we will connect to AWS

files = ['file1.xml','file2.xml',]
#Provide a list of files to loop through
for file in files:
    with open(file, 'r', encoding="utf8") as f:
        data = f.read() 

    # Passing the stored data inside the beautifulsoup parser 
    xmlData = BeautifulSoup(data, 'xml')
    #Retreive the XML

      
    xmlPosts =  xmlData.find_all('post') 
    #Find all instances of a "post" element in the XML

    for post in xmlPosts:
        
        if(len(str(post.message)) > 4999):
            stringBuffer.append(str(post['id']) +"^"+ "was too big")
        else:
            response = client.detect_sentiment(
            Text= str(post.message),
            LanguageCode='en',

            )# get the response
            stringBuffer.append(str(post['id']) + "^" + str(response['Sentiment']) + "^" + str(response['SentimentScore']['Positive'])+ "^" + str(response['SentimentScore']['Negative'])  + "^" + str(response['SentimentScore']['Neutral'])  + "^" + str(response['SentimentScore']['Mixed']) )
             
    for line in stringBuffer:
        print(line)

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>