Spying on the Office Step 2 - The Path to Natural Language Processing

Using Natural Language Processing to analyze Slack Messages.

In our first post about spying on the office, we outlined how we were capturing Slack messages. Now, we want to analyze those messages leveraging Natural Language Processing (NLP). With no prior experience in NLP, we started exploring our options. Our first goal was to uncover any recurring topics. In a later post, we will discuss how we used Apache Spark to determine the most popular topic over a period of time.

hotdog in mac and cheese

What do we even do with this message?

In our initial review of available NLP libraries, it seems they are geared toward analyzing full sentences. While we all have that friend who writes 2 or 3 text messages to communicate a single thought, rarely do people write in full sentences in a single instant message. We have a few ideas on how to combine some of these abbreviated messages, but one challenge we encounter is the time span of the messages, which could throw things off if we try to combine them before analyzing the results. But, we’ll get to that in a future post. For now, we’ll analyze each individual message and accept that some results might not make sense.

Since we want to incorporate the NLP with Apache Spark, we needed a good integration tool. A few Google searches later, we found Stanford CoreNLP wrapper for Apache Spark. After tinkering with it a little bit, we learned that it really works best when analyzing one full sentence at a time. Furthermore, we determined that the “part-of-speech” tagger was the most useful component for us. A little investigation into what the parts of speech returned from Stanford CoreNLP led us to Penn Treebank Project. We could filter the results of each message down to the proper nouns found in it. The results of this out of the gate were fairly decent, but we were essentially limited to one-word nouns as potential topics of a message, which misses subjects like “ice cream” or “New York City”. This type of analysis left us with just “City”, “ice” or “cream” in the results. Distilling each message to single nouns didn’t really reveal the true topic and would make it difficult to come up with a popular topic.

We continued researching our open-source NLP options to find something that would create noun phrases for us. While looking at OpenNLP, we stumbled on a presentation using spaCy which revealed a “phrases” module that would extract noun phrases from text. Since spaCy is written in Python, we looked more closely at OpenNLP and Stanford’s NLP libraries to see if they offered a similar feature. Luckily, Stanford’s NLP offered a Lexicalized Parser which would do what we wanted, but it wasn’t built into the Scala wrapper we found for it. That didn’t stop us, from using it, though, since Scala runs on the JVM.

  def getNounPhrases = udf((x: String) =>{
    val lp: LexicalizedParser = LexicalizedParser.loadModel(
           "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
    val tokenizerFactory: TokenizerFactory[CoreLabel] = 
           PTBTokenizer.factory(new CoreLabelTokenFactory(), "")
    lp.setOptionFlags("-outputFormat", "penn,typedDependenciesCollapsed",
           "-retainTmpSubcategories")
    val rawWords = tokenizerFactory.getTokenizer(new StringReader(x)).tokenize
    val bestParse = lp.parseTree(rawWords)
    val nounPhrases = new ListBuffer[String]

    val iterator = bestParse.iterator()
    while(iterator.hasNext){
      val subtree = iterator.next()
      if (subtree.label().value == "NP") {
        val sb = new StringBuffer
        val phraseIterator = subtree.labeledYield().iterator()
        while(phraseIterator.hasNext){
          sb.append(phraseIterator.next().value)
          if (phraseIterator.hasNext) sb.append(" ")
        }
        nounPhrases += sb.toString
      }
    }
    nounPhrases
  })

Below is a random sampling of phrases and subjects it extracted with the Scala code we used to get this data. We are hoping we can continue to refine it, but it did pick out xcode and IPA! We know someone was talking about programming a mobile app and beer!

  • a file
  • a wiki page that basically was an export
  • the app
  • the ipa
  • xcode

So this analysis is much closer to the final result we want. In most cases, it does a great job putting together things like “New York City” and “ice cream”. As we can continue to experiment, one of our next steps will be to provide some training data to the Stanford CoreNLP to see if that improves any of the results. Additionally, we would like to attempt to combine some of the messages into one text grouping for analysis. We think that may help improve some of the results.

Stay tuned for our next blog post on applying Apache Spark to the NLP results to determine which topics were most frequently discussed in the office Slack channels.