Unstructured Survey Data: A Practitioner’s Approach to Extracting Value from and Adding Value to Customer Feedback
We frequently field inquires from customers about how we can help them handle ever-rising volumes of unstructured data. The curious thing about these inquiries is that they often come from companies already employing sophisticated natural language processing (NLP) tools. It’s not so much that the tools don’t work. It’s that what they produce is neither particularly insightful nor actionable. That comes as both a shock and a disappointment to customers who bought into the promise of tools that replace the time- and labor-intensive task of coding and making sense of vast quantities of open-text data. No matter how sophisticated these tools may be, however, they can deliver only limited value unless they are part of an accompanying strategy. The strategy should, in all probability, derive from the company’s Voice of the Customer program, because most unstructured data tends to be in the form of customer feedback – positive, but more often negative, since consumers typically complain more than they praise. Effective VoC programs have processes that take customer feedback and translate it into action. In fact, the implementation of NLP should simply be a matter of automating a single task in what is a defined, refined, and proven process. This, then, is how we as practitioners help our customers extract value from their unstructured survey data.
It’s all about the process
Figure 1: Automated text coding yields the greatest dividends when done within the context of a formal process to capture and act on customer feedback.
Introducing NLP into a process that doesn’t already operate effectively is just automating inefficiency. NLP simply allows a company to increase dramatically the volume of unstructured data it can process. At a basic level, therefore, NLP tools should mimic the way that real people currently make sense of unstructured data. That means teaching the tools how to allocate individual instances of feedback into meaningful buckets. This can be tricky. As anyone who has ever spent time coding open-text feedback will tell you, an individual’s feedback can often be legitimately allocated to one of several plausible buckets. Take this simple suggestion.
It contains words that could qualify it for allocation to one of several typical feedback buckets (information-related, need-related, usability-related, findability-related). Human coders will often keep track in their heads of the different possibilities and then allocate items in a way they feel best balances the overall analytical picture. Decisions about how best to allocate each item makes the manual process daunting. Consider the range of topics covered in just this tiny segment of a brand site’s feedback on suggested changes to their website.
Figure 2: Companies must often face a daunting volume of feedback covering an immense range of topics
Teaching the tools to undertake the coding, however, begins with having gone through a manual coding exercise that you KNOW reflects the most accurate picture of the feedback. It is probably sufficient to code 1,000 responses. The picture that emerges typically looks something like this.
Figure 3: Unstructured data becomes infinitely more versatile when it becomes structured
Once you have your coding protocols in place, the next step is to convert them into rules the NLP tools will understand. Stand-alone NLP tools can be very powerful, but the real value of putting structure to unstructured data is in making it usable with all your other VoC structured data — which is to say cross-tabbing it or using it to add context or indicate causality. Most of the VoC data we handle for our customers comes in the form of survey data, primarily (but not exclusively) from website surveys. We use Tableau to visualize, organize, and cross-tab the data. Tableau has limited text-handling tools, but they can be utilized to great effect if employed imaginatively. The first task we set for Tableau is to mimic our coding protocol. We do this by creating a sorting formula for Tableau to apply against the open-text data. It is the simplest sort of formula, but they can get rather long, so we tend to build them in a tool like Excel and then copy them into the Tableau routine. (NLP software has technical terms like stemming and lemmatization to describe some of these rules, but most of them are only elaborate descriptions for common sense.)
Figure 4: Coding instructions typically run to hundreds of lines; we find it’s often more efficient to build them in Excel
There are several keys to executing this step effectively. Because automated routines don’t have the imagination to “balance” allocation on the fly, like a human coder, the routine needs to be strictly prioritized. So the routine must identify the most important (actionable, insightful) buckets first. If that is not the case, these most valuable pieces of feedback run the risk of getting lost in larger, less useful buckets because they contain elements that the routine could allocate to them. (Take a piece of feedback like this: “There’s really nothing I would change except for perhaps a simpler way to help me recover my password.” An automated routine could easily take the phrase “nothing I would change” and allocate it to the “No change” bucket because it occurs in the main clause, and miss the value of the qualification that followed it.) In the example above, we have allocated any piece of feedback that contains the word “need” to the first bucket. (Site visitors whose needs cannot be met should be thoroughly investigated. Are they the target audience? Do they have a legitimate need we have yet to identify? Is that need widespread? Is it a quantifiable opportunity? Is it costing us money, loyalty, advocacy?) The sequence then runs down the list of priorities appropriate to this channel. The buckets and key words for a billing center will be different, for example, from those for a website or a retail store or a support hot line. This step may require several iterations until the picture it yields resembles that created by the manual coding exercise. But once it is done, you can be confident that the routine operates on the same principles as your best human coder. Now, however, it can process vast amounts of data in no more than the blink of an eye!
Adding structure to the unstructured
At this point, we have put structure to our unstructured data. That, in itself, is valuable, but it is only the start of what can be done to extract value, because now the data can be depicted in visually compelling formats and cross-tabbed against all our other structured data. Cross-tabbing adds context and insight. It is now possible, for example, to show how feedback relates to the different types of site visitor. Is there a difference in topic popularity by role? In the Treemap below, you can determine at a glance that merchandise-related issues are a much bigger focus for purchasing managers than they are for chefs. Why is that? It’s likely a productive path to follow. And knowing what analytical paths to investigate is half the battle when looking at survey data. Most analysts practice triage because they come up with far more hypotheses than they typically have time to pursue. Being able to recognize gaps quickly means that you can run more top-line hypotheses in the same amount of available time. You can see what becomes visible, for example, if you cross-tab the suggestions by type of device used. Is there a difference by region, or by income, by online campaign? Visualization can make these answers readily accessible.
Figure 5: Visualization types should serve the cause of pattern (or gap) recognition
When you want to introduce additional dimensions into your scenarios, however, relationships can be easier to discern when presented in tabular form. Tabular nesting also allows for the inclusion of actual text, which adds definitive context.
Figure 6: Nesting is an analytically convenient way of following hypotheses
Over time, the coded data can reveal the impact of remedial actions or opportunities pursued. In this next illustration, we see that the volume of “No changes” responses (blue line) rises over time and some of the key areas of concern trend downwards (khaki line for Merchandise-related) as remedial actions take effect. Other buckets show increased rates of response (green line for Page layout-related), which may indicate emerging priorities.
These scenarios, however, are only the first level of value you can add to your unstructured data. These methods can be replicated to process the data from a variety of perspectives. Instead of having the tool code the data from a need/problem viewpoint, it can be parsed in a noun-first manner in order to present a topic-frequency view. The same can be done with a verb-first routine in order to look at a user-behavior perspective. You might also want to run a “sentiment” analysis by looking for positive nouns and adjectives, along with negatives and negative contractions. You can identify which nouns or verbs or adjectives to rank at the top of your routines by undertaking a simple exercise in reducing each piece of feedback into a two-word essence. These two words might be noun-verb, verb-adverb, or noun adjective combinations.
Figure 7: Reducing a piece of feedback to its two-word essence is a productive exercise
If you go through the exercise for, say, 100 instances of feedback, you will be able to sort the parts of speech and their frequency. You can then set your coding priorities. Each pass at the data will provide a different perspective and each perspective will add another dimension of insight. These additional passes take set-up time, but that time is an essential investment, one that assures you that the automated routine is a true facsimile of how your best analyst would have interpreted the same data. Now, however, you have the routines in place to run huge volumes of unstructured data, confident in the integrity of the output and, most importantly, in the decisions you make on the basis of it.
-Roger Beynon, CSO, Usability Sciences Corporation