Breaking News: DigitalOwl Moves Beyond Summaries, Delivering Actionable Insights from Medical Records Learn More

Cracking the challenge of unstructured medical text

Published On
June 1, 2020
Share this post
https://digitalowl.com/cracking-the-challenge-of-unstructured-medical-text

Natural Language Processing (NLP) is a technology built to help computers understand human language. Many advances have been made in recent years as artificial intelligence research has intersected with NLP.  Today, NLP is used for a wide range of applications like translation, voice assistants, document classification, and more.

At DigitalOwl, we have harnessed NLP's capabilities into the world of "medical insurance", helping Underwriters and Claim Analysts assess applicants’ and insureds' medical records.  With advanced algorithms, we can identify all meaningful information in medical documents (medical conditions, dates, body parts, treatments, outcomes, etc.).  Just as important, we can extract pertinent non-medical phrases that are critical to understanding the full context of the subject’s medical history specifically for insurance purposes (return to work, ADLs, restrictions, and limitations, etc.).

As pioneers in applying NLP in the insurance industry, we face many unique challenges that arise from the integration of NLP and medical information like the variety of writing-styles of different physicians and the amount of information in each case.

This article focuses on the fascinating solution we’ve developed for understanding the context of words in a medical document:  Analyzing the position of words in the document.

The meaning of the position of words in a sentence:

The order of words in the sentence matter.  Different orders of the same words generate different meanings. The set of words: I / Like / Do / Not / Why / Trips can have a positive or negative meaning when you change the orders of the words:

"Why do I not like trips?" -Vs.- "I do like trips, why not?"

Imagine that you come home after a long working day, and your partner says "You seem to have gone through a hard day, you deserve a long rest," but the words are mixed, and instead you hear "You seem to have gone through a long day of rest, you deserve hard work."

In all NLP tasks, the form in which the text is analyzed is in the form of a sequence. That means that every word has a number.  The computer goes through the text line by line without considering the page structure at all.

The meaning of the position of words on the page:

To understand the text, mainstream NLP models index each word using a simple sequence.  For example, the top left word is “1”, the next word to the right is “2”, and so forth, line by line.

But this isn’t good enough.  As humans, when we read a document, we not only scan the text from left to right, but our brain also directs us to "strategic" places on the page, searching for familiar patterns.

For example, in medical records, the date in one of the top corners of the page is usually the visit date (even if there are few dates in the text), and the name in the top right corner is often the hospital name.

That's why we've developed a unique model, which is aware of the locations of the words on the page. Let's say you have a page with two lists of medical findings:

As we mentioned, one way to process the words is to index them by sequence from left to right.

The results of this processing method will be that the model gets this input:

NLP models index each word using a simple sequence.

And in this way, how can the models possibly know if Anamnesis (12) is an existing or non-existent condition?

Our solution is to enter all the information to the model:

With DigitalOwl's AI, every word is coordinated in space.

In this way, every word is coordinated in space. The word “Hand” gets the coordinate (20, 14), and “Anamnesis” gets (28, 57).  In this way, the model gets the full structure of this page, and can easily say that Anamnesis is a non-existent condition.

Sometimes, it is not just the context between words that is location-dependent, but also the role of each word.  Sometimes a page has many dates, but each page only has one printed date. Many times this date will be written in the top right corner (as you can see in the following image)

These capabilities make our NLP model more precise and faster.

Left Picture: The focus on finding the visit date. Right Picture: The focus on finding medical conditions.

Of course, all of this does not make the model refer only to location, but it certainly helps assign a better meaning for each word.

Imagine if you were tasked with locating a doctor's name within a document. Would you instinctively begin at the top left corner and methodically scan each line? Probably not. Similarly, NLP models don’t have to rely on such rigid sequential processing.

Matanya Hatan
Data Science Team Lead
,
DigitalOwl
About the author

Matanya is the Data Science Team Lead at DigitalOwl, bringing over four years of invaluable experience to the role. His steadfast guidance ensures the delivery of impactful outcomes and propels the team's success.