Posted on June 16, 2016
An ongoing challenge in large-scale epidemiological studies is the processing and use of occupational exposure data, which is the first step in identifying and characterizing occupational exposures that may contribute to cancer risk. Typically study participants describe details about their jobs in free-text responses to open-ended questions about their title, employer, and work tasks. Standardizing these job descriptions is a crucial step, but manual coding is time-consuming and expensive.
Melissa Friesen, Ph.D., a tenure-track investigator in the Occupational and Environmental Epidemiology Branch (OEEB), and Daniel Russ, Ph.D., in the NIH Center for Information Technology, sought to address this problem by developing SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiologic Research), an algorithm to efficiently identify, classify, and code occupations. Their application, which is publicly available, makes it easier for epidemiological researchers to incorporate occupational exposures into their studies.
SOCcer assigns standardized occupational codes, or SOCs, to free-text job descriptions using machine learning techniques based on natural language processing. The application is unique in its ability to estimate occupational exposure measurements from several components of a job, not just its title, thus providing insight into the full experience of the employee.
“Our goal was to estimate workplace exposures over an employee’s entire working life—sometimes as long as 50 years,” Dr. Friesen said. Lifetime occupational questionnaires were a key tool, in addition to detailed follow-up questions to certain subsets of participants.
In order to assign the most accurate code, SOCcer incorporates the study participants’ responses to open-ended questions about their job title, description, and other aspects of their employment history. Some job descriptions have multiple plausible codes; therefore, each SOC code is weighted and assigned a rank. By analyzing additional details of an employee’s job description, SOCcer can distinguish between similar jobs or jobs with similar or identical titles, identifying which highest ranking SOC code is assigned to the job.
Evaluation of the algorithm in two case studies found that it reliably replicated manually assigned occupation codes; SOCcer’s assignments had overall agreement with expert coders that ranged from 45% at the most detailed level of the classification system and 76 % at a major grouping level. Its agreement with manual coding increased with algorithm score; low scoring job descriptions are more likely to require expert review than high scoring job description.
The application is unique in its ability to estimate occupational exposure measurements from several components of a job, thus providing insight into the full experience of the employee.
“The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” Dr. Friesen said.
Dr. Friesen credits the success of SOCcer to wide-ranging support from collaborators and Division leadership. “Dr. Russ, a computer scientist, was especially critical to this project,” she said. “In addition, Dr. Chanock’s support of technological tools that can advance NCI research allowed for SOCcer to be made available as an online application.”
The team was awarded funds from the 2015 DCEG Informatics Tool Challenge for its companion software SOCAssign, which loads the SOCcer output into a visual display to aid expert coders in manual code assignment and reconciliation of poor computer-based matches. SOCAssign is freely available for download to users of the SOCcer website.
Dr. Friesen and collaborators continue to make improvements to the original SOCcer algorithm; SOCcer 2.0 will better capture the specific tasks associated with each job code. “It’s not your job title, but what you actually do at your job that matters,” Dr. Russ said. Better use of the additional text on work tasks may assist in building an even more accurate algorithm. Dr. Friesen and collaborators recently were awarded funds from the 2016 DCEG Informatics Tool Challenge for SOCcer 2.0.
Read more information about the SOCcer algorithm.