top of page

Clinician Data Scientists?

It is widely accepted that data science is an in-demand profession for all industries including healthcare. The unfortunate reality is that there are not enough data scientists to meet the needs of the various fields. Data science is a complex field, demanding programming skills, database management, statistics, higher mathematics, machine learning, soft skills, and domain expertise. Most data scientists have advanced degrees; 49 percent hold a Master's degree, and 41 percent hold a PhD.

Due to the shortage of data scientists in the foreseeable future, alternatives have been suggested. Gartner and others have proposed the new role of the “citizen data scientist.” According to Carlie Idoine at Gartner, a citizen data scientist is “a person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics.” Furthermore, Gartner states, “citizen data science bridges the gap between mainstream self-service data discovery by business users and the advanced analytics techniques of data scientists.” Clearly, the definition is somewhat vague, as is the required education.

Is a “clinician data scientist” (CDS) the citizen data scientist equivalent in the healthcare domain? Clinicians (physicians or nurses) clearly have domain expertise by virtue of their training and experience, although they generally lack business and technical training. Those with training in health informatics, clinical informatics, and nursing informatics have the greatest potential to assume the role of CDS. While these informaticians are good candidates to become CDSs, they would require additional training in data science to assume the role.

In a large healthcare system, there is likely to be a collaborative data science team. Its members might include: data scientists, data analysts, clinicians, quality improvement department reps, administrative reps, a chief information officer, a chief medical information officer, a computer scientist if available, a statistician, a database manager, and a project manager. In smaller healthcare organizations there would be far fewer team members and each would likely wear several hats. Moreover, the organization might have to contract out for additional data services.

So, what role would the clinician data scientist play in a data science team? Based on the Microsoft Team Data Science Process below, there are several areas a CDS could have a significant impact.

(Adapted from Microsoft Data Science Team Process)

The initial step is to identify a significant problem such as a quality or a cost issue. For example, how can a hospital or healthcare system reduce the number of asthmatics who frequently visit the emergency department? In this scenario, a seasoned clinician knows the disease, the clinical workflow and the likelihood that a new intervention might succeed. The second step of data acquisition and engineering also would benefit from a clinician’s input. Is it the right data and the right population? The next step is modeling, where model selection, training, and evaluation takes place. This usually requires further tweaking and feature engineering. A CDS could provide input regarding the relative importance of false positives and false negatives in a model. Lastly, model deployment means promoting and implementing the model into the workflow of clinicians and other staff. It also includes the soft skill of presentation to the C-suite.

What then would be optimal training and education of a CDS? Because a majority of a data scientist’s time is spent “data wrangling”, familiarity with data preparation and exploratory data analysis programs would be essential. A high comfort level with spreadsheets would be important for the initial steps of data exploration. Data visualization that could utilize spreadsheet software or alternatives such as Tableau or Power BI. While a CDS does not need to be an expert in biostatistics, some working knowledge would be important. Programming expertise in SQL, Python and/or R languages is essential for a senior data scientist but unrealistic for most CDSs.

There is a panoply of data science and machine learning courses offered online. Many are fee-based, particularly if the participant desires a certificate upon completion. In addition, there are many free machine learning courses of high quality available. Potential students should research them carefully to see if they include data preparation and exploratory data analysis in addition to modeling and algorithms. Also, what is the mechanism to ask questions? Do the courses center around learning a programming language? How long are the courses?

Most data science training programs use a programming language for machine learning but there are alternatives, such as menu-driven software. The following table lists some of the available machine learning software that are open source and/or free for educational purposes.

The last software program listed in the table RapidMiner has automated tools, such as TurboPrep® and Automodel® that greatly expedite data exploration and predictive analytics. These tools include both supervised and unsupervised learning. These tools qualify the program as being automated machine learning (AutoML). There are multiple other commercial AutoML programs currently available: H2o Driverless ai, Google Cloud AutoML, dotData and Data Robot, to mention a few. While the machine learning process tends to be automated and more user friendly, for several it is assumed that the data owner will clean and prepare the data appropriately before creating a model.

Clearly, newer software data tools will help democratize data science for more individuals. Gartner optimistically predicts that more than 40% of data science tasks will be automated by 2020. Therefore, it is not beyond the realm of possibilities for a CDS to use these tools to create a predictive analytical model, after the proper data preparation and exploration.

To date, there is no “AutoAI” software or platform as the next step after AutoML. Deep learning is based on complex neural networks that require programming skills as well as advanced mathematics. Is automated AI around the corner or in the remote future?

While AutoML greatly facilitates the creation of classification and regression models, it requires background knowledge and experience for meaningful and accurate results. Importantly, numerous potential pitfalls await a novice CDS. For example, how do you deal with missing data or imbalanced classes; why correlation does not equal causation; how model accuracy measures can be misleading, etc. Not understanding these issues leads to garbage in garbage out (GIGO).

It is not clear how realistic it would be to “re-engineer” clinicians to become more knowledgeable in data science. Will medical schools start to add data science components in their curricula? Will health, clinical and nursing informatics programs enhance their required courses so they can become clinician data scientists? Can you have CDSs without proficiency testing or certification?

Perhaps it would be easier for healthcare organizations to hire data analysts and contract out for the services of a senior data scientist. As pointed out by Gregory Piatetsky, “There is already a good title for such a job - Data Analyst.“ Only time will tell whether a true sea change in medical education resulting in clinician data scientists will occur.

Article also posted on January 12, 2020

Recent Posts

See All


bottom of page