In the new data age, number nerds are thriving

In the new data age, number nerds are thriving

Books on data analysis were for sale at the Joint Statistical Meetings in Washington last week. NYTBut she was drawn to what she calls “all the computer and math stuff” that was part of the job.

“People think of field archeology as Indiana Jones, but much of what you really do is data analysis,” she said. Now Grimes does a different kind of digging. She works at Google, where she uses statistical analysis of mounds of data to come up with ways to improve its search engine.

Grimes is an internet-age statistician, one of many who are changing the image of the profession as a place for dronish number nerds. They are finding themselves increasingly in demand — and even cool.

“I keep saying that the sexy job in the next 10 years will be statisticians,” said Hal Varian, chief economist at Google. “And I’m not kidding.” The rising stature of statisticians, who can earn $1,25,000 at top companies in their first year after getting a doctorate, is a byproduct of the recent explosion of digital data.

In field after field, computing and the web are creating new realms of data to explore — sensor signals, surveillance tapes, social network chatter, public records and more. And the digital data surge only promises to accelerate, rising fivefold by 2012, according to a projection by IDC, a research firm.

Yet data is merely the raw material of knowledge. “We’re rapidly entering a world where everything can be monitored and measured,” said Erik Brynjolfsson, an economist and director of the MIT’s Centre for Digital Business. “But the big problem is going to be the ability of humans to use, analyse and make sense of the data.”

The new breed of statisticians tackle that problem. They use powerful computers and sophisticated mathematical models to hunt for meaningful patterns and insights in vast troves of data. The applications are as diverse as improving internet search and online advertising, culling gene sequencing information for cancer research and analysing sensor and location data to optimise the handling of food shipments.

Though at the fore, statisticians are only a small part of an army of experts using modern statistical techniques for data analysis. Computing and numerical skills, experts say, matter far more than degrees. So the new data sleuths come from backgrounds like economics, computer science and mathematics.

In another sign of the growing interest in the field, an estimated 6,400 people are attending the statistics profession’s annual conference in Washington this week, up from around 5,400 in recent years, according to the American Statistical Association. The data surge is elevating a profession that traditionally tackled less-visible and less-lucrative work, like figuring out life expectancy rates for insurance companies.

Grimes, 32, got her doctorate in statistics from Stanford in 2003 and joined Google later that year. She is now one of many statisticians in a group of 250 data analysts. She uses statistical modelling to help improve the company’s search technology.

For example, Grimes worked on an algorithm to fine-tune Google’s crawler software, which roams the web to constantly update its search index. The model increased the chances that the crawler would scan frequently updated web pages and make fewer trips to more static ones.

It is the size of the data sets on the web that opens new worlds of discovery.

Traditionally, social sciences tracked people’s behaviour by interviewing or surveying them. “But the web provides this amazing resource for observing how millions of people interact,” said Jon Kleinberg, a computer scientist and social networking researcher at Cornell.

For example, in research just published, Kleinberg and two colleagues followed the flow of ideas across cyberspace. They tracked 1.6 million news sites and blogs during the 2008 presidential campaign, using algorithms that scanned for phrases associated with news topics like ‘lipstick on a pig’.

The Cornell researchers found that, generally, the traditional media leads and the blogs follow, typically by 2.5 hours. But a handful of blogs were quickest to quotes that later gained wide attention.

The rich lode of web data, experts warn, has its perils. Its sheer volume can easily overwhelm statistical models. Statisticians also caution that strong correlations of data do not necessarily prove a cause-and-effect link.

For example, in the late 1940s, before there was a polio vaccine, public health experts in America noted that polio cases increased in step with the consumption of ice cream and soft drinks, according to David Alan Grier, a historian and statistician at George Washington University. Eliminating such treats was even recommended as part of an anti-polio diet.

If the data explosion magnifies longstanding issues in statistics, it also opens up new frontiers. “The key is to let computers do what they are good at, which is trawling these massive data sets for something that is mathematically odd,” said Daniel Gruhl, an IBM researcher whose recent work includes mining medical data to improve treatment. “And that makes it easier for humans to do what they are good at — explain those anomalies.”

The New York Times

DH Newsletter Privacy Policy Get top news in your inbox daily