When I drive to work every day, most of my journey is recorded by cameras. Information about my route and speed of travel, as well as that of other people on the road at the time, can be used by governments to improve our national infrastructure, among many other things. It is the job of politicians to decide how much to invest, not only on roads but also alternatives, including incentives to avoid rush hour or switch to public transport. But what exactly feeds into this policymaking?
Transport has been a hotly debated issue for many years in the Netherlands, and the main question is whether the government should tax people for driving on the roads during rush hour or rather invest the money differently; for instance on better railways, more bicycle routes, highways and city roads, etc. Yet, policy decisions are rarely made quickly.
Policymakers often make decisions based on many years’ worth of data. Yet these decisions are often designed for an average day – and we all know that traffic is worse in heavy rain, snow or ice. Governments will not necessarily change our infrastructure based on the experience of those one, two or three bad days a year. Yet, in my travel to work on those rainy days, it is exactly then when situations are difficult, that I would love to have instructions on how to move faster.
For those situations, ‘big data’ can help. If you drive your car on the highway, your trip, time of travel, and route is recorded by cameras, and information is stored. We know how many cars were on the road, where all the cars were, at each time of day, and we can analyse the data. Having the big data readily available allows us to capture ‘reality’ within days on how traffic changes, if weather conditions change, etc. And we can do that without using surveys or other data collection methods that take time and money. Put simply, we can use big data to streamline traffic relatively quickly, even in difficult conditions. This is exactly what the people in the Centre for Big Data Statistics focus on.
With our PhD students, including the first year GPAC2 and IEGD PhD fellows and some PhD students from other Maastricht University faculties, we visited the Netherlands Statistics office on 8 November 2018. This visit is a yearly activity that is part of our PhD training, organised jointly with Prof. Hans Schmeets (Faculty of Arts and Social Science, and Statistics Netherlands). During the day we are informed about the practice and use of censuses, samples and surveys, and more on developments in big data.
Sophie de Broe offered us a very informative lecture on the use of big data at Statistics Netherlands. She explained that Statistics Netherlands is committed to using these new data sources to answer complex policy questions. Issues the unit tackles include ‘How do we start energy transition as a city? Can we better predict migrant flows? and How can we better match demand and supply of jobs on the labour market?
Sophie kindly agreed to assist in writing this blog, and will share how big data impact her work. Also PhD alumnus Florian Henning, who now works at Statistics Netherlands in the big data unit offers his views on the value and use of big data. However, Statistics Netherlands is not the only unit that is exploring the field. GPAC2 PhD fellow Stafford Nichols, who currently works for GALLUP, plans to integrate big data in his PhD study. Click on the names below to open the Q&As.
SOFIE DE BROE
Sofie De Broe is head of methodology and scientific director of the Centre for Big Data Statistics, which aims to use big data combined with survey and administrative data to produceofficial statistics. For that, they set up the Centre to make (initially) experimental products that should eventually become part of the official statistical process. For an overview of their so-called beta products, have a look here.
Can you give an example where big data provides innovative ways to analyse a complex policy issue – something that cannot be done just with census or survey data?
The big disadvantage with survey and census data is the interaction between interviewers and interviewees. In other words, there may be measurement errors. Censuses may also miss out parts of the population due to under-coverage. On the other hand, a big data source will give you the advantage of measuring directly (for example through sensor produced data) the phenomenon you are interested in. However, the big data source will also be a selection of the study population and you may not know which part but compared to a census, the big data collection is much cheaper. For example, it is much easier to use smart meter data to have an idea of energy consumption than to interview people, or to use apps and sensors on travelling and destination to derive time use than having to keep a diary, or to have the data on a smart scale than having to ask people about their weight. Big data also offers the potential to make new statistical products as with road sensor data or to measure concepts differently by looking at what people post online instead of their answers in a questionnaire.
What are the pitfalls of working with big data?
There are still some challenges facing the usage of big data. First, we need to deal with this type of unstructured data, which needs to be validated internally and externally – and these validation mechanisms need to be transparent. Second, we also need to understand better the data generating processes, how concepts are measured in different data sources. Third, we have to guarantee privacy for the user, ensure the right (data science) skills are available and that data delivery is stable.
There is still a way to go for official statistics to be based on big data. We aim to combine it with other secondary sources such as survey and admin but primary data will continue to have its value for getting information about society and answers to policy-relevant questions.
FLORIAN HENNING
Dr. Florian Henning is a Data Officer at Statistics Netherlands and an alumnus of UNU-MERIT.
How do you manage big data in your daily research activities?
As a data scout, my role is to facilitate our big data research. I focus on supporting the data collection activities of CBS through usage of new data sources, in particular big data. To this end, I am scouting new data sources, and building partnerships with key organisations. It is a very interesting role, maybe comparable to a football scout: continuously looking for promising new ‘talents’ in the world that can contribute to our ‘club’, getting in contact with them and arranging for them to join the team.
Can you give an example where big data provides innovative ways to analyse a complex policy issue – something that cannot be done just with census or survey data?
One example is our analysis of innovative businesses in the Netherlands. CBS is currently only surveying companies that have more than 10 employees about their level of innovation, which, by definition, excludes many small companies and start-ups. To collect information about these businesses as well, a big data method has been developed that analyses the text on a company’s website. This ‘web scraping’ method is useful for identifying small innovative businesses, such as start-ups. Details on this and on other ‘beta products’ that the CBS has developed based on big data sources can be found here.
What are the pitfalls of working with big data?
Big data statistics is a new field, and statistics agencies worldwide are continuously learning about it. It offers many exciting benefits, such as the potential to increase speed, accuracy and level of detail in statistics production, at lower costs and requiring less administrative effort.
That being said, no data source is perfect. Challenges with big data fall into three categories: methodological, organisational and technical categories. Methodologically, a key difference of big data and survey data is that whereas a traditional survey is deliberately designed to answer a set of specific research questions, big data is often a digital by-product of everyday activities and as such, it often lacks a well-defined target population, structure and quality. They can be selective, and lack relevant background characteristics that are needed for data editing and estimation methods. Organisational challenges concern privacy and legal issues, such as data ownership. The prevention of disclosing the identity of individuals needs to be safeguarded at all times. We have a team of legal experts advising us, and we are partnering with research institutions to develop innovative methods for privacy-preserving data analytics. This also relates to technological issues; for a large part, these also stem from the huge size of big data, which creates new challenges for processing, storage and transfer of large data sets.
STAFFORD NICHOLS
Stafford Nichols works for GALLUP and is a GPAC2 PhD fellow .
How do you manage big data in your daily research activities?
I design and manage social research surveys across a couple of dozen countries. Most of these countries are still developing, so face-to-face surveys are necessary in order to accurately represent the country. Big data offers wonderful advantages in two ways.
First, all of my interviewers use smart phones to collect interviews. This not only provides them with an easy user interface to enter data, it also enables us to record the exact time each question was answered, the GPS location of the interview, make audio recordings of the interview, and even take a picture of the household if we want. Furthermore, the phone can automatically send the data to our central server half way around the world, as soon as it has an internet connection. As data streams in from thousands of interviewers around the world, software programmes run automatic quality control checks, and gauge whether to accept or reject each interview.
Second, big data sources tell us more about the environments where respondents live, and this helps us better model and explain various social phenomena by supplementing the survey data. For example, cell phone companies can track the speed at which cell phones travel. In developing countries, this is a strong indicator of economic progress – someone who can afford a motorbike is probably better off than someone who has to walk to work. If I am conducting a survey about perceptions of the economy, I can compare the economic sentiment in a particular community to this cell phone data. If I want to append a second non-survey source, I can look at the change in the intensity of night light data over time. This has been shown to be another useful proxy for economic development. Having a better picture of the respondent’s community allows me to validate and enhance the survey data I collect.
Can you give an example where big data provides innovative ways to analyse a complex policy issue – something that cannot be done just with census or survey data?
Mapping poverty is important for resource allocation, for aid distribution, and designing surveys. However, traditional data sources cannot provide policymakers with an accurate picture of where poverty presides in a timely manner. Many developing countries are experiencing rapid urbanisation, and census data are too outdated or infrequent.
Luckily, researchers have developed machine learning techniques that use computer vision to estimate poverty, by looking at high-resolution satellite imagery. They are able to accurately estimate the socioeconomic status of communities, by having computers analyse characteristics like building height, construction material and road surfaces.
There are so many satellites now that the entire surface of the earth is photographed every single day at a granular level. Therefore, computers have millions of fresh images to analyse each day. While training these computers is still very difficult, this combination of up-to-date satellite imagery and machine learning, offers policymakers a powerful tool to track the status of entire populations in a timely manner.
What are the pitfalls of working with big data?
Many big data datasets are generated as a byproduct of a particular activity – credit card usage, social media posts, phone call history – and so they only offer researchers a broad brush with which to investigate questions. While they have tremendous breadth, they typically do not have the necessary depth to offer conclusive evidence. On the other hand, surveys offer researchers a highly customised and targeted instrument, and so are often better at answering specific research questions. This is primarily why there is a general consensus in the field of survey research today that big data will not replace surveys anytime soon.
Big data is usually cheaper than surveys, so if a policymaker or business leader can make decisions based solely on big data – great! But if the decision making process requires more accurate inferential statistics, than paying for an expensive survey may be necessary.
The future will not be about choosing between one or the other, but rather how the two can be used together to enhance each other. A big data dataset can offer general information about a population that allows for a more accurate survey design, for example. Or perhaps a detailed survey in year one enables you to create a highly explanatory model, so in years two, three and four, you can use an inexpensive big data source to estimate the covariates for your model, instead of a survey. We can expect hybrid solutions like these to become more common in the future.
ANY COMMENTS?
NOTA BENE
The opinions expressed here are the authors’ own; they do not necessarily reflect the views of UNU.
MEDIA CREDITS
Flickr / P.Jansen