The Women in Data Science conference (WiDS) took place on Mar 5th at Stanford University. The conference had hundreds of participants on site, as well as thousands of them on live stream across the world. This year, a datathon was offered in advance. A total of 1160 people, from 26 countries participated in the competition. The objective of data challenge was to predict gender using survey data from developing countries. Winner team was “Minions” from Massachusetts. This competition used Kaggle leaderboard to track the accuracy. Winner team achieved an area under the curve (AUC) over 0.975. If you want to learn more, here is the Kaggle site for the datathon.
Keynotes and talks were focused on data science breakthroughs, led by Google and Microsoft Research, accompanied by talks on applied data science, in healthcare, social network and financial investment. Among all of them, I found “More Data, More Problems” and “Data Science to Save the World” most fascinating.
More Data, More Problems
by Daniela Witten, Associate Professor of Statistics and Biostatistics, University of Washington
“While it is certainly true that more data brings with it incredible opportunities, it is also true that more data can bring new and previously unimaginable statistical challenges.”
Daniela’s talk reminded me the course of Experimental Design when I was at grad school. I remembered all those chapters talking about evaluating control and treatment, random and fixed effect. But in the big data era, especially when we have millions or billions of user logs per day, it is impossible to use same practice. As Daniela emphasized, the truth is we cannot completely randomize observations, and we cannot assign treatment randomly to observations. Daniela had a good point on thinking about what questions can be directly answerable from data, before jumping directly into data science solution. Then try to expand finding, using causal inference, for example. Also, prior to this meeting, Daniela posted a tweet asking about “Within the context of interdisciplinary data science, what’s the #1 thing that rigorous statistical training brings to the table?” Surprisingly, answers are mostly around skepticism of data biases, trouble of reproducibility, flaws in experiment. Other popular answers are around how to use statistical learning knowledge to unveil masks of latent variables and get closer to real problem, as quoted from Samuel Karlin, “The purpose of models is not to fit the data but to sharpen the questions.”
Data Science to Save the World
by Latanya Sweeney, Professor of Government and Technology in Residence, Harvard University
“As technology progresses, every societal value and every state rule comes up for grabs and will likely be redefined by what technology enables or not. Data science allows us to do experiments to show how it all fits together or falls apart.”
Latanya’s talk began with how she reverse-engineered a federal governor’s medical record by zip code, birth date and gender. She also talked about how the recommendation algorithm of Google Ads used to be biased by diversity. Another example was given on evaluating applicants, as one algorithm developed suggested hiring more younger applicants. Astonishingly, Civil Rights Act, Credit Reporting Act, health privacy and equal employment have come up for grabs in data science world. The best usage of data science is really to “Harness tech for public interests.” A good example Latanya gave was finding out SAT online courses had price discrimination. The analysis found that zip codes with more Asian families were charged higher fees.
At Intuit, we deeply respect and steward data. We make sure data is secure and appropriately used at all times. As such, there are many use cases for data scientists, such as how to assess new applicants, how to route customer care requests, recommend financial support, etc. It’s great to see in this year’s WiDS conference, concerns of statistical randomness, data privacy and data ethics emerging.
Latanya Sweeney also mentioned, “Technology happens in days and policy happens in years.” Therefore, to work proactively, I would encourage data scientists to keep these questions in mind in their daily practice. Are my observations randomly selected from population? What data regulations should I comply with? Is it appropriate to use data for this purpose?
Above all, hope this post is helpful in every data science journey you take.