Synthetic data for machine learning combats privacy, bias issues

In the age of Big Data, the owners of the biggest data have distinct advantages as they build ML- and AI-based products. Smaller players, startups, and projects with a high sensitivity to data privacy cannot leverage the massive data sets to build products that address larger audiences and scope that would maximize their opportunity. Thus, the rise of synthetic data – large datasets generated by ML models that resemble smaller "real" datasets to which the developers already have access. On the surface, synthetic data would seem to level the playing field and reduce monopolistic power wielded by the larger platforms; however, as this piece and other articles point out, using synthetic data increases risk because synthesizing from biased data might cause algorithms to both accentuate the bias and create a false sense of security due to Automation Bias, a powerful human cognitive bias prevalent in the field of AI. The immediate solution to bias in synthetic data remains the same: deploy controls across the model's full lifecycle, enable continuous monitoring of your models in production, and perform frequent audits of every model decision with full reproducibility.