recentpopularlog in

jerryking : data_wrangling   3

The promise of synthetic data
February 4, 2020 | Financial Times | by Anjana Ahuja.

* Race after Technology by Ruha Benjamin.
Where anonymization fails, synthetic data might yet succeed. Synthetic data is artificially generated. It is most often created by funnelling real-world data through a noise-adding algorithm to construct a new data set. The resulting data set captures the statistical features of the original information without being a giveaway replica. Its usefulness hinges on a principle known as differential privacy: that anybody mining synthetic data could make the same statistical inferences as they would from the true data — without being able to identify individual contributions........Synthetic data has the potential to squeeze useful information from tightly-controlled databases. Uncovering fraud, for example, can be challenging because regulations restrict how information can be shared, even within banks. Synthetic data can help to unveil useful patterns, while masking individual incidents.......“If you’re trying to train an algorithm to detect fraud, you don’t care about specific transactions and who made them,” he says. “You care about the statistics, like whether the amounts are just below the limit needed to trigger an audit, or if they tend to occur close to the end of the quarter.” Those kinds of numbers can be shaken out of synthetic data as well as from the original........the UK’s Office for National Statistics says synthetic data offers a “safer, easier and faster way to share data between government, academia and the private sector”........ The data does not have to be rooted in the real world to have value: it can be fabricated and slotted in where some is missing or hard to get hold of........Synthetic data could, of course, be framed as fake data — but in some circumstances that is a bonus. Artificial intelligence that is trained on real-life information flaunts a baked-in bias: algorithmic decision-making in fields such as criminal justice and credit scoring shows evidence of racial discrimination........discrimination is not something that AI should perpetuate ..... synthetic data could help tackle complex social issues such as poverty: “We could modify that bias. People could release synthetic data that reflects the world we would like to have. Why not use those as training sets for AI?"
algorithms  anonymity  anonymized  biases  books  dark_side  data  data_wrangling  differential_privacy  fairness   inequality  noise  privacy  racial_discrimination  synthetic_data 
7 weeks ago by jerryking
Data Challenges Are Halting AI Projects, IBM Executive Says
May 28, 2019 | WSJ | By Jared Council.

About 80% of the work with an AI project is collecting and preparing data. Some companies aren’t prepared for the cost and work associated with that going in,......“And so you run out of patience along the way, because you spend your first year just collecting and cleansing the data,”.....“And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.”....A report this month by Forrester Research Inc. found that data quality is among the biggest AI project challenges. Forrester analyst Michele Goetz said companies pursuing such projects generally lack an expert understanding of what data is needed for machine-learning models and struggle with preparing data in a way that’s beneficial to those systems.

She said producing high-quality data involves more than just reformatting or correcting errors: Data needs to be labeled to be able to provide an explanation when questions are raised about the decisions machines make.

While AI failures aren’t much talked about, Ms. Goetz said companies should be prepared for them and use them as teachable moments. “Rather than looking at it as a failure, be mindful about, ‘What did you learn from this?’”
artificial_intelligence  data_collection  data_quality  data_wrangling  IBM  IBM_Watson  teachable_moments 
may 2019 by jerryking
The Three Sexy Skills of Data Geeks : Dataspora Blog
Hal Varian, Google’s Chief Economist, was interviewed a few
months ago, and said the following in the McKinsey Quarterly:
“The sexy job in the next ten years will be statisticians… The ability
to take data—to be able to understand it, to process it, to extract
value from it, to visualize it, to communicate it—that’s going to be a
hugely important skill.” Put All Three Skills Together: Sexy. Thus with
the Age of Data upon us, those who can model, munge (data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics) , and visually
communicate data — call us statisticians or data geeks — are a hot
analytics  data  data_scientists  data_wrangling  career_paths  Hal_Varian  Information_Rules  statistics  visualization  value_extraction 
july 2009 by jerryking

Copy this bookmark:

to read