recentpopularlog in

jtyost2 : datamining   65

Just discovered the most INSANE thing. The ORDER OF THE EPISODES for Netflix's new series Love Death & Robots changes based on whether Netflix thinks you're gay or straight.
Just discovered the most INSANE thing. The ORDER OF THE EPISODES for Netflix's new series Love Death & Robots changes based on whether Netflix thinks you're gay or straight.
netflix  algorithm  privacy  gender  datamining  machinelearning  software  technology  culture 
march 2019 by jtyost2
Almost None of the Women in the Ashley Madison Database Ever Used the Site
Overall, the picture is grim indeed. Out of 5.5 million female accounts, roughly zero percent had ever shown any kind of activity at all, after the day they were created.

The men’s accounts tell a story of lively engagement with the site, with over 20 million men hopefully looking at their inboxes, and over 10 million of them initiating chats. The women’s accounts show so little activity that they might as well not be there.

Sure, some of these inactive accounts were probably created by real, live women (or men pretending to be women) who were curious to see what the site was about. Some probably wanted to find their cheating husbands. Others were no doubt curious journalists like me. But they were still overwhelmingly inactive. They were not created by women wanting to hook up with married men. They were static profiles full of dead data, whose sole purpose was to make men think that millions of women were active on Ashley Madison.

Ashley Madison employees did a pretty decent job making their millions of women’s accounts look alive. They left the data in these inactive accounts visible to men, showing nicknames, pictures, sexy comments. But when it came to data that was only visible on to company admins, they got sloppy. The women’s personal email addresses and IP addresses showed marked signs of fakery. And as for the women’s user activity, the fundamental sign of life online? Ashley Madison employees didn’t even bother faking that at all.

There are definitely other possible explanations for these data discrepancies. It could be that the women’s data in these three fields just happened to get hopelessly corrupted, even though the men’s data didn’t. Or maybe most of those accounts weren’t deliberately faked, but just represented real women who came to the site once, never to return.

Either way, we’re left with data that suggests Ashley Madison is a site where tens of millions of men write mail, chat, and spend money for women who aren’t there.
privacy  datamining  information 
august 2015 by jtyost2
Consequences of an Insightful Algorithm
We have ethical responsibilities when coding. We’re able to extract remarkably precise intuitions about an individual. But do we have a right to know what they didn’t consent to share, even when they willingly shared the data that leads us there? A major retailer’s data-driven marketing accidentially revealed to a teen’s family that she was pregnant. Eek.

What are our obligations to people who did not expect themselves to be so intimately known without sharing directly? How do we mitigate against unintended outcomes? For instance, an activity tracker carelessly revealed users’ sexual activity data to search engines. A social network’s algorithm accidentally triggered painful memories for grieving families who’d recently experienced death of their child and other loved ones.

We design software for humans. Balancing human needs and business specs can be tough. It’s crucial that we learn how to build in systematic empathy.

In this talk, we’ll delve into specific examples of uncritical programming, and painful results from using insightful data in ways that were benignly intended. You’ll learn ways we can integrate practices for examining how our code might harm individuals. We’ll look at how to flip the paradigm, netting consequences that can be better for everyone.

Conferences (2015): PyConAU, WDCNZ, Open Source Bridge
algorithm  privacy  datamining  civilrights  ethics  programmer  programming  software 
august 2015 by jtyost2
Student Data Could Pose Privacy Risk
The fear is that the multi-billion-dollar education technology (or “ed-tech”) industry that seeks to individualize learning and reduce drop-out rates could also pose a threat to privacy, as a rush to commercialize student data could leave children tagged for life with indicators based on their childhood performance.

“What if potential employers can buy the data about you growing up and in school?” asks mathematician Cathy O’Neil, who’s finishing a book on big data and blogs at mathbabe.org. In some of the educational tracking systems, which literally log a child’s progress on software keystroke by keystroke, “We’re giving a persistence score as young as age 7 — that is, how easily do you give up or do you keep trying? Once you track this and attach this to [a child’s] name, the persistence score will be there somewhere.” O’Neil worries that just as credit scores are now being used in hiring decisions, predictive analytics based on educational metrics may be applied in unintended ways.

Such worries came to the fore last week when educational services giant Pearson announced that it was selling the company PowerSchool, which tracks student performance, to a private equity firm for $350 million. The company was started independently; sold to Apple; then to Pearson; and now to Vista Equity Partners. Each owner in turn has to decide how to manage the records of some 15 million students across the globe, according to Pearson. The company did not sign an initiative called the Student Privacy Pledge, whose signatories promise not to sell student information or behaviorally target advertising (151 other companies including Google have signed the non-binding pledge).

A Pearson spokesperson said, “We do not use personal student data to sell or market Pearson products or services. The data is entrusted to us as a part of our work with schools and institutions and is guarded by federal and state laws. From a security perspective, when an education institution or agency entrusts Pearson with personally identifiable student information, we work directly with the organization to ensure the data is protected and our controls are consistent with relevant requirements.”

PowerSchool intakes a large variety of data. Its site touts administrator tools including discipline management and reporting; student and staff demographics; and family management. Brendan O’Grady, VP of media and communities for Pearson, says the company has provided ways of allowing educators to track the performance of individual students and groups of students in order to serve them better. “Big data and all of the associated technologies have really improved all of the technologies in the world, the way we travel and communicate and more,” he says. “But we haven’t seen a similar advance in the way we use data in education. There are very legitimate questions about data security and around what works best for schools. But there should be some very positive experiences using big data to give better feedback on what needs to be learned. That’s the biggest opportunity.”

The biggest flame-out so far in the ed-tech arena has been inBloom, a company that had a stellar lineup of support — $100 million of it — from sources including the Bill & Melinda Gates Foundation and the Carnegie Corporation of New York. In Louisiana, parents were incensed that school officials had uploaded student social security numbers to the platform. After several other states ended relationships, the last remaining client — New York — changed state law to forbid giving student data to companies storing it in dashboards and portals. InBloom announced in April 2014 that it was shutting down.
legal  privacy  education  civilrights  government  regulation  business  datamining 
june 2015 by jtyost2
US Government Labeled Al Jazeera Journalist as Al Qaeda
The U.S. government labeled a prominent journalist as a member of Al Qaeda and placed him on a watch list of suspected terrorists, according to a top-secret document that details U.S. intelligence efforts to track Al Qaeda couriers by analyzing metadata.

The briefing singles out Ahmad Muaffaq Zaidan, Al Jazeera’s longtime Islamabad bureau chief, as a member of the terrorist group. A Syrian national, Zaidan has focused his reporting throughout his career on the Taliban and Al Qaeda, and has conducted several high-profile interviews with senior Al Qaeda leaders, including Osama bin Laden.

A slide dated June 2012 from a National Security Agency PowerPoint presentation bears his photo, name, and a terror watch list identification number, and labels him a “member of Al-Qa’ida” as well as the Muslim Brotherhood. It also notes that he “works for Al Jazeera.”

The presentation was among the documents provided by NSA whistleblower Edward Snowden.

In a brief phone interview with The Intercept, Zaidan “absolutely” denied that he is a member of Al Qaeda or the Muslim Brotherhood. In a statement provided through Al Jazeera, Zaidan noted that his career has spanned many years of dangerous work in Afghanistan and Pakistan, and required interviewing key people in the region — a normal part of any journalist’s job.
usa  legal  journalism  politics  terrorism  technology  software  datamining  AlJazeera  AlQaeda 
may 2015 by jtyost2
Troy Hunt: Mobile app privacy insanity – we’re still failing massively at this
I feel a bit matronly saying this, but I’m disappointed. No really, I was sure things were looking up there for a bit and it seemed harder to find egregious examples of security shortcomings in a random selection of apps. Whilst this is but a very small selection here, the problems were found very quickly and are extremely worrying. Both Aussie Farmers and Nando’s have serious security risks in how they handle data and as for PayPal, you may not call it a security risk but it’s sure as hell an invasion of privacy. The network I’m connected to when using their service and where I physically am in the world is my business and I don’t particularly want to share it with them. Perhaps I should just stick to the browser that doesn’t leak this class of data yet one would assume is still sufficiently secure.

What has me a little worried with all this is that we’re heading in a direction where we have more data to share via more channels which are all racing to be first to market with something. This week the Apple Watch will hit and it will arguably be the point at which wearables seriously take off. Now we’re talking about very personal data – health data – and there will be all new ways of sharing it. But it’s to improve the experience to you, the consumer, right?
software  programming  webdevelopment  security  privacy  datamining 
april 2015 by jtyost2
Privacy vs. User Experience • Dustin Curtis
Cook is being disingenuous, because he knows that the same information Google uses to target advertising is also used to make its products, like Google Maps, so great. I find it very odd that Cook implies the only use for such data is to “monetize” through advertising. iPhone and iCloud could be made much better if the computer systems could analyze the data people are storing in them. This is obvious.

The real issue that Apple is trying to address is not really privacy, but rather security. Though Google has all of my data, it is still private. Google does not sell access to my data; it sells access to my attention. Advertisers do not get my information from Google. So as long as I trust Google’s employees, the only two potential breaches of my privacy are from the government or from a hacker. If we accept this as a fact, the fundamental privacy question changes from, “Do you respect my privacy?” to “Is the user experience improvement worth the security risk to my private information?”

As long as people understand the potential risks, the answer to the second question is almost always, “Yes.” And with the emergence of artificial intelligence, the answer to that question will become increasingly more clear. The vast improvements in user experience far, far outweigh the potential security risks to private information.

Unfortunately, Apple has answered, “No.”
privacy  software  technology  artificialintelligence  datamining  security  apple  google 
february 2015 by jtyost2
Data Scientists Can Link Your Instagrams To Your Credit Card Purchases
When I tweeted from a Knicks game at Madison Square Garden on Dec. 2, I had no idea that data scientists could use that information to find out I’d used my MasterCard to buy an overpriced $12 beer — as well as identify all my other credit card purchases.

But with as few as four publicly available geo-tagged data points, scientists can accurately connect 90 percent of people to their credit card transactions, according to research published in the journal Science on Friday. That data is supposed to be anonymous, but it’s not really, and women and high-income people have less anonymity than others.

The study used metadata from three months of credit card transactions made by 1.1 million people who shopped at 10,000 stores in an unnamed (for now?) wealthy country. This metadata had no names, no account numbers, nor any other information that would make it easy to identify someone. The only transaction data available was the day it took place, the rough location and — in a separate model — the amount spent.

The researchers were able to then take geo-tagged information — such a Instagram photos, tweets and Facebook posts — and use it to mine the “anonymous” credit card metadata. So, in my case, they could combine my tweet from M.S.G. with three other data points — maybe when I posted on Facebook from Whole Foods, the public library and the gym — to match my name to my user ID in the transactions.
privacy  datamining  civilrights  research  technology 
february 2015 by jtyost2
With a Few Bits of Data, Researchers Identify ‘Anonymous’ People
Even when real names and other personal information are stripped from big data sets, it is often possible to use just a few pieces of the information to identify a specific person, according to a study to be published Friday in the journal Science.

In the study, titled “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata,” a group of data scientists analyzed credit card transactions made by 1.1 million people in 10,000 stores over a three-month period. The data set contained details including the date of each transaction, amount charged and name of the store.

Although the information had been “anonymized” by removing personal details like names and account numbers, the uniqueness of people’s behavior made it easy to single them out.

In fact, knowing just four random pieces of information was enough to reidentify 90 percent of the shoppers as unique individuals and to uncover their records, researchers calculated. And that uniqueness of behavior — or “unicity,” as the researchers termed it — combined with publicly available information, like Instagram or Twitter posts, could make it possible to reidentify people’s records by name.
privacy  research  datamining  science  technology 
january 2015 by jtyost2
How three small credit card transactions could reveal your identity | Computerworld
Just three small clues -- receipts for a pizza, a coffee and a pair of jeans -- are enough information to identify a person's credit card transactions from among those of a million people, according to a new study.

The findings, published in the journal Science, add to other research showing that seemingly anonymous data sets may not protect people's privacy under rigorous analysis.
legal  privacy  civilrights  humanrights  datamining  database  information 
january 2015 by jtyost2
To avoid detection, terrorists purposely sent emails with spammy subject lines
By now, it’s common knowledge the National Security Agency collects plenty of data on suspected terrorists as well as ordinary citizens. But the agency also has algorithms in place to filter out information that doesn’t need to be collected or stored for further analysis, such as spam emails—a fact terrorists used to their advantage.

Much of the debate around the NSA’s overreach has focused on selectors, the terms it uses to describe its requests for information collected. According to a transparency report it published last sumer, the agency was approved to use 423 selectors in 2013 under its telephone metadata program. However, filters, which specify data the agency does not want, also play an important role in reducing noise.

In a paper published by the American Mathematical Society, the agency’s research director, Michael Wertheimer, recalled an instance when the US seized laptops left by Taliban members soon after the 9/11 attacks. The only email written in English found on the computers contained a purposely spammy subject line: “CONSOLIDATE YOUR DEBT.” According to Wertheimer, the email was sent to and from nondescript addresses that were later confirmed to belong to combatants.

“It is surely the case that the sender and receiver attempted to avoid allied collection of this operational message by triggering presumed ‘spam’ filters,” he said, noting the agency is constantly refining its algorithms to discover new threats.
terrorism  security  datamining  email  communication 
january 2015 by jtyost2
This Algorithm Knows You Better Than Your Facebook Friends Do | FiveThirtyEight
In a paper published online today in the Proceedings of the National Academy of Sciences, the researchers show that yes, computers can know us better than we know each other, at least as measured by a computerized personality test.

Wu and her colleagues exploited a database of myPersonality’s 100-item questionnaires that measured users on a Five Factor Model of Personality, gathering data on how open, conscientious, extroverted, agreeable and neurotic people were. This wasn’t one of those sham personality tests, either — the app was created in 2007 by David Stillwell, deputy director of the Psychometrics Centre, and based on a scientifically validated model of personality.

Users could also ask friends to assess their personalities using an abbreviated, 10-question version of the test, and the myPersonality database now contains more than 300,000 such friend-ratings. People who use the app can opt in to share their anonymized personality ratings and Facebook data for research purposes, and more than 40 percent of the app’s 7.5 million users have done so.

Wu and Kosinski developed an algorithm that predicts somebody’s Five Factor personality type using only Facebook likes. Using a sample of 17,622 U.S. participants who had been judged by at least one friend and a group of 14,410 users who’d had two friends fill out the 10-question survey, the researchers measured correlations between the self-judgments of personality and the judgments made by Facebook friends.
research  science  psychology  datamining  technology  computerscience 
january 2015 by jtyost2
Mark Zuckerberg Is Not Oprah
At the moment, Amazon–especially with the acquisition of Goodreads–knows a whole lot about readers' habits: their purchases, libraries, and conversations. But Facebook has always been a leader in personal-data collection, and so easily has the potential to pinpoint where reading fits into its users' lifestyles.

There has been talk of Facebook becoming a one-stop shop someday: the app you open whenever you need to buy, see, or say anything. There's the potential for a secondary marketplace here, too: very personal, granular, and potentially influential user data. Mark Zuckerberg's book club may be just an unambitious, unambiguous book club, but it's resting on a goldmine of powerful, hard-to-come-by information. Whether Facebook collects this data strategically or not, such information could be very useful for the publishing industry. Like Nielsen BookScan or Next Big Book, the publishing industry might even be willing to pay for it.

Transparency around data collection–what's collected, who can see it, and what it's used for – is a bit of a pipe dream. With the present fervor for data has come a push for that data to stay proprietary, to be hoarded: even data that isn't particularly relevant to a company's core business goals is often squirreled away, just in case. (Grease for the ever-possible pivot.) But publishers provide the books that fuel "A Year of Books," just as users provide the data that fuels Facebook. It wouldn't take much for data about readers, about people–in a spirit not dissimilar to the kumbaya of an Internet-wide book club–to be open to those who generate it, and available in some capacity to those who might need to rely on it.

This will never happen, of course. When people hang out in privately owned spaces on the Internet, it effectively throws a barricade up against sociological inquiry. Plus, if the predictions about a Facebook marketplace ring true, then books are a perfect first foray: Facebook can sell to a book club of book-buyers, while simultaneously creating a database that's highly valuable to publishers. It seems unlikely that this information will ever be shared, or free–but I suppose it's not Facebook's responsibility to be a tool for the publishing industry, generous as that would be.

Though his platform falls short in sustaining a conversation about books, Zuckerberg has shown that he can start one. For publishers, Facebook should be considered as more than just a booster platform. For readers, the most generous, benevolent thing Zuckerberg can do now is encourage people to move the conversation he's started somewhere else. Just as the consolidation of data on a single platform can be unwise, there's no need to stuff all of life's small pleasures into one comment thread. When in doubt, turn to Oprah: she held up the books, but viewers happily threw their parties without her.
amazon.com  business  publishing  datamining  privacy  books  facebook  Goodreads 
january 2015 by jtyost2
I Quant NY - How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips
The receipt on the left from RidLinQ (the payment app for CMT) shows the same 20% tip amount of $3.20 as the receipt on the right from Way2Ride (the payment app for Verifone).  But the base fares in green are different. Here, CMT is calculating tips including tax, giving the driver the same default tip for less mileage as the Verifone driver. 

So what does this mean? Well in short cab drivers who are driving CMT programmed cars are making more money in tips from New Yorkers than those driving Verifone programmed cabs
datamining  software  technology  taxi 
january 2015 by jtyost2
The Future of Getting Arrested
Even the most straightforward arrest is built upon an incredibly complex foundation: the moment the handcuffs go on is the moment some of our society’s most hotly contested ideas about justice, security, and liberty are brought to bear on an individual. It’s also a moment that’s poised to change dramatically, as law-enforcement agencies around the country adopt new technology—from predictive-policing software to surveillance cameras programmed to detect criminal activity—and incorporate emerging research into the work of apprehending suspects.

Not all of the innovations that are in the works will necessarily become widely used, of course. Experts say that many of them will ultimately require trade-offs that the public may not be willing to make. “We’re approaching a world where it’s becoming technologically possible to ensure 100 percent compliance with a lot of laws,” says Jay Stanley, a senior policy analyst at the American Civil Liberties Union. “For example, we could now pretty easily, if we wanted to, enforce 100 percent compliance with speed limits.” That doesn’t mean we will.

Here, drawn from interviews with a range of thinkers and practitioners, is a glimpse of how tomorrow’s police officers may go about identifying, pursuing, and arresting their targets.
police  legal  ethics  government  politics  justice  crime  technology  software  hardware  datamining  civilrights  humanrights 
january 2015 by jtyost2
Schneier on Security: Who Might Control Your Telephone Metadata
Remember last winter when President Obama called for an end to the NSA's telephone metadata collection program? He didn't actually call for an end to it; he just wanted it moved from an NSA database to some commercial database. (I still think this is a bad idea, and that having the companies store it is worse than having the government store it.)

Anyway, the Director of National Intelligence solicited companies who might be interested and capable of storing all this data. Here's the list of companies that expressed interest. Note that Oracle is on the list -- the only company I've heard of. Also note that many of these companies are just intermediaries that register for all sorts of things.
legal  privacy  technology  government  surveillance  nsa  telephone  datamining  freedom  civilrights  freedomfromsearchandseizure  telecommunications  metadata 
december 2014 by jtyost2
Operation Socialist: How GCHQ Spies Hacked Belgium’s Largest Telecom
When the incoming emails stopped arriving, it seemed innocuous at first. But it would eventually become clear that this was no routine technical problem. Inside a row of gray office in Brussels, a major hacking attack was in progress. And the perpetrators were British government spies.

It was in the summer of 2012 that the anomalies were initially detected by employees at Belgium’s largest telecommunications provider, Belgacom. But it wasn’t until a year later, in June 2013, that the company’s security experts were able to figure out what was going on. The computer systems of Belgacom had been infected with a highly sophisticated malware, and it was disguising itself as legitimate Microsoft software while quietly stealing data.

Last year, documents from National Security Agency whistleblower Edward Snowden confirmed that British surveillance agency Government Communications Headquarters was behind the attack, codenamed Operation Socialist. And in November, The Intercept revealed that the malware found on Belgacom’s systems was one of the most advanced spy tools ever identified by security researchers, who named it “Regin.”

The full story about GCHQ’s infiltration of Belgacom, however, has never been told. Key details about the attack have remained shrouded in mystery—and the scope of the attack unclear.

Now, in partnership with Dutch and Belgian newspapers NRC Handelsblad and De Standaard, The Intercept has pieced together the first full reconstruction of events that took place before, during, and after the secret GCHQ hacking operation.

Based on new documents from the Snowden archive and interviews with sources familiar with the malware investigation at Belgacom’s networks, The Intercept and its partners have established that the attack on Belgacom was more aggressive and far-reaching that previously thought. It occurred in stages between 2010 and 2011, each time penetrating deeper into Belgacom’s systems, eventually compromising the very core of the company’s networks.
GCHQ  legal  surveillance  privacy  government  Belgacom  communication  encryption  technology  datamining 
december 2014 by jtyost2
The Simple Reason Why Goodreads Is So Valuable to Amazon
The United States is not, sadly, a country of lit buffs. In 2008, a little more than half of all American adults reported reading a book that was not required for work or school during the past year, according to the National Endowment for the Arts. And as shown in the graph below, which like the other charts in this piece come courtesy of the industry researchers at Codex Group and updates the sample data to match the 2010 Census, just 19 percent read a dozen or more titles.

Or, to put it another way, according to Codex just 19 percent of Americans do 79 percent of all our (non-required) book readin'.
goodreads  books  ebooks  datamining  information  kindle  amazon.com 
december 2014 by jtyost2
I Asked a Privacy Lawyer What Facebook's New Terms and Conditions Will Mean for You | VICE | United States
Your expertise covers the crazy stuff companies try to hide in their TOS. How does Facebook compare to most of what you see?
​Out of all the TOS I have dealt with in 20 years, Facebook's are the most intrusive. To be granted rights to track an individual's movements, and thus the people that would be with those individuals, and to potentially commercially exploit without permission all pictures posted on Facebook without specific consent, is breath-taking.

Users must take responsibility for their data. Facebook's ability to exploit our data is contingent upon our allowing them to do so. It is up to us to value our privacy and to spend a few minutes setting some restrictions on the privacy settings.
facebook  eula  privacy  civilrights  business  advertising  legal  technology  database  datamining  information 
december 2014 by jtyost2
Why Google Has Become Germany's Bogeyman — And Why It Matters
IN ITS nearly 500-year history, Unter den Linden, Berlin’s main boulevard, has seen many political protests. But the one held recently in front of Google’s offices in the German capital must be among the most peculiar.
Activists demonstrated against iWright, a new software that can supposedly write novels and is supposedly backed by Google. “It’s a declaration of war against all authors”, an organiser said.

If all this sounds like a PR stunt, it probably was—though it is not clear what for. Still, it nicely captures the mood in Germany. Although Germans seem to love Google’s services (it has a 91% market share in online search), the firm itself is seen as a digital glutton that intends to ingest everything: personal data, intellectual property, industry, even democracy. Sigmar Gabriel, Germany’s economy minister and vice-chancellor, has gone so far as to suggest that the company be broken up.

In recent weeks things have calmed somewhat: next month Mr Gabriel is due to appear on a discussion panel with Eric Schmidt, Google’s executive chairman (who is also on the board of The Economist’s parent company). But Germany’s Googlephobia is a big reason why Joaquín Almunia, Europe’s competition commissioner, is likely to announce shortly that he plans to renegotiate a deal with Google to settle antitrust charges.

One reason for the aversion is, predictably, Edward Snowden’s revelations. Since any German attempts to curb American spooks’ snooping are doomed, Google and other big American tech firms have become ersatz targets. This is in tune with a mounting digital revulsion in Germany.
google  germany  privacy  civilrights  humanrights  information  datamining  government  legal  advertising  business  searchengine 
september 2014 by jtyost2
BBC News - Privacy fears over FBI facial recognition database
Campaigners have raised privacy concerns over a facial recognition database being developed by the FBI that could contain 52m images by 2015.

The civil liberties group Electronic Frontier Foundation (EFF) obtained information about the project through a freedom of information request.

It said it was concerned that images of non-criminals would be stored alongside those of criminals.

The FBI say the database will reduce terrorist and criminal activities.

The facial recognition database is part of the bureau's Next Generation Identification (NGI) programme which is a large biometric database being developed to replace the current Integrated Automated Fingerprint Identification System (IAFIS).

The programme, which is being rolled out over a number of years, will offer "state of the art biometric identification services" according to the bureau's website.

As well as facial recognition images the programme is being developed to include the capture and storage of finger prints, iris scans and palm prints.
eff  legal  civilrights  freedom  privacy  datamining  fbi  government  police 
april 2014 by jtyost2
4 Surprising Things Facebook Has Learned From Your Relationship Status | Mental Floss
Facebook has a lot of data about its users, and it also has a data science division dedicated to transforming all that data into interesting information. They recently publicized a series of studies around the topic of love. While many of the results match up well with our expectations (e.g., people tend to marry within their religion), not all of them were so obvious. Here are some things Facebook has learned from looking at your relationship status.
facebook  datamining  privacy  culture  socialmedia  socialnetworking 
february 2014 by jtyost2
Twitter Opens Its Enormous Archives to Data-Hungry Academics | Wired Enterprise | Wired.com
Twitter is sharing its massive trove of data with the academic world — for free.

The social networking outfit has long sold access to its enormous collection of tweets — a record of what the people of the world are doing and saying — hooking companies like Google and Yahoo into the “Twitter fire hose.” But now, through a new grant program, it wants to make it easier for social scientists and other academics to explore its tweet archive, which stretches back to 2006.

Twitter previously worked with researchers from Johns Hopkins University to predict where flu outbreaks will hit, and the new program aims to open doors for similar projects. The company is now accepting applications from researchers, who have until March 15 to submit a proposal.

Academics see huge value in the data collected by social media companies like Twitter and Facebook. “You’ve got potentially the largest data set on human interaction ever,” Devin Gaffney — a developer at a tech startup called Little Bird who holds a master’s degree in Social Science of the Internet from Oxford University — told us last year. “It will be biased towards people who are on the internet, but it’s still better than before. Plus, it’s less work. You don’t have to talk to 10,000 people. You just write some code to do it for you.”

But researchers often struggle to gain access to the troves of data jealously guarded by social media companies. Facebook has shared its data with a few well known researchers, but it’s hard for most people to get a look at. And Twitter only makes a small portion of its data available through its API, or application programming interface. If you want access to what Twitter calls the fire hose, you’ll have to pay a premium to be one of its partner companies. Access to the fire hose generally starts at about $500 a month. Twitter’s Data Grants program gives researchers a different route to the data, providing access through a reseller called Gnip.

It’s unclear whether researchers can share these data sets with other academics in order to do peer review, and the company did not respond to a request for comment. But if the program follows the same terms and services as the Twitter API, then researchers won’t be able to re-publish their data.

The lack of peer review can make it hard to evaluate the data studies published by social media companies themselves. For example, Facebook has published some of its own research on migration patterns and the evolution of memes within the social network, but it hasn’t allowed outsiders to verify its results.

But such verification is key part of doing science. Pete Warden, a former Apple developer now at Jetpac, experienced this problem first hand in 2010 when he published an analysis of location data he scraped from Twitter. He originally shared both his data set and his results, but eventually took the data set down due to legal pressure from Facebook, making it impossible to conduct any sort of peer review on his work.

Regardless, Twitter’s program is well welcome news. Some access to this enormous dataset is far better than none.
twitter  datamining  research  science  socialmedia 
february 2014 by jtyost2
Ford Exec: 'We Know Everyone Who Breaks The Law' Thanks To Our GPS In Your Car
Farley was trying to describe how much data Ford has on its customers, and illustrate the fact that the company uses very little of it in order to avoid raising privacy concerns: "We know everyone who breaks the law, we know when you're doing it. We have GPS in your car, so we know what you're doing. By the way, we don't supply that data to anyone," he told attendees.

Rather, he said, he imagined a day when the data might be used anonymously and in aggregate to help other marketers with traffic related problems. Suppose a stadium is holding an event; knowing how much traffic is making its way toward the arena might help the venue change its parking lot resources accordingly, he said.

A Ford spokesperson later told Business Insider that in general, GPS units in Ford cars are not routinely pinging out their whereabouts as customers drive around. Rather, Ford cars have several on-board services such as "Sync Services Directions" (a navigation device that works with drivers' phones) and 911 Assist, which users have to switch on and opt into. And employers can use a service called "Crew Chief" to monitor their corporate car fleet. Data coming from those services is generally used only to improve services, a spokesperson says.

Farley himself then walked back the statement, saying "I absolutely left the wrong impression about how Ford operates. We do not track our customers in their cars without their approval or consent."
ford  privacy  legal  automotive  business  data  datamining 
january 2014 by jtyost2
How Netflix Reverse Engineered Hollywood - Alexis C. Madrigal - The Atlantic
How do you systematically dismember thousands of movies using a bunch of different people who all need to have the same understanding of what a given microtag means? In 2006, Yellin holed up with a couple of engineers and spent months developing a document called "Netflix Quantum Theory," which Yellin now derides as "our pretentious name." The name refers to what Yellin used to call "quanta," the little "packets of energy" that compose each movie. He now prefers the term "microtag."

The Netflix Quantum Theory doc spelled out ways of tagging movie endings, the "social acceptability" of lead characters, and dozens of other facets of a movie. Many values are "scalar," that is to say, they go from 1 to 5. So, every movie gets a romance rating, not just the ones labeled "romantic" in the personalized genres. Every movie's ending is rated from happy to sad, passing through ambiguous. Every plot is tagged. Lead characters' jobs are tagged. Movie locations are tagged. Everything. Everyone.

That's the data at the base of the pyramid. It is the basis for creating all the altgenres that I scraped. Netflix's engineers took the microtags and created a syntax for the genres, much of which we were able to reproduce in our generator.

Netflix's personalized genres are, in their own weird way, a tool for introspection.
To me, that's the key step: It's where the human intelligence of the taggers gets combined with the machine intelligence of the algorithms. There's something in the Netflix personalized genres that I think we can tell is not fully human, but is revealing in a way that humans alone might not be.

For example, the adjective "feel good" gets attached to movies that have a certain set of features, most importantly a happy ending. It's not a direct tag that people attach so much as a computed movie category based on an underlying set of tags.

The only semi-similar project that I could think of is Pandora's once-lauded Music Genome Project, but what's amazing about Netflix is that its descriptions of movies are foregrounded. It's not just that Netflix can show you things you might like, but that it can tell you what kinds of things those are. It is, in its own weird way, a tool for introspection.

That distinguishes it from Netflix's old way of recommending movies to you, too. The company used to trumpet the fact that it could kind of predict how many stars you might give a movie. And so, the company encouraged its users to rate movie after movie, so that it could take those numeric values and develop a taste profile for you.

They even offered a $1 million prize to the team that could design an algorithm that would improve the company's ability to predict how many stars users would give movies. It took years to improve the algorithm by a mere 10 percent.

The prize was awarded in 2009, but Netflix never actually incorporated the new models. That's in part because of the work required, but also because Netflix had decided to "go beyond the 5 stars," which is where the personalized genres come in.

The human language of the genres helps people identify with the recommendations. "Predicting something is 3.2 stars is kind of fun if you have an engineering sensibility, but it would be more useful to talk about dysfunctional families and viral plagues. We wanted to put in more language," Yellin said. "We wanted to highlight our personalization because we pride ourselves on putting the right title in front of the right person at the right time."

And nothing highlights their personalization like throwing you a very, very specific altgenre.

So why aren't they ultraspecific, which is to say, super long, like the gonzo genres that our play generator can create?

Yellin said that the genres were limited by three main factors: 1) they only want to display 50 characters for various UI reasons, which eliminates most long genres; 2) there had to be a "critical mass" of content that fit the description of the genre, at least in Netflix's extended DVD catalog; and 3) they only wanted genres that made syntactic sense.

"We're gonna tag how much romance is in a movie. We're not gonna tell you how much romance is in it, but we're gonna recommend it."
We ignore all of these constraints and that's precisely why our generator is hilarious. In Netflix's real world, there are no genres that have more than five descriptors. Four descriptors are rare, but they do show up for users: Scary Cult Mad-Scientist Movies from the 1970s. Three descriptors are more common: Feel-good Foreign Comedies for Hopeless Romantics. Two are widely used: Steamy Mind Game Movies. And, of course, there are many ones: Quirky Movies.

A fascinating thing I learned from Yellin is that the underlying tagging data isn't just used to create genres, but also to increase the level of personalization in all the movies a user is shown. So, if Netflix knows you love Action Adventure movies with high romantic ratings (on their 1-5 scale), it might show you that kind of movie, without ever saying, "Romantic Action Adventure Movies."

"We're gonna tag how much romance is in a movie. We're not gonna tell you how much romance is in it, but we're gonna recommend it," Yellin said. "You're gonna get an action row and it may have more or less romance in it based on what we know about you."

As Yellin talked, it occurred to me that Netflix has built a system that really only has one analog in the tech world: Facebook's NewsFeed. But instead of serving you up the pieces of web content that the algorithm thinks you'll like, Netflix is serving you up filmed entertainment.

Which makes its hybrid human and machine intelligence approach that much more impressive. They could have purely used computation. For example, looking at people with similar viewing habits and recommending movies based on what they watched. (And Netflix does use this kind of data, too.) But they went beyond that approach to look at the content itself.

"It's a real combination: machine-learned, algorithms, algorithmic syntax," Yellin said, "and also a bunch of geeks who love this stuff going deep."

As a thought experiment: Imagine if Facebook broke down individual websites according to a 36-page tagging document that let the company truly understand what it was people liked about Atlantic or Popular Science or 4chan or ViralNova?

It might be impossible with web content. But if Netflix's system didn't already exist, most people would probably say that it couldn't exist either.
netflix  software  technology  datamining  research  personalization  algorithm 
january 2014 by jtyost2
As New Services Track Habits, the E-Books Are Reading You - NYTimes.com
Before the Internet, books were written — and published — blindly, hopefully. Sometimes they sold, usually they did not, but no one had a clue what readers did when they opened them up. Did they skip or skim? Slow down or speed up when the end was in sight? Linger over the sex scenes?

A wave of start-ups is using technology to answer these questions — and help writers give readers more of what they want. The companies get reading data from subscribers who, for a flat monthly fee, buy access to an array of titles, which they can read on a variety of devices. The idea is to do for books what Netflix did for movies and Spotify for music.

“Self-published writers are going to eat this up,” said Mark Coker, the chief executive of Smashwords, a large independent publisher. “Many seem to value their books more than their kids. They want anything that might help them reach more readers.”

Last week, Smashwords made a deal to put 225,000 books on Scribd, a digital library here that unveiled a reading subscription service in October. Many of Smashwords’ books are already on Oyster, a New York-based subscription start-up that also began in the fall.

The move to exploit reading data is one aspect of how consumer analytics is making its way into every corner of the culture. Amazon and Barnes & Noble already collect vast amounts of information from their e-readers but keep it proprietary. Now the start-ups — which also include Entitle, a North Carolina-based company — are hoping to profit by telling all.

“We’re going to be pretty open about sharing this data so people can use it to publish better books,” said Trip Adler, Scribd’s chief executive.

Quinn Loftis, a writer of young adult paranormal romances who lives in western Arkansas, interacts extensively with her fans on Facebook, Pinterest, Twitter, Goodreads, YouTube, Flickr and her own website. These efforts at community, most of which did not exist a decade ago, have already given the 33-year-old a six-figure annual income. But having actual data about how her books are being read would take her market research to the ultimate level.

“What writer would pass up the opportunity to peer into the reader’s mind?” she asked.

Scribd is just beginning to analyze the data from its subscribers. Some general insights: The longer a mystery novel is, the more likely readers are to jump to the end to see who done it. People are more likely to finish biographies than business titles, but a chapter of a yoga book is all they need. They speed through romances faster than religious titles, and erotica fastest of all.

At Oyster, a top book is “What Women Want,” promoted as a work that “brings you inside a woman’s head so you can learn how to blow her mind.” Everyone who starts it finishes it. On the other hand, Arthur M. Schlesinger Jr.’s “The Cycles of American History” blows no minds: fewer than 1 percent of the readers who start it get to the end.

Oyster data shows that readers are 25 percent more likely to finish books that are broken up into shorter chapters. That is an inevitable consequence of people reading in short sessions during the day on an iPhone.

A few writers might be repelled by too much knowledge. But others would be fascinated, as long as they retained control.
ebooks  business  publishing  writing  analytics  information  datamining  privacy 
december 2013 by jtyost2
Will A Computer Decide Whether You Get Your Next Job?
With these new techniques, Xerox says it has been able to improve its hiring and significantly reduce turnover at its call centers.

Other companies that parse employee data are finding surprising results. Michael Rosenbaum of Pegged Software, a company that works with hospitals, says one piece of conventional wisdom is flat-out wrong: “We find zero statistically significant correlation between a college degree or a master’s degree and success as a software developer.”

Of course, using data to drive hiring decisions has its problems. Employers guided by data could wind up skipping over promising candidates. But Barbara Marder of the consulting firm Mercer points out that the way companies hire now has its own flaws. We like to hire people who are like us. People who went to schools we know. People who were referred to us by our friends.

“A lot of these new techniques do have the potential to eliminate biases,” Marder says.
employment  technology  datamining  database  software  softwareengineering  hardware 
december 2013 by jtyost2
How Iron Maiden found its worst music pirates -- then went and played for them
In the case of Iron Maiden, still a top-drawing band in the U.S. and Europe after thirty years, it noted a surge in traffic in South America. Also, it saw that Brazil, Venezuela, Mexico, Columbia, and Chile were among the top 10 countries with the most Iron Maiden Twitter followers. There was also a huge amount of BitTorrent traffic in South America, particularly in Brazil.

Rather than send in the lawyers, Maiden sent itself in. The band has focused extensively on South American tours in recent years, one of which was filmed for the documentary "Flight 666." After all, fans can't download a concert or t-shirts. The result was massive sellouts. The São Paolo show alone grossed £1.58 million (US$2.58 million) alone.

And in a positive cycle, Maiden's online fanbase grew. According to Musicmetric, in the 12 months ending May 31, 2012, the band attracted more than 3.1 million social media fans. After its Maiden England world tour, which ran from June 2012 to October 2013, Maiden's fan base grew by five million online fans, with a significant increase in popularity in South America.

This year began with legitimate questions about the security shortcomings of Android, deficiencies that have seriously slowed the adoption of Google's mobile OS by the enterprise. Now, as 2013 draws to a close, it's fair to say that Google made huge strides in providing users and IT professionals with tools to better secure the open source mobile platform.
music  business  copyright  privacy  bittorrent  analytics  information  datamining 
december 2013 by jtyost2
Ways to make fake data look meaningful
If you didn't get the joke, I don't recommend you do any of these things. What I do recommend is that when you are a consumer of other people's data, you are skeptical by default and are on the lookout for these tricks and others. And if you are the person analyzing the data, show respect to your readers and give them the necessary information to confirm your conclusions.

Be especially wary when there is more of an incentive to be interesting than to be accurate. The easiest way to come up with an interesting and defensible story is with fake statistics. Fake may strike you a strong word to use, but I think it is fair. Statistics are either presented in a mathematically defensible way or they aren't, and it really doesn't matter if it is due to ignorance or malice.

There is a common internet phrase "pics or it didn't happen", which is often posted as a reply to impressive claims without evidence. While a shorthand response like this can be viewed as rude, I believe that people making impressive claims without the proper evidence is more rude. Therefore, I propose a similar phrase, raw data or it didn't happen, as a quick response for people who make impressive analytical claims without the necessary statistical evidence.
statistics  datamining  information  science 
december 2013 by jtyost2
CIA-backed Palantir Technologies raises $107.5 million
Palantir Technologies, the data-mining company that is partly backed by the Central Intelligence Agency, has raised another $107.5 million, according to a filing.

The funding round, which brings the total raised by the company to over $800 million, values the company at more than $9 billion, according to a person familiar with the situation.

Clients such as the National Security Agency, the Federal Bureau of Investigation and the CIA, along with corporate customers such as banks, use Palantir software to piece together information on terrorist plots, financial frauds and the like.

Palantir’s chief executive, Alex Karp, did not immediately respond to a request for comment.

The Palo Alto., California-based start-up has drawn attention because of its Prism software product, which has the same name as an NSA program that monitors emails and other communications of ordinary people on a mass scale. Palantir has said its product is separate from the NSA program.

Investment bank Morgan Stanley & Co. earned a $7 million commission on the funding round, according to the filing, which was made on Tuesday at the Securities and Exchange Commission.
Palantir  datamining  cloudcomputing  technology  business 
december 2013 by jtyost2
The Brilliant Hack That Brought Foursquare Back From the Dead | Wired Business | Wired.com
It all began when engineer Anoop Ranganth sat down for a chat with data scientist Blake Shaw.

In January, Ranganth took on the task of building a prototype for a new Foursquare app. By the spring, even he had to admit that the project was a mess. It caused batteries to drain after just a few hours. It gave bad directions. It sent alerts at the wrong times — tossing users recommendations for a nearby fashion boutique when they were comfortably seated at a bar around the corner.

The problem was the method the prototype was using to identify location — a straightforward combination of GPS, Wi-Fi signals, and cell towers. It couldn’t always find the right signals, and even if it did, it tended to seriously drain the battery as it searched.

But when Ranganth told Shaw about the problems, the data scientist had an idea. Why not take a shortcut? Foursquare already had a massive database of check-ins — location information about the places its users most liked to go. And this data didn’t just include the place where someone had checked in. It showed how strong the GPS signal was at the time, how strong each surrounding Wi-Fi hotspot signal was, what local cell towers were nearby, and so on. Leveraging this data meant that Foursquare could still grab a good current location even if users were underground, near a source of radio interference, or facing some other signal obstacle. Chances are, some prior Foursquare user had seen the world through the same flawed eyes and reported his or her location.

“It’s one thing for us to match one point to another point, but we have a lot more options when we can match a cloud of points to another cloud of points,” Ranganth says. “It was very much an ah-ha moment for everybody.”
hardware  geolocation  datamining  software  programming  FourSquare  mobile 
december 2013 by jtyost2
Apple Acquires Local Data Outfit Locationary
Last September, Grant Ritchie, CEO of crowdsourced location data company Locationary, penned an article for TechCrunch describing five challenges Apple faces as it builds out its new mapping service. Ten months later, he has become part of the effort to overcome them.

Apple has acquired the Toronto-based Locationary, a small Canadian startup, backed by Extreme Venture Partners and Plazacorp Ventures. Multiple sources familiar with the deal tell AllThingsD it closed recently and includes Locationary’s technology and team, both. The price of the acquisition couldn’t immediately be learned.

Apple spokesman Steve Dowling confirmed the deal with the statement the company typically releases when news of one of its acquisitions surfaces: “Apple buys smaller technology companies from time to time, and we generally do not discuss our purpose or plans.”

Apple’s plans in this case are fairly obvious: Beef up its new mapping service. The troubled launch of Apple’s home-brewed mapping software last year sparked a world-wide consumer backlash capped by a rare apology from CEO Tim Cook. Since that time, Apple has been working hard behind the scenes to improve the service. “We’re putting all of our energy into making it right,” Cook said last December.

And this acquisition will undoubtedly figure prominently in that effort. Locationary is a sort of Wikipedia for local business listings. It uses crowdsourcing and a federated data exchange platform called Saturn to collect, merge and continuously verify a massive database of information on local businesses and points of interest around the world, solving one of location’s biggest problems: Out-of-date information.

Not only does Locationary ensure that business listing data is positionally accurate (i.e., the restaurant I searched for is where Apple said it would be), it ensures that it is temporally accurate as well (i.e., the restaurant I searched for is still open for business and not closed for renovation or shuttered entirely). And that sort of clean location data could go a long way toward improving Apple Maps and distinguishing it from rivals getting their business-location data from regional Yellow Pages directories and the like.
location  information  datamining  database  business  apple  search 
august 2013 by jtyost2
cfpb/qu
qu is an in-progress data platform created by the CFPB to serve their public data sets. This is a public domain work of the US Government. This is the stable version; day-to-day development is at https://github.com/cndreisbach/qu.
http://cfpb.github.io/qu/
government  usa  database  datamining  information 
august 2013 by jtyost2
I Flirt and Tweet. Follow Me at #Socialbot.
FROM the earliest days of the Internet, robotic programs, or bots, have been trying to pass themselves off as human. Chatbots greet users when they enter an online chat room, for example, or kick them out when they get obnoxious. More insidiously, spambots indiscriminately churn out e-mails advertising miracle stocks and unattended bank accounts in Nigeria. Bimbots deploy photos of gorgeous women to hawk work-from-home job ploys and illegal pharmaceuticals.

Now come socialbots. These automated charlatans are programmed to tweet and retweet. They have quirks, life histories and the gift of gab. Many of them have built-in databases of current events, so they can piece together phrases that seem relevant to their target audience. They have sleep-wake cycles so their fakery is more convincing, making them less prone to repetitive patterns that flag them as mere programs. Some have even been souped up by so-called persona management software, which makes them seem more real by adding matching Facebook, Reddit or Foursquare accounts, giving them an online footprint over time as they amass friends and like-minded followers.

Researchers say this new breed of bots is being designed not just with greater sophistication but also with grander goals: to sway elections, to influence the stock market, to attack governments, even to flirt with people and one another.

“Bots are getting smarter and easier to create, and people are more susceptible to being fooled by them because we’re more inundated with information,” said Filippo Menczer, a professor at Indiana University and one of the principal investigators for Truthy, a research program at Indiana University that tracks bots and Twitter trends.

Socialbots are being circulated around the Web for many purposes. To irritate his adversaries, a software developer from Australia designed a bot that automatically responds to tweets from climate change deniers, sending them counterarguments and links to studies debunking their claims. A security engineer in California programed a bot to scoop up reservations for State Bird Provisions, a trendy restaurant in San Francisco. Mercenary armies of bots can be bought on the Web for as little as $250.

For some, the goal is increasing popularity. Last month, computer scientists from the Federal University of Ouro Preto in Brazil revealed that Carina Santos, a much-followed journalist on Twitter, was actually not a real person but a bot that they had created. Based on the circulation of her tweets, two commonly used ranking sites, Twitalyzer and Klout, ranked Ms. Santos as having more online “influence” than Oprah Winfrey.

Other bots have more underhanded ambitions. Last year, officials from Mexico’s governing Institutional Revolutionary Party were accused of using bots to sabotage the party’s critics by appropriating some of their hashtags and flooding Twitter with identical posts, designed to trip Twitter’s spam filter. Believing the posts to be spam, Twitter soon began blocking those hashtags entirely, temporarily silencing the critics, which was exactly what the government officials intended.

During a dispute over a Russian parliamentary election in 2011, thousands of Twitter bots, created months before but largely dormant, suddenly began posting hundreds of messages a day targeting anti-Kremlin activists, aiming to drown them out, according to security analysts. Researchers say similar tactics have been used more recently by the government in Syria.

Socialbots are tapping into an ever-expanding universe of social media. Last year, the number of Twitter accounts topped 500 million. Some researchers estimate that only 35 percent of the average Twitter user’s followers are real people. In fact, more than half of Internet traffic already comes from nonhuman sources like bots or other types of algorithms. Within two years, about 10 percent of the activity occurring on social online networks will be masquerading bots, according to technology researchers.

Dating sites provide especially fertile ground for socialbots. Swindlers routinely seek to dupe lonely people into sending money to fictitious suitors or to lure viewers toward pay-for-service pornography pages. Christian Rudder, a co-founder and general manager of OkCupid, said that when his dating site recently bought and redesigned a smaller site, they witnessed not just a sharp decline in bots, but also a sudden 15 percent drop in use of the new site by real people. This decrease in traffic occurred, he maintains, because the flirtatious messages and automated “likes” that bots had been posting to members’ pages had imbued the former site with a false sense of intimacy and activity. “Love was in the air,” Mr. Rudder said. “Robot love.”

Mr. Rudder added that his programmers are seeking to design their own bots that will flirt with invader bots, courting them into a special room, “a purgatory of sorts,” to talk to one another rather than fooling the humans.

Marketers and political groups are in on the game, too. Last year, researchers at the Health Media Collaboratory of the University of Illinois at Chicago found that e-cigarettes were being heavily marketed on social media largely through bots dispersing messages about weaning people from regular cigarettes.

In 2010, researchers with Truthy, the Indiana University research group, discovered a number of Twitter accounts sending out duplicate messages and re-tweeting messages from the same few accounts in a closely connected network. Two accounts, for example, sent out 20,000 similar tweets, most of them linking to, or promoting, the Web site of John A. Boehner, then the House minority leader, before the last midterm elections.

Much of the social media remains unregulated by campaign finance and transparency laws. So far, the Federal Election Commission has been reluctant to venture into this realm.

But the bots are likely to venture into ours, said Tim Hwang, chief scientist at the Pacific Social Architecting Corporation, which creates bots and technologies that can shape social behavior. “Our vision is that in the near future automatons will eventually be able to rally crowds, open up bank accounts, write letters,” he said, “all through human surrogates.”
automation  socialmedia  socialnetworking  twitter  technology  communication  information  datamining 
august 2013 by jtyost2
Aral Balkan — Schnail Mail: free real mail for life!
Only with Gmail, Google doesn’t just read your email to create a profile of you, they also pull in all the other data you give them via their plethora of services and devices: e.g., Google drive (all your files), Google+ (your friends and family, status updates,…), Google Now, Google Maps, etc. (where you’ve been, where you are, who you’re with, where you will be tomorrow), and so on and so forth.

And it doesn’t stop there either. They also sell you cheap hardware (Android phones, tablets, and Chromebooks) to make it easier than ever for you to sign in to their services.

In fact, they’re now venturing into providing Internet access. When your Internet sign‐in is your Google sign‐in, they can capture and analyse all of your online activity, regardless of which device or services you use.

Surely, Schnail Mail isn’t nearly as invasive?

So, what do you say?

Show a poor schnail some love?
privacy  advertising  google  gmail  email  information  datamining 
august 2013 by jtyost2
Hands-On with the 'Automatic' Connected Driving Assistant System
From there, it asks you to start your vehicle, and you’re on your way. Amusingly enough, to get the setup to actually finish the engine of your car has to start. I drive a 2011 Prius, and the internal combustion engine only fires up when it’s actually needed. So, there was a bit of confusion between what the app was asking me to do (simply start my car) and what I needed to do, which amounted to just driving around the block so the gas engine started. The Automatic app runs in the background and automatically connects to the Smart Driving Assistant whenever you get in your car. Regardless of whether or not you even have the app open, once you start driving, it begins tracking everything you’re doing. Data points captured include how long you were driving (both in time and distance), your miles per gallon, how many times you both braked or accelerated too hard, and how many minutes you were driving over 70 miles per hour. Your route is also saved and plotted on a map, and by tracking local gas prices the app computes how much each trip cost you.

All of this data is tallied together for your weekly totals and averages which is displayed at the top of your driving timeline. Additionally, using the information the app collects, it computes a “Drive Score” to grade you on how efficiently it thinks you’re driving. In its current implementation this scoring system seems crazy, as right now I’m rocking a 35 out of 100 in my Prius, regardless of the fact that I’m exceeding the EPA estimated MPG of my car. The Automatic blog mentions tweaking this formula, as right now it is not computed on a specific car-by-car basis and instead is just grading you on hard brakes, acceleration, and how often you’re driving over 70 MPH. Arguably the most useful feature of the Automatic Smart Driving Assistant’s current implementation is seamlessly saving the location of where you parked your car. When you turn off your car, the app tags your current GPS location, and a simple tap loads up a full-screen map showing where you are in relation to your car. In my experience, accuracy of this feature has been fantastic, and way more useful than my typical routine of wandering through the parking lot pressing the lock button on my key fob over and over when I can’t find my car. Without a doubt, the geek-factor of the Automatic Smart Driving Assistant is off the charts. Being able to load up an app and see exactly where your car is, exactly how much each trip cost you in gas, and everything else feels futuristic — particularly with how seamless this all is with the automatic Bluetooth connection and background data collection. It’s also by far the most user-friendly OBD-II device I’ve seen, in that it parses the data the port can deliver in a very easy to understand format even for the least mechanically-minded drivers out there. The system also remains in beta testing, although it is unclear whether any additional features will be added before the official launch.

However, just how useful the Smart Driving Assistant actually is in reducing fuel consumption is debatable. It aims to save gas by reducing the amount of hard braking you do, how much of a lead foot you have, and how much you speed. But, do you really need a $70 gizmo to tell you that? Just simply making an effort to drive more slowly and conservatively, and both gradually accelerating and braking will have the same effect — all without spending $70.
software  technology  automotive  hardware  datamining  information 
august 2013 by jtyost2
Apple’s Tim Cook, tech executives meet with Barack Obama to talk surveillance - Tony Romm
President Barack Obama hosted Apple CEO Tim Cook, AT&T CEO Randall Stephenson, Google computer scientist Vint Cerf and other tech executives and civil liberties leaders on Thursday for a closed-door meeting about government surveillance, sources tell POLITICO.

The session, which Obama attended himself, followed a similar gathering earlier this week between top administration officials, tech-industry lobbyists and leading privacy hawks, the sources said. Those earlier, off-the-record discussions centered on the controversy surrounding the NSA as well as commercial privacy issues such as online tracking of consumers.

The White House has declined to provide any details about its new outreach since the beginning of the week. A spokesman didn’t comment Thursday about the high-level meeting with the president — and the companies and groups invited also kept quiet when contacted by POLITICO.
barackobama  politics  government  business  legal  privacy  datamining  database  nsa  PRISM  freedom  civilrights  freedomfromsearchandseizure  advertising 
august 2013 by jtyost2
Dark data is more important than big data
Imagine if…

Imagine if you had Google Glass, or the Iron Man suit, and your heads up display (HUD) could tell you anything you wanted to know about everything in your field of vision.

What would you want to know? What would you benefit from knowing?

How old is this?
Who owns this?
How much does it cost?
How was it manufactured?
What material is it made of?
Where did it come from?
Who else has been here?
These are just a few of the many questions that you could ask of your surroundings.

What is “Dark Data”?

There are three types of dark data. Let me briefly define them and provide an example for each:

1) There is data that is not currently being collected.
An example of this is location data before Foursquare, or social data before Facebook. Where did the people go? Who did the people know? Now we know.
2) There is data that is being collected, but that is difficult to access at the right time and place.
In front of you, there is a pine tree. How do you know it is not a fir? Because, in some book, in some library, there is an explanation of the difference. That’s useless. Here and now, we need information applied to the present.
3) There is data that is collected and available, but that has not yet been productized, or fully applied.
You’re walking down Fifth Avenue in Manhattan. Every building you look at, Wikipedia has vast amounts of data about. But technology startups are only just beginning to figure out how to bring that data to you, and make it valuable. The burgeoning field of augmented reality is full of opportunities like this.
What’s the difference between “Dark Data” and “Big Data”?

Big data problems are problems caused not by the inaccessibility of data, but by the abundance of it.

That’s why big data opportunities are smaller than dark data opportunities. Dark data is a bigger problem, because it hasn’t been surfaced yet. And the bigger the problem, the bigger the opportunity.

Big companies tend to have big data problems, and they know it. That’s why big data is a great market. Lots of customers with lots of data willing to pay startups to help them make sense of it all. Think banks, insurance companies, telcos, hospitals, and on and on…

Startups going after dark data problems are usually not playing in existing markets with customers self-aware of their problems. They are creating new markets by surfacing new kinds of data and creating unimagined applications with that data. But when they succeed, they become big companies, ironically, with big data problems.

Dark data is everywhere

In my “useless” liberal arts background I learned about this dude named Immanuel Kant. Kant split experience in two . There is the experience of reality itself. Reality is infinite, multi-layered and complex. Kant called this the “phenomenal” realm. Then there is the way we interpret and understand reality, as we describe it with language and data. Kant called this the “numenal” realm. To make sense of reality, and to navigate our way through it, we have to abstract away meaning from it by simplifying it through creating models, frameworks, world-views, etc.

If reality is infinite, multi-layered and complex, the good news is that there are always more types of data to extract, and new types of applications to create on top of that data. That’s why there are so many dark data opportunities all around us.

Great companies that are surfacing dark data

If your startup is surfacing dark data, I’d like to hear about it, feel free to reach out. Several of these companies I am either friends with or advise, so full disclosure, but here are some that come to mind:

Boxes — a social network for stuff. Stuff is dark data. All of your stuff is not online. There’s no place online that has all the things that I own, all the things that I want to own, etc.

NewHive — the blank canvas for the web; a social network for creativity. Expression and art is dark data, but create the right platform, and all of a sudden, all of it springs into light.

Xola — a booking and distribution platform that powers businesses offering lifestyle experiences. Their software helps these businesses manage their back-office and online reservations, payment processing, calendaring, inventory and guide management, and customer relationship management. All of this is dark data: until Xola, most of these businesses were being run with pen and paper, out of a cigar box. Now, all of their data is running through their platform.

The Tip Network — these guys are taking tips at restaurants (and eventually bars, hotels, casinos, etc.), which are currently all handled old-school, with receipts and cash and paper records (dark data), and moving them into the digital era, with beautiful software that adds value (in multiple ways) to both servers and restaurants. They will be processing the $35B in tips in the US every year, and soon will be adding other services for restaurants and services, from payroll to banking, on top of that platform.

Newtrust — Louis Anslow’s startup idea is based on the realization that everything from the school you go to, to your LinkedIn profile, is ultimately about signaling credibility to create trust, so that you can be employable and well compensated, but that instead of relying on proxies for trust, we should go right to the source: the work itself, as it is done, every hour of every day, and track and measure that — it is valuable dark data.

NeuroVigil — your brain activity is dark data.

Nest — your home energy consumption patterns are dark data.

23andme — your DNA is dark data.

My friend Louis Anslow trotted out this great line recently:

“Often that is treated as important which happens to be accessible to measurement”
Friedrich Von Hayek
That which is not accessible to measurement may be very important tomorrow, even though it is dark to us today, it just needs to be brought to light.
datamining  DarkData  BigData  information  technology 
august 2013 by jtyost2
Health Data Exploration Survey Seeks Participants Who Self-Track Health
The Health Data Exploration project has announced a call for participants in an online survey that seeks to uncover insights into how individuals, companies and researchers are using the data that are captured through digital devices such as fitness apps.

HD Explore logo
The Health Data Exploration project's online survey consists of 15-20 questions, requires fewer than 20 minutes to complete and will be online through the end of August.
Another goal of the survey is to determine how willing individuals are to share their digitally captured health data with others for research purposes.

This initiative – housed at the California Institute for Telecommunications and Information Technology (Calit2) and supported by the Robert Wood Johnson Foundation (RWJF) – will explore how new technologies like smartphones and digital apps are yielding an increasingly large amount of data that can be mined for insights into individual and population health and well-being.

“The future of personal health and how healthcare will be provided is poised to be radically transformed by these new technologies,” said Larry Smarr, director of Calit2 and a member of the Health Data Exploration (HDE) advisory board. “It’s important to set the stage now for how this will happen so that researchers from disciplines ranging from computer science to systems biology to exercise science can contribute their best efforts to help make this happen.”

Self-tracking devices and other technologies that generate “digital footprints” now make it possible for individuals, patients, providers and researchers to generate and access an abundance of health-related data. Until now, no large-scale survey has been conducted to determine how supportive these populations are to opening up and sharing these data for purposes of research.

The initiative also seeks a better understanding of current barriers preventing this type of data from being used in research – such as data quality, privacy and confidentiality.

Stephen J. Downs, chief technology and information officer of RWJF, says he and his collaborators at Calit2 are “hoping to hear from people who use smartphone apps or wearable devices to track different health-related variables like fitness, meals, sleep and mood and from representatives of the companies that provide those apps and devices."

According to a report released earlier this year by the Pew Internet Project, 69 percent of Americans track some kind of health data, and many researchers and medical practitioners believe these data could revolutionize how health is promoted, how care is provided and how individuals can become more confident in managing their health and wellness needs.

The HDE survey consists of 15-20 questions and requires fewer than 20 minutes to complete. Questions differ depending on how the participant self-identifies: As an individual, company representative or academic researcher. Individuals are asked under what conditions they might share their self-tracked data, for example, while researchers are asked if there are institutional barriers that would stop them from using self-tracked health data if they were offered from companies or individuals. Although most questions require only “yes” or “no” responses, some request additional (optional) comments.

Participants in the survey can remain anonymous, and the results of the survey will be made available in summary by the end of this year on the HDE website.

“We seek information and input from a range of academic disciplines -- from health services researchers, to those who study the social and biological sciences,” says Lori Melichar, senior program officer for RWJF. “We hope to hear from those who are enticed by the opportunity to explore these new datasets, and also need to hear from researchers who are skeptical that analyses of these types of data will produce valid, reliable, useful findings.”

The HDE project is led by Kevin Patrick and Jerry Sheehan of the UC San Diego division of Calit2 (now called the Qualcomm Institute). The project team also includes Judith Gregory, Matthew Bietz and Scout Calvert of the UC Irvine EVOKE Lab, with support from the Intel Science and Technology Center for Social Computing.

In addition to Smarr, members of the HDE advisory board include 23andMe founder Linda Avey, ePatient advocate Hugo Campos, Robert M. Kaplan of the National Institutes of Health, ideas42 founder Sendhil Mullainathan, Tim O’Reilly of O’Reilly Media, Aetna CarePass’ Martha Wofford and Gary Wolf, one of the founders of the Quantified Self movement.

The survey is available at through the end of August 2013. To take the survey and find about more about the HDE project, visit http://www.calit2.net/hdexplore/.
health  healthcare  information  research  science  privacy  datamining  technology  hardware 
august 2013 by jtyost2
The Economist explains: Who owns your data when you're dead? | The Economist
AFTER we die, our bodies are reduced to dust or ash, through burial or cremation. The fate of the digital corpuses we leave behind is rather more complicated. Before the advent of internet-hosted storage and services, your digital remains would have been accessible only to those with physical access to your computers, and only then if you had not applied encryption or password protection. But these days many people leave traces of their lives spread across the internet. Facebook knows who we love and hate, Google knows what we are interested in, Amazon knows what we buy, and so on. Specialist services may even store information about your genetic makeup (23andme) or archives of your files (Dropbox, CrashPlan, and many others). Who owns your data when you're dead?

No one, not even a probate lawyer, will tell you that the process of transferring property by writing a will—or dealing with the absence of one—is a simple matter. But when it comes to financial assets, physical goods or property, thousands of years of tradition and many hundreds of years of legal precedent provide a basis on which to proceed in even the most esoteric cases. Digital assets that are stored on shared servers in the cloud, by contrast, are so new that legal systems have not yet caught up. Five American states have passed legislation to provide executors and other parties with a legal basis on which to assert authority over digital assets, and others are considering similar rules, but these laws vary widely in what they cover (the oldest of them covers only e-mail). There are no federal laws. The same is true in other countries. To complicate matters further, internet firms may be based in different countries from their users and may store data in servers in many countries, making it unclear whose laws would apply.

A paper by Maria Perrone in the journal CommLaw Conspectus explains how internet firms and digital service-providers sit in final judgment when it comes to deciding the fate of data belonging to the dead. Some firms cite an American law from 1986, the Stored Communications Act, as clearly prohibiting many forms of data handover to heirs or estates, even with verified written instructions asking for data to be released. The law provides no exemptions and involves hefty prison sentences for violators. But every company seems to have its own set of rules, procedures and terms of service. Some require a legal executor to make a request, while others honour requests from anyone who can prove a family connection or even a link to an online obituary. Facebook limits valid parties to requesting either that an account be removed or be turned into a memorial site. Twitter says bluntly that it can deactivate an account on presentation of several bits of information, but it is "unable to provide account access to anyone regardless of his or her relationship to the deceased." Some firms delete accounts after inactivity; others refuse to allow renewals to keep the data alive; others won't allow any changes, and leave a user's data frozen in time, to the distress of those left behind. Several companies, such as Cirrus Legacy and LegacyLocker, offer digital safes for passwords and documents, releasing them only to authorised parties in the event of the owner's death. But such firms state clearly that their contract is not legally binding in two regards: a judge or executor might compel them to release information to people other than those specific by the owner, and the passwords may be useless if they relate to an account that has been separately deactivated or shut down.

All this can be maddening for those dealing with grief. But there are signs of progress. In April, Google released the Inactive Account Manager, which in effect allows users of its service to set up a digital will. When enabled, it activates a dead-man's switch, and if the account is not used for a specified period (between three and 18 months) an e-mail can be sent to a trusted contact, and there is an option to delete the account automatically. The trusted contact can then follow a procedure to gain access to the account. Other internet giants may follow suit and offer similar features. More broadly, America's Uniform Law Commission, a non-partisan group that creates model legislation that is then adopted unchanged by many American states, has a "Fiduciary Access to Digital Assets" committee working on amendments to existing ULC laws that would give executors many of the same powers over digital assets that they have over financial and physical ones, while absolving service providers of any liability. These adjustments could be incorporated into some states' laws as soon as 2015, though some federal fiddles may be required as well. In her paper, Ms Perrone notes that such uniformity would mean that "people would no longer have to rely on companies' varying terms of use to determine how to manage digital assets." When dealing with death, a little certainty can be a great comfort.
privacy  information  socialmedia  socialnetwork  socialnetworking  facebook  google  database  datamining  death  culture  legal 
july 2013 by jtyost2
With Big Data surveillance, the government doesn’t need to know “why” anymore. - Slate Magazine
The good news—at least to Big Data proponents—is that we don't need to understand what any of these clicks or videos mean. We just need to establish some relationship between the unknown terrorists of tomorrow and the established terrorists of today. If the terrorists we do know have a penchant for, say, hummus, then we might want to apply extra scrutiny to anyone who's ever bought it—without ever developing a hypothesis as to why the hummus is so beloved. (In fact, for a brief period of time in 2005 and 2006, the FBI, hoping to find some underground Iranian terrorist cells, did just that: They went through customer data collected by grocery stores in the San Francisco area searching for sales records of Middle Eastern food.)
The great temptation of Big Data is that we can stop worrying about comprehension and focus on preventive action instead. Instead of wasting precious public resources on understanding the “why”—i.e., exploring the reasons as to why terrorists become terrorists—one can focus on predicting the “when” so that a timely intervention could be made. And once someone has been identified as a suspect, it's wise to get to know everyone in his social network: Catching just one Tsarnaev brother early on may not have stopped the Boston bombing. Thus, one is simply better off recording everything—you never know when it might be useful.
Gus Hunt, the chief technology officer of the CIA, said as much earlier this year. "The value of any piece of information is only known when you can connect it with something else that arrives at a future point in time,” he said at a Big Data conference. Thus, “since you can't connect dots you don't have … we fundamentally try to collect everything and hang on to it forever." The end of theory, which Chris Anderson predicted in Wired a few years ago, has reached the intelligence community: Just like Google doesn't need to know why some sites get more links from other sites—securing a better place on its search results as a result—the spies do not need to know why some people behave like terrorists. Acting like a terrorist is good enough.
As the media academic Mark Andrejevic points out in Infoglut, his new book on the political implications of information overload, there is an immense—but mostly invisible—cost to the embrace of Big Data by the intelligence community (and by just about everyone else in both the public and private sectors). That cost is the devaluation of individual and institutional comprehension, epitomized by our reluctance to investigate the causes of actions and jump straight to dealing with their consequences. But, argues Andrejevic, while Google can afford to be ignorant, public institutions cannot.
"If the imperative of data mining is to continue to gather more data about everything," he writes, "its promise is to put this data to work, not necessarily to make sense of it. Indeed, the goal of both data mining and predictive analytics is to generate useful patterns that are far beyond the ability of the human mind to detect or even explain." In other words, we don't need to inquire why things are the way they are as long as we can affect them to be the way we want them to be. This is rather unfortunate. The abandonment of comprehension as a useful public policy goal would make serious political reforms impossible.
Forget terrorism for a moment. Take more mundane crime. Why does crime happen? Well, you might say that it’s because youths don't have jobs. Or you might say that's because the doors of our buildings are not fortified enough. Given some limited funds to spend, you can either create yet another national employment program or you can equip houses with even better cameras, sensors, and locks. What should you do?
If you're a technocratic manager, the answer is easy: Embrace the cheapest option. But what if you are that rare breed, a responsible politician? Just because some crimes have now become harder doesn't mean that the previously unemployed youths have finally found employment. Surveillance cameras might reduce crime—even though the evidence here is mixed—but no studies show that they result in greater happiness of everyone involved. The unemployed youths are still as stuck as they were before—only that now, perhaps, they displace anger onto one another. On this reading, fortifying our streets without inquiring into the root causes of crime is a self-defeating strategy, at least in the long run.
Big Data is very much like the surveillance camera in this analogy: Yes, it can help us avoid occasional jolts and disturbances and, perhaps, even stop the bad guys. But it can also blind us to the fact that the problem at hand requires a more radical approach. Big Data buys us time, but it also gives us a false illusion of mastery.
We can draw a distinction here between Big Data—the stuff of numbers that thrives on correlations—and Big Narrative—a story-driven, anthropological approach that seeks to explain why things are the way they are. Big Data is cheap where Big Narrative is expensive. Big Data is clean where Big Narrative is messy. Big Data is actionable where Big Narrative is paralyzing.
The promise of Big Data is that it allows us to avoid the pitfalls of Big Narrative. But this is also its greatest cost. With an extremely emotional issue such as terrorism, it's easy to believe that Big Data can do wonders. But once we move to more pedestrian issues, it becomes obvious that the supertool it's made out to be is a rather feeble instrument that tackles problems quite unimaginatively and unambitiously. Worse, it prevents us from having many important public debates.
As Band-Aids go, Big Data is excellent. But Band-Aids are useless when the patient needs surgery. In that case, trying to use a Band-Aid may result in amputation. This, at least, is the hunch I drew from Big Data.
database  datamining  terrorism  information  government  surveillance  privacy  logic  politics  crime 
july 2013 by jtyost2
Never Give Stores Your ZIP Code. Here's Why - Forbes
Why make such a big deal over five digits that only records that someone lives in the same area as many thousands of others? Because along with other information, the ZIP code may provide the final clue to figuring out your address, phone number and past purchasing details, if a sales clerk sees your name while swiping your credit card.

How does this work? In one of their brochures, direct marketing services company Harte-Hanks describes the GeoCapture service they offer retail businesses as follows: “Users simply capture name from the credit card swipe and request a customer’s ZIP code during the transaction. GeoCapture matches the collected information to a comprehensive consumer database to return an address.” In a promotional brochure, they claim accuracy rates as high as 100%.

Fair Isaac Corp., a company best known for its FICO credit scores, also offers a similar service which they say can boost direct marketing efforts by as much as 400%. “FICO Contact Builder helps you overcome the common challenges of gathering contact information from shoppers—such as complicating or jeopardizing the sales process by asking for an address or phone number, or complying with regulations,” it says. “It requires minimal customer information captured at point-of-sale, just customer name or telephone number and the customer or store ZIP code.”
privacy  information  marketing  datamining  business  legal  advertising 
july 2013 by jtyost2
EFF Joins Over 100 Civil Liberties Organizations and Internet Companies in Demanding a Full-Scale Congressional Investigation Into NSA Surveillance | Electronic Frontier Foundation
Dozens of civil liberties organizations and Internet companies—including the Electronic Privacy Information Center, National Association of Criminal Defense Lawyers, ThoughtWorks, and Americans for Limited Government—today joined a coalition demanding Congress initiate a full-scale investigation into the NSA’s surveillance programs. This morning, we sent an updated letter to Congress with 115 organizations and companies demanding public transparency and an end to illegal spying.

The letter comes even as dozens of groups are organizing a nationwide call-in campaign to demand transparency and an end to the NSA’s unconstitutional surveillance program via https://call.stopwatching.us

It’s been less than two weeks since the first NSA revelations were published in the Guardian, and it’s clear the American people want Congress to act. The first step is organizing an independent investigation, similar to the Church Committee in the 1970s, to publicly account for all of the NSA’s surveillance capabilities. This type of public process will ensure the American people are informed, once and for all, about government surveillance conducted in their name. Our letter tells Congress:

This type of blanket data collection by the government strikes at bedrock American values of freedom and privacy. This dragnet surveillance violates the First and Fourth Amendments of the U.S. Constitution…

In addition, our stop StopWatching.us global petition has gathered more than 200,000 signatures since it was launched one week ago. The petition calls on Congress and the President to provide public access and scrutiny of the United States' domestic spying capabilities and to bring an end to illegal surveillance. Please support our efforts to rein in the NSA’s surveillance program by adding your name now.
privacy  government  usa  eff  legal  freedom  freedomfromsearchandseizure  warrant  PRISM  nsa  internet  information  datamining  politics 
june 2013 by jtyost2
Google+
I heard a rumour this morning that if your profile is set to male, you'll get different What's Hot (Explore) content than if you're set to female.

So I opened a tab with my profile which was set to female, and from that opened a fresh tab with WH.  Then I immediately changed my profile to male and opened another WH tab to compare as closely in time as possible the difference.

Holy. Fucking. Shit.

When I'm a girl, I get piles of those graphical platitudes I hate.  I have understood all along why +EcoSnark gets those more than my real profile, because she shares them to her NerdSmacked community to make fun of them (see https://plus.google.com/communities/117382348426379346016 ).  I don't plus them or share them, but they're always there.  I loathe them and it's why I have had the slider for WH turned down to zero for ages.

Well boy howdy wouldn'tcha just know it, when I become a guy, ALL of those things are gone and instead I get way more nerdy and technical stuff.

I noted the URLs of the top listed 12 posts from my female and male WH versions, and vaguely categorized them by type.   The difference is astounding.
googleplus  gender  feminism  information  media  datamining  google 
june 2013 by jtyost2
BBC News - Europe alarmed by US surveillance
The EU is demanding assurances that Europeans' rights are not being infringed by massive, newly revealed US surveillance programmes.

Justice Commissioner Viviane Reding plans to raise the concerns with US Attorney General Eric Holder on Friday.

Last week a series of leaks by a former CIA worker led to claims the US had a vast surveillance network with much less oversight than previously thought.

The US insists its snooping is legal under domestic law.

The Obama administration is investigating whether the disclosures by former CIA worker Edward Snowden were a criminal offence.
cia  legal  government  warrant  privacy  information  datamining  PRISM  nsa  EuropeanUnion  humanrights  civilrights 
june 2013 by jtyost2
Edward Snowden: Why did the NSA whistleblower have access to PRISM and other sensitive systems?
Edward Snowden sounds like a thoughtful, patriotic young man, and I’m sure glad he blew the whistle on the NSA’s surveillance programs. But the more I learned about him this afternoon, the angrier I became. Wait, him? The NSA trusted its most sensitive documents to this guy? And now, after it has just proven itself so inept at handling its own information, the agency still wants us to believe that it can securely hold on to all of our data? Oy vey!

According to the Guardian, Snowden is a 29-year-old high-school dropout who trained for the Army Special Forces before an injury forced him to leave the military. His IT credentials are apparently limited to a few “computer” classes he took at a community college in order to get his high-school equivalency degree—courses that he did not complete. His first job at the NSA was as a security guard. Then, amazingly, he moved up the ranks of the United States’ national security infrastructure: The CIA gave him a job in IT security. He was given diplomatic cover in Geneva. He was hired by Booz Allen Hamilton, the government contractor, which paid him $200,000 a year to work on the NSA’s computer systems.

Let’s note what Snowden is not: He isn’t a seasoned FBI or CIA investigator. He isn’t a State Department analyst. He’s not an attorney with a specialty in national security or privacy law.

Instead, he’s the IT guy, and not a very accomplished, experienced one at that. If Snowden had sent his résumé to any of the tech companies that are providing data to the NSA’s PRISM program, I doubt he’d have even gotten an interview. Yes, he could be a computing savant anyway—many well-known techies dropped out of school. But he was given access way beyond what even a supergeek should have gotten. As he tells the Guardian, the NSA let him see “everything.” He was accorded the NSA’s top security clearance, which allowed him to see and to download the agency’s most sensitive documents. But he didn’t just know about the NSA’s surveillance systems—he says he had the ability to use them. “I, sitting at my desk, certainly had the authorities [sic] to wiretap anyone from you or your accountant to a federal judge to even the president if I had a personal email,” he says in a video interview with the paper.

Because Snowden is now in Hong Kong, it’s unclear what the United States can do to him. But watch for officials to tar Snowden—he’ll be called unpatriotic, unprofessional, treasonous, a liar, grandiose, and worse. As in the Bradley Manning case, though, the more badly Snowden is depicted, the more rickety the government’s case for surveillance becomes. After all, they hired him. They gave him unrestricted access to their systems, from court orders to PowerPoint presentations depicting the crown jewels of their surveillance infrastructure. (Also of note: They made a hideous PowerPoint presentation depicting the crown jewels of their surveillance infrastructure—who does that? I’ve been reading a lot of Le Carré lately, and when I saw the PRISM presentation, I remembered how Le Carré’s veteran spy George Smiley endeavored to never write down his big secrets. Now our spies aren’t just writing things down—they’re trying to make their secrets easily presentable to large audiences.)
nsa  datamining  legal  warrant  privacy  information  government  usa  PRISM  freedom  freedomfromsearchandseizure  civilrights 
june 2013 by jtyost2
This abuse of the Patriot Act must end - Jim Sensenbrenner
The administration claims authority to sift through details of our private lives because the Patriot Act says that it can. I disagree. I authored the Patriot Act, and this is an abuse of that law.

I was the chairman of the House judiciary committee when the US was attacked on 11 September 2001. Five days later, the Justice Department delivered its proposal for new legislation. Although I, along with every other American, knew we had to strengthen our ability to combat those targeting our country, this version went too far. I believed then and now that we can defend our country and our liberty at the same time.

I immediately called then-House Speaker Dennis Hastert and asked him for time to redraft the legislation. I told the speaker that if the legislation moved forward as drafted, I would not only vote against it, but would actively oppose it.

The country wanted action, and the pressure from the White House was intense. To his credit, Speaker Hastert gave us more time. There were endless meetings and non-stop negotiations with the White House, the FBI and the intelligence community. The question could not have been more fundamental: how could we defend our liberty and protect the American people at the same time?

The legislation had to be narrowly tailored – everyone agreed that we could not allow unrestrained surveillance. The Patriot Act had 17 provisions. To prevent abuse, I insisted on sunsetting all the provisions so that they would automatically expire if Congress did not renew them. This would allow Congress to conduct oversight of the administration’s implementation of the act.

In 2006, Congress made 14 of the provisions permanent because they were noncontroversial. The three remaining provisions, including the so-called business records provision the administration relied on for the programs in question, will expire in 2015 if they are not reauthorized.

The final draft was bipartisan and passed the judiciary committee unanimously. The Patriot Act has saved lives by ensuring that information is shared among those responsible for defending our country and by giving the intelligence community the tools it needs to identify and track terrorists.

In his press conference on Friday, President Obama described the massive collection of phone and digital records as “two programs that were originally authorized by Congress, have been repeatedly authorized by Congress”. But Congress has never specifically authorized these programs, and the Patriot Act was never intended to allow the daily spying the Obama administration is conducting.

To obtain a business records order like the one the administration obtained, the Patriot Act requires the government to prove to a special federal court, known as a Fisa court, that it is complying with specific guidelines set by the attorney general and that the information sought is relevant to an authorized investigation. Intentionally targeting US citizens is prohibited.

Technically, the administration’s actions were lawful insofar as they were done pursuant to an order from the Fisa court. But based on the scope of the released order, both the administration and the Fisa court are relying on an unbounded interpretation of the act that Congress never intended.

The released Fisa order requires daily productions of the details of every call that every American makes, as well as calls made by foreigners to or from the United States. Congress intended to allow the intelligence communities to access targeted information for specific investigations. How can every call that every American makes or receives be relevant to a specific investigation?

This is well beyond what the Patriot Act allows.

President Obama’s claim that “this is the most transparent administration in history” has once again proven false. In fact, it appears that no administration has ever peered more closely or intimately into the lives of innocent Americans. The president should immediately direct his administration to stop abusing the US constitution.

We all know the saying “eternal vigilance is the price of liberty.” We are seeing that truth demonstrated once again.

Our liberties are secure only so long as we are prepared to defend them. I and many other members of Congress intend to take immediate action to ensure that such abuses are not repeated.
legal  PatriotAct  privacy  information  datamining  PRISM  nsa  fisa  government  warrant  freedomfromsearchandseizure  civilrights  humanrights  barackobama 
june 2013 by jtyost2
Is data mining company Palantir's software behind PRISM surveillance? Palantir says no | The Verge
Earlier today, Talking Points Memo surfaced a new theory about the NSA and FBI's PRISM surveillance program: that it originated with data mining software company Palantir. "Palantir has a software package called 'Prism': 'Prism is a software component that lets you quickly integrate external databases into Palantir,'" wrote an anonymous source. "That sounds like exactly the tool you'd want if you were trying to find patterns in data from multiple companies."

Palantir, which offers its services to a variety of industries, is well-known for its national security work. Likely coincidentally, it even once proposed a smear campaign against journalist Glenn Greenwald, who released information about both PRISM and a court order requiring Verizon to turn over call logs. But the company insists its Prism program has nothing to do with surveillance.

"Palantir's Prism platform is completely unrelated to any US government program of the same name," the company told us. "Prism is Palantir's name for a data integration technology used in the Palantir Metropolis platform (formerly branded as Palantir Finance). This software has been licensed to banks and hedge funds for quantitative analysis and research." Palantir's overview of the program describes it as a way to integrate databases into its software.
nsa  legal  PRISM  Palantir  civilrights  freedom  freedomfromsearchandseizure  warrant  datamining  government 
june 2013 by jtyost2
Full-Text Indexing PDFs In Javascript
I once worked for a company that sold access to legal and financial databases (as they call it, “intelligent information“). Most court records are PDFS available through PACER, a website developed specifically to distribute court records. Meaningful database products on this dataset require building a processing pipeline that can extract and index text from the 200+ million PDFs that represent 20+ years of U.S. litigation. These processes can take many months of machine time, which puts a lot of pressure on the software teams that build them.

Mozilla Labs received a lot of attention lately for a project impressive in it’s ambitions: rendering PDFs in a browser using only Javascript. The PDF spec is incredibly complex, so best of luck to the pdf.js team! On a different vein, Oliver Nightingale is implementing a Javascript full-text indexer in the Javascript – combining these two projects allows reproducing the PDF processing pipeline entirely in web browsers.

As a refresher, full text indexing lets a user search unstructured text, ranking resulting documents by a relevance score determined by word frequencies. The indexer counts how often each word occurs per document and makes minor modifications the text, removing grammatical features which are irrelevant to search. E.g. it might subtract “-ing” and change vowels to phonetic common denominators. If a word shows up frequently across the document set it is automatically considered less important, and it’s effect on resulting ranking is minimized. This differs from the basic concept behind Google PageRank, which boosts the rank of documents based on a citation graph.

Most database software provides full-text indexing support, but large scale installations are typically handled in more powerful tools. The predominant open-source product is Solr/Lucene, Solr being a web-app wrapper around the Lucene library. Both are written in Java.

Building a Javascript full-text indexer enables search in places that were previously difficult such as Phonegap apps, end-user machines, or on user data that will be stored encrypted. There is a whole field of research to encrypted search indices, but indexing and encrypting data on a client machine seems like a good way around this naturally challenging problem.

To test building this processing pipeline, we first look at how to extract text from PDFs, which will later be inserted into a full text index. The code for pdf.js is instructive, in that the Mozilla developers use browser features that aren’t in common use. Web Workers, for instance, let you set up background processing threads.
javascript  webdevelopment  pdf  search  datamining 
may 2013 by jtyost2
Obamacare’s Other Surprise - NYTimes.com
Todd Park, the White House’s chief technology officer, said many new apps being developed have been further fueled by the decision by Health and Human Services to make available massive amounts data that it had gathered over the years but had largely not been accessible in computer readable forms that could be used to improve health care.

It started in March 2010 when Health and Human Services met with “45 rather skeptical entrepreneurs,” said Park, “and rather meekly put an initial pile of H.H.S. data in front of them — aggregate data on hospital quality, nursing home patient satisfaction and regional health care system performance. We asked the entrepreneurs what, if anything, they might be able to do with this data, if we made it supereasy to find, download and use.” They were told that in 90 days the department would hold a “Health Datapalooza,” — a public event to showcase innovators who harnessed the power of this data to improve health and care.

Ninety days later, entrepreneurs showed up and demonstrated more than 20 new or upgraded apps they had built that leveraged open data to do everything from helping patients find the best health care providers to enabling health care leaders to better understand patterns of health care system performance across communities, said Park. In 2012, another “Health Datapalooza” was held, and this time, he added, “1,600 entrepreneurs and innovators packed into rooms at the Washington Convention Center, hearing presentations from about 100 companies who were selected from a field of over 230 companies who had applied to present.” Most had been started in the last 24 months.

Among the start-ups I met with are Eviti, which uses technology to help cancer patients get the right combination of drugs or radiation from Day 1, which can lower costs and improve outcomes; Teladoc, which takes unused slices of doctors’ time and makes use of it by connecting them with remote patients, reducing visits to emergency wards; Humedica, which helps health care providers analyze their electronic patient records, tracking what was done to a patient, and did they actually get better; and Lumeris, which does health care analytics that uses real-time data about every aspect of a patient’s care, to improve medical decision-making, collaboration and cost-saving.

Obamacare will be a success only if it can deliver improved health care for more people at affordable prices. That remains to be seen. But at least it is already spurring the innovation necessary to make that happen.
healthcare  innovation  business  technology  datamining 
may 2013 by jtyost2
This Is the Most Detailed Picture of the Internet Ever (and Making it Was Very Illegal) | Motherboard
Why would you need a map of the Internet? The Internet is not like the Grand Canyon. It is not a destination in a voyage that requires so many right turns and so many left turns. The Internet, as the name suggests and many of you already know, is nothing but the sum of decentralized connections between various interconnected computers that are speaking roughly the same language. To map out those connections and visualize the place where I spend so much of my time may not have any clear use, but it intrigues the pants off me. 

An anonymous researcher with a lot of time on his hands apparently shares the sentiment. In a newly published research paper, this unnamed data junkie explains how he used some stupid simple hacking techniques to build a 420,000-node botnet that helped him draw the most detailed map of the Internet known to man. Not only does it show where people are logging in, it also shows changes in traffic patterns over time with an impressive amount of precision. This is all possible, of course, because the researcher hacked into nearly half a million computers so that he could ping each one, charting the resulting paths in order to make such a complex and detailed map. Along those lines, the project has as much to do with hacking as it does with mapping. 

The resultant map isn't perfect, but it is beautiful. Based on the parameter's of the researcher's study, the map is already on its way to becoming obsolete, since it shows only devices with IPv4 addresses. (The latest standard is IPv6, but IPv4 is still pretty common.) The map is further limited to Linux-based computers with a certain amount of processing power. And finally, because of the parameters of the hack, it shows some amount of bias towards naive users who don't put passwords on their computers.

But on a general, half-a-million-computer level, this is what the Internet looks like in all of its gorgeous motion:
internet  security  information  datamining  hacking  research 
may 2013 by jtyost2
Schneier on Security: Intelligence Analysis and the Connect-the-Dots Metaphor
Connecting the dots in a coloring book is easy and fun. They're right there on the page, and they're all numbered. All you have to do is move your pencil from one dot to the next, and when you're done, you've drawn a sailboat. Or a tiger. It's so simple that 5-year-olds can do it.
But in real life, the dots can only be numbered after the fact. With the benefit of hindsight, it's easy to draw lines from a Russian request for information to a foreign visit to some other piece of information that might have been collected.
In hindsight, we know who the bad guys are. Before the fact, there are an enormous number of potential bad guys.
How many? We don't know. But we know that the no-fly list had 21,000 people on it last year. The Terrorist Identities Datamart Environment, also known as the watch list, has 700,000 names on it.
We have no idea how many potential "dots" the FBI, CIA, NSA and other agencies collect, but it's easily in the millions. It's easy to work backwards through the data and see all the obvious warning signs. But before a terrorist attack, when there are millions of dots -- some important but the vast majority unimportant -- uncovering plots is a lot harder.
Rather than thinking of intelligence as a simple connect-the-dots picture, think of it as a million unnumbered pictures superimposed on top of each other. Or a random-dot stereogram. Is it a sailboat, a puppy, two guys with pressure-cooker bombs, or just an unintelligible mess of dots? You try to figure it out.
It's not a matter of not enough data, either.
Piling more data onto the mix makes it harder, not easier. The best way to think of it is a needle-in-a-haystack problem; the last thing you want to do is increase the amount of hay you have to search through. The television show Person of Interest is fiction, not fact.
There's a name for this sort of logical fallacy: hindsight bias. First explained by psychologists Daniel Kahneman and Amos Tversky, it's surprisingly common. Since what actually happened is so obvious once it happens, we overestimate how obvious it was before it happened.
statistics  information  politics  police  legal  datamining  crime  cia  fbi  government 
may 2013 by jtyost2
Automated License Plate Readers Threaten Our Privacy | Electronic Frontier Foundation
Law enforcement agencies are increasingly using sophisticated cameras, called “automated license plate readers” or ALPR, to scan and record the license plates of millions of cars across the country. These cameras, mounted on top of patrol cars and on city streets, can scan up to 1,800 license plate per minute, day or night, allowing one squad car to record more than 14,000 plates during the course of a single shift.

Photographing a single license plate one time on a public city street may not seem problematic, but when that data is put into a database, combined with other scans of that same plate on other city streets, and stored forever, it can become very revealing. Information about your location over time can show not only where you live and work, but your political and religious beliefs, your social and sexual habits, your visits to the doctor, and your associations with others. And, according to recent research reported in Nature, it’s possible to identify 95% of individuals with as few as four randomly selected geospatial datapoints (location + time), making location data the ultimate biometric identifier.

To better gauge the real threat to privacy posed by ALPR, EFF and the ACLU of Southern California asked LAPD and LASD for information on their systems, including their policies on retaining and sharing information and all the license plate data each department collected over the course of a single week in 2012. After both agencies refused to release most of the records we asked for, we sued. We hope to get access to this data, both to show just how much data the agencies are collecting and how revealing it can be.

ALPRs are often touted as an easy way to find stolen cars — the system checks a scanned plate against a database of stolen or wanted cars and can instantly identify a hit, allowing officers to set up a sting to recover the car and catch the thief. But even when there’s no match in the database and no reason to think a car is stolen or involved in a crime, police keep the data. According to the LA Weekly, LAPD and LASD together already have collected more than 160 million “data points” (license plates plus time, date, and exact location) in the greater LA area—that’s more than 20 hits for each of the more than 7 million vehicles registered in L.A. County. That’s a ton of data, but it’s not all — law enforcement officers also have access to private databases containing hundreds of millions of plates and their coordinates collected by “repo” men.
legal  privacy  freedom  police  ethics  government  datamining 
may 2013 by jtyost2
Tangible assets | The Bookseller
Yeah, that’s about the size of things. When Amazon or Google want to test out a commercial proposition—moving a certain button a few pixels over to see how it performs relative to the old spot, say—they get to make the change, run it against a few thousand users, and examine the data on the spot. When a publisher wants to try out new cover art or different catalogue copy, it makes the change, waits six months or a year, and makes a guess about whether that was a good idea or a bad idea.

Is it any wonder that the e-book channel is running circles around publishing? They’ve chucked in all kinds of creepy, privacy-invading rubbish that lets them know how and where you’re reading, which search terms you’re using to get to where in the book, and they won’t even share it with authors or publishers in realtime! This is the worst of all possible worlds—e-books that threaten the intellectual liberty of their readers and provide virtually no realtime intelligence to publishers and authors.

If the publishers want to go to the mats with Amazon, Apple, Google, Kobo, and BN.com, this is the issue they should be fighting over and for: realtime equitable retail intelligence and a reader’s bill of rights to ensure the long-term health of books’ special penumbra of
virtue—this latter is an intangible asset far more important than the “intellectual property” rights, and the DRM deployed in the name of the latter lays waste to the former.

And once you’ve got a realtime view into the e-book numbers, hire a couple of coders (for God’s sake, don’t outsource this, especially to a lumbering Big Six consultancy!) and start to build apps that do stuff with it. Little apps, the kind of thing you can build and deploy
in a week or two. Try a million things.

The future is unpredictable, and it’s not the sort of thing upon which you want to be making big all-or-nothing bets in the dark.
ebooks  publishing  information  business  DataMining 
april 2013 by jtyost2
Smart Campaigns, Meet Smart Voters
There is a rich opportunity for data collection and analysis here. The same technology that allows messages to be targeted and delivered can also help voters work together to better understand what politicians are up to. Voters deserve to know how the campaigns are selling themselves.
politics  advertising  datamining  information  technology 
december 2012 by jtyost2
Paris and the Data Mind - The Morning News
I can’t help but see an element of self-preservation amid our data collection. Preservation embedded deep within our check-ins, our food photos, our tracked steps and mapped run routes. We are collecting like never before.

We used to collect privately. The physical possessions one owns when one dies constitutes, perhaps, an idealization of the self. Those possessions, however, have always been unnetworked. And they were limited by physics; you could only collect so much. Closets filled, things decayed, people moved, treasures were thrown away.

Now the collection is boundless. The space near infinite. Every single item collected is plugged into the network. And so that self—that idealization—suddenly flows fast and far. It touches other selfs, other idealizations. It can be reconstituted by data mappers.

What a strange thing to think: It can be reconstituted by data mappers. But it’s true.

It’s no longer just the edges of a life, a general amassed physicality. It’s the millimeter precision of runs, the numbers of times “Hey Jude” was played, the minutes spent reading Harry Potter, the version-controlled genesis of an essay.

How specific and formful our collections—these collections that constitute our selves—have become. Still not entirely whole, but closer than they’ve ever been. We play them back—literally, scrolling out timelines. A life of thoughts, granular GPS, and time-coded data. Holograms of ourselves, transparent and broken, from another time and place. They skip like a worn record, or a dusty movie reel, with pieces missing here and there. But they are us, however scratchy, and their resolution increases daily.
DataMining  information  privacy  culture  technology  politics 
october 2012 by jtyost2
alexbw/Netflix-Prize
The code I used to get in the top #150 in the Netflix Prize
netflix  datamining  programming  politics 
august 2012 by jtyost2
Electronic Scores Rank Consumers by Potential Value
These digital scores, known broadly as consumer valuation or buying-power scores, measure our potential value as customers. What’s your e-score? You’ll probably never know. That’s because they are largely invisible to the public. But they are highly valuable to companies that want — or in some cases, don’t want — to have you as their customer.

Online consumer scores are calculated by a handful of start-ups, as well as a few financial services stalwarts, that specialize in the flourishing field of predictive consumer analytics. It is a Google-esque business, one fueled by almost unimaginable amounts of data and powered by complex computer algorithms. The result is a private, digital ranking of American society unlike anything that has come before.

It’s true that credit scores, based on personal credit reports, have been around for decades. And direct marketing companies have long ranked consumers by their socioeconomic status. But e-scores go further. They can take into account facts like occupation, salary and home value to spending on luxury goods or pet food, and do it all with algorithms that their creators say accurately predict spending.

A growing number of companies, including banks, credit and debit card providers, insurers and online educational institutions are using these scores to choose whom to woo on the Web. These scores can determine whether someone is pitched a platinum credit card or a plain one, a full-service cable plan or none at all. They can determine whether a customer is routed promptly to an attentive service agent or relegated to an overflow call center.

Federal regulators and consumer advocates worry that these scores could eventually put some consumers at a disadvantage, particularly those under financial stress. In effect, they say, the scores could create a new subprime class: people who are bypassed by companies online without even knowing it. Financial institutions, in particular, might avoid people with low scores, reducing those people’s access to home loans, credit cards and insurance.
privacy  business  information  datamining  ecommerce  politics 
august 2012 by jtyost2
Stuff Harvard, MIT, Stanford & Caltech People Like (echen.me)
What types of students go to which schools? There are, of course, the classic stereotypes:

MIT has the hacker engineers.
Stanford has the laid-back, social folks.
Harvard has the prestigious leaders of the world.
Berkeley has the activist hippies.
Caltech has the hardcore science nerds.
But how well do these perceptions match reality? What are students at Stanford, Harvard, MIT, Caltech, and Berkeley really interested in? Following the path of my previous data-driven post on differences between Silicon Valley and NYC, I scraped the Quora profiles of a couple hundred followers of each school to find out.
information  datamining  education  college  politics 
august 2012 by jtyost2
What we learned from 5 million books | Video on TED.com
Have you played with Google Labs' Ngram Viewer? It's an addicting tool that lets you search for words and ideas in a database of 5 million books from across centuries. Erez Lieberman Aiden and Jean-Baptiste Michel show us how it works, and a few of the surprising things we can learn from 500 billion words.
google  information  datamining  books  culture  politics 
october 2011 by jtyost2
nathanmarz/storm - GitHub
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
storm  programming  batchprocessing  datamining  language  politics 
september 2011 by jtyost2
3 Data Rights We Must Demand from Companies - ReadWriteCloud
Last week while covering a tool for analyzing your iPhone location data (or as it turns out, your nearby cell tower and hotspot location data), I mused on my long-time interest in data portability - giving users access to and control over their own data. It's an idea we've been covering here for years.

This week, the customer control over data received more attention, with a write-up in the New York Times, a new Facebook acquisition and the revelation that TomTom sold data on its customers' driving habits to law enforcement. These are three different matters: access, use and control. But they are all connected, and as more of our data is stored in the cloud, I'm glad these matters are starting to get more attention.

Here's what we should be demanding of companies:
privacy  business  advertising  information  datamining  dataportability  usa  ftc  politics 
may 2011 by jtyost2
This Tech Bubble Is Different - BusinessWeek
Even if Cloudera doesn't find a cure for cancer, rid Silicon Valley of ad-think, and persuade a generation of brainiacs to embrace the adventure that is business software, Price argues, the tech industry will have the same entrepreneurial fervor of yesteryear. "You can make a lot of jokes about Zynga and playing FarmVille, but they are generating billions of dollars," the Flite CEO says. "The greatest thing about the Valley is that people come and work in these super-intense, high-pressure environments and see what it takes to create a business and take risk." A parade of employees has left Google and Facebook to start their own companies, dabbling in everything from more ad systems to robotics and publishing. "It's almost a perpetual-motion machine," Price says.

Perpetual-motion machines sound great until you remember that they don't exist. So far, the Wants have failed to carry the rest of the industry toward higher ground. "It's clear that the new industry that is building around Internet advertising and these other services doesn't create that many jobs," says Christophe Lécuyer, a historian who has written numerous books about Silicon Valley's economic history. "The loss of manufacturing and design knowhow is truly worrisome."

Dial back the clock 25 years to an earlier tech boom. In 1986, Microsoft, Oracle (ORCL), and Sun Microsystems went public. Compaq went from launch to the Fortune 500 in four years—the quickest run in history. Each of those companies has waxed and waned, yet all helped build technology that begat other technologies. And now? Groupon, which e-mails coupons to people, may be the fastest-growing company of all time. Its revenue could hit $4 billion this year, up from $750 million last year, and the startup has reached a valuation of $25 billion. Its technological legacy is cute e-mail.

There have always been foundational technologies and flashier derivatives built atop them. Sometimes one cycle's glamour company becomes the next one's hard-core technology company; witness Amazon.com's (AMZN) transformation over the past decade from mere e-commerce powerhouse to e-commerce powerhouse and purveyor of cloud-computing capabilities to other companies. Has the pendulum swung too far? "It's a safe bet that sometime in the next 20 months, the capital markets will close, the music will stop, and the world will look bleak again," says Bridgescale Partners' Cowan. "The legitimate concern here is that we are not diversifying, so that we have roots to fall back on when we enter a different part of the cycle."
business  technology  computerscience  hardware  software  engineering  mathematics  advertising  statistics  datamining  politics 
april 2011 by jtyost2
PeteSearch: How to split up the US
"As I've been digging deeper into the data I've gathered on 210 million public Facebook profiles, I've been fascinated by some of the patterns that have emerged. My latest visualization shows the information by location, with connections drawn between places that share friends. For example, a lot of people in LA have friends in San Francisco, so there's a line between them."
statistics  research  datamining  facebook  politics 
february 2010 by jtyost2
Project ‘Gaydar’: An MIT experiment raises new questions about online privacy - The Boston Globe
"Using data from the social network Facebook, they made a striking discovery: just by looking at a person’s online friends, they could predict whether the person was gay. They did this with a software program that looked at the gender and sexuality of a person’s friends and, using statistical analysis, made a prediction. The two students had no way of checking all of their predictions, but based on their own knowledge outside the Facebook world, their computer program appeared quite accurate for men, they said. People may be effectively “outing” themselves just by the virtual company they keep." Cool.
research  socialnetworking  privacy  socialmedia  facebook  ethics  sexual  statistics  datamining  politics 
september 2009 by jtyost2

Copy this bookmark:





to read