Humans learn first by naturally absorbing and processing data about the world around them. Later, as they begin to understand language, they can be “programmed” through formal representations of information — listening to older people and asking questions, reading, and watching/listening to video and audio content.
There are a number of factors that determine how well a person learns — including the quality of their “teachers” and the individual’s desire to learn — but perhaps the most critical factor is what they’re taught. What are the data inputs?
Tell a small child repeatedly that blue is orange, or that a tree is a car, and they’ll believe blue is orange or a tree is a car until they receive enough evidence (or enough compelling evidence) to believe otherwise. And even then, it may be difficult to shake them from their initial beliefs.
Which leads this technology blog post to the topic of technology. Artificial intelligence (AI) and its components, such as machine learning and neural networks, already are being integrated into enterprise networks across a number of industries, including the financial sector, healthcare, and manufacturing.
Organizations are using intelligent machines to improve efficiency, cut costs, automate processes, make decisions, help enterprise leaders make faster and better decisions, predict internal and external events, and much more. AI even has been used by a couple of music technologists to create a CD of original “black metal” music.
So why did a neural network make a CD of black metal music? Because that’s what it was taught to do through the algorithm built for it, and an algorithm can only work with the data it’s presented. The music technologists fed it “data” in the form of audio bits from a black metal CD made by a human band.
But what if they randomly fed audio bits from a Mozart concerto or a Pete Seeger folk ballad? Unless the algorithm ignored these audio bits because they didn’t recognize them, their introduction into the neural network would render the intended black metal CD less authentic-sounding (as bassoons and banjos tend to do).
“The problem is as old as data-processing itself: garbage in, garbage out,” Cory Doctorow writes in BoingBoing. “Assembling the large, well-labeled datasets needed to train machine learning systems is a tedious job (indeed, the whole point and promise of machine learning is to teach computers to do this work, which humans are generally not good at and do not enjoy). The shortcuts we take to produce datasets come with steep costs that are not well-understood by the industry.”
Pete Warden, an engineer and technology author, writes in detail about how progress in machine learning has been impeded because too much time and energy is being spent on improving algorithms, and too little effort on the quality of training data.
“As part of my job I work closely with a lot of researchers and product teams, and my belief in the power of data improvements comes from the massive gains I’ve seen them achieve when they concentrate on that side of their model building,” Warden says. “The biggest barrier to using deep learning in most applications is getting high enough accuracy in the real world, and improving the training set is the fastest route I’ve seen to accuracy improvements.”
Accomplishing that means not giving in to the shortcuts to which Doctorow alludes.
“It may seem obvious, but your very first step should be to randomly browse through the training data you’re starting with,” Warden writes. “I always feels a bit silly going through this process, but I’ve never regretted it afterwards.”
Bottom line: Ensuring data quality is imperative if you care about the quality of your AI and machine learning initiatives.
Poor-quality data undoubtedly is a real problem in the world of AI and machine learning. In the next post, I’ll discuss a more ominous problem.