How Things Work: Machine Learning
Today’s article was written by Steve Zheng, Contributor.
It all started with the TOM case on Watson. After a class-long discussion on what is Toronto, Anton and I were amazed by how we were able to cover one of the greatest machine learning triumphs in human history without actually talking much about machine learning. When Anton started planning for the new Harbus column How Things Work, it occurred to us that now might be a good time to double-click on the omnipresent yet not-so-obvious buzz term: machine learning.
Being a Harbus article, How Things Work: Machine Learning has two goals: (1) to unravel machine learning with a focus on its applications to business, and (2) to distill some actionable insights on how to communicate with a software developer or data scientist who may be your future client, colleague, friend, significant other, or, in all likelihood, your interviewer.
Understanding Machine Learning
There is a pile of 300 HBS cases, each case being either a BGIE or a FIN2 case. How would you correctly divide the cases into a BGIE pile and a FIN2 pile? An immediate reaction would be to count the number of pages for each case. If the number is low, that case is likely FIN2. If the page count is double-digit or more, that case is probably BGIE. Following this strategy, you proceed with the following steps:
- Pick 20 cases from the pile, read through them carefully, and manually label them as either BGIE or FIN2.
- Feed those 20 cases to a computer, and set an initial cutoff page count (say, 15). Any case with fewer than 15 pages will be classified as FIN2, otherwise BGIE.
- Allow the computer to update the cutoff value (doing so is trivial in the machine learning field) so that the highest percentage of the 20 cases are classified correctly, using your manual labels in Step 1 as the answer key.
- Let’s say the computer settles on number 10 as the optimal cutoff page count.
- Starting from the 21st case, you can just blindly feed it to your computer algorithm. Any case with fewer than 10 pages will be automatically classified as FIN2, otherwise BGIE.
You have just implemented a machine learning algorithm for classifying HBS cases! It should be quite precise in classifying most BGIE vs. FIN2 cases, barring some outliers (e.g., the 20-page FIN2 case on Burger King would be incorrectly classified as BGIE). By labeling the first 20 cases, you were essentially injecting human insights into the machine learning algorithm, which can then deduce the optimal cutoff page count and automate the remaining classification tasks. In other words, you were “supervising” the machine to learn from the first 20 labeled samples. Such machine learning paradigm is called, unsurprisingly, supervised learning.
Business applications of supervised learning include bioinformatics (classifying malignant vs. benign genes), text analysis (classifying spam vs. non-spam), sentiment analysis (predicting election outcomes based on social media feeds), informational retrieval (search engine and Watson), speech recognition, digital marketing, and algorithmic trading.
The antithesis of supervised learning is unsupervised learning. A good example is a set of data points with x and y coordinates. Just feed the data points into any modern computer, and the computer will tell you which quadrant each data point falls into. In this example, the computer is automatically clustering raw data points in a two-dimensional space without you labeling the first 20 data points. Unsupervised learning is therefore characterized by a lack of pre-labeled data in their algorithmic constructs. Application of unsupervised learning include recommendation systems (clustering similar products into a “you may also like” bucket), security (malware detection), bio-imaging, language translation, and conversational agents such as Siri.
Naturally, unsupervised learning is more scalable as it bypasses the need to generate manual labels required for supervised settings. As such, many AI practitioners consider unsupervised learning a stronger proxy to human intelligence, which is characterized by self-awareness and continuous learning with little external intervention. We envision future generations of smart machines to be a hybrid model that blends scalability with versatility and contextual awareness –a “Watson” for everything.
Communicating Machine Learning
Besides supervised vs. unsupervised learning, what other elements could you discuss with a software engineer or data scientist to bring the conversation to the next level? We would like to share three ideas.
- Ask about the input data.
Data is the new oil. If a machine learning algorithm is a jet engine, high-volume and high-quality data is the premium oil for the algorithm to unleash its maximum power. In the technology industry, more and more strategic acquisitions are focusing on acquiring “big data” assets with long-lasting business values. Given the mission-critical nature of input data, you can never go wrong by prompting your counterparts to talk more about their data source. Commonly used jargons in data-related discussions include sampling,
bias, variance, noise, normalization, and dimensionality reduction. Wikipedia provides excellent information for each concept.
- Ask about how they measure the effectiveness of their machine learning algorithms.
Professor Clayton Christensen says that general management research is missing a critical theoretical building block: a theory of metrics. His observation seems to hold true in the domain of artificial intelligence as well. By optimizing our supervised learning algorithm so that “the highest percentage of the 20 cases are classified correctly,” we were implicitly using accuracy as the metric to evaluate our algorithm. However, there is a much larger pool of metrics we may draw from, and the industry is still missing a coherent framework for machine learning metrics. Try asking your counterparts what metrics they are using for their latest projects. You will run into a wide array of terms such as coverage, error rate, precision/recall, likelihood, and information gain. Again, a basic Wikipedia survey would be more than sufficient to get you started.
- “Do you believe in Symbolism or Neural Networks?”
Our HBS case classifier belongs to the Symbolism tribe, where each HBS case is reduced into a human-readable symbolic representation (i.e., total page count). The Neural Network tribe, on the other hand, tackles machine learning problems by modeling the human brain and nervous system. The tribes of Symbolism and Neural Networks are like the Democratic and Republican parties of AI. Bonus point if you run into a symbolic and a neural network folk at a dinner table – ask the restaurant if they have some popcorn.
Now that you know machine learning …
How would you teach a machine to differentiate between FIN1 vs. FIN2 cases? Shallow symbolic features such as total page count may no longer suffice. Instead, we will need to leverage the semantics of our input data: a FIN1 case is more likely to talk about CAPM and portfolio theory, while a FIN2 case will probably touch upon debt financing and options. Unfortunately, we are running out of spaces.
Steve Zheng (HBS ’18) was a program manager at Microsoft Silicon Valley Office, driving product development in AI and Search. Born and raised in Shanghai, he studied computer science at the Institute down the Charles River.