FALL 2020

Instructor: Chris Tanner
TFs: Will Claybaugh, Shahab Asoodeh, Phoebe Wong, Isaac Slavitt, Zona Kostic, Nick Stern

Network Recommendation for Small Businesses

With over 5 million members spread across more than 30,000 local communities, Alignable aims to help business owners and operators establish their online presence, grow their networks by finding peers with related professional interests, and build valuable relationships through referrals and recommendations. The key we address is “How do we best recommend potential connections?”, in other words, “Who should connect with whom?” Our solution is to build a recommendation system that would generate ranked lists of businesses so that Alignable can send out referral emails accordingly.

Understanding Catastrophic Network Forgetting

Neural network models suffer from the phenomenon of catastrophic forgetting: a model can drastically lose its generalization ability on a task after being trained on a new task. This usually means a new task will likely override the weights that have been learned in the past, and thus degrade the model performance for the past tasks. In this project, we will explore how commonly used deep learning methods mitigate or exacerbate the degree of forgetting (e.g. batch-norm, dropout, data augmentation, weight decay, etc.). Further, we would like to select one or several methods and try to learn about the cause of effects.

Bringing BERT to the field: Transformer models for gene expression prediction in maize

Inspired by the rapid developments in the field of Natural Language Processing (NLP), we are interested in exploring the possibility of applying powerful language representation models like the transformers to the field of genomics. In simple terms, gene expression levels, typically measured in transcripts per million, represents the number of copies of a particular gene within a cell. For plants, a higher gene expression level of a protein that resists heat stress could translate into higher resiliency. We develop CornBERT to predict gene expressions in maize.

Thomas Edison Award

Improving Named Entity Disambiguation using Entity Relatedness within Wikipedia

Named Entity Disambiguation (NED) is a research area of Natural Language Processing (NLP) focused on linking a reference within a unit of text to its corresponding entity in some knowledge base, such as a node in a knowledge graph. Accurate NED is critical for modern technology companies to process and link documents, reports, press releases and other written material with meaningful contextual information to entries in knowledge bases. Toward this goal, we explore "Congruence" -- our idea for explicitly leveraging/enforcing a holistic, shared agreement of contextual entities. We used open-source Wikimedia data to demonstrate the potential performance gains on entity disambiguation tasks.

Finding Your Dream Home: House Recommendations for REX Real Estate

The process of house-buying is often complicated and tedious. There is a plethora of information about listings online and it takes a long time for a home buyer to sift through to find a home that meets his or her criteria. Often, a lot of unnecessary effort is wasted searching looking for a good house even before beginning the extended process of buying a home. We aim to improve the overall user experience of finding an ideal home. Specifically, we developed an application for REX that serves open-minded house-hunters with personalized matches for discovering their perfect home.

The GSD Award

Towards a Revamped Real Estate Index

This semester we worked with REX, a real estate technology company that is trying to bring innovation to an industry that hasn’t seen much of it over the past 50+ years. In the spirit of REX’s mission, our goal was to address these two weaknesses of traditional real estate indices. First, we sought to predict the market conditions in any given market area between one to six months into the future. Second, to make these predictions as targeted as possible, we tailored our forecast to each census tract (CT) in the given market area.

Uniting Somerville’s Tree Inventory Datasets for Tree Growth and Survival Analysis

Somerville is home to nearly 14,000 publicly-owned trees, which provides over one million dollars in ecosystem service benefits (e.g., reducing stormwater runoff and water pollution, moderating temperatures, preventing erosion). Further, the city's recent survey results revealed a correlation between tree-rich neighborhoods and happiness. On average, Somerville needs to invest $1,000 to plant and maintain a new tree. Consequently, understanding how trees grow and survive within Somerville is essential for efficiently allocating city resources while also producing the most public benefit. Toward this goal, we developed models to correctly link unique trees from cross-temporal inventories.

Winston Churchill Award

Reimagining Inventory Management for Fashion Retail

The apparel & fashion industry faces a huge challenge: inventory allocation. In addition to the disruptive nature of rising and falling fashion trends and the emergence of fast fashion, the intense financial pressure brought on by the ongoing pandemic has made effective inventory management more important than ever.  Unfortunately, the allocation process is in dire need of modernization. Syrup, an early-stage Harvard Business School startup, is tackling this problem by helping retailers automate their inventory management processes. We partnered with them and built models that: (1) predict sales for a client’s stores; and (2) optimize how these projected garments will be allocated across stores.

NOTE: The team signed an NDA, and the blog and poster video were not further edited for public release. 

The Effect of Fast-shipping

Wayfair offers 14 million items from more than 11,000 global suppliers. "Fast-shipping" products can be delivered to a portion of Wayfair’s customers in two business days or less. While this increases customer purchases, it comes at a price for Wayfair. We investigate if adding a fast shipping option to a product would increase its profits. In particular, we: (1) estimate the average boost in sales across all products, for each category, and for each individual product; (2) measure how the sales boost changes as the portion of products offered with fast shipping increases; (3) identify key product characteristics that will make the product sales more sensitive to fast shipping.

Read the Reviews: Analyzing NLP Signals of Wayfair Products

E-retailers need to predict future return rates for quality control and pricing applications. We investigate NLP methods for feature extraction from free-text reviews. These signals meaningfully improve the prediction of future product return rates.


Instructor: Chris Tanner
TFs: None

Working with Austin Pets Alive! (APA!), an Austin, Texas-based no-kill animal shelter, AdOptimize is revolutionizing the pet adoption process. Namely, one of the limiting factors in adoption rates is the poor quality of the animals' photos -- from a lack of photos, bad angles, lighting, etc. By using Machine Learning, AdOptimize greatly improves the photo process. Empirically, great success has been shown by capturing photos of dogs (increased adoption rates). A natural extension is to focus on making similar progress for cats.

Kensho, a private finance technology firm, aims to understand and predict financial markets, ideally by using the wealth of text data on the Internet. In attempt to understand any text, it is often crucial to know which entities (e.g., people, locations, organizations) are in the text.

Independent of this, our world knowledge can be represented as a structured format called a knowledge graph, where each node represents a real-world entity, and every node is connected to others via their real-world relationships (e.g., Michael Jordan was a teammate_with Scotty Pippen, and was coached_by Phil Jackson). Given a knowledge graph constructed from all of Wikipedia's data, can we use this structured information to better understand and identify entities within other non-structured text documents (e.g., financial reports).

FALL 2019

Instructors: Pavlos Protopapas and Chris Tanner
TFs: Kevin Rader, Isaac Slavitt, Javier Zazo, Cecilia Garraffo, Gonzalo Mena

Building An Image Recommendation System For News Articles using Word and Sentence Embeddings

Working in collaboration with the Associated Press (AP), this capstone group built a Text-to-Image recommendation system to recommend a set of images using headline captions. 


Since Machine Learning methods cannot optimize text directly, the team converted text to a numerical representation using word embeddings, which are means by which a word can be represented as a vector of numbers.

Optimal Real-time Scheduling for Black Hole Imaging


In April 2019, the Event Horizon Telescope (EHT) Collaboration released the first image of a black hole.  To accomplish this, the EHT used radio dishes across the globe simultaneously recording radio waves from near the black hole, synchronized by Global Positioning System (GPS) timing and referenced to atomic clocks for stability.EHT observations typically take place during a 10-12 day window with 5-6 days to be triggered when conditions are optimal.  This project's goal is to use machine learning and/or prediction methods to help the EHT determine which nights should be triggered for global observations. This is an opportunity for students to work with EHT scientists and engineers on various aspects of black hole science in order to assess the probability that observations will lead to breakthrough results.

The need for efficient Neural Architecture Search (NAS)

Deep learning frees us from feature engineering, but creates a new problem of “architecture engineering”. Numerous neural network architectures have been invented, but the design of architectures often feels more like an art than science. In this project, we investigate an efficient gradient-based search method called DARTS (Differentiable Architecture Search). DARTS is shown to require ~100x fewer GPU hours than previous methods like NASNet and AmoebaNet, and is competitive to the ENAS approach from Google Brain. We will compare DARTS to random search and state-of-the-art, hand-designed architectures such as ResNet.

Named Entity Disambiguation Boosted with Knowledge Graphs

Named Entity Disambiguation (NED), or Named Entity Linking, is a natural language processing (NLP) task which assigns a unique identity to entities mentioned in text. This can be helpful in text analysis. For example, a financial company may want to identify all companies mentioned within a news article, and subsequently investigate how the relations between the companies might affect the markets.

Computer Vision for Automatic Road Damage Detection

Deteriorating roads plague areas with highly volatile weather and budgetary constraints. It’s a constant challenge for municipal governments to keep ahead of the wear and tear as they catalogue and target hot spots to fix. In the U.S., most states only employ semi-automated methods for keeping track of road damage, and in other parts of the world, the process is completely manual, or foregone altogether. The costly and time-consuming procedure for collecting these data is only compounded by the fact that it must be done with relatively high frequency to ensure the data are up to date. This begs the question: can computer vision help?

Machine Learning for Urban Planning: Estimating Parking Capacity

If everything continues as planned, Somerville, Massachusetts — a city just outside of Boston — will be getting a new subway line in 2021. Though the new line is exciting, it may cause issues for the existing citywide resident on-street parking program. To address transportation planning questions, Somerville is conducting an audit of their parking supply. They have a good estimate of on-street parking capacity, but they have much less data about off-street parking. Their question is deceptively simple: how many residential units in Somerville have off-street parking?

Spotify Challenge: Offline Recommender System

One of the main challenges for Spotify is to recommend the right music to each user. Users' satisfaction can be monitored based on whether they skip the recommendation.  Therefore the goal of a good recommender system is to show users content they like, and to minimize the probability that they will skip a song. In this project, we present the problem of sequential music recommendations.

Back-Translation for Named Entity Recognition

The original question of our project was whether we could incorporate information from a knowledge base such as WikiData to improve performance on NER. We explore several methods for constructing type-specific vocabularies compiled from the knowledge base and show the non-triviality of compiling and cleaning this data. We then explore several methods of incorporating these vocabularies to learn an NER classifier trained on Wikipedia articles in a weakly-supervised way. We demonstrate the challenges of incorporating non-contextual information in a setting where context is key. Lastly, we show how we can incorporate ideas from low-resource neural machine translation to improve the generalizability of NER classification.

33 Oxford Street, G-107
Cambridge, MA 02138