Part 4/7:
Although data scientists focus on model development, they spend a substantial portion of their time analyzing, cleaning, and transforming raw data—a concept often summarized as "garbage in, garbage out." Each time a new model is needed, they rely on data engineers to craft specific pipelines, delaying experimentation and reducing agility. The absence of a centralized feature store made it challenging for ML scientists to re-use data, leading to duplication of efforts.
Conceptual Solution: Centralized Seed and Feature Data Store
To address these inefficiencies, Wayfair aimed to建立一个统一的存储库,用于存放所有基础数据(称为“种子数据”)和特征数据。种子数据相当于主键和基本定义(如用户ID、产品ID、交互行为等),而特征数据则是通过对种子数据的转化所得,如“过去30天内用户浏览的商品数”或“某个时间范围内的购买偏好”。该统一平台能让所有ML科学家方便访问、创建和管理自己的特征集,极大地提升工作效率。