跳到主要内容

创建苹果数据集

为了生成一些有意义的数据(以及一个模型)供我们记录到 MLflow 中,我们需要一个数据集。为了坚持我们的农产品销售需求建模主题,这些数据实际上需要是关于苹果的。

在互联网上找到一个关于这个的实际数据集的可能性微乎其微,所以我们可以自己动手制作。

定义数据集生成器

为了让我们的示例能够工作,我们需要一些可以拟合的东西,但又不能拟合得太好。我们将进行多次迭代训练,以展示修改模型超参数的效果,因此特征集中需要存在一定程度的不可解释方差。然而,我们的目标变量(在我们想要预测的苹果销售数据中是 demand)与特征集之间需要存在一定程度的相关性。

我们可以通过构建特征和目标之间的关系来引入这种相关性。某些因素的随机元素将处理不可解释方差的部分。

import pandas as pd
import numpy as np
from datetime import datetime, timedelta


def generate_apple_sales_data_with_promo_adjustment(
base_demand: int = 1000, n_rows: int = 5000
):
"""
Generates a synthetic dataset for predicting apple sales demand with seasonality
and inflation.

This function creates a pandas DataFrame with features relevant to apple sales.
The features include date, average_temperature, rainfall, weekend flag, holiday flag,
promotional flag, price_per_kg, and the previous day's demand. The target variable,
'demand', is generated based on a combination of these features with some added noise.

Args:
base_demand (int, optional): Base demand for apples. Defaults to 1000.
n_rows (int, optional): Number of rows (days) of data to generate. Defaults to 5000.

Returns:
pd.DataFrame: DataFrame with features and target variable for apple sales prediction.

Example:
>>> df = generate_apple_sales_data_with_seasonality(base_demand=1200, n_rows=6000)
>>> df.head()
"""

# Set seed for reproducibility
np.random.seed(9999)

# Create date range
dates = [datetime.now() - timedelta(days=i) for i in range(n_rows)]
dates.reverse()

# Generate features
df = pd.DataFrame(
{
"date": dates,
"average_temperature": np.random.uniform(10, 35, n_rows),
"rainfall": np.random.exponential(5, n_rows),
"weekend": [(date.weekday() >= 5) * 1 for date in dates],
"holiday": np.random.choice([0, 1], n_rows, p=[0.97, 0.03]),
"price_per_kg": np.random.uniform(0.5, 3, n_rows),
"month": [date.month for date in dates],
}
)

# Introduce inflation over time (years)
df["inflation_multiplier"] = (
1 + (df["date"].dt.year - df["date"].dt.year.min()) * 0.03
)

# Incorporate seasonality due to apple harvests
df["harvest_effect"] = np.sin(2 * np.pi * (df["month"] - 3) / 12) + np.sin(
2 * np.pi * (df["month"] - 9) / 12
)

# Modify the price_per_kg based on harvest effect
df["price_per_kg"] = df["price_per_kg"] - df["harvest_effect"] * 0.5

# Adjust promo periods to coincide with periods lagging peak harvest by 1 month
peak_months = [4, 10] # months following the peak availability
df["promo"] = np.where(
df["month"].isin(peak_months),
1,
np.random.choice([0, 1], n_rows, p=[0.85, 0.15]),
)

# Generate target variable based on features
base_price_effect = -df["price_per_kg"] * 50
seasonality_effect = df["harvest_effect"] * 50
promo_effect = df["promo"] * 200

df["demand"] = (
base_demand
+ base_price_effect
+ seasonality_effect
+ promo_effect
+ df["weekend"] * 300
+ np.random.normal(0, 50, n_rows)
) * df[
"inflation_multiplier"
] # adding random noise

# Add previous day's demand
df["previous_days_demand"] = df["demand"].shift(1)
df["previous_days_demand"].fillna(
method="bfill", inplace=True
) # fill the first row

# Drop temporary columns
df.drop(columns=["inflation_multiplier", "harvest_effect", "month"], inplace=True)

return df

使用我们刚刚准备的方法生成数据并保存其结果。

data = generate_apple_sales_data_with_promo_adjustment(base_demand=1_000, n_rows=1_000)

data[-20:]

在下一节中,我们将同时使用此生成器来获取其输出(数据集),并将其作为一个示例,说明如何在项目的原型阶段利用 MLflow Tracking。