07. PyTorch 实验跟踪¶

注意： 本笔记本使用了 torchvision 的新多权重支持 API（适用于 torchvision v0.13+）。

在制作 FoodVision Mini（一个用于分类披萨、牛排或寿司图像的图像分类模型）的过程中，我们已经训练了不少模型。

到目前为止，我们通过 Python 字典来跟踪它们。

或者仅仅通过训练期间打印的指标来进行比较。

如果你想一次性运行十几个（或更多）不同的模型，该怎么办？

当然有更好的方法...

实验跟踪。

由于实验跟踪对机器学习至关重要，你可以将这个笔记本视为你的第一个里程碑项目。

欢迎来到里程碑项目 1：FoodVision Mini 实验跟踪。

我们将回答这个问题：如何跟踪我的机器学习实验？

什么是实验跟踪？¶

机器学习和深度学习是非常实验性的。

你需要戴上艺术家的贝雷帽/厨师的帽子，来创造出许多不同的模型。

同时，你还需要穿上科学家的白大褂，来追踪各种数据组合、模型架构和训练制度的结果。

这就是实验跟踪的作用所在。

如果你正在进行大量的不同实验，实验跟踪帮助你弄清楚哪些方法有效，哪些无效。

为什么要跟踪实验？¶

如果你只运行了少数几个模型（就像我们目前所做的那样），可能只需要通过打印输出和几个字典来跟踪它们的结果就可以了。

然而，随着你运行的实验数量开始增加，这种简单的跟踪方式可能会变得难以管理。

因此，如果你遵循机器学习从业者的座右铭——实验，实验，再实验！，你会需要一种方法来跟踪这些实验。

在构建了几个模型并跟踪它们的结果之后，你会开始注意到这种情况会多么迅速地变得难以管理。

跟踪机器学习实验的不同方法¶

跟踪机器学习实验的方法和实验本身一样多。

下表涵盖了几种方法。

方法	设置	优点	缺点	成本
Python 字典、CSV 文件、打印输出	无	易于设置，纯 Python 运行	难以跟踪大量实验	免费
TensorBoard	最小化，安装 `tensorboard`	内置于 PyTorch 的扩展，广泛认可和使用，易于扩展	用户体验不如其他选项	免费
Weights & Biases 实验跟踪	最小化，安装 `wandb`，创建账户	出色的用户体验，公开实验，几乎可以跟踪任何内容	需要 PyTorch 之外的外部资源	个人使用免费
MLFlow	最小化，安装 `mlflow` 并开始跟踪	完全开源的 MLOps 生命周期管理，许多集成	设置远程跟踪服务器比其他服务稍难	免费

跟踪机器学习实验的各种地方和技术。注意： 还有其他类似 Weights & Biases 的选项和类似 MLflow 的开源选项，但为了简洁起见，我没有列出。你可以通过搜索“机器学习实验跟踪”找到更多。

我们将要涵盖的内容¶

我们将运行多个不同层次的数据、模型大小和训练时间的建模实验，以尝试改进 FoodVision Mini。

由于其与 PyTorch 的紧密集成和广泛使用，本笔记本专注于使用 TensorBoard 来跟踪我们的实验。

然而，我们将涵盖的原则与其他所有实验跟踪工具中的原则相似。

主题	内容
0. 环境设置	我们在过去几节中编写了不少有用的代码，让我们下载并确保我们可以再次使用它。
1. 获取数据	让我们获取我们一直在使用的披萨、牛排和寿司图像分类数据集，以尝试改进我们的 FoodVision Mini 模型的结果。
2. 创建数据集和数据加载器	我们将使用在第05章节中编写的 `data_setup.py` 脚本来设置我们的数据加载器。
3. 获取并自定义预训练模型	就像上一节一样，我们将从 `torchvision.models` 下载一个预训练模型并将其自定义为我们自己的问题。
4. 训练模型并跟踪结果	让我们看看使用 TensorBoard 训练和跟踪单个模型的训练结果是什么样的。
5. 在 TensorBoard 中查看模型的结果	之前我们使用辅助函数可视化了模型的损失曲线，现在让我们看看它们在 TensorBoard 中的样子。
6. 创建辅助函数以跟踪实验	如果我们打算遵循机器学习实践者的座右铭：实验，实验，实验！，我们最好创建一个函数来帮助我们保存建模实验结果。
7. 设置一系列建模实验	与其一次运行一个实验，不如我们编写一些代码来一次运行多个实验，使用不同的模型、不同数量的数据和不同的训练时间。
8. 在 TensorBoard 中查看建模实验	到这一步我们将一次性运行了八个建模实验，有很多需要跟踪，让我们看看它们的结果在 TensorBoard 中的样子。
9. 加载最佳模型并使用它进行预测	实验跟踪的目的是找出哪个模型表现最好，让我们加载表现最好的模型并使用它进行一些预测，以可视化，可视化，可视化！。

在哪里可以获得帮助？¶

本课程的所有材料都可以在 GitHub 上找到。

如果你遇到问题，可以在课程的 GitHub 讨论页面上提问。

当然，还有 PyTorch 文档和 PyTorch 开发者论坛，这是所有 PyTorch 相关问题的非常有帮助的地方。

0. 环境设置¶

首先，让我们下载本节所需的所有模块。

为了节省编写额外代码的时间，我们将利用在05. PyTorch Going Modular部分创建的一些Python脚本（如data_setup.py和engine.py）。

具体来说，我们将从pytorch-deep-learning仓库下载going_modular目录（如果尚未下载）。

我们还将获取torchinfo包（如果尚未安装）。

torchinfo将在后续帮助我们生成模型的可视化摘要。

由于我们使用的是较新版本的torchvision包（截至2022年6月为v0.13），我们将确保安装最新版本。

In [1]:

Copied!





# For this notebook to run with updated APIs, we need torch 1.12+ and torchvision 0.13+
try:
    import torch
    import torchvision
    assert int(torch.__version__.split(".")[1]) >= 12, "torch version should be 1.12+"
    assert int(torchvision.__version__.split(".")[1]) >= 13, "torchvision version should be 0.13+"
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
except:
    print(f"[INFO] torch/torchvision versions not as required, installing nightly versions.")
    !pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
    import torch
    import torchvision
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
# For this notebook to run with updated APIs, we need torch 1.12+ and torchvision 0.13+
try:
    import torch
    import torchvision
    assert int(torch.__version__.split(".")[1]) >= 12, "torch version should be 1.12+"
    assert int(torchvision.__version__.split(".")[1]) >= 13, "torchvision version should be 0.13+"
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")
except:
    print(f"[INFO] torch/torchvision versions not as required, installing nightly versions.")
    !pip3 install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
    import torch
    import torchvision
    print(f"torch version: {torch.__version__}")
    print(f"torchvision version: {torchvision.__version__}")

torch version: 1.13.0.dev20220620+cu113
torchvision version: 0.14.0.dev20220620+cu113

注意： 如果你使用的是 Google Colab，在运行上述单元格后，你可能需要重启运行时。重启后，你可以再次运行该单元格，并验证你已安装正确版本的 torch（0.12+）和 torchvision（0.13+）。

In [2]:

Copied!





# Continue with regular imports
import matplotlib.pyplot as plt
import torch
import torchvision

from torch import nn
from torchvision import transforms

# Try to get torchinfo, install it if it doesn't work
try:
    from torchinfo import summary
except:
    print("[INFO] Couldn't find torchinfo... installing it.")
    !pip install -q torchinfo
    from torchinfo import summary

# Try to import the going_modular directory, download it from GitHub if it doesn't work
try:
    from going_modular.going_modular import data_setup, engine
except:
    # Get the going_modular scripts
    print("[INFO] Couldn't find going_modular scripts... downloading them from GitHub.")
    !git clone https://github.com/mrdbourke/pytorch-deep-learning
    !mv pytorch-deep-learning/going_modular .
    !rm -rf pytorch-deep-learning
    from going_modular.going_modular import data_setup, engine
# Continue with regular imports
import matplotlib.pyplot as plt
import torch
import torchvision

from torch import nn
from torchvision import transforms

# Try to get torchinfo, install it if it doesn't work
try:
    from torchinfo import summary
except:
    print("[INFO] Couldn't find torchinfo... installing it.")
    !pip install -q torchinfo
    from torchinfo import summary

# Try to import the going_modular directory, download it from GitHub if it doesn't work
try:
    from going_modular.going_modular import data_setup, engine
except:
    # Get the going_modular scripts
    print("[INFO] Couldn't find going_modular scripts... downloading them from GitHub.")
    !git clone https://github.com/mrdbourke/pytorch-deep-learning
    !mv pytorch-deep-learning/going_modular .
    !rm -rf pytorch-deep-learning
    from going_modular.going_modular import data_setup, engine

现在让我们设置与设备无关的代码。

注意： 如果你正在使用 Google Colab，并且还没有开启 GPU，现在是时候通过 Runtime -> Change runtime type -> Hardware accelerator -> GPU 来开启一个 GPU 了。

In [3]:

Copied!

device = "cuda" if torch.cuda.is_available() else "cpu"
device
device = "cuda" if torch.cuda.is_available() else "cpu"
device

Out[3]:

'cuda'

创建一个设置随机种子的辅助函数¶

由于在前面的章节中我们已经多次设置随机种子，不如我们将其函数化？

让我们创建一个名为 set_seeds() 的函数来“设置种子”。

注意： 回顾一下，随机种子是计算机生成随机性的一种方式。在运行机器学习代码时，并不总是需要设置随机种子，然而，它们有助于确保可重复性（我代码生成的数字与你代码生成的数字相似）。在教育或实验环境之外，通常不需要随机种子。

In [4]:

Copied!





# Set seeds
def set_seeds(seed: int=42):
    """Sets random sets for torch operations.

    Args:
        seed (int, optional): Random seed to set. Defaults to 42.
    """
    # Set the seed for general torch operations
    torch.manual_seed(seed)
    # Set the seed for CUDA torch operations (ones that happen on the GPU)
    torch.cuda.manual_seed(seed)
# Set seeds
def set_seeds(seed: int=42):
    """Sets random sets for torch operations.

    Args:
        seed (int, optional): Random seed to set. Defaults to 42.
    """
    # Set the seed for general torch operations
    torch.manual_seed(seed)
    # Set the seed for CUDA torch operations (ones that happen on the GPU)
    torch.cuda.manual_seed(seed)

1. 获取数据¶

一如既往，在我们能够运行机器学习实验之前，我们需要一个数据集。

我们将继续尝试改进我们在 FoodVision Mini 上获得的结果。

在上一节，06. PyTorch 迁移学习，我们看到了使用预训练模型和迁移学习在分类披萨、牛排和寿司图像时的强大之处。

那么，我们何不运行一些实验，尝试进一步改进我们的结果呢？

为此，我们将使用与上一节类似的代码来下载 pizza_steak_sushi.zip（如果数据尚未存在），只不过这次它已经被函数化了。

这将使我们能够在以后再次使用它。

In [5]:

Copied!





import os
import zipfile

from pathlib import Path

import requests

def download_data(source: str, 
                  destination: str,
                  remove_source: bool = True) -> Path:
    """Downloads a zipped dataset from source and unzips to destination.

    Args:
        source (str): A link to a zipped file containing data.
        destination (str): A target directory to unzip data to.
        remove_source (bool): Whether to remove the source after downloading and extracting.
    
    Returns:
        pathlib.Path to downloaded data.
    
    Example usage:
        download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                      destination="pizza_steak_sushi")
    """
    # Setup path to data folder
    data_path = Path("data/")
    image_path = data_path / destination

    # If the image folder doesn't exist, download it and prepare it... 
    if image_path.is_dir():
        print(f"[INFO] {image_path} directory exists, skipping download.")
    else:
        print(f"[INFO] Did not find {image_path} directory, creating one...")
        image_path.mkdir(parents=True, exist_ok=True)
        
        # Download pizza, steak, sushi data
        target_file = Path(source).name
        with open(data_path / target_file, "wb") as f:
            request = requests.get(source)
            print(f"[INFO] Downloading {target_file} from {source}...")
            f.write(request.content)

        # Unzip pizza, steak, sushi data
        with zipfile.ZipFile(data_path / target_file, "r") as zip_ref:
            print(f"[INFO] Unzipping {target_file} data...") 
            zip_ref.extractall(image_path)

        # Remove .zip file
        if remove_source:
            os.remove(data_path / target_file)
    
    return image_path

image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path
import os
import zipfile

from pathlib import Path

import requests

def download_data(source: str, 
                  destination: str,
                  remove_source: bool = True) -> Path:
    """Downloads a zipped dataset from source and unzips to destination.

    Args:
        source (str): A link to a zipped file containing data.
        destination (str): A target directory to unzip data to.
        remove_source (bool): Whether to remove the source after downloading and extracting.
    
    Returns:
        pathlib.Path to downloaded data.
    
    Example usage:
        download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                      destination="pizza_steak_sushi")
    """
    # Setup path to data folder
    data_path = Path("data/")
    image_path = data_path / destination

    # If the image folder doesn't exist, download it and prepare it... 
    if image_path.is_dir():
        print(f"[INFO] {image_path} directory exists, skipping download.")
    else:
        print(f"[INFO] Did not find {image_path} directory, creating one...")
        image_path.mkdir(parents=True, exist_ok=True)
        
        # Download pizza, steak, sushi data
        target_file = Path(source).name
        with open(data_path / target_file, "wb") as f:
            request = requests.get(source)
            print(f"[INFO] Downloading {target_file} from {source}...")
            f.write(request.content)

        # Unzip pizza, steak, sushi data
        with zipfile.ZipFile(data_path / target_file, "r") as zip_ref:
            print(f"[INFO] Unzipping {target_file} data...") 
            zip_ref.extractall(image_path)

        # Remove .zip file
        if remove_source:
            os.remove(data_path / target_file)
    
    return image_path

image_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                           destination="pizza_steak_sushi")
image_path

[INFO] data/pizza_steak_sushi directory exists, skipping download.

Out[5]:

PosixPath('data/pizza_steak_sushi')

太棒了！看来我们已经准备好了标准图像分类格式的披萨、牛排和寿司图片。

2. 创建数据集和数据加载器¶

现在我们有了一些数据，让我们将其转换为 PyTorch 数据加载器。

我们可以使用在 05. PyTorch Going Modular part 2 中创建的 create_dataloaders() 函数来实现这一点。

由于我们将使用迁移学习和从 torchvision.models 中获取的预训练模型，我们将创建一个转换来正确准备我们的图像。

为了将我们的图像转换为张量，我们可以使用：

使用 torchvision.transforms 手动创建的转换。
使用 torchvision.models.MODEL_NAME.MODEL_WEIGHTS.DEFAULT.transforms() 自动创建的转换。
- 其中 MODEL_NAME 是特定的 torchvision.models 架构，MODEL_WEIGHTS 是特定的预训练权重集，DEFAULT 表示“最佳可用权重”。

我们在 06. PyTorch 迁移学习第 2 节中看到了这些方法的示例。

首先，让我们看一个手动创建 torchvision.transforms 管道的示例（以这种方式创建转换管道提供了最大的自定义性，但如果转换与预训练模型不匹配，可能会导致性能下降）。

我们需要确保的主要手动转换是所有图像都以 ImageNet 格式进行归一化（这是因为预训练的 torchvision.models 都是在 ImageNet 上进行预训练的）。

我们可以这样做：

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

2.1 使用手动创建的变换创建数据加载器¶

In [6]:

Copied!





# Setup directories
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup ImageNet normalization levels (turns all images into similar distribution as ImageNet)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    normalize
])           
print(f"Manually created transforms: {manual_transforms}")

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, # use manually created transforms
    batch_size=32
)

train_dataloader, test_dataloader, class_names
# Setup directories
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup ImageNet normalization levels (turns all images into similar distribution as ImageNet)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

# Create transform pipeline manually
manual_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    normalize
])           
print(f"Manually created transforms: {manual_transforms}")

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=manual_transforms, # use manually created transforms
    batch_size=32
)

train_dataloader, test_dataloader, class_names

Manually created transforms: Compose(
    Resize(size=(224, 224), interpolation=bilinear, max_size=None, antialias=None)
    ToTensor()
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)

Out[6]:

(<torch.utils.data.dataloader.DataLoader at 0x7febf1d218e0>,
 <torch.utils.data.dataloader.DataLoader at 0x7febf1d216a0>,
 ['pizza', 'steak', 'sushi'])

2.2 使用自动创建的变换创建 DataLoaders¶

数据已变换并创建了 DataLoaders！

现在让我们看看同样的变换流程在使用自动变换的情况下是什么样子的。

我们可以通过首先实例化一组预训练权重（例如 weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT），然后对其调用 transforms() 方法来实现这一点。

In [7]:

Copied!





# Setup dirs
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup pretrained weights (plenty of these available in torchvision.models)
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT

# Get transforms from weights (these are the transforms that were used to obtain the weights)
automatic_transforms = weights.transforms() 
print(f"Automatically created transforms: {automatic_transforms}")

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=automatic_transforms, # use automatic created transforms
    batch_size=32
)

train_dataloader, test_dataloader, class_names
# Setup dirs
train_dir = image_path / "train"
test_dir = image_path / "test"

# Setup pretrained weights (plenty of these available in torchvision.models)
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT

# Get transforms from weights (these are the transforms that were used to obtain the weights)
automatic_transforms = weights.transforms() 
print(f"Automatically created transforms: {automatic_transforms}")

# Create data loaders
train_dataloader, test_dataloader, class_names = data_setup.create_dataloaders(
    train_dir=train_dir,
    test_dir=test_dir,
    transform=automatic_transforms, # use automatic created transforms
    batch_size=32
)

train_dataloader, test_dataloader, class_names

Automatically created transforms: ImageClassification(
    crop_size=[224]
    resize_size=[256]
    mean=[0.485, 0.456, 0.406]
    std=[0.229, 0.224, 0.225]
    interpolation=InterpolationMode.BICUBIC
)

Out[7]:

(<torch.utils.data.dataloader.DataLoader at 0x7febf1d213a0>,
 <torch.utils.data.dataloader.DataLoader at 0x7febf1d21490>,
 ['pizza', 'steak', 'sushi'])

3. 获取预训练模型，冻结基础层并更改分类器头部¶

在运行和跟踪多个建模实验之前，我们先来看看如何运行和跟踪单个实验。

既然我们的数据已经准备好了，接下来我们需要的就是一个模型。

让我们下载 torchvision.models.efficientnet_b0() 模型的预训练权重，并准备好将其用于我们自己的数据。

In [8]:

Copied!





# Note: This is how a pretrained model would be created in torchvision > 0.13, it will be deprecated in future versions.
# model = torchvision.models.efficientnet_b0(pretrained=True).to(device) # OLD 

# Download the pretrained weights for EfficientNet_B0
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # NEW in torchvision 0.13, "DEFAULT" means "best weights available"

# Setup the model with the pretrained weights and send it to the target device
model = torchvision.models.efficientnet_b0(weights=weights).to(device)

# View the output of the model
# model
# Note: This is how a pretrained model would be created in torchvision > 0.13, it will be deprecated in future versions.
# model = torchvision.models.efficientnet_b0(pretrained=True).to(device) # OLD 

# Download the pretrained weights for EfficientNet_B0
weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT # NEW in torchvision 0.13, "DEFAULT" means "best weights available"

# Setup the model with the pretrained weights and send it to the target device
model = torchvision.models.efficientnet_b0(weights=weights).to(device)

# View the output of the model
# model

太棒了！

现在我们有了一个预训练模型，接下来将其转变为一个特征提取器模型。

本质上，我们将冻结模型的基础层（我们将使用这些层从输入图像中提取特征），并且我们将改变分类器头部（输出层）以适应我们正在处理的类别数量（我们有3个类别：披萨、牛排、寿司）。

注意： 创建特征提取器模型的概念（我们在这里所做的）在06. PyTorch 迁移学习第3.2节：设置预训练模型中有更深入的探讨。

In [9]:

Copied!





# Freeze all base layers by setting requires_grad attribute to False
for param in model.features.parameters():
    param.requires_grad = False
    
# Since we're creating a new layer with random weights (torch.nn.Linear), 
# let's set the seeds
set_seeds() 

# Update the classifier head to suit our problem
model.classifier = torch.nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=1280, 
              out_features=len(class_names),
              bias=True).to(device))
# Freeze all base layers by setting requires_grad attribute to False
for param in model.features.parameters():
    param.requires_grad = False
    
# Since we're creating a new layer with random weights (torch.nn.Linear), 
# let's set the seeds
set_seeds() 

# Update the classifier head to suit our problem
model.classifier = torch.nn.Sequential(
    nn.Dropout(p=0.2, inplace=True),
    nn.Linear(in_features=1280, 
              out_features=len(class_names),
              bias=True).to(device))

基础层已冻结，分类器头已更改，让我们使用 torchinfo.summary() 获取模型的摘要。

In [10]:

Copied!





from torchinfo import summary

# # Get a summary of the model (uncomment for full output)
# summary(model, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape" (batch_size, color_channels, height, width)
#         verbose=0,
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )
from torchinfo import summary

# # Get a summary of the model (uncomment for full output)
# summary(model, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape" (batch_size, color_channels, height, width)
#         verbose=0,
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )

将我们的模型传递给 torchinfo.summary() 时的输出，注意基本层是如何被冻结（不可训练）的，输出层是如何根据我们自己的问题定制的

torchinfo.summary() 的输出，展示了我们的特征提取器 EffNetB0 模型，注意基本层是如何被冻结（不可训练）的，输出层是如何根据我们自己的问题定制的。

4. 训练模型并跟踪结果¶

模型准备就绪！

让我们通过创建损失函数和优化器来准备训练模型。

由于我们处理的是多类别分类问题，我们将使用 torch.nn.CrossEntropyLoss() 作为损失函数。

并且我们将继续使用学习率为 0.001 的 torch.optim.Adam() 作为优化器。

In [11]:

Copied!

# Define loss and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Define loss and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

调整 `train()` 函数以使用 `SummaryWriter()` 跟踪结果¶

太棒了！

我们的训练代码各个部分开始逐渐整合在一起。

现在，让我们添加最后一块拼图来跟踪我们的实验。

之前，我们使用多个 Python 字典（每个模型一个）来跟踪我们的建模实验。

但你可以想象，如果我们运行的实验不止几个，这种方法可能会变得难以管理。

不用担心，有一个更好的选择！

我们可以使用 PyTorch 的 torch.utils.tensorboard.SummaryWriter() 类将模型训练过程的各个部分保存到文件中。

默认情况下，SummaryWriter() 类会将有关模型的各种信息保存到由 log_dir 参数设置的文件中。

log_dir 的默认位置是 runs/CURRENT_DATETIME_HOSTNAME，其中 HOSTNAME 是你的计算机名称。

当然，你可以自定义实验跟踪的位置（文件名可以随心所欲地定制）。

SummaryWriter() 的输出以 TensorBoard 格式保存。

TensorBoard 是 TensorFlow 深度学习库的一部分，是可视化模型不同部分的优秀工具。

要开始跟踪我们的建模实验，让我们创建一个默认的 SummaryWriter() 实例。

In [12]:

Copied!

from torch.utils.tensorboard import SummaryWriter

# Create a writer with all default settings
writer = SummaryWriter()
from torch.utils.tensorboard import SummaryWriter

# Create a writer with all default settings
writer = SummaryWriter()

现在要使用这个写入器，我们可以编写一个新的训练循环，或者我们可以调整在05. PyTorch Going Modular 第4节中创建的现有 train() 函数。

我们选择后者。

我们将从 engine.py 获取 train() 函数，并调整它以使用 writer。

具体来说，我们将添加让我们的 train() 函数记录模型训练和测试损失及准确度值的功能。

我们可以使用 writer.add_scalars(main_tag, tag_scalar_dict) 来实现这一点，其中：

main_tag（字符串）- 被跟踪标量的名称（例如 "Accuracy"）
tag_scalar_dict（字典）- 被跟踪的值的字典（例如 {"train_loss": 0.3454}）
- 注意： 该方法名为 add_scalars()，因为我们的损失和准确度值通常是标量（单个值）。

一旦我们完成了值的跟踪，我们将调用 writer.close() 来告诉 writer 停止寻找要跟踪的值。

为了开始修改 train()，我们还将从 engine.py 导入 train_step() 和 test_step()。

注意： 你几乎可以在代码的任何地方跟踪有关模型的信息。但实验通常会在模型训练时（在训练/测试循环内部）进行跟踪。

torch.utils.tensorboard.SummaryWriter() 类还有许多不同的方法来跟踪模型/数据的不同方面，例如 add_graph() 用于跟踪模型的计算图。更多选项，查看 SummaryWriter() 文档。

In [13]:

Copied!





from typing import Dict, List
from tqdm.auto import tqdm

from going_modular.going_modular.engine import train_step, test_step

# Import train() function from: 
# https://github.com/mrdbourke/pytorch-deep-learning/blob/main/going_modular/going_modular/engine.py
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      
    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                           dataloader=train_dataloader,
                                           loss_fn=loss_fn,
                                           optimizer=optimizer,
                                           device=device)
        test_loss, test_acc = test_step(model=model,
                                        dataloader=test_dataloader,
                                        loss_fn=loss_fn,
                                        device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        ### New: Experiment tracking ###
        # Add loss results to SummaryWriter
        writer.add_scalars(main_tag="Loss", 
                           tag_scalar_dict={"train_loss": train_loss,
                                            "test_loss": test_loss},
                           global_step=epoch)

        # Add accuracy results to SummaryWriter
        writer.add_scalars(main_tag="Accuracy", 
                           tag_scalar_dict={"train_acc": train_acc,
                                            "test_acc": test_acc}, 
                           global_step=epoch)
        
        # Track the PyTorch model architecture
        writer.add_graph(model=model, 
                         # Pass in an example input
                         input_to_model=torch.randn(32, 3, 224, 224).to(device))
    
    # Close the writer
    writer.close()
    
    ### End new ###

    # Return the filled results at the end of the epochs
    return results
from typing import Dict, List
from tqdm.auto import tqdm

from going_modular.going_modular.engine import train_step, test_step

# Import train() function from: 
# https://github.com/mrdbourke/pytorch-deep-learning/blob/main/going_modular/going_modular/engine.py
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      
    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                           dataloader=train_dataloader,
                                           loss_fn=loss_fn,
                                           optimizer=optimizer,
                                           device=device)
        test_loss, test_acc = test_step(model=model,
                                        dataloader=test_dataloader,
                                        loss_fn=loss_fn,
                                        device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)

        ### New: Experiment tracking ###
        # Add loss results to SummaryWriter
        writer.add_scalars(main_tag="Loss", 
                           tag_scalar_dict={"train_loss": train_loss,
                                            "test_loss": test_loss},
                           global_step=epoch)

        # Add accuracy results to SummaryWriter
        writer.add_scalars(main_tag="Accuracy", 
                           tag_scalar_dict={"train_acc": train_acc,
                                            "test_acc": test_acc}, 
                           global_step=epoch)
        
        # Track the PyTorch model architecture
        writer.add_graph(model=model, 
                         # Pass in an example input
                         input_to_model=torch.randn(32, 3, 224, 224).to(device))
    
    # Close the writer
    writer.close()
    
    ### End new ###

    # Return the filled results at the end of the epochs
    return results

哇哦！

我们的 train() 函数现已更新，使用 SummaryWriter() 实例来跟踪模型的结果。

我们试试运行 5 个周期如何？

In [14]:

Copied!





# Train model
# Note: Not using engine.train() since the original script isn't updated to use writer
set_seeds()
results = train(model=model,
                train_dataloader=train_dataloader,
                test_dataloader=test_dataloader,
                optimizer=optimizer,
                loss_fn=loss_fn,
                epochs=5,
                device=device)
# Train model
# Note: Not using engine.train() since the original script isn't updated to use writer
set_seeds()
results = train(model=model,
                train_dataloader=train_dataloader,
                test_dataloader=test_dataloader,
                optimizer=optimizer,
                loss_fn=loss_fn,
                epochs=5,
                device=device)

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0924 | train_acc: 0.3984 | test_loss: 0.9133 | test_acc: 0.5398
Epoch: 2 | train_loss: 0.8975 | train_acc: 0.6562 | test_loss: 0.7838 | test_acc: 0.8561
Epoch: 3 | train_loss: 0.8037 | train_acc: 0.7461 | test_loss: 0.6723 | test_acc: 0.8864
Epoch: 4 | train_loss: 0.6769 | train_acc: 0.8516 | test_loss: 0.6698 | test_acc: 0.8049
Epoch: 5 | train_loss: 0.7065 | train_acc: 0.7188 | test_loss: 0.6746 | test_acc: 0.7737

注意： 你可能会注意到这里的结果与我们在06. PyTorch迁移学习中得到的结果略有不同。这种差异来自于使用 engine.train() 和我们修改后的 train() 函数。你能猜到为什么吗？PyTorch关于随机性的文档可能会有所帮助。

运行上面的单元格，我们得到了与06. PyTorch迁移学习第4节：训练模型相似的输出，但不同之处在于我们的 writer 实例已经创建了一个 runs/ 目录，用于存储我们模型的结果。

例如，保存位置可能看起来像：

runs/Jun21_00-46-03_daniels_macbook_pro

其中默认格式是 runs/CURRENT_DATETIME_HOSTNAME。

我们稍后会查看这些内容，但作为提醒，我们之前是在一个字典中跟踪我们模型的结果。

In [15]:

Copied!

# Check out the model results
results
# Check out the model results
results

Out[15]:

{'train_loss': [1.0923754647374153,
  0.8974628075957298,
  0.803724929690361,
  0.6769256368279457,
  0.7064960040152073],
 'train_acc': [0.3984375, 0.65625, 0.74609375, 0.8515625, 0.71875],
 'test_loss': [0.9132757981618246,
  0.7837507526079813,
  0.6722926497459412,
  0.6698453426361084,
  0.6746167540550232],
 'test_acc': [0.5397727272727273,
  0.8560606060606061,
  0.8863636363636364,
  0.8049242424242425,
  0.7736742424242425]}

嗯，我们可以将这些数据格式化为一个美观的图表，但你能想象要跟踪这么多字典吗？

肯定有更好的方法...

5. 在 TensorBoard 中查看我们模型的结果¶

SummaryWriter() 类默认将我们模型的结果以 TensorBoard 格式存储在名为 runs/ 的目录中。

TensorBoard 是由 TensorFlow 团队创建的一个可视化程序，用于查看和检查有关模型和数据的信息。

你知道这意味着什么吗？

是时候遵循数据可视化器的座右铭，可视化，可视化，可视化！

你可以通过多种方式查看 TensorBoard：

代码环境	如何查看 TensorBoard	资源
VS Code（笔记本或 Python 脚本）	按 `SHIFT + CMD + P` 打开命令面板，搜索命令 "Python: Launch TensorBoard"。	VS Code 指南：TensorBoard 和 PyTorch
Jupyter 和 Colab 笔记本	确保 TensorBoard 已安装，使用 `%load_ext tensorboard` 加载它，然后使用 `%tensorboard --logdir DIR_WITH_LOGS` 查看结果。	`torch.utils.tensorboard` 和 TensorBoard 入门

你还可以将你的实验上传到 tensorboard.dev，以便与他人公开分享。

在 Google Colab 或 Jupyter Notebook 中运行以下代码将启动一个交互式 TensorBoard 会话，以查看 runs/ 目录中的 TensorBoard 文件。

%load_ext tensorboard # 加载 TensorBoard 的行魔法
%tensorboard --logdir runs # 使用 "runs/" 目录运行 TensorBoard 会话

In [16]:

Copied!

# Example code to run in Jupyter or Google Colab Notebook (uncomment to try it out)
# %load_ext tensorboard
# %tensorboard --logdir runs
# Example code to run in Jupyter or Google Colab Notebook (uncomment to try it out)
# %load_ext tensorboard
# %tensorboard --logdir runs

如果一切操作正确，你应该会看到类似以下的内容：

在 TensorBoard 中查看单个建模实验的准确率和损失结果。

注意： 有关在笔记本或其他位置运行 TensorBoard 的更多信息，请参阅以下内容：

TensorFlow 的笔记本中使用 TensorBoard 指南

开始使用 TensorBoard.dev（有助于将你的 TensorBoard 日志上传到一个可分享的链接）

6. 创建一个辅助函数来构建 `SummaryWriter()` 实例¶

SummaryWriter() 类将各种信息记录到由 log_dir 参数指定的目录中。

我们是否可以创建一个辅助函数，为每个实验创建一个自定义目录？

本质上，每个实验都有自己的日志目录。

例如，假设我们想要跟踪以下内容：

实验日期/时间戳 - 实验是在何时进行的？
实验名称 - 我们是否想要为实验取一个名称？
模型名称 - 使用了哪个模型？
额外信息 - 是否需要跟踪其他任何内容？

你可以在这里跟踪几乎任何内容，并且可以尽情发挥创意，但这些应该足够开始。

让我们创建一个名为 create_writer() 的辅助函数，该函数生成一个 SummaryWriter() 实例，记录到自定义的 log_dir。

理想情况下，我们希望 log_dir 类似于：

runs/YYYY-MM-DD/experiment_name/model_name/extra

其中 YYYY-MM-DD 是实验运行的日期（如果你愿意，也可以添加时间）。

In [17]:

Copied!





def create_writer(experiment_name: str, 
                  model_name: str, 
                  extra: str=None) -> torch.utils.tensorboard.writer.SummaryWriter():
    """Creates a torch.utils.tensorboard.writer.SummaryWriter() instance saving to a specific log_dir.

    log_dir is a combination of runs/timestamp/experiment_name/model_name/extra.

    Where timestamp is the current date in YYYY-MM-DD format.

    Args:
        experiment_name (str): Name of experiment.
        model_name (str): Name of model.
        extra (str, optional): Anything extra to add to the directory. Defaults to None.

    Returns:
        torch.utils.tensorboard.writer.SummaryWriter(): Instance of a writer saving to log_dir.

    Example usage:
        # Create a writer saving to "runs/2022-06-04/data_10_percent/effnetb2/5_epochs/"
        writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb2",
                               extra="5_epochs")
        # The above is the same as:
        writer = SummaryWriter(log_dir="runs/2022-06-04/data_10_percent/effnetb2/5_epochs/")
    """
    from datetime import datetime
    import os

    # Get timestamp of current date (all experiments on certain day live in same folder)
    timestamp = datetime.now().strftime("%Y-%m-%d") # returns current date in YYYY-MM-DD format

    if extra:
        # Create log directory path
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name, extra)
    else:
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name)
        
    print(f"[INFO] Created SummaryWriter, saving to: {log_dir}...")
    return SummaryWriter(log_dir=log_dir)
def create_writer(experiment_name: str, 
                  model_name: str, 
                  extra: str=None) -> torch.utils.tensorboard.writer.SummaryWriter():
    """Creates a torch.utils.tensorboard.writer.SummaryWriter() instance saving to a specific log_dir.

    log_dir is a combination of runs/timestamp/experiment_name/model_name/extra.

    Where timestamp is the current date in YYYY-MM-DD format.

    Args:
        experiment_name (str): Name of experiment.
        model_name (str): Name of model.
        extra (str, optional): Anything extra to add to the directory. Defaults to None.

    Returns:
        torch.utils.tensorboard.writer.SummaryWriter(): Instance of a writer saving to log_dir.

    Example usage:
        # Create a writer saving to "runs/2022-06-04/data_10_percent/effnetb2/5_epochs/"
        writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb2",
                               extra="5_epochs")
        # The above is the same as:
        writer = SummaryWriter(log_dir="runs/2022-06-04/data_10_percent/effnetb2/5_epochs/")
    """
    from datetime import datetime
    import os

    # Get timestamp of current date (all experiments on certain day live in same folder)
    timestamp = datetime.now().strftime("%Y-%m-%d") # returns current date in YYYY-MM-DD format

    if extra:
        # Create log directory path
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name, extra)
    else:
        log_dir = os.path.join("runs", timestamp, experiment_name, model_name)
        
    print(f"[INFO] Created SummaryWriter, saving to: {log_dir}...")
    return SummaryWriter(log_dir=log_dir)

太棒了！

现在我们已经有了 create_writer() 函数，让我们来试试它的效果。

In [18]:

Copied!





# Create an example writer
example_writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb0",
                               extra="5_epochs")
# Create an example writer
example_writer = create_writer(experiment_name="data_10_percent",
                               model_name="effnetb0",
                               extra="5_epochs")

[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_10_percent/effnetb0/5_epochs...

看起来不错，现在我们有了记录和追溯各种实验的方法。

6.1 更新 `train()` 函数以包含 `writer` 参数¶

我们的 create_writer() 函数表现非常出色。

让我们为 train() 函数增加一个 writer 参数，这样每次调用 train() 时，我们都能主动更新正在使用的 SummaryWriter() 实例。

例如，假设我们正在运行一系列实验，多次调用 train() 函数来训练多个不同的模型，那么每个实验使用不同的 writer 会很有用。

每个实验一个 writer = 每个实验一个日志目录。

为了调整 train() 函数，我们将向函数添加一个 writer 参数，然后添加一些代码来检查是否有 writer，如果有，我们将在那里记录我们的信息。

In [19]:

Copied!





from typing import Dict, List
from tqdm.auto import tqdm

# Add writer parameter to train()
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device, 
          writer: torch.utils.tensorboard.writer.SummaryWriter # new parameter to take in a writer
          ) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Stores metrics to specified writer log_dir if present.

    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      writer: A SummaryWriter() instance to log model results to.

    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                          dataloader=train_dataloader,
                                          loss_fn=loss_fn,
                                          optimizer=optimizer,
                                          device=device)
        test_loss, test_acc = test_step(model=model,
          dataloader=test_dataloader,
          loss_fn=loss_fn,
          device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)


        ### New: Use the writer parameter to track experiments ###
        # See if there's a writer, if so, log to it
        if writer:
            # Add results to SummaryWriter
            writer.add_scalars(main_tag="Loss", 
                               tag_scalar_dict={"train_loss": train_loss,
                                                "test_loss": test_loss},
                               global_step=epoch)
            writer.add_scalars(main_tag="Accuracy", 
                               tag_scalar_dict={"train_acc": train_acc,
                                                "test_acc": test_acc}, 
                               global_step=epoch)

            # Close the writer
            writer.close()
        else:
            pass
    ### End new ###

    # Return the filled results at the end of the epochs
    return results
from typing import Dict, List
from tqdm.auto import tqdm

# Add writer parameter to train()
def train(model: torch.nn.Module, 
          train_dataloader: torch.utils.data.DataLoader, 
          test_dataloader: torch.utils.data.DataLoader, 
          optimizer: torch.optim.Optimizer,
          loss_fn: torch.nn.Module,
          epochs: int,
          device: torch.device, 
          writer: torch.utils.tensorboard.writer.SummaryWriter # new parameter to take in a writer
          ) -> Dict[str, List]:
    """Trains and tests a PyTorch model.

    Passes a target PyTorch models through train_step() and test_step()
    functions for a number of epochs, training and testing the model
    in the same epoch loop.

    Calculates, prints and stores evaluation metrics throughout.

    Stores metrics to specified writer log_dir if present.

    Args:
      model: A PyTorch model to be trained and tested.
      train_dataloader: A DataLoader instance for the model to be trained on.
      test_dataloader: A DataLoader instance for the model to be tested on.
      optimizer: A PyTorch optimizer to help minimize the loss function.
      loss_fn: A PyTorch loss function to calculate loss on both datasets.
      epochs: An integer indicating how many epochs to train for.
      device: A target device to compute on (e.g. "cuda" or "cpu").
      writer: A SummaryWriter() instance to log model results to.

    Returns:
      A dictionary of training and testing loss as well as training and
      testing accuracy metrics. Each metric has a value in a list for 
      each epoch.
      In the form: {train_loss: [...],
                train_acc: [...],
                test_loss: [...],
                test_acc: [...]} 
      For example if training for epochs=2: 
              {train_loss: [2.0616, 1.0537],
                train_acc: [0.3945, 0.3945],
                test_loss: [1.2641, 1.5706],
                test_acc: [0.3400, 0.2973]} 
    """
    # Create empty results dictionary
    results = {"train_loss": [],
               "train_acc": [],
               "test_loss": [],
               "test_acc": []
    }

    # Loop through training and testing steps for a number of epochs
    for epoch in tqdm(range(epochs)):
        train_loss, train_acc = train_step(model=model,
                                          dataloader=train_dataloader,
                                          loss_fn=loss_fn,
                                          optimizer=optimizer,
                                          device=device)
        test_loss, test_acc = test_step(model=model,
          dataloader=test_dataloader,
          loss_fn=loss_fn,
          device=device)

        # Print out what's happening
        print(
          f"Epoch: {epoch+1} | "
          f"train_loss: {train_loss:.4f} | "
          f"train_acc: {train_acc:.4f} | "
          f"test_loss: {test_loss:.4f} | "
          f"test_acc: {test_acc:.4f}"
        )

        # Update results dictionary
        results["train_loss"].append(train_loss)
        results["train_acc"].append(train_acc)
        results["test_loss"].append(test_loss)
        results["test_acc"].append(test_acc)


        ### New: Use the writer parameter to track experiments ###
        # See if there's a writer, if so, log to it
        if writer:
            # Add results to SummaryWriter
            writer.add_scalars(main_tag="Loss", 
                               tag_scalar_dict={"train_loss": train_loss,
                                                "test_loss": test_loss},
                               global_step=epoch)
            writer.add_scalars(main_tag="Accuracy", 
                               tag_scalar_dict={"train_acc": train_acc,
                                                "test_acc": test_acc}, 
                               global_step=epoch)

            # Close the writer
            writer.close()
        else:
            pass
    ### End new ###

    # Return the filled results at the end of the epochs
    return results

7. 设置一系列建模实验¶

是时候提升一个档次了。

之前我们已经进行了各种实验并逐一检查结果。

但如果我们能同时运行多个实验，然后一起检查结果呢？

你准备好了吗？

来吧，让我们开始。

7.1 你应该进行什么样的实验？¶

这是机器学习中的百万美元问题。

因为你可以进行的实验真的没有限制。

正是这种自由使得机器学习既令人兴奋又令人恐惧。

在这里，你必须穿上科学家的外衣，并记住机器学习实践者的座右铭：实验，实验，实验！

每一个超参数都是一个不同实验的起点：

改变 epoch 的数量。
改变 层数/隐藏单元 的数量。
改变数据的数量。
改变 学习率。
尝试不同类型的 数据增强。
选择不同的 模型架构。

通过实践和运行许多不同的实验，你将开始建立一种直觉，了解什么可能有助于你的模型。

我故意说可能，因为没有保证。

但一般来说，鉴于 The Bitter Lesson（我已经提到过两次，因为它是人工智能领域中一篇重要的文章），通常你的模型越大（更多的可学习参数），你拥有的数据越多（更多的学习机会），性能就越好。

然而，当你第一次面对一个机器学习问题时：从小规模开始，如果某个方法有效，再进行扩展。

你的第一批实验应该只需几秒到几分钟就能运行完毕。

你越快能进行实验，就能越快找出什么不有效，进而越快找出什么有效。

7.2 我们将进行哪些实验？¶

我们的目标是改进驱动 FoodVision Mini 的模型，同时避免模型变得过大。

本质上，我们理想的模型在测试集上达到高准确率（90%+），但训练和推理（预测）时间不会太长。

我们有很多选择，但不妨保持简单。

让我们尝试以下组合：

不同数量的数据（10% 的披萨、牛排、寿司图片 vs. 20%）
不同的模型（torchvision.models.efficientnet_b0 vs. torchvision.models.efficientnet_b2）
不同的训练时间（5 个 epoch vs. 10 个 epoch）

具体分解如下：

实验编号	训练数据集	模型（在 ImageNet 上预训练）	训练轮数
1	披萨、牛排、寿司 10%	EfficientNetB0	5
2	披萨、牛排、寿司 10%	EfficientNetB2	5
3	披萨、牛排、寿司 10%	EfficientNetB0	10
4	披萨、牛排、寿司 10%	EfficientNetB2	10
5	披萨、牛排、寿司 20%	EfficientNetB0	5
6	披萨、牛排、寿司 20%	EfficientNetB2	5
7	披萨、牛排、寿司 20%	EfficientNetB0	10
8	披萨、牛排、寿司 20%	EfficientNetB2	10

注意我们是如何逐步增加实验规模的。

每个实验我们都逐渐增加数据量、模型大小和训练时长。

到最后，实验 8 将使用比实验 1 多一倍的数据、大一倍的模型和长一倍的训练时间。

注意： 我想明确一点，你真正可以进行的实验数量是没有限制的。我们这里设计的只是非常小的一部分选项。然而，你不可能测试所有东西，所以最好先尝试一些，然后跟进那些效果最好的。

另外提醒一下，我们使用的数据集是 Food101 数据集的一个子集（3 个类别，披萨、牛排、寿司，而不是 101 个），并且只使用了 10% 和 20% 的图片，而不是 100%。如果我们的实验成功，我们可以开始在更多数据上进行实验（尽管这将需要更长的计算时间）。你可以通过 04_custom_data_creation.ipynb 笔记本查看数据集是如何创建的。

7.3 下载不同的数据集¶

在我们开始运行一系列实验之前，我们需要确保数据集已经准备就绪。

我们需要两种形式的训练集：

一个包含 Food101 披萨、牛排、寿司图片 10% 数据 的训练集（我们已经在上面创建了这个，但为了完整性，我们将再次创建）。
一个包含 Food101 披萨、牛排、寿司图片 20% 数据 的训练集。

为了保持一致性，所有实验将使用相同的测试数据集（来自 10% 数据分割的那个）。

我们将首先使用之前创建的 download_data() 函数下载我们需要的各种数据集。

这两个数据集都可以从课程的 GitHub 上获取：

In [20]:

Copied!





# Download 10 percent and 20 percent training data (if necessary)
data_10_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                                     destination="pizza_steak_sushi")

data_20_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi_20_percent.zip",
                                     destination="pizza_steak_sushi_20_percent")
# Download 10 percent and 20 percent training data (if necessary)
data_10_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip",
                                     destination="pizza_steak_sushi")

data_20_percent_path = download_data(source="https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi_20_percent.zip",
                                     destination="pizza_steak_sushi_20_percent")

[INFO] data/pizza_steak_sushi directory exists, skipping download.
[INFO] data/pizza_steak_sushi_20_percent directory exists, skipping download.

数据已下载！

现在让我们设置用于不同实验的数据文件路径。

我们将创建不同的训练目录路径，但由于所有实验都将使用相同的测试数据集（即 pizza, steak, sushi 10% 的测试数据集），因此我们只需要一个测试目录路径。

In [21]:

Copied!





# Setup training directory paths
train_dir_10_percent = data_10_percent_path / "train"
train_dir_20_percent = data_20_percent_path / "train"

# Setup testing directory paths (note: use the same test dataset for both to compare the results)
test_dir = data_10_percent_path / "test"

# Check the directories
print(f"Training directory 10%: {train_dir_10_percent}")
print(f"Training directory 20%: {train_dir_20_percent}")
print(f"Testing directory: {test_dir}")
# Setup training directory paths
train_dir_10_percent = data_10_percent_path / "train"
train_dir_20_percent = data_20_percent_path / "train"

# Setup testing directory paths (note: use the same test dataset for both to compare the results)
test_dir = data_10_percent_path / "test"

# Check the directories
print(f"Training directory 10%: {train_dir_10_percent}")
print(f"Training directory 20%: {train_dir_20_percent}")
print(f"Testing directory: {test_dir}")

Training directory 10%: data/pizza_steak_sushi/train
Training directory 20%: data/pizza_steak_sushi_20_percent/train
Testing directory: data/pizza_steak_sushi/test

7.4 转换数据集并创建 DataLoader¶

接下来，我们将创建一系列转换，以准备我们的模型所需的图像。

为了保持一致性，我们将手动创建一个转换（就像上面所做的那样），并在所有数据集中使用相同的转换。

这个转换将：

调整所有图像的大小（我们首先使用 224x224，但这个尺寸可以更改）。
将它们转换为值在 0 到 1 之间的张量。
以某种方式对它们进行归一化，使其分布与 ImageNet 数据集一致（我们这样做是因为从 torchvision.models 获取的模型已经在 ImageNet 上进行了预训练）。

In [22]:

Copied!





from torchvision import transforms

# Create a transform to normalize data distribution to be inline with ImageNet
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], # values per colour channel [red, green, blue]
                                 std=[0.229, 0.224, 0.225]) # values per colour channel [red, green, blue]

# Compose transforms into a pipeline
simple_transform = transforms.Compose([
    transforms.Resize((224, 224)), # 1. Resize the images
    transforms.ToTensor(), # 2. Turn the images into tensors with values between 0 & 1
    normalize # 3. Normalize the images so their distributions match the ImageNet dataset 
])
from torchvision import transforms

# Create a transform to normalize data distribution to be inline with ImageNet
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], # values per colour channel [red, green, blue]
                                 std=[0.229, 0.224, 0.225]) # values per colour channel [red, green, blue]

# Compose transforms into a pipeline
simple_transform = transforms.Compose([
    transforms.Resize((224, 224)), # 1. Resize the images
    transforms.ToTensor(), # 2. Turn the images into tensors with values between 0 & 1
    normalize # 3. Normalize the images so their distributions match the ImageNet dataset 
])

转换准备就绪！

现在，让我们使用在05. PyTorch Going Modular 第2节中创建的data_setup.py模块中的create_dataloaders()函数来创建我们的DataLoader。

我们将以批量大小为32来创建DataLoader。

对于我们所有的实验，我们将使用相同的test_dataloader（以保持比较的一致性）。

In [23]:

Copied!





BATCH_SIZE = 32

# Create 10% training and test DataLoaders
train_dataloader_10_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_10_percent,
    test_dir=test_dir, 
    transform=simple_transform,
    batch_size=BATCH_SIZE
)

# Create 20% training and test data DataLoders
train_dataloader_20_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_20_percent,
    test_dir=test_dir,
    transform=simple_transform,
    batch_size=BATCH_SIZE
)

# Find the number of samples/batches per dataloader (using the same test_dataloader for both experiments)
print(f"Number of batches of size {BATCH_SIZE} in 10 percent training data: {len(train_dataloader_10_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in 20 percent training data: {len(train_dataloader_20_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in testing data: {len(train_dataloader_10_percent)} (all experiments will use the same test set)")
print(f"Number of classes: {len(class_names)}, class names: {class_names}")
BATCH_SIZE = 32

# Create 10% training and test DataLoaders
train_dataloader_10_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_10_percent,
    test_dir=test_dir, 
    transform=simple_transform,
    batch_size=BATCH_SIZE
)

# Create 20% training and test data DataLoders
train_dataloader_20_percent, test_dataloader, class_names = data_setup.create_dataloaders(train_dir=train_dir_20_percent,
    test_dir=test_dir,
    transform=simple_transform,
    batch_size=BATCH_SIZE
)

# Find the number of samples/batches per dataloader (using the same test_dataloader for both experiments)
print(f"Number of batches of size {BATCH_SIZE} in 10 percent training data: {len(train_dataloader_10_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in 20 percent training data: {len(train_dataloader_20_percent)}")
print(f"Number of batches of size {BATCH_SIZE} in testing data: {len(train_dataloader_10_percent)} (all experiments will use the same test set)")
print(f"Number of classes: {len(class_names)}, class names: {class_names}")

Number of batches of size 32 in 10 percent training data: 8
Number of batches of size 32 in 20 percent training data: 15
Number of batches of size 32 in testing data: 8 (all experiments will use the same test set)
Number of classes: 3, class names: ['pizza', 'steak', 'sushi']

7.5 创建特征提取器模型¶

是时候开始构建我们的模型了。

我们将创建两个特征提取器模型：

torchvision.models.efficientnet_b0() 预训练主干 + 自定义分类器头（简称 EffNetB0）。
torchvision.models.efficientnet_b2() 预训练主干 + 自定义分类器头（简称 EffNetB2）。

为此，我们将冻结基础层（特征层）并更新模型的分类器头（输出层），以适应我们的问题，就像我们在06. PyTorch 迁移学习第3.4节中所做的那样。

我们在上一章中看到，EffNetB0 分类器头的 in_features 参数是 1280（主干将输入图像转换为大小为 1280 的特征向量）。

由于 EffNetB2 具有不同数量的层和参数，我们需要相应地调整它。

注意： 每当你使用不同的模型时，首先要检查的是输入和输出形状。这样你就知道如何准备输入数据/更新模型以获得正确的输出形状。

我们可以使用 torchinfo.summary() 并传入 input_size=(32, 3, 224, 224) 参数来找到 EffNetB2 的输入和输出形状（(32, 3, 224, 224) 相当于 (batch_size, color_channels, height, width)，即我们传入一个示例，说明我们模型的一个批次的数据会是什么样子）。

注意： 许多现代模型由于 torch.nn.AdaptiveAvgPool2d() 层的存在，可以处理不同大小的输入图像，该层根据需要自适应地调整给定输入的 output_size。你可以通过向 torchinfo.summary() 或你自己的模型传递不同大小的输入图像来尝试这一点。

为了找到 EffNetB2 最终层所需的输入形状，让我们：

创建一个 torchvision.models.efficientnet_b2(pretrained=True) 的实例。
通过运行 torchinfo.summary() 查看各种输入和输出形状。
通过检查 EffNetB2 分类器部分的 state_dict() 并打印权重矩阵的长度来打印 in_features 的数量。
- 注意： 你也可以直接检查 effnetb2.classifier 的输出。

In [24]:

Copied!





import torchvision
from torchinfo import summary

# 1. Create an instance of EffNetB2 with pretrained weights
effnetb2_weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT # "DEFAULT" means best available weights
effnetb2 = torchvision.models.efficientnet_b2(weights=effnetb2_weights)

# # 2. Get a summary of standard EffNetB2 from torchvision.models (uncomment for full output)
# summary(model=effnetb2, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# ) 

# 3. Get the number of in_features of the EfficientNetB2 classifier layer
print(f"Number of in_features to final layer of EfficientNetB2: {len(effnetb2.classifier.state_dict()['1.weight'][0])}")
import torchvision
from torchinfo import summary

# 1. Create an instance of EffNetB2 with pretrained weights
effnetb2_weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT # "DEFAULT" means best available weights
effnetb2 = torchvision.models.efficientnet_b2(weights=effnetb2_weights)

# # 2. Get a summary of standard EffNetB2 from torchvision.models (uncomment for full output)
# summary(model=effnetb2, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# ) 

# 3. Get the number of in_features of the EfficientNetB2 classifier layer
print(f"Number of in_features to final layer of EfficientNetB2: {len(effnetb2.classifier.state_dict()['1.weight'][0])}")

Number of in_features to final layer of EfficientNetB2: 1408

当传递具有所有可训练层和默认分类器头的EffNetB2模型时，torchinfo.summary()的输出

EffNetB2特征提取器模型的模型摘要，所有层未冻结（可训练），并使用ImageNet预训练的默认分类器头。

现在我们知道EffNetB2模型所需的in_features数量，让我们创建几个辅助函数来设置我们的EffNetB0和EffNetB2特征提取器模型。

我们希望这些函数能够：

从torchvision.models获取基础模型
冻结模型中的基础层（设置requires_grad=False）
设置随机种子（我们不需要这样做，但由于我们正在进行一系列实验并在初始化一个具有随机权重的新层，我们希望每次实验的随机性相似）
更改分类器头（以适应我们的问题）
给模型命名（例如，EffNetB0为"effnetb0"）

In [25]:

Copied!





import torchvision
from torch import nn

# Get num out features (one for each class pizza, steak, sushi)
OUT_FEATURES = len(class_names)

# Create an EffNetB0 feature extractor
def create_effnetb0():
    # 1. Get the base mdoel with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
    model = torchvision.models.efficientnet_b0(weights=weights).to(device)

    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False

    # 3. Set the seeds
    set_seeds()

    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.2),
        nn.Linear(in_features=1280, out_features=OUT_FEATURES)
    ).to(device)

    # 5. Give the model a name
    model.name = "effnetb0"
    print(f"[INFO] Created new {model.name} model.")
    return model

# Create an EffNetB2 feature extractor
def create_effnetb2():
    # 1. Get the base model with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT
    model = torchvision.models.efficientnet_b2(weights=weights).to(device)

    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False

    # 3. Set the seeds
    set_seeds()

    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.3),
        nn.Linear(in_features=1408, out_features=OUT_FEATURES)
    ).to(device)

    # 5. Give the model a name
    model.name = "effnetb2"
    print(f"[INFO] Created new {model.name} model.")
    return model
import torchvision
from torch import nn

# Get num out features (one for each class pizza, steak, sushi)
OUT_FEATURES = len(class_names)

# Create an EffNetB0 feature extractor
def create_effnetb0():
    # 1. Get the base mdoel with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B0_Weights.DEFAULT
    model = torchvision.models.efficientnet_b0(weights=weights).to(device)

    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False

    # 3. Set the seeds
    set_seeds()

    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.2),
        nn.Linear(in_features=1280, out_features=OUT_FEATURES)
    ).to(device)

    # 5. Give the model a name
    model.name = "effnetb0"
    print(f"[INFO] Created new {model.name} model.")
    return model

# Create an EffNetB2 feature extractor
def create_effnetb2():
    # 1. Get the base model with pretrained weights and send to target device
    weights = torchvision.models.EfficientNet_B2_Weights.DEFAULT
    model = torchvision.models.efficientnet_b2(weights=weights).to(device)

    # 2. Freeze the base model layers
    for param in model.features.parameters():
        param.requires_grad = False

    # 3. Set the seeds
    set_seeds()

    # 4. Change the classifier head
    model.classifier = nn.Sequential(
        nn.Dropout(p=0.3),
        nn.Linear(in_features=1408, out_features=OUT_FEATURES)
    ).to(device)

    # 5. Give the model a name
    model.name = "effnetb2"
    print(f"[INFO] Created new {model.name} model.")
    return model

这些函数看起来很不错！

让我们通过创建一个EffNetB0和EffNetB2的实例并查看它们的summary()来测试它们。

In [26]:

Copied!





effnetb0 = create_effnetb0() 

# Get an output summary of the layers in our EffNetB0 feature extractor model (uncomment to view full output)
# summary(model=effnetb0, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )
effnetb0 = create_effnetb0() 

# Get an output summary of the layers in our EffNetB0 feature extractor model (uncomment to view full output)
# summary(model=effnetb0, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# ) 

[INFO] Created new effnetb0 model.

传递给带有冻结基础层和更新分类器头的EffNetB0模型的torchinfo.summary()输出

带有冻结基础层（不可训练）和更新分类器头的EffNetB0模型摘要（适用于披萨、牛排、寿司图像分类）。

In [27]:

Copied!





effnetb2 = create_effnetb2()

# Get an output summary of the layers in our EffNetB2 feature extractor model (uncomment to view full output)
# summary(model=effnetb2, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# )
effnetb2 = create_effnetb2()

# Get an output summary of the layers in our EffNetB2 feature extractor model (uncomment to view full output)
# summary(model=effnetb2, 
#         input_size=(32, 3, 224, 224), # make sure this is "input_size", not "input_shape"
#         # col_names=["input_size"], # uncomment for smaller output
#         col_names=["input_size", "output_size", "num_params", "trainable"],
#         col_width=20,
#         row_settings=["var_names"]
# ) 

[INFO] Created new effnetb2 model.

传递给带有基础层冻结（不可训练）和更新分类器头部的EffNetB2模型的torchinfo.summary()输出

EffNetB2模型的模型摘要，基础层冻结（不可训练），分类器头部更新（适用于披萨、牛排、寿司图像分类）。

从总结的输出结果来看，EffNetB2 骨干网络的参数数量几乎是 EffNetB0 的两倍。

模型	总参数（冻结/更改头部前）	总参数（冻结/更改头部后）	可训练总参数（冻结/更改头部后）
EfficientNetB0	5,288,548	4,011,391	3,843
EfficientNetB2	9,109,994	7,705,221	4,227

这使得 EffNetB2 模型的骨干网络有更多机会形成对披萨、牛排和寿司数据的表示。

然而，每个模型的可训练参数（分类器头部）差异并不大。

这些额外的参数会带来更好的结果吗？

我们拭目以待...

注意： 本着实验的精神，你几乎可以尝试 torchvision.models 中的任何模型，就像我们在这里所做的一样。我仅选择了 EffNetB0 和 EffNetB2 作为示例。也许你可以尝试加入类似 torchvision.models.convnext_tiny() 或 torchvision.models.convnext_small() 的模型。

7.6 创建实验并设置训练代码¶

我们已经准备好了数据并准备好了模型，现在是时候设置一些实验了！

我们将从创建两个列表和一个字典开始：

我们想要测试的 epoch 数列表（[5, 10]）
我们想要测试的模型列表（["effnetb0", "effnetb2"]）
不同训练 DataLoader 的字典

In [28]:

Copied!





# 1. Create epochs list
num_epochs = [5, 10]

# 2. Create models list (need to create a new model for each experiment)
models = ["effnetb0", "effnetb2"]

# 3. Create dataloaders dictionary for various dataloaders
train_dataloaders = {"data_10_percent": train_dataloader_10_percent,
                     "data_20_percent": train_dataloader_20_percent}
# 1. Create epochs list
num_epochs = [5, 10]

# 2. Create models list (need to create a new model for each experiment)
models = ["effnetb0", "effnetb2"]

# 3. Create dataloaders dictionary for various dataloaders
train_dataloaders = {"data_10_percent": train_dataloader_10_percent,
                     "data_20_percent": train_dataloader_20_percent}

列表和字典已经创建好了！

现在我们可以编写代码来遍历每种不同的选项，并尝试每种不同的组合。

我们还会在每次实验结束时保存模型，以便稍后可以加载最佳模型并用于进行预测。

具体来说，让我们按照以下步骤进行：

设置随机种子（这样我们的实验结果是可复现的，在实践中，你可能会在 ~3 个不同的种子上运行相同的实验并平均结果）。
跟踪不同的实验编号（这主要是为了打印输出更美观）。
遍历 train_dataloaders 字典项，对每个不同的训练 DataLoader 进行循环。
遍历 epoch 数量的列表。
遍历不同模型名称的列表。
为当前运行的实验创建信息打印输出（这样我们就知道发生了什么）。
检查目标模型是哪一个，并创建一个新的 EffNetB0 或 EffNetB2 实例（我们每次实验都创建一个新的模型实例，以便所有模型都从相同的起点开始）。
为每个新实验创建一个新的损失函数（torch.nn.CrossEntropyLoss()）和优化器（torch.optim.Adam(params=model.parameters(), lr=0.001)）。
使用修改后的 train() 函数训练模型，并将适当的详细信息传递给 writer 参数。
使用 utils.py 中的 save_model() 函数，以适当的文件名保存训练好的模型。

我们还可以使用 %%time 魔法来查看所有实验在一个 Jupyter/Google Colab 单元格中总共需要多长时间。

开始吧！

In [29]:

Copied!





%%time
from going_modular.going_modular.utils import save_model

# 1. Set the random seeds
set_seeds(seed=42)

# 2. Keep track of experiment numbers
experiment_number = 0

# 3. Loop through each DataLoader
for dataloader_name, train_dataloader in train_dataloaders.items():

    # 4. Loop through each number of epochs
    for epochs in num_epochs: 

        # 5. Loop through each model name and create a new model based on the name
        for model_name in models:

            # 6. Create information print outs
            experiment_number += 1
            print(f"[INFO] Experiment number: {experiment_number}")
            print(f"[INFO] Model: {model_name}")
            print(f"[INFO] DataLoader: {dataloader_name}")
            print(f"[INFO] Number of epochs: {epochs}")  

            # 7. Select the model
            if model_name == "effnetb0":
                model = create_effnetb0() # creates a new model each time (important because we want each experiment to start from scratch)
            else:
                model = create_effnetb2() # creates a new model each time (important because we want each experiment to start from scratch)
            
            # 8. Create a new loss and optimizer for every model
            loss_fn = nn.CrossEntropyLoss()
            optimizer = torch.optim.Adam(params=model.parameters(), lr=0.001)

            # 9. Train target model with target dataloaders and track experiments
            train(model=model,
                  train_dataloader=train_dataloader,
                  test_dataloader=test_dataloader, 
                  optimizer=optimizer,
                  loss_fn=loss_fn,
                  epochs=epochs,
                  device=device,
                  writer=create_writer(experiment_name=dataloader_name,
                                       model_name=model_name,
                                       extra=f"{epochs}_epochs"))
            
            # 10. Save the model to file so we can get back the best model
            save_filepath = f"07_{model_name}_{dataloader_name}_{epochs}_epochs.pth"
            save_model(model=model,
                       target_dir="models",
                       model_name=save_filepath)
            print("-"*50 + "\n")
%%time
from going_modular.going_modular.utils import save_model

# 1. Set the random seeds
set_seeds(seed=42)

# 2. Keep track of experiment numbers
experiment_number = 0

# 3. Loop through each DataLoader
for dataloader_name, train_dataloader in train_dataloaders.items():

    # 4. Loop through each number of epochs
    for epochs in num_epochs: 

        # 5. Loop through each model name and create a new model based on the name
        for model_name in models:

            # 6. Create information print outs
            experiment_number += 1
            print(f"[INFO] Experiment number: {experiment_number}")
            print(f"[INFO] Model: {model_name}")
            print(f"[INFO] DataLoader: {dataloader_name}")
            print(f"[INFO] Number of epochs: {epochs}")  

            # 7. Select the model
            if model_name == "effnetb0":
                model = create_effnetb0() # creates a new model each time (important because we want each experiment to start from scratch)
            else:
                model = create_effnetb2() # creates a new model each time (important because we want each experiment to start from scratch)
            
            # 8. Create a new loss and optimizer for every model
            loss_fn = nn.CrossEntropyLoss()
            optimizer = torch.optim.Adam(params=model.parameters(), lr=0.001)

            # 9. Train target model with target dataloaders and track experiments
            train(model=model,
                  train_dataloader=train_dataloader,
                  test_dataloader=test_dataloader, 
                  optimizer=optimizer,
                  loss_fn=loss_fn,
                  epochs=epochs,
                  device=device,
                  writer=create_writer(experiment_name=dataloader_name,
                                       model_name=model_name,
                                       extra=f"{epochs}_epochs"))
            
            # 10. Save the model to file so we can get back the best model
            save_filepath = f"07_{model_name}_{dataloader_name}_{epochs}_epochs.pth"
            save_model(model=model,
                       target_dir="models",
                       model_name=save_filepath)
            print("-"*50 + "\n")

[INFO] Experiment number: 1
[INFO] Model: effnetb0
[INFO] DataLoader: data_10_percent
[INFO] Number of epochs: 5
[INFO] Created new effnetb0 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_10_percent/effnetb0/5_epochs...

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0528 | train_acc: 0.4961 | test_loss: 0.9217 | test_acc: 0.4678
Epoch: 2 | train_loss: 0.8747 | train_acc: 0.6992 | test_loss: 0.8138 | test_acc: 0.6203
Epoch: 3 | train_loss: 0.8099 | train_acc: 0.6445 | test_loss: 0.7175 | test_acc: 0.8258
Epoch: 4 | train_loss: 0.7097 | train_acc: 0.7578 | test_loss: 0.5897 | test_acc: 0.8864
Epoch: 5 | train_loss: 0.5980 | train_acc: 0.9141 | test_loss: 0.5676 | test_acc: 0.8864
[INFO] Saving model to: models/07_effnetb0_data_10_percent_5_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 2
[INFO] Model: effnetb2
[INFO] DataLoader: data_10_percent
[INFO] Number of epochs: 5
[INFO] Created new effnetb2 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_10_percent/effnetb2/5_epochs...

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0928 | train_acc: 0.3711 | test_loss: 0.9557 | test_acc: 0.6610
Epoch: 2 | train_loss: 0.9247 | train_acc: 0.6445 | test_loss: 0.8711 | test_acc: 0.8144
Epoch: 3 | train_loss: 0.8086 | train_acc: 0.7656 | test_loss: 0.7511 | test_acc: 0.9176
Epoch: 4 | train_loss: 0.7191 | train_acc: 0.8867 | test_loss: 0.7150 | test_acc: 0.9081
Epoch: 5 | train_loss: 0.6851 | train_acc: 0.7695 | test_loss: 0.7076 | test_acc: 0.8873
[INFO] Saving model to: models/07_effnetb2_data_10_percent_5_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 3
[INFO] Model: effnetb0
[INFO] DataLoader: data_10_percent
[INFO] Number of epochs: 10
[INFO] Created new effnetb0 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_10_percent/effnetb0/10_epochs...

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0528 | train_acc: 0.4961 | test_loss: 0.9217 | test_acc: 0.4678
Epoch: 2 | train_loss: 0.8747 | train_acc: 0.6992 | test_loss: 0.8138 | test_acc: 0.6203
Epoch: 3 | train_loss: 0.8099 | train_acc: 0.6445 | test_loss: 0.7175 | test_acc: 0.8258
Epoch: 4 | train_loss: 0.7097 | train_acc: 0.7578 | test_loss: 0.5897 | test_acc: 0.8864
Epoch: 5 | train_loss: 0.5980 | train_acc: 0.9141 | test_loss: 0.5676 | test_acc: 0.8864
Epoch: 6 | train_loss: 0.5611 | train_acc: 0.8984 | test_loss: 0.5949 | test_acc: 0.8864
Epoch: 7 | train_loss: 0.5573 | train_acc: 0.7930 | test_loss: 0.5566 | test_acc: 0.8864
Epoch: 8 | train_loss: 0.4702 | train_acc: 0.9492 | test_loss: 0.5176 | test_acc: 0.8759
Epoch: 9 | train_loss: 0.5728 | train_acc: 0.7773 | test_loss: 0.5095 | test_acc: 0.8873
Epoch: 10 | train_loss: 0.4794 | train_acc: 0.8242 | test_loss: 0.4640 | test_acc: 0.9072
[INFO] Saving model to: models/07_effnetb0_data_10_percent_10_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 4
[INFO] Model: effnetb2
[INFO] DataLoader: data_10_percent
[INFO] Number of epochs: 10
[INFO] Created new effnetb2 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_10_percent/effnetb2/10_epochs...

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 1.0928 | train_acc: 0.3711 | test_loss: 0.9557 | test_acc: 0.6610
Epoch: 2 | train_loss: 0.9247 | train_acc: 0.6445 | test_loss: 0.8711 | test_acc: 0.8144
Epoch: 3 | train_loss: 0.8086 | train_acc: 0.7656 | test_loss: 0.7511 | test_acc: 0.9176
Epoch: 4 | train_loss: 0.7191 | train_acc: 0.8867 | test_loss: 0.7150 | test_acc: 0.9081
Epoch: 5 | train_loss: 0.6851 | train_acc: 0.7695 | test_loss: 0.7076 | test_acc: 0.8873
Epoch: 6 | train_loss: 0.6111 | train_acc: 0.7812 | test_loss: 0.6325 | test_acc: 0.9280
Epoch: 7 | train_loss: 0.6127 | train_acc: 0.8008 | test_loss: 0.6404 | test_acc: 0.8769
Epoch: 8 | train_loss: 0.5202 | train_acc: 0.9336 | test_loss: 0.6200 | test_acc: 0.8977
Epoch: 9 | train_loss: 0.5425 | train_acc: 0.8008 | test_loss: 0.6227 | test_acc: 0.8466
Epoch: 10 | train_loss: 0.4908 | train_acc: 0.8125 | test_loss: 0.5870 | test_acc: 0.8873
[INFO] Saving model to: models/07_effnetb2_data_10_percent_10_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 5
[INFO] Model: effnetb0
[INFO] DataLoader: data_20_percent
[INFO] Number of epochs: 5
[INFO] Created new effnetb0 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_20_percent/effnetb0/5_epochs...

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 0.9577 | train_acc: 0.6167 | test_loss: 0.6545 | test_acc: 0.8655
Epoch: 2 | train_loss: 0.6881 | train_acc: 0.8438 | test_loss: 0.5798 | test_acc: 0.9176
Epoch: 3 | train_loss: 0.5798 | train_acc: 0.8604 | test_loss: 0.4575 | test_acc: 0.9176
Epoch: 4 | train_loss: 0.4930 | train_acc: 0.8646 | test_loss: 0.4458 | test_acc: 0.9176
Epoch: 5 | train_loss: 0.4886 | train_acc: 0.8500 | test_loss: 0.3909 | test_acc: 0.9176
[INFO] Saving model to: models/07_effnetb0_data_20_percent_5_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 6
[INFO] Model: effnetb2
[INFO] DataLoader: data_20_percent
[INFO] Number of epochs: 5
[INFO] Created new effnetb2 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_20_percent/effnetb2/5_epochs...

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 0.9830 | train_acc: 0.5521 | test_loss: 0.7767 | test_acc: 0.8153
Epoch: 2 | train_loss: 0.7298 | train_acc: 0.7604 | test_loss: 0.6673 | test_acc: 0.8873
Epoch: 3 | train_loss: 0.6022 | train_acc: 0.8458 | test_loss: 0.5622 | test_acc: 0.9280
Epoch: 4 | train_loss: 0.5435 | train_acc: 0.8354 | test_loss: 0.5679 | test_acc: 0.9186
Epoch: 5 | train_loss: 0.4404 | train_acc: 0.9042 | test_loss: 0.4462 | test_acc: 0.9489
[INFO] Saving model to: models/07_effnetb2_data_20_percent_5_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 7
[INFO] Model: effnetb0
[INFO] DataLoader: data_20_percent
[INFO] Number of epochs: 10
[INFO] Created new effnetb0 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_20_percent/effnetb0/10_epochs...

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 0.9577 | train_acc: 0.6167 | test_loss: 0.6545 | test_acc: 0.8655
Epoch: 2 | train_loss: 0.6881 | train_acc: 0.8438 | test_loss: 0.5798 | test_acc: 0.9176
Epoch: 3 | train_loss: 0.5798 | train_acc: 0.8604 | test_loss: 0.4575 | test_acc: 0.9176
Epoch: 4 | train_loss: 0.4930 | train_acc: 0.8646 | test_loss: 0.4458 | test_acc: 0.9176
Epoch: 5 | train_loss: 0.4886 | train_acc: 0.8500 | test_loss: 0.3909 | test_acc: 0.9176
Epoch: 6 | train_loss: 0.3705 | train_acc: 0.8854 | test_loss: 0.3568 | test_acc: 0.9072
Epoch: 7 | train_loss: 0.3551 | train_acc: 0.9250 | test_loss: 0.3187 | test_acc: 0.9072
Epoch: 8 | train_loss: 0.3745 | train_acc: 0.8938 | test_loss: 0.3349 | test_acc: 0.8873
Epoch: 9 | train_loss: 0.2972 | train_acc: 0.9396 | test_loss: 0.3092 | test_acc: 0.9280
Epoch: 10 | train_loss: 0.3620 | train_acc: 0.8479 | test_loss: 0.2780 | test_acc: 0.9072
[INFO] Saving model to: models/07_effnetb0_data_20_percent_10_epochs.pth
--------------------------------------------------

[INFO] Experiment number: 8
[INFO] Model: effnetb2
[INFO] DataLoader: data_20_percent
[INFO] Number of epochs: 10
[INFO] Created new effnetb2 model.
[INFO] Created SummaryWriter, saving to: runs/2022-06-23/data_20_percent/effnetb2/10_epochs...

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch: 1 | train_loss: 0.9830 | train_acc: 0.5521 | test_loss: 0.7767 | test_acc: 0.8153
Epoch: 2 | train_loss: 0.7298 | train_acc: 0.7604 | test_loss: 0.6673 | test_acc: 0.8873
Epoch: 3 | train_loss: 0.6022 | train_acc: 0.8458 | test_loss: 0.5622 | test_acc: 0.9280
Epoch: 4 | train_loss: 0.5435 | train_acc: 0.8354 | test_loss: 0.5679 | test_acc: 0.9186
Epoch: 5 | train_loss: 0.4404 | train_acc: 0.9042 | test_loss: 0.4462 | test_acc: 0.9489
Epoch: 6 | train_loss: 0.3889 | train_acc: 0.9104 | test_loss: 0.4555 | test_acc: 0.8977
Epoch: 7 | train_loss: 0.3483 | train_acc: 0.9271 | test_loss: 0.4227 | test_acc: 0.9384
Epoch: 8 | train_loss: 0.3862 | train_acc: 0.8771 | test_loss: 0.4344 | test_acc: 0.9280
Epoch: 9 | train_loss: 0.3308 | train_acc: 0.8979 | test_loss: 0.4242 | test_acc: 0.9384
Epoch: 10 | train_loss: 0.3383 | train_acc: 0.8896 | test_loss: 0.3906 | test_acc: 0.9384
[INFO] Saving model to: models/07_effnetb2_data_20_percent_10_epochs.pth
--------------------------------------------------

CPU times: user 29.5 s, sys: 1min 28s, total: 1min 58s
Wall time: 2min 33s

8. 在 TensorBoard 中查看实验结果¶

哦，哦！

看我们进展得多快！

一次训练八个模型？

这才符合我们的座右铭！

实验，实验，实验！

我们何不看看 TensorBoard 中的结果呢？

In [30]:

Copied!

# Viewing TensorBoard in Jupyter and Google Colab Notebooks (uncomment to view full TensorBoard instance)
# %load_ext tensorboard
# %tensorboard --logdir runs
# Viewing TensorBoard in Jupyter and Google Colab Notebooks (uncomment to view full TensorBoard instance)
# %load_ext tensorboard
# %tensorboard --logdir runs

运行上述单元格后，我们应该得到类似于以下的输出。

注意： 根据你使用的随机种子和硬件，你的数字可能与这里展示的不完全相同。这是正常的。这是由于深度学习固有的随机性所致。最重要的是趋势，你的数字走向何方。如果它们偏离很大，可能是有问题，最好回去检查代码。但如果它们只是小幅度偏离（比如小数点后几位），那是可以接受的。

在TensorBoard上可视化的各种建模实验，其中具有最低测试损失的模型被突出显示

在TensorBoard中可视化不同建模实验的测试损失值，你可以看到，训练了10个周期的EffNetB0模型，并且使用了20%的数据，达到了最低的损失。这符合实验的整体趋势：更多的数据、更大的模型和更长的训练时间通常会更好。

你还可以将你的TensorBoard实验结果上传到tensorboard.dev，免费公开托管它们。

例如，运行类似于以下的代码：

In [31]:

Copied!





# # Upload the results to TensorBoard.dev (uncomment to try it out)
# !tensorboard dev upload --logdir runs \
#     --name "07. PyTorch Experiment Tracking: FoodVision Mini model results" \
#     --description "Comparing results of different model size, training data amount and training time."
# # Upload the results to TensorBoard.dev (uncomment to try it out)
# !tensorboard dev upload --logdir runs \
#     --name "07. PyTorch Experiment Tracking: FoodVision Mini model results" \
#     --description "Comparing results of different model size, training data amount and training time."

运行上述单元格后，本笔记本中的实验将在以下网址公开可见：https://tensorboard.dev/experiment/VySxUYY7Rje0xREYvCvZXA/

注意： 请注意，您上传到 tensorboard.dev 的任何内容都是公开的，任何人都可以查看。因此，如果您上传了实验数据，请确保它们不包含敏感信息。

9. 加载最佳模型并使用它进行预测¶

查看我们八个实验的 TensorBoard 日志，似乎第八个实验取得了最佳整体结果（最高测试准确率，第二低的测试损失）。

这个实验使用了：

EffNetB2（参数数量是 EffNetB0 的两倍）
20% 的披萨、牛排、寿司训练数据（训练数据量是原始数据的两倍）
10 个周期（训练时间是原始时间的两倍）

本质上，我们最大的模型取得了最佳结果。

虽然这些结果并没有比其他模型好很多。

在相同的训练时间内（实验编号 6），相同的模型在相同的数据上取得了类似的结果。

这表明，实验中可能最具影响力的部分是参数数量和数据量。

进一步检查结果，似乎通常具有更多参数（EffNetB2）和更多数据（20% 的披萨、牛排、寿司训练数据）的模型表现更好（测试损失更低，测试准确率更高）。

可以进行更多实验来进一步验证这一点，但现在，让我们从第八个实验中导入表现最佳的模型（保存路径为：models/07_effnetb2_data_20_percent_10_epochs.pth，你可以从课程 GitHub 下载此模型）并进行一些定性评估。

换句话说，让我们 可视化，可视化，可视化！

我们可以通过使用 create_effnetb2() 函数创建一个新的 EffNetB2 实例，然后使用 torch.load() 加载保存的 state_dict() 来导入最佳保存模型。

In [32]:

Copied!





# Setup the best model filepath
best_model_path = "models/07_effnetb2_data_20_percent_10_epochs.pth"

# Instantiate a new instance of EffNetB2 (to load the saved state_dict() to)
best_model = create_effnetb2()

# Load the saved best model state_dict()
best_model.load_state_dict(torch.load(best_model_path))
# Setup the best model filepath
best_model_path = "models/07_effnetb2_data_20_percent_10_epochs.pth"

# Instantiate a new instance of EffNetB2 (to load the saved state_dict() to)
best_model = create_effnetb2()

# Load the saved best model state_dict()
best_model.load_state_dict(torch.load(best_model_path))

[INFO] Created new effnetb2 model.

Out[32]:

<All keys matched successfully>

最佳模型已加载！

既然我们在这里，不妨检查一下它的文件大小。

这在后续部署模型（将其整合到应用程序中）时是一个重要的考虑因素。

如果模型过大，部署起来可能会很困难。

In [33]:

Copied!





# Check the model file size
from pathlib import Path

# Get the model size in bytes then convert to megabytes
effnetb2_model_size = Path(best_model_path).stat().st_size // (1024*1024)
print(f"EfficientNetB2 feature extractor model size: {effnetb2_model_size} MB")
# Check the model file size
from pathlib import Path

# Get the model size in bytes then convert to megabytes
effnetb2_model_size = Path(best_model_path).stat().st_size // (1024*1024)
print(f"EfficientNetB2 feature extractor model size: {effnetb2_model_size} MB")

EfficientNetB2 feature extractor model size: 29 MB

看起来我们目前最好的模型大小为29 MB。如果我们以后想要部署它，我们会记住这一点。

现在是时候进行一些预测并可视化结果了。

我们在06. PyTorch 迁移学习第6节中创建了一个pred_and_plot_image()函数，用于使用训练好的模型对图像进行预测。

我们可以通过从going_modular.going_modular.predictions.py导入该函数来重用它（我将pred_and_plot_image()函数放在一个脚本中，以便我们可以重用它）。

因此，为了对模型之前未见过的各种图像进行预测，我们首先从20%的披萨、牛排、寿司测试数据集中获取所有图像文件路径的列表，然后随机选择这些路径的一个子集传递给我们的pred_and_plot_image()函数。

In [34]:

Copied!





# Import function to make predictions on images and plot them 
# See the function previously created in section: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set
from going_modular.going_modular.predictions import pred_and_plot_image

# Get a random list of 3 images from 20% test set
import random
num_images_to_plot = 3
test_image_path_list = list(Path(data_20_percent_path / "test").glob("*/*.jpg")) # get all test image paths from 20% dataset
test_image_path_sample = random.sample(population=test_image_path_list,
                                       k=num_images_to_plot) # randomly select k number of images

# Iterate through random test image paths, make predictions on them and plot them
for image_path in test_image_path_sample:
    pred_and_plot_image(model=best_model,
                        image_path=image_path,
                        class_names=class_names,
                        image_size=(224, 224))
# Import function to make predictions on images and plot them 
# See the function previously created in section: https://www.learnpytorch.io/06_pytorch_transfer_learning/#6-make-predictions-on-images-from-the-test-set
from going_modular.going_modular.predictions import pred_and_plot_image

# Get a random list of 3 images from 20% test set
import random
num_images_to_plot = 3
test_image_path_list = list(Path(data_20_percent_path / "test").glob("*/*.jpg")) # get all test image paths from 20% dataset
test_image_path_sample = random.sample(population=test_image_path_list,
                                       k=num_images_to_plot) # randomly select k number of images

# Iterate through random test image paths, make predictions on them and plot them
for image_path in test_image_path_sample:
    pred_and_plot_image(model=best_model,
                        image_path=image_path,
                        class_names=class_names,
                        image_size=(224, 224))

$No description has been provided for this image$

不错！

多次运行上面的单元格，我们可以看到我们的模型表现相当好，并且预测概率通常比我们之前构建的模型更高。

这表明模型对其做出的决策更加自信。

9.1 使用最佳模型对自定义图像进行预测¶

在测试数据集上进行预测很酷，但机器学习的真正魅力在于能够对您自己的自定义图像进行预测。

因此，让我们导入可靠的披萨爸爸图片（一张我爸爸站在披萨前的照片），这是我们在前几节中一直在使用的，看看我们的模型在它上面的表现如何。

In [35]:

Copied!





# Download custom image
import requests

# Setup custom image path
custom_image_path = Path("data/04-pizza-dad.jpeg")

# Download the image if it doesn't already exist
if not custom_image_path.is_file():
    with open(custom_image_path, "wb") as f:
        # When downloading from GitHub, need to use the "raw" file link
        request = requests.get("https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/04-pizza-dad.jpeg")
        print(f"Downloading {custom_image_path}...")
        f.write(request.content)
else:
    print(f"{custom_image_path} already exists, skipping download.")

# Predict on custom image
pred_and_plot_image(model=model,
                    image_path=custom_image_path,
                    class_names=class_names)
# Download custom image
import requests

# Setup custom image path
custom_image_path = Path("data/04-pizza-dad.jpeg")

# Download the image if it doesn't already exist
if not custom_image_path.is_file():
    with open(custom_image_path, "wb") as f:
        # When downloading from GitHub, need to use the "raw" file link
        request = requests.get("https://raw.githubusercontent.com/mrdbourke/pytorch-deep-learning/main/images/04-pizza-dad.jpeg")
        print(f"Downloading {custom_image_path}...")
        f.write(request.content)
else:
    print(f"{custom_image_path} already exists, skipping download.")

# Predict on custom image
pred_and_plot_image(model=model,
                    image_path=custom_image_path,
                    class_names=class_names)

data/04-pizza-dad.jpeg already exists, skipping download.

哇哦！

再次双赞！

我们的最佳模型正确预测了“披萨”，并且这次预测概率（0.978）比我们在06. PyTorch迁移学习第6.1节中训练和使用的第一个特征提取模型更高。

这再次表明，我们当前的最佳模型（在20%的披萨、牛排、寿司训练数据上训练了10个周期的EffNetB2特征提取器）已经学习了模式，使其更有信心做出预测披萨的决定。

我想知道还有什么可以进一步提高我们模型的性能？

我将这个问题留给你作为探索的挑战。

主要收获¶

我们现在完整地回顾了在 01. PyTorch 工作流程基础中介绍的 PyTorch 工作流程，我们已经准备好了数据，构建并选择了一个预训练模型，使用各种辅助函数来训练和评估模型，并且在本次笔记本中，我们通过运行和跟踪一系列实验来改进了我们的 FoodVision Mini 模型。

a pytorch workflow flowchat

你应该为自己感到骄傲，这是一项不小的成就！

你应该从本次里程碑项目1中带走的主要想法是：

机器学习实践者的座右铭：实验，实验，实验！（尽管我们已经做了很多这样的工作）。
在开始时，保持实验规模小，以便你可以快速工作，你的前几次实验不应该花费超过几秒到几分钟的时间来运行。
你做的实验越多，你就能越快地找出什么不起作用。
当你找到有效的方案时，再进行扩展。例如，既然我们发现了一个性能相当不错的模型，使用了 EffNetB2 作为特征提取器，也许你现在想看看当你将其扩展到整个 Food101 数据集时会发生什么。
以编程方式跟踪你的实验需要一些步骤来设置，但从长远来看是值得的，这样你可以找出什么有效，什么无效。
- 有很多不同的机器学习实验跟踪器，所以探索一些并尝试它们。

练习¶

注意： 这些练习期望使用 torchvision v0.13+（2022年7月发布），之前的版本可能也能工作，但可能会出现错误。

所有练习都集中在练习上述代码上。

你应该能够通过参考每个部分或遵循所链接的资源来完成它们。

所有练习都应使用设备无关代码完成。

资源：

第07节的练习模板笔记本
第07节的示例解决方案笔记本（尝试练习之前查看此内容）
- 在YouTube上观看解决方案的视频讲解（包括所有错误）

从torchvision.models中选择一个更大的模型添加到实验列表中（例如，EffNetB3或更高版本）。
- 它的表现与我们的现有模型相比如何？
使用20%的披萨、牛排、寿司训练和测试数据集引入数据增强到实验列表中，这会有什么变化吗？
- 例如，你可以有一个使用数据增强的训练DataLoader（例如train_dataloader_20_percent_aug和train_dataloader_20_percent_no_aug），然后比较两种相同类型的模型在这些DataLoader上的训练结果。
- 注意： 你可能需要修改create_dataloaders()函数，使其能够接受训练数据和测试数据的变换（因为你不需要对测试数据进行数据增强）。参见04. PyTorch自定义数据集第6节中使用数据增强的示例，或下面的脚本示例：

# 注意：这种数据增强变换只应在训练数据上进行
train_transform_data_aug = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.TrivialAugmentWide(),
    transforms.ToTensor(),
    normalize
])

# 辅助函数，用于在DataLoader中查看图像（适用于数据增强变换或不适用）
def view_dataloader_images(dataloader, n=10):
    if n > 10:
        print(f"n高于10会导致图像混乱，降低到10。")
        n = 10
    imgs, labels = next(iter(dataloader))
    plt.figure(figsize=(16, 8))
    for i in range(n):
        # 最小-最大缩放图像以便显示
        targ_image = imgs[i]
        sample_min, sample_max = targ_image.min(), targ_image.max()
        sample_scaled = (targ_image - sample_min)/(sample_max - sample_min)

        # 绘制图像并显示相应的轴信息
        plt.subplot(1, 10, i+1)
        plt.imshow(sample_scaled.permute(1, 2, 0)) # 调整尺寸以符合Matplotlib要求
        plt.title(class_names[labels[i]])
        plt.axis(False)

# 需要更新`create_dataloaders()`以处理不同的增强
import os
from torch.utils.data import DataLoader
from torchvision import datasets

NUM_WORKERS = os.cpu_count() # 使用最大CPU数量以加快数据加载

# 注意：这是data_setup.create_dataloaders的更新版本，用于处理不同的训练和测试变换
def create_dataloaders(
    train_dir, 
    test_dir, 
    train_transform, # 添加训练变换参数（应用于训练数据集）
    test_transform,  # 添加测试变换参数（应用于测试数据集）
    batch_size=32, num_workers=NUM_WORKERS
):
    # 使用ImageFolder创建数据集
    train_data = datasets.ImageFolder(train_dir, transform=train_transform)
    test_data = datasets.ImageFolder(test_dir, transform=test_transform)

    # 获取类别名称
    class_names = train_data.classes

    # 将图像转换为数据加载器
    train_dataloader = DataLoader(
        train_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )
    test_dataloader = DataLoader(
        test_data,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
    )

    return train_dataloader, test_dataloader, class_names

扩展数据集，将FoodVision Mini变成FoodVision Big，使用torchvision.models中的整个Food101数据集
- 你可以从你的各种实验中选择表现最好的模型，或者甚至是我们在这篇笔记中创建的EffNetB2特征提取器，看看它在所有Food101数据上进行5个周期的训练效果如何。
- 如果你尝试了多个模型，最好跟踪每个模型的结果。
- 如果你从torchvision.models加载Food101数据集，你需要创建PyTorch DataLoaders以便用于训练。
- 注意： 由于Food101数据集比我们的披萨、牛排、寿司数据集包含更多的数据，这个模型将需要更长的时间来训练。

课外拓展¶

阅读 Richard Sutton 的博客文章 The Bitter Lesson，了解许多最新的 AI 进展是如何来自规模扩大（更大的数据集和更大的模型）和更通用（不那么精心设计）的方法。
花 20 分钟浏览 PyTorch YouTube/代码教程中的 TensorBoard 教程，看看它与我们在这本笔记本中编写的代码有何不同。
也许你想用 DataFrame 查看和重新排列你的模型 TensorBoard 日志（这样你就可以按最低损失或最高准确率排序结果），TensorBoard 文档中有这方面的指南在 TensorBoard 文档中。
如果你喜欢使用 VSCode 进行脚本或笔记本开发（VSCode 现在可以原生使用 Jupyter Notebooks），你可以使用 PyTorch 在 VSCode 中的开发指南在 VSCode 中直接设置 TensorBoard。
如果想进一步进行实验跟踪，并从速度角度查看你的 PyTorch 模型性能（是否存在可以改进的瓶颈以加快训练速度？），请参阅 PyTorch 文档中的 PyTorch profiler。
Made With ML 是由 Goku Mohandas 提供的关于机器学习的优秀资源，他们的实验跟踪指南包含了一个关于使用 MLflow 跟踪机器学习实验的精彩介绍。

07. PyTorch 实验跟踪¶

什么是实验跟踪？¶

为什么要跟踪实验？¶

跟踪机器学习实验的不同方法¶

我们将要涵盖的内容¶

在哪里可以获得帮助？¶

0. 环境设置¶

创建一个设置随机种子的辅助函数¶

1. 获取数据¶

2. 创建数据集和数据加载器¶

2.1 使用手动创建的变换创建数据加载器¶

2.2 使用自动创建的变换创建 DataLoaders¶

3. 获取预训练模型，冻结基础层并更改分类器头部¶

4. 训练模型并跟踪结果¶

调整 train() 函数以使用 SummaryWriter() 跟踪结果¶

5. 在 TensorBoard 中查看我们模型的结果¶

6. 创建一个辅助函数来构建 SummaryWriter() 实例¶

6.1 更新 train() 函数以包含 writer 参数¶

7. 设置一系列建模实验¶

7.1 你应该进行什么样的实验？¶

7.2 我们将进行哪些实验？¶

7.3 下载不同的数据集¶

7.4 转换数据集并创建 DataLoader¶

7.5 创建特征提取器模型¶

7.6 创建实验并设置训练代码¶

8. 在 TensorBoard 中查看实验结果¶

9. 加载最佳模型并使用它进行预测¶

9.1 使用最佳模型对自定义图像进行预测¶

主要收获¶

练习¶

课外拓展¶

调整 `train()` 函数以使用 `SummaryWriter()` 跟踪结果¶

6. 创建一个辅助函数来构建 `SummaryWriter()` 实例¶

6.1 更新 `train()` 函数以包含 `writer` 参数¶