MRFI¶

Overview¶

Multi-Resolution Fault Injector is a powerful neural network fault injector based on PyTorch.

Compared with other injection frameworks, the biggest feature is that it can flexibly adjust different injection configurations for different experimental needs. Injection config and observations on each layer can be set independently by one clear config file. MRFI also provides a large number of commonly used error injection methods and error models, and allows customization.

Overview Pic

In preliminary experiments, you may not want to face complex experimental configurations. For example, simply observing the parameters of the network model, or conducting error injection experiments with a simple global configuration. MRFI also provide simple API for observation and course-grained fault injection.

See MRFI Basic usage to learn how to use MRFI.

On our paper of MRFI on Arxiv, we provided a detailed explanation of the background of the problem, the composition and principles of MRFI, and demonstrated the importance of fine-grained evaluation through experiments using MRFI.

Supported Features¶

Activation injection

Fixed position (Permanent fault)
Runtime random position (Transient fault)

Weight injection

Fixed position (Permanent fault)
Runtime random position (Transient fault)

Injection on quantization model

Posting training quantization
Dynamic quantization
Fine-grained quantization parameters config
Add custom quantization

Error mode

Internal observation & visualize

Activation & Weight observer
Error propagation observer
Easy to save and visualize result, work well with numpy and matplotlib

Flexibility

Add custom error_mode, selector, quantization and observer
Distinguish network-level, layer-level, channel-level, neuron-level and bit-level fault tolerance difference

Performance

Automatically use GPU for network inference and fault injection
The selector - injector design is significantly faster than generate probability on all position when perform a random error injection
Accelerate error impact analysis through internal observer metrics rather than use original accuracy metric

Fine-grained configuration

By python code
By .yaml config file
By GUI

Evaluation fault tolerance policy

Selective protection on different level
More fault tolerance method may be support later (e.g. fault tolerant retrain, range-based filter)