Commit | Line | Data |
---|---|---|
db2ab7a0 AL |
1 | The health mechanism is targeted for Real Time Alerting, in order to know when |
2 | something bad had happened to a PCI device | |
3 | - Provide alert debug information | |
4 | - Self healing | |
5 | - If problem needs vendor support, provide a way to gather all needed debugging | |
6 | information. | |
7 | ||
8 | The main idea is to unify and centralize driver health reports in the | |
9 | generic devlink instance and allow the user to set different | |
10 | attributes of the health reporting and recovery procedures. | |
11 | ||
12 | The devlink health reporter: | |
13 | Device driver creates a "health reporter" per each error/health type. | |
14 | Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) | |
15 | or unknown (driver specific). | |
16 | For each registered health reporter a driver can issue error/health reports | |
17 | asynchronously. All health reports handling is done by devlink. | |
18 | Device driver can provide specific callbacks for each "health reporter", e.g. | |
19 | - Recovery procedures | |
20 | - Diagnostics and object dump procedures | |
21 | - OOB initial parameters | |
22 | Different parts of the driver can register different types of health reporters | |
23 | with different handlers. | |
24 | ||
25 | Once an error is reported, devlink health will do the following actions: | |
26 | * A log is being send to the kernel trace events buffer | |
27 | * Health status and statistics are being updated for the reporter instance | |
28 | * Object dump is being taken and saved at the reporter instance (as long as | |
29 | there is no other dump which is already stored) | |
30 | * Auto recovery attempt is being done. Depends on: | |
31 | - Auto-recovery configuration | |
32 | - Grace period vs. time passed since last recover | |
33 | ||
34 | The user interface: | |
35 | User can access/change each reporter's parameters and driver specific callbacks | |
36 | via devlink, e.g per error type (per health reporter) | |
37 | - Configure reporter's generic parameters (like: disable/enable auto recovery) | |
38 | - Invoke recovery procedure | |
39 | - Run diagnostics | |
40 | - Object dump | |
41 | ||
42 | The devlink health interface (via netlink): | |
43 | DEVLINK_CMD_HEALTH_REPORTER_GET | |
44 | Retrieves status and configuration info per DEV and reporter. | |
45 | DEVLINK_CMD_HEALTH_REPORTER_SET | |
46 | Allows reporter-related configuration setting. | |
47 | DEVLINK_CMD_HEALTH_REPORTER_RECOVER | |
48 | Triggers a reporter's recovery procedure. | |
49 | DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE | |
50 | Retrieves diagnostics data from a reporter on a device. | |
51 | DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET | |
52 | Retrieves the last stored dump. Devlink health | |
53 | saves a single dump. If an dump is not already stored by the devlink | |
54 | for this reporter, devlink generates a new dump. | |
55 | dump output is defined by the reporter. | |
56 | DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR | |
57 | Clears the last saved dump file for the specified reporter. | |
58 | ||
59 | ||
60 | netlink | |
61 | +--------------------------+ | |
62 | | | | |
63 | | + | | |
64 | | | | | |
65 | +--------------------------+ | |
66 | |request for ops | |
67 | |(diagnose, | |
68 | mlx5_core devlink |recover, | |
69 | |dump) | |
70 | +--------+ +--------------------------+ | |
71 | | | | reporter| | | |
72 | | | | +---------v----------+ | | |
73 | | | ops execution | | | | | |
74 | | <----------------------------------+ | | | |
75 | | | | | | | | |
76 | | | | + ^------------------+ | | |
77 | | | | | request for ops | | |
78 | | | | | (recover, dump) | | |
79 | | | | | | | |
80 | | | | +-+------------------+ | | |
81 | | | health report | | health handler | | | |
82 | | +-------------------------------> | | | |
83 | | | | +--------------------+ | | |
84 | | | health reporter create | | | |
85 | | +----------------------------> | | |
86 | +--------+ +--------------------------+ |