The block exception cause zfs.io and checksum events within a short time #9716

homerl · 2019-12-12T08:01:31Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.6
Linux Kernel	3.10.0-957.el7_lustre
Architecture	x86_64
ZFS Version	0.7.9-1
SPL Version	0.7.9-1

Describe the problem you're observing

A guy to replace the bad HDD, he plug-out the whole JBOD(with SAS cable), it trigger some errors, a group SAS devices have been impacted.

My question:
When the block device goes wrong, why the ZFS shows the checksum error within a short time, just 8~16 seconds.
Looks like the driver(mpt3sas 26.00.00.00) has no time to handle this issue.

Yes, I will replace the bad SAS cable. Could I let zfs has the better fault tolerance?
Thanks.

Let 's have a look this device

NAME                        STATE     READ WRITE CKSUM
scsi-35000cca25d6d70b8  ONLINE       0     0     2

[1:0:21:0]   disk    HGST     HUS726040AL4210  AD05  /dev/sdv   35000cca25d6d70b8  /dev/sg23  4.00TB

Here is sdv dmesg error

[Fri Nov 29 13:58:30 2019] blk_update_request: I/O error, dev sdp, sector 2764468056
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 Sense Key : Aborted Command [current] [descriptor]
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 Add. Sense: Nak received
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 CDB: Read(10) 28 00 1f 5c 1d 59 00 00 0c 00
[Fri Nov 29 13:58:30 2019] blk_update_request: I/O error, dev sdv, sector 4209044168
[Fri Nov 29 13:58:30 2019] mpt3sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)

......

[Fri Nov 29 13:58:41 2019] sd 1:0:21:0: [sdv] tag#19 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Fri Nov 29 13:58:41 2019] sd 1:0:21:0: [sdv] tag#19 CDB: Read(10) 28 00 1f 62 d5 d4 00 00 07 00
[Fri Nov 29 13:58:41 2019] blk_update_request: I/O error, dev sdv, sector 4212567712
......
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 Sense Key : Aborted Command [current] [descriptor]
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 Add. Sense: Nak received
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 CDB: Read(10) 28 00 2f 8c 9d 88 00 00 1e 00
[Fri Nov 29 13:58:46 2019] blk_update_request: I/O error, dev sdv, sector 6381956160

Here is zpool events about sdv

$ grep -B 16 35000cca25d6d70b8 zpool_events | grep -Ei 'Nov 29'
Nov 29 2019 13:58:39.602222815 ereport.fs.zfs.io
Nov 29 2019 13:58:39.602222815 ereport.fs.zfs.io
Nov 29 2019 13:58:40.579243400 ereport.fs.zfs.io
Nov 29 2019 13:58:40.579243400 ereport.fs.zfs.io
Nov 29 2019 13:58:40.646244812 ereport.fs.zfs.io
Nov 29 2019 13:58:40.646244812 ereport.fs.zfs.io
Nov 29 2019 13:58:40.810248268 ereport.fs.zfs.io
Nov 29 2019 13:58:40.810248268 ereport.fs.zfs.io
Nov 29 2019 13:58:40.810248268 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:49.219425441 ereport.fs.zfs.io
Nov 29 2019 13:58:54.429535213 ereport.fs.zfs.io
Nov 29 2019 13:58:54.429535213 ereport.fs.zfs.io
Nov 29 2019 13:58:54.452535698 ereport.fs.zfs.checksum
Nov 29 2019 13:58:54.454535740 ereport.fs.zfs.checksum

Describe how to reproduce the problem

Just in production env, I can 't to reproduce it.

Include any warning/errors/backtraces from the system logs

dmesg
zpool events
zpool status
zfs spl parameters

The text was updated successfully, but these errors were encountered:

stale · 2020-12-11T10:09:10Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

stale bot added the Status: Stale No recent activity for issue label Dec 11, 2020

stale bot closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The block exception cause zfs.io and checksum events within a short time #9716

The block exception cause zfs.io and checksum events within a short time #9716

homerl commented Dec 12, 2019

stale bot commented Dec 11, 2020

The block exception cause zfs.io and checksum events within a short time #9716

The block exception cause zfs.io and checksum events within a short time #9716

Comments

homerl commented Dec 12, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

stale bot commented Dec 11, 2020