Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The block exception cause zfs.io and checksum events within a short time #9716

Closed
homerl opened this issue Dec 12, 2019 · 1 comment
Closed
Labels
Status: Stale No recent activity for issue

Comments

@homerl
Copy link

homerl commented Dec 12, 2019

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.6
Linux Kernel 3.10.0-957.el7_lustre
Architecture x86_64
ZFS Version 0.7.9-1
SPL Version 0.7.9-1

Describe the problem you're observing

A guy to replace the bad HDD, he plug-out the whole JBOD(with SAS cable), it trigger some errors, a group SAS devices have been impacted.

My question:
When the block device goes wrong, why the ZFS shows the checksum error within a short time, just 8~16 seconds.
Looks like the driver(mpt3sas 26.00.00.00) has no time to handle this issue.

Yes, I will replace the bad SAS cable. Could I let zfs has the better fault tolerance?
Thanks.

Let 's have a look this device

NAME                        STATE     READ WRITE CKSUM
scsi-35000cca25d6d70b8  ONLINE       0     0     2

[1:0:21:0]   disk    HGST     HUS726040AL4210  AD05  /dev/sdv   35000cca25d6d70b8  /dev/sg23  4.00TB

Here is sdv dmesg error

[Fri Nov 29 13:58:30 2019] blk_update_request: I/O error, dev sdp, sector 2764468056
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 Sense Key : Aborted Command [current] [descriptor]
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 Add. Sense: Nak received
[Fri Nov 29 13:58:30 2019] sd 1:0:21:0: [sdv] tag#80 CDB: Read(10) 28 00 1f 5c 1d 59 00 00 0c 00
[Fri Nov 29 13:58:30 2019] blk_update_request: I/O error, dev sdv, sector 4209044168
[Fri Nov 29 13:58:30 2019] mpt3sas_cm0: log_info(0x31120302): originator(PL), code(0x12), sub_code(0x0302)

......

[Fri Nov 29 13:58:41 2019] sd 1:0:21:0: [sdv] tag#19 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
[Fri Nov 29 13:58:41 2019] sd 1:0:21:0: [sdv] tag#19 CDB: Read(10) 28 00 1f 62 d5 d4 00 00 07 00
[Fri Nov 29 13:58:41 2019] blk_update_request: I/O error, dev sdv, sector 4212567712
......
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 Sense Key : Aborted Command [current] [descriptor]
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 Add. Sense: Nak received
[Fri Nov 29 13:58:46 2019] sd 1:0:21:0: [sdv] tag#66 CDB: Read(10) 28 00 2f 8c 9d 88 00 00 1e 00
[Fri Nov 29 13:58:46 2019] blk_update_request: I/O error, dev sdv, sector 6381956160

Here is zpool events about sdv

$ grep -B 16 35000cca25d6d70b8 zpool_events | grep -Ei 'Nov 29'
Nov 29 2019 13:58:39.602222815 ereport.fs.zfs.io
Nov 29 2019 13:58:39.602222815 ereport.fs.zfs.io
Nov 29 2019 13:58:40.579243400 ereport.fs.zfs.io
Nov 29 2019 13:58:40.579243400 ereport.fs.zfs.io
Nov 29 2019 13:58:40.646244812 ereport.fs.zfs.io
Nov 29 2019 13:58:40.646244812 ereport.fs.zfs.io
Nov 29 2019 13:58:40.810248268 ereport.fs.zfs.io
Nov 29 2019 13:58:40.810248268 ereport.fs.zfs.io
Nov 29 2019 13:58:40.810248268 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.200256485 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:41.491262616 ereport.fs.zfs.io
Nov 29 2019 13:58:49.219425441 ereport.fs.zfs.io
Nov 29 2019 13:58:54.429535213 ereport.fs.zfs.io
Nov 29 2019 13:58:54.429535213 ereport.fs.zfs.io
Nov 29 2019 13:58:54.452535698 ereport.fs.zfs.checksum
Nov 29 2019 13:58:54.454535740 ereport.fs.zfs.checksum

Describe how to reproduce the problem

Just in production env, I can 't to reproduce it.

Include any warning/errors/backtraces from the system logs

dmesg
zpool events
zpool status
zfs spl parameters

@stale
Copy link

stale bot commented Dec 11, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Dec 11, 2020
@stale stale bot closed this as completed Mar 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

1 participant