Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Weird issue with disk
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Dwosky
Tux's lil' helper
Tux's lil' helper


Joined: 07 Nov 2018
Posts: 87

PostPosted: Mon Mar 04, 2019 10:54 am    Post subject: Weird issue with disk Reply with quote

Hello,

First of all I'm not sure if this is the correct section of the forum to post it, sorry if I did it wrong.

Last week I suffered an issue over one of the disks I have in my server, from the initial look of it, it seems that the disk got an issue either reading or writing, kernel tried to retry and recover but in the end it couldn't. I saw issues in both kernel log and smart over the affected HDD:
Code:
Feb 28 06:36:50 server kernel: ata2.00: exception Emask 0x0 SAct 0x8000000 SErr 0x0 action 0x0
Feb 28 06:36:50 server kernel: ata2.00: irq_stat 0x40000008
Feb 28 06:36:50 server kernel: ata2.00: failed command: READ FPDMA QUEUED
Feb 28 06:36:50 server kernel: ata2.00: cmd 60/00:d8:80:37:1c/01:00:fb:00:00/40 tag 27 ncq dma 131072 in\x0a         res 41/40:00:80:37:1c/00:00:fb:00:00/40 Emask 0x409 (media error) <F>
Feb 28 06:36:50 server kernel: ata2.00: status: { DRDY ERR }
Feb 28 06:36:50 server kernel: ata2.00: error: { UNC }
Feb 28 06:36:50 server kernel: ata2.00: configured for UDMA/133
Feb 28 06:36:50 server kernel: sd 1:0:0:0: [sdb] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 28 06:36:50 server kernel: sd 1:0:0:0: [sdb] tag#27 Sense Key : Medium Error [current]
Feb 28 06:36:50 server kernel: sd 1:0:0:0: [sdb] tag#27 Add. Sense: Unrecovered read error - auto reallocate failed
Feb 28 06:36:50 server kernel: sd 1:0:0:0: [sdb] tag#27 CDB: Read(16) 88 00 00 00 00 00 fb 1c 37 80 00 00 01 00 00 00
Feb 28 06:36:50 server kernel: print_req_error: I/O error, dev sdb, sector 4212930432
Feb 28 06:36:50 server kernel: ata2: EH complete

Code:
Error 1 occurred at disk power-on lifetime: 13293 hours (553 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.
 
  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 80 2e f8 e8  Error: UNC at LBA = 0x08f82e80 = 150482560
 
  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 80 2e f8 e8 00  18d+21:33:56.755  READ DMA
  b0 d1 01 01 4f c2 00 00  18d+21:33:16.888  SMART READ ATTRIBUTE THRESHOLDS [OBS-4]
  b0 d0 01 00 4f c2 00 00  18d+21:33:16.884  SMART READ DATA


When I checked the server, the HDD was already removed (/dev/sdb was not mounted). I rebooted, umounted /dev/sdb as soon as it restarted and ran a smartclt Conveyance test over the LBAs that failed and a Extended test afterwards. Both tests where successful and no other error was reported. I also opened a Support Request to WDC and used their Data Lifecycle test (both quick & extended test ended without issues). As for the SATA cable, its a fairly new cable (I think its Sharkoon) and hasn't presented any issues over the past year (I bought the cable on May 2018 and the server its usually online 24/7).

Right now I'm kinda lost, since I don't really know why did the error happen or what caused it in order to try to avoid or fix the issue. Does anyone have any hint of what more to check?

Here I have the kernel & full smartctl logs:
Kernel: https://pastebin.com/vTNhdE3e
SMART: https://pastebin.com/HZ7NyKPd
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Mon Mar 04, 2019 11:11 am    Post subject: Reply with quote

Do any of the other disks share the same controller/card as the one that seems to be failing?
Oh and also: I once had similar problems. I "solved" it by adding libata.force=3.0Gbps to the kernel command line.
It's actually still in use... Hmmmm.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Dwosky
Tux's lil' helper
Tux's lil' helper


Joined: 07 Nov 2018
Posts: 87

PostPosted: Mon Mar 04, 2019 11:21 am    Post subject: Reply with quote

Zucca wrote:
Do any of the other disks share the same controller/card as the one that seems to be failing?

I think so, this is an Intel H77 motherboard, which has 6 ports over the same controller: 4 at 3Gb/s and 2 at 6Gb/s. I haven't seen any failure over the rest of the disks, but I must add that the most active one is the one that failed. Its active because its available and might get read over the whole day, but it isn't under stress or heavy R/W.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Mon Mar 04, 2019 3:41 pm    Post subject: Reply with quote

Quote:
Code:
SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)

I'd try to set the SATA speed to 3Gbps and see if you actaully have the same problem I had.

EDIT: Have you done any hardware changes lately? Or did this problem pop out of nothing?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43178
Location: 56N 3W

PostPosted: Mon Mar 04, 2019 3:51 pm    Post subject: Reply with quote

Dwosky,

Code:
Feb 28 06:36:50 server kernel: ata2.00: cmd 60/00:d8:80:37:1c/01:00:fb:00:00/40 tag 27 ncq dma 131072 in\x0a         res 41/40:00:80:37:1c/00:00:fb:00:00/40 Emask 0x409 (media error)
Feb 28 06:36:50 server kernel: sd 1:0:0:0: [sdb] tag#27 Add. Sense: Unrecovered read error - auto reallocate failed

That media error is ugly. It suggests an internal drive error.

Run
Code:
smartctl -a /dev/sd...
on the drive and post the result.
Emerge smartmontools if you need to.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Dwosky
Tux's lil' helper
Tux's lil' helper


Joined: 07 Nov 2018
Posts: 87

PostPosted: Mon Mar 04, 2019 6:50 pm    Post subject: Reply with quote

NeddySeagoon wrote:
That media error is ugly. It suggests an internal drive error.

Yeah, from the logs it seems there is really something wrong with the HDD, but what I don't understand is why the next SMART extended test returned as successful and didn't find any issue.

SMART info its here: https://pastebin.com/HZ7NyKPd

@Zucca, I'll check the force parameter, but setting a lower bandwidth its not really optimal, since I have a SSD as well, which might be impacted by that performance degradation.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43178
Location: 56N 3W

PostPosted: Mon Mar 04, 2019 7:04 pm    Post subject: Reply with quote

Dwosky,

It looks like the drive has an intermittent fault.

It really failed to read those blocks.
Code:
40 51 08 c0 2e fd e8  Error: UNC 8 sectors at LBA = 0x08fd2ec0 = 150810304
40 51 00 40 2e fd e8  Error: UNC at LBA = 0x08fd2e40 = 150810176
40 51 00 80 2e f8 e8  Error: UNC at LBA = 0x08f82e80 = 150482560


That taken together with
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0


says that the drive didn't even try to allocate spare sectors, probably because when it tried, reads failed.

Yet
Code:
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 3  Extended offline    Completed without error       00%     13310
shows that there were no error at that time.

The drive needs to be replaced. Plug
Code:
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:   
into the Warranty Checker and if your warranty is good, return the drive.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum