Vhtforums
AI Assistant
Step-by-Step Guide:...
 
Notifications
Clear all

Step-by-Step Guide: Checking Ceph OSD Disk Health

1 Posts
1 Users
0 Reactions
1,013 Views
Brandon Lee
Posts: 690
Admin
Topic starter
Translate
English
Spanish
French
German
Italian
Portuguese
Russian
Chinese
Japanese
Korean
Arabic
Hindi
Dutch
Polish
Turkish
Vietnamese
Thai
Swedish
Danish
Finnish
Norwegian
Czech
Hungarian
Romanian
Greek
Hebrew
Indonesian
Malay
Ukrainian
Bulgarian
Croatian
Slovak
Slovenian
Serbian
Lithuanian
Latvian
Estonian
(@brandon-lee)
Member
Joined: 16 years ago
[#509]

I have been working a ton with Ceph lately in the home lab. Just some notes on how to check if you have a Ceph disk that is showing to have slow disk. You can see your Ceph health with the command:

ceph status

or 

ceph -s
viewing slow alert in ceph

Step 1: Identify the Problem OSD

# Check overall cluster health
ceph status
# Get detailed health information (shows which OSD has issues)
ceph health detail
```
**Example output:**
```
[WRN] BLUESTORE_SLOW_OP_ALERT: 1 OSD(s) experiencing slow operations in BlueStore
osd.6 observed slow operation indications in BlueStore
Note the OSD number (in this case: osd.6)

Step 2: Locate the OSD's Host

# Find which physical host contains the OSD
ceph osd find <osd-number>
# Example:
ceph osd find 6
Example output:
json
{
"osd": 6,
"addrs": {
"addrvec": [
{
"type": "v1",
"addr": "10.3.33.204:6804",
"nonce": 1234
}
]
},
"osd_fsid": "900daf28-d681-4637-90db-9764bcfd2f11",
"host": "pvehost04",
"crush_location": {
"host": "pvehost04",
"root": "default"
}
}
Note the hostname (in this case: pvehost04)

Step 3: Connect to the Host

# SSH to the host containing the problematic OSD
ssh root@pvehost04
 

Step 4: Identify the Physical Disk

# Find the OSD's logical volume
ceph-volume lvm list | grep -A 10 "osd.<number>"
# Example:
ceph-volume lvm list | grep -A 10 "osd.6"
```
**Example output:**
```
====== osd.6 =======
[block] /dev/ceph-46ed1f42-7685-4bfd-b64f-ad525bddc935/osd-block-900daf28...
block device /dev/ceph-46ed1f42-7685-4bfd-b64f-ad525bddc935/osd-block-900daf28...
block uuid f94nrL-KRDg-D648-Ia7B-F3Yx-hjwQ-HppqAT
Note the VG name (in this case: ceph-46ed1f42-7685-4bfd-b64f-ad525bddc935)

Find the underlying physical disk:

# Show the complete disk hierarchy
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE
# Or find the physical volume for the VG
pvs | grep <vg-name>
# Example:
pvs | grep ceph-46ed1f42-7685-4bfd-b64f-ad525bddc935
```
**Example output:**
```
/dev/nvme1n1 ceph-46ed1f42-7685-4bfd-b64f-ad525bddc935 lvm2 a-- <953.87g
Note the physical device (in this case: /dev/nvme1n1)

Step 5: Check Disk Health

For NVMe Drives:

# Install nvme-cli if not present
apt install nvme-cli -y
# Check SMART health summary
nvme smart-log /dev/nvme1n1
# Or using smartctl
smartctl -a /dev/nvme1n1
# Check for errors
nvme error-log /dev/nvme1n1

For SATA/SAS Drives:

# Install smartmontools if not present
apt install smartmontools -y
# Quick health check
smartctl -H /dev/sdX
# Full SMART information
smartctl -a /dev/sdX
# Check for specific error indicators
smartctl -a /dev/sdX | grep -E "Reallocated|Pending|Current_Pending|Offline_Uncorrectable|UDMA_CRC_Error"
 

Step 6: Interpret Health Results

Critical Values to Check:

For NVMe:

  • critical_warning: Should be 0 (anything else is bad)
  • temperature: Should be < 70°C (< 158°F)
  • available_spare: Should be > 10%
  • percentage_used: Wear indicator (100% = end of life)
  • media_errors: Should be 0
  • error log entries: Review for I/O errors

For SATA/SAS:

  • SMART overall-health: Should be PASSED
  • Reallocated_Sector_Ct: Should be 0 (or very low)
  • Current_Pending_Sector: Should be 0
  • Offline_Uncorrectable: Should be 0
  • UDMA_CRC_Error_Count: High values indicate cable/connection issues
  • Temperature: Should be < 55°C

Step 7: Check OSD Performance Metrics

# From any Ceph node, check OSD performance
ceph osd perf
# Check OSD utilization
ceph osd df
# Check for current slow operations (run on the OSD's host)
ceph daemon osd.<number> dump_ops_in_flight
# Check historic slow operations (run on the OSD's host)
ceph daemon osd.<number> dump_historic_slow_ops

Step 8: Monitor I/O Performance (Optional)

# Install sysstat if not present
apt install sysstat -y
# Monitor real-time I/O stats (watch for high await times or %util)
iostat -x <device> 2 5
# Example for NVMe:
iostat -x nvme1n1 2 5
# Example for SATA:
iostat -x sda 2 5
Key metrics to watch:
%util: > 90% consistently = saturated disk
await: > 10ms = slow responses
r_await / w_await: Read/write latency separately

Step 9: Check System Logs

# Check for disk-related errors in dmesg
dmesg -T | grep -i "<device>" | tail -50
# Example:
dmesg -T | grep -i nvme1n1 | tail -50
# Check systemd journal for Ceph or disk issues
journalctl -u ceph-osd@<number> --since "1 hour ago"
# Example:
journalctl -u ceph-osd@6 --since "1 hour ago"

Step 10: Common Issues and Resolutions

Issue: Slow Operations During Rebalancing

Cause: Normal during data migration
Solution: Wait for rebalancing to complete or mute the warning:
 
ceph health mute BLUESTORE_SLOW_OP_ALERT --sticky

Issue: High Media Errors or Reallocated Sectors

Cause: Failing disk
Solution: Replace the disk:
 
# Mark OSD out (triggers data migration)
ceph osd out <osd-number>
# Monitor rebalancing
watch ceph -s
# Once complete, remove OSD
ceph osd down <osd-number>
ceph osd rm <osd-number>
ceph auth del osd.<osd-number>
ceph osd crush rm osd.<osd-number>

Issue: High Temperature

Cause: Poor cooling or failing fan
Solution: Improve airflow, check datacenter HVAC

Issue: Disk Full

Cause: Imbalanced data distribution
Solution: Check weight and rebalance:
ceph osd df tree
ceph osd reweight <osd-number> <weight>
 

Quick Reference Checklist

# 1. Identify problem OSD
ceph health detail
# 2. Find host
ceph osd find <osd-number>
# 3. SSH to host
ssh <hostname>
# 4. Find physical disk
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,FSTYPE
# 5. Check health (NVMe)
nvme smart-log /dev/<device>
# 5. Check health (SATA)
smartctl -a /dev/<device>
# 6. Check OSD performance
ceph osd perf
ceph daemon osd.<number> dump_historic_slow_ops
# 7. Monitor I/O
iostat -x <device> 2 5
# 8. Check logs
journalctl -u ceph-osd@<number> --since "1 hour ago"