Skip to content

Commit

Permalink
pool resize disk removal unknown internal error and no UI counterpart #…
Browse files Browse the repository at this point in the history
…1722

Fix disk removal timeout failure re "Unknown internal error doing a PUT
.../remove" by asynchronously executing 'btrfs dev remove'. The
pool_balance model was extending to accommodate for what are arbitrarily
named (within Rockstor) 'internal' balances: those automatically
initiated upon every 'btrfs dev delete' by the btrfs subsystem itself.
A complication of 'internal' balances is  their invisibility via 'btrfs
balance status'. An inference mechanism was thus constructed to 'fake'
the output of a regular balance status so that our existing Web-UI
balance surfacing mechanisms could be extended to serve these 'internal'
variants similarly. The new state of device 'in removal' and the above
mentioned inference mechanism required that we now track and update
devid and per device allocation. These were added as disk model fields
and surfaced appropriately at the pool details level within the Web-UI.

Akin to regular balances, btrfs dev delete 'internal' balances were
found to negatively impact Web-UI interactivity. This was in part
alleviated by refactoring the lowest levels of our disk/pool scan
mechanisms. In essence this refactoring significantly reduces the number
of system and python calls required to attain the same system wide dev /
pool info and simplifies low level device name handling. Existing unit
tests were employed to aid in this refactoring. Minor additional code
was required to account for regressions (predominantly in LUKS device
name handling) that were introduced by these low level device name code
changes.

Summary:

- Execute device removal asynchronously.
- Monitor the consequent 'internal' balances by existing mechanisms
where possible.
- Only remove pool members pool associations once their associated
'internal' balance has finished.
- Improve low level efficiency/clarity re device/pool scanning by moving
to a single call of the lighter get_dev_pool_info() rather than calling
the slower get_pool_info() btrfs disk count times; get_pool_info() is
retained for pool import duties as it’s structure is ideally suited to
that task. Multiple prior temp_name/by-id conversions are also avoided.
- Improve user messaging re system performance / Web-UI responsiveness
during a balance operation, regular or 'internal'.
- Fix bug re reliance on "None" disk label removing a fragility
concerning disk pool association within the Web-UI.
- Improve auto pool labeling subsystem by abstracting and generalising
ready for pool renaming capability.
- Improve pool uuid tracking and add related Web-UI element.
- Add Web-UI element in balance status tab to identify regular or
'internal' balance type.
- Add devid tracking and related Web-UI element.
- Use devid Disk model info to ascertain pool info for detached disks.
- Add per device allocation tracking and related Web-UI element.
- Fix prior TODO: re btrfs in partition failure point introduced in git
tag 3.9.2-32.
- Fix prior TODO: re unlabeled pools caveat.
- Add pool details disks table ‘Page refresh required’ indicator keyed
from devid=0.
- Add guidance on common detached disk removal reboot requirement.
- Remove a low level special case for LUKS dev matching (mapped devices)
which affected the performance of all dev name by-id lookups.
- Add TODO re removing legacy formatted disk raid role pre openSUSE
move.
- Update scan_disks() unit tests for new 'path included' output.
- Address todo in scan_disks() unit tests and normalise on pre sort
method.
  • Loading branch information
phillxnet committed Jan 15, 2019
1 parent 33917f1 commit f5cd593
Show file tree
Hide file tree
Showing 18 changed files with 871 additions and 414 deletions.
289 changes: 242 additions & 47 deletions src/rockstor/fs/btrfs.py

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions src/rockstor/scripts/flash_optimize.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def trim_support(disk):
run_command([YUM, 'install', '-y', 'hdparm'])
logging.info('Installed hdparm successfully')

o, e, rc = run_command(['hdparm', '-I', '/dev/%s' % disk])
o, e, rc = run_command(['hdparm', '-I', '{}'.format(disk)])
for l in o:
if (re.search('Data Set Management TRIM supported', l) is not None):
logging.debug('TRIM supported. info: %s' % l)
Expand All @@ -84,7 +84,7 @@ def trim_support(disk):

def is_flash(disk):
flash = False
o, e, rc = run_command(['udevadm', 'info', '--path=/sys/block/%s' % disk])
o, e, rc = run_command(['udevadm', 'info', '--name', disk])
for l in o:
if (re.search('ID_BUS=', l) is not None):
if (l.strip().split()[1].split('=')[1] != 'usb'):
Expand All @@ -98,6 +98,8 @@ def is_flash(disk):
# /sys/block/disk/queue/rotational is not reliable, but if [deadline] is in
# /sys/block/disk/queue/scheduler, it's fair to assume flash
logging.debug('Checking if scheduler is set to [deadline] for %s' % disk)
disk = disk.split('/')[-1] # strip off the path
# Note that the following may fail for sys on luks dev.
with open('/sys/block/%s/queue/scheduler' % disk) as sfo:
for l in sfo.readlines():
if (re.search('\[deadline\]', l) is not None):
Expand Down
29 changes: 29 additions & 0 deletions src/rockstor/storageadmin/migrations/0008_auto_20190115_1637.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('storageadmin', '0007_auto_20181210_0740'),
]

operations = [
migrations.AddField(
model_name='disk',
name='allocated',
field=models.BigIntegerField(default=0),
),
migrations.AddField(
model_name='disk',
name='devid',
field=models.PositiveSmallIntegerField(default=0),
),
migrations.AddField(
model_name='poolbalance',
name='internal',
field=models.BooleanField(default=False),
),
]
8 changes: 7 additions & 1 deletion src/rockstor/storageadmin/models/disk.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,14 @@ class Disk(models.Model):
mostly derived from model and serial number.
"""
name = models.CharField(max_length=128, unique=True)
"""total size in KB"""
"""btrfs devid 0 is place holder as real devids start from 1"""
devid = models.PositiveSmallIntegerField(default=0) # 0 to 32767
"""total size in KB. Zero if btrfs device detached/last stage of delete."""
size = models.BigIntegerField(default=0)
"""allocated in KB: ie per device 'used' in 'btrfs fi show' and total
listed per device in 'btrfs fi usage /mnt_pt'.
"""
allocated = models.BigIntegerField(default=0)
"""true if disk went offline"""
offline = models.BooleanField(default=False)
"""whether the disk is partitioned at the moment. relevent for root disks
Expand Down
2 changes: 2 additions & 0 deletions src/rockstor/storageadmin/models/pool_balance.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ class PoolBalance(models.Model):
start_time = models.DateTimeField(auto_now=True)
end_time = models.DateTimeField(null=True)
percent_done = models.IntegerField(default=0)
# Flag to denote internal auto initiated balance ie during dev delete.
internal = models.BooleanField(default=False)

class Meta:
app_label = 'storageadmin'
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
{{else}}
<span style="color:red">{{model.mount_status}}</span>
{{/if}}
</strong>
</strong><br/>
UUID: <strong>{{model.uuid}}</strong>
</div> <!-- module-content -->

Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,11 @@
<thead>
<tr>
<th scope="col" abbr="Id">Id</th>
<th scope="col" abbr="Status">Status</th>
<th scope="col" abbr="Status">Status <i class="fa fa-info-circle" title="Both balance types: unknown, running, finished, failed.&#013Regular only: cancelling, cancelled, pausing, paused." /> </th>
<th scope="col" abbr="Type">Type <i class="fa fa-info-circle" title="Regular or Disk Removal" /></th>
<th scope="col" abbr="STime">Start Time</th>
<th scope="col" abbr="Updates">Percent finished</th>
<!-- <th scope="col" abbr="ETime">End Time</th> -->
<th scope="col" abbr="Updates">Percent finished <i class="fa fa-info-circle" title="Progress unavailable for Disk Removal." /></th>
<th scope="col" abbr="Infos">Errors or Notes</th>
</tr>
</thead>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@
</div>
<br>
<div class="alert alert-success">
<h4>Resize initiated - disk addition or raid change entails a subsequent Rockstor visible balance which may take several hours. Check status in the Balances tab. Disk delete progress is currently unmonitored.</h4>
<h4>Resize initiated - the associated balance can take several hours to complete and usually negatively impacts system performance.
<br>Check the Balances tab for status. A page refresh will be required.
</h4>
</div>
<div class="alert alert-warning">
<h4>
<strong>Expect reduced Web-UI responsiveness until this balance has finished.</strong>
<br>Removed disks will have progressively smaller Allocated GB until they are finally removed.
<br>Please note: a detached disk removal can fail with status ending:
<br><i>"... no missing devices found to remove"</i>
<br> If you encounter this error see the <u>Maintenance required</u> section for guidance then try again after a reboot.
</h4>
</div>

Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,9 @@
<tr>
<th scope="col" abbr="Name">Name</th>
<th scope="col" abbr="Temp Name">Temp Name</th>
<th scope="col" abbr="Btrfs DevID">Btrfs DevID</th>
<th scope="col" abbr="Capacity">Capacity</th>
<th scope="col" abbr="Allocated (%)">Allocated (%)</th>
<th scope="col" abbr="write_io_errs">Write I/O errors</th>
<th scope="col" abbr="read_io_errs">Read I/O errors</th>
<th scope="col" abbr="flush_io_errs">Flush I/O errors</th>
Expand All @@ -66,7 +68,11 @@
<td>
{{this.temp_name}}
</td>
<td>
{{btrfsDevID this.devid}}
</td>
<td>{{humanReadableSize this.size}}</td>
<td>{{humanReadableAllocatedPercent this.allocated this.size}}</td>
{{ioErrorStatsTableData this.io_error_stats}}
</tr>
{{/each}}
Expand All @@ -91,7 +97,11 @@
<td>
{{this.temp_name}}
</td>
<td>
{{btrfsDevID this.devid}}
</td>
<td>{{humanReadableSize this.size}}</td>
<td>{{humanReadableAllocatedPercent this.allocated this.size}}</td>
{{ioErrorStatsTableData this.io_error_stats}}
</tr>
{{/each}}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ PoolDetailsLayoutView = RockstorLayoutView.extend({
if (confirm('If any detached members are listed use the Resize/ReRaid button - "Remove disks" option instead. Click OK only if "(Some Missing)" and no "detached-..." appear in the Pool page Disks sub-section?')) {
var raid_level = _this.pool.get('raid');
var disk_names = ['missing'];
var delete_missing_msg = ('Delete missing is initiated (can take several hours), a progress report is currently unavailable. Balance attempts are blocked for this period.');
var delete_missing_msg = ('Delete missing initiated - associated balance can take several hours and negatively impact system performance. Check Balances tab for status.');
$.ajax({
url: url,
type: 'PUT',
Expand Down Expand Up @@ -487,6 +487,29 @@ PoolDetailsLayoutView = RockstorLayoutView.extend({
return humanize.filesize(size * 1024);
});

Handlebars.registerHelper('humanReadableAllocatedPercent', function(allocated, size) {
var html = '';
html += humanize.filesize(allocated * 1024);
// One decimal place of % = 1 GB per TB = normal allocation unit.
if (size == 0) {
// we likely have a disk delete/removal in operation or a
// missing / detached device so flag.
html += '<strong><span style="color:darkred"> Missing or removal in progress </span></strong>'
} else {
html += ' <strong>(' + ((allocated / size) * 100).toFixed(1) + '%)</strong>'

}
return new Handlebars.SafeString(html);
});

Handlebars.registerHelper('btrfsDevID', function(devid){
if (devid !== 0) {
return devid
}
var html = '<strong><span style="color:darkred"> Page refresh required </span></strong>';
return new Handlebars.SafeString(html)
});

Handlebars.registerHelper('isRoot', function(role){
if (role == 'root') {
return true;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,11 +115,31 @@ PoolRebalanceTableModule = RockstorModuleView.extend({
html += '<td>' + poolrebalance.get('id') + '</td>';
html += '<td>' + poolrebalance.get('status') + '</td>';
html += '<td>';
internal_balance = poolrebalance.get('internal');
percent_done = poolrebalance.get('percent_done')
if (internal_balance) {
html += 'Disk Removal'
} else {
html += 'Regular'
}
html += '</td>';
html += '<td>';
if (poolrebalance.get('start_time')) {
html += moment(poolrebalance.get('start_time')).format(RS_DATE_FORMAT);
}
html += '</td>';
html += '<td>' + poolrebalance.get('percent_done') + '</td>';
// html += '<td>';
// if (poolrebalance.get('end_time')) {
// html += moment(poolrebalance.get('end_time')).format(RS_DATE_FORMAT);
// }
// html += '</td>';
html += '<td>';
if (percent_done != 100 && internal_balance) {
html += 'unavailable';
} else {
html += percent_done;
}
html + '</td>';
html += '<td>';
if (poolrebalance.get('message') != null) {
html += poolrebalance.get('message');
Expand Down
35 changes: 19 additions & 16 deletions src/rockstor/storageadmin/views/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,8 @@
from storageadmin.auth import DigestAuthentication
from rest_framework.permissions import IsAuthenticated
from storageadmin.views import DiskMixin
from system.osi import (uptime, kernel_info, get_dev_byid_name,
get_device_path)
from fs.btrfs import (mount_share, mount_root, get_pool_info,
from system.osi import (uptime, kernel_info, get_device_mapper_map)
from fs.btrfs import (mount_share, mount_root, get_dev_pool_info,
pool_raid, mount_snap)
from system.ssh import (sftp_mount_map, sftp_mount)
from system.services import systemctl
Expand Down Expand Up @@ -56,6 +55,10 @@ class CommandView(DiskMixin, NFSExportMixin, APIView):
@staticmethod
@transaction.atomic
def _refresh_pool_state():
# Get map of dm-0 to /dev/mapper members ie luks-.. devices.
mapped_devs = get_device_mapper_map()
# Get temp_names (kernel names) to btrfs pool info for attached devs.
dev_pool_info = get_dev_pool_info()
for p in Pool.objects.all():
# If our pool has no disks, detached included, then delete it.
# We leave pools with all detached members in place intentionally.
Expand All @@ -70,22 +73,22 @@ def _refresh_pool_state():
continue
try:
# Get and save what info we can prior to mount.
first_attached_dev = p.disk_set.attached().first()
is_root_pool = (p.role == 'root')
# Observe any redirect role by using target_name.
byid_disk_name, is_byid = get_dev_byid_name(
get_device_path(first_attached_dev.target_name))
if is_byid:
pool_info = get_pool_info(first_attached_dev.target_name,
is_root_pool)
pool_name = pool_info['label']
else:
first_dev = p.disk_set.attached().first()
# Use target_name to account for redirect role.
if first_dev.target_name == first_dev.temp_name:
logger.error('Skipping pool ({}) mount as attached disk '
'({}) has no by-id name (no serial # ?)'.
format(p.name,
first_attached_dev.target_name))
format(p.name, first_dev.target_name))
continue
p.name = pool_name
if first_dev.temp_name in mapped_devs:
dev_tmp_name = '/dev/mapper/{}'.format(
mapped_devs[first_dev.temp_name])
else:
dev_tmp_name = '/dev/{}'.format(first_dev.temp_name)
# For now we call get_dev_pool_info() once for each pool.
pool_info = dev_pool_info[dev_tmp_name]
p.name = pool_info.label
p.uuid = pool_info.uuid
p.save()
mount_root(p)
p.raid = pool_raid('%s%s' % (settings.MNT_PT, p.name))['data']
Expand Down
Loading

0 comments on commit f5cd593

Please sign in to comment.