Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkstation LS-WVL is unreachable when booted with daily images #17

Open
val-kulkov opened this issue May 31, 2018 · 24 comments
Open

Linkstation LS-WVL is unreachable when booted with daily images #17

val-kulkov opened this issue May 31, 2018 · 24 comments

Comments

@val-kulkov
Copy link

Attention: @rogers0

Booting Linkstation LS-WVL with daily images from https://d-i.debian.org/daily-images/armel/daily/kirkwood/network-console/buffalo/ in its first ext3-formatted partition (/dev/sda1) does bring the Linkstation to a point where one can connect to it using "ssh installer@linkstation" to continue Debian installation.

The Linkstation does not respond to ping requests. The LEDs on the network switch to which the Linkstation is connected are glowing green and therefore a gigabit connection appears to have been established. Nonetheless, the Linkstation is unreachable.

Specifically, these images were used in the tests described above:

initrd.buffalo 2018-05-30 01:05 12M
uImage.buffalo 2018-05-30 01:05 1.9M

I repeated the process with daily images for May 1, 2018, going back in time as far as I could: https://d-i.debian.org/daily-images/armel/20180501-01:11/kirkwood/network-console/buffalo/ls-wvl/ The result was the same. Linkstation did not respond to ping requests.

The hard disk I used in the tests was WD Red 4 TB, with the first partition formatted as ext3, size: 1 GB.

I do not have the means to establish a serial connection to the Linkstation console. I am not sure what other tests I can run to investigate this issue further.

Notably, the same Linkstation successfully boots with initrd.buffalo and uImage.buffalo from http://ftp.debian.org/debian/dists/stable/main/installer-armel/current/images/kirkwood/network-console/buffalo/ls-wvl/ (Debian Stretch).

@val-kulkov
Copy link
Author

I should add the following observations.

If Debian Stretch is installed on a Linkstation using images from http://ftp.debian.org/debian/dists/stable/main/installer-armel/current/images/kirkwood/network-console/buffalo/ls-wvl/ and then dist-upgraded to Buster, the Linkstation becomes unreachable after a reboot.

After the upgrade but before the reboot, I edited /etc/rsyslog.conf to make sure that I get kernel logs written into /var/log/dmesg.log so that the logs can be analysed if the Linkstation does not come online successfully. Well, the Linkstation did not come online successfully. At that point, I powered off the Linkstation, took out the disk, connected it to my workstation and examined the contents of /var/log/. No log files were written during reboot after the upgrade. The only file in /var/log/ that appeared to had been touched on reboot was /var/log/wtmp, that's it.

It appears that the current Debian version in trunk has some issue that bricks the Linkstation on reboot. The last Debian kernel I was able to boot my Linkstation into successfully was 4.14.0-3 (circa February 2018).

Note to all current users of OpenLinkstation: if you installed Debian from daily images per official instructions in this Git repo, do NOT upgrade your Debian to the current version in the trunk, kernel 4.16, as it will likely brick your Linkstation. At least until the problem is resolved.

@rogers0
Copy link
Owner

rogers0 commented Jun 1, 2018

@val-kulkov thanks for your report!

I tried upgrade my stretch box to use latest 4.16 kernel from backports, and it boot fine.

$ uname -a
Linux LS-VL 4.16.0-0.bpo.1-marvell #1 Debian 4.16.5-1~bpo9+1 (2018-05-06) armv5tel GNU/Linux

I guess it just d-i image issue.
I changed some feature from built-in to module around kernel 4.15 stage. So the d-i config need to be updated.
I'll have you updated after I finish that update.

@val-kulkov
Copy link
Author

@rogers0 : thank you very much for looking into it!

I hope it is just the d-i image issue as you indicated.

There may be something else however. I tried running the same upgrade using a very old HDD and, unexpectedly, the upgrade successfully completed! Then I tried the other HDDs and got the same result as before:

Hitachi HDS725050KLA360, 500 GB (Feb 2006): success
WD Red WD40EFRX, 4 TB (Jan 2018): Linkstation is unreachable
WD Green WD10EACS, 1 TB (Jan 2009): Linkstation is unreachable

Could it be some race condition on the first boot after upgrade?

I look forward to testing your update. Hopefully it is just the d-i image issue. Thank you!

@val-kulkov
Copy link
Author

Just checked daily images to see if the problem is still there. "sudo parted /dev/sdi" and then:

mklabel gpt
mkpart boot 2048s 1024MiB
mkpart root 1024MiB 6144MiB
mkpart swap 6144MiB 6656MiB
mkpart data 6656MiB -1
print
quit

After that:

sudo mkfs.ext3 /dev/sdi1
sudo mount /dev/sdi1 /mnt
wget https://d-i.debian.org/daily-images/armel/daily/kirkwood/network-console/buffalo/ls-wvl/initrd.buffalo
wget https://d-i.debian.org/daily-images/armel/daily/kirkwood/network-console/buffalo/ls-wvl/uImage.buffalo
sudo cp *buffalo /mnt/
sudo umount /mnt

Then I insert the prepared disk into the LinkStation and turn the LinkStation on. The LED lights come on green on the network hub, indicating a gigabit link. However, the Linkstation does not even respond to ping requests:

--- cloud10 ping statistics ---
527 packets transmitted, 0 received, +525 errors, 100% packet loss, time 528807ms

I am going to leave it powered on for a few hours and see if anything changes. If it starts responding to ping requests, I'll post an update here.

@nhhuayt
Copy link

nhhuayt commented Nov 1, 2018

Just checked daily images to see if the problem is still there. "sudo parted /dev/sdi" and then:

mklabel gpt
mkpart boot 2048s 1024MiB
mkpart root 1024MiB 6144MiB
mkpart swap 6144MiB 6656MiB
mkpart data 6656MiB -1
print
quit

After that:

sudo mkfs.ext3 /dev/sdi1
sudo mount /dev/sdi1 /mnt
wget https://d-i.debian.org/daily-images/armel/daily/kirkwood/network-console/buffalo/ls-wvl/initrd.buffalo
wget https://d-i.debian.org/daily-images/armel/daily/kirkwood/network-console/buffalo/ls-wvl/uImage.buffalo
sudo cp *buffalo /mnt/
sudo umount /mnt

Then I insert the prepared disk into the LinkStation and turn the LinkStation on. The LED lights come on green on the network hub, indicating a gigabit link. However, the Linkstation does not even respond to ping requests:

--- cloud10 ping statistics ---
527 packets transmitted, 0 received, +525 errors, 100% packet loss, time 528807ms

I am going to leave it powered on for a few hours and see if anything changes. If it starts responding to ping requests, I'll post an update here.

Is there any news? I have that problem too

@thatguyatgithub
Copy link

thatguyatgithub commented Nov 9, 2018

Greetings from a thankful user!

I experience a relatively similar issue with Debian-Installer on Daily D-I builds last week.
The installation and everything on it reports no issue, but after finishing D-I and rebooting, the unit seems to enter into a boot loop, judging for the sound of the fans and leds, that get re-powered every 10 seconds.

In my case, I wanted to run Stretch, which it is! Flawlessly if I may add! 👍

I presume this might be related to something on the flash-kernel end, but I've nothing to validate that.

Please dont hesitate on contacting me for any kind of tests you might want to run, @rogers0. I'm running a Kirkwood LS-WXL.

@thatguyatgithub
Copy link

thatguyatgithub commented Nov 10, 2018

I can confirm that updating both latest Debian's sid 4.18.0-2-marvell kernel and 3.95 flash-kernel appears to be indeed working as expected.

Maybe there's an issue with the partitioning? I'll try to build an UART interface to see what U-boot says about D-I installation.

Please dont hesitate on contacting me for any kind of tests you might want to run, @rogers0. I'm running a Kirkwood LS-WXL.

Little typo, I'm in fact running a LS-WVL not LS-WXL, sorry about that.

@nhhuayt
Copy link

nhhuayt commented Nov 14, 2018

I installed Debian to NAS Buffalo successfully
This is how to do that
Please note what kind of your NAS (below is ls-wvl) and insert HDD to slot 2 of Dual Bay Linkstation

  1. Create boot partition
    sudo parted /dev/sdc
    (parted) mklabel gpt
    (parted) mkpart boot 2048s 1024MiB
    (parted) quit
  2. Format and mount partition
    $ sudo mkfs.ext3 /dev/sdd1
    $ sudo mount /dev/sdd1 /mnt
    $ sudo cd /mnt
    $ sudo wget ftp.debian.org/debian/dists/stable/main/installer-armel/current/images/kirkwood/network-console/buffalo/ls-wvl/uImage.buffalo
    $ sudo wget ftp.debian.org/debian/dists/stable/main/installer-armel/current/images/kirkwood/network-console/buffalo/ls-wvl/initrd.buffalo
    $ sudo umount /mnt
  3. Connect SSH and install Debian like normal on PC, I recommend to choose Guide entire disk when reaching Partition Setup
    ssh installer@ with password “install”
  4. Get SSH root access
  5. Change tmpfs size to install apps
    $ nano /etc/fstab
    Add this line
    tmpfs /run tmpfs nosuid,noexec,size=256M,nr_inodes=4096 0 0
  6. Install net-tools
    apt install net-tools

@val-kulkov
Copy link
Author

Confirming success installing the latest daily Debian image on LS-WVL: 4.18.0-2-marvell #1 Debian 4.18.10-2 (2018-11-03) armv5tel GNU/Linux, but with one very important "BUT": it takes over 40 minutes for the LinkStation to complete booting. Here is the output of "dmesg | tail -12":

[   10.002201] 0x000000070000-0x000000080000 : "uboot_env"
[   11.459072] Adding 523260k swap on /dev/md2.  Priority:-2 extents:1 across:523260k FS
[   13.369193] EXT4-fs (md0): mounting ext3 file system using the ext4 subsystem
[   13.558505] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
[   33.010300] random: crng init done
[   33.013723] random: 7 urandom warning(s) missed due to ratelimiting
[ 2524.385850] EXT4-fs (md3): mounting ext3 file system using the ext4 subsystem
[ 2524.769672] EXT4-fs (md3): mounted filesystem with ordered data mode. Opts: (null)
[ 2525.671062] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 2525.828230] NET: Registered protocol family 17
[ 2528.161504] mv643xx_eth_port mv643xx_eth_port.0 eth0: link up, 1000 Mb/s, full duplex, flow control disabled
[ 2528.171421] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

It may be that the LinkStation performs fsck on every reboot now. I have no way to confirm that because I cannot set up a serial connection to the LinkStation's serial interface. dmesg output does not provide the necessary details. I have two ext3-formatted WD Red 4TB disks assembled as RAID1. See the details of my partition setup earlier in this thread. The size of my data partition is almost 4TB. If fsck runs on every reboot, then this explains the delay. I should make it clear that the boot delay occurs not only on the first boot after installation, but on every subsequent reboot.

Since my LinkStation was not coming online for a long time after Debian installation, it looked to me like the installs were failing. It is only when I got distracted and left the LinkStation powered on after a Debian install for a couple of hours I discovered that the LinkStation was finally online. It is possible that all my "unsuccessful" attempts to install Debian were in fact successful.

I should note that the RAID setup probably makes no difference on the boot delay. When trying to install Debian on the system with just one WD Red 4TB, the installation process seemed to have completed successfully but then I did not see the LinkStation coming online for about 5 minutes. Which I took as the sign that the installation had failed.

What causes the boot delay is a mystery to me. Again, without a serial connection I can't do much to investigate it. Any suggestions will be much appreciated.

@val-kulkov
Copy link
Author

My hunch about fsck was probably wrong. When I ssh into LS-WVL as root and run fsck on the unmounted 3.6GB data partition, it takes about 17 seconds for fsck to complete. So something else must be causing the the 40+ minutes boot delay.

@nhhuayt
Copy link

nhhuayt commented Nov 19, 2018

When boot with installer@IP
Please note that you must choose Guide entire disk to wipe out anything on disk that prevent boot.
Because Debian installer is loaded to Memory, don't need worry about this.

@val-kulkov
Copy link
Author

@nhhuayt : what do you mean by "anything on disk that prevent boot"? A broken MBR? A broken partition?

@rogers0
Copy link
Owner

rogers0 commented Dec 23, 2018

@val-kulkov finally I tried the installer image on my LS-WVL box.
Both stretch and latest buster images boot well.

  • /debian/dists/buster/main/installer-armel/20181206/images/kirkwood/network-console/buffalo/ls-wvl/
  • /debian/dists/stretch/main/installer-armel/20170615+deb9u5/images/kirkwood/network-console/buffalo/ls-wvl/

I can ping, and ssh to the box.

ssh installer@<IP address>

If shouldn't have broken file-system if you use a new HDD or re-partition the whole disk.

@val-kulkov
Copy link
Author

@rogers0 : yes, as of 2018-11-16 (maybe earlier, I did not try earlier installer images), ssh installer@ works again. Please see my 2018-11-16 post above.

The problem now is that it takes more than 40 minutes to reboot a LS-WVL. I wrote a script pinging LS-WVL every minute while it reboots to see how much time it takes for my LS-WVL to go online. My LS-WVL starts responding to ping requests after about 43 minutes. This behaviour is consistent. I wish I had access to the serial console to see what is going on. 'dmesg' does not provide useful information.

@rogers0
Copy link
Owner

rogers0 commented Dec 24, 2018

@val-kulkov I think serial console is not easy for LS-WVL.

But you can try netconsole, which you can get the dmesg output over the network.
Please refer: https://forum.doozan.com/read.php?2,9522,9702

You need to:

  • edit /etc/initramfs-tools/modules, add:
mv643xx_eth
netconsole [email protected]/,[email protected]/
mvmdio
marvell
  • update initrd.buffalo by: update-initramfs -u
  • please ignore mkimage command, because it's already included in the flash-kernel process.
  • reboot linkstation
  • on your PC side, which has IP 192.168.11.1, get linkstation dmesg output by: nc -l -u -p 6666 | tee ~/netconsole.log

@rogers0
Copy link
Owner

rogers0 commented Dec 24, 2018

@val-kulkov I find I already described netconsole stuff on the slides, page 18,19:

@val-kulkov
Copy link
Author

@rogers0 : thank you for reminding about netconsole. I considered using it when I reported the issue, but back then I was unable to login to LS-WVL at all and therefore I could not edit /etc/initramfs-tools/modules and do update-initramfs -u.

With 4.18.0-3-marvell # 1 Debian 4.18.20-2 (2018-11-23), I can log in to LS-WVL and enable netconsole, but I am not getting any output from it at all. Apparently, netconsole is loaded too early when eth0 does not yet exist. See line 164 in dmesg output.

The dmesg output shows that eth0 is created on the 9th second after boot: line 213. However, the Ethernet link is not up until about 44 minutes from the boot time. During this time, the LED light for the link with LS-WVL on the network switch is off and LS-WVL does not respond to ping requests.

I wonder if there is a way to make uboot more verbose and capture its output into some log file during the boot process? Alternatively, I wonder if it is possible to delay netconsole loading by about 10 seconds?

@val-kulkov
Copy link
Author

It turns out that clock reset on reboot is the root cause of the boot delay. Since the LinkStation has no hardware RTC, the system clock is set to Unix epoch on reboot but then early in the boot process systemd advances the clock to the OS build time. Here is the record of it from dmesg:

[    4.789191] systemd[1]: System time before build time, advancing clock.

In my case, the system build time is 'Dec 21 13:53'. Later in the boot process when systemd gets to perform file system checks, the system time is weeks in the past because the NTP time synchronization is performed after the file system checks have been completed. This trips fsck because it checks the drive's superblock last mount time and last write time and finds that both are in the future. fsck sees this as a problem with the file system and therefore decides to perform a full (forced) file system check that takes much more time than a regular file system check. In my case, the full file system check takes 11 minutes for a 1 TB drive and 44 minutes for a 4 TB drive. A regular fsck of the 1 TB drive takes about 6 seconds.

root@cloud10:~# journalctl -u systemd-fsck*
-- Logs begin at Fri 2018-12-21 13:53:35 EST, end at Thu 2019-01-17 12:34:10 EST. --
Dec 21 13:53:38 cloud10 systemd[1]: Starting File System Check on /dev/disk/by-uuid/0287ec1e-341b-4763-9f7e-427ec6fff75e...
Dec 21 13:53:38 cloud10 systemd[1]: Starting File System Check on /dev/disk/by-uuid/2f4a6cf2-21ab-482a-9ad8-1bc53eb9dac0...
Dec 21 13:53:38 cloud10 systemd[1]: Started File System Check Daemon to report status.
Dec 21 13:53:40 cloud10 systemd-fsck[191]: home: Superblock last mount time is in the future.
Dec 21 13:53:40 cloud10 systemd-fsck[191]:         (by less than a day, probably due to the hardware clock being incorrectly set)
Dec 21 13:53:40 cloud10 systemd-fsck[191]: home: Superblock last write time (Thu Jan 17 12:00:40 2019,
Dec 21 13:53:40 cloud10 systemd-fsck[191]:         now = Fri Dec 21 13:53:38 2018) is in the future.
Dec 21 13:53:40 cloud10 systemd-fsck[191]: FIXED.
Dec 21 14:04:34 cloud10 systemd-fsck[191]: home: 22/60628992 files (0.0% non-contiguous), 4640040/242486528 blocks
Dec 21 14:04:34 cloud10 systemd[1]: Started File System Check on /dev/disk/by-uuid/0287ec1e-341b-4763-9f7e-427ec6fff75e.
Dec 21 14:04:34 cloud10 systemd-fsck[193]: boot: Superblock last write time (Thu Jan 17 12:21:18 2019,
Dec 21 14:04:34 cloud10 systemd-fsck[193]:         now = Fri Dec 21 14:04:34 2018) is in the future.
Dec 21 14:04:34 cloud10 systemd-fsck[193]: FIXED.
Dec 21 14:04:35 cloud10 systemd-fsck[193]: boot: 29/65536 files (27.6% non-contiguous), 13674/261888 blocks
Dec 21 14:04:35 cloud10 systemd[1]: Started File System Check on /dev/disk/by-uuid/2f4a6cf2-21ab-482a-9ad8-1bc53eb9dac0.
Jan 17 12:21:45 cloud10 systemd[1]: systemd-fsckd.service: Succeeded.

Installing "fake-hwclock" package seems to solve this problem. How fake-hwclock does it is a bit of a mystery to me, because fake-hwclock.service appears to be dependent on local-fs.target, which involves running fsck: see the output of systemctl list-dependencies. Anyway, enough time already spent on solving this problem, I am leaving the fake-hwclock mystery for another day.

The problem that was initially reported here where LinkStation was unreachable for hours could be related to fsck, too. Debian Bug report log #878843 described a problem that was somewhat similar to the one initially reported here.

@nwizard74
Copy link

I have the same issue with my LS-WVL - a bootloop after first step - debootstrap. I use Wireshark as netcosole and see that Linux Debian start to boot, then after Læv ÚEVÁ@áîÀ¨�À¨�
Bäõ[ 14.891060] sd 0:0:0:0: [sda] Synchronizing SCSI cache goes to reboot.
Any suggestion ? How to start the debootstrap again ? remove HDD and connect to Debian PC ? I have 2 pieces 2TB WD in RAID0.

@1000001101000
Copy link

1000001101000 commented Jun 19, 2019

@val-kulkov @nwizard74 @rogers0 @nhhuayt

I believe the issue with the system taking so long to come up is due to a lack of entropy for he RNG. For security reasons Debian has been disabling CONFIG_RANDOM_TRUST_CPU in it's kernels since the same time you started seeing the issue. I've resolved this for my devices by installing haveged

This issue is discussed in detail here:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=898814
https://daniel-lange.com/archives/152-Openssh-taking-minutes-to-become-available,-booting-takes-half-an-hour-...-because-your-server-waits-for-a-few-bytes-of-randomness.html

You can read more about haveged here:
http://www.issihosts.com/haveged/faq.html

@1000001101000
Copy link

This has also been the cause of the installer images failing for me with armhf devices. when the netconsole installer starts it tries to generate a private key for the ssh server, fails because of lack of entropy and then hangs forever.

For the armhf installer I added a call to rngd from the rng-tools package (easier to embed than haveged) to generate aditional entropy before the installer tries to start sshd. I bet the same thing is happening with the armel installer.

@val-kulkov
Copy link
Author

@1000001101000 : your observations may explain why LinkStation is unreachable by ssh for a long time or forever. But the network subsystem should not depend on sshd. Therefore LinkStation should pick up an ip address from a DHCP server and respond to ping requests, correct?

@1000001101000
Copy link

I just uninstalled haveged on my ls421de and rebooted. It's behaving more or less as you expect, it grabbed an ip address via dhcp and responds to ping but still hasn't finished starting sshd after 15+ minutes.

I tried booting a installer image without the rngd modification and it never came online. It's possible I messed something up but it matches my memory of when I was originally testing this. I believe the script to generate the private key for sshd runs before network interfaces are started which results in the device never joining the network.

If I get some time in the next couple of days I'll pull out my ls-wvl and try this out. If I get the same failure I'll try adding the same rngd modification to the armel installer and see if that solves it.

@1000001101000
Copy link

I just loaded the daily installer image on an ls-wxl and it booted without issue.

It failed to load most of the modules needed for detecting disks which is rather frustrating, unfortunately the installer images for testing often have issues like that.

it seems like the issue I experienced when first testing the installer under 4.19 don't seem to apply to the current one somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants