Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(ansible_snippets#Filter json data): Filter json data
To select a single element or a data subset from a complex data structure in JSON format (for example, Ansible facts), use the `community.general.json_query` filter. The `community.general.json_query` filter lets you query a complex JSON structure and iterate over it using a loop structure. This filter is built upon jmespath, and you can use the same syntax. For examples, see [jmespath examples](http://jmespath.org/examples.html). A complex example would be: ```yaml "{{ ec2_facts | json_query('instances[0].block_device_mappings[?device_name!=`/dev/sda1` && device_name!=`/dev/xvda`].{device_name: device_name, id: ebs.volume_id}') }}" ``` This snippet: - Gets all dictionaries under the block_device_mappings list which `device_name` is not equal to `/dev/sda1` or `/dev/xvda` - From those results it extracts and flattens only the desired values. In this case `device_name` and the `id` which is at the key `ebs.volume_id` of each of the items of the block_device_mappings list. feat(ansible_snippets#Do asserts): Do asserts ```yaml - name: After version 2.7 both 'msg' and 'fail_msg' can customize failing assertion message ansible.builtin.assert: that: - my_param <= 100 - my_param >= 0 fail_msg: "'my_param' must be between 0 and 100" success_msg: "'my_param' is between 0 and 100" ``` feat(ansible_snippets#Split a variable in ansible ): Split a variable in ansible ```yaml {{ item | split ('@') | last }} ``` feat(ansible_snippets#Get a list of EC2 volumes mounted on an instance an their mount points): Get a list of EC2 volumes mounted on an instance an their mount points Assuming that each volume has a tag `mount_point` you could: ```yaml - name: Gather EC2 instance metadata facts amazon.aws.ec2_metadata_facts: - name: Gather info on the mounted disks delegate_to: localhost block: - name: Gather information about the instance amazon.aws.ec2_instance_info: instance_ids: - "{{ ansible_ec2_instance_id }}" register: ec2_facts - name: Gather volume tags amazon.aws.ec2_vol_info: filters: volume-id: "{{ item.id }}" # We exclude the root disk as they are already mounted and formatted loop: "{{ ec2_facts | json_query('instances[0].block_device_mappings[?device_name!=`/dev/sda1` && device_name!=`/dev/xvda`].{device_name: device_name, id: ebs.volume_id}') }}" register: volume_tags_data - name: Save the required volume data set_fact: volumes: "{{ volume_tags_data | json_query('results[0].volumes[].{id: id, mount_point: tags.mount_point}') }}" - name: Display volumes data debug: msg: "{{ volumes }}" - name: Make sure that all volumes have a mount point assert: that: - item.mount_point is defined - item.mount_point|length > 0 fail_msg: "Configure the 'mount_point' tag on the volume {{ item.id }} on the instance {{ ansible_ec2_instance_id }}" success_msg: "The volume {{ item.id }} has the mount_point tag well set" loop: "{{ volumes }}" ``` feat(ansible_snippets#Create a list of dictionaries using ansible): Create a list of dictionaries using ansible ```yaml - name: Create and Add items to dictionary set_fact: userdata: "{{ userdata | default({}) | combine ({ item.key : item.value }) }}" with_items: - { 'key': 'Name' , 'value': 'SaravAK'} - { 'key': 'Email' , 'value': '[email protected]'} - { 'key': 'Location' , 'value': 'Coimbatore'} - { 'key': 'Nationality' , 'value': 'Indian'} ``` feat(ansible_snippets#Merge two dictionaries on a key ): Merge two dictionaries on a key If you have these two lists: ```yaml "list1": [ { "a": "b", "c": "d" }, { "a": "e", "c": "f" } ] "list2": [ { "a": "e", "g": "h" }, { "a": "b", "g": "i" } ] ``` And want to merge them using the value of key "a": ```yaml "list3": [ { "a": "b", "c": "d", "g": "i" }, { "a": "e", "c": "f", "g": "h" } ] ``` If you can install the collection community.general use the filter lists_mergeby. The expression below gives the same result ```yaml list3: "{{ list1|community.general.lists_mergeby(list2, 'a') }}" ``` feat(ages): Add 2024 Hidden Cup 5 awesome match [Semifinal Viper vs Lierey](https://yewtu.be/watch?v=Ol-mqMeQ7OQ) feat(anticolonialism#poems): Add Rafeef Ziadah pro-palestine poem [Rafeef Ziadah - "Nosotros enseñamos vida, señor"](https://www.youtube.com/watch?v=neYO0kJ-6XQ) feat(bash_snippets#Compare two semantic versions): Compare two semantic versions [This article](https://www.baeldung.com/linux/compare-dot-separated-version-string) gives a lot of ways to do it. For my case the simplest is to use `dpkg` to compare two strings in dot-separated version format in bash. ```bash Usage: dpkg --compare-versions <condition> ``` If the condition is `true`, the status code returned by `dpkg` will be zero (indicating success). So, we can use this command in an `if` statement to compare two version numbers: ```bash $ if $(dpkg --compare-versions "2.11" "lt" "3"); then echo true; else echo false; fi true ``` feat(bash_snippets#Exclude list of extensions from find command ): Exclude list of extensions from find command ```bash find . -not \( -name '*.sh' -o -name '*.log' \) ``` feat(python_snippets#Investigate a class attributes): Investigate a class attributes [Investigate a class attributes with inspect](https://docs.python.org/3/library/inspect.html) feat(python_snippets#Expire the cache of the lru_cache): Expire the cache of the lru_cache The `lru_cache` decorator caches forever, a way to prevent it is by adding one more parameter to your expensive function: `ttl_hash=None`. This new parameter is so-called "time sensitive hash", its the only purpose is to affect lru_cache. For example: ```python from functools import lru_cache import time @lru_cache() def my_expensive_function(a, b, ttl_hash=None): del ttl_hash # to emphasize we don't use it and to shut pylint up return a + b # horrible CPU load... def get_ttl_hash(seconds=3600): """Return the same value withing `seconds` time period""" return round(time.time() / seconds) res = my_expensive_function(2, 2, ttl_hash=get_ttl_hash()) ``` feat(sql#Get the last row of a table ): Get the last row of a table ```sql SELECT * FROM Table ORDER BY ID DESC LIMIT 1 ``` feat(ecc): Introduce ECC RAM [Error Correction Code](https://www.memtest86.com/ecc.htm) (ECC) is a mechanism used to detect and correct errors in memory data due to environmental interference and physical defects. ECC memory is used in high-reliability applications that cannot tolerate failure due to corrupted data. **Installation**: Due to additional circuitry required for ECC protection, specialized ECC hardware support is required by the CPU chipset, motherboard and DRAM module. This includes the following: - Server-grade CPU chipset with ECC support (Intel Xeon, AMD Ryzen) - Motherboard supporting ECC operation - ECC RAM Consult the motherboard and/or CPU documentation for the specific model to verify whether the hardware supports ECC. Use vendor-supplied list of certified ECC RAM, if provided. Most ECC-supported motherboards allow you to configure ECC settings from the BIOS setup. They are usually on the Advanced tab. The specific option depends on the motherboard vendor or model such as the following: - DRAM ECC Enable (American Megatrends, ASUS, ASRock, MSI) - ECC Mode (ASUS) **Monitorization** The mechanism for how ECC errors are logged and reported to the end-user depends on the BIOS and operating system. In most cases, corrected ECC errors are written to system/event logs. Uncorrected ECC errors may result in kernel panic or blue screen. The Linux kernel supports reporting ECC errors for ECC memory via the EDAC (Error Detection And Correction) driver subsystem. Depending on the Linux distribution, ECC errors may be reported by the following: - [`rasdaemon`](rasdaemon.md): monitor ECC memory and report both correctable and uncorrectable memory errors on recent Linux kernels. - `mcelog` (Deprecated): collects and decodes MCA error events on x86. - `edac-utils` (Deprecated): fills DIMM labels data and summarizes memory errors. To configure rasdaemon follow [this article](rasdaemon.md) **Confusion on boards supporting ECC** I've read that even if some motherboards say that they "Support ECC" some of them don't do anything with it. On [this post](https://forums.servethehome.com/index.php?threads/has-anyone-gotten-ecc-logging-rasdaemon-edac-whea-etc-to-work-on-xeon-w-1200-or-w-1300-or-core-12-or-13-gen-processors.39257/) and the [kernel docs](https://www.kernel.org/doc/html/latest/firmware-guide/acpi/apei/einj.html) show that you should see references to ACPI/WHEA in the specs manual. Ideally ACPI5 support. From the ) EINJ provides a hardware error injection mechanism. It is very useful for debugging and testing APEI and RAS features in general. You need to check whether your BIOS supports EINJ first. For that, look for early boot messages similar to this one: ``` ACPI: EINJ 0x000000007370A000 000150 (v01 INTEL 00000001 INTL 00000001) ``` Which shows that the BIOS is exposing an EINJ table - it is the mechanism through which the injection is done. Alternatively, look in `/sys/firmware/acpi/tables` for an "EINJ" file, which is a different representation of the same thing. It doesn't necessarily mean that EINJ is not supported if those above don't exist: before you give up, go into BIOS setup to see if the BIOS has an option to enable error injection. Look for something called `WHEA` or similar. Often, you need to enable an `ACPI5` support option prior, in order to see the `APEI`,`EINJ`,... functionality supported and exposed by the BIOS menu. To use `EINJ`, make sure the following are options enabled in your kernel configuration: ``` CONFIG_DEBUG_FS CONFIG_ACPI_APEI CONFIG_ACPI_APEI_EINJ ``` One way to test it can be to run [memtest](memtest.md) as it sometimes [shows ECC errors](https://forum.level1techs.com/t/asrock-taichi-x570-ecc-options-no-longer-in-bios/178045) such as `** Warning** ECC injection may be disabled for AMD Ryzen (70h-7fh)`. Other people ([1](https://www.memtest86.com/ecc.htm), [2](https://www.reddit.com/r/ASRock/comments/jlsw5z/x570_pro4_correctable_ecc_errors_no_response_from/) say that there are a lot of motherboards that NEVER report any corrected errors to the OS. In order to see corrected errors, PFEH (Platform First Error Handling) has to be disabled. On some motherboards and FW versions this setting is hidden from the user and always enabled, thus resulting in zero correctable errors getting reported. [They also suggest](https://www.memtest86.com/ecc.htm) to disable "Quick Boot". In order to initialize ECC, memory has to be written before it can be used. Usually this is done by BIOS, but with some motherboards this step is skipped if "Quick Boot" is enabled. The people behind [memtest](memtest.md) have a [paid tool to test ECC](https://www.passmark.com/products/ecc-tester/index.php) feat(fastapi): Launch the server from within python ```python import uvicorn if __name__ == "__main__": uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True) ``` feat(fastapi): Add the request time to the logs For more information on changing the logging read [1](https://nuculabs.dev/p/fastapi-uvicorn-logging-in-production) To set the datetime of the requests [use this configuration](https://stackoverflow.com/questions/62894952/fastapi-gunicon-uvicorn-access-log-format-customization) ```python @asynccontextmanager async def lifespan(api: FastAPI): logger = logging.getLogger("uvicorn.access") console_formatter = uvicorn.logging.ColourizedFormatter( "{asctime} {levelprefix} : {message}", style="{", use_colors=True ) logger.handlers[0].setFormatter(console_formatter) yield api = FastAPI(lifespan=lifespan) ``` feat(git#Update all git submodules): Update all git submodules If it's the first time you check-out a repo you need to use `--init` first: ```bash git submodule update --init --recursive ``` To update to latest tips of remote branches use: ```bash git submodule update --recursive --remote ``` feat(goodconf#Initialize the config with a default value if the file doesn't exist): Initialize the config with a default value if the file doesn't exist ```python def load(self, filename: Optional[str] = None) -> None: self._config_file = filename if not self.store_dir.is_dir(): log.warning("The store directory doesn't exist. Creating it") os.makedirs(str(self.store_dir)) if not Path(self.config_file).is_file(): log.warning("The yaml store file doesn't exist. Creating it") self.save() super().load(filename) ``` feat(goodconf#Config saving) So far [`goodconf` doesn't support saving the config](lincolnloop/goodconf#12). Until it's ready you can use the next snippet: ```python class YamlStorage(GoodConf): """Adapter to store and load information from a yaml file.""" @Property def config_file(self) -> str: """Return the path to the config file.""" return str(self._config_file) @Property def store_dir(self) -> Path: """Return the path to the store directory.""" return Path(self.config_file).parent def reload(self) -> None: """Reload the contents of the authentication store.""" self.load(self.config_file) def load(self, filename: Optional[str] = None) -> None: """Load a configuration file.""" if not filename: filename = f"{self.store_dir}/data.yaml" super().load(self.config_file) def save(self) -> None: """Save the contents of the authentication store.""" with open(self.config_file, "w+", encoding="utf-8") as file_cursor: yaml = YAML() yaml.default_flow_style = False yaml.dump(self.dict(), file_cursor) ``` feat(google_chrome#Open a specific profile): Open a specific profile ```bash google-chrome --profile-directory="Profile Name" ``` Where `Profile Name` is one of the profiles listed under `ls ~/.config/chromium | grep -i profile`. feat(zfs#Monitorization): Monitor dbgmsg with loki If you use [loki](loki.md) remember to monitor the `/proc/spl/kstat/zfs/dbgmsg` file: ```yaml - job_name: zfs static_configs: - targets: - localhost labels: job: zfs __path__: /proc/spl/kstat/zfs/dbgmsg ``` fix(zfs): Add loki alerts on the kernel panic error You can monitor this issue with loki using the next alerts: ```yaml groups: - name: zfs rules: - alert: SlowSpaSyncZFSError expr: | count_over_time({job="zfs"} |~ `spa_deadman.*slow spa_sync` [5m]) for: 1m labels: severity: critical annotations: summary: "Slow sync traces found in the ZFS debug logs at {{ $labels.hostname}}" message: "This usually happens before the ZFS becomes unresponsible" ``` feat(linux_resilience): Introduce linux resilience Increasing the resilience of the servers is critical when hosting services for others. This is the roadmap I'm following for my servers. **Autostart services if the system reboots** Using init system services to manage your services **Get basic metrics traceability and alerts ** Set up [Prometheus](prometheus.md) with: - The [blackbox exporter](blackbox_exporter.md) to track if the services are available to your users and to monitor SSL certificates health. - The [node exporter](node_exporter.md) to keep track on the resource usage of your machines and set alerts to get notified when concerning events happen (disks are getting filled, CPU usage is too high) **Get basic logs traceability and alerts ** Set up [Loki](loki.md) and clear up your system log errors. **Improve the resilience of your data** If you're still using `ext4` for your filesystems instead of [`zfs`](zfs.md) you're missing a big improvement. To set it up: - [Plan your zfs storage architecture](zfs_storage_planning.md) - [Install ZFS](zfs.md) - [Create ZFS local and remote backups](sanoid.md) - [Monitor your ZFS ] **Automatically react on system failures** - [Kernel panics](https://www.supertechcrew.com/kernel-panics-and-lockups/) - [watchdog](watchdog.md) **Future undeveloped improvements** - Handle the system reboots after kernel upgrades feat(linux_snippets#Send multiline messages with notify-send): Send multiline messages with notify-send The title can't have new lines, but the body can. ```bash notify-send "Title" "This is the first line.\nAnd this is the second.") ``` feat(linux_snippets#Find BIOS version): Find BIOS version ```bash dmidecode | less ``` feat(linux_snippets#Reboot server on kernel panic ): Reboot server on kernel panic The `proc/sys/kernel/panic` file gives read/write access to the kernel variable `panic_timeout`. If this is zero, the kernel will loop on a panic; if nonzero it indicates that the kernel should autoreboot after this number of seconds. When you use the software watchdog device driver, the recommended setting is `60`. To set the value add the next contents to the `/etc/sysctl.d/99-panic.conf` ``` kernel.panic = 60 ``` Or with an ansible task: ```yaml - name: Configure reboot on kernel panic become: true lineinfile: path: /etc/sysctl.d/99-panic.conf line: kernel.panic = 60 create: true state: present ``` feat(linux_snippets#Share a calculated value between github actions steps): Share a calculated value between github actions steps You need to set a step's output parameter. Note that the step will need an `id` to be defined to later retrieve the output value. ```bash echo "{name}={value}" >> "$GITHUB_OUTPUT" ``` For example: ```yaml - name: Set color id: color-selector run: echo "SELECTED_COLOR=green" >> "$GITHUB_OUTPUT" - name: Get color env: SELECTED_COLOR: ${{ steps.color-selector.outputs.SELECTED_COLOR }} run: echo "The selected color is $SELECTED_COLOR" ``` feat(linux_snippets#Split a zip into sizes with restricted size ): Split a zip into sizes with restricted size Something like: ```bash zip -9 myfile.zip * zipsplit -n 250000000 myfile.zip ``` Would produce `myfile1.zip`, `myfile2.zip`, etc., all independent of each other, and none larger than 250MB (in powers of ten). `zipsplit` will even try to organize the contents so that each resulting archive is as close as possible to the maximum size. feat(linux_snippets#find files that were modified between dates): find files that were modified between dates The best option is the `-newerXY`. The m and t flags can be used. - `m` The modification time of the file reference - `t` reference is interpreted directly as a time So the solution is ```bash find . -type f -newermt 20111222 \! -newermt 20111225 ``` The lower bound in inclusive, and upper bound is exclusive, so I added 1 day to it. And it is recursive. fix(loki#installation): Use `fake` when using one loki instance in docker If you only have one Loki instance you need to save the rule yaml files in the `/etc/loki/rules/fake/` otherwise Loki will silently ignore them (it took me a lot of time to figure this out `-.-`). feat(loki): Add alerts Surprisingly I haven't found any compilation of Loki alerts. I'll gather here the ones I create. There are two kinds of rules: alerting rules and recording rules. - [ECC error alerts](rasdaemon.md#monitorization) - [ZFS errors](zfs.md#zfs-pool-is-stuck) - [Sanoid errors](sanoid.md#monitorization) feat(luddites): Nice comic about the luddites [Comic about luddites](https://www.technologyreview.com/2024/02/28/1088262/luddites-resisting-automated-future-technology/) feat(magic_keys): Introduce the Magic Keys The magic SysRq key is a key combination understood by the Linux kernel, which allows the user to perform various low-level commands regardless of the system's state. It is often used to recover from freezes, or to reboot a computer without corrupting the filesystem.[1] Its effect is similar to the computer's hardware reset button (or power switch) but with many more options and much more control. This key combination provides access to powerful features for software development and disaster recovery. In this sense, it can be considered a form of escape sequence. Principal among the offered commands are means to forcibly unmount file systems, kill processes, recover keyboard state, and write unwritten data to disk. With respect to these tasks, this feature serves as a tool of last resort. The magic SysRq key cannot work under certain conditions, such as a kernel panic[2] or a hardware failure preventing the kernel from running properly. The key combination consists of Alt+Sys Req and another key, which controls the command issued. On some devices, notably laptops, the Fn key may need to be pressed to use the magic SysRq key. **Reboot the machine** A common use of the magic SysRq key is to perform a safe reboot of a Linux computer which has otherwise locked up (abbr. REISUB). This can prevent a fsck being required on reboot and gives some programs a chance to save emergency backups of unsaved work. The QWERTY (or AZERTY) mnemonics: "Raising Elephants Is So Utterly Boring", "Reboot Even If System Utterly Broken" or simply the word "BUSIER" read backwards, are often used to remember the following SysRq-keys sequence: * unRaw (take control of keyboard back from X), * tErminate (send SIGTERM to all processes, allowing them to terminate gracefully), * kIll (send SIGKILL to all processes, forcing them to terminate immediately), * Sync (flush data to disk), * Unmount (remount all filesystems read-only), * reBoot. When magic SysRq keys are used to kill a frozen graphical program, the program has no chance to restore text mode. This can make everything unreadable. The commands textmode (part of SVGAlib) and the reset command can restore text mode and make the console readable again. On distributions that do not include a textmode executable, the key command Ctrl+Alt+F1 may sometimes be able to force a return to a text console. (Use F1, F2, F3,..., F(n), where n is the highest number of text consoles set up by the distribution. Ctrl+Alt+F(n+1) would normally be used to reenter GUI mode on a system on which the X server has not crashed.) **[Interact with the sysrq through the commandline](https://unix.stackexchange.com/questions/714910/what-is-a-good-way-to-test-watchdog-script-or-command-to-deliberately-overload)** It can also be used by echoing letters to `/proc/sysrq-trigger`, for example to trigger a system crash and take a crashdump you can: ```bash echo c > /proc/sysrq-trigger ``` feat(memtest): Introduce memtest [memtest86](https://www.memtest86.com/) is a testing software for RAM. **Installation** ```bash apt-get install memtest86+ ``` After the installation you'll get Memtest entries in grub which you can spawn. For some unknown reason the memtest of the boot menu didn't work for me. So I [downloaded the latest free version of memtest](https://www.memtest86.com/download.htm) (It's at the bottom of the screen), burnt it in a usb and booted from there. **Usage** It will run by itself. For 64GB of ECC RAM it took aproximately 100 minutes to run all the tests. **[Check ECC errors](https://www.memtest86.com/ecc.htm)** MemTest86 directly polls ECC errors logged in the chipset/memory controller registers and displays it to the user on-screen. In addition, ECC errors are written to the log and report file. fix(nas): Add suggestions when buying a motherboard When choosing a motherboard make sure that: - If you want [ECC](ecc.md) that it [truly supports ECC](ecc.md#confusion-on-boards-supporting-ecc). - [It is IPMI compliant, if you want to have hardware watchdog support](watchdog.md#watchdog-hardware is-disabled-error-on-boot) feat(pass): Add rofi launcher [pass](http://www.passwordstore.org/) is a command line password store **Configure rofi launcher** - Save [this script](https://raw.githubusercontent.com/carnager/rofi-pass/master/rofi-pass) somewhere in your `$PATH` - Configure your window manager to launch it whenever you need a password. feat(process_exporter): Introduce the process exporter [`process_exporter`](https://github.com/ncabatoff/process-exporter?tab=readme-ov-file) is a rometheus exporter that mines /proc to report on selected processes. **References** - [Source](https://github.com/ncabatoff/process-exporter?tab=readme-ov-file ) - [Grafana dashboard](https://grafana.com/grafana/dashboards/249-named-processes/) feat(promtail#Installation): Install with ansible Use [patrickjahns ansible role](https://github.com/patrickjahns/ansible-role-promtail). Some interesting variables are: ```yaml loki_url: localhost promtail_system_user: root promtail_config_clients: - url: "http://{{ loki_url }}:3100/loki/api/v1/push" external_labels: hostname: "{{ ansible_hostname }}" ``` feat(promtail#Troubleshooting): Troubleshooting promtail Find where is the `positions.yaml` file and see if it evolves. Sometimes if you are not seeing the logs in loki it's because the query you're running is not correct. feat(python_protocols): Introduce Python Protocols The Python type system supports two ways of deciding whether two objects are compatible as types: nominal subtyping and structural subtyping. Nominal subtyping is strictly based on the class hierarchy. If class Dog inherits class `Animal`, it’s a subtype of `Animal`. Instances of `Dog` can be used when `Animal` instances are expected. This form of subtyping subtyping is what Python’s type system predominantly uses: it’s easy to understand and produces clear and concise error messages, and matches how the native `isinstance` check works – based on class hierarchy. Structural subtyping is based on the operations that can be performed with an object. Class `Dog` is a structural subtype of class `Animal` if the former has all attributes and methods of the latter, and with compatible types. Structural subtyping can be seen as a static equivalent of duck typing, which is well known to Python programmers. See [PEP 544](https://peps.python.org/pep-0544/) for the detailed specification of protocols and structural subtyping in Python. **Usage** You can define your own protocol class by inheriting the special Protocol class: ```python from typing import Iterable from typing_extensions import Protocol class SupportsClose(Protocol): # Empty method body (explicit '...') def close(self) -> None: ... class Resource: # No SupportsClose base class! def close(self) -> None: self.resource.release() # ... other methods ... def close_all(items: Iterable[SupportsClose]) -> None: for item in items: item.close() close_all([Resource(), open('some/file')]) # OK ``` `Resource` is a subtype of the `SupportsClose` protocol since it defines a compatible close method. Regular file objects returned by `open()` are similarly compatible with the protocol, as they support `close()`. If you want to define a docstring on the method use the next syntax: ```python def load(self, filename: Optional[str] = None) -> None: """Load a configuration file.""" ... ``` **[Make protocols work with `isinstance`](https://mypy.readthedocs.io/en/stable/protocols.html#using-isinstance-with-protocols)** To check an instance against the protocol using `isinstance`, we need to decorate our protocol with `@runtime_checkable` **[Make a protocol property variable](https://mypy.readthedocs.io/en/stable/protocols.html#invariance-of-protocol-attributes)** **[Make protocol of functions](https://mypy.readthedocs.io/en/stable/protocols.html#callback-protocols)** **References** - [Mypy article on protocols](https://mypy.readthedocs.io/en/stable/protocols.html) - [Predefined protocols reference](https://mypy.readthedocs.io/en/stable/protocols.html#predefined-protocol-reference) feat(rasdaemon): Introduce rasdaemon the ECC monitor [`rasdaemon`](https://github.com/mchehab/rasdaemon) is a RAS (Reliability, Availability and Serviceability) logging tool. It records memory errors, using the EDAC tracing events. EDAC is a Linux kernel subsystem with handles detection of ECC errors from memory controllers for most chipsets on i386 and x86_64 architectures. EDAC drivers for other architectures like arm also exists. **Installation** ```bash apt-get install rasdaemon ``` The output will be available via syslog but you can show it to the foreground (`-f`) or to an sqlite3 database (`-r`) To post-process and decode received MCA errors on AMD SMCA systems, run: ```bash rasdaemon -p --status <STATUS_reg> --ipid <IPID_reg> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM> ``` Status and IPID Register values (in hex) are mandatory. The smca flag with family and model are required if not decoding locally. Bank parameter is optional. You may also start it via systemd: ```bash systemctl start rasdaemon ``` The rasdaemon will then output the messages to journald. **[Usage](https://www.setphaserstostun.org/posts/monitoring-ecc-memory-on-linux-with-rasdaemon/)** At this point `rasdaemon` should already be running on your system. You can now use the `ras-mc-ctl` tool to query the errors that have been detected. If everything is well configured you'll see something like: ```bash $: ras-mc-ctl --error-count Label CE UE mc#0csrow#2channel#0 0 0 mc#0csrow#2channel#1 0 0 mc#0csrow#3channel#1 0 0 mc#0csrow#3channel#0 0 0 ``` If it's not you'll see: ```bash ras-mc-ctl: Error: No DIMMs found in /sys or new sysfs EDAC interface not found. ``` The `CE` column represents the number of corrected errors for a given DIMM, `UE` represents uncorrectable errors that were detected. The label on the left shows the EDAC path under `/sys/devices/system/edac/mc/` of every DIMM. This is not very readable, if you wish to improve the labeling [read this article](https://www.setphaserstostun.org/posts/monitoring-ecc-memory-on-linux-with-rasdaemon/) More ways to check is to run: ```bash $: ras-mc-ctl --status ras-mc-ctl: drivers are loaded. ``` You can also see a summary of the state with: ```bash $: ras-mc-ctl --summary No Memory errors. No PCIe AER errors. No Extlog errors. DBD::SQLite::db prepare failed: no such table: devlink_event at /usr/sbin/ras-mc-ctl line 1183. Can't call method "execute" on an undefined value at /usr/sbin/ras-mc-ctl line 1184. ``` **Monitorization** You can use [loki](loki.md) to monitor ECC errors shown in the logs with the next alerts: ```yaml groups: - name: ecc rules: - alert: ECCError expr: | count_over_time({job="systemd-journal", unit="rasdaemon.service", level="error"} [5m]) > 0 for: 1m labels: severity: critical annotations: summary: "Possible ECC error detected in {{ $labels.hostname}}" - alert: ECCWarning expr: | count_over_time({job="systemd-journal", unit="rasdaemon.service", level="warning"} [5m]) > 0 for: 1m labels: severity: warning annotations: summary: "Possible ECC warning detected in {{ $labels.hostname}}" - alert: ECCAlert expr: | count_over_time({job="systemd-journal", unit="rasdaemon.service", level!~"info|error|warning"} [5m]) > 0 for: 1m labels: severity: info annotations: summary: "ECC log trace with unknown severity level detected in {{ $labels.hostname}}" ``` **References** - [Source](https://github.com/mchehab/rasdaemon) feat(rofi): Introduce Rofi [Rofi](https://github.com/davatorium/rofi?tab=readme-ov-file) is a window switcher, application launcher and dmenu replacement. **[Installation](https://github.com/davatorium/rofi/blob/next/INSTALL.md)** ```bash sudo apt-get install rofi ``` **[Usage](https://github.com/davatorium/rofi?tab=readme-ov-file#usage)** To launch rofi directly in a certain mode, specify a mode with `rofi -show <mode>`. To show the run dialog: ```bash rofi -show run ``` Or get the options from a script: ```bash ~/my_script.sh | rofi -dmenu ``` Specify an ordered, comma-separated list of modes to enable. Enabled modes can be changed at runtime. Default key is Ctrl+Tab. If no modes are specified, all configured modes will be enabled. To only show the run and ssh launcher: ```bash rofi -modes "run,ssh" -show run ``` The modes to combine in combi mode. For syntax to `-combi-modes` , see `-modes`. To get one merge view, of window,run, and ssh: ```bash rofi -show combi -combi-modes "window,run,ssh" -modes combi ``` **[Configuration](https://github.com/davatorium/rofi/blob/next/CONFIG.md)** The configuration lives at `~/.config/rofi/config.rasi` to create this file with the default conf run: ```bash rofi -dump-config > ~/.config/rofi/config.rasi ``` **[Use fzf to do the matching]()** To run once: ```bash rofi -show run -sorting-method fzf -matching fuzzy ``` To persist them change those same values in the configuration. **Theme changing** To change the theme: - Choose the one you like most looking [here](https://davatorium.github.io/rofi/themes/themes/) - Run `rofi-theme-selector` to select it - Accept it with `Alt + a` **[Keybindings change](https://davatorium.github.io/rofi/current/rofi-keys.5/)** **[Plugins](https://github.com/davatorium/rofi/wiki/User-scripts)** You can write your custom plugins. If you're on python using [`python-rofi`](https://github.com/bcbnz/python-rofi) seems to be the best option although it looks unmaintained. Some interesting examples are: - [Python based plugin](https://framagit.org/Daguhh/naivecalendar/-/tree/master?ref_type=heads) - [Creation of nice menus](https://gitlab.com/vahnrr/rofi-menus/-/tree/master?ref_type=heads) - [Nice collection of possibilities](https://github.com/adi1090x/rofi/tree/master) - [Date picker](https://github.com/DMBuce/i3b/blob/master/bin/pickdate) - [Orgmode capture](https://github.com/wakatara/rofi-org-todo/blob/master/rofi-org-todo.py) Other interesting references are: - [List of key bindings](https://davatorium.github.io/rofi/current/rofi-keys.5/) - [Theme guide](https://davatorium.github.io/rofi/current/rofi-theme.5/#examples) **References** - [Source](https://github.com/davatorium/rofi?tab=readme-ov-file) - [Docs](https://davatorium.github.io/rofi/) - [Plugins](https://github.com/davatorium/rofi/wiki/User-scripts) feat(sanoid#Monitorization): Monitorization You can monitor this issue with loki using the next alerts: ```yaml groups: - name: zfs rules: - alert: ErrorInSanoidLogs expr: | count_over_time({job="systemd-journal", syslog_identifier="sanoid"} |= `ERROR` [5m]) for: 1m labels: severity: critical annotations: summary: "Errors found on sanoid log at {{ $labels.hostname}}" ``` feat(signal#Use the Molly FOSS android client): Use the Molly FOSS android client Molly is an independent Signal fork for Android. The advantages are: - Contains no proprietary blobs, unlike Signal. - Protects database with passphrase encryption. - Locks down the app automatically when you are gone for a set period of time. - Securely shreds sensitive data from RAM. - Automatic backups on a daily or weekly basis. - Supports SOCKS proxy and Tor via Orbot. **[Migrate from Signal](https://github.com/mollyim/mollyim-android/wiki/Migrating-From-Signal)** Note, the migration should be done when the available Molly version is equal to or later than the currently installed Signal app version. - Verify your Signal backup passphrase. In the Signal app: Settings > Chats > Chat backups > Verify backup passphrase. - Optionally, put your phone offline (enable airplane mode or disable data services) until after Signal is uninstalled in step 5. This will prevent the possibility of losing any Signal messages that are received during or after the backup is created. - Create a Signal backup. In the Signal app, go to Settings > Chats > Chat backups > Create backup. - Uninstall the Signal app. Now you can put your phone back online (disable airplane mode or re-enable data services). - Install the Molly or Molly-FOSS app. - Open the Molly app. Enable database encryption if desired. As soon as the option is given, tap Transfer or restore account. Answer any permissions questions. - Choose to Restore from backup and tap Choose backup. Navigate to your Signal backup location (Signal/Backups/, by default) and choose the backup that was created in step 3. - Check the backup details and then tap Restore backup to confirm. Enter the backup passphrase when requested. - If asked, choose a new folder for backup storage. Or choose Not Now and do it later. Consider also: - Any previously linked devices will need to be re-linked. Go to Settings > Linked devices in the Molly app. If Signal Desktop is not detecting that it is no longer linked, try restarting it. - Verify your Molly backup settings and passphrase at Settings > Chats > Chat backups (to change the backup folder, disable and then enable backups). Tap Create backup to create your first Molly backup. - When you are satisfied that Molly is working, you may want to delete the old Signal backups (in Signal/Backups, by default). fix(time_management_abstraction_levels): Rename Task to Action To remove the productivity capitalist load from the concept feat(watchdog): Introduce the watchdog A [watchdog timer](https://en.wikipedia.org/wiki/Watchdog_timer) (WDT, or simply a watchdog), sometimes called a computer operating properly timer (COP timer), is an electronic or software timer that is used to detect and recover from computer malfunctions. Watchdog timers are widely used in computers to facilitate automatic correction of temporary hardware faults, and to prevent errant or malevolent software from disrupting system operation. During normal operation, the computer regularly restarts the watchdog timer to prevent it from elapsing, or "timing out". If, due to a hardware fault or program error, the computer fails to restart the watchdog, the timer will elapse and generate a timeout signal. The timeout signal is used to initiate corrective actions. The corrective actions typically include placing the computer and associated hardware in a safe state and invoking a computer reboot. Microcontrollers often include an integrated, on-chip watchdog. In other computers the watchdog may reside in a nearby chip that connects directly to the CPU, or it may be located on an external expansion card in the computer's chassis. **Hardware watchdog** Before you start using the hardware watchdog you need to check if your hardware actually supports it. If you see [Watchdog hardware is disabled error on boot](#watchdog-hardware-is-disabled-error-on-boot) things are not looking good. **Check if the hardware watchdog is enabled** You can see if hardware watchdog is loaded by running `wdctl`. For example for a machine that has it enabled you'll see: ``` Device: /dev/watchdog0 Identity: iTCO_wdt [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds Timeleft: 30 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0 ``` On a machine that doesn't you'll see: ``` wdctl: No default device is available.: No such file or directory ``` Another option is to run `dmesg | grep wd` or `dmesg | grep watc -i`. For example for a machine that has enabled the hardware watchdog you'll see something like: ``` [ 20.708839] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11 [ 20.708894] iTCO_wdt: Found a Intel PCH TCO device (Version=4, TCOBASE=0x0400) [ 20.709009] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0) ``` For one that is not you'll see: ``` [ 1.934999] sp5100_tco: SP5100/SB800 TCO WatchDog Timer Driver [ 1.935057] sp5100-tco sp5100-tco: Using 0xfed80b00 for watchdog MMIO address [ 1.935062] sp5100-tco sp5100-tco: Watchdog hardware is disabled ``` If you're out of luck and your hardware doesn't support it you can delegate the task to the software watchdog or get some [usb watchdog](https://github.com/zatarra/usb-watchdog) **[Systemd watchdog](https://0pointer.de/blog/projects/watchdog.html)** Starting with version 183 systemd provides full support for hardware watchdogs (as exposed in /dev/watchdog to userspace), as well as supervisor (software) watchdog support for invidual system services. The basic idea is the following: if enabled, systemd will regularly ping the watchdog hardware. If systemd or the kernel hang this ping will not happen anymore and the hardware will automatically reset the system. This way systemd and the kernel are protected from boundless hangs -- by the hardware. To make the chain complete, systemd then exposes a software watchdog interface for individual services so that they can also be restarted (or some other action taken) if they begin to hang. This software watchdog logic can be configured individually for each service in the ping frequency and the action to take. Putting both parts together (i.e. hardware watchdogs supervising systemd and the kernel, as well as systemd supervising all other services) we have a reliable way to watchdog every single component of the system. **[Configuring the watchdog](https://0pointer.de/blog/projects/watchdog.html)** To make use of the hardware watchdog it is sufficient to set the `RuntimeWatchdogSec=` option in `/etc/systemd/system.conf`. It defaults to `0` (i.e. no hardware watchdog use). Set it to a value like `20s` and the watchdog is enabled. After 20s of no keep-alive pings the hardware will reset itself. Note that `systemd` will send a ping to the hardware at half the specified interval, i.e. every 10s. Note that the hardware watchdog device (`/dev/watchdog`) is single-user only. That means that you can either enable this functionality in systemd, or use a separate external watchdog daemon, such as the aptly named `watchdog`. Although the built-in hardware watchdog support of systemd does not conflict with other watchdog software by default. systemd does not make use of `/dev/watchdog` by default, and you are welcome to use external watchdog daemons in conjunction with systemd, if this better suits your needs. `ShutdownWatchdogSec=`` is another option that can be configured in `/etc/systemd/system.conf`. It controls the watchdog interval to use during reboots. It defaults to 10min, and adds extra reliability to the system reboot logic: if a clean reboot is not possible and shutdown hangs, we rely on the watchdog hardware to reset the system abruptly, as extra safety net. Now, let's have a look how to add watchdog logic to individual services. First of all, to make software watchdog-supervisable it needs to be patched to send out "I am alive" signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the `WATCHDOG_USEC=` environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue `sd_notify("WATCHDOG=1")` calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to. To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the `WatchdogSec=` to the desired failure latency. See `systemd.service(5)` for details on this setting. This causes `WATCHDOG_USEC=` to be set for the service's processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval. The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set `Restart=on-failure` for the service. To configure how many times a service shall be attempted to be restarted use the combination of `StartLimitBurst=` and `StartLimitInterval=` which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with `StartLimitAction=`. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values are `reboot`, `reboot-force` and `reboot-immediate`. - `reboot` attempts a clean reboot, going through the usual, clean shutdown logic. - `reboot-force` is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast). - `reboot-immediate` does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. `reboot-immediate` hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in `systemd.service(5)`. Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn't help. Here's an example unit file: ```ini [Unit] Description=My Little Daemon Documentation=man:mylittled(8) [Service] ExecStart=/usr/bin/mylittled WatchdogSec=30s Restart=on-failure StartLimitInterval=5min StartLimitBurst=4 StartLimitAction=reboot-force ```` This service will automatically be restarted if it hasn't pinged the system manager for longer than 30s or if it fails otherwise. If it is restarted this way more often than 4 times in 5min action is taken and the system quickly rebooted, with all file systems being clean when it comes up again. To write the code of the watchdog service you can follow one of these guides: - [Python based watchdog](https://sleeplessbeastie.eu/2022/08/15/how-to-create-watchdog-for-systemd-service/) - [Bash based watchdog](https://www.medo64.com/2019/01/systemd-watchdog-for-any-service/) **[Testing a watchdog](https://serverfault.com/questions/375220/how-to-check-what-if-hardware-watchdogs-are-available-in-linux)** One simple way to test a watchdog is to trigger a kernel panic. This can be done as root with: ```bash echo c > /proc/sysrq-trigger ``` The kernel will stop responding to the watchdog pings, so the watchdog will trigger. SysRq is a 'magical' key combo you can hit which the kernel will respond to regardless of whatever else it is doing, unless it is completely locked up. It can also be used by echoing letters to /proc/sysrq-trigger, like we're doing here. In this case, the letter c means perform a system crash and take a crashdump if configured. **Troubleshooting** **Watchdog hardware is disabled error on boot** According to the discussion at [the kernel mailing list](https://lore.kernel.org/linux-watchdog/[email protected]/T/#u) it means that the system contains hardware watchdog but it has been disabled (probably by BIOS) and Linux cannot enable the hardware. If your BIOS doesn't have a switch to enable it, consider the watchdog hardware broken for your system. Some people are blacklisting the module so that it's not loaded and therefore it doesn't return the error ([1](https://www.reddit.com/r/openSUSE/comments/a3nmg5/watchdog_hardware_is_disabled_on_boot/), [2](https://bbs.archlinux.org/viewtopic.php?id=239075) **References** - [0pointer post on systemd watchdogs](https://0pointer.de/blog/projects/watchdog.html) - [Heckel post on how to reboot using watchdogs](https://blog.heckel.io/2020/10/08/reliably-rebooting-ubuntu-using-watchdogs/)
- Loading branch information