When we lock a mutex, eventually `Thread::block` is invoked which could
in turn invoke `Process::big_lock().restore_exclusive_lock()`. This
would then try to add the current thread to a different blocked thread
list then the one in use for the original mutex being locked, and
because it's an intrusive list, the thread is removed from its original
list during the `.append()`. When the original mutex eventually
unblocks, we no longer have the thread in the intrusive blocked threads
list and we panic.
Solve this by making the big lock mutex special and giving it its own
blocked thread list. Because the process big lock is temporary and is
being actively removed from e.g. syscalls, it's a matter of time before
we can also remove the fix introduced by this commit.
Fixes issue #9401.
If we unregister from the RegionTree before unmapping, there's a race
where a new region can get inserted at the same address that we're about
to unmap. If this happens, ~Region() will then unmap the newly inserted
region, which now finds itself with cleared-out page table entries.
This had no business being in RegionTree, since RegionTree doesn't track
identity-mapped regions anyway. (We allow *any* address to be identity
mapped, not just the ones that are part of the RegionTree's range.)
This patch adds RegionTree::get_lock() which exposes the internal lock
inside RegionTree. We can then lock it from the outside when doing
lookups or traversal.
This solution is not very beautiful, we should find a way to protect
this data with SpinlockProtected or something similar. This is a stopgap
patch to try and fix the currently flaky CI.
This syscall ends up disabling interrupts while changing the time,
and the clock is a global resource anyway, so preventing threads in the
same process from running wouldn't solve anything.
Let's use terminology from the the Intel manual to avoid confusion.
Also add `_string` suffixes to better distinguish the numeric values
from the string values.
...and remove the last remaining client of the API. It's no longer
possible to ask the RegionTree for a VM range. You can only ask it to
place your Region somewhere in available space.
This patch move AddressSpace (the per-process memory manager) to using
the new atomic "place" APIs in RegionTree as well, just like we did for
MemoryManager in the previous commit.
This required updating quite a few places where VM allocation and
actually committing a Region object to the AddressSpace were separated
by other code.
All you have to do now is call into AddressSpace once and it'll take
care of everything for you.
Instead of first allocating the VM range, and then inserting a region
with that range into the MM region tree, we now do both things in a
single atomic operation:
- RegionTree::place_anywhere(Region&, size, alignment)
- RegionTree::place_specifically(Region&, address, size)
To reduce the number of things we do while locking the region tree,
we also require callers to provide a constructed Region object.
This patch ports MemoryManager to RegionTree as well. The biggest
difference between this and the userspace code is that kernel regions
are owned by extant OwnPtr<Region> objects spread around the kernel,
while userspace regions are owned by the AddressSpace itself.
For kernelspace, there are a couple of situations where we need to make
large VM reservations that never get backed by regular VMObjects
(for example the kernel image reservation, or the big kmalloc range.)
Since we can't make a VM reservation without a Region object anymore,
this patch adds a way to create unbacked Region objects that can be
used for this exact purpose. They have no internal VMObject.)
RegionTree holds an IntrusiveRedBlackTree of Region objects and vends a
set of APIs for allocating memory ranges.
It's used by AddressSpace at the moment, and will be used by MM soon.
This patch stops using VirtualRangeAllocator in AddressSpace and instead
looks for holes in the region tree when allocating VM space.
There are many benefits:
- VirtualRangeAllocator is non-intrusive and would call kmalloc/kfree
when used. This new solution is allocation-free. This was a source
of unpleasant MM/kmalloc deadlocks.
- We consolidate authority on what the address space looks like in a
single place. Previously, we had both the range allocator *and* the
region tree both being used to determine if an address was valid.
Now there is only the region tree.
- Deallocation of VM when splitting regions is no longer complicated,
as we don't need to keep two separate trees in sync.
This is important for dmidecode because it does an fstat on the DMI
blobs, trying to figure out their size. Because we already know the size
of the blobs when creating the SysFS components, there's no performance
penalty whatsoever, and this allows dmidecode to not use the /dev/mem
device as a fallback.
The current implementation of read/write will fail in StorageDevice
when the request length is less than the block size of the underlying
device. Fix it by calculating the offset within a block for such cases
and using it for copying data from the bounce buffer.
8233da3398 introduced a not-so-subtle bug
where an application with an existing pledge set containing `no_error`
could elevate its pledge set by pledging _anything_, this commit makes
sure that no new promise is accepted.
Some error indication was done by returning bool. This was changed to
propagate the error by ErrorOr from the underlying functions. The
returntype of the underlying functions was also changed to propagate the
error.
We're now able to detect all the regular CPUID feature flags from
ECX/EDX for EAX=1 :^)
None of the new ones are being used for anything yet, but they will show
up in /proc/cpuinfo and subsequently lscpu and SystemMonitor.
Note that I replaced the periods from the SSE 4.1 and 4.2 instructions
with underscores, which matches the internal enum names, Linux's
/proc/cpuinfo and the general pattern of replacing special characters
with underscores to limit feature names to [a-z0-9_].
The enum member stringification has been moved to a new function for
better re-usability and to avoid cluttering up Processor.cpp.
This will make it possible to add many, many more CPU features - more
than the current limit 32 and later limit of 64 if we stick with an enum
class to be specific :^)
Checks of ECX go before EDX, and the bit indices are now ordered
properly. Additionally, handling of the EDX[11] bit has been moved into
a lambda function to keep the series of if statements neatly together.
All of this makes it *a lot* easier to follow along and compare the
implementation to the tables in the Intel manual, e.g. to find missing
checks.
This solves a problem where any non-trivial member in the global BSP
Processor instance would get re-initialized (improperly), losing data
that was already initialized earlier.
Expose the block size variable via a member function in the
AsyncBlockDeviceRequest so that the driver doesn't need to assume any
value such as 512 bytes.
The underlying driver does not need to recalculate the buffer size as
it is passed in the AsyncBlockDevice struct anyway. This also helps in
removing any assumptions of the underlying block size of the device.
This makes pledge() ignore promises that would otherwise cause it to
fail with EPERM, which is very useful for allowing programs to run under
a "jail" so to speak, without having them termiate early due to a
failing pledge() call.
Contrary to the past, we don't attempt to assume the real name of a TTY
device, but instead, we generate a pseudo name only when needed to do so
which is still OK because we don't break abstraction layer rules and we
still can provide userspace with the required information.
Now that we reclaim the memory range that is created by KASLR before
the start of the kernel image, there's no need to be conservative with
the KASLR offset.
The obsolete ttyname and ptsname syscalls are removed.
LibC doesn't rely on these anymore, and it helps simplifying the Kernel
in many places, so it's an overall an improvement.
In addition to that, /proc/PID/tty node is removed too as it is not
needed anymore by userspace to get the attached TTY of a process, as
/dev/tty (which is already a character device) represents that as well.
This ioctl operation will allow userspace to determine the index number
of a MasterPTY after opening /dev/ptmx and actually getting an internal
file descriptor of MasterPTY.
This will replace the /dev/tty symlink created by SystemServer, so
instead of a symlink, a character device will be created. When doing
read(2), write(2) and ioctl(2) on this device, it will "redirect" these
operations to the attached TTY of the current process.
This ensures we don't just waste the memory range between the default
base load address and the actual load address that was shifted by the
KASLR offset.
This requirement comes from the fact the Prekernel mapping logic only
uses 2 MiB pages.
This unfortunately reduces the bits of entropy in kernel addresses from
16 bits to 7, but it could be further improved in the future by making
the Prekernel mapping logic a bit more dynamic.
To do so, we now check that the framebuffer type is RGB so we know that
the Multiboot bootloader actually provided a valid framebuffer to work
with.
This fixes a problem I observed on my ICH7 test machine that apparently
the multiboot_framebuffer_addr was not null but there was no framebuffer
that was set up for RGB colors, and by initializing that console, there
was a memory curroption caused somewhere in the EBDA area to probably
cause a complete system lockup.
This helps solving an issue when we boot with text mode screen so the
Kernel initializes an early text mode console, but even after disabling
it, that console can still access VGA ports. This wouldn't be a problem
for emulated hardware but bare metal hardware might have a "conflict",
especially if the native driver explicitly request to disable the VGA
emulation.
This class already has variables named m_lock, and it's also strange
that locals are named with the `m_` prefix. So lets fix that to make
the code more readable.
Found by PVS-Studio.
C++20 provides the `requires` clause which simplifies the ability to
limit overload resolution. Prefer it over `EnableIf`
With all uses of `EnableIf` being removed, also remove the
implementation so future devs are not tempted.
Since the allocated memory is going to be zeroed immediately anyway,
let's avoid redundantly scrubbing it with MALLOC_SCRUB_BYTE just before
that.
The latest versions of gcc and Clang can automatically do this malloc +
memset -> calloc optimization, but I've seen a couple of places where it
failed to be done.
This commit also adds a naive kcalloc function to the kernel that
doesn't (yet) eliminate the redundancy like the userland does.
This is mainly useful when adding an HostController but due to OOM
condition, we abort temporary Vector insertion of a DeviceIdentifier
and then exit the iteration loop to report back the error if occured.
Instead, hold the lock while we copy the contents to a stack-based
Vector then iterate on it without any locking.
Because we rely on heap allocations, we need to propagate errors back
in case of OOM condition, therefore, both PCI::enumerate API function
and PCI::Access::add_host_controller_and_enumerate_attached_devices use
now a ErrorOr<void> return value to propagate errors. OOM Error can only
occur when enumerating the m_device_identifiers vector under a spinlock
and trying to expand the temporary Vector which will be used locklessly
to actually iterate over the PCI::DeviceIdentifiers objects.
This code attempts to copy the `Protocol::Resource3DSpecification`
struct into request, starting at `Protocol::ResourceCreate3D::target`
member of the `Protocol::ResourceCreate3D` struct.
The problem is that the `Protocol::Resource3DSpecification` struct
does not having the trailing `u32 padding` that the `ResourceCreate3D`
struct has. Leading to memcopy overrunning the struct and corrupting
32 bits of data trailing the struct.
Found by SonarCloud:
- Memory copy function overflows the destination buffer.
As there is no need for a Prekernel on aarch64, the Prekernel code was
moved into Kernel itself. The functionality remains the same.
SERENITY_KERNEL_AND_INITRD in run.sh specifies a kernel and an inital
ramdisk to be used by the emulator. This is needed because aarch64
does not need a Prekernel and the other ones do.
These fences should not be needed, since we force the use of
synchronous operations through synchronous_virtio_gpu_command. The use
of these fences also causes severe lag when SERENITY_GL is enabled.
This commit flips VirtIOGPU back to using a Mutex for its operation
lock (instead of a spinlock). This is necessary for avoiding a few
system hangs when queuing actions on the driver from multiple
processes, which becomes much more of an issue when using VirGL from
multiple userspace process.
This does result in a few code paths where we inevitably have to grab
a mutex from inside a spinlock, the only way to fix both issues is to
move to issuing asynchronous virtio gpu commands.
The project appears to build just fine without it, and the explicit use
of `LibC` causes it to conflict with the system-wide `fd_set.h` when
building inside of Serenity.
This additionally refactors FramebufferDevice::try_to_initialize to not
leave the FramebufferDevice in an invalid state on errors.
This also unifies the logic between FramebufferDevice::mmap and
FramebufferDevice::try_to_initialize.
This comes with the drawback of removing the UNMAP_AFTER_INIT attribute
from this function, which wasn't honoured by IntelNativeGraphicsAdapter
anyway.
If init crashes, all other userspace processes exit too, thus rendering
the system unusable. Previously, the kernel would still keep running
even without a userland, showing just a black screen without any
indication of the issue.
We now panic the kernel, which shows a message on the console. In the
case of the CI runners, it shuts down the virtual machine, so we don't
have to wait for the 1 hour timeout if an issue arises with
SystemServer.
The stack is misaligned at this point for some reason, this is a hack
that makes the resulting object "correctly" aligned, thus avoiding a
KUBSAN error.
Mere mortals like myself cannot understand more than two lines of
assembly without a million comments explaining what's happening, so do
that and make sure no one has to go on a wild stack state chase when
hacking on these.
The comments were confusing, and had a mathematical error, stop trying
to be clever and just let the computer do the math.
Also assert that we're pushing exactly as many stack elements as we're
using for the alignment calculations.
POSIX requires that sigaction() and friends set a _process-wide_ signal
handler, so move signal handlers and flags inside Process.
This also fixes a "pid/tid confusion" FIXME, as we can now send the
signal to the process and let that decide which thread should get the
signal (which is the thread with tid==pid, but that's now the Process's
problem).
Note that each thread still retains its signal mask, as that is local to
each thread.
Previously register_string would return incorrect values when
called multiple times with the same input. This patch makes this
function return the same index, identical strings. This change was
required, as this functionality is now being used with read syscall
profiling, (#12465), which uses 'register_string' to registers file
path on every read syscall.
If there's no PCI bus, then it's safe to assume that we run on a x86
machine that has an ISA IDE controller in the system. In such case, we
just instantiate a ISAIDEController object that assumes fixed locations
of IDE IO ports.
If there's no PCI bus, then it's safe to assume that the x86 machine we
run on supports VGA text mode console output with an ISA VGA adapter.
If this is the case, we just instantiate a ISAVGAAdapter object that
assumes this situation and allows us to boot into VGA text mode console.
Reading from /proc/pci assumes we have PCI enabled and also enumerated.
However, if PCI is disabled for some reason, we can't allow the user to
read from it as there's no valuable data we can supply.
To declare that we don't have a PCI bus in the system we do two things:
1. Probe IO ports before enabling access -
In case we are using the QEMU ISA-PC machine type, IO probing results in
floating bus condition (returning 0xFF values), thus, we know we don't
have PCI bus on the system.
2. Allow the user to specify to not use the PCI bus at all in the kernel
commandline.
This change allow the user to request the kernel to not use any PCI
resources/devices at all.
Also, don't try to initialize devices that rely on PCI if disabled.
Instead of winging it with "width * 4", use the actual pitch since it
may be different.
This makes the kernel text console show up in native 1368x768 on my
ThinkPad X250. :^)
We now only reset the PCM out channel during initialization, and handle
the case where the channel's current index has passed the last valid
index properly.
This fixes issues with stuttering audio between multiple subsequent
`aplay` invocations, for example.
This might help with debugging on bare metal. Since the minimum version
that can be specified is revision 2.1, and we do not use any feature
from revision 2.2 or newer, this is merely future-proofing ourselves
for new features yet to be built. Additionally, removing the `VERIFY()`
ensures we will not crash on cards that only support earlier revisions.
These are not technically required, since the Thread constructor
already sets these, but they are set on i686, so let's try and keep
consistent behaviour between the different archs.
The Qemu AC'97 device stops its PCM channel's DMA engine when it is
running and the sample rate is changed. We now make sure the DMA engine
is restarted after changing the sample rate, allowing you to e.g. run
`asctl set r 22050` during `aplay` playback.
In short: QEMU supports both Memory-Mapped-IO and classic IO methods
for controlling the emulated VGA device. Bochs and VirtualBox only
support the classic IO method. An excellent write up on the history of
these interfaces can be found here:
https://www.kraxel.org/blog/2018/10/qemu-vga-emulation-and-bochs-display
The IO method was how things were done originally in SerenityOS. Commit
6a728e2d76 introduced the MMIO method for
all devices, breaking Bochs and VirtualBox compatibility. Later in
commit 6a9dc5562d the classic IO method
was restored for VirtualBox graphics adapters.
QEMU and Bochs use the same PCI VID/DID (0x1234/0x1111) for the emulated
VGA adapter. To distinguish betwen QEMU and Bochs we use the PCI
revision ID field (0=Bochs, 2=QEMU).
This driver is not tested and probably not used on any modern hardware
machine, because it is plugged into the ISA bus and not the PCI bus.
Also, the run script doesn't utilize this device anymore, making it more
hard to test this driver and to ensure it doesn't rot.
The AP boot code was partially adapted to build on x86_64 but didn't
properly jump into 64 bit mode. Furthermore, the APIC code was still
using 32 bit pointers.
Fixes#12662
We already init receive buffer if we have singleport console, but if
we have multiport console that dynamically allocates ports we never
initted their receive buffers.
Some hardware controllers might reset when trying to do self-test, so
keep the configuration byte to restore it later on.
To ensure we are not missing the response from the i8042 controller,
bump the attempts count to 20 times after initiating self-test check.
Also, try to drain the i8042 controller output buffer as it might be a
early good indication on whether i8042 is present or not.
To ensure we drain all the output buffer, we attempt to read from the
buffer 50 times and not 20 times.
While investigating why gdb is failing when it calls `PT_CONTINUE`
against Serenity I noticed that the names of the programs in the
System Monitor didn't make sense. They were seemingly stale.
After inspecting the kernel code, it became apparent that the sequence
occurs as follows:
1. Debugger calls `fork()`
2. The forked child calls `PT_TRACE_ME`
3. The `PT_TRACE_ME` instructs the forked process to block in the
kernel waiting for a signal from the tracer on the next call
to `execve(..)`.
4. Debugger waits for forked child to spawn and stop, and then it
calls `PT_ATTACH` followed by `PT_CONTINUE` on the child.
5. Currently the `PT_CONTINUE` fails because of some other yet to
be found bug.
6. The process name is set immediately AFTER we are woken up by
the `PT_CONTINUE` which never happens in the case I'm debugging.
This chain of events leaves the process suspended, with the name of
the original (forked) process instead of the name we inherit from
the `execve(..)` call.
To avoid such confusion in the future, we set the new name before we
block waiting for the tracer.
This is very similar to the change that was done in 32053e8, except it
turned out that the new limit of 50 iterations was not enough when
testing on bare metal - most IO operations would succeed in the first or
second iteration, but two of them took 140 and 150 iterations
respectively.
Increase the limit from 50 to 250 to account for this, and have some
additional headroom.
This caused an initialization failure of the i8042 when I tested on
bare metal. We cannot entirely get rid of this method as QEMU for
example doesn't indicate the existence of an i8042 via ACPI, but we can
get away with only doing the manual probing if ACPI is disabled or we
didn't get a 'yes' from it.
Increasing the number of maximum loops did eventually lead to a
successful return from the function, but would later fail the actual
self test.
Arguments larger than 32bit need to be passed as a pointer on a 32bit
architectures. sys$profiling_enable has u64 event_mask argument,
which means that it needs to be passed as an pointer. Previously upper
32bits were filled by garbage.
This prevents a kernel panic found in CI when m_receive_queue's size is
queried and found to be non-zero, then a different thread clears the
queue, and finally the first thread continues into the if block and
calls the queue's first() method, which then fails an assertion that
the queue's size is non-zero.
The SB16 card driver doesn't swallow more than 4096 bytes of data at
once, so instead of asserting just return ENOSPC for now.
To test this, either play normal sound or just this (very!) loud noise:
dd if=/dev/random of=/dev/audio/0 bs=4096
We have 3 new components:
1. The AudioManagement singleton. This class like in other subsystems,
is responsible to find hardware audio controllers and keep a reference
to them.
2. AudioController class - this class is the parent class for hardware
controllers like the Sound Blaster 16 or Intel 82801AA (AC97). For now,
this class has simple interface for getting and controlling sample rate
of audio channels, as well a write interface for specific audio channel
but not reading from it. One AudioController object might have multiple
AudioChannel "child" objects to hold with reference counting.
3. AudioChannel class - this is based on the CharacterDevice class, and
represents hardware PCM audio channel. It facilitates an ioctl interface
which should be consistent across all supported hardware currently.
It has a weak reference to a parent AudioController, and when trying to
write to a channel, it redirects the data to the parent AudioController.
Each audio channel device should be added into a new directory under the
/dev filesystem called "audio".
Since we're in an IRQ each of these evaluate_block_conditions() calls
enqueues a new deferred call, so to save on some space in the deferred
call queue let's just do it once.
This matches the likes of the adopt_{own, ref}_if_nonnull family and
also frees up the name to allow us to eventually add OOM-fallible
versions of these functions.
Move the definitions for maximum argument and environment size to
Process.h from execve.cpp. This allows sysconf(_SC_ARG_MAX) to return
the actual argument maximum of 128 KiB to userspace.
Error codes can leak information about veiled paths, if the path
resolution fails with e.g. EACCESS.
This is non-trivial to fix, as there is a group of error codes we want
to propagate to the caller, such as ENOMEM.
VirtualFileSystem::mkdir() relies on resolve_path() returning an error,
since it is only interested in the out_parent passed as a pointer. Since
resolve_path_without_veil returns an error, no process veil validation
is done by resolve_path() in that case. Due to this problem, mkdir()
should use resolve_path_without_veil() and then manually validate if the
parent directory of the to-be-created directory is unveiled with 'c'
permissions.
This fixes a bug where the mkdir syscall would not respect the process
veil at all.
Previously, VirtualFileSystem::resolve_path() could return a non-null
RefPtr<Custody>* out_parent even if the function errored because the
path has been veiled.
If code relies on recieving the parent custody even if the path is
veiled, it should just call resolve_path_without_veil and do the veil
validation manually. This is because it could be that the parent is
unveiled but the child isn't or the other way round.
Apparently on VirtualBox the keyboard device refused to complete the
reset sequence. With longer delays and more attempts before giving up,
it seems like the problem is gone.
If we crashed in the middle of mapping in Regions, some of the regions
may not have a page directory yet, and will result in a crash when
Region::remap() is called.
We were frequently dropping packets when downloading large files.
Then we had to wait for TCP retransmission which slowed things down.
This patch dramatically improves E1000 throughput by increasing the
number of RX/TX buffers from 32/8 to 256/256.
The largest chunk of JavaScript from Discord now downloads in roughly
1 second instead of 7 seconds. :^)
If someone specifically wants contiguous memory in the low-physical-
address-for-DMA range ("super pages"), they can use the
allocate_dma_buffer_pages() helper.
Not only does it makes the code more robust and correct as it allows
error propagation, it allows us to enforce timeouts on waiting loops so
we don't hang forever, by waiting for the i8042 controller to respond to
us.
Therefore, it makes the i8042 more resilient against faulty hardware and
bad behaving chipsets out there.
If we don't do so, we just hang forever because we assume there's i8042
controller in the system, which is not a valid assumption for modern PC
hardware.
If the bootloader that loaded us is providing a framebuffer details from
the Multiboot protocol then we can instantiate a framebuffer console.
Otherwise, we should use a text mode console, assuming that the BIOS and
the bootloader didn't try to modeset the screen resolution so we have is
a VGA 80x25 text mode being displayed on screen.
Since "boot_framebuffer_console" is no longer a good representative as a
global variable name, it's changed to g_boot_console to match the fact
that it can be assigned with a text mode console and not framebuffer
console if needed.
Not sure how it's useful to do so, let's not assert if something tries
to disable it. If we will use TextModeConsole as a boot console, that
console will be disabled after loading an appropriate console to replace
it.
Instead, we can construct this type of object without having to
instantiate a VGACompatibleAdapter object first.
This can help instantiate such console very early on boot to aid debug
issues on bare metal hardware.
Function-local `static constexpr` variables can be `constexpr`. This
can reduce memory consumption, binary size, and offer additional
compiler optimizations.
These changes result in a stripped x86_64 kernel binary size reduction
of 592 bytes.
This is not ASCII-betical because `_` comes after all the uppercase
characters. Treating `_` as a ` ` (space character), these lists are
now alphabetical.
The global variable use in these functions is super thread-unsafe and
means that any concurrent calls to sprintf or fprintf in a process
could race with each other and end up writing unexpected results.
We can just replace the function + global variable with a lambda that
captures the relevant argument when calling printf_internal instead.