By constraining two implementations, the compiler will select the best
fitting one. All this will require is duplicating the implementation and
simplifying for the `void` case.
This constraining also informs both the caller and compiler by passing
the callback parameter types as part of the constraint
(e.g.: `IterationFunction<int>`).
Some `for_each` functions in LibELF only take functions which return
`void`. This is a minimal correctness check, as it removes one way for a
function to incompletely do something.
There seems to be a possible idiom where inside a lambda, a `return;` is
the same as `continue;` in a for-loop.
Regressed in 8a4cc735b9.
We stopped generating "process created" when enabling profiling,
which led to Profiler getting confused about the missing events.
Previously calls to perf_event() would end up in a process-specific
perfcore file even though global profiling was enabled. This changes
the behavior for perf_event() so that these events are stored into
the global profile instead.
This change looks more involved than it actually is. This simply
reshuffles the previous Process constructor and splits out the
parts which can fail (resource allocation) into separate methods
which can be called from a factory method. The factory is then
used everywhere instead of the constructor.
We were using ELF::Image::section(0) to indicate the "undefined"
section, when what we really wanted was just Optional<Section>.
So let's use Optional instead. :^)
The function fstatat can do the same thing as the stat and lstat
functions. However, it can be passed the file descriptor of a directory
which will be used when as the starting point for relative paths. This
is contrary to stat and lstat which use the current working directory as
the starting for relative paths.
This updates the profiling subsystem to use a separate timer to
trigger CPU sampling. This timer has a higher resolution (1000Hz)
and is independent from the scheduler. At a later time the
resolution could even be made configurable with an argument for
sys$profiling_enable() - but not today.
Modify the API so it's possible to propagate error on OOM failure.
NonnullOwnPtr<T> is not appropriate for the ThreadTracer::create() API,
so switch to OwnPtr<T>, use adopt_own_if_nonnull() to handle creation.
This patch modifies InodeWatcher to switch to a one watcher, multiple
watches architecture. The following changes have been made:
- The watch_file syscall is removed, and in its place the
create_iwatcher, iwatcher_add_watch and iwatcher_remove_watch calls
have been added.
- InodeWatcher now holds multiple WatchDescriptions for each file that
is being watched.
- The InodeWatcher file descriptor can be read from to receive events on
all watched files.
Co-authored-by: Gunnar Beutner <gunnar@beutner.name>
Previously we'd try to load ELF images which did not have
an interpreter set with an incorrect load offset of 0, i.e. way
outside of the part of the address space where we'd expect either
the dynamic loader or the user's executable to reside.
This fixes the problem by using get_load_offset for both executables
which have an interpreter set and those which don't. Notably this
allows us to actually successfully execute the Loader.so binary:
courage:~ $ /usr/lib/Loader.so
You have invoked `Loader.so'. This is the helper program for programs
that use shared libraries. Special directives embedded in executables
tell the kernel to load this program.
This helper program loads the shared libraries needed by the program,
prepares the program to run, and runs it. You do not need to invoke
this helper program directly.
courage:~ $
The current method of emitting performance events requires a bit of
boiler plate at every invocation, as well as having to ignore the
return code which isn't used outside of the perf event syscall. This
change attempts to clean that up by exposing high level API's that
can be used around the code base.
The fact that current_time can "fail" makes its use a bit awkward.
All callers in the Kernel are trusted besides syscalls, so assert
that they never get there, and make sure all current callers perform
validation of the clock_id with TimeManagement::is_valid_clock_id().
I have fuzzed this change locally for a bit to make sure I didn't
miss any obvious regression.
The error handling in all these cases was still using the old style
negative values to indicate errors. We have a nicer solution for this
now with KResultOr<T>. This change switches the interface and then all
implementers to use the new style.
Previously, TLS data was always zero-initialized.
To support initializing the values of TLS data, sys$allocate_tls now
receives a buffer with the desired initial data, and copies it to the
master TLS region of the process.
The DynamicLinker gathers the initial TLS image and passes it to
sys$allocate_tls.
We also now require the size passed to sys$allocate_tls to be
page-aligned, to make things easier. Note that this doesn't waste memory
as the TLS data has to be allocated in separate pages anyway.
For example Linux accepts an additional argument for flags in accept4()
that let the user specify what flags they want. However, by default
accept() should not inherit those flags from the listener socket.
Theoretically the append should never fail as we have in-line storage
of FD_SETSIZE, which should always be enough. However I'm planning on
removing the non-try variants of AK::Vector when compiling in kernel
mode in the future, so this will need to go eventually. I suppose it
also protects against some unforeseen bug where we we can append more
than FD_SETSIZE items.
Theoretically the append should never fail as we have in-line storage
of 2, which should be enough. However I'm planning on removing the
non-try variants of AK::Vector when compiling in kernel mode in the
future, so this will need to go eventually. I suppose it also protects
against some unforeseen bug where we we can append more than 2 items.
sys$purge() is a bit unique, in that it is probably in the systems
advantage to attempt to limp along if we hit OOM while processing
the vmobjects to purge. This change modifies the algorithm to observe
OOM and continue trying to purge any previously visited VMObjects.
This turns the perfcore format into more a log than it was before,
which lets us properly log process, thread and region
creation/destruction. This also makes it unnecessary to dump the
process' regions every time it is scheduled like we did before.
Incidentally this also fixes 'profile -c' because we previously ended
up incorrectly dumping the parent's region map into the profile data.
Log-based mmap support enables profiling shared libraries which
are loaded at runtime, e.g. via dlopen().
This enables profiling both the parent and child process for
programs which use execve(). Previously we'd discard the profiling
data for the old process.
The Profiler tool has been updated to not treat thread IDs as
process IDs anymore. This enables support for processes with more
than one thread. Also, there's a new widget to filter which
process should be displayed.
Userspace can provide a null argument for the `act` argument to the
`sigaction` syscall to not set any new behavior. This is described
here:
https://pubs.opengroup.org/onlinepubs/007904875/functions/sigaction.html
Without this fix, the `copy_from_user(...)` invocation on `user_act`
fails and makes the syscall return early.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
While profiling all processes the profile buffer lives forever.
Once you have copied the profile to disk, there's no need to keep it
in memory. This syscall surfaces the ability to clear that buffer.
This adds PT_PEEKDEBUG and PT_POKEDEBUG to allow for reading/writing
the debug registers, and updates the Kernel's debug handler to read the
new information from the debug status register.
Previously the dynamic loader would become unused after it had invoked
the program's entry function. However, in order to support exceptions
and - at a later point - dlfcn functionality we need to call back
into the dynamic loader at runtime.
Because the dynamic loader has a static copy of LibC it'll attempt to
invoke syscalls directly from its text segment. For this to work the
executable region for the dynamic loader needs to have syscalls enabled.
Reading from the mapping doesn't work when the text segment has a non-zero
offset because in that case the first mapped page doesn't contain the ELF
header.
This should provide some speed up, as currently searches for regions
containing a given address were performed in O(n) complexity, while
this container allows us to do those in O(logn).
This does not affect functionality right now, but it means that the
regions vector will now never have any overlapping regions, which will
allow the use of balance binary search trees instead of a vector in the
future. (since they require keys to be exclusive)
The end goal of this commit is to allow to boot on bare metal with no
PS/2 device connected to the system. It turned out that the original
code relied on the existence of the PS/2 keyboard, so VirtualConsole
called it even though ACPI indicated the there's no i8042 controller on
my real machine because I didn't plug any PS/2 device.
The code is much more flexible, so adding HID support for other type of
hardware (e.g. USB HID) could be much simpler.
Briefly describing the change, we have a new singleton called
HIDManagement, which is responsible to initialize the i8042 controller
if exists, and to enumerate its devices. I also abstracted a bit
things, so now every Human interface device is represented with the
HIDDevice class. Then, there are 2 types of it - the MouseDevice and
KeyboardDevice classes; both are responsible to handle the interface in
the DevFS.
PS2KeyboardDevice, PS2MouseDevice and VMWareMouseDevice classes are
responsible for handling the hardware-specific interface they are
assigned to. Therefore, they are inheriting from the IRQHandler class.
Previously, Process::do_write would error if the O_APPEND flag was set
on a non-seekable file.
Other systems (such as Linux) seem to be OK with doing this, so we now
do not attempt to seek to the end the file if it's not seekable.
Alot of code is shared between i386/i686/x86 and x86_64
and a lot probably will be used for compatability modes.
So we start by moving the headers into one Directory.
We will probalby be able to move some cpp files aswell.
We previously ignored these values in the return value of
load_elf_object, which causes us to not allocate a TLS region for
statically-linked programs.
This implements a fallback to munmap that unmaps multiple regions at a
time, with splitting some when needed.
The way it is implemented is possibly not optimal, due to it searching
without looking into the cache
The previous architecture had a huge flaw: the pointer to the protected
data was itself unprotected, allowing you to overwrite it at any time.
This patch reorganizes the protected data so it's part of the Process
class itself. (Actually, it's a new ProcessBase helper class.)
We use the first 4 KB of Process objects themselves as the new storage
location for protected data. Then we make Process objects page-aligned
using MAKE_ALIGNED_ALLOCATED.
This allows us to easily turn on/off write-protection for everything in
the ProcessBase portion of Process. :^)
Thanks to @bugaevc for pointing out the flaw! This is still not perfect
but it's an improvement.
Process member variable like m_euid are very valuable targets for
kernel exploits and until now they have been writable at all times.
This patch moves m_euid along with a whole bunch of other members
into a new Process::ProtectedData struct. This struct is remapped
as read-only memory whenever we don't need to write to it.
This means that a kernel write primitive is no longer enough to
overwrite a process's effective UID, you must first unprotect the
protected data where the UID is stored. :^)
This returns ENOSYS if you are running in the real kernel, and some
other result if you are running in UserspaceEmulator.
There are other ways we could check if we're inside an emulator, but
it seemed easier to just ask. :^)
Switch to using type-safe bitwise operators for the BlockFlags class,
this cleans up a lot of boilerplate casts which are necessary when the
enum is declared as `enum class`.
The expression
(u8*)params.m_stack_location + stack_size
… causes UBSan to spit out the warning
KUBSAN: addition of unsigned offset to 0x00000002 overflowed to 0xb0000003
… even though there is no actual overflow happening here.
This can be reproduced by running:
$ syscall create_thread 0 [ 0 0 0 0 0xb0000001 2 ]
Technically, this is a true-positive: The C++-reference is incredibly strict
about pointer-arithmetic:
> A pointer to non-array object is treated as a pointer to the first element
> of an array with size 1. […] [A]ttempts to generate a pointer that isn't
> pointing at an element of the same array or one past the end invoke
> undefined behavior.
https://en.cppreference.com/w/cpp/language/operator_arithmetic
Frankly, this feels silly. So let's just use FlatPtr instead.
Found by fuzz-syscalls. Undocumented bug.
Note that FlatPtr is an unsigned type, so
user_esp.value() - 4
is defined even if we end up with a user_esp of 0 (this can happen for example
when params.m_stack_size = 0 and params.m_stack_location = 0). The result would
be a Kernelspace-pointer, which would then be immediately flagged by
'MM.validate_user_stack' as invalid, as intended.
Since we know for sure that the virtual memory regions in the new
process being created are not being used on any CPU, there's no need
to do TLB flushes for every mapped page.
Dynamic Vector allocations in sys$select() were showing up in the
full-system profile and since there will never be more than FD_SETSIZE
file descriptors to worry about, we can confidently add enough inline
capacity to this Vector that it never has to kmalloc.
To compensate for the increased stack usage, reduce the size of the
FDInfo struct while we're here. :^)
The perfcore file format was previously limited to a single process
since the pid/executable/regions data was top-level in the JSON.
This patch moves the process-specific data into a top-level array
named "processes" and we now add entries for each process that has
been sampled during the profile run.
This makes it possible to see samples from multiple threads when
viewing a perfcore file with Profiler. This is extremely cool! :^)
The superuser can now call sys$profiling_enable() with PID -1 to enable
profiling of all running threads in the system. The perf events are
collected in a global PerformanceEventBuffer (currently 32 MiB in size.)
The events can be accessed via /proc/profile
If we can't allocate a PerformanceEventBuffer to store the profiling
events, we now fail sys$profiling_enable() and sys$perf_event()
with ENOMEM instead of carrying on with a broken buffer.
I don't dare touch the multi-threading logic and locking mechanism, so it stays
timespec for now. However, this could and should be changed to AK::Time, and I
bet it will simplify the "increment_time_since_boot()" code.
This commit is very invasive, because Thread likes to take a pointer and write
to it. This means that translating between timespec/timeval/Time would have been
more difficult than just changing everything that hands a raw pointer to Thread,
in bulk.
fuzz-syscalls found a bunch of unaligned accesses into struct sigaction
via this syscall. This patch fixes that issue by porting the syscall
to Userspace<T> which we should have done anyway. :^)
Fixes#5500.
This was another vestige from a long time ago, when exiting a thread
would mutate global data structures that were only protected by the
interrupt flag.
This was necessary in the past when crash handling would modify
various global things, but all that stuff is long gone so we can
simplify crashes by leaving the interrupt flag alone.
Make more of the kernel compile in 64-bit mode, and make some things
pointer-size-agnostic (by using FlatPtr.)
There's a lot of work to do here before the kernel will even compile.
Instead of writing to the userspace utsname struct one field at a time,
build up a utsname on the kernel stack and copy it out to userspace
once it's finished. This is both simpler and gets validity checking
built-in for free.
Found by KUBSAN! :^)
Fixes#5499.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
- We've already computed the number of fds * sizeof(pollfd), so use it
instead of needlessly doing it again.
- Use fds_copy.data() instead off address of indexing the vector.
The `default_signal_action(u8 signal)` function already has the
full mapping. The only caveat being that now we need to make
sure the thread constructor and clear_signals() method do the work
of resetting the m_signal_action_data array, instead or relying on
the previous logic in set_default_signal_dispositions.
This is a new promise that guards access to mmap() with MAP_FIXED.
Fixed-address mappings are rarely used, but can be useful if you are
trying to groom the process address space for malicious purposes.
None of our programs need this at the moment, as the only user of
MAP_FIXED is DynamicLoader, but the fixed mappings are constructed
before the process has had a chance to pledge anything.
If we have a tracer process waiting for us to exec, we need to release
the ptrace lock before stopping ourselves, since otherwise the tracer
will block forever on the lock.
Fixes#5409.
sys$mmap() and related syscalls must pad to the nearest page boundary
below the base address *and* above the end address of the specified
range. Since we have to do this in many places, let's make a helper.
This bug is a good example why copy-paste code should eventually be eliminated
from the code base: Apparently the code was copied from read.cpp before
c6027ed7cc, so the same bug got introduced here.
To recap: A malicious program can ask the Kernel to prepare sys-ing to
a huge amount of iovecs. The Kernel must first copy all the vector locations
into 'vecs', and before that allocates an arbitrary amount of memory:
vecs.resize(iov_count);
This can cause Kernel memory exhaustion, triggered by any malicious userland
program.
This patch adds a random offset between 0 and 4096 to the initial
stack pointer in new processes. Since the stack has to be 16-byte
aligned, the bottom bits can't be randomized.
Yet another thing to make things less predictable. :^)
We were doing stack and syscall-origin region validations before
taking the big process lock. There was a window of time where those
regions could then be unmapped/remapped by another thread before we
proceed with our syscall.
This patch closes that window, and makes sys$get_stack_bounds() rely
on the fact that we now know the userspace stack pointer to be valid.
Thanks to @BenWiederhake for spotting this! :^)
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.
The signal trampoline was previously in kernelspace memory, but with
a special exception to make it user-accessible.
This patch moves it into each process's regular address space so we
can stop supporting user-allowed memory above 0xc0000000.
We were failing to round down the base of partial VM ranges. This led
to split regions being constructed that could have a non-page-aligned
base address. This would then trip assertions in the VM code.
Found by fuzz-syscalls. :^)
If a program attempts to write from more than a million different locations,
there is likely shenaniganery afoot! Refuse to write to prevent kmem exhaustion.
Found by fuzz-syscalls. Can be reproduced by running this in the Shell:
$ syscall writev 1 [ 0 ] 0x08000000
Found by fuzz-syscalls. Can be reproduced by running this in the Shell:
$ syscall exit_thread
This leaves the process in the 'Dying' state but never actually removes it.
Therefore, avoid this scenario by pretending to exit the entire process.
Add a per-process ptrace lock and use it to prevent ptrace access to a
process after it decides to commit to a new executable in sys$execve().
Fixes#5230.
This patch adds Space, a class representing a process's address space.
- Each Process has a Space.
- The Space owns the PageDirectory and all Regions in the Process.
This allows us to reorganize sys$execve() so that it constructs and
populates a new Space fully before committing to it.
Previously, we would construct the new address space while still
running in the old one, and encountering an error meant we had to do
tedious and error-prone rollback.
Those problems are now gone, replaced by what's hopefully a set of much
smaller problems and missing cleanups. :^)
Wrap thread creation in a Thread::try_create() helper that first
allocates a kernel stack region. If that allocation fails, we propagate
an ENOMEM error to the caller.
This avoids the situation where a thread is half-constructed, without a
valid kernel stack, and avoids having to do messy cleanup in that case.
This patch adds sys$msyscall() which is loosely based on an OpenBSD
mechanism for preventing syscalls from non-blessed memory regions.
It works similarly to pledge and unveil, you can call it as many
times as you like, and when you're finished, you call it with a null
pointer and it will stop accepting new regions from then on.
If a syscall later happens and doesn't originate from one of the
previously blessed regions, the kernel will simply crash the process.
We had an exception that allowed SOL_SOCKET + SO_PEERCRED on local
socket to support LibIPC's PID exchange mechanism. This is no longer
needed so let's just remove the exception.
It's useful for programs to change their thread names to say something
interesting about what they are working on. Let's not require "thread"
for this since single-threaded programs may want to do it without
pledging "thread".
This prevents sys$mmap() and sys$mprotect() from creating executable
memory mappings in pledged programs that don't have this promise.
Note that the dynamic loader runs before pledging happens, so it's
unaffected by this.
This adds another layer of defense against introducing new code into a
running process. The only permitted way of doing so is by mmapping an
open file with PROT_READ | PROT_EXEC.
This does make any future JIT implementations slightly more complicated
but I think it's a worthwhile trade-off at this point. :^)
This patch adds enforcement of two new rules:
- Memory that was previously writable cannot become executable
- Memory that was previously executable cannot become writable
Unfortunately we have to make an exception for text relocations in the
dynamic loader. Since those necessitate writing into a private copy
of library code, we allow programs to transition from RW to RX under
very specific conditions. See the implementation of sys$mprotect()'s
should_make_executable_exception_for_dynamic_loader() for details.
When mounting an Ext2FS, a block device source is required. All other
filesystem types are unaffected, as most of them ignore the source file
descriptor anyway.
Fixes#5153.
`allocate_randomized` assert an already sanitized size but `mmap` were
just forwarding whatever the process asked so it was possible to
trigger a kernel panic from an unpriviliged process just by asking some
randomly placed memory and a size non alligned with the page size.
This fixes this issue by rounding up to the next page size before
calling `allocate_randomized`.
Fixes#5149
This can be used to request random VM placement instead of the highly
predictable regular mmap(nullptr, ...) VM allocation strategy.
It will soon be used to implement ASLR in the dynamic loader. :^)
When passing nullptr for either promises or execpromises to pledge(),
the expected behaviour is to not change their current value at all - we
were accidentally resetting them to 0, effectively dropping previously
pledge()'d promises.
We now move the execpromises state into the regular promises, and clear
the execpromises state.
Also make sure to duplicate the promise state on fork.
This fixes an issue where "su" would launch a shell which immediately
crashed due to not having pledged "stdio".
Let's force callers to provide a VM range when allocating a region.
This makes ENOMEM error handling more visible and removes implicit
VM allocation which felt a bit magical.
This tells the kernel that the process wants to use pledge, but without
pledging anything - effectively restricting it to syscalls that don't
require a certain promise. This is part of OpenBSD's pledge() as well,
which served as basis for Serenity's.
Instead of letting each File subclass do range allocation in their
mmap() override, do it up front in sys$mmap().
This makes us honor alignment requests for file-backed memory mappings
and simplifies the code somwhat.
This was done with the help of several scripts, I dump them here to
easily find them later:
awk '/#ifdef/ { print "#cmakedefine01 "$2 }' AK/Debug.h.in
for debug_macro in $(awk '/#ifdef/ { print $2 }' AK/Debug.h.in)
do
find . \( -name '*.cpp' -o -name '*.h' -o -name '*.in' \) -not -path './Toolchain/*' -not -path './Build/*' -exec sed -i -E 's/#ifdef '$debug_macro'/#if '$debug_macro'/' {} \;
done
# Remember to remove WRAPPER_GERNERATOR_DEBUG from the list.
awk '/#cmake/ { print "set("$2" ON)" }' AK/Debug.h.in
For some reason we were keeping the bits 04777 in file modes. That
doesn't seem right and I can't think of a reason why the set-uid bit
should be allowed to slip through.
This was just an alias for "unix" that I added early on back when there
was some belief that we might be compatible with OpenBSD. We're clearly
never going to be compatible with their pledges so just drop the alias.
..and allow implicit creation of KResult and KResultOr from ErrnoCode.
This means that kernel functions that return those types can finally
do "return EINVAL;" and it will just work.
There's a handful of functions that still deal with signed integers
that should be converted to return KResults.
This adds support for FUTEX_WAKE_OP, FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET,
FUTEX_REQUEUE, and FUTEX_CMP_REQUEUE, as well well as global and private
futex and absolute/relative timeouts against the appropriate clock. This
also changes the implementation so that kernel resources are only used when
a thread is blocked on a futex.
Global futexes are implemented as offsets in VMObjects, so that different
processes can share a futex against the same VMObject despite potentially
being mapped at different virtual addresses.
This sort-of matches what some other systems do and seems like a
generally sane thing to do instead of allowing programs to spawn a
child with a nearly full stack.
We forgot to remove the automatic SMAP disablers after fixing up all
this code to not access userspace memory directly. Let's lock things
down at last. :^)
All users of this mechanism have been switched to anonymous files and
passing file descriptors with sendfd()/recvfd().
Shbufs got us where we are today, but it's time we say good-bye to them
and welcome a much more idiomatic replacement. :^)