This ensures that everything allocated on the stack in Process::exec()
gets cleaned up. We had a few leaks related to the parsing of shebang
(#!) executables that get fixed by this.
This function is an extended version of `chmod(2)` that lets one control
whether to dereference symlinks, and specify a file descriptor to a
directory that will be used as the base for relative paths.
Previously we would crash the process immediately when a promise
violation was found during a syscall. This is error prone, as we
don't unwind the stack. This means that in certain cases we can
leak resources, like an OwnPtr / RefPtr tracked on the stack. Or
even leak a lock acquired in a ScopeLockLocker.
To remedy this situation we move the promise violation handling to
the syscall handler, right before we return to user space. This
allows the code to follow the normal unwind path, and grantees
there is no longer any cleanup that needs to occur.
The Process::require_promise() and Process::require_no_promises()
functions were modified to return ErrorOr<void> so we enforce that
the errors are always propagated by the caller.
This file refers to the controlling terminal associated with the current
process. It's specified by POSIX, and is used by ports like openssh to
interface with the terminal even if the standard input/output is
redirected to somewhere else.
Our implementation leverages ProcFS's existing facilities to create
process-specific symbolic links. In our setup, `/dev/tty` is a symbolic
link to `/proc/self/tty`, which itself is a symlink to the appropriate
`/dev/pts` entry. If no TTY is attached, `/dev/tty` is left dangling.
Now that the userland has a compatiblity wrapper for select(), the
kernel doesn't need to implement this syscall natively. The poll()
interface been around since 1987, any code still using select()
should be slapped silly.
Note: the SerenityOS source tree mostly uses select() and not poll()
despite SerenityOS having support for poll() since early 2019...
This includes a new Thread::Blocker called SignalBlocker which blocks
until a signal of a matching type is pending. The current Blocker
implementation in the Kernel is very complicated, but cleaning it up is
a different yak for a different day.
This allows userspace to trigger a full (FIXME) flush of a shared file
mapping to disk. We iterate over all the mapped pages in the VMObject
and write them out to the underlying inode, one by one. This is rather
naive, and there's lots of room for improvement.
Note that shared file mappings are currently not possible since mmap()
returns ENOTSUP for PROT_WRITE+MAP_SHARED. That restriction will be
removed in a subsequent commit. :^)
This isn't a complete conversion to ErrorOr<void>, but a good chunk.
The end goal here is to propagate buffer allocation failures to the
caller, and allow the use of TRY() with formatting functions.
... In files included from Kernel/Thread.cpp or Kernel/Process.cpp
Some places the warning is suppressed, because we do not want a const
object do have non-const access to the returned sub-object.
`off_t` is a 64-bit signed integer, so passing it in a register on i686
is not the best idea.
This fix gets us one step closer to making the LLVM port work.
Instead of signalling allocation failure with a bool return value
(false), we now use ErrorOr<void> and return ENOMEM as appropriate.
This allows us to use TRY() and MUST() with Vector. :^)
In particular, fstatvfs used to assume that a file that was earlier
opened using some path will forever be at that path. This is wrong, and
in the meantime new mounts and new filesystems could take up the
filename or directories, leading to a completely inaccurate result.
This commit improves the situation:
- All filesystem information is now always accurate.
- The mount flags *might* be erroneously zero, if the custody for the
open file is not available. I don't know when that might happen, but
it is definitely not the typical case.
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
The OpenFileDescription class already offers the necessary functionlity,
so implementing this was only a matter of following the structure for
`read` while handling the additional `offset` argument.
This change removes the halt and reboot syscalls, and create a new
mechanism to change the power state of the machine.
Instead of how power state was changed until now, put a SysFS node as
writable only for the superuser, that with a defined value, can result
in either reboot or poweroff.
In the future, a power group can be assigned to this node (which will be
the GroupID responsible for power management).
This opens an opportunity to permit to shutdown/reboot without superuser
permissions, so in the future, a userspace daemon can take control of
this node to perform power management operations without superuser
permissions, if we enforce different UserID/GroupID on that node.
These interfaces are broken for about 9 months, maybe longer than that.
At this point, this is just a dead code nobody tests or tries to use, so
let's remove it instead of keeping a stale code just for the sake of
keeping it and hoping someone will fix it.
To better justify this, I read that OpenBSD removed loadable kernel
modules in 5.7 release (2014), mainly for the same reason we do -
nobody used it so they had no good reason to maintain it.
Still, OpenBSD had LKMs being effectively working, which is not the
current state in our project for a long time.
An arguably better approach to minimize the Kernel image size is to
allow dropping drivers and features while compiling a new image.
This patch converts all the usage of AK::String around sys$execve() to
using KString instead, allowing us to catch and propagate OOM errors.
It also required changing the kernel CommandLine helper class to return
a vector of KString for the userspace init program arguments.
Make use of the new FileDescription::try_serialize_absolute_path() to
avoid String in favor of KString throughout much of sys$execve() and
its helpers.
We previously allowed Thread to exist in a state where its m_name was
null, and had to work around that in various places.
This patch removes that possibility and forces those who would create a
thread (or change the name of one) to provide a NonnullOwnPtr<KString>
with the name.
The default template argument is only used in one place, and it
looks like it was probably just an oversight. The rest of the Kernel
code all uses u8 as the type. So lets make that the default and remove
the unused template argument, as there doesn't seem to be a reason to
allow the size to be customizable.
Instead of checking it at every call site (to generate EBADF), we make
file_description(fd) return a KResultOr<NonnullRefPtr<FileDescription>>.
This allows us to wrap all the calls in TRY(). :^)
The only place that got a little bit messier from this is sys$mount(),
and there's a whole bunch of things there in need of cleanup.
This is the idiomatic way to declare type aliases in modern C++.
Flagged by Sonar Cloud as a "Code Smell", but I happen to agree
with this particular one. :^)
This function is currently only ever used to create the init process
(SystemServer). It had a few idiosyncratic things about it that this
patch cleans up:
- Errors were returned in an int& out-param.
- It had a path for non-0 process PIDs which was never taken.
REQUIRE_PROMISE and REQUIRE_NO_PROMISES were macros for some reason,
and used all over the place.
This patch adds require_promise(Pledge) and require_no_promises()
to Process and makes the macros call these on the current process
instead of inlining code everywhere.
Prior to this change, both uid_t and gid_t were typedef'ed to `u32`.
This made it easy to use them interchangeably. Let's not allow that.
This patch adds UserID and GroupID using the AK::DistinctNumeric
mechanism we've already been employing for pid_t/ProcessID.
Previously, we would try to acquire a reference to the all processes
lock or other contended resources while holding both the scheduler lock
and the thread's blocker lock. This could lead to a deadlock if we
actually have to block on those other resources.
There are callers of processes().with or processes().for_each that
require interrupts to be disabled. Taking a Mutexe with interrupts
disabled is a recipe for deadlock, so convert this to a Spinlock.
This has several benefits:
1) We no longer just blindly derefence a null pointer in various places
2) We will get nicer runtime error messages if the current process does
turn out to be null in the call location
3) GCC no longer complains about possible nullptr dereferences when
compiling without KUBSAN
We are not using this for anything and it's just been sitting there
gathering dust for well over a year, so let's stop carrying all this
complexity around for no good reason.
This makes the following scenario impossible with an SMP setup:
1) CPU A enters unref() and decrements the link count to 0.
2) CPU B sees the process in the process list and ref()s it.
3) CPU A removes the process from the list and continues destructing.
4) CPU B is now holding a destructed Process object.
By holding the process list lock before doing anything with it, we
ensure that other CPUs can't find this process in the middle of it being
destructed.
This allows us to 1) let go of the Process when an inode is ref'ing for
ProcFSExposedComponent related reasons, and 2) change our ref/unref
implementation.
The only two paths for copying strings in the kernel should be going
through the existing Userspace<char const*>, or StringArgument methods.
Lets enforce this by removing the option for using the raw cstring APIs
that were previously available.
The compiler can re-order the structure (class) members if that's
necessary, so if we make Process to inherit from ProcFSExposedComponent,
even if the declaration is to inherit first from ProcessBase, then from
ProcFSExposedComponent and last from Weakable<Process>, the members of
class ProcFSExposedComponent (including the Ref-counted parts) are the
first members of the Process class.
This problem made it impossible to safely use the current toggling
method with the write-protection bit on the ProcessBase members, so
instead of inheriting from it, we make its members the last ones in the
Process class so we can safely locate and modify the corresponding page
write protection bit of these values.
We make sure that the Process class doesn't expand beyond 8192 bytes and
the protected values are always aligned on a page boundary.
Making userspace provide a global string ID was silly, and made the API
extremely difficult to use correctly in a global profiling context.
Instead, simply make the kernel do the string ID allocation for us.
This also allows us to convert the string storage to a Vector in the
kernel (and an array in the JSON profile data.)
This syscall allows userspace to register a keyed string that appears in
a new "strings" JSON object in profile output.
This will be used to add custom strings to profile signposts. :^)
This patch adds a vDSO-like mechanism for exposing the current time as
an array of per-clock-source timestamps.
LibC's clock_gettime() calls sys$map_time_page() to map the kernel's
"time page" into the process address space (at a random address, ofc.)
This is only done on first call, and from then on the timestamps are
fetched from the time page.
This first patch only adds support for CLOCK_REALTIME, but eventually
we should be able to support all clock sources this way and get rid of
sys$clock_gettime() in the kernel entirely. :^)
Accesses are synchronized using two atomic integers that are incremented
at the start and finish of the kernel's time page update cycle.
I had to move the thread list out of the protected base area of Process
so that it could live with its lock (which needs to be mutable).
Ideally it would live in the protected area, so maybe we can figure out
a way to do that later.
This isn't needed for Process / Thread as they only reference it
by pointer and it's already part of Kernel/Forward.h. So just include
it where the implementation needs to call it.
The way the Process::FileDescriptions::allocate() API works today means
that two callers who allocate back to back without associating a
FileDescription with the allocated FD, will receive the same FD and thus
one will stomp over the other.
Naively tracking which FileDescriptions are allocated and moving onto
the next would introduce other bugs however, as now if you "allocate"
a fd and then return early further down the control flow of the syscall
you would leak that fd.
This change modifies this behavior by tracking which descriptions are
allocated and then having an RAII type to "deallocate" the fd if the
association is not setup the end of it's scope.
Currently all syscalls run under the Process:m_big_lock, which is an
obvious bottleneck. Long term we would like to remove the big lock and
replace it with the required fine grained locking.
To facilitate this goal we need a way of gradually decomposing the big
lock into the all of the required fine grained locks. This commit
introduces instrumentation to the syscall table, allowing the big lock
requirement to be toggled on/off per syscall.
Eventually when we are finished, no syscall will required the big lock,
and we'll be able to remove all of this instrumentation.
It's possible that another thread might try to exit the process just
about the same time another thread does the same, or a crash happens.
Also, we may not be able to kill all other threads instantly as they
may be blocked in the kernel (though in this case they would get killed
before ever returning back to user mode. So keep track of whether
Process::die was already called and ignore it on subsequent calls.
Fixes#8485
This re-arranges the order of how things are initialized so that we
try to initialize process and thread management earlier. This is
neccessary because a lot of the code uses the Lock class, which really
needs to have a running scheduler in place so that we can properly
preempt.
This also enables us to potentially initialize some things in parallel.
The ProtectedDataMutationScope cannot blindly assume that there is only
exactly one thread at a time that may want to unprotect the Process.
Most of the time the big lock guaranteed this, but there are some cases
such as finalization (among others) where this is not necessarily
guaranteed.
This fixes random panics due to access violations when the
ProtectedDataMutationScope protects the Process instance while another
is still modifying it.
Fixes#8512
This was an old SerenityOS-specific syscall for donating the remainder
of the calling thread's time-slice to another thread within the same
process.
Now that Threading::Lock uses a pthread_mutex_t internally, we no
longer need this syscall, which allows us to get rid of a surprising
amount of unnecessary scheduler logic. :^)
The new ProcFS design consists of two main parts:
1. The representative ProcFS class, which is derived from the FS class.
The ProcFS and its inodes are much more lean - merely 3 classes to
represent the common type of inodes - regular files, symbolic links and
directories. They're backed by a ProcFSExposedComponent object, which
is responsible for the functional operation behind the scenes.
2. The backend of the ProcFS - the ProcFSComponentsRegistrar class
and all derived classes from the ProcFSExposedComponent class. These
together form the entire backend and handle all the functions you can
expect from the ProcFS.
The ProcFSExposedComponent derived classes split to 3 types in the
manner of lifetime in the kernel:
1. Persistent objects - this category includes all basic objects, like
the root folder, /proc/bus folder, main blob files in the root folders,
etc. These objects are persistent and cannot die ever.
2. Semi-persistent objects - this category includes all PID folders,
and subdirectories to the PID folders. It also includes exposed objects
like the unveil JSON'ed blob. These object are persistent as long as the
the responsible process they represent is still alive.
3. Dynamic objects - this category includes files in the subdirectories
of a PID folder, like /proc/PID/fd/* or /proc/PID/stacks/*. Essentially,
these objects are always created dynamically and when no longer in need
after being used, they're deallocated.
Nevertheless, the new allocated backend objects and inodes try to use
the same InodeIndex if possible - this might change only when a thread
dies and a new thread is born with a new thread stack, or when a file
descriptor is closed and a new one within the same file descriptor
number is opened. This is needed to actually be able to do something
useful with these objects.
The new design assures that many ProcFS instances can be used at once,
with one backend for usage for all instances.
The Process::Handler type has KResultOr<FlatPtr> as its return type.
Using a different return type with an equally-sized template parameter
sort of works but breaks once that condition is no longer true, e.g.
for KResultOr<int> on x86_64.
Ideally the syscall handlers would also take FlatPtrs as their args
so we can get rid of the reinterpret_cast for the function pointer
but I didn't quite feel like cleaning that up as well.
Without this the ProcessBase class is placed into the padding for the
ProtectedProcessBase class which then causes the members of the
RefCounted class to end up without the first 4096 bytes of the Process
class:
BP 1, Kernel::Process::protect_data (this=this@entry=0xc063b000)
205 {
(gdb) p &m_ref_count
$1 = (AK::Atomic<unsigned int, (AK::MemoryOrder)5> *) 0xc063bffc
Note how the difference between 'this' and &m_ref_count is less than
4096.
This fixes a bug where unveiling a subdirectory of an already unveiled
path would sometimes be allowed and sometimes not (depending on what
other unveil calls have been made).
Now, it is always allowed to unveil a subdirectory of an already
unveiled directory, even if it has higher permissions.
This removes the need for the permissions_inherited_from_root flag in
UnveilMetadata, so it has been removed.
This adds two new arguments to the thread_exit system call which let
a thread unmap an arbitrary VM range on thread exit. LibPthread
uses this functionality to unmap the thread stack.
Fixes#7267.
With -Og, all calls to create_kernel_process were triggering -Wnonnull
when creating these lambdas that get implicitly converted to function
pointers. A different design of create_kernel_process to use
AK::Function instead might avoid this awkward behavior.
Previously the process' m_profiling flag was ignored for all event
types other than CPU samples.
The kfree tracing code relies on temporarily disabling tracing during
exec. This didn't work for per-process profiles and would instead
panic.
This updates the profiling code so that the m_profiling flag isn't
ignored.
Problem:
- `static` variables consume memory and sometimes are less
optimizable.
- `static const` variables can be `constexpr`, usually.
- `static` function-local variables require an initialization check
every time the function is run.
Solution:
- If a global `static` variable is only used in a single function then
move it into the function and make it non-`static` and `constexpr`.
- Make all global `static` variables `constexpr` instead of `const`.
- Change function-local `static const[expr]` variables to be just
`constexpr`.
Unlike accept() the new accept4() system call lets the caller specify
flags for the newly accepted socket file descriptor, such as
SOCK_CLOEXEC and SOCK_NONBLOCK.
By constraining two implementations, the compiler will select the best
fitting one. All this will require is duplicating the implementation and
simplifying for the `void` case.
This constraining also informs both the caller and compiler by passing
the callback parameter types as part of the constraint
(e.g.: `IterationFunction<int>`).
Some `for_each` functions in LibELF only take functions which return
`void`. This is a minimal correctness check, as it removes one way for a
function to incompletely do something.
There seems to be a possible idiom where inside a lambda, a `return;` is
the same as `continue;` in a for-loop.
This change looks more involved than it actually is. This simply
reshuffles the previous Process constructor and splits out the
parts which can fail (resource allocation) into separate methods
which can be called from a factory method. The factory is then
used everywhere instead of the constructor.
Modify the API so it's possible to propagate error on OOM failure.
NonnullOwnPtr<T> is not appropriate for the ThreadTracer::create() API,
so switch to OwnPtr<T>, use adopt_own_if_nonnull() to handle creation.
This patch modifies InodeWatcher to switch to a one watcher, multiple
watches architecture. The following changes have been made:
- The watch_file syscall is removed, and in its place the
create_iwatcher, iwatcher_add_watch and iwatcher_remove_watch calls
have been added.
- InodeWatcher now holds multiple WatchDescriptions for each file that
is being watched.
- The InodeWatcher file descriptor can be read from to receive events on
all watched files.
Co-authored-by: Gunnar Beutner <gunnar@beutner.name>
The current method of emitting performance events requires a bit of
boiler plate at every invocation, as well as having to ignore the
return code which isn't used outside of the perf event syscall. This
change attempts to clean that up by exposing high level API's that
can be used around the code base.
Previously, TLS data was always zero-initialized.
To support initializing the values of TLS data, sys$allocate_tls now
receives a buffer with the desired initial data, and copies it to the
master TLS region of the process.
The DynamicLinker gathers the initial TLS image and passes it to
sys$allocate_tls.
We also now require the size passed to sys$allocate_tls to be
page-aligned, to make things easier. Note that this doesn't waste memory
as the TLS data has to be allocated in separate pages anyway.
This turns the perfcore format into more a log than it was before,
which lets us properly log process, thread and region
creation/destruction. This also makes it unnecessary to dump the
process' regions every time it is scheduled like we did before.
Incidentally this also fixes 'profile -c' because we previously ended
up incorrectly dumping the parent's region map into the profile data.
Log-based mmap support enables profiling shared libraries which
are loaded at runtime, e.g. via dlopen().
This enables profiling both the parent and child process for
programs which use execve(). Previously we'd discard the profiling
data for the old process.
The Profiler tool has been updated to not treat thread IDs as
process IDs anymore. This enables support for processes with more
than one thread. Also, there's a new widget to filter which
process should be displayed.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
While profiling all processes the profile buffer lives forever.
Once you have copied the profile to disk, there's no need to keep it
in memory. This syscall surfaces the ability to clear that buffer.