Commit graph

6323 commits

Author SHA1 Message Date
Andreas Kling 2a5cff232b Kernel: Use slab allocation automagically for small kmalloc() requests
This patch adds generic slab allocators to kmalloc. In this initial
version, the slab sizes are 16, 32, 64, 128, 256 and 512 bytes.

Slabheaps are backed by 64 KiB block-aligned blocks with freelists,
similar to what we do in LibC malloc and LibJS Heap.
2021-12-26 21:22:59 +01:00
Andreas Kling f6c594fa29 Kernel: Remove arbitrary alignment requirement from kmalloc_aligned()
We were not allowing alignments greater than PAGE_SIZE for some reason.
2021-12-26 21:22:59 +01:00
Andreas Kling 9182653a0f Kernel: Log purported size of bogus kfree_sized() requests 2021-12-26 21:22:59 +01:00
Andreas Kling c6c786c992 Kernel: Remove kfree(), leaving only kfree_sized() :^)
There are no more users of the C-style kfree() API in the kernel,
so let's get rid of it and enjoy the new world where we always know
how much memory we are freeing. :^)
2021-12-26 21:22:59 +01:00
Andreas Kling 6eb48f7df6 Kernel: Consolidate kmalloc_aligned() and use kfree_sized() within
This patch does two things:

- Combines kmalloc_aligned() and kmalloc_aligned_cxx(). Templatizing
  the alignment parameter doesn't seem like a valuable enough
  optimization to justify having two almost-identical implementations.

- Stores the real allocation size of an aligned allocation along with
  the other alignment metadata, and uses it to call kfree_sized()
  instead of kfree().
2021-12-26 21:22:59 +01:00
Andreas Kling 83dd93ff13 Kernel: Use kfree_sized() in SlabAllocator 2021-12-26 21:22:59 +01:00
Andreas Kling 8f3b3af5ea Kernel: Remove no-longer-used Lockable template 2021-12-26 21:22:59 +01:00
Andreas Kling fcf6ccd771 Kernel: Make KernelRng not inherit from Lockable
This class was misusing the outdate Lockable template and didn't take
advantage of the lock/resource separation mechanism fully anyway.

Since the underlying PRNG has its own SpinLock, and we already use that
for synchronization everywhere anyway, we can simply remove the Lockable
inheritance from this class.
2021-12-26 21:22:59 +01:00
Pankaj Raghav 1a27220bca Kernel: Encapsulate APIC initialization inside InterruptManagement
Currently the APIC class is constructed irrespective of whether it
is used or not.

So, move APIC initialization from init to the InterruptManagement
class and construct the APIC class only when it is needed.
2021-12-26 16:22:09 +02:00
Idan Horowitz 7757d874ad Kernel: Assert that a KmallocSubheap fits inside a page
Since we allocate the subheap in the first page of the given storage
let's assert that the subheap can actually fit in a single page, to
prevent the possible future headache of trying to debug the cause of
random kernel memory corruption :^)
2021-12-26 11:26:39 +01:00
Andreas Kling 1c99f99e99 Kernel: Make kmalloc expansions scale to incoming allocation request
This allows kmalloc() to satisfy arbitrary allocation requests instead
of being limited to a static subheap expansion size.
2021-12-26 10:43:07 +01:00
Andreas Kling f49649645c Kernel: Allocate page tables for the entire kmalloc VM range up front
This avoids getting caught with our pants down when heap expansion fails
due to missing page tables. It also avoids a circular dependency on
kmalloc() by way of HashMap::set() in MemoryManager::ensure_pte().
2021-12-26 02:42:49 +01:00
Andreas Kling d58880b5b0 Kernel: Write to debug log when creating new kmalloc subheaps 2021-12-26 01:25:02 +01:00
Andreas Kling 16850423cf Kernel: Fix deadlock caused by page faults while holding disk cache lock
If the data passed to sys$write() is backed by a not-yet-paged-in inode
mapping, we could end up in a situation where we get a page fault when
trying to copy data from userspace.

If that page fault handler tried reading from an inode that someone else
had locked while waiting for the disk cache lock, we'd deadlock.

This patch fixes the issue by copying the userspace data into a local
buffer before acquiring the disk cache lock. This is not ideal since it
incurs an extra copy, and I'm sure we can think of a better solution
eventually.

This was a frequent cause of startup deadlocks on x86_64 for me. :^)
2021-12-26 00:42:51 +01:00
Andreas Kling 4d585cdb82 Kernel: Set NX bit on expanded kmalloc memory mappings if supported
We never want to execute kmalloc memory.
2021-12-25 22:07:59 +01:00
Andreas Kling da5c257e2e Kernel: Remove unused function declaration for kmalloc_impl() 2021-12-25 22:07:59 +01:00
Andreas Kling f7a4c34929 Kernel: Make kmalloc heap expansion kmalloc-free
Previously, the heap expansion logic could end up calling kmalloc
recursively, which was quite messy and hard to reason about.

This patch redesigns heap expansion so that it's kmalloc-free:

- We make a single large virtual range allocation at startup
- When expanding, we bump allocate VM from that region
- When expanding, we populate page tables directly ourselves,
  instead of going via MemoryManager.

This makes heap expansion a great deal simpler. However, do note that it
introduces two new flaws that we'll need to deal with eventually:

- The single virtual range allocation is limited to 64 MiB and once
  exhausted, kmalloc() will fail. (Actually, it will PANIC for now..)

- The kmalloc heap can no longer shrink once expanded. Subheaps stay
  in place once constructed.
2021-12-25 22:07:59 +01:00
Andreas Kling 9965e59ad8 Kernel: Remove unnecessary SocketHandle<T> class
This was used to return a pre-locked UDPSocket in one place, but there
was really no need for that mechanism in the first place since the
caller ends up locking the socket anyway.
2021-12-25 11:23:57 +01:00
Brian Gianforcaro 1c950773fb Kernel: Make MemoryManager::protect_ksyms_after_init UNMAP_AFTER_INIT
The function to protect ksyms after initialization, is only used during
boot of the system, so it can be UNMAP_AFTER_INIT as well.

This requires we switch the order of the init sequence, so we now call
`MM.protect_ksyms_after_init()` before `MM.unmap_text_after_init()`.
2021-12-24 14:28:59 -08:00
Brian Gianforcaro e88e4967d1 Kernel: Mark PTYMultiplexer init & parse_hex_digit as UNMAP_AFTER_INIT
Noticed these boot only functions are not currently UNMAP_AFTER_INIT.
Lets fix that :^)
2021-12-24 14:28:59 -08:00
Liav A 52e01b46eb Kernel: Move Multi Processor Parser code to a separate directory 2021-12-23 23:18:58 -08:00
Guilherme Gonçalves da6aef9fff Kernel: Make msync return EINVAL when regions are too large
As a small cleanup, this also makes `page_round_up` verify its
precondition with `page_round_up_would_wrap` (which callers are expected
to call), rather than having its own logic.

Fixes #11297.
2021-12-23 17:43:12 -08:00
Daniel Bertalan 8e3d1a42e3 Kernel+UE+LibC: Store address as void* in SC_m{re,}map_params
Most other syscalls pass address arguments as `void*` instead of
`uintptr_t`, so let's do that here too. Besides improving consistency,
this commit makes `strace` correctly pretty-print these arguments in
hex.
2021-12-23 23:08:10 +01:00
Daniel Bertalan 77f9272aaf Kernel+UE: Add MAP_FIXED_NOREPLACE mmap() flag
This feature was introduced in version 4.17 of the Linux kernel, and
while it's not specified by POSIX, I think it will be a nice addition to
our system.

MAP_FIXED_NOREPLACE provides a less error-prone alternative to
MAP_FIXED: while regular fixed mappings would cause any intersecting
ranges to be unmapped, MAP_FIXED_NOREPLACE returns EEXIST instead. This
ensures that we don't corrupt our process's address space if something
is already at the requested address.

Note that the more portable way to do this is to use regular
MAP_ANONYMOUS, and check afterwards whether the returned address matches
what we wanted. This, however, has a large performance impact on
programs like Wine which try to reserve large portions of the address
space at once, as the non-matching addresses have to be unmapped
separately.
2021-12-23 23:08:10 +01:00
Daniel Bertalan 4195a7ef4b Kernel: Return EEXIST in VirtualRangeAllocator::try_allocate_specific()
This error only ever gets propagated to the userspace if
MAP_FIXED_NOREPLACE is requested, as MAP_FIXED unmaps intersecting
ranges beforehand, and non-fixed mmap() calls will just fall back to
allocating anywhere.

Linux specifies MAP_FIXED_NOREPLACE to return EEXIST when it can't
allocate, we now match that behavior.
2021-12-23 23:08:10 +01:00
Liav A 9eb08bdb0f Kernel: Make major and minor numbers to be DistinctNumerics
This helps avoid confusion in general, and make constructors, methods
and code patterns much more clean and understandable.
2021-12-23 23:02:39 +01:00
Andreas Kling 1d08b671ea Kernel: Enter new address space before destroying old in sys$execve()
Previously we were assigning to Process::m_space before actually
entering the new address space (assigning it to CR3.)

If a thread was preempted by the scheduler while destroying the old
address space, we'd then attempt to resume the thread with CR3 pointing
at a partially destroyed address space.

We could then crash immediately in write_cr3(), right after assigning
the new value to CR3. I am hopeful that this may have been the bug
haunting our CI for months. :^)
2021-12-23 01:18:26 +01:00
Andreas Kling 601a9321d9 Kernel: Don't honor userspace SIGSTOP requests in Thread::block()
Instead, wait until we transition back to userspace. This stops
userspace from being able to suspend a thread indefinitely while it's
running in kernelspace (potentially holding some blocking mutex.)
2021-12-23 00:57:36 +01:00
Brian Gianforcaro 8afcf2441c Kernel: Initialize SupriousInterruptHandler::m_enabled on construction
Found by PVS Studio Static Analysis
2021-12-22 13:29:31 -08:00
Brian Gianforcaro 0348d9afbe Kernel: Always initialize ext2_inode and ext_super_block structs
Found by PVS Studio Static Analysis
2021-12-22 13:29:31 -08:00
Brian Gianforcaro b8e210deea Kernel: Initialize PhysicalRegion::m_large_zones, remove m_small_zones
Found by PVS Studio Static Analysis.
2021-12-22 13:29:31 -08:00
Brian Gianforcaro c724955d54 LibC: Add support for posix_madvise(..)
Add the `posix_madvise(..)` LibC implementation that just forwards
to the normal `madvise(..)` implementation.

Also define a few POSIX_MADV_DONTNEED and POSIX_MADV_NORMAL as they
are part of the POSIX API for `posix_madvise(..)`.

This is needed by the `fio` port.
2021-12-22 13:28:13 -08:00
Idan Horowitz 7a662c2638 Kernel: Add the si_errno and si_band siginfo_t members
These 2 members are required by POSIX and are also used by some ports.
Zero is a valid value for both of these, so no further work to support
them is required.
2021-12-22 22:53:56 +02:00
Idan Horowitz b2f0697afc Kernel: Switch KUBSAN prints to use critical_dmesgln instead of dbgln
This allows to KUBSAN to print correctly in strictier memory
conditions. This patch also removes some useless curly braces around
single line ifs.
2021-12-22 00:02:36 -08:00
Idan Horowitz 5f4a67434c Kernel: Move userspace virtual address range base to 0x10000
Now that the shared bottom 2 MiB virtual address mappings are gone
userspace can use lower virtual addresses.
2021-12-22 00:02:36 -08:00
Idan Horowitz fccd0432a1 Kernel: Don't share the bottom 2 MiB of kernel mappings with processes
Now that the last 2 users of these mappings (the Prekernel and the APIC
ap boot environment) were removed, these are no longer used.
2021-12-22 00:02:36 -08:00
Daniel Bertalan 4fc28bfe02 Kernel: Unmap Prekernel pages after they are no longer needed
The Prekernel's memory is only accessed until MemoryManager has been
initialized. Keeping them around afterwards is both unnecessary and bad,
as it prevents the userland from using the 0x100000-0x155000 virtual
address range.

Co-authored-by: Idan Horowitz <idan.horowitz@gmail.com>
2021-12-22 00:02:36 -08:00
Daniel Bertalan 2f1b4b8a81 Kernel: Exclude PROT_NONE regions from coredumps
As PROT_NONE regions can't be accessed by processes, and their only real
use is for reserving ranges of virtual memory, there's no point in
including them in coredumps.
2021-12-22 00:02:36 -08:00
Daniel Bertalan ce1bf3724e Kernel: Replace intersecting ranges in mmap when MAP_FIXED is specified
This behavior is mandated by POSIX and is used by software like Wine
after reserving large chunks of the address range.
2021-12-22 00:02:36 -08:00
Idan Horowitz fd3be7ffcc Kernel: Setup APIC AP cores boot environment before init_stage2
Since this range is mapped in already in the kernel page directory, we
can initialize it before jumping into the first kernel process which
lets us avoid mapping in the range into init_stage2's address space.

This brings us half-way to removing the shared bottom 2 MiB mapping in
every process, leaving only the Prekernel.
2021-12-22 00:02:36 -08:00
Idan Horowitz 7b24fc6fb8 Kernel+LibC: Stub out getifaddrs() and freeifaddrs()
These are required for some ports.
2021-12-22 00:02:36 -08:00
Idan Horowitz 468ae105d8 Kernel+LibC: Stub out if_nameindex() and if_freenameindex()
These should allow users to receive the names of network interfaces in
the system, but for now these are only stubs required to compile some
ports.
2021-12-22 00:02:36 -08:00
Idan Horowitz 3a1ff175e8 Kernel: Define and return the ARPHRD_* device type in SIOCGIFHWADDR
The sa_family field in SIOCGIFHWADDR specifies the underlying network
interface's device type, this is hardcoded to generic "Ethernet" right
now, as we don't have a nice way to query it.
2021-12-22 00:02:36 -08:00
Nick Johnson 08e4a1a4dc AK+Everywhere: Replace __builtin bit functions
In order to reduce our reliance on __builtin_{ffs, clz, ctz, popcount},
this commit removes all calls to these functions and replaces them with
the equivalent functions in AK/BuiltinWrappers.h.
2021-12-21 22:13:51 +01:00
Martin Bříza 86b249f02f Kernel: Implement sysconf(_SC_SYMLOOP_MAX)
Not much to say here, this is an implementation of this call that
accesses the actual limit constant that's used by the VirtualFileSystem
class.

As a side note, this is required for my eventual Qt port.
2021-12-21 12:54:11 -08:00
Martin Bříza f75bab2a25 Kernel: Move symlink recursion limit to .h, increase it to 8
As pointed out by BertalanD on Discord, POSIX specifies that
_SC_SYMLOOP_MAX (implemented in the following commit) always needs to be
equal or more than _POSIX_SYMLOOP_MAX (8, defined in
LibC/bits/posix1_lim.h), hence I've increased it to that value to
comply with the standard.

The move to header is required for the following commit - to make this
constant accessible outside of the VFS class, namely in sysconf.
2021-12-21 12:54:11 -08:00
Liav A 30659040ed Kernel: Ensure SMP mode is not enabled if IOAPIC mode is disabled
We need to use the IOAPIC in SMP mode, so if the user requested to
disable it, we can't enable SMP mode either.
2021-12-20 11:00:31 -08:00
Liav A 5a649d0fd5 Kernel: Return EINVAL when specifying -1 for setuid and similar syscalls
For setreuid and setresuid syscalls, -1 means to set the current
uid/euid/gid/egid value, to be more convenient for programming.
However, for other syscalls where we pass only one argument, there's no
justification to specify -1.

This behavior is identical to how Linux handles the value -1, and is
influenced by the fact that the manual pages for the group of one
argument syscalls that handle ID operations is ambiguous about this
topic.
2021-12-20 11:32:16 +01:00
Andreas Kling e0521cfb9d Kernel: Stop ProcFS stack walk on bogus userspace->kernel traversal
Unsurprisingly, the /proc/PID/stacks/TID stack walk had the same
arbitrary memory read problem as the perf event stack walk.

It would be nice if the kernel had a single stack walk implementation,
but that's outside the scope of this commit.
2021-12-19 18:18:38 +01:00
Andreas Kling bc518e39bf Kernel: Make perfcore files owned by UID=0, GID=0
Since perfcore files can be generated during process finalization,
we can't just allow them to contain sensitive kernel information
if they're gonna be owned by the process's own UID+GID.

So instead, perfcores are now owned by 0:0. This is not the most
ergonomic solution, but I'm not sure what we could do to make it nicer.
We'll have to think more about that. In the meantime, this patches up
a kernel info leak. :^)
2021-12-19 18:18:38 +01:00