Merge branch 'master' into update-urls-queue-quickly

This commit is contained in:
Daoud Clarke 2023-02-24 21:37:54 +00:00
commit bc6be8b6d5
11 changed files with 1326 additions and 982 deletions

57
.github/workflows/ci.yml vendored Normal file
View file

@ -0,0 +1,57 @@
name: CI
on:
push:
branches: [main]
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
#----------------------------------------------
# check-out repo and set-up python
#----------------------------------------------
- name: Check out repository
uses: actions/checkout@v3
- name: Set up python
id: setup-python
uses: actions/setup-python@v4
with:
python-version: '3.10'
#----------------------------------------------
# ----- install & configure poetry -----
#----------------------------------------------
- name: Install Poetry
uses: snok/install-poetry@v1.3.3
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
#----------------------------------------------
# load cached venv if cache exists
#----------------------------------------------
- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
#----------------------------------------------
# install dependencies if cache does not exist
#----------------------------------------------
- name: Install dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction --no-root
#----------------------------------------------
# install your root project, if required
#----------------------------------------------
- name: Install project
run: poetry install --no-interaction
#----------------------------------------------
# run test suite
#----------------------------------------------
- name: Run tests
run: |
poetry run pytest

15
.vscode/launch.json vendored Normal file
View file

@ -0,0 +1,15 @@
{
"version": "0.2.0",
"configurations": [
{
"name": "mwmbl",
"type": "python",
"request": "launch",
"module": "mwmbl.main",
"python": "${workspaceFolder}/.venv/bin/python",
"stopOnEntry": false,
"console": "integratedTerminal",
"justMyCode": true
}
]
}

128
CODE_OF_CONDUCT.md Normal file
View file

@ -0,0 +1,128 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
https://matrix.to/#/#mwmbl:matrix.org.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see the FAQ at
https://www.contributor-covenant.org/faq. Translations are available at
https://www.contributor-covenant.org/translations.

5
CONTRIBUTING.md Normal file
View file

@ -0,0 +1,5 @@
Contributions are very welcome!
Please join the discussion at https://matrix.to/#/#mwmbl:matrix.org and let us know what you're planning to do.
See https://book.mwmbl.org/page/developers/ for a guide to development.

View file

@ -13,6 +13,8 @@ the web front-end and search technology on a small index.
Our vision is a community working to provide top quality search
particularly for hackers, funded purely by donations.
![mwmbl](https://user-images.githubusercontent.com/1283077/218265959-be4220b4-dcf0-47ab-acd3-f06df0883b52.gif)
Crawling
========

View file

@ -29,6 +29,7 @@ def setup_args():
parser = argparse.ArgumentParser(description="Mwmbl API server and background task processor")
parser.add_argument("--num-pages", type=int, help="Number of pages of memory (4096 bytes) to use for the index", default=2560)
parser.add_argument("--data", help="Path to the data folder for storing index and cached batches", default="./devdata")
parser.add_argument("--port", type=int, help="Port for the server to listen at", default=5000)
parser.add_argument("--background", help="Enable running the background tasks to process batches",
action='store_true')
args = parser.parse_args()
@ -74,7 +75,7 @@ def run():
app.include_router(crawler_router)
# Initialize uvicorn server using global app instance and server config params
uvicorn.run(app, host="0.0.0.0", port=5000)
uvicorn.run(app, host="0.0.0.0", port=args.port)
if __name__ == "__main__":

View file

@ -10,13 +10,16 @@ TERMS_PATH = Path(__file__).parent.parent / 'resources' / 'mwmbl-crawl-terms.csv
class Completer:
def __init__(self, num_matches: int = 3):
# Load term data
terms = pd.read_csv(TERMS_PATH)
terms = self.get_terms()
terms_dict = terms.sort_values('term').set_index('term')['count'].to_dict()
self.terms = list(terms_dict.keys())
self.counts = list(terms_dict.values())
self.num_matches = num_matches
print("Terms", self.terms[:100], self.counts[:100])
def get_terms(self):
return pd.read_csv(TERMS_PATH)
def complete(self, term) -> list[str]:
term_length = len(term)

View file

@ -122,7 +122,7 @@ class TinyIndex(Generic[T]):
def __enter__(self):
self.index_file = open(self.index_path, 'r+b')
prot = PROT_READ if self.mode == 'r' else PROT_READ | PROT_WRITE
self.mmap = mmap(self.index_file.fileno(), 0, offset=METADATA_SIZE, prot=prot)
self.mmap = mmap(self.index_file.fileno(), 0, prot=prot)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
@ -146,7 +146,7 @@ class TinyIndex(Generic[T]):
return [self.item_factory(*item) for item in results]
def _get_page_tuples(self, i):
page_data = self.mmap[i * self.page_size:(i + 1) * self.page_size]
page_data = self.mmap[i * self.page_size + METADATA_SIZE:(i + 1) * self.page_size + METADATA_SIZE]
try:
decompressed_data = self.decompressor.decompress(page_data)
except ZstdError:
@ -186,7 +186,7 @@ class TinyIndex(Generic[T]):
page_data = _get_page_data(self.compressor, self.page_size, data)
logger.debug(f"Got page data of length {len(page_data)}")
self.mmap[i * self.page_size:(i+1) * self.page_size] = page_data
self.mmap[i * self.page_size:(i+1) * self.page_size + METADATA_SIZE] = page_data
@staticmethod
def create(item_factory: Callable[..., T], index_path: str, num_pages: int, page_size: int):

2007
poetry.lock generated

File diff suppressed because it is too large Load diff

View file

@ -18,6 +18,8 @@ boto3 = "^1.20.37"
requests = "^2.27.1"
psycopg2-binary = "^2.9.3"
spacy = "==3.2.1"
pytest = "^7.2.1"
pytest-mock = "^3.10.0"
# Optional dependencies do not get installed by default. Look under tool.poetry.extras section
# to see which extras to use.

78
test/test_completer.py Normal file
View file

@ -0,0 +1,78 @@
import mwmbl.tinysearchengine.completer
import pytest
import pandas as pd
def mockCompleterData(mocker, data):
testDataFrame = pd.DataFrame(data, columns=['','term','count'])
mocker.patch('mwmbl.tinysearchengine.completer.Completer.get_terms',
return_value = testDataFrame)
def test_correctCompletions(mocker):
# Mock completer with custom data
testdata = [
[0, 'build', 4],
[1, 'builder', 3],
[2, 'announce', 2],
[3, 'buildings', 1]]
mockCompleterData(mocker, testdata)
completer = mwmbl.tinysearchengine.completer.Completer()
completion = completer.complete('build')
assert ['build', 'builder', 'buildings'] == completion
def test_correctSortOrder(mocker):
# Mock completer with custom data
testdata = [
[0, 'build', 4],
[1, 'builder', 1],
[2, 'announce', 2],
[3, 'buildings', 3]]
mockCompleterData(mocker, testdata)
completer = mwmbl.tinysearchengine.completer.Completer()
completion = completer.complete('build')
assert ['build', 'buildings', 'builder'] == completion
def test_noCompletions(mocker):
# Mock completer with custom data
testdata = [
[0, 'build', 4],
[1, 'builder', 3],
[2, 'announce', 2],
[3, 'buildings', 1]]
mockCompleterData(mocker, testdata)
completer = mwmbl.tinysearchengine.completer.Completer()
completion = completer.complete('test')
assert [] == completion
def test_singleCompletions(mocker):
# Mock completer with custom data
testdata = [
[0, 'build', 4],
[1, 'builder', 3],
[2, 'announce', 2],
[3, 'buildings', 1]]
mockCompleterData(mocker, testdata)
completer = mwmbl.tinysearchengine.completer.Completer()
completion = completer.complete('announce')
assert ['announce'] == completion
def test_idempotencyWithSameScoreCompletions(mocker):
# Mock completer with custom data
testdata = [
[0, 'build', 1],
[1, 'builder', 1],
[2, 'announce', 1],
[3, 'buildings', 1]]
mockCompleterData(mocker, testdata)
completer = mwmbl.tinysearchengine.completer.Completer()
for i in range(3):
print(f"iteration: {i}")
completion = completer.complete('build')
# Results expected in reverse order
expected = ['buildings','builder','build']
assert expected == completion