Commit Graph

33 Commits

Author SHA1 Message Date
Daoud Clarke 204304e18e Add term info to index 2023-11-18 18:49:41 +00:00
Daoud Clarke a2b872008f Add a script to evaluate how much it costs to add the term to the index
Old sizes mean 33.3673 0.08148019988498635
New sizes mean 32.1322 0.07700185221489449
2023-11-16 17:42:18 +00:00
Daoud Clarke 918eaa8709 Rename django app to mwmbl 2023-10-10 13:51:06 +01:00
Daoud Clarke 77e39b4a89 Optimise URL update 2023-01-22 20:28:18 +00:00
Daoud Clarke 66700f8a3e Speed up domain parsing 2023-01-20 20:53:50 +00:00
Daoud Clarke 4779371cf3 Use a custom tokenizer 2022-08-23 21:57:38 +01:00
Daoud Clarke b1eea2457f Script to index local batch for evaluation 2022-08-22 22:47:42 +01:00
Daoud Clarke 00705703f3 Require matching at least half the terms 2022-08-11 23:27:30 +01:00
Daoud Clarke 74107667b4 Improve printing of search results in script 2022-08-10 21:43:13 +01:00
Daoud Clarke c1d361c0a0 New LTR model trained on more data 2022-08-08 22:52:37 +01:00
Daoud Clarke ae658906dd Store the best items, not the worst ones 2022-07-31 22:55:15 +01:00
Daoud Clarke 93307ad1ec Add util script to send batch; add logging 2022-07-18 21:37:19 +01:00
Daoud Clarke ff2312a5ca Use different scores for same domain links 2022-06-27 22:46:06 +01:00
Daoud Clarke e27d749e18 Investigate duplication of URLs in batches 2022-06-26 21:11:51 +01:00
Daoud Clarke eb571fc5fe Add a script to count urls in the index 2022-06-21 21:55:38 +01:00
Daoud Clarke e2eb405083 Combine crawler and search servers 2022-06-16 22:49:41 +01:00
Daoud Clarke 14107acc75 Use new server 2022-06-09 22:24:54 +01:00
Daoud Clarke aaca8b2b6e Record historical batches via the API 2022-06-05 09:15:04 +01:00
Daoud Clarke f5b20d0128 Index link counts 2022-02-24 20:47:36 +00:00
Daoud Clarke b5b2005323 Store computed link counts 2022-02-23 22:13:38 +00:00
Daoud Clarke 00d18c3474 Remove unused code 2022-02-23 21:59:24 +00:00
Daoud Clarke e03e379ccf Refactor to enable easier evaluation 2022-02-09 22:43:47 +00:00
Daoud Clarke 2fc999b402 Count unique domains instead of links 2022-02-02 20:09:59 +00:00
Daoud Clarke d77b72d7df Analyse links to find most popular ones 2022-02-02 19:47:38 +00:00
Daoud Clarke ef36513f64 Analyse the pages that are crawled most often 2022-01-29 07:06:53 +00:00
Daoud Clarke 70254ae160 Analyse crawled URLs and domains 2022-01-26 18:51:58 +00:00
Daoud Clarke 171fa645d2 Add script to export top domains 2022-01-23 22:04:30 +00:00
Daoud Clarke 25918e42ef Export URLs to sqlite for evaluation purposes 2022-01-02 20:06:13 +00:00
nitred 11eedcde84 renamed package to mwmbl
- renamed package to mwmbl in pyproject.toml
- tinysearchengine and indexer modules have been moved into mwmbl package folder
- analyse module has been left as is in the root of the repo
- import statements in tinysearchengine now use mwmbl.tinysearchengine
- import statements in indexer now use mwmbl.indexer or mwmbl.tinysearchengine or relative imports like .paths
- import statements in analyse now use mwmbl.indexer or mwmbl.tinysearchengine
- final CMD in Dockerfile now uses updated path mwmbl.tinysearchengine.app
- fixed a couple of import statement errors in tinysearchengine/indexer.py
2021-12-28 12:35:46 +01:00
Daoud Clarke baede32298 Move indexer code to a separate package 2021-12-26 08:55:09 +00:00
Daoud Clarke 9c65bf3c8f WIP: implement docker image. TODO: copy index and set the correct index path using env var 2021-12-22 23:21:23 +00:00
Daoud Clarke 9ee6f37a60 Analysis to confirm that 'leek and potato soup' page was really missing 2021-12-19 21:09:00 +00:00
Daoud Clarke 4cbed29c08 Show the extract 2021-12-19 20:48:28 +00:00