Commit graph

  • bd0cc3863e Don't try and update an empty list of URLs Daoud Clarke 2023-01-09 21:02:40 +0000
  • d347a17d63 Update URL queue separately from the other background process to speed it up Daoud Clarke 2023-01-09 20:50:28 +0000
  • 7bd12c1ead Fix some bugs in URL fetching query Daoud Clarke 2023-01-02 20:51:23 +0000
  • a50f1d8ae3 Fix postgres install Daoud Clarke 2023-01-02 12:19:10 +0000
  • 1ab16b1fb4 Install postgres client Daoud Clarke 2023-01-02 12:18:03 +0000
  • dda5a25ad0 Add core domains Daoud Clarke 2023-01-02 12:05:22 +0000
  • ab37bbe0a5 Exclude google plus Daoud Clarke 2023-01-01 22:18:47 +0000
  • 2336ed7f7d Allow posting extra links with lower score weighting Daoud Clarke 2023-01-01 20:37:41 +0000
  • 6edf48693b Check the domain is correct, potential bug in psql Daoud Clarke 2023-01-01 01:30:44 +0000
  • b7984684c9 Tidy, improve logging Daoud Clarke 2023-01-01 01:14:05 +0000
  • 7c14cd99f8 Update the URL queue earlier Daoud Clarke 2022-12-31 23:37:59 +0000
  • 0d33b4f68f
    Merge pull request #86 from mwmbl/improve-crawling Daoud Clarke 2022-12-31 22:56:21 +0000
  • a86e172bf3 Reinstate background tasks #86 improve-crawling Daoud Clarke 2022-12-31 22:52:17 +0000
  • d9cd3c585b Get results from other domains Daoud Clarke 2022-12-31 22:51:00 +0000
  • 77f08d8f0a Update URL status Daoud Clarke 2022-12-31 22:25:05 +0000
  • 36af579f7c Sample domains Daoud Clarke 2022-12-31 17:04:38 +0000
  • ea16e7b5cd WIP: improve method of getting URLs for crawling Daoud Clarke 2022-12-31 13:37:40 +0000
  • 7dae39b780 WIP: improve method of getting URLs for crawling Daoud Clarke 2022-12-31 13:32:15 +0000
  • c69108cfcc Don't delete an index if the sizes don't match Daoud Clarke 2022-12-27 10:52:46 +0000
  • bb8a36a612 Number of pages is an int Daoud Clarke 2022-12-27 10:40:53 +0000
  • c01129cdb9 Merge branch 'master' of github.com:mwmbl/mwmbl Daoud Clarke 2022-12-27 10:25:41 +0000
  • 26351a1072 Use the correct storage location in prod Daoud Clarke 2022-12-27 10:24:48 +0000
  • f3f3831a97
    Merge pull request #83 from omasanori/spacy-deps-rework Daoud Clarke 2022-12-27 10:20:52 +0000
  • 71187a3938 Rework installation of spaCy models for clarity #83 Masanori Ogino 2022-12-27 11:30:35 +0900
  • b8f7775b49 Add configuration for Devbox #82 Masanori Ogino 2022-12-27 03:33:47 +0900
  • d85067ec09 Remove apt command Daoud Clarke 2022-12-24 20:20:53 +0000
  • 1ef60e8d5d Put install in correct place Daoud Clarke 2022-12-24 20:18:02 +0000
  • 8e613dd368 Install psql client Daoud Clarke 2022-12-24 20:13:53 +0000
  • 80282cfc7a Exclude a domain Daoud Clarke 2022-12-24 19:59:56 +0000
  • 8676abbc63 Format fetched url Daoud Clarke 2022-12-24 19:59:15 +0000
  • 57295846cb
    Update README.md Daoud Clarke 2022-12-21 21:49:56 +0000
  • 0a4e1e4aee Add endpoint to fetch a URL and return title and extract Daoud Clarke 2022-12-21 21:15:34 +0000
  • c7571120cc Implement validation Daoud Clarke 2022-12-21 15:32:30 +0000
  • 061462460b Separate out the curation to make it easier to store in a comment Daoud Clarke 2022-12-20 19:11:01 +0000
  • 6cf27fa47f Fix serialisation issue Daoud Clarke 2022-12-19 23:19:32 +0000
  • b559a50506 Require the whole result Daoud Clarke 2022-12-19 22:18:28 +0000
  • 5eab543f3b Merge branch 'master' into user-registration Daoud Clarke 2022-12-19 21:53:11 +0000
  • a88a1a3e95 Rename some parameters; return curation ID Daoud Clarke 2022-12-19 21:51:26 +0000
  • efc8e8e383
    Merge pull request #78 from mwmbl/make-dev-easier Daoud Clarke 2022-12-19 21:50:54 +0000
  • 31c27daca4 Add curations Daoud Clarke 2022-12-11 18:48:25 +0000
  • f89e1d6043 Create a post when beginning curation Daoud Clarke 2022-12-10 23:45:10 +0000
  • eadb7f3e28 Follow a begin curate/update curation workflow Daoud Clarke 2022-12-10 22:49:06 +0000
  • f8ab6092b0 Suggest using dokku instead of docker directly #78 Daoud Clarke 2022-12-08 22:33:58 +0000
  • 8aa51e548b Allow login Daoud Clarke 2022-12-08 22:23:48 +0000
  • cf6ceedfd5 Actually allow registration Daoud Clarke 2022-12-07 22:56:20 +0000
  • a50bc28436 Make it easier to rum mwmbl locally Daoud Clarke 2022-12-07 20:01:31 +0000
  • d8d7149f4a Start to implement user registration using Lemmy as a back end Daoud Clarke 2022-12-06 22:36:38 +0000
  • c0f89ba6c3
    Update matrix badge Daoud Clarke 2022-12-05 18:47:26 +0000
  • dd4dd8a752 Exclude an annoying web site Daoud Clarke 2022-12-02 21:29:06 +0000
  • 40f9eade9a Update index name Daoud Clarke 2022-08-27 09:38:39 +0100
  • b6183e00ea
    Merge pull request #74 from mwmbl/evaluate-indexing Daoud Clarke 2022-08-27 09:37:22 +0100
  • cf253ae524 Split out URL updating from indexing #74 Daoud Clarke 2022-08-26 22:20:35 +0100
  • f4fb9f831a Use terms and bigrams from the beginning of the string only Daoud Clarke 2022-08-26 17:20:11 +0100
  • 619b6c3a93 Don't remove stopwords Daoud Clarke 2022-08-24 21:08:33 +0100
  • 578b705609 Don't replace full stops and commas Daoud Clarke 2022-08-23 22:06:43 +0100
  • 4779371cf3 Use a custom tokenizer Daoud Clarke 2022-08-23 21:57:38 +0100
  • b1eea2457f Script to index local batch for evaluation Daoud Clarke 2022-08-22 22:47:42 +0100
  • 480be85cfd Fix bug in completions with duplicated terms Daoud Clarke 2022-08-14 22:03:50 +0100
  • f7660bcd27
    Merge pull request #73 from mwmbl/completion Daoud Clarke 2022-08-13 23:55:22 +0100
  • 627f82d19f Suggest searching Google if there are no search results #73 Daoud Clarke 2022-08-13 23:54:57 +0100
  • f1c77d1389 Search google if there are no results Daoud Clarke 2022-08-13 23:47:48 +0100
  • fe5eff7b64 Exclude web.archive.org as we're only crawling that right now Daoud Clarke 2022-08-13 10:52:31 +0100
  • 9920fc5ddd Disguise URLs so Firefox doesn't recognise them and filter them out local Daoud Clarke 2022-08-13 10:49:55 +0100
  • a8bbb9f303 Missing import Daoud Clarke 2022-08-13 10:14:28 +0100
  • 6022d867a3 Merge branch 'completion' into local Daoud Clarke 2022-08-13 10:08:37 +0100
  • 00705703f3 Require matching at least half the terms Daoud Clarke 2022-08-11 23:27:30 +0100
  • eda7870788 Restrict to https and strip the prefix and / on the end Daoud Clarke 2022-08-11 22:23:14 +0100
  • 23e47e963b Simplify completions Daoud Clarke 2022-08-11 17:34:52 +0100
  • c6773b46c4
    Merge pull request #72 from mwmbl/improve-ranking-with-multi-term-search Daoud Clarke 2022-08-10 21:43:51 +0100
  • 74107667b4 Improve printing of search results in script #72 Daoud Clarke 2022-08-10 21:43:13 +0100
  • 3bcb7f42c1 Use heuristic ranker Daoud Clarke 2022-08-09 22:56:12 +0100
  • 9b22c32322 Merge branch 'improve-ranking-with-multi-term-search' into local Daoud Clarke 2022-08-09 22:50:56 +0100
  • c1b9e70743 Add new LTR model Daoud Clarke 2022-08-09 22:47:59 +0100
  • 57476ed2c8 Tweak features Daoud Clarke 2022-08-09 22:23:36 +0100
  • c99e813398 Get best-performing configuration Daoud Clarke 2022-08-09 20:56:15 +0100
  • 8b50643303 Add in match score feature (although it hurts the results) Daoud Clarke 2022-08-09 00:08:55 +0100
  • c60b73a403 Create a get_features function and make it work like the heuristic approach Daoud Clarke 2022-08-08 23:42:34 +0100
  • c1d361c0a0 New LTR model trained on more data Daoud Clarke 2022-08-08 22:52:37 +0100
  • b99d9d1c6a Search for the term itself as well as its completion Daoud Clarke 2022-08-01 23:38:14 +0100
  • a40259af30 Search for the term itself as well as its completion Daoud Clarke 2022-08-01 23:38:14 +0100
  • 87d2e9474c Merge branch 'improve-ranking-with-multi-term-search' into local Daoud Clarke 2022-08-01 23:33:52 +0100
  • f40d82c449 Allow running with no background script Daoud Clarke 2022-08-01 23:33:02 +0100
  • 046f86f7e3
    Merge pull request #71 from mwmbl/fix-missing-scores Daoud Clarke 2022-08-01 23:32:24 +0100
  • ae658906dd Store the best items, not the worst ones #71 Daoud Clarke 2022-07-31 22:55:15 +0100
  • aa5878fd2f
    Merge pull request #70 from mwmbl/reduce-new-batch-contention Daoud Clarke 2022-07-31 21:02:05 +0100
  • fc1742e24f Reinstate correct num_pages #70 Daoud Clarke 2022-07-31 00:45:00 +0100
  • bb5186196f Use an in-memory queue Daoud Clarke 2022-07-31 00:43:58 +0100
  • 62ba9ddc7e Use a randomised timeout for getting a new batch Daoud Clarke 2022-07-30 23:10:37 +0100
  • a54e093cf1
    Merge pull request #69 from mwmbl/reduce-contention-for-client-queries Daoud Clarke 2022-07-30 17:11:34 +0100
  • 2942d83673 Get URL scores in batches #69 Daoud Clarke 2022-07-30 14:35:21 +0100
  • 3709cb236f Use correct index path; retrieve historical batches Daoud Clarke 2022-07-30 11:08:15 +0100
  • 063ebb4504 args.index no longer exists Daoud Clarke 2022-07-30 10:57:15 +0100
  • ea32c0ba00 Double index size Daoud Clarke 2022-07-30 10:37:07 +0100
  • 2d5235f6f6 More threads for retrieving batches Daoud Clarke 2022-07-30 10:09:27 +0100
  • 218d873654 Delete unused SQL Daoud Clarke 2022-07-30 09:27:44 +0100
  • 3137068c77 More threads for retrieving batches Daoud Clarke 2022-07-30 10:09:27 +0100
  • e79f1ce10b Delete unused SQL Daoud Clarke 2022-07-30 09:27:44 +0100
  • c52faeaddc Merge branch 'reduce-contention-for-client-queries' into local Daoud Clarke 2022-07-24 17:02:37 +0100
  • 6209382d76 Index batches in memory Daoud Clarke 2022-07-24 15:44:01 +0100
  • 1bceeae3df Implement new indexing approach Daoud Clarke 2022-07-23 23:19:36 +0100