Commit graph

40 commits

Author SHA1 Message Date
Brian Huisman 30630c6c60 Start enforcing PHP and SQL version limitations. 2023-06-26 15:00:42 -04:00
Brian Huisman 3307baac4d Update crawler.php
Run mb_convert_encoding in ALL cases to remove potentially invalid UTF-8 characters.
Add the "replacement" UTF-8 character to the whitespace array to ensure it's removed.
2023-06-22 15:35:40 -04:00
Brian Huisman 042339d3ef Update crawler.php
Don't assume that other data from a PDF is the same as the content. Bypasses some still-unfixed PDFParser encoding issues.
Also exit the crawler script if we are in debug mode and there is a crawl already running.
2023-06-21 17:23:08 -04:00
Brian Huisman 0a83546411 Update crawler.php
Make sure regexp lines in require and ignore URL fields are actually treated as regexps.
2023-06-19 11:51:57 -04:00
Brian Huisman e76fdf730c s_show_orphans cleanup
Make 's_show_orphans' a runtime variable and normalize the SQL queries it's used in.
Also change generic '$select' variable to more semantic '$crawldata'.
2023-06-15 10:19:05 -04:00
Brian Huisman fd2bbf745f Add 'resumed' flag to sp_progress
Add a third value to the sp_progress config value to let the script know if a crawl was resumed or not.
Also restore the sp_sha1 data from the crawltemp table on a resumed crawl.
2023-06-12 12:19:00 -04:00
Brian Huisman 5cfeb0a414 Update crawler.php
Also rebuild the domains list if a crawl is resumed.
2023-06-08 09:03:45 -04:00
Brian Huisman 87ecb553a7 Update crawler.php
Whoops, remove debug code.
2023-06-05 11:00:40 -04:00
Brian Huisman 783f1d97ca Update crawler.php
Merge function of $updateNotModified SQL statement with $insertNotModified.
2023-06-05 10:58:32 -04:00
Brian Huisman 8b024c438c Remove some unnecessary continues
Also add documentation for the crawler debug mode.
Scope fixes for JS output, still need to work on this.
2023-06-02 14:05:52 -04:00
Brian Huisman 56c84a89cb Prevent endless loop
If an orphan URL is blocked by a user rule, then remove it from the 'sp_exist' list so it doesn't keep coming back again and again. This only happens the next crawl after the user adds new rules.
Other misc edits.
2023-06-01 12:20:09 -04:00
Brian Huisman 3f9d713633 Simplify logging 2023-05-30 16:12:01 -04:00
Brian Huisman 727936cb80 Update crawler.php
$url => $row['url'] fix, and a couple other tweaks.
2023-05-30 16:07:41 -04:00
Brian Huisman 89f6fc2393 Try to resume a failed crawl.
Attempt to resume a crawl if it exited without going through the shutdown function
2023-05-30 15:53:41 -04:00
Brian Huisman c7c4960e1e Strict in_array checking 2023-05-30 15:01:24 -04:00
Brian Huisman 9e551324f3 data transferred
'sp_data_transferred' is now an ODATA variable.
2023-05-29 19:15:47 -04:00
Brian Huisman d8e9d5dc91 Admin UI edits for when crawl is in progress
Automatically encode/decode json when saving/reading ODATA config values.
Remove 'sp_links_crawled' config table value, now stored in 'sp_progress'.
Update Crawl Information window in real-time while crawler is running. Be more aggressive at reloading the page to get the latest data once a crawl has finished.
Time the setting of certain config values while crawling in a more sensible way.
2023-05-16 12:00:28 -04:00
Brian Huisman f16c4f9e0a Refactor character nomalization
Refactor the whitespace and punctuation normalization arrays.
2023-05-12 13:41:36 -04:00
Brian Huisman 4bb28031b6 Enable downloading Page Index
Allow downloading of the page index as a csv.
Remove unnecessary database columns url_base and status_noindex
Store list of domains at crawl so we don't need to request them every page-load; you will need to reinstall fresh because of this change
2023-05-12 10:06:57 -04:00
Brian Huisman 803155547d Rename to sp_punct
Rename sp_smart ("smart" punctuation) to the more general and accurate sp_punct
2023-05-05 11:54:07 -04:00
Brian Huisman 83f8fc9ed2 Javascript crawl support enhancement
Don't require reloading the page after a crawl has completed.
Javascript will dynamically update the Crawler Information values if we are on the Crawler Management page.
2023-04-28 13:55:26 -04:00
Brian Huisman ddc601697c Ping server to see if crawl has started
If admin UI is loaded while a crawl is not running, add a ping every 5 seconds to check if one has started. Fix issue where reloading the page while a crawl was running would cause a JS error that would cancel the crawl.
2023-04-28 12:26:58 -04:00
Brian Huisman 0c733426db Update crawler.php
Use the previously crawled page's category value if available.
2023-04-27 13:22:20 -04:00
Brian Huisman 41f6b25f0f Allow specifying Default Category 2023-04-27 13:10:22 -04:00
Brian Huisman ba04173c29 Daily updates
Keep Page Index pagination page within limits; add UTF-8 BOM to CSV and TXT download output; use utf8mb4_unicode_520_ci collation to remove need for SQL REGEXP; add more latin accent equivalent characters.
2023-04-26 15:16:13 -04:00
Brian Huisman 53e86085bc Update crawler.php
Don't need trim() here. OS_cleanTextUTF8 already does it.
2023-04-25 13:35:22 -04:00
Brian Huisman 8d091c8195 Update crawler.php
Add error condition for empty PDF, don't index.
2023-04-25 12:46:38 -04:00
Brian Huisman 2665cff354 Change If-Modified-Since calculation
Use the last_modified date of the individual file for the If-Modified-Since header instead of the date of the last successful crawl.
2023-04-25 10:01:53 -04:00
Brian Huisman b3b40a9194 Implement filetype: searching 2023-04-24 16:31:27 -04:00
Brian Huisman 84e38a5663 Re-upload 3rd party libraries 2023-04-20 10:47:11 -04:00
Brian Huisman 358fa42aee Update crawler.php 2023-04-19 16:23:42 -04:00
Brian Huisman 1363370840 Fix for dynamic classes deprecation in PHP 8.2 2023-04-19 11:50:48 -04:00
Brian Huisman 6c78df9d92 Choose entity flag based on DOCTYPE 2023-04-18 18:36:11 -04:00
Brian Huisman ec2b7aa075 Daily updates, big flow change in crawler.php 2023-04-18 17:20:27 -04:00
Brian Huisman a57cb3ca83 Proper breaks in switch 2023-04-17 18:48:02 -04:00
Brian Huisman 553fc019fe Daily update 2023-04-17 17:47:22 -04:00
Brian Huisman 17fa8fae05 Tighten up file headings 2023-04-13 08:27:41 -04:00
Brian Huisman 062f009829 Updates for the day 2023-04-12 19:08:00 -04:00
Brian Huisman 595740962e Update name to Orcinus 2023-04-12 08:28:29 -04:00
Brian Huisman bffa144421 move os3/ to orcinus/ 2023-04-12 08:08:11 -04:00
Renamed from os3/crawler.php (Browse further)