Commit graph

47 commits

Author SHA1 Message Date
Brian Huisman 4bbe1d967b Misc fixes
Save the process id of the crawler in the sp_crawling DB value instead of just a flag; we can use it to compare and further prevent race conditions which still seem to happen occasionally.
2023-10-17 10:36:34 -04:00
Brian Huisman 1860d1f8ce Totally forgot to actually implement this feature
The "remove text from titles" feature was coded into the admin UI from the previous version, but was never actually implemented in the crawler. Wow. It works now.
2023-09-27 15:33:06 -04:00
Brian Huisman 4c78a5245f Use REPLACE INTO for resiliency 2023-09-12 10:44:14 -04:00
Brian Huisman 511207e0b2 Add PDF Last Modified multiplier 2023-09-08 15:11:27 -04:00
Brian Huisman 382511077a Misc updates
Prettify some SQL code.
Add some error-reponse code for fatal failed SQL statements.
2023-07-21 13:04:51 -04:00
Brian Huisman 229129a9e4 Update crawler.php
Get and set sp_crawling in real-time to minimize race conditions.
2023-07-06 15:09:31 -04:00
Brian Huisman a5ff604f58 Update crawler.php
Update crawler.php to also try using XMP metadata from updated PDFParser
2023-07-04 13:46:12 -04:00
Brian Huisman 30630c6c60 Start enforcing PHP and SQL version limitations. 2023-06-26 15:00:42 -04:00
Brian Huisman 3307baac4d Update crawler.php
Run mb_convert_encoding in ALL cases to remove potentially invalid UTF-8 characters.
Add the "replacement" UTF-8 character to the whitespace array to ensure it's removed.
2023-06-22 15:35:40 -04:00
Brian Huisman 042339d3ef Update crawler.php
Don't assume that other data from a PDF is the same as the content. Bypasses some still-unfixed PDFParser encoding issues.
Also exit the crawler script if we are in debug mode and there is a crawl already running.
2023-06-21 17:23:08 -04:00
Brian Huisman 0a83546411 Update crawler.php
Make sure regexp lines in require and ignore URL fields are actually treated as regexps.
2023-06-19 11:51:57 -04:00
Brian Huisman e76fdf730c s_show_orphans cleanup
Make 's_show_orphans' a runtime variable and normalize the SQL queries it's used in.
Also change generic '$select' variable to more semantic '$crawldata'.
2023-06-15 10:19:05 -04:00
Brian Huisman fd2bbf745f Add 'resumed' flag to sp_progress
Add a third value to the sp_progress config value to let the script know if a crawl was resumed or not.
Also restore the sp_sha1 data from the crawltemp table on a resumed crawl.
2023-06-12 12:19:00 -04:00
Brian Huisman 5cfeb0a414 Update crawler.php
Also rebuild the domains list if a crawl is resumed.
2023-06-08 09:03:45 -04:00
Brian Huisman 87ecb553a7 Update crawler.php
Whoops, remove debug code.
2023-06-05 11:00:40 -04:00
Brian Huisman 783f1d97ca Update crawler.php
Merge function of $updateNotModified SQL statement with $insertNotModified.
2023-06-05 10:58:32 -04:00
Brian Huisman 8b024c438c Remove some unnecessary continues
Also add documentation for the crawler debug mode.
Scope fixes for JS output, still need to work on this.
2023-06-02 14:05:52 -04:00
Brian Huisman 56c84a89cb Prevent endless loop
If an orphan URL is blocked by a user rule, then remove it from the 'sp_exist' list so it doesn't keep coming back again and again. This only happens the next crawl after the user adds new rules.
Other misc edits.
2023-06-01 12:20:09 -04:00
Brian Huisman 3f9d713633 Simplify logging 2023-05-30 16:12:01 -04:00
Brian Huisman 727936cb80 Update crawler.php
$url => $row['url'] fix, and a couple other tweaks.
2023-05-30 16:07:41 -04:00
Brian Huisman 89f6fc2393 Try to resume a failed crawl.
Attempt to resume a crawl if it exited without going through the shutdown function
2023-05-30 15:53:41 -04:00
Brian Huisman c7c4960e1e Strict in_array checking 2023-05-30 15:01:24 -04:00
Brian Huisman 9e551324f3 data transferred
'sp_data_transferred' is now an ODATA variable.
2023-05-29 19:15:47 -04:00
Brian Huisman d8e9d5dc91 Admin UI edits for when crawl is in progress
Automatically encode/decode json when saving/reading ODATA config values.
Remove 'sp_links_crawled' config table value, now stored in 'sp_progress'.
Update Crawl Information window in real-time while crawler is running. Be more aggressive at reloading the page to get the latest data once a crawl has finished.
Time the setting of certain config values while crawling in a more sensible way.
2023-05-16 12:00:28 -04:00
Brian Huisman f16c4f9e0a Refactor character nomalization
Refactor the whitespace and punctuation normalization arrays.
2023-05-12 13:41:36 -04:00
Brian Huisman 4bb28031b6 Enable downloading Page Index
Allow downloading of the page index as a csv.
Remove unnecessary database columns url_base and status_noindex
Store list of domains at crawl so we don't need to request them every page-load; you will need to reinstall fresh because of this change
2023-05-12 10:06:57 -04:00
Brian Huisman 803155547d Rename to sp_punct
Rename sp_smart ("smart" punctuation) to the more general and accurate sp_punct
2023-05-05 11:54:07 -04:00
Brian Huisman 83f8fc9ed2 Javascript crawl support enhancement
Don't require reloading the page after a crawl has completed.
Javascript will dynamically update the Crawler Information values if we are on the Crawler Management page.
2023-04-28 13:55:26 -04:00
Brian Huisman ddc601697c Ping server to see if crawl has started
If admin UI is loaded while a crawl is not running, add a ping every 5 seconds to check if one has started. Fix issue where reloading the page while a crawl was running would cause a JS error that would cancel the crawl.
2023-04-28 12:26:58 -04:00
Brian Huisman 0c733426db Update crawler.php
Use the previously crawled page's category value if available.
2023-04-27 13:22:20 -04:00
Brian Huisman 41f6b25f0f Allow specifying Default Category 2023-04-27 13:10:22 -04:00
Brian Huisman ba04173c29 Daily updates
Keep Page Index pagination page within limits; add UTF-8 BOM to CSV and TXT download output; use utf8mb4_unicode_520_ci collation to remove need for SQL REGEXP; add more latin accent equivalent characters.
2023-04-26 15:16:13 -04:00
Brian Huisman 53e86085bc Update crawler.php
Don't need trim() here. OS_cleanTextUTF8 already does it.
2023-04-25 13:35:22 -04:00
Brian Huisman 8d091c8195 Update crawler.php
Add error condition for empty PDF, don't index.
2023-04-25 12:46:38 -04:00
Brian Huisman 2665cff354 Change If-Modified-Since calculation
Use the last_modified date of the individual file for the If-Modified-Since header instead of the date of the last successful crawl.
2023-04-25 10:01:53 -04:00
Brian Huisman b3b40a9194 Implement filetype: searching 2023-04-24 16:31:27 -04:00
Brian Huisman 84e38a5663 Re-upload 3rd party libraries 2023-04-20 10:47:11 -04:00
Brian Huisman 358fa42aee Update crawler.php 2023-04-19 16:23:42 -04:00
Brian Huisman 1363370840 Fix for dynamic classes deprecation in PHP 8.2 2023-04-19 11:50:48 -04:00
Brian Huisman 6c78df9d92 Choose entity flag based on DOCTYPE 2023-04-18 18:36:11 -04:00
Brian Huisman ec2b7aa075 Daily updates, big flow change in crawler.php 2023-04-18 17:20:27 -04:00
Brian Huisman a57cb3ca83 Proper breaks in switch 2023-04-17 18:48:02 -04:00
Brian Huisman 553fc019fe Daily update 2023-04-17 17:47:22 -04:00
Brian Huisman 17fa8fae05 Tighten up file headings 2023-04-13 08:27:41 -04:00
Brian Huisman 062f009829 Updates for the day 2023-04-12 19:08:00 -04:00
Brian Huisman 595740962e Update name to Orcinus 2023-04-12 08:28:29 -04:00
Brian Huisman bffa144421 move os3/ to orcinus/ 2023-04-12 08:08:11 -04:00
Renamed from os3/crawler.php (Browse further)