beenull/orcinus-search

Author	SHA1	Message	Date
Brian Huisman	4bbe1d967b	Misc fixes Save the process id of the crawler in the sp_crawling DB value instead of just a flag; we can use it to compare and further prevent race conditions which still seem to happen occasionally.	2023-10-17 10:36:34 -04:00
Brian Huisman	1860d1f8ce	Totally forgot to actually implement this feature The "remove text from titles" feature was coded into the admin UI from the previous version, but was never actually implemented in the crawler. Wow. It works now.	2023-09-27 15:33:06 -04:00
Brian Huisman	4c78a5245f	Use REPLACE INTO for resiliency	2023-09-12 10:44:14 -04:00
Brian Huisman	511207e0b2	Add PDF Last Modified multiplier	2023-09-08 15:11:27 -04:00
Brian Huisman	382511077a	Misc updates Prettify some SQL code. Add some error-reponse code for fatal failed SQL statements.	2023-07-21 13:04:51 -04:00
Brian Huisman	229129a9e4	Update crawler.php Get and set sp_crawling in real-time to minimize race conditions.	2023-07-06 15:09:31 -04:00
Brian Huisman	a5ff604f58	Update crawler.php Update crawler.php to also try using XMP metadata from updated PDFParser	2023-07-04 13:46:12 -04:00
Brian Huisman	30630c6c60	Start enforcing PHP and SQL version limitations.	2023-06-26 15:00:42 -04:00
Brian Huisman	3307baac4d	Update crawler.php Run mb_convert_encoding in ALL cases to remove potentially invalid UTF-8 characters. Add the "replacement" UTF-8 character to the whitespace array to ensure it's removed.	2023-06-22 15:35:40 -04:00
Brian Huisman	042339d3ef	Update crawler.php Don't assume that other data from a PDF is the same as the content. Bypasses some still-unfixed PDFParser encoding issues. Also exit the crawler script if we are in debug mode and there is a crawl already running.	2023-06-21 17:23:08 -04:00
Brian Huisman	0a83546411	Update crawler.php Make sure regexp lines in require and ignore URL fields are actually treated as regexps.	2023-06-19 11:51:57 -04:00
Brian Huisman	e76fdf730c	s_show_orphans cleanup Make 's_show_orphans' a runtime variable and normalize the SQL queries it's used in. Also change generic '$select' variable to more semantic '$crawldata'.	2023-06-15 10:19:05 -04:00
Brian Huisman	fd2bbf745f	Add 'resumed' flag to sp_progress Add a third value to the sp_progress config value to let the script know if a crawl was resumed or not. Also restore the sp_sha1 data from the crawltemp table on a resumed crawl.	2023-06-12 12:19:00 -04:00
Brian Huisman	5cfeb0a414	Update crawler.php Also rebuild the domains list if a crawl is resumed.	2023-06-08 09:03:45 -04:00
Brian Huisman	87ecb553a7	Update crawler.php Whoops, remove debug code.	2023-06-05 11:00:40 -04:00
Brian Huisman	783f1d97ca	Update crawler.php Merge function of $updateNotModified SQL statement with $insertNotModified.	2023-06-05 10:58:32 -04:00
Brian Huisman	8b024c438c	Remove some unnecessary continues Also add documentation for the crawler debug mode. Scope fixes for JS output, still need to work on this.	2023-06-02 14:05:52 -04:00
Brian Huisman	56c84a89cb	Prevent endless loop If an orphan URL is blocked by a user rule, then remove it from the 'sp_exist' list so it doesn't keep coming back again and again. This only happens the next crawl after the user adds new rules. Other misc edits.	2023-06-01 12:20:09 -04:00
Brian Huisman	3f9d713633	Simplify logging	2023-05-30 16:12:01 -04:00
Brian Huisman	727936cb80	Update crawler.php $url => $row['url'] fix, and a couple other tweaks.	2023-05-30 16:07:41 -04:00
Brian Huisman	89f6fc2393	Try to resume a failed crawl. Attempt to resume a crawl if it exited without going through the shutdown function	2023-05-30 15:53:41 -04:00
Brian Huisman	c7c4960e1e	Strict in_array checking	2023-05-30 15:01:24 -04:00
Brian Huisman	9e551324f3	data transferred 'sp_data_transferred' is now an ODATA variable.	2023-05-29 19:15:47 -04:00
Brian Huisman	d8e9d5dc91	Admin UI edits for when crawl is in progress Automatically encode/decode json when saving/reading ODATA config values. Remove 'sp_links_crawled' config table value, now stored in 'sp_progress'. Update Crawl Information window in real-time while crawler is running. Be more aggressive at reloading the page to get the latest data once a crawl has finished. Time the setting of certain config values while crawling in a more sensible way.	2023-05-16 12:00:28 -04:00
Brian Huisman	f16c4f9e0a	Refactor character nomalization Refactor the whitespace and punctuation normalization arrays.	2023-05-12 13:41:36 -04:00
Brian Huisman	4bb28031b6	Enable downloading Page Index Allow downloading of the page index as a csv. Remove unnecessary database columns url_base and status_noindex Store list of domains at crawl so we don't need to request them every page-load; you will need to reinstall fresh because of this change	2023-05-12 10:06:57 -04:00
Brian Huisman	803155547d	Rename to sp_punct Rename sp_smart ("smart" punctuation) to the more general and accurate sp_punct	2023-05-05 11:54:07 -04:00
Brian Huisman	83f8fc9ed2	Javascript crawl support enhancement Don't require reloading the page after a crawl has completed. Javascript will dynamically update the Crawler Information values if we are on the Crawler Management page.	2023-04-28 13:55:26 -04:00
Brian Huisman	ddc601697c	Ping server to see if crawl has started If admin UI is loaded while a crawl is not running, add a ping every 5 seconds to check if one has started. Fix issue where reloading the page while a crawl was running would cause a JS error that would cancel the crawl.	2023-04-28 12:26:58 -04:00
Brian Huisman	0c733426db	Update crawler.php Use the previously crawled page's category value if available.	2023-04-27 13:22:20 -04:00
Brian Huisman	41f6b25f0f	Allow specifying Default Category	2023-04-27 13:10:22 -04:00
Brian Huisman	ba04173c29	Daily updates Keep Page Index pagination page within limits; add UTF-8 BOM to CSV and TXT download output; use utf8mb4_unicode_520_ci collation to remove need for SQL REGEXP; add more latin accent equivalent characters.	2023-04-26 15:16:13 -04:00
Brian Huisman	53e86085bc	Update crawler.php Don't need trim() here. OS_cleanTextUTF8 already does it.	2023-04-25 13:35:22 -04:00
Brian Huisman	8d091c8195	Update crawler.php Add error condition for empty PDF, don't index.	2023-04-25 12:46:38 -04:00
Brian Huisman	2665cff354	Change If-Modified-Since calculation Use the last_modified date of the individual file for the If-Modified-Since header instead of the date of the last successful crawl.	2023-04-25 10:01:53 -04:00
Brian Huisman	b3b40a9194	Implement filetype: searching	2023-04-24 16:31:27 -04:00
Brian Huisman	84e38a5663	Re-upload 3rd party libraries	2023-04-20 10:47:11 -04:00
Brian Huisman	358fa42aee	Update crawler.php	2023-04-19 16:23:42 -04:00
Brian Huisman	1363370840	Fix for dynamic classes deprecation in PHP 8.2	2023-04-19 11:50:48 -04:00
Brian Huisman	6c78df9d92	Choose entity flag based on DOCTYPE	2023-04-18 18:36:11 -04:00
Brian Huisman	ec2b7aa075	Daily updates, big flow change in crawler.php	2023-04-18 17:20:27 -04:00
Brian Huisman	a57cb3ca83	Proper breaks in switch	2023-04-17 18:48:02 -04:00
Brian Huisman	553fc019fe	Daily update	2023-04-17 17:47:22 -04:00
Brian Huisman	17fa8fae05	Tighten up file headings	2023-04-13 08:27:41 -04:00
Brian Huisman	062f009829	Updates for the day	2023-04-12 19:08:00 -04:00
Brian Huisman	595740962e	Update name to Orcinus	2023-04-12 08:28:29 -04:00
Brian Huisman	bffa144421	move os3/ to orcinus/	2023-04-12 08:08:11 -04:00

47 commits