5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <xsize>635</xsize> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12 2024 is the actual release. 1. Settings customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF RSD RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON XML and RSS result output [ Documentation ] 1. Settings customizing and statistics If you want to change settings behavior and design of Sphider-plus you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add edit delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add edit delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log separate and bulk delete - Clear Thumbnail images separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin' 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal) you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr. total clicks last . . .
. . .
Link addr. total clicks last clicked last query (Top 50) - Most Popular Searches for media links offering: Link addr. total clicks last clicked last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query Results Queried at Time taken User IP Country Host name (Latest100) - Index log offering: . . .
. . .
Country Host name (Latest100) - Index log offering: File-name index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP host query impact involved tags date and time of intrusion. - Flood attempts log offering: IP query date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP query date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software environment MySQL PDF-converter image functions php.ini file PHP integration PHP security info. Each item holding lists of details. All text links media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page because the tags need to be added (edited) to the page. A more flexible method . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section nav aside hgroup article header footer etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus during index / re-index there was no printout available because: - Several servers especially on Win32 buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds) AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title filename size of original image link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds present the text extract of 350 characters as part of product attributes. If this option is not activated all products of the XML product feed will be presented in search results and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases which are correctly configured assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results the complete database had to been browsed. Starting with version 2.5 an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour""ip":"::1""host_name":"guard_007-hoster""query_time":"2016-03-18 10:56:23 AM""consumed":0.016"total_results":2"num_of_results":2"from":1"to":2"text_results":[{"num":1"weight":"100.0""url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf""title":" Info_eng""fulltxt":" . . . . . .
. . .
placed on the ground floor today's big kitchen with the heating fireplace for open and close mode of . . . ""page_size":"5661.6kb""domain_name":"www.english.le-piaggie.info"}{"num":2"weight":"50.0""url":"http:\/\/www.english.le-piaggie.info\/html\/description.html""title":" Description""fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house about 800 m to reach the . . . ""page_size":"26.4 kb""domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 15 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 105623 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 25 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 25 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 25, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 45 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 45 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 65 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 165 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 165 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 165 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 165 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-2147561html-21.47.56_1.html (log file of first thread) db2_100524-2147561html-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524-2147561html - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-2147561html-21.47.56_ID1.html (log file of first thread) db2_100524-2147561html-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-2147561html_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 2147561html - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-2147561html_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-2147561html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-214756ID1html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"56616kbdomainnamewwwenglishle-piaggieinfo}{num2weight500urlhttp\/\/wwwenglishle-piaggieinfo\/html\/descriptionhtmltitle","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1-1 iso-8859-1-2 iso-8859-1-3 iso-8859-1-4 iso-8859-1-5 iso-8859-1-6 iso-8859-1-7 iso-8859-1-8 iso-8859-1-9 iso-8859-1-10 iso-8859-1-11 iso-8859-1-12 iso-8859-1-13 iso-8859-1-14 iso-8859-1-15 iso-8859-1-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-1-12 iso-8859-1-13 iso-8859-1-14 iso-8859-1-15 iso-8859-1-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859-1 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like Sites - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories - Add, edit, delete - Create new subcategory under Index - Basic indexing options - Advanced options Clean - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering . . .
. . .
Country, Host name (Latest100) - Index log offering File-name, index date and delete option - sitemap log offering sitemap.xml output sitemap list offering file/page suffixes - IDS log offering IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs http//www.example.com/product.php?item=swedish-fish&category=gummy-candy http//www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL http//www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend First option For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http//www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http//www.abc.de/index.php</url> . . .
. . .
the following content {"query""colour","ip""1","host_name""guard_007-hoster","query_time""2016-03-18 105623 AM","consumed"0.016,"total_results"2,"num_of_results"2,"from"1,"to"2,"text_results"[{"num"1,"weight""100.0","url""http\/\/www.english.le-piaggie.info\/Info_eng.pdf","title"" Info_eng","fulltxt"" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size""5,661.6kb","domain_name""www.english.le-piaggie.info"},{"num"2,"weight""50.0","url""http\/\/www.english.le-piaggie.info\/html\/description.html","title"" Description","fulltxt"" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size""26.4 kb","domain_name""www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like Sites - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories - Add, edit, delete - Create new subcategory under Index - Basic indexing options - Advanced options Clean - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering . . .
. . .
Country, Host name (Latest100) - Index log offering File-name, index date and delete option - sitemap log offering sitemap.xml output sitemap list offering file/page suffixes - IDS log offering IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs http//www.example.com/product.php?item=swedish-fish&category=gummy-candy http//www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL http//www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend First option For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http//www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http//www.abc.de/index.php</url> . . .
. . .
the following content {"query""colour","ip""1","host_name""guard_007-hoster","query_time""2016-03-18 105623 AM","consumed"0.016,"total_results"2,"num_of_results"2,"from"1,"to"2,"text_results"[{"num"1,"weight""100.0","url""http\/\/www.english.le-piaggie.info\/Info_eng.pdf","title"" Info_eng","fulltxt"" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size""5,661.6kb","domain_name""www.english.le-piaggie.info"},{"num"2,"weight""50.0","url""http\/\/www.english.le-piaggie.info\/html\/description.html","title"" Description","fulltxt"" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size""26.4 kb","domain_name""www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results"2numofresults2from1to2textresults[{num1weight1000urlhttp\/\/wwwenglishle-piaggieinfo\/Infoengpdftitle"num_of_results"2numofresults2from1to2textresults[{num1weight1000urlhttp\/\/wwwenglishle-piaggieinfo\/Infoengpdftitle"from":1,"to"2numofresults2from1to2textresults[{num1weight1000urlhttp\/\/wwwenglishle-piaggieinfo\/Infoengpdftitle"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num"2numofresults2from1to2textresults[{num1weight1000urlhttp\/\/wwwenglishle-piaggieinfo\/Infoengpdftitle"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2100524-2147561html-21.47.56_1.html (log file of first thread) db2100524-2147561html-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2100524-2147561html-21.47.56_ID1.html (log file of first thread) db2100524-2147561html-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2100524-2147561html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2100524-214756ID1html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2100524-2148122html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2100524-214812ID2html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domainnamewwwenglishle-piaggieinfo}{num2weight500urlhttp\/\/wwwenglishle-piaggieinfo\/html\/descriptionhtmltitle":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domainnamewwwenglishle-piaggieinfo}{num2weight500urlhttp\/\/wwwenglishle-piaggieinfo\/html\/descriptionhtmltitle":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http//wwwexamplecom/productphpitem=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1-1 iso-8859-1-2 iso-8859-1-3 iso-8859-1-4 iso-8859-1-5 iso-8859-1-6 iso-8859-1-7 iso-8859-1-8 iso-8859-1-9 iso-8859-1-10 iso-8859-1-11 iso-8859-1-12 iso-8859-1-13 iso-8859-1-14 iso-8859-1-15 iso-8859-1-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-1-12 iso-8859-1-13 iso-8859-1-14 iso-8859-1-15 iso-8859-1-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non iso-8859-1 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum)) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5]] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","pagesize56616kbdomainnamewwwenglishle-piaggieinfo}{num2weight500urlhttp\/\/wwwenglishle-piaggieinfo\/html\/descriptionhtmltitle":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","pagesize56616kbdomainnamewwwenglishle-piaggieinfo}{num2weight500urlhttp\/\/wwwenglishle-piaggieinfo\/html\/descriptionhtmltitle":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5]] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/wwwenglishle-piaggieinfo\/Infoengpdftitle-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"wwwenglishle-piaggieinfo\/Infoengpdftitle-piaggie.info"},{"num":2,"weight":"50.0","url":"http:\/\/wwwenglishle-piaggieinfo\/Infoengpdftitle-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"wwwenglishle-piaggieinfo\/Infoengpdftitle-piaggie.info"}]} Top . . .
5a2 All required information. Introduction Release and Legal Info Installation Documentation Change Log [ Documentation Summary ] Preamble: The info presented here is valid only for the latest release of Sphider-plus. At present version 4.2024a published February 12, 2024 is the actual release. 1. Settings, customizing and statistics 2. Indexing . . .
. . .
2. Indexing 2.1 Various options 2.2 Allow other hosts in same domain 2.3 Word stemming 2.4 Periodical Re-indexing 2.5 Preferred indexing 2.6 Multithreaded indexing 2.7 Create thumbnails during index procedure 2.8 Prevent indexing of known malware and pishing pages 2.9 Follow and create sitemap files 2.10 Use private sitemap instead of global . . .
. . .
Sitemap file 3. Using the indexer from command line 3.1 All options 3.2 Multithreaded indexing 4. Keeping pages, words and files from being indexed 4.1 robots.txt 4.2 Must include / must not include string list 4.3 Ignoring links 4.4 Ignoring parts of a page by <! sphider_noindex > 4.5 Ignoring parts of a page by <div id='abc'> . . .
. . .
charset' 6. Search modes 6.1 Search with wildcards * 6.2 Strict search ! 6.3 Tolerant search 6.4 Link search 6.5 Media search 6.6 Search only in one domain 6.7 Search in categories 6.8 Greek language support 6.9 Block queries 7. Chronological order for result listing 7.1 Text result listing 7.2 Media result listing 8. PDF converter 9. Clean . . .
. . .
of logging data 11. Error messages and Debug mode 12. Delete secondary characters 13. Media search for images, audio streams and videos 13.1 Media indexing 13.2 Not supported media content 13.3 Search for media content 13.4 Statistics for media content 14. Feed support 14.1 XML product feeds 14.2 RDF, RSD, RSS and Atom feeds 15. Result . . .
. . .
Overview 16.2 Definition and configuration 16.3 Activate / disable database 16.4 Backup & Restore of databases 16.5 Copy & and Move 16.6 Enhancing functionality of multiple database support 17. Search in categories 17.1 Hierachical structure 17.2 Parallel structure 18. User suggested sites 19. Vulnerability protection 19.1 Prevent queries . . .
. . .
templates 22.2 Embed the search engine into existing HTML code 22.3 The different style sheet files 23. JSON, XML and RSS result output [ Documentation ] 1. Settings, customizing and statistics If you want to change settings, behavior and design of Sphider-plus, you can do so by means of the Admin interface. There is a wide range of . . .
. . .
interface. There is a wide range of settings foreseen for Sphider-plus. Separated into different submenus like: Sites: - Add Site - Index only the new - Re-index all - Re-index only preferred URLs - Erase Re-index (available also for individual URLs) - Import/export URL list - Approve sites - Banned domains Categories: - Add, edit, delete - . . .
. . .
URL list - Approve sites - Banned domains Categories: - Add, edit, delete - Create new subcategory under Index: - Basic indexing options - Advanced options Clean: - Clean keywords not associated with any link - Clean links not associated with any site - Clean Category table not associated with any site - Clean Media links - Clear Temp table . . .
. . .
- Clear Search log - Clear 'Most Popular Page Links' log - Clear 'Most Popular Media Links' log - Clear Spider log, separate and bulk delete - Clear Thumbnail images, separate and bulk delete - Clear Text cache - Clear Media cache - Clear IDS log file - Clear flood attempts log file - Clear all entries in addurl or banned table - Truncate all . . .
. . .
flood attempts log file - Clear all entries in addurl or banned table - Truncate all tables in database Settings: - General Settings - Index Log Settings - Spider Settings - Search Settings - Order of Result listing - Suggest Options - Page Indexing Weights Database: - Configure up to 5 databases with unlimited number of table sets - Activate . . .
. . .
Database: - Configure up to 5 databases with unlimited number of table sets - Activate separately for 'Admin', 'Search' user and 'Suggest URL user' - Backup / Restore - Copy / Move - Optimize Templates: In order to enable customer's integration of Sphider-plus into existing sites, HTML templates are prepared for Search form Text result . . .
. . .
Search form Text result listing Media result listing Most popular queries etc. Three different designs are offered, which may be selected in submenu 'Settings'. If the layout does not fit the design of your site (which is normal), you may create new designs and modify the appropriate file /templates/My_template/adminstyle.css . . .
. . .
the appropriate file /templates/My_template/adminstyle.css /templates/My_template/userstyle.css Statistics output: - Top keywords (Top 50 with hit counter). - All indexed thumbnails w 3e80 ith ID3 and EXIF info. - Larges pages offering link URL and file size. - Most Popular Searches for text links offering: Link addr., total clicks, last . . .
. . .
Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Searches for media links offering: Link addr., total clicks, last clicked, last query (Top 50) - Most Popular Links (click counter). - Search log offering: Query, Results, Queried at, Time taken, User IP, Country, Host name (Latest100) - Index log offering: . . .
. . .
Country, Host name (Latest100) - Index log offering: File-name, index date and delete option - sitemap log offering: sitemap.xml output sitemap list offering file/page suffixes - IDS log offering: IP, host, query, impact, involved tags, date and time of intrusion. - Flood attempts log offering: IP, query, date and time of flood attempt. - Auto . . .
. . .
attempts log offering: IP, query, date and time of flood attempt. - Auto Re-index log file - Server info offering: Server software, environment, MySQL, PDF-converter, image functions, php.ini file PHP integration, PHP security info. Each item holding lists of details. All text links, media links and thumbnails are active linked. As stated in . . .
. . .
individual Re-indexing, the periodical Re-indexer could be started and aborted in the "Options" menu of each site. 2.5 Preferred Re-indexing Each new URL added to the Admin backend, could be supplied with a priority level. This level will be used by the option 'Re-index only preferred sites'. Level 1 will be interpreted as most important, while . . .
. . .
less all index results will be stored in log files in sub folder /admin/log/ The names of the log files look like: db2_100524-21.47.56_1.html (log file of first thread) db2_100524-21.48.12_2.html (log file of second thread) and is build by the following items: db2 - Number of database. 100524 - Date (May 24, 2010) 21.47.56 - Time when this thread . . .
. . .
spider.php -new1 php spider.php -new2 etc. The IDs will be added to the name of the corresponding log files like: db2_100524-21.47.56_ID1.html (log file of first thread) db2_100524-21.48.12_ID2.html (log file of second thread) IDs could be defined by personal requirements, but the limitations for file names with respect to the OS should be taken . . .
. . .
but will not erase the content of all the other tables. So the check whether the content of a page has changed (MD5sum) is still available for a fast re-index procedure. Once prepared, multithreaded re-indexing could be invoked by starting several threads and adding individual IDs to the option parameter like: php spider.php -erased1 php . . .
. . .
4.4 Ignoring parts of a page Sphider-plus includes an option to exclude parts of pages from being indexed. Thi 775e s can for example be used to prevent search result flooding when certain keywords appear on certain part in most pages (like a header, footer or a menu). Any part of a page between <! sphider_noindex > and <! . . .
. . .
<! sphider_noindex > and <! /sphider_noindex > tags is not indexed, however links in it are followed. 4.5 Ignoring parts of a page by <div id='abc'> Ignoring parts of a page by the <! sphider_noindex > tags requires direct access to the page, because the tags need to be added (edited) to the page. A more flexible method, . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ menu[0-5] / 4.6 Indexing only parts of a page by <div id='abc'> If enabled in Admin settings, the values as defined in the list-file /include/common/divs_use.txt will be used to index only the content between <div . . .
. . .
a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */ table[0-5] / 4.7 Ignore HTML elements defined by <tagname> . . </tagname> This option is foreseen to cooperate with the new HTML5 elements like section, nav, aside, hgroup, article, header, footer, etc. HTML elements . . .
. . .
contain a regexp pattern. The regexp needs to be introduced by */ and must be ended with another slash. Example: */nav[0-5]/ Please keep in mind that element names placed in /include/common/elements_not.txt will be processed case-sensitive. 4.8 Index only HTML elements defined by <tagname> . . </tagname> This is the vice versa . . .
. . .
of all the duplicate content URLs: http://www.example.com/product.php?item=swedish-fish&category=gummy-candy http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678 and Sphider-plus will understand that the duplicates all refer to the canonical URL: http://www.example.com/product.php?item=swedish-fish. The . . .
. . .
charset are supported by the ConvertCharset function and will be used to convert text into UTF-8 Unicode: WINDOWS windows-1250 - Central Europe windows-1251 - Cyrillic windows-1252 - Latin I windows-1253 - Greek windows-1254 - Turkish windows-1255 - Hebrew windows-1256 - Arabic windows-1257 - Baltic windows-1258 - Viet Nam cp874 - Thai - this . . .
. . .
- Baltic windows-1258 - Viet Nam cp874 - Thai - this file is also for DOS DOS cp437 - Latin US cp737 - Greek cp775 - BaltRim cp850 - Latin1 cp852 - Latin2 cp855 - Cyrylic cp857 - Turkish cp860 - Portuguese cp861 - Iceland cp862 - Hebrew cp863 - Canada cp864 - Arabic cp865 - Nordic cp866 - Cyrylic Russian (this is the one, used in IE . . .
. . .
IE Cyrillic (DOS) ) cp869 - Greek2 MAC (Apple) x-mac-cyrillic x-mac-greek x-mac-icelandic x-mac-ce x-mac-roman ISO iso-8859-1 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-9 iso-8859-10 iso-8859-11 iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 . . .
. . .
iso-8859-12 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 MISCELLANEOUS gsm0338 (ETSI GSM 03.38) cp037 cp424 cp500 cp856 cp875 cp1006 cp1026 koi8-r (Cyrillic) koi8-u (Cyrillic Ukrainian) nextstep us-ascii us-ascii-quotes DSP implementation for NeXT stdenc symbol zdingbat And specially for old Polish programs: mazovia This list is to be read . . .
. . .
Access denied; you need the RELOAD privilege. . . Top 10. Enable real-time output of logging data Up to version 1.5 of Sphider-plus, during index / re-index there was no printout available because: - Several servers, especially on Win32, buffer the output from the script until it terminates before transmitting the results to the browser. - . . .
. . .
is seen. - Some versions of Microsoft Internet Explorer only start to display the page after they have received 256 bytes of output. As progress was not presented during index / re- index procedure, waiting for results became a pain in the neck. Selectable in Admin setting together with the update interval (1 - 10 seconds), AJAX technology was . . .
. . .
characters in front of words Warning: This option should be used with special care and not be activated for non ISO-8859 charsets. Some special characters as part of the word ending might be erased by accidental. Top 13. Media search for images, audio streams and videos 13.1 Media indexing Index of media files is enabled by separated Admin . . .
. . .
and 'Found at' - Total clicks - Last clicked - Query input 'Indexed Image Thumbnails' presenting: - Thumbnail 150 x 100 pixel - Image details like title, filename size of original image, link- and thumb-id - Option to delete the thumbnail In order to open the media files all tables contain active links. Media results are also stored in . . .
. . .
Settings' menu of the admin backend: First option: For results of XML product feeds, present the text extract of 350 characters as part of product attributes. If this option is not activated, all products of the XML product feed will be presented in search results, and the hits of the query string will be highlighted in all involved products. . . .
. . .
content of the database tables (those with the same table prefix) will be destroyed by the restore procedure. 16.5 Copy & Move This section of the Database Management will present only those databases, which are correctly configured, assigned and do have a set of installed tables as described in chapter Definition and configuration. This . . .
. . .
results. Never the less to find these x results, the complete database had to been browsed. Starting with version 2.5, an additional clean option is offered as part of the Admin backend. Main advantage of this option is a significant reduction of the search time for any query, because the content of the db could be limited to offer only x . . .
. . .
<link>http://www.abc.de/images/warp.gif</link> <title>warp.gif</title> <x_size>635</x_size> <y_size>98</y_size> </media_result> <media_result> <num>2</num> <type>audio</type> <url>http://www.abc.de/index.php</url> . . .
. . .
the following content: {"query":"colour","ip":"::1","host_name":"guard_007-hoster","query_time":"2016-03-18 10:56:23 AM","consumed":0.016,"total_results":2,"num_of_results":2,"from":1,"to":2,"text_results":[{"num":1,"weight":"100.0","url":"http:\/\/www.english.le-piaggie.info\/Info_eng.pdf","title":" Info_eng","fulltxt":" . . . . . .
. . .
placed on the ground floor, today's big kitchen with the heating fireplace for open and close mode of . . . ","page_size":"5,661.6kb","domain_name":"www.english.le-piaggie.info"}{num2weight500urlhttp\/\/wwwenglishle-piaggieinfo\/html\/descriptionhtmltitle"num":2,"weight":"50.0","url":"http:\/\/www.english.le-piaggie.info\/html\/description.html","title":" Description","fulltxt":" . . . cable. Additional active . . .
. . .
D.O.C.G. as far as to the Pratomagno mountains 160 olive-trees Detached house, about 800 m to reach the . . . ","page_size":"26.4 kb","domain_name":"www.english.le-piaggie.info"}]} Top . . .