Dataset: Diagnostic Evaluation of Wildcard Processing in Web of Science Smart Search
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/7z9fbs8wmn
下载链接
链接失效反馈官方服务:
资源简介:
This dataset presents a diagnostic evaluation of wildcard processing in the Web of Science (WoS) Smart Search interface. The study tests whether “smart” search tools reliably execute fundamental retrieval functions – specifically, right truncation (*), internal masking (?), left truncation (*), and their combinations.
Data collection: 65 structured test points across seven dimensions (Right Truncation, Internal Masking, Left Truncation, Combined Wildcards, Special Characters, Case Variants, Edge Cases). Each test records: user input, Smart Search parsed query, Smart Search result count, Advanced Search benchmark query, and benchmark result count. All searches were performed on the same day on the WoS Core Collection without date restrictions.
Key findings reveal systematic failures:
Right truncation (*) ignored in simple queries: cell* misses 8.1% of relevant documents. For irregular words (mouse*), opaque thesaurus expansion adds mice, causing over‑retrieval (+94.8%) with no user transparency.
Left truncation (*) completely ignored: *omics retrieves only 7.2% of the benchmark (51,164 vs. 710,319), with results skewed toward the rare standalone “omics” rather than intended prefixed terms (genomics, proteomics).
Internal masking (?) replaced by space: c?ll becomes meaningless c AND ll fragment search.
Multi‑word phrases with truncation (STEM educat*) suffer “stem‑chopping”: 20 vs. 7,623 results.
Quoted wildcards (“3D print*”) have asterisk replaced by space, missing >98% of intended records.
Complex combinations (CRISPR*Cas*) parsed as CRISPR Cas, causing massive over‑retrieval (11,553 vs. 260) – 97% of results contain separate words, not exact CRISPR‑Cas* variants.
Short stems (5G*, AI*) executed despite Advanced Search invalidity, but wildcard ignored and results polluted with noise (84% of AI* results irrelevant due to language mismatches).
Conditional activation: Wildcards processed correctly in Boolean queries (cell* AND growth matches benchmark), revealing an undocumented dual‑parser architecture.
Depth analysis for eight critical cases includes manual coding of sampled records, confirming mechanisms behind quantitative differences.
Dataset contents:
1_Methodology_Overview.xlsx – Core dimensions, test stem selection, variation matrix, defect categorisation.
2_Test_Results.xlsx – Complete raw data for all 65 test points.
3_Depth_Analysis.xlsx – Detailed case studies with sampled records and coding.
Implications: Data inform information literacy instruction, reference services, and database evaluation by revealing hidden limitations of “smart” search tools.
创建时间:
2026-03-11



