A Hierarchical Approach to exploiting Multiple Datasets from TalkBank

要約

TalkBank は、言語学の研究データの共有を容易にするオンラインデータベースです。
ただし、既存の TalkBank の API にはデータフィルタリング機能とバッチ処理機能が制限されています。
これらの制限を克服するために、このホワイトペーパーでは、効率的な複雑なデータ選択を可能にする階層検索アプローチを採用したパイプラインフレームワークを紹介します。
このアプローチには、研究者が必要とする可能性のある関連コーパスの迅速な予備スクリーニングが含まれ、その後、特定の基準に基づいてターゲットデータの詳細な検索が実行されます。
識別されたファイルにはインデックスが付けられ、将来の分析に簡単にアクセスできるようになります。
さらに、この論文では、メタデータを標準化およびクリーニングすることで、フレームワークを使用して収集されたさまざまな研究からのデータをどのように統合し、研究者が大規模な統合されたデータセットから洞察を抽出できるかを示しています。
このフレームワークは TalkBank 用に設計されていますが、他のオープンサイエンスプラットフォームからのデータを処理するように適合させることもできます。

要約(オリジナル)

TalkBank is an online database that facilitates the sharing of linguistics research data. However, the existing TalkBank’s API has limited data filtering and batch processing capabilities. To overcome these limitations, this paper introduces a pipeline framework that employs a hierarchical search approach, enabling efficient complex data selection. This approach involves a quick preliminary screening of relevant corpora that a researcher may need, and then perform an in-depth search for target data based on specific criteria. The identified files are then indexed, providing easier access for future analysis. Furthermore, the paper demonstrates how data from different studies curated with the framework can be integrated by standardizing and cleaning metadata, allowing researchers to extract insights from a large, integrated dataset. While being designed for TalkBank, the framework can also be adapted to process data from other open-science platforms.

arxiv情報

著者	Man Ho Wong
発行日	2023-06-21 22:37:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Hierarchical Approach to exploiting Multiple Datasets from TalkBank

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー