Improving Startup Success with Text Analysis

要約

投資家は、できれば無料のオンラインソースを使用して収集できる公開データを使用して、新興企業の将来の成功を予測することに関心があります。
公開のみのデータを使用することは効果があることが示されていますが、改善の余地はまだ多くあります。
最もパフォーマンスの高い 2 つの予測実験では、それぞれ 17 個と 49 個の特徴を使用しており、そのほとんどが数値とカテゴリの性質を持っています。
このペーパーでは、より良い予測を実現するために、ソースと特徴の数の両方を大幅に拡張および多様化しました (171 まで)。
Crunchbase、Google Search API、Twitter (現在は X) から収集されたデータは、企業が一定の期間内に資金を調達するかどうかを予測するために使用されます。
新機能の多くはテキストであり、Twitter のサブセットには受動態や品詞の測定などの言語指標が含まれています。
合計 10 個の機械学習モデルも最高のパフォーマンスを評価されます。
適応性のあるモデルを使用すると、精度または再現率のいずれかを優先する可変カットオフしきい値を使用して、1 ～ 5 年後の資金調達を予測できます。
同等の仮定を使用した予測では、通常、文献での以前の試み (0.531) を上回る 0.730 を超える F スコアが達成され、より少ない例でそれが達成されます。
さらに、パフォーマンスへの影響の大部分は、企業に関する自由形式のテキスト説明である最もパフォーマンスの高い個別の機能を含む、ほとんどが一般的な企業の観察である 171 個の機能のうち上位 18 個によるものであることがわかりました。

要約(オリジナル)

Investors are interested in predicting future success of startup companies, preferably using publicly available data which can be gathered using free online sources. Using public-only data has been shown to work, but there is still much room for improvement. Two of the best performing prediction experiments use 17 and 49 features respectively, mostly numeric and categorical in nature. In this paper, we significantly expand and diversify both the sources and the number of features (to 171) to achieve better prediction. Data collected from Crunchbase, the Google Search API, and Twitter (now X) are used to predict whether a company will raise a round of funding within a fixed time horizon. Much of the new features are textual and the Twitter subset include linguistic metrics such as measures of passive voice and parts-of-speech. A total of ten machine learning models are also evaluated for best performance. The adaptable model can be used to predict funding 1-5 years into the future, with a variable cutoff threshold to favor either precision or recall. Prediction with comparable assumptions generally achieves F scores above 0.730 which outperforms previous attempts in the literature (0.531), and does so with fewer examples. Furthermore, we find that the vast majority of the performance impact comes from the top 18 of 171 features which are mostly generic company observations, including the best performing individual feature which is the free-form text description of the company.

arxiv情報

著者	Emily Gavrilenko,Foaad Khosmood,Mahdi Rastad,Sadra Amiri Moghaddam
発行日	2023-12-11 09:22:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Startup Success with Text Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー