-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many entries tagged with language=swedish are in fact in german #13
Comments
It doesn't look like very thorough research. I'm curious how much labor it takes to write 3 million tweets in 2 years. Besides being in several languages, don't appear to have a correlation with the election. Seem to be more correlated with normal internet use. |
@ericodavis I got info from one person working with commercial fake FB entries, each entry was awarded $2 for a 120 word entry producing about 10 entries per hour. |
@olofhagsand This is a frequent problem that has to do with Twitter's language detection algorithm, particularly on short tweets. The same happens between Danish and Norwegian. |
@johannessweater OK thanks. |
@olofhagsand Language is part of the metadata that comes with tweets so I'm guessing that's where the language label comes from. But yikes, 50 percent is bad. In my experience it's usually more like 95 percent. |
@johannessweater Going thorough them I verified 66 entries as Swedish out of 1021 marked as Swedish (out of 3M total). |
I suspect the language field is meaningless. For example, |
@jpallas @olofhagsand Yeah, this doesn't seem to be Twitter's language tag. I checked some of these tweets against duplicate tweets I was able to find in my own database from the election, and the language fields don't match up. |
I went through the entries and found 1021 entries marked as language=Swedish. But looking in more detail many of these are actually German.
Such as entry nr 322020 in IRAhandle_tweets_1.csv:
7.25000000000e+17,BERLINBOTE,Bernd Krömer: Amri-Ausschuss will früheren Innenstaatssekretär vernehmen https://t.co/qKqGZyBEma,Unknown,Swedish,9/8/2017 5:51,9/8/2017 5:51,2230,1779,22274,,German,0,0,NonEnglish
The tweet is definitely German, not swedish as seem all tweets from BERLINBOTE.
You may have done the mistake of identifying "ö" and "ä" as identifier for Swedish?
The text was updated successfully, but these errors were encountered: