Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many entries tagged with language=swedish are in fact in german #13

Open
olofhagsand opened this issue Aug 1, 2018 · 9 comments
Open

Comments

@olofhagsand
Copy link

I went through the entries and found 1021 entries marked as language=Swedish. But looking in more detail many of these are actually German.
Such as entry nr 322020 in IRAhandle_tweets_1.csv:
7.25000000000e+17,BERLINBOTE,Bernd Krömer: Amri-Ausschuss will früheren Innenstaatssekretär vernehmen https://t.co/qKqGZyBEma,Unknown,Swedish,9/8/2017 5:51,9/8/2017 5:51,2230,1779,22274,,German,0,0,NonEnglish
The tweet is definitely German, not swedish as seem all tweets from BERLINBOTE.
You may have done the mistake of identifying "ö" and "ä" as identifier for Swedish?

@ericodavis
Copy link

It doesn't look like very thorough research. I'm curious how much labor it takes to write 3 million tweets in 2 years. Besides being in several languages, don't appear to have a correlation with the election. Seem to be more correlated with normal internet use.

@olofhagsand
Copy link
Author

olofhagsand commented Aug 2, 2018

@ericodavis I got info from one person working with commercial fake FB entries, each entry was awarded $2 for a 120 word entry producing about 10 entries per hour.
These tweets are shorter. But extrapolating on those figures and freely speculating, and without automatic bots, this could yield a rate of ~100 tweets per person and day, so that 3M tweets could be produced by ~100 persons in two ýears with a cost of ~$100M.
Again, this may be wildly off, hope there are better estimates out there.

@johannessweater
Copy link

@olofhagsand This is a frequent problem that has to do with Twitter's language detection algorithm, particularly on short tweets. The same happens between Danish and Norwegian.

@olofhagsand
Copy link
Author

@johannessweater OK thanks.
So this is twitter's own classification? I thought it may have been the research group,...
I see now out after browsing the 2252 entries marked as "Norwegian", I can detect no Norwegian at all.
At least maybe 50% of the swedish antries were actually swedish.
So I conclude the language classification is useless.

@johannessweater
Copy link

@olofhagsand Language is part of the metadata that comes with tweets so I'm guessing that's where the language label comes from. But yikes, 50 percent is bad. In my experience it's usually more like 95 percent.

@olofhagsand
Copy link
Author

@johannessweater Going thorough them I verified 66 entries as Swedish out of 1021 marked as Swedish (out of 3M total).
That corresponds to 6.5% correctly labelled as Swedish.
There may be more entries with actual swedish marked as other languages. I have not looked for them.
Even worse, among the 2252 marked as Norwegian, 581 marked as Finnish and 499 marked as Icelandic, I detected 0% correctness. I.e., none of these were actually in Norwegian, Finnish or Icelandic.
But I detected one in Swedish that was marked as Norwegian (nr 2826690)
And BTW for your interest, here are the 66 entries marked as Swedish that I confirmed to be actual Swedish:
95420 95696 98517 102129 102153 102594 102870 102909 104564 106679 109151 109931 111427 111749 112482 115190 116317 119412 121071 122736 123911 124397 124684 207390 651135 897811 904937 907197 907285 907428 907763 970049 970728 980633 1073220 1208471 1231385 1231797 1231853 1235321 1235424 1235561 1235613 1324426 1624583 1648180 2109106 2109525 2114772 2115070 2115306 2115722 2198212 2826196 2826215 2826494 2830468 2830606 2830993 2831009 2831134 2831184 2935489 2944234 2968211 2968230

@ericodavis
Copy link

This is how you do a Troll Army. I wonder when this will be sorted out.
dj6qyfhx4aacg9r

@jpallas
Copy link

jpallas commented Aug 11, 2018

I suspect the language field is meaningless. For example, BERLINBOTE has tweets tagged in 26 different languages, but samples tagged Spanish, Vietnamese, and Polish are clearly in German. I suspect all of its tweets are in German. So either the language tagging done by Twitter is crap, or the language tag attached to the data is not the language tag inferred by Twitter.

@johannessweater
Copy link

@jpallas @olofhagsand Yeah, this doesn't seem to be Twitter's language tag. I checked some of these tweets against duplicate tweets I was able to find in my own database from the election, and the language fields don't match up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants